Introduction to R
R
looks like before tutorials vs. How you look like after learning R
Set up R
Download R 4.0.2 from R project It asks you to pick up a CRAN mirror. There is one for Canada/University of Toronto.
Next, you may download a text editor. Text editors make our life easier for using R. I recommend RStudio.
Once Rstudio installed, launch it and you are good to go!
The frontend
- Before I walk you through the interface, some fun first. You can change the default color scheme of your RStudio interface. Go to Tools>Global Options>Appearance. Here you can change the font type and size, and color scheme. I am using Modern. I think it’s cool. Here is how my interface looks like:
\(~\)
- The ordering/layout of these windows/tabs might be different on your screen. That’s ok. On my screen, please notice that the console window is in the up right corner. The window with files-plots tabs is in the down right corner. And the environment window is to the left. You can change this layout as you wish. Whatever is more convenient for you. Go to Tools>Global Options>Pane Layout. Here are my settings:
\(~\)
- What are these tabs then?
- Everything on
R
is an object. The environment tab stores the objects you saved/created during the session. The object can be a list of values, data frames, or a function. For instance, as you may see below, in my environment, there is data entitled “vdem15”, a function entitled “thewitcher” and a list of values entitled “lucky” stored.
- The files tab displays all files in your default workspace. The plots tab displays the plots/figures that you’ll create.
- The console tab is where you can type your commands and see the output. But if you’d like to keep track of your commands, you should start an
R
script. If you used STATA before, it’s similar to do-files. Press Ctrl+Shift+N to open a new script. You can type and run yourR
commands here.
You are good to go! I know R
might be intimidating for some of you. But hang in there. You’ll be able to pull off cool stuff with it. I’ve prepared this html for you on R
, for instance. Or you can design interactive plots like this (it’s called Rosenbrock’s valley):
Some Basics
Let’s check out the basic syntax and command operators on R.
Mathematical Operations
Open your R script and type in the following commands. Then click Ctrl + Enter. With this shortcut, you can run the command on the current line or any selected lines.
### You can type your notes/headings after a hashtag.
105 + 105 #you may perform arithmetic operations
365/12
5^-2
(3*3) + (2/5)
### There are some built-in functions in R.
log(100) #natural log
seq(2,6) #create a sequence of numbers from 2 to 6
seq(1,12, by=2) #create a sequence of numbers from 1 to 12 that increases by 2.
seq(0,1, length=11) #a sequence from 0 to 1 with specific length/total number of elements
5:8 #an alternative notation for integer-sequence
sum(5:8) #take a sum of numbers from 5 to 8
mean(10:20) #find mean of numbers from 10 to 20
sqrt(16) #square root of a non-negative number
#You can always look through documentations of these functions to seek help and remember the notation.
?seq
Creating Objects
We store information in R sessions as an object with an assigned name. To that aim, we use the “<-” assignment operator.
### Assignments
result <- sqrt(36) + sum(4:15) / 2^4
result #Write the object name and hit Enter in the console, or Ctrl+Enter in R script. It'll print the result in the Console.
## [1] 13.125
Please note that result object has just been stored as a value in your environment. If you assign different value to a stored object, it’ll be replaced. Be advised! You can assign numeric values, functions, strings of characters to an object.
## [1] "i am a nerd"
Trick: you can both store and print the object at the same time by putting the assignment in parentheses.
You can list and remove these stored objects.
### Listing and removing objects
a1 <- "5789" #treats it as a character
a2 <- 5789 #treats it as a number
ls() #lists all objects in the environment
## [1] "a1" "a2" "result"
## [1] "a1" "a2"
rm(a1) #removes a1 object from the environment
rm(list=ls(pat="a")) #removes all objects that contain letter a from the environment
Each object has two intrinsic values: its class and length. e.g. An object might be logical, numeric, character, function, etc.
## [1] "character"
## [1] "numeric"
## [1] "logical"
## [1] "function"
You may use numeric objects for subsequent mathematical operations.
Vectors and Lists
We can combine/concatenate multiple elements and objects into one object - a vector.
### Concatenate
a1 <- c("Tenet","is", "a", "good", "movie")
length(a1) #the length of the vector/how many elements
## [1] 5
## [1] 4
## [1] 1 2 3 4 4 72 21
## [1] 7
a4 <- c(a2, a3) #You can combine vectors. Please note that logical elements are coerced into numeric.
print(a4)
## [1] 1 0 0 1 1 2 3 4 4 72 21
## [1] "Tenet" "is" "a" "good" "movie" "1" "2" "3" "4"
## [10] "4" "72" "21"
You can access specific elements of vectors through indexing.
## [1] "good"
## [1] 0 0 1 1 2 3
## [1] 0 3
## [1] "Tenet" "is" "a" "movie"
## [1] "Tenet" "is" "good"
## [1] "Tenet"
You can use specific elements of numeric vectors for mathematical operations. Below you see the cumulative number of active COVID cases for Toronto for the last week of August.
Date | Cases |
---|---|
30/8/2020 | 371 |
29/8/2020 | 350 |
28/8/2020 | 300 |
27/8/2020 | 267 |
26/8/2020 | 226 |
25/8/2020 | 192 |
24/8/2020 | 174 |
Table 1: Number of cumulative COVID cases for Toronto
covid <- c(174, 192, 226, 267, 300, 350, 371) #Storing number of cumulative cases
covid.increase <- covid[-1] - covid[-7]
(covid.percent <- (covid.increase / covid[-7])*100) #daily percentage increase for cumulative cases
## [1] 10.34483 17.70833 18.14159 12.35955 16.66667 6.00000
###More built-in functions
mean(covid.percent) #the average percentage increase for the last week of august
max(covid.percent) #maximum value of daily percentage increase
min(covid.percent) #minimum value
(max(covid.percent) - min(covid.percent)) / length(covid.percent) #You can use these basic functions for arithmetic operations, too.
Vectorized arithmetic is also possible.
a1 <- seq(from=5, to=45, by=5)
sort(a1, decreasing = TRUE) #sorts the elements of a1 in a decreasing fashion
## [1] 45 40 35 30 25 20 15 10 5
a2 <- rep(5, times=length(a1)) #rep is another built-in function for R. repeats number 5 for the length of a1 vector.
print(a1/a2) #divide the vectors
## [1] 1 2 3 4 5 6 7 8 9
Lists are objects that include elements of different types.
## [1] 3
## [1] "list"
Functions
Please take a moment to take a stock of built-in functions we have learned so far.
Another built-in function is names
which assigns names to elements in a vector. Let’s label the cumulative COVID cases with their respective dates.
###Assigning names and saving as date
covid.dates <- as.Date(c("2020-08-24", "2020-08-25", "2020-08-26",
"2020-08-27", "2020-08-28", "2020-08-29",
"2020-08-30")) #as.Date is another built-in R function that allows you to store dates. Watch out for the appropriate date format.
names(covid) <- covid.dates #labeling COVID items with dates
print(covid)
## 2020-08-24 2020-08-25 2020-08-26 2020-08-27 2020-08-28 2020-08-29 2020-08-30
## 174 192 226 267 300 350 371
You may also use paste function quite often.
sentence <- c("toronto", "is", "the", "best", "city", "in the world")
paste(sentence, collapse = " ")
## [1] "toronto is the best city in the world"
You can also create your own functions to avoid typing the same command over and over again. User-defined functions offer us efficiency. Let’s say as part of your job, you are expected to report weekly 1) the cumulative number of COVID cases over a week for Toronto, 2) the average number of new cases for the week, 3) maximum number of new cases, 4) minimum number, 5) daily percentage change. Instead of writing the code for each week and run it separately, you may simply create a function to render the process more efficient!
31.08.2020 | 30.08.2020 | 29.08.2020 | 28.08.2020 | 27.08.2020 | 26.08.2020 | 25.08.2020 |
42 | 29 | 52 | 34 | 43 | 40 | 23 |
24.08.2020 | 23.08.2020 | 22.08.2020 | 21.08.2020 | 20.08.2020 | 19.08.2020 | 18.08.2020 |
35 | 34 | 21 | 27 | 30 | 31 | 28 |
Table 2: Number of daily reported COVID cases for Toronto
###User-defined functions
covid.week1 <- c(28, 31, 30, 27, 21, 34, 35) #storing number of cases for each week
covid.week2 <- c(23, 40, 43, 34, 52, 29, 42)
covid.summary <- function(x){ #creating a function with one input, x, titled covid.summary
out.total <- sum(x) #cumulative cases
out.mean <- mean(x) #average
out.max <- max(x) #maximum
out.min <- min(x) #minimum
out.percent <- mean((x[-1] - x[-7])*100 / x[-7]) #average daily percentage change
out.final <- c(out.total, out.mean, out.max, out.min, out.percent) #final output
names(out.final) <- c("Cumulative Cases", "Average Number", "Maximum", "Minimum", "Average Per. Change") #labeling the output, be careful with the ordering!
return(out.final) #return function will call the output here
}
covid.summary(covid.week1) #Calling the function covid.summary supplying covid.week1 vector as an argument
## Cumulative Cases Average Number Maximum Minimum
## 206.000000 29.428571 35.000000 21.000000
## Average Per. Change
## 6.685366
## Cumulative Cases Average Number Maximum Minimum
## 263.00000 37.57143 52.00000 23.00000
## Average Per. Change
## 19.00347
There are different ways in which you can define your arguments. e.g. You can create a function with multiple arguments/inputs. Let’s say you are asked to calculate weekly percentage change in average number of new reported cases for Toronto.
covid.weekly <- function(w1, w2) { #two arguments defined as w1 and w2 that would be referred to in the function
out.percent <- (mean(w2) - mean(w1))*100/mean(w1)
names(out.percent) <- "Weekly Percentage Change"
return(out.percent)
}
covid.weekly(covid.week1, covid.week2) #supply covid.week1 and covid.week2 for w1 and w2, respectively
## Weekly Percentage Change
## 27.6699
Data
- We don’t really save our data manually with vectors. More often than not, we import external files to
R
. Often it’s either a .csv (Excel), .txt (Text), .dta (STATA) or RData files. RData is a collection ofR
objects (i.e.R
output). R
will automatically try to import data from your working directory. Check the Files tab to see your current working directory. Ideally, you’d want to have your project files in one designated place. Do not save to / import from your Desktop - it’ll be chaotic.- There are different ways in which you can assign/change your working directory. Below is one of them. Know that it exists as a method, and use it just for now, until I introduce a new technique toward the end.
setwd("C:/Users/Semuhi/Documents/University of Toronto/Fall 2020/TA - Statistics for Political Scientists/Tutorials/Week 1") #another built-in function to set working directory
getwd() #check working directory
- Please download the datasets from Imai’s website for Chapter 1. Place these files in your working directory.
UNpop <- read.csv("UNpop.csv") #read.csv is a built-in function to help us import csv. files into an R object. Don't forget to assign this data to a specific object, otherwise it'll not be stored in your environment.
class(UNpop)
length(UNpop)
load("UNpop.RData") #for Rdata, we use load function.
- Check your Environment tab. UNpop object is a data.frame with 2 variables and 7 observations. Data frame is an
R
object of collection of vectors. In this case, it’s as if there are two vectors/columns (therefore the length of UNpop is 2) merged into a data frame.
year <- seq(from=1950, to=2010, by=10)
world.pop <- c(2525779, 3026003, 3691173, 4449049, 5320817, 6127700, 6916183)
UNpop1 <- data.frame(year, world.pop) #data.frame is another built-in function that creates data frames.
View(UNpop) #view function to check the data.
View(UNpop1) #Compare UNpop and UNpop1.
Let’s work on this data a bit.
## [1] 2
## [1] 7
## year world.pop
## 1 1950 2525779
## 2 1960 3026003
## 3 1970 3691173
## 4 1980 4449049
## 5 1990 5320817
## 6 2000 6127700
## year world.pop
## Min. :1950 Min. :2525779
## 1st Qu.:1965 1st Qu.:3358588
## Median :1980 Median :4449049
## Mean :1980 Mean :4579529
## 3rd Qu.:1995 3rd Qu.:5724258
## Max. :2010 Max. :6916183
Please familiarize yourself with $
operator. That allows us to access variables/columns from data frames and individual elements from objects.
## [1] 1950 1960 1970 1980 1990 2000 2010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2525779 3358588 4449049 4579529 5724258 6916183
## [1] 1950 1960 1970 1980 1990 2000 2010
## [1] 1950 1960 1970
Let’s create a new dataset for running some descriptive statistics. Assume that this is a dataset on schools in an imaginary province that lays out information on their status, funding, and number of teachers.
### Expand Grid
schools <- expand.grid(status=c("Public", "Private"),
funding=seq(1500, 2000, by=100),
teacher=c(seq(5,15,by=5),NA))
#expand.grid is a built-in function that creates a data frame with all possible combinations of given vectors.
head(schools)
## status funding teacher
## 1 Public 1500 5
## 2 Private 1500 5
## 3 Public 1600 5
## 4 Private 1600 5
## 5 Public 1700 5
## 6 Private 1700 5
## [1] NA
mean(schools$teacher, na.rm = TRUE) #TRUE and FALSE are logical statements in R. na.rm option allows you to discard them for taking the average.
## [1] 10
## [1] 10
You can save/export your data to your working directory.
Packages
R
package is a collection of coding, data, and documentation to expand R
functionalities. You can think of them as apps we install on our phones. Our phone can make a call, send a text, etc., but with extra apps, you can shoot a TikTok video… I will introduce to you some of the important packages over the course of the term. You can also create your own package later!
In order to use these packages (say apps if you want), we must install them first. One useful package is “foreign” that allows us to import data from other statistical software such as STATA.
### Install Packages
install.packages("foreign") #Write this in console, not in script; you need to install this only once in your computer.
library("foreign") #Once installed, you must use the library command in your R script for each R session.
## It's like each time you need to use an app, you must tap on the icon, right? Library function does just that.
Another quite useful package is here. Instead of using setwd()
, you can simply use this package. Because it creates path relative to your top-level directory, it helps you to make your script easily reproducible. If the first line of your script is setwd("C:), Jennry Bryan might come and set your computer on fire. It’s not cool.
Instead, use the here package.
Other Syntax and Operators
An operator helps us with mathematical and logical manipulations.
- The built-in operators for arithmetic operators are
+, -, *, /, ^
etc. - Relational and logical operators:
>, <, ==, <=, >=, !=
.!
introduces negation.!=
means not equal.&, |
are logical AND - OR operators. - Other operators:
:
is a colon operator that implies series in a sequence.%in%
denotes if an element belongs to a vector or not.
You may find some examples below.
### Operators
schools$status[!(schools$funding<1900)] #list status of schools with funding more than 1900.
#Let's calculate total number of teachers in public schools.
sum(schools$teacher[schools$status=="Public"], na.rm=TRUE) #access teachers in public schools. Notice the use of double brackets. This is subsetting with a logical operation with "=="
sum(schools$teacher[schools$status=="Public" & schools$funding>1900], na.rm=TRUE) #total number of teachers in public schools with funding more than 1900.
Other Tips and Shortcuts
- If you cannot run your
R
code and keep getting errors, it is either you forgot to add a column or comma somewhere, or it’s just you need to “turn it off and on again”. Go to Session>Restart R. But that means you must run the whole code again.
You can comment out blocks of code by selecting lines of code and use the following shortcut: Ctrl + Shift + C. You can take it back with the same shortcut.
You can edit several lines at the same time by pressing ALT.
Get familiar with some of these RStudio shortcuts. It’ll make your life easier.
You can create a heading in the navigation button at the end of your R script tab by adding at least 4 #### or
- - - -
at the end of your lines.If you cannot figure something out, just Google it. 80% of learning how to code is just that.
rm(list=ls())
will clear user-defined objects in your environment. It does not restart your session. You’ll see some people putting at the beginning of their code. If someone else runs your code, as they have some objects stored in their session, you’d just erase everything. Not cool. Don’t use it in your R script. If you want to clear your environment anyway, use it in the Console tab.
A Sample R Script
##--------------------------------------------------------------##
## Tutorial #1 ##
## Introduction to R ##
## Semuhi Sinanoglu ##
## Sep. 2020 ##
##--------------------------------------------------------------##
#install.packages(c('foreign', 'here'))
library(foreign)
library(here)
## Get data ------------------------------------------------------
UNpop <- read.csv("UNpop.csv")
UNpop.analysis <- UNpop #keep the original in the environment and use the new one for data manipulation
## Descriptive Statistics ----------------------------------------
summary(UNpop.analysis)
Project-based Workflow
For every new research project/homework, I highly encourage you to start an
R
project. Each project will be self-contained and easily reproducible, especially used with here package. It’ll help you to have a file system structure in the sense that all of your files for a project will be stored in a designated folder.Go to File->New Project->New Directory->New Project and then create your new project in a designated folder.
- Rproj can help you access your other files stored in the working directory in the Files tab. It also allows us to switch to recent projects. Check the right corner of your screen for a scrolldown with the Rproj icon.