How R looks like before tutorials vs. How you look like after learning R

Set up R

  1. Download R 4.0.2 from R project It asks you to pick up a CRAN mirror. There is one for Canada/University of Toronto.

  2. Next, you may download a text editor. Text editors make our life easier for using R. I recommend RStudio.

  3. Once Rstudio installed, launch it and you are good to go!

The frontend

  1. Before I walk you through the interface, some fun first. You can change the default color scheme of your RStudio interface. Go to Tools>Global Options>Appearance. Here you can change the font type and size, and color scheme. I am using Modern. I think it’s cool. Here is how my interface looks like:

\(~\)

  1. The ordering/layout of these windows/tabs might be different on your screen. That’s ok. On my screen, please notice that the console window is in the up right corner. The window with files-plots tabs is in the down right corner. And the environment window is to the left. You can change this layout as you wish. Whatever is more convenient for you. Go to Tools>Global Options>Pane Layout. Here are my settings:

\(~\)

  1. What are these tabs then?
  • Everything on R is an object. The environment tab stores the objects you saved/created during the session. The object can be a list of values, data frames, or a function. For instance, as you may see below, in my environment, there is data entitled “vdem15”, a function entitled “thewitcher” and a list of values entitled “lucky” stored.

  • The files tab displays all files in your default workspace. The plots tab displays the plots/figures that you’ll create.
  • The console tab is where you can type your commands and see the output. But if you’d like to keep track of your commands, you should start an R script. If you used STATA before, it’s similar to do-files. Press Ctrl+Shift+N to open a new script. You can type and run your R commands here.

You are good to go! I know R might be intimidating for some of you. But hang in there. You’ll be able to pull off cool stuff with it. I’ve prepared this html for you on R, for instance. Or you can design interactive plots like this (it’s called Rosenbrock’s valley):

Some Basics

Let’s check out the basic syntax and command operators on R.

Mathematical Operations

Open your R script and type in the following commands. Then click Ctrl + Enter. With this shortcut, you can run the command on the current line or any selected lines.

### You can type your notes/headings after a hashtag.
105 + 105 #you may perform arithmetic operations
365/12
5^-2
(3*3) + (2/5)

### There are some built-in functions in R.
log(100) #natural log
seq(2,6) #create a sequence of numbers from 2 to 6
seq(1,12, by=2) #create a sequence of numbers from 1 to 12 that increases by 2.
seq(0,1, length=11) #a sequence from 0 to 1 with specific length/total number of elements
5:8 #an alternative notation for integer-sequence
sum(5:8) #take a sum of numbers from 5 to 8
mean(10:20) #find mean of numbers from 10 to 20
sqrt(16) #square root of a non-negative number

#You can always look through documentations of these functions to seek help and remember the notation.
?seq

Creating Objects

We store information in R sessions as an object with an assigned name. To that aim, we use the “<-” assignment operator.

### Assignments
result <- sqrt(36) + sum(4:15) / 2^4
result #Write the object name and hit Enter in the console, or Ctrl+Enter in R script. It'll print the result in the Console.
## [1] 13.125

Please note that result object has just been stored as a value in your environment. If you assign different value to a stored object, it’ll be replaced. Be advised! You can assign numeric values, functions, strings of characters to an object.

#string of characters stored, so they are in double quotation marks.
(result <- "i am a nerd") 
## [1] "i am a nerd"

Trick: you can both store and print the object at the same time by putting the assignment in parentheses.

You can list and remove these stored objects.

### Listing and removing objects
a1 <- "5789" #treats it as a character
a2 <- 5789 #treats it as a number
ls() #lists all objects in the environment
## [1] "a1"     "a2"     "result"
ls(pat="a") #lists all objects that include the letter a in their name
## [1] "a1" "a2"
rm(a1) #removes a1 object from the environment
rm(list=ls(pat="a")) #removes all objects that contain letter a from the environment

Each object has two intrinsic values: its class and length. e.g. An object might be logical, numeric, character, function, etc.

### Object attributes
a1 <- "Tenet"
class(a1) #class function tells us the main attribute/type.
## [1] "character"
a2 <- 2020
class(a2)
## [1] "numeric"
a3 <- TRUE
class(a3)
## [1] "logical"
class(seq)
## [1] "function"

You may use numeric objects for subsequent mathematical operations.

x <- sum(1:100)
y <- x/50

Vectors and Lists

We can combine/concatenate multiple elements and objects into one object - a vector.

### Concatenate
a1 <- c("Tenet","is", "a", "good", "movie")
length(a1) #the length of the vector/how many elements
## [1] 5
a2 <- c(TRUE, FALSE, FALSE, TRUE)
length(a2)
## [1] 4
a3 <- c(seq(1,4), sqrt(16), 47+25, sum(6:8))
print(a3) #It will print the object in the console. 
## [1]  1  2  3  4  4 72 21
length(a3)
## [1] 7
a4 <- c(a2, a3) #You can combine vectors. Please note that logical elements are coerced into numeric.
print(a4)
##  [1]  1  0  0  1  1  2  3  4  4 72 21
a5 <- c(a1, a3) #Combined with character, numeric coerced into character,
print(a5)
##  [1] "Tenet" "is"    "a"     "good"  "movie" "1"     "2"     "3"     "4"    
## [10] "4"     "72"    "21"

You can access specific elements of vectors through indexing.

### Indexing
a1[4] #we use square brackets for indexing. This means fourth element.
## [1] "good"
a4[2:7] #elements 2 through 7
## [1] 0 0 1 1 2 3
a4[c(2,7)] #second and seventh element
## [1] 0 3
a1[-4] #omit fourth element
## [1] "Tenet" "is"    "a"     "movie"
a1[-c(3,5)] #omit third and fifth element
## [1] "Tenet" "is"    "good"
a1[-(2:7)] #omit elements 2 through 7
## [1] "Tenet"

You can use specific elements of numeric vectors for mathematical operations. Below you see the cumulative number of active COVID cases for Toronto for the last week of August.

Date Cases
30/8/2020 371
29/8/2020 350
28/8/2020 300
27/8/2020 267
26/8/2020 226
25/8/2020 192
24/8/2020 174

Table 1: Number of cumulative COVID cases for Toronto

covid <- c(174, 192, 226, 267, 300, 350, 371) #Storing number of cumulative cases
covid.increase <- covid[-1] - covid[-7]
(covid.percent <- (covid.increase / covid[-7])*100) #daily percentage increase for cumulative cases
## [1] 10.34483 17.70833 18.14159 12.35955 16.66667  6.00000
###More built-in functions
mean(covid.percent) #the average percentage increase for the last week of august
max(covid.percent) #maximum value of daily percentage increase
min(covid.percent) #minimum value
(max(covid.percent) - min(covid.percent)) / length(covid.percent) #You can use these basic functions for arithmetic operations, too. 

Vectorized arithmetic is also possible.

a1 <- seq(from=5, to=45, by=5)
sort(a1, decreasing = TRUE) #sorts the elements of a1 in a decreasing fashion
## [1] 45 40 35 30 25 20 15 10  5
a2 <- rep(5, times=length(a1)) #rep is another built-in function for R. repeats number 5 for the length of a1 vector.
print(a1/a2) #divide the vectors
## [1] 1 2 3 4 5 6 7 8 9

Lists are objects that include elements of different types.

dates <- list(c("Sep", "3", "2020"), c("Month", "Day", "Year"), c(1,2))
length(dates)
## [1] 3
class(dates)
## [1] "list"

Functions

Please take a moment to take a stock of built-in functions we have learned so far.  

Another built-in function is names which assigns names to elements in a vector. Let’s label the cumulative COVID cases with their respective dates.

###Assigning names and saving as date
covid.dates <- as.Date(c("2020-08-24", "2020-08-25", "2020-08-26",
                         "2020-08-27", "2020-08-28", "2020-08-29",
                         "2020-08-30")) #as.Date is another built-in R function that allows you to store dates. Watch out for the appropriate date format.
names(covid) <- covid.dates #labeling COVID items with dates
print(covid)
## 2020-08-24 2020-08-25 2020-08-26 2020-08-27 2020-08-28 2020-08-29 2020-08-30 
##        174        192        226        267        300        350        371

You may also use paste function quite often.

sentence <- c("toronto", "is", "the", "best", "city", "in the world")
paste(sentence, collapse = " ")
## [1] "toronto is the best city in the world"

You can also create your own functions to avoid typing the same command over and over again. User-defined functions offer us efficiency. Let’s say as part of your job, you are expected to report weekly 1) the cumulative number of COVID cases over a week for Toronto, 2) the average number of new cases for the week, 3) maximum number of new cases, 4) minimum number, 5) daily percentage change. Instead of writing the code for each week and run it separately, you may simply create a function to render the process more efficient!

31.08.2020 30.08.2020 29.08.2020 28.08.2020 27.08.2020 26.08.2020 25.08.2020
42 29 52 34 43 40 23
24.08.2020 23.08.2020 22.08.2020 21.08.2020 20.08.2020 19.08.2020 18.08.2020
35 34 21 27 30 31 28

Table 2: Number of daily reported COVID cases for Toronto

###User-defined functions
covid.week1 <- c(28, 31, 30, 27, 21, 34, 35) #storing number of cases for each week
covid.week2 <- c(23, 40, 43, 34, 52, 29, 42)

covid.summary <- function(x){ #creating a function with one input, x, titled covid.summary
  out.total <- sum(x) #cumulative cases
  out.mean <- mean(x) #average
  out.max <- max(x) #maximum
  out.min <- min(x) #minimum
  out.percent <- mean((x[-1] - x[-7])*100 / x[-7]) #average daily percentage change
  out.final <- c(out.total, out.mean, out.max, out.min, out.percent) #final output
  names(out.final) <- c("Cumulative Cases", "Average Number", "Maximum", "Minimum", "Average Per. Change") #labeling the output, be careful with the ordering! 
  return(out.final) #return function will call the output here
}

covid.summary(covid.week1) #Calling the function covid.summary supplying covid.week1 vector as an argument
##    Cumulative Cases      Average Number             Maximum             Minimum 
##          206.000000           29.428571           35.000000           21.000000 
## Average Per. Change 
##            6.685366
covid.summary(covid.week2)
##    Cumulative Cases      Average Number             Maximum             Minimum 
##           263.00000            37.57143            52.00000            23.00000 
## Average Per. Change 
##            19.00347

There are different ways in which you can define your arguments. e.g. You can create a function with multiple arguments/inputs. Let’s say you are asked to calculate weekly percentage change in average number of new reported cases for Toronto.

covid.weekly <- function(w1, w2) { #two arguments defined as w1 and w2 that would be referred to in the function
  out.percent <- (mean(w2) - mean(w1))*100/mean(w1)
  names(out.percent) <- "Weekly Percentage Change"
  return(out.percent)
}

covid.weekly(covid.week1, covid.week2) #supply covid.week1 and covid.week2 for w1 and w2, respectively
## Weekly Percentage Change 
##                  27.6699

Data

  • We don’t really save our data manually with vectors. More often than not, we import external files to R. Often it’s either a .csv (Excel), .txt (Text), .dta (STATA) or RData files. RData is a collection of R objects (i.e. R output).
  • R will automatically try to import data from your working directory. Check the Files tab to see your current working directory. Ideally, you’d want to have your project files in one designated place. Do not save to / import from your Desktop - it’ll be chaotic.
  • There are different ways in which you can assign/change your working directory. Below is one of them. Know that it exists as a method, and use it just for now, until I introduce a new technique toward the end.
setwd("C:/Users/Semuhi/Documents/University of Toronto/Fall 2020/TA - Statistics for Political Scientists/Tutorials/Week 1") #another built-in function to set working directory
getwd() #check working directory
  • Please download the datasets from Imai’s website for Chapter 1. Place these files in your working directory.
UNpop <- read.csv("UNpop.csv") #read.csv is a built-in function to help us import csv. files into an R object. Don't forget to assign this data to a specific object, otherwise it'll not be stored in your environment.
class(UNpop)
length(UNpop)

load("UNpop.RData") #for Rdata, we use load function. 
  • Check your Environment tab. UNpop object is a data.frame with 2 variables and 7 observations. Data frame is an R object of collection of vectors. In this case, it’s as if there are two vectors/columns (therefore the length of UNpop is 2) merged into a data frame.
year <- seq(from=1950, to=2010, by=10)
world.pop <- c(2525779, 3026003, 3691173, 4449049, 5320817, 6127700, 6916183)
UNpop1 <- data.frame(year, world.pop) #data.frame is another built-in function that creates data frames.
View(UNpop) #view function to check the data.
View(UNpop1) #Compare UNpop and UNpop1.

Let’s work on this data a bit.

ncol(UNpop) #gives number of columns in a data frame
## [1] 2
nrow(UNpop) #gives number of rows in a data frame
## [1] 7
head(UNpop) #returns upper left part of an object; good for quick views
##   year world.pop
## 1 1950   2525779
## 2 1960   3026003
## 3 1970   3691173
## 4 1980   4449049
## 5 1990   5320817
## 6 2000   6127700
summary(UNpop) #descriptive statistics for this data frame (mean, median, percentile, etc.)
##       year        world.pop      
##  Min.   :1950   Min.   :2525779  
##  1st Qu.:1965   1st Qu.:3358588  
##  Median :1980   Median :4449049  
##  Mean   :1980   Mean   :4579529  
##  3rd Qu.:1995   3rd Qu.:5724258  
##  Max.   :2010   Max.   :6916183

Please familiarize yourself with $ operator. That allows us to access variables/columns from data frames and individual elements from objects.

### $ Operator and Brackets
UNpop$year #accessing year column
## [1] 1950 1960 1970 1980 1990 2000 2010
(descriptive <- summary(UNpop$world.pop))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 2525779 3358588 4449049 4579529 5724258 6916183
UNpop[,"year"] #extract year column: using brackets with a comma to separate rows and columns
## [1] 1950 1960 1970 1980 1990 2000 2010
UNpop[c(1,2,3),"year"] #extract first three rows of the year column
## [1] 1950 1960 1970

Let’s create a new dataset for running some descriptive statistics. Assume that this is a dataset on schools in an imaginary province that lays out information on their status, funding, and number of teachers.

### Expand Grid
schools <- expand.grid(status=c("Public", "Private"), 
                       funding=seq(1500, 2000, by=100), 
                       teacher=c(seq(5,15,by=5),NA)) 
#expand.grid is a built-in function that creates a data frame with all possible combinations of given vectors.
head(schools)
##    status funding teacher
## 1  Public    1500       5
## 2 Private    1500       5
## 3  Public    1600       5
## 4 Private    1600       5
## 5  Public    1700       5
## 6 Private    1700       5
mean(schools$teacher) #not gonna work because of the missing values NA
## [1] NA
mean(schools$teacher, na.rm = TRUE) #TRUE and FALSE are logical statements in R. na.rm option allows you to discard them for taking the average.
## [1] 10
mean(na.omit(schools$teacher)) # na.omit(x) function suppresses the observations with missing data.
## [1] 10

You can save/export your data to your working directory.

### Saving Data
write.csv(schools, file="school.csv") #saving as csv file
save(schools, file="school.RData") #saving as RData

Packages

R package is a collection of coding, data, and documentation to expand R functionalities. You can think of them as apps we install on our phones. Our phone can make a call, send a text, etc., but with extra apps, you can shoot a TikTok video… I will introduce to you some of the important packages over the course of the term. You can also create your own package later!

In order to use these packages (say apps if you want), we must install them first. One useful package is “foreign” that allows us to import data from other statistical software such as STATA.

### Install Packages
install.packages("foreign") #Write this in console, not in script; you need to install this only once in your computer. 
library("foreign") #Once installed, you must use the library command in your R script for each R session. 
## It's like each time you need to use an app, you must tap on the icon, right? Library function does just that. 

Another quite useful package is here. Instead of using setwd() , you can simply use this package. Because it creates path relative to your top-level directory, it helps you to make your script easily reproducible. If the first line of your script is setwd("C:), Jennry Bryan might come and set your computer on fire. It’s not cool.

Instead, use the here package.

### Install Packages
install.packages("here")
library("here")

Other Syntax and Operators

An operator helps us with mathematical and logical manipulations.

  • The built-in operators for arithmetic operators are +, -, *, /, ^ etc.
  • Relational and logical operators: >, <, ==, <=, >=, != . ! introduces negation. != means not equal. &, | are logical AND - OR operators.
  • Other operators: : is a colon operator that implies series in a sequence. %in% denotes if an element belongs to a vector or not.

You may find some examples below.

### Operators
schools$status[!(schools$funding<1900)] #list status of schools with funding more than 1900.
#Let's calculate total number of teachers in public schools.
sum(schools$teacher[schools$status=="Public"], na.rm=TRUE) #access teachers in public schools. Notice the use of double brackets. This is subsetting with a logical operation with "=="

sum(schools$teacher[schools$status=="Public" & schools$funding>1900], na.rm=TRUE) #total number of teachers in public schools with funding more than 1900. 

Other Tips and Shortcuts

  • If you cannot run your R code and keep getting errors, it is either you forgot to add a column or comma somewhere, or it’s just you need to “turn it off and on again”. Go to Session>Restart R. But that means you must run the whole code again.

  • You can comment out blocks of code by selecting lines of code and use the following shortcut: Ctrl + Shift + C. You can take it back with the same shortcut.

  • You can edit several lines at the same time by pressing ALT.

  • Get familiar with some of these RStudio shortcuts. It’ll make your life easier.

  • You can create a heading in the navigation button at the end of your R script tab by adding at least 4 #### or - - - - at the end of your lines.

  • If you cannot figure something out, just Google it. 80% of learning how to code is just that.

  • rm(list=ls()) will clear user-defined objects in your environment. It does not restart your session. You’ll see some people putting at the beginning of their code. If someone else runs your code, as they have some objects stored in their session, you’d just erase everything. Not cool. Don’t use it in your R script. If you want to clear your environment anyway, use it in the Console tab.

A Sample R Script

##--------------------------------------------------------------##
##                       Tutorial #1                            ##
##                    Introduction to R                         ##
##                     Semuhi Sinanoglu                         ##
##                        Sep. 2020                             ##
##--------------------------------------------------------------##

#install.packages(c('foreign', 'here'))

library(foreign)
library(here)

## Get data ------------------------------------------------------
UNpop <- read.csv("UNpop.csv")
UNpop.analysis <- UNpop #keep the original in the environment and use the new one for data manipulation

## Descriptive Statistics ----------------------------------------
summary(UNpop.analysis)

Project-based Workflow

  • For every new research project/homework, I highly encourage you to start an R project. Each project will be self-contained and easily reproducible, especially used with here package. It’ll help you to have a file system structure in the sense that all of your files for a project will be stored in a designated folder.

  • Go to File->New Project->New Directory->New Project and then create your new project in a designated folder.

  • Rproj can help you access your other files stored in the working directory in the Files tab. It also allows us to switch to recent projects. Check the right corner of your screen for a scrolldown with the Rproj icon.