Introduction to R: Tutorial

Intensive Statistics R Teaching

This post serves as the primary tutorial notes for the “Introduction to R” sessions of the Intensive Statistics course for MCom (Economics) students at Stellenbosch University (2024). This lecture is intended to offer a cursory introduction to enable students to perform basic operations in R and RStudio.

Wihan Marais
2024-01-23

Tutorial materials can be downloaded from here.

The tutorial is accompanied by a short assignment to be completed by students.

This post was last updated on 04 August 2024.

Why R?

  1. R is open-source.

    • Free updates and dissemination.

    • Widespread availability of helpful resources like stackoverflow.

  2. R uses packages.

    • R consists of Base-R coupled with third-party libraries of pre-written code, or packages. You not need to reinvent the wheel.
  3. R uses predictive coding (Ctrl/Cmd + Space is very useful).

  1. R is compatible with Markdown.

    • Author, connect to data, and run code in R Markdown, RStudio’s native authoring framework for data science.

    • See this 1-minute video summary of what R Markdown entails.

Setup

Before you start, please proceed with the following steps, and download and install the necessary software on your machine:

  1. R (The programming language you will be using)

“R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.”

Select the appropriate link corresponding to your machine’s operating system and follow the subsequent instructions:

Execute the newly downloaded R-4.3.2-win.exe file and follow the instructions. If you encounter any difficulties, please ask for help!

  1. RStudio Desktop (The program you will be using to interface with the R code)

“RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.”

Once you have completed downloading and installing R, do the same for RStudio. Execute the newly downloaded RStudio-2023.12.0-369.exe file and follow the instructions that follow.

  1. After completing Steps 1 and 2, open up RStudio. It should look something like this:
This is the default RStudio Desktop workspace. The Console is on the left, the Environment pane is top-right and the Files pane is bottom-right. You can change the layout of your workspace in the taskbar with View > Panes > Pane Layout. You can change the aesthetics of your workspace in the taskbar with Tools > Global Options > Appearance.
  1. File > New Project > New directory > New project > Choose directory location and name1

It is advisable to work from a R project, as this keeps track of your most current workspace and variable environment. Its location will serve as your “root folder” or main directory for subsequent operations in the project. To work from a particular project, open a new project with File > Open Project or starting your session in R Studio by clicking the relevant .Rproj file.

To ensure that you are working from the correct project, check whether your project’s name appears in this top right-hand corner.

Once operating from a new project, you should be faced with the default workspace, as illustrated in Step 3 above. The pane on the left is called the Console. As in Stata, code can be typed and executed directly in your console (or “Command” in Stata.)

Try: Enter the code getwd() in your console to observe the location of your working directory, the location of your project.

Alternatively, this code can be stored in- and executed from a script or .R file, much like do-files in Stata.

  1. Open a new script in the taskbar with File > New File > R Script or with Ctrl/Cmd + Shift + N. Save the .R file in your working directory with File > Save or Ctrl/Cmd + S.

After creating and saving a new script, you should observe a workspace layout like the one summarised in Table 1.

Table 1: Workspace Layout
Location Function
Top left Script/RMarkdown
Top right Global environment
Bottom left Console
Bottom right Files/Plots/Packages/Help

Basics

With the software installed and operating, and your new project and script created, you are ready to get started with R and R Studio. Proceed by copying the code from this post, or by working from the tutorial’s corresponding master script available here.

NB: Highlighted code in the console or script can be executed using Ctrl/Cmd + Enter. When working from a script, output will typically appear in the console.

Packages

We rely on packages to access pre-written functions performing specific tasks.2 Packages can be downloaded and installed from CRAN, the official repository of packages, or GitHub, which often acts as an repository for third-party libraries or beta-versions.

Installing, loading and maintaining packages can be tedious. Packages need only be installed with install.packages("x") once, after which only load(x) is necessary.

# First install the package from CRAN
install.packages("pacman")

# Load the installed package into your workspace
library(pacman)

If a package x has been installed already, install.packages("x") will produce the following error:

It can be difficult to keep track of all the packages on your machine. The pacman package allows us to easily install and load packages from CRAN with p_load(), and from GitHub with p_load_gh(). This will install and load a new package, or merely load previously installed ones. Let’s install the packages we will be using in this tutorial.

# From CRAN
p_load(fixest, tidyverse, huxtable, modelsummary, glue, skimr) # or
pacman::p_load(fixest, tidyverse, huxtable, modelsummary, glue, skimr)

# From GitHub where "profile/repository name"
p_load_gh("BlakeRMills/MetBrewer")

NB: Functions are called using the function’s name, e.g., p_load(), or its full name, e.g., pacman::p_load(). The latter is useful when functions from different packages share the same name.

Directories can be viewed to the bottom-right pane, in addition to plot outputs, currently loaded packages, and help files. Should you ever require help or additional information regarding a specific command, add a ? before that command and run the code, for example:

?pacman::p_load()

Environment

R is an object-orientated language. Objects of various classes (scalars, matrices, data frames, vectors, etc.) can be stored in memory for later use. Once named and saved, these objects will appear in your global environment.

We use the assignment operators <- or = to name and save objects.

Shortcut: Alt/Option + Minus to get <-

# object name <- (or =) value(s)
a <- 10
hello <- "Hello world!"
test = TRUE

# Determine the class of an object
class(a)
[1] "numeric"
class(hello)
[1] "character"
class(test)
[1] "logical"

By executing the relevant code, objects should appear in your global environment like this:

Report these variables in your output by running the following:

a
[1] 10
# or
print(hello)
[1] "Hello world!"
# or by using the glue package for something more fancy
glue::glue("It's {test}. I saved a variable which contains {hello} and I stored the number {a}.")
It's TRUE. I saved a variable which contains Hello world! and I stored the number 10.

Arrays

An array object is equivalent to a vector of values from the same class. Arrays can be created by concatenating values using the function c().

x <- c(1, 2, 3, 4)
y <- c(4, 5, 6, 7)
z <- c(7, 8, 9, 10)

# Useful functions to perform on arrays/vectors
sum(x)
[1] 10
min(x)
[1] 1
median(x)
[1] 2.5
# summary() provides a summary of the functions above
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.75    2.50    2.50    3.25    4.00 
# Missing values denoted by NA
x_with_missing <- c(1, 2, 3, NA)

# Take care to properly treat missing values:
sum(x_with_missing)
[1] NA
sum(x_with_missing, na.rm = T)
[1] 6

Data frames

Data frames, consisting of rows and columns, are the workhorse of statistical analysis in R. They can be created in various ways. Rows and columns can also be named.

# data.frame() can create columns from arrays and assign column names
df_1 <- data.frame(A = x, B = y, C = z)

# Some useful operations
colnames(df_1)
[1] "A" "B" "C"
df_1_copy <- df_1
colnames(df_1_copy) <- c("col1", "col2", "col3")
colnames(df_1_copy)
[1] "col1" "col2" "col3"
nrow(df_1)
[1] 4
ncol(df_1)
[1] 3

Specific rows, columns, and cells can be referenced as follows:

# Return column "A" as a vector
df_1$A 
[1] 1 2 3 4
# df_1[row no., column no.] - empty implies all
df_1[ , 1]
[1] 1 2 3 4
# Using tidyverse's pipe operator %>%
df_1 %>% pull(A) 
[1] 1 2 3 4

Shortcut: Ctrl/Cmd + Shift + M to get %>%

There are numerous ways to subset a data frame.

# Similarly with rows
# Return row 2 as a single row data frame
df_1[2, ]

# Return row 2-3 as a two row data frame 
df_1[2:3, ]

# Return cell in row 2 column 1
df_1[2, 1]

Let’s create a new column named “D”:

# Create a new column "D" that is the sum of A and B
df_1$D <- df_1$A + df_1$B

# is the same as
df_1 <- df_1 %>% mutate(D = A + B)

Reading data

Data is often imported from external files, such as .csv files. Let’s consider an example. If you haven’t done so already, download the tut_materials.zip file using this link. Extract the files in the .zip folder and copy the data folder into your working directory.

Instead of repeatedly reading the data from the source files, it is wise to create a data frame. Now, read the data from the Ireland_energy.csv file in the data folder using read.csv(), and create a data frame called ire_energy with an assignment operator. The data represents Ireland’s energy consumption data for 1980-2018.

ire_energy <- read.csv(file = "data/Ireland_energy.csv", header = TRUE)
# The "file" argument refers to the relative file path from your root directory
# The "header" argument is set to true because the .csv file contains column headings

Do the same for a corresponding file containing data on Ireland’s population from Ireland_population.csv, and create a data frame called ire_pop.

ire_pop <- read.csv("data/Ireland_population.csv")
# Sometimes it's unnecessary to spell out the arguments
When calling your function, access the tip with Ctrl/Cmd + Space. Note the potential arguments for the particular function and some arguments’ default settings, e.g., header = TRUE.

We now have two data frames. Take a peak at the first 5 observations in ire_energy and the last 5 observations in ire_pop with the following commands:

head(ire_energy, 5) # same as ire_energy[1:5, ]
tail(ire_pop, 5) # same as ire_pop[(nrow(ire_pop) - 4): nrow(ire_pop), ]

# skim() can provide useful overviews of data frames
skimr::skim(ire_pop)

Manipulating data

We can merge the two data frames on the basis of some common value, e.g. Year, to create a single one called ireland_df.

ireland_df <- merge(x = ire_energy, y = ire_pop, by.x = "Year", by.y = "Year") 
# merge() merges data frames x and y on the basis of some column
# by.x for x's column and by.y for y's column

Thereafter, create and save a new column called ln_energy_pc, representing the natural logarithm of Ireland’s per capita energy consumption.

ireland_df <- ireland_df %>% 
  mutate(ln_energy_pc = log(GJ/Population))
# You should recognise mutate() from before
# log()'s default setting implies natural logarithmic transformation

# Instead of tidyverse piping, you could have done this:
ireland_df <- mutate(.data = ireland_df, ln_energy_pc = log(GJ/Population))

# But piping is more useful when you require multiple consecutive operations
# For example, everything we've done thus far could've been condensed
ireland_df <- read.csv("data/Ireland_energy.csv") %>% 
  merge(x = ., # full stop represents the result of all previous operations
        y = read.csv("data/Ireland_population.csv"),
        by.x = "Year",
        by.y = "Year") %>% 
  mutate(ln_energy_pc = log(GJ/Population))

It is sensible to ensure that data frame’s variables are in the appropriate format or class. For example, ireland_df’s Year column is of the class integer.

ireland_df$Year %>% 
  class(.)
[1] "integer"

Many useful functions require that your vectors/columns be of class date. Transform the Year column from an integer to a date.

ireland_df <- ireland_df %>%  
  mutate(Year = glue::glue("{Year}-01-01")) 
# creates  "1980-01-01" instead of 1980 
# but the result is of class `character` 

ireland_df <- ireland_df %>%
  mutate(Year = as.Date(x = Year, format = "%Y-%m-%d"))
# as.Date() renders characters of a given format into dates
# "%Y-%m-%d" means yyyy-mm-dd

# Confirm that data is chronological
ireland_df <- ireland_df %>% 
  arrange(Year) # this is equivalent to sorting by Year

Let’s see what our new data frame ireland_df looks like. Run view(ireland_df) or click the appropriate icon in your Global Environment to view the data frame in your workspace.

When reporting your data frame in a .Rmd or Markdown document, you can create a table using the huxtable package.

ireland_df %>%
  filter(Year > as.Date("2013-01-01")) %>%
  # subsets data for entries after 2013
  
  as_hux() %>% 
  # or huxtable::as_hux() to transform data frame into huxtable object
  # hereafter code to define certain aesthetic qualities of our table
  
  theme_basic() %>%
  # use a theme to make tables more presentable, 
  # e.g. theme_article() or theme_compact()

  set_number_format(col = c(2,4),value = 2) %>%
  # set number of decimals to 2 in the 2nd and 4th column
  
  set_font_size(10) %>% 
  set_caption("Ireland's energy consumption after 2013")
Table 2: Ireland's energy consumption after 2013
YearGJPopulationln_energy_pc
2014-01-01603311823.7046627134.86
2015-01-01632644273.0047080404.90
2016-01-01665078014.4047625954.94
2017-01-01674283626.4048131384.94
2018-01-01695195813.7048765474.96

Writing data

We can now write this data frame back to a .csv file. Let’s save it in the data folder in our working directory.

ireland_df %>% # data frame to be written to csv
  write.csv(x = ., # ireland_df is piped into "."
            file = "data/ireland_complete.csv", # file path and file name we choose
            row.names = FALSE) # because we have no row names

Time series analysis

Let’s perform some time series analysis with this data, particularly with respect to Ireland’s energy consumption per capita.

Visualising time series

It is always useful to eyeball the data before proceeding with formal analysis. Here are some examples using Base R:

# plot()'s default is a scatterplot
# inputting a single vector
ireland_df$ln_energy_pc %>%
  plot(.) 
# undefined x-axis
# inputting a data frame with two columns
ireland_df %>% 
  select(Year, ln_energy_pc) %>% 
  plot(.)
# lines instead of points
ireland_df %>% 
  select(Year, ln_energy_pc) %>%
  plot(., type = "l")

As opposed to visualisations with Base R, I prefer ggplot. It offers a comprehensive suite of graphs that can be flexibly tweaked to your liking. ggplot is part of the so-called tidyverse. To illustrate, let us make a line graph and consider the sequential adding of layers to our “canvas”.

# ggplot() creates an empty 'canvas'
ireland_df %>% 
  ggplot() 
# aes() provides the coordinates (or "mapping")
ireland_df %>% 
  ggplot(aes(x = Year, y = ln_energy_pc)) 
# on the canvas, we add layers with pluses (+)
# add one of many existing themes
ireland_df %>% 
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw()
# add a scatterplot
# it inherits the previously defined mapping
ireland_df %>% 
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw() +
  geom_point()
# add a line
ireland_df %>% 
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw() +
  geom_point() +
  geom_line()
# add the appropriate labels
# add more breaks to the x-axis and change its labels
# change the range of the y-axis
ireland_df %>% 
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw() +
  geom_point() +
  geom_line() +
  labs(title = "Ireland's Primary Energy Consumption", 
       y = "ln(GJs Per Capita)", 
       x = "Date") +
  scale_x_date(date_labels = "`%y", 
               date_breaks = "2 year") +
  scale_y_continuous(limits = c(4, 5.5))
# and customise as you please
ireland_df %>% 
  ggplot(aes(x = Year, y = ln_energy_pc)) +
  theme_bw() +
  geom_point(aes(color = ifelse(Year < as.Date("2000-01-01"), "Before 2000", 
                                ifelse(Year > as.Date("2000-01-01"), "After 2000", "2000"))), 
             size = 1.5) +
  geom_line(alpha = 0.5, 
            color = "lightgrey", 
            size = 1) +
  labs(title = "Ireland's Primary Energy Consumption", 
       y = "ln(GJs Per Capita)", 
       x = "Date") +
  scale_x_date(date_labels = "`%y", 
               date_breaks = "2 year") +
  scale_y_continuous(limits = c(4, 5.5)) +
  theme(plot.title = element_text(hjust = 0.5, size = 10),
    axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0), 
                                size = 10),
    axis.title.x = element_text(margin = margin(t = 0, r = 0, b = 0, l = 0), 
                                size = 10),
    axis.text.x = element_text(angle = 45),
    legend.position = "bottom", 
    legend.margin = margin(t = -10, r = 0, b = 0, l = 0),
    legend.title = element_blank()) +
  geom_label(data = . %>% filter(Year == as.Date("2000-01-01")),
             aes(label = round(ln_energy_pc,1)),
             nudge_y = 0.15,
             size = 3,
             color = met.brewer("Austria", type = "discrete")[1]) +
  geom_hline(aes(color = "Mean", yintercept = mean(ln_energy_pc)),
             size = 1,
             linetype = "dashed",
             show.legend = F) +
  scale_color_manual(values = met.brewer("Austria", type = "discrete"))

Autocorrelation

Now that we know what the series looks like, proceed to compute and plot the autocorrelation- and partial autocorrelation function of Ireland’s per capita energy consumption.

ireland_df %>% 
  select(ln_energy_pc) %>% # isolate ln_energy_pc in data frame
  acf(plot = T, # create a plot
      type = "correlation") # standard ACF
ireland_df %>% 
  select(ln_energy_pc) %>% 
  acf(plot = T,
      type = "partial") # PACF option

Unit root tests

Use the urca package to test for stationarity by performing Augmented Dickey-Fuller (ADF) tests with ur.df().

# load the urca package
p_load(urca)

# ur.df() requires a vector/array
# you should recognise pull() from before
test_vector <- ireland_df %>% 
  pull(ln_energy_pc)

my_ADF1 <- ur.df(
  y = test_vector, # vector
  type = "trend", # type  of ADF - trend + constant
  lags = 5, # max number of lags
  selectlags = "AIC") # lag selection criteria
  
# use summary() to present the saved ADF object
# summary() wraps many different kinds of objects
summary(my_ADF1) 

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression trend 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.087796 -0.019856  0.009542  0.029923  0.045227 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.3937065  0.2278056   1.728   0.0950 .
z.lag.1     -0.0797608  0.0498511  -1.600   0.1208  
tt           0.0001319  0.0010831   0.122   0.9039  
z.diff.lag1  0.1723326  0.1771983   0.973   0.3391  
z.diff.lag2  0.3115141  0.1794628   1.736   0.0936 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03842 on 28 degrees of freedom
Multiple R-squared:  0.296, Adjusted R-squared:  0.1954 
F-statistic: 2.943 on 4 and 28 DF,  p-value: 0.03778


Value of test-statistic is: -1.6 1.6437 2.1039 

Critical values for test statistics: 
      1pct  5pct 10pct
tau3 -4.15 -3.50 -3.18
phi2  7.02  5.13  4.31
phi3  9.31  6.73  5.61

Is the null hypothesis rejected? How about an ADF test that specifies only a constant/drift?

my_ADF2 <- ur.df(
  y = test_vector,
  type = "drift", # type  of ADF - with drift
  lags = 5, 
  selectlags = "AIC")

summary(my_ADF2) 

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.088270 -0.019198  0.009114  0.030175  0.044932 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.37689    0.17811   2.116   0.0430 *
z.lag.1     -0.07569    0.03633  -2.083   0.0461 *
z.diff.lag1  0.16583    0.16608   0.999   0.3263  
z.diff.lag2  0.30364    0.16454   1.845   0.0752 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03776 on 29 degrees of freedom
Multiple R-squared:  0.2956,    Adjusted R-squared:  0.2228 
F-statistic: 4.057 on 3 and 29 DF,  p-value: 0.01595


Value of test-statistic is: -2.0834 2.5446 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1  7.06  4.86  3.94

Cross section analysis

The fixest package aids the estimation of various kinds of regression models. Let’s run a few simple OLS regressions to illustrate. Our example tries to answer the question: What bearing does education have on wages in 1974?. We will also control for age and marital status.

Instead of sourcing .csv files locally, we can download such files directly from the internet. As before, we can use read.csv() to create a data frame called cs_df.

# Replace a local file path with a web address
# Subset the data to only those observations in 1974
# To restrict memory usage, select only the relevant columns
cs_df <- read.csv('https://raw.githubusercontent.com/stata2r/stata2r.github.io/main/data/cps_long.csv') %>% 
  filter(year == 1974) %>% 
  select(wage, educ, age, marr)

Descriptive statistics

Before we run the regressions, we should probably get a better picture of the data we are dealing with. Take note of the dimensions of the sample and its variables’ types and distributions.

# Get an overview of the sample
# Do you notice any issues?
skimr::skim(cs_df)
Table 3: Data summary
Name cs_df
Number of rows 15992
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
wage 0 1 14016.80 9569.80 0 4403.45 15123.58 23584.18 25862.32 ▆▂▃▃▇
educ 0 1 12.03 2.87 0 11.00 12.00 13.00 18.00 ▁▁▂▇▂
age 0 1 33.23 11.05 16 24.00 31.00 42.00 55.00 ▆▇▆▅▅
marr 0 1 0.71 0.45 0 0.00 1.00 1.00 1.00 ▃▁▁▁▇
# Get an impression of wage by marital status
cs_df %>% 
  select(wage, marr) %>% 
  group_by(marr) %>% 
  skimr::skim()
Table 3: Data summary
Name Piped data
Number of rows 15992
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables marr

Variable type: numeric

skim_variable marr n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
wage 0 0 1 7681.53 8325.02 0 386.47 4325.08 12987.49 25862.32 ▇▂▂▁▂
wage 1 0 1 16582.74 8818.61 0 10237.66 18677.69 25781.50 25862.32 ▃▂▂▃▇

Regressions

We can perform standard OLS regressions with fixest’s feols() function. This function requires two arguments, formula (or fml) and data. Formulas are given in the format y ~ x1 + x2, and data refers to a data frame.

# Our first model
model1 <- feols(fml = wage ~ educ, data = cs_df)

# Adding an explanatory continuous variable: age
model2 <- feols(wage ~ educ + age, cs_df)

# Adding a categorical variable
model3 <- feols(wage ~ educ + age + factor(marr), cs_df)

# As before, use summary() to display the results of model1
summary(model1)
OLS estimation, Dep. Var.: wage
Observations: 15,992
Standard-errors: IID 
             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 10280.103   324.5497 31.6750 < 2.2e-16 ***
educ          310.679    26.2467 11.8369 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 9,527.8   Adj. R2: 0.008624

Visualising results

Consider using the huxreg() function from the huxtable package. Its enables the neat presentation of multiple regression models in a single table. Let’s consider the standard version first.

huxreg(model1,model2,model3)
Table 4:
(1)(2)(3)
(Intercept)10280.103 ***-4194.788 ***-4175.649 ***
(324.550)   (381.879)   (365.526)   
educ310.679 ***493.332 ***439.230 ***
(26.247)   (23.960)   (22.977)   
age        369.539 ***256.760 ***
        (6.228)   (6.650)   
factor(marr)1                6152.185 ***
                (160.809)   
N15992        15992        15992        
R20.009    0.188    0.256    
logLik-169209.949    -167618.468    -166918.049    
AIC338423.898    335242.937    333844.097    
*** p < 0.001; ** p < 0.01; * p < 0.05.

However, you may want to improve the look of your regression table with some aesthetic adjustments, as we did for Table 2. Please note all of the optional arguments to the huxreg() function itself.

huxreg("Model 1" = model1, "Model 2" = model2, "Model 3" = model3,
       statistics = c("N" = "nobs", "R-squared" = "r.squared"),
       stars = c(`*` = 0.1, `**` = 0.05, `***` = 0.01, `****` = 0.001),
       number_format = 2,
       coefs = c(
         "Education" = "educ",
         "Age" = "age",
         "Married" = "factor(marr)1")) %>%
set_font_size(8) %>%
set_caption("My Regression Table")
Table 5: My Regression Table
Model 1Model 2Model 3
Education310.68 ****493.33 ****439.23 ****
(26.25)    (23.96)    (22.98)    
Age        369.54 ****256.76 ****
        (6.23)    (6.65)    
Married                6152.18 ****
                (160.81)    
N15992        15992        15992        
R-squared0.01     0.19     0.26     
**** p < 0.001; *** p < 0.01; ** p < 0.05; * p < 0.1.

fixest also contains various plotting functions based on your regressions. For example, consider the following coefficient plot which graphs the coefficients for each model:

# Notice that models need to be entered as a list() object
coefplot(list(model1, model2, model3))

Interaction effects

You may also need to run regressions using interaction effects. Consider the following example where the effect of education is moderated by marital status:

# Same as before, but "*" denotes an interaction
model4 = feols(wage ~ educ*factor(marr), data = cs_df)

# What do our results say?
summary(model4)
OLS estimation, Dep. Var.: wage
Observations: 15,992
Standard-errors: IID 
                     Estimate Std. Error   t value   Pr(>|t|)    
(Intercept)          -45.0497   585.6946 -0.076917 9.3869e-01    
educ                 642.0020    47.5094 13.513149  < 2.2e-16 ***
factor(marr)1      14178.7540   676.8164 20.949189  < 2.2e-16 ***
educ:factor(marr)1  -438.3299    54.8425 -7.992522 1.4117e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 8,614.8   Adj. R2: 0.18942

We can visualise the differential effects of education on wages by marital status using ggplot and a plot called geom_smooth. The latter can represent regression lines for married and unmarried respondents.

cs_df %>% 
  ggplot(aes(x = educ, y = wage)) +
  theme_bw() +
  geom_point(alpha = 0.5) + # creates a scatterplot
  geom_smooth(formula = y ~ x, # x, y inherited from aes()
              method = "lm", # specifies linear model
              aes(color = factor(marr)), # creates two regression lines
              se = T, # display confidence interval,
              level = 0.95) + # confidence level to 95%
  theme(legend.position = "bottom") +
  labs(y = "Wage", x = "Years of Education", color = "Married", 
       title = "The effect of education on wage by marital status")

End

That was the cursory introduction to R and R Studio with a focus on some basic operations, data visualisations, and a little bit of econometrics. I hope it was useful! You can consult the Acknowledgements and Further Reading sections for additional resources. Thank you for your attention and please feel free to reach out if you encounter any issues.

Acknowledgements

Lecture notes are compiled from the following resources:

Further Reading

Should you need additional resources to get started, try the following:


  1. If you are planning on applying version control to your new project, it is useful to check Create git repository.↩︎

  2. Packages are extensions to the R language. They contain code, data, and documentation in a standardised collection format that can be installed by users, typically via a centralised software repository such as CRAN (The Comprehensive R Archive Network). CRAN is a network of ftp (file transfer protocol) and web servers around the world that store identical, up-to-date, versions of code and documentation for R.↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/WihanZA/wihan_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Marais (2024, Jan. 23). Wihan Marais: Introduction to R: Tutorial. Retrieved from https://www.wihanza.com/posts/2024-01-23-introduction-to-r-tutorial/

BibTeX citation

@misc{marais2024introduction,
  author = {Marais, Wihan},
  title = {Wihan Marais: Introduction to R: Tutorial},
  url = {https://www.wihanza.com/posts/2024-01-23-introduction-to-r-tutorial/},
  year = {2024}
}