This post serves as the primary tutorial notes for the “Introduction to R” sessions of the Intensive Statistics course for MCom (Economics) students at Stellenbosch University (2024). This lecture is intended to offer a cursory introduction to enable students to perform basic operations in R and RStudio.
Tutorial materials can be downloaded from here.
The tutorial is accompanied by a short assignment to be completed by students.
This post was last updated on 04 August 2024.
R is open-source.
Free updates and dissemination.
Widespread availability of helpful resources like stackoverflow.
R uses packages.
R uses predictive coding (Ctrl/Cmd + Space
is very useful).
R is compatible with Markdown.
Author, connect to data, and run code in R Markdown, RStudio’s native authoring framework for data science.
See this 1-minute video summary of what R Markdown entails.
Before you start, please proceed with the following steps, and download and install the necessary software on your machine:
“R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.”
Select the appropriate link corresponding to your machine’s operating system and follow the subsequent instructions:
Execute the newly downloaded R-4.3.2-win.exe
file and follow the instructions. If you encounter any difficulties, please ask for help!
R
code)“RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.”
Once you have completed downloading and installing R
, do the same for RStudio. Execute the newly downloaded RStudio-2023.12.0-369.exe
file and follow the instructions that follow.
File
> New Project
> New directory
> New project
> Choose directory location and name
1It is advisable to work from a R
project, as this keeps track of your most current workspace and variable environment. Its location will serve as your “root folder” or main directory for subsequent operations in the project. To work from a particular project, open a new project with File
> Open Project
or starting your session in R Studio by clicking the relevant .Rproj
file.
Once operating from a new project, you should be faced with the default workspace, as illustrated in Step 3 above. The pane on the left is called the Console. As in Stata
, code can be typed and executed directly in your console (or “Command” in Stata
.)
Try: Enter the code
getwd()
in your console to observe the location of your working directory, the location of your project.
Alternatively, this code can be stored in- and executed from a script or .R
file, much like do-files in Stata
.
File
> New File
> R Script
or with Ctrl/Cmd + Shift + N
. Save the .R
file in your working directory with File
> Save
or Ctrl/Cmd + S
.After creating and saving a new script, you should observe a workspace layout like the one summarised in Table 1.
Location | Function |
---|---|
Top left | Script/RMarkdown |
Top right | Global environment |
Bottom left | Console |
Bottom right | Files/Plots/Packages/Help |
With the software installed and operating, and your new project and script created, you are ready to get started with R
and R Studio. Proceed by copying the code from this post, or by working from the tutorial’s corresponding master script available here.
NB: Highlighted code in the console or script can be executed using
Ctrl/Cmd + Enter
. When working from a script, output will typically appear in the console.
We rely on packages to access pre-written functions performing specific tasks.2 Packages can be downloaded and installed from CRAN, the official repository of packages, or GitHub, which often acts as an repository for third-party libraries or beta-versions.
Installing, loading and maintaining packages can be tedious. Packages need only be installed with install.packages("x")
once, after which only load(x)
is necessary.
# First install the package from CRAN
install.packages("pacman")
# Load the installed package into your workspace
library(pacman)
If a package x
has been installed already, install.packages("x")
will produce the following error:
It can be difficult to keep track of all the packages on your machine. The pacman
package allows us to easily install and load packages from CRAN with p_load()
, and from GitHub with p_load_gh()
. This will install and load a new package, or merely load previously installed ones. Let’s install the packages we will be using in this tutorial.
NB: Functions are called using the function’s name, e.g.,
p_load()
, or its full name, e.g.,pacman::p_load()
. The latter is useful when functions from different packages share the same name.
Directories can be viewed to the bottom-right pane, in addition to plot outputs, currently loaded packages, and help files. Should you ever require help or additional information regarding a specific command, add a ?
before that command and run the code, for example:
?pacman::p_load()
R
is an object-orientated language. Objects of various classes (scalars, matrices, data frames, vectors, etc.) can be stored in memory for later use. Once named and saved, these objects will appear in your global environment.
We use the assignment operators <-
or =
to name and save objects.
Shortcut:
Alt/Option + Minus
to get<-
# object name <- (or =) value(s)
a <- 10
hello <- "Hello world!"
test = TRUE
# Determine the class of an object
class(a)
[1] "numeric"
class(hello)
[1] "character"
class(test)
[1] "logical"
By executing the relevant code, objects should appear in your global environment like this:
Report these variables in your output by running the following:
a
[1] 10
# or
print(hello)
[1] "Hello world!"
# or by using the glue package for something more fancy
glue::glue("It's {test}. I saved a variable which contains {hello} and I stored the number {a}.")
It's TRUE. I saved a variable which contains Hello world! and I stored the number 10.
An array object is equivalent to a vector of values from the same class. Arrays can be created by concatenating values using the function c()
.
x <- c(1, 2, 3, 4)
y <- c(4, 5, 6, 7)
z <- c(7, 8, 9, 10)
# Useful functions to perform on arrays/vectors
sum(x)
[1] 10
min(x)
[1] 1
median(x)
[1] 2.5
# summary() provides a summary of the functions above
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.75 2.50 2.50 3.25 4.00
# Missing values denoted by NA
x_with_missing <- c(1, 2, 3, NA)
# Take care to properly treat missing values:
sum(x_with_missing)
[1] NA
sum(x_with_missing, na.rm = T)
[1] 6
Data frames, consisting of rows and columns, are the workhorse of statistical analysis in R
. They can be created in various ways. Rows and columns can also be named.
# data.frame() can create columns from arrays and assign column names
df_1 <- data.frame(A = x, B = y, C = z)
# Some useful operations
colnames(df_1)
[1] "A" "B" "C"
[1] "col1" "col2" "col3"
nrow(df_1)
[1] 4
ncol(df_1)
[1] 3
Specific rows, columns, and cells can be referenced as follows:
# Return column "A" as a vector
df_1$A
[1] 1 2 3 4
# df_1[row no., column no.] - empty implies all
df_1[ , 1]
[1] 1 2 3 4
# Using tidyverse's pipe operator %>%
df_1 %>% pull(A)
[1] 1 2 3 4
Shortcut:
Ctrl/Cmd + Shift + M
to get %>%
There are numerous ways to subset a data frame.
# Similarly with rows
# Return row 2 as a single row data frame
df_1[2, ]
# Return row 2-3 as a two row data frame
df_1[2:3, ]
# Return cell in row 2 column 1
df_1[2, 1]
Let’s create a new column named “D”:
# Create a new column "D" that is the sum of A and B
df_1$D <- df_1$A + df_1$B
# is the same as
df_1 <- df_1 %>% mutate(D = A + B)
Data is often imported from external files, such as .csv
files. Let’s consider an example. If you haven’t done so already, download the tut_materials.zip
file using this link. Extract the files in the .zip
folder and copy the data
folder into your working directory.
Instead of repeatedly reading the data from the source files, it is wise to create a data frame. Now, read the data from the Ireland_energy.csv
file in the data
folder using read.csv()
, and create a data frame called ire_energy
with an assignment operator. The data represents Ireland’s energy consumption data for 1980-2018.
ire_energy <- read.csv(file = "data/Ireland_energy.csv", header = TRUE)
# The "file" argument refers to the relative file path from your root directory
# The "header" argument is set to true because the .csv file contains column headings
Do the same for a corresponding file containing data on Ireland’s population from Ireland_population.csv
, and create a data frame called ire_pop
.
ire_pop <- read.csv("data/Ireland_population.csv")
# Sometimes it's unnecessary to spell out the arguments
We now have two data frames. Take a peak at the first 5 observations in ire_energy
and the last 5 observations in ire_pop
with the following commands:
We can merge the two data frames on the basis of some common value, e.g. Year
, to create a single one called ireland_df
.
ireland_df <- merge(x = ire_energy, y = ire_pop, by.x = "Year", by.y = "Year")
# merge() merges data frames x and y on the basis of some column
# by.x for x's column and by.y for y's column
Thereafter, create and save a new column called ln_energy_pc
, representing the natural logarithm of Ireland’s per capita energy consumption.
ireland_df <- ireland_df %>%
mutate(ln_energy_pc = log(GJ/Population))
# You should recognise mutate() from before
# log()'s default setting implies natural logarithmic transformation
# Instead of tidyverse piping, you could have done this:
ireland_df <- mutate(.data = ireland_df, ln_energy_pc = log(GJ/Population))
# But piping is more useful when you require multiple consecutive operations
# For example, everything we've done thus far could've been condensed
ireland_df <- read.csv("data/Ireland_energy.csv") %>%
merge(x = ., # full stop represents the result of all previous operations
y = read.csv("data/Ireland_population.csv"),
by.x = "Year",
by.y = "Year") %>%
mutate(ln_energy_pc = log(GJ/Population))
It is sensible to ensure that data frame’s variables are in the appropriate format or class. For example, ireland_df
’s Year
column is of the class integer
.
ireland_df$Year %>%
class(.)
[1] "integer"
Many useful functions require that your vectors/columns be of class date
. Transform the Year
column from an integer to a date.
ireland_df <- ireland_df %>%
mutate(Year = glue::glue("{Year}-01-01"))
# creates "1980-01-01" instead of 1980
# but the result is of class `character`
ireland_df <- ireland_df %>%
mutate(Year = as.Date(x = Year, format = "%Y-%m-%d"))
# as.Date() renders characters of a given format into dates
# "%Y-%m-%d" means yyyy-mm-dd
# Confirm that data is chronological
ireland_df <- ireland_df %>%
arrange(Year) # this is equivalent to sorting by Year
Let’s see what our new data frame ireland_df
looks like. Run view(ireland_df)
or click the appropriate icon in your Global Environment to view the data frame in your workspace.
When reporting your data frame in a .Rmd
or Markdown document, you can create a table using the huxtable
package.
ireland_df %>%
filter(Year > as.Date("2013-01-01")) %>%
# subsets data for entries after 2013
as_hux() %>%
# or huxtable::as_hux() to transform data frame into huxtable object
# hereafter code to define certain aesthetic qualities of our table
theme_basic() %>%
# use a theme to make tables more presentable,
# e.g. theme_article() or theme_compact()
set_number_format(col = c(2,4),value = 2) %>%
# set number of decimals to 2 in the 2nd and 4th column
set_font_size(10) %>%
set_caption("Ireland's energy consumption after 2013")
Year | GJ | Population | ln_energy_pc |
---|---|---|---|
2014-01-01 | 603311823.70 | 4662713 | 4.86 |
2015-01-01 | 632644273.00 | 4708040 | 4.90 |
2016-01-01 | 665078014.40 | 4762595 | 4.94 |
2017-01-01 | 674283626.40 | 4813138 | 4.94 |
2018-01-01 | 695195813.70 | 4876547 | 4.96 |
We can now write this data frame back to a .csv
file. Let’s save it in the data
folder in our working directory.
ireland_df %>% # data frame to be written to csv
write.csv(x = ., # ireland_df is piped into "."
file = "data/ireland_complete.csv", # file path and file name we choose
row.names = FALSE) # because we have no row names
Let’s perform some time series analysis with this data, particularly with respect to Ireland’s energy consumption per capita.
It is always useful to eyeball the data before proceeding with formal analysis. Here are some examples using Base R:
# plot()'s default is a scatterplot
# inputting a single vector
ireland_df$ln_energy_pc %>%
plot(.)
# undefined x-axis
# inputting a data frame with two columns
ireland_df %>%
select(Year, ln_energy_pc) %>%
plot(.)
# lines instead of points
ireland_df %>%
select(Year, ln_energy_pc) %>%
plot(., type = "l")
As opposed to visualisations with Base R, I prefer ggplot. It offers a comprehensive suite of graphs that can be flexibly tweaked to your liking. ggplot
is part of the so-called tidyverse
. To illustrate, let us make a line graph and consider the sequential adding of layers to our “canvas”.
# ggplot() creates an empty 'canvas'
ireland_df %>%
ggplot()
# aes() provides the coordinates (or "mapping")
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc))
# on the canvas, we add layers with pluses (+)
# add one of many existing themes
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw()
# add a scatterplot
# it inherits the previously defined mapping
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw() +
geom_point()
# add a line
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw() +
geom_point() +
geom_line()
# add the appropriate labels
# add more breaks to the x-axis and change its labels
# change the range of the y-axis
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw() +
geom_point() +
geom_line() +
labs(title = "Ireland's Primary Energy Consumption",
y = "ln(GJs Per Capita)",
x = "Date") +
scale_x_date(date_labels = "`%y",
date_breaks = "2 year") +
scale_y_continuous(limits = c(4, 5.5))
# and customise as you please
ireland_df %>%
ggplot(aes(x = Year, y = ln_energy_pc)) +
theme_bw() +
geom_point(aes(color = ifelse(Year < as.Date("2000-01-01"), "Before 2000",
ifelse(Year > as.Date("2000-01-01"), "After 2000", "2000"))),
size = 1.5) +
geom_line(alpha = 0.5,
color = "lightgrey",
size = 1) +
labs(title = "Ireland's Primary Energy Consumption",
y = "ln(GJs Per Capita)",
x = "Date") +
scale_x_date(date_labels = "`%y",
date_breaks = "2 year") +
scale_y_continuous(limits = c(4, 5.5)) +
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0),
size = 10),
axis.title.x = element_text(margin = margin(t = 0, r = 0, b = 0, l = 0),
size = 10),
axis.text.x = element_text(angle = 45),
legend.position = "bottom",
legend.margin = margin(t = -10, r = 0, b = 0, l = 0),
legend.title = element_blank()) +
geom_label(data = . %>% filter(Year == as.Date("2000-01-01")),
aes(label = round(ln_energy_pc,1)),
nudge_y = 0.15,
size = 3,
color = met.brewer("Austria", type = "discrete")[1]) +
geom_hline(aes(color = "Mean", yintercept = mean(ln_energy_pc)),
size = 1,
linetype = "dashed",
show.legend = F) +
scale_color_manual(values = met.brewer("Austria", type = "discrete"))
Now that we know what the series looks like, proceed to compute and plot the autocorrelation- and partial autocorrelation function of Ireland’s per capita energy consumption.
ireland_df %>%
select(ln_energy_pc) %>% # isolate ln_energy_pc in data frame
acf(plot = T, # create a plot
type = "correlation") # standard ACF
ireland_df %>%
select(ln_energy_pc) %>%
acf(plot = T,
type = "partial") # PACF option
Use the urca
package to test for stationarity by performing Augmented Dickey-Fuller (ADF) tests with ur.df()
.
# load the urca package
p_load(urca)
# ur.df() requires a vector/array
# you should recognise pull() from before
test_vector <- ireland_df %>%
pull(ln_energy_pc)
my_ADF1 <- ur.df(
y = test_vector, # vector
type = "trend", # type of ADF - trend + constant
lags = 5, # max number of lags
selectlags = "AIC") # lag selection criteria
# use summary() to present the saved ADF object
# summary() wraps many different kinds of objects
summary(my_ADF1)
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression trend
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.087796 -0.019856 0.009542 0.029923 0.045227
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3937065 0.2278056 1.728 0.0950 .
z.lag.1 -0.0797608 0.0498511 -1.600 0.1208
tt 0.0001319 0.0010831 0.122 0.9039
z.diff.lag1 0.1723326 0.1771983 0.973 0.3391
z.diff.lag2 0.3115141 0.1794628 1.736 0.0936 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03842 on 28 degrees of freedom
Multiple R-squared: 0.296, Adjusted R-squared: 0.1954
F-statistic: 2.943 on 4 and 28 DF, p-value: 0.03778
Value of test-statistic is: -1.6 1.6437 2.1039
Critical values for test statistics:
1pct 5pct 10pct
tau3 -4.15 -3.50 -3.18
phi2 7.02 5.13 4.31
phi3 9.31 6.73 5.61
Is the null hypothesis rejected? How about an ADF test that specifies only a constant/drift?
my_ADF2 <- ur.df(
y = test_vector,
type = "drift", # type of ADF - with drift
lags = 5,
selectlags = "AIC")
summary(my_ADF2)
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Test regression drift
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.088270 -0.019198 0.009114 0.030175 0.044932
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.37689 0.17811 2.116 0.0430 *
z.lag.1 -0.07569 0.03633 -2.083 0.0461 *
z.diff.lag1 0.16583 0.16608 0.999 0.3263
z.diff.lag2 0.30364 0.16454 1.845 0.0752 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03776 on 29 degrees of freedom
Multiple R-squared: 0.2956, Adjusted R-squared: 0.2228
F-statistic: 4.057 on 3 and 29 DF, p-value: 0.01595
Value of test-statistic is: -2.0834 2.5446
Critical values for test statistics:
1pct 5pct 10pct
tau2 -3.58 -2.93 -2.60
phi1 7.06 4.86 3.94
The fixest
package aids the estimation of various kinds of regression models. Let’s run a few simple OLS regressions to illustrate. Our example tries to answer the question: What bearing does education have on wages in 1974?. We will also control for age and marital status.
Instead of sourcing .csv
files locally, we can download such files directly from the internet. As before, we can use read.csv()
to create a data frame called cs_df
.
# Replace a local file path with a web address
# Subset the data to only those observations in 1974
# To restrict memory usage, select only the relevant columns
cs_df <- read.csv('https://raw.githubusercontent.com/stata2r/stata2r.github.io/main/data/cps_long.csv') %>%
filter(year == 1974) %>%
select(wage, educ, age, marr)
Before we run the regressions, we should probably get a better picture of the data we are dealing with. Take note of the dimensions of the sample and its variables’ types and distributions.
# Get an overview of the sample
# Do you notice any issues?
skimr::skim(cs_df)
Name | cs_df |
Number of rows | 15992 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
wage | 0 | 1 | 14016.80 | 9569.80 | 0 | 4403.45 | 15123.58 | 23584.18 | 25862.32 | ▆▂▃▃▇ |
educ | 0 | 1 | 12.03 | 2.87 | 0 | 11.00 | 12.00 | 13.00 | 18.00 | ▁▁▂▇▂ |
age | 0 | 1 | 33.23 | 11.05 | 16 | 24.00 | 31.00 | 42.00 | 55.00 | ▆▇▆▅▅ |
marr | 0 | 1 | 0.71 | 0.45 | 0 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
# Get an impression of wage by marital status
cs_df %>%
select(wage, marr) %>%
group_by(marr) %>%
skimr::skim()
Name | Piped data |
Number of rows | 15992 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | marr |
Variable type: numeric
skim_variable | marr | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
wage | 0 | 0 | 1 | 7681.53 | 8325.02 | 0 | 386.47 | 4325.08 | 12987.49 | 25862.32 | ▇▂▂▁▂ |
wage | 1 | 0 | 1 | 16582.74 | 8818.61 | 0 | 10237.66 | 18677.69 | 25781.50 | 25862.32 | ▃▂▂▃▇ |
We can perform standard OLS regressions with fixest
’s feols()
function. This function requires two arguments, formula
(or fml
) and data
. Formulas are given in the format y ~ x1 + x2
, and data
refers to a data frame.
# Our first model
model1 <- feols(fml = wage ~ educ, data = cs_df)
# Adding an explanatory continuous variable: age
model2 <- feols(wage ~ educ + age, cs_df)
# Adding a categorical variable
model3 <- feols(wage ~ educ + age + factor(marr), cs_df)
# As before, use summary() to display the results of model1
summary(model1)
OLS estimation, Dep. Var.: wage
Observations: 15,992
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10280.103 324.5497 31.6750 < 2.2e-16 ***
educ 310.679 26.2467 11.8369 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 9,527.8 Adj. R2: 0.008624
Consider using the huxreg()
function from the huxtable
package. Its enables the neat presentation of multiple regression models in a single table. Let’s consider the standard version first.
huxreg(model1,model2,model3)
(1) | (2) | (3) | |
---|---|---|---|
(Intercept) | 10280.103 *** | -4194.788 *** | -4175.649 *** |
(324.550) | (381.879) | (365.526) | |
educ | 310.679 *** | 493.332 *** | 439.230 *** |
(26.247) | (23.960) | (22.977) | |
age | 369.539 *** | 256.760 *** | |
(6.228) | (6.650) | ||
factor(marr)1 | 6152.185 *** | ||
(160.809) | |||
N | 15992 | 15992 | 15992 |
R2 | 0.009 | 0.188 | 0.256 |
logLik | -169209.949 | -167618.468 | -166918.049 |
AIC | 338423.898 | 335242.937 | 333844.097 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
However, you may want to improve the look of your regression table with some aesthetic adjustments, as we did for Table 2. Please note all of the optional arguments to the huxreg()
function itself.
huxreg("Model 1" = model1, "Model 2" = model2, "Model 3" = model3,
statistics = c("N" = "nobs", "R-squared" = "r.squared"),
stars = c(`*` = 0.1, `**` = 0.05, `***` = 0.01, `****` = 0.001),
number_format = 2,
coefs = c(
"Education" = "educ",
"Age" = "age",
"Married" = "factor(marr)1")) %>%
set_font_size(8) %>%
set_caption("My Regression Table")
Model 1 | Model 2 | Model 3 | |
---|---|---|---|
Education | 310.68 **** | 493.33 **** | 439.23 **** |
(26.25) | (23.96) | (22.98) | |
Age | 369.54 **** | 256.76 **** | |
(6.23) | (6.65) | ||
Married | 6152.18 **** | ||
(160.81) | |||
N | 15992 | 15992 | 15992 |
R-squared | 0.01 | 0.19 | 0.26 |
**** p < 0.001; *** p < 0.01; ** p < 0.05; * p < 0.1. |
fixest
also contains various plotting functions based on your regressions. For example, consider the following coefficient plot which graphs the coefficients for each model:
# Notice that models need to be entered as a list() object
coefplot(list(model1, model2, model3))
You may also need to run regressions using interaction effects. Consider the following example where the effect of education is moderated by marital status:
# Same as before, but "*" denotes an interaction
model4 = feols(wage ~ educ*factor(marr), data = cs_df)
# What do our results say?
summary(model4)
OLS estimation, Dep. Var.: wage
Observations: 15,992
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) -45.0497 585.6946 -0.076917 9.3869e-01
educ 642.0020 47.5094 13.513149 < 2.2e-16 ***
factor(marr)1 14178.7540 676.8164 20.949189 < 2.2e-16 ***
educ:factor(marr)1 -438.3299 54.8425 -7.992522 1.4117e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 8,614.8 Adj. R2: 0.18942
We can visualise the differential effects of education on wages by marital status using ggplot
and a plot called geom_smooth
. The latter can represent regression lines for married and unmarried respondents.
cs_df %>%
ggplot(aes(x = educ, y = wage)) +
theme_bw() +
geom_point(alpha = 0.5) + # creates a scatterplot
geom_smooth(formula = y ~ x, # x, y inherited from aes()
method = "lm", # specifies linear model
aes(color = factor(marr)), # creates two regression lines
se = T, # display confidence interval,
level = 0.95) + # confidence level to 95%
theme(legend.position = "bottom") +
labs(y = "Wage", x = "Years of Education", color = "Married",
title = "The effect of education on wage by marital status")
That was the cursory introduction to R
and R Studio with a focus on some basic operations, data visualisations, and a little bit of econometrics. I hope it was useful! You can consult the Acknowledgements and Further Reading sections for additional resources. Thank you for your attention and please feel free to reach out if you encounter any issues.
Lecture notes are compiled from the following resources:
R Intro (2018) by Grant R. McDermott and Ed Rubin.
Data Science for Economics and Finance: Getting you staRted (2021) by N.F. Katzke.
Should you need additional resources to get started, try the following:
Data Science Programming Methods (STAT 447) by Dirk Eddelbuettel (University of Illinois)
Data Science for Economists (EC 607) by Grant McDermott (University of Oregon).
Use your university credentials to sign up for a GitHub Pro account.
Download GitHub Desktop for free and use version control for all your projects.
Excellent data visualisation guide with R code at Data to Viz
If you are planning on applying version control to your new project, it is useful to check Create git repository
.↩︎
Packages are extensions to the R
language. They contain code, data, and documentation in a standardised collection format that can be installed by users, typically via a centralised software repository such as CRAN (The Comprehensive R Archive Network). CRAN is a network of ftp (file transfer protocol) and web servers around the world that store identical, up-to-date, versions of code and documentation for R.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/WihanZA/wihan_distill, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Marais (2024, Jan. 23). Wihan Marais: Introduction to R: Tutorial. Retrieved from https://www.wihanza.com/posts/2024-01-23-introduction-to-r-tutorial/
BibTeX citation
@misc{marais2024introduction, author = {Marais, Wihan}, title = {Wihan Marais: Introduction to R: Tutorial}, url = {https://www.wihanza.com/posts/2024-01-23-introduction-to-r-tutorial/}, year = {2024} }