## You might need to install this. To do so, run this:
## install.packages("fixest")
library(fixest)
6 Running regressions in R
One of the most common and flexible ways to analyze data is using a linear regression model. While R
comes with the lm
function to run regressions, I recommend the fixest
package. This uses the same syntax as lm
but has a bunch of features that make many things simpler and nicer.
To begin, we will use a data set on the SAT scores of NYC public schools
<- read.csv("data/nyc_sat.csv")
nyc str(nyc)
'data.frame': 435 obs. of 22 variables:
$ school_id : chr "02M260" "06M211" "01M539" "02M294" ...
$ school_name : chr "Clinton School Writers and Artists" "Inwood Early College for Health and Information Technologies" "New Explorations into Science, Technology and Math High School" "Essex Street Academy" ...
$ borough : chr "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
$ building_code : chr "M933" "M052" "M022" "M445" ...
$ street_address : chr "425 West 33rd Street" "650 Academy Street" "111 Columbia Street" "350 Grand Street" ...
$ city : chr "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
$ state : chr "NY" "NY" "NY" "NY" ...
$ zip_code : int 10001 10002 10002 10002 10002 10002 10002 10002 10002 10002 ...
$ latitude : num 40.8 40.9 40.7 40.7 40.7 ...
$ longitude : num -74 -73.9 -74 -74 -74 ...
$ phone_number : chr "212-695-9114" "718-935-3660 " "212-677-5190" "212-475-4773" ...
$ start_time : chr "" "8:30 AM" "8:15 AM" "8:00 AM" ...
$ end_time : chr "" "3:00 PM" "4:00 PM" "2:45 PM" ...
$ student_enrollment : int NA 87 1735 358 383 416 255 545 329 363 ...
$ percent_white : chr "" "3.4%" "28.6%" "11.7%" ...
$ percent_black : chr "" "21.8%" "13.3%" "38.5%" ...
$ percent_hispanic : chr "" "67.8%" "18.0%" "41.3%" ...
$ percent_asian : chr "" "4.6%" "38.5%" "5.9%" ...
$ average_score_sat_math : int NA NA 657 395 418 613 410 634 389 438 ...
$ average_score_sat_reading: int NA NA 601 411 428 453 406 641 395 413 ...
$ average_score_sat_writing: int NA NA 601 387 415 463 381 639 381 394 ...
$ percent_tested : chr "" "" "91.0%" "78.9%" ...
## Looks like NYC :-)
plot(latitude ~ longitude, data = nyc)
6.0.0.1 Exercise
Whenever working with a new dataset, it is good to plot variables to get a sense of the data. In this class, a lot of data is clean already, but in the wild, you will see a lot of weirdness. For example, the Census Bureau often times codes NA
in monthly income as 9999
. If you don’t look at the data and instead just run regressions, your results will be really wonky from these “very high income earners”.
To familiarize ourselves with this data, let’s plot a few key variables.
- First, create a histogram of
average_score_sat_math
andaverage_score_sat_reading
- Print out
percent_white
; what’s the problem here?
- Create a scatter plot comparing SAT reading score with SAT math score. Please use quality x and y labels. Additionally, use a high-quality title. Try and write a descriptive title that gives your reader a “key takeaway”.
6.1 First regression by hand
First, let’s use some code to fix the percent variables. I simply delete the %
from the string and then convert to a number:
$percent_white <- as.numeric(gsub("%", "", nyc$percent_white))
nyc$percent_black <- as.numeric(gsub("%", "", nyc$percent_black))
nyc$percent_hispanic <- as.numeric(gsub("%", "", nyc$percent_hispanic))
nyc$percent_asian <- as.numeric(gsub("%", "", nyc$percent_asian))
nyc$percent_tested <- as.numeric(gsub("%", "", nyc$percent_tested)) nyc
Now, let’s do our first regression by hand. We want to regress average SAT reading score on average SAT math score. Recall the formula for the OLS coefficients \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are given by
\[ \hat{\beta}_1 = \text{cov}(X, Y) / \text{var}(X) \ \text { and }\hat{\beta}_0 = \bar{Y} - \bar{X} * \hat{\beta}_1 \]
Let’s grab the math and reading scores and put in variables x
and y
for ease of writing. But first, we need to drop NAs in either x
or y
. To do this, we will grab the two columns from nyc
and then use the na.omit()
function to drop any rows that contain an NA
in any variable.
## need to drop NAs manually
<- nyc[, c("average_score_sat_math", "average_score_sat_reading")]
vars <- na.omit(vars)
vars <- vars$average_score_sat_math
x <- vars$average_score_sat_reading
y
## No NAs
sum(is.na(x)) + sum(is.na(y))
[1] 0
Now, that x
and y
contain no NAs
## Calcualate regression by hand
<- mean(y)
ybar <- mean(x)
xbar <- var(x)
var_x <- cov(x, y)
cov_x_y
<- cov_x_y / var_x
beta_1_hat <- ybar - xbar * beta_1_hat beta_0_hat
Let’s plot our data and our estimated regression line. We can plot the regression line using abline()
. abline
is similar to points
/lines
in that it needs to come after a call to plot
. abline
takes two required arguments a
for the intercept and b
for the slope.
plot(
~ average_score_sat_math,
average_score_sat_reading data = nyc
)abline(
a = beta_0_hat, b = beta_1_hat,
col = "purple", lwd = 2
)
6.1.1 Forecasting
We can use our fitted model to produce fitted values. For example, if we want to predict the average SAT reading score for a school with an average SAT math score of 450, we can do the following:
+ beta_1_hat * 450 beta_0_hat
[1] 438.12
Or, we can do in-sample prediction by using our dataframe
$average_score_sat_reading_hat <- (beta_0_hat + beta_1_hat * nyc$average_score_sat_math) nyc
These, of course, will all fall on our line of best fit:
plot(
~ average_score_sat_math,
average_score_sat_reading data = nyc
)abline(
a = beta_0_hat, b = beta_1_hat,
col = "purple", lwd = 2
)points(average_score_sat_reading_hat ~ average_score_sat_math, data = nyc, col = "purple")
Another interesting property of regression is that \((\bar{X}, \bar{Y})\) will fall on the regression line. That is, the fitted value for a school with SAT math score of \(\bar{X}\) is \(\bar{Y}\):
== beta_0_hat + beta_1_hat * xbar ybar
[1] TRUE
6.1.1.1 Exercise
- Predict the SAT reading score for a school with an SAT math score of 600
6.2 First regression by function
Now, let’s compare our estimates to the ones produced by R’s built-in function lm
. The lm
function operates very similarly to plot
: it takes two arguments: the formula y ~ x
and the data
argument. I will store the results of lm
in a variable called est_lm
so that I can use it in later functions.
## When lines get too long, use new lines to make reading the code easier
<- lm(
est_lm ~ average_score_sat_math,
average_score_sat_reading data = nyc
)print(est_lm)
Call:
lm(formula = average_score_sat_reading ~ average_score_sat_math,
data = nyc)
Coefficients:
(Intercept) average_score_sat_math
78.8797 0.7983
cat(paste0("\nbeta_0_hat = ", round(beta_0_hat, 4)))
beta_0_hat = 78.8797
cat(paste0("\nbeta_1_hat = ", round(beta_1_hat, 4)))
beta_1_hat = 0.7983
If we want more information, we pass est_lm
to summary
:
summary(est_lm)
Call:
lm(formula = average_score_sat_reading ~ average_score_sat_math,
data = nyc)
Residuals:
Min 1Q Median 3Q Max
-139.868 -9.486 2.426 13.195 71.166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.87972 7.26968 10.85 <2e-16 ***
average_score_sat_math 0.79831 0.01656 48.19 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.05 on 373 degrees of freedom
(60 observations deleted due to missingness)
Multiple R-squared: 0.8616, Adjusted R-squared: 0.8613
F-statistic: 2323 on 1 and 373 DF, p-value: < 2.2e-16
As mentioned above, we will use the fixest
package in this class. The function feols
is equivalent to the lm
function in base R and takes the same formula and data arguments. I will store the results of feols
in a variable called est
so that I can use it in later functions. To display the results, I will use the print
function.
## Using fixest package
<- feols(
est ~ average_score_sat_math,
average_score_sat_reading data = nyc
)
NOTE: 60 observations removed because of NA values (LHS: 60, RHS: 60).
print(est)
OLS estimation, Dep. Var.: average_score_sat_reading
Observations: 375
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.879717 7.269680 10.8505 < 2.2e-16 ***
average_score_sat_math 0.798312 0.016565 48.1936 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 23.0 Adj. R2: 0.861257
6.2.1 Forecasting
To forecast fitted values, we will use the predict
function. The predict
function takes as a first argument the result of lm
or feols
. In addition, we will use the newdata
argument to use. The newdata
argument must contain the same \(X\) variable(s) that we use in our regression.
# Predict yhat in-sample
$yhat <- predict(est, newdata = nyc)
nyc
plot(
~ average_score_sat_math, data = nyc
average_score_sat_reading
)abline(coef = coef(est), col = "orange")
points(
~ average_score_sat_math, data = nyc,
yhat col = "orange"
)
We can also forecast for different values by creating a new data.frame containing hypothetical values:
<- data.frame(average_score_sat_math = seq(300, 750, by = 10))
predict_df $yhat <- predict(est, newdata = predict_df)
predict_dfplot(yhat ~ average_score_sat_math, data = predict_df)
6.2.1.1 Exercise
- Run a regression of
average_score_sat_math
as the outcome variable andpercent_white
as the predictor variable usingfeols
. Interpret the coefficients in worlds
Answer:
- Create a plot of the raw data as well as the fitted regression line. Make sure each axis has high-quality labels.
6.2.2 Regression with indicator varaibles
feols(average_score_sat_math ~ 0 + i(borough), data = nyc)
NOTE: 60 observations removed because of NA values (LHS: 60).
OLS estimation, Dep. Var.: average_score_sat_math
Observations: 375
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
borough::Bronx 404.357 6.82985 59.2044 < 2.2e-16 ***
borough::Brooklyn 416.404 6.47607 64.2989 < 2.2e-16 ***
borough::Manhattan 455.888 7.16687 63.6104 < 2.2e-16 ***
borough::Queens 462.362 8.13954 56.8045 < 2.2e-16 ***
borough::Staten Island 486.200 21.38083 22.7400 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 67.2 Adj. R2: 0.114643
feols(average_score_sat_math ~ i(borough), data = nyc)
NOTE: 60 observations removed because of NA values (LHS: 60).
OLS estimation, Dep. Var.: average_score_sat_math
Observations: 375
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 404.3571 6.82985 59.20435 < 2.2e-16 ***
borough::Brooklyn 12.0465 9.41203 1.27991 2.0138e-01
borough::Manhattan 51.5305 9.90005 5.20508 3.2223e-07 ***
borough::Queens 58.0052 10.62540 5.45911 8.7887e-08 ***
borough::Staten Island 81.8429 22.44519 3.64634 3.0411e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 67.2 Adj. R2: 0.117004
feols(average_score_sat_math ~ i(borough, ref = "Manhattan"), data = nyc)
NOTE: 60 observations removed because of NA values (LHS: 60).
OLS estimation, Dep. Var.: average_score_sat_math
Observations: 375
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 455.88764 7.16687 63.610428 < 2.2e-16 ***
borough::Bronx -51.53050 9.90005 -5.205076 3.2223e-07 ***
borough::Brooklyn -39.48397 9.65937 -4.087634 5.3483e-05 ***
borough::Queens 6.47468 10.84510 0.597014 5.5086e-01
borough::Staten Island 30.31236 22.55003 1.344227 1.7970e-01
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 67.2 Adj. R2: 0.117004
6.2.3 Interactions
<- read.csv("data/cps.csv")
workers str(workers)
'data.frame': 15992 obs. of 9 variables:
$ age : int 45 21 38 48 18 22 48 18 48 45 ...
$ education: int 11 14 12 6 8 11 10 11 9 12 ...
$ black : int 0 0 0 0 0 0 0 0 0 0 ...
$ hispanic : int 0 0 0 0 0 0 0 0 0 0 ...
$ married : int 1 0 1 1 1 1 1 0 1 1 ...
$ nodegree : int 1 0 0 1 1 1 1 1 1 0 ...
$ re74 : num 21517 3176 23039 24994 1669 ...
$ re75 : num 25244 5853 25131 25244 10728 ...
$ re78 : num 25565 13496 25565 25565 9861 ...
$college <- 1 - workers$nodegree workers
For interactions between two discrete variables, we need a special syntax:
feols(re78 ~ i(black, i.college), workers)
OLS estimation, Dep. Var.: re78
Observations: 15,992
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12817.225 146.402 87.548265 < 2.2e-16 ***
black::0:college::1 3153.177 173.126 18.213228 < 2.2e-16 ***
black::1:college::0 -2152.322 445.900 -4.826912 1.3995e-06 ***
black::1:college::1 216.947 396.580 0.547045 5.8436e-01
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 9,510.4 Adj. R2: 0.02795
For interactions between a discrete variable and a continuous variable,
feols(re78 ~ 0 + i(black) + i(black, age), workers)
OLS estimation, Dep. Var.: re78
Observations: 15,992
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
black::0 10610.336 247.64760 42.84449 < 2.2e-16 ***
black::1 7791.504 869.97852 8.95597 < 2.2e-16 ***
black::0:age 134.108 7.06422 18.98411 < 2.2e-16 ***
black::1:age 129.046 25.24742 5.11126 3.237e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 9,499.7 Adj. R2: 0.030079