6  Running regressions in R

One of the most common and flexible ways to analyze data is using a linear regression model. While R comes with the lm function to run regressions, I recommend the fixest package. This uses the same syntax as lm but has a bunch of features that make many things simpler and nicer.

## You might need to install this. To do so, run this:
## install.packages("fixest")
library(fixest)

To begin, we will use a data set on the SAT scores of NYC public schools

nyc <- read.csv("data/nyc_sat.csv")
str(nyc)
'data.frame':   435 obs. of  22 variables:
 $ school_id                : chr  "02M260" "06M211" "01M539" "02M294" ...
 $ school_name              : chr  "Clinton School Writers and Artists" "Inwood Early College for Health and Information Technologies" "New Explorations into Science, Technology and Math High School" "Essex Street Academy" ...
 $ borough                  : chr  "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
 $ building_code            : chr  "M933" "M052" "M022" "M445" ...
 $ street_address           : chr  "425 West 33rd Street" "650 Academy Street" "111 Columbia Street" "350 Grand Street" ...
 $ city                     : chr  "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
 $ state                    : chr  "NY" "NY" "NY" "NY" ...
 $ zip_code                 : int  10001 10002 10002 10002 10002 10002 10002 10002 10002 10002 ...
 $ latitude                 : num  40.8 40.9 40.7 40.7 40.7 ...
 $ longitude                : num  -74 -73.9 -74 -74 -74 ...
 $ phone_number             : chr  "212-695-9114" "718-935-3660  " "212-677-5190" "212-475-4773" ...
 $ start_time               : chr  "" "8:30 AM" "8:15 AM" "8:00 AM" ...
 $ end_time                 : chr  "" "3:00 PM" "4:00 PM" "2:45 PM" ...
 $ student_enrollment       : int  NA 87 1735 358 383 416 255 545 329 363 ...
 $ percent_white            : chr  "" "3.4%" "28.6%" "11.7%" ...
 $ percent_black            : chr  "" "21.8%" "13.3%" "38.5%" ...
 $ percent_hispanic         : chr  "" "67.8%" "18.0%" "41.3%" ...
 $ percent_asian            : chr  "" "4.6%" "38.5%" "5.9%" ...
 $ average_score_sat_math   : int  NA NA 657 395 418 613 410 634 389 438 ...
 $ average_score_sat_reading: int  NA NA 601 411 428 453 406 641 395 413 ...
 $ average_score_sat_writing: int  NA NA 601 387 415 463 381 639 381 394 ...
 $ percent_tested           : chr  "" "" "91.0%" "78.9%" ...
## Looks like NYC :-)
plot(latitude ~ longitude, data = nyc)

6.0.0.1 Exercise

Whenever working with a new dataset, it is good to plot variables to get a sense of the data. In this class, a lot of data is clean already, but in the wild, you will see a lot of weirdness. For example, the Census Bureau often times codes NA in monthly income as 9999. If you don’t look at the data and instead just run regressions, your results will be really wonky from these “very high income earners”.

To familiarize ourselves with this data, let’s plot a few key variables.

  1. First, create a histogram of average_score_sat_math and average_score_sat_reading
  1. Print out percent_white; what’s the problem here?
  1. Create a scatter plot comparing SAT reading score with SAT math score. Please use quality x and y labels. Additionally, use a high-quality title. Try and write a descriptive title that gives your reader a “key takeaway”.

6.1 First regression by hand

First, let’s use some code to fix the percent variables. I simply delete the % from the string and then convert to a number:

nyc$percent_white <- as.numeric(gsub("%", "", nyc$percent_white))
nyc$percent_black <- as.numeric(gsub("%", "", nyc$percent_black))
nyc$percent_hispanic <- as.numeric(gsub("%", "", nyc$percent_hispanic))
nyc$percent_asian <- as.numeric(gsub("%", "", nyc$percent_asian))
nyc$percent_tested <- as.numeric(gsub("%", "", nyc$percent_tested))

Now, let’s do our first regression by hand. We want to regress average SAT reading score on average SAT math score. Recall the formula for the OLS coefficients \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are given by

\[ \hat{\beta}_1 = \text{cov}(X, Y) / \text{var}(X) \ \text { and }\hat{\beta}_0 = \bar{Y} - \bar{X} * \hat{\beta}_1 \]

Let’s grab the math and reading scores and put in variables x and y for ease of writing. But first, we need to drop NAs in either x or y. To do this, we will grab the two columns from nyc and then use the na.omit() function to drop any rows that contain an NA in any variable.

## need to drop NAs manually
vars <- nyc[, c("average_score_sat_math", "average_score_sat_reading")]
vars <- na.omit(vars) 
x <- vars$average_score_sat_math
y <- vars$average_score_sat_reading

## No NAs
sum(is.na(x)) + sum(is.na(y))
[1] 0

Now, that x and y contain no NAs

## Calcualate regression by hand 
ybar <- mean(y)
xbar <- mean(x)
var_x <- var(x)
cov_x_y <- cov(x, y)

beta_1_hat <- cov_x_y / var_x
beta_0_hat <- ybar - xbar * beta_1_hat

Let’s plot our data and our estimated regression line. We can plot the regression line using abline(). abline is similar to points/lines in that it needs to come after a call to plot. abline takes two required arguments a for the intercept and b for the slope.

plot(
  average_score_sat_reading ~ average_score_sat_math,
  data = nyc
)
abline(
  a = beta_0_hat, b = beta_1_hat, 
  col = "purple", lwd = 2
)

6.1.1 Forecasting

We can use our fitted model to produce fitted values. For example, if we want to predict the average SAT reading score for a school with an average SAT math score of 450, we can do the following:

beta_0_hat + beta_1_hat * 450
[1] 438.12

Or, we can do in-sample prediction by using our dataframe

nyc$average_score_sat_reading_hat <- (beta_0_hat + beta_1_hat * nyc$average_score_sat_math)

These, of course, will all fall on our line of best fit:

plot(
  average_score_sat_reading ~ average_score_sat_math,
  data = nyc
)
abline(
  a = beta_0_hat, b = beta_1_hat, 
  col = "purple", lwd = 2
)
points(average_score_sat_reading_hat ~ average_score_sat_math, data = nyc, col = "purple")

Another interesting property of regression is that \((\bar{X}, \bar{Y})\) will fall on the regression line. That is, the fitted value for a school with SAT math score of \(\bar{X}\) is \(\bar{Y}\):

ybar == beta_0_hat + beta_1_hat * xbar
[1] TRUE

6.1.1.1 Exercise

  1. Predict the SAT reading score for a school with an SAT math score of 600

6.2 First regression by function

Now, let’s compare our estimates to the ones produced by R’s built-in function lm. The lm function operates very similarly to plot: it takes two arguments: the formula y ~ x and the data argument. I will store the results of lm in a variable called est_lm so that I can use it in later functions.

## When lines get too long, use new lines to make reading the code easier
est_lm <- lm(
  average_score_sat_reading ~ average_score_sat_math,
  data = nyc
)
print(est_lm)

Call:
lm(formula = average_score_sat_reading ~ average_score_sat_math, 
    data = nyc)

Coefficients:
           (Intercept)  average_score_sat_math  
               78.8797                  0.7983  
cat(paste0("\nbeta_0_hat = ", round(beta_0_hat, 4)))

beta_0_hat = 78.8797
cat(paste0("\nbeta_1_hat = ", round(beta_1_hat, 4)))

beta_1_hat = 0.7983

If we want more information, we pass est_lm to summary:

summary(est_lm)

Call:
lm(formula = average_score_sat_reading ~ average_score_sat_math, 
    data = nyc)

Residuals:
     Min       1Q   Median       3Q      Max 
-139.868   -9.486    2.426   13.195   71.166 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            78.87972    7.26968   10.85   <2e-16 ***
average_score_sat_math  0.79831    0.01656   48.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.05 on 373 degrees of freedom
  (60 observations deleted due to missingness)
Multiple R-squared:  0.8616,    Adjusted R-squared:  0.8613 
F-statistic:  2323 on 1 and 373 DF,  p-value: < 2.2e-16

As mentioned above, we will use the fixest package in this class. The function feols is equivalent to the lm function in base R and takes the same formula and data arguments. I will store the results of feols in a variable called est so that I can use it in later functions. To display the results, I will use the print function.

## Using fixest package
est <- feols(
  average_score_sat_reading ~ average_score_sat_math,
  data = nyc
)
NOTE: 60 observations removed because of NA values (LHS: 60, RHS: 60).
print(est)
OLS estimation, Dep. Var.: average_score_sat_reading
Observations: 375
Standard-errors: IID 
                        Estimate Std. Error t value  Pr(>|t|)    
(Intercept)            78.879717   7.269680 10.8505 < 2.2e-16 ***
average_score_sat_math  0.798312   0.016565 48.1936 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 23.0   Adj. R2: 0.861257

6.2.1 Forecasting

To forecast fitted values, we will use the predict function. The predict function takes as a first argument the result of lm or feols. In addition, we will use the newdata argument to use. The newdata argument must contain the same \(X\) variable(s) that we use in our regression.

# Predict yhat in-sample
nyc$yhat <- predict(est, newdata = nyc)

plot(
  average_score_sat_reading ~ average_score_sat_math, data = nyc
)
abline(coef = coef(est), col = "orange")
points(
  yhat ~ average_score_sat_math, data = nyc, 
  col = "orange"
)

We can also forecast for different values by creating a new data.frame containing hypothetical values:

predict_df <- data.frame(average_score_sat_math = seq(300, 750, by = 10))
predict_df$yhat <- predict(est, newdata = predict_df)
plot(yhat ~ average_score_sat_math, data = predict_df)

6.2.1.1 Exercise

  1. Run a regression of average_score_sat_math as the outcome variable and percent_white as the predictor variable using feols. Interpret the coefficients in worlds

Answer:

  1. Create a plot of the raw data as well as the fitted regression line. Make sure each axis has high-quality labels.

6.2.2 Regression with indicator varaibles

feols(average_score_sat_math ~ 0 + i(borough), data = nyc)
NOTE: 60 observations removed because of NA values (LHS: 60).
OLS estimation, Dep. Var.: average_score_sat_math
Observations: 375
Standard-errors: IID 
                       Estimate Std. Error t value  Pr(>|t|)    
borough::Bronx          404.357    6.82985 59.2044 < 2.2e-16 ***
borough::Brooklyn       416.404    6.47607 64.2989 < 2.2e-16 ***
borough::Manhattan      455.888    7.16687 63.6104 < 2.2e-16 ***
borough::Queens         462.362    8.13954 56.8045 < 2.2e-16 ***
borough::Staten Island  486.200   21.38083 22.7400 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 67.2   Adj. R2: 0.114643
feols(average_score_sat_math ~ i(borough), data = nyc)
NOTE: 60 observations removed because of NA values (LHS: 60).
OLS estimation, Dep. Var.: average_score_sat_math
Observations: 375
Standard-errors: IID 
                       Estimate Std. Error  t value   Pr(>|t|)    
(Intercept)            404.3571    6.82985 59.20435  < 2.2e-16 ***
borough::Brooklyn       12.0465    9.41203  1.27991 2.0138e-01    
borough::Manhattan      51.5305    9.90005  5.20508 3.2223e-07 ***
borough::Queens         58.0052   10.62540  5.45911 8.7887e-08 ***
borough::Staten Island  81.8429   22.44519  3.64634 3.0411e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 67.2   Adj. R2: 0.117004
feols(average_score_sat_math ~ i(borough, ref = "Manhattan"), data = nyc)
NOTE: 60 observations removed because of NA values (LHS: 60).
OLS estimation, Dep. Var.: average_score_sat_math
Observations: 375
Standard-errors: IID 
                        Estimate Std. Error   t value   Pr(>|t|)    
(Intercept)            455.88764    7.16687 63.610428  < 2.2e-16 ***
borough::Bronx         -51.53050    9.90005 -5.205076 3.2223e-07 ***
borough::Brooklyn      -39.48397    9.65937 -4.087634 5.3483e-05 ***
borough::Queens          6.47468   10.84510  0.597014 5.5086e-01    
borough::Staten Island  30.31236   22.55003  1.344227 1.7970e-01    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 67.2   Adj. R2: 0.117004

6.2.3 Interactions

workers <- read.csv("data/cps.csv")
str(workers)
'data.frame':   15992 obs. of  9 variables:
 $ age      : int  45 21 38 48 18 22 48 18 48 45 ...
 $ education: int  11 14 12 6 8 11 10 11 9 12 ...
 $ black    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hispanic : int  0 0 0 0 0 0 0 0 0 0 ...
 $ married  : int  1 0 1 1 1 1 1 0 1 1 ...
 $ nodegree : int  1 0 0 1 1 1 1 1 1 0 ...
 $ re74     : num  21517 3176 23039 24994 1669 ...
 $ re75     : num  25244 5853 25131 25244 10728 ...
 $ re78     : num  25565 13496 25565 25565 9861 ...
workers$college <- 1 - workers$nodegree

For interactions between two discrete variables, we need a special syntax:

feols(re78 ~ i(black, i.college), workers)
OLS estimation, Dep. Var.: re78
Observations: 15,992
Standard-errors: IID 
                     Estimate Std. Error   t value   Pr(>|t|)    
(Intercept)         12817.225    146.402 87.548265  < 2.2e-16 ***
black::0:college::1  3153.177    173.126 18.213228  < 2.2e-16 ***
black::1:college::0 -2152.322    445.900 -4.826912 1.3995e-06 ***
black::1:college::1   216.947    396.580  0.547045 5.8436e-01    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 9,510.4   Adj. R2: 0.02795

For interactions between a discrete variable and a continuous variable,

feols(re78 ~ 0 + i(black) + i(black, age), workers)
OLS estimation, Dep. Var.: re78
Observations: 15,992
Standard-errors: IID 
              Estimate Std. Error  t value  Pr(>|t|)    
black::0     10610.336  247.64760 42.84449 < 2.2e-16 ***
black::1      7791.504  869.97852  8.95597 < 2.2e-16 ***
black::0:age   134.108    7.06422 18.98411 < 2.2e-16 ***
black::1:age   129.046   25.24742  5.11126 3.237e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 9,499.7   Adj. R2: 0.030079