ModernDive

11 Inference for Regression

11.1 Refresher: Professor evaluations data

Let’s revisit the professor evaluations data that we analyzed using multiple regression with one numerical and one categorical predictor. In particular

  • \(y\): outcome variable of instructor evaluation score
  • predictor variables
    • \(x_1\): numerical explanatory/predictor variable of age
    • \(x_2\): categorical explanatory/predictor variable of gender
library(ggplot2)
library(dplyr)
library(moderndive)

load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
  select(score, ethnicity, gender, language, age, bty_avg, rank)

First, recall that we had two competing potential models to explain professors’ teaching scores:

  1. Model 1: No interaction term. i.e. both male and female profs have the same slope describing the associated effect of age on teaching score
  2. Model 2: Includes an interaction term. i.e. we allow for male and female profs to have different slopes describing the associated effect of age on teaching score

11.1.1 Refresher: Visualizations

Recall the plots we made for both these models:

Model 1: no interaction effect included

Figure 11.1: Model 1: no interaction effect included

Model 2: interaction effect included

Figure 11.2: Model 2: interaction effect included

11.1.2 Refresher: Regression tables

Last, let’s recall the regressions we fit. First, the regression with no interaction effect: note the use of + in the formula.

score_model_2 <- lm(score ~ age + gender, data = evals)
get_regression_table(score_model_2)
Table 11.1: Model 1: Regression table with no interaction effect included
term estimate std_error statistic p_value conf_low conf_high
intercept 4.484 0.125 35.79 0.000 4.238 4.730
age -0.009 0.003 -3.28 0.001 -0.014 -0.003
gendermale 0.191 0.052 3.63 0.000 0.087 0.294

Second, the regression with an interaction effect: note the use of * in the formula.

score_model_3 <- lm(score ~ age * gender, data = evals)
get_regression_table(score_model_3)
Table 11.2: Model 2: Regression table with interaction effect included
term estimate std_error statistic p_value conf_low conf_high
intercept 4.883 0.205 23.80 0.000 4.480 5.286
age -0.018 0.004 -3.92 0.000 -0.026 -0.009
gendermale -0.446 0.265 -1.68 0.094 -0.968 0.076
age:gendermale 0.014 0.006 2.45 0.015 0.003 0.024

11.1.3 Refresher: Residual analysis

Let’s compute the residuals using augment() to see if there is a pattern.

regression_points <- get_regression_points(score_model_3)
regression_points
# A tibble: 463 x 6
      ID score   age gender score_hat residual
   <int> <dbl> <dbl> <fct>      <dbl>    <dbl>
 1     1  4.70  36.0 female      4.25   0.448 
 2     2  4.10  36.0 female      4.25  -0.152 
 3     3  3.90  36.0 female      4.25  -0.352 
 4     4  4.80  36.0 female      4.25   0.548 
 5     5  4.60  59.0 male        4.20   0.399 
 6     6  4.30  59.0 male        4.20   0.0990
 7     7  2.80  59.0 male        4.20  -1.40  
 8     8  4.10  51.0 male        4.23  -0.133 
 9     9  3.40  51.0 male        4.23  -0.833 
10    10  4.50  40.0 female      4.18   0.318 
# ... with 453 more rows

First the histogram:

ggplot(regression_points, aes(x = residual)) +
  geom_histogram(binwidth = 0.25, color = "white") +
  labs(x = "Residual")
Model 2 (with interaction) histogram of residual

Figure 11.3: Model 2 (with interaction) histogram of residual

Second, the residuals as compared to the predictor variables:

  • \(x_1\): numerical explanatory/predictor variable of age
  • \(x_2\): categorical explanatory/predictor variable of gender
ggplot(regression_points, aes(x = age, y = residual)) +
  geom_point() +
  labs(x = "age", y = "Residual") +
  geom_hline(yintercept = 0, col = "blue", size = 1) +
  facet_wrap(~ gender)
Model 2 (with interaction) residuals vs predictor

Figure 11.4: Model 2 (with interaction) residuals vs predictor

11.1.4 Script of R code

An R script file of all R code used in this chapter is available here.