Problem Statement

The Corporate Average Fuel Economy (CAFE) bill was proposed by Senators John McCain and John Kerry to improve the fuel economy of cars and light trucks sold in the United States. However, a critical vote on an amendment in March 2002 threatened to indefinitely postpone CAFE. The amendment charged the National Highway Traffic Safety Administration to develop a new standard, the effect being to put an indefinite hold the McCain-Kerry bill. It passed by a vote of 62-38.

A political question of interest is whether there is evidence of monetary influence on a senator’s vote. Scott Preston, a professor of statistics at SUNY, Oswego, collected data on this vote, which includes our response variable, the vote of each senator (Yes or No), and as an explanatory variable, monetary contributions that each of the 100 senators received over his or her lifetime from car manufacturers.

Anyone with an interest in U.S. politics might naturally be led to ask the following questions with which we will proceed in answering:

Including contributions in the model, what is the effect of party affiliation on the vote in the CAFE amendment context?
Including party in the model, what is the effect of increased contributions on the vote in the CAFE amendment context? (Tweaked a bit from Cannon et al. 2013 [Chapter 10])

Competing Hypotheses

There are many hypothesis tests to run here. The model in this case corresponds to the log odds of a Yes vote as the response and LogContr = log(Contribution + 1) and Party as the explanatory variables for our population. We can denote that with the following equation: \[ log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 LogContr + \beta_2 Party.\] (Note that this dataset has been modified slightly to create a dichotomous outcome for the Party variable. The modification comes with Senator James Jeffords of Vermont, who was a Republican until 2001, when he left the party to become an Independent and began caucusing with the Democrats. His Party has, thus, been changed to a D from an I.)

Since Party has two levels (R and D), we encode this as one dummy variable with D as the baseline (since it occurs first alphabetically in the list of two parties). This model (from our sample) would help us determine if there is a statistical difference in the intercepts of predicting Vote based on LogContr for the two parties in the Senate, assuming that the coefficient on LogContr is the same for both curves:

\[\hat{log} \left(\dfrac{\mathbb{P}(Vote = 1)}{\mathbb{P}(Vote = 0)} \right) = b_0 + b_1 * LogContr + b_2 * Party.\]

In words

Null hypothesis: The coefficients on the parameters (including interaction terms) of the logistic regression modeling log(odds) of Vote as a function of log contribution and party are zero.
Alternative hypothesis: At least one of the coefficients on the parameters (including interaction terms) of the logistic regression modeling log(odds) of Vote as a function of log contribution and party are nonzero.

In symbols (with annotations)

\(H_0: \beta_i = 0\), where \(\beta_i\) represents the population coefficient on the parameters (including interaction terms) of the logistic regression modeling log(odds) of Vote as a function of log contribution and party.
\(H_A:\) At least one \(\beta_i \ne 0\)

Set \(\alpha\)

It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.

Exploring the sample data

library(dplyr)
library(knitr)
library(ggplot2)
library(Stat2Data)
data(CAFE)
CAFE <- CAFE %>%
  mutate(Party = sub(pattern = "I", replacement = "D", 
                    x = Party, fixed = TRUE)) %>%
  mutate(Party = as.character(Party)) %>%
  select(-Dem)
options(digits = 5, scipen = 20, width = 90)

The scatterplot below shows the relationship between log contributions, vote, and party.

qplot(x = LogContr, y = factor(Vote), color = Party, data = CAFE, geom = "point")

Guess about statistical significance

It seems that party affiliation and whether a Senator voted “yes” are related. There are far fewer Republicans that voted against the bill than Democrats. There also appears to be a relationship between contributions and voting “yes”.

Check conditions

Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.

Linear relationship between between response and predictors: This condition is difficult to check without binning the data and computing something called the empirical logit function.

We will assume that this condition is met here.
Independent observations and errors: If cases are selected at random, the independent observations condition is met. If no time series-like patterns emerge in the residuals plot, the independent errors condition is met.

Without independence, the formulas for standard errors and the \(p\)-value calculations will be wrong. Here, as with randomness, there is a naive version of the question with a simple answer, and a subtler and harder version of the question. The simple version ignores conditioning and asks, “If you know one senator’s vote, does that help you predict any other senator’s vote?” A quick look at the votes state-by-state gives an equally quick answer of “Yes.” Senators from the same state tend to vote the same way.

If you do not condition on party and contribution amount, votes are clearly not independent. What matters, however, is the conditional version of the question, which is the more subtle version: If two senators belong to the same party and receive the same campaign amounts, are their votes independent?

There are many ways that independence can fail, but most failures result from lurking variables—one or more shared features that are not part of the model. One way to check independence is to use the applied context to guess at a possible lurking variable, then compare predictions with and without that variable in the model.

For the CAFE data, one possible lurking variable is a senator’s state. It is reasonable to think that senators from the same state tend to vote the same way more often than predicted by the independence condition. This prediction is one we can check: For each senator, we use our glm model to compute a fitted \(\mathbb{P}(Yes)\) given party and contribution level. We then use these fitted probabilities to compute the chance that both senators from a state vote the same way, assuming that votes are independent. This lets us compute the expected number of states where both senators vote the same way, if the independence condition holds. We can then compare the expected number with the observed number. If votes are in fact independent, observed and expected should be close. The table below shows the actual numbers.

Together Split Total

Actual 38.0 12.0 50.0

If independent 31.1 18.9 50.0

Observed and expected are far enough apart to call into question the condition of independence. (This can be tested using a chi-square test and does produce a statistically significant result.) The evidence is all the stronger because the direction of the difference supports the idea that senators from the same state tend to vote the same way, even after you adjust for party and campaign contributions.

Where does this leave us in relation to the model? Agnostic at best. We have no clear basis to justify the independence part of the model, and some very substantial basis for thinking the model is incomplete. At the same time, we have no compelling basis for declaring the model to be so clearly wrong as to make tests and intervals also clearly wrong.

It is reasonable to go ahead with inference but to be prudent to regard the results as tentative guidelines only. This caveat aside, it is important to recognize that, even without formal inference, the logistic model may still have great value. It may fit well, and it provides a simple, accurate, and useful description of a meaningful pattern.

Checking conditions is the hard part of inference, as the last example illustrates. You really have to think. By comparison, getting the numbers for inference is often a walk in the park. Keep in mind, however, as you go through the next few sections, that what matters is not the numbers themselves, but what they do tell you, and what they cannot tell you, even if you want them to.

	Together	Split	*Total*
Actual	38.0	12.0	50.0
If independent	31.1	18.9	50.0

Test statistic

The test statistics are random variables based on the sample data. Here, we want to look at a way to estimate the population coefficients \(\beta_i\). A good guess is the sample coefficients \(B_i\). Recall that these sample coefficients are actually random variables that will vary as different samples are (theoretically, would be) collected.

We next look at our fitted regression coefficients from our sample of data:

CAFE_mult_glm <- glm(Vote ~ LogContr + Party, 
                     family = binomial, data = CAFE)
summary(CAFE_mult_glm)

## 
## Call:
## glm(formula = Vote ~ LogContr + Party, family = binomial, data = CAFE)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.412  -0.657   0.360   0.562   2.362  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -8.573      2.339   -3.66  0.00025 ***
## LogContr       2.166      0.613    3.53  0.00041 ***
## PartyR         1.733      0.580    2.99  0.00283 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132.813  on 99  degrees of freedom
## Residual deviance:  87.336  on 97  degrees of freedom
## AIC: 93.34
## 
## Number of Fisher Scoring iterations: 5

We are looking to see how likely is it for us to have observed sample coefficients \(b_{i, obs}\) or more extreme assuming that the population coefficients are 0 (assuming the null hypothesis is true). If the conditions are met and assuming \(H_0\) is true, we can “standardize” this original test statistic of \(B_i\) into \(Z\) statistics that follow a \(N(0, 1)\) distribution.

\[ Z =\dfrac{ B_i - 0}{ {SE}_i } \sim N(0, 1) \]

where \({SE}_i\) represents the standard deviation of the distribution of the sample coefficients.

Observed test statistic

While one could compute these observed test statistics by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the lm function here to fit a line and conduct the hypothesis test. (We’ve already run this code earlier in the analysis, but it is shown here again for clarity.)

CAFE_mult_glm <- glm(Vote ~ LogContr + Party, 
                     family = binomial, data = CAFE)
summary(CAFE_mult_glm)$coefficients

##             Estimate Std. Error z value   Pr(>|z|)
## (Intercept)  -8.5730    2.33924 -3.6649 0.00024748
## LogContr      2.1659    0.61308  3.5328 0.00041123
## PartyR        1.7328    0.58045  2.9853 0.00283317

We skip formal interpretations of the coefficients here. We can note the signs on the coefficients and what they mean in regards to the relationships between the variables. We see that sign is positive on LogContr. As we suspected, this corresponds to a positive relationship between voting “yes” on the bill and contributions (holding other variables fixed).

The sign on PartyR is also positive. This agrees with our first intuitions: we predict Republicans are more likely to vote “yes” than Democrats.

Compute \(p\)-values

The \(p\)-values correspond to the probability of observing a \(z\) value of \(b_{i, obs}\) or more extreme in our null distribution. We see that the (Intercept), LogContr and PartyR are statistically significant at the 5% level.

We show below how we can obtain one of these \(p\)-values (for PartyR) in R directly:

2 * pnorm(2.9853, lower.tail = FALSE)

## [1] 0.002833

State conclusion

We, therefore, have sufficient evidence to reject the null hypothesis for LogContr and on Party, assuming the other term is in the model. Our initial guess that the variables would be related to a “Yes” vote is validated by our statistical analyses here.

Cannon, Ann R., George W. Cobb, Bradley A. Hartlaub, Julie M. Legler, Robin H. Lock, Thomas L. Moore, Allan J. Rossman, and Jeffrey A. Witmer. 2013. STAT2 - Building Models for a World of Data.

Multiple Logistic Regression Example