Problem Statement

Mileage of used cars is often thought of as a good predictor of sale prices of used cars. Does this same conjecture hold for so called “luxury cars”: Porches, Jaguars, and BMWs? More precisely, do the slopes and intercepts differ when comparing mileage and price for these three brands of cars? To answer this question, data was randomly selected from an Internet car sale site. (Tweaked a bit from Cannon et al. 2013 [Chapter 1 and Chapter 4])

Competing Hypotheses

There are many hypothesis tests to run here. It’s important to first think about the model that we will fit to address these questions. We want to predict Price (in thousands of dollars) based on Mileage (in thousands of miles). A simple linear regression equation for this would be \(\hat{Price} = b_0 + b_1 * Mileage\).

We are dealing with a more complicated example in this case though. We need to also include in CarType to our model. Since CarType has three levels: BMW, Porche, and Jaguar, we encode this as two dummy variables with BMW as the baseline (since it occurs first alphabetically in the list of three car types). This model would help us determine if there is a statistical difference in the intercepts of predicting Price based on Mileage for the three car types, assuming that the slope is the same for all three lines:

\[\hat{Price} = b_0 + b_1 * Mileage + b_2 * Porche + b_3 * Jaguar.\]

This is not exactly what the problem is asking for though. It wants us to see if there is also a difference in the slopes of the three fitted lines for the three car types. To do so, we need to incorporate interaction terms on the dummy variables of Porche and Jaguar with Mileage. This also creates a baseline interaction term of BMW:Mileage, which is not specifically included in the model but comes into play by setting Jaguar and Porche equal to 0:

\[\hat{Price} = b_0 + b_1 * Mileage + b_2 * Porche + b_3 * Jaguar + b_4 Mileage*Jaguar + b_5 Mileage*Porche.\]

In words

  • Null hypothesis: The coefficients on the parameters (including interaction terms) of the least squares regression modeling price as a function of mileage and car type are zero.

  • Alternative hypothesis: At least one of the coefficients on the parameters (including interaction terms) of the least squares regression modeling price as a function of mileage and car type are nonzero.

In symbols (with annotations)

  • \(H_0: \beta_i = 0\), where \(\beta_i\) represents the population coefficient of the least squares regression modeling price as a function of mileage and car type.
  • \(H_A:\) At least one \(\beta_i \ne 0\)

Set \(\alpha\)

It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.

Exploring the sample data

library(dplyr)
library(knitr)
library(ggplot2)
library(Stat2Data)
data(ThreeCars)
ThreeCars <- ThreeCars %>%
  select(CarType, Price, Mileage) %>%
  mutate(CarType = as.character(CarType))
options(digits = 5, scipen = 20, width = 90)

The scatterplot below shows the relationship between mileage, price, and car type.

qplot(x = Mileage, y = Price, color = CarType, data = ThreeCars, geom = "point")