Students at Virginia Tech studied which vehicles come to a complete stop at an intersection with four-way stop signs, selecting at random the cars to observe. They looked at several factors to see which (if any) were associated with coming to a complete stop. (They defined a complete stop as “the speed of the vehicle will become zero at least for an [instant]”). Some of these variables included the age of the driver, how many passengers were in the vehicle, and type of vehicle. The variable we are going to investigate is the arrival position of vehicles approaching an intersection all traveling in the same direction. They classified this arrival pattern into three groups: whether the vehicle arrives alone, is the lead in a group of vehicles, or is a follower in a group of vehicles. The students studied one specific intersection in Northern Virginia at a variety of different times. Because random assignment was not used, this is an observational study. Also note that no vehicle from one group is paired with a vehicle from another group. In other words, there is independence between the different groups of vehicles. (Tweaked a bit from Tintle et al. 2014 [p. 8-2 - 8-13])
Null hypothesis: There is no association between the arrival position of the vehicle and whether or not it comes to a complete stop.
Alternative hypothesis: There is an association between the arrival position of the vehicle and whether or not it comes to a complete stop.
Null hypothesis: The long-run probability that a single vehicle will stop is the same as the long-run probability a lead vehicle will stop, which is the same as the long-run probability that a following vehicle will stop. In other words all three long-run probabilities are actually the same.
Alternative hypothesis: At least one of these parameter (long-run) probabilities is different from the others
It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.
Single Vehicle | Lead Vehicle | Following Vehicle | TOTAL | |
---|---|---|---|---|
Complete Stop | 151 (0.858) | 38 (0.905) | 76 (0.776) | 265 |
Not Complete Stop | 25 (0.142) | 4 (0.095) | 22 (0.224) | 51 |
Total | 176 | 42 | 98 | 316 |
stop <- c(rep("complete", 265), rep("not_complete", 52))
vehicle_type <- c(rep("single", 151), rep("lead", 38), rep("follow", 76),
rep("single", 25), rep("lead", 5), rep("follow", 22))
df <- data.frame(stop, vehicle_type)
ggplot(data = df, mapping = aes(x = vehicle_type, fill = stop)) +
geom_bar(position = "fill", color = "black") +
xlab("\nArrival Position of Vehicle") +
ylab("Conditional Probability\n")
We are looking to see if a difference exists in the heights of the bars corresponding to complete
. Based solely on the picture, we have reason to believe that a difference exists since the follow
bar seems to be lower than the other two by quite a big margin. BUT…it’s important to use statistics to see if that difference is actually statistically significant!
Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met.
Independence: Each case that contributes a count to the table must be independent of all the other cases in the table.
This condition is met since cars were selected at random to observe.
Sample size: Each cell count must have at least 5 expected cases.
This is met by observing the table above.
Degrees of freedom: We need 3 or more columns in the table.
This is met by observing the table above.
The test statistic is a random variable based on the sample data. Here, we want to look for deviations from what we would expect cells in the table if the null hypothesis were true. This requires us to calculate expected counts via
\[\text{Expected Count}_{\text{row } i, \text{col } j} = \dfrac{\text{row } i \text{ total} \times \text{column } j \text{ total}}{\text{table total}}\]
Single Vehicle | Lead Vehicle | Following Vehicle | TOTAL | |
---|---|---|---|---|
Complete Stop | 151 (147.59) | 38 (35.22) | 76 (82.18) | 265 |
Not Complete Stop | 25 (28.41) | 4 (6.78) | 22 (15.82) | 51 |
Total | 176 | 42 | 98 | 316 |
\(X^2 = \sum_{\text{all cells in the table}} \dfrac{(\text{observed count} - \text{expected count})^2}{\text{expected count}}\)
Assuming the conditions outlined above are met, \(X^2 \sim \chi^2(df = (R - 1) \times (C - 1))\) where \(R\) is the number of rows in the table and \(C\) is the number of columns.
While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the inference
function in the oilabs
package to perform this analysis for us.
inference(data = df, y = stop,
x = vehicle_type,
statistic = "proportion",
type = "ht",
alternative = "greater",
method = "theoretical",
show_eda_plot = FALSE,
show_inf_plot = FALSE)
## Response variable: categorical (2 levels)
## Explanatory variable: categorical (3 levels)
## Observed:
## y
## x complete not_complete
## follow 76 22
## lead 38 5
## single 151 25
##
## Expected:
## y
## x complete not_complete
## follow 81.924 16.0757
## lead 35.946 7.0536
## single 147.129 28.8707
##
## H0: vehicle_type and stop are independent
## HA: vehicle_type and stop are dependent
## chi_sq = 3.9476, df = 2, p_value = 0.1389
We see here that the \(x^2_{obs}\) value is around 4 with \(df = (2 - 1)(3 - 1) = 2\).
The \(p\)-value—the probability of observing a \(\chi^2_{df = 2}\) value of 4 or more in our null distribution—is (to one decimal place) 10%. This can also be calculated in R directly:
1 - pchisq(3.9476, df = 2)
## [1] 0.13893
Note that we could also do this test directly without invoking the inference
function using the chisq.test
function.
chisq.test(x = table(df$vehicle_type, df$stop), correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: table(df$vehicle_type, df$stop)
## X-squared = 3.95, df = 2, p-value = 0.14
We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference existed in the proportions was not backed up by this statistical analysis. We do not have evidence to suggest that there is a dependency between the arrival position of the vehicle and whether or not it comes to a complete stop.
Tintle, Nathan, Beth Chance, George Cobb, Allan Rossman, Soma Roy, Todd Swanson, and Jill VanderStoep. 2014. Introduction to Statistical Investigations.