The pnwflights14
R package is a modified version of Hadley Wickham’s nycflights13
package and contains information about all flights that departed from the two major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total. To help understand what causes delays, it also includes a number of other useful datasets:
weather
: hourly meterological data for each airportplanes
: construction information about each planeairports
: airport names and locationsairlines
: translation between two letter carrier codes and namesIn this example, we will focus on a random sample of flights departing Portland with no missing information:
library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>%
select(-year, -origin)
set.seed(2016)
pdx_rs <- na.omit(pdx_flights) %>% sample_n(6000)
We are interested in testing whether there is a difference in the mean departure delays by airline/carrier
for departing flights from Portland in 2014. The carrier
variable is coded as a 2-character code. The corresponding airlines
dataset mentioned above contains the full name of the airline and we will link this together with our flights
dataset later.
Null hypothesis: The mean departure delays are the same for all 11 carriers for all 2014 flights departing Portland, Oregon.
Alternative hypothesis: At least one of the population mean departure delays is different.
Null hypothesis: There is no association between departure delay and airline in the population of interest.
Alternative hypothesis: There is an association between the variables in the population.
It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.
library(dplyr)
library(knitr)
library(ggplot2)
library(reedoilabs)
pdx_summ <- pdx_rs %>% group_by(carrier) %>%
summarize(sample_size = n(),
mean = mean(dep_delay),
sd = sd(dep_delay),
minimum = min(dep_delay),
lower_quartile = quantile(dep_delay, 0.25),
median = median(dep_delay),
upper_quartile = quantile(dep_delay, 0.75),
max = max(dep_delay))
kable(pdx_summ)
carrier | sample_size | mean | sd | minimum | lower_quartile | median | upper_quartile | max |
---|---|---|---|---|---|---|---|---|
AA | 231 | 13.20779 | 75.622 | -18 | -5 | -2 | 5.0 | 782 |
AS | 1422 | 0.33052 | 20.067 | -20 | -7 | -5 | -1.0 | 204 |
B6 | 175 | 4.20571 | 22.688 | -13 | -7 | -3 | 4.5 | 144 |
DL | 580 | 2.14828 | 22.262 | -14 | -5 | -3 | 0.0 | 272 |
F9 | 142 | 9.57746 | 56.272 | -19 | -7 | -3 | 7.0 | 590 |
HA | 37 | -1.81081 | 12.927 | -12 | -7 | -6 | -2.0 | 64 |
OO | 1111 | 4.50405 | 28.384 | -15 | -6 | -4 | 0.0 | 302 |
UA | 715 | 9.05455 | 34.287 | -16 | -5 | -2 | 5.0 | 249 |
US | 247 | 2.20243 | 25.027 | -15 | -6 | -3 | 1.0 | 258 |
VX | 87 | 2.33333 | 18.733 | -12 | -7 | -4 | 0.0 | 85 |
WN | 1253 | 12.06385 | 30.318 | -10 | -2 | 2 | 15.0 | 474 |
The side-by-side boxplot below also shows the mean for each group highlighted by the red dots.
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot") +
stat_summary(fun.y = "mean", geom = "point", color = "red")
The outliers do muddy this plot up quite a bit. We can remove them and zoom in on the interquartile range by using the R code below:
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(-20, 45)) +
stat_summary(fun.y = "mean", geom = "point", color = "red")
We are looking to see if a difference exists in the mean departure time in the 11 levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that a difference exists, but the overlap of the boxplots is potentially a bit concerning.
Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.
Independent observations: The observations are independent both within and across groups.
A pseudorandom generator was used in R to select a sample of 3000 flights from our population of all flights. We also assume that the different airlines act independently, which may not always be the case, but shouldn’t affect the results greatly here.
Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30.
qplot(sample = dep_delay, data = pdx_rs, facets = ~ carrier)
The quantile-quantile plots here should result in straight lines matching the theoretical quantiles of a standard normal distribution if they were in fact normal. We have serious reason to doubt they are normally distributed, which intuitively makes sense since flight departure delays are likely to have a right-skewed distribution.
The Central Limit Theorem can be invoked with reasonable certainty if each of the groups has more than 30 observations (sample size). We can see in the table above that HA
has the fewest but it is at 37, so we can assume this condition is met even though the distributions are heavily skewed.
Constant variance: The variance in the groups is about equal from one group to the next.
The standard deviances in the table above leads us to possibly suspect that the standard deviations of the groups in the population are not similar. AA
has a standard deviation of 76.622 minutes while HA
has a standard deviation of 12.927 minutes. These are on the same order of magnitude so we can proceed, albeit with some caution.
The test statistic is a random variable based on the sample data. Here, we want to look at the ratio of variability between the groups over the variability within the groups. This measure we will call \(F\) and it represents a measure of the difference in means. A big observed \(F\) ratio corresponds to the variability between groups over-powering the variability within groups. A small observed \(F\) ratio means that the within group variability is much larger than the between group variability.
\(F\) is the defined as the ratio
\[ F = \frac{MSG}{MSE}. \]
Here, \(MSG\) is the within group variability. As a formula,
\[ MSG = \dfrac{\sum_{i = 1}^k n_i (\bar{X}_i - \bar{X})}{k - 1} \] where \(\bar{X}_i\) is the mean for each group \(i\), and \(\bar{X}\) is the overall mean.
Notice that this is very similar to the variance for looking at the group means compared to the overall mean. In other words, this is the between group variability.
Also, note that \(MSE\) can be thought of as a pooled variance estimate, which can be thought as a measure of the within group variability:
\[MSE = \dfrac{\sum_{i, j} (X_{ij} - \bar{X}_j)^2}{n_{total} - k} \]
where \(n_{total} = n_1 + n_2 + \cdots + n_k\) with \(n_i\) being the sample size of group \(i\).
While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the inference
function in the reedoilabs
package to perform this analysis for us. Note that to obtain the F value
given here, you divide the observed \(MSG\) value of 13,033 by the observed \(MSE\) value of 966.
inference(x = pdx_rs$carrier,
y = pdx_rs$dep_delay,
est = "mean",
type = "ht",
alternative = "greater",
method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
##
## Summary statistics:
## n_AA = 231, mean_AA = 13.208, sd_AA = 75.622
## n_AS = 1422, mean_AS = 0.3305, sd_AS = 20.067
## n_B6 = 175, mean_B6 = 4.2057, sd_B6 = 22.688
## n_DL = 580, mean_DL = 2.1483, sd_DL = 22.262
## n_F9 = 142, mean_F9 = 9.5775, sd_F9 = 56.272
## n_HA = 37, mean_HA = -1.8108, sd_HA = 12.927
## n_OO = 1111, mean_OO = 4.5041, sd_OO = 28.384
## n_UA = 715, mean_UA = 9.0545, sd_UA = 34.287
## n_US = 247, mean_US = 2.2024, sd_US = 25.027
## n_VX = 87, mean_VX = 2.3333, sd_VX = 18.733
## n_WN = 1253, mean_WN = 12.064, sd_WN = 30.318
##
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 10 130335 13033 13.5 <0.0000000000000002
## Residuals 5989 5785272 966
##
## Pairwise tests: t tests with pooled SD
## AA AS B6 DL F9 HA OO UA US VX
## AS 0.0000 NA NA NA NA NA NA NA NA NA
## B6 0.0039 0.1197 NA NA NA NA NA NA NA NA
## DL 0.0000 0.2352 0.4428 NA NA NA NA NA NA NA
## F9 0.2734 0.0007 0.1260 0.0107 NA NA NA NA NA NA
## HA 0.0064 0.6791 0.2847 0.4525 0.0472 NA NA NA NA NA
## OO 0.0001 0.0008 0.9061 0.1390 0.0671 0.2241 NA NA NA NA
## UA 0.0775 0.0000 0.0644 0.0001 0.8547 0.0382 0.0023 NA NA NA
## US 0.0001 0.3823 0.5142 0.9817 0.0243 0.4639 0.2925 0.0028 NA NA
## VX 0.0054 0.5596 0.6461 0.9587 0.0870 0.4969 0.5305 0.0569 0.9731 NA
## WN 0.6073 0.0000 0.0017 0.0000 0.3663 0.0075 0.0000 0.0389 0.0000 0.0048
We see here that the \(f_{obs}\) value is around 13.5 with \(df_G = k - 1 = 11 - 1 = 10\) and \(df_E = n_{total} - k = 6000 - 11 = 5989\).
The \(p\)-value—the probability of observing an \(F(df_G = 10, df_E = 5989)\) value of 15.4 or more in our null distribution—is essentially 0. This can also be calculated in R directly:
pf(13.5, df1 = 10, df2 = 5989, lower.tail = FALSE)
## [1] 0.0000000000000000000000086392
Note that we could also do this test directly without invoking the inference
function using the aov
function. aov
stands for analysis of variance and its form is similar to what is done using the lm
function with linear regression. It fits an analysis of variance model to the data in the same way that lm
fits a linear regression model to the data. summary
displays the resulting ANOVA table for the model fit.
pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## carrier 10 130335 13033 13.5 <0.0000000000000002 ***
## Residuals 5989 5785272 966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference existed in the means was backed by this statistical analysis. We have evidence to suggest that departure delay is related to carrier. If we’d like to get a sense for how the codes relate to the actual names of the airlines, we can use the inner_join
function in the dplyr
package to match the carrier
code with its corresponding name
in the airlines
table:
data(airlines, package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summ %>% select(carrier, mean, sd),
y = airlines, by = "carrier")
kable(pdx_join)
carrier | mean | sd | name |
---|---|---|---|
AA | 13.20779 | 75.622 | American Airlines Inc. |
AS | 0.33052 | 20.067 | Alaska Airlines Inc. |
B6 | 4.20571 | 22.688 | JetBlue Airways |
DL | 2.14828 | 22.262 | Delta Air Lines Inc. |
F9 | 9.57746 | 56.272 | Frontier Airlines Inc. |
HA | -1.81081 | 12.927 | Hawaiian Airlines Inc. |
OO | 4.50405 | 28.384 | SkyWest Airlines Inc. |
UA | 9.05455 | 34.287 | United Air Lines Inc. |
US | 2.20243 | 25.027 | US Airways Inc. |
VX | 2.33333 | 18.733 | Virgin America |
WN | 12.06385 | 30.318 | Southwest Airlines Co. |
With the conditions near being (or possibly) violated, one should use randomization to compare our \(p\)-value there with the value here to see if the assumptions may have been violated. One could also assess whether the sampling distribution of the \(F\) statistic matches well with a Fisher’s \(F\) distribution using randomization as well. If the conditions are reasonable, the next step would be to calculate pairwise analyses to better understand the reasons for the rejection of the null hypotheses in the ANOVA.