Problem Statement

The pnwflights14 R package is a modified version of Hadley Wickham’s nycflights13 package and contains information about all flights that departed from the two major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total. To help understand what causes delays, it also includes a number of other useful datasets:

  • weather: hourly meterological data for each airport
  • planes: construction information about each plane
  • airports: airport names and locations
  • airlines: translation between two letter carrier codes and names

In this example, we will focus on a random sample of flights departing Portland with no missing information:

library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
set.seed(2016)
pdx_rs <- na.omit(pdx_flights) %>% sample_n(6000)

We are interested in testing whether there is a difference in the mean departure delays by airline/carrier for departing flights from Portland in 2014. The carrier variable is coded as a 2-character code. The corresponding airlines dataset mentioned above contains the full name of the airline and we will link this together with our flights dataset later.

Competing Hypotheses

In words

  • Null hypothesis: The mean departure delays are the same for all 11 carriers for all 2014 flights departing Portland, Oregon.

  • Alternative hypothesis: At least one of the population mean departure delays is different.

Another way with words

  • Null hypothesis: There is no association between departure delay and airline in the population of interest.

  • Alternative hypothesis: There is an association between the variables in the population.

Set \(\alpha\)

It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.

Exploring the sample data

library(dplyr)
library(knitr)
library(ggplot2)
library(reedoilabs)
pdx_summ <- pdx_rs %>% group_by(carrier) %>%
  summarize(sample_size = n(),
    mean = mean(dep_delay),
    sd = sd(dep_delay),
    minimum = min(dep_delay),
    lower_quartile = quantile(dep_delay, 0.25),
    median = median(dep_delay),
    upper_quartile = quantile(dep_delay, 0.75),
    max = max(dep_delay))
kable(pdx_summ)
carrier sample_size mean sd minimum lower_quartile median upper_quartile max
AA 231 13.20779 75.622 -18 -5 -2 5.0 782
AS 1422 0.33052 20.067 -20 -7 -5 -1.0 204
B6 175 4.20571 22.688 -13 -7 -3 4.5 144
DL 580 2.14828 22.262 -14 -5 -3 0.0 272
F9 142 9.57746 56.272 -19 -7 -3 7.0 590
HA 37 -1.81081 12.927 -12 -7 -6 -2.0 64
OO 1111 4.50405 28.384 -15 -6 -4 0.0 302
UA 715 9.05455 34.287 -16 -5 -2 5.0 249
US 247 2.20243 25.027 -15 -6 -3 1.0 258
VX 87 2.33333 18.733 -12 -7 -4 0.0 85
WN 1253 12.06385 30.318 -10 -2 2 15.0 474

The side-by-side boxplot below also shows the mean for each group highlighted by the red dots.

qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot") +
      stat_summary(fun.y = "mean", geom = "point", color = "red")

The outliers do muddy this plot up quite a bit. We can remove them and zoom in on the interquartile range by using the R code below:

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")

Guess about statistical significance

We are looking to see if a difference exists in the mean departure time in the 11 levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that a difference exists, but the overlap of the boxplots is potentially a bit concerning.

Check conditions

Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.

  1. Independent observations: The observations are independent both within and across groups.

    A pseudorandom generator was used in R to select a sample of 3000 flights from our population of all flights. We also assume that the different airlines act independently, which may not always be the case, but shouldn’t affect the results greatly here.

  2. Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30.

qplot(sample = dep_delay, data = pdx_rs, facets = ~ carrier)

The quantile-quantile plots here should result in straight lines matching the theoretical quantiles of a standard normal distribution if they were in fact normal. We have serious reason to doubt they are normally distributed, which intuitively makes sense since flight departure delays are likely to have a right-skewed distribution.

The Central Limit Theorem can be invoked with reasonable certainty if each of the groups has more than 30 observations (sample size). We can see in the table above that HA has the fewest but it is at 37, so we can assume this condition is met even though the distributions are heavily skewed.

  1. Constant variance: The variance in the groups is about equal from one group to the next.

    The standard deviances in the table above leads us to possibly suspect that the standard deviations of the groups in the population are not similar. AA has a standard deviation of 76.622 minutes while HA has a standard deviation of 12.927 minutes. These are on the same order of magnitude so we can proceed, albeit with some caution.

Test statistic

The test statistic is a random variable based on the sample data. Here, we want to look at the ratio of variability between the groups over the variability within the groups. This measure we will call \(F\) and it represents a measure of the difference in means. A big observed \(F\) ratio corresponds to the variability between groups over-powering the variability within groups. A small observed \(F\) ratio means that the within group variability is much larger than the between group variability.

\(F\) is the defined as the ratio

\[ F = \frac{MSG}{MSE}. \]

Here, \(MSG\) is the within group variability. As a formula,

\[ MSG = \dfrac{\sum_{i = 1}^k n_i (\bar{X}_i - \bar{X})}{k - 1} \] where \(\bar{X}_i\) is the mean for each group \(i\), and \(\bar{X}\) is the overall mean.

Notice that this is very similar to the variance for looking at the group means compared to the overall mean. In other words, this is the between group variability.

Also, note that \(MSE\) can be thought of as a pooled variance estimate, which can be thought as a measure of the within group variability:

\[MSE = \dfrac{\sum_{i, j} (X_{ij} - \bar{X}_j)^2}{n_{total} - k} \]

where \(n_{total} = n_1 + n_2 + \cdots + n_k\) with \(n_i\) being the sample size of group \(i\).

Observed test statistic

While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the inference function in the reedoilabs package to perform this analysis for us. Note that to obtain the F value given here, you divide the observed \(MSG\) value of 13,033 by the observed \(MSE\) value of 966.

inference(x = pdx_rs$carrier, 
          y = pdx_rs$dep_delay, 
          est = "mean",
          type = "ht", 
          alternative = "greater", 
          method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## 
## Summary statistics:
## n_AA = 231, mean_AA = 13.208, sd_AA = 75.622
## n_AS = 1422, mean_AS = 0.3305, sd_AS = 20.067
## n_B6 = 175, mean_B6 = 4.2057, sd_B6 = 22.688
## n_DL = 580, mean_DL = 2.1483, sd_DL = 22.262
## n_F9 = 142, mean_F9 = 9.5775, sd_F9 = 56.272
## n_HA = 37, mean_HA = -1.8108, sd_HA = 12.927
## n_OO = 1111, mean_OO = 4.5041, sd_OO = 28.384
## n_UA = 715, mean_UA = 9.0545, sd_UA = 34.287
## n_US = 247, mean_US = 2.2024, sd_US = 25.027
## n_VX = 87, mean_VX = 2.3333, sd_VX = 18.733
## n_WN = 1253, mean_WN = 12.064, sd_WN = 30.318
## 
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##             Df  Sum Sq Mean Sq F value              Pr(>F)
## x           10  130335   13033    13.5 <0.0000000000000002
## Residuals 5989 5785272     966                            
## 
## Pairwise tests: t tests with pooled SD 
##        AA     AS     B6     DL     F9     HA     OO     UA     US     VX
## AS 0.0000     NA     NA     NA     NA     NA     NA     NA     NA     NA
## B6 0.0039 0.1197     NA     NA     NA     NA     NA     NA     NA     NA
## DL 0.0000 0.2352 0.4428     NA     NA     NA     NA     NA     NA     NA
## F9 0.2734 0.0007 0.1260 0.0107     NA     NA     NA     NA     NA     NA
## HA 0.0064 0.6791 0.2847 0.4525 0.0472     NA     NA     NA     NA     NA
## OO 0.0001 0.0008 0.9061 0.1390 0.0671 0.2241     NA     NA     NA     NA
## UA 0.0775 0.0000 0.0644 0.0001 0.8547 0.0382 0.0023     NA     NA     NA
## US 0.0001 0.3823 0.5142 0.9817 0.0243 0.4639 0.2925 0.0028     NA     NA
## VX 0.0054 0.5596 0.6461 0.9587 0.0870 0.4969 0.5305 0.0569 0.9731     NA
## WN 0.6073 0.0000 0.0017 0.0000 0.3663 0.0075 0.0000 0.0389 0.0000 0.0048

We see here that the \(f_{obs}\) value is around 13.5 with \(df_G = k - 1 = 11 - 1 = 10\) and \(df_E = n_{total} - k = 6000 - 11 = 5989\).

Compute \(p\)-value

The \(p\)-value—the probability of observing an \(F(df_G = 10, df_E = 5989)\) value of 15.4 or more in our null distribution—is essentially 0. This can also be calculated in R directly:

pf(13.5, df1 = 10, df2 = 5989, lower.tail = FALSE)
## [1] 0.0000000000000000000000086392

Note that we could also do this test directly without invoking the inference function using the aov function. aov stands for analysis of variance and its form is similar to what is done using the lm function with linear regression. It fits an analysis of variance model to the data in the same way that lm fits a linear regression model to the data. summary displays the resulting ANOVA table for the model fit.

pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)
##               Df  Sum Sq Mean Sq F value              Pr(>F)    
## carrier       10  130335   13033    13.5 <0.0000000000000002 ***
## Residuals   5989 5785272     966                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

State conclusion

We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference existed in the means was backed by this statistical analysis. We have evidence to suggest that departure delay is related to carrier. If we’d like to get a sense for how the codes relate to the actual names of the airlines, we can use the inner_join function in the dplyr package to match the carrier code with its corresponding name in the airlines table:

data(airlines, package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summ %>% select(carrier, mean, sd), 
           y = airlines, by = "carrier")
kable(pdx_join)
carrier mean sd name
AA 13.20779 75.622 American Airlines Inc.
AS 0.33052 20.067 Alaska Airlines Inc.
B6 4.20571 22.688 JetBlue Airways
DL 2.14828 22.262 Delta Air Lines Inc.
F9 9.57746 56.272 Frontier Airlines Inc.
HA -1.81081 12.927 Hawaiian Airlines Inc.
OO 4.50405 28.384 SkyWest Airlines Inc.
UA 9.05455 34.287 United Air Lines Inc.
US 2.20243 25.027 US Airways Inc.
VX 2.33333 18.733 Virgin America
WN 12.06385 30.318 Southwest Airlines Co.

Final note

With the conditions near being (or possibly) violated, one should use randomization to compare our \(p\)-value there with the value here to see if the assumptions may have been violated. One could also assess whether the sampling distribution of the \(F\) statistic matches well with a Fisher’s \(F\) distribution using randomization as well. If the conditions are reasonable, the next step would be to calculate pairwise analyses to better understand the reasons for the rejection of the null hypotheses in the ANOVA.