Problem Statement

The pnwflights14 R package is a modified version of Hadley Wickham’s nycflights13 package and contains information about all flights that departed from the two major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total. To help understand what causes delays, it also includes a number of other useful datasets:

weather: hourly meterological data for each airport
planes: construction information about each plane
airports: airport names and locations
airlines: translation between two letter carrier codes and names

In this example, we will focus on a random sample of flights departing Portland with no missing information:

library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
set.seed(2016)
pdx_rs <- na.omit(pdx_flights) %>% sample_n(6000)

We are interested in testing whether there is a difference in the mean departure delays by airline/carrier for departing flights from Portland in 2014. The carrier variable is coded as a 2-character code. The corresponding airlines dataset mentioned above contains the full name of the airline and we will link this together with our flights dataset later.

Competing Hypotheses

In words

Null hypothesis: The mean departure delays are the same for all 11 carriers for all 2014 flights departing Portland, Oregon.
Alternative hypothesis: At least one of the population mean departure delays is different.

Another way with words

Null hypothesis: There is no association between departure delay and airline in the population of interest.
Alternative hypothesis: There is an association between the variables in the population.

Set \(\alpha\)

It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.

Exploring the sample data

library(dplyr)
library(knitr)
library(ggplot2)
library(reedoilabs)

pdx_summ <- pdx_rs %>% group_by(carrier) %>%
  summarize(sample_size = n(),
    mean = mean(dep_delay),
    sd = sd(dep_delay),
    minimum = min(dep_delay),
    lower_quartile = quantile(dep_delay, 0.25),
    median = median(dep_delay),
    upper_quartile = quantile(dep_delay, 0.75),
    max = max(dep_delay))
kable(pdx_summ)

carrier	sample_size	mean	sd	minimum	lower_quartile	median	upper_quartile	max
AA	231	13.20779	75.622	-18	-5	-2	5.0	782
AS	1422	0.33052	20.067	-20	-7	-5	-1.0	204
B6	175	4.20571	22.688	-13	-7	-3	4.5	144
DL	580	2.14828	22.262	-14	-5	-3	0.0	272
F9	142	9.57746	56.272	-19	-7	-3	7.0	590
HA	37	-1.81081	12.927	-12	-7	-6	-2.0	64
OO	1111	4.50405	28.384	-15	-6	-4	0.0	302
UA	715	9.05455	34.287	-16	-5	-2	5.0	249
US	247	2.20243	25.027	-15	-6	-3	1.0	258
VX	87	2.33333	18.733	-12	-7	-4	0.0	85
WN	1253	12.06385	30.318	-10	-2	2	15.0	474

The side-by-side boxplot below also shows the mean for each group highlighted by the red dots.

qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot") +
      stat_summary(fun.y = "mean", geom = "point", color = "red")

The outliers do muddy this plot up quite a bit. We can remove them and zoom in on the interquartile range by using the R code below:

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")

Guess about statistical significance

We are looking to see if a difference exists in the mean departure time in the 11 levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that a difference exists, but the overlap of the boxplots is potentially a bit concerning.

Check conditions

Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.

Independent observations: The observations are independent both within and across groups.

A pseudorandom generator was used in R to select a sample of 3000 flights from our population of all flights. We also assume that the different airlines act independently, which may not always be the case, but shouldn’t affect the results greatly here.
Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30.

qplot(sample = dep_delay, data = pdx_rs, facets = ~ carrier)

The quantile-quantile plots here should result in straight lines matching the theoretical quantiles of a standard normal distribution if they were in fact normal. We have serious reason to doubt they are normally distributed, which intuitively makes sense since flight departure delays are likely to have a right-skewed distribution.

The Central Limit Theorem can be invoked with reasonable certainty if each of the groups has more than 30 observations (sample size). We can see in the table above that HA has the fewest but it is at 37, so we can assume this condition is met even though the distributions are heavily skewed.

Constant variance: The variance in the groups is about equal from one group to the next.

The standard deviances in the table above leads us to possibly suspect that the standard deviations of the groups in the population are not similar. AA has a standard deviation of 76.622 minutes while HA has a standard deviation of 12.927 minutes. These are on the same order of magnitude so we can proceed, albeit with some caution.

Test statistic

The test statistic is a random variable based on the sample data. Here, we want to look at the ratio of variability between the groups over the variability within the groups. This measure we will call \(F\) and it represents a measure of the difference in means. A big observed \(F\) ratio corresponds to the variability between groups over-powering the variability within groups. A small observed \(F\) ratio means that the within group variability is much larger than the between group variability.

\(F\) is the defined as the ratio

\[ F = \frac{MSG}{MSE}. \]

Here, \(MSG\) is the within group variability. As a formula,

\[ MSG = \dfrac{\sum_{i = 1}^k n_i (\bar{X}_i - \bar{X})}{k - 1} \] where \(\bar{X}_i\) is the mean for each group \(i\), and \(\bar{X}\) is the overall mean.

Notice that this is very similar to the variance for looking at the group means compared to the overall mean. In other words, this is the between group variability.

Also, note that \(MSE\) can be thought of as a pooled variance estimate, which can be thought as a measure of the within group variability:

\[MSE = \dfrac{\sum_{i, j} (X_{ij} - \bar{X}_j)^2}{n_{total} - k} \]

where \(n_{total} = n_1 + n_2 + \cdots + n_k\) with \(n_i\) being the sample size of group \(i\).

Observed test statistic

While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the inference function in the reedoilabs package to perform this analysis for us. Note that to obtain the F value given here, you divide the observed \(MSG\) value of 13,033 by the observed \(MSE\) value of 966.

inference(x = pdx_rs$carrier, 
          y = pdx_rs$dep_delay, 
          est = "mean",
          type = "ht", 
          alternative = "greater", 
          method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## 
## Summary statistics:
## n_AA = 231, mean_AA = 13.208, sd_AA = 75.622
## n_AS = 1422, mean_AS = 0.3305, sd_AS = 20.067
## n_B6 = 175, mean_B6 = 4.2057, sd_B6 = 22.688
## n_DL = 580, mean_DL = 2.1483, sd_DL = 22.262
## n_F9 = 142, mean_F9 = 9.5775, sd_F9 = 56.272
## n_HA = 37, mean_HA = -1.8108, sd_HA = 12.927
## n_OO = 1111, mean_OO = 4.5041, sd_OO = 28.384
## n_UA = 715, mean_UA = 9.0545, sd_UA = 34.287
## n_US = 247, mean_US = 2.2024, sd_US = 25.027
## n_VX = 87, mean_VX = 2.3333, sd_VX = 18.733
## n_WN = 1253, mean_WN = 12.064, sd_WN = 30.318
## 
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##             Df  Sum Sq Mean Sq F value              Pr(>F)
## x           10  130335   13033    13.5 <0.0000000000000002
## Residuals 5989 5785272     966                            
## 
## Pairwise tests: t tests with pooled SD 
##        AA     AS     B6     DL     F9     HA     OO     UA     US     VX
## AS 0.0000     NA     NA     NA     NA     NA     NA     NA     NA     NA
## B6 0.0039 0.1197     NA     NA     NA     NA     NA     NA     NA     NA
## DL 0.0000 0.2352 0.4428     NA     NA     NA     NA     NA     NA     NA
## F9 0.2734 0.0007 0.1260 0.0107     NA     NA     NA     NA     NA     NA
## HA 0.0064 0.6791 0.2847 0.4525 0.0472     NA     NA     NA     NA     NA
## OO 0.0001 0.0008 0.9061 0.1390 0.0671 0.2241     NA     NA     NA     NA
## UA 0.0775 0.0000 0.0644 0.0001 0.8547 0.0382 0.0023     NA     NA     NA
## US 0.0001 0.3823 0.5142 0.9817 0.0243 0.4639 0.2925 0.0028     NA     NA
## VX 0.0054 0.5596 0.6461 0.9587 0.0870 0.4969 0.5305 0.0569 0.9731     NA
## WN 0.6073 0.0000 0.0017 0.0000 0.3663 0.0075 0.0000 0.0389 0.0000 0.0048

We see here that the \(f_{obs}\) value is around 13.5 with \(df_G = k - 1 = 11 - 1 = 10\) and \(df_E = n_{total} - k = 6000 - 11 = 5989\).

Compute \(p\)-value

The \(p\)-value—the probability of observing an \(F(df_G = 10, df_E = 5989)\) value of 15.4 or more in our null distribution—is essentially 0. This can also be calculated in R directly:

pf(13.5, df1 = 10, df2 = 5989, lower.tail = FALSE)

## [1] 0.0000000000000000000000086392

Note that we could also do this test directly without invoking the inference function using the aov function. aov stands for analysis of variance and its form is similar to what is done using the lm function with linear regression. It fits an analysis of variance model to the data in the same way that lm fits a linear regression model to the data. summary displays the resulting ANOVA table for the model fit.

pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)

##               Df  Sum Sq Mean Sq F value              Pr(>F)    
## carrier       10  130335   13033    13.5 <0.0000000000000002 ***
## Residuals   5989 5785272     966                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

State conclusion

We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference existed in the means was backed by this statistical analysis. We have evidence to suggest that departure delay is related to carrier. If we’d like to get a sense for how the codes relate to the actual names of the airlines, we can use the inner_join function in the dplyr package to match the carrier code with its corresponding name in the airlines table:

data(airlines, package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summ %>% select(carrier, mean, sd), 
           y = airlines, by = "carrier")
kable(pdx_join)

carrier	mean	sd	name
AA	13.20779	75.622	American Airlines Inc.
AS	0.33052	20.067	Alaska Airlines Inc.
B6	4.20571	22.688	JetBlue Airways
DL	2.14828	22.262	Delta Air Lines Inc.
F9	9.57746	56.272	Frontier Airlines Inc.
HA	-1.81081	12.927	Hawaiian Airlines Inc.
OO	4.50405	28.384	SkyWest Airlines Inc.
UA	9.05455	34.287	United Air Lines Inc.
US	2.20243	25.027	US Airways Inc.
VX	2.33333	18.733	Virgin America
WN	12.06385	30.318	Southwest Airlines Co.

Final note

With the conditions near being (or possibly) violated, one should use randomization to compare our \(p\)-value there with the value here to see if the assumptions may have been violated. One could also assess whether the sampling distribution of the \(F\) statistic matches well with a Fisher’s \(F\) distribution using randomization as well. If the conditions are reasonable, the next step would be to calculate pairwise analyses to better understand the reasons for the rejection of the null hypotheses in the ANOVA.

ANOVA Example for Portland departing flights in 2014