Problem Statement

The azflights24 R package is a modified version of Hadley Wickham’s nycflights13 package and contains information about all flights that departed from the five major airports of Arizona in 2024, including PHX in Phoenix and TUS in Tucson. To help understand what causes delays, it also includes a number of other useful datasets:

weather: hourly meteorological data for each airport
planes: construction information about each plane
airports: airport names and locations
airlines: translation between two letter carrier codes and names

In this example, we will focus on a random sample of flights departing Phoenix (PHX) with no missing information:

data("flights", package = "azflights24")
phx_flights <- flights |>
  filter(origin == "PHX") |>
  select(carrier, dep_delay) |>
  drop_na()
set.seed(2016)
phx_rs <- phx_flights |> sample_n(6000)

We are interested in testing whether there is a difference in the mean departure delays by airline/carrier for departing flights from Phoenix in 2024. The carrier variable is coded as a 2-character code. The corresponding airlines dataset mentioned above contains the full name of the airline and we will link this together with our flights dataset later.

Competing Hypotheses

In words

Null hypothesis: The mean departure delays are the same for all 12 carriers for all 2024 flights departing Phoenix, Arizona.
Alternative hypothesis: At least one of the population mean departure delays is different.

Another way with words

Null hypothesis: There is no association between departure delay and airline in the population of interest.
Alternative hypothesis: There is an association between the variables in the population.

Set \(\alpha\)

It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.

Exploring the sample data

phx_summ <- phx_rs |>
  group_by(carrier) |>
  summarize(sample_size = n(),
    mean = mean(dep_delay),
    sd = sd(dep_delay),
    minimum = min(dep_delay),
    lower_quartile = quantile(dep_delay, 0.25),
    median = median(dep_delay),
    upper_quartile = quantile(dep_delay, 0.75),
    max = max(dep_delay))
kable(phx_summ)

carrier	sample_size	mean	sd	minimum	lower_quartile	median	upper_quartile	max
AA	1683	14.81640	60.7271	-17	-5.00	-2.0	8.00	782
AS	147	6.57143	31.1327	-22	-8.50	-4.0	6.00	189
B6	31	34.87097	79.8932	-17	-5.50	7.0	23.50	363
DL	344	7.12209	52.3905	-18	-6.00	-2.0	4.00	833
F9	284	11.53873	47.9082	-18	-10.00	-5.5	8.25	487
G4	14	30.28571	63.3299	-14	-10.75	7.5	35.75	192
HA	15	-0.53333	5.5403	-10	-3.00	-2.0	0.00	12
MQ	266	4.69549	20.9382	-10	-4.00	-2.0	0.00	138
NK	35	15.97143	31.8937	-14	-4.50	4.0	20.50	109
OO	787	6.07497	38.8151	-13	-7.00	-4.0	0.00	631
UA	310	10.21613	61.9166	-18	-6.75	-4.0	2.00	839
WN	2084	12.35701	27.1891	-13	-2.00	2.0	15.00	239

The side-by-side boxplot below also shows the mean for each group highlighted by the red dots. We remove the outliers and zoom in on the interquartile range to make the plot more readable.

phx_rs |> ggplot(aes(x = carrier, y = dep_delay)) +
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun = "mean", geom = "point", color = "red")

Guess about statistical significance

We are looking to see if a difference exists in the mean departure delay across the levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that a difference exists, but the overlap of the boxplots is potentially a bit concerning.

Non-traditional methods

We’ll use the infer package to carry out the simulation-based workflow with the \(F\) statistic. Because the dataset is large, we use 1,000 replicates here.

Observed statistic

F_hat <- phx_rs |>
  specify(dep_delay ~ carrier) |>
  hypothesize(null = "independence") |>
  calculate(stat = "F")
F_hat

## Response: dep_delay (numeric)
## Explanatory: carrier (factor)
## Null Hypothesis: ...
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  4.04

Randomization for Hypothesis Test

Under the null of no association, the carrier labels are exchangeable, so infer shuffles them and recomputes the \(F\) statistic to build the null distribution.

set.seed(2018)
null_distn <- phx_rs |>
  specify(dep_delay ~ carrier) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "F")

A big \(F\) ratio corresponds to between-group variability over-powering within-group variability, so this is a one-sided (right-tailed) test.

null_distn |>
  visualize() +
  shade_p_value(obs_stat = F_hat, direction = "greater")

We can also overlay the theoretical \(F\) distribution:

null_distn |>
  visualize(method = "both") +
  shade_p_value(obs_stat = F_hat, direction = "greater")

Calculate \(p\)-value

pvalue <- null_distn |>
  get_p_value(obs_stat = F_hat, direction = "greater")
pvalue

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.001

So our \(p\)-value is 0.001 and we reject the null hypothesis at the 5% level.

Traditional methods

Check conditions

Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.

Independent observations: The observations are independent both within and across groups.

A pseudorandom generator was used in R to select a sample of 6000 flights from our population of all flights. We also assume that the different airlines act independently, which may not always be the case, but shouldn’t affect the results greatly here.
Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30.

phx_rs |> ggplot(aes(sample = dep_delay)) +
  stat_qq() + stat_qq_line() +
  facet_wrap(~ carrier)

The Q-Q plots should result in straight lines matching the theoretical quantiles of a standard normal distribution if the data were in fact normal. We have serious reason to doubt they are normally distributed, which intuitively makes sense since flight departure delays are likely to have a right-skewed distribution. The Central Limit Theorem can be invoked with reasonable certainty since each of the groups has many more than 30 observations.

Constant variance: The variance in the groups is about equal from one group to the next.

The standard deviations in the table above lead us to possibly suspect that the standard deviations of the groups in the population are not similar, but they are on the same order of magnitude so we proceed with some caution.

Test statistic

The \(F\) statistic is the ratio

\[ F = \frac{MSG}{MSE}, \]

where \(MSG\) measures the between group variability and \(MSE\) is a pooled estimate of the within group variability.

Observed test statistic

We fit the model with lm() and summarize it with the moderndive package. get_regression_summaries() reports the overall \(F\) statistic and \(p\)-value — which for a single categorical predictor is exactly the one-way ANOVA \(F\)-test.

phx_mod <- lm(dep_delay ~ carrier, data = phx_rs)
get_regression_summaries(phx_mod)

## # A tibble: 1 × 9
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.007         0.006 2040.  45.2  45.2      4.04       0    11  6000

We see that the statistic (the observed \(F\) value) matches the F_hat computed with infer above.

State conclusion

We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference existed in the means was backed by this statistical analysis. We have evidence to suggest that departure delay is related to carrier. If we’d like to get a sense for how the codes relate to the actual names of the airlines, we can use inner_join() to match the carrier code with its corresponding name in the airlines table:

data(airlines, package = "azflights24")
phx_join <- phx_summ |>
  select(carrier, mean, sd) |>
  inner_join(airlines, by = "carrier")
kable(phx_join)

carrier	mean	sd	name
AA	14.81640	60.7271	American Airlines Inc.
AS	6.57143	31.1327	Alaska Airlines Inc.
B6	34.87097	79.8932	JetBlue Airways
DL	7.12209	52.3905	Delta Air Lines Inc.
F9	11.53873	47.9082	Frontier Airlines Inc.
G4	30.28571	63.3299	Allegiant Air
HA	-0.53333	5.5403	Hawaiian Airlines Inc.
MQ	4.69549	20.9382	Envoy Air
NK	15.97143	31.8937	Spirit Air Lines
OO	6.07497	38.8151	SkyWest Airlines Inc.
UA	10.21613	61.9166	United Air Lines Inc.
WN	12.35701	27.1891	Southwest Airlines Co.

Comparing results

The simulation-based and theory-based approaches give very similar \(p\)-values here. With the normality condition violated but large group sizes, the Central Limit Theorem supports the theory-based test, and the close agreement between the randomization and \(F\)-distribution results is reassuring. If the conditions are reasonable, the next step would be to calculate pairwise analyses to better understand the reasons for the rejection of the null hypothesis in the ANOVA.

ANOVA Example for Phoenix departing flights in 2024