The azflights24 R package is a modified version of
Hadley Wickham’s nycflights13 package and contains
information about all flights that departed from the five major airports
of Arizona in 2024, including PHX in Phoenix and TUS in Tucson. To help
understand what causes delays, it also includes a number of other useful
datasets:
weather: hourly meteorological data for each
airportplanes: construction information about each planeairports: airport names and locationsairlines: translation between two letter carrier codes
and namesIn this example, we will focus on a random sample of flights departing Phoenix (PHX) with no missing information:
data("flights", package = "azflights24")
phx_flights <- flights |>
filter(origin == "PHX") |>
select(carrier, dep_delay) |>
drop_na()
set.seed(2016)
phx_rs <- phx_flights |> sample_n(6000)We are interested in testing whether there is a difference in the
mean departure delays by airline/carrier for departing
flights from Phoenix in 2024. The carrier variable is coded
as a 2-character code. The corresponding airlines dataset
mentioned above contains the full name of the airline and we will link
this together with our flights dataset later.
Null hypothesis: The mean departure delays are the same for all 12 carriers for all 2024 flights departing Phoenix, Arizona.
Alternative hypothesis: At least one of the population mean departure delays is different.
Null hypothesis: There is no association between departure delay and airline in the population of interest.
Alternative hypothesis: There is an association between the variables in the population.
It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.
phx_summ <- phx_rs |>
group_by(carrier) |>
summarize(sample_size = n(),
mean = mean(dep_delay),
sd = sd(dep_delay),
minimum = min(dep_delay),
lower_quartile = quantile(dep_delay, 0.25),
median = median(dep_delay),
upper_quartile = quantile(dep_delay, 0.75),
max = max(dep_delay))
kable(phx_summ)| carrier | sample_size | mean | sd | minimum | lower_quartile | median | upper_quartile | max |
|---|---|---|---|---|---|---|---|---|
| AA | 1683 | 14.81640 | 60.7271 | -17 | -5.00 | -2.0 | 8.00 | 782 |
| AS | 147 | 6.57143 | 31.1327 | -22 | -8.50 | -4.0 | 6.00 | 189 |
| B6 | 31 | 34.87097 | 79.8932 | -17 | -5.50 | 7.0 | 23.50 | 363 |
| DL | 344 | 7.12209 | 52.3905 | -18 | -6.00 | -2.0 | 4.00 | 833 |
| F9 | 284 | 11.53873 | 47.9082 | -18 | -10.00 | -5.5 | 8.25 | 487 |
| G4 | 14 | 30.28571 | 63.3299 | -14 | -10.75 | 7.5 | 35.75 | 192 |
| HA | 15 | -0.53333 | 5.5403 | -10 | -3.00 | -2.0 | 0.00 | 12 |
| MQ | 266 | 4.69549 | 20.9382 | -10 | -4.00 | -2.0 | 0.00 | 138 |
| NK | 35 | 15.97143 | 31.8937 | -14 | -4.50 | 4.0 | 20.50 | 109 |
| OO | 787 | 6.07497 | 38.8151 | -13 | -7.00 | -4.0 | 0.00 | 631 |
| UA | 310 | 10.21613 | 61.9166 | -18 | -6.75 | -4.0 | 2.00 | 839 |
| WN | 2084 | 12.35701 | 27.1891 | -13 | -2.00 | 2.0 | 15.00 | 239 |
The side-by-side boxplot below also shows the mean for each group highlighted by the red dots. We remove the outliers and zoom in on the interquartile range to make the plot more readable.
phx_rs |> ggplot(aes(x = carrier, y = dep_delay)) +
geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(-20, 45)) +
stat_summary(fun = "mean", geom = "point", color = "red")We are looking to see if a difference exists in the mean departure delay across the levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that a difference exists, but the overlap of the boxplots is potentially a bit concerning.
We’ll use the infer package to carry
out the simulation-based workflow with the \(F\) statistic. Because the dataset is
large, we use 1,000 replicates here.
F_hat <- phx_rs |>
specify(dep_delay ~ carrier) |>
hypothesize(null = "independence") |>
calculate(stat = "F")
F_hat## Response: dep_delay (numeric)
## Explanatory: carrier (factor)
## Null Hypothesis: ...
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 4.04
Under the null of no association, the carrier labels are
exchangeable, so infer shuffles them and recomputes the
\(F\) statistic to build the null
distribution.
set.seed(2018)
null_distn <- phx_rs |>
specify(dep_delay ~ carrier) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "F")A big \(F\) ratio corresponds to between-group variability over-powering within-group variability, so this is a one-sided (right-tailed) test.
We can also overlay the theoretical \(F\) distribution:
Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.
Independent observations: The observations are independent both within and across groups.
A pseudorandom generator was used in R to select a sample of 6000 flights from our population of all flights. We also assume that the different airlines act independently, which may not always be the case, but shouldn’t affect the results greatly here.
Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30.
The Q-Q plots should result in straight lines matching the theoretical quantiles of a standard normal distribution if the data were in fact normal. We have serious reason to doubt they are normally distributed, which intuitively makes sense since flight departure delays are likely to have a right-skewed distribution. The Central Limit Theorem can be invoked with reasonable certainty since each of the groups has many more than 30 observations.
Constant variance: The variance in the groups is about equal from one group to the next.
The standard deviations in the table above lead us to possibly suspect that the standard deviations of the groups in the population are not similar, but they are on the same order of magnitude so we proceed with some caution.
The \(F\) statistic is the ratio
\[ F = \frac{MSG}{MSE}, \]
where \(MSG\) measures the between group variability and \(MSE\) is a pooled estimate of the within group variability.
We fit the model with lm() and summarize it with the
moderndive package.
get_regression_summaries() reports the overall \(F\) statistic and \(p\)-value — which for a single categorical
predictor is exactly the one-way ANOVA \(F\)-test.
## # A tibble: 1 × 9
## r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.007 0.006 2040. 45.2 45.2 4.04 0 11 6000
We see that the statistic (the observed \(F\) value) matches the F_hat
computed with infer above.
We, therefore, have sufficient evidence to reject the null
hypothesis. Our initial guess that a statistically significant
difference existed in the means was backed by this statistical analysis.
We have evidence to suggest that departure delay is related to carrier.
If we’d like to get a sense for how the codes relate to the actual names
of the airlines, we can use inner_join() to match the
carrier code with its corresponding name in
the airlines table:
data(airlines, package = "azflights24")
phx_join <- phx_summ |>
select(carrier, mean, sd) |>
inner_join(airlines, by = "carrier")
kable(phx_join)| carrier | mean | sd | name |
|---|---|---|---|
| AA | 14.81640 | 60.7271 | American Airlines Inc. |
| AS | 6.57143 | 31.1327 | Alaska Airlines Inc. |
| B6 | 34.87097 | 79.8932 | JetBlue Airways |
| DL | 7.12209 | 52.3905 | Delta Air Lines Inc. |
| F9 | 11.53873 | 47.9082 | Frontier Airlines Inc. |
| G4 | 30.28571 | 63.3299 | Allegiant Air |
| HA | -0.53333 | 5.5403 | Hawaiian Airlines Inc. |
| MQ | 4.69549 | 20.9382 | Envoy Air |
| NK | 15.97143 | 31.8937 | Spirit Air Lines |
| OO | 6.07497 | 38.8151 | SkyWest Airlines Inc. |
| UA | 10.21613 | 61.9166 | United Air Lines Inc. |
| WN | 12.35701 | 27.1891 | Southwest Airlines Co. |
The simulation-based and theory-based approaches give very similar \(p\)-values here. With the normality condition violated but large group sizes, the Central Limit Theorem supports the theory-based test, and the close agreement between the randomization and \(F\)-distribution results is reassuring. If the conditions are reasonable, the next step would be to calculate pairwise analyses to better understand the reasons for the rejection of the null hypothesis in the ANOVA.