Problem Statement

Average income varies from one region of the country to another, and it often reflects both lifestyles and regional living expenses. Suppose a new graduate is considering a job in two locations, Cleveland, OH and Sacramento, CA, and he wants to see whether the average income in one of these cities is higher than the other. He would like to conduct a hypothesis test based on two randomly selected samples from the 2000 Census. (Tweaked a bit from Diez, Barr, and Cetinkaya-Rundel 2015 [Chapter 5])

Competing Hypotheses

In words

Null hypothesis: There is no association between income and location (Cleveland, OH and Sacramento, CA).
Alternative hypothesis: There is an association between income and location (Cleveland, OH and Sacramento, CA).

Another way in words

Null hypothesis: The mean income is the same for both cities.
Alternative hypothesis: The mean income is different for the two cities.

In symbols (with annotations)

$H_0: \mu_{sac} = \mu_{cle}$ or $H_0: \mu_{sac} - \mu_{cle} = 0$, where $\mu$ represents the average income.
$H_A: \mu_{sac} - \mu_{cle} \ne 0$

Set $\alpha$

It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.

Exploring the sample data

library(dplyr)
library(knitr)
library(ggplot2)
download.file("http://ismayc.github.io/teaching/sample_problems/cleSac.txt", 
  destfile = "cleSac.txt",
  method = "curl")
cleSac <- read.delim("cleSac.txt") %>%
  rename(metro_area = Metropolitan_area_Detailed,
         income = Total_personal_income) %>%
  na.omit()

inc_summ <- cleSac %>% group_by(metro_area) %>%
  summarize(sample_size = n(),
    mean = mean(income),
    sd = sd(income),
    minimum = min(income),
    lower_quartile = quantile(income, 0.25),
    median = median(income),
    upper_quartile = quantile(income, 0.75),
    max = max(income))
kable(inc_summ)

metro_area	sample_size	mean	sd	minimum	lower_quartile	median	upper_quartile	max
Cleveland_ OH	212	27467	27681	0	8475	21000	35275	152400
Sacramento_ CA	175	32428	35774	0	8050	20000	49350	206900

The boxplot below also shows the mean for each group highlighted by the red dots.

cleSac %>% ggplot(aes(x = metro_area, y = income)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", color = "red")

Guess about statistical significance

We are looking to see if a difference exists in the mean income of the two levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that no difference exists. The distributions of income seem similar and the means fall in roughly the same place.

Non-traditional methods

Collecting summary info

Next we will assign some key values to variable names in R:

xbar_cle <- inc_summ$mean[1]
xbar_sac <- inc_summ$mean[2]
sd_cle <- inc_summ$sd[1]
sd_sac <- inc_summ$sd[2]
obs_diff <- xbar_sac - xbar_cle
n_cle <- inc_summ$sample_size[1]
n_sac <- inc_summ$sample_size[2]

Randomization for Hypothesis Test

In order to look to see if the observed sample mean for Sacramento of 27467.066 is statistically different than that for Cleveland of 32427.543, we need to account for the sample sizes. Note that this is the same as looking to see if $\bar{x}_{sac} - \bar{x}_{cle}$ is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 212 and 175 were selected.

We can use the idea of randomization testing (also known as permutation testing) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using shuffling from that simulated population to account for sampling variability.

library(mosaic)
set.seed(2016)
many_shuffles <- do(10000) * 
  (cleSac %>%
     mutate(income = shuffle(income)) %>% 
     group_by(metro_area) %>%
     summarize(mean_inc = mean(income))
   )
null_distn <- many_shuffles %>% 
  group_by(.index) %>%
  summarize(diffmean = diff(mean_inc))
null_distn %>% ggplot(aes(x = diffmean)) +
  geom_histogram(bins = 30, color = "white")

We can next use this distribution to observe our $p$-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to 4960.477 or less than or equal to -4960.477 for our $p$-value.

null_distn %>% ggplot(aes(x = diffmean)) +
  geom_histogram(bins = 30, color = "white") +
  geom_vline(color = "red", xintercept = obs_diff) +
  geom_vline(color = "red", xintercept = -obs_diff)

Calculate $p$-value

pvalue <- null_distn %>%
  filter( (diffmean >= obs_diff) | (diffmean <= -obs_diff) ) %>%
  nrow() / nrow(null_distn)
pvalue

## [1] 0.122

So our $p$-value is 0.122 and we fail to reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are not very far into the tail of the null distribution.

Bootstrapping for Confidence Interval

We can also create a confidence interval for the unknown population parameter $\mu_{sac} - \mu_{cle}$ using our sample data with bootstrapping. Here we will bootstrap each of the groups with replacement instead of shuffling. This is done using the groups argument in the resample function to fix the size of each group to be the same as the original group sizes of 175 for Sacramento and 212 for Cleveland.

boot_means <- do(10000) *
  cleSac %>% 
  resample(replace = TRUE, groups = metro_area) %>% 
  group_by(metro_area) %>% 
  summarize(mean_inc = mean(income))

Next, we calculate the difference in sample means for each of the 10,000 replications:

boot_distn <- boot_means %>% 
  group_by(.index) %>% 
  summarize(diffmean = diff(mean_inc))

boot_distn %>% ggplot(aes(x = diffmean)) +
  geom_histogram(bins = 30, color = "white")

(ci_boot <- boot_distn %>% summarize(lower = quantile(diffmean, probs = 0.025),
    upper = quantile(diffmean, probs = 0.975)))

## # A tibble: 1 × 2
##   lower upper
##   <dbl> <dbl>
## 1 -1513 11459

We see that 0 is contained in this confidence interval as a plausible value of $\mu_{sac} - \mu_{cle}$ (the unknown population parameter). This matches with our hypothesis test results of failing to reject the null hypothesis. Since zero is a plausible value of the population parameter, we do not have evidence that Sacramento incomes are different than Cleveland incomes.

Interpretation: We are 95% confident the true mean yearly income for those living in Sacramento is between 1512.59 dollars smaller to 11458.85 dollars higher than for Cleveland.

Note: You could also use the null distribution based on randomization with a shift to have its center at $\bar{x}_{sac} - \bar{x}_{cle} = \$4960.48$ instead of at 0 and calculate its percentiles. The confidence interval produced via this method should be comparable to the one done using bootstrapping above.

Traditional methods

Check conditions

Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met.

Independent observations: The observations are independent in both groups.

This metro_area variable is met since the cases are randomly selected from each city.
Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30.
```
cleSac %>% ggplot(aes(x = income)) +
  geom_histogram(color = "white", binwidth = 20000) +
  facet_wrap(~ metro_area)
```
We have some reason to doubt the normality assumption here since both the histograms show deviation from a normal model fitting the data well for each group. The sample sizes for each group are greater than 100 though so the assumptions should still apply.
Independent samples: The samples should be collected without any natural pairing.

There is no mention of there being a relationship between those selected in Cleveland and in Sacramento.

Test statistic

The test statistic is a random variable based on the sample data. Here, we are interested in seeing if our observed difference in sample means ($\bar{x}_{sac, obs} - \bar{x}_{cle, obs}$ = 4960.477) is statistically different than 0. Assuming that conditions are met and the null hypothesis is true, we can use the $t$ distribution to standardize the difference in sample means ($\bar{X}_{sac} - \bar{X}_{cle}$) using the approximate standard error of $\bar{X}_{sac} - \bar{X}_{cle}$ (invoking $S_{sac}$ and $S_{cle}$ as estimates of unknown $\sigma_{sac}$ and $\sigma_{cle}$).

\[ T =\dfrac{ (\bar{X}_1 - \bar{X}_2) - 0}{ \sqrt{\dfrac{S_1^2}{n_1} + \dfrac{S_2^2}{n_2}} } \sim t (df = min(n_1 - 1, n_2 - 1)) \] where 1 = Sacramento and 2 = Cleveland with $S_1^2$ and $S_2^2$ the sample variance of the incomes of both cities, respectively, and $n_1 = 175$ for Sacramento and $n_2 = 212$ for Cleveland.

Important note: This is frequently interpreted as the “variability in the sample means” divided by the “variability in the samples”.

Observed test statistic

Note that we could also do (ALMOST) this test directly using the t.test function. The x and y arguments are expected to both be numeric vectors here so we’ll need to appropriately filter our data sets.

cleveland <- cleSac %>% filter(metro_area == "Cleveland_ OH")
sacramento <- cleSac %>% filter(metro_area != "Cleveland_ OH")
t.test(y = cleveland$income, x = sacramento$income,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  sacramento$income and cleveland$income
## t = 2, df = 300, p-value = 0.1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1543 11464
## sample estimates:
## mean of x mean of y 
##     32428     27467

Note that the degrees of freedom reported above are different than what we used above in specifying the Test Statistic. The degrees of freedom used here is also known as the Satterthwaite approximation and involves a quite complicated formula. For most problems, the must simpler smaller of sample sizes minus one will suffice.

While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies.

We see here that the observed test statistic value is around 1.5 with $df = min(212 - 1, 175 - 1) = 174$. Recall that for large degrees of freedom, the $t$ distribution is roughly equal to the standard normal curve so our difference in df for the Satterthwaite and “min” variations doesn’t really matter.

Comparing the null distribution for traditional and non-traditional methods

Let’s investigate how the null distribution compares for the permutation testing results to what we use with the $t$ distribution in the traditional method. To get the results on the same scale we will need to divide what we have in the null distribution of the permutation test ($\bar{X}_{sac} - \bar{X}_{cle}$) by the appropriate denominator to create a $T$ statistic:

\[T = \dfrac{\bar{X}_{sac} - \bar{X}_{cle}}{\sqrt{(S_{sac}^2 / n_{sac}) + (S_{cle}^2 / n_{cle}) })} \]

null_distn <- null_distn %>%
  mutate(t_stat = diffmean / sqrt( (sd_sac^2 / n_sac) + (sd_cle^2 / n_cle) ))
ggplot(data = null_distn, mapping = aes(x = t_stat)) +
  geom_histogram(aes(y = ..density..), color = "white", binwidth = 0.3) +
  stat_function(fun = dt,
    args = list(df = min(n_sac - 1, n_cle - 1)), 
    color = "royalblue", size = 2)

We see that the distributions very closely match up as we can expect since the assumptions were met for the traditional test.

Compute $p$-value

The $p$-value—the probability of observing an $t_{174}$ value of 1.5 or more extreme (in both directions) in our null distribution—is around 0.135. This can also be calculated in R directly:

2 * pt(1.5, df = min(212 - 1, 175 - 1), lower.tail = FALSE)

## [1] 0.135

We can also approximate by using the standard normal curve:

2 * pnorm(1.5, lower.tail = FALSE)

## [1] 0.134

Note that the 95 percent confidence interval given above matches well with the one calculated using bootstrapping.

State conclusion

We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference not existing in the means was backed by this statistical analysis. We do not have evidence to suggest that the true mean income differs between Cleveland, OH and Sacramento, CA based on this data.

Comparing results

Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the $p$-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results.

Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel. 2015. OpenIntro Statistics, Third Edition.

Two Means - Independent Samples Example