Problem Statement

Trace metals in drinking water affect the flavor and an unusually high concentration can pose a health hazard. Ten pairs of data were taken measuring zinc concentration in bottom water and surface water at 10 randomly selected locations on a stretch of river. Do the data suggest that the true average concentration in the surface water is smaller than that of bottom water? (Note that units are not given.) [Tweaked a bit from https://onlinecourses.science.psu.edu/stat500/node/51]

Competing Hypotheses

In words

  • Null hypothesis: The mean concentration in the bottom water is the same as that of the surface water at different paired locations.

  • Alternative hypothesis: The mean concentration in the surface water is smaller than that of the bottom water at different paired locations.

In symbols (with annotations)

  • \(H_0: \mu_{diff} = 0\), where \(\mu_{diff}\) represents the mean difference in concentration for surface water minus bottom water.
  • \(H_A: \mu_{diff} < 0\)

Set \(\alpha\)

It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.

Exploring the sample data

library(dplyr)
library(knitr)
library(ggplot2)
library(readr)
download.file("ismayc.github.io/teaching/sample_problems/zinc_tidy.csv",
  destfile = "zinc_tidy.csv",
  method = "curl")
zinc_tidy <- read_csv("zinc_tidy.csv")

We want to look at the differences in surface - bottom for each location:

zinc_diff <- zinc_tidy %>% 
  group_by(loc_id) %>% 
  summarize(pair_diff = diff(concentration))

Descriptive statistics are now given.

zinc_summ <- zinc_diff %>%
  summarize(sample_size = n(),
    mean = mean(pair_diff),
    sd = sd(pair_diff),
    minimum = min(pair_diff),
    lower_quartile = quantile(pair_diff, 0.25),
    median = median(pair_diff),
    upper_quartile = quantile(pair_diff, 0.75),
    max = max(pair_diff))
kable(zinc_summ)
sample_size mean sd minimum lower_quartile median upper_quartile max
10 -0.08 0.05 -0.18 -0.11 -0.08 -0.04 -0.02

The histogram below also shows the distribution of pair_diff.

zinc_diff %>% ggplot(aes(x = pair_diff)) +
  geom_histogram(binwidth = 0.04, color = "white")

Guess about statistical significance

We are looking to see if the sample paired mean difference of -0.08 is statistically less than 0. They seem to be quite close, but we have a small number of pairs here. Let’s guess that we will fail to reject the null hypothesis.


Non-traditional methods

Collecting summary info

Next we will assign some key values to variable names in R:

obs_diff <- zinc_summ$mean
n_pairs <- zinc_summ$sample_size

Randomization for Hypothesis Test

In order to look to see if the observed sample mean difference \(\bar{x}_{diff} = -0.08\) is statistically less than 0, we need to account for the number of pairs. We also need to determine a process that replicates how the paired data was selected in a way similar to how we calculated our original difference in sample means.

We can use the idea of randomization testing (also known as permutation testing) to simulate the population from which the sample came and then generate samples using shuffling from that simulated population to account for sampling variability. In this case, we will shuffle along each paired location. So values that were on the bottom of location 1 may now be switched to be on the surface or vice versa.

library(mosaic)
set.seed(2016)
many_shuffles <- do(10000) * 
  (zinc_tidy %>%
     mutate(concentration = shuffle(concentration, groups = loc_id)) %>% 
     group_by(loc_id) %>% 
     summarize(pair_diff = diff(concentration))
   )
null_distn <- many_shuffles %>% 
  group_by(.index) %>%
  summarize(mean_diff = mean(pair_diff))
null_distn %>% ggplot(aes(x = mean_diff)) +
  geom_histogram(bins = 30, color = "white")

We can next use this distribution to observe our \(p\)-value. Recall this is a left-tailed test so we will be looking for values that are less than or equal to -0.08 for our \(p\)-value.

null_distn %>% ggplot(aes(x = mean_diff)) +
  geom_histogram(bins = 30, color = "white") +
  geom_vline(color = "red", xintercept = obs_diff)

Calculate \(p\)-value

pvalue <- null_distn %>%
  filter(mean_diff <= obs_diff) %>%
  nrow() / nrow(null_distn)
pvalue
## [1] 0.0009

So our \(p\)-value is essentially 0 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the left tail of the null distribution.

Bootstrapping for Confidence Interval

We can also create a confidence interval for the unknown population parameter \(\mu_{diff}\) using our sample data (the calculated differences) with bootstrapping. This is similar to the bootstrapping done in a one sample mean case, except now our data is differences instead of raw numerical data.

boot_distn <- do(10000) * 
  resample(zinc_diff, replace = TRUE) %>% 
  summarize(mean_diff = mean(pair_diff))
boot_distn %>% ggplot(aes(x = mean_diff)) +
  geom_histogram(bins = 30, color = "white")

(ci_boot <- boot_distn %>% summarize(lower = quantile(mean_diff, probs = 0.025),
    upper = quantile(mean_diff, probs = 0.975)))
##   lower upper
## 1 -0.11 -0.05

We see that 0 is not contained in this confidence interval as a plausible value of \(\mu_{diff}\) (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter and since the entire confidence interval falls below zero, we have evidence that surface zinc concentration levels are lower, on average, than bottom level zinc concentrations.

Interpretation: We are 95% confident the true mean zinc concentration on the surface is between 0.11 units smaller to 0.05 units smaller than on the bottom.

Note: You could also use the null distribution based on randomization with a shift to have its center at \(\bar{x}_{diff} = -0.08\) instead of at 0 and calculate its percentiles. The confidence interval produced via this method should be comparable to the one done using bootstrapping above.


Traditional methods

Check conditions

Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.

  1. Independent observations: The observations among pairs are independent.

The locations are selected independently through random sampling so this condition is met.

  1. Approximately normal: The distribution of population of differences is normal or the number of pairs is at least 30.

    The histogram above does show some skew so we have reason to doubt the population being normal based on this sample. We also only have 10 pairs which is fewer than the 30 needed. A theory-based test may not be valid here.

Test statistic

The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population mean difference \(\mu_{diff}\). A good guess is the sample mean difference \(\bar{X}_{diff}\). Recall that this sample mean is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample mean of \(\bar{x}_{diff, obs} = 0.0804\) or larger assuming that the population mean difference is 0 (assuming the null hypothesis is true). If the conditions are met and assuming \(H_0\) is true, we can “standardize” this original test statistic of \(\bar{X}_{diff}\) into a \(T\) statistic that follows a \(t\) distribution with degrees of freedom equal to \(df = n - 1\):

\[ T =\dfrac{ \bar{X}_{diff} - 0}{ S_{diff} / \sqrt{n} } \sim t (df = n - 1) \]

where \(S\) represents the standard deviation of the sample differences and \(n\) is the number of pairs.

Observed test statistic

While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the t.test function on the differences to perform this analysis for us.

stats::t.test(x = zinc_diff$pair_diff, 
       alternative = "less",
       mu = 0)
## 
##  One Sample t-test
## 
## data:  zinc_diff$pair_diff
## t = -5, df = 9, p-value = 0.0004
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
##   -Inf -0.05
## sample estimates:
## mean of x 
##     -0.08

We see here that the \(t_{obs}\) value is around -5.

Compute \(p\)-value

The \(p\)-value—the probability of observing an \(t_{obs}\) value of or less in our null distribution of a \(t\) with 9 degrees of freedom—is 0.0004. This can also be calculated in R directly:

pt(-5, df = nrow(zinc_diff) - 1, lower.tail = TRUE)
## [1] 0.00037

State conclusion

We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample mean difference was not statistically less than the hypothesized mean of 0 has been invalidated here. Based on this sample, we have evidence that the mean concentration in the bottom water is greater than that of the surface water at different paired locations.


Comparing results

Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \(p\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions were not met since the number of pairs was small, but the sample data was not highly skewed. Using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) lead to similar results.