A particular brand of candy-coated chocolate comes in five different colors: brown, yellow, orange, green, and coffee. The manufacturer of the candy says the candies are distributed in the following proportions: brown - 40%, yellow - 20%, orange = 20%, and the remaining are split evenly between green and coffee. A random sample of 580 pieces of this candy are collected. Does this random sample provide evidence against the manufacturer’s claim? [Tweaked a bit from https://onlinecourses.science.psu.edu/stat500/node/56]
Null hypothesis: The true proportions of a particular brand of candy-coated chocolate match what the manufacturer states: brown - 40%, coffee - 10%, green - 10%, orange = 20%, and yellow - 20%.
Alternative hypothesis: The distribution of candy proportions differs from what the manufacturer states.
It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.
library(dplyr)
library(knitr)
library(ggplot2)
#download.file("https://raw.githubusercontent.com/ismayc/ismayc.github.io/master/teaching/sample_problems/candies.csv", destfile = "candies.csv")
candies <- read.csv("candies.csv")
prop.table(table(candies$candy_colors))
##
## brown coffee green orange yellow
## 0.386207 0.101724 0.082759 0.224138 0.205172
The bar below also shows the distribution of candy_colors
.
qplot(x = candy_colors, data = candies, geom = "bar")
These proportions all look relatively close to the expected distributions. The sample size is relatively big, so that might lead us to reject the null hypothesis. Green and orange both have observed proportions that may lead us to reject. Let’s make the guess that we will reject \(H_0\), but it might be close.
Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.
The cases are selected independently through random sampling so this condition is met.
Expected cell counts: All expected cell counts are at least 5.
\(580 \cdot 0.1 = 58\) and since 0.1 is the smallest expected proportion, this condition is met.
Degrees of freedom: The degrees of freedom must be at least 2.
There are five different groups of candy colors here so our degrees of freedom is 4.
The test statistic is a random variable based on the sample data. Here, we want to look at a way to measure the differences between what we actually observed for counts and what we expected for counts. The Pearson’s \(\chi^2\) test statistic does just this but looking at the ratio of the squared difference in observed minus expected counts divided by expected counts and then summing this ratio over all cells. If the conditions are met and assuming the null hypothesis is true, this test statistic follows a \(\chi^2\) distribution with \(k - 1\) degrees of freedom with \(k\) corresponding to the number of groups of the variable of interest.
\[ ( X^2 = \sum_{\text{all cells}} \dfrac{\text{(observed - expected)}^2}{\text{expected}} \sim \chi^2 (df = k - 1) \]
While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. The inference
function in the oilabs
package does not currently have a built-in analysis for the Goodness of Fit test. We instead will use the chisq.test
function:
chisq.test(x = table(candies$candy_colors),
p = c(0.4, 0.1, 0.1, 0.2, 0.2))
##
## Chi-squared test for given probabilities
##
## data: table(candies$candy_colors)
## X-squared = 3.78, df = 4, p-value = 0.44
We see here that the \(x^2_{obs}\) value is around 3.78 with 4 degrees of freedom.
The \(p\)-value—the probability of observing an \(x^2_{obs}\) value of 3.78 or more in our null distribution—is around 44%. This can also be calculated in R directly:
pchisq(3.78, df = 4, lower.tail = FALSE)
## [1] 0.4366
We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that we would find evidence that the distribution differed compared to what was expected was incorrect. Based on this sample, we have do not evidence that the true proportions of candy colors for this type of candy is different from what the manufacturer claimed.