Problem Statement

A particular brand of candy-coated chocolate comes in five different colors: brown, yellow, orange, green, and coffee. The manufacturer of the candy says the candies are distributed in the following proportions: brown - 40%, yellow - 20%, orange = 20%, and the remaining are split evenly between green and coffee. A random sample of 580 pieces of this candy are collected. Does this random sample provide evidence against the manufacturer’s claim? [Tweaked a bit from https://onlinecourses.science.psu.edu/stat500/node/56]

Competing Hypotheses

In words

Null hypothesis: The true proportions of a particular brand of candy-coated chocolate match what the manufacturer states: brown - 40%, coffee - 10%, green - 10%, orange = 20%, and yellow - 20%.
Alternative hypothesis: The distribution of candy proportions differs from what the manufacturer states.

In symbols (with annotations)

\(H_0: P_1 = p_{1, 0}, P_2 = p_{2, 0}, P_3 = p_{3, 0}, P_4 = p_{4, 0}, P_5 = p_{5, 0}\), where \(P\) represents the true proportion of colored candies (1 is “brown”, 2 is “coffee”, 3 is “green”, 4 is “orange”, and 5 is “yellow”) and \((p_{1, 0}, p_{2, 0}, p_{3, 0}, p_{4, 0}, p_{5, 0}) = ( 0.4, 0.1, 0.1, 0.2, 0.2 )\).
\(H_a:\) At least one \(P_i \ne p_{i, 0}\) for \(i \in \{ 1, \ldots, 5 \}\)

Set \(\alpha\)

It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here.

Exploring the sample data

library(dplyr)
library(knitr)
library(ggplot2)
#download.file("https://raw.githubusercontent.com/ismayc/ismayc.github.io/master/teaching/sample_problems/candies.csv", destfile = "candies.csv")
candies <- read.csv("candies.csv")

prop.table(table(candies$candy_colors))

## 
##    brown   coffee    green   orange   yellow 
## 0.386207 0.101724 0.082759 0.224138 0.205172

The bar below also shows the distribution of candy_colors.

qplot(x = candy_colors, data = candies, geom = "bar")

Guess about statistical significance

These proportions all look relatively close to the expected distributions. The sample size is relatively big, so that might lead us to reject the null hypothesis. Green and orange both have observed proportions that may lead us to reject. Let’s make the guess that we will reject \(H_0\), but it might be close.

Check conditions

Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met.

Independent observations: The observations are collected independently.

The cases are selected independently through random sampling so this condition is met.

Expected cell counts: All expected cell counts are at least 5.

\(580 \cdot 0.1 = 58\) and since 0.1 is the smallest expected proportion, this condition is met.
Degrees of freedom: The degrees of freedom must be at least 2.

There are five different groups of candy colors here so our degrees of freedom is 4.

Test statistic

The test statistic is a random variable based on the sample data. Here, we want to look at a way to measure the differences between what we actually observed for counts and what we expected for counts. The Pearson’s \(\chi^2\) test statistic does just this but looking at the ratio of the squared difference in observed minus expected counts divided by expected counts and then summing this ratio over all cells. If the conditions are met and assuming the null hypothesis is true, this test statistic follows a \(\chi^2\) distribution with \(k - 1\) degrees of freedom with \(k\) corresponding to the number of groups of the variable of interest.

\[ ( X^2 = \sum_{\text{all cells}} \dfrac{\text{(observed - expected)}^2}{\text{expected}} \sim \chi^2 (df = k - 1) \]

Observed test statistic

While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. The inference function in the oilabs package does not currently have a built-in analysis for the Goodness of Fit test. We instead will use the chisq.test function:

chisq.test(x = table(candies$candy_colors), 
           p = c(0.4, 0.1, 0.1, 0.2, 0.2))

## 
##  Chi-squared test for given probabilities
## 
## data:  table(candies$candy_colors)
## X-squared = 3.78, df = 4, p-value = 0.44

We see here that the \(x^2_{obs}\) value is around 3.78 with 4 degrees of freedom.

Compute \(p\)-value

The \(p\)-value—the probability of observing an \(x^2_{obs}\) value of 3.78 or more in our null distribution—is around 44%. This can also be calculated in R directly:

pchisq(3.78, df = 4, lower.tail = FALSE)

## [1] 0.4366

State conclusion

We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that we would find evidence that the distribution differed compared to what was expected was incorrect. Based on this sample, we have do not evidence that the true proportions of candy colors for this type of candy is different from what the manufacturer claimed.

Goodness of Fit Example