Topic 14: Inference for One or More Categorical Variables With Many Levels
This activity introduces \(\chi^2\) (Chi-Squared) tests for Goodness of Fit and Independence — two methods for testing claims involving categorical data with more than two levels. Data used in this activity is simulated, based on the US Census Bureau, The Sentencing Project, and the 2019 Behavioral Risk Factor Surveillance System.
Chi-Squared Tests for Goodness of Fit and Independence
In this activity, we consider two methods for testing claims corresponding to categorical data with possibly more than two levels. The first is called the \(\chi^2\) Goodness of Fit Test, used to test whether a sample provides evidence that an assumed discrete distribution is not an appropriate model for a categorical variable. The second is called the \(\chi^2\) Test of Independence, used to test whether two categorical variables are associated with one another. You’ll be exposed to video explanations and worked examples before trying problems on your own.
Chi-Squared Goodness of Fit
Consider a sociologist interested in better understanding incarceration rates in the state of New Hampshire. The researcher wants to determine whether minority populations are disproportionately represented in State Penitentiaries. Using estimates from the United States Census Bureau, the population of New Hampshire had the following racial breakdown as of 2019: white (89.8%), Black (1.8%), Hispanic (4.0%), other (4.4%). A reasonable expectation is that the incarcerated population should roughly reflect this same distribution.
The researcher took a random sample of 300 inmates in State Penitentiaries across New Hampshire and observed the following results: 243 inmates were white, 24 were Black, 15 were Hispanic, and 18 were of another race. The researcher wants to determine whether this sample provides evidence that the state prison population does not reflect the racial demographics of the State.
We’ll come back to this example shortly, but first let’s watch Dr. Çetinkaya-Rundel introduce the \(\chi^2\) Goodness of Fit Test.
Dr. Çetinkaya-Rundel discussed an example of racial bias in jury selection. Before applying these techniques to our New Hampshire example, let’s recap a few key ideas.
The \(\chi^2\) test statistic does not follow a normal distribution — instead, it follows a \(\chi^2\)-distribution. Although this is a new distribution, the principles remain familiar. The test statistic measures how far our sample falls from what was expected under the null hypothesis, and the \(p\)-value is the corresponding tail probability. Since the \(\chi^2\) test statistic is always non-negative, we are always interested in the upper tail.
Running a \(\chi^2\) Goodness of Fit test requires the following conditions:
- The sample must be taken randomly.
- If sampling without replacement, the sample should include less than 10% of the entire population.
- The sample must be large enough that each group has an expected count of at least 5 observations.
Now, back to our application. As a reminder, New Hampshire is estimated to be 89.8% white, 1.8% Black, 4.0% Hispanic, and 4.4% of other races. We have a sample of 300 inmates: 243 white, 24 Black, 15 Hispanic, and 18 of another race. We want to know whether this sample provides evidence that the prison population does not reflect the State’s racial demographics.
What is the variable of interest in this study?
The variable of interest in this study is:
How many measured levels of the variable of interest are there?
In order to answer the question as posed, we should:
Which of the following are the correct hypotheses associated with this test? Select all that apply.
Now compute the expected counts for each racial group in the sample of 300 inmates.
Expected count — white inmates:
What proportion of New Hampshire’s population is white?
The expected count is the total number of prisoners multiplied by the proportion of the State’s population in that racial group.
___ * ___The expected count is the total number of prisoners multiplied by the proportion of the State’s population in that racial group.
- There are 300 inmates in the sample.
300 * ___The expected count is the total number of prisoners multiplied by the proportion of the State’s population in that racial group.
- There are 300 inmates in the sample.
- According to the US Census Bureau, the proportion of white residents in NH was about 0.898.
300 * 0.898
300 * 0.898
300 * 0.898Expected count — Black inmates:
Use the same approach as the previous question. What proportion of New Hampshire residents are Black?
Use the same approach as the previous question. What proportion of New Hampshire residents are Black?
There are still 300 inmates in the sample and the proportion of black residents in NH, according to the US Census Bureau in 2019, was about 0.018.
300 * 0.018
300 * 0.018
300 * 0.018Expected count — Hispanic inmates:
Same approach again. What proportion of New Hampshire residents are Hispanic?
300 * 0.040
300 * 0.040Expected count — inmates of other races:
Same approach. What proportion of New Hampshire residents identify as another race?
300 * 0.044
300 * 0.044Now we’re ready to compute the test statistic. Recall from Dr. Çetinkaya-Rundel’s video that the \(\chi^2\) statistic is:
\[\chi^2 = \sum_{i = 1}^{k}{\frac{\left(\text{observed} - \text{expected}\right)^2}{\text{expected}}}\]
where \(k\) is the number of groups. As a reminder, the \(\sigma\) symbol is used to indicate that we should add terms together. You’ll calculate \(\displaystyle{\frac{\left(\text{observed} - \text{ expected}\right)^2}{\text{expected}}}\) for each group (white, black, hispanic, and other) and add those quantities together.
Use the code block below to compute the \(\chi^2\) test statistic. The block is pre-populated to get you started.
Fill in the expected vector with the four expected counts you just computed — in the same order as the observed counts: white, Black, Hispanic, other.
The expected counts are 269.4, 5.4, 12.0, and 13.2 for white, Black, Hispanic, and other respectively.
observed <- c(243, 24, 15, 18)
expected <- c(269.4, 5.4, 12.0, 13.2)
test_stat <- sum((observed - expected)^2 / expected)
test_stat
observed <- c(243, 24, 15, 18)
expected <- c(269.4, 5.4, 12.0, 13.2)
test_stat <- sum((observed - expected)^2 / expected)
test_stat
observed <- c(243, 24, 15, 18)
expected <- c(269.4, 5.4, 12.0, 13.2)
test_stat <- sum((observed - expected)^2 / expected)
test_statNow compute the degrees of freedom for the \(\chi^2\)-distribution associated with this test.
For a Goodness of Fit test, how are the degrees of freedom related to the number of groups?
The degrees of freedom for a Goodness of Fit test is one less than the number of groups.
There are 4 racial groups, so the degrees of freedom is \(4 - 1 = 3\).
4 - 1
4 - 1
4 - 1Use the code block below to compute the \(p\)-value. The function pchisq(q, df) returns the probability to the left of the boundary value q under a \(\chi^2\) distribution with df degrees of freedom.
Use your test statistic and degrees of freedom with pchisq(). Which tail are you interested in?
The \(p\)-value is the area to the right of the test statistic. Since pchisq() gives the left-tail area, how do you obtain the right-tail area?
Just like with pnorm(), to obtain the area to the right of a boundary value in the \(\chi^2\) distribution, we’ll subtract from 1.
1 - pchisq(___, df = ___)The second argument to pchisq() is the degrees of freedom (df). The degrees of freedom for this test is 3.
1 - pchisq(___, df = 3)The first argument to pchisq() is the boundary value – that’s our test statistic. We calculated that test statistic to be about 69.15.
1 - pchisq(69.15, df = 3)
1 - pchisq(69.15, df = 3)
1 - pchisq(69.15, df = 3)Assume the test was conducted at the \(\alpha = 0.05\) level of significance.
What is the result of the test?
The result of the test means that:
This application is based on 2019 data from the US Census Bureau and The Sentencing Project.
Chi-Squared Test of Independence
Now that you’ve completed a Goodness of Fit test, let’s consider another application of the \(\chi^2\)-distribution. We’ll work through an application in which we are interested in determining whether household income (IncomeLevel) and adolescent drug use (DrugUse) are independent. We’ll use a simulated dataset based on metrics from the 2019 Behavioral Risk Factor Surveillance System. The simulated data is stored in a data frame called BRFSSsim.
First, let’s watch Dr. Çetinkaya-Rundel introduce the \(\chi^2\) Test of Independence.
We return to our simulated BRFSS data. Each row of the dataset represents a response from a single adolescent in the USA. We are interested in determining, at the 5% level of significance, whether there is evidence to suggest an association between IncomeLevel which has levels Poverty, LowIncome, MiddleIncome, and HighIncome and the variable DrugUse with yes or no response indicating whether the individual reported using any illicit drug in the past year.
We assume the following marginal distributions: approximately 16% of households fall below the poverty line, 38% are low income, 35% are middle income, and 11% are high income. Additionally, approximately 14% of adolescents are estimated to have used illicit drugs in any given year. We’ll use these assumed percentages when computing our expected counts.
How many variables of interest are there in this researcher’s study?
In order to answer their question, the researcher should:
The variables of interest to the researcher are:
The hypotheses associated with the researcher’s test are:
Under the assumption of the null hypothesis (independence), compute the expected number of the 2,500 adolescents in each of the following drug-use groups.
Expected count — Poverty and drug use:
If two events A and B are independent, what is \(\mathbb{P}\left[A \text{ and } B\right]\)?
If events are independent, \(\mathbb{P}\left[A \text{ and } B\right] = \mathbb{P}\left[A\right] \times \mathbb{P}\left[B\right]\). What are the probabilities of coming from a poverty household and of having used drugs?
\(\mathbb{P}\left[\text{Poverty}\right] = 0.16\) and \(\mathbb{P}\left[\text{DrugUse}\right] = 0.14\). How do you go from a probability to an expected count?
Multiply the total number of adolescents (2,500) by the joint probability.
2500 * 0.16 * 0.14
2500 * 0.16 * 0.14
2500 * 0.16 * 0.14Expected count — Low Income and drug use:
Use the same approach as the previous question, but update the probability of being in the Low Income group.
Use the same approach as the previous question, but update the probability of being in the Low Income group.
\[\text{Expected Count} = \mathbb{P}\left[\text{low income}\right]\cdot \mathbb{P}\left[\text{drug use}\right]\cdot n\]
Use the same approach as the previous question, but update the probability of being in the Low Income group.
\[\text{Expected Count} = \mathbb{P}\left[\text{low income}\right]\cdot \mathbb{P}\left[\text{drug use}\right]\cdot n\]
0.38*0.14*2500
2500 * 0.38 * 0.14
2500 * 0.38 * 0.14Expected count — Middle Income and drug use:
Use the same approach as for calculating the previous two expected counts.
Use the same approach as for calculating the previous two expected counts. The marginal probability of a randomly selected household being middle income is 0.35.
0.35*0.14*2500
2500 * 0.35 * 0.14
2500 * 0.35 * 0.14Expected count — High Income and drug use:
Once more, use the same approach.
Once more, use the same approach. This time, the marginal probability of a randomly chosen household being high income is 0.11.
0.11*0.14*2500
2500 * 0.11 * 0.14
2500 * 0.11 * 0.14You’ve now computed the expected counts for the drug-use row. I’ll do it for the row corresponding to no drug-use. To build that row, we do exactly what we did above, but replace 0.14 (the probability of a randomly chosen individual having used illicit drugs in the last 12 months) with 0.86 (the probability that they have not).
The full expected and observed tables are shown below for reference.
Expected Results:
| Poverty | Low Income | Middle Income | High Income | |
|---|---|---|---|---|
| No Drug Use | 344 | 817 | 752.5 | 236.5 |
| Drug Use | 56 | 133 | 122.5 | 38.5 |
Observed Results:
| Poverty | Low Income | Middle Income | High Income | |
|---|---|---|---|---|
| No Drug Use | 341 | 752 | 796 | 221 |
| Drug Use | 78 | 153 | 116 | 43 |
The \(\chi^2\) test statistic formula is the same as for Goodness of Fit:
\[\chi^2 = \sum_{i = 1}^{k}{\frac{\left(\text{observed} - \text{expected}\right)^2}{\text{expected}}}\]
where \(k\) is again the total number of groups. There were four income levels and two drug-use categories, so \(4\times 2 = 8\) total groups. Use the code block below to compute the \(\chi^2\) test statistic.
Start by copying and pasting the code we used to compute the test statistic in the previous example.
observed <- c(243, 24, 15, 18)
expected <- c(269.4, 5.4, 12.0, 13.2)
test_stat <- sum((observed - expected)^2 / expected)
test_statThe observed and expected count vectors must be replaced.
observed <- c(___)
expected <- c(___)
test_stat <- sum((observed - expected)^2 / expected)
test_statType in the observed counts from our scenario. Pay special attention to the order you use.
observed <- c(341, 752, 796, 221, 78, 153, 116, 43)
expected <- c(___)
test_stat <- sum((observed - expected)^2 / expected)
test_statType in the observed counts from our scenario. Pay special attention to the order you use.
Type in the expected counts you calculated earlier. Note that you must use the same ordering here as you did for the observed counts.
observed <- c(341, 752, 796, 221, 78, 153, 116, 43)
expected <- c(344, 817, 752.5, 236.5, 56, 133, 122.5, 38.5)
test_stat <- sum((observed - expected)^2 / expected)
test_statRun the code – no additional changes are required.
observed <- c(341, 752, 796, 221, 78, 153, 116, 43)
expected <- c(344, 817, 752.5, 236.5, 56, 133, 122.5, 38.5)
test_stat <- sum((observed - expected)^2 / expected)
test_stat
observed <- c(341, 752, 796, 221, 78, 153, 116, 43)
expected <- c(344, 817, 752.5, 236.5, 56, 133, 122.5, 38.5)
test_stat <- sum((observed - expected)^2 / expected)
test_stat
observed <- c(341, 752, 796, 221, 78, 153, 116, 43)
expected <- c(344, 817, 752.5, 236.5, 56, 133, 122.5, 38.5)
test_stat <- sum((observed - expected)^2 / expected)
test_statNow compute the degrees of freedom associated with this test for independence.
For a Test of Independence, the degrees of freedom depend on the number of groups in each of the two categorical variables. How many levels does IncomeLevel have? How many does DrugUse have?
There are \(k = 4\) levels for the income variable and there are \(\ell = 2\) levels for the drug use variable.
There are \(k = 4\) levels for the income variable and there are \(\ell = 2\) levels for the drug use variable.
For the \(\chi^2\) test for independence, the degrees of freedom is the product \(\left(k - 1\right)\left(\ell - 1\right)\).
There are \(k = 4\) levels for the income variable and there are \(\ell = 2\) levels for the drug use variable.
For the \(\chi^2\) test for independence, the degrees of freedom is the product \(\left(k - 1\right)\left(\ell - 1\right)\).
(4 - 1)*(2 - 1)
(4 - 1) * (2 - 1)
(4 - 1) * (2 - 1)Now compute the \(p\)-value associated with this test.
We are using the \(\chi^2\) distribution again. What function computes probabilities from this distribution?
pchisq(q, df) gives the area to the left of q. We are always interested in the upper tail for \(\chi^2\) tests. How do you find the right-tail area?
We’ll use 1 - pchisq() to find the area to the right of our test statistic.
1 - pchisq(___, df = ___)Your test statistic you calculated is approximately 21.25 and the degrees of freedom is 3.
1 - pchisq(21.25, df = 3)
1 - pchisq(21.24924, df = 3)
1 - pchisq(21.24924, df = 3)Recall that the test was conducted at the \(\alpha = 0.05\) level of significance.
What is the result of the test?
The result of the test means that:
Submit
If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested.
The hash below encodes your responses to the multiple choice and checkbox questions in this activity.
Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.
You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.
Summary
The \(\chi^2\) Goodness of Fit test is used when we have a single categorical variable with two or more levels and want to test whether a population follows an assumed distribution. The null hypothesis specifies the assumed proportions for each level; the alternative hypothesis simply states that the distribution is different.
The \(\chi^2\) Test of Independence is used when we have two categorical variables and want to test whether they are associated. The null hypothesis states that the variables are independent; under independence, the expected count for each cell is the total sample size times the product of the marginal probabilities.
Both tests use the same \(\chi^2\) test statistic: \[\chi^2 = \sum_{i=1}^{k} \frac{(\text{observed} - \text{expected})^2}{\text{expected}}\] where \(k\) is the number of groups (GoF) or cells (independence).
The degrees of freedom differ between the two tests:
- Goodness of Fit: \(df = (\text{number of groups}) - 1\)
- Test of Independence: \(df = (\text{levels in first variable} - 1) \times (\text{levels in second variable} - 1)\)
The \(p\)-value is always the upper tail of the \(\chi^2\) distribution. We compute this area with
1 - pchisq(test_stat, df).Conditions for inference require a random sample and expected counts of at least 5 in each group or cell.
With this activity, you’ve now developed tools for testing claims involving categorical data across a wide range of scenarios — single proportions, two-proportion comparisons, goodness of fit, and tests of independence. In the coming activities, we’ll make a significant shift and begin working with numerical data. This will introduce the \(t\)-distribution, which becomes necessary when we don’t know the population standard deviation — which is almost always. The core logic of hypothesis testing and confidence intervals remains unchanged; what changes is the distribution we use and the standard error formula we apply.