Topic 13: Inference for Categorical Data (Lab)
In this lab, we explore what’s at play when making inference about population proportions using categorical data. We work with real survey data on global atheism, practice using the inference() function from the {statsr} package, and investigate how sample size and population proportion affect the margin of error.
This is a derivative of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. The original lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people: “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person, or a convinced atheist?” This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data. In this activity we take a look at the atheism survey and explore what’s at play when making inference about population proportions using categorical data.
The Survey
You can find the press release for the WIN-Gallup International poll on the Global Index of Religion and Atheism here. Please take a moment to review the report and then address the following questions.
In the first paragraph of the report, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?
The title of the report is Global Index of Religiosity and Atheism. To generalize the report’s findings to the global human population, what must we assume about the sampling method?
Do you expect that the required assumption was satisfied? What are the implications of this?
Turn your attention to Table 6 (pages 14 and 15) of the report, which summarizes the sample size and response percentages for all 57 countries. While this is a useful format for summarizing the data, we will base our analysis on the original data set of individual responses to the survey. These original responses are available in a data frame named atheism from the {openintro} package, which has been loaded for you.
What does each row of Table 6 correspond to? What does each row of the atheism data frame correspond to?
To investigate the link between these two ways of organizing this data, take a look at the estimated proportion of atheists in the United States. Towards the bottom of Table 6, we see that this is 5%. We should be able to arrive at the same number using the atheism data.
Run the command in the code block below and be sure to understand what the code is doing — you’ll be asked to do something similar shortly. Here we create a new data frame called us12 containing only the rows in atheism associated with respondents from the United States in 2012. We then calculate the proportion of atheist responses. Does the result agree with the percentage in Table 6? If not, why might it differ?
Inference on Proportions
As you noted previously, Table 6 provides statistics — calculations made from the sample of 51,927 people. We’d like insight into the population parameters instead. You answer the question “What proportion of people in your sample reported being atheists?” with a sample statistic, while the question “What proportion of people on Earth would report being atheists?” is answered with an estimate of the parameter.
You’ll use what you’ve learned about inferential tools for estimating population proportions to answer questions related to the WIN-Gallup poll. Additionally, you’ll explore how the value of the population proportion can impact the margin of error for a confidence interval.
As long as the conditions for inference are reasonably well satisfied, we can either calculate the standard error and construct the confidence interval by hand, or allow the inference() function from the {statsr} package to do it for us.
Run the following code block to construct a confidence interval for the proportion of atheists in the US in 2012.
Let’s pause for a moment to go through the arguments of this function:
y— the response variable of interest:responsedata— the data frame containing theresponsecolumnstatistic— the parameter we’re estimating:"proportion"(other options include"mean"and"median")type— the type of inference:"ci"for a confidence interval or"ht"for a hypothesis testmethod—"theoretical"or"simulation"based; we use the theoretical framework throughout this coursesuccess— since we are estimating a proportion, we specify which level counts as a “success”:"atheist"
The default confidence level is 95% (conf_level = 0.95), though this can be adjusted.
Although formal confidence intervals and hypothesis tests don’t appear explicitly in the WIN-Gallup report, suggestions of inference appear at the bottom of page 6: “In general, the error margin for surveys of this kind is ±3–5% at 95% confidence.” We will check the validity of this claim shortly.
Use the code block below to help you answer the question that follows.
What is the relationship between the margin of error and the width of a confidence interval?
The margin of error is half the width of the confidence interval — think of it as the “radius” around the point estimate.
(___ - ___) / 2The margin of error is half the width of the confidence interval — think of it as the “radius” around the point estimate.
We can take the upper bound for the confidence interval minus the lower bound to find the entire width of the interval. Dividing that by two will give us the margin of error.
(0.0634 - 0.0364) / 2Based on the R output, what is the margin of error for the estimate of the proportion of atheists in the US in 2012?
Using the code block below and the inference() function, calculate confidence intervals for the proportion of atheists in 2012 in two other countries of your choice, and report the associated margins of error. Be sure to note whether the conditions for inference are met. It will be helpful to create new data sets for each of the two countries first, and then use these data sets in the inference() function.
Revisit the code block where we created the us12 data frame. What would you need to change to get data from a different country?
Change the nationality value in filter() to match a country of your choice. You can check the available country names by running unique(atheism$nationality).
Create a data frame for each country, then pass it to inference() the same way we did for us12.
country1 <- atheism |>
filter(nationality == "___", year == 2012)
inference(y = response, data = country1,
statistic = "proportion", type = "ci",
method = "theoretical", success = "atheist")How Does the Proportion Affect the Margin of Error?
Imagine you’ve set out to survey 1,000 people on two questions: are you female? and are you left-handed? Since both sample proportions were calculated from the same sample size, they should have the same margin of error, right? Not so fast! While the margin of error does change with sample size, it is also affected by the proportion itself.
Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This feeds into the margin of error for a 95% confidence interval: \(ME = 1.96 \times SE = 1.96 \times \sqrt{p(1-p)/n}\). Since the population proportion \(p\) appears in this formula, it makes sense that the margin of error depends on the population proportion. We can visualize this relationship by plotting \(ME\) vs. \(p\).
The code block below creates a vector p from 0 to 1 in steps of 0.01, calculates the corresponding margin of error for each value of \(p\) (using \(ME \approx 2 \times SE\)), and plots the relationship.
Describe the relationship between the population proportion \(p\) and the margin of error.
Which of the following are implications of your answer above?
We now know that both sample size and the population proportion impact the margin of error. Often, pollsters have requirements for the margin of error — for example, estimating a president’s net favorability rating to within ±3 percentage points. They can use these requirements, along with any prior knowledge or intuition about the population proportion, to estimate how much data they need to collect. The required sample size can be estimated using the formula below (a rearrangement of the margin of error formula):
\[n \geq \left(\frac{Z_{\alpha/2}}{M_E}\right)^2 \cdot p\left(1 - p\right)\]
where \(Z_{\alpha/2}\) is the critical value for the desired confidence level, \(M_E\) is the desired margin of error, and \(p\) is an estimate for the population proportion. If no estimate is available, use \(p = 0.5\) as a conservative worst-case choice.
Success-Failure Condition
You must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if the sample is random and both \(np \geq 10\) and \(n(1-p) \geq 10\). This rule of thumb is easy to follow, but it raises an interesting question: what’s so special about the number 10?
The short answer is: nothing. The “best” value for such a rule of thumb is, to some degree, arbitrary. However, when \(np\) and \(n(1-p)\) both reach 10, the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests based on that approximation.
We can investigate the interplay between \(n\) and \(p\) and the shape of the sampling distribution using simulations. The code block below simulates 5,000 samples of size 1,040 from a population with a true atheist proportion of 0.1, computes \(\hat{p}\) for each sample, and plots a histogram of the results.
Use the code block below to repeat this simulation with n = 400 and p = 0.1. Plot your result and compare it to the original. What impact does lowering the number of observations have?
Start by copying the simulation code from the previous block. Which two values at the top of the code need to change?
Change n to 400 and keep p at 0.1. Everything else can stay the same — just update the plot title to reflect the new parameters.
Now re-run the experiment with n = 1040 and p = 0.02. Think about the impact that a smaller population proportion has on the distribution of \(\hat{p}\).
Same approach as before — copy the original simulation code and update n and p. Don’t forget to update the plot title.
Finally, re-run the experiment with n = 400 and p = 0.02. Compare all four distributions. How does this connect back to the success-failure condition for inference?
Same approach again — update n to 400 and p to 0.02. After running all four simulations, think about which combinations satisfy \(np \geq 10\) and \(n(1-p) \geq 10\).
Referring to Table 6 in the WIN-Gallup report, Australia has a sample proportion of 0.1 on a sample size of 1,040, and Ecuador has a sample proportion of 0.02 on 400 subjects. Suppose these point estimates are the true population proportions. Given the shapes of their respective sampling distributions, is it sensible to proceed with inference and report margins of error as the report does?
On Your Own
The question of atheism was asked by WIN-Gallup International in a similar survey conducted in 2005. Table 4 on page 12 of the report summarizes survey results from 2005 and 2012 for 39 countries. Try answering the following questions on your own. The code blocks below are available for any calculations you need.
1. Answer the following two questions using the inference() function. As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
- Is there convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012? Create new data sets for respondents from Spain in both years, form confidence intervals for the true proportion of atheists in both years, and determine whether they overlap.
- Is there convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012?
Start by creating a data frame containing only responses from Spain. How did we create us12 earlier?
You don’t need to filter on year — both 2005 and 2012 data are present for Spain. Pass the year variable to the optional x argument in inference() to automatically create groups for each year.
spain <- atheism |>
filter(nationality == "Spain")
inference(y = response, x = year, data = spain,
statistic = "proportion", type = "ci",
method = "theoretical", success = "atheist")You can ignore the warning about converting year to a factor — R is doing exactly what you asked it to do.
2. If in fact there has been no change in the atheism index in any of the countries listed in Table 4, in how many of those countries would you expect to detect a change (at a significance level of 0.05) simply by chance? Hint: Look up Type 1 error in your textbook.
3. Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for \(p\). How many people would you have to sample to ensure that you are within the guidelines?
Recall the sample size formula introduced earlier in this activity. Which document contains this formula?
The formula is on the Standard Error Decision Tree and was also written out earlier in this activity. What value should you use for \(p\) when you have no prior estimate?
Since you have no prior estimate for \(p\), use \(p = 0.5\) as a conservative worst-case choice. This value maximizes the required sample size.
# Critical value for 95% CI
z_star <- 1.96
# Desired margin of error
ME <- 0.01
# Sample size formula with p = 0.5
n <- ((z_star / ME)^2) * 0.5 * (1 - 0.5)
ceiling(n)Submit
If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested (there is only a question hash for this activity, no exercise hash).
The hash below encodes your responses to the multiple choice questions in this activity.
Since there were no code cell exercises in this activity, there is no exercise hash to generate. You’ll see exercise hashes in future activities.
Summary
- Sample statistics estimate population parameters. The proportions reported in the WIN-Gallup poll are sample statistics — they estimate the true population proportions for each country, but they are not the population parameters themselves.
- Conditions for inference must be checked. For inference on a proportion to be valid, the sample must be random and the success-failure condition (\(np \geq 10\) and \(n(1-p) \geq 10\)) must be satisfied. When the condition isn’t met — as with Ecuador’s sample — the sampling distribution of \(\hat{p}\) is not approximately normal, and inference isn’t reliable.
- The margin of error depends on both \(n\) and \(p\). The margin of error is not determined by sample size alone — the population proportion also plays a role. Margins of error are largest when \(p\) is near 0.5 and smallest when \(p\) is near 0 or 1.
- The
inference()function automates the calculation. For a single proportion,inference()computes the confidence interval using the theoretical framework. Passing a grouping variable to thexargument allows comparison between groups. - Sample size planning works backwards from the margin of error. Given a desired margin of error and confidence level, we can solve for the minimum sample size needed. When no prior estimate of \(p\) is available, using \(p = 0.5\) gives a conservative upper bound on the required sample size.
In this lab you applied inferential tools for a single proportion and began to compare proportions across groups. In the coming activities, we’ll extend inference to numerical data — introducing the \(t\)-distribution and exploring one- and two-sample tests and confidence intervals for means. The framework remains the same: a point estimate, a standard error, a critical value or test statistic, and a conclusion stated in context.