Topic 9: Foundations for Inference
This activity provides an introduction to the Central Limit Theorem. We’ll see the Central Limit Theorem in action through several simulations which construct the sampling distribution. We’ll have opportunities to change parameters of the population distribution and the size of the samples being drawn. The goal is to discover connections between population parameters, sample size, and the resulting sampling distribution.
Foundations for Inference
In this activity we’ll begin investigating the true power of statistics — using sample data to make accurate claims about a population, even when we don’t have access to the entire population. We start here by exploring the connection between a population distribution and the distribution of sample means, often called the sampling distribution. We’ll do this through a series of interactive code blocks which you will run and use to answer questions.
Exploring the Connection Between Population and Sampling Distributions
Start by viewing the following video from the New York Times.
The video claimed that the sampling distribution can help us answer questions about the population. This is really important because, as we mentioned in our first activity, a census is almost always impossible. Use the code blocks below to explore the connection between the population and the sampling distribution for various different populations.
The sampling distribution of the sample mean is a theoretical distribution consisting of the means of all possible samples of a fixed size (\(n\) elements) drawn from a population.
This will be our central object of study in this activity. Our goal will be to understand how the sampling distributon of the sample mean is connected to:
- the population distribution, and
- the sample size.
You’ll encounter several large code blocks in this activity. You do not need to understand all of the code in each block — focus on the pictures that result each time you run the code, instead. You are invited to change the first few lines of each block (the parameters), and you should do so! You are not expected to modify the remaining code though.
A Normally Distributed Population
Work with the following code block to explore the connection between the population distribution and sampling distribution when the population follows a normal distribution. Use your explorations to answer the questions that follow.
Try changing the following parameters one at a time and re-running the code — the mean, the standard deviation (sd), the sample size, and the number of samples (numSamps). What changes do you notice about the population distribution? What changes do you notice about the sampling distribution?
Pay particular attention to what happens to the shape and spread of the sampling distribution (the second plot) as you change the sample size. What do you notice?
Which of the following regarding the population and sampling distributions is true?
Try various different values for sample size. What is true about the mean of the sampling distribution?
Try various values of the mean, standard deviation, and sample size. What can be said about the sampling distribution?
Now that you’ve tried various values for the parameters, what can be said about the connection between sample size and the spread of the sampling distribution?
Okay, so if a population distribution is approximately normal, then the sampling distribution is also nearly normal – what’s the big deal?
The real takeaway here is that, if it’s reasonable to assume that our data comes from an approximately normal population, then our sample mean is a reliable estimate of the true population mean — and the larger our sample, the more reliable that estimate becomes.
This might seem obvious, but it’s actually the foundation of everything that follows in our course. We almost never have access to an entire population, so we rely on samples. Knowing that our sample mean is a trustworthy stand-in for the population mean — and being able to quantify how trustworthy it is based on sample size — is what makes statistical inference possible.
Okay, so what if we can’t assume that the population we are interested in follows a “nearly”-normal distribution? Do you believe in “magic”? (Disclaimer: There is actually no magic, or trickery, involved in what you are about to see – just rock-solid mathematics. Get ready to have your mind blown!)
A Uniformly Distributed Population
A population is said to be uniformly distributed between some minimum value A and a maximum value B if all values between A and B are equally likely to be observed.
Work with the following code block to explore the connection between the population distribution and sampling distribution when the population follows a uniform distribution. Note that in the first plot, assuming a normal distribution for the population is an extremely poor choice. Use your explorations to answer the questions that follow.
Again, try changing the minimum value, maximum value, sample size, and number of samples (numSamps) one at a time and re-running the code. What changes do you notice about the population distribution? What changes do you notice about the sampling distribution?
Pay particular attention to what happens to the shape of the sampling distribution (the second plot) even for very small sample sizes like 2 or 3. How does this compare to what you observed with the normally distributed population?
Which of the following regarding the sampling distributions is true?
Try various different values for sample size. What is true about the mean of the sampling distribution?
Try various values for the parameters. What can be said about the connection between sample size and the spread of the sampling distribution?
Given a uniformly distributed population, the sampling distribution is approximately normal for samples of size 2 or greater.
- The mean of the sampling distribution is the population mean.
- The spread of the sampling distribution is related to the sample size – larger samples result in more narrow distributions.
The Impact of Skew on the Sampling Distribution
We’ve seen that the sampling distribution is nearly normal for any sample size when the population is nearly normally distributed, and nearly normal for samples of size at least two when the population is uniformly distributed. What if we move in a very different direction and consider a population that is extremely skewed?
We’ve encountered skew in our course already, and we know that it describes the effect of the mean being pulled away from the center of our distribution. This witchcraft (read: mathematics) certainly cannot apply in the face of strongly skewed distributions, can it?
Work with the following code block to explore the connection between the population distribution and sampling distribution when the population follows a strongly skewed distribution. Use your explorations to answer the questions that follow.
Try changing the shape and rate parameters one at a time — can you figure out what each one controls by watching how the first plot changes? Then try increasing the sample size gradually. What do you observe?
Try sample sizes of 5, 10, 20, 30, 50, and 100 in sequence. How does the shape of the sampling distribution (the second plot) change as the sample size grows? How does this compare to what you observed with the uniform population?
Try running the code with samples of size 5, 10, and 20. Are the resulting sampling distributions nearly normal?
Try samples of size 30, 50, and 100. What can be said about the resulting sampling distributions?
Try various different values for the other parameters. What is true about the mean of the sampling distribution?
Since you’ve tried various values for the parameters, what can be said about the connection between sample size and the spread of the sampling distribution?
Good work through the previous sets of questions. Think about your answers and use them to answer the following questions about the connection between population distributions and sampling distributions in general.
Consider a generic population distribution. What can be said about the shape of the sampling distribution?
For a generic population distribution, what can be said about the mean of the sampling distribution?
For a generic population distribution, what can be said about the connection between sample size and the spread of the sampling distribution?
In all of these cases, the mean of the sampling distribution falls close to the mean of the population distribution — but the more important observation has to do with the spread of the sampling distribution.
We’ve been working backwards here — we assumed we know the population distribution and took many thousands of samples to construct the sampling distributions. This is exactly opposite of the real statistical scenario: we don’t know the population distribution, we can’t collect thousands of samples, and we generally have just one sample of a fixed size.
Luckily, from the experiments we’ve run (and from mathematics which has been proven to work), we know that our sample mean falls “near” the population mean, and we can describe what “near” means since it depends on the size of our sample. This is how I was able to draw those normal distributions in all the plots on the right, and it is what makes statistics work!
The Central Limit Theorem
What you just discovered through the simulations and questions above is what statisticians call the Central Limit Theorem — the result discussed in the CreatureCast video at the beginning of this activity. This is one of the most important theorems in all of statistics, and it is what will allow us to make and test claims about populations for the remainder of our course.
Regardless of the shape of a population distribution, for sample sizes large enough to overcome skew, the distribution of sample means (the sampling distribution) is approximately normal. Furthermore:
- The mean of the sampling distribution equals the population mean (\(\mu\)).
- The spread of the sampling distribution is described by the standard error (\(S_E\)), which depends on the population standard deviation and the sample size.
For a population with mean \(\mu\) and standard deviation \(\sigma\), the sampling distribution of sample means from samples of size \(n\) is:
\[\bar{X}_n \sim N\left(\mu,~S_E = \frac{\sigma}{\sqrt{n}}\right)\]
Sample Problem: Suppose you have a 46 square foot wall which you want to cover with spray paint. The brand of spray paint you plan to use is known to have coverage which is approximately normal, with an average coverage of 10 square feet per can and a standard deviation of 1.5 square feet. Use the code block below to answer the questions that follow.
For the first question, you have a 46 square foot wall and plan to buy four cans of paint. How many square feet would each can need to cover?
Divide the 46 square feet by 4 to find the required coverage per can.
For the second question, we know that coverage from a single can of spray paint is approximately normal. Start with a picture of a normal distribution that includes the mean (average coverage) and the boundary value of coverage you are interested in. Shade the region corresponding to the probability you want, then use pnorm().
You’re looking for the probability of at least 11.5 square feet of coverage. This means you’re looking for the probability of being to the right of that boundary value. Use 1 - pnorm().
Use 1 - pnorm(11.5, 10, 1.5) to find the probability that a single can covers at least 11.5 square feet.
For the third question, draw another picture — but this time use what you learned about the Central Limit Theorem. The sampling distribution is also approximately normal (since the population is), with the same mean but a smaller spread. The spread of the sampling distribution is the standard error: \(S_E = \sigma / \sqrt{n}\).
The standard error in coverage for a random sample of four cans is \(S_E = 1.5 / \sqrt{4} = 0.75\).
Similar to the second question, use 1 - pnorm() to find the probability of the cans providing an average coverage of at least 11.5 square feet.
Using 1 - pnorm(11.5, 10, 1.5/sqrt(4)) will calculate the probability for you.
How many square feet would each can need to cover if you wanted to use only four cans to cover the entire 46 square foot wall?
What is the probability that a single can of spray paint covers at least 11.5 square feet?
What is the probability that a random sample of four cans of spray paint covers an average of at least 11.5 square feet?
Should we expect to cover the entire wall with only four cans of spray paint, or should we plan to buy five?
Submit
If you are part of a course with an instructor who is grading your work on these activities, please copy and submit the hash below using the method your instructor has requested.
The hash below encodes your responses to the multiple choice and checkbox questions in this activity.
Since there were no code cell exercises in this activity, there is no exercise hash to generate.
Summary
In this activity you discovered one of the most important results in all of statistics through hands-on simulation. Here are the key takeaways and a look at what’s ahead.
- The sampling distribution describes the behavior of sample means. When we take many samples of size \(n\) from a population and compute the mean of each, the resulting distribution of those means is called the sampling distribution.
- The mean of the sampling distribution equals the population mean. Regardless of the population shape or sample size, the center of the sampling distribution is always close to the true population mean \(\mu\).
- Larger samples produce narrower sampling distributions. The spread of the sampling distribution — measured by the standard error \(S_E = \sigma / \sqrt{n}\) — shrinks as sample size grows. Larger samples give us more precise estimates of the population mean.
- This is the Central Limit Theorem. For large enough sample sizes, the sampling distribution is approximately normal, regardless of the shape of the population distribution. For normally distributed populations, any sample size works. For skewed populations, larger sample sizes (30 or more for moderate skew) are needed.
- Skew increases the required sample size. The more extreme the skew in the population distribution, the larger the sample size must be before the sampling distribution becomes approximately normal.
- The CLT is what makes statistical inference possible. Because we can predict the shape and spread of the sampling distribution, we can make reliable probability statements about how close a single sample mean is likely to be to the true population mean — even when we only have one sample.
Now that you understand why sample means behave predictably, the next activities will put this to work. We’ll use the Central Limit Theorem as the foundation for confidence intervals — a way of expressing how precisely a sample mean estimates the population mean — and hypothesis tests — a formal framework for deciding whether observed data is consistent with a specific claim about a population. Everything you’ve built intuition for in this activity will be essential going forward.