Topic 10: Introduction to Inference Lab
In this lab, we investigate the ways in which statistics from a random sample can serve as point estimates for population parameters. We dig deeper into the Central Limit Theorem, explore connections between sample size and uncertainty, and introduce the notion of the confidence interval.
This is a derivative of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
Intro to Inference Lab
In this activity, we investigate the ways in which statistics from a random sample of data can serve as point estimates for population parameters. This activity builds off of the content from Topic 9, where we were exposed to the sampling distribution and the Central Limit Theorem. Here we will work to become more familiar with these two ideas and how we can use the sampling distribution to make claims about population-level data.
The Data
We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be a dataset containing all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. While we typically don’t have access to population-level data, this will allow us to see the Central Limit Theorem in action and to see how close our estimates come to the corresponding true population mean.
In this lab we investigate what we can learn about the full population of these home sales by taking smaller samples. We’ll see how well our small samples can be used to estimate a population parameter. The data has been loaded for you in a dataset called ames.
Use the code block below to answer some basic questions about the ames dataset.
You can call a data frame by name to print it out. You can also use functions like head(), dim(), and glimpse() to gain information as well.
ames Dataset I
What does each observation in the ames dataset represent?
ames Dataset II
How many observations are there in the ames dataset?
ames Dataset III
How many variables are there in the ames dataset?
We see that there are quite a few variables in the data set — enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet and the sale price. To save some effort throughout the lab, we’ll create two variables with short names that represent these two variables. The code block below is pre-set to define the area vector — add a second line which creates the price vector.
Start by copying the existing code and making changes to it. What needs to change?
area <- ames |>
pull(area)
area <- ames |>
pull(area)We want to use a different variable name because (i) we don’t want to overwrite the existing area variable, and (ii) we want the name to be meaningful.
We also don’t want to extract the area column again.
area <- ames |>
pull(area)
___ <- ames |>
pull(___)Store the results into a new variable called price, as requested in the prompt. What column from the ames data frame do you want to extract?
area <- ames |>
pull(area)
price <- ames |>
pull(___)Store the results into a new variable called price, as requested in the prompt. You want to extract the price column.
area <- ames |>
pull(area)
price <- ames |>
pull(price)
area <- ames |>
pull(area)
price <- ames |>
pull(price)
area <- ames |>
pull(area)
price <- ames |>
pull(price)Initial Exploration
Let’s look at the distributions of price and area in our population of home sales by calculating a few summary statistics. Use the code block below to find the mean and median for both price and area. Use your results to answer the questions that follow.
The area and price objects are vectors of values. Pass them, one at a time, to the mean() and median() functions.
Fill in the blanks:
mean(___)
median(___)
mean(___)
median(___)Which of the following is the mean of the price variable?
Which of the following is the median of the price variable?
What does the relationship between the mean and median tell you about the distribution of price?
Which of the following is the mean of the area variable?
Which of the following is the median of the area variable?
What does the relationship between the mean and median tell you about the distribution of area?
Use the code block below to verify your answers about the shapes of the distributions. The code to produce the histogram and boxplot for area is already present. Add the code necessary to produce a histogram and boxplot for price. Are you able to see the skew in the distributions?
Start by copying and pasting the code for the existing plots. What needs to be changed?
ggplot() +
geom_histogram(aes(x = area)) +
labs(title = "Histogram of Home Size",
x = "Size (sq ft)", y = "Count")
ggplot() +
geom_boxplot(aes(x = area)) +
labs(title = "Boxplot of Home Sizes",
x = "Size (sq ft)")
ggplot() +
geom_histogram(aes(x = area)) +
labs(title = "Histogram of Home Size",
x = "Size (sq ft)", y = "Count")
ggplot() +
geom_boxplot(aes(x = area)) +
labs(title = "Boxplot of Home Sizes",
x = "Size (sq ft)")In the second set of plots, we’ll need to replace the variable being plotted and rewrite the labels to describe what’s being plotted.
ggplot() +
geom_histogram(aes(x = area)) +
labs(title = "Histogram of Home Size",
x = "Size (sq ft)", y = "Count")
ggplot() +
geom_boxplot(aes(x = area)) +
labs(title = "Boxplot of Home Sizes",
x = "Size (sq ft)")
ggplot() +
geom_histogram(aes(x = ___)) +
labs(title = "Histogram of ___",
x = "___", y = "Count")
ggplot() +
geom_boxplot(aes(x = ___)) +
labs(title = "Boxplot of ___",
x = "___")You’ll want to map price to the x aesthetic in both the histogram and boxplot.
ggplot() +
geom_histogram(aes(x = area)) +
labs(title = "Histogram of Home Size",
x = "Size (sq ft)", y = "Count")
ggplot() +
geom_boxplot(aes(x = area)) +
labs(title = "Boxplot of Home Sizes",
x = "Size (sq ft)")
ggplot() +
geom_histogram(aes(x = price)) +
labs(title = "Histogram of ___",
x = "___", y = "Count")
ggplot() +
geom_boxplot(aes(x = price)) +
labs(title = "Boxplot of ___",
x = "___")Now provide meaningful titles and axis labels.
ggplot() +
geom_histogram(aes(x = area)) +
labs(title = "Histogram of Home Size",
x = "Size (sq ft)", y = "Count")
ggplot() +
geom_boxplot(aes(x = area)) +
labs(title = "Boxplot of Home Sizes",
x = "Size (sq ft)")
ggplot() +
geom_histogram(aes(x = price)) +
labs(title = "Histogram of Selling Prices",
x = "Sale Price ($)", y = "Count")
ggplot() +
geom_boxplot(aes(x = price)) +
labs(title = "Boxplot of Selling Prices",
x = "Sale Price ($)")Taking Samples
In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we usually take a sample of the population and use that to understand the properties of the population.
If we were interested in estimating the mean living area in Ames based on a sample, we can use the sample() function to survey the population. Try running the code in the block below a few times — what happens?
This command collects a simple random sample of size 50 from the vector area and assigns it to samp1. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2,930 home sales.
Run the code block below to produce a histogram of all observations in the population alongside a histogram of only those houses in our sample. How does the sample distribution compare to the population distribution?
If we’re interested in estimating the average living area of homes in Ames using our sample, our best single guess is the sample mean. Run the code block below and use the result to answer the question that follows.
Which of the following is true?
Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1,499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.
Use the code block below to take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1?
Copy and paste the code from the previous code block. What should be changed?
samp1 <- sample(area, 50)
samp1Change samp1 to samp2 in both lines.
samp2 <- sample(area, 50)
samp2If you took a sample of size 50, a sample of size 100, and a sample of size 200, which sample would you expect to result in the most accurate estimate of the population mean?
Constructing a Sampling Distribution
Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The sampling distribution can help us understand this variability. Because we have access to the entire population, we can build up the sampling distribution for the sample mean by repeating the above steps many times.
The code below generates 5,000 samples of size 50 and computes the sample mean of each. Run the code to see the distribution of sample means.
You may notice a warning message about bin width. The following code block shows how to adjust the number of bins using the bins argument. Run it to see the difference.
sample_means50 I
How many elements are there in sample_means50?
sample_means50 II
Where is the sampling distribution centered? That is, what is the mean of the distribution of sample means?
sample_means50 III
Would you expect the distribution to change if we instead collected 50,000 sample means instead of 5,000?
Interlude: The for Loop
Let’s take a break from the statistics for a moment to understand the code you used to generate all of those sample means and build the sampling distribution.
You may have just run your first ever for loop — a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code as many times as you want without having to type it out over and over again. Without the for loop, filling in just the first four entries of sample_means50 would require code like this:
sample_means50 <- rep(NA, 5000)
samp <- sample(area, 50)
sample_means50[1] <- mean(samp)
samp <- sample(area, 50)
sample_means50[2] <- mean(samp)
samp <- sample(area, 50)
sample_means50[3] <- mean(samp)
samp <- sample(area, 50)
sample_means50[4] <- mean(samp)You would only need to copy and paste that pattern 4,996 more times to fill in the entire vector! With the for loop, those thousands of lines are compressed into a handful. Here’s a simple loop — read the code and guess what will happen, then run it and see if you were right.
Now, back to the original code. Let’s consider it line by line:
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}- Line 1 initializes a vector of 5,000
NAentries calledsample_means50. This vector will store the results generated within theforloop. - Line 3 calls the
forloop. It can be read as: “for every integerifrom 1 to 5000, run the following lines of code.” The loop runs once wheni = 1, once wheni = 2, and so on up to (and including)i = 5000. - Lines 4–5 are the body of the loop — the code that runs on every iteration. Each time through, we take a random sample of size 50 from
area, compute its mean, and store it as the \(i^{th}\) element ofsample_means50.
To make sure you understand what you’ve done, build and run a smaller version in the code block below. Initialize a vector of 100 zeros called sample_means_small, run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, iterating from 1 to 100. Print the output by including sample_means_small after the loop.
Start with the original for loop which creates sample_means50. What needs to be changed?
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}We’ll actually be changing quite a few items. We’ll also want to print out the resulting list of sample means.
___ <- rep(___, ___)
for(i in 1:___){
samp <- sample(area, 50)
___[i] <- mean(samp)
}
#Print out the list of sample means
___In the first line, the name of the container will be sample_means_small, and it will be initialized by repeating zero 100 times. We can initialize that container with any value we like (NA or 0 are natural choices) because they’ll just be overwritten shortlyly.
sample_means_small <- rep(0, 100)
for(i in 1:___){
samp <- sample(area, 50)
___[i] <- mean(samp)
}
#Print out the list of sample means
___In the third line, the for loop needs to overwrite all of the entries in the sample_means_small container, one element at a time. There are 100 slots in sample_means_small, so the for loop will run for i between 1 and 100.
sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
___[i] <- mean(samp)
}
#Print out the list of sample means
___In the last line inside of the for loop, we’re overwriting the contents of sample_means_small one slot at a time. Fill that blank with sample_means_small.
sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
#Print out the list of sample means
___The sample means are now contained in sample_means_small. Print them out by typing sample_means_small in place of the blank on the last line.
sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
#Print out the list of sample means
sample_means_small
sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
sample_means_small
sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
sample_means_smallsample_means_small I
How many elements are there in sample_means_small?
sample_means_small II
What does each element of sample_means_small represent?
Sample Size and the Sampling Distribution
Now that we have a better understanding of the mechanics of our code, let’s return to the reason we used a for loop: to compute an approximation for the sampling distribution. To get a sense of the effect that sample size has on the sampling distribution, build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100. Call them sample_means10 and sample_means100 respectively. The code to create sample_means50 is pre-populated for reference.
Copy and paste all of the code to create sample_means50 twice and update it to create sample_means10 and sample_means100.
Fill in the blanks for both new objects:
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
___ <- rep(___, ___)
for(i in 1:___){
samp <- sample(area, ___)
___[i] <- mean(samp)
}
___ <- rep(___, ___)
for(i in 1:___){
samp <- sample(area, ___)
___[i] <- mean(samp)
}Fill in the blanks for sample_means100 similarly:
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
sample_means10 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
}
sample_means100 <- rep(___, ___)
for(i in 1:___){
samp <- sample(area, ___)
___[i] <- mean(samp)
}Fill in the blanks for sample_means100 similarly:
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
sample_means10 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
}
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
# Samples of 50 houses at a time
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
# Samples of 10 houses at a time
sample_means10 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
}
# Samples of 100 houses at a time
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
# Samples of 50 houses at a time
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
# Samples of 10 houses at a time
sample_means10 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
}
# Samples of 100 houses at a time
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}Now that you’ve created the three sampling distributions, I’ve done a bit of “behind the scenes” work to combine them into a single data frame. This makes it convenient to compare them in a single faceted plot. Run the code block below to see the effect that different sample sizes have on the sampling distribution.
When the sample size is larger, what happens to the center of the sampling distribution?
What happens to the spread of the sampling distribution as the sample size increases?
Confidence Intervals
Based on a sample, what can we infer about the population? In practice, we’ll have just a single sample. In this case, the best estimate of the average living area of houses sold in Ames would be the sample mean \(\bar{x}\). That serves as a reasonable point estimate, but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.
We can calculate a 95% confidence interval for a sample mean by adding and subtracting a certain number of standard errors to the point estimate. For a 95% confidence interval, that number comes from the normal distribution — specifically, the value that cuts off 2.5% in each tail. If you remember the Empirical Rule, you’ll recall that approximately 95% of observations fall within two standard deviations of the mean. A more precise value is 1.96, and for now we’ll use this as our multiplier.
Using 1.96 assumes we know quite a bit about the population. In practice, we almost never do — and in later activities we’ll introduce a refinement that accounts for this uncertainty. For now, 1.96 gives us a good working approximation and lets us focus on the core idea: that a confidence interval puts a range of plausible values around our point estimate.
Use the code block below to take a single sample of size 60 of areas from houses in Ames. Call it samp.
Look earlier in the activity for where we took samples of square footage areas for homes. Adapt that code here.
___ <- sample(___, ___)
samp <- sample(area, 60)
samp <- sample(area, 60)Run the code below to construct the confidence interval.
This is an important inference: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the calculated lower and upper bounds calculated.
There are a few conditions that must be met for this interval to be valid – the questions that follow test what you recall from our previous discussion on the Central Limit Theorem.
For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true? Select all that apply.
Which of the following is the correct interpretation of the 95% confidence interval?
In this case we have the luxury of knowing the true population mean since we have data on the entire population. We calculated it earlier — it is 1499.69 square feet.
Every time you run the code to build samp and construct the confidence interval you will get a slightly different confidence interval.
If you built these confidence intervals over and over, about what proportion of the intervals do you expect to contain the true population mean?
Your response to the previous question highlights a real intricacy. The level of confidence – here 95% – indicates the confidence that we have in the procedure. That is, in the long run, if we were to collect data and construct confidence intervals for the average square footage of homes sold, about 95% of those intervals would contain the population mean (\(\mu\)).
The claim in the box above can be difficult to grasp. We’ll use visuals to clarify what is being said. We’re going to recreate many samples (a hundred of them) and construct a confidence interval from the randomly chosen properties in each. Here is the rough outline:
- Obtain a random sample.
- Calculate and store the sample’s mean and standard deviation.
- Repeat steps 1 and 2 one hundred times.
- Use these stored statistics to calculate many confidence intervals.
Read the code below — does it make sense to you? What do you think the result will be? Once you think you know, run the code block and see if you were right. Run it multiple times and observe what changes. What do the red highlighted intervals indicate? About how many red intervals do you observe on average?
Good work through this activity. I hope it has made the Central Limit Theorem and sampling distributions a bit more intuitive. We’ll do more with confidence intervals in the coming activities. If you are interested in doing more, go back through this activity using the price data in place of area.
Submit
If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested.
The hash below encodes your responses to the multiple choice and checkbox questions in this activity.
Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.
You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.
This is a derivative of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
Summary
In this activity you connected the abstract machinery of the Central Limit Theorem to a concrete application — estimating the average home size in Ames, Iowa. Here are the key takeaways and a look at what’s ahead.
- Sample statistics are point estimates for population parameters. The sample mean \(\bar{x}\) is our best single guess for the population mean \(\mu\), but we must acknowledge that the estimate \(\bar{x}\) will vary from sample to sample.
- The sampling distribution describes that variability. By simulating many samples and computing each mean, we can see how much our estimate is likely to fluctuate — and that fluctuation decreases as sample size grows.
- Larger samples produce more precise estimates. The standard error \(S_E = s/\sqrt{n}\) quantifies this — as \(n\) increases, \(S_E\) decreases, and our sampling distribution narrows.
- A confidence interval quantifies our uncertainty. A 95% confidence interval is constructed as \(\bar{x} \pm 1.96 \times S_E\). Rather than a single point estimate, it gives a range of plausible values for the population mean.
- “95% confident” has a specific meaning. If we repeated the sampling process many times and built a confidence interval each time, approximately 95% of those intervals would contain the true population mean. Any single interval either contains the true mean or it doesn’t — we just don’t know which.
- The
forloop is a powerful tool for simulation. It allowed us to build entire sampling distributions and collections of confidence intervals that would be impossibly tedious to construct by hand.
In this activity you built confidence intervals informally using the formula \(\bar{x} \pm 1.96 \times S_E\). You saw that these confidence intervals most often did contain the true value of the population parameter we were seeking. In the coming activities we’ll formalize this process, explore some intricacies, and extend these ideas beyond just estimating the value of a single population mean.