Topic 10: Introduction to Inference Lab

function ok_checkbox(response, n) {
  if (!response || response.length === 0) 
    return html`<span style="color:purple">You haven't answered yet.</span>`;
  if (response.toString() === n) 
    return html`<span style="color:green">Correct ✓</span>`;
  return html`<span style="color:red">Not Yet! ✗</span>`;
}

About

In this lab, we investigate the ways in which statistics from a random sample can serve as point estimates for population parameters. We dig deeper into the Central Limit Theorem, explore connections between sample size and uncertainty, and introduce the notion of the confidence interval.

License

This is a derivative of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.

Intro to Inference Lab

In this activity, we investigate the ways in which statistics from a random sample of data can serve as point estimates for population parameters. This activity builds off of the content from Topic 9, where we were exposed to the sampling distribution and the Central Limit Theorem. Here we will work to become more familiar with these two ideas and how we can use the sampling distribution to make claims about population-level data.

The Data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be a dataset containing all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. While we typically don’t have access to population-level data, this will allow us to see the Central Limit Theorem in action and to see how close our estimates come to the corresponding true population mean.

In this lab we investigate what we can learn about the full population of these home sales by taking smaller samples. We’ll see how well our small samples can be used to estimate a population parameter. The data has been loaded for you in a dataset called ames.

Use the code block below to answer some basic questions about the ames dataset.

Check Your Understanding: The ames Dataset I

What does each observation in the ames dataset represent?

mutable ok_response = (response, n) => { return html`Loading...` };
viewof q1 = Inputs.radio(
  new Map([
    ["The sale of a home in Ames, Iowa during the time period 2006 to 2010.", 1],
    ["A house in Ames, Iowa.", 2],
    ["A person.", 3],
    ["A house listed for sale in Ames, Iowa.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q1_selected") ?? "null")}
);

{
  localStorage.setItem("q1_selected", JSON.stringify(q1));
  localStorage.setItem("q1_correct", "1");
  localStorage.setItem("q1_result", q1 === null ? "unattempted" : (q1 == 1 ? "correct" : "incorrect"));
}

ok_response(q1, "1");

Check Your Understanding: The ames Dataset II

How many observations are there in the ames dataset?

viewof q2 = Inputs.radio(
  new Map([
    ["1", 1],
    ["82", 2],
    ["2,006", 3],
    ["2,930", 4],
    ["It is impossible to tell.", 5]
  ]),
  {value: JSON.parse(localStorage.getItem("q2_selected") ?? "null")}
);

{
  localStorage.setItem("q2_selected", JSON.stringify(q2));
  localStorage.setItem("q2_correct", "4");
  localStorage.setItem("q2_result", q2 === null ? "unattempted" : (q2 == 4 ? "correct" : "incorrect"));
}

ok_response(q2, "4");

Check Your Understanding: The ames Dataset III

How many variables are there in the ames dataset?

viewof q3 = Inputs.radio(
  new Map([
    ["1", 1],
    ["82", 2],
    ["2,006", 3],
    ["2,930", 4],
    ["It is impossible to tell.", 5]
  ]),
  {value: JSON.parse(localStorage.getItem("q3_selected") ?? "null")}
);

{
  localStorage.setItem("q3_selected", JSON.stringify(q3));
  localStorage.setItem("q3_correct", "2");
  localStorage.setItem("q3_result", q3 === null ? "unattempted" : (q3 == 2 ? "correct" : "incorrect"));
}

ok_response(q3, "2");

We see that there are quite a few variables in the data set — enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet and the sale price. To save some effort throughout the lab, we’ll create two variables with short names that represent these two variables. The code block below is pre-set to define the area vector — add a second line which creates the price vector.

Hint 2

We want to use a different variable name because (i) we don’t want to overwrite the existing area variable, and (ii) we want the name to be meaningful.

We also don’t want to extract the area column again.

area <- ames |>
  pull(area)

___ <- ames |>
  pull(___)

Hint 3

Store the results into a new variable called price, as requested in the prompt. What column from the ames data frame do you want to extract?

area <- ames |>
  pull(area)

price <- ames |>
  pull(___)

Hint 4 (Solved)

Store the results into a new variable called price, as requested in the prompt. You want to extract the price column.

area <- ames |>
  pull(area)

price <- ames |>
  pull(price)


area <- ames |>
  pull(area)

price <- ames |>
  pull(price)

Initial Exploration

Let’s look at the distributions of price and area in our population of home sales by calculating a few summary statistics. Use the code block below to find the mean and median for both price and area. Use your results to answer the questions that follow.

Check Your Understanding: Summary Statistics I

Which of the following is the mean of the price variable?

viewof q4 = Inputs.radio(
  new Map([
    ["$160,000", 1],
    ["$167,842.47", 2],
    ["$180,796.10", 3],
    ["$200,000", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q4_selected") ?? "null")}
);

{
  localStorage.setItem("q4_selected", JSON.stringify(q4));
  localStorage.setItem("q4_correct", "3");
  localStorage.setItem("q4_result", q4 === null ? "unattempted" : (q4 == 3 ? "correct" : "incorrect"));
}

ok_response(q4, "3");

Check Your Understanding: Summary Statistics II

Which of the following is the median of the price variable?

viewof q5 = Inputs.radio(
  new Map([
    ["$160,000", 1],
    ["$167,842.47", 2],
    ["$180,796.10", 3],
    ["$200,000", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q5_selected") ?? "null")}
);

{
  localStorage.setItem("q5_selected", JSON.stringify(q5));
  localStorage.setItem("q5_correct", "1");
  localStorage.setItem("q5_result", q5 === null ? "unattempted" : (q5 == 1 ? "correct" : "incorrect"));
}

ok_response(q5, "1");

Check Your Understanding: Summary Statistics III

What does the relationship between the mean and median tell you about the distribution of price?

viewof q6 = Inputs.radio(
  new Map([
    ["The `price` variable is skewed right.", 1],
    ["The `price` variable is skewed left.", 2],
    ["The `price` variable is bimodal.", 3],
    ["The `price` variable is symmetric.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q6_selected") ?? "null")}
);

{
  localStorage.setItem("q6_selected", JSON.stringify(q6));
  localStorage.setItem("q6_correct", "1");
  localStorage.setItem("q6_result", q6 === null ? "unattempted" : (q6 == 1 ? "correct" : "incorrect"));
}

ok_response(q6, "1");

Check Your Understanding: Summary Statistics IV

Which of the following is the mean of the area variable?

viewof q7 = Inputs.radio(
  new Map([
    ["1,442 sqft", 1],
    ["1,458.42 sqft", 2],
    ["1,462.5 sqft", 3],
    ["1,499.69 sqft", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q7_selected") ?? "null")}
);

{
  localStorage.setItem("q7_selected", JSON.stringify(q7));
  localStorage.setItem("q7_correct", "4");
  localStorage.setItem("q7_result", q7 === null ? "unattempted" : (q7 == 4 ? "correct" : "incorrect"));
}

ok_response(q7, "4");

Check Your Understanding: Summary Statistics V

Which of the following is the median of the area variable?

viewof q8 = Inputs.radio(
  new Map([
    ["1,442 sqft", 1],
    ["1,458.42 sqft", 2],
    ["1,462.5 sqft", 3],
    ["1,499.69 sqft", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q8_selected") ?? "null")}
);

{
  localStorage.setItem("q8_selected", JSON.stringify(q8));
  localStorage.setItem("q8_correct", "1");
  localStorage.setItem("q8_result", q8 === null ? "unattempted" : (q8 == 1 ? "correct" : "incorrect"));
}

ok_response(q8, "1");

Check Your Understanding: Summary Statistics VI

What does the relationship between the mean and median tell you about the distribution of area?

viewof q9 = Inputs.radio(
  new Map([
    ["The `area` variable is skewed right.", 1],
    ["The `area` variable is skewed left.", 2],
    ["The `area` variable is bimodal.", 3],
    ["The `area` variable is symmetric.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q9_selected") ?? "null")}
);

{
  localStorage.setItem("q9_selected", JSON.stringify(q9));
  localStorage.setItem("q9_correct", "1");
  localStorage.setItem("q9_result", q9 === null ? "unattempted" : (q9 == 1 ? "correct" : "incorrect"));
}

ok_response(q9, "1");

Use the code block below to verify your answers about the shapes of the distributions. The code to produce the histogram and boxplot for area is already present. Add the code necessary to produce a histogram and boxplot for price. Are you able to see the skew in the distributions?

Hint 1

Start by copying and pasting the code for the existing plots. What needs to be changed?

ggplot() + 
  geom_histogram(aes(x = area)) + 
  labs(title = "Histogram of Home Size",
       x = "Size (sq ft)", y = "Count")

ggplot() + 
  geom_boxplot(aes(x = area)) + 
  labs(title = "Boxplot of Home Sizes",
       x = "Size (sq ft)")

ggplot() + 
  geom_histogram(aes(x = area)) + 
  labs(title = "Histogram of Home Size",
       x = "Size (sq ft)", y = "Count")

ggplot() + 
  geom_boxplot(aes(x = area)) + 
  labs(title = "Boxplot of Home Sizes",
       x = "Size (sq ft)")

Hint 2

In the second set of plots, we’ll need to replace the variable being plotted and rewrite the labels to describe what’s being plotted.

ggplot() + 
  geom_histogram(aes(x = area)) + 
  labs(title = "Histogram of Home Size",
       x = "Size (sq ft)", y = "Count")

ggplot() + 
  geom_boxplot(aes(x = area)) + 
  labs(title = "Boxplot of Home Sizes",
       x = "Size (sq ft)")

ggplot() + 
  geom_histogram(aes(x = ___)) + 
  labs(title = "Histogram of ___",
       x = "___", y = "Count")

ggplot() + 
  geom_boxplot(aes(x = ___)) + 
  labs(title = "Boxplot of ___",
       x = "___")

Hint 3

You’ll want to map price to the x aesthetic in both the histogram and boxplot.

ggplot() + 
  geom_histogram(aes(x = area)) + 
  labs(title = "Histogram of Home Size",
       x = "Size (sq ft)", y = "Count")

ggplot() + 
  geom_boxplot(aes(x = area)) + 
  labs(title = "Boxplot of Home Sizes",
       x = "Size (sq ft)")

ggplot() + 
  geom_histogram(aes(x = price)) + 
  labs(title = "Histogram of ___",
       x = "___", y = "Count")

ggplot() + 
  geom_boxplot(aes(x = price)) + 
  labs(title = "Boxplot of ___",
       x = "___")

Hint 4 (Solved)

Now provide meaningful titles and axis labels.

ggplot() + 
  geom_histogram(aes(x = area)) + 
  labs(title = "Histogram of Home Size",
       x = "Size (sq ft)", y = "Count")

ggplot() + 
  geom_boxplot(aes(x = area)) + 
  labs(title = "Boxplot of Home Sizes",
       x = "Size (sq ft)")

ggplot() + 
  geom_histogram(aes(x = price)) + 
  labs(title = "Histogram of Selling Prices",
       x = "Sale Price ($)", y = "Count")

ggplot() + 
  geom_boxplot(aes(x = price)) + 
  labs(title = "Boxplot of Selling Prices",
       x = "Sale Price ($)")

Taking Samples

In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we usually take a sample of the population and use that to understand the properties of the population.

If we were interested in estimating the mean living area in Ames based on a sample, we can use the sample() function to survey the population. Try running the code in the block below a few times — what happens?

This command collects a simple random sample of size 50 from the vector area and assigns it to samp1. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2,930 home sales.

Run the code block below to produce a histogram of all observations in the population alongside a histogram of only those houses in our sample. How does the sample distribution compare to the population distribution?

If we’re interested in estimating the average living area of homes in Ames using our sample, our best single guess is the sample mean. Run the code block below and use the result to answer the question that follows.

Check Your Understanding: Point Estimates

Which of the following is true?

viewof q10 = Inputs.radio(
  new Map([
    ["The sample mean is an estimate for the true population mean.", 1],
    ["The sample mean is the true population mean.", 2],
    ["We are unable to say anything about the relationship between the sample mean and population mean.", 3]
  ]),
  {value: JSON.parse(localStorage.getItem("q10_selected") ?? "null")}
);

{
  localStorage.setItem("q10_selected", JSON.stringify(q10));
  localStorage.setItem("q10_correct", "1");
  localStorage.setItem("q10_result", q10 === null ? "unattempted" : (q10 == 1 ? "correct" : "incorrect"));
}

ok_response(q10, "1");

Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1,499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

Use the code block below to take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1?

Check Your Understanding: Sample Size

If you took a sample of size 50, a sample of size 100, and a sample of size 200, which sample would you expect to result in the most accurate estimate of the population mean?

viewof q11 = Inputs.radio(
  new Map([
    ["Sample of size 50", 1],
    ["Sample of size 100", 2],
    ["Sample of size 200", 3]
  ]),
  {value: JSON.parse(localStorage.getItem("q11_selected") ?? "null")}
);

{
  localStorage.setItem("q11_selected", JSON.stringify(q11));
  localStorage.setItem("q11_correct", "3");
  localStorage.setItem("q11_result", q11 === null ? "unattempted" : (q11 == 3 ? "correct" : "incorrect"));
}

ok_response(q11, "3");

Constructing a Sampling Distribution

Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The sampling distribution can help us understand this variability. Because we have access to the entire population, we can build up the sampling distribution for the sample mean by repeating the above steps many times.

The code below generates 5,000 samples of size 50 and computes the sample mean of each. Run the code to see the distribution of sample means.

You may notice a warning message about bin width. The following code block shows how to adjust the number of bins using the bins argument. Run it to see the difference.

Check Your Understanding: sample_means50 I

How many elements are there in sample_means50?

viewof q12 = Inputs.radio(
  new Map([
    ["50", 1],
    ["82", 2],
    ["2,930", 3],
    ["5,000", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q12_selected") ?? "null")}
);

{
  localStorage.setItem("q12_selected", JSON.stringify(q12));
  localStorage.setItem("q12_correct", "4");
  localStorage.setItem("q12_result", q12 === null ? "unattempted" : (q12 == 4 ? "correct" : "incorrect"));
}

ok_response(q12, "4");

Check Your Understanding: sample_means50 II

Where is the sampling distribution centered? That is, what is the mean of the distribution of sample means?

viewof q13 = Inputs.radio(
  new Map([
    ["Approximately 1,250 sq ft", 1],
    ["Approximately 1,500 sq ft", 2],
    ["Approximately 1,600 sq ft", 3],
    ["Approximately 2,930 sq ft", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q13_selected") ?? "null")}
);

{
  localStorage.setItem("q13_selected", JSON.stringify(q13));
  localStorage.setItem("q13_correct", "2");
  localStorage.setItem("q13_result", q13 === null ? "unattempted" : (q13 == 2 ? "correct" : "incorrect"));
}

ok_response(q13, "2");

Check Your Understanding: sample_means50 III

Would you expect the distribution to change if we instead collected 50,000 sample means instead of 5,000?

viewof q14 = Inputs.radio(
  new Map([
    ["No. We would have more sample means, but the shape of the distribution would not likely change much.", 1],
    ["Yes. The distribution would be more narrow.", 2],
    ["Yes. The distribution may change drastically since this is an entirely new set of samples.", 3],
    ["No. The distribution will be exactly the same.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q14_selected") ?? "null")}
);

{
  localStorage.setItem("q14_selected", JSON.stringify(q14));
  localStorage.setItem("q14_correct", "1");
  localStorage.setItem("q14_result", q14 === null ? "unattempted" : (q14 == 1 ? "correct" : "incorrect"));
}

ok_response(q14, "1");

Interlude: The `for` Loop

Let’s take a break from the statistics for a moment to understand the code you used to generate all of those sample means and build the sampling distribution.

You may have just run your first ever for loop — a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code as many times as you want without having to type it out over and over again. Without the for loop, filling in just the first four entries of sample_means50 would require code like this:

sample_means50 <- rep(NA, 5000)

samp <- sample(area, 50)
sample_means50[1] <- mean(samp)

samp <- sample(area, 50)
sample_means50[2] <- mean(samp)

samp <- sample(area, 50)
sample_means50[3] <- mean(samp)

samp <- sample(area, 50)
sample_means50[4] <- mean(samp)

You would only need to copy and paste that pattern 4,996 more times to fill in the entire vector! With the for loop, those thousands of lines are compressed into a handful. Here’s a simple loop — read the code and guess what will happen, then run it and see if you were right.

Now, back to the original code. Let’s consider it line by line:

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
}

Line 1 initializes a vector of 5,000 NA entries called sample_means50. This vector will store the results generated within the for loop.
Line 3 calls the for loop. It can be read as: “for every integer i from 1 to 5000, run the following lines of code.” The loop runs once when i = 1, once when i = 2, and so on up to (and including) i = 5000.
Lines 4–5 are the body of the loop — the code that runs on every iteration. Each time through, we take a random sample of size 50 from area, compute its mean, and store it as the \(i^{th}\) element of sample_means50.

To make sure you understand what you’ve done, build and run a smaller version in the code block below. Initialize a vector of 100 zeros called sample_means_small, run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, iterating from 1 to 100. Print the output by including sample_means_small after the loop.

Hint 1

Start with the original for loop which creates sample_means50. What needs to be changed?

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
}

Hint 2

We’ll actually be changing quite a few items. We’ll also want to print out the resulting list of sample means.

___ <- rep(___, ___)

for(i in 1:___){
  samp <- sample(area, 50)
  ___[i] <- mean(samp)
}

#Print out the list of sample means
___

Hint 3

In the first line, the name of the container will be sample_means_small, and it will be initialized by repeating zero 100 times. We can initialize that container with any value we like (NA or 0 are natural choices) because they’ll just be overwritten shortlyly.

sample_means_small <- rep(0, 100)

for(i in 1:___){
  samp <- sample(area, 50)
  ___[i] <- mean(samp)
}

#Print out the list of sample means
___

Hint 4

In the third line, the for loop needs to overwrite all of the entries in the sample_means_small container, one element at a time. There are 100 slots in sample_means_small, so the for loop will run for i between 1 and 100.

sample_means_small <- rep(0, 100)

for(i in 1:100){
  samp <- sample(area, 50)
  ___[i] <- mean(samp)
}

#Print out the list of sample means
___

Hint 5

In the last line inside of the for loop, we’re overwriting the contents of sample_means_small one slot at a time. Fill that blank with sample_means_small.

sample_means_small <- rep(0, 100)

for(i in 1:100){
  samp <- sample(area, 50)
  sample_means_small[i] <- mean(samp)
}

#Print out the list of sample means
___

Hint 6 (Solved)

The sample means are now contained in sample_means_small. Print them out by typing sample_means_small in place of the blank on the last line.

sample_means_small <- rep(0, 100)

for(i in 1:100){
  samp <- sample(area, 50)
  sample_means_small[i] <- mean(samp)
}

#Print out the list of sample means
sample_means_small


sample_means_small <- rep(0, 100)

for(i in 1:100){
  samp <- sample(area, 50)
  sample_means_small[i] <- mean(samp)
}

sample_means_small

Check Your Understanding: sample_means_small I

How many elements are there in sample_means_small?

viewof q15 = Inputs.radio(
  new Map([
    ["10", 1],
    ["50", 2],
    ["100", 3],
    ["2,930", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q15_selected") ?? "null")}
);

{
  localStorage.setItem("q15_selected", JSON.stringify(q15));
  localStorage.setItem("q15_correct", "3");
  localStorage.setItem("q15_result", q15 === null ? "unattempted" : (q15 == 3 ? "correct" : "incorrect"));
}

ok_response(q15, "3");

Check Your Understanding: sample_means_small II

What does each element of sample_means_small represent?

viewof q16 = Inputs.radio(
  new Map([
    ["The living area of a house in square feet.", 1],
    ["The average living area of a random sample of 50 houses in square feet.", 2],
    ["The average living area of a random sample of 100 houses in square feet.", 3],
    ["The average living area of a random sample of 5,000 houses in square feet.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q16_selected") ?? "null")}
);

{
  localStorage.setItem("q16_selected", JSON.stringify(q16));
  localStorage.setItem("q16_correct", "2");
  localStorage.setItem("q16_result", q16 === null ? "unattempted" : (q16 == 2 ? "correct" : "incorrect"));
}

ok_response(q16, "2");

Sample Size and the Sampling Distribution

Now that we have a better understanding of the mechanics of our code, let’s return to the reason we used a for loop: to compute an approximation for the sampling distribution. To get a sense of the effect that sample size has on the sampling distribution, build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100. Call them sample_means10 and sample_means100 respectively. The code to create sample_means50 is pre-populated for reference.

Hint 2

Fill in the blanks for both new objects:

sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
}

___ <- rep(___, ___)
for(i in 1:___){
  samp <- sample(area, ___)
  ___[i] <- mean(samp)
}

___ <- rep(___, ___)
for(i in 1:___){
  samp <- sample(area, ___)
  ___[i] <- mean(samp)
}

Hint 3

Fill in the blanks for sample_means100 similarly:

sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
}

sample_means10 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
}

sample_means100 <- rep(___, ___)
for(i in 1:___){
  samp <- sample(area, ___)
  ___[i] <- mean(samp)
}

Hint 4 (Solved)

Fill in the blanks for sample_means100 similarly:

sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
}

sample_means10 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
}

sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}


# Samples of 50 houses at a time
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
}

# Samples of 10 houses at a time
sample_means10 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
}

# Samples of 100 houses at a time
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

Now that you’ve created the three sampling distributions, I’ve done a bit of “behind the scenes” work to combine them into a single data frame. This makes it convenient to compare them in a single faceted plot. Run the code block below to see the effect that different sample sizes have on the sampling distribution.

Check Your Understanding: Effect of Sample Size I

When the sample size is larger, what happens to the center of the sampling distribution?

viewof q17 = Inputs.radio(
  new Map([
    ["The center stays near the population mean (about 1,500 sq ft).", 1],
    ["The center bounces around uncontrollably.", 2],
    ["The center stayed near the population mean, but if we ran this a second time the results could be quite different.", 3]
  ]),
  {value: JSON.parse(localStorage.getItem("q17_selected") ?? "null")}
);

{
  localStorage.setItem("q17_selected", JSON.stringify(q17));
  localStorage.setItem("q17_correct", "1");
  localStorage.setItem("q17_result", q17 === null ? "unattempted" : (q17 == 1 ? "correct" : "incorrect"));
}

ok_response(q17, "1");

Check Your Understanding: Effect of Sample Size II

What happens to the spread of the sampling distribution as the sample size increases?

viewof q18 = Inputs.radio(
  new Map([
    ["The larger the sample size, the wider the sampling distribution.", 1],
    ["The larger the sample size, the more narrow the sampling distribution.", 2],
    ["The sample size and spread of the sampling distribution are independent.", 3],
    ["Larger sample sizes resulted in a more narrow spread this time, but if we built new sampling distributions we may not reproduce this.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q18_selected") ?? "null")}
);

{
  localStorage.setItem("q18_selected", JSON.stringify(q18));
  localStorage.setItem("q18_correct", "2");
  localStorage.setItem("q18_result", q18 === null ? "unattempted" : (q18 == 2 ? "correct" : "incorrect"));
}

ok_response(q18, "2");

Confidence Intervals

Based on a sample, what can we infer about the population? In practice, we’ll have just a single sample. In this case, the best estimate of the average living area of houses sold in Ames would be the sample mean \(\bar{x}\). That serves as a reasonable point estimate, but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.

We can calculate a 95% confidence interval for a sample mean by adding and subtracting a certain number of standard errors to the point estimate. For a 95% confidence interval, that number comes from the normal distribution — specifically, the value that cuts off 2.5% in each tail. If you remember the Empirical Rule, you’ll recall that approximately 95% of observations fall within two standard deviations of the mean. A more precise value is 1.96, and for now we’ll use this as our multiplier.

A Note on 1.96

Using 1.96 assumes we know quite a bit about the population. In practice, we almost never do — and in later activities we’ll introduce a refinement that accounts for this uncertainty. For now, 1.96 gives us a good working approximation and lets us focus on the core idea: that a confidence interval puts a range of plausible values around our point estimate.

Use the code block below to take a single sample of size 60 of areas from houses in Ames. Call it samp.


samp <- sample(area, 60)

Run the code below to construct the confidence interval.

Interpreting a Confidence Interval

This is an important inference: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the calculated lower and upper bounds calculated.

There are a few conditions that must be met for this interval to be valid – the questions that follow test what you recall from our previous discussion on the Central Limit Theorem.

Check Your Understanding: Confidence Interval Conditions

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true? Select all that apply.

viewof q19 = Inputs.checkbox(
  new Map([
    ["The population must be normally distributed.", 1],
    ["The sample size must be large enough to overcome any skew in the population (usually at least 30 for moderate skew).", 2],
    ["The sample taken must be random and representative of the population.", 3],
    ["The confidence interval must contain the true population mean.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q19_selected") ?? "[]") ?? []}
);

{
  localStorage.setItem("q19_selected", JSON.stringify(q19));
  localStorage.setItem("q19_correct", "2,3");
  localStorage.setItem("q19_result", (!q19 || q19.length === 0) ? "unattempted" : (q19.toString() === "2,3" ? "correct" : "incorrect"));
}

ok_checkbox(q19.toString(), "2,3");

Check Your Understanding: Interpreting the Confidence Interval

Which of the following is the correct interpretation of the 95% confidence interval?

viewof q20 = Inputs.radio(
  new Map([
    ["We are 95% confident that the true population mean square footage of living area for houses sold in Ames, Iowa between 2006 and 2010 is between the lower and upper bound calculated.", 1],
    ["There is a 95% chance that the true population mean square footage of living area for houses sold in Ames, Iowa between 2006 and 2010 is between the lower and upper bound.", 2],
    ["We are 95% confident that the sample mean square footage of living area for houses sold in Ames, Iowa between 2006 and 2010 is between the lower and upper bound.", 3]
  ]),
  {value: JSON.parse(localStorage.getItem("q20_selected") ?? "null")}
);

{
  localStorage.setItem("q20_selected", JSON.stringify(q20));
  localStorage.setItem("q20_correct", "1");
  localStorage.setItem("q20_result", q20 === null ? "unattempted" : (q20 == 1 ? "correct" : "incorrect"));
}

ok_response(q20, "1");

In this case we have the luxury of knowing the true population mean since we have data on the entire population. We calculated it earlier — it is 1499.69 square feet.

Every time you run the code to build samp and construct the confidence interval you will get a slightly different confidence interval.

Check Your Understanding: Coverage of Confidence Intervals

If you built these confidence intervals over and over, about what proportion of the intervals do you expect to contain the true population mean?

viewof q21 = Inputs.radio(
  new Map([
    ["Around 50%", 1],
    ["About 90%", 2],
    ["About 95%", 3],
    ["100%", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q21_selected") ?? "null")}
);

{
  localStorage.setItem("q21_selected", JSON.stringify(q21));
  localStorage.setItem("q21_correct", "3");
  localStorage.setItem("q21_result", q21 === null ? "unattempted" : (q21 == 3 ? "correct" : "incorrect"));
}

ok_response(q21, "3");

Clarifying the Meaning of the Confidence Level

Your response to the previous question highlights a real intricacy. The level of confidence – here 95% – indicates the confidence that we have in the procedure. That is, in the long run, if we were to collect data and construct confidence intervals for the average square footage of homes sold, about 95% of those intervals would contain the population mean (\(\mu\)).

The claim in the box above can be difficult to grasp. We’ll use visuals to clarify what is being said. We’re going to recreate many samples (a hundred of them) and construct a confidence interval from the randomly chosen properties in each. Here is the rough outline:

Obtain a random sample.
Calculate and store the sample’s mean and standard deviation.
Repeat steps 1 and 2 one hundred times.
Use these stored statistics to calculate many confidence intervals.

Read the code below — does it make sense to you? What do you think the result will be? Once you think you know, run the code block and see if you were right. Run it multiple times and observe what changes. What do the red highlighted intervals indicate? About how many red intervals do you observe on average?

Good work through this activity. I hope it has made the Central Limit Theorem and sampling distributions a bit more intuitive. We’ll do more with confidence intervals in the coming activities. If you are interested in doing more, go back through this activity using the price data in place of area.

Submit

If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested.

Question Hash

The hash below encodes your responses to the multiple choice and checkbox questions in this activity.

function buildQuestionResults() {
  return {
    notebook: "Topic 10: Introduction to Inference Lab",
    type: "questions",
    timestamp: new Date().toISOString(),
    questions: {
      q1_ames_observation_unit: {
        selected: q1,
        correct_answer: "1",
        result: q1 === null ? "unattempted" : (q1 == 1 ? "correct" : "incorrect")
      },
      q2_ames_num_observations: {
        selected: q2,
        correct_answer: "4",
        result: q2 === null ? "unattempted" : (q2 == 4 ? "correct" : "incorrect")
      },
      q3_ames_num_variables: {
        selected: q3,
        correct_answer: "2",
        result: q3 === null ? "unattempted" : (q3 == 2 ? "correct" : "incorrect")
      },
      q4_price_mean: {
        selected: q4,
        correct_answer: "3",
        result: q4 === null ? "unattempted" : (q4 == 3 ? "correct" : "incorrect")
      },
      q5_price_median: {
        selected: q5,
        correct_answer: "1",
        result: q5 === null ? "unattempted" : (q5 == 1 ? "correct" : "incorrect")
      },
      q6_price_skew: {
        selected: q6,
        correct_answer: "1",
        result: q6 === null ? "unattempted" : (q6 == 1 ? "correct" : "incorrect")
      },
      q7_area_mean: {
        selected: q7,
        correct_answer: "4",
        result: q7 === null ? "unattempted" : (q7 == 4 ? "correct" : "incorrect")
      },
      q8_area_median: {
        selected: q8,
        correct_answer: "1",
        result: q8 === null ? "unattempted" : (q8 == 1 ? "correct" : "incorrect")
      },
      q9_area_skew: {
        selected: q9,
        correct_answer: "1",
        result: q9 === null ? "unattempted" : (q9 == 1 ? "correct" : "incorrect")
      },
      q10_point_estimate: {
        selected: q10,
        correct_answer: "1",
        result: q10 === null ? "unattempted" : (q10 == 1 ? "correct" : "incorrect")
      },
      q11_sample_size_accuracy: {
        selected: q11,
        correct_answer: "3",
        result: q11 === null ? "unattempted" : (q11 == 3 ? "correct" : "incorrect")
      },
      q12_sample_means50_count: {
        selected: q12,
        correct_answer: "4",
        result: q12 === null ? "unattempted" : (q12 == 4 ? "correct" : "incorrect")
      },
      q13_sample_means50_center: {
        selected: q13,
        correct_answer: "2",
        result: q13 === null ? "unattempted" : (q13 == 2 ? "correct" : "incorrect")
      },
      q14_sample_means50_more_samples: {
        selected: q14,
        correct_answer: "1",
        result: q14 === null ? "unattempted" : (q14 == 1 ? "correct" : "incorrect")
      },
      q15_sample_means_small_count: {
        selected: q15,
        correct_answer: "3",
        result: q15 === null ? "unattempted" : (q15 == 3 ? "correct" : "incorrect")
      },
      q16_sample_means_small_elements: {
        selected: q16,
        correct_answer: "2",
        result: q16 === null ? "unattempted" : (q16 == 2 ? "correct" : "incorrect")
      },
      q17_effect_sample_size_center: {
        selected: q17,
        correct_answer: "1",
        result: q17 === null ? "unattempted" : (q17 == 1 ? "correct" : "incorrect")
      },
      q18_effect_sample_size_spread: {
        selected: q18,
        correct_answer: "2",
        result: q18 === null ? "unattempted" : (q18 == 2 ? "correct" : "incorrect")
      },
      q19_ci_conditions: {
        selected: q19,
        correct_answer: "2,3",
        result: (!q19 || q19.length === 0) ? "unattempted" : (q19.toString() === "2,3" ? "correct" : "incorrect")
      },
      q20_ci_interpretation: {
        selected: q20,
        correct_answer: "1",
        result: q20 === null ? "unattempted" : (q20 == 1 ? "correct" : "incorrect")
      },
      q21_ci_coverage: {
        selected: q21,
        correct_answer: "3",
        result: q21 === null ? "unattempted" : (q21 == 3 ? "correct" : "incorrect")
      }
    }
  };
}

function toBase64(str) {
  return btoa(unescape(encodeURIComponent(str)));
}

question_hash = {
  q1; q2; q3; q4; q5; q6; q7; q8; q9; q10;
  q11; q12; q13; q14; q15; q16; q17; q18; q19; q20; q21;
  return toBase64(JSON.stringify(buildQuestionResults()));
}

html`<div style="font-family: monospace; font-size: 0.85em; background: #f5f5f5; padding: 12px; border-radius: 6px; word-break: break-all; border: 1px solid #ddd; user-select: all; cursor: pointer;" onclick="navigator.clipboard.writeText(this.innerText)">
  ${question_hash}
</div>
<p style="margin-top: 8px; font-size: 0.9em; color: #555;">
  Click the box to copy to clipboard.
</p>`

Exercise Hash

Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.

You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.

License

Summary

In this activity you connected the abstract machinery of the Central Limit Theorem to a concrete application — estimating the average home size in Ames, Iowa. Here are the key takeaways and a look at what’s ahead.

Main Takeaways

Sample statistics are point estimates for population parameters. The sample mean \(\bar{x}\) is our best single guess for the population mean \(\mu\), but we must acknowledge that the estimate \(\bar{x}\) will vary from sample to sample.
The sampling distribution describes that variability. By simulating many samples and computing each mean, we can see how much our estimate is likely to fluctuate — and that fluctuation decreases as sample size grows.
Larger samples produce more precise estimates. The standard error \(S_E = s/\sqrt{n}\) quantifies this — as \(n\) increases, \(S_E\) decreases, and our sampling distribution narrows.
A confidence interval quantifies our uncertainty. A 95% confidence interval is constructed as \(\bar{x} \pm 1.96 \times S_E\). Rather than a single point estimate, it gives a range of plausible values for the population mean.
“95% confident” has a specific meaning. If we repeated the sampling process many times and built a confidence interval each time, approximately 95% of those intervals would contain the true population mean. Any single interval either contains the true mean or it doesn’t — we just don’t know which.
The for loop is a powerful tool for simulation. It allowed us to build entire sampling distributions and collections of confidence intervals that would be impossibly tedious to construct by hand.

Looking Ahead

In this activity you built confidence intervals informally using the formula \(\bar{x} \pm 1.96 \times S_E\). You saw that these confidence intervals most often did contain the true value of the population parameter we were seeking. In the coming activities we’ll formalize this process, explore some intricacies, and extend these ideas beyond just estimating the value of a single population mean.

Intro to Inference Lab

The Data

Initial Exploration

Taking Samples

Constructing a Sampling Distribution

Interlude: The for Loop

Sample Size and the Sampling Distribution

Confidence Intervals

Submit

Summary

Interlude: The `for` Loop