Topic 17: Inference for Numerical Data (Lab)
In this lab, we work with data on 1,000 pregnancies recorded in North Carolina in 2004. We revisit the inference() function from the {statsr} package — this time applying it to numerical data. We’ll conduct hypothesis tests and construct confidence intervals for means.
This is a derivative of a product of OpenIntro released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.
Inference for Numerical Data
During 2004, the state of North Carolina released a large data set containing information on births recorded in the state. This data set is useful to researchers studying the relationship between the habits, practices, and demographic characteristics of expectant mothers and the birth outcomes of their children. This includes important questions about equity in prenatal care access and outcomes. We will work with a random sample of 1,000 observations from this data set, loaded as nc.
The nc dataset contains 13 variables:
| Variable | Description |
|---|---|
fage |
father’s age in years |
mage |
mother’s age in years |
mature |
maturity status of mother |
weeks |
length of pregnancy in weeks |
premie |
whether the birth was classified as premature or full-term |
visits |
number of hospital visits during pregnancy |
marital |
whether mother is married or not married at birth |
gained |
weight gained by mother during pregnancy in pounds |
weight |
weight of the baby at birth in pounds |
lowbirthweight |
whether baby was classified as low birthweight or not |
gender |
gender of the baby |
habit |
status of the mother as a nonsmoker or smoker |
whitemom |
whether mom is white or not white |
Use the code block below to explore the nc data frame and answer the questions that follow.
Try running nc on its own.
ncWhat are the cases in this data set?
How many cases are there in our sample?
As a first step in any analysis, it’s good practice to review summaries of the data. Use summary() or skim() from the {skimr} package to get an overview of the nc data frame, and use the output to help you answer the questions that follow.
Pipe the nc data frame into your favorite summary function.
nc |>
___Pipe the nc data frame into your favorite summary function. I like the skim() function from {skimr} because it provides lots of useful information about the data set.
nc |>
skim()As you review the variable summaries, consider which variables are categorical and which are numerical.
Which of the variables in the nc dataset are numerical? Select all that apply.
For the numerical variables in the nc dataset, are there any obvious outliers? Feel free to use the code block below to explore further with plots.
Boxplots do a nice job at visually identifying outliers. Do you remember how to construct a plot?
Boxplots do a nice job at visually identifying outliers. Do you remember how to construct a plot?
Start by piping your data frame into ggplot().
nc |>
ggplot() Boxplots do a nice job at visually identifying outliers. Do you remember how to construct a plot?
Start by piping your data frame into ggplot(). Now, add a boxplot layer and map your numerical variables to the x aesthetic one at a time.
nc |>
ggplot() +
geom_boxplot(aes(x = ___))Boxplots do a nice job at visually identifying outliers. Do you remember how to construct a plot?
Start by piping your data frame into ggplot(). Now, add a boxplot layer and map your numerical variables to the x aesthetic one at a time. For example, we can plot the mother’s age (mage).
nc |>
ggplot() +
geom_boxplot(aes(x = mage))If plotting one variable at a time feels inefficient, it is possible to plot them all at once!
nc |>
select_if(~is.numeric(.)) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "value") |>
ggplot() +
geom_boxplot(aes(x = value,
y = variable))Click Next Hint to get back to the code for plotting an individual variable at a time.
Boxplots do a nice job at visually identifying outliers. Do you remember how to construct a plot?
Start by piping your data frame into ggplot(). Now, add a boxplot layer and map your numerical variables to the x aesthetic one at a time. For example, we can plot the mother’s age (mage).
nc |>
ggplot() +
geom_boxplot(aes(x = mage))Exploratory Analysis
Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step — it helps us quickly visualize trends, identify associations, and develop research questions.
Create a side-by-side boxplot of habit and weight. Add the following labels to your plot:
labs(
title = "Birthweight by Smoking Status",
x = "",
y = "Birthweight (lbs)"
)What does the plot suggest about the relationship between these two variables?
Start by piping nc into ggplot().
nc |>
ggplot()Add a geom_boxplot() layer. You’ll need x and y aesthetic mappings.
nc |>
ggplot() +
geom_boxplot(aes(x = ___, y = ___))Add a geom_boxplot() layer. You’ll need x and y aesthetic mappings. The habit variable encodes whether the mother is a smoker or not, and the weight variable contains the birth-weight of the baby.
nc |>
ggplot() +
geom_boxplot(aes(x = ___, y = ___))Add a geom_boxplot() layer. You’ll need x and y aesthetic mappings. The habit variable encodes whether the mother is a smoker or not, and the weight variable contains the birth-weight of the baby.
Now don’t forget the labels.
nc |>
ggplot() +
geom_boxplot(aes(x = habit, y = weight)) +
labs(___)Add a geom_boxplot() layer. You’ll need x and y aesthetic mappings. The habit variable encodes whether the mother is a smoker or not, and the weight variable contains the birth-weight of the baby.
nc |>
ggplot() +
geom_boxplot(aes(x = habit, y = weight)) +
labs(
title = "Birthweight by Smoking Status",
x = "",
y = "Birthweight (lbs)"
)
nc |>
ggplot() +
geom_boxplot(aes(x = habit, y = weight)) +
labs(
title = "Birthweight by Smoking Status",
x = "",
y = "Birthweight (lbs)"
)
nc |>
ggplot() +
geom_boxplot(aes(x = habit, y = weight)) +
labs(
title = "Birthweight by Smoking Status",
x = "",
y = "Birthweight (lbs)"
)The boxplots show how the medians of the two distributions compare, but we can also compare the means directly. The following code groups the data by habit and computes the mean weight for each group. Think about what the output will look like before running it, then run it and reflect on what it tells you.
What does the output from the code cell above tell you?
Remember that we can’t make population level claims from summary statistics alone. Because of sampling variation we know that we would obtain different results (different average birth weights) if we were to collect a new sample. Statistical inference helps us quantify how different those results may be. Inference is was allows us to make population-level claims.
Inference
There is an observed difference in average birth weights between the two groups — but is this difference statistically significant? To answer this we’ll conduct a hypothesis test. First, let’s check whether the conditions necessary for inference are satisfied. Use the code block below to obtain sample sizes for each group.
Pipe nc into count() and pass the grouping variable as an argument.
nc |>
count(habit)
nc |>
count(habit)
nc |>
count(habit)How many groups are being considered?
Which of the following are the groups? Select all that apply.
Are the conditions for inference satisfied?
The hypotheses for testing whether average birth weights differ between smoking and non-smoking mothers are:
Now let’s use the inference() function from {statsr} to conduct the hypothesis test. Here’s a reminder of the key arguments:
y— the response variable (weight)x— the explanatory variable that splits data into groups (habit)data— the data frame (nc)statistic— the parameter of interest ("mean")type—"ht"for hypothesis test or"ci"for confidence intervalnull— the null value (for a hypothesis test about a difference in means, this is0)alternative—"less","greater", or"twosided"method—"theoretical"or"simulation"
The code block below runs the hypothesis test. Review the output, then modify the code to instead construct a confidence interval for the difference in average birth weights between the two groups. When switching to type = "ci", remove the null and alternative arguments — they don’t apply to confidence intervals.
First run the code as-is to see the hypothesis test results. Then identify which arguments need to change for a confidence interval.
Change type = "ht" to type = "ci". Confidence intervals don’t have null values or alternative hypotheses.
Remove the null and alternative arguments and change type to "ci".
inference(
y = weight,
x = habit,
data = nc,
statistic = "mean",
type = "ci",
method = "theoretical"
)
inference(
y = weight,
x = habit,
data = nc,
statistic = "mean",
type = "ci",
method = "theoretical"
)
inference(
y = weight,
x = habit,
data = nc,
statistic = "mean",
type = "ci",
method = "theoretical"
)By default, inference() reports the interval for \(\mu_{\text{nonsmoker}} - \mu_{\text{smoker}}\) because R orders factor levels alphabetically. You can reverse this using the order argument. Run the code below to see the result.
On Your Own
Use the code block below to work through the following tasks. Each task includes hints if you need them.
1. Construct a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context.
Use inference() with y = weeks and data = nc. Since there’s no grouping variable, leave out x entirely.
Use inference() with y = weeks and data = nc. Since there’s no grouping variable, leave out x entirely.
Start by copying and pasting our most recent call to the inference() function. Editing it is much easier than starting from scratch!
inference(
y = weight,
x = habit,
data = nc,
statistic = "mean",
type = "ci",
method = "theoretical",
order = c("smoker", "nonsmoker")
)Use inference() with y = weeks and data = nc. Since there’s no grouping variable, leave out x entirely.
Start by copying and pasting our most recent call to the inference() function. Editing it is much easier than starting from scratch! What needs to be changed?
inference(
y = weight,
x = habit,
data = nc,
statistic = "mean",
type = "ci",
method = "theoretical",
order = c("smoker", "nonsmoker")
)Use inference() with y = weeks and data = nc. Since there’s no grouping variable, leave out x entirely.
Start by copying and pasting our most recent call to the inference() function. Editing it is much easier than starting from scratch! What needs to be changed?
- The variable of interest (
y) isweeksinstead ofweight. - We’re not comparing groups, so get rid of the grouping variable
xaltogether. - We don’t have groups, so setting an order here doesn’t make sense – remove it.
inference(
y = weeks,
data = nc,
statistic = "mean",
type = "ci",
method = "theoretical"
)2. Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding conf_level = 0.90 as an argument to inference().
To change the confidence level, add the conf_level argument to the call to inference(). It can go anywhere! (Just remember that arguments are separated by commas.)
inference(
y = weeks,
data = nc,
statistic = "mean",
type = "ci",
method = "theoretical",
conf_level = 0.90
)3. Conduct a hypothesis test evaluating whether the average weight gained by younger mothers differs from the average weight gained by mature mothers.
Copy and paste an earlier call to the inference() function where we were conducting a hypothesis test.
inference(
y = weight,
x = habit,
data = nc,
statistic = "mean",
type = "ht",
null = 0,
alternative = "twosided",
method = "theoretical"
)Copy and paste an earlier call to the inference() function where we were conducting a hypothesis test. What needs to be changed?
inference(
y = weight,
x = habit,
data = nc,
statistic = "mean",
type = "ht",
null = 0,
alternative = "twosided",
method = "theoretical"
)Copy and paste an earlier call to the inference() function where we were conducting a hypothesis test. What needs to be changed?
- We’re interested in weight gained (
gained) by the mother during the pregnancy. Changeyto reflect this. - The grouping variable is whether the mother is young or mature (
mature). Updatexwith this change.
No other changes are necessary.
inference(
y = ___,
x = ___,
data = nc,
statistic = "mean",
type = "ht",
null = 0,
alternative = "twosided",
method = "theoretical"
)- We’re interested in weight gained (
gained) by the mother during the pregnancy. Changeyto reflect this. - The grouping variable is whether the mother is young or mature (
mature). Updatexwith this change.
No other changes are necessary.
inference(
y = gained,
x = mature,
data = nc,
statistic = "mean",
type = "ht",
null = 0,
alternative = "twosided",
method = "theoretical"
)4. Determine the age cutoff that separates younger and mature mothers. Use a method of your choice and explain how it works.
How might you find the oldest young mother? What about the youngest old mother?
This is a great opportunity for grouping. Start with the nc data frame and then group_by() the mature variable.
nc |>
group_by(mature)This is a great opportunity for grouping. Start with the nc data frame and then group_by() the mature variable, and then let’s summarize the data. How?
nc |>
group_by(mature) |>
summarize(___)This is a great opportunity for grouping. Start with the nc data frame and then group_by() the mature variable, and then let’s summarize the data to find the minimum mother’s age (mage) in each group.
nc |>
group_by(mature) |>
summarize(
min_age = min(mage)
)This is a great opportunity for grouping. Start with the nc data frame and then group_by() the mature variable, and then let’s summarize the data to find the minimum mother’s age (mage) in each group. This gives us part of the answer, but we still can’t be completely certain about the age cutoff between younger and mature mothers. Add another summary statistic to help solidify your answer.
nc |>
group_by(mature) |>
summarize(
min_age = min(mage),
___ = ___
)This is a great opportunity for grouping. Start with the nc data frame and then group_by() the mature variable, and then let’s summarize the data to find the minimum and maximum mother’s age (mage) in each group.
nc |>
group_by(mature) |>
summarize(
min_age = min(mage),
max_age = max(mage)
)5. Choose a pair of numerical and categorical variables and formulate a research question that can be answered with a hypothesis test or confidence interval. Use inference() to answer it, report the statistical results, and provide a plain-language interpretation.
Explore and experiment with questions that are of interest to you. Start with an existing call to the inference() function that does something close to what you want to do. Make the edits necessary in order to conduct your desired investigation.
Submit
If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested.
The hash below encodes your responses to the multiple choice and checkbox questions in this activity.
Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.
You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.
Summary
The
inference()function from{statsr}handles both hypothesis tests (type = "ht") and confidence intervals (type = "ci") for means and proportions. The key arguments change depending on the task — for a hypothesis test you supplynullandalternativehypotheses; for a confidence interval you can optionally setconf_level.Conditions for inference on means require that observations are independent and that either the sample size is large enough to rely on the CLT or the population distribution is approximately normal. Group sizes well above 30 generally satisfy this even with moderate skew.
The
orderargument ininference()controls which group is subtracted from which in a two-sample comparison. The default is alphabetical — useorder = c("group1", "group2")to set your preferred direction.Exploratory analysis comes first. Side-by-side boxplots, grouped summaries, and counts help you understand the data before running any formal inference. Don’t skip this step.
- One very important caution: exploratory analysis should inform how you conduct your inference, not what hypotheses you test. Generating a hypothesis after seeing a pattern in the data — then testing that same hypothesis on the same data — inflates the risk of a false positive. This is often referred to as p-hacking or fishing. Hypotheses should always either be established before looking at the data, or tested on a fresh, independent sample.
Scope of inference matters. Results from this sample can be generalized to North Carolina births in 2004, but not necessarily to other states, years, or populations.
The next activity introduces ANOVA — a method for comparing means across more than two groups simultaneously. ANOVA extends the two-sample \(t\)-test framework and introduces a new test statistic (the \(F\)-statistic) along with a new distribution (the \(F\)-distribution). The core ideas of hypothesis testing remain unchanged though.