Topic 18: Inference for Comparing Many Group Means (ANOVA)

function ok_checkbox(response, n) {
  if (!response || response.length === 0) 
    return html`<span style="color:purple">You haven't answered yet.</span>`;
  if (response.toString() === n) 
    return html`<span style="color:green">Correct ✓</span>`;
  return html`<span style="color:red">Not Yet! ✗</span>`;
}

About

This activity introduces Analysis of Variance (ANOVA) for comparing means across more than two groups simultaneously. We’ll test whether average petal width differs across three varieties of iris using the famous iris dataset collected by Anderson and published by Fisher. We’ll also introduce the Tukey Honest Significant Differences test for pairwise follow-up comparisons.

ANalysis Of VAriance (ANOVA)

In this activity, we consider a method for comparing means from multiple groups. We’ll use a very famous dataset containing measurements on three varieties of iris: setosa, versicolor, and virginica. Our goal is to determine whether the iris data provides significant evidence of a difference in average petal widths across the three varieties.

Three varieties of iris: setosa, versicolor, and virginica

Notice that none of the tools we’ve encountered so far can be applied directly here. Our hypothesis testing strategies have been limited to comparing numerical measures across one or two populations. We could apply three separate one-versus-one tests (versicolor vs. setosa, versicolor vs. virginica, and setosa vs. virginica), but the probability of an erroneous conclusion grows quickly. At a 5% significance level, the probability of making at least one Type I error across three comparisons is \(1 - (0.95)^3 \approx 14.26\%\). A better option is ANalysis Of VAriance — an ANOVA test.

Watch the video below from Dr. Çetinkaya-Rundel introducing ANOVA.

Dr. Çetinkaya-Rundel’s video walked through the details of ANOVA. The key ideas to carry forward are:

ANOVA tests whether there is significant evidence of a difference in means across three or more populations.
The test statistic follows an \(F\)-distribution rather than the normal or \(t\)-distribution.
The \(p\)-value is the area from the computed \(F\)-statistic into the upper tail of the \(F\)-distribution.
The interpretation of the \(p\)-value is the same as always — the probability of observing a sample at least as extreme as ours, assuming the null hypothesis is true.

Use the code block below to explore the iris dataset as you answer the questions that follow.

Check Your Understanding: ANOVA I

Which of the following are the correct hypotheses for testing whether average petal widths differ across the three iris varieties?

mutable ok_response = (response, n) => { return html`Loading...` };
viewof q1 = Inputs.radio(
  new Map([
    ["H₀: μ_setosa = μ_versicolor = μ_virginica; Hₐ: μ_setosa ≠ μ_versicolor ≠ μ_virginica", 1],
    ["H₀: μ_setosa = μ_versicolor = μ_virginica = 0; Hₐ: μ_setosa ≠ μ_versicolor ≠ μ_virginica ≠ 0", 2],
    ["H₀: μ_setosa = μ_versicolor = μ_virginica; Hₐ: At least one variety has a different average petal width.", 3],
    ["H₀: μ_setosa = μ_versicolor = μ_virginica; Hₐ: All varieties have different average petal widths.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q1_selected") ?? "null")}
);

{
  localStorage.setItem("q1_selected", JSON.stringify(q1));
  localStorage.setItem("q1_correct", "3");
  localStorage.setItem("q1_result", q1 === null ? "unattempted" : (q1 == 3 ? "correct" : "incorrect"));
}

ok_response(q1, "3");

Check Your Understanding: ANOVA II

How many groups are involved in this hypothesis test?

viewof q2 = Inputs.radio(
  new Map([
    ["1", 1],
    ["2", 2],
    ["3", 3],
    ["4", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q2_selected") ?? "null")}
);

{
  localStorage.setItem("q2_selected", JSON.stringify(q2));
  localStorage.setItem("q2_correct", "3");
  localStorage.setItem("q2_result", q2 === null ? "unattempted" : (q2 == 3 ? "correct" : "incorrect"));
}

ok_response(q2, "3");

Check Your Understanding: ANOVA III

How many degrees of freedom are due to groups?

viewof q3 = Inputs.radio(
  new Map([
    ["1", 1],
    ["2", 2],
    ["3", 3],
    ["4", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q3_selected") ?? "null")}
);

{
  localStorage.setItem("q3_selected", JSON.stringify(q3));
  localStorage.setItem("q3_correct", "2");
  localStorage.setItem("q3_result", q3 === null ? "unattempted" : (q3 == 2 ? "correct" : "incorrect"));
}

ok_response(q3, "2");

Check Your Understanding: ANOVA IV

Use the code block above to find the number of observations in iris. What are the total degrees of freedom?

viewof q4 = Inputs.radio(
  new Map([
    ["99", 1],
    ["100", 2],
    ["149", 3],
    ["150", 4],
    ["2", 5],
    ["3", 6]
  ]),
  {value: JSON.parse(localStorage.getItem("q4_selected") ?? "null")}
);

{
  localStorage.setItem("q4_selected", JSON.stringify(q4));
  localStorage.setItem("q4_correct", "3");
  localStorage.setItem("q4_result", q4 === null ? "unattempted" : (q4 == 3 ? "correct" : "incorrect"));
}

ok_response(q4, "3");

Running an ANOVA Test in R

Dr. Çetinkaya-Rundel mentions in the video that ANOVA computations are tedious and prone to error — so we use software. In R, the aov() function runs an ANOVA test. Store the result of running aov(Petal.Width ~ Species, data = iris) in an object called ANOVAtable, then view the results by passing that object to summary().


ANOVAtable <- aov(Petal.Width ~ Species, data = iris)
summary(ANOVAtable)

Use the output to answer the following questions.

Check Your Understanding: ANOVA Table I

How are the Mean Sq values related to the other values in the ANOVA table?

viewof q5 = Inputs.radio(
  new Map([
    ["Mean Sq is obtained by dividing Sum Sq (sum of squared deviations) by Df (degrees of freedom).", 1],
    ["Mean Sq is the F value divided by Sum Sq.", 2],
    ["Mean Sq values are unrelated to the other entries and are output as standalone meaningful values.", 3],
    ["Mean Sq values are the product of the F value and the p-value.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q5_selected") ?? "null")}
);

{
  localStorage.setItem("q5_selected", JSON.stringify(q5));
  localStorage.setItem("q5_correct", "1");
  localStorage.setItem("q5_result", q5 === null ? "unattempted" : (q5 == 1 ? "correct" : "incorrect"));
}

ok_response(q5, "1");

Check Your Understanding: ANOVA Table II

What is the test statistic associated with this ANOVA test?

viewof q6 = Inputs.radio(
  new Map([
    ["2e-16", 1],
    ["40.21", 2],
    ["960", 3],
    ["2 and 147", 4],
    ["There is no test statistic associated with an ANOVA test.", 5]
  ]),
  {value: JSON.parse(localStorage.getItem("q6_selected") ?? "null")}
);

{
  localStorage.setItem("q6_selected", JSON.stringify(q6));
  localStorage.setItem("q6_correct", "3");
  localStorage.setItem("q6_result", q6 === null ? "unattempted" : (q6 == 3 ? "correct" : "incorrect"));
}

ok_response(q6, "3");

Check Your Understanding: ANOVA Table III

What is the \(p\)-value associated with this ANOVA test?

viewof q7 = Inputs.radio(
  new Map([
    ["0.05", 1],
    ["0.10", 2],
    ["0.04", 3],
    ["A number smaller than 0.0000000000000002", 4],
    ["-14", 5]
  ]),
  {value: JSON.parse(localStorage.getItem("q7_selected") ?? "null")}
);

{
  localStorage.setItem("q7_selected", JSON.stringify(q7));
  localStorage.setItem("q7_correct", "4");
  localStorage.setItem("q7_result", q7 === null ? "unattempted" : (q7 == 4 ? "correct" : "incorrect"));
}

ok_response(q7, "4");

Check Your Understanding: ANOVA Table IV

What is the conclusion of the test?

viewof q8 = Inputs.radio(
  new Map([
    ["Since p < α, we reject the null hypothesis and accept the alternative.", 1],
    ["Since p < α, we fail to reject the null hypothesis.", 2],
    ["Since p < α, we accept the null hypothesis.", 3],
    ["Since p ≥ α, we fail to reject the null hypothesis.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q8_selected") ?? "null")}
);

{
  localStorage.setItem("q8_selected", JSON.stringify(q8));
  localStorage.setItem("q8_correct", "1");
  localStorage.setItem("q8_result", q8 === null ? "unattempted" : (q8 == 1 ? "correct" : "incorrect"));
}

ok_response(q8, "1");

Check Your Understanding: ANOVA Table V

What does the conclusion mean in the context of our original question?

viewof q9 = Inputs.radio(
  new Map([
    ["The iris data provides significant evidence to suggest that the average petal width is not the same across all three varieties.", 1],
    ["The iris data provides significant evidence to suggest that the average petal width is the same across all three varieties.", 2],
    ["The iris data does not provide significant evidence to suggest that average petal width differs across at least one variety.", 3],
    ["The iris data provides significant evidence to suggest that all three varieties have different average petal widths.", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q9_selected") ?? "null")}
);

{
  localStorage.setItem("q9_selected", JSON.stringify(q9));
  localStorage.setItem("q9_correct", "1");
  localStorage.setItem("q9_result", q9 === null ? "unattempted" : (q9 == 1 ? "correct" : "incorrect"));
}

ok_response(q9, "1");

The Tukey Test

The ANOVA result tells us that at least one variety has a different average petal width — but it doesn’t tell us which one. The Tukey Honest Significant Differences (HSD) Test can fill that gap. When ANOVA returns a significant result, the Tukey Test runs a collection of pairwise comparisons with adjusted \(p\)-values that account for the multiple comparisons being made.

Since our ANOVA was significant, we’ll follow up with a Tukey Test. In R, we pass the result of aov() directly to TukeyHSD(). Run the code block below and use the output to answer the question that follows.

Check Your Understanding: Tukey Test

Which pairs of iris varieties have significantly different average petal widths? Select all that apply.

viewof q10 = Inputs.checkbox(
  new Map([
    ["Versicolor and Setosa", 1],
    ["Virginica and Setosa", 2],
    ["Virginica and Versicolor", 3]
  ]),
  {value: JSON.parse(localStorage.getItem("q10_selected") ?? "[]") ?? []}
);

{
  localStorage.setItem("q10_selected", JSON.stringify(q10));
  localStorage.setItem("q10_correct", "1,2,3");
  localStorage.setItem("q10_result", (!q10 || q10.length === 0) ? "unattempted" : (q10.toString() === "1,2,3" ? "correct" : "incorrect"));
}

ok_checkbox(q10.toString(), "1,2,3");

You may have used the adjusted \(p\)-value column to make your decisions, or you may have noticed that none of the confidence intervals contain 0 — either way leads to the same conclusion. We can also visualize these confidence intervals directly. Run the code block below to produce a plot of the pairwise differences and their confidence intervals.

The dashed vertical line at zero represents no difference. Since all three confidence intervals fall entirely to one side of zero, we have visual confirmation that each pair of varieties differs significantly in average petal width.

Which do you find more useful for communicating the results — the table, the plot, or a combination of both?

Submit

If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested.

Question Hash

The hash below encodes your responses to the multiple choice and checkbox questions in this activity.

function buildQuestionResults() {
  return {
    notebook: "Topic 18: Inference for Comparing Many Group Means (ANOVA)",
    type: "questions",
    timestamp: new Date().toISOString(),
    questions: {
      q1_anova_hypotheses: {
        selected: q1,
        correct_answer: "3",
        result: q1 === null ? "unattempted" : (q1 == 3 ? "correct" : "incorrect")
      },
      q2_anova_num_groups: {
        selected: q2,
        correct_answer: "3",
        result: q2 === null ? "unattempted" : (q2 == 3 ? "correct" : "incorrect")
      },
      q3_anova_group_df: {
        selected: q3,
        correct_answer: "2",
        result: q3 === null ? "unattempted" : (q3 == 2 ? "correct" : "incorrect")
      },
      q4_anova_total_df: {
        selected: q4,
        correct_answer: "3",
        result: q4 === null ? "unattempted" : (q4 == 3 ? "correct" : "incorrect")
      },
      q5_anova_mean_sq: {
        selected: q5,
        correct_answer: "1",
        result: q5 === null ? "unattempted" : (q5 == 1 ? "correct" : "incorrect")
      },
      q6_anova_test_stat: {
        selected: q6,
        correct_answer: "3",
        result: q6 === null ? "unattempted" : (q6 == 3 ? "correct" : "incorrect")
      },
      q7_anova_pvalue: {
        selected: q7,
        correct_answer: "4",
        result: q7 === null ? "unattempted" : (q7 == 4 ? "correct" : "incorrect")
      },
      q8_anova_conclusion: {
        selected: q8,
        correct_answer: "1",
        result: q8 === null ? "unattempted" : (q8 == 1 ? "correct" : "incorrect")
      },
      q9_anova_conclusion_context: {
        selected: q9,
        correct_answer: "1",
        result: q9 === null ? "unattempted" : (q9 == 1 ? "correct" : "incorrect")
      },
      q10_tukey_significant_pairs: {
        selected: q10,
        correct_answer: "1,2,3",
        result: (!q10 || q10.length === 0) ? "unattempted" : (q10.toString() === "1,2,3" ? "correct" : "incorrect")
      }
    }
  };
}

function toBase64(str) {
  return btoa(unescape(encodeURIComponent(str)));
}

question_hash = {
  q1; q2; q3; q4; q5; q6; q7; q8; q9; q10;
  return toBase64(JSON.stringify(buildQuestionResults()));
}

html`<div style="font-family: monospace; font-size: 0.85em; background: #f5f5f5; padding: 12px; border-radius: 6px; word-break: break-all; border: 1px solid #ddd; user-select: all; cursor: pointer;" onclick="navigator.clipboard.writeText(this.innerText)">
  ${question_hash}
</div>
<p style="margin-top: 8px; font-size: 0.9em; color: #555;">
  Click the box to copy to clipboard.
</p>`

Exercise Hash

Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.

You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.

Summary

Main Takeaways

ANOVA tests whether there is significant evidence of a difference in means across three or more populations. The null hypothesis states that all group means are equal; the alternative states that at least one differs.
The \(F\)-statistic is the test statistic for ANOVA. It measures the ratio of variation between groups to variation within groups — a large \(F\) value suggests the group means are more spread out than would be expected by chance alone.
The \(p\)-value from an ANOVA test is the upper tail area of the \(F\)-distribution beyond the observed \(F\)-statistic. Its interpretation is exactly the same as in our earlier hypothesis tests.
Running ANOVA in R uses aov(response ~ group, data = df) followed by summary() to view the table.
ANOVA only tells you that at least one group mean differs — it does not identify which pair(s). A Tukey HSD Test, run via TukeyHSD(), performs all pairwise comparisons with adjusted \(p\)-values and should be used as a follow-up when ANOVA is significant.
Why not just run multiple \(t\)-tests? Each additional test inflates the probability of at least one false positive. With three groups and \(\alpha = 0.05\), running all pairwise \(t\)-tests gives a \(1 - 0.95^3 \approx 14\%\) chance of at least one erroneous conclusion. ANOVA and the Tukey correction keep that risk controlled.

Looking Ahead

The final activity in this series is a regression lab. We’ll shift from comparing group means to modeling the relationship between two numerical variables — using one to predict the other. Linear regression extends the inference framework you’ve built throughout this course into a new and widely-used context.