Single Populations and Comparisons Between Two Groups
November 14, 2024
Reminders on statistical inference
Additional uncertainty with inference on means
The \(t\)-distributions
Examples
We use statistical inference to make or test claims about population parameters which we cannot measure directly
Confidence intervals provide a range of plausible values for a population parameter
\[\left(\begin{array}{c}\text{point}\\ \text{estimate}\end{array}\right) \pm \left(\begin{array}{c}\text{critical}\\ \text{value}\end{array}\right)\left(\begin{array}{c}\text{standard}\\ \text{error}\end{array}\right)\]
Hypothesis tests provide a framework for testing claims about a population parameter, with major steps including…
Inference On... | "Test" Name |
---|---|
One Binary Categorical Variable | One Sample z |
Association Between Two Binary Categorical Variables | Two Sample z |
One MultiClass Categorical Variable | Chi-Squared GOF |
Associations Between Two MultiClass Categorical Variables | Chi-Squared Independence |
Inference On... | "Test" Name |
---|---|
One Binary Categorical Variable | One Sample z |
Association Between Two Binary Categorical Variables | Two Sample z |
One MultiClass Categorical Variable | Chi-Squared GOF |
Associations Between Two MultiClass Categorical Variables | Chi-Squared Independence |
One Numerical Variable | One Sample t |
Association Between a Numerical Variable and a Binary Categorical Variable | Two Sample t |
Inference On... | "Test" Name |
---|---|
One Binary Categorical Variable | One Sample z |
Association Between Two Binary Categorical Variables | Two Sample z |
One MultiClass Categorical Variable | Chi-Squared GOF |
Associations Between Two MultiClass Categorical Variables | Chi-Squared Independence |
One Numerical Variable | One Sample t |
Association Between a Numerical Variable and a Binary Categorical Variable | Two Sample t |
Association Between a Numerical Variable and a MultiClass Categorical Variable | |
Association Between a Numerical Variable and a Single Other Numerical Variable | |
Association Between a Numerical Variable and Many Other Variables | |
Association Between a Categorical Variable and Many Other Variables | ✘ |
When doing inference, we utilize the sampling distribution
With categorical data, our sampling distribution has a mean and standard error which depend only on a single parameter – the population proportion(s) \(p\)/\(p_1\)/\(p_2\)
For numerical data, the mean of the sampling distribution is the population mean \(\mu\) and the standard error given by \(\sigma/\sqrt{n}\)
With means, we have additional uncertainty associated with estimating the sampling distribution because we are approximating more population parameters via our sample data
Because of this additional uncertainty, using the normal distribution is too optimistic and we use a “penalized” distribution instead
The distribution we use to account for additional uncertainty is a \(t\)-distribution, from a family of distributions
These distributions were first used by a brewmaster at Guinness who was trying to estimate alcohol content of beer using small samples
These distributions have “heavier tails” than the normal distribution, reflecting greater variability when using a sample standard deviation \(s\) to estimate the population standard deviation \(\sigma\)
As sample size grows, the \(t\)-distribution becomes closer to the normal distribution, since more data provides a more stable estimate of \(\sigma\)
Each \(t\)-distribution is characterized by its degrees of freedom (\(\texttt{df}\)), which typically equals \(n - 1\) for a single sample
This accounts for the number of observations in our sample that are free to vary after using one value to estimate the mean
pt(-1.96, df = 11)
\(\approx\) 0.0379072
pt(-1.96, df = 11)
\(\approx\) 0.0379072
1 - pt(2.1, df = 26)
\(\approx\) 0.0227907
qt(0.90, df = 17)
\(\approx\) 1.33
qt(0.90, df = 17)
\(\approx\) 1.33
Each tail contains half of the 5% of area that remains
qt(0.90, df = 17)
\(\approx\) 1.33
Each tail contains 2.5% of the area
qt(0.90, df = 17)
\(\approx\) 1.33
Each tail contains 2.5% of the area
qt(1 - 0.025, df = 13)
\(\approx\) 2.16
Scenario: A car manufacturer claims that the average annual repair cost for their new model is no more than $400. To validate this claim, a consumer group gathers a sample of 30 car owners and finds an average annual repair cost of $450 with a standard deviation of $150. Does this sample provide evidence that the car manufacturer’s claim is incorrect?
Scenario: A car manufacturer claims that the average annual repair cost for their new model is no more than $400. To validate this claim, a consumer group gathers a sample of 30 car owners and finds an average annual repair cost of $450 with a standard deviation of $150. Does this sample provide evidence that the car manufacturer’s claim is incorrect?
Scenario: A car manufacturer claims that the average annual repair cost for their new model is no more than $400. To validate this claim, a consumer group gathers a sample of 30 car owners and finds an average annual repair cost of $450 with a standard deviation of $150. Does this sample provide evidence that the car manufacturer’s claim is incorrect?
Scenario: A car manufacturer claims that the average annual repair cost for their new model is no more than $400. To validate this claim, a consumer group gathers a sample of 30 car owners and finds an average annual repair cost of $450 with a standard deviation of $150. Does this sample provide evidence that the car manufacturer’s claim is incorrect?
Scenario: A car manufacturer claims that the average annual repair cost for their new model is no more than $400. To validate this claim, a consumer group gathers a sample of 30 car owners and finds an average annual repair cost of $450 with a standard deviation of $150. Does this sample provide evidence that the car manufacturer’s claim is incorrect?
Scenario: A car manufacturer claims that the average annual repair cost for their new model is no more than $400. To validate this claim, a consumer group gathers a sample of 30 car owners and finds an average annual repair cost of $450 with a standard deviation of $150. Does this sample provide evidence that the car manufacturer’s claim is incorrect?
(1 - pt(1.83, df = 29))
\(\approx\) 0.0387735Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
pt(-1.97, df = 13)
\(\approx\) 0.0352605Scenario: Researchers are interested in understanding how social media usage varies between different educational stages. They surveyed a sample of 14 high school and 21 college students to measure the average time they spend on social media platforms daily. The average time spent for high school students was 2.8 hours per day with a standard deviation of 0.9 hours. For college students, the average was 3.5 hours per day with a standard deviation of 1.2 hours. Conduct a test at the 10% level of significance to determine whether this sample provides evidence to suggest that college students spend more time per day on social media than high school students.
Scenario: City planners are investigating the average commute time for residents to work. They survey 17 residents to estimate the average commute time in minutes. The average commute time in the sample was 35 minutes with a standard deviation of 10 minutes. Find a 95% confidence interval for the average commute time.
Scenario: A city council is evaluating the efficiency of two emergency services (ambulance and fire department) based on their average response times. They collect data from 40 recent incidents handled by the ambulance service and 35 recent incidents that the Fire Department responded to. The average response time for the ambulance service was 8 minutes with a standard deviation of 2 minutes, while the average response time for the fire department was 10 minutes with a standard deviation of 2.5 minutes. Build a 98% confidence interval for the difference in average response times.
Scenario: A city council is evaluating the efficiency of two emergency services (ambulance and fire department) based on their average response times. They collect data from 40 recent incidents handled by the ambulance service and 35 recent incidents that the Fire Department responded to. The average response time for the ambulance service was 8 minutes with a standard deviation of 2 minutes, while the average response time for the fire department was 10 minutes with a standard deviation of 2.5 minutes. Build a 98% confidence interval for the difference in average response times.
\[\bar{x}_F - \bar{x}_A = 10 - 8 = 2\]
Scenario: A city council is evaluating the efficiency of two emergency services (ambulance and fire department) based on their average response times. They collect data from 40 recent incidents handled by the ambulance service and 35 recent incidents that the Fire Department responded to. The average response time for the ambulance service was 8 minutes with a standard deviation of 2 minutes, while the average response time for the fire department was 10 minutes with a standard deviation of 2.5 minutes. Build a 98% confidence interval for the difference in average response times.
\[\bar{x}_F - \bar{x}_A = 10 - 8 = 2\]
Scenario: A city council is evaluating the efficiency of two emergency services (ambulance and fire department) based on their average response times. They collect data from 40 recent incidents handled by the ambulance service and 35 recent incidents that the Fire Department responded to. The average response time for the ambulance service was 8 minutes with a standard deviation of 2 minutes, while the average response time for the fire department was 10 minutes with a standard deviation of 2.5 minutes. Build a 98% confidence interval for the difference in average response times.
\[\bar{x}_F - \bar{x}_A = 10 - 8 = 2\]
Scenario: A city council is evaluating the efficiency of two emergency services (ambulance and fire department) based on their average response times. They collect data from 40 recent incidents handled by the ambulance service and 35 recent incidents that the Fire Department responded to. The average response time for the ambulance service was 8 minutes with a standard deviation of 2 minutes, while the average response time for the fire department was 10 minutes with a standard deviation of 2.5 minutes. Build a 98% confidence interval for the difference in average response times.
\[\bar{x}_F - \bar{x}_A = 10 - 8 = 2\]
Scenario: A city council is evaluating the efficiency of two emergency services (ambulance and fire department) based on their average response times. They collect data from 40 recent incidents handled by the ambulance service and 35 recent incidents that the Fire Department responded to. The average response time for the ambulance service was 8 minutes with a standard deviation of 2 minutes, while the average response time for the fire department was 10 minutes with a standard deviation of 2.5 minutes. Build a 98% confidence interval for the difference in average response times.
\[\bar{x}_F - \bar{x}_A = 10 - 8 = 2\]
qt(1 - 0.01, df = 34)
\(\approx\) 2.44Scenario: A city council is evaluating the efficiency of two emergency services (ambulance and fire department) based on their average response times. They collect data from 40 recent incidents handled by the ambulance service and 35 recent incidents that the Fire Department responded to. The average response time for the ambulance service was 8 minutes with a standard deviation of 2 minutes, while the average response time for the fire department was 10 minutes with a standard deviation of 2.5 minutes. Build a 98% confidence interval for the difference in average response times.
\[\bar{x}_F - \bar{x}_A = 10 - 8 = 2\]
qt(1 - 0.01, df = 34)
\(\approx\) 2.44\[\to \left\{\begin{array}{lcl} 2 - 1.2878 & = & 0.7122\\ 2 + 1.2878 & = & 3.2878\end{array}\right.\]
Scenario: A city council is evaluating the efficiency of two emergency services (ambulance and fire department) based on their average response times. They collect data from 40 recent incidents handled by the ambulance service and 35 recent incidents that the Fire Department responded to. The average response time for the ambulance service was 8 minutes with a standard deviation of 2 minutes, while the average response time for the fire department was 10 minutes with a standard deviation of 2.5 minutes. Build a 98% confidence interval for the difference in average response times.
The following slides contain several examples to try
As usual, there are more than we’ll be able to complete during this class meeting
The examples include a mixture of scenarios involving a single population mean or comparison between two population means
The examples also include a mixture of questions which are best answered with confidence intervals or best answered with hypothesis tests
You’ll need to determine which scenario you are in and which tools to apply
Scenario: A coffee company is interested in the average amount of coffee consumed by office workers per day. They survey a sample of 22 office workers to estimate this average and observe a mean consumption of 3.2 cups with a standard deviation of 1.1 cups. Construct a 90% confidence interval for the average number of cups of coffee consumed per day by office workers.
Scenario: A university researcher investigates whether students in a demanding academic program sleep less on average than those in a less rigorous program. They survey two samples of students, including 45 from a very rigorous program and 50 from a less rigorous program. The average sleep time for students in the very rigorous program was 6.2 hours with a standard deviation of 1.3 hours. The students in the less rigorous program averaged 7.5 hours of sleep with a standard deviation of 1 hour. Construct a 95% confidence interval for the difference in average sleep.
Scenario: Ornithologists studying a specific species of migratory bird believe the average duration of one leg of their migration flight is about 6 hours. They observe a sample of 25 birds this season and find a mean flight duration of 5.5 hours with a standard deviation of 1.2 hours. Does this sample provide evidence to suggest that the duration differs from 6 hours?
Scenario: A company has introduced a new wellness program aimed at increasing employees’ daily step count. Prior to the program, the average daily step count was 6,000 steps. After one month, a sample of 50 employees reports a mean step count of 6,400 steps with a standard deviation of 1,200 steps. Find a 90% confidence interval for the average daily step count after implementing this new wellness program.
Scenario: Public health officials are examining differences in hydration habits between two cities. They survey 50 residents from each city to determine how much water they consume on average each day to promote health awareness initiatives. The mean water consumption in City A was 2.7 liters, with a standard deviation of 1.2 liters. The average water consumption in City B was 3.1 liters, with a standard deviation of 1.4 liters. Conduct a test to determine whether the sample provides evidence for a difference in average water consumption between the two cities.
Scenario: A career services office at a university wants to estimate the average starting salary of its recent graduates. They collect data from a sample of 34 graduates who landed jobs after graduation and found an average salary of $52,000 with a standard deviation of $8,000. Construct a 95% confidence interval for the average starting salary of recent graduates.