Topic 7: Discrete Distributions Lab
In this lab, we learn how to simulate data and compare simulated data to observed data in order to informally test a hypothesis.
This is a derivative of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. The original lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.
Discrete Distributions and Simulation
The previous two activities discussed probability distributions. In this seventh activity you’ll engage in computer simulation to test a theory by building and comparing distributions. Computer simulations can be really helpful because they allow us to repeat tasks very quickly (much more quickly than we would be able to do so in real life) and they also allow us to model tasks which may be impossible or unethical to perform in real life. Simulation is a truly powerful tool for helping us understand processes in our world.
Kobe Bryant and the Hot Hand
Basketball players who make several baskets in succession are often described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are indeed independent events (link to paper). This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.
We do not expect to resolve this controversy today. However, in this activity we’ll apply one approach to answering questions like this. The goals for the remainder of this activity are to:
- think about the effects of independent and dependent events,
- learn how to simulate shooting streaks in R, and
- compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.
Which of the following assumptions must be made if we are to simulate a set of basketball shots as a binomial experiment?
Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers. His performance against the Orlando Magic in the 2009 NBA Finals earned him the title Most Valuable Player and many spectators commented on how he appeared to show a hot hand. Let’s look at the first several rows of data from those games.
| vs | game | quarter | time | description | shot |
|---|---|---|---|---|---|
| ORL | 1 | 1 | 9:47 | Kobe Bryant makes 4-foot two point shot | H |
| ORL | 1 | 1 | 9:07 | Kobe Bryant misses jumper | M |
| ORL | 1 | 1 | 8:11 | Kobe Bryant misses 7-foot jumper | M |
| ORL | 1 | 1 | 7:41 | Kobe Bryant makes 16-foot jumper (Derek Fisher assists) | H |
| ORL | 1 | 1 | 7:03 | Kobe Bryant makes driving layup | H |
| ORL | 1 | 1 | 6:01 | Kobe Bryant misses jumper | M |
In this data frame, every row records a shot taken by Kobe Bryant. If he hit the shot (made a basket), a hit (H) is recorded in the column named shot, otherwise a miss (M) is recorded.
Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand. One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks. For this lab, we define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.
For example, in Game 1 Kobe had the following sequence of hits and misses from his nine shot attempts in the first quarter:
\[\textrm{H M | M | H H M | M | M | M}\]
To verify this, run the command kobe_basket |> pull(shot) |> head(n = 9) in the code block below.
The pull() function extracts a column from a data frame as a vector, and head(n = 9) returns the first 9 elements.
kobe_basket |> pull(shot) |> head(n = 9)Within the nine shot attempts, there are six streaks, which are separated by a “|” above. Their lengths are one, zero, two, zero, zero, zero (in order of occurrence).
What does a streak length of 1 mean — how many hits and misses are in a streak of 1?
What does a streak of length 0 mean?
In this activity you have access to the function calc_streak() from the {openintro} package. This function calculates the lengths of all shooting streaks. Store the result of kobe_basket |> pull(shot) |> calc_streak() in a new object called kobe_streak, and then call kobe_streak to see the streak lengths printed out.
Use the assignment operator <- to store the result in kobe_streak.
kobe_streak <- ___Start with the kobe_streak data frame.
kobe_streak <- kobe_basket Start with the kobe_streak data frame, and then…
kobe_streak <- kobe_basket |>Start with the kobe_streak data frame, and then extract the values from the shot column.
kobe_streak <- kobe_basket |>
pull(shot)Start with the kobe_streak data frame, and then extract the values from the shot column, and then…
kobe_streak <- kobe_basket |>
pull(shot) |>Start with the kobe_streak data frame, and then extract the values from the shot column, and then calculate the streak lengths with calc_streak().
kobe_streak <- kobe_basket |>
pull(shot) |>
calc_streak()The result from calc_streak() is a data frame with a single column named length. Now use ggplot to make a barplot of kobe_streak.
Before trying to build the plot, simply type and run kobe_streak to verify the contents of the kobe_streak variable.
Start with the kobe_streak data frame.
kobe_streakStart with the kobe_streak data frame, and then…
kobe_streak |>Start with the kobe_streak data frame, and then take out a canvas to plot on.
kobe_streak |>
ggplot()Start with the kobe_streak data frame, and then take out a canvas to plot on. Next add a bar graph layer…
kobe_streak |>
ggplot() +
geom_bar(aes(x = ___))Start with the kobe_streak data frame, and then take out a canvas to plot on. Next add a bar graph layer with length as the x aesthetic.
kobe_streak |>
ggplot() |>
geom_bar(aes(x = length))
kobe_streak |>
ggplot() +
geom_bar(aes(x = length))
kobe_streak |>
ggplot() +
geom_bar(aes(x = length))Note that instead of making a histogram, we chose to make a bar plot from a table of the streak data. A bar plot is preferable here since our variable is discrete — counts — instead of continuous.
Describe the distribution of Kobe’s streaks from the 2009 NBA Finals.
What was Kobe’s typical streak length?
How long was Kobe’s longest streak?
Compared to What?
We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had hot hands? What can we compare them to?
To answer these questions, let’s return to the idea of independence. Two processes are independent if the outcome of one process doesn’t affect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.
A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot.
Let’s suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (his shooting percentage) is about 45%, or in probability notation,
\[\mathbb{P}\left[\textrm{shot 1 = H}\right] = 0.45\]
If he makes the first shot and has a hot hand (not independent shots), then the probability that he makes his second shot would go up to, say, 60%,
\[\mathbb{P}\left[\textrm{shot 2 = H} \, | \, \textrm{shot 1 = H}\right] = 0.60\]
As a result of these increased probabilities, you’d expect Kobe to have longer streaks. Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hit his first shot, the probability that he makes the second is still 0.45.
\[\mathbb{P}\left[\textrm{shot 2 = H} \, | \, \textrm{shot 1 = H}\right] = 0.45\]
In other words, making the first shot did nothing to affect the probability that he’d make his second shot. If Kobe’s shots are independent, then he’d have the same probability of hitting every shot regardless of his past shots: 45%.
Now that we’ve phrased the situation in terms of independent shots, let’s return to the question: how do we tell if Kobe’s shooting streaks are long enough to indicate that he has hot hands? We can compare his streak lengths to someone without hot hands: an independent shooter.
Simulations in R
We’ll come back to Kobe shortly, but for now let’s think about how we might simulate an independent shooter. While we don’t have any data from a shooter we know to have independent shots, that sort of data is very easy to simulate in R. In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. For example, the following code block is set up to simulate a single flip of a fair coin. Run it a few times and see what happens — then edit the code to simulate 10 flips at a time, or a hundred, or maybe even a hundred thousand! The computer can flip 100,000 coins much faster than you and I could ever hope to.
To increase the number of simulated flips, edit the size parameter.
The first argument c("heads", "tails") represents the outcomes and can be thought of as a hat with two slips of paper in it: one slip says heads and the other says tails. The function sample draws one slip from the hat and tells us if it was a head or a tail.
As you’ve discovered, if you wanted to simulate flipping a fair coin 100 times, you could either run the sample() function 100 times or, more simply, adjust the size argument, which governs how many flips to simulate. The replace = TRUE argument indicates we put the slip of paper back in the hat before drawing again). Use the code block below to simulate 100 flips of a fair coin, storing the results in an object called sim_fair_coin. Then use table(sim_fair_coin) to compute a frequency table for the outcomes — what do you notice?
Start with the code provided earlier to simulate your coin flips, but adjust the size argument.
Use the assignment operator (<-) to store the results in sim_fair_coin.
sim_fair_coin <- ___Once you’ve simulated the flips, use table(sim_fair_coin) to compute your frequency table.
sim_fair_coin <- sample(c("heads", "tails"),
size = ___,
replace = TRUE)
table(sim_fair_coin)Since there are only two elements in the outcomes, the probability that we “flip” a coin and it lands heads is 0.5. Say we’re trying to simulate an unfair coin that we know only lands heads 20% of the time. We can adjust for this by adding an argument called prob, which provides a vector of two probability weights. Execute the following code cell to simulate this unfair coin.
The argument prob = c(0.2, 0.8) indicates that for the two elements in the outcomes vector, we want to select the first one, heads, with probability 0.2 and the second one, tails, with probability 0.8. Another way of thinking about this is to imagine the outcome space as a bag of 10 chips, where 2 chips are labeled “head” and 8 chips are labeled “tail”. Therefore at each draw, the probability of drawing a chip that says “head” is 20%, and “tail” is 80%.
Simulating the Independent Shooter
Simulating a basketball player who has independent shots uses the same mechanism that we use to simulate a coin flip. To simulate a single shot from an independent shooter with a shooting percentage of 50% we type:
sim_basket <- sample(c("H", "M"), size = 1, replace = TRUE)To make a valid comparison between Kobe and our simulated independent shooter, we need to align both their shooting percentage and the number of attempted shots. Use the code block below to help you find the number of shots we need to simulate.
The simulated shooter should take the same number of shots as Kobe took in the series.
Every row of the kobe_basket data frame corresponds to a shot taken by Kobe Bryant in the series.
There are several ways to find the number of rows in a data frame — try nrow(), dim(), or glimpse().
How many shots did Kobe Bryant take during the 2009 NBA Finals?
What should the shooting percentage be for our simulated random independent shooter?
Use your answers from the previous questions, our discussion about simulated coin flipping, and the code block below to simulate an independent shooter which we can compare to Kobe Bryant. Store the results of your simulated shots in an object called sim_basket.
Use the assignment operator (<-) to assign the outcome to sim_basket.
Use the sample() function to simulate the shooter.
sim_basket <- sample(c("H", "M"),
___,
___,
___)For Kobe Bryant, the probability of hitting a shot is 0.45 and missing is 0.55. Use this to fill in the prob argument.
sim_basket <- sample(c("H", "M"),
___,
prob = c(0.45, 0.55),
___)How many shots should we simulate? Fill in the size argument, and remember to set replace = TRUE.
sim_basket <- sample(c("H", "M"),
size = ___,
prob = c(0.45, 0.55),
replace = ___)
sim_basket <- sample(c("H", "M"),
size = 133,
prob = c(0.45, 0.55),
replace = TRUE)
sim_basket <- sample(c("H", "M"),
size = 133,
prob = c(0.45, 0.55),
replace = TRUE)Note that we’ve named the new vector sim_basket, the same name that we gave to the previous vector reflecting a shooting percentage of 50%. In this situation, R overwrites the old object with the new one, so always make sure that you don’t need the information in an old vector before reassigning its name.
With the results of the simulation saved as sim_basket, we have the data necessary to compare Kobe to our independent shooter. Execute the following code cell to see Kobe’s actual shots alongside our simulated shooter’s shots. I’ve created a fixed version of sim_basket using a random seed so that we will all see the same simulated data for comparison.
Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 45%. We know that our simulated data is from a shooter that has independent shots — that is, we know the simulated shooter does not have a hot hand.
Use the calc_streak() function in the code block below to compute the streak lengths of sim_basket and store the result in an object called sim_streak.
Use the calc_streak() function on sim_basket and store the result as sim_streak.
sim_streak <- ___
sim_streak <- calc_streak(sim_basket)
sim_streak <- calc_streak(sim_basket)Okay, now you’ve got everything you need in order to help you discover whether there is evidence to suggest that Kobe Bryant had a hot hand in the 2009 NBA Finals. Use the code blocks below each question to help you answer.
1. Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? How long is the simulated player’s longest streak of baskets in 133 shots?
One approach is to produce a barplot of sim_streak, just like you did earlier for kobe_streak, and compare the two distributions visually.
Start with the sim_streak data frame, and then…
sim_streak |>Start with the sim_streak data frame, and then take out a canvas to draw a plot.
sim_streak |>
ggplot()Start with the sim_streak data frame, and then take out a canvas to draw a plot. Now add a barplot layer to your plot.
sim_streak |>
ggplot() +
geom_bar(aes(x = ___))Start with the sim_streak data frame, and then take out a canvas to draw a plot. Now add a barplot layer to your plot, mapping the streak length to the x aesthetic..
sim_streak |>
ggplot() +
geom_bar(aes(x = length))2. If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above? Exactly the same? Somewhat similar? Totally different? Explain your reasoning.
There’s no single correct answer here. The code block is here is for you to actually try it out. Copy, paste, and execute the simulation code to create a new independent shooter — what do you observe?
Next, calculate the streak lengths and store them.
sim_basket <- sample(c("H", "M"),
size = 133,
prob = c(0.45, 0.55),
replace = TRUE)
sim_streak <- calc_streak(sim_basket)Now, plot the results.
sim_basket <- sample(c("H", "M"),
size = 133,
prob = c(0.45, 0.55),
replace = TRUE)
sim_streak <- calc_streak(sim_basket)
sim_streak %>%
ggplot() +
geom_bar(aes(x = length))Run the code several times to see what happens. Discuss what you see and compare it to Kobe’s distribution of streaks.
sim_basket <- sample(c("H", "M"),
size = 133,
prob = c(0.45, 0.55),
replace = TRUE)
sim_streak <- calc_streak(sim_basket)
sim_streak %>%
ggplot() +
geom_bar(aes(x = length))Discussion Question 3. How does Kobe Bryant’s distribution of streak lengths compare to the distribution of streak lengths for the simulated shooter? Using this comparison, do you have evidence Kobe’s shooting patterns exhibit the hot hand phenomenon?
There’s no single correct answer here either, but you might use ggplot() to compare Kobe’s actual streak length distribution to that of your simulated shooter side by side. As a reminder, your simulated shooter does not have a hot hand.
A shooter with a hot hand should have a distribution of streaks that is different from the simulated streak distributions you are generating. In particular, a shooter with a hot hand would have more long streaks of made shots.
Run the code below several times. Does it look like Kobe Bryant has a streak distribution that differs from our independent shooters’?
sim_basket <- sample(c("H", "M"),
size = 133,
prob = c(0.45, 0.55),
replace = TRUE)
sim_streak <- calc_streak(sim_basket)
p1 <- kobe_streak %>%
ggplot() +
geom_bar(aes(x = length)) +
labs(
title = "Kobe"
)
p2 <- sim_streak %>%
ggplot() +
geom_bar(aes(x = length)) +
labs(
title = "Simulated Independent Shooter"
)
p1 + p2Submit
If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested.
The hash below encodes your responses to the multiple choice and checkbox questions in this activity.
Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.
You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.
Summary
In this lab you explored the power of computer simulation as a tool for informally examining hypotheses about real-world processes. Below are the key takeaways and a heads-up about what to expect moving forward.
- Simulation allows us to model independent processes. By using R’s
sample()function with appropriate outcome spaces and probabilities, we can generate data from a process we fully control — one where we know shots are independent. - The
probargument insample()lets us tune our simulation to match real-world conditions. Settingprob = c(0.45, 0.55)ensures our simulated shooter has the same shooting percentage as Kobe Bryant, making the comparison meaningful. - Streak length is a useful summary of shooting patterns. The
calc_streak()function converts a sequence of hits and misses into a distribution of streak lengths, which gives us a way to compare Kobe’s actual performance to what we’d expect from an independent shooter. - Comparing observed data to simulated data is a form of informal hypothesis testing. If Kobe’s streak distribution looks similar to those of our simulated independent shooters, we have little evidence of a hot hand. If it looks dramatically different, we might reconsider.
- Simulation results vary from run to run. Each simulated shooter produces a slightly different streak distribution, which is a reminder that random processes are inherently variable — we should look for consistent patterns across many simulations rather than drawing conclusions from a single run.
In this lab you used simulation informally — generating data from a known process and comparing it to observed data by eye. While we won’t continue using simulation as our primary tool, the underlying logic carries forward throughout the rest of the course. In the coming activities we’ll formalize this process of comparing what we observe to what we’d expect if some assumption were true, and we’ll develop rigorous statistical methods for deciding when the difference between the two is large enough to matter. That process — hypothesis testing — is one of the most important ideas in all of statistics, and everything you did in this lab was quietly building intuition for it.