Topic 7: Discrete Distributions Lab

function ok_checkbox(response, n) {
  if (!response || response.length === 0) 
    return html`<span style="color:purple">You haven't answered yet.</span>`;
  if (response.toString() === n) 
    return html`<span style="color:green">Correct ✓</span>`;
  return html`<span style="color:red">Not Yet! ✗</span>`;
}

About

In this lab, we learn how to simulate data and compare simulated data to observed data in order to informally test a hypothesis.

License

This is a derivative of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license. The original lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.

Discrete Distributions and Simulation

The previous two activities discussed probability distributions. In this seventh activity you’ll engage in computer simulation to test a theory by building and comparing distributions. Computer simulations can be really helpful because they allow us to repeat tasks very quickly (much more quickly than we would be able to do so in real life) and they also allow us to model tasks which may be impossible or unethical to perform in real life. Simulation is a truly powerful tool for helping us understand processes in our world.

Kobe Bryant and the Hot Hand

Basketball players who make several baskets in succession are often described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are indeed independent events (link to paper). This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.

We do not expect to resolve this controversy today. However, in this activity we’ll apply one approach to answering questions like this. The goals for the remainder of this activity are to:

think about the effects of independent and dependent events,
learn how to simulate shooting streaks in R, and
compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.

Check Your Understanding: Binomial Assumptions

Which of the following assumptions must be made if we are to simulate a set of basketball shots as a binomial experiment?

viewof q1 = Inputs.checkbox(
  new Map([
    ["Every shot has two possible outcomes — a 'hit' or a 'miss'", 1],
    ["If a player 'hits' a shot, then the likelihood that they make their next shot goes up", 2],
    ["Shots are independent — that is, the hot hand phenomenon does not exist", 3],
    ["We must simulate at least 50 shots", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q1_selected") ?? "[]") ?? []}
);

{
  localStorage.setItem("q1_selected", JSON.stringify(q1));
  localStorage.setItem("q1_correct", "1,3");
  localStorage.setItem("q1_result", (!q1 || q1.length === 0) ? "unattempted" : (q1.toString() === "1,3" ? "correct" : "incorrect"));
}

ok_checkbox(q1.toString(), "1,3");

Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers. His performance against the Orlando Magic in the 2009 NBA Finals earned him the title Most Valuable Player and many spectators commented on how he appeared to show a hot hand. Let’s look at the first several rows of data from those games.

vs	game	quarter	time	description	shot
ORL	1	1	9:47	Kobe Bryant makes 4-foot two point shot	H
ORL	1	1	9:07	Kobe Bryant misses jumper	M
ORL	1	1	8:11	Kobe Bryant misses 7-foot jumper	M
ORL	1	1	7:41	Kobe Bryant makes 16-foot jumper (Derek Fisher assists)	H
ORL	1	1	7:03	Kobe Bryant makes driving layup	H
ORL	1	1	6:01	Kobe Bryant misses jumper	M

In this data frame, every row records a shot taken by Kobe Bryant. If he hit the shot (made a basket), a hit (H) is recorded in the column named shot, otherwise a miss (M) is recorded.

Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand. One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks. For this lab, we define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.

For example, in Game 1 Kobe had the following sequence of hits and misses from his nine shot attempts in the first quarter:

\[\textrm{H M | M | H H M | M | M | M}\]

To verify this, run the command kobe_basket |> pull(shot) |> head(n = 9) in the code block below.

Within the nine shot attempts, there are six streaks, which are separated by a “|” above. Their lengths are one, zero, two, zero, zero, zero (in order of occurrence).

Check Your Understanding: Streak Lengths I

What does a streak length of 1 mean — how many hits and misses are in a streak of 1?

mutable ok_response = (response, n) => { return html`Loading...` };
viewof q2 = Inputs.radio(
  new Map([
    ["One hit", 1],
    ["One hit and one miss", 2],
    ["One hit and any number of misses", 3],
    ["Just a miss", 4],
    ["Any number of misses", 5]
  ]),
  {value: JSON.parse(localStorage.getItem("q2_selected") ?? "null")}
);

{
  localStorage.setItem("q2_selected", JSON.stringify(q2));
  localStorage.setItem("q2_correct", "2");
  localStorage.setItem("q2_result", q2 === null ? "unattempted" : (q2 == 2 ? "correct" : "incorrect"));
}

ok_response(q2, "2");

Check Your Understanding: Streak Lengths II

What does a streak of length 0 mean?

viewof q3 = Inputs.radio(
  new Map([
    ["One hit", 1],
    ["One hit and one miss", 2],
    ["One hit and any number of misses", 3],
    ["Just a single miss", 4],
    ["Any number of misses", 5]
  ]),
  {value: JSON.parse(localStorage.getItem("q3_selected") ?? "null")}
);

{
  localStorage.setItem("q3_selected", JSON.stringify(q3));
  localStorage.setItem("q3_correct", "4");
  localStorage.setItem("q3_result", q3 === null ? "unattempted" : (q3 == 4 ? "correct" : "incorrect"));
}

ok_response(q3, "4");

In this activity you have access to the function calc_streak() from the {openintro} package. This function calculates the lengths of all shooting streaks. Store the result of kobe_basket |> pull(shot) |> calc_streak() in a new object called kobe_streak, and then call kobe_streak to see the streak lengths printed out.

Hint 6 (Solved)

Start with the kobe_streak data frame, and then extract the values from the shot column, and then calculate the streak lengths with calc_streak().

kobe_streak <- kobe_basket |>
  pull(shot) |>
  calc_streak()

The result from calc_streak() is a data frame with a single column named length. Now use ggplot to make a barplot of kobe_streak.

Hint 6 (Solved)

Start with the kobe_streak data frame, and then take out a canvas to plot on. Next add a bar graph layer with length as the x aesthetic.

kobe_streak |>
  ggplot() |>
  geom_bar(aes(x = length))


kobe_streak |> 
  ggplot() + 
  geom_bar(aes(x = length))

Note that instead of making a histogram, we chose to make a bar plot from a table of the streak data. A bar plot is preferable here since our variable is discrete — counts — instead of continuous.

Check Your Understanding: Kobe’s Streak Distribution

Describe the distribution of Kobe’s streaks from the 2009 NBA Finals.

viewof q4 = Inputs.radio(
  new Map([
    ["The distribution is uniform", 1],
    ["The distribution is symmetric", 2],
    ["The distribution is skewed left", 3],
    ["The distribution is skewed right", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q4_selected") ?? "null")}
);

{
  localStorage.setItem("q4_selected", JSON.stringify(q4));
  localStorage.setItem("q4_correct", "4");
  localStorage.setItem("q4_result", q4 === null ? "unattempted" : (q4 == 4 ? "correct" : "incorrect"));
}

ok_response(q4, "4");

Check Your Understanding: Typical Streak Length

What was Kobe’s typical streak length?

viewof q5 = Inputs.radio(
  new Map([
    ["0", 1],
    ["1", 2],
    ["2", 3],
    ["3", 4],
    ["4", 5],
    ["5", 6]
  ]),
  {value: JSON.parse(localStorage.getItem("q5_selected") ?? "null")}
);

{
  localStorage.setItem("q5_selected", JSON.stringify(q5));
  localStorage.setItem("q5_correct", "1");
  localStorage.setItem("q5_result", q5 === null ? "unattempted" : (q5 == 1 ? "correct" : "incorrect"));
}

ok_response(q5, "1");

Check Your Understanding: Longest Streak

How long was Kobe’s longest streak?

viewof q6 = Inputs.radio(
  new Map([
    ["0", 1],
    ["1", 2],
    ["2", 3],
    ["3", 4],
    ["4", 5],
    ["5", 6]
  ]),
  {value: JSON.parse(localStorage.getItem("q6_selected") ?? "null")}
);

{
  localStorage.setItem("q6_selected", JSON.stringify(q6));
  localStorage.setItem("q6_correct", "5");
  localStorage.setItem("q6_result", q6 === null ? "unattempted" : (q6 == 5 ? "correct" : "incorrect"));
}

ok_response(q6, "5");

Compared to What?

We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had hot hands? What can we compare them to?

To answer these questions, let’s return to the idea of independence. Two processes are independent if the outcome of one process doesn’t affect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.

A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot.

Let’s suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (his shooting percentage) is about 45%, or in probability notation,

\[\mathbb{P}\left[\textrm{shot 1 = H}\right] = 0.45\]

If he makes the first shot and has a hot hand (not independent shots), then the probability that he makes his second shot would go up to, say, 60%,

\[\mathbb{P}\left[\textrm{shot 2 = H} \, | \, \textrm{shot 1 = H}\right] = 0.60\]

As a result of these increased probabilities, you’d expect Kobe to have longer streaks. Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hit his first shot, the probability that he makes the second is still 0.45.

\[\mathbb{P}\left[\textrm{shot 2 = H} \, | \, \textrm{shot 1 = H}\right] = 0.45\]

In other words, making the first shot did nothing to affect the probability that he’d make his second shot. If Kobe’s shots are independent, then he’d have the same probability of hitting every shot regardless of his past shots: 45%.

Now that we’ve phrased the situation in terms of independent shots, let’s return to the question: how do we tell if Kobe’s shooting streaks are long enough to indicate that he has hot hands? We can compare his streak lengths to someone without hot hands: an independent shooter.

Simulations in R

We’ll come back to Kobe shortly, but for now let’s think about how we might simulate an independent shooter. While we don’t have any data from a shooter we know to have independent shots, that sort of data is very easy to simulate in R. In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. For example, the following code block is set up to simulate a single flip of a fair coin. Run it a few times and see what happens — then edit the code to simulate 10 flips at a time, or a hundred, or maybe even a hundred thousand! The computer can flip 100,000 coins much faster than you and I could ever hope to.

The first argument c("heads", "tails") represents the outcomes and can be thought of as a hat with two slips of paper in it: one slip says heads and the other says tails. The function sample draws one slip from the hat and tells us if it was a head or a tail.

As you’ve discovered, if you wanted to simulate flipping a fair coin 100 times, you could either run the sample() function 100 times or, more simply, adjust the size argument, which governs how many flips to simulate. The replace = TRUE argument indicates we put the slip of paper back in the hat before drawing again). Use the code block below to simulate 100 flips of a fair coin, storing the results in an object called sim_fair_coin. Then use table(sim_fair_coin) to compute a frequency table for the outcomes — what do you notice?

Hint 3

Once you’ve simulated the flips, use table(sim_fair_coin) to compute your frequency table.

sim_fair_coin <- sample(c("heads", "tails"), 
                        size = ___, 
                        replace = TRUE)
table(sim_fair_coin)

Since there are only two elements in the outcomes, the probability that we “flip” a coin and it lands heads is 0.5. Say we’re trying to simulate an unfair coin that we know only lands heads 20% of the time. We can adjust for this by adding an argument called prob, which provides a vector of two probability weights. Execute the following code cell to simulate this unfair coin.

The argument prob = c(0.2, 0.8) indicates that for the two elements in the outcomes vector, we want to select the first one, heads, with probability 0.2 and the second one, tails, with probability 0.8. Another way of thinking about this is to imagine the outcome space as a bag of 10 chips, where 2 chips are labeled “head” and 8 chips are labeled “tail”. Therefore at each draw, the probability of drawing a chip that says “head” is 20%, and “tail” is 80%.

Simulating the Independent Shooter

Simulating a basketball player who has independent shots uses the same mechanism that we use to simulate a coin flip. To simulate a single shot from an independent shooter with a shooting percentage of 50% we type:

sim_basket <- sample(c("H", "M"), size = 1, replace = TRUE)

To make a valid comparison between Kobe and our simulated independent shooter, we need to align both their shooting percentage and the number of attempted shots. Use the code block below to help you find the number of shots we need to simulate.

Check Your Understanding: Number of Shots

How many shots did Kobe Bryant take during the 2009 NBA Finals?

viewof q7 = Inputs.radio(
  new Map([
    ["0", 1],
    ["10", 2],
    ["123", 3],
    ["133", 4],
    ["76", 5],
    ["6", 6],
    ["45", 7]
  ]),
  {value: JSON.parse(localStorage.getItem("q7_selected") ?? "null")}
);

{
  localStorage.setItem("q7_selected", JSON.stringify(q7));
  localStorage.setItem("q7_correct", "4");
  localStorage.setItem("q7_result", q7 === null ? "unattempted" : (q7 == 4 ? "correct" : "incorrect"));
}

ok_response(q7, "4");

Check Your Understanding: Shooting Percentage

What should the shooting percentage be for our simulated random independent shooter?

viewof q8 = Inputs.radio(
  new Map([
    ["50%", 1],
    ["45%", 2],
    ["60%", 3],
    ["40%", 4]
  ]),
  {value: JSON.parse(localStorage.getItem("q8_selected") ?? "null")}
);

{
  localStorage.setItem("q8_selected", JSON.stringify(q8));
  localStorage.setItem("q8_correct", "2");
  localStorage.setItem("q8_result", q8 === null ? "unattempted" : (q8 == 2 ? "correct" : "incorrect"));
}

ok_response(q8, "2");

Use your answers from the previous questions, our discussion about simulated coin flipping, and the code block below to simulate an independent shooter which we can compare to Kobe Bryant. Store the results of your simulated shots in an object called sim_basket.

Hint 3

For Kobe Bryant, the probability of hitting a shot is 0.45 and missing is 0.55. Use this to fill in the prob argument.

sim_basket <- sample(c("H", "M"), 
                     ___, 
                     prob = c(0.45, 0.55),
                     ___)

Hint 4

How many shots should we simulate? Fill in the size argument, and remember to set replace = TRUE.

sim_basket <- sample(c("H", "M"), 
                     size = ___, 
                     prob = c(0.45, 0.55),
                     replace = ___)


sim_basket <- sample(c("H", "M"), 
                     size = 133, 
                     prob = c(0.45, 0.55), 
                     replace = TRUE)

Note that we’ve named the new vector sim_basket, the same name that we gave to the previous vector reflecting a shooting percentage of 50%. In this situation, R overwrites the old object with the new one, so always make sure that you don’t need the information in an old vector before reassigning its name.

With the results of the simulation saved as sim_basket, we have the data necessary to compare Kobe to our independent shooter. Execute the following code cell to see Kobe’s actual shots alongside our simulated shooter’s shots. I’ve created a fixed version of sim_basket using a random seed so that we will all see the same simulated data for comparison.

Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 45%. We know that our simulated data is from a shooter that has independent shots — that is, we know the simulated shooter does not have a hot hand.

Use the calc_streak() function in the code block below to compute the streak lengths of sim_basket and store the result in an object called sim_streak.


sim_streak <- calc_streak(sim_basket)

Okay, now you’ve got everything you need in order to help you discover whether there is evidence to suggest that Kobe Bryant had a hot hand in the 2009 NBA Finals. Use the code blocks below each question to help you answer.

1. Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? How long is the simulated player’s longest streak of baskets in 133 shots?

Hint 5 (Solved)

Start with the sim_streak data frame, and then take out a canvas to draw a plot. Now add a barplot layer to your plot, mapping the streak length to the x aesthetic..

sim_streak |>
  ggplot() + 
  geom_bar(aes(x = length))

2. If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above? Exactly the same? Somewhat similar? Totally different? Explain your reasoning.

Hint 2

Next, calculate the streak lengths and store them.

sim_basket <- sample(c("H", "M"), 
                     size = 133, 
                     prob = c(0.45, 0.55), 
                     replace = TRUE)

sim_streak <- calc_streak(sim_basket)

Hint 3

Now, plot the results.

sim_basket <- sample(c("H", "M"), 
                     size = 133, 
                     prob = c(0.45, 0.55), 
                     replace = TRUE)

sim_streak <- calc_streak(sim_basket)

sim_streak %>%
  ggplot() + 
  geom_bar(aes(x = length))

Hint 4 (Solved)

Run the code several times to see what happens. Discuss what you see and compare it to Kobe’s distribution of streaks.

sim_basket <- sample(c("H", "M"), 
                     size = 133, 
                     prob = c(0.45, 0.55), 
                     replace = TRUE)

sim_streak <- calc_streak(sim_basket)

sim_streak %>%
  ggplot() + 
  geom_bar(aes(x = length))

Discussion Question 3. How does Kobe Bryant’s distribution of streak lengths compare to the distribution of streak lengths for the simulated shooter? Using this comparison, do you have evidence Kobe’s shooting patterns exhibit the hot hand phenomenon?

Hint 3

Run the code below several times. Does it look like Kobe Bryant has a streak distribution that differs from our independent shooters’?

sim_basket <- sample(c("H", "M"), 
                     size = 133, 
                     prob = c(0.45, 0.55), 
                     replace = TRUE)

sim_streak <- calc_streak(sim_basket)

p1 <- kobe_streak %>%
  ggplot() + 
  geom_bar(aes(x = length)) + 
  labs(
    title = "Kobe"
  )

p2 <- sim_streak %>%
  ggplot() + 
  geom_bar(aes(x = length)) + 
  labs(
    title = "Simulated Independent Shooter"
  )

p1 + p2

Submit

If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested.

Question Hash

The hash below encodes your responses to the multiple choice and checkbox questions in this activity.

function buildQuestionResults() {
  return {
    notebook: "Topic 7: Discrete Distributions Lab",
    type: "questions",
    timestamp: new Date().toISOString(),
    questions: {
      q1_binomial_assumptions: {
        selected: q1,
        correct_answer: "1,3",
        result: (!q1 || q1.length === 0) ? "unattempted" : (q1.toString() === "1,3" ? "correct" : "incorrect")
      },
      q2_streak_length_1: {
        selected: q2,
        correct_answer: "2",
        result: q2 === null ? "unattempted" : (q2 == 2 ? "correct" : "incorrect")
      },
      q3_streak_length_0: {
        selected: q3,
        correct_answer: "4",
        result: q3 === null ? "unattempted" : (q3 == 4 ? "correct" : "incorrect")
      },
      q4_streak_distribution: {
        selected: q4,
        correct_answer: "4",
        result: q4 === null ? "unattempted" : (q4 == 4 ? "correct" : "incorrect")
      },
      q5_typical_streak: {
        selected: q5,
        correct_answer: "1",
        result: q5 === null ? "unattempted" : (q5 == 1 ? "correct" : "incorrect")
      },
      q6_longest_streak: {
        selected: q6,
        correct_answer: "5",
        result: q6 === null ? "unattempted" : (q6 == 5 ? "correct" : "incorrect")
      },
      q7_num_shots: {
        selected: q7,
        correct_answer: "4",
        result: q7 === null ? "unattempted" : (q7 == 4 ? "correct" : "incorrect")
      },
      q8_shooting_percentage: {
        selected: q8,
        correct_answer: "2",
        result: q8 === null ? "unattempted" : (q8 == 2 ? "correct" : "incorrect")
      }
    }
  };
}

function toBase64(str) {
  return btoa(unescape(encodeURIComponent(str)));
}

question_hash = {
  q1; q2; q3; q4; q5; q6; q7; q8;
  return toBase64(JSON.stringify(buildQuestionResults()));
}

html`<div style="font-family: monospace; font-size: 0.85em; background: #f5f5f5; padding: 12px; border-radius: 6px; word-break: break-all; border: 1px solid #ddd; user-select: all; cursor: pointer;" onclick="navigator.clipboard.writeText(this.innerText)">
  ${question_hash}
</div>
<p style="margin-top: 8px; font-size: 0.9em; color: #555;">
  Click the box to copy to clipboard.
</p>`

Exercise Hash

Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.

You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.

Summary

In this lab you explored the power of computer simulation as a tool for informally examining hypotheses about real-world processes. Below are the key takeaways and a heads-up about what to expect moving forward.

Main Takeaways

Simulation allows us to model independent processes. By using R’s sample() function with appropriate outcome spaces and probabilities, we can generate data from a process we fully control — one where we know shots are independent.
The prob argument in sample() lets us tune our simulation to match real-world conditions. Setting prob = c(0.45, 0.55) ensures our simulated shooter has the same shooting percentage as Kobe Bryant, making the comparison meaningful.
Streak length is a useful summary of shooting patterns. The calc_streak() function converts a sequence of hits and misses into a distribution of streak lengths, which gives us a way to compare Kobe’s actual performance to what we’d expect from an independent shooter.
Comparing observed data to simulated data is a form of informal hypothesis testing. If Kobe’s streak distribution looks similar to those of our simulated independent shooters, we have little evidence of a hot hand. If it looks dramatically different, we might reconsider.
Simulation results vary from run to run. Each simulated shooter produces a slightly different streak distribution, which is a reminder that random processes are inherently variable — we should look for consistent patterns across many simulations rather than drawing conclusions from a single run.

Looking Ahead

In this lab you used simulation informally — generating data from a known process and comparing it to observed data by eye. While we won’t continue using simulation as our primary tool, the underlying logic carries forward throughout the rest of the course. In the coming activities we’ll formalize this process of comparing what we observe to what we’d expect if some assumption were true, and we’ll develop rigorous statistical methods for deciding when the difference between the two is large enough to matter. That process — hypothesis testing — is one of the most important ideas in all of statistics, and everything you did in this lab was quietly building intuition for it.