<- 1
alpha <- 1
beta #Fewer points results in more jagged pictures
<- 500
grid_pts
#Create prior distribution
<- tibble(
my_dists pi = seq(0, 1, length.out = grid_pts), #possible proportion values
prior_prob = dbeta(pi, alpha, beta) #prior probability
)
#Plot prior distribution
%>%
my_dists ggplot() +
geom_area(aes(x = pi, y = prior_prob),
fill = "purple", alpha = 0.4) +
geom_line(aes(x = pi, y = prior_prob),
linetype = "dashed") +
labs(
title = "Prior Distribution",
x = "π",
y = ""
+
) ylim(c(-0.2, 3))
From the Prior to the Posterior
(Biology; Chytrid Fungus in Frogs)
This activity is a sample of our first activity (From the Prior to the Posterior) with a biology context laid over the framework. This sample has been constructed with coauthor Dr. Katie Duryea, a member of the Biology faculty at Southern New Hampshire University. This sample activity provides an example of what versions of this activity from other domains/contexts may look like.
Goals and Objectives
Goals: The goals of this activity are as follows:
- Show students an alternative to the Frequentist approach to capturing a population parameter.
- Students discover the impact of choice of prior on their plausible range.
- Students discover the impact of amount of data on their plausible range.
Course Objectives: This activity would map to course-level objectives similar to the following. Note that this is not an exhaustive list, nor should it be interpreted that objectives must be phrased identically. This is to give an idea as to the wide variety of contexts this activity might be placed in.
- Students will evaluate a research question using appropriate statistical techniques
- Students will correctly identify the type of data they are working with
- Students will evaluate literature and/or prior research to generate hypotheses for a research question
- Students will learn about different statistical models and approaches
- Students will interpret coefficients from a statistical model
- Students will evaluate the underlying assumptions of a statistical approach
- Students will consider the ethical implications of statistical approaches
- Students will gather data using methodologies appropriate to the context
Biology Objectives:
- Students will examine the ecological impact of different pathogens, both native and introduced
- Students will learn about how different statistical approaches can be applied to biological data, and how to compare approaches
- Students will consider how different statistical approaches can help scientists to learn more about an ecological concept or conservation concern
Background Information
The following subsections outline the background information for this activity from both a statistics and domain-specific lens.
Data Analysis and Bayesian Thinking
There are many statistical tools which can be used to test hypotheses about population parameters. Broadly speaking, these tools fall into three categories:
- Classical/Frequentist methods
- Simulation-based methods
- Bayesian methods
Perhaps you’ve encountered Frequentist methods previously. These methods depend on distribution assumptions and the Central Limit Theorem (for example, “my data is approximately normally distributed”). In this notebook, we’ll introduce Bayesian methods. In particular, you’ll explore how your prior belief (controlled via your choice of prior distribution) and the strength of your observed data work together to produce updated beliefs (a posterior distribution).
In Bayesian inference, we approach our tasks with some prior belief about the value of our population parameter. This is natural, because it matches our lived experience as humans. We use that prior belief, in conjunction with our data, to produce an updated version of our beliefs. Again, likely matching our individual approaches to interacting with the world we live in and following what you might refer to as the scientific method.
In this interactive notebook, you’ll see the foundations of the Bayesian approach to inference on a population proportion – the proportion of frogs in New Hampshire which have Chytrid Fungal ingections – in action. You’ll explore how your prior belief (controlled via your choice of prior distribution) and the strength of your observed data work together to produce updated beliefs.
About Chytrid
Chytrid Fungus is an infectious fungal disease that can be fatal to amphibians and has caused some species to become extinct. Check out this video from Chris Egnoto to learn more about this threat to amphibian life.
Purpose
Let’s try to estimate the proportion of frogs in southern New Hampshire which are impacted by Chytrid Fungus.
Prior Assumptions
Perhaps today is our first time learning about Chytrid Fungus and frogs. We may have no expectation about what our population proportion (\(\pi\)) might be. In the Frequentist approach, we would let the data completely determine the inference we make about the proportion. In Bayesian inference we take a different approach, assigning a prior distribution over the population proportion we are looking for. Since today is the first we’ve heard of Chytrid Fungus, let’s set up a prior that indicates nearly all values of \(\pi\) are equally likely. (Note: The code to generate the distribution is shown below because you’ll have opportunities to make small updates to it in the activity on the next tab.)
The Beta-distribution we use here is commonly used with binomial data. It is determined by two shape parameters alpha
(\(\alpha\)) and beta
(\(\beta\)). We can think of \(\alpha\) as the number of previously observed successes and \(\beta\) as the number of previously observed failures. Choosing \(\alpha = 1\) and \(\beta = 1\), results in a prior that allows for both successes and failures (\(\pi \neq 0\) and \(\pi \neq 1\)) but which has no other certainty about the true value of \(\pi\).
Notice that the prior assumption we’ve made above (via the purple distribution) is that all proportions for Chytrid Fungus infected frogs are equally plausible. In the interactive activity on the next tab, we’ll explore how newly observed data updates that prior assumption to obtain a more informed posterior understanding of the proportion of frogs in New Hampshire infected by Chytrid Fungus.
We’ll start with a reminder of where our pre-reading left off.
Purpose
We are attempting to estimate the proportion of frogs in southern New Hampshire which are impacted by Chytrid Fungus.
Prior Assumptions
Perhaps this activity is our first time learning about Chytrid Fungus and frogs. In this case, we would have no expectation about what our population proportion (\(\pi\)) might be. In the Frequentist approach to parameter estimation, we would let the data completely determine the inference we make about the proportion. In Bayesian inference, we take a different approach and assign a prior distribution over the population proportion we are looking for. Assuming that this activity is the first time we’ve heard of Chytrid, we’ve set up a prior that indicates nearly all values of \(\pi\) are equally likely.
<- 1
alpha <- 1
beta #Fewer points results in more jagged pictures
<- 500
grid_pts
#Create prior distribution
<- tibble(
my_dists pi = seq(0, 1, length.out = grid_pts), #possible proportion values
prior_prob = dbeta(pi, alpha, beta) #prior probability
)
#Plot prior distribution
%>%
my_dists ggplot() +
geom_area(aes(x = pi, y = prior_prob),
fill = "purple", alpha = 0.4) +
geom_line(aes(x = pi, y = prior_prob),
linetype = "dashed") +
labs(
title = "Prior Distribution",
x = "π",
y = ""
+
) ylim(c(-0.2, 3))
The Beta-distribution we use here is commonly used with binomial data. It is determined by two shape parameters alpha
(\(\alpha\)) and beta
(\(\beta\)). We can think of \(\alpha\) as the number of previously observed successes and \(\beta\) as the number of previously observed failures. Choosing \(\alpha = 1\) and \(\beta = 1\), results in a prior that allows for both successes and failures (\(\pi \neq 0\) and \(\pi \neq 1\)) but which has no other certainty about the true value of \(\pi\).
Observed Data
Data was collected from 13 frogs in Pembroke, Chichester, and Manchester, NH. Data from those frogs is read in and processed below.
<- read_csv(
initial_frogs "https://raw.githubusercontent.com/mcduryea/Evolution-Simulations/refs/heads/main/chytrid_small.csv"
)
<- initial_frogs %>%
num_frogs nrow()
<- initial_frogs %>%
num_pos filter(chytrid == "y") %>%
nrow()
print(paste0("Of the ", num_frogs,
" frogs observed, the number positive for Chytrid Fungus was ",
".")) num_pos,
[1] "Of the 13 frogs observed, the number positive for Chytrid Fungus was 9."
Obtaining the Posterior
Now we’ll use our data to update the prior distribution and obtain the posterior distribution for our population proportion. We obtain the posterior distribution by multiplying the prior by the likelihood function and then dividing by a normalizing factor to ensure that the result is a probability density (that is, the total probability is 1). The likelihood measures the probability of observing our data at each possible value of the population proportion.
\[\underbrace{\mathbb{P}\left[\text{proportion} | \text{data}\right]}_{\text{posterior}} = \frac{\overbrace{\mathbb{P}\left[\text{data} | \text{proportion}\right]}^{\text{likelihood}}\cdot \overbrace{\mathbb{P}\left[\text{proportion}\right]}^{\text{prior}}}{\underbrace{\mathbb{P}\left[\text{data}\right]}_{\text{normalizing factor}}}\]
#Construct the posterior
<- my_dists %>%
my_dists mutate(
#Compute likelihood using binomial distribution
likelihood = choose(num_frogs, num_pos)*pi^num_pos*(1 - pi)^(num_frogs - num_pos),
#Compute posterior as likelihood*prior
post_prob = likelihood*prior_prob,
#Normalize posterior
post_prob_normalized = post_prob/(sum(post_prob)*1/grid_pts))
#Plot prior and posterior
%>%
my_dists ggplot() +
geom_area(aes(x = pi, y = prior_prob, fill = "prior"),
alpha = 0.4) +
geom_line(aes(x = pi, y = prior_prob),
linetype = "dashed") +
geom_area(aes(x = pi, y = post_prob_normalized, fill = "posterior"),
alpha = 0.7) +
geom_line(aes(x = pi, y = post_prob_normalized),
linetype = "dashed") +
labs(
title = "Prior and Posterior Distributions",
x = "π",
y = "",
fill = "Distribution"
+
) scale_fill_manual(values = c("prior" = "purple", "posterior" = "orange"))
Notice that, after seeing the data, our posterior estimate for the proportion of frogs impacted by Chytrid has been updated. It looks very unlikely that \(\pi\) is less than 0.25 (those population proportions seem incompatible with our observed data), and the most likely proportions are between 0.5 and 0.8.
Investigating Further
Now that you’ve seen one Bayesian analysis to estimate the proportion of frogs impacted by Chytrid, let’s reproduce the analysis and explore the impact of (i) choice of prior, (ii) strength of data, and (iii) the combination of data and prior.
In this section, you’ll have an opportunity to reproduce the analysis above, but with a different choice of prior distribution on \(\pi\).
In the 2019 collection year, Dr. Duryea noted that approximately 60% of frogs were impacted by Chytrid.
Recall that, in the Beta distribution, the two parameters alpha
(\(\alpha\)) and beta
(\(\beta\)) can be thought of as the number of prior successes and prior failures, respectively. Update the code chunk below to choose an alpha
and beta
that will result in the 60% estimate observed by Dr. Duryea in her research. Run the code to construct and visualize your prior.
How does this prior compare to our original prior? Use your understanding of alpha
and beta
to justify what you are seeing.
Now that you’ve built this new prior, we’ll have you re-run the code below to obtain the observed data from our small collection of 13 frogs from Steven’s Pond. Don’t make any changes to the code just yet.
Now run the code chunk below to use the observed data to calculate and plot the updated posterior.
How does this new posterior compare to the original posterior distribution we estimated? What effect has our choice of stronger prior had on the range of proportions you think our population parameter may live within? Does this result match your expectations?
In this section you’ll have an opportunity to reproduce the analysis above, but with different data sources for updating. In particular, you’ll explore the impact that the number of observations can have on the posterior distribution.
Dr. Duryea’s students have finished processing their most recent collection of samples. She has provided us with a larger collection of observed data. Let’s see the impact that this larger data set has on updating the prior distribution.
The code chunk below is set to create a weakly informative prior that assumes a population proportion of Chytrid Fungus infection of 60%, backed by a total of 10 observations. Run the code to construct and visualize this prior.
Now that we have this prior, run the code cell below to obtain the larger collection of data recently processed by Dr. Duryea’s students.
Now run the code chunk below to use the observed data to calculate and plot the updated posterior.
How does this new posterior compare to the original posterior distribution we estimated? What effect has our larger collection of observations (larger sample size) had on the range of proportions you think our population parameter may live within? Does this result match your expectations?
Now that you’ve seen how different priors and different data (separately) result in different inferences, return to the previous section and examine how stronger priors (for example alpha = 600
and beta = 400
) interact with our data to influence our inference.
What conclusions can you draw about whether the prior or the observed data has greater influence over the resulting posterior?