Modeling Genetic Drift with Markov Chains
Genetic Drift
Genetic drift is the process in which versions of a trait in a population fluctuate over time. (Nature Education, 2014b) For example, the proportion of blondes, brunettes, people with black hair and redheads changes from generation to generation. A person’s observable traits (blonde, brunette, black hair, redhead) is called a phenotype. (Nature Education, 2014a) This phenotype is determined by a person’s set of genes called their genotype. (Nature Education, 2014a) Because a person gets one chromosome from their mother and one from their father, each person has two versions of each gene. These alternative forms of a gene are called alleles. (Gleichmann, 2020) In a scenario when there are two alternative forms of a gene (alleles), one allele is written as a capitalized letter (A for example) and the other is written as a lowercase letter (a for example). Each parent has two of these alleles and has an equal chance of passing on either of them to their child.
We can use punnet squares to examine the potential alleles that a child of two parents with specific alleles can have and the probabilities that a child will have a certain set of alleles. It is important to note that the probabilities listed are not the percent of their children that will have each pair of alleles but rather, the probabilities that a single child will have each allele combination. Each child’s allele combination is independent from the other children. Below are all of the possible parent combinations in a population with two forms of a gene and the probabilities that they will have a child of each genotype. If two AA parents have a child, both parents will give their child an A allele with 100% probability so the child will have an AA genotype (assuming there is no mutation). Mutation rates are typically lower than 0.00001 per gene per generation (Balin & Cascalho, 2009) so this is a realistic scenario. In a punnet square, the alleles of one parent goes on the top of the square and the alleles of the other goes on the side. Then the column and row of a cell determines the alleles that a parent gives the child and the frequency of cells with that combination determines the probability that a child has that allele combination. In the following diagram and simulation the number 1 represents having an AA genotype (called homozygous dominant), 2 represents having an Aa genotype (called heterozygous), and 3 represents having an aa genotype (homozygous recessive).
The following is a simulation of how the frequencies of these genotypes may change over time. The simulation models births according to a Poisson process with rate equal to lambda times the population divided by 2 because 2 parents are needed to reproduce. The death is also modeled according to a Poisson process with rate mu times the number of people in the population.
Code
# get which allele combination is born
get_allele_birth <- function(LastEvent){
totalFreq <- LastEvent$TotalPeople
AAfreq <- LastEvent$AA.freq
Aafreq <- LastEvent$Aa.freq
aafreq <- LastEvent$aa.freq
population <- c(rep(1, AAfreq), rep(2, Aafreq), rep(3, aafreq))
#parents <- sample(c(1,2,3), 2, replace=F, prob=c(AAfreq, Aafreq, aafreq))
parents <- sample(population, size=2 , replace=F)
#print(parents)
if(parents[1] == 1 && parents[2] == 1){ #Two AA parents
return(1)
}
if(parents[1] == 2 && parents[2] == 1 | parents[1] == 1 && parents[2] == 2 ){ #AA Aa parents
child <- sample(c(1,2), 1, prob=c(0.5, 0.5))
return(child)
}
if(parents[1] == 1 && parents[2] == 3 | parents[1] == 3 && parents[2] == 1 ){ #AA aa parents
return(2)
}
if(parents[1] == 2 && parents[2] == 2){ #Two Aa parents
child <- sample(c(1,2,3), 1, prob=c(0.25, 0.5, 0.25))
return(child)
}
if(parents[1] == 2 && parents[2] == 3 | parents[1] == 3 && parents[2] == 2 ){ #Aa aa parents
child <- sample(c(2,3), 1, prob=c(0.5, 0.5))
return(child)
}
if(parents[1] == 3 && parents[2] == 3){ #Two aa parents
return(3)
}
}
Code
# get which allele combination dies
get_allele_death <- function(LastEvent){
totalFreq <- LastEvent$TotalPeople
AAfreq <- LastEvent$AA.freq/totalFreq
Aafreq <- LastEvent$Aa.freq/totalFreq
aafreq <- LastEvent$aa.freq/totalFreq
death <- rng_respecting_sample(c(1,2,3), size=1 , replace=F, prob=c(AAfreq, Aafreq, aafreq))
#sample(c(1,2,3), 1, prob=c(AAfreq, Aafreq, aafreq))
return(death)
}
Code
# run the simulation
run_genetic_drift <- function(init.AA, init.Aa, init.aa, steps, lambda, mu){
init.pop <- init.AA + init.Aa + init.aa
Events <- data.frame("WaitTime"=c(0), "AA.freq"=c(init.AA), "Aa.freq"=c(init.Aa), "aa.freq"=c(init.aa), "CummulativeTime"=c(0), "TotalPeople"=c(init.pop), "BarWidth" = c(0))
for(i in 1:steps){
new.row <- Events[i, ]
current.pop <- Events[i, ]$AA.freq + Events[i, ]$aa.freq +Events[i, ]$Aa.freq
wait <- get_wait(current.pop, lambda, mu)
new.row$WaitTime <- wait
new.row$CummulativeTime <- new.row$CummulativeTime + wait
stepType <- get_step_type(current.pop, lambda, mu)
if(stepType == 1){ #birth
allele_birth <- get_allele_birth(Events[i,])
new.row[, 1+allele_birth] <- Events[i, 1+ allele_birth] + 1
}
if(stepType == 2){ #death
allele_death <- get_allele_death(Events[i,])
new.row[, 1+allele_death] <- Events[i, 1+ allele_death] - 1
}
new.row$TotalPeople <- sum(new.row$AA.freq, new.row$Aa.freq, new.row$aa.freq)
row.names(new.row) <- i+1
Events[i-1,7] <- wait
Events <- rbind(Events, new.row)
}
return(Events)
}
Modeling two genotypes with equal start frequencies
With rates lambda=3 and mu=1 and the time being equal to 100 years, the simulation has the assumptions that a birth happens at a rate of 3 births per person (because two people are needed to reproduce this is really equal to 3 births per 2 people) every 100 years. We also assume that people live to be 100 on average. The following plots are the count of each genotype in the population and the frequency of each genotype in the population.
Code
set.seed(3)
Events <- run_genetic_drift(init.AA=10, init.Aa=10, init.aa=0, steps=10000, lambda=3, mu=1)
plot(Events$CummulativeTime, Events$AA.freq,
type = "s",
xlab = "Time (hundreds of years)", ylab = "Count of Genotypes",
main = "Genotype Count", col="blue", ylim=c(-1,1000))
lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green")
legend(1, 900, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=1)
Code
plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople,
type = "s",
xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes",
main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3))
lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green")
legend(6, -0.1, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=0.75)
It looks like these genotypes converge to a stationary distribution over time. We can run this again with another seed to see if this occurs again.
Code
set.seed(2)
Events <- run_genetic_drift(init.AA=10, init.Aa=10, init.aa=0, steps=10000, lambda=3, mu=1)
plot(Events$CummulativeTime, Events$AA.freq,
type = "s",
xlab = "Time (hundreds of years)", ylab = "Count of Genotypes",
main = "Genotype Count", col="blue", ylim=c(-1,1000))
lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green")
legend(1, 900, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=1)
Code
plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople,
type = "s",
xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes",
main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3))
lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green")
legend(6, -0.1, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=0.75)
The genotypes oscilate in the beginning of the simulation and then look to converge to a stationary distribution again. I ran this simulation many other times with different seeds and this pattern occured each time. We can approximate the stationary distribution of each genotype by finding the long run distribution of each genotype in this simulation:
Code
summary_AA.freq <- Events %>%
group_by(AA.freq/TotalPeople) %>%
summarise(total_wait_time = sum(WaitTime))
binned_summary_AA.freq <- summary_AA.freq %>% mutate(new_bin = cut(`AA.freq/TotalPeople`, breaks=c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)))
binned.freqs <- binned_summary_AA.freq %>%
group_by(new_bin) %>%
summarise(Frequency = sum(total_wait_time))
binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime)
ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) +
geom_bar(stat="identity", fill="red") +
labs(title="Long Run Frequency Distribution of AA", x="Frequency of allele in population", y="Percentage of time in this state")
Code
summary_Aa.freq <- Events %>%
group_by(Aa.freq/TotalPeople) %>%
summarise(total_wait_time = sum(WaitTime))
binned_summary_Aa.freq <- summary_Aa.freq %>% mutate(new_bin = cut(`Aa.freq/TotalPeople`, breaks=c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)))
binned.freqs <- binned_summary_Aa.freq %>%
group_by(new_bin) %>%
summarise(Frequency = sum(total_wait_time))
binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime)
ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) +
geom_bar(stat="identity", fill="blue") +
labs(title="Long Run Frequency Distribution of Aa", x="Frequency of allele in population", y="Percentage of time in this state")
Code
summary_aa.freq <- Events %>%
group_by(aa.freq/TotalPeople) %>%
summarise(total_wait_time = sum(WaitTime))
binned_summary_aa.freq <- summary_aa.freq %>% mutate(new_bin = cut(`aa.freq/TotalPeople`, breaks=c(-0.00001, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)))
binned.freqs <- binned_summary_aa.freq %>%
group_by(new_bin) %>%
summarise(Frequency = sum(total_wait_time))
binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime)
ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) +
geom_bar(stat="identity", fill="green") +
labs(title="Long Run Frequency of aa", x="Frequency of allele in population", y="Percentage of time in this state")
Modeling a Mutation
When a mutation happens in a population, one allele in a person is changed to be a different version of a gene. We can model this by making only one member of the population be Aa and the other members be AA.
Code
set.seed(3)
Events <- run_genetic_drift(init.AA=10, init.Aa=1, init.aa=0, steps=500, lambda=3, mu=1)
plot(Events$CummulativeTime, Events$AA.freq,
type = "s",
xlab = "Time (hundreds of years)", ylab = "Count of Genotypes",
main = "Genotype Counts", col="blue", ylim=c(-1,40))
lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green")
legend(0.5, 35, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=1)
Code
plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople,
type = "s",
xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes",
main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3))
lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green")
legend(6, -0.1, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=0.75)
In this simulation, the proportion of people with the mutated gene increases but eventually the mutation disappears completely from the popluation. We can run this simulation again to see if this happens every time:
Note the scale change on the axes of this plot
Code
set.seed(7)
Events <- run_genetic_drift(init.AA=10, init.Aa=1, init.aa=0, steps=10000, lambda=3, mu=1)
plot(Events$CummulativeTime, Events$AA.freq,
type = "s",
xlab = "Time(hundreds of years)", ylab = "Count of Genotypes",
main = "Genotype Frequencies", col="blue", ylim=c(-1,1000))
lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green")
legend(0.5, 950, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=1)
Code
plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople,
type = "s",
xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes",
main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3))
lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green")
legend(6, -0.1, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=0.75)
In this simulation, the mutation did not disappear from the population and instead looks to converge to some stationary distribution although the frequency of the genotypes doesn’t look to be totally constant at the end of the simulation.
Modeling inbreeding in a small population
Inbreeding is known to be detrimental to a population because it leads to homozygosity of alleles. (McDaniel, 2011) If an allele is heterozygous (Aa), then the phenotype that is expressed is either the one corresponding to the dominant allele (A) or a mixture of the alleles (this is called codominance (Codominance, 2024)). Homozygosity of recessive alleles (aa) can sometimes be dangerous when the phenotype expressed by the aa genotype combination is a harmful disease. For example, the gene for sickle cell anemia is represented by the recessive a allele. If a person has the AA genotype, they have regularly shaped red blood cells. If a person has the Aa genotype, they have mild sickle cell anemia but are protected from malaria. If a person has the aa genotype, they have severe sickle cell anemia. (Helzner, 2023) This is why increasing the population of recessive homozygotes can be harmful to a population. This can occur when there is a large amount of inbreeding in a population. The following simulation is a simulation of inbreeding in a population of 2 which has two children every generation who then mate with eachother and have 2 children, etc. This can be modeled by the following state diagram of the alleles of the populations: These probabilities come from the original punnett squares and calculating the probabilities of having two children of a certain set of alleles.
The corresponding transition matrix is:
Code
# Code by Dr. Ross
shape_transition_matrix_to_plot <- function(P, state_names = NULL) {
n_states = nrow(P)
if (is.null(state_names)) state_names = 1:n_states
as.data.frame.table(t(P), responseName = "transition_probability") |>
mutate_if(is.factor, as.integer) |>
rename(after_state = Var1, before_state = Var2) |>
mutate(after_state = state_names[after_state],
before_state = state_names[before_state]) |>
relocate(before_state) |>
mutate(before_state = factor(before_state),
after_state = factor(after_state)) |>
mutate(before_state = fct_rev(before_state))
}
plot_transition_matrix <- function(P, state_names = NULL, n_step = 1) {
n_states = nrow(P)
if (is.null(state_names)) state_names = 1:n_states
P_plot = shape_transition_matrix_to_plot(P, state_names)
P_plot |>
ggplot(aes(x = after_state,
y = before_state,
fill = transition_probability,
label = sprintf("%0.3f", round(transition_probability, 3)))) +
geom_raster() +
geom_text(aes(color = after_scale(best_contrast(fill,
c("white", "black")))),
show.legend = FALSE) +
scale_fill_viridis(option = "magma",
limits = c(0, 1)) +
scale_x_discrete(expand = c(0, 0),
position = "top") +
scale_y_discrete(expand = c(0, 0)) +
labs(x = "Next state",
y = "Current state",
title = paste(n_step, "-step transition matrix", sep =))
}
state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa")
plot_transition_matrix(P, state_names)
Code
# Code by Dr. Ross
simulate_single_DTMC_path <- function(initial_distribution, transition_matrix, last_time){
n_states = nrow(transition_matrix) # number of states
states = 1:n_states # state space
X = rep(NA, last_time + 1) # state at time n; +1 to include time 0
X[1] = sample(states, 1, replace = TRUE, prob = initial_distribution) # initial state
for (n in 2:(last_time + 1)){
X[n] = sample(states, 1, replace = TRUE, prob = transition_matrix[X[n-1], ])
}
return(X)
}
Code
# Code by Dr. Ross
simulate_DTMC_paths <- function(initial_distribution, transition_matrix, last_time, n_paths = 1) {
# Columns of output:
# - n = time (possible values 0:last_time)
# - path (possible values 1:n_paths)
# - X = state (possible values 1:n_states); n_states is nrow(P)
replicate(n_paths, simulate_single_DTMC_path(initial_distribution, transition_matrix, last_time)) |>
as.data.frame() |>
mutate(n = 0:last_time) |>
pivot_longer(cols = !n, names_to = "path", values_to = "X")
}
Code
# Code by Dr. Ross
set.seed(1)
plot_DTMC_paths <- function(initial_distribution, transition_matrix, state_names, last_time, n_paths = 1) {
n_states = nrow(transition_matrix)
X_df = simulate_DTMC_paths(initial_distribution,
transition_matrix,
last_time,
n_paths)
if (n_paths > 1) {
X_df = X_df |>
# X is jittered vertically; for jitter need continuous X
# but X needs to be a factor for discrete colors
mutate(X_jitter = jitter(X),
n_jitter = jitter(n))
} else {
X_df = X_df |>
mutate(X_jitter = X,
n_jitter = n)
}
alpha_value = min(1, 1 / log10(n_states * n_paths * last_time))
path_plot <- X_df |>
ggplot(aes(x = n_jitter,
y = X_jitter,
group = path)) +
geom_point(aes(col = factor(X)),
alpha = alpha_value) +
scale_color_viridis(discrete = TRUE,
labels = paste(state_names, " (", 1:n_states, ")", sep = ""),
guide = guide_legend(reverse = TRUE)) +
geom_line(col = "gray",
alpha = alpha_value) +
scale_x_continuous(breaks = 0:last_time) +
scale_y_continuous(breaks = 1:n_states) +
labs(x = "Time (generation)",
y = "Population Genotypes",
col = "State",
title = paste(n_paths, "sample paths with starting population Aa Aa")) +
theme_bw() +
theme(panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_line(colour = "black"),
axis.text.x = element_text(angle = 90, size = 8)) +
# coord_cartesian is for aligning with marginal bar plot
coord_cartesian(xlim = c(0, last_time))
path_plot
}
initial_distribution <- c(0,0,0,1,0,0)
state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa")
plot_DTMC_paths(initial_distribution, P, state_names, 20, 10)
AA AA and aa aa are absorbing states, which every population eventually goes to. Populations with high rates of inbreeding have high rates of homozygotes. This emphasizes the populations going to the absorbing states quickly when there is a small inbreeding population. We can calculate the mean number of generations to get to one of these absorbing states when we start at each population geonotype
Code
# Code by Dr. Ross
mean_time_to_absorption <- function(transition_matrix, state_names = NULL) {
absorbing_states = which(diag(transition_matrix) == 1)
if (length(absorbing_states) == 0) stop("There are no absorbing states.")
n_states = nrow(transition_matrix)
transient_states = setdiff(1:n_states, absorbing_states)
Q = transition_matrix[transient_states, transient_states]
mtta = solve(diag(nrow(Q)) - Q, rep(1, nrow(Q)))
if (is.null(state_names)) state_names = 1:n_states
data.frame(start_state = state_names[transient_states],
mean_time_to_absorption = mtta)
}
state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa")
mean_time_to_absorption(P, state_names)
start_state mean_time_to_absorption
1 AA Aa 4.833333
2 AA aa 6.666667
3 Aa Aa 5.666667
4 Aa aa 4.833333
When inbreeding occurs especially in a small population, there is a high probability that the population will converge to one of these absorbing states. And we can see that this happens quickly in a small population. This is why random breeding is important the the health of a population. It also emphasizes why genetic variation comes from the mating of different genotypes.
Genetic drift and different scenarios in genetics and evolution can be well modeled by Markov chains because a child’s genotype only depends on their parent’s genotype and when this genotype is known, the child’s genotype is conditionally independent of generation’s before the last.
Bibliography
Balin, S. J., & Cascalho, M. (2009). The rate of mutation of a single gene. Nucleic Acids Research, 38(5), 1575–1582. https://doi.org/10.1093/nar/gkp1119
Codominance. (2024, March). Genome.gov. https://www.genome.gov/genetics-glossary/Codominance#:~:text=%E2%80%8BCodominance&text=Codominance%2C%20as%20it%20relates%20to
Gleichmann, N. (2020). Gene Vs Allele: Definition, Difference and Comparison. Neuroscience from Technology Networks. https://www.technologynetworks.com/neuroscience/articles/gene-vs-allele-definition-difference-and-comparison-331835
Helzner, C. (2023, November). Heterozygote Advantage | Definition & Examples. https://study.com/academy/lesson/heterozygote-advantage-example-lesson-quiz.html#:~:text=When%20a%20single%20copy%20of,a%20deadly%20disease%20in%20homozygotes.
McDaniel, B. T. (2011). Inbreeding - an overview | ScienceDirect Topics. Www.sciencedirect.com. https://www.sciencedirect.com/topics/immunology-and-microbiology/inbreeding#:~:text=Inbreeding%20generally%20has%20deleterious%20effects
Nature Education. (2014a). phenotype / phenotypes | Learn Science at Scitable. Www.nature.com. https://www.nature.com/scitable/definition/phenotype-phenotypes-35/
Nature Education. (2014b). random genetic drift / genetic drift | Learn Science at Scitable. Nature.com. https://www.nature.com/scitable/definition/genetic-drift-201/