Modeling Genetic Drift with Markov Chains

Author

Bena Smith

Code

library(DDD)
library(markovchain)
library(colourvalues)
library(igraph)
library(prismatic)
library(viridis)
library(ggplot2)
library(dplyr)
library(tidyr)
library(forcats)

Genetic Drift

Genetic drift is the process in which versions of a trait in a population fluctuate over time. (Nature Education, 2014b) For example, the proportion of blondes, brunettes, people with black hair and redheads changes from generation to generation. A person’s observable traits (blonde, brunette, black hair, redhead) is called a phenotype. (Nature Education, 2014a) This phenotype is determined by a person’s set of genes called their genotype. (Nature Education, 2014a) Because a person gets one chromosome from their mother and one from their father, each person has two versions of each gene. These alternative forms of a gene are called alleles. (Gleichmann, 2020) In a scenario when there are two alternative forms of a gene (alleles), one allele is written as a capitalized letter (A for example) and the other is written as a lowercase letter (a for example). Each parent has two of these alleles and has an equal chance of passing on either of them to their child.

We can use punnet squares to examine the potential alleles that a child of two parents with specific alleles can have and the probabilities that a child will have a certain set of alleles. It is important to note that the probabilities listed are not the percent of their children that will have each pair of alleles but rather, the probabilities that a single child will have each allele combination. Each child’s allele combination is independent from the other children. Below are all of the possible parent combinations in a population with two forms of a gene and the probabilities that they will have a child of each genotype. If two AA parents have a child, both parents will give their child an A allele with 100% probability so the child will have an AA genotype (assuming there is no mutation). Mutation rates are typically lower than 0.00001 per gene per generation (Balin & Cascalho, 2009) so this is a realistic scenario. In a punnet square, the alleles of one parent goes on the top of the square and the alleles of the other goes on the side. Then the column and row of a cell determines the alleles that a parent gives the child and the frequency of cells with that combination determines the probability that a child has that allele combination. In the following diagram and simulation the number 1 represents having an AA genotype (called homozygous dominant), 2 represents having an Aa genotype (called heterozygous), and 3 represents having an aa genotype (homozygous recessive).

Code

knitr::include_graphics("PunnetSquares.png")

The following is a simulation of how the frequencies of these genotypes may change over time. The simulation models births according to a Poisson process with rate equal to lambda times the population divided by 2 because 2 parents are needed to reproduce. The death is also modeled according to a Poisson process with rate mu times the number of people in the population.

Code

# Get wait times between an event
get_wait <- function(current.pop, lambda, mu){
  rate <- ((current.pop)/2)*lambda + current.pop*mu
  waitTime <- rexp(1, rate)
  return(waitTime)
  
}

Code

# Get if there is a death or birth
get_step_type <- function(current.pop, lambda, mu){
  prob.birth <- ((current.pop/2)*lambda)/ ((current.pop/2)*lambda + current.pop*mu)
  prob.death <- 1-prob.birth
  type <- sample(c(1,2), 1, prob=c(prob.birth, prob.death))
  return(type)  # 1 if birth 2 if death
}

Code

# get which allele combination is born
get_allele_birth <- function(LastEvent){
  
  totalFreq <- LastEvent$TotalPeople
  
  AAfreq <- LastEvent$AA.freq
  Aafreq <- LastEvent$Aa.freq
  aafreq <- LastEvent$aa.freq

  population <- c(rep(1, AAfreq), rep(2, Aafreq), rep(3, aafreq))
  
  #parents <- sample(c(1,2,3), 2, replace=F, prob=c(AAfreq, Aafreq, aafreq))
  parents <- sample(population, size=2 , replace=F)

  #print(parents)
  
  if(parents[1] == 1 && parents[2] == 1){ #Two AA parents
    return(1)
  } 
  
  if(parents[1] == 2 && parents[2] == 1 | parents[1] == 1 && parents[2] == 2 ){ #AA Aa parents
    child <- sample(c(1,2), 1, prob=c(0.5, 0.5))
    return(child)
  } 
  
  if(parents[1] == 1 && parents[2] == 3 | parents[1] == 3 && parents[2] == 1 ){ #AA aa parents
    return(2)
  } 
  
  if(parents[1] == 2 && parents[2] == 2){ #Two Aa parents
    child <- sample(c(1,2,3), 1, prob=c(0.25, 0.5, 0.25))
    return(child)
  } 
  
  if(parents[1] == 2 && parents[2] == 3 | parents[1] == 3 && parents[2] == 2 ){ #Aa aa parents
    child <- sample(c(2,3), 1, prob=c(0.5, 0.5))
    return(child)
  } 
  
  if(parents[1] == 3 && parents[2] == 3){ #Two aa parents
    return(3)
  } 
  
 
    
}

Code

# get which allele combination dies
get_allele_death <- function(LastEvent){
  
  
  totalFreq <- LastEvent$TotalPeople
  
  AAfreq <- LastEvent$AA.freq/totalFreq
  Aafreq <- LastEvent$Aa.freq/totalFreq
  aafreq <- LastEvent$aa.freq/totalFreq
  
  death <-   rng_respecting_sample(c(1,2,3), size=1 , replace=F, prob=c(AAfreq, Aafreq, aafreq))

  #sample(c(1,2,3), 1, prob=c(AAfreq, Aafreq, aafreq))
  
  return(death)
}

Code

# run the simulation
run_genetic_drift <- function(init.AA, init.Aa, init.aa, steps, lambda, mu){
  
  init.pop <- init.AA + init.Aa + init.aa

  Events <- data.frame("WaitTime"=c(0), "AA.freq"=c(init.AA), "Aa.freq"=c(init.Aa), "aa.freq"=c(init.aa), "CummulativeTime"=c(0), "TotalPeople"=c(init.pop), "BarWidth" = c(0))


  for(i in 1:steps){
    new.row <- Events[i, ]

    current.pop <- Events[i, ]$AA.freq + Events[i, ]$aa.freq +Events[i, ]$Aa.freq

    wait <- get_wait(current.pop, lambda, mu)
    
    new.row$WaitTime <- wait
    new.row$CummulativeTime <- new.row$CummulativeTime + wait
    
    stepType <- get_step_type(current.pop, lambda, mu)

    
    if(stepType == 1){ #birth

      allele_birth <- get_allele_birth(Events[i,])
      new.row[, 1+allele_birth] <- Events[i, 1+ allele_birth] + 1
      }
    
    if(stepType == 2){ #death
      allele_death <- get_allele_death(Events[i,])
      new.row[, 1+allele_death] <- Events[i, 1+ allele_death] - 1

      
    }
    
    new.row$TotalPeople <- sum(new.row$AA.freq, new.row$Aa.freq, new.row$aa.freq)
    row.names(new.row) <- i+1

    Events[i-1,7] <- wait 
    Events <- rbind(Events, new.row)
    
  }


  return(Events)
  
  
}

Modeling two genotypes with equal start frequencies

With rates lambda=3 and mu=1 and the time being equal to 100 years, the simulation has the assumptions that a birth happens at a rate of 3 births per person (because two people are needed to reproduce this is really equal to 3 births per 2 people) every 100 years. We also assume that people live to be 100 on average. The following plots are the count of each genotype in the population and the frequency of each genotype in the population.

Code

set.seed(3)
Events <- run_genetic_drift(init.AA=10, init.Aa=10, init.aa=0, steps=10000, lambda=3, mu=1)

plot(Events$CummulativeTime, Events$AA.freq,
  type = "s",
  xlab = "Time (hundreds of years)", ylab = "Count of Genotypes",
  main = "Genotype Count", col="blue", ylim=c(-1,1000))
lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green")
legend(1, 900, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=1)

Code

plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople,
  type = "s",
  xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes",
  main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3))
lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green")
legend(6, -0.1, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=0.75)

It looks like these genotypes converge to a stationary distribution over time. We can run this again with another seed to see if this occurs again.

Code

set.seed(2)
Events <- run_genetic_drift(init.AA=10, init.Aa=10, init.aa=0, steps=10000, lambda=3, mu=1)

plot(Events$CummulativeTime, Events$AA.freq,
  type = "s",
  xlab = "Time (hundreds of years)", ylab = "Count of Genotypes",
  main = "Genotype Count", col="blue", ylim=c(-1,1000))
lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green")
legend(1, 900, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=1)

Code

plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople,
  type = "s",
  xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes",
  main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3))
lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green")
legend(6, -0.1, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=0.75)

The genotypes oscilate in the beginning of the simulation and then look to converge to a stationary distribution again. I ran this simulation many other times with different seeds and this pattern occured each time. We can approximate the stationary distribution of each genotype by finding the long run distribution of each genotype in this simulation:

Code

summary_AA.freq <- Events %>%
  group_by(AA.freq/TotalPeople) %>%
  summarise(total_wait_time = sum(WaitTime))

binned_summary_AA.freq <- summary_AA.freq %>% mutate(new_bin = cut(`AA.freq/TotalPeople`, breaks=c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)))


binned.freqs <- binned_summary_AA.freq %>% 
  group_by(new_bin) %>% 
  summarise(Frequency = sum(total_wait_time))

binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime)

ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) +
  geom_bar(stat="identity", fill="red") +
  labs(title="Long Run Frequency Distribution of AA", x="Frequency of allele in population", y="Percentage of time in this state")

Code

summary_Aa.freq <- Events %>%
  group_by(Aa.freq/TotalPeople) %>%
  summarise(total_wait_time = sum(WaitTime))

binned_summary_Aa.freq <- summary_Aa.freq %>% mutate(new_bin = cut(`Aa.freq/TotalPeople`, breaks=c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)))


binned.freqs <- binned_summary_Aa.freq %>% 
  group_by(new_bin) %>% 
  summarise(Frequency = sum(total_wait_time))

binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime)

ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) +
  geom_bar(stat="identity", fill="blue") +
  labs(title="Long Run Frequency Distribution of Aa", x="Frequency of allele in population", y="Percentage of time in this state")

Code

summary_aa.freq <- Events %>%
  group_by(aa.freq/TotalPeople) %>%
  summarise(total_wait_time = sum(WaitTime))

binned_summary_aa.freq <- summary_aa.freq %>% mutate(new_bin = cut(`aa.freq/TotalPeople`, breaks=c(-0.00001, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)))


binned.freqs <- binned_summary_aa.freq %>% 
  group_by(new_bin) %>% 
  summarise(Frequency = sum(total_wait_time))

binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime)

ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) +
  geom_bar(stat="identity", fill="green") +
  labs(title="Long Run Frequency of aa", x="Frequency of allele in population", y="Percentage of time in this state")

Modeling a Mutation

When a mutation happens in a population, one allele in a person is changed to be a different version of a gene. We can model this by making only one member of the population be Aa and the other members be AA.

Code

set.seed(3)
Events <- run_genetic_drift(init.AA=10, init.Aa=1, init.aa=0, steps=500, lambda=3, mu=1)

plot(Events$CummulativeTime, Events$AA.freq,
  type = "s",
  xlab = "Time (hundreds of years)", ylab = "Count of Genotypes",
  main = "Genotype Counts", col="blue", ylim=c(-1,40))
lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green")
legend(0.5, 35, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=1)

Code

plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople,
  type = "s",
  xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes",
  main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3))
lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green")
legend(6, -0.1, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=0.75)

In this simulation, the proportion of people with the mutated gene increases but eventually the mutation disappears completely from the popluation. We can run this simulation again to see if this happens every time:

Note the scale change on the axes of this plot

Code

set.seed(7)
Events <- run_genetic_drift(init.AA=10, init.Aa=1, init.aa=0, steps=10000, lambda=3, mu=1)

plot(Events$CummulativeTime, Events$AA.freq,
  type = "s",
  xlab = "Time(hundreds of years)", ylab = "Count of Genotypes",
  main = "Genotype Frequencies", col="blue", ylim=c(-1,1000))
lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green")
legend(0.5, 950, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=1)

Code

plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople,
  type = "s",
  xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes",
  main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3))
lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red")
lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green")
legend(6, -0.1, legend=c("AA", "Aa", "aa"),
col=c("blue", "red", "green"), lty=1,
cex=0.75)

In this simulation, the mutation did not disappear from the population and instead looks to converge to some stationary distribution although the frequency of the genotypes doesn’t look to be totally constant at the end of the simulation.

Modeling inbreeding in a small population

Inbreeding is known to be detrimental to a population because it leads to homozygosity of alleles. (McDaniel, 2011) If an allele is heterozygous (Aa), then the phenotype that is expressed is either the one corresponding to the dominant allele (A) or a mixture of the alleles (this is called codominance (Codominance, 2024)). Homozygosity of recessive alleles (aa) can sometimes be dangerous when the phenotype expressed by the aa genotype combination is a harmful disease. For example, the gene for sickle cell anemia is represented by the recessive a allele. If a person has the AA genotype, they have regularly shaped red blood cells. If a person has the Aa genotype, they have mild sickle cell anemia but are protected from malaria. If a person has the aa genotype, they have severe sickle cell anemia. (Helzner, 2023) This is why increasing the population of recessive homozygotes can be harmful to a population. This can occur when there is a large amount of inbreeding in a population. The following simulation is a simulation of inbreeding in a population of 2 which has two children every generation who then mate with eachother and have 2 children, etc. This can be modeled by the following state diagram of the alleles of the populations: These probabilities come from the original punnett squares and calculating the probabilities of having two children of a certain set of alleles.

Code

knitr::include_graphics("InbreedingStateDiagram.png")

The corresponding transition matrix is:

Code

P <- t(matrix(c(1, 0, 0, 0, 0, 0,
              0.25, 0.5, 0, 0.25, 0, 0,
              0, 0, 0, 1, 0, 0,
              0.0625, 0.25, 0.125, 0.25, 0.25, 0.0625,
              0, 0, 0, 0.25, 0.5, 0.25,
              0, 0, 0, 0, 0, 1), nrow = 6))

Code

# Code by Dr. Ross

shape_transition_matrix_to_plot <- function(P, state_names = NULL) {
  
  n_states = nrow(P)
  
  if (is.null(state_names)) state_names = 1:n_states
  
  as.data.frame.table(t(P), responseName = "transition_probability") |>
    mutate_if(is.factor, as.integer) |>
    rename(after_state = Var1, before_state = Var2) |>
    mutate(after_state = state_names[after_state],
           before_state = state_names[before_state]) |>
    relocate(before_state) |>
    mutate(before_state = factor(before_state),
           after_state = factor(after_state)) |>
    mutate(before_state = fct_rev(before_state))
}

plot_transition_matrix <- function(P, state_names = NULL, n_step = 1) {
  
  n_states = nrow(P)
  
  if (is.null(state_names)) state_names = 1:n_states
  

  P_plot = shape_transition_matrix_to_plot(P, state_names)
  
  P_plot |>
    ggplot(aes(x = after_state,
               y = before_state,
               fill = transition_probability,
               label = sprintf("%0.3f", round(transition_probability, 3)))) +
    geom_raster() +
    geom_text(aes(color = after_scale(best_contrast(fill,
                                                    c("white", "black")))),
              show.legend = FALSE) +
    scale_fill_viridis(option = "magma",
                       limits = c(0, 1)) +
    scale_x_discrete(expand = c(0, 0),
                     position = "top") +
    scale_y_discrete(expand = c(0, 0)) +
    labs(x = "Next state",
         y = "Current state",
         title = paste(n_step, "-step transition matrix", sep =))
  
}

state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa")
plot_transition_matrix(P, state_names)

Code

# Code by Dr. Ross

simulate_single_DTMC_path <- function(initial_distribution, transition_matrix, last_time){
  
  n_states = nrow(transition_matrix) # number of states
  
  states = 1:n_states # state space
  
  X = rep(NA, last_time + 1) # state at time n; +1 to include time 0
  
  X[1] = sample(states, 1, replace = TRUE, prob = initial_distribution) # initial state
  
  for (n in 2:(last_time + 1)){
    
    X[n] = sample(states, 1, replace = TRUE, prob = transition_matrix[X[n-1], ])
    
  }
  
  return(X)
  
}

Code

# Code by Dr. Ross

simulate_DTMC_paths <- function(initial_distribution, transition_matrix, last_time, n_paths = 1) {
  # Columns of output:
  # - n = time (possible values 0:last_time)
  # - path (possible values 1:n_paths)
  # - X = state (possible values 1:n_states); n_states is nrow(P)
  
  
  replicate(n_paths, simulate_single_DTMC_path(initial_distribution, transition_matrix, last_time)) |>
    as.data.frame() |>
    mutate(n = 0:last_time) |>
    pivot_longer(cols = !n, names_to = "path", values_to = "X")
  
}

Code

# Code by Dr. Ross

set.seed(1)
plot_DTMC_paths <- function(initial_distribution, transition_matrix, state_names, last_time, n_paths = 1) {
  
  n_states = nrow(transition_matrix)
  
  X_df = simulate_DTMC_paths(initial_distribution,
                             transition_matrix,
                             last_time,
                             n_paths)
  
  if (n_paths > 1) {
    X_df = X_df |>
      # X is jittered vertically; for jitter need continuous X
      # but X needs to be a factor for discrete colors
      mutate(X_jitter = jitter(X),
             n_jitter = jitter(n))
  } else {
    X_df = X_df |>
      mutate(X_jitter = X,
             n_jitter = n)
  }
  alpha_value = min(1, 1 / log10(n_states * n_paths * last_time))
  
  path_plot <- X_df |>
    ggplot(aes(x = n_jitter,
               y = X_jitter,
               group = path)) +
    geom_point(aes(col = factor(X)),
               alpha = alpha_value) +
    scale_color_viridis(discrete = TRUE,
                        labels = paste(state_names, " (", 1:n_states, ")", sep = ""),
                        guide = guide_legend(reverse = TRUE)) +
    geom_line(col = "gray",
              alpha = alpha_value) +
    scale_x_continuous(breaks = 0:last_time) +
    scale_y_continuous(breaks = 1:n_states) +
    labs(x = "Time (generation)",
         y = "Population Genotypes",
         col = "State",
         title = paste(n_paths, "sample paths with starting population Aa Aa")) +
    theme_bw() +
    theme(panel.grid.major.y = element_blank(),
          panel.grid.minor.y = element_line(colour = "black"),
          axis.text.x = element_text(angle = 90, size = 8)) +
    # coord_cartesian is for aligning with marginal bar plot
    coord_cartesian(xlim = c(0, last_time))
  
  
  path_plot
}

initial_distribution <- c(0,0,0,1,0,0)
state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa")
plot_DTMC_paths(initial_distribution, P, state_names, 20, 10)

AA AA and aa aa are absorbing states, which every population eventually goes to. Populations with high rates of inbreeding have high rates of homozygotes. This emphasizes the populations going to the absorbing states quickly when there is a small inbreeding population. We can calculate the mean number of generations to get to one of these absorbing states when we start at each population geonotype

Code

# Code by Dr. Ross
mean_time_to_absorption <- function(transition_matrix, state_names = NULL) {

  absorbing_states = which(diag(transition_matrix) == 1)
  
  if (length(absorbing_states) == 0) stop("There are no absorbing states.")
  
  n_states = nrow(transition_matrix)
  
  transient_states = setdiff(1:n_states, absorbing_states)
  
  Q = transition_matrix[transient_states, transient_states]
  
  mtta = solve(diag(nrow(Q)) - Q, rep(1, nrow(Q)))
  
  if (is.null(state_names)) state_names = 1:n_states
  
  data.frame(start_state = state_names[transient_states],
             mean_time_to_absorption = mtta)
}

state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa")
mean_time_to_absorption(P, state_names)

  start_state mean_time_to_absorption
1       AA Aa                4.833333
2       AA aa                6.666667
3       Aa Aa                5.666667
4       Aa aa                4.833333

When inbreeding occurs especially in a small population, there is a high probability that the population will converge to one of these absorbing states. And we can see that this happens quickly in a small population. This is why random breeding is important the the health of a population. It also emphasizes why genetic variation comes from the mating of different genotypes.

Genetic drift and different scenarios in genetics and evolution can be well modeled by Markov chains because a child’s genotype only depends on their parent’s genotype and when this genotype is known, the child’s genotype is conditionally independent of generation’s before the last.

Bibliography

Balin, S. J., & Cascalho, M. (2009). The rate of mutation of a single gene. Nucleic Acids Research, 38(5), 1575–1582. https://doi.org/10.1093/nar/gkp1119

Codominance. (2024, March). Genome.gov. https://www.genome.gov/genetics-glossary/Codominance#:~:text=%E2%80%8BCodominance&text=Codominance%2C%20as%20it%20relates%20to

Gleichmann, N. (2020). Gene Vs Allele: Definition, Difference and Comparison. Neuroscience from Technology Networks. https://www.technologynetworks.com/neuroscience/articles/gene-vs-allele-definition-difference-and-comparison-331835

Helzner, C. (2023, November). Heterozygote Advantage | Definition & Examples. https://study.com/academy/lesson/heterozygote-advantage-example-lesson-quiz.html#:~:text=When%20a%20single%20copy%20of,a%20deadly%20disease%20in%20homozygotes.

McDaniel, B. T. (2011). Inbreeding - an overview | ScienceDirect Topics. Www.sciencedirect.com. https://www.sciencedirect.com/topics/immunology-and-microbiology/inbreeding#:~:text=Inbreeding%20generally%20has%20deleterious%20effects

Nature Education. (2014a). phenotype / phenotypes | Learn Science at Scitable. Www.nature.com. https://www.nature.com/scitable/definition/phenotype-phenotypes-35/

Nature Education. (2014b). random genetic drift / genetic drift | Learn Science at Scitable. Nature.com. https://www.nature.com/scitable/definition/genetic-drift-201/

--- title: "Modeling Genetic Drift with Markov Chains" author: "Bena Smith" format: html: code-fold: true code-line-numbers: true code-tools: true editor: source embed-resources: true --- ```{r message=FALSE} library(DDD) library(markovchain) library(colourvalues) library(igraph) library(prismatic) library(viridis) library(ggplot2) library(dplyr) library(tidyr) library(forcats) ``` # Genetic Drift Genetic drift is the process in which versions of a trait in a population fluctuate over time. (Nature Education, 2014b) For example, the proportion of blondes, brunettes, people with black hair and redheads changes from generation to generation. A person's observable traits (blonde, brunette, black hair, redhead) is called a phenotype. (Nature Education, 2014a) This phenotype is determined by a person's set of genes called their genotype. (Nature Education, 2014a) Because a person gets one chromosome from their mother and one from their father, each person has two versions of each gene. These alternative forms of a gene are called alleles. (Gleichmann, 2020) In a scenario when there are two alternative forms of a gene (alleles), one allele is written as a capitalized letter (A for example) and the other is written as a lowercase letter (a for example). Each parent has two of these alleles and has an equal chance of passing on either of them to their child. We can use punnet squares to examine the potential alleles that a child of two parents with specific alleles can have and the probabilities that a child will have a certain set of alleles. It is important to note that the probabilities listed are not the percent of their children that will have each pair of alleles but rather, the probabilities that a single child will have each allele combination. Each child's allele combination is independent from the other children. Below are all of the possible parent combinations in a population with two forms of a gene and the probabilities that they will have a child of each genotype. If two AA parents have a child, both parents will give their child an A allele with 100% probability so the child will have an AA genotype (assuming there is no mutation). Mutation rates are typically lower than 0.00001 per gene per generation (Balin & Cascalho, 2009) so this is a realistic scenario. In a punnet square, the alleles of one parent goes on the top of the square and the alleles of the other goes on the side. Then the column and row of a cell determines the alleles that a parent gives the child and the frequency of cells with that combination determines the probability that a child has that allele combination. In the following diagram and simulation the number 1 represents having an AA genotype (called homozygous dominant), 2 represents having an Aa genotype (called heterozygous), and 3 represents having an aa genotype (homozygous recessive). ```{r} knitr::include_graphics("PunnetSquares.png") ``` The following is a simulation of how the frequencies of these genotypes may change over time. The simulation models births according to a Poisson process with rate equal to lambda times the population divided by 2 because 2 parents are needed to reproduce. The death is also modeled according to a Poisson process with rate mu times the number of people in the population. ```{r} # Get wait times between an event get_wait <- function(current.pop, lambda, mu){ rate <- ((current.pop)/2)*lambda + current.pop*mu waitTime <- rexp(1, rate) return(waitTime) } ``` ```{r} # Get if there is a death or birth get_step_type <- function(current.pop, lambda, mu){ prob.birth <- ((current.pop/2)*lambda)/ ((current.pop/2)*lambda + current.pop*mu) prob.death <- 1-prob.birth type <- sample(c(1,2), 1, prob=c(prob.birth, prob.death)) return(type) # 1 if birth 2 if death } ``` ```{r} # get which allele combination is born get_allele_birth <- function(LastEvent){ totalFreq <- LastEvent$TotalPeople AAfreq <- LastEvent$AA.freq Aafreq <- LastEvent$Aa.freq aafreq <- LastEvent$aa.freq population <- c(rep(1, AAfreq), rep(2, Aafreq), rep(3, aafreq)) #parents <- sample(c(1,2,3), 2, replace=F, prob=c(AAfreq, Aafreq, aafreq)) parents <- sample(population, size=2 , replace=F) #print(parents) if(parents[1] == 1 && parents[2] == 1){ #Two AA parents return(1) } if(parents[1] == 2 && parents[2] == 1 | parents[1] == 1 && parents[2] == 2 ){ #AA Aa parents child <- sample(c(1,2), 1, prob=c(0.5, 0.5)) return(child) } if(parents[1] == 1 && parents[2] == 3 | parents[1] == 3 && parents[2] == 1 ){ #AA aa parents return(2) } if(parents[1] == 2 && parents[2] == 2){ #Two Aa parents child <- sample(c(1,2,3), 1, prob=c(0.25, 0.5, 0.25)) return(child) } if(parents[1] == 2 && parents[2] == 3 | parents[1] == 3 && parents[2] == 2 ){ #Aa aa parents child <- sample(c(2,3), 1, prob=c(0.5, 0.5)) return(child) } if(parents[1] == 3 && parents[2] == 3){ #Two aa parents return(3) } } ``` ```{r} # get which allele combination dies get_allele_death <- function(LastEvent){ totalFreq <- LastEvent$TotalPeople AAfreq <- LastEvent$AA.freq/totalFreq Aafreq <- LastEvent$Aa.freq/totalFreq aafreq <- LastEvent$aa.freq/totalFreq death <- rng_respecting_sample(c(1,2,3), size=1 , replace=F, prob=c(AAfreq, Aafreq, aafreq)) #sample(c(1,2,3), 1, prob=c(AAfreq, Aafreq, aafreq)) return(death) } ``` ```{r} # run the simulation run_genetic_drift <- function(init.AA, init.Aa, init.aa, steps, lambda, mu){ init.pop <- init.AA + init.Aa + init.aa Events <- data.frame("WaitTime"=c(0), "AA.freq"=c(init.AA), "Aa.freq"=c(init.Aa), "aa.freq"=c(init.aa), "CummulativeTime"=c(0), "TotalPeople"=c(init.pop), "BarWidth" = c(0)) for(i in 1:steps){ new.row <- Events[i, ] current.pop <- Events[i, ]$AA.freq + Events[i, ]$aa.freq +Events[i, ]$Aa.freq wait <- get_wait(current.pop, lambda, mu) new.row$WaitTime <- wait new.row$CummulativeTime <- new.row$CummulativeTime + wait stepType <- get_step_type(current.pop, lambda, mu) if(stepType == 1){ #birth allele_birth <- get_allele_birth(Events[i,]) new.row[, 1+allele_birth] <- Events[i, 1+ allele_birth] + 1 } if(stepType == 2){ #death allele_death <- get_allele_death(Events[i,]) new.row[, 1+allele_death] <- Events[i, 1+ allele_death] - 1 } new.row$TotalPeople <- sum(new.row$AA.freq, new.row$Aa.freq, new.row$aa.freq) row.names(new.row) <- i+1 Events[i-1,7] <- wait Events <- rbind(Events, new.row) } return(Events) } ``` ## Modeling two genotypes with equal start frequencies With rates lambda=3 and mu=1 and the time being equal to 100 years, the simulation has the assumptions that a birth happens at a rate of 3 births per person (because two people are needed to reproduce this is really equal to 3 births per 2 people) every 100 years. We also assume that people live to be 100 on average. The following plots are the count of each genotype in the population and the frequency of each genotype in the population. ```{r} set.seed(3) Events <- run_genetic_drift(init.AA=10, init.Aa=10, init.aa=0, steps=10000, lambda=3, mu=1) plot(Events$CummulativeTime, Events$AA.freq, type = "s", xlab = "Time (hundreds of years)", ylab = "Count of Genotypes", main = "Genotype Count", col="blue", ylim=c(-1,1000)) lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red") lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green") legend(1, 900, legend=c("AA", "Aa", "aa"), col=c("blue", "red", "green"), lty=1, cex=1) plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople, type = "s", xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes", main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3)) lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red") lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green") legend(6, -0.1, legend=c("AA", "Aa", "aa"), col=c("blue", "red", "green"), lty=1, cex=0.75) ``` It looks like these genotypes converge to a stationary distribution over time. We can run this again with another seed to see if this occurs again. ```{r} set.seed(2) Events <- run_genetic_drift(init.AA=10, init.Aa=10, init.aa=0, steps=10000, lambda=3, mu=1) plot(Events$CummulativeTime, Events$AA.freq, type = "s", xlab = "Time (hundreds of years)", ylab = "Count of Genotypes", main = "Genotype Count", col="blue", ylim=c(-1,1000)) lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red") lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green") legend(1, 900, legend=c("AA", "Aa", "aa"), col=c("blue", "red", "green"), lty=1, cex=1) plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople, type = "s", xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes", main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3)) lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red") lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green") legend(6, -0.1, legend=c("AA", "Aa", "aa"), col=c("blue", "red", "green"), lty=1, cex=0.75) ``` The genotypes oscilate in the beginning of the simulation and then look to converge to a stationary distribution again. I ran this simulation many other times with different seeds and this pattern occured each time. We can approximate the stationary distribution of each genotype by finding the long run distribution of each genotype in this simulation: ```{r} summary_AA.freq <- Events %>% group_by(AA.freq/TotalPeople) %>% summarise(total_wait_time = sum(WaitTime)) binned_summary_AA.freq <- summary_AA.freq %>% mutate(new_bin = cut(`AA.freq/TotalPeople`, breaks=c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))) binned.freqs <- binned_summary_AA.freq %>% group_by(new_bin) %>% summarise(Frequency = sum(total_wait_time)) binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime) ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) + geom_bar(stat="identity", fill="red") + labs(title="Long Run Frequency Distribution of AA", x="Frequency of allele in population", y="Percentage of time in this state") ``` ```{r} summary_Aa.freq <- Events %>% group_by(Aa.freq/TotalPeople) %>% summarise(total_wait_time = sum(WaitTime)) binned_summary_Aa.freq <- summary_Aa.freq %>% mutate(new_bin = cut(`Aa.freq/TotalPeople`, breaks=c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))) binned.freqs <- binned_summary_Aa.freq %>% group_by(new_bin) %>% summarise(Frequency = sum(total_wait_time)) binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime) ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) + geom_bar(stat="identity", fill="blue") + labs(title="Long Run Frequency Distribution of Aa", x="Frequency of allele in population", y="Percentage of time in this state") ``` ```{r} summary_aa.freq <- Events %>% group_by(aa.freq/TotalPeople) %>% summarise(total_wait_time = sum(WaitTime)) binned_summary_aa.freq <- summary_aa.freq %>% mutate(new_bin = cut(`aa.freq/TotalPeople`, breaks=c(-0.00001, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))) binned.freqs <- binned_summary_aa.freq %>% group_by(new_bin) %>% summarise(Frequency = sum(total_wait_time)) binned.freqs$Frequency <- binned.freqs$Frequency/ max(Events$CummulativeTime) ggplot(data=binned.freqs, aes(x=new_bin, y=Frequency)) + geom_bar(stat="identity", fill="green") + labs(title="Long Run Frequency of aa", x="Frequency of allele in population", y="Percentage of time in this state") ``` ## Modeling a Mutation When a mutation happens in a population, one allele in a person is changed to be a different version of a gene. We can model this by making only one member of the population be Aa and the other members be AA. ```{r} set.seed(3) Events <- run_genetic_drift(init.AA=10, init.Aa=1, init.aa=0, steps=500, lambda=3, mu=1) plot(Events$CummulativeTime, Events$AA.freq, type = "s", xlab = "Time (hundreds of years)", ylab = "Count of Genotypes", main = "Genotype Counts", col="blue", ylim=c(-1,40)) lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red") lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green") legend(0.5, 35, legend=c("AA", "Aa", "aa"), col=c("blue", "red", "green"), lty=1, cex=1) plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople, type = "s", xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes", main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3)) lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red") lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green") legend(6, -0.1, legend=c("AA", "Aa", "aa"), col=c("blue", "red", "green"), lty=1, cex=0.75) ``` In this simulation, the proportion of people with the mutated gene increases but eventually the mutation disappears completely from the popluation. We can run this simulation again to see if this happens every time: Note the scale change on the axes of this plot ```{r} set.seed(7) Events <- run_genetic_drift(init.AA=10, init.Aa=1, init.aa=0, steps=10000, lambda=3, mu=1) plot(Events$CummulativeTime, Events$AA.freq, type = "s", xlab = "Time(hundreds of years)", ylab = "Count of Genotypes", main = "Genotype Frequencies", col="blue", ylim=c(-1,1000)) lines(Events$CummulativeTime, Events$Aa.freq, type="s", col="red") lines(Events$CummulativeTime, Events$aa.freq, type="s", col="green") legend(0.5, 950, legend=c("AA", "Aa", "aa"), col=c("blue", "red", "green"), lty=1, cex=1) plot(Events$CummulativeTime, Events$AA.freq/Events$TotalPeople, type = "s", xlab = "Time (hundreds of years)", ylab = "Frequency of Genotypes", main = "Genotype Frequencies", col="blue", ylim=c(-0.5,1.3)) lines(Events$CummulativeTime, Events$Aa.freq/Events$TotalPeople, type="s", col="red") lines(Events$CummulativeTime, Events$aa.freq/Events$TotalPeople, type="s", col="green") legend(6, -0.1, legend=c("AA", "Aa", "aa"), col=c("blue", "red", "green"), lty=1, cex=0.75) ``` In this simulation, the mutation did not disappear from the population and instead looks to converge to some stationary distribution although the frequency of the genotypes doesn't look to be totally constant at the end of the simulation. # Modeling inbreeding in a small population Inbreeding is known to be detrimental to a population because it leads to homozygosity of alleles. (McDaniel, 2011) If an allele is heterozygous (Aa), then the phenotype that is expressed is either the one corresponding to the dominant allele (A) or a mixture of the alleles (this is called codominance (Codominance, 2024)). Homozygosity of recessive alleles (aa) can sometimes be dangerous when the phenotype expressed by the aa genotype combination is a harmful disease. For example, the gene for sickle cell anemia is represented by the recessive a allele. If a person has the AA genotype, they have regularly shaped red blood cells. If a person has the Aa genotype, they have mild sickle cell anemia but are protected from malaria. If a person has the aa genotype, they have severe sickle cell anemia. (Helzner, 2023) This is why increasing the population of recessive homozygotes can be harmful to a population. This can occur when there is a large amount of inbreeding in a population. The following simulation is a simulation of inbreeding in a population of 2 which has two children every generation who then mate with eachother and have 2 children, etc. This can be modeled by the following state diagram of the alleles of the populations: These probabilities come from the original punnett squares and calculating the probabilities of having two children of a certain set of alleles. ```{r} knitr::include_graphics("InbreedingStateDiagram.png") ``` The corresponding transition matrix is: ```{r} P <- t(matrix(c(1, 0, 0, 0, 0, 0, 0.25, 0.5, 0, 0.25, 0, 0, 0, 0, 0, 1, 0, 0, 0.0625, 0.25, 0.125, 0.25, 0.25, 0.0625, 0, 0, 0, 0.25, 0.5, 0.25, 0, 0, 0, 0, 0, 1), nrow = 6)) ``` ```{r} # Code by Dr. Ross shape_transition_matrix_to_plot <- function(P, state_names = NULL) { n_states = nrow(P) if (is.null(state_names)) state_names = 1:n_states as.data.frame.table(t(P), responseName = "transition_probability") |> mutate_if(is.factor, as.integer) |> rename(after_state = Var1, before_state = Var2) |> mutate(after_state = state_names[after_state], before_state = state_names[before_state]) |> relocate(before_state) |> mutate(before_state = factor(before_state), after_state = factor(after_state)) |> mutate(before_state = fct_rev(before_state)) } plot_transition_matrix <- function(P, state_names = NULL, n_step = 1) { n_states = nrow(P) if (is.null(state_names)) state_names = 1:n_states P_plot = shape_transition_matrix_to_plot(P, state_names) P_plot |> ggplot(aes(x = after_state, y = before_state, fill = transition_probability, label = sprintf("%0.3f", round(transition_probability, 3)))) + geom_raster() + geom_text(aes(color = after_scale(best_contrast(fill, c("white", "black")))), show.legend = FALSE) + scale_fill_viridis(option = "magma", limits = c(0, 1)) + scale_x_discrete(expand = c(0, 0), position = "top") + scale_y_discrete(expand = c(0, 0)) + labs(x = "Next state", y = "Current state", title = paste(n_step, "-step transition matrix", sep =)) } state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa") plot_transition_matrix(P, state_names) ``` ```{r} # Code by Dr. Ross simulate_single_DTMC_path <- function(initial_distribution, transition_matrix, last_time){ n_states = nrow(transition_matrix) # number of states states = 1:n_states # state space X = rep(NA, last_time + 1) # state at time n; +1 to include time 0 X[1] = sample(states, 1, replace = TRUE, prob = initial_distribution) # initial state for (n in 2:(last_time + 1)){ X[n] = sample(states, 1, replace = TRUE, prob = transition_matrix[X[n-1], ]) } return(X) } ``` ```{r} # Code by Dr. Ross simulate_DTMC_paths <- function(initial_distribution, transition_matrix, last_time, n_paths = 1) { # Columns of output: # - n = time (possible values 0:last_time) # - path (possible values 1:n_paths) # - X = state (possible values 1:n_states); n_states is nrow(P) replicate(n_paths, simulate_single_DTMC_path(initial_distribution, transition_matrix, last_time)) |> as.data.frame() |> mutate(n = 0:last_time) |> pivot_longer(cols = !n, names_to = "path", values_to = "X") } ``` ```{r} # Code by Dr. Ross set.seed(1) plot_DTMC_paths <- function(initial_distribution, transition_matrix, state_names, last_time, n_paths = 1) { n_states = nrow(transition_matrix) X_df = simulate_DTMC_paths(initial_distribution, transition_matrix, last_time, n_paths) if (n_paths > 1) { X_df = X_df |> # X is jittered vertically; for jitter need continuous X # but X needs to be a factor for discrete colors mutate(X_jitter = jitter(X), n_jitter = jitter(n)) } else { X_df = X_df |> mutate(X_jitter = X, n_jitter = n) } alpha_value = min(1, 1 / log10(n_states * n_paths * last_time)) path_plot <- X_df |> ggplot(aes(x = n_jitter, y = X_jitter, group = path)) + geom_point(aes(col = factor(X)), alpha = alpha_value) + scale_color_viridis(discrete = TRUE, labels = paste(state_names, " (", 1:n_states, ")", sep = ""), guide = guide_legend(reverse = TRUE)) + geom_line(col = "gray", alpha = alpha_value) + scale_x_continuous(breaks = 0:last_time) + scale_y_continuous(breaks = 1:n_states) + labs(x = "Time (generation)", y = "Population Genotypes", col = "State", title = paste(n_paths, "sample paths with starting population Aa Aa")) + theme_bw() + theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_line(colour = "black"), axis.text.x = element_text(angle = 90, size = 8)) + # coord_cartesian is for aligning with marginal bar plot coord_cartesian(xlim = c(0, last_time)) path_plot } initial_distribution <- c(0,0,0,1,0,0) state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa") plot_DTMC_paths(initial_distribution, P, state_names, 20, 10) ``` AA AA and aa aa are absorbing states, which every population eventually goes to. Populations with high rates of inbreeding have high rates of homozygotes. This emphasizes the populations going to the absorbing states quickly when there is a small inbreeding population. We can calculate the mean number of generations to get to one of these absorbing states when we start at each population geonotype ```{r} # Code by Dr. Ross mean_time_to_absorption <- function(transition_matrix, state_names = NULL) { absorbing_states = which(diag(transition_matrix) == 1) if (length(absorbing_states) == 0) stop("There are no absorbing states.") n_states = nrow(transition_matrix) transient_states = setdiff(1:n_states, absorbing_states) Q = transition_matrix[transient_states, transient_states] mtta = solve(diag(nrow(Q)) - Q, rep(1, nrow(Q))) if (is.null(state_names)) state_names = 1:n_states data.frame(start_state = state_names[transient_states], mean_time_to_absorption = mtta) } state_names <- c("AA AA", "AA Aa", "AA aa", "Aa Aa", "Aa aa", "aa aa") mean_time_to_absorption(P, state_names) ``` When inbreeding occurs especially in a small population, there is a high probability that the population will converge to one of these absorbing states. And we can see that this happens quickly in a small population. This is why random breeding is important the the health of a population. It also emphasizes why genetic variation comes from the mating of different genotypes. Genetic drift and different scenarios in genetics and evolution can be well modeled by Markov chains because a child's genotype only depends on their parent's genotype and when this genotype is known, the child's genotype is conditionally independent of generation's before the last. # Bibliography Balin, S. J., & Cascalho, M. (2009). The rate of mutation of a single gene. Nucleic Acids Research, 38(5), 1575–1582. https://doi.org/10.1093/nar/gkp1119 Codominance. (2024, March). Genome.gov. https://www.genome.gov/genetics-glossary/Codominance#:~:text=%E2%80%8BCodominance&text=Codominance%2C%20as%20it%20relates%20to Gleichmann, N. (2020). Gene Vs Allele: Definition, Difference and Comparison. Neuroscience from Technology Networks. https://www.technologynetworks.com/neuroscience/articles/gene-vs-allele-definition-difference-and-comparison-331835 Helzner, C. (2023, November). Heterozygote Advantage | Definition & Examples. https://study.com/academy/lesson/heterozygote-advantage-example-lesson-quiz.html#:~:text=When%20a%20single%20copy%20of,a%20deadly%20disease%20in%20homozygotes. McDaniel, B. T. (2011). Inbreeding - an overview | ScienceDirect Topics. Www.sciencedirect.com. https://www.sciencedirect.com/topics/immunology-and-microbiology/inbreeding#:~:text=Inbreeding%20generally%20has%20deleterious%20effects Nature Education. (2014a). phenotype / phenotypes | Learn Science at Scitable. Www.nature.com. https://www.nature.com/scitable/definition/phenotype-phenotypes-35/ Nature Education. (2014b). random genetic drift / genetic drift | Learn Science at Scitable. Nature.com. https://www.nature.com/scitable/definition/genetic-drift-201/