Replication of Origin

Author

Skylar Pietz

Intro to R for Bioinformatics Notebook

In this intro notebook, I will gain familiarity with the folowing:

  • Using the arrow (<-) operator to store objects in variables

  • Using R’s sample() function to create a random “genome” for the purpose of testing functions

  • Building functions to process and analyze genetic data, learning about for loops and implementing them when necessary.

  • Reading in a real bacterial genome from a text file.

  • Analyzing the genome using the ideas and tools developed in the first part of the notebook.

Challenge 1

# Challenge 1

nucleotides <- c("A", "C", "G", "T")
nucleotides
[1] "A" "C" "G" "T"

Challenge 1 involved creating a list of the four nucleotides A, C, G, and T, using the c() function.

Challenge 2

# Challenge 2

Genome <- 15 
randGenome <- sample(nucleotides, size = Genome, replace = TRUE)
randGenome
 [1] "G" "A" "T" "C" "C" "C" "T" "T" "A" "C" "T" "A" "C" "T" "A"
randGenome <- paste(randGenome, collapse = "")
randGenome
[1] "GATCCCTTACTACTA"

In Challenge 2 I used sample() to create a random string of 15 nucleotides, called randGenome. The paste() method was used to collapse the genome of nucleotides down to a single string, so it’s easier to read. When I compared my collapsed genome with a classmate, I noticed that we generated very different genomes.

Challenge 3

# Challenge 3

Genome <- 1500 
randGenome <- sample(nucleotides, size = Genome, replace = TRUE)
randGenome <- paste(randGenome, collapse = "" )
randGenome
[1] "ACCATAGTGCACTGGATAGCGCGCCGGACATTTATCCAAAACGGAACCAAGCATATGCGCGGCGGAAAAGTATCCGCCAGTACCTAGTATTTTGCTCAAGCAGGTATCTGCTACCGCGAGCATGGGAAAGATCCTCGTTTCGTTAGAGGGGCGATGTGACCATGCAGACTCAATACAGCACTCCCGAGTGAGTAAAGTGCCCCACCTGTTGAGTGCCATTTTGAACTGCCGTGGGAATTATCGGCTGGGTATCTCTCTTATAACTGGATGCAGATAGGCGTCGTCCCTGATTTGAAATCCGTAACTAATCACAGGTGTATTTACCATGAGAGACGTCGGAGCGTTATATGCCAGCAAAGATCAACAAAAAACAAATACGCAAACTGTTCATTGATCGCTATACCAATTTAAGTCGGCGACAAGGCCACCGCTAGACGGTAGCACTCTCAGTCATTCGATCCGAAGACTACACAACGGGAGTTATGGCCCATAGACGTTTTCAACCCCAAGTAATTGATACAGATTTCTCGAACCAAAACGCCGAAAAATAAAGCTAACATCGTCACAGCTAGCGCCATTCAGACCGTGCACACGCACGGCTATGAGTTGTGATGACAATATCTCTGCCCCTCGCGTGAAAGACTATAATGAGTCGAGAGCCTTGAATTTGTGATATCAAGCCCGTGCTACGTCACCTCTTCGGGCAAACGTTCCTAGCTCTTGTTGCTTTTAGTCGCCAGTGCCCAATGGTCCAACAGAAACGTCTAGGGCGCGACAATACTTAATACCATGAGCTGTGGGAGCAACACTCACGCGTCTCCGAAAGAGAAGCCTTAAGGCTTTAATTTAGCGAAGCACCCGTGGTTGAAAACACTAGGCCCAGTTCTAAATTTCAAAAACCTAACGACGACAGCGGGGCATGAGAGGTACTCTCCTCATAACTGCATCTCTGGTCTCATAAACTACATGAGCGGGTCACTTAAGTTGACCTCAAGACTTCTGTGATCGGATGGTCCTCGGCCACAACCTGGGAACAAGAAAGGGTACAGGACAAGTGATTTAGAGTGGCGATCGACATGCCACAAGTCGTGATAAGGGGGGGATCATAGGTCGATTAGTTCAGCACCGGTGCGCAAAATAGCAATAATGCCCACATAGGGGCGTCGTGTAGCCCGGGACGTTAGCAAATTCATCCACCATCATCTTAGAGTGGTAAGCACTGGAGCATTTTCGACGCCCTTGAATTCTCTGTAGAGGACTGCCTACGTTGAAAGTTTCGGTCAGAGTGCCACGTGGTCGAGGCTACGAGCGAATCTCAAGCCCAAAAAGACTCTCAGGCCCTAGTGTGCAACGGTACGTCGTCGGAATTTGGAGGTCTAAAGATGTGGTATCGGGCGGAACAGATAACTTTATTCGATCGCGAGACTGTGATGTACGTATTTAACGCAATACTGGGAGCGTCTCTAGCAGGCCACTGACAGTGATCAGTTGAGGGCTTTG"
# Challenge 3 continued

set.seed(215)
genomeLength <- 1500

randGenome <- sample(nucleotides, size = genomeLength, replace = TRUE)

randGenome <- paste(randGenome, collapse = "")
randGenome
[1] "TGGAATCTTTAATGTTCCGACCTTTGAGCCTCAAGTTGGACTACCGCGGCCCCCGACCGAGAGCTAACTCTAAATGATGCAAGCAGCGCGTGCCGAGTCTCTGTGGACAGGATCACATTGGATGTGAGCCAAGATCTCGAGATGAAAAACAAGTAGGTTCTCTATCGCTGGCTCCAATCATTTTTATCCTTTGCTGCATACTTATGGCCCTCCAACCTTGCGAATTGCGGCATTACGTTCTATGCACGAAAGAGGCACGAGATGTATATCACTCTTCAGCTATCTGACACAGGTAGGCATTGGATGTCAGCGTACGACCGGGGTAGGCAACCCCTTCTGTTGCCCCGCTGGCCGGATGCCAAGGTGTTTTACATCGGTATTTTAAAGATGTGAACTTAAGTAACCTACTTAGCTGACGGGAAGGTGCACATGTATGTGTGACTCTACACCAAGATGCACTTAGGCATCAATAAAAGTTGCCGGTTTGTAATCCTTGACAGTAATCGAGATAATTACTTGCGGGCACATAACCTACTCGGTTCGTCCCCCTTAGCGTACGGGGGGGTGGGGAACCATCAGCGTTCGTATGTTAGTCCTCCGGCATTTTTGGTCACGGCCCTCCATGAAATACATACTAGCGTGCCATTCCGGTCAAATGGCCCTGCGAAGGATGCGGTGCGGTAAGCGTGTGGGCCATACTTGGGCAGATTGAGTTATAAGAAACAGGCAAATTGGGAACTAGTGCGGCACATGTACAGCTCGCACTTCCTGGGGGGCGACTACCGAATTCCTAGTATGTCTGAGTAGCGTACCGGGCCAGCTTTGGGTCCTCTTACCTTGTAATGGGATCGGCCTTATGGGATGCCATGACGACACTGCGGGAGGTTAATAAGTGTCCGCGATAAAGCCTCGTGGTCGAAATCTGTTACGTGCTCTTCTGTCATTTGCCATATTGGTAAAGTTCAAGAGCCACCCTATCTATTGACTCCTAGATCTCCCAAATGTATCCTCCAACAGTACTTCACCTGGATGGTAACCCCACCTCAATCACCACTTAGGTCTCCAGTACTTTTGAGCCGCCTACGCGCTCTCGAAATAAGCTAAGCTAGCTACGGTCCGTGTTAGGCGAAAAATGAATCCCTTTCCAGCGCTACCAAGTGTAGTCGGATGCGGTTGACTTATGCCTGCCCGGTGAGCAGAATTTAAATCGCTGTACCAATCAGCACTACCACAAGGCGAACGGGGCAGCCACGACTGGCGCCACAAGCATGGAAGGAGTGTCGGGCGAAATCAGGGCCGCATTAAGCCGTTTTATCACGGGTAAAGAGGTCACATGTGCACAGAGCCAGATAGTCGGCAGTGAAGGGCACAACGCTTTTTCGGCAGGAAGACCCTTGCACCTTTATCACACCGCGGGTCATCACCGGCCAAAGCAGTCGGCGGTCAACGTTAAAATTATTGCATGTCTAGGTAAAGCAGGACCAGAGAGACGTCGAAATC"

Challenge 3 consisted of using code to generate a random genome of 1500 nucleotides long. Once again, I used the paste() method in code to collapse it from a list down to a single string. I also added another line, using set.seed(215) to initialize a random number generator with the seed 215. And once again, the paste() function was used to collapse the list down to one single string.

Challenge 4

# Challenge 4

set.seed(215)
Genome <- 100
randGenome <- sample(nucleotides, size = Genome, replace = TRUE)
randGenome <- paste(randGenome, collapse = "")
randGenome
[1] "TGGAATCTTTAATGTTCCGACCTTTGAGCCTCAAGTTGGACTACCGCGGCCCCCGACCGAGAGCTAACTCTAAATGATGCAAGCAGCGCGTGCCGAGTCT"

Here, I used set.seed(215) to generate a random genome (randGenome) consisting of 100 nucleotides, collapsed down to a single string with the paste() function. I counted the frequency of Adenine (A) in the resulting “genome” above and got 23.

for Loops

mySum <- 0

for(i in 1:10){
  mySum <- mySum + i
  print(mySum)
}
[1] 1
[1] 3
[1] 6
[1] 10
[1] 15
[1] 21
[1] 28
[1] 36
[1] 45
[1] 55

Before starting Challenge 5, I began learning different ways to run a new programming technique called the for loop. In programming, a for loop allows us to run a set of instructions over and over for some predetermined number of iterations. Doing so will help me create loops to analyze genomes.

Challenge 5

# Challenge 5 

myProduct <- 1
for(j in 1:15){
  myProduct <- myProduct * j
  print(myProduct)
}
[1] 1
[1] 2
[1] 6
[1] 24
[1] 120
[1] 720
[1] 5040
[1] 40320
[1] 362880
[1] 3628800
[1] 39916800
[1] 479001600
[1] 6227020800
[1] 87178291200
[1] 1.307674e+12

In Challenge 5, I wrote and executed my own for loop in order to analyze a genome.

Challenge 6

# Challenge 6

set.seed(215)
Genome <- 10
randGenome <- sample(nucleotides, size = Genome, replace = TRUE)
randGenome <- paste(randGenome, collapse = "")
randGenome
[1] "TGGAATCTTT"
for(j in 1:nchar(randGenome)){
  print(str_sub(randGenome, start = j, end = j))
}
[1] "T"
[1] "G"
[1] "G"
[1] "A"
[1] "A"
[1] "T"
[1] "C"
[1] "T"
[1] "T"
[1] "T"

Flow Control

Challenge 6 involved generating a random genome substring consisting of 10 nucleotides and using the paste() method to collapse the genome to a single string rather than a list. Next, I wrote a for loop to print out each individual nucleotide instead of the entire thing. Next, I added the 1:nchar() function to run the for loop through all nucleotides in the string. The str_sub() function was used to extract individual nucleotides. By using str_sub() and using j as both the start and end for subsetting, I was able to have the for loop extract each individual nucleotide.

Challenge 7

nucleotides <- c("A", "C", "G", "T")
genomeLength <- 10

randGenome <- paste(
  sample(nucleotides, size = genomeLength, replace = TRUE),
                   collapse = "")
print(randGenome)
[1] "AATGTTCCGA"
for(i in 1:nchar(randGenome)){
  if(str_sub(randGenome, start = i, end = i) == "A"){
    print(str_sub(randGenome, start = i, end = i))
  }
}
[1] "A"
[1] "A"
[1] "A"
countA <- 0 
for(i in 1:nchar(randGenome)){
  if(str_sub(randGenome, start = i, end = i) == "A"){
    countA <- countA + 1
    print(countA)
  }
}
[1] 1
[1] 2
[1] 3

In Challenge 7, I created a genome 10 nucleotides long using the (randGenome) function. I then adapted the for loop in order to count the number of Adenine (A) in randGenome. Using these functions, I was able to determine that there were 3 Adenine.

Challenge 8

countG <- 0 
for(i in 1:nchar(randGenome)){
  if(str_sub(randGenome, start = i, end = i) == "G"){
    countG <- countG + 1
    print(countG)
  }
}
[1] 1
[1] 2
countC <- 0 
for(i in 1:nchar(randGenome)){
  if(str_sub(randGenome, start = i, end = i) == "C"){
    countC <- countC + 1
    print(countC)
  }
}
[1] 1
[1] 2
countT <- 0 
for(i in 1:nchar(randGenome)){
  if(str_sub(randGenome, start = i, end = i) == "T"){
    countT <- countT + 1
    print(countT)
  }
}
[1] 1
[1] 2
[1] 3

Vibrio Cholerae Chromosome DNA

vib_c <- scan("VibrioCholerae.txt", what = "character", sep = NULL)

Challenge 8 was similar to Challenge 7 in that I created a randome nucleotide and used the for loop to count the frequencies of each of the four individual nucleotides. I then was able to perform these functions on a real genome from the Vibrio Cholerae chromosome DNA, consisting of 1,108,250 nucleotides. I read this large genome into R with the scan() function.

Challenge 9

countA <- 0 
for(i in 1:nchar(vib_c)){
  if(str_sub(vib_c, start = i, end = i) == "A"){
    countA <- countA + 1
  }
}

print(countA)
[1] 293942
countG <- 0 
for(i in 1:nchar(vib_c)){
  if(str_sub(vib_c, start = i, end = i) == "G"){
    countG <- countG + 1
  }
}

print(countG)
[1] 256024
countC <- 0 
for(i in 1:nchar(vib_c)){
  if(str_sub(vib_c, start = i, end = i) == "C"){
    countC <- countC + 1
  }
}

print(countC)
[1] 263573
countT <- 0 
for(i in 1:nchar(vib_c)){
  if(str_sub(vib_c, start = i, end = i) == "T"){
    countT <- countT + 1
  }
}

print(countT)
[1] 294711

In Challenge 9, I was able to use the code from Challenge 8 to count the frequency of each nucleotide in the Vibrio Cholerae chromosome. After doing this, the frequency of adenine was 293942, Guanine was 256024, Cytosine was 263573, and Thymine was 294711.

Challenge 10

rosalinddata <- scan("rosalind_dna.txt", what = "character", sep = NULL)
countA <- 0 
for(i in 1:nchar(rosalinddata)){
  if(str_sub(rosalinddata, start = i, end = i) == "A"){
    countA <- countA + 1
  }
}

print(countA)
[1] 223
countG <- 0 
for(i in 1:nchar(rosalinddata)){
  if(str_sub(rosalinddata, start = i, end = i) == "G"){
    countG <- countG + 1
  }
}

print(countG)
[1] 202
countC <- 0 
for(i in 1:nchar(rosalinddata)){
  if(str_sub(rosalinddata, start = i, end = i) == "C"){
    countC <- countC + 1
  }
}

print(countC)
[1] 233
countT <- 0 
for(i in 1:nchar(rosalinddata)){
  if(str_sub(rosalinddata, start = i, end = i) == "T"){
    countT <- countT + 1
  }
}

print(countT)
[1] 201

In Challenge 10, I used code once again to scan in another genome from the rosalind website (similar to what I did in the previous challenge with the Vibrio Cholerae genome). I then used the code from Challenge 9 in order to calculate the frequency of each nucleotide in the genome string. I determined the frequency of Adenine in the Rosalind DNA genome to be 223, Guanine was 202, Cytosine was 233, and Thymine was 201. I got the Rosalind Challenge correct!

Summary

In all, I learned a lot throughout this notebook. Including:

  • Created a list of nucleotides

  • Used the sample() function to create a random “genome” using the characters found in the nucleotides list.

  • Used paste() with the parameter setting collapse = "" to collapse my random genome into a single long string of nucleotides.

  • Used set.seed() to set a seed for a random number generator to make my results reproducible over time and across different machines (and different researchers too).

  • Used a for loop to repeat a set of simple instructions for a pre-determined number of iterations.

  • Used programmatic flow control in the form of if/else if/else statements to run code only when a particular condition is satisfied.

  • Constructed a for loop and used if statements to count the frequency of each of the four nucleotides in a genome string.

  • Solved a bioinformatics problem on the rosalind platform!

Finding the Replication Origin, Part 1

nucleotide_frequency <- function(genomeString, nucleotide = "A"){
  count <- 0
  for(i in 1:nchar(genomeString)){
    if(str_sub(genomeString, start = i, end = i) == nucleotide){
      count <- count + 1
    }
  }
  return(count)
}

nucleotide_frequency("ACTTGCGGGTATCGAG", "G")
[1] 6

Challenge 1

genomeLength <- 2000

randGenome <- sample(nucleotides, size = genomeLength, replace = TRUE)

randGenome <- paste(randGenome, collapse = "")
randGenome
[1] "CCTTTGAGCCTCAAGTTGGACTACCGCGGCCCCCGACCGAGAGCTAACTCTAAATGATGCAAGCAGCGCGTGCCGAGTCTCTGTGGACAGGATCACATTGGATGTGAGCCAAGATCTCGAGATGAAAAACAAGTAGGTTCTCTATCGCTGGCTCCAATCATTTTTATCCTTTGCTGCATACTTATGGCCCTCCAACCTTGCGAATTGCGGCATTACGTTCTATGCACGAAAGAGGCACGAGATGTATATCACTCTTCAGCTATCTGACACAGGTAGGCATTGGATGTCAGCGTACGACCGGGGTAGGCAACCCCTTCTGTTGCCCCGCTGGCCGGATGCCAAGGTGTTTTACATCGGTATTTTAAAGATGTGAACTTAAGTAACCTACTTAGCTGACGGGAAGGTGCACATGTATGTGTGACTCTACACCAAGATGCACTTAGGCATCAATAAAAGTTGCCGGTTTGTAATCCTTGACAGTAATCGAGATAATTACTTGCGGGCACATAACCTACTCGGTTCGTCCCCCTTAGCGTACGGGGGGGTGGGGAACCATCAGCGTTCGTATGTTAGTCCTCCGGCATTTTTGGTCACGGCCCTCCATGAAATACATACTAGCGTGCCATTCCGGTCAAATGGCCCTGCGAAGGATGCGGTGCGGTAAGCGTGTGGGCCATACTTGGGCAGATTGAGTTATAAGAAACAGGCAAATTGGGAACTAGTGCGGCACATGTACAGCTCGCACTTCCTGGGGGGCGACTACCGAATTCCTAGTATGTCTGAGTAGCGTACCGGGCCAGCTTTGGGTCCTCTTACCTTGTAATGGGATCGGCCTTATGGGATGCCATGACGACACTGCGGGAGGTTAATAAGTGTCCGCGATAAAGCCTCGTGGTCGAAATCTGTTACGTGCTCTTCTGTCATTTGCCATATTGGTAAAGTTCAAGAGCCACCCTATCTATTGACTCCTAGATCTCCCAAATGTATCCTCCAACAGTACTTCACCTGGATGGTAACCCCACCTCAATCACCACTTAGGTCTCCAGTACTTTTGAGCCGCCTACGCGCTCTCGAAATAAGCTAAGCTAGCTACGGTCCGTGTTAGGCGAAAAATGAATCCCTTTCCAGCGCTACCAAGTGTAGTCGGATGCGGTTGACTTATGCCTGCCCGGTGAGCAGAATTTAAATCGCTGTACCAATCAGCACTACCACAAGGCGAACGGGGCAGCCACGACTGGCGCCACAAGCATGGAAGGAGTGTCGGGCGAAATCAGGGCCGCATTAAGCCGTTTTATCACGGGTAAAGAGGTCACATGTGCACAGAGCCAGATAGTCGGCAGTGAAGGGCACAACGCTTTTTCGGCAGGAAGACCCTTGCACCTTTATCACACCGCGGGTCATCACCGGCCAAAGCAGTCGGCGGTCAACGTTAAAATTATTGCATGTCTAGGTAAAGCAGGACCAGAGAGACGTCGAAATCCATCGCCTGGTCGCCGTAATTAACGGCTCGAGTAAAACAAGATTAAACCCTGGAGCTTACATCCATAAAATTGGATGAGGGCGACGTATTCGACATCCATTCTTCTATGATGGCTAGCACAAACACCACCGGATGTACGGGCCTCCATCCTAATGTCGACGATTCGAGGTTACGTCTAAGGCTCGGCTTACCTCTCCGTGCCTCCGATTAAACGATATAAAAAGGGGAACTACACTTGTTCTTTGTATCACGATGCGTTTACCTCAATTCAATAGGCCGCTTTTCGGCGACTGCCCGCAGATACCTACAATTTGGAGCTTGAATGCAAAGCTGGCTGTTGCCGGTCGTTGCACATGTCGGACTAGTGGCGTATAATGATTGCGTGGCACTTGTGACACATCTCAATAAGCTACTGATGACTAATACTCGTCATGTTGACAGGTGATAACGAGACCGAATCGAACATTGCCGTTTGGACGATAACTCTGGCTAACAGTAACGTCTGGCCATGCGGGGCACG"
nucleotide_frequency <- function(randGenome, nucleotide = "A"){
  count <- 0
  for(i in 1:nchar(randGenome)){
    if(str_sub(randGenome, start = i, end = i) == nucleotide){
      count <- count + 1
    }
  }
  return(count)
}
nucleotide_frequency("CATCGCCTGGTCGCCGTAATTAACGGCTCGAGTAAAACAAGATTAAACCCTGGAGCTTACATCCATAAAATTGGATGAGGGCGACGTATTCGACATCCATTCTTCTATGATGGCTAGCACAAACACCACCGGATGTACGGGCCTCCATCCTAATGTCGACGATTCGAGGTTACGTCTAAGGCTCGGCTTACCTCTCCGTGCCTCCGATTAAACGATATAAAAAGGGGAACTACACTTGTTCTTTGTATCACGATGCGTTTACCTCAATTCAATAGGCCGCTTTTCGGCGACTGCCCGCAGATACCTACAATTTGGAGCTTGAATGCAAAGCTGGCTGTTGCCGGTCGTTGCACATGTCGGACTAGTGGCGTATAATGATTGCGTGGCACTTGTGACACATCTCAATAAGCTACTGATGACTAATACTCGTCATGTTGACAGGTGATAACGAGACCGAATCGAACATTGCCGTTTGGACGATAACTCTGGCTAACAGTAACGTCTGGCCATGCGGGGCACGGCATGTAATTTTTGGGGACTGCATGAAATCACTTTAGGATAGTAGATTAGTACTAGTAAGGGCAGGTGACTATAGGGGCCGAGTCTCCGCAACAGTAAATGACATTAGAAGACAAGATGCAACGAGGGGTGGAGGGTCCATGAATGCAGGGCTAGTCGCTACTAGGTCGCTTCAGTAGGAAGCCTTTACTCGCTGGAGTGATAAACTAGACAGCCCACTGCGCCTGATAGCACGTACATAGGGATATGCGTTAATGGGCGCACACGAACAAGTAAACGCTAAGCACGCCCATATTTAAGCTGGCGCCGGTAGGAATATAGTAGAGGACATTCGTTGCAACCCAGTGCAATTCTCATGCATGCTGTGTTGCAACACGAAGGACCGCGCCAGTAAGGGATTCTCCGATCAAATCGTGCTTATCCGAAGCCATGTAGGGAAGCACGGAAGAACGATGGTGGGAAGAACTCATGCATCATATATCAGTTCGTTACTACCGGGGCTATAGTCGGAAAGAATTGGATAGAACCCGACCTGGTAGTGCCCTCCGGTTCTGCTCTCCGGCACGAATCGGGTAGTTCGGAAGGCATTGTTTAAGAATTACGAATATGCACCAAATTCGAGAATCTGCACCATGGCCCAAGTTACAAATCTACAGCATTGGAACCTGGGGAAAATACAAGTTTTGTGATTAGGCTGGCGCCACTCCAACAGAGTTTCGGACTCAGGATAACAACTTCAAGACGGGCCACGGATGAGGCAATCTGAAATAAGCCGCCTGTAGAGACGTAAGCTAGGCAGAGCCCTACGCGGCGAGATGTGCAAGCCCTCGTGTCCGGTCCATCGCCCCCTCCGATATATAACCAATCAAGATCAATTGTATCGCTGGTACTGCCGTGAATCCGTTAATACGCCGGGGCTCTATTTATAAAGGAGTATTTATATCCTGGGATAGCGACCGGGAAGCCAATGTATAGGGACTTAAAATTGTTTAAAGACGAGATAGAAGATAGGATGCCGATTCTTTTAATCCCATGATGGTTCCCATTCGAGATTGGGAAACAGTCAAAGGCAGGTCTGGGCGAGTCGGGGATATACGCTTGGCGCAATGGGATCGTTTTAACGTGTAGCACGCATTTCTTTACCTTTGATGAACACATAGCTGATAGCAGTAGAACTTTCCCGGTTCATGAATTATCTGGCTGCCCGAACGGAGAAAGAATGTAGTATAATCTAGTCTGATAAGGAGTTTTATAGTAAGCGTCGGCCATACATCCACGTGGTGTTTGGCCGCCGTGGTTGCATGCTGCTCACTTTTCAGAGCTCGCACTCCGGTTAACGTGTGGATCACATGGGTGGAACGATGAGCAGCGACCACGCCCAGAAATCCGCACGGAACAGCGATCAAGCGGAACACTTATGTTACCTTAGCTGTACTCGCTGATAAGGCACGGACGCCAGCAATTACTACGTTTGTGATAGA", "C")
[1] 453

In the first challenge, I used the sample() function in conjunction with the paste(..., collapse = "") method to generate a random genome of length at least 2000. I then used the nucleotide_frequency() function to count the frequency of Cytosine in the random genome I constructed. I was able to determine that Cytosine appeared 453 times.

If I don’t provide a second argument to the nucleotide_frequency() function, then it will count the total number of all nucleotides, not just C.

Challenge 2

rand_Genome <- function(k) { 
  nuceleotides <- c ("A", "C", "T", "G")

rand_Genome <- sample(nucleotides, size = k, replace = TRUE)

rand_Genome <- paste(rand_Genome, collapse = "")
return(rand_Genome)
}
rand_Genome (k=15)
[1] "GCATGTAATTTTTGG"

In challenge 2, I built a function rand_genome() that takes a single parameter k. This denotes the number of nucleotides in the genome I want to generate. My function returned a single genome string of length k.

Challenge 3

generate_3_mers<- function(genomeString) {
  list_3_mers <- c()
  for(i in 1:(nchar(genomeString) - 1)){
  list_3_mers <- list_3_mers %>%
  append(str_sub(genomeString, start = i, end = i + 2))
    }
  return(list_3_mers)
}
  list3mers <- generate_3_mers(rand_Genome(k=2000))

Challenge 3 involved adapting a loop above to walk through a longer string (a random genome) and storing consecutive substrings of 3 nucleotides. I built a function called generate_3_mers() to generate all of the substrings of 3 nucleotides (called 3-mers) in a genome string. I then used the rand_genome() function to construct a random genome of 2000 nucleotides in length. Next, the generate_3_mers() function was applied to compute a list of all the 3-mers in my random genome. My function collected a list of 3-mers. This function prooduced a total of 994 3-mers. It produced only 3-mers as final elements. No 2-mers or single nucleotide were produced.

tail(list3mers, n=25)
 [1] "CCT" "CTT" "TTT" "TTA" "TAT" "ATA" "TAT" "ATC" "TCC" "CCA" "CAA" "AAA"
[13] "AAA" "AAA" "AAC" "ACC" "CCT" "CTA" "TAC" "ACG" "CGC" "GCG" "CGC" "GCA"
[25] "CA" 

If I had a list of 25 3-mers from the tail, it will contain 24 3-mers and 1 2-mer.

Challenge 4

generate_k_mers <- function(genomeString, k=3) {
  list_codon <- c()
  
  for(i in seq(1, nchar(genomeString) - k + 1, by = k)) {
  list_codon <- list_codon %>%
  append(str_sub(genomeString, start = i, end = i + k - 1))
    }
  return(list_codon)
}
generate_k_mers(rand_Genome(9))
[1] "CGT" "AGT" "TTA"
genomeString <- rand_Genome(15)
generate_3_mers <- function(genomeString) {
  list_3_mers <- c()
  for(i in 1:(nchar(genomeString) - 2)){
    list_3_mers <- list_3_mers %>%
      append(str_sub(genomeString, start = i, end = i + 2))}
  return(list_3_mers)
}
genomeString
[1] "GCACTATCCTAGTGT"
generate_3_mers(genomeString)
 [1] "GCA" "CAC" "ACT" "CTA" "TAT" "ATC" "TCC" "CCT" "CTA" "TAG" "AGT" "GTG"
[13] "TGT"

In Challenge 4, I created an updated version of the generate_3_mers() function so that it can generate k-mers of any length I desire. K controls the number of nucleotides in each K-mer. This function then returns a list of all k-mers in genomeString.

Challenge 5

count_patterns <- function(string, pattern) {
  count <- 0
  
  for(i in 1: nchar(string)){
    if(str_sub(string, i, i + str_length(pattern)-1) == pattern){
      count = count + 1
    }
    }
  return(count)
  }
count_patterns(rand_Genome(2000), "CTG")
[1] 31

Rosalind Challenge

rosalind_string <- scan("rosalind_string.txt", what = "character", sep = NULL)
count_pattern <- function(myString, pattern){
  count <- 0
  k <- nchar(pattern)
  for(i in 1:nchar(myString)){
    if(str_sub(myString, start = i, end = i+k - 1) == pattern){
      count = count + 1
    }
  }
  return(count)
}
  
count_pattern (rosalind_string, "GTGGCCAGT")
[1] 22
rosalind_string <- "GGTGGCCATAAGTGGCCATAGTGGCCAGTGGCCATAAAAGTGGCCAGCGGTGGCCAAGCCGTGGCCAGCGTGGCCATTAAAGGGTGGCCACGTGAGTGGCCAGTGGCCAGTGGCCATTTTGTGGCCAGGTGTGGCCATAGTGGCCAGGTGGCCAGACGTGGCCAGGTGGCCACCTGGCTGTGGCCACGTGGCCAAGTGGCCAGGTGGCCAGTGGCCACGAGTGGCCAATAGTGGCCAGAGTGGCCATACCGTGGCCAGGTGGCCAACGTGGCCAGAAGTGGCCAGTGGCCAGTGGCCAAATGTGGCCAGTGGCCATGTGGCCAGGTGGCCACGGACAAGGTGGCCAGTGGCCAGCGGGTGGCCAGTGTGGCCAAGTGGCCAAGTGTGGCCAGTGGCCACTTATGGTGGCCAGTGGCCAAGTGGCCACCTATGTGGCCATATGGTGGCCATGTGGCCAAGTGGCCATGGTGGCCACGTGGCCAAGTGGCCACCTTAGGTTGGTGGCCAGCGGTGGCCAGTGGCCACAATCAATGTGGCCACGGTGGCCATTAAGTGGCCATATTGATGAGGCCCGGTGGCCAGGTGGCCAGTGGCCAAAGGGGTGGTGGCCATACGGTGGCCATTGTGGCCATTCGTGGCCAGGGTGTGGCCACCCGTGGCCAATGAGGTGGCCACGTGGCCAAAGATGTGGCCAACTGGAGTGGCCAATGTGGCCAAGTGGCCACCTCCACGTGGCCACTAGCGTGGCCATCACCGACCTGGTGGCCAGTGGCCAATGGTGGCCAGTGGCCAGTGGCCAGTGGCCAATACTGGGTGGCCAGTGGCCAGCGACGTGGCCATGTGGCCAGTGGCCACCTGTGGCCACGTGGCCACCAGTGGCCAGTGGCCAGTGGCCACAAAGTGGCCAGTGGCCAAGGTGGCCAATGCTGTGGCCATTCGAGTGGCCA"

In this last Challenge, I wrote a new function called count_pattern() that counts occurrences of a particular pattern within a larger genomeString. My function took two arguments, genomeString and the pattern I want to count occurrences of. My function returned the count of occurrences of the pattern within the genomeString. My function accomplished what I intended it to. It counted 22 patterns of “CTG” in my randome genome string that was 2000 nucleotides long. I then used this function to count the number of occurrences of the pattern in the genome string and submit the result to Rosalind. I solved the Rosalind Challenge! There were 22 patterns of GTGGCCAGT in the Rosalind DNA string.

Finding the Replication Origin, Part II

In this notebook, I will continue using code to determine the replication origin in a genome. DNA replication begins at a certain position in the genome called the origin. The origin consists of a specific set of DNA sequences that signal to the cell’s machinery the correct position to start replication. In bacteria, one of the proteins that helps with replication is called DnaA. This protein binds to sequences in the genome called DnaA boxes, which are strings of 9 nucleotides that are repeated more frequently around the origin of replication.

I am challenged to combine my generate_k_mers() and count_patterns() functions into a new function called generate_frequent_k_mers(). This should then build a list of the most frequent k-mers. My function should take two parameters, genomeString and k.

Challenge 1

generate_k_mers <- function(genome_String, k=3) {
  list_codon <- c()

  for(i in seq(1, nchar(genome_String) - k+ 1, by = k)) {
  list_codon <- list_codon %>%
  append(str_sub(genome_String, start = i, end = i + k-1))
    }
  return(list_codon)
}
generate_k_mers("ACACAGACATCCCACCCC")
[1] "ACA" "CAG" "ACA" "TCC" "CAC" "CCC"

Challenge 1 involves finding the most frequent words in a string. I began by practicing with a small genome and used the generate_k_mers function to count the 3-mers in that sequence. I notice that that 3-mer “ACA” is repeated. Next in this challenge, I figured out how to combine my generate_k_mers function with my count_ pattern function so that it will count and report the most frequently occurring k-mers. I considered what’s needed in this new function, called generate_frequent_kmers , by asking things like…

  • How can I use my existing generate_k_mers function to generate a list of k-mers of length k from a genomeString?

  • How can I use my count_pattern function to count this list of k-mers that I generated? This may involve a for loop!

  • How will I tell my function to return only the most frequent patterns?

find_frequent_kmers <- function(genome_String, k) {
  kmers <- generate_k_mers(genome_String, k)
  kmers <- unique(kmers)
  kmer_counts <- rep(0, length(kmers))
  for(i in 1: length(kmers)) {
    kmer_counts[i] <- count_pattern(genome_String, kmers[i])
  }
    max_freq <- max(kmer_counts)
    freq_kmers <- kmers[kmer_counts == max_freq]
    return(freq_kmers)
}

find_frequent_kmers("ACACAGACATCCCACCCC",3)
[1] "ACA" "CCC"

I read in the sample sequence from the Notebook guide and called it genomeString. I instructed R to report the 3-mers in the sequence. It reported the 3-mers without counting the repeated ACA twice. But it should just report “ACA” since this 3-mer is the most frequent.

generate_k_mers <- function(genome_String, k) {
  list_k_mers <- c()

  for(i in 1:(nchar(genome_String) - (k-1))){
  list_k_mers <- list_k_mers %>%
  append(str_sub(genome_String, start = i, end = i + k-1))
    }
  return(list_k_mers)
}
generate_k_mers("ACACAGACATCCCACCCC", k=3)
 [1] "ACA" "CAC" "ACA" "CAG" "AGA" "GAC" "ACA" "CAT" "ATC" "TCC" "CCC" "CCA"
[13] "CAC" "ACC" "CCC" "CCC"

Rosalind Word Problem

I will now test the ability of my function to try and solve the Rosalind Frequent Words Problem. Before downloading the whole data set, I will test my code on a sample data set first. I read in the Rosalind sequence and called it genome. I then used the find_frequent_kmers function to count the 4-mers.

SAMPLE PROBLEM

genome <- "ACGTTGCATGTCGCATGATGCATGAGAGCT"
find_frequent_kmers <- function(genome, k) {
  #Get the Kmers
  kmers <- generate_k_mers(genome, k)
  kmers <- unique(kmers)
  #Count occurences 
  kmer_counts <- rep(0, length(kmers))
  for(i in 1: length(kmers)) {
    kmer_counts[i] <- count_pattern(genome, kmers[i])
  }
  max_freq <- max(kmer_counts)
  freq_kmers <- kmers[kmer_counts==max_freq]
  return(freq_kmers)
  
}

find_frequent_kmers("ACGTTGCATGTCGCATGATGCATGAGAGCT", k = 4)
[1] "GCAT" "CATG"

CHALLENGE PROBLEM

Here is the Rosalind sequence that I read in from the word problem. I called it rosalindGenome.

rosalindGenome <- "TCTCTCATAATAAGGAAGTGCTAAGAATAAGGAAAATAAGGAATCTCTCATGTGCTAAGGGGTCCTCTTCTCTCATAATAAGGAAAATAAGGAAAATAAGGAAAATAAGGAATCTCTCATTCTCTCATGGGTCCTCTTAGGCAGGGGGGTCCTCTAATAAGGAATAGGCAGGGGTGCTAAGGGGTCCTCTTCTCTCATAATAAGGAATAGGCAGGGGTGCTAAGGGGTCCTCTGTGCTAAGTAGGCAGGGGGGTCCTCTGTGCTAAGTCTCTCATGGGTCCTCTAATAAGGAATCTCTCATAATAAGGAAAATAAGGAATCTCTCATTCTCTCATGGGTCCTCTGGGTCCTCTTCTCTCATTAGGCAGGGTCTCTCATTAGGCAGGGGTGCTAAGGGGTCCTCTAATAAGGAAGGGTCCTCTAATAAGGAATAGGCAGGGAATAAGGAAGTGCTAAGTAGGCAGGGGGGTCCTCTTAGGCAGGGTCTCTCATGGGTCCTCTTCTCTCATGTGCTAAGAATAAGGAATAGGCAGGGAATAAGGAATCTCTCATAATAAGGAAAATAAGGAAGTGCTAAGTCTCTCATGGGTCCTCTGGGTCCTCTAATAAGGAAGTGCTAAGGGGTCCTCTGTGCTAAGTCTCTCATTAGGCAGGGTAGGCAGGGGGGTCCTCTTCTCTCATGGGTCCTCTGGGTCCTCTTAGGCAGGGTAGGCAGGGAATAAGGAATAGGCAGGGGTGCTAAGTCTCTCATAATAAGGAATAGGCAGGGTAGGCAGGGGGGTCCTCTGGGTCCTCTTAGGCAGGGAATAAGGAAGTGCTAAGGTGCTAAGGGGTCCTCTGTGCTAAGTAGGCAGGGGGGTCCTCTGTGCTAAGGGGTCCTCTGTGCTAAGGGGTCCTCTGGGTCCTCT"
find_frequent_kmers <- function(rosalindGenome, k) {
  #Get the k mers
  kmers <- generate_k_mers(rosalindGenome, k)
  kmers <- unique(kmers)
  #Count occurences 
  kmer_counts <- rep(0, length(kmers))
  for(i in 1: length(kmers)) {
    kmer_counts[i] <- count_pattern(rosalindGenome, kmers[i])
  }
  max_freq <- max(kmer_counts)
  freq_kmers <- kmers[kmer_counts==max_freq]
  return(freq_kmers)
  
}

find_frequent_kmers(rosalindGenome, k = 12)
[1] "GTGCTAAGGGGT" "TGCTAAGGGGTC" "GCTAAGGGGTCC" "CTAAGGGGTCCT" "TAAGGGGTCCTC"
[6] "AAGGGGTCCTCT"

As seen above, I used my find_frequent_kmers function to run through the entire sequence, pulling k-mers from rosalindGenome. The code stores the list of the most frequently occurring k-mers as frequent_kmers. The kmer length listed by Rosalind that I must find is 12, so I set the k value to equal 12.

frequent_kmers<- find_frequent_kmers(rosalindGenome, k=12)
noquote(frequent_kmers)
[1] GTGCTAAGGGGT TGCTAAGGGGTC GCTAAGGGGTCC CTAAGGGGTCCT TAAGGGGTCCTC
[6] AAGGGGTCCTCT

I used noquote(frequent_kmers) to print the k-mers out in a single line with no quotations so I can copy and paste it in the submussion box using the correct format. I got the Rosalind Word Problem correct! This confirms that my code is working properly.

Challenge 2

While undergoing DNA replication, DNA polymerase can make mistakes while it replicates the genome. Unfortunately the nucleotides in these DnaA boxes we’re searching for, don’t appear the way we need them to. Additionally, the replication signal we are looking for may appear as both the k-mer signaling that replication should begin, or as its reverse complement. Because of this, we need to create a code to count reverse complements as well as near k-mers.

A and T are complementary base pairs, as well as G and C. This means that in double-stranded DNA, we should see across from each Adenine (A) a Thymine (T), across from each Cytosine (C) a Guanine (G), and vice-versa. This means that if we look across from a segment of DNA, we can see the reverse complement of that segment. The “reverse” refers to the fact that the complementary sequence is also anti-parallel, meaning 5’ and 3’ ends are oriented in opposite directions on each side of the double strand. Thus, we read the complementary sequence in reverse order.

Challenge 2 involves solving the Reverse-Complement Problem. I must do this by inputting a pattern of nucleotides called genomeSubStringand hopefully get an output of the reverse-complement of genomeSubString. I created the reverse_complement function to run through the whole sequence from the input, and output the reverse complement of the sequence. The code will return the complementary nucleotide each time it loops past a nucleotide, using if statements to tell the code to run only if the target nucleotide is present. Meaning, every time the code encounters an Adenine, the reverse_complement function will return Thymine, and vice versa. This is also applied for Cytosine and Guanine, using if statements as well.

reverse_complement <- function(substring){
  #reverse the substring
  substring_list <- str_split(substring,"")[[1]]
  rev_substring <- rev(substring_list)
  #Get complimentary Base-Pairs
  rev_complement <- c()
  for(i in 1:length(rev_substring)){
    if(rev_substring[i]=="A"){
      rev_complement <- append(rev_complement, "T")
    }
    if(rev_substring[i]=="T"){
      rev_complement <- append(rev_complement, "A")
    }
    if(rev_substring[i]=="G"){
      rev_complement <- append(rev_complement, "C")
    }
    if(rev_substring[i]=="C"){
      rev_complement <- append(rev_complement, "G")
    }
    }
  rev_complement <- paste(rev_complement, collapse = "")
  return(rev_complement)
}
reverse_complement("AAAACCCGGT")
[1] "ACCGGGTTTT"

In the code above I tested my code with the sample data set from the Rosalind reverse complement problem. This confirmed my code is working so I will now apply it with the challenge data set below. I first downloaded and saved the data sequence to my laptop, then read the data into R. Lastly, I input the data to the reverse_complement function.

revcompGenome <- scan("rosalind_ba1c (3).txt", what = "character", sep = NULL)
reverse_complement(revcompGenome)
[1] "CCGAAAGACTCGAATATCTCTACGATCGTACTTAGAAGGTGTTCATGTCAATCAAGCCGTGCACATGGACTGCGGCCAACGTATATCAAAGCATCCGCGGGTTTGGAGATGGGCAGTATTAATGGCCGCCTTCCATGGTCTCCTCGACGAGGCAGTTAGACGAGGCTGTCTTACTGTAGGTGCGAGTCCCAAGATTGTGTACCAAGGTTGCCGTTTAGAACACCGGGCCCGTTGCACATATTCGGAGAGTCCCGAAGGTGGCTGGCTCATTCGGAGCTTATGGAAGTATTATCTTGCAATACCAAGGGCTGGACAATGGGGTCAGAAGCACCGCTACAAAGAACGTACAGCTTCCCAATTTCGGCCAAGGAGAGCTTTACGAATTGGAAGCTACTACCTATTCTGTAGGGGACATATAGACCCGCATGGGGACACTCTATCACCTGCGGTAGCCTCAAGCACGCCGAAGCGCCTTACTCCTACCCGTCAAGATACGGTTGGTCGATTTGAAGTCTTCCACTGCGGGTTCAGTCGAACCTCGGTGTTGGGCTTGAAGGTAACACTCCTCTCTAGAGTCGTCTGGGCCAAGCGGCTCCTGCATCGCGCTTGCTTATTGGATATTCACATTCCCGGCTGGACAGCAAGGGACCGCAACTAGCTAGAGGACCTGGGCTGTCGCCAGTTCAGTAATGACCAGAAAAACCCAGGCTTCGGACTCCCACTGACATCCTGTGGAGGTCAGATCGCATCATACACGATTCGGTCCAAAGTAGGCGCACGCCATTACAGGCGACGTTCAGTAACAACCAATACCCCCGAACTACACCGCCATAGTCACCCTGCGACCATCAACTTAATCATCTAGTACGAGAGCCCCACGGATGAGACTCAGGCAAATTTTCGATGTTTTTCAGAACCCCAAGTTGGCTCATCAGTCAGCCCTACAAATGACGCTTGATCATAGTGATGATTGGAGTTTCCGAGTATAGAGCACTCCTACCGATTGCGGCATCCTTGCCCGGATCGCAGTAGAGTACGCTGCATTTCACAGCACGATATAGTTGGAGTCAACGTCGCCGGCTATTGAGTTGGATTGCAATTACGTTAGAAGACCGAACGTTCCCTTGCTTAGGTCTACTCCAAGGAGTACTTGGGACGGTACACAATCCGATTACGACTCCCGCCGCAAAATTCCAAGTTGATAAGCACGCGTAAATTTTGACCTATCGTCACTCCAGCGCCGTGCATTACGACGCGACTAACAATCTCTGATCTATAGACCTAGCTAGAAAGCGGTGTTGGGGCACCGAGACCCACGTATCCGCGTGGGGCTCATCATGTTCACTGACGGATGTGGGCGGATAGGGATAGCCAATTCGAAAATGAGTCCCAACTGAGATTATCGTCTCCGGGTAGAATTCCCTAATGACTAAACACGTCCGTTTGATTCTGCTTATGCCCTCAGAACCGTGAGAAGAAACTCGCAAACACGTAATGGAGAGTGGCTACAATCCCTGACTGAGGCAGATAGACTGAATCAGATTGACACCACGCAGTACTGTTTACGCTGACTAATATAAACGACATCCTGTTTTTACGACGCGACGTCTCGACGCCTAACGCACCGTACGATGATAAACTATCCAAGGAGCTCTCCAGCATATTCGACAAGTACTCGGTAGAGGAAGCGCTAAGCTAAAGTCGGAAGTCTGTATATGATTGTTCGAGTTGTAATTCAGAAATGAATGGCTTTAATCGTTGCGGAATCCCTTCTTGCGTGTGCGCTACGGCCAAGTGGACGAAGCTCAGTTAAGCGCGTGCAGACTCAAGTCACGCTCCGAAGGATTCGCTGGGACAGTCCCTTCTTACTTTTACGATTCCCCCATTGCCGCAGCCTCTTCATATTAAAATTCGAAGGGAGCGAAATTTCACCTCATGGAAGTACCTACTCTGGTCAACGCCGGATGGCATGACGGTCTATGCGCATTCTATGCCAATCGACCTATGTGGCGCAGTTGACGTACTAGAGTTTACGATTTAGCAGTGGTTGGTAAAATAGCGAGCGGGCGGCTTTTCCGGGTGACCCGTAGCAGGACTGACACCTAAGACAGATTTTAGGATACCTACAAAGGCTGTCCGGCAATTAACATATTTTGGCGTCCGACTGTAAAGCATGCCGTGGTTCGCTGCGTGGTGGGAGACCGCCCTCGTAGACGCCAAAATTCACGCCACAGTGAGCGGGGATAGCTGTTCCCATCGGGCTGAAAATGAGACACATAGCGTACCCACCCGAGGCATCGTCCCTTGATGTTGCATATAGTGTTGACGGCTCCGACGAAGAACGGATAAAGCGGCCTCTTCTTTTCTCAAGCGGCCATTGCTTGAGCTCACGGGTGGTTCTGCTATCCTTACGAATTAGAACAAGGCACGTCCTGTAGCATGAAACGCTGGCGCAAGATCTGCGGTGTAATTATTGAATTGCTGTCTTCTAACGCCGAGCCGCATGAGACCCGCGTCTGGTGTTGTCGCATGTGTCTAAACGCATGTGGCTACTCAATTCCATCCTCTGGCGCGCGCCACATCACTATCAATCAGGTTACTGCTCAATCTGTAGATACGCACACTGAGGATAACAGGCTTGTTTCGTCGCCTTTGTGCATTGAGCCGGTCCTCCGATTAAAGAATCGATTAGTACTGGGGCAATACAGCTTCTATTCAATACACGGGGGCCCAAATGGCGCGCGGTCGTTTTTTACCGTCGTTGCTTTATCTACGAACCAAGCTGCTGCGATCTTGCTGGGTAGTTAGTGTATATGTCTTATCTCAGCGAGTAGGGTTCTCAGGCACGTAGTCAATCTCTTCAGCTTATTCCCTGTTATCGGCATTATTATAAAACTGAACACCACGCCACTCGGCGCTCCAATAAAGGTAAGGAGCGCATCAAGAGAGAAATAGACCCTTTGCGCCCCTCGCGCGAAAATGGCGGTTAGATCAAATGAAATCTCGTGACAACCGCGGACGGTATAGGTGCGGCTTCCCGAATGGGGGGCAATGCAACTCATCTAAACTACACTGTGCAGTTCTGCGATGTGTTGAAGGCGAAAACACCAAACCCAATCTCATGCCTCCAGATGTCCGGTGTGCTACCCCGTGACGACAGGGCACGGAACACTCACGCCCCGACTCGCGATAAGCTAGAAAGCGGCCCCTCGGCTGTCGGACCGAGCGTCCACGGGCCGCTTACCAGAGACCAGCTGCTGTCATGCCGATGCATGTGGTCCAAAACAGGGGAGGATTACTGCCGATACGTCCTCCATCCAAGCCCGACCGAAGTTACGCTGTCCGCGGAACTCTGACCCTGGGAGCGGCCCCGACCAGCTTTAGGTCGACCGTCGAAACCGATAGCCCCATCGTAATATCGACGGTTGAGTTCTGTATTCGTGACGCGGTTCACACAGTCTAGGGTTGGGACTTGCATTAAGCTCTTCAGTGAGAGAGGATCTGCGACCGCTCCGGATCAGTGTTTCATCTCGCTGTAGAAGGAGGGGTTGTGGAAACCAACGGCAAGCTCCCTGCATTTCTCAGCAGTCGACTTCGCATTATGACTCCATTTAAGAGATTGACTAGCAAGAATTTCGTGAAACCAATTGTTAGCAGAGTCAGTAACTTAATGGTATAAGACTCAAGGGACTGAGGTATTGAAACTTAGTGACCGCTACGCATTTATGCTGCGGCGCCTAAGTCTAGCGCGGGTATCTCTCCAACGGGAACGAGATAGAGTGTCCGCGCCTTCATATGGTCTGGGGCGGGCTCTTTGGTGTACCTTTTAGTGCCATTCCTTCTATGTCCCCTGGAGCCCGGACCGGCGTGCGCGGTCTGGAGCGGCCTATCCGGTGCCGAATGAGTAACCACCCGCTAACATACCGCACAGACCAATTATAATTCATTGTCTCTCAATCTCTCATACGCTCTACACCGGTTTGATCATGCACCAGCATTGTGTCCAGTAGCTTACGCCGTAGTAAGTAGAGTGTTAGCATATTTTATTGGCTACCACATCTGTATCAACATTTCCCGCCATACGTACTCTGGTGGTCGTGGAATATATCACAGTTGATTATAATAGCGGCGAATGGTCACTATCAATACCATGAAGCCAGCGCTCCTGAGGGGATTATCAACATTAGTGGATCATACAGTTCAAGTACTCCGACCAATCCATGTGTTTCTAAATGCATAAAATGACTTTCGTCAAATGAATGAGACCATACTTGATCATATAAACACACGCGAACCCTAAATTCCACCGTATCAACTCCTAGCAAATGCCTGTAGCGAAAGATCAATCTACTATAGACCGTGATGAGTCTGGTGGGTTCAGCCCCGAGCGCTTCCCTCGCCCGAGATGAACGCCAGATGGGTCTCGTAGGACTTCAAGTCTCCCGGTATGGTAATCAACCAACATCTCCGGGAATGCGTTCAAGAGCTAGACGGCTTAAGACTTCACATAGCAAGACTCGGATTAGGTGGCCCGTATGCTGTCAGTTTGGGTGCCGAACACCTTCATTTAAGGCTAGTGTTCGTCCAGCCGCCGTACCCTTTTCACTGGCGGTGTGTTATGACTCATCACTTTTCATATATGGACTCCTGTTCAATCCTAAGCCATATATTAGCATAACACAACAACTATTGATCGTATAGTAGTTTAAAAGATAGAGGGACTTCAATGTAGCTACGTAAACATGGTCTGAATCTAATCGCAACGCTAAAGCATAGACCCAGCGAGGAACTACGAAGACGATTGTAGGTTCGAGTACAAGCGCAGGACAAATACACACTTAGTTTTCGGATCAATCGGGATCGGGAGCATGACAGCGCCACGGTATAGTCCCTTCTTTTGTGAAAACGCTCGCAAGAGTACTTGCAGAATGTCAACATGCTCTGAACCGAACAACCAGACCGGCAAGTACCTGCCTTCTGATATATACGTCCCTCAACGGGTCTGCGCAGGCAGGAGAAAACTGTCAAACGTGTTAGCGCGTCAAGCGCGCCGGACTCGGTACAGCTGCTAGCCGCCCCGGGCACTGACGAACTCTATTCATAATCGCCTGGTCGCTTCCACGGGAGCATGAGATGCAAAACAAAAACGCCAAACATGAGGGCGGTACATTTGCTCGCAAAGCAATTGTAATACCGATGGCGTAAGGCGAAATGTCTACCGCCCCAGTGGTCGTAAACTACCGATTGTATGGTTTTAGCTGTGTGAAGTGTCACGTATAACTACAGTTGGCTTATTGAGGGCGAATGCGAGCAACCCCGTAATCGTCACGAGTCTTATATGCACGAGGCCGTAATTATACGTGCGAATCTTCTATAGCCTCCACGGCGTATAGCGCTTCGAGCGAATAAATTATATCAAAATTTCGCACCCAGTCGCTTCGGGAGTTCCCTTGCAAAAAAACTGATACGTTGTTCTTTTATGCTAGCCAAAGTCATGTGTAAGGCTTTGGTGTGTCAGGGATACTACATATCGCGCTAAATAATTCGTATCCCCGACATCTCTCCCAGTCAAGGTCGTTAGTGCATATCTCGCATCCGGGCGCGCGAATCAGGATTACGTTACATCGCCAGTGCTCATGGTCCGGTATTGCGCTTTGTAACGTTCATTTGACGGCCTCACTCATCGCCGAGTACACTACCCATTTGTATAGAAACGAAGGACCTCAACCTCAAGGTGATTATCCTCTAAAAGAAGCGTCTAGATACTACCTCCCCTATGCCGTTGTCTTCGAATTCCCGATACACGCAAGGAGGTTGTCGGTTGGCGGCGCGTTCGTAGTAGAGTATTGCGAGACATGAAGCTACCGGACCTTCTTGCGACGATATATCTCAAGTAACAGCATGAAGCATGTCTCCCCCCCAGTCAGGAGATGCCTAGCTTAATCCACATGCACCAACCATCAGTTATTATCATCAACGAGTTACCCCGATCCAGGCCGCTCGATATACACTAGAACAAACACGGCGTACGACCCAATTATATTCGCTTGGTTTGCCGACAACGTCCGTCCGCTTGGCGGATTAAAAATACTTGACGCTGGCAGAAAAGGCCCCTTTAACGCTCGCAGGAGCACCTGGGGTAGGGGATCCCGCTCAGATCCGGCACACAGTAGGCGCTTACCGCAAAAGGAGTCTTACAAAGTTGTAATACGCTGTCTAAATAGGGGTCGCATCGAAGTCCACCTTTAGACGATTTCAGAACCGCAACGCGTGTCGCTCCACGAAAGCTACGAAGTATCCATCGCACAAGCCGCACGCTAAATATGTGTTTTTCACCAGAGCCTCTGACCAGCTAATTACCCTGGCTCCTCCAGTATTCAACTGGCAGGATCCATAATGCGGACTCGCTCCATGCACTAGAACCTGCGTCCCAAGGAAGGTTGCCAACTCTCGGCAAGAACTGGAATACATGCGGAGTCCAAGAACAGATCACAGCAACGGAAAGAGCTATGGGGCTAGAAGTCGGGCTGGCAACATCGTGGGACATACCCACGCGGCTTCCTCATGAAGAAGCTGTTGGTAGGAGTGTGTCCCGCCGTTCGGCTCCTCGAAACCTGACGAATTTCCCTACCCGGCCACCGGGGGTTGCGGTTTTAGATGCCCCGATCCCTCACTATCATGAACGAGTTTACGCAGATGCTTAAGAGTGATCGTAAATGGCCAGAACGGTACAAGACCTTGAGATTCACAGAAGTTCATACGCGTCGGTCCACGACGGTATTCATGTCATGTCACCCTAGACTAGAAACTCATCATTCCAAGGTCGTGCGCTTTCTCTATGAGTGTTAAATGTGGGTTGATTTGTTGAGGACCCTTTGATGTCGAGGGAGAATTATACACCTTCCATGATCAGCTCAATGTGCCAGTCTGAGACTTAGGAGCTGCAAGATTTGCCTTAGGGGGTGACAATACGTCACGGCGATATGCACAAAGTGTCGCATGAGGATGTCGTGCCCCCGGAATTCAATCTCAAACGGTAAGTAGGCATAACTCACTTTGTTTCATTCTGGAACAATATGTGCGCACATCGGCCACGTAACGCAACTACATTCTCATGGGGGGTAGGTAATGGTCAGTAGGGCGCGGTATAGTCGAGCCTGGAGTGCGACCTGAATGCGTTCGTCCCGCGTAAGCAAGCTTGATGTGGAACTCGCTGATGAAATTCCACGCCTAATGAATGAGGGGTTTATGTCCTGTACCAGCTGCACTTATAATGGGAACCTGCCTGCTGAGGAATCGCAATACTGGAGTGCAACTCGGCGAATTTAGGTCGTAGTTAGTCGACCAGGCTGTATGGAGAAACCGCGCACAGAATTCGATGGGAAGAATGTTATTTTCTTAATCGAGGAAAAAGACAGGCGCCAAGATAATTTCATCCTGTCCCTCGGCACGTGATACGGTAGGGAATTAGCACAGGATGCCGTACAGATGTCGTTAGGCTAGAGATGACAATGGGTCCAAACGGTCGGGGATCAGTGTAGGTTATTTCATGTCCGCGCTCACCTCCAATCGGCCCAAGCAATAAGCGTTACTGCTCATTTTCTCCATCAGAGAGTACTAAAACTTATCGGAAGGGCCGCGACCCTCGGTCTGCTCCTCGGACCTGCAATCCATCGCTCTAATGAGTCTCCACTTTGAGGATAAAGAGAAGCGAGGTAGCTGGCACTCAGCTGCAACTTAGGAATGGGAGGACAAGAATAGCACCGCACACGGGGCTTTTGTTCGCTTATAGTAACCGTCCAAAGTACACCTATCAGGAGAGGAAGGCCCGGCCGGTACGCGGATGGGTGGGATGCGGGCTTAACGTCCCATCAAGTCAACTGACGTGTAAGTCGTTCAACTCGCCTGACCCCGGGTCAGGGGTGGCCTTACCCAACGGGATTGATTTGAAGTGTTCTACCTTTCTTTGAGAACAAAAAAATTCGACACAGACCCCCCTTATAGCCAGCACAGACCCCACCGCAGCCCGCCCCTCTCCCTGACACTAGGCCGTCTAGGGGTAGCGGAATACGTAGTGAGCCTGCAGGCCGTCTAGATCAATCGCGTTCCCTCAGGGCAAAGAGCGAGAGCGCCCCGATTCATGCGCATAGACTCATTTTCGGGCCGCTATGTTGGTGAAGCGCTAAGCTAGAGTGCTTCTTGCGTGCTTGGGAACACGCGTATGCTACAGATAGAATTTACCTAGGCATAGTGCCGCTCAGCTCTGTGGGCAATAGGCGAGGGATTTATCGGCGTCGCAGGCGAGACGTGGGTGACGTCCGGTTGATATAATTTCCTATAGGCTGCAGTCCCCGTGCGATGCGGTCAGAACCGATTCCGTTCCAGCCATCGCAACACGTCTGACTGTGCCTGAGCAATCAGAATGTTAAGGGATCTATGGGCGACCCCACGGTGGATTTCCGAACCGTCTTTGTGAAACGCAGGCATTGATCAGGAGCCGGAAAACAAAAGCCTTCTCCCCGTGTACAATGTTGCTGCTCTAGCGGTATACCGCCACGCCGCTACCACAGTGAGAGAATGGCAGCAGGCTAGTGGATGATTCTGTTGGAAGTAGGTATTTCTAGAACAATACACGACTCAGCCAGGATGATATGTCGTCATTTTAGGCTAGCGCTAACAAGGACTCTTCCCGGCCAGCTCCAGTTACAGTGTTACAGACATATTCGGTGTTGGAGGAAGAAAGCGTAGTTAGGCCCGGTGAAACGTCTTACGCGACGCCAAGTATCGCTCCCGTAAGCACACAGTACACCAC"

I found the reverse complement of a string and got the Rosalind Complement Problem correct!

In summary, this notebook helped me create a code that would search a genome for frequent “near” k-mers and their reverse-complements. These are the genetic substrings which will signal replication to the DNA polymerase, and therefore help us get closer to finding the origin of replication.

Finding the Replication Origin, Part III

Challenge 1

The first challenge in Replication Origin Notebook Part III involves implementing a clump-finding algorithm. I first created a dictionary of all possible k-mers, called k_mers_dict, and initialized it with counts of 0 at the beginning of my function, called initalize_k_mer_dict.

initialize_k_mer_dict <- function(k){
  nucleotides <- c("A", "C", "G", "T")

  k_mers_dict <- expand.grid(rep(list(nucleotides), k)) %>%
    unite("k_mers", everything(), remove = TRUE, sep = "") %>%
    unique() %>%
    mutate(count = 0)
  
  return(k_mers_dict)
}

k_mers_dict <- initialize_k_mer_dict(9)

I then implemented clump_finding() on a small example. This function pulls from a sequence called genomeString and takes parameters L, t, and k. L indicates the size of the window we are observing and instructs the code to look in a specific area of the genome. T counts the frequency of occurrences, while k indicates the integer length of k-mers to be returned. If working correctly, this code should result in a set of k-mers forming (L, t)-clumps within genomeString.

clump_finding <- function(genomeString, L, t, k){
  initial_window <- str_sub(genomeString, i, L)
  my_kmer_dict <- intialize_kmer_dict (k)
  for(i in 1: L-k+1){
    curr_kmer<- str_sub(initial_window, i, i+k-1)
    curr_count<- count_pattern(initial_window, curr_kmer)
    curr_row<- which(my_kmer_dict$kmer==curr_kmer)
    my_kmer_dict$count[curr_row]<-curr_count
  }
  candidates<- my_kmer_dict%>%
    filter(count>=t)%>%
    pull(k_mer)
  for(j in 2:(ng -(L-2))){
    rem_pattern <- str_sub(genome, j -1, j-1+ (k-1))
    add_pattern <- str_sub(genome, j + (L-2)-(k-1), j+ (L-2))
    
    rem_row <- which(kmer_dict$k_mers == rem_pattern)
    add_row <- which(kmer_dict$k_mers == add_pattern)
    
    kmer_dict$count[rem_row] <- kmer_dict$count [rem_row] - 1
    kmer_dict$count [add_row] <- kmer_dict$count [add_row] +1
    
    if(kmer_dict$count[add_row] >= t){
      candidates <- append(candidates, add_pattern)
      candidates <- unique(candidates)
    }
  }
  return(candidates) 
}

As previously stated, my kmer_dict function initializes the table of k-mers, and the for statement intructs the code to observe a certain window and thus sets up the initial window length. The curr_kmer function tells the code to float through the initial window and find a specific pattern. The curr_count function counts how many times that pattern shows up in the initial window, and the curr_row function shows where the k-mer appears in the k-mer dictionary kmer_dict.

Challenge 2

Adding functionality, or optimize!

This challenge involves finding all 9-mers corresponding to (500, 3)-clumps in the E. Coli genome. However, we modified it during class so that we did not utilize the E.Coli genome. In order to call this a function, it must pass a window length, t-value, and k-value. We scanned in the E.Coli genome and tried to run and test our code in class, but then realized it takes too long. The clump_finding function works on smaller genomes, but was too slow to run through E.Coli’s 4.6 million nucleotide genome. This suggests that the code needs to be further optimized in order to run this genome.

In Replication Origin Notebooks I, II, and III I learned many new coding methods and tools. These include creating lists and sample genomes, while also using the set.seed() function to set a seed for random number generation to ensure consistent results. I enjoyed using the count_pattern() function count occurrences of a particular pattern within a larger genomeString. I also used code to make for loops, and if/else statements to make sure the code is only ran when specific criteria are reached. Two functions created in notebook 4 were generate_k_mers(), which will generate all strings of k consecutive nucleotides within a genome, and count_patterns() which will count the number of occurrences of a particular pattern within a genome. I was able to search a genome for frequent “near”-k-mers and their reverse-complements.

Throughout notebooks I, II, and III, I learned how to scan in downloaded datasets from my laptop into R in order to help me solve many different bioinformatics word problems and challenges on the Rosalind website. After many trial and errors, I got them all correct! This recent notebook was very interesting because it required combining some of the previously developed functions in order to build the clump_finding() algorithm. Even though future work is still required, I enjoyed gaining new knowledge around these tools and functions because it pushed me closer to discovering the replication origin of a genome! I thoroughly enjoyed this project because I was able to apply what I learned in my Biology and Genetics classes to a different platform! Although it was very challenging, I learned the importance of code and how it can help us solve big or small scale problems.