Starter Notebook

Author

Ainsley Owens

Intro to R Notebook

In this notebook, I will be exploring several functions of R and learning tools to complete the Replication Origin Notebooks. This will include using the arrow operator <-to store objects in variables, using R’s sample() function to create random genomes for testing functions, and building functions to process and analyze genetic data. I will also be learning about for loops and reading in data from outside sources. These tools will be used to analyze genomes in future work.

Challenge 1

Here I am using the c() function to put four nucleotides (A, G, C, T) into a list.

Nucleotides <-c ("A","G","C","T")
Nucleotides

[1] "A" "G" "C" "T"

Challenge 2

Now, I will create a random string of 15 nucleotides, titled randGenome, using the sample() tool. Using the <- will store the list as randGenome.

genome <- 15
randGenome <- sample(Nucleotides, size= genome, replace= TRUE)
randGenome

 [1] "A" "T" "A" "A" "G" "A" "T" "A" "G" "C" "A" "T" "C" "T" "C"

Now, I will collapse() the genome to create a single string and make it easier to read.

paste (randGenome, collapse="")

[1] "ATAAGATAGCATCTC"

Challenge 3

Now, I will create a random genome that is 1500 nucleotides long, using the sample() function again, naming this genome randGenome. I used the paste() function to collapse the list into a single string.

genome <- 1500
randGenome <- sample(Nucleotides, size= genome, replace= TRUE)
paste (randGenome, collapse="")

[1] "AGTGCAATACGATAAACAGAAGGATAGAGCATACTCGTGCTCGGCCAGGAGATAGTGAAGGTTGCTTAGGTGGATTCGATAGTCTCCTCCAATAATCCAGTCAGGAGCACCAAATTACAGCCAACTCGCAAGAATGCTGGCATGTAATTGGAGTTAGCGCGACGACCAGAAGCTAGTCTTAGAATCTCCGGCGCCGTGACTAATTTCGCATTCGATTAGCCAAAACGATAAATACCTGAAGGCGACTTACCGTGATAATTGAACGAGAGCGCTTGCACCGGGTCGTCCTGTTTTCGCCTATCGTCAACCGCACTAGCTGTCTAGCACAGACGGGGCATGTGCTAGCCTACCATGCCGTGGTAAAACAAAACCTAAGCAGTGGATTAAGTGGCTACACTCTCTGTGTACGATCTACCGTCTTTGTAGATGCAAGTGAAGTGTATGCCTAATTCCACTCTCTACCGCCGGTCCCTGGATTACTCACGGATTTACCGGCGAGTGAAGCATACGGGATGGGAGTCCTGCGCATGAGGGGTATGCCCGTTCTAACAATTTTAGAGAAAACATTCGAATTAAGCGGAGCCCCCGACTAAGCATTGCGTGTAGGCAATCTCAGCACGGTAGATCTGCGCCCACATCACGCATGCAACGACTATGGTTTTCACCTTACACCGATCTGGTTGACGGAGTCTCAGACTCCTCTCTCAGATACTAACTCCCACCTGGCCGTAACTCTTTGTACTCGTTTTGTAATGCCAGAAAACTGAGGCCCACTATCTGCCCTCTGCAGATGCCCAGATTATACCTTCTTGGCCGGCGAGGTGTGGTAATAGGGCACATCTCTCAACTTTATTATGTAAAATCCTTGCGTAAGTGATTAATGACGCACACGGCATCTCCCTATGGCTGATCTTACTAGCACTATTGTCGTTGTGCGATACATTGGATCGTATCTTACATCAAAATGAACGTACCGATTTGCATGGGCGTCAATTGACGTTGCCCGTACGATGTCGTGAAGCTCAGCTTAGAGCTTTCGAATAACTTTTTGGCTAGCGTAAGACGAGAAAGATACTAACGCGACAGATGGTGCTCAGACCGGCAGACCCTTTTTGACCTCTAAGCTAAATTGCGAACCATGCTAAATAGATTCTGGGAGCACAACGGCGCGCATAGAATGCAATATGCCAGCGTGGCTATACTTGTAGGAAATTGGCTTAAGTGTACCTGCCAGGTCAATTTAGGCCGTCCAACCGATAACTACCAGCGCAATAGGTAAGCGTAAAGGCTATCCCGGTGCCGTGTCGCACGTAGATCTAGTAAAGGGTAGGGTCCCTCTTAGATATGAAGGTTAAGAAGCCCGGTTTGCATAAGTCCAGTTACGCAATTAAGGCTGACAGTAAATAAACCTCTTATGACATAGTGCAGTAGTCACAGTAAGCAGTACGAAACGGAACTTTTCTTAGTCGCGTGACGCAGCCTAGTCTATTCTGCCGGTTC"

Here, I am using set.seed(215) to initialize the random number generation with the seed 215, using the paste() function to collapse the list into one string. The set.seed(215) function guarantees that the same random values will be produced each time the code is ran, useful for reproducible results.

set.seed(215)
genomeLength <- 1500

randGenome <- sample(Nucleotides, size = genomeLength, replace = TRUE)

randGenome <- paste(randGenome, collapse = "")
randGenome

[1] "TCCAATGTTTAATCTTGGCAGGTTTCACGGTGAACTTCCAGTAGGCGCCGGGGGCAGGCACACGTAAGTGTAAATCATCGAACGACGCGCTCGGCACTGTGTCTCCAGACCATGAGATTCCATCTCACGGAACATGTGCACATCAAAAAGAACTACCTTGTGTATGCGTCCGTGGAATGATTTTTATGGTTTCGTCGATAGTTATCCGGGTGGAAGGTTCGCAATTCGCCGATTAGCTTGTATCGAGCAAACACCGAGCACATCTATATGAGTGTTGACGTATGTCAGAGACCTACCGATTCCATCTGACGCTAGCAGGCCCCTACCGAAGGGGTTGTCTTCGGGGCGTCCGGCCATCGGAACCTCTTTTAGATGCCTATTTTAAACATCTCAAGTTAACTAAGGTAGTTACGTCAGCCCAACCTCGAGATCTATCTCTCAGTGTAGAGGAACATCGAGTTACCGATGAATAAAACTTCGGCCTTTCTAATGGTTCAGACTAATGCACATAATTAGTTCGCCCGAGATAAGGTAGTGCCTTGCTGGGGGTTACGCTAGCCCCCCCTCCCCAAGGATGACGCTTGCTATCTTACTGGTGGCCGATTTTTCCTGAGCCGGGTGGATCAAATAGATAGTACGCTCGGATTGGCCTGAAATCCGGGTCGCAACCATCGCCTCGCCTAACGCTCTCCCGGATAGTTCCCGACATTCACTTATAACAAAGACCGAAATTCCCAAGTACTCGCCGAGATCTAGACGTGCGAGTTGGTCCCCCCGCAGTAGGCAATTGGTACTATCTGTCACTACGCTAGGCCCGGACGTTTCCCTGGTGTTAGGTTCTAATCCCATGCCGGTTATCCCATCGGATCAGCAGAGTCGCCCACCTTAATAACTCTGGCGCATAAACGGTGCTCCTGCAAATGTCTTAGCTCGTGTTGTCTGATTTCGGATATTCCTAAACTTGAACACGGAGGGTATGTATTCAGTGGTACATGTGGGAAATCTATGGTGGAAGACTAGTTGAGGTCCATCCTAAGGGGAGGTGAATGAGGAGTTACCTGTGGACTAGTTTTCACGGCGGTAGCGCGTGTGCAAATAACGTAACGTACGTAGCCTGGCTCTTACCGCAAAAATCAATGGGTTTGGACGCGTAGGAACTCTACTGCCATCGCCTTCAGTTATCGGTCGGGCCTCACGACAATTTAAATGCGTCTAGGAATGACGAGTAGGAGAACCGCAAGCCCCGACGGAGCAGTCCGCGGAGAACGATCCAACCACTCTGCCCGCAAATGACCCGGCGATTAACGGCTTTTATGAGCCCTAAACACCTGAGATCTCGAGACACGGACATACTGCCGACTCAACCCGAGAAGCGTTTTTGCCGACCAACAGGGTTCGAGGTTTATGAGAGGCGCCCTGATGAGGCCGGAAACGACTGCCGCCTGAAGCTTAAAATTATTCGATCTGTACCTAAACGACCAGGACACACAGCTGCAAATG"

Challenge 4

Using set.seed(215), to ensure the same random sample is created each time I run the code, I created a random genome composed of 100 nucleotides, done by setting the genomeLength to <-100. The paste() function collapsed the list into one single strand.

set.seed(215)
genomeLength <- 100
randGenome <- sample(Nucleotides, size = genomeLength, replace = TRUE)

randGenome <- paste(randGenome, collapse = "")
randGenome

[1] "TCCAATGTTTAATCTTGGCAGGTTTCACGGTGAACTTCCAGTAGGCGCCGGGGGCAGGCACACGTAAGTGTAAATCATCGAACGACGCGCTCGGCACTGT"

Challenge 5

Here, I am learning how to create a for loop so I can create for loops to analyze genomes. A for loop is a control-flow construct used to run through a dataset and apply the same set of operations on each item in the dataset. To create a for loop, you must set an iterator and provide the instructions to be run each time through the loop. This simple for loop will multiply myProduct. I first initialized the container, myProduct, with a starting value of 1. By setting the for(j in ) value to 1:15, this instructs the loop to run through the range of integer values from 1 through 15. I then provided the condition on the iterator within a set of parentheses, ending the line with an open bracket. I will indent the next line, as the indented lines following the open bracket are where the instructions to be run through the loop are located.

myProduct <- 1

for(j in 1:15){
  myProduct <- myProduct*j
  print(myProduct)
}

[1] 1
[1] 2
[1] 6
[1] 24
[1] 120
[1] 720
[1] 5040
[1] 40320
[1] 362880
[1] 3628800
[1] 39916800
[1] 479001600
[1] 6227020800
[1] 87178291200
[1] 1.307674e+12

Challenge 6

Here, I created a random genome consisting of 10 nucleotides. I then created a for loop intended to print each individual nucleotide as opposed to the whole string. I used 1:nchar() to run the for loop through all characters in the string. I used the str_sub() function to extract individual nucleotides. The string subset function takes three arguments: the string to be subset, the position to start subsetting, and the position to end subsetting the string. If start and end are the same, a single character will be extracted. By using str_sub() and using j as both the start and end for subsetting, I was able to have the for loop print the individual nucleotides.

set.seed(215)
genomeLength <- 10
randGenome <- sample(Nucleotides, size = genomeLength, replace = TRUE)

randGenome <- paste(randGenome, collapse = "")
randGenome

[1] "TCCAATGTTT"

for(j in 1:nchar(randGenome)){
  print (str_sub(randGenome,start= j, end=j))
         }

[1] "T"
[1] "C"
[1] "C"
[1] "A"
[1] "A"
[1] "T"
[1] "G"
[1] "T"
[1] "T"
[1] "T"

Challenge 7

Flow control is a programming concept used to execute code only when certain conditions are satisfied. This can be achieved using if, else if, and else statements. The code in Challenge 6 was adapted to only print occurrences of Adenine ("A"). This is done by using an if statement. I first initialized the container, adenineCount, with a starting value of 0. The for line instructs the code to run through all of the characters in the random genome. The if line instructs the code to only run if the current nucleotide in the random genome is Adenine ("A"). Because I already initialized the container with a starting value of 0, I can instruct the code to increment the value of the container by one every time an Adenine occurs, by following the for line, with adenineCount <- adenineCount +1. After closing the brackets for the flow control, I used the print() function to return the occurrences of Adenine.

adenineCount <- 0
for(i in 1:nchar(randGenome)){
  if(str_sub(randGenome, start = i, end = i) == "A"){
    adenineCount <- adenineCount +1
  }
}
 print(adenineCount)

[1] 2

Challenge 8

Now that I have created a for loop to count the number of occurrences of Adenine in the string, I will adapt the loop to count the frequencies of each of the four nucleotides. I did so by setting containers for each nucleotide, thymineCount, cytosineCount, adenineCount, and guanineCount, to a starting value of zero. I then repeated the code used in Challenge 7, and changed the nucleotide in the if line to count each nucleotide, as well as the nucleotideCount line to the matching nucleotide to increment by 1 each time the nucleotide occurs. After closing the brackets for my loops, I printed each nucleotide count, using the print(c(nucleotideCount)) function.

thymineCount <- 0
cytosineCount <- 0
adenineCount <- 0
guanineCount <- 0
for(i in 1:nchar(randGenome)){
  if(str_sub(randGenome, start = i, end = i) == "T"){
   thymineCount <- thymineCount +1
  
  }
   if(str_sub(randGenome, start = i, end = i) == "A"){
   adenineCount <- adenineCount +1
   }
   
  if(str_sub(randGenome, start = i, end = i) == "C"){
   cytosineCount <- cytosineCount +1
  
  }
  if(str_sub(randGenome, start = i, end = i) == "G"){
   guanineCount <- guanineCount +1
  
  }
}
print(c(thymineCount, adenineCount, cytosineCount, guanineCount))

[1] 5 2 2 1

Here, I read in the Vibrio Cholerae genome, composed of 1,108,250 nucleotides, to test my code on a real genome. I named the dataset vib_c, and scanned the data in using the file path in my computer in quotations, followed by , what = "character", sep = NULL. These lines following the scan function must be in parentheses.

vib_c <- scan("C:/Users/owens/OneDrive/Desktop/VibrioCholerae.txt", what = "character", sep = NULL)

Challenge 9

Now I can use the for loop I created to count all of the occurrences of the individual nucleotides in the cholera genome, by replacing randGenome with vib_c, to pull the nucleotides from the cholera genome.

adenineCount <- 0
cytosineCount <- 0
guanineCount <- 0
thymineCount <- 0
for(i in 1:nchar(vib_c)){
  if(str_sub(vib_c, start = i, end = i) == "T"){
   thymineCount <- thymineCount +1
  
  }
   if(str_sub(vib_c, start = i, end = i) == "A"){
   adenineCount <- adenineCount +1
   }
   
  if(str_sub(vib_c, start = i, end = i) == "C"){
   cytosineCount <- cytosineCount +1
  
  }
  if(str_sub(vib_c, start = i, end = i) == "G"){
   guanineCount <- guanineCount +1
  
  }
}
print(c(thymineCount, adenineCount, cytosineCount, guanineCount))

[1] 294711 293942 263573 256024

Challenge 10

Here, I used the scan() function to read in a Rosalind genome, from the Counting DNA Nucleotides problem, to check that my code is working properly. I used the for loop previously created in Challenge 9 to count the occurrences of each nucleotide in the genome.

rosalindSequence <- scan("C:/Users/owens/Downloads/rosalind_dna (6).txt", what = "character")

I got this Rosalind problem correct, meaning that my code is working properly to count nucleotides.

rosalindSequence <- "CTTACAAATGAAGCCATATCCTCCCCCCCACTGGCGCGAAATAGAAACCCAAGCCATGTGTTAGGGCATTATAGGGATTGCGGGGTACGGCTCCTTGGCAACTAATAGATGTGAGGCTCTACACAAGTCCAAGGGCCTTTTCCACTCTAGGCTCATTATAAGCTACCGGCCCCAGTGTATCGTGGGCGCGTACCTGTCAGATAACTTTTCTCCGGGATTTGAGTCGACTCGTGGGTTTAAGATCGGTCTATCCCCCACATCCTGACTGAGGTGCACTTTACAATAATTGGTAGTTTAAACAAGGGCTATGCCCGCCGGCCTCAGACGTTGTTGCAGGAGGCAATGGCCAAAACTGCTTCCAGTAGGTTTTTAGGATCCACTACGACCAAAGCTAGCATTCAGTTGCCTGAGACGCGACTTCCAGTACTTGGCCCCGACGACAGAGGGCTTACAATTGTAGCCCCCCGTTAACCAGGGTTGCACCGGCCCAGATGCCGGCGGATGGTGTAGACTGCGAATGGAGAAACGTATCAGTACCAGGTGATCAAGTGCGAATAGGCAGCGACCTAGATTGGGCAACTTCATCGATCCCATTCGAGGCTCGGGTAGTCGGTTACACCATGGCGTCGCGGTAATGCCCTGCGAACACATTTGCAAGTCTTCGCCAATATACGTGCCTCCCGTGTAATCGGTCGAACCAGAAGCATCAGCCGGATCGGCAGGCGCTAAGCGGTAACCTATCCGCGATTGTTTTCCCAGTTCGTGCTCTTGTTACCCGGATAAATAGAAAGTTATTCATGGGAAGTACGTTCGCAGTCATAACCGCTCGTTAAATTATGTATACTCTTCAGTCAACGCGCGTTTCTACTTGTACATGGTCCCGGGTCCATCTGATGTGGGTCCGTTTAATCTCGAGGTTAGAAGGAGAAACCGATGCCGCCATCAACAAGCG
"
thymineCount <- 0
adenineCount <- 0
cytosineCount <- 0
guanineCount <- 0
for(i in 0:nchar(rosalindSequence)){
  if(str_sub(rosalindSequence, start = i, end = i) == "T"){
   thymineCount <- thymineCount +1
  
  }
   if(str_sub(rosalindSequence, start = i, end = i) == "A"){
   adenineCount <- adenineCount +1
   }
   
  if(str_sub(rosalindSequence, start = i, end = i) == "C"){
   cytosineCount <- cytosineCount +1
  
  }
  if(str_sub(rosalindSequence, start = i, end = i) == "G"){
   guanineCount <- guanineCount +1
  
  }
}
print(c(adenineCount,cytosineCount, guanineCount, thymineCount))

[1] 229 249 242 232

Throughout this notebook, I learned several skills. I was able to create a list of nucleotides using c(). Using the sample() function, I was able to use the nucleotide list I provided to create a random genome. The collapse() function allowed me to collapse the genome into one single strand, as opposed to separate single characters. I learned that the set.seed() function allows for the code to set a seed for random number generation to enable reproducible results over time. I learned how to create a for loop to repeat instructions provided, and I used if/else statements to ensure the code is only ran when certain criteria is met. This notebook also taught me how to scan in data from other sources, like the vibrio genome and the Rosalind sequence, using the scan() function. I learned how to correctly solve bioinformatics problems on the Rosalind platform.

Replication Origin I

In this notebook, I will be working towards developing code that can be used to find the origin of replication in a genome. During DNA replication, proteins must locate the origin where the process begins. Locating specific patterns in the genome are the first step in the process of determining the replication origin. Throughout this notebook, I will be creating and using reusable code.

Challenge 1

In the first code block, I am reading in the nucleotide_frequency function for further use. There are two arguments, a string (genomeString) and a nucleotide (nucleotide). This function counts the number of occurrences of specific nucleotides in the string. I set the container, count, to a starting value of 0. The for loop instructs the code to run through all characters in the string, and the if statement instructs the code to only run if the nucleotide indicated is present. By setting count to count + 1 following the if statement, the code is set to increment by 1 every time the target nucleotide occurs. I am then returning the count of nucleotide occurrences.

nucleotide_frequency <- function(genomeString, nucleotide = "A"){
  count <- 0
  for(i in 1:nchar(genomeString)){
    if(str_sub(genomeString, start = i, end = i) == nucleotide){
      count <- count + 1
    }
  }
  return(count)
}

Using the sample() function alongside the paste() and collapse"", I generated a random genome with a length of 2000 nucleotides, and stored it as randGenome. Using the nucleotide_frequency() function, I instructed the code to pull from the randGenome sample I just created, and indicated the target nucleotide to count as nucleotide="C" to count the occurrences of Cytosine.

  genome <- 2000
randGenome <- sample(Nucleotides, size= genome, replace= TRUE)
randGenome <- paste (randGenome, collapse="")
nucleotide_frequency(randGenome,nucleotide="C")

[1] 499

Challenge 2

In this challenge, I created a function to compute random genomes. This will make future work in this notebook easier because I no longer need to create genome lists, but instead can pull using the rand_genome() function. I provided the function with the nucleotide list to use, denoted the size parameter as k, and used the paste() and collapse'' functions to create one string. To do so, I just need to insert rand_genome() and place the desired genome length in the parentheses, using the k parameter.

rand_genome <- function(k){
  nucleotides <-c ("A", "T", "C", "G")
  rand_genome <- sample(nucleotides, size= k, replace= TRUE)
  rand_genome <- paste(rand_genome, collapse= "")
  return(rand_genome)
}
rand_genome(k=15)

[1] "CTCCCCTATCCTAGC"

Challenge 3

For this challenge, I will create a for loop that can run all the characters in a genome and count consecutive characters. This for loop will count and return all the substrings of 3 nucleotides in the genome, known as 3-mers. I used the rand_genome() function within the new generate_3_mers function that I created to generate a 2000 nucleotide sequence. The generate_3_mers function then reported all of the 3-mers in the genome.

generate_3_mers <- function(genomeString) {
  list_3_mers <- c()

  for(i in 1:(nchar(genomeString) - 1)){
  list_3_mers <- list_3_mers %>%
  append(str_sub(genomeString, start = i, end = i + 2))
    }
  return(list_3_mers)
}

list_3_mers <- generate_3_mers(rand_genome(k=2000))

To check if the function collected only 3-mers, I used the tail() function to print the last 25 substrings collected, using n=. I was able to call the list_3_mers data because I named the list in the previous code block. The tail function showed that the last substring was a 2-mer, not a 3-mer (“AA”).

tail(list_3_mers, n= 25)

 [1] "CCT" "CTT" "TTT" "TTT" "TTA" "TAG" "AGA" "GAT" "ATT" "TTT" "TTG" "TGG"
[13] "GGG" "GGA" "GAG" "AGA" "GAG" "AGT" "GTT" "TTA" "TAA" "AAA" "AAA" "AAA"
[25] "AA"

Challenge 4

For this challenge, I am adapting the function from the previous challenge to create a function that reports k-mers of any size. I did so by adding the k value. I named the list list_k_mers so I am able to call upon this list during future work. The value next to rand_genome at the bottom of the code block determines how many characters the genome will be comprised of, while the value next to myGenome in the last line of code determines how long the k-mers reported by the function are. If I give the rand_genome function a value of 10, named myGenome, then the genome will be 10 nucleotides long. If I give the generate_k_mers function a value of 5, pulling from myGenome, then the function will report the k-mers that are 5 nucleotides long.

generate_k_mers <- function(genomeString, k) {
  list_k_mers <- c()

  for(i in 1:(nchar(genomeString) - (k-1))){
  list_k_mers <- list_k_mers %>%
  append(str_sub(genomeString, start = i, end = i + k-1))
    }
  return(list_k_mers)
}
myGenome <- rand_genome(10)
myGenome

[1] "TTGATCTCTA"

generate_k_mers (myGenome, 5)

[1] "TTGAT" "TGATC" "GATCT" "ATCTC" "TCTCT" "CTCTA"

Challenge 5

Here, I created a function, count_pattern, to count occurrences of a particular pattern within a genome string. The bottom line of code, count_pattern(myGenome, "") tells R which pattern to count throughout the genome. This is function that will be adapted later in the challenge, so this code block is just the outline. In the if statement, the end = i + 1), the +1 value can be adjusted to account for a larger pattern. The count_pattern() line can also be adjusted for future use. The (myGenome, "TA") can be adapted to a particular string and pattern, which will be seen next.

count_pattern <- function(genomeString, pattern){
  count <- 0
  for (i in 1:nchar(genomeString)){
    if(str_sub(genomeString, start = i, end = i+1) == pattern){
      count = count + 1
    }
  }
  return(count)
}
  
count_pattern(myGenome, "TA")

[1] 1

Here, I read in the Rosalind sequence to test that the code is properly working, naming it rosalindString using the <- arrow tool.

rosalindString <- "CGCTGCAATGTGCAATGTTGCAATGTGCAATGAATTATGCAATGAATGCAATGTTGCAATGACTTGCAATGTATGCAATGGTGCAATGTTGCAATGTGCAATGTGCAATGACCATGCAATGTGCAATGTGCAATGAATTGCAATGTGCAATGTGCAATGTGCAATGTATGGCTGCAATGTGCAATGTGCAATGTGCAATGGTGCAATGCCAGTGCAATGGTTAAGTGCAATGAGTCTTGACTGCAATGTTTTGCAATGGTTTTGCAATGTGCAATGCTGCAATGGAATGCAATGTGCAATGCTGCAATGTGCAATGTGCGTAAGTGCAATGTGCAATGATGCAATGCCTGTCCATGCAATGTGCAATGTCGGTGCAATGTCATTATGCAATGTGAAGCGAATATGCAATGATGCAATGGTGCAATGTGCAATGTTGCAATGTGCAATGCAATGCAATGGTGCAATGATTTTGCAATGTTGCAATGCCTATACCTAGTCCATGCAATGTGCAATGGTAAATGCAATGGAGTGCAATGCTATGCCTTGCAATGTTGCAATGTGCAATGCTCCTGCAATGGCATGCAATGTGCAATGGTGCAATGGAACTGCAATGCTGCAATGATCCAGATCCGTGTGCAATGTTTGCAATGTGCAATGAGGGCGTTGCAATGCCGAGAGCTGCAATGGTGCAATGGTGCAATGACAGATGCAATGGTGCAATGGATGCAATGCCTGCAATGCTGCAATGTGCAATGTGCAATGATGCAATGTGCAATGTGCAATGGCCATTGCAATGTGCAATGTGCAATGTGATCAATTGCAATGTGCAATGTTGAATGCAATGCTATTCGGTCTGCAATGTTTGCAATGTGTGCAATGTGTGCAATGCTGCAATGTTTGCTGCAATGAAATATGCAATGCCCTTTGCAATGTGCAATGTTGCAATGATGCAATGGCCCGCTGCAATGAACATGTTGCAATGTGCAATG"

In this code block, the outline function is adapted to run the Rosalind sequence. I adjusted the function by changing the if statement to i+8, because the pattern that is being looked for in the sequence is 9 characters long. In the last line of code, count_pattern(), I replaced the string to pull from rosalindString, and I copied in the pattern that is being counted, provided by Rosalind. Copying my solution into Rosalind, I determined that my answer is correct, meaning the code is working properly. (I adjusted this code to be functional in Replication Origin II, changing the end= i+8 to end= i+ k -1. Having the end= i+8 caused the count_pattern function to not be usable in Challenge 1 of Replication Origin II, because the string was not 9 nucleotides long.)

count_pattern <- function(genomeString, pattern){
  count <- 0
  k<- nchar(pattern)
  for (i in 1:nchar(genomeString)){
    if(str_sub(genomeString, start = i, end = i+k-1) == pattern){
      count = count + 1
    }
  }
  return(count)
}
  
count_pattern("ATATAT", "AT")

[1] 3

While completing this notebook, I learned several skills and concepts. I learned to use functions to write reusable code, which can be helpful when trying to find similar things in several different genomes. Using functions to write reusable code prevents having to rewrite code several times to complete the same task within different genomes. I have worked more to learn how to understand and break down genomes using code. In this notebook, I learned how to count the number of occurrences of patterns in the genome.

Replication Origin II

In this notebook, I will continue developing code with the goal of finding the replication origin in a genome. DNA replication begins in a certain position on the genome, known as the origin. This origin consists of a set of DNA sequences that signals the cells to start replication.

Challenge 1

In bacteria, one of the proteins involved in replication is known as DnaA. DnaA binds to DnaA boxes, sequences in the genome, strings of 9 nucleotides that are more frequently repeated around the origin of replication.

In this challenge, I will write code to search for frequently occurring 9-mers. I will begin by using a previously created function to practice counting k-mers. Using the generate_k_mers function I previously created, I read in the sample sequence from the notebook guide and named it myGenome. By changing the value on the last line of code next to myGenome, I instructed R to report the 3-mers in the sequence.

generate_k_mers <- function(genomeString, k) {
  list_k_mers <- c()

  for(i in 1:(nchar(genomeString) - (k-1))){
  list_k_mers <- list_k_mers %>%
  append(str_sub(genomeString, start = i, end = i + k-1))
    }
  return(list_k_mers)
}
myGenome <- "ACACAGACATCCCACCCC"
myGenome

[1] "ACACAGACATCCCACCCC"

generate_k_mers (myGenome, 3)

 [1] "ACA" "CAC" "ACA" "CAG" "AGA" "GAC" "ACA" "CAT" "ATC" "TCC" "CCC" "CCA"
[13] "CAC" "ACC" "CCC" "CCC"

I will now adapt the generate_k_mers function to create a function that will count and report frequently occurring k-mers. This code block is intended to find and count k-mers in the genome, loop through the list of k-mers and return the most frequently occurring k-mers. By giving k a value of 4, this code will return the most frequent 4-mer. I can change this k value to determine what length of most frequent k-mer is returned.

frequent_kmers <- function(genome, k) {
  kmers <- generate_k_mers(genome, k)
  kmers <- unique(kmers)
  kmer_counts <- rep(0, length(kmers))
  for(i in 1: length(kmers)) {
    kmer_counts[i] <- count_pattern(genome, kmers[i])
  }
  max_count <-max (kmer_counts)
  frequent_kmers <- kmers[kmer_counts==max_count]
  return(frequent_kmers)
  
}

smallGenome <- "ATCCATTAT"
frequent_kmers(smallGenome, k=4)

[1] "ATCC" "TCCA" "CCAT" "CATT" "ATTA" "TTAT"

Now, to ensure my code is working properly, I will test it on a Rosalind sequence. This code block is using the Rosalind sample sequence, named rosalindGenome, before downloading the full genome.

find_frequent_kmers <- function(genome, k) {
  kmers <- generate_k_mers(genome, k)
  kmers <- unique(kmers)
  kmer_count <- rep(0, length(kmers))
  for(i in 1: length(kmers)) {
    kmer_count[i] <- count_pattern(genome, kmers[i])
  }
  max_count <-max (kmer_count)
  frequent_kmers <- kmers[kmer_count==max_count]
  return(frequent_kmers)
  
}

rosalindGenome <- "ACGTTGCATGTCGCATGATGCATGAGAGCT"
find_frequent_kmers(rosalindGenome, k = 4)

[1] "GCAT" "CATG"

Here, I read in the full Rosalind sequence, as data, and then named the data genomeString.

data <-"ACCGCAGAGTCCCTATGGGCAGCCGCACGCCCTATGGGACCGCAGAGTACCGCAGAGTTGCCACGGATGCCACGGACCCTATGGGTGCCACGGATGCCACGGACAGCCGCACGTGCCACGGACAGCCGCACGCAGCCGCACGACCGCAGAGTACCGCAGAGTACCGCAGAGTTGCCACGGATGCCACGGACCCTATGGGTGCCACGGACAGCCGCACGTGCCACGGACCCTATGGGGGAGTGGTTGCCACGGACCCTATGGGCAGCCGCACGCAGCCGCACGACCGCAGAGTGGAGTGGTTGCCACGGAACCGCAGAGTGGAGTGGTTGCCACGGACAGCCGCACGTGCCACGGACCCTATGGGGGAGTGGTTGCCACGGACCCTATGGGACCGCAGAGTGGAGTGGTACCGCAGAGTCAGCCGCACGTGCCACGGAACCGCAGAGTACCGCAGAGTGGAGTGGTCCCTATGGGCCCTATGGGCAGCCGCACGGGAGTGGTCAGCCGCACGTGCCACGGATGCCACGGACAGCCGCACGACCGCAGAGTGGAGTGGTACCGCAGAGTCAGCCGCACGCAGCCGCACGTGCCACGGAACCGCAGAGTCAGCCGCACGGGAGTGGTGGAGTGGTTGCCACGGAACCGCAGAGTACCGCAGAGTGGAGTGGTCAGCCGCACGACCGCAGAGTTGCCACGGACAGCCGCACGTGCCACGGAACCGCAGAGTCAGCCGCACGTGCCACGGACAGCCGCACGTGCCACGGATGCCACGGACAGCCGCACGTGCCACGGAACCGCAGAGTGGAGTGGTCAGCCGCACGCCCTATGGGGGAGTGGTCAGCCGCACGTGCCACGGACCCTATGGGACCGCAGAGTGGAGTGGTACCGCAGAGT"
#The genomeString
genomeString <- data[1]
genomeString

[1] "ACCGCAGAGTCCCTATGGGCAGCCGCACGCCCTATGGGACCGCAGAGTACCGCAGAGTTGCCACGGATGCCACGGACCCTATGGGTGCCACGGATGCCACGGACAGCCGCACGTGCCACGGACAGCCGCACGCAGCCGCACGACCGCAGAGTACCGCAGAGTACCGCAGAGTTGCCACGGATGCCACGGACCCTATGGGTGCCACGGACAGCCGCACGTGCCACGGACCCTATGGGGGAGTGGTTGCCACGGACCCTATGGGCAGCCGCACGCAGCCGCACGACCGCAGAGTGGAGTGGTTGCCACGGAACCGCAGAGTGGAGTGGTTGCCACGGACAGCCGCACGTGCCACGGACCCTATGGGGGAGTGGTTGCCACGGACCCTATGGGACCGCAGAGTGGAGTGGTACCGCAGAGTCAGCCGCACGTGCCACGGAACCGCAGAGTACCGCAGAGTGGAGTGGTCCCTATGGGCCCTATGGGCAGCCGCACGGGAGTGGTCAGCCGCACGTGCCACGGATGCCACGGACAGCCGCACGACCGCAGAGTGGAGTGGTACCGCAGAGTCAGCCGCACGCAGCCGCACGTGCCACGGAACCGCAGAGTCAGCCGCACGGGAGTGGTGGAGTGGTTGCCACGGAACCGCAGAGTACCGCAGAGTGGAGTGGTCAGCCGCACGACCGCAGAGTTGCCACGGACAGCCGCACGTGCCACGGAACCGCAGAGTCAGCCGCACGTGCCACGGACAGCCGCACGTGCCACGGATGCCACGGACAGCCGCACGTGCCACGGAACCGCAGAGTGGAGTGGTCAGCCGCACGCCCTATGGGGGAGTGGTCAGCCGCACGTGCCACGGACCCTATGGGACCGCAGAGTGGAGTGGTACCGCAGAGT"

#The value of k
k <- as.integer(data[2])
k

[1] NA

Now, I used the find_frequent_kmers function to run through the entire sequence, pulling from genomeString. The code stores the list of most frequently occurring k-mers as frequent_kmers. The desired kmer length to find listed by Rosalind is 14, so I set the k value to equal 14. I used noquote(frequent_kmers) to print the kmers out in a single line with no quotations, and got the Rosalind challenge correct, meaning that my code is working properly.

frequent_kmers<- find_frequent_kmers(genomeString, k=14)
noquote(frequent_kmers)

[1] CAGCCGCACGTGCC AGCCGCACGTGCCA GCCGCACGTGCCAC CCGCACGTGCCACG CGCACGTGCCACGG
[6] GCACGTGCCACGGA

During DNA replication, DNA polymerase can make mistakes, meaning that nucleotides may not appear in DnaA boxes the way we may expect them to. The replication signal that we are looking for to lead us to the origin may appear as the k-mer signaling that replication should begin, or as its reverse compliment. To account for this, we must create a function that counts reverse compliments as well as k-mers.

Adenine and thymine are complementary base pairs, while guanine and cytosine are as well. In double stranded DNA, these complementary base pairs ( A & T, G & C) should be across from each other. If we look across from a segment of DNA, we can see the reverse complement of that segment. This refers to the fact that the complementary sequence is anti-parallel, meaning 5’ and 3’ ends are oriented in opposite directions on each side of the double strand. Meaning, we read the complementary sequence in reverse order.

Challenge 2

In this challenge, I will write a function to compute the reverse_compliment() of a genomeSubString.

This reverse_compliment function is created to run through the entire sequence that is provided as input, and output the reverse compliment of this sequence. The code is instructed to return the complementary nucleotide each time it loops past a nucleotide, using if statements to instruct the code to run only if the target nucleotide is present. Meaning, every time the code encounters an Adenine, the reverse_compliment function will return Thymine, and vice versa. This is also applied for Cytosine and Guanine, using if statements as well.

I am using the Rosalind Reverse Compliment Problem sample dataset, listing it as substring to ensure that my code is working properly.

reverse_compliment <- function(substring) {
  substring_list <- str_split(substring, "") [[1]]
rev_substring <- rev(substring_list)
reverse_compliment <- c()
for(i in 1:length (rev_substring)) {
  if(rev_substring[i]=="A"){
    reverse_compliment <- append(reverse_compliment, "T")
  }
if(rev_substring[i]=="T"){
  reverse_compliment <- append(reverse_compliment, "A")
}
if(rev_substring[i]=="C"){
  reverse_compliment <- append(reverse_compliment, "G")
}
if(rev_substring[i]=="G"){
  reverse_compliment <- append(reverse_compliment, "C")
}
}

reverse_compliment <- paste(reverse_compliment, collapse = "")
return(reverse_compliment)
}
substring <- "AAAACCCGGT" 
reverse_compliment(substring)

[1] "ACCGGGTTTT"

Here, I am again testing to ensure my code is working properly to return the reverse compliment by running the Rosalind Reverse Complement Problem dataset.

Here, I am scanning in the dataset using the scan() function, naming it revcomp_genome.

revcomp_genome <- scan ("C:/Users/owens/Downloads/rosalind_ba1c (1).txt", what= "character")

Now that I have scanned in the dataset, and am now running the reverse_compliment function I created through the Rosalind genome. I solved the Rosalind problem correctly, proving that my code is correctly returning the reverse complement of the genome input.

reverse_compliment(revcomp_genome)

[1] "ACTTGTCGTGTCGCTGCCCGGCGGTAACCAGGTGTGTAGCACTGGGCGTATCAACTTAGACTAATAGTGAAGTCTGACAAGAGGTGTAAGGAGAAAAAAACCAGTGAGTATACATGTCTTGCGCAGCGTGGTTTAGAACGAAACCAGCGGGCCTATAGATCTAATTCCGTAATGGTGCAATTCTATACCCTATATCTCGTCAAGGCGGGTACCAGAATTCCCGAGGCTTCCTTGTCCTACTACGGTAGTCATGTGAAGCAGGCCCACTCGCAGTGAGATACCGGCCCGCAGACAGATTAACACCCACGAGCTGGGTGGAGCCGACTCGAGAGTTTTCTATGATCATTAGTATTCATCCTCCCGATGAAGTATGCCATCGTAAACTTTACACCGCACCTTAGTTCAATGTATAATGAGTATTCAGCGGAGGCCATGCTCGTATGTGTTTTGTTTGTATGGATTACCTTCGCGATCTACACTGGCGGCTGTATTGGTACACATCCTCATCCCCGTAGTATAGGAATGAAGTGAGGCTGGTGAGCCTGAAAAAATACCGGAGAGCAAATGTAATAGTGTAGAAAAGCACTGTCCAAAAAATGTCCCTTTTGTGGCGGTTAACATGAAGTTTTAATTCCGGAGATAGTGTTATCTAGGCCGTTTATCCTTCCTTTCGCCTGCTGGTTGACCCCTGCACGCATCTGCTGTACGAACGACACCAGCACAAGAGATCGCCAGGCCTGCGCCTATGTGGGCCCTTGTCTGTGGAAGCGGGTATGGGGAGCCTCACTTTAGTGAGCGTGTCTGAGCGATGCGAATGGCGTTTATTTCGTGTCACACGGGGAATTCTGTGGGGAGTGATTATGGGTAATTTCGAGACTTGGGAGCTGAATGAACCCGAGAGCTCTTTGAGAGAACCTGGTCCAGAGCGCCTACGTGGTCTCACCAGGATTTCTTTCAATGCATGAAAACGCTGAATCAATGCCCTGTCTTCTACTGGTTTTTCAACCTTGATGTGCCGTGTAAAACTAGTTTCAGCAAATGATCCTGGAGTTAAAGATAATTGACACACTTAGTCATTTGGCTGGTGTACATCCTTCAGAAGTTCAGGGGGGTTCGGAATAGGACTGCGAGGCCGAATGATCAGACCCCGAGGATGATCAGGAGAGTACACGTGGCACGGGAACACTGAATTGCCACTCCCGTGGAGACACAGTAATTGGCAACCGTCCAAAATAACATCGTAATGCTTCTGGTTATTATACAATTGCGAGTGTACTGTATTTCCATGGAGAGATCTGCAAAAGCAACGGTTCCGATCTCCCCAATTAACTTAAATTGTATCTGTATGGCCCTCTTCAAAACGATCGCTTCGCAACCCCAGGGTAACAGATGCATCACCCCTTGGGCAGGTCTGAAGATGTCGCAAGTATTTAAAACTGCTCTGCCGAACAAATCGTGGTGTGCTAGTTCGAGGAGTGGGACGGAAGGTACGAGCCAGCGCAAAAAAAGCTCTGGACTCCGCTGCGCCTCTTATACTGATAGGTCCTTCCTGACCTCTCTTTGGATAACACTTTTACTTGCCTCCCCCGTGGACGCATTCCCCCGCAGCTGACACAAAGTAACACACCAAGGAACACGCATGCAACCGATCACTTGTCCGGGACCATTAGCCCGTTGTCAGCACCTGGGGTGGAAAAGGAGACGTAGGACGCATCGGTCCATGATCAAACCACGTTTCAAGAGAGGAAGCCGAGAAAGCTGATGAACGCACCGGAACCACGCGGACTCATTACGACAGCCTAGAACAGTGTTCGTCTCATCTAACGCCCGTCTTTCTACACGTGATTGTAAGATTGGCTGCTGATAGGGCTTTGCCAAACATGTGTAGTCCAACTGCATGATGTCAAAGCTAACGGCGGCTAAAAAACTGGGAATCGTTTCTCTCACTCAGCCGTGCCGTCGTTCGATGACTAAAGGGCTTGCATACATGCGGGCAAGCATTATTAAATGCAATTTACTACGCGTATTTCAGATCACCGAGCGGGCCTAATGTACGCTTATAGTTGAAGTGGTGTTAAGCAAAGGTGATAGCCTTGGTCGTTGAGCGGGACATCTAGTGGACGGGTTAGGAGCACACTAGGAGCACCCAGCTTAAACGCTTAGACGGGCCGGCTTAAGTCCATATTCCGTCTGTACCGAATATGGCAGTGAGAACTCTAGAAAAACACGATTTTGCGTGTACTAAGATAGTGATCCCACAGCAAGGGGCCACTATATAAACTACTTTCTGAATCCCACCATAAGTATCACTAAAGCGAACGTGGCAGTGATCACACCCGTGCTTGATCGTCGTTATGGCTATTCCTACATTGTACTGATAATAGGAATTACCACAGTGTCTTGCGATAGTAACCTCACGAGAGGAAGACACCCGAGCGCTTCATTATCCTAACAAAGAGTGTTATTAGACCAAGTCTCATACCTAGCGGTAATGGTTGCGGCCTAACTTGGCGTGATACAGCGAGGGGCACCCAGTGTGCTCGCTTTTGACTGTTCCCTGTATGTGTGGGACACCGTTCCCAAACGTCCTGAAAGCCCACAGGTACATTTAGTAGCGCATCAGGAAATAATCTCATACCTGTTAAACTCAACATAGAAGCGACCCTAGCTCTGCTTCAGTCTGCTGAGGAATCTCAAAGTGGCATGACCAAAAATGTGCAGTGTCGCTTGATTTCGGCGCGACAGTACGCCAAACACTCCTTCCCGTGAAGACCCGTCTTTGTACAATGCTTCTCGAAAGCGTGAATCTCCTTGGACGTACAAAGCTGTCATTGAGAATGCACACAATAGTTATTGGCCCGGCGAATCAGCATTAAAGGACTTAACAAAAACATCAACAGGCCCGCATTTGAAAACGAAGTGTAGGGACTCTGCAGGACCGTAAAGCGTATCAGAACGGTCGAACTATTGTGTCGGGGGTTTTAAGGGAGGGGAGGACACGTCTTATCGAGCAACGATTTATGCGGTCACGAGAGAGAACACACACCGCAAGTCCACGCTACCATCAATAATTAACTAGGGCTAACCGCACGCATAGTAATGCTAGTTTCGCGTGACTGAGCGTGCAAACTCGACAATGCAGTGCTTAGTGGAGGCCGGGCATAAAAAAGCGACGGACCTCACCGTTGTGGGGTTTTCTCCTCTACCTATTGAATGCTTCTTCTGTACCTTTGAACACACTCCGTCGGTTAGCTCAGAGTTTTGTGGTAAAGAAACATATTTATGATTCCTCAACAACAGTTCGACATATCTATCGTGCTTTTAATTCTTTTTTAAACGGTCCAGGCCTCTTGGCGCTCTGACCGACTAATTTAAATTGTAGCGTGATCACTCACGAGGTCGACATTGGACTGTCGCCCGAGTCCAAGTTATCCAAGCGCGCGCATCACCTCGAAGTTAGCAACGGTACGCTGTGATGTTCAATCTTCGTACGACTCTTTGCCAACAGATTGAGGGAAGGATACCTAAGCACGTCGTATCTGAATGGATAGCTGAAAGCATTTGGGGAACGGATCAGATACTCCATATCAGCGGAACGACTAACCGCCCTATTTATTCCCGATGCCAATATGCGATCATGTTCTACGTCTAAAACTGGTGACTCGTGAATGAGCCAGATAAAGCGCTAATCAACTGTCGGTGCAGGGGTGCAGGCTGAATATGAAACATGAGCTTCGGAAATGGAATCTTCAAAGTTCTTCGCGGAACGGATCTGATACCCTCGCTTCTCTTATGTATTCTATAGGAGGAGCGACCGCGTATCTTAATTGAAGGTTTCAGAGCCATATTGGTCCCTGTCACCACAGTCGCCGGTTGATCGTTCGGGTTAGCACCGATCAATCCTCGGTCCAAAAGGGGCCGAGGTCGATCGCACGAGAGTACCCGCGGTATTCTCTCTCCAATATCGAATCTTGACCTGAACCTGCCGAATCTTTTGTTAGCTTAATCGACCACAACTCGACGCGCACCCTACCCTGCTAGCCGATCCGTCTCCGACGCACTGCTTCAACTTAGACCTGCGGTAGAACTGGTATCAATCTACTTGAGCGACGGTACCCTCGTTACTCCTCCCCAAGGGATGGGCATCTATCGTAGTATAATAGGACCCGATCCACGCCCTTCCCGACAAGTGACCGTCGTCAACTAAAGGAATTGTTTGCGACAATACCCGAGGGGACGGGGGTTAACTCCGACAGATGGGGCACTCGCTTAACGCTTGGTACAGTAGTCCGCCTTACCTGCCAACTGAAGGCGCAAGTTACCGCCTTGTCACAATGACTACAAGGAGTGGGCGCACTGTACCCAAGTTTCCTAGTGTTAGCCGTGTATGCTTGATAGGGGAGCGATCACCTCGTTCGGTTCACCTAAGCACCCGGCAGGATCTGAGCTGTGGTGCTCGTATGTTTGTAGGAACTCCAGTTCGAGCCTCAGATCTGAACAAGCCGAGGGGTGCAACGGAATTATATACTCTCGTTCAAGTTGCCGAGATGGGCCGTGCAGAAGAGGCGGCCGAACTTGTTCGGTTTTCGGTGAACTCCAAAACCGAACGCGACCAACTATTGAAACGTAACACAGTGCTGAATTTGCAAAGCCAATACATTGTACGCCCCCGCGTCGGTCGGAAGCCCCTGTTTAAGATGGATGGGGGTATGAGATGTCCCTGCTAAGAACACAGTAGAATCACGGTGATGGGGCAAGGTGGGCAAAATTTTTCAAATTGTATAGCGCTGTTCGTTTAGACCGGCTAAAATCGCGGCCAGGACAGGGGTATAGAGGACTGTCGTCCTGAAACCTTAAGGGTGTCCCTGCTGTAGCTCACCTAAGAGCGAGCTCTGGCGCGTCCAGTGGTGTTTCTAAATAACTCCAGTCCTCTTAGCGTCATAACACCAAAGGAGGTAGACGAAAGGAACCGTATTTAAACCATCAATGATACCCTCTCCAATCCTTTTTGGCAATCCATGAATTCCAGGTGGTGACCTGCCTCATTGTCTCAGCATCCTGTTTCGCAAACGGTTGTCCATTGGTATAAGAACTGTATCCGCTGTAGCGTGTATTAACAAAAACCGTTGGCATTTGGATAGCCACACTCCCGATGCCTACCTGATGGCGTCGGCCAGGACCTAGCCACGCAGATGCGGTGGAATCCGCAAGTACTGTCTAGCATAGAAGTGGATCTCATGGCCCATTTTAGGTACCTATTCGTGGGCTGTGACTTATCTTAGCCTGAGTAGGCATGATTCGACGCATCCTTTCCTATAGTAATACTTTCTGGAGATGCATATCCTCCTTTTGACAACGGGACTAGGGGACATTTAAAATGTGACCGTGCCGCACCAAGTAACTAACTCATGAAGTGCACCTGTAATTTATCGAATTGTATTCGATGTACCCGCGGAACGCCGGGCGTTTACCACTCCTTCCATCGGCCATTTGAGTGCCCGAAGCAAGTCAAGCTGTGAGTTATCGTCAATTACGTCTTGTACGTTCGTGCAAAAACCAAATGAAGATGAGTTATTACCACGCGCAGTAGGGATTTAGGTGCTGTTCGGGTGTCTTTCTCCGGGAAGAATGACTGGCCTGACATGCTAAACCTTGGGAACCGTAGGCTTTGATGTACGAGCACGCAAGGACGATGTGTGCTATGGTCGTGGGCCTCCACCGGATTGTGGCGACCACCTGCCAGAGAGGTGGACTTAAAAGTTACCACAGAAACGATGACGACTGCGAAAAACTGACATGGAAATTTCGTCGCGCTATGGCAAAGAACCGCGCCAATCACGATCTCACCAAGATTTGCTTCATGGACTGGACCCCCGGAGAGCGGTTGGACTGGGGACAAAAAGGTTTCCAGGACCATGCGCGGACCTAGAGCTACGGTGCGGTTCTCTAAGGGGTTTTTTAACCTCAAGCTTCCTTGACTCGTATCGGATAACCTAACGAGCACAAGTCAGATTTCAAATGTGGCTTCCGGACTTGCAGCTGACACCACCTGGATGGGTCGTCGATAGAAAGCCAGTGACAAGCGTCAGGTCCGCGAACGTTACGGCGGCATGTATTCAGTTTTAAGTAAACCGGCCTGGAGCGTGCCGCGAGTCCGTCTGAGGGATAAGCTTACAGTACAACCTATGAAATATTGCCTATAACCCAGAACATTATTGCTGCCCGATTTAGACCCGCGCTACTTCTTCATCCTTAACCAGTGTTGCTCGGGACTTGTCCTCCAAGTGTCGTCGGCGCAGAAGGTCCAGTCTCTTTACCTCTCACCGTACCGGATACGATAGTTACCTCCCACACCAGGTCCTCTATGCTGGTCAGCGATGTTCGTGATCACGAGAACCCAACTCGTTGTCAGTCCATTTATACACACTCTTGCAGTACTTATGGCATCTACGCTCAGAGAGTAGCTTCATATAACAGACAGGCGTAATCTGGACATGTTTCTCTCGAACGCAGGGGGTTTGCCTTTGAGGGTGTAAACGGCTTCTTGCAGTCGAATTCATAGGACAGGGGCTAGACCCACTAACACGGTTCGAATATACGCCGGGAGGTCTGCACAAAGGTTTATGGCGGCAGGAGTTGCCAGCATCTGTTTCTTGAGGTCAAGCATTGGACCATATATAAAGTCATCCGAGTCGACGAATACGATCGGTTGCACCATCGAATCTCCACCCCAGTGTTCCAAATCTTGCAACGGAGCTAGTTCTAAATCCCCTCGCCCTGGCCTCTGACGAATACGTCGAGGCATTTTTAAGGACCCACGAATCGTGCTCGAGGCGACGGATAGCTCCTCGACCGAGCCGTGACCTTTCGTGATAACGAGGATGGTGGGGCTGGGCATCACCCCAATCTGGGACTGGTCCGACTCAAGTCAAGGAGCTCGATCATCTTACTAGAATGTTGTTACTGATGTGGAGAGTAACACCATTCGAGGTGCATAAGTTCGTTTGGCGGGACAGATTACATCTAATTGTTCGTTGGGCCATACCGTATCAAATTTTTAGCCACGTTGACCGATTTCCGGATAGTTTATCTAAAGGTCATTTGCACCGATATCGAAGAAATGCGCAAGCTGGGAGTGTTTCACAGTACAGGATGGGGGTCCTGCGCTATGTTTTGCTAGGAGAGGCAGAATGTATGGGCAAAGTGCCTATTTCCCCGTGTGAAAGCGCTGTGATCGCATGAGCGTGGAGATCGACTACCAGGAATATACCGGTAGAGTATGGTCTCCACTATTCTGCGGTACGACGCTAAGCTTGTTTAGTCCTGGGATCGAAGCTATTAAAGGTGCGTCCCGTCATGCAGAGTCTCTGACACTCTTGTGCGGGAGTCCTACATTCGGCGAGAGCAGTGTAGAGAGCGCGTTGGCACGGTTGTGATCCCAATGGACTCTCCGTACGTCAGCTATCTCCCTCCCTCAAGGAACCGTCAAGCCGTCGCAAAGAAGCGAAAGGCCCCTGCACGTGTTACTTGGTGAGGAACATGGCCCGCATGGATATTATTTGTCATGTTCCAGAACAGGAACGATGGAGAACCAGAGTCTGGTGAAGATGCCGGGCGCACTGCCTGCCAATCGTAAGACGGTGATCTGCACCCTCCTGCGTAAGGGTGTACTGTGGGTAGGTTAGGTGGTTGGGGCCGAACATTTCCGGAAGAGGTCGAGCTTTGTTGGAGCAGAGCCGTCTCATTAGCAACCGTGCGTACCATGTGCTGCCCAGGCGATGAGTCCAGGCGTCAGTAACAAGCGCGAGTGCGTTGGGTATGCTCTTTGGGCGAGTCCCCCAACCACTCCACACTAGGTGCAAGACTGGTTATAGCCTCGACGGTAGAACGTGCCAGATTCACCTGAAGTAGAAGAGCCATTAAATTCCCGAACGAGTATGCGTGTTAGCGTCTTGTGCGCTGAGAGGTCATGCAGCCAAGCTAGTGTGATCTACTGTGCAAACAGCTAGATAAAGTGCGTGGTTGTTGCGAAATTATCCTGATAGCGTCACGACATTGGGGACGCCCGCTTCCTAAGGTTCGCTTACGGTTTGTCGCCGCAAAGTCGACCACCAATCAGTTATAATAGGTCACCGCTACTAAAAGAGATCATCATCATGAGGATAGCCGCGAAAGTGCTCTGCCGGGAGTTAACGGTGGTTTAGACCTGCACCCCCGAAGAAGAGCCTCGGGATACCCGTACTTAAGTTACTCTTATATTTCACAGTCAGACAGGGATCGTCCTCAGTTTCTCCGTCAGAAATAGTCGGCCGGATAAGCTCTCGAACTAGGCCTCAGCTAGAGCAGTAGGCCGACAAGCTTGCTCACTCATATATGTGGACAATAACCAACCCTAGATCATACCGCAAACTGTCGCGTAAATAGTCATTTTGATTGTACGTCGCACCAAATTCTGTCGACGATGCAAGCGGCATCGCGCTTTAACTAAAACCTCAGTCGGCTGAAAGTCCT"

In this notebook, I created code to search a genome for frequent near k-mers and their reverse complements. Finding these helps us get closer to finding the origin of replication, as the code will return the genetic substrings that signal to begin replication to DNA polymerase.

Replication Origin III

In the Replication Origin II notebook, I was able to find k-mers, near k-mers, the most frequent k-mers and their reverse compliments. Scientists have identified that DnaA boxes within different species of bacteria consist of different nucleotide sequences. To find DnaA boxes in certain bacteria, I need to find high-frequency 9-mers within a specific section of the genome.

In this code block, I am creating a dictionary of all possible k-mers, using the initalize_k_mer_dict function and initialized it with a starting value of 0, naming it k_mers_dict, indicating a length of 9.

initialize_k_mer_dict <- function(k){
  nucleotides <- c("A", "C", "G", "T")

  k_mers_dict <- expand.grid(rep(list(nucleotides), k)) %>%
    unite("k_mers", everything(), remove = TRUE, sep = "") %>%
    unique() %>%
    mutate(count = 0)
  
  return(k_mers_dict)
}

k_mers_dict <- initialize_k_mer_dict(9)

In this code, I created a clump_finding function. This function pulls from genomeString and takes the parameters L, T, and K. L indicates the length of the window that we are looking at and instructs the code to look in a specific area of the genome. The parameter T counts the frequency of occurrences, while K indicates the length of k-mers we want returned. The first line of code, kmer_dict, initializes the table of k-mers. The for statement tells the code what window to look at, setting up the initial window length. The curr_kmer function instructs the code to float through the initial window and find a pattern, while curr_count counts how many times that pattern appears in the initial window. The curr_row function shows where the k-mer appears in the k-mer dictionary kmer_dict. The candidates function represents possible candidates for DnaA boxes in the initial window. If the count exceeds the t value, the k-mer is added to the list of candidates. Following the candidate function, the code will slide down the sequence one nucleotide at a time until it runs through the rest of the genome. Subtracting L-2 from add_pattern prevents the code from running out of the window, reaching the edge of the window without falling off. The rem_pattern and add_pattern functions move through the genome while maintaining the window, adjusting the counts for k-mers. The rem_row function allows us to remove k-mer counts for k-mers that are no longer in the window as we move through the genome. Add_row adds the new nucleotide to the window, moving down the sequence. J represents the left end point of the current window. The final if statement instructs the code to add the pattern to the list of candidates if it exceeds the threshold and updates the list of candidates. The code will run through the entire genome, leaving the last window with a final nucleotide on the right side without falling off the genome.

clump_finding <- function(genomeString, L, t, k){
  
  kmer_dict <- initalize_k_mer_dict(k)
  ng <- nchar(genome)
  for(i in 1:(L-(k-1))){
    curr_kmer <- str_sub(initial_window, i, i+ (k-1))
    curr_count <- count+pattern(initial_window, curr_kmer)
    curr_row <- which(kmer_dict$k_mers==curr_kmer)
    kmer_dict$count[curr_row] <- curr_count
  }
  candidates <- kmer_dict %>%
    filter(count >=t) %>%
    pull(k_mers)
  for(j in 2:(ng-(L-2))){
    rem_pattern <- str_sub(genome, j-1, j-1+ (k-1))
    add_pattern <- str_sub(genome, j + (L-2)-(k-1),j+ (L-2))
    
    rem_row <- which(kmer_dict$k_mers == rem_pattern)
    add_row <- which(kmer_dict$k_mers == add_pattern)
    
    kmer_dict$count[rem_row] <- kmer_dict$count [rem_row] - 1
    kmer_dict$count [add_row] <- kmer_dict$count [add_row] +1
    
    if(kmer_dict$count[add_row] >= t){
      candidates <- append(candidates, add_pattern)
      candidates <- unique(candidates)
    }
  }
  return(candidates) 
}

In order to call this function, it must pass a window length, t-value, and k-value. To test our code in class, we scanned in the E. coli genome and attempted to run the code on this genome, discovering that it takes too long. The clump_finding function does work on smaller genomes, but was too slow to run through the 4.6 million nucleotide E. coli genome. This suggests that the code needs to be further optimized to run this genome.

In Replication Origin Notebooks I, II, and III I learned several new coding methods and tools. Including but not limited to creating lists and sample genomes with the c() and sample() functions, as well as using the set.seed() function to set a seed for random number generation to ensure reproducible results. I also learned how to create for loops, and if/else statements to ensure the code is only ran when specific criteria are met. I also learned how to scan in datasets, and correctly solve bioinformatics problems on the Rosalind problem. I learned how to use functions to write reusable code and count the number of occurrences of patterns in a list, or genome in this case. These tools and functions all brought me closer to finding the replication origin of a genome, although future work is still required.