Tuesday, January 17, 2012

Entry 001: Regular Expressions to Identify Synonymous Codon Locations in a Sequence

What if you need to take a DNA sequence and identify all the synonymous codon codon locations, perhaps to create a shuffling of a sequence without altering the amino acid structure?

First, lets address the problem of finding all the synonymous codons. The way I approached this problem was to first create a tab delimited text file with all the synonymous codons for each amino acid on one line. Here is a Google Document version of the file https://docs.google.com/document/pub?id=1yaWcwXHOFYIIdvlB4VPMmXptiz1SK1JTcpoxKrXnYHw&amp

The first question to ask is what do we need to address this problem? We need a list which maps all synonymous codons to each other in some manner, and we need to be able to parse this list in some easy to understand format. To make the data structure generally accessible for a variety of problems, I would also like to make it able to accept a codon input, and return all synonymous codon possibilities. This suggests a hash data structure. The following code implements one possible structure, which is easy to understand, but not very efficient.

my %synonymous_codons;
open (S_CODONS, '<' , "./syn_codons.txt" ) | | die "Cannot open file\n";
while ( my $line = <S_CODONS> ) {
        chomp $line;
        my @temporary_array = split ( /\t/ , $line );
        foreach my $element ( @temporary_array ){
                $synonymous_codons{ $element } = \@temporary_array;
        }
}
close S_CODONS; 

Basically, this builds an hash table of array references. Each key of the hash is a codon ( 'AAA', 'GCG', etc.) and it returns a reference to the same array where all of the synonymous codons are stored.

It is important to note that when initializing a variable inside a loop, normally as the loop terminates variables created inside the loop are eliminated from the stack (local memory). However, creating a reference to the temporary array inside the loop causes perl to remember it as a global variable. Perhaps perl preserves the data in local memory for all local variables, but only allows you to access it when in the appropriate scope. I am not sure exactly how the compiler keeps track of this, but it is enough to say that this method works. Please reply with a better answer, if you know how this works.

Now lets return to the original problem of finding all synonymous codon positions. Assuming that we need to identify all synonymous codon locations, as opposed to a single amino acid's codons, the following code will iterate over all the arrays in this hash.

 my %unique;
foreach my $syn_codon_ref ( grep { ! $unique{$_} ++ } values %synonymous_codons){
                     [ Some code here]
}
Lets examine the foreach line. What we want the code to do is iterate over each synonymous array once. "values %synonymous_codons" will return multiple references to the same array, because there are a number of reference to the same array equal to the number of synonymous codons for each amino acid. The hash %unique keeps counts of how many times a value is iterated over in the values list, and  grep { ! } only returns the value if the count for the element passed by values %synonymous_codons is undefined ( the value has not occurred before).

Remember that perl references have a string representation, and look something like this, ARRAY(0x8f3d48) or this, HASH(0x1272960). The %unique hash stores scalar values for keys equal to the strings representing each array reference. Passing an array reference to the %unique hash in this case acts like any other string key.


Next we need to do something with each of these arrays, now that we can iterate over them in an controlled way. I do this with a regular expression. What we want is a expression which will find any of the synonymous codons, in a frame specific manner, of an arbitrary DNA sequence. We want to ignore the start codon, and the stop codon as well, since these do not really count in this application. We also want to store the location of each match to the synonymous codons.The following expression will do this.

my @index; 
my $match_string = join '|', map { quotemeta $_ } @{$syn_codon_ref}; 
while ( $sequences[$i] =~ m/\G([ACGT]{3})+?($match_string)/g){ 
         pos($sequences[$i])=$-[2]; 
         push(@index,$-[2]); 
}



First, lets look at the $match_string.  @{$syn_codon_ref} turns our array reference back into an array, or rather it points perl to the appropriate array location. map { quotemeta $_ }  takes each element of the current array and interprets it as a string, and join '|' concatenates these elements as a string. If the synonymous codon array held all leucine codons, the $match_string variable should contain something like this "TTA|TTG|CTT|CTC|CTA|CTG". The '|' character ( referred to as the pipe character ), tells the regular expression engine that any of the strings contained inside the pipe delimited string can be matched. It is equalivalent to an logical 'OR' statement.

The regular expression m/\G([ACGT]{3})+?($match_string)/g  matches the synonymous codons in a frame specific way. ^ matches the beginning of the string, [AGCT] matches any of the 4 nucleotide characters, [ACGT]{3} matches exactly 3 of the nucleotides in a row, and [ACGT]{3}+? matches 1 or more instances of three A,C,G, or T consecutively, but not greedily (? character after any regular expression unit signifies the not greedy option) . The not greedy option is essential for this regular expression. Since in a DNA string, every in frame synonymous codon is going to match [ACGT]{3}+, we need to tell the compiler to match the minimum number of these codons, while allowing the rest of the expression to match. The $match_string variable tells the expression to match the first instance of any synonymous codon in the string. \G is a metacharacter which anchors the match to begin at the position in pos($sequences[$i]).

@index holds all match locations of the current set of synonymous codons. The @- variable stores the locations of the last regular expression string match, where $-[0] is the location of the beginning of the entire regular expression match (in this case it should hold zero for all normal DNA strings), and $-[1], $-[2], $-[3], ... each hold the match location for all capture groups in the regular expression. When inserting parentheses inside a regular expression, you are asking the compiler to remember where it matches the things inside the parentheses. So, ($match_string)  in this instance will tell the compiler to remember where it matched the synonymous codon in the DNA sequence. Since the above regular expression contains two capture groups (  ([ACGT]{3})  and ($match_string) ), the location of the synonymous codon is held in $-[2].

**Important note** Always remember that perl indexes string characters starting at zero, not at one. Therefore, if you print out the location of the match, always increment the number by one so that normal (non-programming) people will not be confused when looking at your data.

Finally, using all these tools together, we have a way to index all synonymous codon locations

my $dna_seq=<>; ##pass DNA sequence string into program 
my %synonymous_codons; 
open (S_CODONS, '<' , "./syn_codons.txt" ) | | die "Cannot open file\n"; 
while ( my $line = <S_CODONS> ) { 
        chomp $line; 
        my @temporary_array = split ( /\t/ , $line ); 
        foreach my $element ( @temporary_array ){ 
                $synonymous_codons{ $element } = \@temporary_array; 
        }    
} 
close S_CODONS;  
my %unique; 
foreach my $syn_codon_ref ( grep { ! $unique{$_} ++ } values %synonymous_codons){ 
    my @index; 
    my $match_string = join '|', map { quotemeta $_ } @{$syn_codon_ref}; 
    while ( $dna_seq =~ m/^([ACGT]{3})+?($match_string).{3,}/g){ 
         push(@index,$-[2]); 
    } 
    [Do things to synonymous locations here] 
}







This is clearly not the most efficient way to do this task. I feel like it is very adaptable to other tasks however, and I always prefer to adapt code for new tasks. Please feel free to improve my code and post the result as a comment, or if you decide to use the code given here, please post a link to my blog as a comment inside your program.

If you post something very cool, or completely better than what I have, I will edit this post to include it, and reference your account with the credit.


Justin Gardin 

Introduction

Hi Interweb!

I am a Ph.D candidate in a Microbiology Department with a Masters in Applied Mathematics.  I would like to begin an in silico experiment, with the goal of spreading the things I am learning about bioinformatics in perl, and hopefully learning some stuff from those who read the posts I submit. Periodically, I will create posts at this site with explanations about tricks I have discovered or rediscovered  to address bioinformatics questions using the perl programming language. I hope that these posts will be informative to someone, and perhaps informative to me through the vast community of people who really know how to program in perl (those who don't just fake it like I do ;-) ).  Let the learning begin!


Justin Gardin