Jump to content
Sign in to follow this  
oilcan

DNA custom function/calculation

Recommended Posts

Hey all,

just curious if any of you have seen a calculation or a custom function to convert a DNA sequence to an amino acid sequence or vice versa.

I figure it wouldn't be too terribly difficult to make one myself, but if somebody has already conceived of this thing it would certainly be a time saver.

Thanks.

Share this post


Link to post
Share on other sites

C'mon, you mean you don't know how to convert DNA sequences in your head? What's wrong with you? :

Share this post


Link to post
Share on other sites

If it's as simple as changing letters it should be fairly easy. The convertor I found online seemed to suggest that was the case. But then again I'm not a rocket scientist.

http://www.geneseo.edu/~eshamb/php/dna.php

Leucine (aac → uug)

Glycine (ccg → ggc)

Proline (ggt → cca)

Share this post


Link to post
Share on other sites

http://www.cbs.dtu.dk/courses/27619/codon.html

That site has a nice simple table illustrating the DNA to Amino Acid conversion.

Every three characters of a DNA sequence is considered a 'codon', and codes for a particular amino acid, which would be represented in a protein sequence by a letter.

You could convert a DNA chain to an amino acid chain easily enough, but to do the reverse accurately would not be possible, as there are several instances where multiple codons code for the same amino acid. well, you could make a DNA sequence that would work, but it might not be the same DNA sequence that mother nature invented :

My original thought for my current database project was to be able to calculate an amino acid sequence by supplying a DNA sequence. I'm not sure if I'm still going to use this approach, though it might be nice to develop this functionality anyway.

anyway, as I said, though I am quite noobish at designing calc fields effectively at this point, it's probably not the hardest thing in the world to do. I was mainly looking to see if this type of calc had ever popped up on these forums before, the search function gives me zilch for DNA.

thanks for the responses.

Share this post


Link to post
Share on other sites

The part that is unambiguous could be done by a simple Substitute() with multiple arguments.

The other direction... it depends on what you want.

Share this post


Link to post
Share on other sites

yeah, I guess the point is, we wouldn't want the reverse direction, protein sequence to DNA sequence, as there is no surefire way you would be giving back the original gene. you could give back one that would code for the amino acid sequence in question, but there are other important factors about the original DNA sequence such as restriction sites and the like that would change if you didn't get the original sequence exactly. so it would only be feasible to do a DNA to Amino Acid conversion for any realistic use.

Thanks for the tip using Substitute. Still learning all of the various calc functions :

Share this post


Link to post
Share on other sites

we wouldn't want the reverse direction

OK, then it should be relatively simple. You need to mind your delimiters though, and include them in the searchString parameter - otherwise you may create a mess if 3 SLC's are falsely identified as a codon.

For example, to convert a DNA string like "GCA, ACA, ACC" you would use something like:

Substitute (

TrimAll ( DNA & "," ; 0 ; 3 ) ;

[ "GCA," ; "A" ] ;

[ "ACA," ; "T" ] ;

[ "ACC," ; "T" ] ;

...

[ "ATT," ; "I" ] ;

...

)

Note that the result here is "ATT" which also happens to be a valid codon - but since it is not followed by a comma, it would be safe.

Share this post


Link to post
Share on other sites

hmm, I see where you are going with that. However, I think that this notion is slightly different than what I envisioned. First off, the delimiters are probably out of the question. DNA sequences we'd be working with can be quite long, as in, potentially thousands of characters. And given their size, they will clearly be cut and paste sorts of things. The source information won't be delimited by commas, it will be a straight series of 'GATACACCAAAGATTTTAGAGGGACCG' etc.

As I was thinking about this, it looks like several things should be in place for this. First, perhaps certain warning messages. We must first determine if the number of characters in the DNA string is divisible by 3. If not, it would tell you that your DNA string isn't a full reading frame. Next, it would determine if the last codon was a stop codon or not, indeed if there are any stop codons throughout the sequence, and again warn the user if such was the case. The same with a start codon (a start codon is always ATG, which codes for methionine). Perhaps at the end of all this it might be able to drop multiple amino acid sequences if there are multiple start and stop codons within a sequence.

As I'm thinking about this, it would probably be better to approach this thing as a script rather than a calculation field, so it would have more user interaction. A button that performs the conversion script, and feeds back appropriate error messages if necessary, rather as was done by the link in sbg2's post.

I appreciate the dialogue on this. It's really helping me through the thought process of how this thing should work. As I said, I'm not sure if my working solution will require this functionality any more or not, but I think I might attempt this thing on my own just for practice. Can't hurt having a script around that can do that. Many thanks.

Share this post


Link to post
Share on other sites

If there are no delimiters, then you'll need some form of recursion - either a script or a custom function. You would take the first 3 letters of the DNA sequence, substitute them for the matching SLC, and pass the rest to the next iteration.

It would be difficult to determine in advance if there are any stop codons throughout the sequence - again, because of the ambiguity. It's not a very good coding system: thousands of triplets with no delimiters are obviously not designed for human reading; OTOH, for machine reading it would be easy to come up with an unambiguous alphabet.

Share this post


Link to post
Share on other sites

Did you ever develop a script to convert nucleotides into amino acids? I am also thinking about this, but don't want to reinvent the wheel.

Share this post


Link to post
Share on other sites

I couldn't find a pre-made solution to convert nucleotide sequences into corresponding amino acid sequences, so I created two scripts and some value lists to do the job.  Each value list contains all possible nucleotide combinations corresponding to a single amino acid. For example, Value List 'A_ALA'  for alanine contains: GCT, GCC, GCA, GCM, GCK, GCW, GCS, GCB, GCD, GCH, GCV, and GCN.  Notice that some of the letters correspond to ambiguities, and hence IUPAC codes (R, Y, M, K, W, S, B, D, H, V, N) are used here.  This is important if one is doing population sequencing on HIV viral sequences where mixed bases are expected.  One script loops through all 11679 nucleotide records in our FM database evaluating each nucleotide sequence, one by one. An internal loop steps through the codons in each sequence and concatenates an amino acid letter on to the end of the growing amino acid chain.  Thought this would be difficult, but not too bad. Our only problem now is that the code doesn't handle entire HIV genomes that consist of multiple genes and real frame shifts. Oh well, baby steps!!!

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

By using this site, you agree to our Terms of Use.