-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support start codons in translation tables #62
Comments
Can you give ome example with sequence? |
One codon in a given translation table ( |
I know, this should be marked as a feature request. Translation tables have alternative amino acids (basically M) for start codons. The standard table doesn't. It's documented under the link I posted. Best, |
Sorry, I didn't read your example request, will post you some today! |
As an example, see locus ERS432292_01433 in CWZP01000026. Protein sequence: Nucleotide sequence: Translation with table 11, as given in the genome annotation, with seqkit gives: Note that the translation table specifies GUG as an alternative start codon. |
I get it, we should mark start codon's product as 'M' when it acts as start codon.
|
I think to do this correctly, you need to assume (1) that genes are complete so that the first base it actually a start codon, and (2) check whether the first codon is a valid start codon for the specified translation table. I don't know where and how you keep the translation tables but this information needs to be added I suppose. For instance, table 11 has the following extra start codons: TTG, CTG, ATT, ATC, ATA, GTG in addition to ATG. BTW: there is a similar issue with alternative stop codons for some of the more more exotic translation tables |
The tables are here, start codons are flagged. All data come from https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes. (1) I assume the gene are complete; (2) done. |
@fungs any concern? can I close this issue? |
@shenwei356 that is great, I will compile and test the current master next week and report if things are working well. Thx! |
Hi @shenwei356, I tested the current master flagged as v10.0.1 (note: it says "New version available: seqkit v0.10.0") on a larger set of protein coding nucleotide sequences. The deviation of translated versus deposited protein sequences is reduced and the program seems to work very well. I do however have some suggestions for safer handling.
So basically it works very well but I think it needs a little better control. Best, |
|
Great work! I guess 2 can be neglected, the behavior is documented here now. It's a very special case which is only of importance when working with novel and erroneous data. Best, |
Hi, thanks again for this wonderful software, with every new release it replace more programs in my pipelines. Now for this simple feature request.
Translation tables come with alternative start codons: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG11
Would it be possible to treat the lookup for the start codon (the first codon of the input sequence) differently to map the correct amino acid? When I map sequences with correct frame and offset to protein space, it happens that some use alternative genetic codes (e.g. plastid-encoded onces) which leads to a wrong translation of the first codon. That could be an additional command line switch and would only makes sense for complete gene sequences.
Best,
Johannes
The text was updated successfully, but these errors were encountered: