Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support start codons in translation tables #62

Closed
fungs opened this issue Jan 28, 2019 · 13 comments
Closed

Support start codons in translation tables #62

fungs opened this issue Jan 28, 2019 · 13 comments

Comments

@fungs
Copy link

fungs commented Jan 28, 2019

Hi, thanks again for this wonderful software, with every new release it replace more programs in my pipelines. Now for this simple feature request.

Translation tables come with alternative start codons: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG11

Would it be possible to treat the lookup for the start codon (the first codon of the input sequence) differently to map the correct amino acid? When I map sequences with correct frame and offset to protein space, it happens that some use alternative genetic codes (e.g. plastid-encoded onces) which leads to a wrong translation of the first codon. That could be an additional command line switch and would only makes sense for complete gene sequences.

Best,
Johannes

@shenwei356
Copy link
Owner

Can you give ome example with sequence?

@shenwei356
Copy link
Owner

One codon in a given translation table (-T/--transl-table) is only translated to one amino acid, no matter whether it's start codon or not.

@fungs
Copy link
Author

fungs commented Jan 29, 2019

I know, this should be marked as a feature request.

Translation tables have alternative amino acids (basically M) for start codons. The standard table doesn't. It's documented under the link I posted.

Best,
Johannes

@fungs
Copy link
Author

fungs commented Jan 29, 2019

Sorry, I didn't read your example request, will post you some today!

@fungs
Copy link
Author

fungs commented Jan 29, 2019

As an example, see locus ERS432292_01433 in CWZP01000026.

Protein sequence:
MDSPAEYKIFLNIILAPVDDGLRFTYSGDNFFYREISMISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNSHSYCFEILERRYFCIEFTASAGCDAGRVLSGLPISGCV

Nucleotide sequence:
GTGGACTCGCCAGCAGAATATAAAATTTTCCTCAACATCATCCTCGCACCAGTCGACGACGGTTTACGCTTTACGTATAGTGGCGACAATTTTTTTTATCGGGAAATCTCAATGATCAGTCTGATTGCGGCGTTAGCGGTAGATCGCGTTATCGGCATGGAAAACGCCATGCCGTGGAACCTGCCTGCCGATCTCGCCTGGTTTAAACGCAACACCTTAAATAAACCCGTGATTATGGGCCGCCATACTTGGGAATCAATCGGTCGTCCGTTGCCAGGACGCAAAAATATTATCCTCAGCAGTCAACCGGGTACGGACGATCGCGTAACGTGGGTGAAGTCGGTGGATGAAGCCATCGCGGCGTGTGGTGACGTACCAGAAATCATGGTGATTGGCGGCGGTCGCGTTTATGAGCAGTTCCTGCCAAAAGCGCAGAAACTGTATCTGACGCATATCGACGCAGAAGTGGAAGGCGACACCCATTTCCCGGATTACGAGCCGGATGACTGGGAATCGGTATTCAGCGAATTCCACGATGCTGATGCGCAGAACTCTCACAGCTATTGCTTTGAGATTCTGGAGCGGCGGTACTTTTGTATAGAATTTACGGCTAGTGCCGGATGCGACGCCGGTCGCGTCTTATCCGGCCTTCCTATATCAGGCTGTGTT

Translation with table 11, as given in the genome annotation, with seqkit gives:
VDSPAEYKIFLNIILAPVDDGLRFTYSGDNFFYREISMISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNSHSYCFEILERRYFCIEFTASAGCDAGRVLSGLPISGCV

Note that the translation table specifies GUG as an alternative start codon.

@shenwei356
Copy link
Owner

I get it, we should mark start codon's product as 'M' when it acts as start codon.

$ seqkit seq t.fa 
>seq
GTGGACTCGCCAGCAGAATATAAAATTTTCCTCAACATCATCCTCGCACCAGTCGACGAC
GGTTTACGCTTTACGTATAGTGGCGACAATTTTTTTTATCGGGAAATCTCAATGATCAGT
CTGATTGCGGCGTTAGCGGTAGATCGCGTTATCGGCATGGAAAACGCCATGCCGTGGAAC
CTGCCTGCCGATCTCGCCTGGTTTAAACGCAACACCTTAAATAAACCCGTGATTATGGGC
CGCCATACTTGGGAATCAATCGGTCGTCCGTTGCCAGGACGCAAAAATATTATCCTCAGC
AGTCAACCGGGTACGGACGATCGCGTAACGTGGGTGAAGTCGGTGGATGAAGCCATCGCG
GCGTGTGGTGACGTACCAGAAATCATGGTGATTGGCGGCGGTCGCGTTTATGAGCAGTTC
CTGCCAAAAGCGCAGAAACTGTATCTGACGCATATCGACGCAGAAGTGGAAGGCGACACC
CATTTCCCGGATTACGAGCCGGATGACTGGGAATCGGTATTCAGCGAATTCCACGATGCT
GATGCGCAGAACTCTCACAGCTATTGCTTTGAGATTCTGGAGCGGCGGTACTTTTGTATA
GAATTTACGGCTAGTGCCGGATGCGACGCCGGTCGCGTCTTATCCGGCCTTCCTATATCA
GGCTGTGTT
>seq2
GTGGACTCGCCAGCAtaaGTGGACTCGCCAGCA
>seq2
GTGGACTCGCCAGCAtaattgGTGGACTCGCCAGCA

$ cat t.fa | seqkit translate -T 11
>seq
MDSPAEYKIFLNIILAPVDDGLRFTYSGDNFFYREISMISLIAALAVDRVIGMENAMPWN
LPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIA
ACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDA
DAQNSHSYCFEILERRYFCIEFTASAGCDAGRVLSGLPISGCV
>seq2
MDSPA*MDSPA
>seq2
MDSPA*MVDSPA

shenwei356 added a commit that referenced this issue Jan 29, 2019
@fungs
Copy link
Author

fungs commented Jan 29, 2019

I get it, we should mark start codon's product as 'M' when it acts as start codon.

I think to do this correctly, you need to assume (1) that genes are complete so that the first base it actually a start codon, and (2) check whether the first codon is a valid start codon for the specified translation table.

I don't know where and how you keep the translation tables but this information needs to be added I suppose. For instance, table 11 has the following extra start codons: TTG, CTG, ATT, ATC, ATA, GTG in addition to ATG.

BTW: there is a similar issue with alternative stop codons for some of the more more exotic translation tables

@shenwei356
Copy link
Owner

The tables are here, start codons are flagged. All data come from https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

(1) I assume the gene are complete; (2) done.

@shenwei356
Copy link
Owner

@fungs any concern? can I close this issue?

@fungs
Copy link
Author

fungs commented Feb 8, 2019

@shenwei356 that is great, I will compile and test the current master next week and report if things are working well. Thx!

@fungs
Copy link
Author

fungs commented Feb 11, 2019

Hi @shenwei356,

I tested the current master flagged as v10.0.1 (note: it says "New version available: seqkit v0.10.0") on a larger set of protein coding nucleotide sequences. The deviation of translated versus deposited protein sequences is reduced and the program seems to work very well. I do however have some suggestions for safer handling.

  1. Currently, the start codon auto-detection triggers automatically without the possibility to turn if off. I'd advice to not enable this by default and instead add a new command line parameter. Otherwise the feature might snap in unintendedly when translating partial gene sequences that don't start at the beginning.

  2. I saw that the auto-detection triggers both at the beginning of a sequence and after a stop codon. I had, however, sequences which seem to contain a faulty stop codon (either due to sequencing error or due to an undocumented/novel genetic code). This then may alter the consecutive amino acid. One example is a variant of gene AJ131405 I have, where the sequence TTATAGATC is annotated as ...QI... but becomes ...*M... with translation table 11.

So basically it works very well but I think it needs a little better control.

Best,
Johannes

@shenwei356
Copy link
Owner

  1. new option -M, --init-codon-as-M translate initial codon at beginning to 'M'
  2. hard to solve.

@fungs
Copy link
Author

fungs commented Feb 14, 2019

Great work! I guess 2 can be neglected, the behavior is documented here now. It's a very special case which is only of importance when working with novel and erroneous data.

Best,
Johannes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants