Support start codons in translation tables #62

fungs · 2019-01-28T19:22:45Z

Hi, thanks again for this wonderful software, with every new release it replace more programs in my pipelines. Now for this simple feature request.

Translation tables come with alternative start codons: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG11

Would it be possible to treat the lookup for the start codon (the first codon of the input sequence) differently to map the correct amino acid? When I map sequences with correct frame and offset to protein space, it happens that some use alternative genetic codes (e.g. plastid-encoded onces) which leads to a wrong translation of the first codon. That could be an additional command line switch and would only makes sense for complete gene sequences.

Best,
Johannes

shenwei356 · 2019-01-28T23:32:39Z

Can you give ome example with sequence?

shenwei356 · 2019-01-29T08:27:05Z

One codon in a given translation table (-T/--transl-table) is only translated to one amino acid, no matter whether it's start codon or not.

fungs · 2019-01-29T09:17:44Z

I know, this should be marked as a feature request.

Translation tables have alternative amino acids (basically M) for start codons. The standard table doesn't. It's documented under the link I posted.

Best,
Johannes

fungs · 2019-01-29T09:19:10Z

Sorry, I didn't read your example request, will post you some today!

fungs · 2019-01-29T10:32:28Z

As an example, see locus ERS432292_01433 in CWZP01000026.

Protein sequence:
MDSPAEYKIFLNIILAPVDDGLRFTYSGDNFFYREISMISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNSHSYCFEILERRYFCIEFTASAGCDAGRVLSGLPISGCV

Nucleotide sequence:
GTGGACTCGCCAGCAGAATATAAAATTTTCCTCAACATCATCCTCGCACCAGTCGACGACGGTTTACGCTTTACGTATAGTGGCGACAATTTTTTTTATCGGGAAATCTCAATGATCAGTCTGATTGCGGCGTTAGCGGTAGATCGCGTTATCGGCATGGAAAACGCCATGCCGTGGAACCTGCCTGCCGATCTCGCCTGGTTTAAACGCAACACCTTAAATAAACCCGTGATTATGGGCCGCCATACTTGGGAATCAATCGGTCGTCCGTTGCCAGGACGCAAAAATATTATCCTCAGCAGTCAACCGGGTACGGACGATCGCGTAACGTGGGTGAAGTCGGTGGATGAAGCCATCGCGGCGTGTGGTGACGTACCAGAAATCATGGTGATTGGCGGCGGTCGCGTTTATGAGCAGTTCCTGCCAAAAGCGCAGAAACTGTATCTGACGCATATCGACGCAGAAGTGGAAGGCGACACCCATTTCCCGGATTACGAGCCGGATGACTGGGAATCGGTATTCAGCGAATTCCACGATGCTGATGCGCAGAACTCTCACAGCTATTGCTTTGAGATTCTGGAGCGGCGGTACTTTTGTATAGAATTTACGGCTAGTGCCGGATGCGACGCCGGTCGCGTCTTATCCGGCCTTCCTATATCAGGCTGTGTT

Translation with table 11, as given in the genome annotation, with seqkit gives:
VDSPAEYKIFLNIILAPVDDGLRFTYSGDNFFYREISMISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNSHSYCFEILERRYFCIEFTASAGCDAGRVLSGLPISGCV

Note that the translation table specifies GUG as an alternative start codon.

shenwei356 · 2019-01-29T11:25:47Z

I get it, we should mark start codon's product as 'M' when it acts as start codon.

$ seqkit seq t.fa 
>seq
GTGGACTCGCCAGCAGAATATAAAATTTTCCTCAACATCATCCTCGCACCAGTCGACGAC
GGTTTACGCTTTACGTATAGTGGCGACAATTTTTTTTATCGGGAAATCTCAATGATCAGT
CTGATTGCGGCGTTAGCGGTAGATCGCGTTATCGGCATGGAAAACGCCATGCCGTGGAAC
CTGCCTGCCGATCTCGCCTGGTTTAAACGCAACACCTTAAATAAACCCGTGATTATGGGC
CGCCATACTTGGGAATCAATCGGTCGTCCGTTGCCAGGACGCAAAAATATTATCCTCAGC
AGTCAACCGGGTACGGACGATCGCGTAACGTGGGTGAAGTCGGTGGATGAAGCCATCGCG
GCGTGTGGTGACGTACCAGAAATCATGGTGATTGGCGGCGGTCGCGTTTATGAGCAGTTC
CTGCCAAAAGCGCAGAAACTGTATCTGACGCATATCGACGCAGAAGTGGAAGGCGACACC
CATTTCCCGGATTACGAGCCGGATGACTGGGAATCGGTATTCAGCGAATTCCACGATGCT
GATGCGCAGAACTCTCACAGCTATTGCTTTGAGATTCTGGAGCGGCGGTACTTTTGTATA
GAATTTACGGCTAGTGCCGGATGCGACGCCGGTCGCGTCTTATCCGGCCTTCCTATATCA
GGCTGTGTT
>seq2
GTGGACTCGCCAGCAtaaGTGGACTCGCCAGCA
>seq2
GTGGACTCGCCAGCAtaattgGTGGACTCGCCAGCA

$ cat t.fa | seqkit translate -T 11
>seq
MDSPAEYKIFLNIILAPVDDGLRFTYSGDNFFYREISMISLIAALAVDRVIGMENAMPWN
LPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIA
ACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDA
DAQNSHSYCFEILERRYFCIEFTASAGCDAGRVLSGLPISGCV
>seq2
MDSPA*MDSPA
>seq2
MDSPA*MVDSPA

fungs · 2019-01-29T12:08:43Z

I get it, we should mark start codon's product as 'M' when it acts as start codon.

I think to do this correctly, you need to assume (1) that genes are complete so that the first base it actually a start codon, and (2) check whether the first codon is a valid start codon for the specified translation table.

I don't know where and how you keep the translation tables but this information needs to be added I suppose. For instance, table 11 has the following extra start codons: TTG, CTG, ATT, ATC, ATA, GTG in addition to ATG.

BTW: there is a similar issue with alternative stop codons for some of the more more exotic translation tables

shenwei356 · 2019-01-29T12:29:04Z

The tables are here, start codons are flagged. All data come from https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

(1) I assume the gene are complete; (2) done.

shenwei356 · 2019-02-08T17:45:26Z

@fungs any concern? can I close this issue?

fungs · 2019-02-08T18:01:54Z

@shenwei356 that is great, I will compile and test the current master next week and report if things are working well. Thx!

fungs · 2019-02-11T10:36:37Z

Hi @shenwei356,

I tested the current master flagged as v10.0.1 (note: it says "New version available: seqkit v0.10.0") on a larger set of protein coding nucleotide sequences. The deviation of translated versus deposited protein sequences is reduced and the program seems to work very well. I do however have some suggestions for safer handling.

Currently, the start codon auto-detection triggers automatically without the possibility to turn if off. I'd advice to not enable this by default and instead add a new command line parameter. Otherwise the feature might snap in unintendedly when translating partial gene sequences that don't start at the beginning.
I saw that the auto-detection triggers both at the beginning of a sequence and after a stop codon. I had, however, sequences which seem to contain a faulty stop codon (either due to sequencing error or due to an undocumented/novel genetic code). This then may alter the consecutive amino acid. One example is a variant of gene AJ131405 I have, where the sequence TTATAGATC is annotated as ...QI... but becomes ...*M... with translation table 11.

So basically it works very well but I think it needs a little better control.

Best,
Johannes

shenwei356 · 2019-02-14T10:34:31Z

new option -M, --init-codon-as-M translate initial codon at beginning to 'M'
hard to solve.

fungs · 2019-02-14T10:49:20Z

Great work! I guess 2 can be neglected, the behavior is documented here now. It's a very special case which is only of importance when working with novel and erroneous data.

Best,
Johannes

shenwei356 added a commit that referenced this issue Jan 29, 2019

#62 and #63

5002a6d

shenwei356 added the enhancement label Feb 11, 2019

shenwei356 added a commit that referenced this issue Feb 14, 2019

seqkit translate: add option -M #62

f1ae471

fungs closed this as completed Feb 14, 2019

shenwei356 mentioned this issue Feb 25, 2019

Partial codon translation #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support start codons in translation tables #62

Support start codons in translation tables #62

fungs commented Jan 28, 2019

shenwei356 commented Jan 28, 2019

shenwei356 commented Jan 29, 2019

fungs commented Jan 29, 2019

fungs commented Jan 29, 2019

fungs commented Jan 29, 2019

shenwei356 commented Jan 29, 2019

fungs commented Jan 29, 2019

shenwei356 commented Jan 29, 2019

shenwei356 commented Feb 8, 2019

fungs commented Feb 8, 2019

fungs commented Feb 11, 2019

shenwei356 commented Feb 14, 2019

fungs commented Feb 14, 2019

Support start codons in translation tables #62

Support start codons in translation tables #62

Comments

fungs commented Jan 28, 2019

shenwei356 commented Jan 28, 2019

shenwei356 commented Jan 29, 2019

fungs commented Jan 29, 2019

fungs commented Jan 29, 2019

fungs commented Jan 29, 2019

shenwei356 commented Jan 29, 2019

fungs commented Jan 29, 2019

shenwei356 commented Jan 29, 2019

shenwei356 commented Feb 8, 2019

fungs commented Feb 8, 2019

fungs commented Feb 11, 2019

shenwei356 commented Feb 14, 2019

fungs commented Feb 14, 2019