Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FragPipe-ready fasta headers and redundancy reduction #221

Open
MiguelCos opened this issue Aug 24, 2021 · 5 comments
Open

FragPipe-ready fasta headers and redundancy reduction #221

MiguelCos opened this issue Aug 24, 2021 · 5 comments

Comments

@MiguelCos
Copy link

MiguelCos commented Aug 24, 2021

Hello @acesnik ,

I am opening this issue here so I can share some thoughts of what I perceive as some issues with the format of the output fasta file from Spritz to be used in FragPipe as initiated in #Nesvilab/FragPipe#263.

I am already working on an R script to try to solve at least 80% of Problem 1 that I will share here hopefully soon (this week).

Problem 1: the headers.
The format does not seem to fit what FragPipe/Philosopher is expecting as a 'mock' of the Uniprot format. On the one hand, I think the mz at the beginning is part of the problem and also the fact that the descriptions of the variant proteins are extremely big.

My solution is to extract all the variant information into a tabular annotation (something like a reduced version of a BED file) and extract a very simple header from there: Code the variant as part of the protein ID section of the header and add a reduced description. The IDs can be then mapped to the 'reduced BED file' afterward to be able to map the variant IDs to their identifiers and annotations.

I also found that some peptide sequences for the variants are appended into the protein/transcript ID section of the header, contributing to a very big header too.

Problem 2: Redundancy

I am trying to describe the problem the best I can here:

The output from spritz looks like this (allow me a pop reference):

>Protein_X1_wt
LADYGALADRIELGANDALFKTHEGIMLIK
>Protein_X1_var1
LADAGALADRIELGANDALFKTHEGIMLIK
>Protein_X1_var2
LADYGALADRIELGENDALFKTHEGIMLIK

This means that protein/transcript X1 has 3 versions: One WT, and two variants. But each variant is present in a different tryptic peptide.

I would like to have all variants for a protein summarized in one unique 'variant' protein so It would be easier to filter identified variants by their unique peptides and would also reduce the search space. In the end, when identifying sequence variants, our evidence for their existence is the tryptic peptide identification so I don't think it is necessary to have a protein entry for each of the called variants.

>Protein_X1_wt
LADYGALADRIELGANDALFKTHEGIMLIK
>Protein_X1_var1_n_var2
LADAGALADRIELGENDALFKTHEGIMLIK

Does it make sense and do you think it is actually a problem?

I'll share here my partial solution to problem 1 as soon as I have it.

Best wishes,
Miguel

@acesnik
Copy link
Collaborator

acesnik commented Aug 24, 2021

Hi @MiguelCos,

Thanks for the message!

Having a lookup table for the variants sounds like a good idea, for sure.

On the redundancy, one thing to be careful about is that Spritz does perform some combinatorics with heterozygous variations. It amends sequences with homozygous variations, and since both the reference and alternate allele could be possible for heterozygous variations, it expands the combinations of those possible peptides. Some of those combinations may be lost if combining all the variants into a single entry.

Anthony

@acesnik
Copy link
Collaborator

acesnik commented Aug 24, 2021

Are you using combined.spritz.snpeff.protein.fasta or combined.spritz.snpeff.protein.withdecoys.fasta?

@MiguelCos
Copy link
Author

Hello Anthony,

I have been using the combined.spritz.snpeff.protein.withdecoys.fasta.

@acesnik
Copy link
Collaborator

acesnik commented Aug 26, 2021

That's great. Thanks for the info!

@MiguelCos
Copy link
Author

Hello Anthony @acesnik

I just finished an R script for adapting the combined.spritz.snpeff.protein.withdecoys.fasta in a format convenient to FragPipe.

https://github.com/MiguelCos/spritz_fasta_2_fragpipe_adaptation

The repo contains a small sample fasta and the sample output.

If you check the annotation file, you will see that I didn't give particularly meaningful names to each of the columns because I am not sure how to refer to each piece of info associated with each variant. Is there any way I can get to know better how to interpret those and what are their actual 'names'?

I used the script on two different datasets and in both cases, Philosopher seemed to parse the fasta properly (it didn't crash when using the LFQ pipeline, and the TMT report tables were properly generated using the TMT pipeline). I need to look a little bit closer, but in general, it seems to be working as it should.

Also, many thanks for your clarification regarding the redundancy 'problem'. It then makes sense to keep the variant sequences as they are!

Best wishes,
Miguel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants