Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating proteogenomic database for Pseudomonas with VCF called from WGS (or exome seq) data #185

Open
animesh opened this issue Aug 23, 2020 · 6 comments
Labels

Comments

@animesh
Copy link

animesh commented Aug 23, 2020

I am wondering how can i add something like Pseudomonas aeruginosa ?

The fasta file for the reference proteome is available at https://www.uniprot.org/proteomes/UP000002438 , any ideas on how to proceed will be appreciated :)

@trishorts
Copy link

You want to do Spritz for Pseudomonas?

@acesnik
Copy link
Collaborator

acesnik commented Aug 23, 2020

Spritz is currently built to call variants from eukaryotes with RNA-Seq data, so this would take a new workflow.

What type of sequencing data do you have for the sample (e.g. exome, genome)?

Here's the ensembl genome for Pseudomonas: http:https://bacteria.ensembl.org/Pseudomonas_aeruginosa_pao1/Info/Index. There's no reference VCF like we're using for human in GATK.

@acesnik
Copy link
Collaborator

acesnik commented Aug 23, 2020

We would also need to implement using other codon tables for this feature #164

@animesh
Copy link
Author

animesh commented Aug 24, 2020

I have WGS data for this bacteria which seems to have diverged from main based on assembly so using canonical proteome is clearly suboptimal. I see that GFF is available at ftp:https://ftp.ensemblgenomes.org/pub/bacteria/current/gff3/bacteria_67_collection/pseudomonas_aeruginosa/ , probably one can use it to call the variants and create a strain-specific VCF ?

@acesnik
Copy link
Collaborator

acesnik commented Aug 27, 2020

This is definitely a good direction to take Spritz. It's also good that the GFF file is available. I know @rmmiller22 was working on vervet monkey samples, which had that situation, i.e. no reference VCF available.

I unfortunately don't have the bandwidth to add this feature to Spritz right now, but we'll keep you posted as we work towards this goal.

By the way, what tool do you typically use to align WGS reads to bacterial genomes? Bowtie/BWA?

@acesnik
Copy link
Collaborator

acesnik commented Aug 27, 2020

Oh, an option in the meantime is that you could generate a VCF file for your sample using other means and run it through the custom SnpEff fork that is part of Spritz with the options -protFasta {file} and -protXml {file} specified. This should generate FASTA and XML files that could be used in MetaMorpheus or other search software. SnpEff has ~270 different Pseudomonas references, which is a lot. For example, one of them is Pseudomonas_aeruginosa, which you could use for this analysis with java -Xmx16M -jar snpEff.jar -v -stats {output.html} -fastaProt {output.protfa} -xmlProt {output.protxml} Pseudomonas_aeruginosa {input.vcf} > {output.vcf}, where the bracketed bits are replaced with your desired input/output files.

@acesnik acesnik changed the title how to add species for variant call Generating proteogenomic database for Pseudomonas with VCF called from WGS data Mar 24, 2021
@acesnik acesnik changed the title Generating proteogenomic database for Pseudomonas with VCF called from WGS data Generating proteogenomic database for Pseudomonas with VCF called from WGS (or exome seq) data Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants