Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation update: GTF/GFF and biotype annotation #1086

Open
MatthiasZepper opened this issue Oct 5, 2023 · 6 comments
Open

Documentation update: GTF/GFF and biotype annotation #1086

MatthiasZepper opened this issue Oct 5, 2023 · 6 comments
Labels

Comments

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Oct 5, 2023

Description of feature

One of the issues that repeatedly causes confusion amongst users of the rnaseq pipeline is the reference transcriptome annotation to be provided as GTF or GFF3 (which is then converted to GTF by the pipeline) and where to best obtain them from.

Since the proposal for a new pipeline / subworkflow for reference bundle preparation is somewhat stalled, I believe some updated to the pipeline's documentation can't harm. Specifically, we need:

Since this is a documentation-only task, I believe it is well suited for the Hackathon.

Thanks!

@lazappi
Copy link

lazappi commented Jun 5, 2024

In addition to the points raised above, it would be great to have some guidance on the version of the genome/annotation that the pipeline expects/recommends. This is especially true for GENCODE which provides complete and primary assembly genomes, and reference chromosome, primary assembly and complete annotations, each with a basic and comprehensive version. I tried to work out what is pulled from iGenomes if you just provide a genome name but didn't have much luck.

The STAR documentation recommends the primary assembly genome and the "most comprehensive" annotation (presumably for the primary assembly). This is in contrast to the GENCODE webpage which suggests the basic version should be used by most people.

For Salmon, it is less clear but most examples seem to use the GENCODE transcripts FASTA which only covers the reference chromosomes. ENSEMBL provides a similar cDNA FASTA file but I'm not sure exactly what is included here and it seems to contain ~50,000 fewer transcripts. Kallisto seems to recommend the ENSEMBL cDNA FASTA or the primary assembly genome with a GTF using kallisto | bustools.

Following both these recommendations could maybe result in some differences for STAR-Salmon vs pseudoalignment Salmon when both a GTF and transcripts FASTA are provided (depending on how files are passed around) as one would use the GTF for the primary assembly while the other would use a transcripts FASTA just for the reference chromosomes. There seem to be only around ~60 extra transcripts that are present on the additional parts of the primary assembly though so not a big difference and could be avoided by just providing the GTF.

Sorry it that's too much information. I'd gone down a bit of a rabbit hole with this and thought it would be good to write it somewhere for future reference.

@MatthiasZepper
Copy link
Member Author

No need to excuse for too much information! Actually, it is fantastic, that you put the effort in to research and document it. But it would actually even better to write it down directly in the documentation of the pipeline.

How would you feel about adding this to the pipeline's documentation?

@lazappi
Copy link

lazappi commented Jun 6, 2024

I'm happy to try and contribute some text but maybe it would be good for people more familiar with the pipeline than me to confirm what is recommended. This is what I think so far:

General advice

  • Use the most complete genome without additional loci, patches etc.
  • Use the most comprehensive annotation for that genome (note that this means quantifying lots of features beyond protein-coding genes that people often aren't interested in)

GENCODE

  • Genome FASTA: {assembly}.primary_assembly.genome.fa.gz
  • Annotation GTF: gencode.{release}.primary_assembly.annotation.gtf.gz (comprehensive not "basic")

ENSEMBL

  • Genome FASTA: {species}.{assembly}.dna.primary_assembly.fa.gz
  • Annotation GTF: {species}.{assembly}.{release}.gtf.gz

Both

  • Do not supply a transcriptome FASTA (or BED files, indexes etc.) unless they have been generated from a previous run of the pipeline (same version?) with the same genome/GTF using --save_reference (particularly if doing both alignment and pseudoalignment)

This is based on human and is probably similar for mouse but I'm less sure about other species.

If there is a general consensus these are the recommendations I'm happy to write this up properly and add a section to the docs.

@tdanhorn
Copy link

tdanhorn commented Jun 9, 2024

Thank you foe taking this on! I agree with all of your suggestions. A couple of additional points:

  • It may be worthwhile spelling out clearly that -- unlike for other pipelines with complex reference file requirements that really benefit from a curated collection like iGenomes (I'm looking at you, sarek!) -- for RNAseq analysis all you need are two reference files -- the genome assembly (FASTA) and the gene annotation (GTF/GFF). Everything else can be created from these, and the pipeline will do that.
  • For many (most?) Ensembl species there is no separate "primary" assembly, it is the same as "toplevel" and is also called that. (Human and mouse have "primary", but fly, cow, and dog have only "toplevel". The difference between the two is that "toplevel" contains haplotypes and patches, but these mostly exist for human and mouse.)
  • It makes sense to use a precomputed STAR index, because that is a very costly step (probably the highest memory consumption of the whole pipeline. A potential problem is that an index made by one STAR version is not always compatible with other versions, so:
    • either make sure you use a STAR version compatible to the one used by the pipeline (ideally the same) for making the index, or
    • let the pipeline create the index the first time (with --save_references) and then reuse that.
  • To make use of all QC features, the GTF should contain gene_biotype. (This is generally the case for Ensembl and Gencode.)
  • To ensure reproducibility as well as the ability to share and document a workflow easily, the genome assembly and gene annotation (GTF/GFF) have to be "well defined", i.e. have a clear version that fully defines the content. (Not a big obstacle for the assembly typically, but gene annotation changes all the time. Both Ensembl and Gencode have clear versions/releases.)
  • Don't know if the docs spell this out clearly, GFFs will be converted to GTFs, so if you have both, use the latter.
  • While it may be tempting to use gene names (symbols) as primary identifiers, these tend to be not unique and also change over time, which can cause problems. That's why Ensembl (and by extension Gencode) uses numeric IDs that are unique (and associated with specific genomic coordinates), but can be linked to symbols (most of the time; for some genes there is no symbol).

@pinin4fjords
Copy link
Member

nice work all- would be lovely to make all this explicit in the docs

@lazappi
Copy link

lazappi commented Jun 11, 2024

I opened a draft PR which tries to incorporate what was said here. I figured it would be easier to give more specific comments where there is some actual text to look at. More comments/contributions welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To do
Development

No branches or pull requests

5 participants