Documentation update: GTF/GFF and biotype annotation #1086

MatthiasZepper · 2023-10-05T12:55:50Z

Description of feature

One of the issues that repeatedly causes confusion amongst users of the rnaseq pipeline is the reference transcriptome annotation to be provided as GTF or GFF3 (which is then converted to GTF by the pipeline) and where to best obtain them from.

Since the proposal for a new pipeline / subworkflow for reference bundle preparation is somewhat stalled, I believe some updated to the pipeline's documentation can't harm. Specifically, we need:

Some clarification what references are needed if iGenomes are not used (thanks Thomas Danhorn!).
Some troubleshooting information on the lack of biotype annotation in NCBI GTFs and that those from Ensembl or Gencode might be a good alternative (thanks Thomas Danhorn!). It may also be worthwhile to mention why it is needed in the first place, and that there is a --skip_biotype_qc parameter in case it is not available.
Some troubleshooting information on the Salmon error Error: no valid ID found for GFF record that occurs if a GTF file is provided that contains gene entries with empty transcript_id "" fields, like those they are recently distributed by RefSeq and Ensembl. It should be mentioned that they must be preprocessed by deleting the respective entries with grep -v 'transcript_id ""' original.gtf > filtered.gtf to work with the pipeline.
A hint pointing out that Ensembl stopped linking the GTF files on their Ensembl Genomes species pages, but that they are still available via FTP download (thanks Alexandru Mizeranschi!).
A brief info box highlighting that GTF and GFF3 are similar, but not identical, and pointing to a good resource for further information.
Possibly adding some troubleshooting information how people can run the GFF3 to GTF conversion step in isolation in case this fails during a pipeline run.

Since this is a documentation-only task, I believe it is well suited for the Hackathon.

Thanks!

The text was updated successfully, but these errors were encountered:

lazappi · 2024-06-05T13:43:04Z

In addition to the points raised above, it would be great to have some guidance on the version of the genome/annotation that the pipeline expects/recommends. This is especially true for GENCODE which provides complete and primary assembly genomes, and reference chromosome, primary assembly and complete annotations, each with a basic and comprehensive version. I tried to work out what is pulled from iGenomes if you just provide a genome name but didn't have much luck.

The STAR documentation recommends the primary assembly genome and the "most comprehensive" annotation (presumably for the primary assembly). This is in contrast to the GENCODE webpage which suggests the basic version should be used by most people.

For Salmon, it is less clear but most examples seem to use the GENCODE transcripts FASTA which only covers the reference chromosomes. ENSEMBL provides a similar cDNA FASTA file but I'm not sure exactly what is included here and it seems to contain ~50,000 fewer transcripts. Kallisto seems to recommend the ENSEMBL cDNA FASTA or the primary assembly genome with a GTF using kallisto | bustools.

Following both these recommendations could maybe result in some differences for STAR-Salmon vs pseudoalignment Salmon when both a GTF and transcripts FASTA are provided (depending on how files are passed around) as one would use the GTF for the primary assembly while the other would use a transcripts FASTA just for the reference chromosomes. There seem to be only around ~60 extra transcripts that are present on the additional parts of the primary assembly though so not a big difference and could be avoided by just providing the GTF.

Sorry it that's too much information. I'd gone down a bit of a rabbit hole with this and thought it would be good to write it somewhere for future reference.

MatthiasZepper · 2024-06-05T17:54:51Z

No need to excuse for too much information! Actually, it is fantastic, that you put the effort in to research and document it. But it would actually even better to write it down directly in the documentation of the pipeline.

How would you feel about adding this to the pipeline's documentation?

lazappi · 2024-06-06T06:27:47Z

I'm happy to try and contribute some text but maybe it would be good for people more familiar with the pipeline than me to confirm what is recommended. This is what I think so far:

General advice

Use the most complete genome without additional loci, patches etc.
Use the most comprehensive annotation for that genome (note that this means quantifying lots of features beyond protein-coding genes that people often aren't interested in)

GENCODE

Genome FASTA: {assembly}.primary_assembly.genome.fa.gz
Annotation GTF: gencode.{release}.primary_assembly.annotation.gtf.gz (comprehensive not "basic")

ENSEMBL

Genome FASTA: {species}.{assembly}.dna.primary_assembly.fa.gz
Annotation GTF: {species}.{assembly}.{release}.gtf.gz

Both

Do not supply a transcriptome FASTA (or BED files, indexes etc.) unless they have been generated from a previous run of the pipeline (same version?) with the same genome/GTF using --save_reference (particularly if doing both alignment and pseudoalignment)

This is based on human and is probably similar for mouse but I'm less sure about other species.

If there is a general consensus these are the recommendations I'm happy to write this up properly and add a section to the docs.

tdanhorn · 2024-06-09T05:04:19Z

Thank you foe taking this on! I agree with all of your suggestions. A couple of additional points:

It may be worthwhile spelling out clearly that -- unlike for other pipelines with complex reference file requirements that really benefit from a curated collection like iGenomes (I'm looking at you, sarek!) -- for RNAseq analysis all you need are two reference files -- the genome assembly (FASTA) and the gene annotation (GTF/GFF). Everything else can be created from these, and the pipeline will do that.
For many (most?) Ensembl species there is no separate "primary" assembly, it is the same as "toplevel" and is also called that. (Human and mouse have "primary", but fly, cow, and dog have only "toplevel". The difference between the two is that "toplevel" contains haplotypes and patches, but these mostly exist for human and mouse.)
It makes sense to use a precomputed STAR index, because that is a very costly step (probably the highest memory consumption of the whole pipeline. A potential problem is that an index made by one STAR version is not always compatible with other versions, so:
- either make sure you use a STAR version compatible to the one used by the pipeline (ideally the same) for making the index, or
- let the pipeline create the index the first time (with --save_references) and then reuse that.
To make use of all QC features, the GTF should contain gene_biotype. (This is generally the case for Ensembl and Gencode.)
To ensure reproducibility as well as the ability to share and document a workflow easily, the genome assembly and gene annotation (GTF/GFF) have to be "well defined", i.e. have a clear version that fully defines the content. (Not a big obstacle for the assembly typically, but gene annotation changes all the time. Both Ensembl and Gencode have clear versions/releases.)
Don't know if the docs spell this out clearly, GFFs will be converted to GTFs, so if you have both, use the latter.
While it may be tempting to use gene names (symbols) as primary identifiers, these tend to be not unique and also change over time, which can cause problems. That's why Ensembl (and by extension Gencode) uses numeric IDs that are unique (and associated with specific genomic coordinates), but can be linked to symbols (most of the time; for some genes there is no symbol).

pinin4fjords · 2024-06-10T07:58:14Z

nice work all- would be lovely to make all this explicit in the docs

lazappi · 2024-06-11T08:15:54Z

I opened a draft PR which tries to incorporate what was said here. I figured it would be easier to give more specific comments where there is some actual text to look at. More comments/contributions welcome!

MatthiasZepper added first-timers-only Good for newcomers enhancement labels Oct 5, 2023

jvturatsinze self-assigned this Oct 16, 2023

JGawra unassigned jvturatsinze Oct 18, 2023

lazappi mentioned this issue Jun 11, 2024

Add reference recommendations to usage docs #1314

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation update: GTF/GFF and biotype annotation #1086

Documentation update: GTF/GFF and biotype annotation #1086

MatthiasZepper commented Oct 5, 2023 •

edited

Loading

lazappi commented Jun 5, 2024

MatthiasZepper commented Jun 5, 2024

lazappi commented Jun 6, 2024 •

edited

Loading

tdanhorn commented Jun 9, 2024 •

edited

Loading

pinin4fjords commented Jun 10, 2024

lazappi commented Jun 11, 2024 •

edited

Loading

Documentation update: GTF/GFF and biotype annotation #1086

Documentation update: GTF/GFF and biotype annotation #1086

Comments

MatthiasZepper commented Oct 5, 2023 • edited Loading

Description of feature

lazappi commented Jun 5, 2024

MatthiasZepper commented Jun 5, 2024

lazappi commented Jun 6, 2024 • edited Loading

tdanhorn commented Jun 9, 2024 • edited Loading

pinin4fjords commented Jun 10, 2024

lazappi commented Jun 11, 2024 • edited Loading

MatthiasZepper commented Oct 5, 2023 •

edited

Loading

lazappi commented Jun 6, 2024 •

edited

Loading

tdanhorn commented Jun 9, 2024 •

edited

Loading

lazappi commented Jun 11, 2024 •

edited

Loading