Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference recommendations to usage docs #1314

Merged
merged 15 commits into from
Jul 15, 2024
Merged
Prev Previous commit
Next Next commit
Add reference files section to usage docs
  • Loading branch information
lazappi committed Jun 11, 2024
commit d40544285d222be26c0bdef92c5449dddcd9781b
40 changes: 40 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,46 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p

> **NB:** The `group` and `replicate` columns were replaced with a single `sample` column as of v3.1 of the pipeline. The `sample` column is essentially a concatenation of the `group` and `replicate` columns, however it now also offers more flexibility in instances where replicate information is not required e.g. when sequencing clinical samples. If all values of `sample` have the same number of underscores, fields defined by these underscore-separated names may be used in the PCA plots produced by the pipeline, to regain the ability to represent different groupings.

## Reference files
lazappi marked this conversation as resolved.
Show resolved Hide resolved

The only reference files required by the pipeline are a FASTA file with the reference genome sequence and a GTF/GFF file with a gene annotation. All other reference files can be created from those by the pipeline. However, selecting the appropriate reference genome and annotation to use analysis can still be difficult. Here we provide some advice on what is expected by the pipeline:

:::note
**GENCODE vs ENSEMBL**

Two of the most common sources of genomic references are GENCODE (for mouse and human) and ENSEMBL (for many organisms). There has been an effort to standardise information between the two sources and now the references [should be consistent](https://www.gencodegenes.org/pages/faq.html) regardless of where they are obtained from (for mouse and human).

However, while the information is consistent, there are still some practical differences. ENSEMBL prefixes chromosome names with `chr` (e.g. `chr1`, `chr1`, ...) while GENCODE uses simple `1`, `2`, etc. There can also be different names used for sequences outside the reference chromosomes. GENCODE also attaches version identifiers to gene and transcript names (e.g. `ENSG00000254647.1`). For these reasons, resources from the two sources cannot be mixed and it is important to stick to one reference source. Some of the steps in the pipeline expect an ENSEMBL reference by default so it is important to set the `--gencode` option if your reference comes from GENCODE.
:::

### Reference genome

It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For models organisms such as mouse or human this is the so-called "primary assembly" which includes the reference chromosomes as well as some additional scaffolds. For human assembly GRCh38 (hg38) this would be the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from ENSEMBL. These files are preferred as they cover the largest amount of the reference genome without including multiple copies of the same sequence which can confuse aligners such as STAR. Most other species (fly, cow, dog etc.) do not have a primary assembly, in which case the complete reference sequence, or "toplevel" assembly, should be used. The difference between the two is the inclusion of alternative loci (haplotypes) but these do not typically exist for species outside mouse and human.

### Gene annotation

Gene annotations are updated more frequently than the reference genome sequence and there are more options to consider here. Because annotations can be updated frequently, you should rely on sources that include well-defined, versioned releases such as ENSEMBL or GENCODE. We generally recommend using the most recent release in order to have the latest and most up-to-date gene annotations. However, if you are planning to combine your data with a dataset that was processed in the past you may want to use the annotation version that was used previously for greater consistency. Once you have decided on a release to use, you can then select an annotation file. This should be the most comprehensive annotation that matches the reference genome you are using. So if you are using the human primary assembly you would want the comprehensive annotation for the primary assembly (the `gencode.{release}.primary_assembly.annotation.gtf.gz` file from GENCODE or the `Homo_sapiens.GRCh38.{release}.gtf.gz` file from ENSEMBL). For something like fly, you would want the annotation matching the toplevel assembly (e.g. `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from ENSEMBL). As well as the comprehensive annotations for the primary and toplevel assemblies, and just the reference chromomes, GENCODE also provides "basic" annotations which only include representative transcripts, but we do not recommend using these.

Gene annotations typically provide a primary identifier for each feature as well as a more common name. For example, the ENSEMBL ID `ENSG00000254647` corresponds to the `INS` gene which encodes the insulin protein. While the gene names may be more familiar and easier to understand it is important to retain and use the primary identifiers as the are unique for a given annotation and are much easier to map between annotation versions or sources.

To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or ENSEMBL but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.

:::note
**GTF vs GFF**

GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations. GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2. The pipeline can accept both GFF and GTF but any GFF files will be converted to GFF so if a GTF is available for your annotation of choice it is better to provide that directly.

More information and links to further resources are [available from ENSEMBL](https://www.ensembl.org/info/website/upload/gff.html).
:::

### Reference transcriptome

As well as the reference genome sequence and annotation it is possible to provide a reference transcriptome FASTA file. These can be obtained from GENCODE or ENSEMBL but it is important to note that the sequences they provide only cover the reference chromosome and can result in inconsistencies if you have provided a primary or toplevel genome assembly and annotation. For this reason, we recommend to not provide a transcriptome FASTA and instead let the pipeline create it from the provided genome and annotation. As with the aligner indexes, it is possible to save the created transcriptome FASTA and BED files to a central location and provide it to future pipeline runs in order to avoid having multiple copies on your system but it is important to make sure that all genome, annotation, transcriptome and index versions match.

### Indexes

Creating the index files required for the alignment and/or pseudoalignment steps can be computationally intensive and the files they produce are quite large. To avoid repeating this work and having multiple redundant files we recommend saving the indexes using the `--save_reference` option and moving them to a central location where they can be accessed by future pipeline runs. When doing this, it is important to record the genome and annotations versions they correspond to so you can easily locate the correct index to use and the program version as an index produced with one version may not have a format compatible with other versions.

## Adapter trimming options

[Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) is a wrapper tool around Cutadapt and FastQC to peform quality and adapter trimming on FastQ files. Trim Galore! will automatically detect and trim the appropriate adapter sequence. It is the default trimming tool used by this pipeline, however you can use fastp instead by specifying the `--trimmer fastp` parameter. [fastp](https://github.com/OpenGene/fastp) is a tool designed to provide fast, all-in-one preprocessing for FastQ files. It has been developed in C++ with multithreading support to achieve higher performance. You can specify additional options for Trim Galore! and fastp via the `--extra_trimgalore_args` and `--extra_fastp_args` parameters, respectively.
Expand Down