Jack Humphrey & Erica Brophy
Towfique Raj Lab
Mount Sinai, New York
2021-2023
Snakemake pipeline for running multi-cohort QTL meta-analysis using the linear mixed model mmQTL to account for diverse ancestries and repeated donors.
Pipeline allows for multiple cohorts to be uniformly processed for joint QTL mapping and meta-analysis of multiple phenotypes, including expression, transcript usage, splice junction usage (Leafcutter) and splicing event usage (SUPPA).
User must assemble a dataset key (see example) created by the user in the format of a dataframe. The dataset key contains 5 rows: 1. Dataset, 2. sampleKey, 3. VCF, 4. Phenotype file, 5. Covariate file; where the column provides the path to each of these files.
In this example, 3 datasets from 2 human brain cohorts are combined:
dataset | sample_key | genotypes | phenotypes | PEER | counts |
---|---|---|---|---|---|
cohort1_frontal_cortex | cohort1_temporal_sample_key.tsv | cohort1.vcf | cohort1_frontal_gene_tpm.tsv | 10 | cohort1_frontal_gene_counts.tsv |
cohort1_temporal_cortex | cohort1_frontal_sample_key.tsv | cohort1.vcf | cohort1_temporal_gene_tpm.tsv | 15 | cohort1_temporal_gene_counts.tsv |
cohort2_frontal_cortex | cohort2_frontal_sample_key.tsv | cohort2.vcf | cohort2_frontal_gene_tpm.tsv | 15 | cohort2_frontal_gene_tpm.tsv |
The inputs required by the user are a sampleKey matching RNA samples to DNA samples, a VCF file containing the genotype information, a phenotype matrix and a gene annotation file (GENCODE or similar).
The sample key must have two columns named participant_id listing the names of the genotype IDs and sample_id listing the name of the phenotype sample IDs.
The genotypes must be in VCF format with all chromsomes in a single file.
The phenotype matrix must be a tab-separated table with one feature per row, the first column should be the unique identifier of the feature, followed by each sample ID in the dataset. The matrix should contain normalised values for each sample. See below for types of phenotypes that mmQTL can work with.
The PEER factor should be a number of PEER factors to be regressed out of the phenotype matrix in each dataset. Ideally this has already been determined using the QTL-mapping-pipeline
counts For mapping gene expression QTLs, TPM values are used for removing lowly expressed genes, and read counts are used as the phenotype. For all other QTL types a counts column is not necessary.
- Gene expression
This must be TPM normalised. A phenotype metadata file should be created. This can be created from a GTF file using scripts/get_gene_meta_from_gtf.R
in "gene" mode.
- Transcripts
The user must prepare a transcript expression matrix with TPM normalised transcript expression, from Kallisto, RSEM or similar.
If the user wishes to treat each transcript as it's own feature - (transcript QTLs), in the config.yaml set group: False
If the user wishes to divide each transcript by the total gene TPM, (transcript usage QTLs), set group: True
Transcript metadata must be created from the same GTF as used to create the transcript expression matrix using scripts/get_gene_meta_from_gtf.R
in "transcript" mode.
- SUPPA local splicing events
python3.4 suppa.py generateEvents -i <input-file.gtf> -o <output-file> -f ioe -e <list-of-events>
- Junction usage QTLs (Leafcutter)
If the user wants to perform junction usage QTLs (Leafcutter splicing QTLs) then provide a list of junction files, created by regtools.
The user must provide an exon metadata created from a GTF of the user's choice using scripts/get_gene_meta_from_gtf.R
in "exon" mode. Phenotype metadata for the junctions will be created by the pipeline.
Then prepare the full leafcutter junction matrix for the entire cohort:
Rscript scripts/prepare_leafcutter_matrix.R --prefix <prefix> --junctions <junction list> --exons <exon metadata>
Where prefix is a directory followed by the stem of a filename, ie "junctions/test"
prepare_qtl_metadata.lsf
creates all QTL metadata files based on your GTF and list of junctions
- minDatasets
This sets the minimum number of datasets that a feature must be present in to be used for QTL meta-analysis (default = 2)
- phenoThreshold
Minimum transcripts per million (TPM) threshold for keeping a gene or transcript for QTL mapping. Default = 1
- phenoFraction
Minimum fraction of the samples in each cohort that must meet the phenoThreshold
. Default = 0.5