This reposistory hosts a Snakemake pipeline used to call 3' UTR cleavage sites de novo from the Mouse Cell Atlas dataset. The final output of the pipeline is the Mouse UTRome used in the scUTRquant pipeline for quantifying 3' UTR isoforms from scRNA-seq data.
Please see our accompanying bioRxiv manuscript.
The original pipeline was run with
- Snakemake 6.8
- Conda 4.11.0
- Mamba 0.21.2
and assumes use of --use-conda
to source all other software.
Rule resources:
values are compatible with Snakemake profiles, which we recommend using. The included lsf.yaml
provides some adjustments to default resource allocations specific to our cluster, mostly providing for jobs with longer run times.
⚠️ This is not trivial pipeline! If you require scUTRquant (kallisto) indices with specific settings, please file an Issue and we may be able to help generating this. For example, we may be able to provide intermediate files to "bootstrap" the pipeline and avoid downloading and aligning the raw data.
After cloning this repository and adjusting the config.yaml
to point at a local HISAT index (hisatIndex
), all outputs for the UTRome can be generated with
snakemake --use-conda
Parameters are adjustable in the config.yaml
:
- epsilon (e): distance for initial merging of cleavage sites
- threshold (t): minimum TPM for cutoff (per cell type)
- gencodeVersion (gc): version of GENCODE to use (default 25)
- polyASiteTPM (pas): minimum TPM for polyASite database entries
- non-internal priming likelihood (f): minimum posterior likelihood for not being an internal priming site
- truncation width (w): maximum transcript distance to cleavage site
- mergeDistance (m): distance within which to merge isoforms when quantifying in scUTRquant (kallisto)
Alternatively, one can generate a specific file directly, such as a kallisto index with a direct command:
snakemake --use-conda data/kdx/utrome.e30.t5.gc25.pas3.f0.9999.w500.kdx