This workflow is designed to analyze metatranscriptomes.
0.0.3
To run this workflow you will need a Docker (Docker ≥ v2.1.0.3) instance and cromwell. All the third party tools are pulled from Dockerhub.
cromwell ≥ 54
bbtools ≥ v38.94
Python ≥ v3.7.6
featureCounts ≥ v2.0.2
R ≥ v3.6.0
edgeR ≥ v3.28.1 (R package)
pandas ≥ v1.0.5 (python package)
gffutils ≥ v0.10.1 (python package)
metaT uses the same database uses for metagenome annotation. See README here for required databases.For QC databases see here
The submit script will request a node and launch the Cromwell. The Cromwell manages the workflow by using Shifter to run applications.
java -Dconfig.file=wdls/shifter.conf -jar /full/path/to/cromwell-XX.jar run -i input.json /full/path/to/wdls/metaT.wdl
microbiomedata/meta_t:latest
.Dockerfile
can be found inDocker/metatranscriptomics/
directory.microbiomedata/bbtools:38.94
scanon/nmdc-meta:v0.0.1
bfoster1/img-omics:0.1.7
scanon/im-trnascan:v0.0.1
scanon/im-last:v0.0.1
scanon/im-hmmsearch:v0.0.0
{
"nmdc_metat.proj": "gold:Ga0370541",
"nmdc_metat.input_file": "/global/cfs/cdirs/m3408/aim2/metatranscriptomics/metaT/test_data/small_test/test_smaller_interleave.fastq.gz",
"nmdc_metat.git_url": "https://github.com/microbiomedata/mg_annotation/releases/tag/0.1",
"nmdc_metat.url_base": "https: //data.microbiomedata.org/data/",
"nmdc_metat.outdir": "/global/cfs/cdirs/m3408/aim2/metatranscriptomics/metaT/test_data/test_small_out",
"nmdc_metat.resource": "NERSC - Cori",
"nmdc_metat.url_root": "https://data.microbiomedata.org/data/",
"nmdc_metat.database": "/global/cfs/cdirs/m3408/aim2/database/",
"nmdc_metat.activity_id": "test-activity-id",
"nmdc_metat.threads": 64,
"nmdc_metat.metat_folder": "/global/cfs/cdirs/m3408/aim2/metatranscriptomics/metaT"
}
proj
: A unique name for your project or sample.input_file
: Full path to the fastq file. The file must be intereleaved paired end fastq.git_url
: A link to this version. Update it based on which version you downloaded.url_base
: A web location where all the data objects from this run will be stored.url_root
: Same as url_base.outdir
: Full path of the folder where all the important outputs will be saved.resource
: A short description or name of where the data was processed.database
: Full path to a folder where RQC (RQCFilterData/
) and IMG (img/
) annotation database are located. Within theIMG
folder following folders are expected:
Cath-FunFam COG IMG-NR Pfam Product_Name_Mappings Rfam SMART SuperFamily TIGRFAM
This folder should also be be set in the cromwell config file.
threads
: Number of threads.activity_id
: A unique ID for the project.metat_folder
: Full path to metaT folder.
All outputs can be found in the outdir
folder. There are following subfolders:
outdir/annotation
: contains gff files from annotation run.outdir/assembly
: contains FASTA fils from assembly.outdir/mapback
: BAM file where reads were mapped back to the contigs.outdir/metat_output
: Two JSON files for sense and antisense that have records for feature, their annotations, read counts from featurecount, and FPKM values.outdir/qa
: contains cleaned reads and a file with associated statistics.
The output file is a JSON formatted file called out.json
with JSON records that contains RPKMs, reads, and information from annotation. An example JSON record:
{
"read_count": 2,
"rpkm": 750750.751,
"featuretype": "CDS",
"seqid": "contig_3",
"id": "contig_3_126_347",
"source": "GeneMark.hmm_2 v1.05",
"start": 126,
"end": 347,
"length": 222,
"strand": "+",
"frame": "0",
"extra": [],
"product": "hypothetical protein"
}
To test the workflow, we have provided a small test dataset and a step by step guidance. See test_data
folder.