MAGnet is a computational tool that uses reference based method to refine taxonomic assignment and provide accurate abundance estimation. This pipeline handles reference genome selection, download using NCBI Datasets, read alignment using Minimap2, and breadth and depth of genome coverage calculations.
- Python 3.9
- NCBI Datasets v15.27.1
- Minimap2 v2.24-r1122
- Samtools v1.15.1
- Biopython
- Pandas
- Ete3 v3.1.2
- BWA v0.7.17
usage: magnet.py [-h] -c CLASSIFICATION -i FASTQ [-I FASTQ2] [-m {ont,illumina}] -o OUTPUT [-t TAXID_IDX]
[-a ABUNDANCE_IDX] [--min-abundance MIN_ABUNDANCE] [--min-mapq MIN_MAPQ]
[--min-covscore MIN_COVSCORE] [--threads THREADS] [--include-mag] [--subspecies]
Universal Taxonomic Classification Verifier.
optional arguments:
-h, --help show this help message and exit
-c CLASSIFICATION, --classification CLASSIFICATION
Path to the Taxonomic Classification Report. Accepting csv/tsv file format, other text formats
are treated as tsv.
-i FASTQ, --fastq FASTQ
Path to the first fastq file.
-I FASTQ2, --fastq2 FASTQ2
Path to the second fastq file for paired-end reads.
-m {ont,illumina}, --mode {ont,illumina}
Modes for different sequencing platforms [ont, illumina]. Default:[ont]
-o OUTPUT, --output OUTPUT
Path to the output directory.
-t TAXID_IDX, --taxid-idx TAXID_IDX
The column index (0-based) of the taxids. Default:[0]
-a ABUNDANCE_IDX, --abundance-idx ABUNDANCE_IDX
The column index (0-based) of the abundance. Default:[None]
--min-abundance MIN_ABUNDANCE
Minimum abundance (0-1) for pre-filtering, exclude taxa below the threshold.
--min-mapq MIN_MAPQ Minimum MAPQ for primary alignments. Default:[20]
--min-covscore MIN_COVSCORE
Minimum Coverage Score for supplementary alignments. Default:[0.7]
--threads THREADS Number of threads for Multi-threading. Default:[1]
--include-mag Include metagenomic assemble genomes. Default:[False]
--subspecies Verify taxonomic classification at subspecies rank. Default:[False]
Since Lemur report uses a format that taxid is the first column, we set '-t' to '0', which is already the default setting, and can be omit. Using the following command to run Magnet.
magnet.py -c {lemur_report_file} -i {input_fastq_file} -o {output_path} -m ont
Since kraken2 report uses a format that taxid is the fifth column, we set '-t' to '4', and since the abundance estimation is located in the first column, we set '-a' to '0'. To filter out potiential noise at low abundance, we could set '--min-abundance' to 0.001. Therefore, we could use the following command.
magnet.py -c {kraken2_report_file} -i {input_fastq_file} -o {output_path} -m ont -t 4 -a 0 --min-abundance 0.001
The following table is an example of the output csv file 'cluster_representative.csv'.
Taxonomy ID | Assembly Accession ID | Source Database | Is Representative | Assembly Level | Organism of Assembly | Strain | Total Length | Downloaded | Species | Cluster Representative | Cluster Members | Secondary Breadth | Secondary Expected | Secondary Score | Secondary Depth | Primary Breadth | Primary Expected | Primary Score | Primary Depth | Consensus ANI | Combined PS and ANI (Sqrt(ANI)xPSx100) | Presence/Absence |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2678528 | GCF_028867355.1 | SOURCE_DATABASE_REFSEQ | False | Complete Genome | Listeria sp. LM90SB2 | LM90SB2 | 2915834.0 | True | Listeria sp. LM90SB2 | True | GCF_028867355.1,GCF_000196035.1 | 1.0 | 1.0 | 1.0 | 3195.6499999657044 | 1.0 | 1.0 | 1.0 | 3188.0799997530726 | 0.9957277140090639 | 99.79 | Present |
96241 | GCF_006094475.1 | SOURCE_DATABASE_REFSEQ | True | Complete Genome | Bacillus spizizenii ATCC 6633 = JCM 2499 | ATCC 6633 | 4045538.0 | True | Bacillus spizizenii | True | GCF_006094475.1 | 0.9999987640704401 | 0.9999999999999607 | 0.9999987640704794 | 30.85103792256793 | 0.9999987640704401 | 0.999999999999958 | 0.999998764070482 | 30.786837976602836 | 0.9952862581301721 | 99.76 | Present |
1613 | GCF_029961225.1 | SOURCE_DATABASE_REFSEQ | True | Complete Genome | Limosilactobacillus fermentum | EFEL6800 | 2103331.0 | True | Limosilactobacillus fermentum | True | GCF_029961225.1 | 0.17646569702852932 | 0.1786636068369798 | 0.9876980553154514 | 1.1147205952452701 | 0.17275099611416372 | 0.17304955697814606 | 0.9982747088799537 | 1.0992270709852185 | 0.853534466277766 | 92.23 | Present |
287 | GCF_000006765.1 | SOURCE_DATABASE_REFSEQ | True | Complete Genome | Pseudomonas aeruginosa PAO1 | PAO1 | 6264404.0 | True | Pseudomonas aeruginosa | True | GCF_024714315.1,GCF_000006765.1 | 0.9736496879830867 | 1.0 | 0.9736496879830867 | 133.57679009269043 | 0.9736029157761855 | 1.0 | 0.9736029157761855 | 133.2863715317914 | 0.9790992928162352 | 96.34 | Present |
4932 | GCF_000146045.2 | SOURCE_DATABASE_REFSEQ | True | Complete Genome | Saccharomyces cerevisiae S288C | S288C | 12071326.0 | True | Saccharomyces cerevisiae | True | GCF_000146045.2 | 0.9702460413067091 | 0.9988842650498445 | 0.9713297878991953 | 7.005664504492986 | 0.9551612822296097 | 0.9987129198063666 | 0.9563922357335672 | 6.96673509016021 | 0.8611351208262882 | 88.75 | Present |
28901 | GCF_000006945.2 | SOURCE_DATABASE_REFSEQ | True | Complete Genome | Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 | LT2 | 4951383.0 | True | Salmonella enterica | True | GCF_000006945.2,GCF_014492125.1,GCF_022556855.1,GCF_014491785.1 | 0.7829200506438564 | 0.8320860674724839 | 0.9409123421835766 | 2.2776524817939543 | 0.7817768582280826 | 0.8295483061281343 | 0.9424126991193291 | 2.2617706866885654 | 0.9299024245256468 | 90.88 | Present |
562 | GCF_000008865.2 | SOURCE_DATABASE_REFSEQ | True | Complete Genome | Escherichia coli O157:H7 str. Sakai | Sakai substr. RIMD 0509952 | 5594605.0 | True | Escherichia coli | True | GCF_000008865.2 | 0.6591498747494352 | 0.7795045795270928 | 0.8456010292451726 | 2.2926196898951465 | 0.6569027483105632 | 0.7772529211094938 | 0.8451595747917728 | 2.284965162259128 | 0.8996365155332352 | 80.16 | Present |
1913579 | GCF_001875785.1 | SOURCE_DATABASE_REFSEQ | False | Scaffold | Bacillus sp. FMQ74 | FMQ74 | 4161595.0 | True | Bacillus sp. FMQ74 | True | GCF_001875785.1 | 0.003557530225790833 | 0.03775324221474807 | 0.0942311180998676 | 10.81661600810537 | 0.002592275317516481 | 0.02381212312913672 | 0.10886367853291296 | 9.295791620318873 | 0.905627438899158 | 10.36 | Absent |
The table contains the following information on the reference genome that are being selected for alignment:
- Taxonomy ID: The taxonomy ID of the species that users queries
- Assembly Accession ID: The assembly accession ID of the genome that is being selected as reference genome for the alignment process
- Source Database: source of the assembly, part of the metadata from NCBI Database
- Is Representative: Whether or not the assembly is labeled as representative genome in NCBI database
- Assembly Level: the assembly level of the assembly, can be one of the following: complete genome, chromosome, contig, scaffold
- Organism of Assembly: The description of the assembly, may contain strain level information
- Strain: The specific strain used for alignment
- Species: taxonomy name of the species
- Cluster Representative: Whether the genome is a cluster representative after ANI clustering
- Cluster Members: Members of the same cluster
After the reference genomes are being selected, the pipeline create a reference in multi-fasta format by concatenate the reference genomes, and aligns the reads (in fasta format) to this concatenated reference, and calculated the following:
- Secondary Breadth Coverage: The proportion of the genome has reads aligned, including secondary alignment and alignments with low MAPQ.
- Secondary Expected Coverage: An estimation of the expected breadth of coverage based on the assumption that the alignment of the read is randomly distributed along the entire reference genome, excluding secondary alignment and alignments with low MAPQ.
- Secondary Coverage Score: Secondary Breadth Coverage / Secondary Expected Coverage
- Secondary Depth Coverage: The mean depth of coverage, calculated with aligned regions only, excluding secondary alignment and alignments with low MAPQ.
- Primary Breadth Coverage: The proportion of the genome has reads aligned, excluding secondary alignment and alignments with low MAPQ.
- Primary Expected Coverage: An estimation of the expected breadth of coverage based on the assumption that the alignment of the read is randomly distributed along the entire reference genome, excluding secondary alignment and alignments with low MAPQ.
- Primary Coverage Score: Primary Breadth Coverage / Primary Expected Coverage
- Primary Depth Coverage: The mean depth of coverage, calculated with aligned regions only, excluding secondary alignment and alignments with low MAPQ.
- Consensus ANI: The ANI is calculated by comparing the consensus genome created by the primary alignment with the original reference genome. Unaligned regions are excluded.
- Combined PS and ANI (Sqrt(ANI)xPSx100)
- Presence/Absence Calls