Skip to content

Sequence based 16S rRNA Taxonomic classifier using MLP

Notifications You must be signed in to change notification settings

ali-kishk/AmpliconNet

Repository files navigation

AmpliconNet: 16s rRNA neural network classifier using the direct sequence.

Publication:

A. Kishk and M. El-Hadidi, "AmpliconNet: Sequence Based Multi-layer Perceptron for Amplicon Read Classification Using Real-time Data Augmentation," 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 2018, pp. 2413-2418. doi: 10.1109/BIBM.2018.8621287

Usage

In prediction against Fastq / Fasta files

python src/predict --dir_path test_fastq  --database models/V46 --output_dir test_pred --input_type fastq

test_pred directory will contain prediction files for each fastq/ fasta file in the test_fastq directory Reference model have to be changed according the HVR primers of the study

for more parameters

python src/predict --help

Building a taxonomy table

Generate BIOM compatible taxonomy table

python src/Predict_Taxonomy_Table.py --pred_dir ./test_pred/ --o-taxa_table ./test_Biom_taxon.csv --biom_taxon_table True --target_rank all

This taxonomy table can be imported by MEGAN (import> Text (csv) format > Classification >Taxonomy)

Example stacked bar using MEGAN for the generated taxonomy table 3 files in test_fastq alt text

Converting the taxonomy table to BIOM 2.1.7 file

(for any other tool rather than MEGAN as MEGAN support only BIOM 1.0)

biom convert -i test_Biom_taxon.csv -o test_Biom_taxon_hdf5.biom --table-type="Taxon table" --to-hdf5

Training a new model

python src/SILVA_header_2_csv.py --silva_path SILVA_132_SSURef_tax_silva.fasta  --silva_header SILVA_header_All_Taxa.csv

python src/preprocess.py --hvr_database V2_SILVA.fa --silva_header SILVA_header_All_Taxa.csv --output_dir models/V2

python src/train.py --database_dir models/V2 --kmer_size 6  --batch_size 250 --training_mode mlp_sk

python src/evaluate.py --database_dir models/V2 --kmer_size 6 --batch_size 250  --training_mode mlp_sk

What's Taxonomic classification:

A nice brief to taxonomic classification by Rob Knight: https://www.youtube.com/watch?v=HkwFdzFLZ0I

Our BIBM Presentation:

https://youtu.be/8RTjeJYX-0k

AmpliconNet in Arabic:

part1: https://youtu.be/qxK9fxugMf0 part2: https://youtu.be/kNfPmOuA0Nk

Abstract

Taxonomic assignment is the core of targeted metagenomics approaches that aims to assign sequencing reads to their corresponding taxonomy. Sequence similarity searching and machine learning (ML) are two commonly used approaches for taxonomic assignment based on the 16S rRNA. Similarity based approaches require high computation resources, while ML approaches don’t need these resources in prediction. The majority of these ML approaches depend on k-mer frequency rather than direct sequence, which leads to low accuracy on short reads as k-mer frequency doesn’t consider k-mer position. Moreover training ML taxonomic classifiers depend on a specific read length which may reduce the prediction performance by decreasing read length. In this study, we built a neural network classifier for 16S rRNA reads based on SILVA database (version 132). Modeling was performed on direct sequences using Convolutional neural network (CNN) and other neural network architectures such as Multi-layer Perceptron and Recurrent Neural Network. In order to reduce modeling time of the direct sequences, In-silico PCR was applied on SILVA database. Total number of 14 subset databases were generated by universal primers for each single or paired high variable region (HVR). Moreover, in this study, we illustrate the results for the V2 database model on ~ 1850 classes on the genus level. In order to simulate sequencing fragmentation, we trained variable length subsequences from 50 bases till the full length of the HVR that are randomly changing in each training iteration. Simple MLP model with global max pooling gives 0.93 test accuracy for the genus level (for reads of 100 base sub-sequences) and 0.96 accuracy for the genus level respectively (on the full length V2 HVR). In this study, we present a novel method AmpliconNet to model the direct amplicon sequence using MLP over a sequence of k-mers faster 20 times than CNN in training and 10 times in prediction.

License

MIT