Skip to content

De Novo identification of species-specific genes for microbial profiling.

Notifications You must be signed in to change notification settings

trinezac/SG_optimization

Repository files navigation

Signature Gene optimization

We propose a method for identifying a set of de novo representative genes, termed signature genes (SGs), which can be used to measure the relative abundance and as phylogenetic markers of each metagenomic species with high precision. An initial set of the 100 genes that correlate with the median gene abundance profile of the metagenomic species (MGS) is selected. However, even in samples with high sequencing depth and species abundances, some genes in the initial set may be undetected, leading to inconsistencies in the estimation of metagenomic species abundance. A variant of the coupon collector’s problem was utilized to evaluate the probability of identifying a certain number of genes in a sample, given their presence, and score the performance of a gene set. This allows us to reject the abundance measurements that are significantly deviating from the expected number of detected genes from the set. Within each sample the expected read counts per gene can be approximated by the discrete negative binomial (NB) distribution, as the reads are assumed to map in proportion to the gene length and show biological variability. A rank-based negative binomial model is used to assess the performance of different gene sets across a large set of samples, facilitating identification of an optimal signature gene set for the MGS

About

De Novo identification of species-specific genes for microbial profiling.

Resources

Stars

Watchers

Forks

Packages

No packages published