This repository holds code for analysis of single-cell RNA-sequencing data from Drosophila olfactory projection neurons.
ICIM code is in the Python module sct.
A simple example of usage is analysis/ICIM_Example.ipynb.
It boils down to:
import sct
myICIM = sct.ICIM(X, df)
myICIM.calc()
marker_genes = myICIM.get_all_markers()
Code for preprocessing sequence data (mapping reads to the genome and counting reads mapping to genes) is contained in the pipeline directory.
Code for visualization and analysis, including reproducing figures shown in the paper, is in the analysis directory.
Preprocessed sequence data and various intermediate files used during analysis are in the data directory.
Raw sequence data can be obtained from the Sequence Read Archive (accession GSE100058).
We introduced an unsupervised machine-learning algorithm for identifying informative genes for separating cell types in single-cell RNA-seq data, which we call ICIM.
Input: reduced count matrix X (gene x cell) and full count matrix df (gene x cell). The reduced count matrix may optionally be prefiltered to remove genes from consideration by ICIM. In practice, we use log-transformed counts.
Output: list of genes that distinguish populations.
You can use the list of genes for further dimensionality reduction and clustering.
Adjustable parameters are:
Related to filtering for informative genes for cell type identification:
- N = number of overdispersed genes considered at each step
- correlation_cutoff = minimum Pearson correlation for identifying correlated genes
- min_hits = minimum number of correlated genes required to keep a gene
- exclude_max = number of top expressing cells that are excluded when calculating robust correlation (for robustness to outliers)
- dropout_rate_low = minimum fraction of cells that a gene must be absent from to keep it
- dropout_rate_high = maximum fraction of cells that a gene must be absent from to keep it
Related to termination:
- stop_condition = {"linkage_dist", "num_cells"} = termination condition
- N_stop = number of cells in a subpopulation below which iteration stops, ignored unless stop_condition = "num_cells"
- linkage_dist_stop = linkage distance above which iteration stops, ignored unless stop_condition = "linkage_dist"