scHPF is a tool for de novo discovery of both discrete and continuous expression patterns in single-cell RNA-sequencing (scRNA-seq). We find that scHPF’s sparse low-dimensional representations, non-negativity, and explicit modeling of variable sparsity across genes and cells produce highly interpretable factors.
Algorithmic details, benchmarking against alternative methods, and scHPF's application to a spatially sampled high-grade glioma can be found in our paper at Molecular Systems Biology.
scHPF requires Python >= 3.6 and the packages:
- numba <=0.40, >=0.35 (numba 0.41 reduces scHPF's performance. Currently working to resolve.)
- scikit-learn
- pandas
- (optional) loompy
The easiest way to setup a python environment for scHPF is with anaconda.
conda create -n schpf_p37 python=3.7 scikit-learn numba=0.40 pandas
# older anaconda
source activate schpf_p37
# XOR newer anaconda
conda activate schpf_p37
# Optional, for using loom files as input to preprocessing
pip install -U loompy
Once you have set up the environment, clone this repository and install.
git clone [email protected]:simslab/scHPF.git
cd scHPF
pip install .
scHPF's prep command intakes a molecular count matrix for an scRNA-seq experiment and formats it for training. Note that scHPF is specifically designed for scRNA-seq data with unique molecular identifiers (UMIs), and only takes integer molecular counts.
scHPF prep currently accepts two input file formats:
-
A whitespace-delimited matrix formatted like:
ENSEMBL_ID GENE_NAME UMICOUNT_CELL0 UMICOUNT_CELL1 ...
The matrix should not have a header, but may be compressed with gzip or bzip2. -
A loom file (loompy.org). The loom file must have at least one of the row attributes
Accession
orGene
, whereAccession
is an ENSEMBL id andGene
is a gene name.
We recommend restricting analysis to protein-coding genes. The -w
/--whitelist
option removes all genes in the input data that are not in a two column, tab-delimited text file of ENSEMBL gene ids and names.
Whitelists for human and mouse are provided in the resources folder.
Filtering txt input details: For whitespace-delimited UMI-count files, filtering is performed using the input matrix's ENSEMBL_ID
by default, but can be done with GENE_NAME
using the --filter-by-gene-name
flag. This is useful for data that does not include a gene id.
Filtering loom input details: For loom files, we filter the Accession
row attribute against the whitelist's ENSEMBLE_ID
if Accession
is present in the loom's row attributes, and filter the Gene
row attribute against the GENE_NAME
in the whitelist otherwise.
To preprocess genome-wide UMI counts for a typical run, use the command:
scHPF prep -i UMICOUNT_MATRIX -o OUTDIR -m 5 -w GENE_WHITELIST
As written, the command formats data for training and only includes genes that are:
- on the whitelist (eg protein coding, see resources folder) and
- that we observe transcripts of in at least 5 cells.
After running this command, OUTDIR
should contain a matrix market file, train.mtx
, and an ordered list of genes, genes.txt
. An optional prefix argument can be added, which is prepended to to the output file names.
More options and details for preprocessing can be viewed with
scHPF prep -h
scHPF's train command accepts two formats:
- Matrix Market (.mtx) files, where rows are cells, columns are genes, and values are nonzero molecular counts. Matrix market files are output by the current scHPF prep command.
- Tab-delimited COO matrix coordinates, output by the previous version of the preprocessing command. These files are essentially the same as .mtx files, except they do not have a header and are zero indexed.
To train an scHPF using data output from the prep command:
scHPF train -i TRAIN_FILE -o OUTDIR -p PREFIX -k 7 -t 5
This command performs approximate Bayesian inference on scHPF with, in this instance, seven factors and five different random initializations. scHPF will automatically select the trial with the highest log-likelihood, and save the model in the OUTDIR
in a serialized joblib file.
If you get an error like Inconsistency detected by ld.so: dl-version.c: 224: _dl_check_map_versions and are running numba 0.40.0, try downgrading to 0.39.0
More options and details for training can be viewed with
scHPF train -h
To get gene- and cell-scores in a tab-delimited file, ordered like the genes and cells in the train file and with a column for each factor:
scHPF score -m MODEL_JOBLIB -o OUTDIR -p PREFIX
To also generate a tab-delimited file of gene names, ranked by gene-score for each factor:
scHPF score -m MODEL_JOBLIB -o OUTDIR -p PREFIX -g GENE_FILE
GENE_FILE
is intended to be the gene.txt
file output by the scHPF prep command (see above), but can in theory be any tab-delimited file where the number of rows is equal to the number of genes in the scHPF model. The score command automatically uses the 1st (zero-indexed) column of GENE_FILE
(or the only column if there is only one); however, the column used can be specified with --name-col
.
Coming soon. This implementation has a scikit-learn-like interface.
@article {msb2019scHPF,
author = {Levitin, Hanna Mendes and Yuan, Jinzhou and Cheng, Yim Ling and Ruiz, Francisco JR and Bush, Erin C and Bruce, Jeffrey N and Canoll, Peter and Iavarone, Antonio and Lasorella, Anna and Blei, David M and Sims, Peter A},
title = {De novo gene signature identification from single-cell RNA-seq with hierarchical Poisson factorization},
volume = {15},
number = {2},
elocation-id = {e8557},
year = {2019},
doi = {10.15252/msb.20188557},
publisher = {EMBO Press},
URL = {https://msb.embopress.org/content/15/2/e8557},
eprint = {https://msb.embopress.org/content/15/2/e8557.full.pdf},
journal = {Molecular Systems Biology}
}
More detailed tutorials on using scHPF to analyze scRNA-seq data, such as code for enrichment analysis on factors, worked jupyter notebooks, and an example selecting the number of factors will be posted soon. In the meantime please open an issue and I will try to provide whatever help and guidance I can.
Contributions to scHPF are welcome. Please get in touch if you would like to discuss. To contribute, please fork scHPF, make your changes, and submit a pull request.