👋 Hey! :D 👋
SampleQC
is an R package for robust multivariate, multi-celltype, multi-sample quality control for single cell RNA-seq. QC is typically done by looking at univariate measures of experimental quality for each cell (such as library size, number of features observed, mitochondrial proportion), and identifying cells that are outliers in some way (e.g. high mitochondrial proportion or small library size).
The main flaw in standard QC approaches that SampleQC
seeks to correct is that unimodal outlier detection implicitly assumes that the sample consists of only one celltype, introducing biases in the cells excluded. It can preferentially exclude whole cell types with extreme QC metrics (e.g. small library sizes; naturally high mitochondrial proportions), even though these cells may be perfectly healthy. Related to this, identifying cell outliers within each sample individually means that if a celltype is present in only a small proportion in a given sample, it may be identified as an outlier and excluded, although it is healthy.
Our method SampleQC
addresses these problems by robustly fitting a multivariate Gaussian mixture model, across multiple samples simultaneously. It is intended for large, complex datasets with many samples, although should also work on smaller datasets. By default it uses log counts, log feature-counts and mitochondrial proportions, although users can select which QC metrics they wish to use (including e.g. CITE-seq QC metrics).
To use this development version of the package, run the following lines in R:
# preliminaries
install.packages("devtools")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SingleCellExperiment")
BiocManager::install("BiocStyle")
# install this repo
devtools::install_github('wmacnair/SampleQC')
# set up libraries
library('SampleQC')
SampleQC
can be broken down into two main parts:
SampleQC
first calculates dissimilarities between sample distributions, and uses these to identify 'sample groups', i.e. groups of samples with similar distributions. This also provides embeddings of the samples, allowing users to check for batch effects.- Within each sample group,
SampleQC
fits a statistical model to identify cells that are potential outliers, based on how unlikely they are under the statistical model.
SampleQC
starts from a data.frame
containing the QC metrics for all your cells, qc_df. This data.frame
is easily obtainable from whichever object contains your single cell data. (We also considered writing functions that take single cell container objects as inputs, such as SingleCellExperiment
or Seurat
objects. However, using a data.frame
is more flexibile, and future-proofs SampleQC
against new types of data and changes to other packages.)
This data.frame
should contain the following columns:
cell_id
, a unique identifier for each cellsample_id
, identifier for experimental sample- at least one QC metric (user's choice, e.g. log_counts, log_feats, and logit_mito)
- optionally, some additional annotations of the samples, e.g. batch_id, patient_id, condition
The optional annotations are for checking whether there are differences in sample-level statistics that are consistent across some groups, e.g. log_counts is consistently low in a particular batch.
You then run a quick preparation function. For SingleCellExperiment
objects, it looks like this:
# calculate QC metrics with `scater`
sce = scater::addPerCellQC(sce,
subsets = list(mito = grep("MT-", rownames(sce))))
# call prep function
qc_dt = make_qc_dt(colData(sce),
sample_var = 'sample_id',
qc_names = c('log_counts', 'log_feats', 'logit_mito'),
annot_vars = 'condition'
)
For Seurat
objects, it looks like this:
# add mito QC metrics
seu[["percent.mt"]] = PercentageFeatureSet(seu, pattern = "^MT-")
# call prep function
qc_dt = make_qc_dt(seu@meta.data,
sample_var = 'sample_id',
qc_names = c('log_counts', 'log_feats', 'logit_mito'),
annot_vars = 'condition'
)
An important point about the QC metrics you use: SampleQC
uses a mixture of Gaussian distributions to identify cell outliers. This means that the model will fit better for metrics that are more approximately normal. Therefore, metrics like proportions (percent.mt) should be transformed first, e.g. via the inverse logistic function. For the variable percent.mt, SampleQC
does this transformation automatically, into the variable logit_mito in the qc_dt object.
We first define some parameters that we want to use.
# which QC metrics do we want to use?
qc_names = c('log_counts', 'log_feats', 'logit_mito')
# which discrete-valued variables do we want to annotate the samples with?
annots_disc = c('well_id', 'patient_id', 'condition')
# which continuous-valued variables do we want to annotate the samples with?
annots_cont = NULL
SampleQC
generates some additional automatic annotations, such as median mitochondrial proportion by sample, sample size, and some others.
We then calculate distances between all the samples, and embed this matrix via dimensionality reduction options. SampleQC
stores everything neatly in a SingleCellExperiment
object (mainly in the colData
entry).
qc_obj = calc_pairwise_mmds(qc_dt, qc_names,
annots_disc = annots_disc, annots_cont = annots_cont, n_cores = 4)
table(colData(qc_obj)$group_id)
Next we fit Gaussian mixture models, either one to each of the sample groupings that we found, or to the whole dataset. The user needs to specify how many clusters to fit in each group of samples. The quickest way to do this is to start with K=1
for each cluster, plot the results, and then inspect the outputs to find which value is best for each cluster. Fitting to real data can require a couple of user iterations, for example trying multiple different values of K.
To fit to each of the sample groupings individually, you use the parameter K_list
. We recommend first specifying 1 component for each group and rendering a report: K=1
is extremely quick to fit, and the diagnostic plots in the rendering allow you to check the appropriate number of components for each sample group.
qc_obj = fit_sampleqc(qc_obj, K_list=rep(1, get_n_groups(qc_obj)))
Once the model has fit, you can render an html report and check whether it makes sense.
# define you project name and where you want to save these reports
proj_name = 'my_project'
save_dir = '/home/work/my_project/qc/'
dir.create(save_dir)
# render the report
make_sampleqc_report(qc_obj, save_dir, proj_name)
This allows you to check whether the number of components for each group looks correct. If not, you can rerun with a different specification of K_list
:
qc_obj = fit_sampleqc(qc_obj, K_list=c(2,3,2,2))
To fit one model to the whole of the dataset, you use the parameter K_all
. You might want to do this when you have relatively few samples (e.g. 30 or fewer) and / or when your samples have very similar distributions of QC metrics.
qc_obj = fit_sampleqc(qc_obj, K_all=2)
Once you're happy that the model is identifying outliers correctly, you can extract the outliers found by SampleQC
by the following:
outliers_dt = get_outliers(qc_obj)
There are a couple of ways of tweaking exactly how you get your outliers using the get_outliers
function. If the plots in the section Checking for outliers and QC batches via MMD of the SampleQC
report indicate that one group of samples is universally of bad quality, then you can choose to exclude all of those samples via the argument exc_groups
. If there are groups of cells with sufficient numbers to be modelled as a group by SampleQC
, but you want to exclude them (for example, a large cluster of cells with extremely high mitochondrial proportions), then you can specify clusters to be excluded with the argument exc_clusters
. You can see more details by calling ?get_outliers
.
To replicate the analyses, in the preprint, please go to the companion GitHub repo here.
Please add anything you like to the Issues page. All feedback enthusiastically received!
Cheers
Will
SampleQC
requires R >= 4.0.0
.