Skip to content

Commit

Permalink
Identification of batch effects and number of clusters
Browse files Browse the repository at this point in the history
  • Loading branch information
Mahdi Zamanighomi committed Jul 31, 2018
1 parent 2627084 commit bd682be
Show file tree
Hide file tree
Showing 2 changed files with 613 additions and 0 deletions.
204 changes: 204 additions & 0 deletions vignettes/BatchEffect_NumberOfCluster.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
title: "Identification of batch effect and number of clusters when mutiple plates are available"
author: "Mahdi Zamanighomi"
date: "July 31, 2018"
output: html_document
---

Here we will walk through an example analysis of batch effect that requires mutiple scATAC-seq plates. We will also explain a new procedure to identify the number of clusters in addition to Gap Statistic. To better visualize clustering results, we will provide a t-SNE plot of single cells that utilizes cluster specific peaks (also named differential peaks).

# Data availability and pre-processing

We used RA-induced scATAC-seq data from Zamanighomi et al. Nature Communications (2018) and Duren et al. PNAS (2018). This data can be downloaded from GEO under accession numbers GSE107651 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107651) and GSE115970 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115970). To obtain bam files, we followed Duren's procedures to align scATAC-seq data to mm9 genome and remove duplicates.

We next used samtools to merge the resulting bam files.
```{r engine='bash', eval=FALSE}
samtools merge [AGGREGATE_BAM] *.bam
```

The merged bam file was employed as input into MACS2 to call for narrow peaks, according to the following code.
```{r engine='bash', eval=FALSE}
macs2 callpeak -t [AGGREGATE_BAM] -f BAMPE -n [PEAK_FILE_NAME] -p 0.1
```

# Obtaining and filtering counts

We obtained the ForeGroundMatrix and BackGroundMatrix matrices as explained in the ExampleWorkflow.Rmd of scABC's vignette. This data was not uploaded into GitHub due to the file size limit, but can be easily reproduced following the below code. We will use these matrices to perform the remaining analysis.

```{r}
# always run
setwd("~/Documents/scABC/vignettes/")
library(devtools)
devtools::install_github("timydaley/scABC", force = TRUE)
library(scABC)
```

```{r cache=TRUE}
# not evaluated, preprocessing
bamfile.table = read.table(file = "~/Documents/RAinduction/Samples.txt") # sample names (column 1) with plate numbers (column 2)
bamfiles = paste0("~/Documents/RAinduction/bams/", bamfile.table[,1], "_001.trim.sort.nuc.uniq.rmdup.bam") # bam files
peaks = selectPeaks("~/Documents/RAinduction/Peaks.narrowPeak") # MACS2 narrow peaks
ForeGround = getCountsMatrix2(bamfiles, peaks) # peak read counts across cells. getCountsMatrix2 is a faster version of getCountsMatrix
ForeGroundFiltered = filterPeaks(ForeGround$ForeGroundMatrix, peaks,1,10) # keep peaks with at last 1 read in minimum 10 cells
peaks = ForeGroundFiltered$peaks # filtered peaks
BackGround = getBackground2(bamfiles, peaks) # read counts for peaks background. getBackground2 is a faster version of getBackground
ForeGroundBackGroundFiltered = filterSamples(ForeGround = ForeGroundFiltered$ForeGroundMatrix,
BackGround = BackGround$BackGroundMatrix) # filter extremely shallow sequenced cells
ForeGroundMatrix = ForeGroundBackGroundFiltered$ForeGroundMatrix # final peak counts
BackGroundMatrix = ForeGroundBackGroundFiltered$BackGroundMatrix # final background counts
```

# Detailed workflow

Let's first use Gap Statistic to determine the number of clusters.

```{r cache=TRUE}
nClusters = 1:10
BackGroundMedian = apply(BackGroundMatrix, 2, median)
GapStat = getGapStat(ForeGroundMatrix, BackGroundMedian,
nClusters=nClusters, quiet = TRUE)
GapStat$nClusterOptimal
plotGapStat(GapStat, nClusters = nClusters, main = "RAinduction_Day4")
```

Gap Statistic resulted in just 1 cluster and can underestimate the number of clusters due to the high similarity between clusters. We emphasize that Gap Statistic performs well when clusters are distinct. In the case of RA-treated cells however, we need to identify developmental stages that are not distinct. To resolve this issue, we recommend to increase the number of clusters to various numbers (e.g. K = 5, 10, 15, and 20) and consider clusters that have cells from the majority of paltes. If a cluster is mainly specific to one plate, we may suffer from batch effect and thus should eliminate such clusters from the analysis. See below for the detailed analysis.

```{r cache=TRUE}
# let's compute the landmarks using 4 clusters.
LandMarks = computeLandmarks(ForeGround = ForeGroundMatrix,
BackGround = BackGroundMatrix,
nCluster = 4, nTop = 10000)
LandMarkAssignments = assign2landmarks(ForeGroundMatrix, LandMarks)
# assigning landmarks to cells
Cell2LandmarkCorrelation = cbind(apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks[,1], method = 'spearman')),
apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks[,2], method = 'spearman')),
apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks[,3], method = 'spearman')),
apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks[,4], method = 'spearman')))
# plate info
cell.plate = bamfile.table[match(colnames(ForeGroundMatrix),paste0("~/Documents/RAinduction/bams/",bamfile.table[,1], "_001.trim.sort.nuc.uniq.rmdup.bam")), 2]
# colors
library(gplots)
library(RColorBrewer)
library(devtools)
source_url("https://raw.githubusercontent.com/obigriffith/biostar-tutorials/master/Heatmaps/heatmap.3.R")
scalered <- colorRampPalette(c("white", "red"), space = "rgb")(256)
# heatmap colors
index_cluster = order(LandMarkAssignments) # order cells according to cluster assignments
rcols1 = brewer.pal(6, "Accent")[1:6]
rowcols1 = rcols1[cell.plate[index_cluster]]
rcols2 = brewer.pal(4, "Dark2")[1:4]
rowcols2 = rcols2[LandMarkAssignments[index_cluster]]
rowcols = rbind(rowcols1, rowcols2)
rownames(rowcols) = c("plate", "cluster")
# normalize each row in the landmark correlation matrix for a better visualization
Cell2LandmarkCorrelationNormalized = rowMeans(abs(Cell2LandmarkCorrelation[index_cluster,]))
Cell2LandmarkCorrelationNormalized = Cell2LandmarkCorrelation[index_cluster,]/Cell2LandmarkCorrelationNormalized
Cell2LandmarkCorrelationNormalized[Cell2LandmarkCorrelationNormalized<0] <- 0
# plot heatmap
heatmap.3(Cell2LandmarkCorrelationNormalized, dendrogram='none', Rowv=FALSE, Colv=FALSE,
trace='none', col = scalered, margin = c(5, 5), density.info = "none",
RowSideColors = rowcols, RowSideColorsSize=2, symm=F,symkey=F,
symbreaks=F, scale="none")
legend("bottomleft", legend = c(paste0("plate ", c(1,2,3,4,5,6)), paste0("cluster ", 1:4)), col = c(rcols1, rcols2), border=FALSE, bty="n", y.intersp = 1, cex= 1, pch = 15)
```

Here, we observe three similar clusters that are not plate-specific. On the other hand, cluster 4 is specific to one plate and should be removed from the analysis (batch effect). To ensure that we have three clusters, we will increase the number of clusters to a larger value as follows.

```{r}
# let's increase the number of clusters to ensure that we can at most define three medoids.
LandMarks_test = computeLandmarks(ForeGround = ForeGroundMatrix,
BackGround = BackGroundMatrix,
nCluster = 6, nTop = 10000)
LandMarkAssignments_test = assign2landmarks(ForeGroundMatrix, LandMarks_test)
# assigning landmarks to cells
Cell2LandmarkCorrelation_test = cbind(apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,1], method = 'spearman')),
apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,2], method = 'spearman')),
apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,3], method = 'spearman')),
apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,4], method = 'spearman')),
apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,5], method = 'spearman')),
apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,6], method = 'spearman')))
# heatmap colors
index_cluster_test = order(LandMarkAssignments_test) # order cells according to cluster assignments
rcols1 = brewer.pal(6, "Accent")[1:6]
rowcols1 = rcols1[cell.plate[index_cluster_test]]
rcols2 = brewer.pal(6, "Dark2")[1:6]
rowcols2 = rcols2[LandMarkAssignments_test[index_cluster_test]]
rowcols = rbind(rowcols1, rowcols2)
rownames(rowcols) = c("plate", "cluster")
# normalize each row in the landmark correlation matrix for a better visualization
Cell2LandmarkCorrelationNormalized_test = rowMeans(abs(Cell2LandmarkCorrelation_test[index_cluster_test,]))
Cell2LandmarkCorrelationNormalized_test = Cell2LandmarkCorrelation_test[index_cluster_test,]/Cell2LandmarkCorrelationNormalized_test
Cell2LandmarkCorrelationNormalized_test[Cell2LandmarkCorrelationNormalized_test<0] <- 0
# plot heatmap
heatmap.3(Cell2LandmarkCorrelationNormalized_test, dendrogram='none', Rowv=FALSE, Colv=FALSE,
trace='none', col = scalered, margin = c(5, 5), density.info = "none",
RowSideColors = rowcols, RowSideColorsSize=2, symm=F,symkey=F,
symbreaks=F, scale="none")
legend("bottomleft", legend = c(paste0("plate ", c(1,2,3,4,5,6)), paste0("cluster ", 1:6)), col = c(rcols1, rcols2), border=FALSE, bty="n", y.intersp = 1, cex= 1, pch = 15)
```

Increasing the number of clusters led to the similar three clusters when K was set to 4. Other clusters are jsut specific to a plate, presenting batch effect.

We emphasize that when K=3, we exactly arrive at the same three clusers with no batch effect. However in some cases, it is possible to first observe plate-specific clusters and then detect real clusters when K becomes larger. We therefore recommend to always increase K to ensure that scABC finds all distinct medoids free of batch effects.

To identify differential peaks, we first need to remove plate-specific clusters. We can use all clustering results with different K as long as plate-specific clusters are removed. For instance, we will use K = 4.

```{r cache=TRUE}
# let's use clustering results for K=4 and remove cluster 4 that was plate-specific.
index_keep = which(!LandMarkAssignments==4)
# cluster specific peaks
PeakSelection = getClusterSpecificPvalue(ForeGround=ForeGroundMatrix[,index_keep], cluster_assignments = LandMarkAssignments[index_keep], background_medians = BackGroundMedian[index_keep])
PeakPvals = PeakSelection$pvalue
# the top 1000 cluster specific peaks for each cluster.
Diff_peaks = apply(PeakPvals, 2, function(x) order(x))
Diff_peaks_union = union(Diff_peaks[1:1000,1],union(Diff_peaks[1:1000,2],Diff_peaks[1:1000,3]))
x = t(ForeGroundMatrix[Diff_peaks_union, index_keep])
d = as.dist(1 - cor(x, method = "spearman"))
col.clus = hclust(d, method ="complete")
x[which(x > 5)] = 5
# heatmap colors
index_cluster = order(LandMarkAssignments[index_keep]) # order cells according to cluster assignments
rcols1 = brewer.pal(6, "Accent")[1:6]
rowcols1 = rcols1[cell.plate[index_keep][index_cluster]]
rcols2 = brewer.pal(3, "Dark2")[1:3]
rowcols2 = rcols2[LandMarkAssignments[index_keep][index_cluster]]
rowcols = rbind(rowcols1, rowcols2)
rownames(rowcols) = c("plate", "cluster")
# plot heatmap
heatmap.3(x[index_cluster,], dendrogram='none', Rowv=FALSE, Colv=col.clus, trace='none', col = scalered,
margin = c(5, 5), density.info = "none", RowSideColors = rowcols,
RowSideColorsSize=2, main = "Cluster specific peaks")
legend("bottomleft", legend = c(paste0("plate ", c(1,2,3,4,5,6)), paste0("cluster ", 1:3)), col = c(rcols1, rcols2), border=FALSE, bty="n", y.intersp = 1, cex= 1, pch = 15)
```

Now, we plot the t-SNE of clustering results using differential peaks.

```{r cache=TRUE}
# t-SNE for clustering results using cluster specefic peaks
library(Rtsne)
d_tSNE = as.dist(1 - cor(ForeGroundMatrix[Diff_peaks_union, index_keep], method = "spearman"))
tsne <- Rtsne(d_tSNE, is_distance = TRUE, check_duplicates = FALSE, pca = FALSE, perplexity=40, theta=0.5, dims=2)
# t-SNE colors
rcols = brewer.pal(3, "Dark2")[1:3]
rowcols = rcols[LandMarkAssignments[index_keep]]
# plot t-SNE
plot(tsne$Y,col=rowcols,type = "p", pch = 16)
legend("bottomleft", legend = c(paste0("cluster ", 1:3)), col=rcols, border=FALSE, bty="n", y.intersp = 1, cex= 1, pch = 16)
```

As a final note, we observe that clusters 1 and 3 are slightly mixed in the t-SNE plot. We should not consider these mixed cells as an incorrect clustering result since the separation between cells cannot be always captured in two dimensions.
409 changes: 409 additions & 0 deletions vignettes/BatchEffect_NumberOfCluster.html

Large diffs are not rendered by default.

0 comments on commit bd682be

Please sign in to comment.