Identification of batch effects and number of clusters

SUwonglab · Jul 31, 2018 · bd682be · bd682be
1 parent 2627084
commit bd682be
Show file tree

Hide file tree

Showing 2 changed files with 613 additions and 0 deletions.
diff --git a/vignettes/BatchEffect_NumberOfCluster.Rmd b/vignettes/BatchEffect_NumberOfCluster.Rmd
@@ -0,0 +1,204 @@
+---
+title: "Identification of batch effect and number of clusters when mutiple plates are available"
+author: "Mahdi Zamanighomi"
+date: "July 31, 2018"
+output: html_document
+---
+
+Here we will walk through an example analysis of batch effect that requires mutiple scATAC-seq plates. We will also explain a new procedure to identify the number of clusters in addition to Gap Statistic. To better visualize clustering results, we will provide a t-SNE plot of single cells that utilizes cluster specific peaks (also named differential peaks).
+
+# Data availability and pre-processing
+
+We used RA-induced scATAC-seq data from Zamanighomi et al. Nature Communications (2018) and Duren et al. PNAS (2018). This data can be downloaded from GEO under accession numbers GSE107651 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107651) and GSE115970 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115970). To obtain bam files, we followed Duren's procedures to align scATAC-seq data to mm9 genome and remove duplicates.
+
+We next used samtools to merge the resulting bam files.
+```{r engine='bash', eval=FALSE}
+samtools merge [AGGREGATE_BAM] *.bam
+```
+
+The merged bam file was employed as input into MACS2 to call for narrow peaks, according to the following code.
+```{r engine='bash', eval=FALSE}
+macs2 callpeak -t [AGGREGATE_BAM] -f BAMPE -n [PEAK_FILE_NAME] -p 0.1
+```
+
+# Obtaining and filtering counts
+
+We obtained the ForeGroundMatrix and BackGroundMatrix matrices as explained in the ExampleWorkflow.Rmd of scABC's vignette. This data was not uploaded into GitHub due to the file size limit, but can be easily reproduced following the below code. We will use these matrices to perform the remaining analysis.
+
+```{r}
+# always run
+setwd("~/Documents/scABC/vignettes/")
+library(devtools)
+devtools::install_github("timydaley/scABC", force = TRUE)
+library(scABC)
+```
+
+```{r cache=TRUE}
+# not evaluated, preprocessing
+bamfile.table = read.table(file = "~/Documents/RAinduction/Samples.txt") # sample names (column 1) with plate numbers (column 2)
+bamfiles = paste0("~/Documents/RAinduction/bams/", bamfile.table[,1], "_001.trim.sort.nuc.uniq.rmdup.bam") # bam files
+peaks = selectPeaks("~/Documents/RAinduction/Peaks.narrowPeak") # MACS2 narrow peaks
+ForeGround = getCountsMatrix2(bamfiles, peaks) # peak read counts across cells. getCountsMatrix2 is a faster version of getCountsMatrix
+ForeGroundFiltered = filterPeaks(ForeGround$ForeGroundMatrix, peaks,1,10) # keep peaks with at last 1 read in minimum 10 cells
+peaks = ForeGroundFiltered$peaks # filtered peaks
+BackGround = getBackground2(bamfiles, peaks) # read counts for peaks background. getBackground2 is a faster version of getBackground
+ForeGroundBackGroundFiltered = filterSamples(ForeGround = ForeGroundFiltered$ForeGroundMatrix, 
+ BackGround = BackGround$BackGroundMatrix) # filter extremely shallow sequenced cells
+ForeGroundMatrix = ForeGroundBackGroundFiltered$ForeGroundMatrix # final peak counts
+BackGroundMatrix = ForeGroundBackGroundFiltered$BackGroundMatrix # final background counts
+```
+
+# Detailed workflow
+
+Let's first use Gap Statistic to determine the number of clusters.
+
+```{r cache=TRUE}
+nClusters = 1:10
+BackGroundMedian = apply(BackGroundMatrix, 2, median)
+GapStat = getGapStat(ForeGroundMatrix, BackGroundMedian, 
+ nClusters=nClusters, quiet = TRUE)
+GapStat$nClusterOptimal
+plotGapStat(GapStat, nClusters = nClusters, main = "RAinduction_Day4")
+```
+
+Gap Statistic resulted in just 1 cluster and can underestimate the number of clusters due to the high similarity between clusters. We emphasize that Gap Statistic performs well when clusters are distinct. In the case of RA-treated cells however, we need to identify developmental stages that are not distinct. To resolve this issue, we recommend to increase the number of clusters to various numbers (e.g. K = 5, 10, 15, and 20) and consider clusters that have cells from the majority of paltes. If a cluster is mainly specific to one plate, we may suffer from batch effect and thus should eliminate such clusters from the analysis. See below for the detailed analysis.
+
+```{r cache=TRUE}
+# let's compute the landmarks using 4 clusters.
+LandMarks = computeLandmarks(ForeGround = ForeGroundMatrix, 
+ BackGround = BackGroundMatrix, 
+ nCluster = 4, nTop = 10000)
+LandMarkAssignments = assign2landmarks(ForeGroundMatrix, LandMarks)
+
+# assigning landmarks to cells
+Cell2LandmarkCorrelation = cbind(apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks[,1], method = 'spearman')), 
+ apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks[,2], method = 'spearman')),
+ apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks[,3], method = 'spearman')),
+ apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks[,4], method = 'spearman')))
+# plate info
+cell.plate = bamfile.table[match(colnames(ForeGroundMatrix),paste0("~/Documents/RAinduction/bams/",bamfile.table[,1], "_001.trim.sort.nuc.uniq.rmdup.bam")), 2]
+
+# colors
+library(gplots)
+library(RColorBrewer)
+library(devtools)
+source_url("https://raw.githubusercontent.com/obigriffith/biostar-tutorials/master/Heatmaps/heatmap.3.R")
+scalered <- colorRampPalette(c("white", "red"), space = "rgb")(256)
+
+# heatmap colors
+index_cluster = order(LandMarkAssignments) # order cells according to cluster assignments
+rcols1 = brewer.pal(6, "Accent")[1:6]
+rowcols1 = rcols1[cell.plate[index_cluster]]
+rcols2 = brewer.pal(4, "Dark2")[1:4]
+rowcols2 = rcols2[LandMarkAssignments[index_cluster]]
+rowcols = rbind(rowcols1, rowcols2)
+rownames(rowcols) = c("plate", "cluster")
+
+# normalize each row in the landmark correlation matrix for a better visualization
+Cell2LandmarkCorrelationNormalized = rowMeans(abs(Cell2LandmarkCorrelation[index_cluster,]))
+Cell2LandmarkCorrelationNormalized = Cell2LandmarkCorrelation[index_cluster,]/Cell2LandmarkCorrelationNormalized 
+Cell2LandmarkCorrelationNormalized[Cell2LandmarkCorrelationNormalized<0] <- 0
+
+# plot heatmap
+heatmap.3(Cell2LandmarkCorrelationNormalized, dendrogram='none', Rowv=FALSE, Colv=FALSE,
+ trace='none', col = scalered, margin = c(5, 5), density.info = "none", 
+ RowSideColors = rowcols, RowSideColorsSize=2, symm=F,symkey=F,
+ symbreaks=F, scale="none")
+legend("bottomleft", legend = c(paste0("plate ", c(1,2,3,4,5,6)), paste0("cluster ", 1:4)), col = c(rcols1, rcols2), border=FALSE, bty="n", y.intersp = 1, cex= 1, pch = 15)
+```
+
+Here, we observe three similar clusters that are not plate-specific. On the other hand, cluster 4 is specific to one plate and should be removed from the analysis (batch effect). To ensure that we have three clusters, we will increase the number of clusters to a larger value as follows.
+
+```{r}
+# let's increase the number of clusters to ensure that we can at most define three medoids.
+LandMarks_test = computeLandmarks(ForeGround = ForeGroundMatrix, 
+ BackGround = BackGroundMatrix, 
+ nCluster = 6, nTop = 10000)
+LandMarkAssignments_test = assign2landmarks(ForeGroundMatrix, LandMarks_test)
+
+# assigning landmarks to cells
+Cell2LandmarkCorrelation_test = cbind(apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,1], method = 'spearman')), 
+ apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,2], method = 'spearman')),
+ apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,3], method = 'spearman')),
+ apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,4], method = 'spearman')),
+ apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,5], method = 'spearman')),
+ apply(ForeGroundMatrix, 2, function(x) cor(x, LandMarks_test[,6], method = 'spearman')))
+
+# heatmap colors
+index_cluster_test = order(LandMarkAssignments_test) # order cells according to cluster assignments
+rcols1 = brewer.pal(6, "Accent")[1:6]
+rowcols1 = rcols1[cell.plate[index_cluster_test]]
+rcols2 = brewer.pal(6, "Dark2")[1:6]
+rowcols2 = rcols2[LandMarkAssignments_test[index_cluster_test]]
+rowcols = rbind(rowcols1, rowcols2)
+rownames(rowcols) = c("plate", "cluster")
+
+# normalize each row in the landmark correlation matrix for a better visualization
+Cell2LandmarkCorrelationNormalized_test = rowMeans(abs(Cell2LandmarkCorrelation_test[index_cluster_test,]))
+Cell2LandmarkCorrelationNormalized_test = Cell2LandmarkCorrelation_test[index_cluster_test,]/Cell2LandmarkCorrelationNormalized_test 
+Cell2LandmarkCorrelationNormalized_test[Cell2LandmarkCorrelationNormalized_test<0] <- 0
+
+# plot heatmap
+heatmap.3(Cell2LandmarkCorrelationNormalized_test, dendrogram='none', Rowv=FALSE, Colv=FALSE,
+ trace='none', col = scalered, margin = c(5, 5), density.info = "none", 
+ RowSideColors = rowcols, RowSideColorsSize=2, symm=F,symkey=F,
+ symbreaks=F, scale="none")
+legend("bottomleft", legend = c(paste0("plate ", c(1,2,3,4,5,6)), paste0("cluster ", 1:6)), col = c(rcols1, rcols2), border=FALSE, bty="n", y.intersp = 1, cex= 1, pch = 15)
+```
+
+Increasing the number of clusters led to the similar three clusters when K was set to 4. Other clusters are jsut specific to a plate, presenting batch effect.
+
+We emphasize that when K=3, we exactly arrive at the same three clusers with no batch effect. However in some cases, it is possible to first observe plate-specific clusters and then detect real clusters when K becomes larger. We therefore recommend to always increase K to ensure that scABC finds all distinct medoids free of batch effects.
+
+To identify differential peaks, we first need to remove plate-specific clusters. We can use all clustering results with different K as long as plate-specific clusters are removed. For instance, we will use K = 4.
+
+```{r cache=TRUE}
+# let's use clustering results for K=4 and remove cluster 4 that was plate-specific.
+index_keep = which(!LandMarkAssignments==4)
+
+# cluster specific peaks
+PeakSelection = getClusterSpecificPvalue(ForeGround=ForeGroundMatrix[,index_keep], cluster_assignments = LandMarkAssignments[index_keep], background_medians = BackGroundMedian[index_keep])
+PeakPvals = PeakSelection$pvalue
+
+# the top 1000 cluster specific peaks for each cluster.
+Diff_peaks = apply(PeakPvals, 2, function(x) order(x))
+Diff_peaks_union = union(Diff_peaks[1:1000,1],union(Diff_peaks[1:1000,2],Diff_peaks[1:1000,3]))
+x = t(ForeGroundMatrix[Diff_peaks_union, index_keep])
+d = as.dist(1 - cor(x, method = "spearman"))
+col.clus = hclust(d, method ="complete")
+x[which(x > 5)] = 5
+
+# heatmap colors
+index_cluster = order(LandMarkAssignments[index_keep]) # order cells according to cluster assignments
+rcols1 = brewer.pal(6, "Accent")[1:6]
+rowcols1 = rcols1[cell.plate[index_keep][index_cluster]]
+rcols2 = brewer.pal(3, "Dark2")[1:3]
+rowcols2 = rcols2[LandMarkAssignments[index_keep][index_cluster]]
+rowcols = rbind(rowcols1, rowcols2)
+rownames(rowcols) = c("plate", "cluster")
+
+# plot heatmap
+heatmap.3(x[index_cluster,], dendrogram='none', Rowv=FALSE, Colv=col.clus, trace='none', col = scalered, 
+ margin = c(5, 5), density.info = "none", RowSideColors = rowcols, 
+ RowSideColorsSize=2, main = "Cluster specific peaks")
+legend("bottomleft", legend = c(paste0("plate ", c(1,2,3,4,5,6)), paste0("cluster ", 1:3)), col = c(rcols1, rcols2), border=FALSE, bty="n", y.intersp = 1, cex= 1, pch = 15)
+```
+
+Now, we plot the t-SNE of clustering results using differential peaks.
+
+```{r cache=TRUE}
+# t-SNE for clustering results using cluster specefic peaks
+library(Rtsne)
+d_tSNE = as.dist(1 - cor(ForeGroundMatrix[Diff_peaks_union, index_keep], method = "spearman"))
+tsne <- Rtsne(d_tSNE, is_distance = TRUE, check_duplicates = FALSE, pca = FALSE, perplexity=40, theta=0.5, dims=2)
+
+# t-SNE colors
+rcols = brewer.pal(3, "Dark2")[1:3]
+rowcols = rcols[LandMarkAssignments[index_keep]]
+
+# plot t-SNE
+plot(tsne$Y,col=rowcols,type = "p", pch = 16)
+legend("bottomleft", legend = c(paste0("cluster ", 1:3)), col=rcols, border=FALSE, bty="n", y.intersp = 1, cex= 1, pch = 16)
+```
+
+As a final note, we observe that clusters 1 and 3 are slightly mixed in the t-SNE plot. We should not consider these mixed cells as an incorrect clustering result since the separation between cells cannot be always captured in two dimensions.
diff --git a/vignettes/BatchEffect_NumberOfCluster.html b/vignettes/BatchEffect_NumberOfCluster.html