Skip to content

Cancer subtype discovery using consensus clustering approach to find pattern in expression data (RNA-seq)

Notifications You must be signed in to change notification settings

Akmazad/Unsupervised_Clustering_for_Gene_Expression_data

 
 

Repository files navigation

Unsupervised clustering of gene expression data (RNA-seq) for cancer subtype discovery

Unsupervised class discovery is a data mining method to identify unknown possible groups (clusters) of items solely based on intrinsic features and no external variables. Basically clustering includes four steps:

1) Data preparation,

2) Dissimilarity matrix calculation,

3) applying clustering algorithms,

4) Assessing cluster assignment

I use a dataset coming from an RNA-seq experiment on 476 patients with non-muscle invasive bladder cancer. To make this tutorial reproducible , clinical data can be downloaded from this repo and normalized RNA-seq data could be obtained through this link . So if you are willing to use these materials, start your analysis from feature selection section onward.


1) Data preparation

In this step we need to filter out incomplete cases and low expressed genes, then transforming/normalizing gene expression values. Usually analysis would start from a raw count matrix comming from an RNA-seq experiment. I keep those genes that have expression in 10% of samples with a count of 10 or higher. All missing values must be removed or one need to impute their value if want to keep them. In my dataset there is no missing value. For transformation/normalization I use variance stabilizing transformation (VST) implemented as vst function from DESeq2 packages which at the same time will normalize the raw count also. Using other type of transformation like Z score, log transformation are also quiet commmon. If the expression matrix contains estimated transcript counts (like RSEM) or counts normalized for sequencing depth (like FPKM) , log2 transformation of the values would prepare the data for clustering. See the following examples:

  • Using log2(RSEM) gene expression value and removing genes with NA values more than 10% across samples. Then selecting top 25% most-varying genes by standard deviation of gene expression across samples (A. Gordon Robertson et al., Cell,2017).

  • Using FPKM matrix and keeping genes with (log2(FPKM+1)>2 in at least 10% of samples and selecting a subsets (2K, 4K, 6K ) of MAD ranked genes to identify stable classes (Jakob Hedegaard et al, Cancer Cell, 2016).

Feature selection: Because there are a large number of features (gene) in expression matrix , a feature selection step should be done to limit the analysis to those genes that possibly explain variation between samples in the cohort. So it is recommended to do select features for clustering. I select 2k, 4k and 6k top genes based on median absolute deviation (MAD) . A number of other methods like "feature selection based on the most variance", "feature dimension reduction and extraction based on Principal Component Analysis (PCA)", and "feature selection based on Cox regression model" could be applied, as well. To read more see the bioconductor package CancerSubtypes manual.

#_________________________________reading data and processing_________________________________#
# reading count data
rna <- read.table("Uromol1_CountData.v1.csv", header = T, sep = ",", row.names = 1)
head(rna[1:5, 1:5], 5)
#                   U0001 U0002 U0006 U0007 U0010
#ENSG00000000003.13  1458   228  1800  3945   293
#ENSG00000000005.5      0     0     9     0     0
#ENSG00000000419.11   594    23   792  1378   139
#ENSG00000000457.12   548    22  1029   976   148
#ENSG00000000460.15    53     2   190   136    47

dim(rna)
# [1] 60483   476  this is a typical output from hts-seq count matrix with more than 60,000 genes

# dissecting dataset based on gene-type
library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- getBM(attributes= c("ensembl_gene_id","hgnc_symbol", "gene_biotype"), 
               mart= mart)

# see gene types returned by biomaRt
data.frame(table(genes$gene_biotype))[order(- data.frame(table(genes$gene_biotype))[,2]), ]

# we will continue with protein coding here:
rna <- rna[substr(rownames(rna),1,15) %in% genes$ensembl_gene_id[genes$gene_biotype == "protein_coding"],]

dim(rna)
#[1] 19581   476 These are protein coding genes

# reading associated clinical data
clinical.exp <- read.table("uromol_clinic.csv", sep = ",", header = T, row.names = 1)
head(clinical.exp[1:5,1:5], 5)
#      UniqueID   CLASS  BASE47    CIS X12.gene.signature
#U0603    U0603 luminal luminal no-CIS          high_risk
#U0497    U0497 luminal   basal no-CIS           low_risk
#U0839    U0839 luminal luminal    CIS           low_risk
#U1043    U1043 luminal luminal no-CIS          high_risk
#U0566    U0566 luminal   basal no-CIS           low_risk

# making sure about sample order in rna and clincal.exp dataset
all(rownames(clinical.exp) %in% colnames(rna))
#[1] TRUE
all(rownames(clinical.exp) == colnames(rna))
#[1] FALSE
# reordering rna dataset
rna <- rna[, rownames(clinical.exp)]
#______________________ Data tranformation & Normalization ______________________________#
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = rna,
                              colData = clinical.exp,
                              design = ~ 1) # 1 passed to the function because of no model
# pre-filteration, however while using DESeq2 package it is not necessary, because the function automatically will filter out low count genes
# Keeping genes with expression in 10% of samples with a count of 10 or higher
keep <- rowSums(counts(dds) >= 10) >= round(ncol(rna)*0.1)
dds <- dds[keep,]

# vst tranformation
vsd <- assay(vst(dds)) # For a fully unsupervised transformation one can set blind = TRUE (which is the default).
#_________________________________# Feature Selection _________________________________#
# top 5K based on MAD 

mads=apply(vsd,1,mad)
# check data distribution
hist(mads, breaks=nrow(vsd)*0.1)
# selecting features
mad2k=vsd[rev(order(mads))[1:2000],]
#mad4k=vsd[rev(order(mads))[1:4000],]
#mad6k=vsd[rev(order(mads))[1:6000],]

2) Dissimilarity matrix calculation and 3) applying clustering algorithms:

Clustering is about grouping similar samples into one cluster and keeping them far from disimilar samples based on distance measure. There is a diverse list of dissimilarity matrix calculation methods (distance measures). To see list of available distance measures please check ?stats::dist and ?vegan::vegdist(). The latter need to have the vegan package to be installed:

  • For log-transformed gene expression, Euclidean based measures can be applied.
  • For RNA-seq normalised counts, correlation based measures (Pearson, Spearman) or a Poisson-based distance can be used.

Clustering algorithms: To see a list of algorithms please check diceR package vignettes. Most widely used algorithms are partitional clustering and hierarchical clustering.

Partitional clustering are clustering methods used to classify samples into multiple clusters based on their similarity. The algorithms required to specify the number of clusters to be generated. The commonly used partitional clustering, include:

  • K-means clustering (KM): each cluster is represented by the center or means of the data points belonging to the cluster. This method is sensitive to outliers.
  • K-medoids clustering or PAM (Partitioning Around Medoids), in which, each cluster is represented by one of the objects in the cluster. PAM is less sensitive to outliers.

Hierarchical clustering (HC) in contrast to partitional clustering, this method does not require to pre-specify the number of clusters to be generated. HC can be grouped into two classes.

  • Agglomerative:"bottom-up" approach, each observation is initially considered as a cluster of its own (leaf), and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive: "top-down" approach, This begins with the root so all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Some words on how to compute distances between clusters: Indeed there are diffrent way to do cluster agglomeration (i.e, linkage ). When one using stat package, possible methods include “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”. Generally complete linkage and Ward’s method are preferred. More reading materials on this can be find here. As a result hierarchical clustering provides a tree-based representation of the objects, which is also known as dendrogram.

In practice HC, KM and PAM are commonly used for gene exoression data.

To quantitatively determine the number and membership of possible clusters within the dataset, I will use Consensus Clustering (CC) approach. Applying this method has proved to be effective in new cancer subclasses discoveries. For more information on the methodology please refere to the seminal paper by Monti et al. (2003) and ConsensusClusterPlus package manual.

To determine the number of cluster based on CC, there are several graphics which would help to this extent. (1) A Color-coded heatmap corresponding to the consensus matrix that represent consensus values from 0–1; white corresponds to 0 (never clustered togather) and dark blue to 1 (allways clustered togather). (2) Consensus Cumulative Distribution Function (CDF) plot, This graphic lets one to determine at what number of clusters,k, the CDF reaches an approximate maximum. So consensus and cluster confidence is at a maximum at this k. (3) Based on CDF plot, a chart for relative change in area under CDF curve can be plotted. This plot allows a user to determine the relative increase in consensus and determine k at which there is no appreciable increase.

#_________________________________# Clustering & Cluster assignmnet validation _________________________________#
# finding optimal clusters by CC
library(ConsensusClusterPlus)
results = ConsensusClusterPlus(mad2k,
                               maxK=6,
                               reps=50,
                               pItem=0.8,
                               pFeature=1,
                               title= "geneExp",
                               clusterAlg="pam",
                               distance="spearman",
                               seed=1262118388.71279,
                               plot="pdf")

For this post, I selected 80% item resampling (pItem), 80% gene resampling (pFeature), a maximum evalulated k of 6 so that cluster counts of 2,3,4,5,6 are evaluated (maxK), 50 resamplings (reps), partition around medoids algorithm (clusterAlg) Spearman correlation distances (distance), gave the output a title (title), and asked to have graphical results written to a pdf plot file. I also set a random seed so that this example is repeatable (seed).

The above command with return sevral plots which is helful to make decision about cluster number especially plots showing consensus CDF and relative change in area under CDF curve.

alt-text-1 alt-text-2

So by looking at the plots, the optimal number of clusters would be four in this case. Below is concesus plot of samples when k = 4. Result for other k values can be find in the result pdf file. This is not in agreement with the original paper, where the authors reported three subtypes. However in a new report from the same group using a new pipeline they conculded that NMIBC to have four diffrent subtypes! ref. alt-text-1

[4] Assessing cluster assignment

Assessing cluster assignment or cluster validation indicate to the procedure of assessing the goodness of clustering results. Alboukadel Kassambara has published a detailed pot on this topic. In this tutorial I will use Silhouette method for cluster assessment. this method can be used to investigate the separation distance between the obtained clusters. The Silhouette plot reflects a measure of how close each data point in one cluster is to a points in the neighboring clusters. This measure, Silhouette width, has a range of -1 to +1. Value near +1 show that the sample is far away from the closeset data point from neighboring cluster. A negative value may indicate wrong cluster assignment and a value close to 0 means an arbitrary cluster assignment to that data point.

Since by clustering one should expect to find cluster of samples who are significantly different in terms of survival probability, here as a kind of assessing cluster assignment in biological perspective, I perform survival analysis between clusters (subtypes) to see significant differences, if any.

The output of ConsensusClusterPlus is a list, and result for each of k values like k = 4, can be accessed by results[[4]]. I save the result in a new object and then will move forward with calculating and then plotting Silhouette plot.

#_________________________________#Assessing cluster assignment _________________________________#
#Silhouette width analysis
cc4 = results[[4]]

# calcultaing Silhouette width using the cluster package 
library(cluster)
cc4Sil = silhouette(x = cc4[[3]], # x is a numeric vector that indicates cluster assignment for each data point
                    dist = as.matrix(1- cc4[[4]])) # dist should be a distance matrix and NOT a similarity matrix, so I subtract the matrix from one to get that dist matrix

#For visualization:
library(factoextra)
fviz_silhouette(cc4Sil, palette = "jco",
                 ggtheme = theme_classic())

The above function return a summary of Silhouette width for each cluster:

  cluster size ave.sil.width
1       1  130          0.75
2       2  199          0.83
3       3   66          0.65
4       4   81          0.72

The from the below image one can get an idea on the average Silhoutte width as well as size of each cluster. alt-text-1

#_________________________________#Assessing cluster assignment _________________________________#
#Survival analysis
### preparing dataset for survival analysis
cc4Class = data.frame(cc4$consensusClass)
cc4Class$ID = rownames(cc4Class)
cc4Class = data.frame(cc4Class[match(rownames(clinical.exp),cc4Class$ID),])
all(cc4Class$ID == rownames(clinical.exp))

# new encoding for status, time and cluster
clinical.exp$status = ifelse(clinical.exp$Progression.to.T2. == "NO", 0,1)
clinical.exp$time = as.numeric(clinical.exp$Progression.free.survival..months. * 30)
clinical.exp$cluster = as.factor(cc4Class$cc4.consensusClass)

library(survival)

res.cox <- coxph(Surv(time, status) ~ cluster, data = clinical.exp)
res.cox
Call:
coxph(formula = Surv(time, status) ~ cluster, data = clinical.exp)

             coef exp(coef) se(coef)      z       p
cluster2 -1.47638   0.22846  0.48357 -3.053 0.00226
cluster3  0.22532   1.25272  0.42213  0.534 0.59351
cluster4 -2.39949   0.09076  1.03301 -2.323 0.02019

Likelihood ratio test=21.88  on 3 df, p=6.908e-05
n= 476, number of events= 31 
summary(res.cox)
Call:
coxph(formula = Surv(time, status) ~ cluster, data = clinical.exp)

  n= 476, number of events= 31 

             coef exp(coef) se(coef)      z Pr(>|z|)   
cluster2 -1.47638   0.22846  0.48357 -3.053  0.00226 **
cluster3  0.22532   1.25272  0.42213  0.534  0.59351   
cluster4 -2.39949   0.09076  1.03301 -2.323  0.02019 * 
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
cluster2   0.22846     4.3771   0.08855    0.5894
cluster3   1.25272     0.7983   0.54769    2.8653
cluster4   0.09076    11.0175   0.01198    0.6874

Concordance= 0.714  (se = 0.039 )
Likelihood ratio test= 21.88  on 3 df,   p=7e-05
Wald test            = 16.52  on 3 df,   p=9e-04
Score (logrank) test = 22.14  on 3 df,   p=6e-05

Refrences

1- Biostar posts: https://www.biostars.org/p/321773/ https://www.biostars.org/p/225315/ https://www.biostars.org/p/74223/ https://www.biostars.org/p/281161/ https://www.biostars.org/p/273107/

2- https://www.datanovia.com/en/courses/partitional-clustering-in-r-the-essentials/

3- https://2-bitbio.com/2017/10/clustering-rnaseq-data-using-k-means.html

About

Cancer subtype discovery using consensus clustering approach to find pattern in expression data (RNA-seq)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • R 100.0%