Useful links:
WGCNA: an R package for weighted correlation network analysis
book; Weighted Network Analysis
The WGCNA pipeline is expected an input matrix of normalized expression values including samples in columns and gene names on rows. There is no limitation for the methods exploring the expression values; RNA-Seq or microarray methods. We can use GEO or TCGA expression profiles for this analysis.
BiocManager::install("WGCNA")
library(WGCNA)
Usually we need to rotate or transpose the rows with the columns in the matrix using t
function.
rows => samples , columns => genes
The first and important step is clustering the samples to identify the outliers. To do this we use flashClust
package.
library(flashClust)
To perform heirarichal clustering we use hclust
function to construct clusters. The samples are clustered by distance between them based on the expression values.
-> hclust(dist(matrix), method = "average")
The graphical tree output is based on the height values on axis y between samples. The outlier is a sample locates in a far distance of others without any connections to others. This sample has a height more than other samples and is located in top of the tree with a high height. It means that the expression values in this sample are not matched or close to others. The outlier sample should be remove.
-
It would be more than one outlier samples in a matrix.
-
Removing the outliers is based on a cut-line to cut the tree cluster according to the highest height in which the samples are close together. It means that other samples above this line display as outliers because they are in a far distance of others.
#line cut for removing the outlier sample
# the highest height is 100 for example
abline(h = 100, col = "red")
<- cutreeStatic(sampleTree, cutHeight = 100)
table(clust)
- clust 1 contains the samples we want to keep.
keepsample <- clust==1
- The outlier samples are removed from the main matrix.
We need a table of trait information such as clinical data, molecular characteristics, or phenotypic and genotypic features to analyze the correlation and relationship between traits and gene expressions. In this step, it is needed to investigate the relationships of traits with gene modules to identify hub genes correlated strongly with an important features of samples.
-
Trait could be the features or characteristics of genes such as regulation levels.
-
Clinical data includes staging, mutations, molecular or cellular features, etc...
Choosing a soft power (β) is an important step to detect modules.
-
power number is a critical index to identify gene module packing
-
β parameter will be to calculate our adjacency matrix.
-
The
pickSoftThreshold
function calculates multiple networks all based on different β values and returns a data frame with the R2 values for the networks scale-free topology model fit as well as the mean connectivity measures.
pickSoftThreshold(matrix)
Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.0278 0.345 0.456 747.00 762.0000 1210.0
## 2 2 0.1260 -0.597 0.843 254.00 251.0000 574.0
## 3 3 0.3400 -1.030 0.972 111.00 102.0000 324.0
## 4 4 0.5060 -1.420 0.973 56.50 47.2000 202.0
## 5 5 0.6810 -1.720 0.940 32.20 25.1000 134.0
## 6 6 0.9020 -1.500 0.962 19.90 14.5000 94.8
## 7 7 0.9210 -1.670 0.917 13.20 8.6800 84.1
## 8 8 0.9040 -1.720 0.876 9.25 5.3900 76.3
## 9 9 0.8590 -1.700 0.836 6.80 3.5600 70.5
## 10 10 0.8330 -1.660 0.831 5.19 2.3800 65.8
## 11 12 0.8530 -1.480 0.911 3.33 1.1500 58.1
## 12 14 0.8760 -1.380 0.949 2.35 0.5740 51.9
## 13 16 0.9070 -1.300 0.970 1.77 0.3090 46.8
## 14 18 0.9120 -1.240 0.973 1.39 0.1670 42.5
## 15 20 0.9310 -1.210 0.977 1.14 0.0951 38.7
Plot the R2 values as a function of the soft thresholds
We should be maximizing the R2 (β) value and minimizing mean connectivity.
par(mfrow=c(1,2))
plot(sft$fitIndices[,1],
-sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology
Model Fit,
signed Rˆ2",type="n",main=paste("Scale independence"))
text(sft$fitIndices[,1],
-sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,col="red")
abline(h=0.80,col="red")
plot(sft$fitIndices[,1],sft$fitIndices[,5],type="n",
xlab="Soft Threshold (power)",ylab="Mean Connectivity",
main=paste("Mean connectivity"))
text(sft$fitIndices[,1],sft$fitIndices[,5],labels=powers,
col="red")
- We can determine the soft power threshold which is a number as it is the β that retains the highest mean connectivity (above zero) while reaching an R2 value above 0.80.
NOTE: the higher the value, the stronger the connection strength will be of highly correlated gene expression profiles and the more devalued low correlations will be.
Now we have the soft threshold power determined we can call on the adjacency
function. This function calculates the similarity measurement and transforms the similarity by the adjacency function and generates a weighted network adjacency matrix.
adjacency(matrix, power = 3)
-
Turn adjacency into topological overlap
-
creating a matrix for showing the neighbors similarity correlation
TOM <- TOMsimilarity(adjacency)
To convert this matrix into a dissimilarity matrix we can subtract the TOM object from 1.
dissTOM <- 1-TOM
The dissimilarity/distance measures are then clustered using linkage hierarchical clustering and a dendrogram (cluster tree) of genes is constructed.
hierTOM = hclust(as.dist(dissTOM),method="average")
#plotting the dendrogram
sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04)
To identify modules from this gene dendrogram, we can use the cutreeDynamic
function.
Modules <- cutreeDynamic(dendro = geneTree, distM = TOM, deepSplit = 2, pamRespectsDendro = FALSE, minClusterSize = 30)
table(Modules)
Modules
## 0 1 2 3 4 5 6
## 88 614 316 311 257 235 225
- Here we can see 6 modules were created with the number of genes belong to them. The Label 0 Module is reserved for unassigned genes (genes that do not fit in any module).
A ME (Module Eigengene) is the standardized gene expression profile for a given module.
To identify the Module Eigengene we can call on the expression data into the moduleEigengenes
function.
MElist <- moduleEigengenes(expression.data, colors = ModuleColors)
MEs <- MElist$eigengenes
head(MEs)
MEblue MEbrown MEturquis MEgreen MEyellow
## F2_2 0.013902476 0.0410177922 0.007072125 0.12978459 0.006276361
## F2_3 0.066675342 -0.0009540238 0.072447744 -0.07777835 0.010326534
## F2_14 0.066711912 -0.0841292811 0.062700422 -0.19072152 0.003707524
## F2_15 -0.064480250 0.0909333146 0.050275810 0.04077621 -0.019067137
## F2_19 0.063634038 -0.0709378322 0.016600588 -0.04036901 0.017796637
## F2_20 -0.001201217 0.0653004166 0.049766750 0.10391289 -0.040252274
## MEgrey
## F2_2 0.006971934
## F2_3 -0.016017527
## F2_14 -0.041321626
## F2_15 -0.014390509
## F2_19 -0.023401174
## F2_20 0.113170728
Calculate dissimilarity of module eigengenes
MEdiss <- 1- cor(MEs)
-
heirerichal clustering for eigengenes to show closed or overlapped modules if they're exist
-
put the cutoff line for more than 0.25 distance which it means 0.75 correlation
-
I want to merge each two modules have more than 0.75 correlation
-
merging close modules
merge <- mergeCloseModules(datExpr, dynamicColor, cutHeight = 0.25, verbose = 3)
mergeColors <- merge$colors
table(mergeColors)
mergedME <- merge$newMEs # Eigengenes of the new merged modules
sizeGrWindow(12, 9)
plotDendroAndColors(geneTree, cbind(dynamicColor, mergeColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
correlation between each ME and each trait is calculated
modulTraitcor <- cor(MEs, datTraits, use = "p")
correlations with p.values and significance values
moduleTraitPval <- corPvalueStudent(modulTraitcor, nsamples)
adjPval <- p.adjust(moduleTraitPval)
Visualization of the module-trait association, displaying correlations and their p-values
sizeGrWindow(12,9)
textMatrix <- paste(signif(modulTraitcor, 2), "\n(",
signif(moduleTraitPval, 1), ")", sep = "")
dim(textMatrix) <- dim(modulTraitcor)
par(mar = c(6, 8.5, 3, 1))
labeledHeatmap(Matrix = module.trait.correlation,
xLabels = names(datTraits),
yLabels = names(mergedMEs),
ySymbols = names(mergedMEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.4,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
- We can use gene mutation status ( or binary traits) as trait features for analyze the correlation. There would be two columns as mutated and non-mutated states for samples holding mutated genes and non-mutated genes respectively.
- We can use the gene significance along with the genes intramodular connectivity to identify potential target genes associated with a particular trait of interest.
Connectivity - how connected a speficic node is in the network (how many nodes have high correlation with that node). High connectivity indicates a hub gene (central to many nodes).
Whole Network connectivity - a measure for how well the node is connected throughout the entire system Intramodular connectivity - a measure for how well the node is connected within its assigned module. Also an indicator for how well that node belongs to its module. This is also known as module membership (MM).
Calculate the module membership and the associated p-values. first I should make p.vaues of each gene corr to their own modules
MMpvalue <- as.data.frame(corPvalueStudent(as.matrix(geneMODmem), nSamples = nsamples))
#make name of module membership columns
names(geneMODmem) <- paste("MM", modNames, sep = "")
names(MMpvalue) <- paste("p.mm", modNames, sep = "")
Calculate the gene significance and associated p-values
geneTraitSig <- as.data.frame(cor(datExpr, datTraits, use = "p"))
GSpvalue <- as.data.frame(corPvalueStudent(as.matrix(geneTraitSig), nsamples))
names(geneTraitSig) = paste("GS.", names(datTraits), sep="")
names(GSpvalue) = paste("p.GS.", names(datTraits), sep="")
Using these two parameters we can identify the hub genes. The MM > 0.80 and GS > 0.20 cutoffs are used together for detecting hub genes.
- hub genes are the genes released from a particular module with both criteria.
brown = "brown" #the significant module with trait in heatmap
column <- match(brown, modNames)
moduleGenes <- moduleColors==brown
sizeGrWindow(3,3)
par(mfrow = c(2,4))
verboseScatterplot(abs(geneMODmem[moduleGenes, column]),
abs(geneTraitSig[moduleGenes, 1]),
xlab = paste("Module Membership in", brown,"module"),
ylab = "Gene significance for brafv600eplus",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = brown, abline = T)
abline(h = 0.20,v = 0.80, col = "red")
Now we can extract the hub genes as a table based on two criteria simultaneously. The hub genes have MM values more than 80% membership in the module, and GS values more than 20% significant correlation with a particular trait.
Probes = names(datExpr)
inModule <- is.finite(match(moduleColors, brown))
modProbes <- Probes[inModule]
modProbes <- data.frame(modProbes)
GenemmPval <- MMpvalue[MMpvalue$p.mmbrown<0.05,]
names(GenemmPval)
GeneGSpval <- data.frame(GSpvalue[GSpvalue$p.GS.v600e_plus<0.05,])
goodgeneMM <- modProbes[modProbes$modProbes %in% as.character(rownames(GenemmPval)),]
goodGS <- modProbes[modProbes$modProbes %in% as.character(rownames(GeneGSpval)),]
MMgene <- goodgeneMM[goodgeneMM %in%
as.character(rownames(geneMODmem)[geneMODmem$MMbrown>0.80])]
GSgene <- goodGS[goodGS %in%
as.character(rownames(geneTraitSig)[geneTraitSig$GS.v600e_plus>0.20])]
MM_GSGene <- MMgene[MMgene %in% as.character(GSgene)]
MM_GSGene <- data.frame(MM_GSGene)
Plot the relationships among the eigengenes and the trait
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(5,4,1,2), cex.lab = 0.8, xLabelsAngle
= 90)
With this heatmap we can identify groups of correlated eigengenes called meta modules. Modules with mutual correlations stronger than their correlation with the specified clinical trait would be grouped into a meta module.