This GitHub repository contains all the code for reproducing the results from our manuscript "Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data", which can be found here.
The website containing extra results and figures from our analysis can be found here.
The scripts needed to create a coexpression network with any of our tested workflows can be found in the src
directory. Most scripts are R scripts and are described below. Some scripts have accompanying data files in the data
directory that can be used to reproduce our workflow, but can be replaced with custom files to use our workflow with your own data. Calculating the pearson correlations to build the network, the network transformations, and evaluation requires installation of the c++ library, Sleipnir. The naive and tissue-aware gold standards we used for evaluation can be found in the "gold_standards" directory.
The docs
and website
directories are only related to the extra results website (linked above) and nothing should be pulled from these directories for reproduction or use of our workflow.
To download the GTEx dataset, use the script gtex_download_and_normalize.R
. It will read a file from data but requires no arguments. Otherwise, see below.
Script:
download_and_normalize.R
Required R packages:
tidyverse, recount
Purpose: Download entire projects from the Recount2 database and output five directories:
counts
: count data for all datasets where the first column in each file is 'gene' and the remaining columns are individual samplescpm
: cpm-normalized datasets where the first column in each file is 'gene' and the remaining columns are individual samplestpm
: tpm-normalized datasets where the first column in each file is 'gene' and the remaining columns are individual samplesrpkm
: rpkm-normalized datasetes where the first column in each file is 'gene' and the remaining columns are individual samplesmetadata
: available metadata for all datasetsrse
: the Rdata object for each dataset Each remaining step of the pipeline is designed to take a directory as an argument to complete the step for every dataset in the directory, so these directories can now be put through the rest of the workflow as desired.
Arguments:
The single argument is the path to a file that is used to select data for download and normalization. We created this file using the Recount2 metadata that can be accessed with the all_metadata function in the recount
R package. The metadata was filtered for our desired characteristics and the following columns were selected to write to file: project
, sample
, run
, and sharq_beta_tissue
.
Use from command line:
$ Rscript download_and_normalize.R selected_projects-tissue-sample.tsv
We have modified the download_and_normalize.R
script to only download the data that met our criteria, so this step can be skipped to repeat our analysis. If other data is being used, this step can be executed to filter out samples that have over half zero-expression for lncRNA, antisense RNA, and protein coding genes.
Script:
sample_filter.R
Required R packages:
tidyverse
Purpose:
Removes samples from each dataset that have over half zero-expression for lncRNA, antisense RNA, and protein coding genes.
Arguments:
Path to directory of datasets to be sample filtered.
Use from command line:
$ Rscript sample_filter.R path/directory_to_be_sample_filtered
Script:
gene_filtering_by_cpm.R
Required R packages:
tidyverse
Purpose:
Remove genes which have universally low expression.
Arguments:
The first argument is the path to the directory of datasets to be gene filtered. The second argument is the path to the file that contains the genes to keep. To repeat our analysis with the SRA genes, use SRA_genes_to_keep_by_cpm_filter.txt
in the data
directory. To repeat our analysis with the GTEx genes, use GTEx_genes_to_keep_by_cpm_filter.txt
in the data
directory.
Use from command line:
$ Rscript gene_filtering_by_cpm.R path/directory_to_be_gene_filtered path/genes_to_keep_file.txt
If no within-sample normalization is desired, continue the pipeline with the counts
directory that has been through any necessary sample selection and gene filtering. Choosing CPM, RPKM, or TPM requires the use of the cpm
, rpkm
, or tpm
directories that have been sample and/or gene filtered as necessary.
The options are none
, TMM
, upper quartile
, CTF
, CUF
, or quantile
normalization. If no between-sample normalization is desired, skip this step.
TMM normalization requires count data and should not be paired with CPM, RPKM, or TPM.
Script:
TMM_normalize.R
Required R packages:
tidyverse, edgeR
Purpose:
TMM normalize each dataset in a directory.
Arguments:
Path to directory of datasets to be TMM normalized.
Use from command line:
$ Rscript TMM_normalize.R path/directory_to_be_normalized
Upper quartile normalization requires count data and should not be paired with CPM, RPKM, or TPM.
Script:
upper_quartile_normalize.R
Required R packages:
tidyverse, edgeR
Purpose:
Upper quartile normalize each dataset in a directory.
Arguments:
Path to directory of datasets to be upper quartile normalized.
Use from command line:
$ Rscript upper_quartile_normalize.R path/directory_to_be_normalized
CTF requires count data and should not be paired with CPM, RPKM, or TPM.
Script:
CTF_normalize.R
Required R packages:
tidyverse, edgeR
Purpose:
CTF normalize each dataset in a directory.
Arguments:
Path to directory of datasets to be CTF normalized.
Use from command line:
$ Rscript CTF_normalize.R path/directory_to_be_normalized
CUF requires count data and should not be paired with CPM, RPKM, or TPM.
Script:
CUF_normalize.R
Required R packages:
tidyverse, edgeR
Purpose:
CTF normalize each dataset in a directory.
Arguments:
Path to directory of datasets to be CUF normalized.
Use from command line:
$ Rscript CUF_normalize.R path/directory_to_be_normalized
Quantile normalization can be paired with counts, TPM, RPKM, or TPM.
Script:
quantile_normalize.R
Required R packages:
tidyverse, preprocessCore
Purpose:
Quantile normalize each dataset in a directory.
Arguments:
Path to directory of datasets to be quantile normalized.
Use from command line:
$ Rscript quantile_normalize.R path/directory_to_be_normalized
This step is not necessary if all gene types are of interest, but the script below will retain only genes types used in our analysis (lncRNA, antisense RNA, and protein coding genes)
Script:
gene_type_filter_lncrna_antirna_proteincoding.R
Required R packages:
tidyverse
Purpose:
Remove all gene types that are not lncRNA, antisense RNA, or protein coding.
Arguments:
Path to directory of datasets to be gene type filtered.
Use from command line:
$ Rscript gene_type_filter_lncrna_antirna_proteincoding.R path/directory_to_be_gene-type_filtered
Script:
asinh_transform.R
Required R packages:
tidyverse
Purpose:
Transform data with hyperbolic arcsine function.
Arguments:
Path to directory of datasets to be transformed.
Use from command line:
$ Rscript asinh_transform.R path/directory_to_be_transformed
This step requires the use of the Sleipnir c++ library. Once Sleipnir has been loaded, we use the following command on each individual dataset to output the network edgelist file:
$ Distancer -i input_dataset.pcl -d pearson -o output_filename.dat -c -s 0 -z
This command was incorporated into a simple bash script to iterate over a directory.
The options are none
, CLR
, or wTO
. If no network transformation is desired, skip this step.
CLR is another option in the Sleipnir c++ library. Once Sleipnir has been loaded, we use the following command on each individual dataset to output the CLR transformed network edgelist:
$ Dat2Dab -i input_dataset.dat -Y -o output_file.dat
This command was incorporated into a simple bash script to iterate over a directory.
Our use of the wTO
R package required that our correlation edgelist be converted to an adjacency matrix before using wTO. For speed, we used a python script to convert the edgelist to an adjacency matrix, then used an R script to use the wTO package and output the transformed edgelist.
Script:
Edgelist_to_matrix.py
(python 3)
Purpose:
Take an edgelist and turn to an adjacency matrix and nodelist for the adjacency matrix.
Arguments:
- dir1 = the directory that the intial edgelist is in
- edgelist = file in dir1 that should be converted to an adjacency matrix
- dir2 = the directory that the output adjacency matrices and nodelists should be saved to
- binary = flag that can be used to output a binary rather than weighted adjacency matrix
- diags = what diagonals should be in the adjacency matrix (default = 0)
Use from command line:
$ python Edgelist_to_matrix.py -dir1 path/input_directory -dir2 path/output_directory -edgelist edgelist_filename.tsv
Script:
edgelist_to_wTO_edgelist.R
Required R packages:
tidyverse, wTO
Purpose:
wTO tranform the network edges.
Arguments:
The first argument is the path to the directory that output files should be placed in. The second argument is the path to the adjacency matrix output by Edgelist_to_matrix.py
. The third argument is the path to the nodelist output by Edgelist_to_matrix.py
.
Use from command line:
$ Rscript edgelist_to_wTO_edgelist.R path/output_directory adjacency_matrix_file.txt nodelist_file.nodelist
If desired, networks can be evaluated on the functional gold standards described in the manuscript. This requires the use of the Sleipnir c++ library. Once Sleipnir has been loaded, we use the following command on each individual dataset:
$ DChecker -w gold_standard.dab -b 20000 -i input_network.dat -o evaluation_output.txt
This command was incorporated into a simple bash script to iterate over a directory. The gold standard can be any gold standard file found in the gold_standards
directory. The evaluation output will include the number of true positives/false positives/true negatives/false negatives at 20,000 cutoffs so that auPRC, auROC, or other desired metrics can be calculated from the file. In our analysis, the R package pracma was used to calculate the area under the precision-recall curve and the area under the ROC curve.
VST and rlog are the data transformations we tested to compare to the hyperbolic arcsine transformation. Both transformations can only be used with count data. We perform sample selection and gene filtering by cpm before doing the transformation, then gene type filtering followed by correlation calculation can be performed to build the network. Network transformation can still be done if desired.
Script:
vst_transform.R
Required R packages:
tidyverse, DESeq2
Purpose:
VST transform each dataset in a directory.
Arguments:
Path to directory of datasets to be transformed.
Use from command line:
$ Rscript vst_transform.R path/directory_to_be_transformed
Script:
rlog_transform.R
Required R packages:
tidyverse, DESeq2
Purpose:
rlog transform each dataset in a directory.
Arguments:
Path to directory of datasets to be transformed.
Use from command line:
$ Rscript rlog_transform.R path/directory_to_be_transformed
For support please contact Kayla Johnson at [email protected].
This repository and all its contents are released under the BSD 3-Clause License; See LICENSE.md.
If you use this work, please cite:
Johnson KA, Krishnan A (2020) Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data. bioRxiv doi.org/10.1101/2020.09.22.308577.
Kayla A Johnson, Arjun Krishnan*
*General correspondence should be addressed to AK at [email protected].
This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK.
We thank the members of the Krishnan Lab for helpful discussion. We are particularly grateful to Anna Yannakopoulos and Christopher A. Mancuso for code advice, and Stephanie Hickey for suggestions on the manuscript.