This version of PCxN is optimized for speed by adjusting code structure and using C++(Rcpp package). The changes do not alter the idea of the original PCxN method found here.
The Rcpp package is widely used to speed-up R scripts by using lower level C++ calculations, majorly decreasing looping overheads (among other things). A large number of R functions are supported by the package (e.g. colMeans()
) but inevitably a few have to be implemented manually, or worst case (very slow) imported from R (e.g. cor.shrink()
).
The original version is based on two nested loops (through tissues and process pair elements). The latter one is distributed between cores. Turning these 2 loops into Rcpp would be hard as we would need to implement all fucntions included to C++. The main problem would be the mclapply()
function, which would require us to find a C++ that provides the same functionality.
What we can do is restructure the code in the following ways:
- Precalculate the Summary matrices (joint and disjoint)
- Subset the pair elements. That means only calculate interesting relationships (e.g. pathways to drugs)
- Enable the concatenationg of result matrices that come from running PCxN with different arguments
A more detailed report of the runtime comparisons can be found here In short, the steps to be implemented:
- Pre-calculate the matrix that holds the disjoint summaries(estimated represent 40% of the geneset pairs).
- Pre-calculate the matrix that holds the joint summaries (multi-core)
- Select desired relationships feature (e.g. pathway - pathway, CMAP - pathway etc)
- Replace/remove Shrinkage, if decided so.
- Concatenate result matrices
R code | Status | Note |
---|---|---|
Pre-calculate the matrix that holds the disjoint summaries | Done | |
Pre-calculate the matrix that holds the joint summaries (multi-core) | On hold | Need multicore |
Desired relationships feature | Done | disjoint + sub |
Replace/remove Shrinkage | Not decided | |
Concatenate matrices | Done |
We have to test whether we can add up matrices which are results of different PCxN runs. Towards that direction we will run a mini experiment of concatenating different sized matrices with different argumetns and see how the actual numbers in the concatenated matrix are affected.
The geneset file DPD.Hs.gs.mini.PDN.RDS is used. We build 3 versions of PCxN (main_01_disjoint.R):
- Base: using the first 50 gene sets from the above file to produce a baseline output matrix with correlations, p-value and adjusted p-values
- Plus_10: This PCxN version uses the 50 base gene sets and 10 new ones.
- Plus_20: This version uses the 50 base gene sets and 20 new ones (not the same as the 10 new ones above)
Once we have the 3 matrices we will cross-check the correlations, p-values and adjusted p-values for any changes.
The results show that only the p.Adjust column changes (as expected), which means we can concatenate matrices like that if we just recalculate the adjusted p-value in estimates02. The resulting matrices can be found here: /shared/hidelab2/shared/Sokratis/PCxN_Plos/Concatenate experiment
Implementing current PCxN functions in Rcpp. Four functions have been translated to C++ but at the moment they don't offer a speed advantage (yet). More effort will go towards implementing them as effieciently as possible. A new function has been created (precalculate_matrices.cpp
) that pre-calculates both joint and disjoint matrices. To gain speed with this function, C++ and multicores have to be used.
C++ code | Status | Note |
---|---|---|
GetSummary.cpp | Implemented | Optimization |
OverlapCoefficient.cpp | Done | |
ShrinkCor.cpp | Done | On dis + sub |
ShrinkPCor.cpp | Done | |
precalculate_matrices.cpp | On hold |
The original PCxN code was built and run for 1,330 gene sets/pathways. When we increased the pathways to 5000, memory issues occcured during execution on sharc. Using rmem=48G, mem=48G and 14 cores scripts that handled tissues with more than one GSEs got stopped shortly after the first completed GSE. One thing we can do is adjust the scripts to clean-up after each loop but still the memory needed at any time will definitely be considerably more than before. Only considering a subset of pairs would do wonders for that problem.
- [C++] Try and avoid ALL imports from R. Especially in repeated parts of the code it results in way slower execution. Specifically important case:
cor.shrink()
in ShrinkCor.cpp and ShrinkPCor.cpp scripts. This is the current reason we can't gain speed from these two scripts. - The code is structured to run per tissue. By splitting the batches(each one having a few tissues) we can cut down the total runtime.
- In Shrink(P)Cor : currently I am importing the cor.shrink function from R and run in C++. Altought the results are identical, this is not speed optimized. The optimal solution would be using a pure C++ function
Software
This PCxN version is working and has the following characteristics:
- All arguments that may change (e.g. genesets file or relationships desired) can be handled through the batch script. No R code changes required.
- Pick relationships feature
- All outputs found in /shared/hidelab2/shared/Sokratis/PCxN_Plos/output_improved_PCxN
- Enable concatenate relationship matrices
Run batch script improved_PCxN_sharc
by first adjusting:
- Batch memory used
- Batch number of cores (suggested 14)
- Batch max time to let running (suggested ?)
- rels (gene sets relationships desired). Use: 1,2,3,4,5,6 to request all relationships. If you need only some of the relationships use any subset of the following numbers:
Desired Pairs | No |
---|---|
pathway-CMAP | 1 |
pathway-CTD | 2 |
pathway-PharmGKB | 3 |
CMAP-CTD | 4 |
CMAP-PharmGKB | 5 |
CMAP.up-CMAP.down | 6 |
pathway-L1000CDS2 | 7 |
L1000CDS2.up-L1000CDS2.down | 8 |
L1000CDS2.up-L1000CDS2.up | 9 |
L1000CDS2.down-L1000CDS2.down | 10 |
pathway-pathway | 11 |
- make gene set file (see
genesets
) and put intodata
- make output file of correct structure (see
example_output_folder
) - modify
src/PCxN_sharc_1
andsrc/PCxN_sharc_2
with correct output file name, relationships (rels, see above) and gene set file name - run
PCxN_sharc_1
with 1-134 jobs (approx ram 5gb * 4 cores, approx 1h per tissue) - run
PCxN_sharc_2
with 1:300 jobs, if it fails, check how many parts and adjust (see details inpowerful_PCxN
) - also adjust the memory (currently 1 part successfully finishes with 8 cores * 20gb, part 1 took about 1h for approx 8000 gene sets)
More info in powerful_PCxN
on github
Additional tools that provide flexibility to PCxN.
This tool allows the combination of 2+ PCxN output matrices into one. The combination requires both output matrices to be calculated beforehand. Adjusted p-values are re-calculated and duplicates are removed (check Concatenate matrices section above).
Due to the simplicity and low resource requirements, this script can be run both locally(P-combiner_local.R
) and on sharc (P-combiner_sharc.R
). The latter should only be used in extreme cases where a large number of matrices needs to be combined or the matrices themselves are very large in size.
Specify the path to parts folder with first part of the parts name eg. output1/mean_pcor2_barcode/res/pcxn_mean_pcor2_barcode
(output of improved_PCxN_estimates02.R
) of all matrices without _parts11.RDS
. In both local/sharc:
- The first argument is composed of the matrix names (the part before "_part1.RDS") seperated by commas.
- The second argument is composed of number of parts for each run separated by commas.
- The third argument is the entire name of the combined matrix without the .RDS filename extension (e.g. matri1_matrix2).
Example
- Local(load the function first):
p_combiner(inputs = c("pcxn_conc_base","pcxn_conc_plus_10","pcxn_conc_plus_20"), nparts = c(2,2,2), output_name = "pcxn_conc_base_plus_10_plus_20")
- Sharc:
Rscript P-combiner_sharc.R pcxn_conc_base,pcxn_conc_plus_10,pcxn_conc_plus_20 2,2,2 pcxn_conc_base_plus_10_plus_20
This tool takes a PcxN output matrix as a base and runs/adds the results of a different set of genesets (much like running a new PCxN without the pairs that have already been calculated). Checks that same named genesets have an identical gene member list. The tools requires only the base matrix to be pre-calculated and calculates only the necessary pairs (the ones that don't already exist in the base matrix).
Place the base matrix, base geneset and new geneset files in the data folder. you will handle the batch script P-adder_PCxN_sharc
. In sharc fill in the following variables:
- base_matrix: The name of the base(already calculated) matrix you added earlier
- gs_base: The geneset used for the base matrix above
- gs_new: The geneset to be used for the new data you want to compute
- old_folder: The path(including a backlash at the end of it) where the part_1, part_2.RDS (etc.) files for the base matirx are kept
- rels: Which relationships you desire for the new computation (works identically to improved PCxN)
The output files (combined matrix in PCxN and PDN style) can be found in the output_adder_PCxN folder.
This tool converts the original PCxN output (matrix where each row represents a pathway pair with p-value/adj. p-value/cor/overlap) to the PDN-style input(square matrix, where rows and columns are the same pathways and the cells only hold correlation between the different pathway pairs). The converted file is named "square_" + original name.
Place the result matrix (from estiamtes03.R) in the same folder as the converter script.In both local/sharc:
- The only argument is the original result's name as a string
Example
- Local(load the function first):
convert_pcxn_result("original_PCxN_result.RDS")
- Sharc:
- svd (cran)
- corpcor (cran)
- parallel (in R core packages)
- metap (cran)