Deep Onco AI

This analysis pipeline is meant to allow the training of multiple machine-learning models on various subsets of the data in the large cell line repositories like the CCLE and in the future, GDSC. Overall predictors can be designed by ensembling the most successful models for each drug.

The goal of the current pipeline is to :

evaluate the predictivity of the different algorithms on the different subsets of the data, on the different drugs.
retrieve and compare the most important features for these, for each drug
- do similar algorithms pick up similar signals?
- which algorithms are better at which task?
- are the observations conserved across drugs?
build ensemble predictors:
- are ensembles of multiple predictors more performant than single algorithms? Which ones and in which case?
- what are the 'best' ensembles for each drug, can a biomarker signature of resistance/sensitivity be formed?

Description and rationale

The prediction of individual patients' response to chemotherapy is a central problem in oncology. Cell lines can resume some of the characteristics of the patients original tumors and can be screened with various drugs. By using the baseline gene and protein expression of the cells it is possible to build predictors of response for cell lines and utlimately patients before treatment or after relapse.

Getting Started

Dependencies

Python3
Conda
Snakemake Snakemake getting started
All necessary modules will be automatically installed during the execution of the snakemake pipeline. Actually each 'chunk' has its own set of dependencies and environment.
Some data must be downloaded externally to the /data folder: CCLE_RNAseq_genes_rpkm_20180929 can be found on the CCLE website

Use

General

The pipeline runs by itself given the info in the config.yaml config file. Here is some info about yaml The configuration is organized as follows:

data:
- omics:
  - omic1
  - omic2
  - ...
- targets:
  - target1
  - target2
  - ...
modeling:
- options...

The data part lists the different subsets used for analysis. The pipeline will recognize which omic types are needed from which database and will import the necessary files.

Under each omic, the following info is allowed:

omic:
- name:
- database:
- filtering:
  - filter1
  - filter2
  - ...
- feature_engineering:
  - feature_selection
    - selection1
    - selection2
  - transformations
    - transformation1
    - transformation2

The following filters are available:

sample_completeness: removes samples with insufficient data
feature_completeness: removes features with insufficient data
feature_variance: removes features with insufficient variance
cross-correlation: removes cross-correlated features (experimental optimization for large datasets instead of exact solution)

Furthermore, filters are divided into 'fast' and 'slow'. Fast filters are fitted and applied first to reduce the size of the dataset. Slow filters (cross-correlation) are fitted on the result of the first pass and applied subsequently. Filters are additive, i.e. the only features that are retained are the ones that pass all the filters in each of the two filtering steps.

The following selections are available:

importance: selects the top features according to XGBoost
predictivity: selects the top features by cross-elimination (not recommended for large datasets)

These methods will select the features with the most signal for the next step.

The following transformations are available:

PCA
t-SNE
Polynomial combination
'OR gate' combination Future updates will include ICA and RPA. The selected features are transformed or combined into a new dataset of features. They are however not used anymore in the published version of the pipeline

Under each target, the following info is allowed:

target:
- name
- database
- responses (which metric is used for the response)
- target_drug_name
- filtering:
  - filter1
  - filter2
  - ...
- normalization
- target_engineering:
  - method1
  - method2

The same filter types are applicable to the 'omics' and 'targets'. Target normalization is only used when the overall normalization is deactivated (in master_script.py). For target engineering, only the quantization method is currently implemented. Thresholding will be active in a future release.

The different omics or targets can be commented out of the analysis. Within each omic or target, the different filters and other methods can be disabled individually (enabled: false)

In the modeling part, the options used for the analysis can be specified, for example the number of folds for the different cross-validations, the random seeds, the search depth of the hyperparameter optimization step, the metric used for performance and the configuration of the ensembling step.

Step-by-step

create a new environment:

$ conda install -n base -c conda-forge mamba
$ conda activate base
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake

$ conda activate snakemake

copy the file config.yaml in a new folder and modify it as needed
specify this folder name in config_snake.yaml as the input, and a path for the output
run the pipeline from the top-level folder:

snakemake --cores 1 --use-conda --configfile workflow\config_snake.yaml

Remember the slash is inverted in Unix systems, and your default environment might not be called 'base' but 'root'.

the results of each of the steps is written as a pickle object. Here is some info about pickle
the analysis is recorded in the snakemake run log.

Structure

Alright, you want to dig in the code. Here is some useful info:

the main code is in 'master_script.py'
the highest-level functions are located in config.py
data is organized with samples as lines and features as columns
the Dataset class is used throughout the project. It contains a Pandas Dataframe, and two Pandas Series corresponding to the 'omic' and 'database' of each feature (columns)
there are two Filter classes: one for samples, and one for features. That is because the samples flavor is applied once, whereas internal validation would require that preprocessing is applied to both training and test subsets but trained only on the training set. As the number of samples might be too low at this point of the project this is not implemented yet but the filtering concept is already in place. The features flavor of filters need an instance of the Rule class, whereas the samples flavor does not.
filters are separated into fast (applied first) and slow (applied second) to decrease computation time.
the pipeline can be run either in 'optimization' mode, where hyperparameters of each classifier is performed with Bayesian optimization, or in 'standard' mode where a predefined set of hyperparameters is used for all trainings. At this point small increases in predictivity have been observed using hyperoptimized models, and minimum differences observed between hyper-optimized parameter values and default ones, but this has not been investigated in full.
...

Visualizations:

Both data and results are visualized. Here is the list of all available plots for each 'omic':

general distribution of the features before and after log transformation. If more than 100 features are present a sample of 100 is created at random
mean vs variance scatterplot, before and after log transformation
analysis of missing data: fraction of data present per sample, per feature, and binary map of missing data
analysis of missing data correlation: per sample, per feature: histograms of correlation coefficients and heatmaps of cross-correlations of missing data presence
correlation analysis: histograms of correlation coefficients and heatmap of data cross-correlations, for both samples and features
target analysis: distributions of raw values, and visualization of the thresholds on the distribution of normalized values

Help

Contact the authors for any help in using the tools.

Contributing

Contributions are welcome from members of the group. Look for the TODO keyword. Here is a brief list of things yet to implement:

upsampling with SMOTE and VAEs
thresholding of responses in combination with the quantization
loading of the 'BinarizedIC50' values (alternative targets)
more models to hyper-optimize (NN architectures)
stacking with more algorithms for scikit-learn or others
stack of stacks
compile gene-level versus transcript level expression
more filters (outliers, ...)
grid-search or other to compare with bayesian search
add dunder methods for classes
unit tests
include GDSC data

Authors

Sébastien De Landtsheer [email protected] DeepBioModeling [email protected] https://www.deepbiomodeling.com
Prof. Thomas Sauter [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Onco AI

Description and rationale

Getting Started

Dependencies

Use

General

Step-by-step

Structure

Visualizations:

Help

Contributing

Authors

Version History

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 346 Commits
DBM_toolbox		DBM_toolbox
DOAI_literature_2021-22		DOAI_literature_2021-22
Drugs_Distribution		Drugs_Distribution
Explainable_models		Explainable_models
FI_plots		FI_plots
Feat_imp_csv		Feat_imp_csv
data		data
oldFI_plots		oldFI_plots
old_results		old_results
paper		paper
scripts/__pycache__		scripts/__pycache__
testall		testall
.gitignore		.gitignore
All_cancers_all_meas_RNA.out		All_cancers_all_meas_RNA.out
FINAL_All_ROCs.png		FINAL_All_ROCs.png
FINAL_ExplDT_ROC_Lapatinib_ActArea.tif		FINAL_ExplDT_ROC_Lapatinib_ActArea.tif
FINAL_ExplDT_struct_Lapatinib_ActArea.tif		FINAL_ExplDT_struct_Lapatinib_ActArea.tif
FINAL_Expl_PDP_17-AAG_ActArea.tif		FINAL_Expl_PDP_17-AAG_ActArea.tif
FINAL_Expl_PDP_AEW541_ActArea.tif		FINAL_Expl_PDP_AEW541_ActArea.tif
FINAL_Expl_PDP_AZD0530_ActArea.tif		FINAL_Expl_PDP_AZD0530_ActArea.tif
FINAL_Expl_PDP_AZD6244_ActArea.tif		FINAL_Expl_PDP_AZD6244_ActArea.tif
FINAL_Expl_PDP_Irinotecan_ActArea.tif		FINAL_Expl_PDP_Irinotecan_ActArea.tif
FINAL_Expl_PDP_L-685458_ActArea.tif		FINAL_Expl_PDP_L-685458_ActArea.tif
FINAL_Expl_PDP_Lapatinib_ActArea.tif		FINAL_Expl_PDP_Lapatinib_ActArea.tif
FINAL_Expl_PDP_Nilotinib_ActArea.tif		FINAL_Expl_PDP_Nilotinib_ActArea.tif
FINAL_Expl_PDP_Nutlin-3_ActArea.tif		FINAL_Expl_PDP_Nutlin-3_ActArea.tif
FINAL_Expl_PDP_PD-0325901_ActArea.tif		FINAL_Expl_PDP_PD-0325901_ActArea.tif
FINAL_Expl_PDP_PF2341066_ActArea.tif		FINAL_Expl_PDP_PF2341066_ActArea.tif
FINAL_Expl_PDP_PHA-665752_ActArea.tif		FINAL_Expl_PDP_PHA-665752_ActArea.tif
FINAL_Expl_PDP_Paclitaxel_ActArea.tif		FINAL_Expl_PDP_Paclitaxel_ActArea.tif
FINAL_Expl_PDP_Panobinostat_ActArea.tif		FINAL_Expl_PDP_Panobinostat_ActArea.tif
FINAL_Expl_PDP_Sorafenib_ActArea.tif		FINAL_Expl_PDP_Sorafenib_ActArea.tif
FINAL_Expl_PDP_ZD-6474_ActArea.tif		FINAL_Expl_PDP_ZD-6474_ActArea.tif
FINAL_Expl_ROC_17-AAG_ActArea.tif		FINAL_Expl_ROC_17-AAG_ActArea.tif
FINAL_Expl_ROC_AEW541_ActArea.tif		FINAL_Expl_ROC_AEW541_ActArea.tif
FINAL_Expl_ROC_AZD0530_ActArea.tif		FINAL_Expl_ROC_AZD0530_ActArea.tif
FINAL_Expl_ROC_AZD6244_ActArea.tif		FINAL_Expl_ROC_AZD6244_ActArea.tif
FINAL_Expl_ROC_Erlotinib_ActArea.tif		FINAL_Expl_ROC_Erlotinib_ActArea.tif
FINAL_Expl_ROC_Irinotecan_ActArea.tif		FINAL_Expl_ROC_Irinotecan_ActArea.tif
FINAL_Expl_ROC_L-685458_ActArea.tif		FINAL_Expl_ROC_L-685458_ActArea.tif
FINAL_Expl_ROC_LBW242_ActArea.tif		FINAL_Expl_ROC_LBW242_ActArea.tif
FINAL_Expl_ROC_Lapatinib_ActArea.tif		FINAL_Expl_ROC_Lapatinib_ActArea.tif
FINAL_Expl_ROC_Nilotinib_ActArea.tif		FINAL_Expl_ROC_Nilotinib_ActArea.tif
FINAL_Expl_ROC_Nutlin-3_ActArea.tif		FINAL_Expl_ROC_Nutlin-3_ActArea.tif
FINAL_Expl_ROC_PD-0325901_ActArea.tif		FINAL_Expl_ROC_PD-0325901_ActArea.tif
FINAL_Expl_ROC_PD-0332991_ActArea.tif		FINAL_Expl_ROC_PD-0332991_ActArea.tif
FINAL_Expl_ROC_PF2341066_ActArea.tif		FINAL_Expl_ROC_PF2341066_ActArea.tif
FINAL_Expl_ROC_PHA-665752_ActArea.tif		FINAL_Expl_ROC_PHA-665752_ActArea.tif
FINAL_Expl_ROC_PLX4720_ActArea.tif		FINAL_Expl_ROC_PLX4720_ActArea.tif
FINAL_Expl_ROC_Paclitaxel_ActArea.tif		FINAL_Expl_ROC_Paclitaxel_ActArea.tif
FINAL_Expl_ROC_Panobinostat_ActArea.tif		FINAL_Expl_ROC_Panobinostat_ActArea.tif
FINAL_Expl_ROC_RAF265_ActArea.tif		FINAL_Expl_ROC_RAF265_ActArea.tif
FINAL_Expl_ROC_Sorafenib_ActArea.tif		FINAL_Expl_ROC_Sorafenib_ActArea.tif
FINAL_Expl_ROC_TAE684_ActArea.tif		FINAL_Expl_ROC_TAE684_ActArea.tif
FINAL_Expl_ROC_TKI258_ActArea.tif		FINAL_Expl_ROC_TKI258_ActArea.tif
FINAL_Expl_ROC_ZD-6474_ActArea.tif		FINAL_Expl_ROC_ZD-6474_ActArea.tif
FINAL_Features_clusters.tif		FINAL_Features_clusters.tif
FINAL_RFC_contributions.png		FINAL_RFC_contributions.png
FINAL_RFC_contributions_cluster.png		FINAL_RFC_contributions_cluster.png
FINAL_RFC_contributions_sd_table.csv		FINAL_RFC_contributions_sd_table.csv
FINAL_RFC_contributions_table.csv		FINAL_RFC_contributions_table.csv
FINAL_ROC_17-AAG_ActArea.png		FINAL_ROC_17-AAG_ActArea.png
FINAL_ROC_AEW541_ActArea.png		FINAL_ROC_AEW541_ActArea.png
FINAL_ROC_AZD0530_ActArea.png		FINAL_ROC_AZD0530_ActArea.png
FINAL_ROC_AZD6244_ActArea.png		FINAL_ROC_AZD6244_ActArea.png
FINAL_ROC_DNA_17-AAG_ActArea.png		FINAL_ROC_DNA_17-AAG_ActArea.png
FINAL_ROC_DNA_AEW541_ActArea.png		FINAL_ROC_DNA_AEW541_ActArea.png
FINAL_ROC_DNA_AZD0530_ActArea.png		FINAL_ROC_DNA_AZD0530_ActArea.png
FINAL_ROC_DNA_AZD6244_ActArea.png		FINAL_ROC_DNA_AZD6244_ActArea.png
FINAL_ROC_DNA_Erlotinib_ActArea.png		FINAL_ROC_DNA_Erlotinib_ActArea.png
FINAL_ROC_DNA_Irinotecan_ActArea.png		FINAL_ROC_DNA_Irinotecan_ActArea.png
FINAL_ROC_DNA_L-685458_ActArea.png		FINAL_ROC_DNA_L-685458_ActArea.png
FINAL_ROC_DNA_LBW242_ActArea.png		FINAL_ROC_DNA_LBW242_ActArea.png
FINAL_ROC_DNA_Lapatinib_ActArea.png		FINAL_ROC_DNA_Lapatinib_ActArea.png
FINAL_ROC_DNA_Nilotinib_ActArea.png		FINAL_ROC_DNA_Nilotinib_ActArea.png
FINAL_ROC_DNA_Nutlin-3_ActArea.png		FINAL_ROC_DNA_Nutlin-3_ActArea.png
FINAL_ROC_DNA_PD-0325901_ActArea.png		FINAL_ROC_DNA_PD-0325901_ActArea.png
FINAL_ROC_DNA_PD-0332991_ActArea.png		FINAL_ROC_DNA_PD-0332991_ActArea.png
FINAL_ROC_DNA_PF2341066_ActArea.png		FINAL_ROC_DNA_PF2341066_ActArea.png
FINAL_ROC_DNA_PHA-665752_ActArea.png		FINAL_ROC_DNA_PHA-665752_ActArea.png
FINAL_ROC_DNA_PLX4720_ActArea.png		FINAL_ROC_DNA_PLX4720_ActArea.png
FINAL_ROC_DNA_Paclitaxel_ActArea.png		FINAL_ROC_DNA_Paclitaxel_ActArea.png
FINAL_ROC_DNA_Panobinostat_ActArea.png		FINAL_ROC_DNA_Panobinostat_ActArea.png
FINAL_ROC_DNA_RAF265_ActArea.png		FINAL_ROC_DNA_RAF265_ActArea.png
FINAL_ROC_DNA_Sorafenib_ActArea.png		FINAL_ROC_DNA_Sorafenib_ActArea.png
FINAL_ROC_DNA_TAE684_ActArea.png		FINAL_ROC_DNA_TAE684_ActArea.png
FINAL_ROC_DNA_TKI258_ActArea.png		FINAL_ROC_DNA_TKI258_ActArea.png
FINAL_ROC_DNA_ZD-6474_ActArea.png		FINAL_ROC_DNA_ZD-6474_ActArea.png
FINAL_ROC_Erlotinib_ActArea.png		FINAL_ROC_Erlotinib_ActArea.png
FINAL_ROC_Irinotecan_ActArea.png		FINAL_ROC_Irinotecan_ActArea.png
FINAL_ROC_L-685458_ActArea.png		FINAL_ROC_L-685458_ActArea.png
FINAL_ROC_LBW242_ActArea.png		FINAL_ROC_LBW242_ActArea.png
FINAL_ROC_Lapatinib_ActArea.png		FINAL_ROC_Lapatinib_ActArea.png
FINAL_ROC_META_17-AAG_ActArea.png		FINAL_ROC_META_17-AAG_ActArea.png
FINAL_ROC_META_AEW541_ActArea.png		FINAL_ROC_META_AEW541_ActArea.png
FINAL_ROC_META_AZD0530_ActArea.png		FINAL_ROC_META_AZD0530_ActArea.png
FINAL_ROC_META_AZD6244_ActArea.png		FINAL_ROC_META_AZD6244_ActArea.png
FINAL_ROC_META_Erlotinib_ActArea.png		FINAL_ROC_META_Erlotinib_ActArea.png
FINAL_ROC_META_Irinotecan_ActArea.png		FINAL_ROC_META_Irinotecan_ActArea.png
FINAL_ROC_META_L-685458_ActArea.png		FINAL_ROC_META_L-685458_ActArea.png

sysbiolux/DeepOncoAI

Folders and files

Latest commit

History

Repository files navigation

Deep Onco AI

Description and rationale

Getting Started

Dependencies

Use

General

Step-by-step

Structure

Visualizations:

Help

Contributing

Authors

Version History

License

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages