Develop an efficient algorithm for classifying drugs based on their biological activity.
The project goal is to predict the Mechanism of Action (MoA) response(s) of different samples (sig_id) using various inputs such as gene expression data and cell viability data.
The Connectivity Map, a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge to advance drug development through improvements to MoA prediction algorithms.(1)
What is the Mechanism of Action (MoA) of a drug? And why is it important?
In pharmacology, the term mechanism of action (MOA) refers to the specific biochemical interaction through which a drug substance produces its pharmacological effect.(2) A mechanism of action usually includes mention of the specific molecular targets to which the drug binds, such as an enzyme or receptor.(3)
In the past, drugs were often derived from natural sources or traditional remedies without a clear understanding of how they worked. For example, paracetamol (known as acetaminophen in the US) was used clinically for decades before its biological mechanisms were fully understood. However, with technological advances, drug discovery has shifted towards a more targeted approach. Scientists now aim to identify the specific protein associated with disease and develop a molecule that can interact with it. Scientists use a mechanism of action (MoA) label to describe a molecule's biological activity.
How do we determine the MoAs of a new drug?
One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression GEO, EMBL-EBI Expression Atlas or cell viability patterns of drugs with known MoAs.
Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair.
-
$N$ represents the number of samples -
$M$ represents the number of MoA targets -
$y_{i, m}$ represents the true label of sample -
$i$ for MoA target$m$ , and$\hat{y_{i, m}}$ represents the predicted probability of sample$i$ for MoA target$m$ .
In this challenge, we can access a unique dataset that combines gene expression and cell viability data. This data is based on a new technology that measures human cells' responses to drugs in a pool of 100 different cell types, solving the problem of identifying which cell types are better suited for a given drug. Additionally, we have access to MoA annotations for over 5,000 drugs in this dataset.
The training data provides an optional set of MoA labels that are not included in the test data and are not used for scoring.
List of files:
train_features.csv
- Features for the training set. Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).train_drug.csv
- This file contains an anonymous drug_id for the training set only.train_targets_scored.csv
- The binary MoA targets that are scored.train_targets_nonscored.csv
- Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.test_features.csv
- Features for the test data. You must predict the probability of each scored MoA for each row in the test data.sample_submission.csv
- A submission file in the correct format.
β οΈ Important: The data processing pipeline is integrated with the Kaggle API. So before getting started, ensure to configure your Kaggle API credentials.
To build a preprocessing pipeline, run the command make data
in your terminal. This command triggers a chain of scripts in the following order:
- Check the
data/raw
directory to make sure there is a training dataset. - Download the dataset from the Kaggle server if
data/raw
is empty, or skip this step otherwise. - Extract the downloaded dataset to the
data/raw
directory - Delete the downloaded zip file
- Perform feature engineering tasks to prepare the dataset for training
- Perform feature selection
- Save the prepared dataset in the
data/processed
directory
Alternatively, you can retrieve the dataset manually. It does not harm the data preprocessing pipeline.
To obtain data manually, follow the next steps:
- Sign in to your Kaggle account or sign up if you still need one.
- Accept MoA competition rules - it will grant you full access to MoA competition data.
- Download the dataset.
- Unzip the downloaded
lish_moa.zip
file to thedata/raw
project directory.
In our project, we utilize four deep learning models: FNN (Feedforward Neural Network), ResNet (Residual Network), FTTransformer (Feature Transformer), and TabNet.
FNN captures complex relationships and serves as a baseline model. ResNet addresses deep network training with gradient propagation through residual connections. FTTransformer excels in handling high-dimensional tabular data. TabNet combines deep networks and attention mechanisms for complex tabular scenarios.
By incorporating these models, we explore diverse approaches and leverage their strengths to improve multiclass drug classification performance.
This project's automation workflow is built on Make GNU, which uses a Makefile as its core. The Makefile includes CLI rules written in C as make
commands. These commands connect all the processes in the project at a high level of abstraction. Refer to the following table for all this project's make
commands.
Command | Description | Prerequisite |
---|---|---|
make env |
Create a virtual environment | |
source moa activate |
Activate virtual environment | |
make test_env |
Test virtual environment | |
make requirements |
Install dependencies | test_environment |
make raw_data |
Download and extract data from Kaggle | |
make data |
Make data preprocessing pipeline | raw_data |
make train |
Initialize model training | data |
make pred |
Make prediction | train |
make report |
Create report | |
make clean |
Delete all compiled Python files | |
make lint |
Lint using flake8 | |
make help |
List all targets and descriptions |
What Does the Prerequisite
Column Mean?
In the Prerequisite
column, you can see which commands require a specific condition to be met. For instance, the make train
command requires data
to be prepared beforehand. However, you do not have to run the prerequisites manually. When you execute the target command, it automatically runs the prerequisite and only proceeds if it is successful.
Follow these steps to set up the Kaggle API credentials:
- Create a new Kaggle API token according to the instructions.
- Save obtained
kaggle.json
file to the~/.kaggle
folder.
π Note: If you need to store the Kaggle API token in an environment location, you must set the KAGGLE_CONFIG_DIR environment variable to the path where you keep the Kaggle API token kaggle.json. For example, on a Unix-based machine, the command would look like this:
export KAGGLE_CONFIG_DIR=/home/user/miniconda3/envs/moa/bin
For your security, ensure that other users of your computer do not read access to your credentials:
chmod 600 ~/.kaggle/kaggle.json
You can also choose to export your Kaggle username and token to the environment:
export KAGGLE_USERNAME=niander_wallace
export KAGGLE_KEY=xxxxxxxxxxxxxx
Follow the documentation to learn more about the Kaggle API and how to use Kaggle CLI tools.
βββ LICENSE
βββ Makefile <- Makefile with commands like `make data` or `make train`
βββ README.md <- The top-level README for developers using this project
βββ configs <- Config files for implemented models and more
βββ data
β βββ predictions <- Predicted targets
β βββ processed <- The final canonical data sets for modeling. Obtained after
β β preprocessing, merging, cleaning, feature engineering, etc.
β βββ raw <- The original, immutable data dump. Should be considered as read-only.
βββ logs <- Logs and tensorboard event files
βββ drafts <- Drafts, hypothesis testing
β
βββ models <- Trained and serialized models, model predictions, or model summaries
βββ notebooks <- Jupyter notebooks. A naming convention is a number (for ordering),
β β the creator's initials, and a short `-` delimited description, e.g.
β β `1.0-os-initial-data-exploration`
β βββ exploratory <- Contains initial explorations
β βββ reports <- Works that can be exported as html to the reports directory
β
βββ notes <- Notes, ideas, experiment tracking, etc.
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials
β
βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
β βββ figures <- Generated graphics and figures to be used in reporting
β
βββ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
β generated with `pip freeze > requirements.txt`
β
βββ setup.py <- Makes project pip installable (pip install -e .) so src can be imported
βββ src <- Source code for use in this project
β
βββ test_environment <- Test python environment is setup correctly
β
βββ tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
To reproduce the solution, do the following:
- Clone the repository:
git clone https://github.com/oleksandrsirenko/mechanisms-of-action-moa-prediction.git moa
- Get dataset manually or configure the Kaggle API to automate this process.
- Create the virtual environment for the project:
make environment
- Activate the virtual environment:
source moa activate
- Install dependencies:
make requirements
- Prepare dataset:
make data
- Train models:
make train
- Make predictions:
make prediction
- Mechanisms of Action (MoA) Prediction. Retrieved from https://www.kaggle.com/c/lish-moa.
- Spratto, G.R., & Woods, A.L. (2010). Delmar Nurse's Drug Handbook. Cengage Learning. ISBN 978-1-4390-5616-5.
- Grant, R.L., Combs, A.B., & Acosta, D. (2010). Experimental Models for the Investigation of Toxicological Mechanisms. In McQueen, C.A. (Ed.), Comprehensive Toxicology (2nd ed., p. 204). Oxford: Elsevier. ISBN 978-0-08-046884-6.
- Corsello, S.M., et al. (2020). Discovering the anticancer potential of non-oncology drugs by systematic viability profiling. Nature Cancer. Advanced online publication. DOI: 10.1038/s43018-019-0018-6.
- Gene Expression Omnibus (GEO). Retrieved from https://www.ncbi.nlm.nih.gov/geo/.
- EMBL-EBI Expression Atlas. Retrieved from https://www.ebi.ac.uk/gxa/home.
- Subramanian, A., et al. (2017). A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell, 171(6), 1437-1452.e17. DOI: 10.1016/j.cell.2017.10.049.
- Connectopedia. Retrieved from https://clue.io/connectopedia/glossary. Certainly! Here are the extracted references from the provided text:
- Henze, M. (n.d.). Explorations of Action - MoA EDA. Retrieved from https://www.kaggle.com/code/headsortails/explorations-of-action-moa-eda/report.
- Yamlahi, A. (n.d.). Drugs MoA classification: EDA. Retrieved from https://www.kaggle.com/code/amiiiney/drugs-moa-classification-eda/notebook.
- Tran, P., & Gligorijevic, V. (2023). Transfer Learning with Deep Tabular Models. arXiv. arXiv:2206.15306. Retrieved from https://arxiv.org/abs/2206.15306.
Please note that this is a partial list of references; additional sources may have been consulted during the project.
In progress
TODO:
- Define project structure
- Automate workflow with Makefile
- Integrate Kaggle API
- Create helper functions
- Create a data preprocessing pipeline
- Make Dataset class
- Build MLP model
- Construct a training loop
- Build ResNet model
- Implement model factory
- Implement cross-validation
- Monitor and log experiments
- Build FTTransformer for transfer learning
- Build TabNet model for transfer learning
- Conduct feature engineering
- Tune hyperparameters
- Perform model interpretation and explainability, compare models
- Ensemble models (including MLP, ResNet, and FTTransformer, TabNet)
- Make an inference using the ensemble of models
- Document and organize code
- Automate report fetching
- Prepare visualizations and figures to support the findings
- Write a research report or paper summarizing the findings
Stay tuned for the upcoming results!
If you'd like to contribute, please fork the repository and use a feature branch. Pull requests are warmly welcome. Here's a quick guide:
- Fork the repo (button on the top right).
- Clone it to your local system (
git clone https://github.com/oleksandrsirenko/mechanisms-of-action-moa-prediction.git
). - Make a new branch (
git checkout -b feature_branch
). - Make your changes.
- Push the branch (
git push origin feature_branch
). - Open a Pull Request on the GitHub page of this repository.
Before creating a Pull Request, please ensure your code follows the style guidelines and passes all the tests.
Oleksandr Sirenko β Data Scientist
- Github: @oleksandrsirenko
- LinkedIn: Oleksandr Sirenko
Your name could be here!
Interested in contributing? We're open to collaboration. Please, follow the contributing guidelines.
The content of this project is licensed under the MIT license.
I'd like to express my gratitude to the Kaggle community for the inspiration and the vast pool of shared knowledge that helps me grow as a Data Scientist. I also want to thank my colleagues Martin Henze and Amin Yamlahi for their outstanding EDAs that allow me to comprehend the competition data. Here are the links to their extraordinary works:
If you have any feedback, questions, or ideas for future enhancements, feel free to reach out to me.
Stay tuned for project updates and announcements.
Feel free to contact me at [email protected]
for any project-related queries.