Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
michael1788 committed Oct 26, 2021
0 parents commit fa609bb
Show file tree
Hide file tree
Showing 76 changed files with 1,686,785 additions and 0 deletions.
129 changes: 129 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# SPECIFIC to this repo
*.sdf
*.npy
*.txt
*.zip
*.pkl
*.h5
!pretrained_models/CLM.h5
*.hdf5
*.ckpt
*.index
*.ckpt*
*.pickle
*.png
*.jpg
*.out
*.csv
!/data/*
!/data/*/*
experiments/outputs/

# mac stuff
*.DS_Store

# GENERAL python and mac stuff
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
133 changes: 133 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Perplexity-based molecule ranking and bias estimation of chemical language models

## Table of Contents
1. [Description](#Description)
2. [Requirements](#Requirements)
3. [How to run an experiment](#Run)
4. [How to cite this work](#Cite)
5. [License](#License)
6. [Address](#Address)


### Description<a name="Description"></a>

This is the supporting code for the paper «Perplexity-based molecule ranking and bias estimation of chemical language models»

**Abstract of the paper**: Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified molecular input line entry systems (SMILES) strings, in a rule-free manner. However, the quality of these de novo generated molecules is difficult to assess a priori. In this study, we apply the perplexity metric to determine the degree to which the molecules generated by a CLM match the desired design objectives. This model-intrinsic score allows identifying and ranking the most promising molecular designs based on the probabilities learned by the CLM. Using perplexity to compare “greedy” (beam search) with “explorative” (multinomial sampling) methods for SMILES generation, certain advantages of multinomial sampling become apparent. Additionally, perplexity scoring is performed to identify undesired model biases introduced during model training and allows the development of a new ranking system to remove those undesired biases.

### Requirements<a name="Requirements"></a>

First, you need to [clone the repository](https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository):

```
git clone [email protected]:michael1788/CLM_perplexity.git
```
Then, you can run the following command, which will create a conda virtual environment and install all the needed packages (if you don't have conda, you can follow the instructions to install it [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)).

```
cd CLM_perplexity/
conda env create -f environment.yml
```

Once the installation is done, you can activate the virtual conda environment for this project:

```
conda activate plex
```
Please note that you will need to activate this virtual conda environment every time you want to use this project.

### How to run an experiment<a name="Run"></a>

You can run an example experiment based on the data used in the paper by following the procedure described in A. and B.
The output after each step can be find in *CLM_perplexity/experiments/outputs/FT/{configuration_file_name}/*.

A. Compare the perplexity of SMILES strings generated with multinomial sampling and the beam search.

A1. Process the data to fine-tune the chemical language model (CLM):
Note: the first set of commands takes as argument the path to the configuration file (in this example, *configfiles/FT/A01.ini* where A01.ini is the name of the configuration file, and FT stands for Fine-Tuning).
```
cd experiments/
sh run_data_processing.sh configfiles/FT/A01.ini
```
Note: the pretrained weights of the CLM are provided.

A2. Fine-tune the CLM:
```
sh run_training.sh configfiles/FT/A01.ini
```

A3. Generate SMILES strings with multinomial sampling:
Note: the following list of commands will take as arguments the path to the configuration file, and the range of epochs at which you want to carry the experiment (start, step, end). In this example, the experiment will be done for SMILES strings sampled at epoch 2 and 4.
```
sh run_generation_multinomial.sh configfiles/FT/A01.ini 2 2 4
```

A4. Process the generated SMILES strings:
```
sh run_process_multinomial_generated.sh configfiles/FT/A01.ini 2 2 4
```

A5. Extract the probabilities from the CLM:
```
sh run_proba_extraction_multinomial.sh configfiles/FT/A01.ini 2 2 4
```

A6. Compute the perplexity:
```
sh run_get_perplexity_multinomial.sh configfiles/FT/A01.ini 2 2 4
```
You can find a .csv file with the results in the *outputs/* directory, under *perplexity/*.
Note: only the de novo molecules (with respect to the pretraining and fine-tuning data) will be considered. You can change the argument in the bash file (.sh) if you also want to consider not de novo molecules.

A7. Generate SMILES strings with the beam search:
```
sh run_generation_beam.sh configfiles/FT/A01.ini 2 2 4
```

A8. Process the beam search generated SMILES strings:
```
sh run_process_beam_generated.sh configfiles/FT/A01.ini 2 2 4
```

A9. Extract the probabilities from the CLM:
```
sh run_proba_extraction_beam.sh configfiles/FT/A01.ini 2 2 4
```

A10. Compute the perplexity:
```
sh run_get_perplexity_beam.sh configfiles/FT/A01.ini 2 2 4
```

B. Compute the delta rank of the SMILES strings generated with multinomial sampling.

B1. Extract the probabilities with the pretrained CLM:
```
sh run_proba_extraction_multinomial_from_pretrained.sh configfiles/FT/A01.ini 2 2 4
```

B2. Compute the perplexity:
```
sh run_get_perplexity_multinomial_from_pretrained.sh configfiles/FT/A01.ini 2 2 4
```

B3. And finall, compute the delta:
```
sh run_get_delta.sh configfiles/FT/A01.ini 2 2 4
```

### How to cite this work<a name="Cite"></a>
```
tbd
```

### License<a name="License"></a>
[MIT License](LICENSE)

### Address<a name="Address"></a>
MODLAB
ETH Zurich
Inst. of Pharm. Sciences
HCI H 413
Vladimir-Prelog-Weg 4
CH-8093 Zurich
10 changes: 10 additions & 0 deletions data/ft_files/CHEMBL1836_10.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
CC(=O)c1ccc(NS(=O)(=O)c2cc(C(=O)Nc3ccc(C(=O)O)cc3)ccc2C)cc1
COc1ccc(-c2ccc(OCc3cc(C(=O)NS(=O)(=O)c4ccc(C)cn4)oc3C)cc2)cc1
O=C(O)c1ccc(CC[C@H]2C(=O)CC[C@@H]2/C=C/C(O)Cc2ccccc2F)cc1
CCCCCC(O)CCc1cccc(=O)n1CCCCCCC(=O)O
C[C@@H](NC(=O)c1c(Cl)sc(Cl)c1Cc1cccc(Cl)c1)c1ccc(C(=O)O)cc1
Cc1cc(C(=O)O)cc(C)c1NC(=O)c1cc(-c2cccc(CO)c2)cc2ccccc12
CCn1c2ccccc2c2cc(-c3nc4cc(C(=O)O)c(C)cc4n3CCOC)ccc21
Cc1cc(Cl)ccc1-c1cccc([C@H](O)CC[C@H]2CCCC(=O)N2CCSCCCC(=O)O)c1
COC(=O)CCC/C=C\C[C@H]1C(=O)C(C)(C)[C@@H](O)[C@@H]1/C=C/C(O)CCc1ccccc1
O=C(O)c1ccc(C2(NC(=O)C3CC4CC4CN3Cc3ccc(C(F)(F)F)cc3)CC2)cc1
20 changes: 20 additions & 0 deletions data/ft_files/CHEMBL1836_20.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
CCCCC[C@H](O)/C=C/[C@H]1[C@H](O)CC(=O)N1C/C=C/CCCC(=O)O
CCOc1c2c(c(OCC)c3ncccc13)CN(c1ccc(CCNC(=O)Cc3ccccc3OC)cc1C)C2=O
CCCCC[C@@H](O)/C=C/[C@H]1[C@H](O)CC(=O)[C@@H]1C/C=C\CCCC(=O)O
CCCCC[C@H](O)/C=C/[C@H]1CCC(=O)N1CCCc1cccc(C(=O)O)c1
C[C@H](NC(=O)c1c(C(F)(F)F)nn(C)c1Oc1ccc(Cl)c(Cl)c1)c1ccc(C(=O)O)cc1
C[C@H](NC(=O)c1c(Cl)sc(Cl)c1Cc1cccc(C(F)(F)F)c1)c1ccc(C(=O)O)cc1
C[C@@H](c1ccccc1)[C@H](O)/C=C/[C@H]1CC(F)(F)C(=O)N1CCCc1ccc(C(=O)O)s1
C[C@@](O)(C/C=C/[C@H]1CCC(=O)[C@@H]1CCSc1nc(C(=O)O)cs1)CCC(F)(F)F
CCn1c2ccc(Cl)cc2c2cc(-c3nc4c(C)c(C(=O)O)ccc4n3CCOC)ccc21
CC#CCC1([C@H](O)/C=C/[C@H]2COC(=O)N2CCSc2nc(C(=O)O)cs2)CCC1
Cc1oc(CC(=O)O)cc1COc1ccc(-c2ccccc2)cc1
CCCCCC(O)CCN1CCC(=O)N1CCc1ccc(C(=O)O)cc1
C[C@H](NC(=O)c1cc(F)cc2c1N(Cc1ccc(C(F)(F)F)cc1)CC2)c1ccc(C(=O)O)cc1
C/C(=C/C=C/[C@H]1CCC(=O)N1CCc1ccc(C(=O)O)cc1)CC1CCCC1
CCOc1ccccc1CC(=O)NS(=O)(=O)Cc1ccc(N2Cc3c(c(OCC)c4cccnc4c3OCC)C2=O)c(C)c1
O=C(O)COC[C@H]1CC[C@H](COC(=O)N(c2ccccc2)c2ccc(Cl)cc2)CC1
C[C@H](NC(=O)c1cccc2c1N(Cc1cccc(C(N)=O)c1)CC2)c1ccc(C(=O)O)cc1
CCOc1c2c(c(OCC)c3ncccc13)CN(c1ccc(CS(=O)(=O)NC(=O)Cc3c(Cl)cccc3Cl)cc1C)C2=O
Cc1cc(C)c2sc(C(=O)N(C)C)c(-c3ccc(CCNC(=O)NS(=O)(=O)c4ccccc4Cl)cc3)c2c1
C[C@H](NC(=O)c1cccc2ccn(Cc3cc(Cl)cc(OC(F)(F)F)c3)c12)c1ccc(C(=O)O)cc1
40 changes: 40 additions & 0 deletions data/ft_files/CHEMBL1836_40.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
CCOc1c2c(c(OCC)c3ncccc13)CN(c1ccc(CCNC(=O)Cc3ccccc3OC)cc1C)C2=O
COc1ccc(Cn2nnc(-c3ccccc3)c2C(=O)N[C@@H](C)c2ccc(C(=O)O)cc2)cc1F
COc1ccc(-c2ccc(OCc3cc(C(=O)O)oc3C)cc2)cc1
Cc1ccc(S(=O)(=O)NC(=O)NCCc2ccc(-c3c(C(=O)N4CCCC4)sc4c(C)cc(C)cc34)cc2)cc1
CCOc1c2c(c(OCC)c3ccccc13)C(=O)N(c1ccc(CS(=O)(=O)NC(=O)C(c3ccccc3)c3ccccc3)cc1)C2
C[C@H](NC(=O)c1cc(Cl)ccc1Oc1cccc(F)c1)c1ccc(C(=O)O)cc1
Cc1ccc(C(=O)O)c(C)c1NC(=O)c1cc(-c2cccc(Cl)c2)cc2ccccc12
CCCCC[C@H](O)/C=C/[C@H]1[C@H](O)CC(=O)[C@@H]1C/C=C\CCCC(=O)O
CC(C)=C/C=C/[C@H]1CCC(=O)N1CCc1ccc(C(=O)O)cc1
Cc1cc(CC2(NC(=O)NS(=O)(=O)c3ccccc3Cl)CC2)ccc1N1Cc2c(c(OCC(F)(F)F)c3cccnc3c2OCC(F)(F)F)C1=O
O=C(O)c1ccc(C2(NC(=O)C3CC4(CCN3Cc3ccc(C(F)(F)F)cc3)CC4)CC2)cc1
Cc1ccc(-c2cccc(CO)c2)nc1C(=O)Nc1c(C)ccc(C(=O)O)c1C
CCOc1c2c(c(OCC)c3ccccc13)C(=O)N(c1ccc(CC3(NC(=O)NS(=O)(=O)c4ccccc4Br)CC3)cc1C)C2
COCc1cccc(C[C@H](O)/C=C/[C@H]2CCCC(=O)N2CCCCCCC(=O)O)c1
O=C(O)c1ccc(CCN2C(=O)CCN2CCC(O)Cc2cccc(Cl)c2)cc1
O=C(O)CCCCCCN1C(=O)CC[C@@H]1/C=C/C(O)c1cccc(Oc2ccccc2)c1
CCCCC[C@H](O)/C=C/[C@H]1CCC(=O)N1CCc1ccc(C(=O)O)s1
O=C(O)c1ccc(CCN2C(=O)CCN2CC[C@@H](O)Cc2cccc(I)c2)cc1
O=C(O)c1ccc(CNC(=O)c2c(Cl)sc(Cl)c2Cc2cccc(Cl)c2)cc1
C[C@H](NC(=O)c1c(Cl)sc(Cl)c1Cc1cccc(Cl)c1)c1ccc(C(=O)O)cc1
O=C(O)c1ccc(CC[C@H]2C(=O)CC[C@@H]2/C=C/C(O)Cc2ccc3ccccc3c2)cc1
CCCOc1c2c(c(OCCC)c3ccccc13)C(=O)N(c1ccc(CC(=O)O)cc1)C2
C[C@@H](CCCc1ccccc1)[C@H](O)/C=C/[C@H]1CC(F)(F)C(=O)N1CCCc1ccc(C(=O)O)s1
CC#CCC1([C@H](O)/C=C/[C@H]2COC(=O)N2CCSc2nc(C(=O)O)cs2)CCC1
CCOc1c2c(c(OCC)c3ccccc13)C(=O)N(c1ccc(CC3(NC(=O)NS(=O)(=O)c4ccccc4C(F)(F)F)CC3)cc1C)C2
C[C@H](NC(=O)c1cccc2ccn(Cc3cccc(C(F)(F)F)c3)c12)c1ccc(C(=O)O)cc1
C[C@H](NC(=O)c1cccc2c1N(Cc1ccc(C(=O)O)cc1)CC2)c1ccc(C(=O)O)cc1
CC1(C)C(=O)[C@H](C/C=C\CCCC(=O)O)[C@@H](/C=C/C(O)CCc2sc3ccccc3c2Cl)[C@@H]1O
COC(=O)CCC/C=C\C[C@H]1C(=O)C(C)(C)[C@@H](O)[C@@H]1/C=C/C(O)CCc1ccccc1
O=C(O)c1ccc(CC[C@H]2C(=O)CC[C@@H]2/C=C/C(O)Cc2ccc3ccsc3c2)cc1
C[C@H](NC(=O)c1cc(Cl)cnc1Oc1cccc(C#N)c1)c1ccc(C(=O)O)cc1
CCOc1c2c(c(OCC)c3ccccc13)C(=O)N(c1ccc(CC3(NC(=O)NS(=O)(=O)c4cccc(Cl)c4Cl)CC3)cc1C)C2
O=C(O)c1csc(SCCN2C(=O)OC[C@@H]2/C=C/[C@@H](O)C2(CCCCF)CCC2)n1
O=C(O)c1ccc(CNC(=O)c2cc(Cl)cnc2Oc2ccc(F)cc2)cc1
Cc1sc(C)c(C(=O)NC2(c3ccc(C(=O)O)cc3)CC2)c1Cc1ccc(C(F)(F)F)cc1
CC1(C)C(=O)[C@H](C/C=C\CCCC(=O)O)[C@@H](/C=C/C(O)Cc2ccccc2)[C@@H]1O
O=C(O)c1ccc(CCN2C(=O)CCN2CCC(O)Cc2cccc(OC(F)(F)F)c2)cc1
C[C@H](NC(=O)c1cc(Cl)cnc1Oc1cccc(C(F)(F)F)c1)c1ccc(C(=O)O)cc1
O=C(O)c1ccc(C2(NC(=O)C3CC4CC4CN3Cc3ccc(C(F)(F)F)cc3)CC2)cc1
CCn1c2ccccc2c2cc(-c3nc4cc(C(=O)O)ccc4n3C(C)C)ccc21
5 changes: 5 additions & 0 deletions data/ft_files/CHEMBL1836_5.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Cc1oc(CC(=O)O)cc1COc1ccc(-c2ccccc2)cc1
O=C(O)COCCCCN1C(=O)CC[C@@H]1/C=C/[C@@H](O)Cc1cccc(C(F)(F)F)c1
CCn1c2ccccc2c2cc(-c3nc4cc(C(=O)O)ccc4n3CC3CC3)ccc21
O=C(O)CCCSCCN1C(=O)CCC[C@@H]1/C=C/[C@@H](O)CCC1CC1
CCCCC(C)(C)[C@H](O)/C=C/[C@H]1CCC(=O)N1CCSc1nc(C(=O)O)cs1
10 changes: 10 additions & 0 deletions data/ft_files/CHEMBL1945_10.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
COc1ccc2c(c1)C(CNC(C)=O)Cc1ccccc1C2=O
CC(=O)N/C=C/c1cccc2ccccc12
CCNC(=S)NCc1cc2c(ccc3ccc(OC)cc32)o1
CC(=O)NCC(Cc1ccccc1)c1ccccc1
COc1ccc2[nH]c(Cn3ccc4ccccc43)c(CCNC(C)=O)c2c1
CCC(=O)NCCCc1cccc2nc(CC)oc12
CCCC1Nc2cccc([C@@H]3C[C@H]3CNC(C)=O)c2O1
CCCc1nc2cccc(CCCNC(=O)NCC)c2o1
CCCC(=O)NCC1Cc2cccc3ccc(OC)c1c23
COc1ccc2ccc3occ(CCNC(C)=O)c3c2c1
20 changes: 20 additions & 0 deletions data/ft_files/CHEMBL1945_20.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
COc1ccc2scc(CCNC(=O)CI)c2c1
COc1ccc2[nH]c(Cc3ccccc3)c(CCNC(C)=O)c2c1
COc1ccc2c(I)cn(CCNC(C)=O)c2n1
CC(=O)NCCc1coc2ccc(OC(C)C)cc12
CC(=O)NCCc1c(-c2ccccc2)nn2ccc3c(c12)CCO3
CCOC(=O)c1[nH]c2ccc(OC)c3c2c1[C@@H](CNC(C)=O)CC3
CC(=O)NCCC1=C(c2csc3ccccc23)Cc2ccc3c(c21)CCO3
CCC(=O)NCC1(c2c[nH]c3ccc(OC)cc23)CCC1
COc1cc2c(CCNC(C)=O)c[nH]c2cc1O
CC(C)c1nc2cccc(CCCNC(=O)C3CC3)c2o1
COc1ccc2oc(Cc3ccccc3)c(CCNC(C)=O)c2c1
CCC(=O)NCCc1ccnc2ccc(OC)cc12
COc1ccc2ccc3c(c2c1)C(CNS(C)(=O)=O)CC3
COc1ccc2[nH]c(CN3CCCC3)c(CCNC(C)=O)c2c1
COc1ccc2ccc3c(c2c1)C(CNC(C)=O)CC3
CCOc1ccc(OC)cc1CCCNC(C)=O
CC(C)C(=O)NCCCc1cccc2nc(-c3ccccc3)oc12
CCCCOc1ccc2ccc(OC)cc2c1CCNC(=O)CC
COc1ccc2cccc(CCC(N)=O)c2c1
O=C(CBr)NCCc1c(Sc2ccc(F)cc2)sc2ccc(F)cc12
Loading

0 comments on commit fa609bb

Please sign in to comment.