🦺 SAFE

Sequential Attachment-based Fragment Embedding (SAFE) is a novel molecular line notation that represents molecules as an unordered sequence of fragment blocks to improve molecule design using generative models.

Paper | Docs | 🤗 Model | 🤗 Training Dataset

Overview of SAFE

SAFE is the deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:

de novo design
superstructure generation
scaffold decoration
motif extension
linker generation
scaffold morphing.

The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by datamol or RDKit.

Installation

You can install safe using pip:

pip install safe-mol

You can use conda/mamba:

mamba install -c conda-forge safe-mol

Datasets and Models

Type	Name	Infos	Size	Comment
Model	datamol-io/safe-gpt	87M params	350M	Default model
Training Dataset	datamol-io/safe-gpt	1.1B rows	250GB	Training dataset
Drug Benchmark Dataset	datamol-io/safe-drugs	26 rows	20 kB	Benchmarking dataset

Usage

Please refer to the documentation, which contains tutorials for getting started with safe and detailed descriptions of the functions provided, as well as an example of how to get started with SAFE-GPT.

API

We summarize some key functions provided by the safe package below.

Function	Description
`safe.encode`	Translates a SMILES string into its corresponding SAFE string.
`safe.decode`	Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's `Chem.MolFromSmiles` with an optional correction argument to take care of missing hydrogen bonds.
`safe.split`	Tokenizes a SAFE string to build a generative model.

Examples

Translation between SAFE and SMILES representations

import safe

ibuprofen = "CC(Cc1ccc(cc1)C(C(=O)O)C)C"

# SMILES -> SAFE -> SMILES translation
try:
    ibuprofen_sf = safe.encode(ibuprofen)  # c12ccc3cc1.C3(C)C(=O)O.CC(C)C2
    ibuprofen_smi = safe.decode(ibuprofen_sf, canonical=True)  # CC(C)Cc1ccc(C(C)C(=O)O)cc1
except safe.EncoderError:
    pass
except safe.DecoderError:
    pass

ibuprofen_tokens = list(safe.split(ibuprofen_sf))

Training/Finetuning a (new) model

A command line interface is available to train a new model, please run safe-train --help. You can also provide an existing checkpoint to continue training or finetune on you own dataset.

For example:

safe-train --config <path to config> \
    --model-path <path to model> \
    --tokenizer  <path to tokenizer> \
    --dataset <path to dataset> \
    --num_labels 9 \
    --torch_compile True \
    --optim "adamw_torch" \
    --learning_rate 1e-5 \
    --prop_loss_coeff 1e-3 \
    --gradient_accumulation_steps 1 \
    --output_dir "<path to outputdir>" \
    --max_steps 5

References

If you use this repository, please cite the following related paper:

@misc{noutahi2023gotta,
      title={Gotta be SAFE: A New Framework for Molecular Design},
      author={Emmanuel Noutahi and Cristian Gabellini and Michael Craig and Jonathan S. C Lim and Prudencio Tossou},
      year={2023},
      eprint={2310.10773},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

Note that all data and model weights of SAFE are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage. See DATA_LICENSE for details.

This code base is licensed under the Apache-2.0 license. See LICENSE for details.

Development lifecycle

Setup dev environment

mamba create -n safe -f env.yml
mamba activate safe

pip install --no-deps -e .

Tests

You can run tests locally with:

pytest

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github		.github
docs		docs
expts		expts
safe		safe
tests		tests
.gitignore		.gitignore
DATA_LICENSE		DATA_LICENSE
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦺 SAFE

Sequential Attachment-based Fragment Embedding (SAFE) is a novel molecular line notation that represents molecules as an unordered sequence of fragment blocks to improve molecule design using generative models.

Overview of SAFE

Installation

Datasets and Models

Usage

API

Examples

Translation between SAFE and SMILES representations

Training/Finetuning a (new) model

References

License

Development lifecycle

Setup dev environment

Tests

About

Releases

Packages

Languages

License

timofoloyo/safe

Folders and files

Latest commit

History

Repository files navigation

🦺 SAFE

Sequential Attachment-based Fragment Embedding (SAFE) is a novel molecular line notation that represents molecules as an unordered sequence of fragment blocks to improve molecule design using generative models.

Overview of SAFE

Installation

Datasets and Models

Usage

API

Examples

Translation between SAFE and SMILES representations

Training/Finetuning a (new) model

References

License

Development lifecycle

Setup dev environment

Tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages