Skip to content

OpenBioML/bio-chem-lm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High level goals

This repository is dedicated specifically to the development of (Large) Language Models, and/or Language/Structure models in the bio-chem space.

Further details can be found here

Bio-LM PubChem Selfies

We are training an Electra-style model on the PubChem dataset with SELFIES representations. The SELFIES is a chemical language that is based on the SMILES language, but is more robust. More info about SELFIES can be found here.

We have released the dataset to HuggingFace Datasets, which contains ~110M compounds in total.

We will perform a hyperparameter search using Maximal Update Parameterization to find a good set of hyperparameters to transfer to a larger model. To launch a sweep on the cluster, run

sbatch --array=1-N mup_train.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published