Skip to content

csjackson0/chemnlp

 
 

Repository files navigation

Contributor Covenant

ChemNLP project 🧪🚀

The ChemNLP project aims to

  1. create an extensive chemistry dataset and
  2. use it to train large language models (LLMs) that can leverage the data for a wide range of chemistry applications.

For more details see our information material section below.

Information material

Community

Feel free to join our #chemnlp channel on our OpenBioML discord server to start the discussion in more detail.

Contributing

ChemNLP is an open-source project - your involvement is warmly welcome! If you're excited to join us, we recommend the following steps:

Note on the "ChemNLP" name

Our OpenBioML ChemNLP project is not afiliated to the ChemNLP library from NIST and we use "ChemNLP" as a general term to highlight our project focus. The datasets and models we create through our project will have a unique and recognizable name when we release them.

About OpenBioML.org

See https://openbioml.org, especially our approach and partners.

Installation and set-up

Create a new conda environment with Python 3.8:

conda create -n chemnlp python=3.8
conda activate chemnlp

To install the chemnlp package (and required dependencies):

pip install chemnlp

If working on developing the python package:

pip install -e "chemnlp[dev]"  # to install development dependencies

If extra dependencies are required (e.g. for dataset creation) but are not needed for the main package please add to the pyproject.toml in the dataset_creation variable and ensure this is reflected in the conda.yml file.

Note

If working on model training, request access to the wandb project chemnlp and log-in to wandb with your API key per here.

Adding a new dataset (to the model training pipline)

We specify datasets by creating a new function here which is named per the dataset on Hugging Face. At present the function must accept a tokenizer and return back the tokenized train and validation datasets.

About

ChemNLP project

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 68.6%
  • Python 31.4%