This is the final project for UVA's graduate level NLP course Fall 2023. The project is called "NLP for social good." The goal is to apply NLP techniques to a problem that has a positive social impact. Our group descided to build a DNA/Protein sequence classifier that can predict the viral source given a protein sequence. The classifier is trained on a dataset of DNA/protein sequences and their corresponding viral origins. The dataset derived from the NCBI virus data portal.
Our best model is available for use on huggingface 🤗
We've provided a script to download the data. Run the following command to download the data to the data
directory:
sh get_data.sh
The data will be in two separate files: covid.fasta
and flu.fasta
. Each file contains a list of protein sequences in FASTA format. The data is formatted as follows.
We've provided a preprocess script that will convert the data into a format that can be used by the model. Run the following command to preprocess the data:
cd data
python preprocess.py
To begin, install the dependencies:
pip install -r requirements.txt
Included in the repository is a notebook that will enable you to train a model on the data. The notebook is called train.ipynb
. You can run the notebook in Google Colab or on your local machine.
Models can be exported using the model.export()
method. The exported model can be loaded again by using the from_pretrained
. Alternatively, you can load the pretrained model from huggingface:
from dna_classification.models import DNASequenceClassifier
model = DNASequenceClassifier("nleroy917/viral-sequence-prediction")
virus = model.predict("MGYINVFAFPFTIYSLLLCRMNFRNYIAQVDVVNFNLT")
print(virus) # covid