Run GTP-2 trained on Discord messages.
The goal of this project is to try and train GPT-2 on your Discord server's message history.
See the Instructions section for details on how to use the project.
Python and Docker must be installed.
Complete sections in this order:
If you intend to use a GPU to train the model see GPU Setup.
If you will only use a CPU complete the steps in CPU Only Setup.
Install Python dependencies:
pip3 install -r requirements.txt
Running training on a GPU can greatly increase the speed compared to running on a CPU. However, a few things must be setup first, and it can be a little tricky.
To setup GPU support:
- Install Nvidia CUDA
- Install Nvidia cuDNN
On Windows the installation process is a little confusing. Once you download cuDNN extract the ZIP file and open thecuda
folder. Inside this folder there should be alib
,include
, andbin
folder. Find the location where CUDA is installed (LikelyC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7
). Then copy thelib
,include
, andbin
folders from cuDNN into the CUDA directory. If prompted select to overwrite any files. - Install Conda
- Open the "Anaconda Prompt" application and run:
conda env create -f conda-environment.yml
- Make sure to run all commands in the Anaconda environment
The training process can be broken into these broad steps:
A tool named Discord Chat Exporter is used to download all messages from a Discord server as JSON files. This tool is run via Docker. To do so:
- Create a copy of
.env-example
named.env
, fill in your own values
- The
DISCORD_TOKEN
value must be a Discord API token for a bot - The bot must have the
bot
OAuth2 scope - The bot must have the permissions:
Read Messages/View Channels
Read Message History
- Invite this bot using the URL generated by the
Discord Developer Dashboard OAuth2 > URL Generator page
- Source the
.env
file:source ./.env
- Run the exporter tool:
This will download Discord messages into the
./scripts/download-discord.sh
discord-messages/
directory. One JSON file will be created for each channel in the server you specified via theDISCORD_GUILD
env var
The contents of the JSON message dumps in the discord-messages/
directory are not sutable for machine learning training. First they must be processed and put into a form which works better for the model.
- The message dump files must be combined into one file which only contains relevant details about the messages. To do this run:
This will create the
python ./src/combine_training_data.py
training-data/discord-messages.txt
file. - A custom tokenizer needs to be built for the training data, to do this run:
This will create several files in the
python ./src/build_tokenizer.py
training-data/
directory which store the parameters of the tokenizer. - The messages need to be broken up by tokens, and encoded into a more efficient format. Run:
This will create the
python ./src/encode_dataset.py
training-data/discord-messages.tar.gz
file, which will hold the encoded dataset.
Once the Discord messages have been prepared they can be fed into the model for training.
- Run the training script:
If you have a GPU and completed the Dependency Installation - GPU Setup steps, then you can run the training script in your Anaconda environment with the
python ./src/train.py
--gpu
option:While training is occuring you can typepython ./src/train.py --gpu
quit
into the terminal and the training process will gracefully exit when the current training epoch is finished.
Once the model has been trained prompts can be supplied and it will try to response appropriately.
- Run the evaluate script:
Type in a prompt for the model and hit enter, you should see the model's responses printed in the terminal.
python ./src/evaluate.py --interactive-prompt
The --prompt-template-file <FILE PATH>
option specifies a file which prompts will be plugged into like a template. The text <PROMPT>
will be replaced with the prompt inputted by the user.
For example if you had a file named prompt-template.txt
with the contents:
AI#1234: How are you doing?
USER#5678: <PROMPT>
AI#1234:
Then if you ran the training script like so:
python ./src/evaluate.py --prompt "Who goes there?" --prompt-template-file ./prompt-template.txt
The model would be given the prompt:
AI#1234: How are you doing?
USER#5678: Who goes there?
AI#1234:
The conda-environment.yml
file contains a snapshot of Python and Anaconda dependencies. To generate this file:
- Create a new Anaconda environment by running:
conda create -n brap-brain conda activate brap-brain
- Install CUDA and PyTorch Anaconda packages by running:
This command was generated from the PyTorch installation instructions, select Windows, Conda, Python, and CUDA to get this command for the most recent version of PyTorch
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
- Install Python dependencies:
pip3 install -r requirements.txt
- Export the Anaconda environment configuration file:
conda env export -f conda-environment.yml
It turns out the aitextgen
library is a wrapper around the GPT-2 code. It can also run different and smaller models. However, it is capable of running the fully sized GPT-2 model. I rewrote the entire brap-brain code-base using the AI Text Gen library. As it is a much cleaner, and better, wrapper around the GPT-2 code