Skip to content

Fork for OpenAI's GPT-2 with scripts to train on Discord messages

License

Notifications You must be signed in to change notification settings

Noah-Huppert/brap-brain

 
 

Repository files navigation

Brap Brain

Run GTP-2 trained on Discord messages.

Table Of Contents

Overview

The goal of this project is to try and train GPT-2 on your Discord server's message history.

See the Instructions section for details on how to use the project.

Instructions

Python and Docker must be installed.

Complete sections in this order:

  1. Dependency Installation
  2. Train The Model
  3. Use The Model

Dependency Installation

If you intend to use a GPU to train the model see GPU Setup.
If you will only use a CPU complete the steps in CPU Only Setup.

CPU Only Setup

Install Python dependencies:

pip3 install -r requirements.txt

GPU Setup

Running training on a GPU can greatly increase the speed compared to running on a CPU. However, a few things must be setup first, and it can be a little tricky.

To setup GPU support:

  1. Install Nvidia CUDA
  2. Install Nvidia cuDNN
    On Windows the installation process is a little confusing. Once you download cuDNN extract the ZIP file and open the cuda folder. Inside this folder there should be a lib, include, and bin folder. Find the location where CUDA is installed (Likely C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7). Then copy the lib, include, and bin folders from cuDNN into the CUDA directory. If prompted select to overwrite any files.
  3. Install Conda
  4. Open the "Anaconda Prompt" application and run:
    conda env create -f conda-environment.yml
  5. Make sure to run all commands in the Anaconda environment

Train The Model

The training process can be broken into these broad steps:

  1. Obtain Raw Discord Messages
  2. Prepare Discord Messages For Training
  3. Run Training Process

Obtain Raw Discord Messages

A tool named Discord Chat Exporter is used to download all messages from a Discord server as JSON files. This tool is run via Docker. To do so:

  1. Create a copy of .env-example named .env, fill in your own values
  • The DISCORD_TOKEN value must be a Discord API token for a bot
  • The bot must have the bot OAuth2 scope
  • The bot must have the permissions:
    • Read Messages/View Channels
    • Read Message History
  • Invite this bot using the URL generated by the Discord Developer Dashboard OAuth2 > URL Generator page
  1. Source the .env file:
    source ./.env
  2. Run the exporter tool:
    ./scripts/download-discord.sh
    This will download Discord messages into the discord-messages/ directory. One JSON file will be created for each channel in the server you specified via the DISCORD_GUILD env var

Prepare Discord Messages For Training

The contents of the JSON message dumps in the discord-messages/ directory are not sutable for machine learning training. First they must be processed and put into a form which works better for the model.

  1. The message dump files must be combined into one file which only contains relevant details about the messages. To do this run:
    python ./src/combine_training_data.py
    This will create the training-data/discord-messages.txt file.
  2. A custom tokenizer needs to be built for the training data, to do this run:
    python ./src/build_tokenizer.py
    This will create several files in the training-data/ directory which store the parameters of the tokenizer.
  3. The messages need to be broken up by tokens, and encoded into a more efficient format. Run:
    python ./src/encode_dataset.py
    This will create the training-data/discord-messages.tar.gz file, which will hold the encoded dataset.

Run Training Process

Once the Discord messages have been prepared they can be fed into the model for training.

  1. Run the training script:
    python ./src/train.py
    If you have a GPU and completed the Dependency Installation - GPU Setup steps, then you can run the training script in your Anaconda environment with the --gpu option:
    python ./src/train.py --gpu
    While training is occuring you can type quit into the terminal and the training process will gracefully exit when the current training epoch is finished.

Use The Model

Once the model has been trained prompts can be supplied and it will try to response appropriately.

  1. Run the evaluate script:
    python ./src/evaluate.py --interactive-prompt
    Type in a prompt for the model and hit enter, you should see the model's responses printed in the terminal.

Prompt Template Files

The --prompt-template-file <FILE PATH> option specifies a file which prompts will be plugged into like a template. The text <PROMPT> will be replaced with the prompt inputted by the user.

For example if you had a file named prompt-template.txt with the contents:

AI#1234: How are you doing?
USER#5678: <PROMPT>
AI#1234:

Then if you ran the training script like so:

python ./src/evaluate.py --prompt "Who goes there?" --prompt-template-file ./prompt-template.txt

The model would be given the prompt:

AI#1234: How are you doing?
USER#5678: Who goes there?
AI#1234:

Development

Creating The Anaconda Environment

The conda-environment.yml file contains a snapshot of Python and Anaconda dependencies. To generate this file:

  1. Create a new Anaconda environment by running:
    conda create -n brap-brain
    conda activate brap-brain
  2. Install CUDA and PyTorch Anaconda packages by running:
    conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
    This command was generated from the PyTorch installation instructions, select Windows, Conda, Python, and CUDA to get this command for the most recent version of PyTorch
  3. Install Python dependencies:
    pip3 install -r requirements.txt
  4. Export the Anaconda environment configuration file:
    conda env export -f conda-environment.yml

It turns out the aitextgen library is a wrapper around the GPT-2 code. It can also run different and smaller models. However, it is capable of running the fully sized GPT-2 model. I rewrote the entire brap-brain code-base using the AI Text Gen library. As it is a much cleaner, and better, wrapper around the GPT-2 code

About

Fork for OpenAI's GPT-2 with scripts to train on Discord messages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages

  • Python 98.5%
  • Shell 1.5%