Brap Brain

Run GTP-2 trained on Discord messages.

Overview

The goal of this project is to try and train GPT-2 on your Discord server's message history.

See the Instructions section for details on how to use the project.

Instructions

Python and Docker must be installed.

Complete sections in this order:

Dependency Installation
Train The Model
Use The Model

Dependency Installation

If you intend to use a GPU to train the model see GPU Setup.
If you will only use a CPU complete the steps in CPU Only Setup.

CPU Only Setup

Install Python dependencies:

pip3 install -r requirements.txt

GPU Setup

Running training on a GPU can greatly increase the speed compared to running on a CPU. However, a few things must be setup first, and it can be a little tricky.

To setup GPU support:

Install Nvidia CUDA
Install Nvidia cuDNN
On Windows the installation process is a little confusing. Once you download cuDNN extract the ZIP file and open the cuda folder. Inside this folder there should be a lib, include, and bin folder. Find the location where CUDA is installed (Likely C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7). Then copy the lib, include, and bin folders from cuDNN into the CUDA directory. If prompted select to overwrite any files.
Install Conda
Open the "Anaconda Prompt" application and run:
```
conda env create -f conda-environment.yml
```
Make sure to run all commands in the Anaconda environment

Train The Model

The training process can be broken into these broad steps:

Obtain Raw Discord Messages
Prepare Discord Messages For Training
Run Training Process

Obtain Raw Discord Messages

A tool named Discord Chat Exporter is used to download all messages from a Discord server as JSON files. This tool is run via Docker. To do so:

Create a copy of .env-example named .env, fill in your own values

The DISCORD_TOKEN value must be a Discord API token for a bot
The bot must have the bot OAuth2 scope
The bot must have the permissions:
- Read Messages/View Channels
- Read Message History
Invite this bot using the URL generated by the Discord Developer Dashboard OAuth2 > URL Generator page

Source the .env file:
```
source ./.env
```
Run the exporter tool:
```
./scripts/download-discord.sh
```
This will download Discord messages into the discord-messages/ directory. One JSON file will be created for each channel in the server you specified via the DISCORD_GUILD env var

Prepare Discord Messages For Training

The contents of the JSON message dumps in the discord-messages/ directory are not sutable for machine learning training. First they must be processed and put into a form which works better for the model.

The message dump files must be combined into one file which only contains relevant details about the messages. To do this run:
```
python ./src/combine_training_data.py
```
This will create the training-data/discord-messages.txt file.
A custom tokenizer needs to be built for the training data, to do this run:
```
python ./src/build_tokenizer.py
```
This will create several files in the training-data/ directory which store the parameters of the tokenizer.
The messages need to be broken up by tokens, and encoded into a more efficient format. Run:
```
python ./src/encode_dataset.py
```
This will create the training-data/discord-messages.tar.gz file, which will hold the encoded dataset.

Run Training Process

Once the Discord messages have been prepared they can be fed into the model for training.

Run the training script:
```
python ./src/train.py
```
If you have a GPU and completed the Dependency Installation - GPU Setup steps, then you can run the training script in your Anaconda environment with the --gpu option:
```
python ./src/train.py --gpu
```
While training is occuring you can type quit into the terminal and the training process will gracefully exit when the current training epoch is finished.

Use The Model

Once the model has been trained prompts can be supplied and it will try to response appropriately.

Run the evaluate script:
```
python ./src/evaluate.py --interactive-prompt
```
Type in a prompt for the model and hit enter, you should see the model's responses printed in the terminal.

Prompt Template Files

The --prompt-template-file <FILE PATH> option specifies a file which prompts will be plugged into like a template. The text <PROMPT> will be replaced with the prompt inputted by the user.

For example if you had a file named prompt-template.txt with the contents:

AI#1234: How are you doing?
USER#5678: <PROMPT>
AI#1234:

Then if you ran the training script like so:

python ./src/evaluate.py --prompt "Who goes there?" --prompt-template-file ./prompt-template.txt

The model would be given the prompt:

AI#1234: How are you doing?
USER#5678: Who goes there?
AI#1234:

Development

Creating The Anaconda Environment

The conda-environment.yml file contains a snapshot of Python and Anaconda dependencies. To generate this file:

Create a new Anaconda environment by running:

conda create -n brap-brain
conda activate brap-brain

Install CUDA and PyTorch Anaconda packages by running:
```
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
```
This command was generated from the PyTorch installation instructions, select Windows, Conda, Python, and CUDA to get this command for the most recent version of PyTorch
Install Python dependencies:
```
pip3 install -r requirements.txt
```
Export the Anaconda environment configuration file:
```
conda env export -f conda-environment.yml
```

It turns out the aitextgen library is a wrapper around the GPT-2 code. It can also run different and smaller models. However, it is capable of running the fully sized GPT-2 model. I rewrote the entire brap-brain code-base using the AI Text Gen library. As it is a much cleaner, and better, wrapper around the GPT-2 code

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
scripts		scripts
src		src
.env-example		.env-example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda-environment.yml		conda-environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brap Brain

Table Of Contents

Overview

Instructions

Dependency Installation

CPU Only Setup

GPU Setup

Train The Model

Obtain Raw Discord Messages

Prepare Discord Messages For Training

Run Training Process

Use The Model

Prompt Template Files

Development

Creating The Anaconda Environment

About

Releases

Languages

License

Noah-Huppert/brap-brain

Folders and files

Latest commit

History

Repository files navigation

Brap Brain

Table Of Contents

Overview

Instructions

Dependency Installation

CPU Only Setup

GPU Setup

Train The Model

Obtain Raw Discord Messages

Prepare Discord Messages For Training

Run Training Process

Use The Model

Prompt Template Files

Development

Creating The Anaconda Environment

About

Resources

License

Stars

Watchers

Forks

Releases

Languages