Skip to content

Commit

Permalink
Improved documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
ConnorJL committed May 23, 2019
1 parent 45ec9e9 commit a105fe5
Show file tree
Hide file tree
Showing 7 changed files with 80 additions and 357 deletions.
32 changes: 0 additions & 32 deletions GPT2-1.7B-adafactor.json

This file was deleted.

18 changes: 8 additions & 10 deletions GPT2-1.7B.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,13 @@
"encoder_path": "gs:https://openwebtext/stuff/encoder",
"n_vocab": 50257,
"embed_dropout": 0.1,
"opt_params": {
"lr": 0.00025,
"warmup_steps": 2000,
"beta1": 0.9,
"beta2": 0.98,
"epsilon": 1e-9,
"name": "adamW",
"weight_decay": 0.01
},
"lr": 0.00025,
"warmup_steps": 2000,
"beta1": 0.9,
"beta2": 0.98,
"epsilon": 1e-9,
"name": "adamW",
"weight_decay": 0.01,
"train_batch_size": 512,
"attn_dropout": 0.1,
"train_steps": 10000,
Expand All @@ -26,7 +24,7 @@
"n_embd": 1600,
"input": "openwebtext_longbiased",
"model": "GPT2",
"model_dir": "gs:https://connors-models/GPT2-1.7B",
"model_path": "gs:https://connors-models/GPT2-1.7B",
"n_ctx": 1024,
"predict_path": "logs/predictions_1.7B.txt",
"n_layer": 48
Expand Down
18 changes: 8 additions & 10 deletions GPT2-117M.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,13 @@
"encoder_path": "gs:https://openwebtext/stuff/encoder",
"n_vocab": 50257,
"embed_dropout": 0.1,
"opt_params": {
"lr": 0.00025,
"warmup_steps": 2000,
"beta1": 0.9,
"beta2": 0.98,
"epsilon": 1e-9,
"name": "adamW",
"weight_decay": 0.01
},
"lr": 0.00025,
"warmup_steps": 2000,
"beta1": 0.9,
"beta2": 0.98,
"epsilon": 1e-9,
"name": "adamW",
"weight_decay": 0.01,
"train_batch_size": 32,
"attn_dropout": 0.1,
"train_steps": 10000,
Expand All @@ -26,7 +24,7 @@
"n_embd": 768,
"input": "openwebtext",
"model": "GPT2",
"model_dir": "gs:https://connors-models/GPT2-117M-long",
"model_path": "gs:https://connors-models/GPT2-117M",
"n_ctx": 1024,
"predict_path": "logs/predictions.txt",
"n_layer": 12
Expand Down
18 changes: 8 additions & 10 deletions GPT2-345M.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,13 @@
"encoder_path": "gs:https://openwebtext/stuff/encoder",
"n_vocab": 50257,
"embed_dropout": 0.1,
"opt_params": {
"lr": 0.00025,
"warmup_steps": 2000,
"weight_decay": 0.01,
"beta1": 0.9,
"beta2": 0.98,
"epsilon": 1e-9,
"name": "adamW"
},
"lr": 0.00025,
"warmup_steps": 2000,
"weight_decay": 0.01,
"beta1": 0.9,
"beta2": 0.98,
"epsilon": 1e-9,
"opt_name": "adamW",
"train_batch_size": 8,
"attn_dropout": 0.1,
"train_steps": 10000,
Expand All @@ -26,7 +24,7 @@
"n_embd": 1024,
"input": "openwebtext",
"model": "GPT2",
"model_dir": "gs:https://connors-models/GPT2-345M",
"model_path": "gs:https://connors-models/GPT2-345M",
"n_ctx": 1024,
"predict_path": "logs/predictions.txt",
"n_layer": 24
Expand Down
56 changes: 47 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# GPT2
**This is not the official GPT2 implentation!**

An implementation of training for [GPT2](https://openai.com/blog/better-language-models/) that supports both GPUs and TPUs. The dataset scripts are hack-y and will probably need to be adapted to your needs.
An implementation of training for [GPT2](https://openai.com/blog/better-language-models/) that supports both GPUs and TPUs. The dataset scripts are a bit hack-y and will probably need to be adapted to your needs.
## Requirements
For GPUs:

Expand All @@ -15,6 +15,11 @@ For TPUs:

`pip3 install --upgrade oauth2client`

For generating the dataset (in addition to Tensorflow):

`pip3 install ftfy tqdm newspaper3k`


## Training
To train a model, define its parameters in a .json file (see examples) and then simply call

Expand All @@ -25,7 +30,7 @@ Using a TPU is optional, it runs fine on GPUs without modification. (Note: Evalu
This assumes you have a version of the openwebtext corpus stored in an accessible location, if you don't see below how to generate your own version.

## Generating Text
To predict you can either pass the prompt directly in the command line, or have it read from a file. (This is useful for prompts that include new lines) Text is output to the console and the file specified in the "predict_path" parameter.
To predict you can either pass the prompt directly in the command line, or have it read from a file. (This is useful for prompts that include new lines) Text is output to the console and the file specified in the "predict_path" parameter. You need a model checkpoint and a copy of the BPE encoder at an accessible location for this to work. (Change the "model_dir" and "encoder_path" parameters in the .json)

From command line:

Expand All @@ -40,15 +45,48 @@ Prediction on TPUs is not supported.
## Generating the Dataset
GPT2 is trained on the webtext corpus, which is basically all websites linked to from reddit with at least 3 Karma. Since the database is huge and contains a lot of copyrighted material, I can't provide a download here. Instead I'll describe how I got it. Be aware it cost me around ~500€ in cloud compute resources to donwload and process the whole thing, but I'm not claiming I was optimally efficient.
1. Use the download script from [here](https://github.com/jcpeterson/openwebtext) to download the archives (I used the prefilteres URLs file)
2. Use *datasets/extract_text.py* and *datasets/run_newspaper_extract.py* to extract the text.
2. Use *datasets/run_newspaper_extract.py* to extract the text.
3. Once you have the raw .txt files use *datasets/create_tfrecords.py* to encode them into correct .tfrecords files.
4. Place the .tfrecords files into a Google Storage bucket. (This is mandatory if you're using TPUs)
5. Change the "data_path" parameter to point to where your files are located and, if necessary, adapt the functions in *inputs.py* to open the correct filenames, in case you changed them.
5. Change the "data_path" parameter to point to where your files are located and, if necessary, adapt the functions in inputs.py to open the correct filenames, in case you changed them.


## Explanation of Parameters
The way the code is setup, you pass all the model parameters in a .json file. Note that any paths also support Google Storage paths.

* **model**: A string that refers to which model to use. This should always just be "GPT2"
* **model_dir**: Where to save and load checkpoints from
* **n_ctx**: Number of tokens the model looks at
Because passing two dozen parameters over the command line would be tedious, you pass all the model parameters in a .json file. Note that any paths also support Google Storage paths and *must* be gs:https:// paths if you're running on TPUs.

Values you'll definitely want to change:
* **model_path**: Where to save and load checkpoints from
* **data_path**: Where your .tfrecords files are located
* **encoder_path**: Path to the BPE encoder files. To get this, use the download_model.py script from [here](https://github.com/openai/gpt-2) to download any model. You will also get a folder called "encoder". This is what you want this to point to (only required for prediction)

Values you'll probably want to change:
* **train_batch_size**: Batch size during training phase. (varies depending on your model and hardware)
* **eval_batch_size**: Batch size during evaluation
* **predict_batch_size**: Batch size during prediction
* **predict_path**: Where to save predictions (point this to a text file to append to)

Model parameters:
* **model**: A string that refers to which model to use. This should always just be "GPT2" (no other models are implemented here)
* **n_ctx**: Number of tokens the model looks at (default: 1024)
* **n_vocab**: Size of vocabulary (default: 50257)
* **n_embd**: Dimension of embedding layers
* **n_layer**: Number of layers in the model
* **n_head**: Number of attention heads (default: n_embd / 64)
* **scale**: Factor by which to scale initializations of weights (default: 1/sqrt(n_layer))

Training parameters:
* **input**: Which input function to use (default: "openwebtext")
* **lr**: Learning rate (default: 0.00025)
* **warmup_steps**: Number of (linear) warmup steps (default: 2000)
* **opt_name**: Name of optimizer, currently only "adamW" implemented (default: "adamW")
* **beta1**: Adam beta1 parameter (default: 0.9)
* **beta2**: Adam beta2 parameter (default: 0.98)
* **epsilon**: Adam epsilon parameter (default: 1e-9)
* **weight_decay**: Weight decay parameter (default: 0.01)
* **train_steps**: Number of training steps to take between evaluations
* **eval_steps**: Number of steps per evaluation
* **max_steps**: The maximum number of training steps (important for declinine lr)
* **iterations**: Number of iterations to perform on TPUs (Only required for TPUs) (Default: 100)
* **embed_dropout**: Dropout chance on the word embedding (default: 0.1)
* **attn_dropout**: Dropout chance on attention layers (default: 0.1)
* **res_dropout**: Dropout chance on residual connections (default: 0.1)
4 changes: 2 additions & 2 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(args.tpu)

run_config = tf.contrib.tpu.RunConfig(
model_dir=params["model_dir"],
model_dir=params["model_path"],
cluster=tpu_cluster_resolver,
save_checkpoints_secs=60*10,
session_config=tf.ConfigProto(
Expand All @@ -100,7 +100,7 @@
# Non TPU setup
params["batch_size"] = params["train_batch_size"]
run_config = tf.estimator.RunConfig(
model_dir=params["model_dir"],
model_dir=params["model_path"],
session_config=tf.ConfigProto(
# log_device_placement=True,
# allow_soft_placement=True
Expand Down
Loading

0 comments on commit a105fe5

Please sign in to comment.