Improved documentation

ggozad · May 23, 2019 · a105fe5 · a105fe5
1 parent 45ec9e9
commit a105fe5
Show file tree

Hide file tree

Showing 7 changed files with 80 additions and 357 deletions.
diff --git a/GPT2-1.7B-adafactor.json b/GPT2-1.7B-adafactor.json
diff --git a/GPT2-1.7B.json b/GPT2-1.7B.json
@@ -3,15 +3,13 @@
     "encoder_path": "gs:https://openwebtext/stuff/encoder",
     "n_vocab": 50257,
     "embed_dropout": 0.1,
-    "opt_params": {
-        "lr": 0.00025,
-        "warmup_steps": 2000,
-        "beta1": 0.9,
-        "beta2": 0.98,
-        "epsilon": 1e-9,
-        "name": "adamW",
-        "weight_decay": 0.01
-    },
+    "lr": 0.00025,
+    "warmup_steps": 2000,
+    "beta1": 0.9,
+    "beta2": 0.98,
+    "epsilon": 1e-9,
+    "name": "adamW",
+    "weight_decay": 0.01,
     "train_batch_size": 512,
     "attn_dropout": 0.1,
     "train_steps": 10000,
@@ -26,7 +24,7 @@
     "n_embd": 1600,
     "input": "openwebtext_longbiased",
     "model": "GPT2",
-    "model_dir": "gs:https://connors-models/GPT2-1.7B",
+    "model_path": "gs:https://connors-models/GPT2-1.7B",
     "n_ctx": 1024,
     "predict_path": "logs/predictions_1.7B.txt",
     "n_layer": 48

diff --git a/GPT2-117M.json b/GPT2-117M.json
@@ -3,15 +3,13 @@
     "encoder_path": "gs:https://openwebtext/stuff/encoder",
     "n_vocab": 50257,
     "embed_dropout": 0.1,
-    "opt_params": {
-        "lr": 0.00025,
-        "warmup_steps": 2000,
-        "beta1": 0.9,
-        "beta2": 0.98,
-        "epsilon": 1e-9,
-        "name": "adamW",
-        "weight_decay": 0.01
-    },
+    "lr": 0.00025,
+    "warmup_steps": 2000,
+    "beta1": 0.9,
+    "beta2": 0.98,
+    "epsilon": 1e-9,
+    "name": "adamW",
+    "weight_decay": 0.01,
     "train_batch_size": 32,
     "attn_dropout": 0.1,
     "train_steps": 10000,
@@ -26,7 +24,7 @@
     "n_embd": 768,
     "input": "openwebtext",
     "model": "GPT2",
-    "model_dir": "gs:https://connors-models/GPT2-117M-long",
+    "model_path": "gs:https://connors-models/GPT2-117M",
     "n_ctx": 1024,
     "predict_path": "logs/predictions.txt",
     "n_layer": 12

diff --git a/GPT2-345M.json b/GPT2-345M.json
@@ -3,15 +3,13 @@
     "encoder_path": "gs:https://openwebtext/stuff/encoder",
     "n_vocab": 50257,
     "embed_dropout": 0.1,
-    "opt_params": {
-        "lr": 0.00025,
-        "warmup_steps": 2000,
-        "weight_decay": 0.01,
-        "beta1": 0.9,
-        "beta2": 0.98,
-        "epsilon": 1e-9,
-        "name": "adamW"
-    },
+    "lr": 0.00025,
+    "warmup_steps": 2000,
+    "weight_decay": 0.01,
+    "beta1": 0.9,
+    "beta2": 0.98,
+    "epsilon": 1e-9,
+    "opt_name": "adamW",
     "train_batch_size": 8,
     "attn_dropout": 0.1,
     "train_steps": 10000,
@@ -26,7 +24,7 @@
     "n_embd": 1024,
     "input": "openwebtext",
     "model": "GPT2",
-    "model_dir": "gs:https://connors-models/GPT2-345M",
+    "model_path": "gs:https://connors-models/GPT2-345M",
     "n_ctx": 1024,
     "predict_path": "logs/predictions.txt",
     "n_layer": 24

diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # GPT2
 **This is not the official GPT2 implentation!**
 
-An implementation of training for [GPT2](https://openai.com/blog/better-language-models/) that supports both GPUs and TPUs. The dataset scripts are hack-y and will probably need to be adapted to your needs. 
+An implementation of training for [GPT2](https://openai.com/blog/better-language-models/) that supports both GPUs and TPUs. The dataset scripts are a bit hack-y and will probably need to be adapted to your needs. 
 ## Requirements
 For GPUs:
 
@@ -15,6 +15,11 @@ For TPUs:
 
 `pip3 install --upgrade oauth2client`
 
+For generating the dataset (in addition to Tensorflow):
+
+`pip3 install ftfy tqdm newspaper3k`
+
+
 ## Training
 To train a model, define its parameters in a .json file (see examples) and then simply call
 
@@ -25,7 +30,7 @@ Using a TPU is optional, it runs fine on GPUs without modification. (Note: Evalu
 This assumes you have a version of the openwebtext corpus stored in an accessible location, if you don't see below how to generate your own version.
 
 ## Generating Text
-To predict you can either pass the prompt directly in the command line, or have it read from a file. (This is useful for prompts that include new lines) Text is output to the console and the file specified in the "predict_path" parameter.
+To predict you can either pass the prompt directly in the command line, or have it read from a file. (This is useful for prompts that include new lines) Text is output to the console and the file specified in the "predict_path" parameter. You need a model checkpoint and a copy of the BPE encoder at an accessible location for this to work. (Change the "model_dir" and "encoder_path" parameters in the .json)
 
 From command line:
 
@@ -40,15 +45,48 @@ Prediction on TPUs is not supported.
 ## Generating the Dataset
 GPT2 is trained on the webtext corpus, which is basically all websites linked to from reddit with at least 3 Karma. Since the database is huge and contains a lot of copyrighted material, I can't provide a download here. Instead I'll describe how I got it. Be aware it cost me around ~500€ in cloud compute resources to donwload and process the whole thing, but I'm not claiming I was optimally efficient. 
 1. Use the download script from [here](https://github.com/jcpeterson/openwebtext) to download the archives (I used the prefilteres URLs file)
-2. Use *datasets/extract_text.py* and *datasets/run_newspaper_extract.py*  to extract the text. 
+2. Use *datasets/run_newspaper_extract.py* to extract the text. 
 3. Once you have the raw .txt files use *datasets/create_tfrecords.py* to encode them into correct .tfrecords files.
 4. Place the .tfrecords files into a Google Storage bucket. (This is mandatory if you're using TPUs)
-5. Change the "data_path" parameter to point to where your files are located and, if necessary, adapt the functions in *inputs.py* to open the correct filenames, in case you changed them.
+5. Change the "data_path" parameter to point to where your files are located and, if necessary, adapt the functions in inputs.py to open the correct filenames, in case you changed them.
 
 
 ## Explanation of Parameters
-The way the code is setup, you pass all the model parameters in a .json file. Note that any paths also support Google Storage paths.
-
-* **model**: A string that refers to which model to use. This should always just be "GPT2"
-* **model_dir**: Where to save and load checkpoints from
-* **n_ctx**: Number of tokens the model looks at
+Because passing two dozen parameters over the command line would be tedious, you pass all the model parameters in a .json file. Note that any paths also support Google Storage paths and *must* be gs:https:// paths if you're running on TPUs.
+
+Values you'll definitely want to change:
+* **model_path**: Where to save and load checkpoints from
+* **data_path**: Where your .tfrecords files are located
+* **encoder_path**: Path to the BPE encoder files. To get this, use the download_model.py script from [here](https://github.com/openai/gpt-2) to download any model. You will also get a folder called "encoder". This is what you want this to point to (only required for prediction)
+
+Values you'll probably want to change:
+* **train_batch_size**: Batch size during training phase. (varies depending on your model and hardware)
+* **eval_batch_size**: Batch size during evaluation
+* **predict_batch_size**: Batch size during prediction
+* **predict_path**: Where to save predictions (point this to a text file to append to)
+
+Model parameters:
+* **model**: A string that refers to which model to use. This should always just be "GPT2" (no other models are implemented here)
+* **n_ctx**: Number of tokens the model looks at (default: 1024)
+* **n_vocab**: Size of vocabulary (default: 50257)
+* **n_embd**: Dimension of embedding layers
+* **n_layer**: Number of layers in the model
+* **n_head**: Number of attention heads (default: n_embd / 64)
+* **scale**: Factor by which to scale initializations of weights (default: 1/sqrt(n_layer))
+
+Training parameters:
+* **input**: Which input function to use (default: "openwebtext")
+* **lr**: Learning rate (default: 0.00025)
+* **warmup_steps**: Number of (linear) warmup steps (default: 2000)
+* **opt_name**: Name of optimizer, currently only "adamW" implemented (default: "adamW")
+* **beta1**: Adam beta1 parameter (default: 0.9)
+* **beta2**: Adam beta2 parameter (default: 0.98)
+* **epsilon**: Adam epsilon parameter (default: 1e-9)
+* **weight_decay**: Weight decay parameter (default: 0.01)
+* **train_steps**: Number of training steps to take between evaluations
+* **eval_steps**: Number of steps per evaluation
+* **max_steps**: The maximum number of training steps (important for declinine lr)
+* **iterations**: Number of iterations to perform on TPUs (Only required for TPUs) (Default: 100)
+* **embed_dropout**: Dropout chance on the word embedding (default: 0.1)
+* **attn_dropout**: Dropout chance on attention layers (default: 0.1)
+* **res_dropout**: Dropout chance on residual connections (default: 0.1)
diff --git a/main.py b/main.py
@@ -76,7 +76,7 @@
         tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(args.tpu)
 
         run_config = tf.contrib.tpu.RunConfig(
-            model_dir=params["model_dir"],
+            model_dir=params["model_path"],
             cluster=tpu_cluster_resolver,
             save_checkpoints_secs=60*10,
             session_config=tf.ConfigProto(
@@ -100,7 +100,7 @@
         # Non TPU setup
         params["batch_size"] = params["train_batch_size"]
         run_config = tf.estimator.RunConfig(
-            model_dir=params["model_dir"],
+            model_dir=params["model_path"],
             session_config=tf.ConfigProto(
                 # log_device_placement=True,
                 # allow_soft_placement=True