Merge pull request #128 from EleutherAI/orz

Update readme to load preshuffled datasets
EleutherAI · Nov 11, 2023 · 3471404 · 3471404
2 parents 0c3f210 + dc24af5
commit 3471404
Show file tree

Hide file tree

Showing 5 changed files with 87 additions and 43 deletions.
diff --git a/README.md b/README.md
@@ -177,43 +177,64 @@ python3 main.py --model hf-causal-experimental --model_args pretrained=../gpt-n
 
 We provide a tool to view particular portions of the training dataloader used by all models during training, at `utils/batch_viewer.py`.
 
-This tool requires the `inspect_idxmap` branch of GPT-NeoX as a git submodule, so you must check out the repository via
+First, we need to clone the Pythia repository:
 ```
-git clone --recurse-submodules https://github.com/EleutherAI/pythia
-cd pythia
-```
-or, if you have already cloned the repository, run
-```
-git submodule update --init --recursive
+git clone https://github.com/EleutherAI/pythia
 ```
 Next, we must install dependencies:
 ```
 pip install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch/
-cd utils/gpt-neox
-pip install -r requirements/requirements.txt
+pip install numpy tqdm huggingface_hub
 ```
-Additionally, we are required to build C++ helpers used by the Megatron dataloader. You can do this via:
+
+Next, we must download the appropriate dataset. We provide preshuffled versions of the duped and deduped pile. Download the appropriate one using Huggingface's utilities as follows:
+
+> Tip: Make sure to replace `path/to/*` to appropriate paths where you intend to save datasets downloaded from Huggingface.
+- To download standard version, use 
+ ```py
+ from huggingface_hub import hf_hub_download
+ hf_hub_download(repo_id="EleutherAI/pile-standard-pythia-preshuffled", repo_type="dataset", cache_dir="path/to/local/folder")
+ ```
+- To download the deduped version, use
+ ```py
+ from huggingface_hub import hf_hub_download
+ hf_hub_download(repo_id="EleutherAI/pile-standard-pythia-preshuffled", repo_type="dataset", cache_dir="path/to/local/folder")
+ ```
+
+You can now merge the files by using the script `utils/unshard_mmap.py` : 
+
+```sh
+python3 utils/unshard_mmap.py --input_file "path/to/local/folder/document-00000-of-00020.bin" --num_shards 21 --output_dir "path/to/merged/folder/"
 ```
-cd /utils/gpt-neox/megatron/data
-make
-cd -
+
+Make sure to also copy index file to the merged folder, using the command
+```sh
+cp path/to/local/folder/document.idx path/to/merged/folder/document.idx
 ```
-Now, we're all set up to run `utils/batch_viewer.py` !
 
-To run, first substitute the filepath to your copy of the downloaded and resharded `.bin` and `.idx` files for either the Pile or deduplicated Pile in `utils/dummy_config.yml`.
+Now, we're all set up to run `utils/batch_viewer.py` !
 
-```python
-PYTHONPATH=utils/gpt-neox/ python utils/batch_viewer.py \
+```sh
+python3 utils/batch_viewer.py \
  --start_iteration 0 \
  --end_iteration 1000 \
- --mode save \
- --save_path .../.../.../... \
+ --load_path path/to/merged/folder/document \
+ --save_path path/to/save/folder/ \
  --conf_dir utils/dummy_config.yml 
 ```
 
-Passing `--mode save` will save a separate file containing each batch as a numpy array. 
+This will save a separate file containing all the indicies as a numpy array. 
+
+You can now load this using numpy as 
+
+```py
+import numpy as np
+
+indicies = np.load("path/to/save/folder/indicies.npy")
+```
 
-Passing `--mode custom` will save a dictionary for each batch to a JSONL file--it can be used to compute arbitrary statistics over each batch seen during training.
+These indicies contain tokenized sequences of integers of size (None, 2049), where an integer corresponds to a unique token index.
+Note that documents are concatenated and saperated by an `EOD` token. Thus, each sample or batch may not start with an EOD token. During training, target tokens are left shifted by 1. Thus, a model of sequence length 2048 requires 2049 length sequences for training (For more info, refer to [this comment](https://github.com/EleutherAI/pythia/issues/123#issuecomment-1791136253))
 
 ## Pythia Paper Replication
 

diff --git a/predictable-memorization/README.md b/predictable-memorization/README.md
@@ -3,9 +3,11 @@
 This folder documents our work using Pythia to study memorization of particular sequences in the training dataset, and includes instructions to reproduce our analyses where possible.
 
 ## Reproducing Memorization Results
-The memorization evaluation script `memorization/eval_memorization.py` assumes that you are running the script in a distributed process, ideally in slurm. If you want to reproduce the evaluation, consider the following steps.
+The memorization evaluation script `memorization/eval_memorization.py` assumes that you are running the script in a distributed process, ideally in slurm. It also assumes that you are using s3 to load and save Pythia's preshuffled Pile datasets (refer [here](https://github.com/EleutherAI/pythia/blob/main/README.md#dataset-viewer) for more details on how to download them), though using a local filesystem for the preshuffled datasets is also supported.
 
-1. Change `prefix` and `idx_path` local variables of `generate_function()` to point to the right document and index path.
+If you want to reproduce the evaluation, consider the following steps.
+
+1. Change `prefix` local variable of `generate_function()` to point to the right document path.
 
 2. If you are not using [Slurm](https://slurm.schedmd.com/documentation.html), You need to change global variables inside the script, like `RANK` and `NUM_PROCS` (world size) to point to the right environment variables.
 
@@ -23,9 +25,11 @@ The memorization evaluation script `memorization/eval_memorization.py` assumes t
 
 ## Reproducing Figures
 
+Refer to `memorization/eda.ipynb` for details on replication.
 
 ## Reproducing Scaling Laws Plots
 
+Refer to `memorization/eda.ipynb` for details on replication.
 
 ## Citation Details
 

diff --git a/predictable-memorization/eval_memorization.py b/predictable-memorization/eval_memorization.py
@@ -3,7 +3,8 @@
 import os
 import sys
 sys.path.append(os.path.join(os.path.dirname(__file__), "..", "utils", "gpt-neox"))
-
+sys.path.append(os.path.join(os.path.dirname(__file__), "..", "utils"))
+from mmap_dataset import MMapIndexedDataset
 import logging
 import time
 import datetime
@@ -19,7 +20,10 @@
 import time
 from tqdm import trange
 
-def generate_dataset(batch_size, start_seq_idx, end_seq_idx, mp_queue, prefetch_max = 128):
+def generate_dataset(batch_size, start_seq_idx, end_seq_idx, mp_queue, 
+ using_s3 = False, 
+ prefetch_max = 128
+):
  """Wrapper function to prefetch pile sequences
 
  Intended to run in a saperate `multiprocessing.Process`, this function will continuously prefetch
@@ -30,6 +34,7 @@ def generate_dataset(batch_size, start_seq_idx, end_seq_idx, mp_queue, prefetch_
  start_seq_idx (int): Sequence index of first sequence to be evaluated by current rank
  end_seq_idx (int): Sequence index of last sequence to be evalauted by current rank
  mp_queue (multiprocessing.Queue): Instance of multiprocessing Queue, to add sequences into
+ using_s3 (bool): If your datasets are located in s3, set this to true
  prefetch_max (int): Maximum number of sequences that can be pre-fetched into the queue
  
  Env Vars:
@@ -38,27 +43,32 @@ def generate_dataset(batch_size, start_seq_idx, end_seq_idx, mp_queue, prefetch_
  """
 
  # Load Pile dataset
- prefix = 'orz/pile/standard/document.bin'
+ prefix = '/scratch/pile/standard/document.bin'
  if "deduped" in os.environ['MODEL']:
  prefix = 'orz/pile/deduped/document.bin'
  s3 = boto3.client('s3')
- buff_size = 2049*1024*2
+ buff_size = 2049*batch_size*2
+ if using_s3 == False:
+ mmap_ds = MMapIndexedDataset(prefix, skip_warmup=True)
 
  # Iterate over pile and add sequences to mp_queue
  context_tokens = []
  true_continuation = []
  i = 0
- for i in range(start_seq_idx, end_seq_idx + 1, buff_size // (2049*2)):
- dataset = s3.get_object(
- Bucket = 's-eai-neox-west', 
- Key = prefix,
- Range = f'bytes={i*2049*2}-{i*2049*2 + buff_size}'
- )
- data = dataset['Body'].read(buff_size)
- data = np.frombuffer(data, dtype = np.uint16).reshape(-1, 2049)
+ for i in range(start_seq_idx, end_seq_idx + 1, batch_size):
+ if using_s3:
+ dataset = s3.get_object(
+ Bucket = os.environ['BUCKET'], 
+ Key = prefix,
+ Range = f'bytes={i*2049*2}-{i*2049*2 + buff_size}'
+ )
+ data = dataset['Body'].read(buff_size)
+ data = np.frombuffer(data, dtype = np.uint16).reshape(-1, 2049)
+ else:
+ data = mmap_ds[i:i+batch_size]
  context_tokens.extend(data[:, :32].tolist())
  true_continuation.extend(data[:,32:64].tolist())
- i += buff_size // (2049*2)
+ i += len(context_tokens)
 
  if len(context_tokens) == batch_size:
  # (start index of batch, context tokens, true continuation)
@@ -95,8 +105,8 @@ def score(model, context_tokens, true_continuation):
  accuracies (torch.Tensor): Accuracies of shape (batch_size,)
  """
  with torch.no_grad():
- context_tokens = torch.tensor(context_tokens)
- true_continuation = torch.tensor(true_continuation)
+ context_tokens = torch.tensor(context_tokens).to('cuda')
+ true_continuation = torch.tensor(true_continuation).to('cuda')
 
  generations = model.generate(context_tokens, temperature = 0.0, top_k = 0, top_p = 0, max_length = 64, min_length = 64)
 
@@ -114,23 +124,29 @@ def main():
  LOCAL_RANK = int(os.environ['SLURM_LOCALID'])
  NUM_PROCS = int(os.environ['SLURM_NPROCS'])
 
+ RANK = int(os.environ['RANK'])
+ LOCAL_RANK = RANK
+ NUM_PROCS = int(os.environ['WORLD_SIZE'])
+
  # Eval configuration variables
  MODEL = os.environ['MODEL']
  CHECKPOINT = int(os.environ['CHECKPOINT'])
 
  # Distributed initializations
- os.environ['MASTER_ADDR'] = os.environ['SLURM_LAUNCH_NODE_IPADDR']
- os.environ['MASTER_PORT'] = '12128'
+ # os.environ['MASTER_ADDR'] = os.environ['SLURM_LAUNCH_NODE_IPADDR']
+ # os.environ['MASTER_PORT'] = '12128'
  logging.basicConfig(format = f'rank-{RANK}:' + '%(levelname)s:%(message)s', level = logging.INFO)
  logging.info(f"Initializing torch distributed with gpus {torch.cuda.device_count()}")
 
  # Initialize torch distributed
+ torch.cuda.set_device(RANK)
  dist.init_process_group(
  "nccl",
  world_size = NUM_PROCS,
  rank = RANK
  )
- store = dist.TCPStore(os.environ['MASTER_ADDR'], 12125, world_size = NUM_PROCS, is_master = RANK == 0, timeout = datetime.timedelta(hours=3))
+ store = dist.TCPStore(os.environ['MASTER_ADDR'], port = 12125, 
+ world_size = NUM_PROCS, is_master = RANK == 0, timeout = datetime.timedelta(hours=3))
 
  dist.barrier()
 
@@ -192,7 +208,7 @@ def main():
  s3 = boto3.client('s3')
  s3.put_object(
  Body = '\n'.join(memorization_evals).encode(),
- Bucket = 's-eai-neox-west',
+ Bucket = os.environ['Bucket'],
  Key = f'memorization-evals/evals-running/memorization_{MODEL}_{CHECKPOINT}/rank-{RANK}.csv'
  )
  dist.barrier()

diff --git a/utils/batch_viewer.py b/utils/batch_viewer.py
@@ -39,6 +39,6 @@
  filename = os.path.join(args.save_path, "indicies.npy")
 
  dataset = MMapIndexedDataset(args.load_path, skip_warmup = True)
- indicies = dataset[args.start_iteration: args.end_iteration + 1]
+ indicies = dataset[args.start_iteration*1024: args.end_iteration*1024 + 1]
  np.save(filename, indicies)
 
diff --git a/utils/mmap_dataset.py b/utils/mmap_dataset.py
@@ -171,6 +171,9 @@ def __init__(self, path, skip_warmup=False):
  self._index = None
  self._bin_buffer = None
 
+ if path.endswith(".bin") or path.endswith(".idx"):
+ path = path[:-4]
+
  self._do_init(path, skip_warmup)
 
  def __getstate__(self):