Added MPU from Sid's MegatronPipeline #88

glebshevchukk · 2021-01-23T19:30:47Z

Issue #75: Added mpu code that Sid wrote for https://github.com/EleutherAI/MegatronPipeline and update train.py file to use it for model building. Only tested with a single machine / GPU so I'm not sure if it works correctly across machines.

Also includes changes for Issue #40 that include the previous dataset for TFRecords and another one for loading directly from Json.

To run, try
sh scripts/train_gpt3small.sh

sdtblck · 2021-01-23T21:45:07Z

Awesome! this is a lot of changes, i'll try to look through it tomorrow

StellaAthena · 2021-01-23T22:42:54Z

I believe I have correctly resolved conflicts between the two branches. train.py has been touchy recently though, and I want to take a second look at it.

joshlk · 2021-01-25T10:42:23Z

gpt_neox/data_utils.py

+from torch.utils.data.dataloader import default_collate
+import pathlib
+from functools import partial
+import logging


These new imports are not used

joshlk · 2021-01-25T10:44:18Z

gpt_neox/datasets.py

+"""
+Dataset that gets sequences from a set of sharded jsonl files
+"""
+class JsonShardedDataset(Dataset):


It would be helpful to have the doc-comment attached to the class. So do:

class JsonShardedDataset(Dataset): """ Dataset that gets sequences from a set of sharded jsonl files """

instead of it dangling above.

StellaAthena · 2021-01-26T15:45:52Z

@ShivanshuPurohit reports getting this to work on two RTX 3090 GPUs using the following configs

{
    "train_batch_size": 1280,
    "gradient_accumulation_steps": 80,
    "gradient_clipping": 1.0,
    "wall_clock_breakdown": true,
    "zero_allow_untested_optimizer": true,
    "tensorboard": {
      "enabled": true,
      "output_path": "./logs",
      "job_name": "gptneox"
    },
    "optimizer": {
      "type": "OneBitAdam",
      "params": {
        "lr": 2e-4,
    "freeze_step":2,
    "cuda_aware":true
      }
    },
    "scheduler": {
      "type": "WarmupLR",
      "params": {
        "warmup_min_lr": 0,
        "warmup_max_lr": 0.00015,
        "warmup_num_steps": 5000
      }
    },
    "fp16": {
      "enabled": true
    },
    "zero_optimization": {
      "stage": 1,
      "contiguous_gradients" : true,
      "cpu_offload": false
    },
      "activation_checkpointing": {
      "partition_activations": true,
      "cpu_checkpointing": false,
      "contiguous_memory_optimization": false,
      "number_checkpoints": 1,
      "synchronize_checkpoint_boundary": false,
      "profile": false
    }
}

StellaAthena

This fails rather egregiously, though I am not totally sure why. Are you hardcoding a path somewhere?

Here's the complete output:

root@eleuther-neox-576949d46-98xcg:/app# sh scripts/train_gpt3small.sh
[2021-01-26 18:08:43,241] [INFO] [runner.py:286:main] Using IP address of 10.140.82.188 for node 10.140.82.188
[2021-01-26 18:08:43,243] [INFO] [multinode_runner.py:51:get_cmd] Running on the following workers: 10.140.82.188,10.140.23.80,10.141.250.227,10.141.113.189
[2021-01-26 18:08:43,243] [INFO] [runner.py:358:main] cmd = pdsh -f 1024 -w 10.140.82.188,10.140.23.80,10.141.250.227,10.141.113.189 export NCCL_SHM_DISABLE=1; export NCCL_DEBUG=info; export PYTHONPATH=/app;  cd /app; /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyIxMC4xNDAuODIuMTg4IjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAiMTAuMTQwLjIzLjgwIjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAiMTAuMTQxLjI1MC4yMjciOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN10sICIxMC4xNDEuMTEzLjE4OSI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XX0= --node_rank=%n --master_addr=10.140.82.188 --master_port=29500 train_mpu.py --deepspeed --deepspeed_config 'configs/deepspeed_zero1.json'
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:71:main] 0 NCCL_SHM_DISABLE 1
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:71:main] 0 NCCL_DEBUG info
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=0
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:100:main] dist_world_size=32
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:71:main] 1 NCCL_SHM_DISABLE 1
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:71:main] 1 NCCL_DEBUG info
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=1
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:100:main] dist_world_size=32
10.140.23.80: [2021-01-26 18:10:11,012] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
**10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory**
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:71:main] 2 NCCL_SHM_DISABLE 1
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:71:main] 2 NCCL_DEBUG info
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
**10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory**
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=2
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:100:main] dist_world_size=32
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: [2021-01-26 18:10:28,113] [INFO] [launch.py:71:main] 3 NCCL_SHM_DISABLE 1
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:71:main] 3 NCCL_DEBUG info
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=3
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:100:main] dist_world_size=32
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753082: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753082: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753112: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.753112: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.756943: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.756974: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.771442: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.771471: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.780042: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.780077: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812464: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812463: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812463: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)

When I run it on only one node, I get much the same thing:

mkdir: cannot create directory 'logs': File exists
[2021-01-26 18:14:48,754] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-26 18:14:49,262] [INFO] [runner.py:358:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 train_mpu.py --deepspeed --deepspeed_config configs/deepspeed_zero1.json
[2021-01-26 18:14:49,896] [INFO] [launch.py:71:main] 0 NCCL_SHM_DISABLE 1
[2021-01-26 18:14:49,896] [INFO] [launch.py:71:main] 0 NCCL_DEBUG info
[2021-01-26 18:14:49,896] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2021-01-26 18:14:49,896] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=8, node_rank=0
[2021-01-26 18:14:49,896] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2021-01-26 18:14:49,896] [INFO] [launch.py:100:main] dist_world_size=8
[2021-01-26 18:14:49,896] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2021-01-26 18:14:51.106938: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.106938: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.106973: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.106975: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.113718: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.113749: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.114798: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.114826: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.129631: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.129671: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.166912: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.166943: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.189289: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.189318: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.196366: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.196392: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)

glebshevchukk · 2021-01-26T18:26:10Z

Fixed, forgot to remove one line when getting rid of json sharded stuff -_-

leogao2 · 2021-01-26T18:55:02Z

@StellaAthena I think the reason is because the containers are out of date. The fix would either be to set up some kind of pipeline to automatically build the containers, or to patch it by adding a git pull to the kubernetes script.

leogao2 · 2021-01-26T18:58:04Z

This commit should fix the problem

glebshevchukk requested a review from a team as a code owner January 23, 2021 19:30

glebshevchukk requested review from AranKomat and leogao2 January 23, 2021 19:30

StellaAthena linked an issue Jan 23, 2021 that may be closed by this pull request

Implement the MPU from Megatron #75

Closed

joshlk reviewed Jan 25, 2021

View reviewed changes

StellaAthena requested changes Jan 26, 2021

View reviewed changes

glebshevchukk closed this Jan 28, 2021

glebshevchukk force-pushed the main branch from e284e22 to 39972e6 Compare January 28, 2021 03:32

glebshevchukk mentioned this pull request Jan 29, 2021

Adding MPU training & generation code #103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added MPU from Sid's MegatronPipeline #88

Added MPU from Sid's MegatronPipeline #88

glebshevchukk commented Jan 23, 2021

sdtblck commented Jan 23, 2021

StellaAthena commented Jan 23, 2021

joshlk Jan 25, 2021

joshlk Jan 25, 2021

StellaAthena commented Jan 26, 2021 •

edited

Loading

StellaAthena left a comment

glebshevchukk commented Jan 26, 2021

leogao2 commented Jan 26, 2021

leogao2 commented Jan 26, 2021

Added MPU from Sid's MegatronPipeline #88

Added MPU from Sid's MegatronPipeline #88

Conversation

glebshevchukk commented Jan 23, 2021

sdtblck commented Jan 23, 2021

StellaAthena commented Jan 23, 2021

joshlk Jan 25, 2021

Choose a reason for hiding this comment

joshlk Jan 25, 2021

Choose a reason for hiding this comment

StellaAthena commented Jan 26, 2021 • edited Loading

StellaAthena left a comment

Choose a reason for hiding this comment

glebshevchukk commented Jan 26, 2021

leogao2 commented Jan 26, 2021

leogao2 commented Jan 26, 2021

StellaAthena commented Jan 26, 2021 •

edited

Loading