Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added MPU from Sid's MegatronPipeline #88

Closed
wants to merge 0 commits into from

Conversation

glebshevchukk
Copy link

Issue #75: Added mpu code that Sid wrote for https://github.com/EleutherAI/MegatronPipeline and update train.py file to use it for model building. Only tested with a single machine / GPU so I'm not sure if it works correctly across machines.

Also includes changes for Issue #40 that include the previous dataset for TFRecords and another one for loading directly from Json.

To run, try
sh scripts/train_gpt3small.sh

@sdtblck
Copy link
Contributor

sdtblck commented Jan 23, 2021

Awesome! this is a lot of changes, i'll try to look through it tomorrow

@StellaAthena
Copy link
Member

I believe I have correctly resolved conflicts between the two branches. train.py has been touchy recently though, and I want to take a second look at it.

@StellaAthena StellaAthena linked an issue Jan 23, 2021 that may be closed by this pull request
from torch.utils.data.dataloader import default_collate
import pathlib
from functools import partial
import logging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new imports are not used

"""
Dataset that gets sequences from a set of sharded jsonl files
"""
class JsonShardedDataset(Dataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to have the doc-comment attached to the class. So do:

class JsonShardedDataset(Dataset):
    """
    Dataset that gets sequences from a set of sharded jsonl files
    """

instead of it dangling above.

@StellaAthena
Copy link
Member

StellaAthena commented Jan 26, 2021

@ShivanshuPurohit reports getting this to work on two RTX 3090 GPUs using the following configs

{
    "train_batch_size": 1280,
    "gradient_accumulation_steps": 80,
    "gradient_clipping": 1.0,
    "wall_clock_breakdown": true,
    "zero_allow_untested_optimizer": true,
    "tensorboard": {
      "enabled": true,
      "output_path": "./logs",
      "job_name": "gptneox"
    },
    "optimizer": {
      "type": "OneBitAdam",
      "params": {
        "lr": 2e-4,
    "freeze_step":2,
    "cuda_aware":true
      }
    },
    "scheduler": {
      "type": "WarmupLR",
      "params": {
        "warmup_min_lr": 0,
        "warmup_max_lr": 0.00015,
        "warmup_num_steps": 5000
      }
    },
    "fp16": {
      "enabled": true
    },
    "zero_optimization": {
      "stage": 1,
      "contiguous_gradients" : true,
      "cpu_offload": false
    },
      "activation_checkpointing": {
      "partition_activations": true,
      "cpu_checkpointing": false,
      "contiguous_memory_optimization": false,
      "number_checkpoints": 1,
      "synchronize_checkpoint_boundary": false,
      "profile": false
    }
}

Copy link
Member

@StellaAthena StellaAthena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails rather egregiously, though I am not totally sure why. Are you hardcoding a path somewhere?

Here's the complete output:

root@eleuther-neox-576949d46-98xcg:/app# sh scripts/train_gpt3small.sh
[2021-01-26 18:08:43,241] [INFO] [runner.py:286:main] Using IP address of 10.140.82.188 for node 10.140.82.188
[2021-01-26 18:08:43,243] [INFO] [multinode_runner.py:51:get_cmd] Running on the following workers: 10.140.82.188,10.140.23.80,10.141.250.227,10.141.113.189
[2021-01-26 18:08:43,243] [INFO] [runner.py:358:main] cmd = pdsh -f 1024 -w 10.140.82.188,10.140.23.80,10.141.250.227,10.141.113.189 export NCCL_SHM_DISABLE=1; export NCCL_DEBUG=info; export PYTHONPATH=/app;  cd /app; /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyIxMC4xNDAuODIuMTg4IjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAiMTAuMTQwLjIzLjgwIjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAiMTAuMTQxLjI1MC4yMjciOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN10sICIxMC4xNDEuMTEzLjE4OSI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XX0= --node_rank=%n --master_addr=10.140.82.188 --master_port=29500 train_mpu.py --deepspeed --deepspeed_config 'configs/deepspeed_zero1.json'
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:71:main] 0 NCCL_SHM_DISABLE 1
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:71:main] 0 NCCL_DEBUG info
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=0
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:100:main] dist_world_size=32
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:71:main] 1 NCCL_SHM_DISABLE 1
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:71:main] 1 NCCL_DEBUG info
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=1
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:100:main] dist_world_size=32
10.140.23.80: [2021-01-26 18:10:11,012] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
**10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory**
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:71:main] 2 NCCL_SHM_DISABLE 1
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:71:main] 2 NCCL_DEBUG info
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
**10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory**
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=2
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:100:main] dist_world_size=32
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: [2021-01-26 18:10:28,113] [INFO] [launch.py:71:main] 3 NCCL_SHM_DISABLE 1
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:71:main] 3 NCCL_DEBUG info
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=3
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:100:main] dist_world_size=32
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753082: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753082: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753112: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.753112: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.756943: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.756974: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.771442: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.771471: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.780042: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.780077: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812464: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812463: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812463: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188:   File "train_mpu.py", line 8, in <module>
10.140.82.188:     from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188:   File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188:     from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)

When I run it on only one node, I get much the same thing:

mkdir: cannot create directory 'logs': File exists
[2021-01-26 18:14:48,754] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-26 18:14:49,262] [INFO] [runner.py:358:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 train_mpu.py --deepspeed --deepspeed_config configs/deepspeed_zero1.json
[2021-01-26 18:14:49,896] [INFO] [launch.py:71:main] 0 NCCL_SHM_DISABLE 1
[2021-01-26 18:14:49,896] [INFO] [launch.py:71:main] 0 NCCL_DEBUG info
[2021-01-26 18:14:49,896] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2021-01-26 18:14:49,896] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=8, node_rank=0
[2021-01-26 18:14:49,896] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2021-01-26 18:14:49,896] [INFO] [launch.py:100:main] dist_world_size=8
[2021-01-26 18:14:49,896] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2021-01-26 18:14:51.106938: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.106938: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.106973: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.106975: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.113718: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.113749: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.114798: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.114826: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.129631: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.129671: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.166912: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.166943: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.189289: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.189318: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.196366: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.196392: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
  File "train_mpu.py", line 8, in <module>
    from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
  File "/app/gpt_neox/__init__.py", line 3, in <module>
    from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)

@glebshevchukk
Copy link
Author

Fixed, forgot to remove one line when getting rid of json sharded stuff -_-

@leogao2
Copy link
Contributor

leogao2 commented Jan 26, 2021

@StellaAthena I think the reason is because the containers are out of date. The fix would either be to set up some kind of pipeline to automatically build the containers, or to patch it by adding a git pull to the kubernetes script.

@leogao2
Copy link
Contributor

leogao2 commented Jan 26, 2021

This commit should fix the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the MPU from Megatron
5 participants