-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added MPU from Sid's MegatronPipeline #88
Conversation
Awesome! this is a lot of changes, i'll try to look through it tomorrow |
I believe I have correctly resolved conflicts between the two branches. |
gpt_neox/data_utils.py
Outdated
from torch.utils.data.dataloader import default_collate | ||
import pathlib | ||
from functools import partial | ||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new imports are not used
gpt_neox/datasets.py
Outdated
""" | ||
Dataset that gets sequences from a set of sharded jsonl files | ||
""" | ||
class JsonShardedDataset(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to have the doc-comment attached to the class. So do:
class JsonShardedDataset(Dataset):
"""
Dataset that gets sequences from a set of sharded jsonl files
"""
instead of it dangling above.
@ShivanshuPurohit reports getting this to work on two RTX 3090 GPUs using the following configs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails rather egregiously, though I am not totally sure why. Are you hardcoding a path somewhere?
Here's the complete output:
root@eleuther-neox-576949d46-98xcg:/app# sh scripts/train_gpt3small.sh
[2021-01-26 18:08:43,241] [INFO] [runner.py:286:main] Using IP address of 10.140.82.188 for node 10.140.82.188
[2021-01-26 18:08:43,243] [INFO] [multinode_runner.py:51:get_cmd] Running on the following workers: 10.140.82.188,10.140.23.80,10.141.250.227,10.141.113.189
[2021-01-26 18:08:43,243] [INFO] [runner.py:358:main] cmd = pdsh -f 1024 -w 10.140.82.188,10.140.23.80,10.141.250.227,10.141.113.189 export NCCL_SHM_DISABLE=1; export NCCL_DEBUG=info; export PYTHONPATH=/app; cd /app; /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyIxMC4xNDAuODIuMTg4IjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAiMTAuMTQwLjIzLjgwIjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAiMTAuMTQxLjI1MC4yMjciOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN10sICIxMC4xNDEuMTEzLjE4OSI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XX0= --node_rank=%n --master_addr=10.140.82.188 --master_port=29500 train_mpu.py --deepspeed --deepspeed_config 'configs/deepspeed_zero1.json'
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:71:main] 0 NCCL_SHM_DISABLE 1
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:71:main] 0 NCCL_DEBUG info
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=0
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:100:main] dist_world_size=32
10.140.82.188: [2021-01-26 18:08:44,573] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:71:main] 1 NCCL_SHM_DISABLE 1
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:71:main] 1 NCCL_DEBUG info
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=1
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.140.23.80: [2021-01-26 18:10:11,011] [INFO] [launch.py:100:main] dist_world_size=32
10.140.23.80: [2021-01-26 18:10:11,012] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
**10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory**
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:71:main] 2 NCCL_SHM_DISABLE 1
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:71:main] 2 NCCL_DEBUG info
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
**10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory**
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=2
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:100:main] dist_world_size=32
10.141.250.227: [2021-01-26 18:09:25,110] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.140.23.80: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: [2021-01-26 18:10:28,113] [INFO] [launch.py:71:main] 3 NCCL_SHM_DISABLE 1
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:71:main] 3 NCCL_DEBUG info
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:78:main] WORLD INFO DICT: {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.250.227': [0, 1, 2, 3, 4, 5, 6, 7], '10.141.113.189': [0, 1, 2, 3, 4, 5, 6, 7]}
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:84:main] nnodes=4, num_local_procs=8, node_rank=3
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'10.140.82.188': [0, 1, 2, 3, 4, 5, 6, 7], '10.140.23.80': [8, 9, 10, 11, 12, 13, 14, 15], '10.141.250.227': [16, 17, 18, 19, 20, 21, 22, 23], '10.141.113.189': [24, 25, 26, 27, 28, 29, 30, 31]})
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:100:main] dist_world_size=32
10.141.113.189: [2021-01-26 18:10:28,114] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.250.227: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.141.113.189: /usr/bin/python3: can't open file 'train_mpu.py': [Errno 2] No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753082: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753082: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.753112: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.753112: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.756943: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.756974: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.771442: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.771471: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.780042: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.780077: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812464: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812463: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812463: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: 2021-01-26 18:08:45.812498: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
10.140.82.188: Traceback (most recent call last):
10.140.82.188: File "train_mpu.py", line 8, in <module>
10.140.82.188: from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188: File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188: from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188: File "train_mpu.py", line 8, in <module>
10.140.82.188: from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188: File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188: from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188: File "train_mpu.py", line 8, in <module>
10.140.82.188: from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188: File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188: from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188: File "train_mpu.py", line 8, in <module>
10.140.82.188: from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188: File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188: from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188: File "train_mpu.py", line 8, in <module>
10.140.82.188: from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188: File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188: from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188: File "train_mpu.py", line 8, in <module>
10.140.82.188: from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188: File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188: from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188: File "train_mpu.py", line 8, in <module>
10.140.82.188: from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188: File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188: from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
10.140.82.188: Traceback (most recent call last):
10.140.82.188: File "train_mpu.py", line 8, in <module>
10.140.82.188: from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
10.140.82.188: File "/app/gpt_neox/__init__.py", line 3, in <module>
10.140.82.188: from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
10.140.82.188: ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
When I run it on only one node, I get much the same thing:
mkdir: cannot create directory 'logs': File exists
[2021-01-26 18:14:48,754] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-26 18:14:49,262] [INFO] [runner.py:358:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 train_mpu.py --deepspeed --deepspeed_config configs/deepspeed_zero1.json
[2021-01-26 18:14:49,896] [INFO] [launch.py:71:main] 0 NCCL_SHM_DISABLE 1
[2021-01-26 18:14:49,896] [INFO] [launch.py:71:main] 0 NCCL_DEBUG info
[2021-01-26 18:14:49,896] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2021-01-26 18:14:49,896] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=8, node_rank=0
[2021-01-26 18:14:49,896] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2021-01-26 18:14:49,896] [INFO] [launch.py:100:main] dist_world_size=8
[2021-01-26 18:14:49,896] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2021-01-26 18:14:51.106938: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.106938: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.106973: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.106975: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.113718: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.113749: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.114798: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.114826: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.129631: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.129671: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.166912: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.166943: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.189289: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.189318: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-26 18:14:51.196366: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-26 18:14:51.196392: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "train_mpu.py", line 8, in <module>
from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
File "/app/gpt_neox/__init__.py", line 3, in <module>
from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
File "train_mpu.py", line 8, in <module>
from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
File "/app/gpt_neox/__init__.py", line 3, in <module>
from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
File "train_mpu.py", line 8, in <module>
from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
File "/app/gpt_neox/__init__.py", line 3, in <module>
from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
File "train_mpu.py", line 8, in <module>
from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
File "/app/gpt_neox/__init__.py", line 3, in <module>
from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
File "train_mpu.py", line 8, in <module>
from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
File "/app/gpt_neox/__init__.py", line 3, in <module>
from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
File "train_mpu.py", line 8, in <module>
from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
File "/app/gpt_neox/__init__.py", line 3, in <module>
from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
File "train_mpu.py", line 8, in <module>
from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
File "/app/gpt_neox/__init__.py", line 3, in <module>
from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Traceback (most recent call last):
File "train_mpu.py", line 8, in <module>
from gpt_neox import (GPTNeoX, AutoregressiveWrapper, TFRecordDataset, extract_tarfile,
File "/app/gpt_neox/__init__.py", line 3, in <module>
from gpt_neox.datasets import TextSamplerDataset, TFRecordDataset,JsonShardedDataset
ImportError: cannot import name 'JsonShardedDataset' from 'gpt_neox.datasets' (/app/gpt_neox/datasets.py)
Fixed, forgot to remove one line when getting rid of json sharded stuff -_- |
@StellaAthena I think the reason is because the containers are out of date. The fix would either be to set up some kind of pipeline to automatically build the containers, or to patch it by adding a git pull to the kubernetes script. |
This commit should fix the problem |
Issue #75: Added mpu code that Sid wrote for https://github.com/EleutherAI/MegatronPipeline and update train.py file to use it for model building. Only tested with a single machine / GPU so I'm not sure if it works correctly across machines.
Also includes changes for Issue #40 that include the previous dataset for TFRecords and another one for loading directly from Json.
To run, try
sh scripts/train_gpt3small.sh