You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expected behavior
Setting ds_accelerator to cuda (auto detect)
[2023-11-14 16:09:59,189] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-11-14 16:09:59,189] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-11-14 16:09:59,189] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-11-14 16:09:59,189] [INFO] [launch.py:163:main] dist_world_size=4
[2023-11-14 16:09:59,189] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
[2023-11-14 16:10:05,177] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,177] [INFO] [comm.py:594:init_distributed] cdb=None
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 4
building HFTokenizer tokenizer ...
[2023-11-14 16:10:05,209] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,209] [INFO] [comm.py:594:init_distributed] cdb=None
padded vocab (size: 50277) with 411 dummy tokens (new size: 50688)
setting tensorboard ...
initializing torch distributed ...
[2023-11-14 16:10:05,376] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,376] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-11-14 16:10:05,376] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-11-14 16:10:05,405] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,405] [INFO] [comm.py:594:init_distributed] cdb=None
initializing model parallel with size 4
MPU DP: [0]
MPU DP: [1]
MPU DP: [2]
MPU DP: [3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0, 1, 2, 3]
setting random seeds to 1234 ...
[2023-11-14 16:10:06,287] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/media/h/nvme/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/media/h/nvme/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3}
[2023-11-14 16:10:06,697] [INFO] [module.py:358:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=37
0: EmbeddingPipe
1: _pre_transformer_block
2: ParallelTransformerLayerPipe
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
6: ParallelTransformerLayerPipe
7: ParallelTransformerLayerPipe
8: ParallelTransformerLayerPipe
9: ParallelTransformerLayerPipe
10: ParallelTransformerLayerPipe
11: ParallelTransformerLayerPipe
12: ParallelTransformerLayerPipe
13: ParallelTransformerLayerPipe
14: ParallelTransformerLayerPipe
15: ParallelTransformerLayerPipe
16: ParallelTransformerLayerPipe
17: ParallelTransformerLayerPipe
18: ParallelTransformerLayerPipe
19: ParallelTransformerLayerPipe
20: ParallelTransformerLayerPipe
21: ParallelTransformerLayerPipe
22: ParallelTransformerLayerPipe
23: ParallelTransformerLayerPipe
24: ParallelTransformerLayerPipe
25: ParallelTransformerLayerPipe
26: ParallelTransformerLayerPipe
27: ParallelTransformerLayerPipe
28: ParallelTransformerLayerPipe
29: ParallelTransformerLayerPipe
30: ParallelTransformerLayerPipe
31: ParallelTransformerLayerPipe
32: ParallelTransformerLayerPipe
33: ParallelTransformerLayerPipe
34: _post_transformer_block
35: NormPipe
36: ParallelLinearPipe
loss: partial
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
Traceback (most recent call last):
File "train.py", line 27, in
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
[2023-11-14 16:10:09,253] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345637
[2023-11-14 16:10:09,275] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345638
[2023-11-14 16:10:09,276] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345639
[2023-11-14 16:10:09,296] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345640
Proposed solution
I dont know what to do. I guess there is a mismatch between flashattention version and pytorch version or between flashattention with system CUDA.
I have tested with gpt-neox v1, v2 in combination with Cuda 11.8, 12.3.
It is suggested that flashattention v2 does not support turing GPU, but I face the same problem with flashattention v1. Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Describe the bug
Cannot train with flash attention.
Global attention works
To Reproduce
Steps to reproduce the behavior:
{
"pipe_parallel_size": 0,
"model_parallel_size": 4,
"num_layers": 32,
"hidden_size": 2560,
"num_attention_heads": 32,
"seq_length": 2048,
"max_position_embeddings": 2048,
"pos_emb": "rotary",
"rotary_pct": 0.25,
"no_weight_tying": true,
"gpt_j_residual": true,
"output_layer_parallelism": "column",
"attention_config": [[["flash"], 32]],
"scaled_upper_triang_masked_softmax_fusion": true,
"bias_gelu_fusion": true,
"init_method": "small_init",
"output_layer_init_method": "wang_init",
"optimizer": {
"type": "CPU_Adam",
"params": {
"lr": 0.00016,
"betas": [0.9, 0.95],
"eps": 1.0e-8
}
},
"min_lr": 1.6e-05,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
},
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 2,
"data_impl": "mmap",
"num_workers": 1,
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,
"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0,
"attention_dropout": 0,
"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},
"train_iters": 143000,
"lr_decay_iters": 143000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
"checkpoint_factor": 1000,
"extra_save_iters": [64,128,256,512],
"eval_interval": 40000,
"eval_iters": 10,
"log_grad_norm": true,
"log_interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,
"tokenizer_type": "HFTokenizer"
}
Expected behavior
Setting ds_accelerator to cuda (auto detect)
[2023-11-14 16:09:59,189] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-11-14 16:09:59,189] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-11-14 16:09:59,189] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-11-14 16:09:59,189] [INFO] [launch.py:163:main] dist_world_size=4
[2023-11-14 16:09:59,189] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
[2023-11-14 16:10:05,177] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,177] [INFO] [comm.py:594:init_distributed] cdb=None
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 4
Proposed solution
I dont know what to do. I guess there is a mismatch between flashattention version and pytorch version or between flashattention with system CUDA.
I have tested with gpt-neox v1, v2 in combination with Cuda 11.8, 12.3.
It is suggested that flashattention v2 does not support turing GPU, but I face the same problem with flashattention v1.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
absl-py 2.0.0
aiohttp 3.8.6
aiosignal 1.3.1
anyio 3.7.1
appdirs 1.4.4
async-timeout 4.0.3
attrs 23.1.0
autopep8 2.0.4
best-download 0.0.9
boto3 1.28.84
botocore 1.31.84
cachetools 5.3.2
certifi 2023.7.22
cfgv 3.4.0
chardet 5.2.0
charset-normalizer 3.3.2
clang-format 17.0.4
click 8.1.7
cmake 3.27.7
colorama 0.4.6
coverage 7.3.2
cupy-cuda111 12.2.0
DataProperty 1.0.1
datasets 2.14.6
deepspeed 0.9.3+a48c649
dill 0.3.7
distlib 0.3.7
distro 1.8.0
docker-pycreds 0.4.0
einops 0.7.0
exceptiongroup 1.1.3
execnet 2.0.2
fastrlock 0.8.2
filelock 3.13.1
flash-attn 2.2.1
frozenlist 1.4.0
fsspec 2023.10.0
ftfy 6.1.1
fused-kernels 0.0.1
gitdb 4.0.11
GitPython 3.1.40
google-auth 2.23.4
google-auth-oauthlib 1.0.0
grpcio 1.59.2
h11 0.14.0
hf_transfer 0.1.4
hjson 3.1.0
httpcore 1.0.2
httpx 0.25.1
huggingface-hub 0.19.0
identify 2.5.31
idna 3.4
importlib-metadata 6.8.0
iniconfig 2.0.0
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.3.2
jsonlines 4.0.0
lm-dataformat 0.0.20
lm-eval 0.3.0
Markdown 3.5.1
MarkupSafe 2.1.3
mbstrdecoder 1.1.3
mpi4py 3.1.5
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.15
networkx 3.1
ninja 1.11.1.1
nltk 3.8.1
nodeenv 1.8.0
numexpr 2.8.6
numpy 1.24.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
openai 1.2.3
packaging 23.2
pandas 2.0.3
pathvalidate 3.2.0
pip 23.3.1
platformdirs 3.11.0
pluggy 1.3.0
portalocker 2.8.2
pre-commit 3.5.0
protobuf 4.25.0
psutil 5.9.6
py 1.11.0
py-cpuinfo 9.0.0
pyarrow 14.0.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.11.1
pycodestyle 2.11.1
pycountry 22.3.5
pydantic 1.10.13
pytablewriter 1.2.0
pytest 7.4.3
pytest-cov 4.1.0
pytest-forked 1.6.0
pytest-xdist 3.4.0
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.10.3
rehash 1.0.1
requests 2.31.0
requests-oauthlib 1.3.1
rouge-score 0.1.2
rsa 4.9
s3transfer 0.7.0
sacrebleu 1.5.0
safetensors 0.4.0
scikit-learn 1.3.2
scipy 1.10.1
sentencepiece 0.1.99
sentry-sdk 1.34.0
setproctitle 1.3.3
setuptools 56.0.0
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
sqlitedict 2.1.0
sympy 1.12
tabledata 1.3.3
tcolorpy 0.1.4
tensorboard 2.13.0
tensorboard-data-server 0.7.2
threadpoolctl 3.2.0
tiktoken 0.5.1
tokenizers 0.13.3
tomli 2.0.1
torch 1.13.1
tqdm 4.66.1
tqdm-multiprocess 0.0.11
transformers 4.30.2
triton 2.0.0.dev20221202
typepy 1.3.2
typing_extensions 4.8.0
tzdata 2023.3
ujson 5.8.0
urllib3 1.26.18
virtualenv 20.24.6
wandb 0.16.0
wcwidth 0.2.9
Werkzeug 3.0.1
wheel 0.41.3
xxhash 3.4.1
yarl 1.9.2
zipp 3.17.0
zstandard 0.22.0
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
The text was updated successfully, but these errors were encountered: