ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: #1079

Drzhivago264 · 2023-11-14T05:26:57Z

Describe the bug
Cannot train with flash attention.
Global attention works

To Reproduce
Steps to reproduce the behavior:

Follow the host setup in this repository.
Run the training script with this config:
{
"pipe_parallel_size": 0,
"model_parallel_size": 4,

"num_layers": 32,
"hidden_size": 2560,
"num_attention_heads": 32,
"seq_length": 2048,
"max_position_embeddings": 2048,
"pos_emb": "rotary",
"rotary_pct": 0.25,
"no_weight_tying": true,
"gpt_j_residual": true,
"output_layer_parallelism": "column",

"attention_config": [[["flash"], 32]],

"scaled_upper_triang_masked_softmax_fusion": true,
"bias_gelu_fusion": true,

"init_method": "small_init",
"output_layer_init_method": "wang_init",

"optimizer": {
"type": "CPU_Adam",
"params": {
"lr": 0.00016,
"betas": [0.9, 0.95],
"eps": 1.0e-8
}
},
"min_lr": 1.6e-05,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
},

"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 2,
"data_impl": "mmap",
"num_workers": 1,

"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0,
"attention_dropout": 0,

"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},

"train_iters": 143000,
"lr_decay_iters": 143000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
"checkpoint_factor": 1000,
"extra_save_iters": [64,128,256,512],
"eval_interval": 40000,
"eval_iters": 10,

"log_grad_norm": true,

"log_interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,

"tokenizer_type": "HFTokenizer"
}

Expected behavior
Setting ds_accelerator to cuda (auto detect)
[2023-11-14 16:09:59,189] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-11-14 16:09:59,189] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-11-14 16:09:59,189] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-11-14 16:09:59,189] [INFO] [launch.py:163:main] dist_world_size=4
[2023-11-14 16:09:59,189] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
[2023-11-14 16:10:05,177] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,177] [INFO] [comm.py:594:init_distributed] cdb=None
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 4

building HFTokenizer tokenizer ...
[2023-11-14 16:10:05,209] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,209] [INFO] [comm.py:594:init_distributed] cdb=None
padded vocab (size: 50277) with 411 dummy tokens (new size: 50688)
setting tensorboard ...
initializing torch distributed ...
[2023-11-14 16:10:05,376] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,376] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-11-14 16:10:05,376] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-11-14 16:10:05,405] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,405] [INFO] [comm.py:594:init_distributed] cdb=None
initializing model parallel with size 4
MPU DP: [0]
MPU DP: [1]
MPU DP: [2]
MPU DP: [3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0, 1, 2, 3]
setting random seeds to 1234 ...
[2023-11-14 16:10:06,287] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/media/h/nvme/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/media/h/nvme/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3}
[2023-11-14 16:10:06,697] [INFO] [module.py:358:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=37
0: EmbeddingPipe
1: _pre_transformer_block
2: ParallelTransformerLayerPipe
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
6: ParallelTransformerLayerPipe
7: ParallelTransformerLayerPipe
8: ParallelTransformerLayerPipe
9: ParallelTransformerLayerPipe
10: ParallelTransformerLayerPipe
11: ParallelTransformerLayerPipe
12: ParallelTransformerLayerPipe
13: ParallelTransformerLayerPipe
14: ParallelTransformerLayerPipe
15: ParallelTransformerLayerPipe
16: ParallelTransformerLayerPipe
17: ParallelTransformerLayerPipe
18: ParallelTransformerLayerPipe
19: ParallelTransformerLayerPipe
20: ParallelTransformerLayerPipe
21: ParallelTransformerLayerPipe
22: ParallelTransformerLayerPipe
23: ParallelTransformerLayerPipe
24: ParallelTransformerLayerPipe
25: ParallelTransformerLayerPipe
26: ParallelTransformerLayerPipe
27: ParallelTransformerLayerPipe
28: ParallelTransformerLayerPipe
29: ParallelTransformerLayerPipe
30: ParallelTransformerLayerPipe
31: ParallelTransformerLayerPipe
32: ParallelTransformerLayerPipe
33: ParallelTransformerLayerPipe
34: _post_transformer_block
35: NormPipe
36: ParallelLinearPipe
loss: partial
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
Traceback (most recent call last):
File "train.py", line 27, in
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
[2023-11-14 16:10:09,253] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345637
[2023-11-14 16:10:09,275] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345638
[2023-11-14 16:10:09,276] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345639
[2023-11-14 16:10:09,296] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345640

Proposed solution
I dont know what to do. I guess there is a mismatch between flashattention version and pytorch version or between flashattention with system CUDA.
I have tested with gpt-neox v1, v2 in combination with Cuda 11.8, 12.3.
It is suggested that flashattention v2 does not support turing GPU, but I face the same problem with flashattention v1.
Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

GPUs: 3 x 2080ti and 1 3060.
Configs: attached above
Package Version

absl-py 2.0.0
aiohttp 3.8.6
aiosignal 1.3.1
anyio 3.7.1
appdirs 1.4.4
async-timeout 4.0.3
attrs 23.1.0
autopep8 2.0.4
best-download 0.0.9
boto3 1.28.84
botocore 1.31.84
cachetools 5.3.2
certifi 2023.7.22
cfgv 3.4.0
chardet 5.2.0
charset-normalizer 3.3.2
clang-format 17.0.4
click 8.1.7
cmake 3.27.7
colorama 0.4.6
coverage 7.3.2
cupy-cuda111 12.2.0
DataProperty 1.0.1
datasets 2.14.6
deepspeed 0.9.3+a48c649
dill 0.3.7
distlib 0.3.7
distro 1.8.0
docker-pycreds 0.4.0
einops 0.7.0
exceptiongroup 1.1.3
execnet 2.0.2
fastrlock 0.8.2
filelock 3.13.1
flash-attn 2.2.1
frozenlist 1.4.0
fsspec 2023.10.0
ftfy 6.1.1
fused-kernels 0.0.1
gitdb 4.0.11
GitPython 3.1.40
google-auth 2.23.4
google-auth-oauthlib 1.0.0
grpcio 1.59.2
h11 0.14.0
hf_transfer 0.1.4
hjson 3.1.0
httpcore 1.0.2
httpx 0.25.1
huggingface-hub 0.19.0
identify 2.5.31
idna 3.4
importlib-metadata 6.8.0
iniconfig 2.0.0
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.3.2
jsonlines 4.0.0
lm-dataformat 0.0.20
lm-eval 0.3.0
Markdown 3.5.1
MarkupSafe 2.1.3
mbstrdecoder 1.1.3
mpi4py 3.1.5
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.15
networkx 3.1
ninja 1.11.1.1
nltk 3.8.1
nodeenv 1.8.0
numexpr 2.8.6
numpy 1.24.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
openai 1.2.3
packaging 23.2
pandas 2.0.3
pathvalidate 3.2.0
pip 23.3.1
platformdirs 3.11.0
pluggy 1.3.0
portalocker 2.8.2
pre-commit 3.5.0
protobuf 4.25.0
psutil 5.9.6
py 1.11.0
py-cpuinfo 9.0.0
pyarrow 14.0.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.11.1
pycodestyle 2.11.1
pycountry 22.3.5
pydantic 1.10.13
pytablewriter 1.2.0
pytest 7.4.3
pytest-cov 4.1.0
pytest-forked 1.6.0
pytest-xdist 3.4.0
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.10.3
rehash 1.0.1
requests 2.31.0
requests-oauthlib 1.3.1
rouge-score 0.1.2
rsa 4.9
s3transfer 0.7.0
sacrebleu 1.5.0
safetensors 0.4.0
scikit-learn 1.3.2
scipy 1.10.1
sentencepiece 0.1.99
sentry-sdk 1.34.0
setproctitle 1.3.3
setuptools 56.0.0
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
sqlitedict 2.1.0
sympy 1.12
tabledata 1.3.3
tcolorpy 0.1.4
tensorboard 2.13.0
tensorboard-data-server 0.7.2
threadpoolctl 3.2.0
tiktoken 0.5.1
tokenizers 0.13.3
tomli 2.0.1
torch 1.13.1
tqdm 4.66.1
tqdm-multiprocess 0.0.11
transformers 4.30.2
triton 2.0.0.dev20221202
typepy 1.3.2
typing_extensions 4.8.0
tzdata 2023.3
ujson 5.8.0
urllib3 1.26.18
virtualenv 20.24.6
wandb 0.16.0
wcwidth 0.2.9
Werkzeug 3.0.1
wheel 0.41.3
xxhash 3.4.1
yarl 1.9.2
zipp 3.17.0
zstandard 0.22.0

Cuda: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

linjiadegou2 · 2024-03-01T08:06:01Z

you can try flash_attn-2.3.0

Drzhivago264 added the bug Something isn't working label Nov 14, 2023

Drzhivago264 closed this as completed Nov 14, 2023

Drzhivago264 reopened this Nov 14, 2023

Drzhivago264 closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: #1079

ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: #1079

Drzhivago264 commented Nov 14, 2023 •

edited

Loading

linjiadegou2 commented Mar 1, 2024

ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: #1079

ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: #1079

Comments

Drzhivago264 commented Nov 14, 2023 • edited Loading

linjiadegou2 commented Mar 1, 2024

Drzhivago264 commented Nov 14, 2023 •

edited

Loading