Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerfile error #933

Closed
KonradWygladacz opened this issue May 12, 2023 · 10 comments
Closed

Dockerfile error #933

KonradWygladacz opened this issue May 12, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@KonradWygladacz
Copy link

I used git clone to download this repository and then downloaded the Slim weights. Next, I built the image and ran the container. I intended to generate text by executing the command ./deepy.py generate.py ./configs/20B.yml, but I encountered the following error:

/usr/local/lib/python3.8/dist-packages/fused_kernels-0.0.1-py3.8-linux-x86_64.egg/scaled_upper_triang_masked_softmax_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor7optionsEv

====================================================================================================

ERROR: Fused kernels configured but not properly installed. Please run `pip install /lustre/scratch/tmp/1503311/gpt-neox/megatron/fused_kernels` to install them

And after: pip install /lustre/scratch/tmp/1503311/gpt-neox/megatron/fused_kernels I got this:

Processing ./megatron/fused_kernels
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: fused-kernels
  Building wheel for fused-kernels (setup.py) ... error

  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [40 lines of output]
      running bdist_wheel
      running build
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/lustre/scratch/tmp/1503311/gpt-neox/megatron/fused_kernels/setup.py", line 43, in <module>
          setup(
        File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 144, in setup
          return distutils.core.setup(**attrs)
        File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3/dist-packages/wheel/bdist_wheel.py", line 223, in run
          self.run_command('build')
        File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3.8/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 87, in run
          _build_ext.run(self)
        File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/home/kehl1152/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/home/kehl1152/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (11.1) mismatches the version that was used to compile
      PyTorch (11.7). Please make sure to use the same CUDA versions.

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for fused-kernels
  Running setup.py clean for fused-kernels
Failed to build fused-kernels
ERROR: Could not build wheels for fused-kernels, which is required to install pyproject.toml-based projects

Can you tell me what can be a solutioin?
Also it seems to me that there is a mistake in Dockerfile in 15 line:
FROM nvidia/cuda:11.1.1-devel-ubuntu20.04
there is cuda:11.1 but i think there sould be cuda:11.7

@KonradWygladacz KonradWygladacz added the bug Something isn't working label May 12, 2023
@Quentin-Anthony
Copy link
Member

If you'd like to use CUDA 11.7, change it locally. I don't think a global update to the Dockerfile CUDA version is necessary.

@KonradWygladacz
Copy link
Author

If you'd like to use CUDA 11.7, change it locally. I don't think a global update to the Dockerfile CUDA version is necessary.

So, what should be done about this error if it is not caused by the CUDA version?

@StellaAthena
Copy link
Member

If you'd like to use CUDA 11.7, change it locally. I don't think a global update to the Dockerfile CUDA version is necessary.

So, what should be done about this error if it is not caused by the CUDA version?

You can either change your local copy of the file to say CUDA 11.7, or you can install a version of PyTorch compiled with CUDA 11.1.

@Quentin-Anthony
Copy link
Member

It is caused by the CUDA version, but with your local torch CUDA version. I'm recommending that you personally switch to the CUDA 11.7 NVIDIA Docker image (FROM nvidia/cuda:11.7.0-devel-ubuntu20.04), test if it works, and we'll consider updating the Dockerfile if so.

@KonradWygladacz
Copy link
Author

It is caused by the CUDA version, but with your local torch CUDA version. I'm recommending that you personally switch to the CUDA 11.7 NVIDIA Docker image (FROM nvidia/cuda:11.7.0-devel-ubuntu20.04), test if it works, and we'll consider updating the Dockerfile if so.

I have changed the CUDA version to 11.7, and now I am encountering the following error.

=> ERROR [15/19] RUN pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+h  22.1s
------
 > [15/19] RUN pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@a651e2c24ecf97cbf367fd3f330df36760e1c597:
#0 1.467 Using pip 23.1.2 from /usr/local/lib/python3.8/dist-packages/pip (python 3.8)
#0 1.635 DEPRECATION: --build-option and --global-option are deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use --config-settings. Discussion can be found at https://github.com/pypa/pip/issues/11859
#0 1.636 WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option.
#0 1.661 Collecting git+https://github.com/NVIDIA/apex.git@a651e2c24ecf97cbf367fd3f330df36760e1c597
#0 1.662   Cloning https://github.com/NVIDIA/apex.git (to revision a651e2c24ecf97cbf367fd3f330df36760e1c597) to /tmp/pip-req-build-lich98no
#0 1.663   Running command git version
#0 1.686   git version 2.25.1
#0 1.686   Running command git clone --filter=blob:none https://github.com/NVIDIA/apex.git /tmp/pip-req-build-lich98no
#0 1.692   Cloning into '/tmp/pip-req-build-lich98no'...
#0 3.878   Updating files:  80% (339/423)
#0 3.879   Updating files:  81% (343/423)
#0 3.880   Updating files:  82% (347/423)
#0 3.881   Updating files:  83% (352/423)
#0 3.883   Updating files:  84% (356/423)
#0 3.883   Updating files:  85% (360/423)
#0 3.884   Updating files:  86% (364/423)
#0 3.885   Updating files:  87% (369/423)
#0 3.886   Updating files:  88% (373/423)
#0 3.887   Updating files:  89% (377/423)
#0 3.887   Updating files:  90% (381/423)
#0 3.888   Updating files:  91% (385/423)
#0 3.889   Updating files:  92% (390/423)
#0 3.890   Updating files:  93% (394/423)
#0 3.890   Updating files:  94% (398/423)
#0 3.892   Updating files:  95% (402/423)
#0 3.893   Updating files:  96% (407/423)
#0 3.894   Updating files:  97% (411/423)
#0 3.894   Updating files:  98% (415/423)
#0 3.895   Updating files:  99% (419/423)
#0 3.896   Updating files: 100% (423/423)
#0 3.897   Updating files: 100% (423/423), done.
#0 3.902   Running command git show-ref a651e2c24ecf97cbf367fd3f330df36760e1c597
#0 3.911   Running command git rev-parse -q --verify 'sha^a651e2c24ecf97cbf367fd3f330df36760e1c597'
#0 3.918   Running command git fetch -q https://github.com/NVIDIA/apex.git a651e2c24ecf97cbf367fd3f330df36760e1c597
#0 4.500   Running command git rev-parse FETCH_HEAD
#0 4.509   a651e2c24ecf97cbf367fd3f330df36760e1c597
#0 4.509   Running command git rev-parse HEAD
#0 4.515   8b7a1ff183741dd8f9b87e7bafd04cfde99cea28
#0 4.516   Running command git checkout -q a651e2c24ecf97cbf367fd3f330df36760e1c597
#0 5.260   Resolved https://github.com/NVIDIA/apex.git to commit a651e2c24ecf97cbf367fd3f330df36760e1c597
#0 5.261   Running command git submodule update --init --recursive -q
#0 13.59   Running command git rev-parse HEAD
#0 13.60   a651e2c24ecf97cbf367fd3f330df36760e1c597
#0 13.60   Preparing metadata (setup.py): started
#0 13.60   Running command python setup.py egg_info
#0 17.44   No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
#0 17.44
#0 17.44   Warning: Torch did not find available GPUs on this system.
#0 17.44    If your intention is to cross-compile, this is not an error.
#0 17.44   By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
#0 17.44   Volta (compute capability 7.0), Turing (compute capability 7.5),
#0 17.44   and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
#0 17.44   If you wish to cross-compile for a single specific architecture,
#0 17.44   export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.
#0 17.44
#0 17.44
#0 17.45
#0 17.45   torch.__version__  = 1.8.1+cu111
#0 17.45
#0 17.45
#0 17.45   running egg_info
#0 17.45   creating /tmp/pip-pip-egg-info-2tnchlp5/apex.egg-info
#0 17.45   writing /tmp/pip-pip-egg-info-2tnchlp5/apex.egg-info/PKG-INFO
#0 17.45   writing dependency_links to /tmp/pip-pip-egg-info-2tnchlp5/apex.egg-info/dependency_links.txt
#0 17.45   writing top-level names to /tmp/pip-pip-egg-info-2tnchlp5/apex.egg-info/top_level.txt
#0 17.45   writing manifest file '/tmp/pip-pip-egg-info-2tnchlp5/apex.egg-info/SOURCES.txt'
#0 17.51   reading manifest file '/tmp/pip-pip-egg-info-2tnchlp5/apex.egg-info/SOURCES.txt'
#0 17.51   writing manifest file '/tmp/pip-pip-egg-info-2tnchlp5/apex.egg-info/SOURCES.txt'
#0 17.51   /tmp/pip-req-build-lich98no/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
#0 17.51     warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
#0 17.74   Preparing metadata (setup.py): finished with status 'done'
#0 17.75 Building wheels for collected packages: apex
#0 17.75   Building wheel for apex (setup.py): started
#0 17.75   Running command python setup.py bdist_wheel
#0 19.43   No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
#0 19.43
#0 19.43   Warning: Torch did not find available GPUs on this system.
#0 19.43    If your intention is to cross-compile, this is not an error.
#0 19.44   By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
#0 19.44   Volta (compute capability 7.0), Turing (compute capability 7.5),
#0 19.44   and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
#0 19.44   If you wish to cross-compile for a single specific architecture,
#0 19.44   export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.
#0 19.44
#0 19.45
#0 19.45
#0 19.45   torch.__version__  = 1.8.1+cu111
#0 19.45
#0 19.45
#0 19.45   /tmp/pip-req-build-lich98no/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
#0 19.46     warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
#0 19.47
#0 19.47   Compiling cuda extensions with
#0 19.47   nvcc: NVIDIA (R) Cuda compiler driver
#0 19.47   Copyright (c) 2005-2022 NVIDIA Corporation
#0 19.47   Built on Wed_Jun__8_16:49:14_PDT_2022
#0 19.47   Cuda compilation tools, release 11.7, V11.7.99
#0 19.47   Build cuda_11.7.r11.7/compiler.31442593_0
#0 19.47   from /usr/local/cuda/bin
#0 19.47
#0 19.47   Traceback (most recent call last):
#0 19.48     File "<string>", line 2, in <module>
#0 19.48     File "<pip-setuptools-caller>", line 34, in <module>
#0 19.48     File "/tmp/pip-req-build-lich98no/setup.py", line 171, in <module>
#0 19.48       check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
#0 19.48     File "/tmp/pip-req-build-lich98no/setup.py", line 102, in check_cuda_torch_binary_vs_bare_metal
#0 19.48       raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
#0 19.48   RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.1.
#0 19.48   In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
#0 19.76   error: subprocess-exited-with-error
#0 19.76  
#0 19.76   × python setup.py bdist_wheel did not run successfully.
#0 19.76   │ exit code: 1
#0 19.76   ╰─> See above for output.
#0 19.76  
#0 19.76   note: This error originates from a subprocess, and is likely not a problem with pip.
#0 19.76   full command: /usr/bin/python3 -u -c '
#0 19.76   exec(compile('"'"''"'"''"'"'
#0 19.76   # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
#0 19.76   #
#0 19.76   # - It imports setuptools before invoking setup.py, to enable projects that directly
#0 19.76   #   import from `distutils.core` to work with newer packaging standards.
#0 19.76   # - It provides a clear error message when setuptools is not installed.
#0 19.76   # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
#0 19.76   #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
#0 19.76   #     manifest_maker: standard file '"'"'-c'"'"' not found".
#0 19.76   # - It generates a shim setup.py, for handling setup.cfg-only projects.
#0 19.76   import os, sys, tokenize
#0 19.76  
#0 19.76   try:
#0 19.76       import setuptools
#0 19.76   except ImportError as error:
#0 19.76       print(
#0 19.76           "ERROR: Can not execute `setup.py` since setuptools is not available in "
#0 19.76           "the build environment.",
#0 19.76           file=sys.stderr,
#0 19.76       )
#0 19.76       sys.exit(1)
#0 19.76  
#0 19.76   __file__ = %r
#0 19.76   sys.argv[0] = __file__
#0 19.76  
#0 19.76   if os.path.exists(__file__):
#0 19.76       filename = __file__
#0 19.76       with tokenize.open(__file__) as f:
#0 19.76           setup_py_code = f.read()
#0 19.76   else:
#0 19.76       filename = "<auto-generated setuptools caller>"
#0 19.76       setup_py_code = "from setuptools import setup; setup()"
#0 19.76  
#0 19.76   exec(compile(setup_py_code, filename, "exec"))
#0 19.76   '"'"''"'"''"'"' % ('"'"'/tmp/pip-req-build-lich98no/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' --cpp_ext --cuda_ext bdist_wheel -d /tmp/pip-wheel-2wc338jj
#0 19.76   cwd: /tmp/pip-req-build-lich98no/
#0 19.76   Building wheel for apex (setup.py): finished with status 'error'
#0 19.76   ERROR: Failed building wheel for apex
#0 19.76   Running setup.py clean for apex
#0 19.76   Running command python setup.py clean
#0 21.44   No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
#0 21.44
#0 21.44   Warning: Torch did not find available GPUs on this system.
#0 21.44    If your intention is to cross-compile, this is not an error.
#0 21.44   By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
#0 21.44   Volta (compute capability 7.0), Turing (compute capability 7.5),
#0 21.44   and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
#0 21.44   If you wish to cross-compile for a single specific architecture,
#0 21.44   export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.
#0 21.44
#0 21.45
#0 21.45
#0 21.46   torch.__version__  = 1.8.1+cu111
#0 21.46
#0 21.46
#0 21.46   /tmp/pip-req-build-lich98no/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
#0 21.46     warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
#0 21.47
#0 21.47   Compiling cuda extensions with
#0 21.47   nvcc: NVIDIA (R) Cuda compiler driver
#0 21.47   Copyright (c) 2005-2022 NVIDIA Corporation
#0 21.47   Built on Wed_Jun__8_16:49:14_PDT_2022
#0 21.47   Cuda compilation tools, release 11.7, V11.7.99
#0 21.48   Build cuda_11.7.r11.7/compiler.31442593_0
#0 21.48   from /usr/local/cuda/bin
#0 21.48
#0 21.48   Traceback (most recent call last):
#0 21.48     File "<string>", line 2, in <module>
#0 21.48     File "<pip-setuptools-caller>", line 34, in <module>
#0 21.48     File "/tmp/pip-req-build-lich98no/setup.py", line 171, in <module>
#0 21.48       check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
#0 21.48     File "/tmp/pip-req-build-lich98no/setup.py", line 102, in check_cuda_torch_binary_vs_bare_metal
#0 21.48       raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
#0 21.48   RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.1.
#0 21.48   In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
#0 21.71   error: subprocess-exited-with-error
#0 21.71  
#0 21.71   × python setup.py clean did not run successfully.
#0 21.71   │ exit code: 1
#0 21.71   ╰─> See above for output.
#0 21.71  
#0 21.71   note: This error originates from a subprocess, and is likely not a problem with pip.
#0 21.71   full command: /usr/bin/python3 -u -c '
#0 21.71   exec(compile('"'"''"'"''"'"'
#0 21.71   # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
#0 21.71   #
#0 21.71   # - It imports setuptools before invoking setup.py, to enable projects that directly
#0 21.71   #   import from `distutils.core` to work with newer packaging standards.
#0 21.71   # - It provides a clear error message when setuptools is not installed.
#0 21.71   # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
#0 21.71   #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
#0 21.71   #     manifest_maker: standard file '"'"'-c'"'"' not found".
#0 21.71   # - It generates a shim setup.py, for handling setup.cfg-only projects.
#0 21.71   import os, sys, tokenize
#0 21.71  
#0 21.71   try:
#0 21.71       import setuptools
#0 21.71   except ImportError as error:
#0 21.71       print(
#0 21.71           "ERROR: Can not execute `setup.py` since setuptools is not available in "
#0 21.71           "the build environment.",
#0 21.71           file=sys.stderr,
#0 21.71       )
#0 21.71       sys.exit(1)
#0 21.71  
#0 21.71   __file__ = %r
#0 21.71   sys.argv[0] = __file__
#0 21.71  
#0 21.71   if os.path.exists(__file__):
#0 21.71       filename = __file__
#0 21.71       with tokenize.open(__file__) as f:
#0 21.71           setup_py_code = f.read()
#0 21.71   else:
#0 21.71       filename = "<auto-generated setuptools caller>"
#0 21.71       setup_py_code = "from setuptools import setup; setup()"
#0 21.71  
#0 21.71   exec(compile(setup_py_code, filename, "exec"))
#0 21.71   '"'"''"'"''"'"' % ('"'"'/tmp/pip-req-build-lich98no/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' --cpp_ext --cuda_ext clean --all
#0 21.71   cwd: /tmp/pip-req-build-lich98no
#0 21.72   ERROR: Failed cleaning build dir for apex
#0 21.72 Failed to build apex
#0 21.72 ERROR: Could not build wheels for apex, which is required to install pyproject.toml-based projects
------
Dockerfile:105
--------------------
 103 |    
 104 |     ## Install APEX
 105 | >>> RUN pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@a651e2c24ecf97cbf367fd3f330df36760e1c597
 106 |    
 107 |     COPY megatron/ megatron
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" git+https://github.com/NVIDIA/apex.git@a651e2c24ecf97cbf367fd3f330df36760e1c597" did not complete successfully: exit code: 1

@StellaAthena
Copy link
Member

When it says torch.__version__ = 1.8.1+cu111 it means PyTorch version 1.8.1 compiled for compatibility with CUDA 11.1 If you wish to use CUDA 11.7 you need to install the PyTorch that is compiled for compatibility with it. You can find instructions on how to download it from here

@Quentin-Anthony
Copy link
Member

Piggybacking off @StellaAthena's comment, the specific issue is with Apex:

#0 21.48   RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.1.
#0 21.48   In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).
#0 21.71   error: subprocess-exited-with-error

Your possible fixes here are to either comment out this check in apex's setup.py (this usually works for a minor CUDA version mismatch. See here for an example), or to install PyTorch compiled for CUDA 11.7. If you need a specific torch version, a good list is at https://pytorch.org/get-started/previous-versions/

@KonradWygladacz
Copy link
Author

I managed to start training using the 6-7B.yml configuration, but when I try to use flash-atention and add the following to the config: "attention_config": [[["flash"], 32]], I encounter the following error. I should mention that I'm working with 4 A100 cards. What can fix this error?

Traceback (most recent call last):
  File "train.py", line 27, in <module>
Traceback (most recent call last):
  File "train.py", line 27, in <module>
Traceback (most recent call last):
  File "train.py", line 27, in <module>
Traceback (most recent call last):
  File "train.py", line 27, in <module>
    pretrain(neox_args=neox_args)
  File "/opt/gpt-neox/megatron/training.py", line 192, in pretrain
    pretrain(neox_args=neox_args)
    pretrain(neox_args=neox_args)  File "/opt/gpt-neox/megatron/training.py", line 192, in pretrain
  File "/opt/gpt-neox/megatron/training.py", line 192, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/opt/gpt-neox/megatron/training.py", line 606, in setup_model_and_optimizer
        pretrain(neox_args=neox_args)model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/opt/gpt-neox/megatron/training.py", line 606, in setup_model_and_optimizer
  File "/opt/gpt-neox/megatron/training.py", line 192, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/opt/gpt-neox/megatron/training.py", line 606, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/opt/gpt-neox/megatron/training.py", line 388, in get_model
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/opt/gpt-neox/megatron/training.py", line 606, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/opt/gpt-neox/megatron/training.py", line 388, in get_model
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/opt/gpt-neox/megatron/training.py", line 388, in get_model
    model = GPT2ModelPipe(
  File "/opt/gpt-neox/megatron/model/gpt2_model.py", line 123, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 205, in __init__
    model = GPT2ModelPipe(
  File "/opt/gpt-neox/megatron/model/gpt2_model.py", line 123, in __init__
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/opt/gpt-neox/megatron/training.py", line 388, in get_model
    model = GPT2ModelPipe(
  File "/opt/gpt-neox/megatron/model/gpt2_model.py", line 123, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 205, in __init__
    self._build()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 254, in _build
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 205, in __init__
    model = GPT2ModelPipe(
  File "/opt/gpt-neox/megatron/model/gpt2_model.py", line 123, in __init__
    self._build()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 254, in _build
    module = layer.build()
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 71, in build
self._build()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 254, in _build
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 205, in __init__
    module = layer.build()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 71, in build
    return self.typename(*self.module_args, **self.module_kwargs)
      File "/opt/gpt-neox/megatron/model/transformer.py", line 752, in __init__
self._build()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 254, in _build
    module = layer.build()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 71, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/opt/gpt-neox/megatron/model/transformer.py", line 752, in __init__
        return self.typename(*self.module_args, **self.module_kwargs)    self.attention = ParallelSelfAttention(
module = layer.build()
  File "/opt/gpt-neox/megatron/model/transformer.py", line 752, in __init__
  File "/opt/gpt-neox/megatron/model/transformer.py", line 347, in __init__
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 71, in build
    self.attention = ParallelSelfAttention(
  File "/opt/gpt-neox/megatron/model/transformer.py", line 347, in __init__
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/opt/gpt-neox/megatron/model/transformer.py", line 752, in __init__
    from megatron.model.flash_attention import (
  File "/opt/gpt-neox/megatron/model/flash_attention.py", line 8, in <module>
    self.attention = ParallelSelfAttention(
  File "/opt/gpt-neox/megatron/model/transformer.py", line 347, in __init__
    from megatron.model.flash_attention import (
  File "/opt/gpt-neox/megatron/model/flash_attention.py", line 8, in <module>
    import flash_attn_cuda
ImportError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory
    import flash_attn_cuda
    ImportErrorfrom megatron.model.flash_attention import (: 
libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory  File "/opt/gpt-neox/megatron/model/flash_attention.py", line 8, in <module>
    self.attention = ParallelSelfAttention(
  File "/opt/gpt-neox/megatron/model/transformer.py", line 347, in __init__
    import flash_attn_cuda
ImportError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory
    from megatron.model.flash_attention import (
  File "/opt/gpt-neox/megatron/model/flash_attention.py", line 8, in <module>
    import flash_attn_cuda
ImportError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory

@StellaAthena
Copy link
Member

This seems to be an issue with version mismatches still, see pytorch/pytorch#91186

@Quentin-Anthony
Copy link
Member

Closing due to inactivity. Feel free to re-open if this is encountered again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants