Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError: No module named 'deepspeed.ops.op_builder' on import deepspeed #425

Closed
shankyemcee opened this issue Oct 15, 2021 · 10 comments
Labels
bug Something isn't working

Comments

@shankyemcee
Copy link

Getting this error on import of deepspeed. I am currently using torch 1.8.0 and installed the requirements.txt as directed. I am also not able to install the apex link provided.

@shankyemcee shankyemcee added the bug Something isn't working label Oct 15, 2021
@EricHallahan
Copy link
Contributor

It would be of great assistance if you could provide more detail about the nature of the issue. What is your environment? What steps can we take to reproduce? We ask that you follow the Issue template for bug reports, as it specifies the information we are interested in.

@shankyemcee
Copy link
Author

Describe the bug

import deepspeed
Traceback (most recent call last):
line 9, in
from ..op_builder import CPUAdamBuilder
ModuleNotFoundError: No module named 'deepspeed.ops.op_builder'

To Reproduce
Steps to reproduce the behavior:

  1. Create new python virtual environment using pip and python 3.8.0
  2. pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0
  3. pip install -r requirements/requirements.txt
  4. start python terminal
  5. import deepspeed (above error shows)

Expected behavior
A clear and concise description of what you expected to happen.
-> import of deepspeed successful

Proposed solution
If you have an idea for how we can fix this problem, describe it here.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • GPUs: Nvidia Geforce RTX 2070 Super (CUDA 11.2)

Additional context
Add any other context about the problem here.
Not able to install apex through the link provided.

pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@e2083df5eb96643c61613b9df48dd4eea6b07690

Error given below:

No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2'

Warning: Torch did not find available GPUs on this system.
 If your intention is to cross-compile, this is not an error.
By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
Volta (compute capability 7.0), Turing (compute capability 7.5),
and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
If you wish to cross-compile for a single specific architecture,
export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.



torch.__version__  = 1.8.0+cpu

I already have the Cuda 11.2 toolkit installed . Could this be due to pytorch mismatch? I am using the Cuda 10.2 version of torch.

@EricHallahan
Copy link
Contributor

I must note that pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 installs PyTorch wheels retrieved from PyPi and not retrieved from PyTorch directly. I also see that the APEX install says that you have installed the CPU version of PyTorch, which seems to conflict with other information I have (both installing from PyPi and your own words). By following that trail I was able to replicate the issue with APEX install with torch==1.8.0+cpu.

I therefore recommend installing PyTorch as outlined in the PyTorch Getting Started guide. In your case that would be pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html (alternatively any of the CUDA 11.1 builds of PyTorch since 1.8 will do). If this gives you trouble, install CUDA toolkit 11.1 for the installation of DeepSpeed, APEX and the Megatron kernels, as if my memory serves me right APEX will complain and installation will fail if the CUDA toolkit version does not match the one used to build PyTorch.

@shankyemcee
Copy link
Author

Thanks I followed your instructions for installing torch directly from PyTorch without changing my nvidia toolkit and I'm not getting the previous error anymore. But now I'm getting the error that the triton version is not available.:

ERROR: Could not find a version that satisfies the requirement triton==0.4.2 (from deepspeed) (from versions: 0.1, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.3.0)
ERROR: No matching distribution found for triton==0.4.2 (from deepspeed)

I am using python 3.8.0 and updated my pip version as well.

@EricHallahan
Copy link
Contributor

EricHallahan commented Oct 17, 2021

ERROR: Could not find a version that satisfies the requirement triton==0.4.2 (from deepspeed) (from versions: 0.1, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.3.0)
ERROR: No matching distribution found for triton==0.4.2 (from deepspeed)

This means that pip cannot find the requested package version for your system configuration on PyPi. In your case this is because the versions of Triton listed (all versions prior to 0.4.0) were published on PyPi as a source distribution, unlike subsequent versions that are instead published as binary wheels.

This indicates to me that you could be attempting to install Triton on an OS other than Linux (as there are only Triton wheels for Linux published on PyPi). I should also note that at the current moment you should only need Triton if you are using DeepSpeed's sparse attention, so if you do not intend to use sparse attention you should be able to skip it.

I would therefore like to ask two questions:

  1. Do you happen to be attempting to run GPT-NeoX on an OS other than Linux (like Windows)? If so, that is important information to communicate to us here.
  2. Have you attempted to train a model after installing the dependencies manually (with the exclusion of Triton)? If not, I recommend trying to do so and reporting back with your results. You can also try to install Triton from source, but if you are not interested in sparse attention it is probably not worth the trouble.

@shankyemcee
Copy link
Author

To answer your questions, yes I am currently trying to run gpt-neox on windows. But that's just for testing if the model is working for my use case, I plan to later on run on a cloud compute server called computeCanada, which I'm sure I'll run into problems with as well later.

How do I install deepspeed excluding triton? I am getting this error when trying to install deepspeed.
git+git:https://github.com/EleutherAI/DeeperSpeed.git@c45ec1c0ac05803c02763d7e43be67919e475a2c#egg=deepspeed

@StellaAthena
Copy link
Member

To answer your questions, yes I am currently trying to run gpt-neox on windows. But that's just for testing if the model is working for my use case, I plan to later on run on a cloud compute server called computeCanada, which I'm sure I'll run into problems with as well later.

How do I install deepspeed excluding triton? I am getting this error when trying to install deepspeed. git+git:https://github.com/EleutherAI/DeeperSpeed.git@c45ec1c0ac05803c02763d7e43be67919e475a2c#egg=deepspeed

Windows is a nightmare and you shouldn’t be trying to install things on it to test.

@shankyemcee
Copy link
Author

Well I am currently trying to run gptneox as part of a larger project, which is all on Windows. Is there any way to fix these issues for Windows? If not would you recommend to try the gptneo repository instead?

@StellaAthena
Copy link
Member

Well I am currently trying to run gptneox as part of a larger project, which is all on Windows. Is there any way to fix these issues for Windows? If not would you recommend to try the gptneo repository instead?

I assumed that your cloud services were Linux, as I am not aware of any high performance computing platforms that run Windows. That's also why I said you shouldn't use Windows for testing, as code tends to perform very differently on Windows OS and other operating systems.

To confirm, the cloud platform you will be using for this project runs Windows as well?

@shankyemcee
Copy link
Author

Well the cloud platform is linux. But I usually build my models on my own system which is windows. But If its really that much of any issue I'll run gptneox inside a virtual machine.....I was actually trying to install the dependencies in the cloud platform and I was still getting the triton error:
ERROR: Could not find a version that satisfies the requirement triton==0.4.2 (from deepspeed) (from versions: 0.1, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.3.0, 1.1.0+computecanada)
@EricHallahan recommended to skip triton installation, but how can i skip it in the deepspeed installation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants