Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA out of memory #11

Open
happygirlzt opened this issue Aug 1, 2022 · 9 comments
Open

RuntimeError: CUDA out of memory #11

happygirlzt opened this issue Aug 1, 2022 · 9 comments

Comments

@happygirlzt
Copy link

Hi there, thank you very much for open-sourcing the work!
I wonder what devices you used for the work. Since I tried to run the training in a machine with 8 Tesla V100-SXM2-16GB, but cannot make it. Besides, I found the code would only utilize 2 GPUs, although I did not specify. I modified the device setting inside run.py, but still cannot change the fact that only 2 GPUs are used.
Please kindly suggest. Thank you in advance!
Screenshot 2022-08-01 at 15 36 30

@happygirlzt
Copy link
Author

Hi @pkuzqh , I've got another issue when running the code.

Screenshot 2022-08-02 at 21 41 50

@pkuzqh
Copy link
Owner

pkuzqh commented Aug 3, 2022

If you want to change the batch size, you need to change the number in the dict "args". If you want to use multiple GPUs, you need to modify "model = nn.DataParallel(model, device_ids=[0, 1])".

@happygirlzt
Copy link
Author

Hi @pkuzqh, thank you for the reply. The cuda out of memory issue has been resolved. However, I found the new error above. Please kindly suggest, thanks.

@pkuzqh
Copy link
Owner

pkuzqh commented Aug 3, 2022

How many GPUs do you use? And the batch size?

@happygirlzt
Copy link
Author

3, I indicated in the train() that: device_ids=[1,2,3]
the batch size is 16

@pkuzqh
Copy link
Owner

pkuzqh commented Aug 5, 2022

You need to change the number "4" in line 103-106 to a multiple of 3. And the batch size needs to be a multiple of 3.

@happygirlzt
Copy link
Author

OK, thank you very much @pkuzqh ! It can now run. However, I saw in the train(), the number of epochs is 100000, for epoch in range(100000): is that true?

@happygirlzt
Copy link
Author

BTW, for inference, it looks like the testDefect4j.py can only use 1 GPU? Since I have 4 GPUs, only one was used, and it caused an OOM issue.
Screenshot 2022-08-08 at 12 40 19

@pkuzqh
Copy link
Owner

pkuzqh commented Aug 12, 2022

you can use "nn.DataParallel" to use multiple gpus in testDefect4J.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants