Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inability to Utilize Specific CUDA Devices for Training in Raindrop Model #304

Closed
1 of 2 tasks
islaxu opened this issue Feb 5, 2024 · 3 comments
Closed
1 of 2 tasks
Assignees
Labels
bug Something isn't working
Milestone

Comments

@islaxu
Copy link

islaxu commented Feb 5, 2024

1. System Info

Python Version: 3.9
PyTorch Version: 2.1.0
CUDA Version: 12.1
GPU Model: NVIDIA RTX 4090

2. Information

  • The official example scripts
  • My own created scripts

3. Reproduction

device_list = ["cuda:3", "cuda:4"]
raindrop = Raindrop(
n_steps=resampled_data.shape[1],
n_features=resampled_data.shape[2],
n_classes=2,
n_layers=2,
……
device=device_list)

4. Expected behavior

  1. Define the Raindrop model with _setup_device allowing for device selection (e.g., cpu, cuda, cuda list).
  2. Attempt to specify a single CUDA device using "cuda:2" as an argument. The model still runs on cuda:0 despite this specification.
    My workaround was to add:
    specific_device = torch.device("cuda:2")
    torch.cuda.set_device(specific_device)
  3. Above code allows me running on the specified CUDA device but doesn't solve the issue when attempting to use multiple CUDA devices. When attempting to use multiple CUDA devices by specifying ["cuda:0", "cuda:1"] as the device parameter, the model still only runs on cuda:0.
@islaxu islaxu added the bug Something isn't working label Feb 5, 2024
@WenjieDu
Copy link
Owner

WenjieDu commented Feb 5, 2024

Hi there 👋,

Thank you so much for your attention to PyPOTS! You can follow me on GitHub to receive the latest news of PyPOTS. If you find PyPOTS helpful to your work, please star⭐️ this repository. Your star is your recognition, which can help more people notice PyPOTS and grow PyPOTS community. It matters and is definitely a kind of contribution to the community.

I have received your message and will respond ASAP. Thank you for your patience! 😃

Best,
Wenjie

@WenjieDu WenjieDu self-assigned this Feb 6, 2024
@WenjieDu WenjieDu added this to the v0.4 milestone Feb 6, 2024
@WenjieDu
Copy link
Owner

WenjieDu commented Mar 13, 2024

Hi Jiaying @islaxu, this bug has been fixed since the problem in issue #306 was solved. I've also tested it on our server with 8 GPUs by running the command pytest tests/classification/raindrop.py -s. And below is the console log:

============================================================================== test session starts ==============================================================================
platform linux -- Python 3.10.13, pytest-8.1.1, pluggy-1.4.0
rootdir: /home/wdu/PyPOTS
plugins: cov-4.1.0, typeguard-4.1.2, anyio-4.3.0, xdist-3.5.0
collecting ... 
2024-03-13 15:48:37 [INFO]: Have set the random seed as 2023 for numpy and pytorch.
2024-03-13 15:48:37 [INFO]: Running tests for a classification model Raindrop...
2024-03-13 15:48:37 [INFO]: Using the given device: [device(type='cuda', index=0), device(type='cuda', index=1), device(type='cuda', index=2), device(type='cuda', index=3), device(type='cuda', index=4), device(type='cuda', index=5), device(type='cuda', index=6), device(type='cuda', index=7)]
2024-03-13 15:48:37 [INFO]: Model files will be saved to testing_results/classification/Raindrop/20240313_T154837
2024-03-13 15:48:37 [INFO]: Tensorboard file will be saved to testing_results/classification/Raindrop/20240313_T154837/tensorboard
2024-03-13 15:48:37 [INFO]: Model has been allocated to the given multiple devices: [device(type='cuda', index=0), device(type='cuda', index=1), device(type='cuda', index=2), device(type='cuda', index=3), device(type='cuda', index=4), device(type='cuda', index=5), device(type='cuda', index=6), device(type='cuda', index=7)]
2024-03-13 15:48:37 [INFO]: Raindrop initialized with the given hyperparameters, the number of trainable parameters: 333,290
collected 5 items

tests/classification/raindrop.py 
2024-03-13 15:48:45 [INFO]: Epoch 001 - training loss: 2.8150, validating loss: 0.0502
2024-03-13 15:48:49 [INFO]: Epoch 002 - training loss: 0.0231, validating loss: 0.0037
2024-03-13 15:48:53 [INFO]: Epoch 003 - training loss: 0.0053, validating loss: 0.0018
2024-03-13 15:48:57 [INFO]: Epoch 004 - training loss: 0.0030, validating loss: 0.0011
2024-03-13 15:49:02 [INFO]: Epoch 005 - training loss: 0.0020, validating loss: 0.0007
2024-03-13 15:49:02 [INFO]: Finished training.
2024-03-13 15:49:02 [INFO]: Saved the model to testing_results/classification/Raindrop/20240313_T154837/Raindrop.pypots
2024-03-13 15:49:02 [WARNING]: 🚨DeprecationWarning: The method classify is deprecated. Please use `predict` instead.
2024-03-13 15:49:02 [INFO]: Lazy-loading Raindrop ROC_AUC: 1.0, PR_AUC: 1.0, F1: 1.0, Precision: 1.0, Recall: 1.0
2024-03-13 15:49:02 [INFO]: Saved the model to testing_results/classification/Raindrop/saved_Raindrop_model.pypots
2024-03-13 15:49:02 [INFO]: Model loaded successfully from testing_results/classification/Raindrop/saved_Raindrop_model.pypots
2024-03-13 15:49:08 [INFO]: Epoch 001 - training loss: 0.0014, validating loss: 0.0005
2024-03-13 15:49:14 [INFO]: Epoch 002 - training loss: 0.0010, validating loss: 0.0004
2024-03-13 15:49:21 [INFO]: Epoch 003 - training loss: 0.0008, validating loss: 0.0003
2024-03-13 15:49:27 [INFO]: Epoch 004 - training loss: 0.0006, validating loss: 0.0002
2024-03-13 15:49:33 [INFO]: Epoch 005 - training loss: 0.0005, validating loss: 0.0002
2024-03-13 15:49:33 [INFO]: Finished training.
2024-03-13 15:49:33 [ERROR]: ❌ File testing_results/classification/Raindrop/20240313_T154837/Raindrop.pypots exists. Saving operation aborted.
2024-03-13 15:49:33 [INFO]: Saved the model to testing_results/classification/Raindrop/20240313_T154837/Raindrop.pypots
2024-03-13 15:49:34 [INFO]: Lazy-loading Raindrop ROC_AUC: 1.0, PR_AUC: 1.0, F1: 1.0, Precision: 1.0, Recall: 1.0

======================================================================== 5 passed, 3 warnings in 59.16s =========================================================================

@islaxu
Copy link
Author

islaxu commented Jul 23, 2024

Thank you for your help! 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants