Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel training on multi GPUs #121

Closed
WenjieDu opened this issue May 20, 2023 · 0 comments · Fixed by #125
Closed

Parallel training on multi GPUs #121

WenjieDu opened this issue May 20, 2023 · 0 comments · Fixed by #125
Labels
enhancement New feature or request new feature Proposing to add a new feature

Comments

@WenjieDu
Copy link
Owner

WenjieDu commented May 20, 2023

1. Feature description

Enable to train PyPOTS NN models on multiple CUDA device parallely.

Parallel training on multiple GPUs for acceleration is useful, and this feature is on our list, but without priority. Mainly because:

  1. If your dataset is very large, PyPOTS provides a data lazy-loading strategy to help you only load necessary data samples during training. Simply using multiple GPU devices for training cannot ease the memory load, because your data still has to be loaded into RAM first for distributing to GPUs;
  2. Different from LLMs, neural network models for time series modeling usually are not large models. A single GPU can accelerate the training to a good speed. So far, you even can run all models in PyPOTS on your laptop with CPUs at an acceptable training and inference speed. Especially, nowadays laptops generally have at least 4 cores. I’m not saying training on multiple GPUs is useless. In some extreme scenes, it can be very helpful;
    Recently, this feature was requested by a member of our community who is using PyPOTS to train a GRU-D model for a POTS classification task, the training takes too much time (even after trying to increase the value of num_workers) and one has 4 GPUs on the machine but cannot use them for parallel training to speed up. Therefore, I'm considering adding the feature of parallel training in the following release. I implemented it with DataParallel, but PyTorch suggests using DistributedDataParallel https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#comparison-between-dataparallel-and-distributeddataparallel. As I mentioned above, I think this is not a necessary feature so I postpone the redesign.

2. Motivation

Speed up the training process.

3. Your contribution

Will make a PR to add this feature.

@WenjieDu WenjieDu added enhancement New feature or request new feature Proposing to add a new feature labels May 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request new feature Proposing to add a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant