Parallel training on multi GPUs #121

WenjieDu · 2023-05-20T06:22:12Z

Enable to train PyPOTS NN models on multiple CUDA device parallely.

Parallel training on multiple GPUs for acceleration is useful, and this feature is on our list, but without priority. Mainly because:

If your dataset is very large, PyPOTS provides a data lazy-loading strategy to help you only load necessary data samples during training. Simply using multiple GPU devices for training cannot ease the memory load, because your data still has to be loaded into RAM first for distributing to GPUs;
Different from LLMs, neural network models for time series modeling usually are not large models. A single GPU can accelerate the training to a good speed. So far, you even can run all models in PyPOTS on your laptop with CPUs at an acceptable training and inference speed. Especially, nowadays laptops generally have at least 4 cores. I’m not saying training on multiple GPUs is useless. In some extreme scenes, it can be very helpful;
Recently, this feature was requested by a member of our community who is using PyPOTS to train a GRU-D model for a POTS classification task, the training takes too much time (even after trying to increase the value of num_workers) and one has 4 GPUs on the machine but cannot use them for parallel training to speed up. Therefore, I'm considering adding the feature of parallel training in the following release. I implemented it with DataParallel, but PyTorch suggests using DistributedDataParallel https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#comparison-between-dataparallel-and-distributeddataparallel. As I mentioned above, I think this is not a necessary feature so I postpone the redesign.

Speed up the training process.

Will make a PR to add this feature.

WenjieDu added enhancement New feature or request new feature Proposing to add a new feature labels May 20, 2023

This was referenced May 20, 2023

Enable parallel training on multi GPUs #122

Merged

Enable models to run on multiple CUDA devices #125

Merged

WenjieDu closed this as completed in #125 May 21, 2023

Provide feedback