CN115794387A

CN115794387A - LSF-based single-host multi-GPU distributed type pytorech parallel computing method

Info

Publication number: CN115794387A
Application number: CN202211431240.8A
Authority: CN
Inventors: 蒋鹏飞; 单晓冬; 徐恩格; 王小龙; 鲍复劼
Original assignee: Suzhou International Science Park Data Center Co ltd
Current assignee: Suzhou International Science Park Data Center Co ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-03-14

Abstract

The invention relates to a single-host multi-GPU distributed type pitorch parallel computing method based on LSF, and belongs to the field of computers. The method comprises two parts: a first part: resource application and scheduling; a second part: and (4) training a deep learning model by using resources. Model parameters are calculated by using a process, then the model parameters are distributed to each GPU in each batch processing period, each GPU calculates respective gradients, the gradients are collected into the GPU0 to be averaged, the GPU0 performs back propagation to update the parameters, and then the parameters of the model are propagated to other GPUs from the GPU 0. GPU utilization is typically low. And nn.Dataparallel requires that all GPUs are on the same node, and the Apex cannot be used for mixed precision training. Compared with the existing datapull mode, the method is higher in speed, high in efficiency and higher in GPU occupation.

Description

LSF-based single-host multi-GPU distributed type pytorech parallel computing method

Technical Field

The invention belongs to the field of computers, and relates to a single-host multi-GPU distributed type pitorch parallel computing method based on LSF.

Background

In recent years, the deep learning technology has been rapidly developed in the direction of image and natural language processing. In order to make the model have higher precision and stronger generalization capability, the model structure is often deeper and more complex during design, and the data for training is also larger. The forward and backward propagation steps in model iteration are typically computationally intensive tasks with a large number of computations. Although a GPU (Graphics Processing Unit-Graphics processor) on hardware can provide stronger computing power, and a model can be optimized through an algorithm to accelerate the convergence speed, resources provided by a single machine still cannot meet large-scale training tasks. Distributed computing can effectively alleviate this problem by segmenting the training tasks and using multiple nodes to execute in parallel.

PyTorch is an open source Python machine learning library, which is based on Torch and used for applications such as natural language processing. Year 2017, month 1, pyTorch was introduced by Facebook artificial intelligence institute (FAIR) based on Torch. It is a Python-based continuous computation package that provides two high-level functions:

1. there is a strong GPU accelerated tensor computation (e.g., numPy).

2. A deep neural network comprising an automatic derivation system.

LSF (load sharing facility) is the next industry oriented, commercial level software flagged by IBM. The system has strong resource scheduling management capability, so that the system can distribute various IT resources to execute distributed tasks at higher speed, more balanced load, more reliable performance and lower cost. For deep learning training tasks, the LSF can efficiently and flexibly allocate GPU resources, help to create and manage distributed computing environments, and accordingly accelerate training speed. However, for operation in the LSF environment, the simplest and most direct method for accelerating the training of the neural network model is to use the GPU, the video memory capacity of one GPU is often limited, and if a relatively large model is encountered or the video memory is directly exploded under the condition of huge parameter quantity, the simplest and most drastic method is to add the GPU. This has a core problem, in order to use multiple GPUs for training, a method must be thought of to distribute data and models on multiple GPUs, and coordinate the training process, thereby increasing the training speed.

At present, pyTorch can calculate in a single GPU card, but the calculation efficiency of multiple times of the single GPU card can not be achieved through parallel calculation of a plurality of GPUs, so that the use cost ratio of multiple cards is greatly reduced.

Disclosure of Invention

In view of this, the present invention aims to provide an LSF-based single-host multi-GPU distributed type pytorch parallel computing method, which solves the technical problem that a single task cannot call multiple GPU computing resources in a supercomputing center, so that software can improve the computing capability on a multi-GPU platform, and the project can be used in any supercomputing center.

In order to achieve the purpose, the invention provides the following technical scheme:

the LSF-based single-host multi-GPU distributed type pitorch parallel computing method comprises two parts:

a first part: resource application and scheduling;

a second part: and (4) training a deep learning model by using resources.

The first part is completed under an LSF cluster;

applying for computing resources through instructions of the LSF, including:

the total number of the jobs to be created is equal to the total number of the GPUs applied;

and the number of GPUs of a single host.

Optionally, the second part is implemented inside a program;

firstly, each LSF job monopolizes a process and a GPU, and a deep learning model is based on a pyrrch framework;

firstly, each job reads 'LSF _ PM _ TASKID' from the environment as rank of each task;

secondly, initializing a distributed process group by using a torch.distributed library, wherein parameters comprise rank, world _ size, init _ method and backup; rank is used to refer to each process, world _ size is the total number of processes, init _ method is used to indicate where and how to find other processes, background is used to indicate the backend communication method used, nccl is used as the communication backend in the invention; NCCL is the communication backend developed by NVIDIA corporation for GPU parallel computing;

thirdly, reading a training data set; the data collection is segmented by a method of store, files, data, distributed and Sampler in the pystore, and each process is enabled to obtain a corresponding data slice by setting num _ replicas as world _ size and rank as rank of the current process; if the data slicing is done using Distributed Sampler, then the size of the batch trained on each process is divided by the total number of processes; using a torch.utils.data.data Loader to further read data on each data slice; the num _ words are set to be more than 1 for starting the subprocess to accelerate the data reading speed, the pin _ memory is set to be True for directly reading the data onto the GPU exclusively occupied by the process, and the time consumption of the data in transmission is reduced;

fourthly, importing related packages including argparse, distributed and Distributed Data Parallel, and adding a parameter, local _ rank, after the importing is successful; informing the current program to run on the GPU; finally, selecting nccl for designating a communication mode;

fifthly, packaging the Dataloader; what is needed here is to change Sampler to Distributed Sampler and then assign Sampler in Data Loader; each GPU or each process can fetch data from a DataLoader, and each GPU can fetch the non-overlapping data by appointed Distributed Sampler;

sixthly, using a Distributed Data Parallel packing model;

and seventhly, putting the input data into a designated GPU.

The invention has the beneficial effects that:

the Dataparallel is easier to use (only a single GPU model is simply packaged) and therefore becomes the mainstream single-node multi-card using mode. model = nn. Dataparallell (model) it uses a process to compute the model parameters, then during each batch it is distributed to each GPU, each GPU computes its respective gradient, then it is summed to GPU0 for averaging, it is backproplated by GPU0 to update the parameters, and then the parameters of the model are propagated by GPU0 to the other GPUs. This way of use communication speed becomes a bottleneck and GPU utilization is typically low. And nn. Dataparallell requires that all GPUs are on the same node (distributed is not supported), and the Apex cannot be used for mixed precision training. Compared with the existing dataparrell mode, the method is higher in speed, high in efficiency and higher in GPU occupation.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

fig. 1 is a schematic diagram of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustration only and not for the purpose of limiting the invention, shown in the drawings are schematic representations and not in the form of actual drawings; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Please refer to fig. 1, which illustrates a LSF-based single-host multi-GPU distributed type of pitorch parallel computing method.

Writing an LSF job commit instruction.

BSUB-q HPC.S1.GPU.X785.sha

Specifying a job submission queue

#BSUB-n 8

The number of cores used was set to 8. And meanwhile, the task starts 8 tasks and uses 8 GPUs.

#BSUB-gpu"num＝4:mode＝exclusive_process"

Each host is set to use 4 GPUs and the task will monopolize the allocated GPUs. This statement will make the value of the environment variable CUDA _ VISIBLE _ DEVICES '0,1,2,3' on each host available, i.e. GPU number 0,1,2,3

#BSUB-o％J.out

#BSUB-e％J.err

An output file and an error file are specified.

python resnet_v_6.py

The job submission statement. Enabling jobs to be sent to distributed computing on a host.

1. The process group is initialized and the backend should write to NCCL due to the use of GPU communication. However, experiments have shown that even if written incorrectly as gloo, the NCCL is automatically used inside the DDP as a communication module.

2. Since the DDP wrapping model is used for training later, the model weights of all rank can be automatically synchronized to the weight of rank0 in the DDP wrapping model, and therefore, the model weights only need to be read on rank 0. This is based on the fact that the pytorech version 1.12.1, the low level version does not seem to have this property, weights need to be introduced separately at different ranks, and the load needs to be passed into map location, as shown by the two lines of code noted below.

3. The reason for creating the optimizer for a model here, rather than creating the optimizer for a ddp _ model wrapped with ddp, is to make it easier to read the optimizer weights for single GPU training.

4. And reading the weight of the optimizer to the GPU occupied by the process. Without the map _ location parameter, the load would read the weight to the device that originally saved it.

5. The optimizer obtains the weights. Through experiments, even if the weight is not in the GPU where the optimizer is located, the weight can be migrated without error. Of course, direct load access to the corresponding GPU reduces data transfer.

6. DDP wraps the model, copies a copy for the model to the corresponding GPU. All model copies of rank will remain consistent with rank 0. Note that the DDP does not duplicate a copy of the model optimizer, so the optimizers for each process need to be consistent at initialization time. The weights are either not read or both are read.

7. Here training of the model begins. The data needs to be transferred to the corresponding GPU device.

8. In back ward, after the gradient is calculated by the models of all processes, the average (not addition) is performed. That is, DDP adds hook in the backward function, where the ring _ reduce of the model gradient of all processes will be executed. This can be verified by inputting different data into each process model, and after back-ward, these models have the same gradient, and the verification is indeed the average of all process gradients. In addition, it can be verified that the back function blocks (block) each process from using the gradient, and each process can read and use the gradient only after all processes complete the back. This ensures consistency of all processes over the gradient.

9. Each process optimizer updates its model copy weight using the gradient. Because the weight of each process model and the weight of the optimizer are consistent during initialization, and the back propagation gradient is also consistent each time, the models of all the processes can be kept consistent in the whole training process.

10. Since all the process weights are kept consistent, only one process is needed for saving.

11. The IP and the port of rank0 are defined, mp.spawn is used, and the definition is only needed in the main process and is not needed to be respectively defined in the sub-processes.

12. Creating a sub-process, passing in: the function called by the sub-process (the first parameter of the function must be rank), the parameters of the sub-process function (except for the rank parameter), the number of the sub-processes, and whether to wait for all the sub-processes to finish the creation and then start the execution.

Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The LSF-based single-host multi-GPU distributed type pytorch parallel computing method is characterized by comprising the following steps of: the method comprises two parts:

a first part: resource application and scheduling;

a second part: and (4) training a deep learning model by using resources.

2. The LSF-based single-host multi-GPU distributed pytorech parallel computing method of claim 1, wherein: the first part is completed under an LSF cluster;

applying for computing resources through instructions of the LSF, including:

and the number of GPUs of a single host.

3. The LSF-based single-host multi-GPU distributed pytorech parallel computing method of claim 2, wherein: the second part is implemented inside the program;

step one, each job reads 'LSF _ PM _ TASKID' from the environment as rank of each task;

thirdly, reading a training data set; the data collection is segmented by a method of store, files, data, distributed and Sampler in the pystore, and each process is enabled to obtain a corresponding data slice by setting num _ replicas as world _ size and rank as rank of the current process; if the data slicing is done using Distributed Sampler, then the size of the batch trained on each process is divided by the total number of processes; further reading data on each data slice by using a torch, files, data and Loader; the num _ words are set to be more than 1 for starting the subprocess to accelerate the data reading speed, the pin _ memory is set to be True for directly reading the data onto the GPU exclusively occupied by the process, and the time consumption of the data in transmission is reduced;

fourthly, importing related packages including argparse, distributed and Distributed Data Parallel, and adding a parameter, local _ rank, after the importing is successful; informing the current program to run on the GPU; finally, designating a communication mode and selecting nccl;

fifthly, packaging the Dataloader; what is needed here is to change Sampler to Distributed Sampler and then assign to the Sampler in the Data Loader; each GPU or each process can fetch data from a DataLoader, and each GPU can fetch the non-overlapping data by appointed Distributed Sampler;

sixthly, packaging the model by using the Distributed Data Parallel;

and seventhly, putting the input data into a designated GPU.