CN115794387A - LSF-based single-host multi-GPU distributed type pytorech parallel computing method - Google Patents
LSF-based single-host multi-GPU distributed type pytorech parallel computing method Download PDFInfo
- Publication number
- CN115794387A CN115794387A CN202211431240.8A CN202211431240A CN115794387A CN 115794387 A CN115794387 A CN 115794387A CN 202211431240 A CN202211431240 A CN 202211431240A CN 115794387 A CN115794387 A CN 115794387A
- Authority
- CN
- China
- Prior art keywords
- gpu
- data
- distributed
- lsf
- rank
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 claims abstract description 65
- 230000008569 process Effects 0.000 claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000013136 deep learning model Methods 0.000 claims abstract description 5
- 238000004891 communication Methods 0.000 claims description 11
- 238000004806 packaging method and process Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 2
- 230000008859 change Effects 0.000 claims description 2
- 238000013480 data collection Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 2
- 230000000644 propagated effect Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000012856 packing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Multi Processors (AREA)
Abstract
The invention relates to a single-host multi-GPU distributed type pitorch parallel computing method based on LSF, and belongs to the field of computers. The method comprises two parts: a first part: resource application and scheduling; a second part: and (4) training a deep learning model by using resources. Model parameters are calculated by using a process, then the model parameters are distributed to each GPU in each batch processing period, each GPU calculates respective gradients, the gradients are collected into the GPU0 to be averaged, the GPU0 performs back propagation to update the parameters, and then the parameters of the model are propagated to other GPUs from the GPU 0. GPU utilization is typically low. And nn.Dataparallel requires that all GPUs are on the same node, and the Apex cannot be used for mixed precision training. Compared with the existing datapull mode, the method is higher in speed, high in efficiency and higher in GPU occupation.
Description
Technical Field
The invention belongs to the field of computers, and relates to a single-host multi-GPU distributed type pitorch parallel computing method based on LSF.
Background
In recent years, the deep learning technology has been rapidly developed in the direction of image and natural language processing. In order to make the model have higher precision and stronger generalization capability, the model structure is often deeper and more complex during design, and the data for training is also larger. The forward and backward propagation steps in model iteration are typically computationally intensive tasks with a large number of computations. Although a GPU (Graphics Processing Unit-Graphics processor) on hardware can provide stronger computing power, and a model can be optimized through an algorithm to accelerate the convergence speed, resources provided by a single machine still cannot meet large-scale training tasks. Distributed computing can effectively alleviate this problem by segmenting the training tasks and using multiple nodes to execute in parallel.
PyTorch is an open source Python machine learning library, which is based on Torch and used for applications such as natural language processing. Year 2017, month 1, pyTorch was introduced by Facebook artificial intelligence institute (FAIR) based on Torch. It is a Python-based continuous computation package that provides two high-level functions:
1. there is a strong GPU accelerated tensor computation (e.g., numPy).
2. A deep neural network comprising an automatic derivation system.
LSF (load sharing facility) is the next industry oriented, commercial level software flagged by IBM. The system has strong resource scheduling management capability, so that the system can distribute various IT resources to execute distributed tasks at higher speed, more balanced load, more reliable performance and lower cost. For deep learning training tasks, the LSF can efficiently and flexibly allocate GPU resources, help to create and manage distributed computing environments, and accordingly accelerate training speed. However, for operation in the LSF environment, the simplest and most direct method for accelerating the training of the neural network model is to use the GPU, the video memory capacity of one GPU is often limited, and if a relatively large model is encountered or the video memory is directly exploded under the condition of huge parameter quantity, the simplest and most drastic method is to add the GPU. This has a core problem, in order to use multiple GPUs for training, a method must be thought of to distribute data and models on multiple GPUs, and coordinate the training process, thereby increasing the training speed.
At present, pyTorch can calculate in a single GPU card, but the calculation efficiency of multiple times of the single GPU card can not be achieved through parallel calculation of a plurality of GPUs, so that the use cost ratio of multiple cards is greatly reduced.
Disclosure of Invention
In view of this, the present invention aims to provide an LSF-based single-host multi-GPU distributed type pytorch parallel computing method, which solves the technical problem that a single task cannot call multiple GPU computing resources in a supercomputing center, so that software can improve the computing capability on a multi-GPU platform, and the project can be used in any supercomputing center.
In order to achieve the purpose, the invention provides the following technical scheme:
the LSF-based single-host multi-GPU distributed type pitorch parallel computing method comprises two parts:
a first part: resource application and scheduling;
a second part: and (4) training a deep learning model by using resources.
The first part is completed under an LSF cluster;
applying for computing resources through instructions of the LSF, including:
the total number of the jobs to be created is equal to the total number of the GPUs applied;
and the number of GPUs of a single host.
Optionally, the second part is implemented inside a program;
firstly, each LSF job monopolizes a process and a GPU, and a deep learning model is based on a pyrrch framework;
firstly, each job reads 'LSF _ PM _ TASKID' from the environment as rank of each task;
secondly, initializing a distributed process group by using a torch.distributed library, wherein parameters comprise rank, world _ size, init _ method and backup; rank is used to refer to each process, world _ size is the total number of processes, init _ method is used to indicate where and how to find other processes, background is used to indicate the backend communication method used, nccl is used as the communication backend in the invention; NCCL is the communication backend developed by NVIDIA corporation for GPU parallel computing;
thirdly, reading a training data set; the data collection is segmented by a method of store, files, data, distributed and Sampler in the pystore, and each process is enabled to obtain a corresponding data slice by setting num _ replicas as world _ size and rank as rank of the current process; if the data slicing is done using Distributed Sampler, then the size of the batch trained on each process is divided by the total number of processes; using a torch.utils.data.data Loader to further read data on each data slice; the num _ words are set to be more than 1 for starting the subprocess to accelerate the data reading speed, the pin _ memory is set to be True for directly reading the data onto the GPU exclusively occupied by the process, and the time consumption of the data in transmission is reduced;
fourthly, importing related packages including argparse, distributed and Distributed Data Parallel, and adding a parameter, local _ rank, after the importing is successful; informing the current program to run on the GPU; finally, selecting nccl for designating a communication mode;
fifthly, packaging the Dataloader; what is needed here is to change Sampler to Distributed Sampler and then assign Sampler in Data Loader; each GPU or each process can fetch data from a DataLoader, and each GPU can fetch the non-overlapping data by appointed Distributed Sampler;
sixthly, using a Distributed Data Parallel packing model;
and seventhly, putting the input data into a designated GPU.
The invention has the beneficial effects that:
the Dataparallel is easier to use (only a single GPU model is simply packaged) and therefore becomes the mainstream single-node multi-card using mode. model = nn. Dataparallell (model) it uses a process to compute the model parameters, then during each batch it is distributed to each GPU, each GPU computes its respective gradient, then it is summed to GPU0 for averaging, it is backproplated by GPU0 to update the parameters, and then the parameters of the model are propagated by GPU0 to the other GPUs. This way of use communication speed becomes a bottleneck and GPU utilization is typically low. And nn. Dataparallell requires that all GPUs are on the same node (distributed is not supported), and the Apex cannot be used for mixed precision training. Compared with the existing dataparrell mode, the method is higher in speed, high in efficiency and higher in GPU occupation.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
fig. 1 is a schematic diagram of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustration only and not for the purpose of limiting the invention, shown in the drawings are schematic representations and not in the form of actual drawings; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
Please refer to fig. 1, which illustrates a LSF-based single-host multi-GPU distributed type of pitorch parallel computing method.
Writing an LSF job commit instruction.
BSUB-q HPC.S1.GPU.X785.sha
Specifying a job submission queue
#BSUB-n 8
The number of cores used was set to 8. And meanwhile, the task starts 8 tasks and uses 8 GPUs.
#BSUB-gpu"num=4:mode=exclusive_process"
Each host is set to use 4 GPUs and the task will monopolize the allocated GPUs. This statement will make the value of the environment variable CUDA _ VISIBLE _ DEVICES '0,1,2,3' on each host available, i.e. GPU number 0,1,2,3
#BSUB-o%J.out
#BSUB-e%J.err
An output file and an error file are specified.
python resnet_v_6.py
The job submission statement. Enabling jobs to be sent to distributed computing on a host.
1. The process group is initialized and the backend should write to NCCL due to the use of GPU communication. However, experiments have shown that even if written incorrectly as gloo, the NCCL is automatically used inside the DDP as a communication module.
2. Since the DDP wrapping model is used for training later, the model weights of all rank can be automatically synchronized to the weight of rank0 in the DDP wrapping model, and therefore, the model weights only need to be read on rank 0. This is based on the fact that the pytorech version 1.12.1, the low level version does not seem to have this property, weights need to be introduced separately at different ranks, and the load needs to be passed into map location, as shown by the two lines of code noted below.
3. The reason for creating the optimizer for a model here, rather than creating the optimizer for a ddp _ model wrapped with ddp, is to make it easier to read the optimizer weights for single GPU training.
4. And reading the weight of the optimizer to the GPU occupied by the process. Without the map _ location parameter, the load would read the weight to the device that originally saved it.
5. The optimizer obtains the weights. Through experiments, even if the weight is not in the GPU where the optimizer is located, the weight can be migrated without error. Of course, direct load access to the corresponding GPU reduces data transfer.
6. DDP wraps the model, copies a copy for the model to the corresponding GPU. All model copies of rank will remain consistent with rank 0. Note that the DDP does not duplicate a copy of the model optimizer, so the optimizers for each process need to be consistent at initialization time. The weights are either not read or both are read.
7. Here training of the model begins. The data needs to be transferred to the corresponding GPU device.
8. In back ward, after the gradient is calculated by the models of all processes, the average (not addition) is performed. That is, DDP adds hook in the backward function, where the ring _ reduce of the model gradient of all processes will be executed. This can be verified by inputting different data into each process model, and after back-ward, these models have the same gradient, and the verification is indeed the average of all process gradients. In addition, it can be verified that the back function blocks (block) each process from using the gradient, and each process can read and use the gradient only after all processes complete the back. This ensures consistency of all processes over the gradient.
9. Each process optimizer updates its model copy weight using the gradient. Because the weight of each process model and the weight of the optimizer are consistent during initialization, and the back propagation gradient is also consistent each time, the models of all the processes can be kept consistent in the whole training process.
10. Since all the process weights are kept consistent, only one process is needed for saving.
11. The IP and the port of rank0 are defined, mp.spawn is used, and the definition is only needed in the main process and is not needed to be respectively defined in the sub-processes.
12. Creating a sub-process, passing in: the function called by the sub-process (the first parameter of the function must be rank), the parameters of the sub-process function (except for the rank parameter), the number of the sub-processes, and whether to wait for all the sub-processes to finish the creation and then start the execution.
Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (3)
1. The LSF-based single-host multi-GPU distributed type pytorch parallel computing method is characterized by comprising the following steps of: the method comprises two parts:
a first part: resource application and scheduling;
a second part: and (4) training a deep learning model by using resources.
2. The LSF-based single-host multi-GPU distributed pytorech parallel computing method of claim 1, wherein: the first part is completed under an LSF cluster;
applying for computing resources through instructions of the LSF, including:
the total number of the jobs to be created is equal to the total number of the GPUs applied;
and the number of GPUs of a single host.
3. The LSF-based single-host multi-GPU distributed pytorech parallel computing method of claim 2, wherein: the second part is implemented inside the program;
firstly, each LSF job monopolizes a process and a GPU, and a deep learning model is based on a pyrrch framework;
step one, each job reads 'LSF _ PM _ TASKID' from the environment as rank of each task;
secondly, initializing a distributed process group by using a torch.distributed library, wherein parameters comprise rank, world _ size, init _ method and backup; rank is used to refer to each process, world _ size is the total number of processes, init _ method is used to indicate where and how to find other processes, background is used to indicate the backend communication method used, nccl is used as the communication backend in the invention; NCCL is the communication backend developed by NVIDIA corporation for GPU parallel computing;
thirdly, reading a training data set; the data collection is segmented by a method of store, files, data, distributed and Sampler in the pystore, and each process is enabled to obtain a corresponding data slice by setting num _ replicas as world _ size and rank as rank of the current process; if the data slicing is done using Distributed Sampler, then the size of the batch trained on each process is divided by the total number of processes; further reading data on each data slice by using a torch, files, data and Loader; the num _ words are set to be more than 1 for starting the subprocess to accelerate the data reading speed, the pin _ memory is set to be True for directly reading the data onto the GPU exclusively occupied by the process, and the time consumption of the data in transmission is reduced;
fourthly, importing related packages including argparse, distributed and Distributed Data Parallel, and adding a parameter, local _ rank, after the importing is successful; informing the current program to run on the GPU; finally, designating a communication mode and selecting nccl;
fifthly, packaging the Dataloader; what is needed here is to change Sampler to Distributed Sampler and then assign to the Sampler in the Data Loader; each GPU or each process can fetch data from a DataLoader, and each GPU can fetch the non-overlapping data by appointed Distributed Sampler;
sixthly, packaging the model by using the Distributed Data Parallel;
and seventhly, putting the input data into a designated GPU.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211431240.8A CN115794387A (en) | 2022-11-14 | 2022-11-14 | LSF-based single-host multi-GPU distributed type pytorech parallel computing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211431240.8A CN115794387A (en) | 2022-11-14 | 2022-11-14 | LSF-based single-host multi-GPU distributed type pytorech parallel computing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115794387A true CN115794387A (en) | 2023-03-14 |
Family
ID=85438007
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211431240.8A Pending CN115794387A (en) | 2022-11-14 | 2022-11-14 | LSF-based single-host multi-GPU distributed type pytorech parallel computing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115794387A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004028339A2 (en) * | 2002-09-27 | 2004-04-08 | Brigham And Women's Hospital, Inc. | Treatment of patients with multiple sclerosis based on gene expression changes in central nervous system tissues |
CN110471766A (en) * | 2019-08-06 | 2019-11-19 | 北京华恒盛世科技有限公司 | A kind of GPU resource scheduling system and method based on CUDA |
CN110795241A (en) * | 2019-10-18 | 2020-02-14 | 北京并行科技股份有限公司 | Job scheduling management method, scheduling center and system |
CN112035238A (en) * | 2020-09-11 | 2020-12-04 | 曙光信息产业(北京)有限公司 | Task scheduling processing method and device, cluster system and readable storage medium |
CN114968559A (en) * | 2022-05-06 | 2022-08-30 | 苏州国科综合数据中心有限公司 | LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model |
-
2022
- 2022-11-14 CN CN202211431240.8A patent/CN115794387A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004028339A2 (en) * | 2002-09-27 | 2004-04-08 | Brigham And Women's Hospital, Inc. | Treatment of patients with multiple sclerosis based on gene expression changes in central nervous system tissues |
CN110471766A (en) * | 2019-08-06 | 2019-11-19 | 北京华恒盛世科技有限公司 | A kind of GPU resource scheduling system and method based on CUDA |
CN110795241A (en) * | 2019-10-18 | 2020-02-14 | 北京并行科技股份有限公司 | Job scheduling management method, scheduling center and system |
CN112035238A (en) * | 2020-09-11 | 2020-12-04 | 曙光信息产业(北京)有限公司 | Task scheduling processing method and device, cluster system and readable storage medium |
CN114968559A (en) * | 2022-05-06 | 2022-08-30 | 苏州国科综合数据中心有限公司 | LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model |
Non-Patent Citations (2)
Title |
---|
III: "怎么使用Pytorch进行多卡训练", pages 1 - 5, Retrieved from the Internet <URL:https://www.yisu.com/zixun/742871.html> * |
颀周等: "pytorch网络训练 单机多卡GPU加速?", pages 1 - 8, Retrieved from the Internet <URL:https://www.zhihu.com/question/456026738/answer/2712223808> * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240176601A1 (en) | Method and system of command buffer between a cpu and gpu | |
EP4036803A1 (en) | Neural network model processing method and apparatus, computer device, and storage medium | |
CA2959528C (en) | Specifying components in graph-based programs | |
WO2021000970A1 (en) | Deep learning algorithm compiling method, device, and related product. | |
US20200042856A1 (en) | Scheduler for mapping neural networks onto an array of neural cores in an inference processing unit | |
US7647590B2 (en) | Parallel computing system using coordinator and master nodes for load balancing and distributing work | |
WO2022048557A1 (en) | Ai model training method and apparatus, and computing device and storage medium | |
CN114968559B (en) | LSF-based multi-host multi-GPU distributed arrangement deep learning model method | |
CN103886547A (en) | Technique For Storing Shared Vertices | |
US10922092B2 (en) | Administrator-monitored reinforcement-learning-based application manager | |
US8615770B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
CN112783554A (en) | Persistent scratchpad memory for inter-program data exchange | |
CN102081544B (en) | Application generation system and method | |
US11042640B2 (en) | Safe-operation-constrained reinforcement-learning-based application manager | |
CN112183735A (en) | Method and device for generating operation data and related product | |
CN114925591A (en) | Automatic parallel strategy searching method based on polyhedron model modeling and related equipment | |
US8959497B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
CN115794387A (en) | LSF-based single-host multi-GPU distributed type pytorech parallel computing method | |
US20230297349A1 (en) | Bandwidth-Aware Computational Graph Mapping | |
US20230023545A1 (en) | Methods and systems for deep learning chip design generation | |
CN113420466B (en) | Cross-platform automatic performance optimization oriented unit computing component and method | |
JP7251416B2 (en) | Information processing program and information processing method | |
Li et al. | Deep learning and machine learning with gpgpu and cuda: Unlocking the power of parallel computing | |
CN118034695A (en) | Calculation map compiling method, compiling device, calculating device and storage medium | |
US20230297643A1 (en) | Non-rectangular matrix computations and data pattern processing using tensor cores |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230314 |
|
RJ01 | Rejection of invention patent application after publication |