CN115794387A - LSF-based single-host multi-GPU distributed type pytorech parallel computing method - Google Patents

LSF-based single-host multi-GPU distributed type pytorech parallel computing method Download PDF

Info

Publication number
CN115794387A
CN115794387A CN202211431240.8A CN202211431240A CN115794387A CN 115794387 A CN115794387 A CN 115794387A CN 202211431240 A CN202211431240 A CN 202211431240A CN 115794387 A CN115794387 A CN 115794387A
Authority
CN
China
Prior art keywords
gpu
data
distributed
lsf
rank
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211431240.8A
Other languages
Chinese (zh)
Inventor
蒋鹏飞
单晓冬
徐恩格
王小龙
鲍复劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou International Science Park Data Center Co ltd
Original Assignee
Suzhou International Science Park Data Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou International Science Park Data Center Co ltd filed Critical Suzhou International Science Park Data Center Co ltd
Priority to CN202211431240.8A priority Critical patent/CN115794387A/en
Publication of CN115794387A publication Critical patent/CN115794387A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Multi Processors (AREA)

Abstract

The invention relates to a single-host multi-GPU distributed type pitorch parallel computing method based on LSF, and belongs to the field of computers. The method comprises two parts: a first part: resource application and scheduling; a second part: and (4) training a deep learning model by using resources. Model parameters are calculated by using a process, then the model parameters are distributed to each GPU in each batch processing period, each GPU calculates respective gradients, the gradients are collected into the GPU0 to be averaged, the GPU0 performs back propagation to update the parameters, and then the parameters of the model are propagated to other GPUs from the GPU 0. GPU utilization is typically low. And nn.Dataparallel requires that all GPUs are on the same node, and the Apex cannot be used for mixed precision training. Compared with the existing datapull mode, the method is higher in speed, high in efficiency and higher in GPU occupation.

Description

LSF-based single-host multi-GPU distributed type pytorech parallel computing method
Technical Field
The invention belongs to the field of computers, and relates to a single-host multi-GPU distributed type pitorch parallel computing method based on LSF.
Background
In recent years, the deep learning technology has been rapidly developed in the direction of image and natural language processing. In order to make the model have higher precision and stronger generalization capability, the model structure is often deeper and more complex during design, and the data for training is also larger. The forward and backward propagation steps in model iteration are typically computationally intensive tasks with a large number of computations. Although a GPU (Graphics Processing Unit-Graphics processor) on hardware can provide stronger computing power, and a model can be optimized through an algorithm to accelerate the convergence speed, resources provided by a single machine still cannot meet large-scale training tasks. Distributed computing can effectively alleviate this problem by segmenting the training tasks and using multiple nodes to execute in parallel.
PyTorch is an open source Python machine learning library, which is based on Torch and used for applications such as natural language processing. Year 2017, month 1, pyTorch was introduced by Facebook artificial intelligence institute (FAIR) based on Torch. It is a Python-based continuous computation package that provides two high-level functions:
1. there is a strong GPU accelerated tensor computation (e.g., numPy).
2. A deep neural network comprising an automatic derivation system.
LSF (load sharing facility) is the next industry oriented, commercial level software flagged by IBM. The system has strong resource scheduling management capability, so that the system can distribute various IT resources to execute distributed tasks at higher speed, more balanced load, more reliable performance and lower cost. For deep learning training tasks, the LSF can efficiently and flexibly allocate GPU resources, help to create and manage distributed computing environments, and accordingly accelerate training speed. However, for operation in the LSF environment, the simplest and most direct method for accelerating the training of the neural network model is to use the GPU, the video memory capacity of one GPU is often limited, and if a relatively large model is encountered or the video memory is directly exploded under the condition of huge parameter quantity, the simplest and most drastic method is to add the GPU. This has a core problem, in order to use multiple GPUs for training, a method must be thought of to distribute data and models on multiple GPUs, and coordinate the training process, thereby increasing the training speed.
At present, pyTorch can calculate in a single GPU card, but the calculation efficiency of multiple times of the single GPU card can not be achieved through parallel calculation of a plurality of GPUs, so that the use cost ratio of multiple cards is greatly reduced.
Disclosure of Invention
In view of this, the present invention aims to provide an LSF-based single-host multi-GPU distributed type pytorch parallel computing method, which solves the technical problem that a single task cannot call multiple GPU computing resources in a supercomputing center, so that software can improve the computing capability on a multi-GPU platform, and the project can be used in any supercomputing center.
In order to achieve the purpose, the invention provides the following technical scheme:
the LSF-based single-host multi-GPU distributed type pitorch parallel computing method comprises two parts:
a first part: resource application and scheduling;
a second part: and (4) training a deep learning model by using resources.
The first part is completed under an LSF cluster;
applying for computing resources through instructions of the LSF, including:
the total number of the jobs to be created is equal to the total number of the GPUs applied;
and the number of GPUs of a single host.
Optionally, the second part is implemented inside a program;
firstly, each LSF job monopolizes a process and a GPU, and a deep learning model is based on a pyrrch framework;
firstly, each job reads 'LSF _ PM _ TASKID' from the environment as rank of each task;
secondly, initializing a distributed process group by using a torch.distributed library, wherein parameters comprise rank, world _ size, init _ method and backup; rank is used to refer to each process, world _ size is the total number of processes, init _ method is used to indicate where and how to find other processes, background is used to indicate the backend communication method used, nccl is used as the communication backend in the invention; NCCL is the communication backend developed by NVIDIA corporation for GPU parallel computing;
thirdly, reading a training data set; the data collection is segmented by a method of store, files, data, distributed and Sampler in the pystore, and each process is enabled to obtain a corresponding data slice by setting num _ replicas as world _ size and rank as rank of the current process; if the data slicing is done using Distributed Sampler, then the size of the batch trained on each process is divided by the total number of processes; using a torch.utils.data.data Loader to further read data on each data slice; the num _ words are set to be more than 1 for starting the subprocess to accelerate the data reading speed, the pin _ memory is set to be True for directly reading the data onto the GPU exclusively occupied by the process, and the time consumption of the data in transmission is reduced;
fourthly, importing related packages including argparse, distributed and Distributed Data Parallel, and adding a parameter, local _ rank, after the importing is successful; informing the current program to run on the GPU; finally, selecting nccl for designating a communication mode;
fifthly, packaging the Dataloader; what is needed here is to change Sampler to Distributed Sampler and then assign Sampler in Data Loader; each GPU or each process can fetch data from a DataLoader, and each GPU can fetch the non-overlapping data by appointed Distributed Sampler;
sixthly, using a Distributed Data Parallel packing model;
and seventhly, putting the input data into a designated GPU.
The invention has the beneficial effects that:
the Dataparallel is easier to use (only a single GPU model is simply packaged) and therefore becomes the mainstream single-node multi-card using mode. model = nn. Dataparallell (model) it uses a process to compute the model parameters, then during each batch it is distributed to each GPU, each GPU computes its respective gradient, then it is summed to GPU0 for averaging, it is backproplated by GPU0 to update the parameters, and then the parameters of the model are propagated by GPU0 to the other GPUs. This way of use communication speed becomes a bottleneck and GPU utilization is typically low. And nn. Dataparallell requires that all GPUs are on the same node (distributed is not supported), and the Apex cannot be used for mixed precision training. Compared with the existing dataparrell mode, the method is higher in speed, high in efficiency and higher in GPU occupation.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
fig. 1 is a schematic diagram of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustration only and not for the purpose of limiting the invention, shown in the drawings are schematic representations and not in the form of actual drawings; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
Please refer to fig. 1, which illustrates a LSF-based single-host multi-GPU distributed type of pitorch parallel computing method.
Writing an LSF job commit instruction.
BSUB-q HPC.S1.GPU.X785.sha
Specifying a job submission queue
#BSUB-n 8
The number of cores used was set to 8. And meanwhile, the task starts 8 tasks and uses 8 GPUs.
#BSUB-gpu"num=4:mode=exclusive_process"
Each host is set to use 4 GPUs and the task will monopolize the allocated GPUs. This statement will make the value of the environment variable CUDA _ VISIBLE _ DEVICES '0,1,2,3' on each host available, i.e. GPU number 0,1,2,3
#BSUB-o%J.out
#BSUB-e%J.err
An output file and an error file are specified.
python resnet_v_6.py
The job submission statement. Enabling jobs to be sent to distributed computing on a host.
1. The process group is initialized and the backend should write to NCCL due to the use of GPU communication. However, experiments have shown that even if written incorrectly as gloo, the NCCL is automatically used inside the DDP as a communication module.
2. Since the DDP wrapping model is used for training later, the model weights of all rank can be automatically synchronized to the weight of rank0 in the DDP wrapping model, and therefore, the model weights only need to be read on rank 0. This is based on the fact that the pytorech version 1.12.1, the low level version does not seem to have this property, weights need to be introduced separately at different ranks, and the load needs to be passed into map location, as shown by the two lines of code noted below.
3. The reason for creating the optimizer for a model here, rather than creating the optimizer for a ddp _ model wrapped with ddp, is to make it easier to read the optimizer weights for single GPU training.
4. And reading the weight of the optimizer to the GPU occupied by the process. Without the map _ location parameter, the load would read the weight to the device that originally saved it.
5. The optimizer obtains the weights. Through experiments, even if the weight is not in the GPU where the optimizer is located, the weight can be migrated without error. Of course, direct load access to the corresponding GPU reduces data transfer.
6. DDP wraps the model, copies a copy for the model to the corresponding GPU. All model copies of rank will remain consistent with rank 0. Note that the DDP does not duplicate a copy of the model optimizer, so the optimizers for each process need to be consistent at initialization time. The weights are either not read or both are read.
7. Here training of the model begins. The data needs to be transferred to the corresponding GPU device.
8. In back ward, after the gradient is calculated by the models of all processes, the average (not addition) is performed. That is, DDP adds hook in the backward function, where the ring _ reduce of the model gradient of all processes will be executed. This can be verified by inputting different data into each process model, and after back-ward, these models have the same gradient, and the verification is indeed the average of all process gradients. In addition, it can be verified that the back function blocks (block) each process from using the gradient, and each process can read and use the gradient only after all processes complete the back. This ensures consistency of all processes over the gradient.
9. Each process optimizer updates its model copy weight using the gradient. Because the weight of each process model and the weight of the optimizer are consistent during initialization, and the back propagation gradient is also consistent each time, the models of all the processes can be kept consistent in the whole training process.
10. Since all the process weights are kept consistent, only one process is needed for saving.
11. The IP and the port of rank0 are defined, mp.spawn is used, and the definition is only needed in the main process and is not needed to be respectively defined in the sub-processes.
12. Creating a sub-process, passing in: the function called by the sub-process (the first parameter of the function must be rank), the parameters of the sub-process function (except for the rank parameter), the number of the sub-processes, and whether to wait for all the sub-processes to finish the creation and then start the execution.
Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (3)

1. The LSF-based single-host multi-GPU distributed type pytorch parallel computing method is characterized by comprising the following steps of: the method comprises two parts:
a first part: resource application and scheduling;
a second part: and (4) training a deep learning model by using resources.
2. The LSF-based single-host multi-GPU distributed pytorech parallel computing method of claim 1, wherein: the first part is completed under an LSF cluster;
applying for computing resources through instructions of the LSF, including:
the total number of the jobs to be created is equal to the total number of the GPUs applied;
and the number of GPUs of a single host.
3. The LSF-based single-host multi-GPU distributed pytorech parallel computing method of claim 2, wherein: the second part is implemented inside the program;
firstly, each LSF job monopolizes a process and a GPU, and a deep learning model is based on a pyrrch framework;
step one, each job reads 'LSF _ PM _ TASKID' from the environment as rank of each task;
secondly, initializing a distributed process group by using a torch.distributed library, wherein parameters comprise rank, world _ size, init _ method and backup; rank is used to refer to each process, world _ size is the total number of processes, init _ method is used to indicate where and how to find other processes, background is used to indicate the backend communication method used, nccl is used as the communication backend in the invention; NCCL is the communication backend developed by NVIDIA corporation for GPU parallel computing;
thirdly, reading a training data set; the data collection is segmented by a method of store, files, data, distributed and Sampler in the pystore, and each process is enabled to obtain a corresponding data slice by setting num _ replicas as world _ size and rank as rank of the current process; if the data slicing is done using Distributed Sampler, then the size of the batch trained on each process is divided by the total number of processes; further reading data on each data slice by using a torch, files, data and Loader; the num _ words are set to be more than 1 for starting the subprocess to accelerate the data reading speed, the pin _ memory is set to be True for directly reading the data onto the GPU exclusively occupied by the process, and the time consumption of the data in transmission is reduced;
fourthly, importing related packages including argparse, distributed and Distributed Data Parallel, and adding a parameter, local _ rank, after the importing is successful; informing the current program to run on the GPU; finally, designating a communication mode and selecting nccl;
fifthly, packaging the Dataloader; what is needed here is to change Sampler to Distributed Sampler and then assign to the Sampler in the Data Loader; each GPU or each process can fetch data from a DataLoader, and each GPU can fetch the non-overlapping data by appointed Distributed Sampler;
sixthly, packaging the model by using the Distributed Data Parallel;
and seventhly, putting the input data into a designated GPU.
CN202211431240.8A 2022-11-14 2022-11-14 LSF-based single-host multi-GPU distributed type pytorech parallel computing method Pending CN115794387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211431240.8A CN115794387A (en) 2022-11-14 2022-11-14 LSF-based single-host multi-GPU distributed type pytorech parallel computing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211431240.8A CN115794387A (en) 2022-11-14 2022-11-14 LSF-based single-host multi-GPU distributed type pytorech parallel computing method

Publications (1)

Publication Number Publication Date
CN115794387A true CN115794387A (en) 2023-03-14

Family

ID=85438007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211431240.8A Pending CN115794387A (en) 2022-11-14 2022-11-14 LSF-based single-host multi-GPU distributed type pytorech parallel computing method

Country Status (1)

Country Link
CN (1) CN115794387A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004028339A2 (en) * 2002-09-27 2004-04-08 Brigham And Women's Hospital, Inc. Treatment of patients with multiple sclerosis based on gene expression changes in central nervous system tissues
CN110471766A (en) * 2019-08-06 2019-11-19 北京华恒盛世科技有限公司 A kind of GPU resource scheduling system and method based on CUDA
CN110795241A (en) * 2019-10-18 2020-02-14 北京并行科技股份有限公司 Job scheduling management method, scheduling center and system
CN112035238A (en) * 2020-09-11 2020-12-04 曙光信息产业(北京)有限公司 Task scheduling processing method and device, cluster system and readable storage medium
CN114968559A (en) * 2022-05-06 2022-08-30 苏州国科综合数据中心有限公司 LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004028339A2 (en) * 2002-09-27 2004-04-08 Brigham And Women's Hospital, Inc. Treatment of patients with multiple sclerosis based on gene expression changes in central nervous system tissues
CN110471766A (en) * 2019-08-06 2019-11-19 北京华恒盛世科技有限公司 A kind of GPU resource scheduling system and method based on CUDA
CN110795241A (en) * 2019-10-18 2020-02-14 北京并行科技股份有限公司 Job scheduling management method, scheduling center and system
CN112035238A (en) * 2020-09-11 2020-12-04 曙光信息产业(北京)有限公司 Task scheduling processing method and device, cluster system and readable storage medium
CN114968559A (en) * 2022-05-06 2022-08-30 苏州国科综合数据中心有限公司 LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
III: "怎么使用Pytorch进行多卡训练", pages 1 - 5, Retrieved from the Internet <URL:https://www.yisu.com/zixun/742871.html> *
颀周等: "pytorch网络训练 单机多卡GPU加速?", pages 1 - 8, Retrieved from the Internet <URL:https://www.zhihu.com/question/456026738/answer/2712223808> *

Similar Documents

Publication Publication Date Title
US20240176601A1 (en) Method and system of command buffer between a cpu and gpu
EP4036803A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
CA2959528C (en) Specifying components in graph-based programs
WO2021000970A1 (en) Deep learning algorithm compiling method, device, and related product.
US20200042856A1 (en) Scheduler for mapping neural networks onto an array of neural cores in an inference processing unit
US7647590B2 (en) Parallel computing system using coordinator and master nodes for load balancing and distributing work
WO2022048557A1 (en) Ai model training method and apparatus, and computing device and storage medium
CN114968559B (en) LSF-based multi-host multi-GPU distributed arrangement deep learning model method
CN103886547A (en) Technique For Storing Shared Vertices
US10922092B2 (en) Administrator-monitored reinforcement-learning-based application manager
US8615770B1 (en) System and method for dynamically spawning thread blocks within multi-threaded processing systems
CN112783554A (en) Persistent scratchpad memory for inter-program data exchange
CN102081544B (en) Application generation system and method
US11042640B2 (en) Safe-operation-constrained reinforcement-learning-based application manager
CN112183735A (en) Method and device for generating operation data and related product
CN114925591A (en) Automatic parallel strategy searching method based on polyhedron model modeling and related equipment
US8959497B1 (en) System and method for dynamically spawning thread blocks within multi-threaded processing systems
CN115794387A (en) LSF-based single-host multi-GPU distributed type pytorech parallel computing method
US20230297349A1 (en) Bandwidth-Aware Computational Graph Mapping
US20230023545A1 (en) Methods and systems for deep learning chip design generation
CN113420466B (en) Cross-platform automatic performance optimization oriented unit computing component and method
JP7251416B2 (en) Information processing program and information processing method
Li et al. Deep learning and machine learning with gpgpu and cuda: Unlocking the power of parallel computing
CN118034695A (en) Calculation map compiling method, compiling device, calculating device and storage medium
US20230297643A1 (en) Non-rectangular matrix computations and data pattern processing using tensor cores

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230314

RJ01 Rejection of invention patent application after publication