CN106844024B

CN106844024B - GPU/CPU scheduling method and system of self-learning running time prediction model

Info

Publication number: CN106844024B
Application number: CN201611251972.3A
Authority: CN
Inventors: 郝昀超; 霍志刚
Original assignee: Chinese Academy Of Sciences State Owned Assets Management Co ltd; Institute of Computing Technology of CAS
Current assignee: Chinese Academy Of Sciences State Owned Assets Management Co ltd; Institute of Computing Technology of CAS
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2020-06-05
Anticipated expiration: 2036-12-30
Also published as: CN106844024A

Abstract

The invention provides a GPU/CPU scheduling method of a self-learning operation time prediction model, which relates to the technical field of large-scale heterogeneous computation and cloud computation, and comprises the steps of preprocessing a source code, generating an operation state identifier corresponding to the source code and parameters required by program operation, and storing the operation state identifier and the parameters in an XML file; setting a prediction function, calculating a regression parameter theta of the prediction function according to the running time of the program at a certain stage returned by the running state identifier and the parameter set at the stage, and storing the regression parameter theta in an XML file; when the program is called again, the XML file corresponding to the program is searched, new normalization parameters are calculated, the new normalization parameters are substituted into the prediction function, the running time prediction value of the program in the running process is obtained, the consumed time required by the program to be redistributed to another node is obtained, and if the newly distributed consumed time is lower than the running time prediction value, the program is distributed to the CPU node.

Description

GPU/CPU scheduling method and system of self-learning running time prediction model

Technical Field

The invention relates to the technical field of large-scale heterogeneous computing and cloud computing, in particular to a GPU/CPU scheduling method and system of a self-learning operation time prediction model.

Background

With the development and popularization of the GPGPU (general purpose computing graphics processing unit) technology, more and more computing clusters use the GPUs (graphics processing units) and the CPUs (central processing units) to perform heterogeneous parallel computing, so as to solve the problem of large-scale computing.

In the prior art, a CPU and a GPU are unified into a whole and loaded into a computing cluster to manufacture a method and a device for hybrid parallel computing of the CPU and the GPU, in a computing node which is scheduled with tasks to be processed, the CPU preprocesses the scheduled tasks to be processed one by one, and maps the preprocessed tasks into a video memory of the GPU after each task is preprocessed; or distributing each task to a proper computing platform according to the computing characteristics of each task of the data flow program and the data communication quantity between the tasks. However, in some types of tasks, such as image and video processing tasks, the tasks are relatively uniform and have strong repeatability, each task is different in data and parameters participating in operation, meanwhile, since the GPU is relatively suitable for parallelization tasks with a large scale, tasks with small parallelism and serial tasks are rather suitable for processing in the CPU, since the GPU device is installed on the PCI-E interface, the device with the GPU can independently run the GPU and the CPU tasks, and if one GPU task has a task segment only using the CPU, the GPU can be idle in the segment.

For a data processing task which has a CPU task segment and a GPU task segment and has high repeatability, the key of how to schedule the task lies in correctly predicting the processing time and the data volume required by each task segment and deciding whether the task is migrated to other nodes, most of the prior art can not predict the processing time, so that the task is in blind queue waiting, but the GPU resource is in an idle state due to the system occupation of the CPU serial task, the defect can be solved by predicting the running time of the task in the GPU and the CPU, when the CPU is suitable for executing the task for a long time, the task is packed into a computing node of only the CPU, and the GPU can execute the next computing task, thereby reducing the idle hardware of the whole system and improving the computing efficiency.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a GPU/CPU scheduling method and system of a self-learning operation time prediction model. The invention discloses a GPU/CPU scheduling method of a self-learning operation time prediction model, which comprises the following steps:

step 1, preprocessing a source code, generating an operation state identifier corresponding to the source code and a parameter required by program operation, and storing the operation state identifier and the parameter in an XML file;

step 2, setting a prediction function, calculating a regression parameter theta of the prediction function according to the running time of the program at a certain stage returned by the running state identifier and the parameter set at the stage, and storing the regression parameter theta in an XML file;

and 3, when the program is called again, searching the XML file corresponding to the program, calculating the normalization parameter, substituting the normalization parameter into the prediction function, acquiring the running time predicted value of the program in the current running, acquiring the consumed time required by the program to be redistributed to another node, and if the newly distributed consumed time is lower than the running time predicted value, distributing the program to the CPU node.

In the aforementioned method for scheduling a GPU/CPU with a self-learning runtime prediction model, the preprocessing of the source code in step 1 includes generating corresponding signals, including a signal copied to a GPU memory and a signal copied to a CPU memory, when the GPU and the CPU memory are exchanged.

In the above GPU/CPU scheduling method for the self-learning runtime prediction model, in step 3, the normalization parameter is generated by the following formula:

wherein X_iIs a normalized parameter, X'_iFor the operating parameters, μ is the mean value, S_iIs the standard deviation.

In the above GPU/CPU scheduling method for self-learning runtime prediction model, in step 2, the prediction function h is used_θ(X) is set to

h_θ(X)＝θ^TX,

Wherein, assuming that the program needs n parameters, the regression parameter θ of the prediction function is ═ θ₀θ₁… θ_n]^T,X[1 X₁…X_n]^TH is a predicted value of the running time, and a mean square error function is designed

Where y is the time for the program to run at a certain stage, the following function is repeatedly calculated for each parameter j equal to 0,1, …, n until the above functions converge:

and recording the finally obtained theta in an XML file.

In the above GPU/CPU scheduling method with the self-learning runtime prediction model, in step 3, the consumed time required for program reallocation to another node is obtained through the following formula:

where m is the size of the file that the program needs to migrate and v is the average of the network speeds.

The invention also provides a GPU/CPU scheduling system of the self-learning operation time prediction model, which comprises:

the initialization module is used for preprocessing the source code, generating an operation state identifier corresponding to the source code and parameters required by program operation, and storing the operation state identifier and the parameters in an XML file;

the calculation theta module is used for setting a prediction function, calculating a regression parameter theta of the prediction function according to the running time of the program at a certain stage returned by the running state identifier and the parameter set at the stage, and storing the regression parameter theta in an XML file;

and the distribution module is used for searching the XML file corresponding to the program when the program is called again, calculating the normalization parameter, substituting the normalization parameter into the prediction function, acquiring the running time predicted value of the program in the running process, acquiring the consumed time required by the program to be redistributed to another node, and distributing the program to the CPU node if the newly distributed consumed time is lower than the running time predicted value.

In the GPU/CPU scheduling system with the self-learning runtime prediction model, the initialization and normalization parameter generation module may perform preprocessing on the source code, and may generate corresponding signals when the GPU and the CPU memory are exchanged, where the signals include a signal copied to the GPU memory and a signal copied to the CPU memory.

The GPU/CPU scheduling system of the self-learning runtime prediction model described above, wherein the allocation module generates the normalization parameter by the following formula:

wherein X_iIs a normalized parameter, X'_iFor the operating parameter, μ is the parameter mean, S_iIs the standard deviation of the parameters.

The GPU/CPU scheduling system of the self-learning runtime prediction model, wherein the calculation theta module is used for predicting the function h_θ(X) is set to

h_θ(X)＝θ^TX,

Wherein, assuming that the program needs n parameters, the regression parameter θ of the prediction function is ═ θ₀θ₁… θ_n]^T,X＝[1 X₁…X_n]^TH is a predicted value of the running time, and a mean square error function is designed

record θ in an XML file.

The GPU/CPU scheduling system of the self-learning runtime prediction model described above, wherein the consumption time required for the program to be reallocated to another node is obtained in the allocation module by the following formula:

According to the scheme, the invention has the advantages that:

in a cluster system with both CPU and GPU, the invention can utilize GPU resources to the maximum extent, and allocate programs occupying CPU for a long time and having GPU idle to nodes with only CPU, thereby reducing the running time of the programs as a whole.

Drawings

FIG. 1 is a flow chart of a GPU/CPU scheduling method of the self-learning runtime prediction model of the present invention.

FIG. 2 is a topological structure diagram of the GPU/CPU scheduling method of the self-learning runtime prediction model of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

FIG. 1 is a flow chart of the GPU/CPU scheduling method of the self-learning runtime prediction model of the present invention, the system includes a source code preprocessor, a scheduling server, a database and a computation cluster with a plurality of GPU and CPU nodes, using the program and data parameters to be run as input, and the computation result of the program to be run as output. And calculating the predicted time according to the historical program running time recorded in the database, recording the actual running time into the database, and updating the recording model. The method specifically comprises the following steps:

step 1 preprocesses the source code of the program.

For each program source code which is to be operated in the system and is operated for the first time, a source code preprocessor is operated, and signals are respectively released when a memory application, a CPU and GPU memory exchange and a kernel function are executed. Specifically, in this embodiment, for the CUDA source code, each signal transmission function is designed based on a design framework of a signal and a slot, and the slot function will run at the dispatch server to receive a corresponding signal; the method specifically comprises the following substeps:

(1-1) when detecting that the program has an operation of copying from the CPU memory to the GPU memory, generating (emit) a corresponding SIGNAL (cudammcmpytogupu (void × src, void × dst, size _ t size)), wherein src represents a memory original address, dst represents a memory target address, and size represents a memory size to be copied;

(1-2) when detecting that the program has an operation of copying from the GPU memory to the CPU memory, generating (emit) a corresponding SIGNAL (cudammcmpytocpu (void × src, void × dst, size _ t size)), where src represents a memory original address, dst represents a memory target address, and size represents a memory size to be copied;

(1-3) when detecting that the GPU-side kernel function starts to execute, generating (emit) a corresponding SIGNAL (cudaKernellaunit ()) to represent that the operation of the GPU program starts;

(1-4) when the GPU-end kernel function is detected to be finished, a synchronization function needs to be run, and a (emit) corresponding SIGNAL (cudaSync ()) is generated to represent that the GPU program is finished to run;

(1-5) running the slot function at the dispatch server, receiving each signal function, recording it as a time node in the database in step 2, and if the program is run for the second time, executing step 3.

Step 2 of establishing database system and generating XML document

The step establishes or indexes the corresponding XML file and data field according to whether the program runs for the first time, and records the running time and the like as the data required by the step 3. The method specifically comprises the following substeps:

(2-1) if the program to be executed is run for the first time, generating a unique identifier and corresponding to an XML file, the XML file recording the integer and floating point parameters required by the program, and for some programs such as image processing, recording the characteristics of the processed data, including but not limited to: length and width of the image, resolution, number, etc. The following fields are generated in the database:

ProgramID int primary key not null identity(1,1),

XmlPos char(50)

(2-2) if the program is run for the second time, searching a corresponding XML file name in the database according to the program name, and adding parameters required by the program running to the XML file, wherein the parameters include but are not limited to: length and width of the image, resolution, number, etc.

Step 3 parameter preprocessing

The parameters to be used are preprocessed, parameters of floating point and integer types are selected, and normalization processing is carried out. The method specifically comprises the following substeps:

(3-1) counting all parameters of the integer type and the floating point type into an XML file, and calculating a mean value mu and a standard deviation S_i；

(3-2) marking the ith parameter as X'_iAnd (4) carrying out normalization processing according to the following formula, recording the normalization processing into an XML file, and preparing for the step 4.

Step 4 calculating the predicted time

And (4) determining that the task is deployed in a GPU or CPU node according to the normalization parameters calculated in the step (3). The method specifically comprises the following substeps:

(4-1) if the program needing to be operated is operated for the first time, skipping to the step 5;

(4-2) obtaining a prediction function regression parameter θ ═ θ in the XML file₀θ₁… θ_n]^TAnd the normalized program parameter X in step 3 ═ 1X₁… X_n]^TThe operation prediction time is calculated by substituting the operation prediction time into the following prediction function.

h_θ(X)＝θ^TX,

H represents the calculated time required by the running of the program, and T is a matrix transposition symbol;

(4-3) based on the obtained time, assuming that the network speed is known and the change is not large, the time cost (elapsed time) required for the program to be redistributed to another node can be obtained approximately as

Wherein m is the size of the file where the program is located, v is the average value of the network speed, and if the newly allocated consumed time is lower than the predicted value of the running time, the program can meet the time overhead of node reallocation, and the program can be allocated to the CPU node.

(4-4) receiving the signal of the program when the program is executed, recording the running time required by each part of the program in the step 1, recording the running time into the XML file again, and participating in the operation in the step 5.

Step 5 calculating a fitting function

In the step, a linear prediction model is adopted to calculate a time prediction function, so that the historical running time recorded in the step 4 is obtained, a running time fitting function theta is recalculated, and the subsequent prediction accuracy aiming at the running program is improved. The method specifically comprises the following substeps:

(5-1) the invention designs the fitting prediction function as a linear prediction function, prediction function h_θ(X) is set to

h_θ(X)＝θ^TX,

Wherein, assuming that the program needs n parameters, the regression parameter θ of the prediction function is ═ θ₀θ₁… θ_n]^TThe normalization parameter X obtained in step 3 is [ 1X ═ X₁… X_n]^TThe program has n parameters, corresponding to n prediction parameters theta of the prediction function;

(5-2) corresponding to (5-1), a mean square error function is designed to

Where y is the time of the program operation at a certain stage recorded in step 5, and the following function is repeatedly calculated for each parameter j equal to 0,1, …, n until the above functions converge:

recording theta in an XML file;

(5-3) when the program runs for the kth time, searching the corresponding XML file from the database, and calculating the normalization parameter

Normalizing parameter X^(k)Carry-in to a prediction function h_θIn the step (X), the running time predicted value h required by the program at the stage is obtained;

(5-4) until the program runs to the end, recording each program segment, updating the prediction function, and normalizing the current normalization parameters

Carry-in mean square error function

(5-5) for each theta₀To theta_nThe following function is repeatedly calculated until the above mean square error function converges:

and updating the theta stored in the XML file again, and finishing the program execution.

FIG. 2 is a topology structure diagram of the GPU/CPU scheduling method and system of the self-learning runtime prediction model of the present invention, the system includes a scheduling server, a database server, a plurality of GPU computation nodes and CPU nodes. The physical connection of the various parts may be through a gigabit switch.

The initialization and normalization parameter generation module is used for generating a normalization parameter, wherein the initialization and normalization parameter generation module is used for preprocessing a source code and generating corresponding signals when the GPU and the CPU are exchanged, and the signals comprise signals copied to a GPU memory and signals copied to a CPU memory.

Further the assignment module generates a normalized parameter by the following equation:

Still further, the calculate θ module predicts the function h_θ(X) is set to

h_θ(X)＝θ^TX,

record θ in an XML file.

Further, the consumption time required for the program to be redistributed to another node is obtained in the distribution module by the following formula:

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A GPU/CPU scheduling method of a self-learning operation time prediction model is characterized by comprising the following steps:

and 3, when the program is called again, searching the XML file corresponding to the program, calculating a normalization parameter, substituting the normalization parameter into a prediction function, acquiring an operation time predicted value of the program in the current operation, acquiring transmission time consumed by the program to be redistributed to a GPU node or a CPU node, and if the transmission time is lower than the operation time predicted value, distributing the program to the CPU node.

2. The method as claimed in claim 1, wherein the preprocessing of the source code in step 1 comprises generating corresponding signals when the GPU and the CPU memory are exchanged, the corresponding signals comprising signals copied to the GPU memory and signals copied to the CPU memory.

3. The method of GPU/CPU scheduling with self-learning runtime prediction model as claimed in claim 1, wherein said step 3 generates the normalization parameters by the following formula:

4. The method of claim 1, wherein the step 2 is to predict the function h_θ(X) is set to

h_θ(X)＝θ^TX,

Wherein, assuming that the program needs n parameters, the regression parameter θ of the prediction function is ═ θ₀θ₁… θ_n]^T,X＝[1 X₁… X_n]^TH is a predicted value of the running time, and a mean square error function is designed

and recording the finally obtained theta in an XML file.

5. The self-learning runtime prediction model GPU/CPU scheduling method of claim 1, wherein the transmission time is obtained in step 3 by the following formula:

6. A GPU/CPU scheduling system for self-learning runtime prediction models, comprising:

and the distribution module is used for searching the XML file corresponding to the program when the program is called again, calculating the normalization parameter, substituting the normalization parameter into the prediction function, obtaining the running time predicted value of the program in the current running, obtaining the transmission time required by the program to be redistributed to the GPU node or the CPU node, and distributing the program to the CPU node if the transmission time is lower than the running time predicted value.

7. The self-learning runtime prediction model GPU/CPU scheduling system of claim 6 wherein the initialization and generation of the source code preprocessing in the normalization parameter module includes generating corresponding signals during GPU and CPU memory swap, the corresponding signals including signals copied to GPU memory and signals copied to CPU memory.

8. The self-learning runtime prediction model GPU/CPU scheduling system of claim 6 wherein the assignment module generates the normalization parameters by the following formula:

9. The self-learning runtime prediction model GPU/CPU scheduling system of claim 6, wherein the compute θ module predicts the function h_θ(X) is set to

h_θ(X)＝θ^TX,

record θ in an XML file.

10. The self-learning runtime prediction model GPU/CPU scheduling system of claim 6 wherein the allocation module obtains the transmission time by: