CN112035238B

CN112035238B - Task scheduling processing method and device, cluster system and readable storage medium

Info

Publication number: CN112035238B
Application number: CN202010957856.3A
Authority: CN
Inventors: 原帅; 郝文静; 张涛; 王家尧; 吕灼恒; 李斌; 沙超群; 历军
Original assignee: Zhongke Sugon Information Industry Chengdu Co ltd; Dawning Information Industry Beijing Co Ltd
Current assignee: Zhongke Sugon Information Industry Chengdu Co ltd; Dawning Information Industry Beijing Co Ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2024-07-19
Anticipated expiration: 2040-09-11
Also published as: CN112035238A

Abstract

The application provides a task scheduling processing method, a task scheduling processing device, a cluster system and a readable storage medium, and relates to the technical field of cluster task processing. The method comprises the following steps: acquiring a job task sent by a scheduling node in a cluster system, wherein the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters; determining the task type of the job task according to the identification of the characterization task type in the job task; invoking a preprocessing component corresponding to the task type, and initializing a task environment to obtain an operating environment for executing an HPC task or an AI task; according to the task content of the job task, the job task is executed through the operation environment to obtain an execution result, and the problems of single task type and low hardware resource utilization rate of the task executed by the computing node can be solved.

Description

Task scheduling processing method and device, cluster system and readable storage medium

Technical Field

The present invention relates to the field of clustered task processing technologies, and in particular, to a task scheduling processing method, device, clustered system, and readable storage medium.

Background

With the development of computer cluster processing technology, supercomputer performance is increasing. Clustered systems typically need to support the computation of high performance computing (High Performance Computing, HPC) tasks, as well as the computation of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) tasks. Currently, hardware resources of a cluster system are generally divided into small clusters or computing nodes facing different fields. Each small cluster or computing node performs a single type of task. For example, a small cluster used to perform HPC tasks cannot perform AI tasks, thus making the hardware resources of the cluster low in utilization.

Disclosure of Invention

The application provides a task scheduling processing method, a task scheduling processing device, a cluster system and a readable storage medium, which can solve the problems of single task type and low hardware resource utilization rate of computing nodes in a cluster.

In order to achieve the above object, the technical solution provided by the embodiments of the present application is as follows:

In a first aspect, an embodiment of the present application provides a task scheduling processing method, applied to a computing node in a cluster system, where the method includes:

acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters;

determining the task type of the job task according to the identification of the characterization task type in the job task;

invoking a preprocessing component corresponding to the task type, and initializing a task environment to obtain an operating environment for executing the HPC task or the AI task;

and executing the job task through the running environment according to the task content of the job task to obtain an execution result.

In the above embodiment, the computing node may perform preprocessing on the task environment according to the task type to obtain an operation environment for executing the HPC task or the AI task, and then may execute the HPC task or the AI task based on the obtained operation environment, so as to solve the problems that the task type executed by the computing node is single and the hardware resource utilization rate is low.

With reference to the first aspect, in some optional embodiments, invoking a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an execution environment for executing the HPC task or the AI task, including:

When the job task is an HPC task, calling a preprocessing component corresponding to the HPC task, and initializing a task environment to obtain an operation environment for executing the HPC task;

When the job task is an AI task, calling a preprocessing component corresponding to the AI task, and initializing a task environment to obtain an operation environment for executing the AI task.

In the above embodiment, for the HPC task and the AI task, the task environments are preprocessed respectively to obtain the corresponding operation environments, so that the computing node can execute the job tasks with different task types.

With reference to the first aspect, in some optional embodiments, the preprocessing component includes a general processing component and an AI framework processing component, calls the preprocessing component corresponding to the AI task, initializes a task environment to obtain an execution environment for executing the AI task, and includes:

invoking the general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task;

invoking the AI frame processing component, and selecting a processing frame and an accelerator corresponding to the AI task;

And creating a container for executing the subtasks according to the target hardware resources, the processing framework and the accelerator to obtain an operation environment for executing the AI tasks.

In the above-described embodiments, the computing node is enabled to perform AI tasks by creating a container and a runtime environment for performing AI tasks.

With reference to the first aspect, in some optional embodiments, the processing framework comprises a DL framework.

With reference to the first aspect, in some optional embodiments, the method further includes:

and clearing the association relation of the target hardware resources corresponding to the job task and the container.

In the above embodiment, after the execution result is obtained, the association relationship, the container and the like are deleted, so that the execution of the new task by the computing node is facilitated, and the influence of the running environment of the current job task on the execution of the new task is avoided.

With reference to the first aspect, in some optional embodiments, acquiring a job task sent by a scheduling node in the cluster system includes:

And acquiring the job task sent by the HPC scheduler of the scheduling node in the cluster system.

In the above embodiment, the HPC scheduler may schedule AI tasks and HPC tasks, thereby improving the problem that the HPC scheduler can only schedule HPC tasks.

In a second aspect, an embodiment of the present application further provides a task scheduling processing method, which is applied to a cluster system, where the cluster system includes a submitting node, a scheduling node, and a plurality of computing nodes, and the method includes:

The submitting node generates a job task according to the task parameters, wherein the job task comprises an HPC task or an AI task;

the scheduling node acquires the job task from the submitting node;

the scheduling node determines a computing node matched with the task parameters of the job task from a plurality of computing nodes as a target computing node;

The target computing node determines the task type of the job task according to the identification of the characterization task type in the job task;

The target computing node invokes a preprocessing component corresponding to the task type, initializes a task environment and obtains an operating environment for executing the HPC task or the AI task;

And the target computing node executes the job task through the running environment according to the task content of the job task to obtain an execution result.

In a third aspect, an embodiment of the present application further provides a task scheduling processing device, which is applied to a computing node in a cluster system, where the device includes:

the task management system comprises an acquisition unit, a scheduling unit and a processing unit, wherein the acquisition unit acquires a job task sent by a scheduling node in the cluster system, and the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters;

The determining unit is used for determining the task type of the job task according to the identification of the characterization task type in the job task;

the preprocessing unit is used for calling a preprocessing component corresponding to the task type, initializing a task environment and obtaining an operating environment for executing the HPC task or the AI task;

And the execution unit is used for executing the job task through the running environment according to the task content of the job task to obtain an execution result.

In a fourth aspect, an embodiment of the present application further provides a server, where the server includes a memory and a processor coupled to each other, and the memory stores a computer program, where the computer program, when executed by the processor, causes the server to perform the method described above.

In a fifth aspect, an embodiment of the present application further provides a cluster system, where the cluster system includes a submitting node, a scheduling node, and a plurality of computing nodes, where:

the submitting node is used for generating a job task according to the task parameters, wherein the job task comprises an HPC task or an AI task;

the scheduling node is used for acquiring the job task from the submitting node;

the scheduling node is further used for determining a computing node matched with the task parameters of the job task from a plurality of computing nodes as a target computing node;

The target computing node is used for determining the task type of the job task according to the identification of the characterization task type in the job task;

The target computing node is also used for calling a preprocessing component corresponding to the task type, initializing a task environment and obtaining an operating environment for executing the HPC task or the AI task;

and the target computing node is also used for executing the job task through the running environment according to the task content of the job task to obtain an execution result.

In a sixth aspect, embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the above-described method.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described. It is to be understood that the following drawings illustrate only certain embodiments of the application and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

Fig. 1 is a schematic diagram of communication connection of a trunking system according to an embodiment of the present application.

Fig. 2 is a block diagram of hardware resources of a computing node according to an embodiment of the present application.

Fig. 3 is a schematic flow chart of a task scheduling processing method according to an embodiment of the present application.

Fig. 4 is a second flowchart of a task scheduling processing method according to an embodiment of the present application.

Fig. 5 is a functional block diagram of a task scheduling processing device according to an embodiment of the present application.

Icon: 10-a cluster system; 20-computing nodes; 30-scheduling nodes; 40-submitting the node; 300-task scheduling processing device; 310-an acquisition unit; 320-a determination unit; 330-a pre-processing unit; 340-execution unit.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. It should be noted that the terms "first," "second," and the like are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.

The applicant has found that the hardware resources of current clustered systems generally need to be divided into small clusters oriented to different domains. A small cluster typically includes one or more computing nodes. Generally, the execution environments required for tasks in different domains are different, and thus each small cluster is generally only capable of executing one task in a divided domain, but is not capable of executing tasks in other domains. For example, a small cluster for performing HPC tasks cannot perform AI tasks. Therefore, in the current cluster system, the task type executed by the computing node is single, and the problem of utilization rate exists.

In view of the above problems, the present applicant has long studied and studied to solve the above problems, and the following examples are set forth. Embodiments of the present application will be described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

First embodiment

Referring to fig. 1, an embodiment of the present application provides a cluster system 10, which can be used to execute each step in the task scheduling processing method described below, so as to solve the problem that the task type executed by a computing node 20 is single, and thus the hardware resources cannot be fully utilized.

In this embodiment, cluster system 10 may include a commit node 40, a schedule node 30, and a plurality of compute nodes 20. Wherein one node (e.g., commit node 40, dispatch node 30, compute node 20, etc.) in cluster system 10 is a server. One node may operate with at least one identity of commit node 40, schedule node 30, and compute node 20. For example, one submitting node 40 may operate with the identity of submitting node 40, and in addition, the submitting node 40 may operate with the identities of scheduling node 30, computing node 20. In general, commit node 40, schedule node 30 and compute node 20 are distinct nodes.

In this embodiment, the submitting node 40 may establish a communication connection with the user terminal through a network for data interaction. The submitting node 40 may establish a communication connection with the scheduling node 30 over a network for data interaction. Scheduling node 30 may establish a communication connection with one or more computing nodes 20 over a network for data interaction.

For example, the user terminal may send information about the job task that needs to be performed to the submit section. Submitting node 40 may generate script files for the job tasks based on the relevant information for the job tasks. The script file is the "understandable" job task of the computer. In addition, submitting node 40 may send the script file of the job task to scheduling node 30. Scheduling node 30 may send the script file to the corresponding target computing node 20. The job task corresponding to the script file is then executed by target compute node 20. Wherein the target computing node 20 may be one or more computing nodes 20.

The user terminal may be, but is not limited to, a smart phone, a Personal computer (Personal Computer, PC), a tablet computer, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a Mobile internet device (Mobile INTERNET DEVICE, MID), etc. The network may be, but is not limited to, a wired network or a wireless network.

Referring to fig. 2, in the present embodiment, the hardware resources included in the computing node 20 include, but are not limited to, a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), and a memory. It will be appreciated that a CPU may be provided with one or more cores, and the number of cores included in the processor may be set according to the actual situation. For example, the CPU may be a single-core processor, or a dual-core processor.

In a computing node 20, the number of cores and graphics processors may be set according to the actual situation. As one example, computing node 20 may include N central processors, M graphics processors as shown in fig. 2. N, M are integers greater than 2, which may be the same or different and may be set according to practical situations. The hardware resources of different computing nodes 20 may be the same or different, and may be set according to actual situations. For example, the number of cores, the number of graphics processors, the operating parameters of the cores, the operating parameters of the graphics processors, may all be different for different compute nodes 20.

Referring to fig. 3, the embodiment of the present application further provides a task scheduling processing method, which can be applied to the above-mentioned cluster system 10, and the corresponding nodes in the cluster system 10 cooperate with each other to execute each step in the method. The method may comprise the steps of:

Step S110, submitting a node, and generating a job task according to task parameters, wherein the job task comprises an HPC task or an AI task;

step S120, a scheduling node acquires the job task from the submitting node;

step S130, the scheduling node determines a computing node matched with the task parameters of the job task from a plurality of computing nodes as a target computing node;

Step S140, a target computing node determines the task type of the job task according to the identification of the characterization task type in the job task;

Step S150, the target computing node invokes a preprocessing component corresponding to the task type, and initializes a task environment to obtain an operation environment for executing the HPC task or the AI task;

step S160, the target computing node executes the job task through the operation environment according to the task content of the job task, so as to obtain an execution result.

In this embodiment, the computing node may perform preprocessing on the task environment according to the task type to obtain an operation environment for executing the HPC task or the AI task, and then may execute the HPC task or the AI task based on the obtained operation environment, so as to solve the problems that the task type executed by the computing node is single and the hardware resource utilization rate is low.

The steps in the method will be described in detail as follows:

In step S110, after acquiring the task parameters, the submitting node may automatically generate a job script according to the task parameters. The job script is the "understandable" job task of the computer. If the task parameter includes a first identifier that characterizes the AI task, the submitting node may generate the AI task based on the task parameter. If the task parameters include a second identification characterizing the HPC task, the submitting node may generate the HPC task based on the task parameters. The first identifier and the second identifier are different and can be numbers or characters, so that the AI task and the HPC task can be distinguished, and the first identifier and the second identifier can be set according to actual conditions. In addition, the job task generated by the submitting node comprises an identification of the task type of the job task so as to facilitate the computing node to execute according to different types of job tasks. For example, a first identifier that characterizes the task as an AI task may be included in the AI task, and a task identifier that characterizes the task as an HPC task may be included in the HPC task.

In this embodiment, the submitting node may obtain the task parameters from the user terminal. The format of the task parameters submitted by the user terminal may be a specified format, for example, JSON format, so that the submitting node reads each sub-parameter in the task parameters. The task parameters are usually parameters uploaded to the submitting node by the user terminal according to actual demands, and can be set according to actual conditions. The task parameters may include, but are not limited to, an identification characterizing the task type, hardware requirements needed to perform the task (e.g., the number of cores needed to perform the task, the rated clock frequency at which the cores/CPUs run, the number of GPUs, the rated clock frequency at which the GPUs run), user information, task content, environmental variables, etc. For example, if the job task is an AI task, task parameters of the AI task include, but are not limited to, user information, processing framework, image file, hardware requirements required to perform the task, DL (DEEP LEARNING ) parameters, and the like.

The image file may be understood as an image file formed by removing data of the image file in the task parameter, and may be used as a backup file of the task parameter. The processing framework may include a DL framework or other framework. Processing framework, DL parameters are well known to those skilled in the art. For example, the processing frame may be, but is not limited to, tensorFlow, pyTorch, MXNet, caffe, keras, etc., frames known to those skilled in the art. DL parameters include, but are not limited to, learning rate, threshold, etc.

In step S120, the scheduling node may automatically obtain the job task generated by the submitting node from the submitting node, for example, the scheduling node may obtain the job task generated within the preset time period from the submitting node every preset time period, and the preset time period may be set according to the actual situation, for example, the preset time period may be 1 minute, 10 minutes, 1 hour, and other time periods. Or the submitting node may automatically send the generated job task to the scheduling node, so that the scheduling node obtains the job task. It is to be understood that the manner of scheduling the job tasks acquired by the nodes may be set according to actual situations, and is not specifically limited herein.

In step S130, the scheduling node may select, according to the current operation condition of each computing node in the cluster system and in combination with the hardware requirement information required for executing the task carried by the job task, one or more computing nodes capable of meeting the hardware requirement for executing the current job task from multiple computing nodes as target computing nodes, and then send the job task to the target computing nodes.

It will be appreciated that the hardware capabilities of the selected target computing node are capable of meeting the requirements for performing the job task. That is, the parameters of the various hardware resources of the target computing node are all greater than or equal to the parameters of the various hardware resources characterized by the hardware requirements needed to perform the job task.

In this embodiment, the scheduling node may acquire the operation parameters of each computing node in the cluster system in real time, or acquire the operation parameters of each computing node in the cluster system when receiving the job task. The operating parameters include total hardware resource information and idle hardware resource information of each node. The total hardware resource information includes, but is not limited to, the number of CPUs included in the node, the number of cores of each CPU, the rated clock frequency of each CPU in operation, the number of GPUs, the rated clock frequency of each GPU in operation, the total capacity of the memory, the identity of the cores, the identity of the GPUs, and the like. The idle hardware resource information includes, but is not limited to, an identity of a CPU that does not perform a job task, an identity of a core of the CPU that does not perform a job task, a remaining capacity of a memory, and the like. Wherein, the larger the rated clock frequency is, the stronger the operation capability of the CPU or GPU is.

Referring to fig. 2 again, assume that a cluster system includes a compute node a and a compute node B, where the compute node a includes 8 CPUs, each CPU includes 8 cores, a rated operating frequency (main frequency) of each CPU is 4.0ghz,4 GPUs are provided, a video memory of each GPU is 8GB, and the rated operating frequency is 1500MHz. The computing node B comprises 8 CPUs, each CPU comprises 4 kernels, the rated working frequency (main frequency) of each CPU is 4.0GHz of the main frequency, the video memory of each GPU is 4GB, and the rated working frequency is 1000MHz. If the current task is an AI task, the hardware requirements for executing the AI task include: at least 16 cores, main frequency of CPU/core not less than 4.0GHz, at least 4 GPU, video memory of GPU not less than 8GB, and working frequency not less than 1000MHz. Because the computing node a meets the hardware requirement required for executing the task, the computing node B does not meet the hardware requirement, at this time, the scheduling node may select the computing node a as the target computing node based on the hardware requirement required for executing the task, and then send the job task to the target computing node.

In step S140, the target computing node may determine a task type of the job task according to the identifier carried in the job task. For example, if the identifier of the job task is the first identifier for characterizing the AI task, the job task is determined to be an AI task, and the task type is an AI class. If the identification of the job task is the second identification representing the HPC task, determining that the job task is the HPC task, and the task type is the HPC class.

In step S150, the target computing node may store the association relationship between the preprocessing component and the task type in advance. That is, the preprocessing component of the AI task is associated with the identification of the AI class and the preprocessing component of the HPC task is associated with the identification of the HPC class. After determining the task type of the job task, the target computing node may automatically select a preprocessing component corresponding to the task type according to the identification of the task type. And then, running the preprocessing component and initializing the task environment to obtain the running environment for executing the current job task.

In step S160, after obtaining the execution environment for executing the current job task, the target computing node may execute the job task through the execution environment, thereby obtaining an execution result. The process of executing the job task by the computing node is well known to those skilled in the art, and will not be described herein. The execution result corresponds to the execution task and can be determined according to the actual situation. For example, the purpose of the HPC task is to create a weather forecast model, and the result of the execution is a weather forecast model. The purpose of the AI task is to create a face recognition model, and the execution result obtained is a face recognition model.

If the target computing node is a plurality of computing nodes, each target computing node can negotiate with each other to subdivide the job task into a plurality of subtasks, and then each target computing node executes the corresponding subtask. The subdivision and negotiation process of the job task is well known to those skilled in the art, and will not be described herein.

As an alternative embodiment, step S110 may further include: and acquiring the job task sent by the HPC scheduler of the scheduling node in the cluster system.

Understandably, in the present embodiment, the HPC scheduler may have a function of scheduling HPC tasks and AI tasks. After the submitting node generates the job task according to the job parameters, the scheduling node can select a corresponding target computing node according to the task type of the job task and the hardware requirement required by executing the task, so that the task scheduling is realized, and the problem that the HPC scheduler cannot schedule the AI task is solved.

The HPC scheduler may be, but is not limited to, an LSF (Load SHARING FACILITY ), slurm, or the like scheduler. The Slurm tool is an open source work scheduler for Linux and Unix like kernels and can be used for computer clusters.

As an alternative embodiment, step S150 may include:

It is appreciated that the target computing node may select the corresponding preprocessing component based on the particular task type of the job task. If the job task is an HPC task, the target computing node calls a preprocessing component corresponding to the HPC task, and initializes the task environment through the preprocessing component to obtain an operating environment for executing the HPC task. If the job task is an AI task, the target computing node calls a preprocessing component corresponding to the AI task, initializes a task environment, and obtains an operation environment for executing the AI task. Based on the method, the target computing node can build an operation environment corresponding to the task type according to the task type of the job task, so that the AI task and the HPC task can be executed, and the problem that the computing node can only execute a single type of task is solved.

In this embodiment, the pre-processing component may generally comprise multiple classes of components, each class of components may be used to build a corresponding task environment. When the pre-processing components are operated, various components can be mutually matched to build an operation environment for executing the current operation task.

As an alternative embodiment, when the task is an AI task, the preprocessing component includes a general purpose processing component, an AI framework processing component. Step S150 may further include:

It is understood that the target computing node may divide the job task into a plurality of subtasks, and the manner of dividing the subtasks is well known to those skilled in the art, and will not be described herein. When the job task is an AI task, the computing node may analyze task parameters (such as parameters of hardware resources required for executing the task), environment variables, collected user information, user group files, and the like in the job task by calling the general processing component, and then select, according to the operation amount required by each subtask, hardware resources required for executing the subtask from hardware resources of all target computing nodes to serve as target hardware resources of the subtask. The target hardware resources include, but are not limited to, the identity of the target computing node, the identity of the CPU, the identity of the kernel, and the identity of the GPU. The environment variable may be determined according to practical situations, for example, may be some parameters of an operating system running environment of the computing node, such as: temporary folder location and system folder location, etc.

The computing node can call the AI framework processing component, a processing framework corresponding to the AI task, and an accelerator. Understandably, the AI task may carry information of a processing framework and information of an accelerator required for executing the AI task. For example, the AI task carries a processing framework TensorFlow for characterizing the AI task, and the accelerator required is an Nvidia accelerator. Of course, the accelerator may be other types of accelerators, such as AMD accelerators, and the type of accelerator is not specifically limited herein.

In order to facilitate understanding of the process of implementing preprocessing by a computing node, the following will illustrate the implementation process of the computing node to obtain a corresponding operating environment by performing preprocessing:

When the target computing node receives the job task sent by the scheduling node, the target computing node starts to execute the Prolog, and then the task type of the job task is detected. Prolog is a logical programming language oriented to deductive reasoning. Prolog is understood to be the preamble of a program and Epilog is understood to be the end of a program. The compiler will plug Prolog code at the beginning of each function and Epilog code at the end of each function.

When the job task is detected to be the AI task, the universal Prolog of the AI task is executed, and the universal processing component of the AI task and the AI framework processing component are called. When the job task is detected as an HPC task, a pre-processing component of the HPC task may be directly invoked.

The executing process of the general processing component for calling the AI task and the AI framework processing component can be as follows: and acquiring task content/task parameters, environment variables, user information, user group files and the like of the job task through the general processing component. Then selecting corresponding hardware resources as target hardware resources of the subtasks for the subtasks of the AI task, and selecting the type of accelerator (Nvidia or AMD) and the DL framework according to the task content. Then, by the AI framework processing component, hardware resources required for creating a container for executing the subtask are allocated based on the respective subtasks of the AI task. The hardware resource required for creating the container is the target hardware resource of the subtask. Then, a container for executing the subtask is created according to the selected target hardware resource, the processing frame and the accelerator, and information of the container is recorded, for example, management relation between the container and the subtask and management relation between the container and the target hardware resource are recorded, and at the moment, an operation environment for executing the AI task can be created.

When the job task is an HPC task, the compute node may directly invoke a pre-processing component of the HPC task to change the current task environment of the compute node to a runtime environment capable of executing the HPC task.

In this embodiment, the default task environment of the computing node may be a running environment capable of executing HPC tasks. When the job task is an HPC task, if the task environment is not the running environment for executing the HPC, invoking a preprocessing component of the HPC task to restore the task environment to a default running environment.

As an alternative embodiment, the method may further comprise: and clearing the association relation of the target hardware resources corresponding to the job task and the container.

Understandably, after step S140, the computing node may further clear the association relationship, the container, the environment variable, the temporary file and the temporary data generated in executing the job task, and the like of the target hardware resource corresponding to the job task. By clearing the data of the association relation, the container and the like, the execution of the new task by the computing node is facilitated, the task environment is restored to be before the task is executed, and the influence of the running environment of the current task on the execution of the new task is avoided.

After the execution result is obtained, the cluster system may store the execution result, or the computing node sends the execution result to the user terminal, so that the user can view the execution result. Or the calculation node sends the execution result to the submitting node through the scheduling node, and then the submitting node sends the execution result to the user terminal.

Based on the design, the hardware resources of the cluster system are shared, and the same computing node can simultaneously bear various tasks such as high-performance computing, artificial intelligence and the like, so that the utilization rate of the hardware resources is improved. The hardware resources of the AI task are uniformly distributed, so that the problem that the AI distributed task occupies part of the hardware resources and cannot operate, and the hardware resources are wasted is avoided. In addition, the creation and the destruction of the container can be completed through the preprocessing and the post-processing of the computing nodes based on the HPC scheduler, so that the arrangement and the scheduling of the container are realized, and the support of the HPC scheduler to the AI task is realized. The fusion scheduling of the HPC task and the AI task can be realized while the flexibility, the rapidness and the convenience of the container are reserved.

Second embodiment

Referring to fig. 4, the present application further provides another task scheduling processing method, which can be applied to computing nodes in a cluster system. The method may comprise the steps of:

step S210, acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters;

Step S220, determining the task type of the job task according to the identification of the characterization task type in the job task;

step S230, calling a preprocessing component corresponding to the task type, and initializing a task environment to obtain an operation environment for executing the HPC task or the AI task;

And step S240, executing the job task through the running environment according to the task content of the job task to obtain an execution result.

As can be appreciated, in the second embodiment, the implementation procedure and the achieved technical effects of the task scheduling processing method are similar to those provided in the first embodiment, as compared with the task scheduling processing method in the first embodiment, except that the task scheduling method in the second embodiment is applied to a computing node, and each step in the method is performed by the computing node. Of course, the task scheduling processing method in the second embodiment may further include other steps, for example, may further include other steps as performed by the computing node in the first embodiment, which will not be described herein. The computing node executing the task scheduling processing method is the target computing node determined by the scheduling node.

Referring to fig. 5, an embodiment of the present application further provides a task scheduling processing device 300, which may be applied to a computing node in a cluster system, and configured to execute steps executed by the computing node. The task scheduling processing device 300 includes at least one software function module that may be stored in a memory module in the form of software or Firmware (Firmware) or cured in a server Operating System (OS). The processing module is configured to execute executable modules stored in the storage module, such as software function modules and computer programs included in the task scheduling processing device 300.

The task scheduling processing device 300 may include an acquisition unit 310, a determination unit 320, a preprocessing unit 330, and an execution unit 340.

The acquiring unit 310 acquires a job task sent by a scheduling node in the cluster system, where the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to a task parameter.

A determining unit 320, configured to determine a task type of the job task according to the identifier of the characterization task type in the job task.

And the preprocessing unit 330 is configured to invoke a preprocessing component corresponding to the task type, initialize a task environment, and obtain an execution environment for executing the HPC task or the AI task.

And the execution unit 340 is configured to execute the job task through the running environment according to the task content of the job task, so as to obtain an execution result.

Optionally, the preprocessing unit 330 is configured to: when the job task is an HPC task, calling a preprocessing component corresponding to the HPC task, and initializing a task environment to obtain an operation environment for executing the HPC task; when the job task is an AI task, calling a preprocessing component corresponding to the AI task, and initializing a task environment to obtain an operation environment for executing the AI task.

Optionally, the preprocessing component includes a general purpose processing component and an AI framework processing component. The preprocessing unit 330 is further configured to: invoking the general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task; invoking the AI frame processing component, and selecting a processing frame and an accelerator corresponding to the AI task; and creating a container for executing the subtasks according to the target hardware resources, the processing framework and the accelerator to obtain an operation environment for executing the AI tasks.

Optionally, the task scheduling processing device 300 may further include a clearing unit, configured to clear the association relationship of the target hardware resource corresponding to the job task and the container.

Alternatively, the obtaining unit 310 is configured to: and acquiring the job task sent by the HPC scheduler of the scheduling node in the cluster system.

It should be noted that, for convenience and brevity of description, specific working processes of the cluster system, the task scheduling processing device 300, and the computing node described above may refer to corresponding processes of each step in the foregoing method, and will not be described in detail herein.

In this embodiment, the servers (e.g., computing nodes) in the cluster system may include a processing module, a communication module, a storage module, and a task scheduling processing device 300, where the processing module, the communication module, the storage module, and the task scheduling processing device 300 are directly or indirectly electrically connected to each other to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The processing module may be an integrated circuit chip with signal processing capabilities. The processing module may be a general purpose processor. For example, the processor may be a central processor (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), a network processor (Network Processor, NP), or the like; the disclosed methods, steps, and logic diagrams of embodiments of the present application may also be implemented or performed with a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic, or discrete hardware components.

The memory module may be, but is not limited to, random access memory, read only memory, programmable read only memory, erasable programmable read only memory, electrically erasable programmable read only memory, and the like. In this embodiment, the storage module may be configured to store information about a job task. Of course, the storage module may also be used to store a program, and the processing module executes the program after receiving the execution instruction.

The communication module is used for establishing communication connection between the node and other nodes in the cluster system through a network and receiving and transmitting data through the network.

The embodiment of the application also provides a computer readable storage medium. The readable storage medium has stored therein a computer program which, when run on a computer, causes the computer to execute the task scheduling processing method as described in the above embodiments.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented in hardware, or by means of software plus a necessary general hardware platform, and based on this understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disc, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective implementation scenario of the present application.

In summary, the present application provides a task scheduling processing method, a device, a cluster system and a readable storage medium. The method comprises the following steps: acquiring a job task sent by a scheduling node in a cluster system, wherein the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters; determining the task type of the job task according to the identification of the characterization task type in the job task; invoking a preprocessing component corresponding to the task type, and initializing a task environment to obtain an operating environment for executing an HPC task or an AI task; and executing the job task through the running environment according to the task content of the job task to obtain an execution result. In the scheme, the computing node can preprocess the task environment according to the task type to obtain the running environment for executing the HPC task or the AI task, and then can execute the HPC task or the AI task based on the obtained running environment, so that the problems of single task type and low hardware resource utilization rate of the computing node are solved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus, system and method may be implemented in other manners as well. The above-described apparatus, system, and method embodiments are merely illustrative, for example, flow charts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for task scheduling processing, applied to a computing node in a cluster system, the method comprising:

executing the job task through the running environment according to the task content of the job task to obtain an execution result;

invoking a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing the HPC task or the AI task, wherein the preprocessing component comprises:

When the job task is an AI task, invoking a preprocessing component corresponding to the AI task, and initializing a task environment to obtain an operation environment for executing the AI task;

The preprocessing component comprises a general processing component and an AI frame processing component, calls the preprocessing component corresponding to the AI task, initializes a task environment to obtain an operation environment for executing the AI task, and comprises the following steps:

Invoking the general processing component, and selecting target hardware resources corresponding to subtasks in the AI task, wherein the target hardware resources corresponding to each subtask are selected and obtained from the hardware resources of all target computing nodes according to the operation amount required by the corresponding subtask, and the target hardware resources comprise the identity of the target computing node, the identity of a CPU, the identity of a kernel and the identity of a GPU;

Invoking the AI frame processing component, and selecting a processing frame and an accelerator corresponding to the AI task, wherein the AI task carries processing frame information and accelerator information required by the AI task;

2. The method according to claim 1, wherein the method further comprises:

3. The method of claim 1, wherein obtaining job tasks sent by scheduling nodes in the cluster system comprises:

4. The task scheduling processing method is characterized by being applied to a cluster system, wherein the cluster system comprises a submitting node, a scheduling node and a plurality of computing nodes, and the method comprises the following steps:

the scheduling node acquires the job task from the submitting node;

The target computing node executes the job task through the running environment according to the task content of the job task to obtain an execution result;

When the job task is an HPC task, the target computing node calls a preprocessing component corresponding to the HPC task, and initializes a task environment to obtain an operating environment for executing the HPC task; when the job task is an AI task, invoking a preprocessing component corresponding to the AI task, and initializing a task environment to obtain an operation environment for executing the AI task; the invoking the preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task, including: invoking a general processing component, and selecting target hardware resources corresponding to subtasks in the AI task, wherein the target hardware resources corresponding to each subtask are selected and obtained from the hardware resources of all target computing nodes according to the operation amount required by the corresponding subtask, and the target hardware resources comprise the identity of the target computing node, the identity of a CPU, the identity of a kernel and the identity of a GPU; invoking an AI frame processing component, and selecting a processing frame and an accelerator corresponding to the AI task; and creating a container for executing the subtasks according to the target hardware resources, the processing framework and the accelerator to obtain an operation environment for executing the AI task, wherein the AI task carries processing framework information and accelerator information required for representing the AI task.

5. A task scheduling processing device, characterized in that it is applied to a computing node in a cluster system, the device comprising:

the execution unit is used for executing the job task through the running environment according to the task content of the job task to obtain an execution result;

The preprocessing unit is specifically configured to call a preprocessing component corresponding to an HPC task when the job task is the HPC task, and initialize a task environment to obtain an operating environment for executing the HPC task; when the job task is an AI task, invoking a preprocessing component corresponding to the AI task, and initializing a task environment to obtain an operation environment for executing the AI task; the invoking the preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task, including: invoking a general processing component, and selecting target hardware resources corresponding to subtasks in the AI task, wherein the target hardware resources corresponding to each subtask are selected and obtained from the hardware resources of all target computing nodes according to the operation amount required by the corresponding subtask, and the target hardware resources comprise the identity of the target computing node, the identity of a CPU, the identity of a kernel and the identity of a GPU; invoking an AI frame processing component, and selecting a processing frame and an accelerator corresponding to the AI task, wherein the AI task carries processing frame information and accelerator information required by representing the AI task; and creating a container for executing the subtasks according to the target hardware resources, the processing framework and the accelerator to obtain an operation environment for executing the AI tasks.

6. A server comprising a memory, a processor coupled to each other, the memory storing a computer program, which when executed by the processor, causes the server to perform the method of any of claims 1-5.

7. A cluster system comprising a commit node, a schedule node, and a plurality of compute nodes, wherein:

the target computing node is further used for executing the job task through the running environment according to the task content of the job task to obtain an execution result;

The target computing node is specifically configured to call a preprocessing component corresponding to an HPC task when the job task is the HPC task, and initialize a task environment to obtain an operating environment for executing the HPC task; when the job task is an AI task, invoking a preprocessing component corresponding to the AI task, and initializing a task environment to obtain an operation environment for executing the AI task; the invoking the preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task, including: invoking a general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task; invoking an AI frame processing component, and selecting a processing frame and an accelerator corresponding to the AI task; and creating a container for executing the subtasks according to the target hardware resources, the processing framework and the accelerator to obtain an operation environment for executing the AI tasks.

8. A computer readable storage medium, characterized in that the computer program is stored in the readable storage medium, which, when run on a computer, causes the computer to perform the method according to any one of claims 1-3.