CN118349283A

CN118349283A - Method and apparatus for executing non-blocking macroinstruction multistage pipeline processor for distributed cluster system

Info

Publication number: CN118349283A
Application number: CN202410469221.7A
Authority: CN
Inventors: 刘学; 施路平; 赵蓉; 李晓民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-04-18
Filing date: 2024-04-18
Publication date: 2024-07-16

Abstract

The invention provides a method and a device for executing a non-blocking macro instruction multi-stage pipeline processor for a distributed cluster system, wherein the method comprises the following steps: detecting a message buffer interface and a signaling buffer interface to determine whether a new message or signaling is input; in response to the input of a new message or signaling, detecting whether a task identifier and a target port number in the message or signaling meet requirements, if so, entering the next stage of processing, and if not, discarding; sending the target port number and the starting command to a processing unit for instruction fetching, decoding, number fetching, execution and write-back processing; outputting the processed message or signaling. According to the invention, the calculation instruction and the communication instruction are decoupled, the calculation instruction and the communication instruction are received and processed by decoupling through the message buffer interface and the signaling buffer interface, and the parallel processing of the instruction is completed based on the control logic unit and the multistage pipeline processing unit, so that the communication efficiency of the network system can be improved, and the parallel task processing efficiency of the distributed cluster system can be improved.

Description

Method and apparatus for executing non-blocking macroinstruction multistage pipeline processor for distributed cluster system

Technical Field

The present invention relates to the field of macro instruction processors, and in particular, to a method and apparatus for executing a non-blocking macro instruction multistage pipeline processor in a distributed cluster system.

Background

The communication calculation mode of the current distributed deep learning platform is single, and communication instructions and calculation instructions cannot be effectively combined, so that interaction delay of communication and calculation is caused. Communication instructions herein refer to various instructions related to the transmission of data between the various nodes of the distributed system, such as synchronization signaling, production consumer communication types, parameter synchronization, transmission and reception of gradient update messages, and the like. While computing instructions refer to instructions that include computing tasks such as forward propagation, backward propagation, etc. that perform specific mathematical operations and model training steps. Communication instructions and calculation instructions are not effectively combined here, meaning that communication instructions and calculation instructions are highly coupled, which means that parallelism and efficiency of calculation and communication operations will be affected, resulting in interaction delays of communication and calculation. In addition, the current distributed cluster system often has a blocking situation, and if a certain node cannot process a new request due to blocking, the node resource may be wasted.

Disclosure of Invention

In view of the foregoing, the present invention provides a method and apparatus for executing a non-blocking macro-instruction multistage pipeline processor for a distributed cluster system, so as to solve at least one of the above-mentioned problems.

In order to achieve the above purpose, the present invention adopts the following scheme:

According to a first aspect of the present invention there is provided a method of execution for a non-blocking macroinstruction multistage pipelined processor of a distributed clustered system, the method comprising: detecting a message buffer interface and a signaling buffer interface based on hardware logic to determine whether a new message or signaling is input; responding to the new information or signaling input, detecting whether the task identifier and the target port number in the information or signaling data packet meet the current processing requirement, if so, directly reading the information or signaling data packet from the information cache interface or the signaling cache interface into a register and entering the next stage of processing, and if not, discarding; sending the target port number and the starting command to a multistage pipeline processing unit to perform instruction fetching, decoding, fetching, executing and write-back processing; and encapsulating and outputting the processed message or signaling by using an independent message or signaling logic interface.

As an embodiment of the present invention, the above method further includes: if the newly input message is detected to be the trigger message, positioning the newly input message to a corresponding subprogram area for processing according to the message content and the preset offset.

As an embodiment of the present invention, the above method further includes: when the message buffer interface and the signaling buffer interface are empty, arbitration is carried out between macro instruction programs corresponding to available port numbers, and after a port number is determined, the determined port number and a starting command are sent to the multistage pipeline processing unit.

As an embodiment of the present invention, the multi-stage pipeline processing unit includes a fetching module, a decoding module, a fetching module, an executing module, and a write-back module, where the sending the target port number and the start command to the multi-stage pipeline processing unit to fetch, decode, fetch, execute, and write-back processing includes: the target port number and the starting command are sent to an instruction fetching module, the instruction fetching module inquires a corresponding instruction address register according to the target port number to obtain an instruction, and the instruction is sent to a decoding module; the decoding module receives the instruction, performs preliminary decoding to obtain an instruction code, outputs the instruction code to the fetch module, enters a delay processing state if the decoding module encounters a state of pipeline pause, and outputs a decoding result in the next period; the fetch module analyzes the operation codes required by the two-stage logic of the execution module and the write-back module according to the instruction codes and transmits the operation codes backwards; the execution module completes execution of instructions based on the operation code; the write-back module completes write operation, stack operation and control flow operation based on the execution result of the execution module and the operation code.

As an embodiment of the present invention, the above method further includes: and stopping the execution of the function of the current stage when the instruction access conflict is detected by the instruction fetching module, and recovering the execution of the function of the current stage after the conflict is solved.

As an embodiment of the present invention, when the instruction fetch module detects an instruction access conflict, the method further includes: and carrying out temporary stack pressing processing on the instructions with the access conflict until the instructions with the access conflict are recovered from the stack after the conflict is resolved and continue to be executed.

As an embodiment of the present invention, the above method further includes: and dynamically distributing the resources of the non-blocking macro-instruction multistage pipeline processor according to the current execution condition and the resource occupation state of the message or the signaling.

The fetching module is further configured to receive pipeline delay, pipeline discard and atomic operation end signals generated by the write-back module; when an atomic operation ending signal is received, the fetching module discards the current operation and gives control rights to the non-blocking macro instruction multistage pipeline processor; when receiving the pipeline discarding signal, the fetching module discards the current operation and continues to execute the subsequent instruction; the instruction fetching module and the decoding module are also used for receiving an atomic operation ending signal generated by the write-back module, discarding the current operation based on the atomic operation ending signal, and then giving control right to the non-blocking macro instruction multistage pipeline processor.

As one embodiment of the invention, after the write-back module generates an atomic operation ending signal, the non-blocking macro instruction multi-stage pipeline processor temporarily stores the current state, then receives and analyzes new information or signaling input, and based on the analysis result, a new pipeline operation is performed by the multi-stage pipeline processing unit.

According to a second aspect of the present invention there is provided a non-blocking macroinstruction multistage pipelined processor for a distributed clustered system, the non-blocking macroinstruction multistage pipelined processor comprising: the control logic unit is used for determining whether a new message or signaling input exists or not based on the hardware logic detection message buffer interface and the signaling buffer interface; in response to the input of a new message or signaling, detecting whether a task identifier and a target port number in a message or signaling data packet meet the current processing requirement, if so, directly reading the message or signaling data packet from the message cache interface or the signaling cache interface into a register, and sending the target port number and a starting command to a multistage pipeline processing unit, and if not, discarding; a multistage pipeline processing unit, configured to perform instruction fetch, decode, fetch, execute, and write-back processing based on the target port number and a start command; and the output unit is used for encapsulating and outputting the processed message or signaling by utilizing the independent message or signaling logic interface.

As an embodiment of the present invention, the control logic unit is further configured to: if the newly input message is detected to be the trigger message, positioning the newly input message to a corresponding subprogram area for processing according to the message content and the preset offset.

As an embodiment of the present invention, the control logic unit is further configured to: when the message buffer interface and the signaling buffer interface are empty, arbitration is carried out between macro instruction programs corresponding to available port numbers, and the determined port numbers and starting commands are sent to the multistage pipeline processing unit after a port number is determined.

As one embodiment of the present invention, the multi-stage pipeline processing unit includes a fetching module, a decoding module, a fetching module, an executing module, and a writing back module, where the fetching module is configured to query a corresponding instruction address register according to the target port number to obtain an instruction, and send the instruction to the decoding module; the decoding module is used for performing preliminary decoding after receiving an instruction to obtain an instruction code, outputting the instruction code to the fetch module, entering a Delay state if the decoding module encounters a state of pipeline pause, and outputting a decoding result in the next period; the fetching module is used for resolving the operation codes required by the two-stage logic of the execution module and the write-back module according to the instruction codes and transmitting the operation codes backwards; the execution module is used for completing the execution of instructions based on the operation code; the write-back module is used for completing write operation, stack operation and control flow operation based on the execution result of the execution module and the operation code.

In one embodiment of the present invention, the instruction fetch module is further configured to stop execution of the present level function when detecting an instruction memory access conflict, and resume execution of the present level function after the conflict is resolved.

As an embodiment of the present invention, when the instruction fetch module detects an instruction access conflict, the apparatus further includes a push stack processing unit: and the instructions for carrying out temporary stack pressing processing on the access conflict instruction until the access conflict instruction is recovered from the stack after conflict resolution and is continuously executed.

As an embodiment of the present invention, the apparatus further comprises a resource scheduling unit: and the resources of the non-blocking macro instruction multistage pipeline processor are dynamically allocated according to the current execution condition and the resource occupation state of the message or the signaling.

According to a third aspect of the present invention there is provided an electronic device comprising a memory, a processor and a computer program stored on said memory and executable on said processor, the processor implementing the steps of the above method when executing said computer program.

According to a fourth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

According to the technical scheme, the execution method and the execution device for the non-blocking macro-instruction multistage pipeline processor for the distributed cluster system are used for decoupling the calculation instruction and the communication instruction, respectively transmitting the calculation instruction and the communication instruction through the data channel and the signaling channel, decoupling and receiving the calculation instruction and the communication instruction by utilizing the message buffer interface and the signaling buffer interface, completing parallel processing of the instructions based on the control logic unit and the multistage pipeline processing unit, and improving the communication efficiency of a network system when the method and the device are applied to the distributed cluster system. In addition, the application can realize the resource contention and data dependency relationship among a plurality of instructions through an effective resource scheduling and conflict detection mechanism, so that the blockage is not caused, the current state can be temporarily stored after the atomic operation is finished, then a new message or signaling input is received and analyzed, and a new round of pipeline operation is carried out by the multistage pipeline processing unit based on the analysis result, so that the non-blocking macro instruction processor can effectively process parallel tasks, and further the execution efficiency and response speed are improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

FIG. 1 is a flow chart of a method for executing a non-blocking macro instruction multi-stage pipelined processor for a distributed cluster system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an execution flow of a multistage pipeline processing unit according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for executing a non-blocking macro instruction multi-stage pipelined processor for a distributed cluster system according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a non-blocking macro-instruction multistage pipeline processor for a distributed cluster system according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a multistage pipeline processing unit according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a non-blocking macro-instruction multistage pipeline processor for a distributed cluster system according to another embodiment of the present application;

Fig. 7 is a schematic block diagram of a system configuration of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present invention and their descriptions herein are for the purpose of explaining the present invention, but are not to be construed as limiting the invention.

The technical terms related to the present application will be briefly described below:

The term "distributed cluster": a computer system architecture connects multiple independent computers (nodes) through a network to collectively perform tasks or applications. By dispersing computing tasks to a plurality of computing nodes, the distributed cluster not only improves computing efficiency and system reliability, but also provides excellent expandability and flexibility, and becomes an important technical foundation for dealing with modern large-scale computing problems.

The term "macroinstruction processor": an advanced processor design is capable of executing complex Instructions consisting of multiple micro-Instructions (microinstructions) or sequences of operations, known as Macro-Instructions (Macro-Instructions).

Fig. 1 is a flow chart of a method for executing a non-blocking macro instruction multi-stage pipeline processor in a distributed cluster system according to an embodiment of the present application, the method includes the following steps:

step S101: the message buffer interface and the signaling buffer interface are detected based on hardware logic to determine whether a new message or signaling is input.

A Message buffer (Message FIFO) interface is dedicated to buffering messages received from other nodes of the cluster system or from external sources, which may include data updates, processing requests, or other forms of information exchange.

A signaling cache (Command FIFO) interface is used to receive and cache control signaling that is used to manage and coordinate tasks and resources in a cluster, such as commands to start or stop tasks, resource allocation instructions, etc.

The message buffer interface and the signaling buffer interface both adopt a first-in first-out queue management mechanism to ensure that data is processed according to the arrival sequence. In this embodiment, the message in the message buffer interface includes a calculation instruction, and the signaling in the signaling buffer interface includes a communication instruction.

Both messages and signaling are in the form of data packets that include a data payload, which is the actual user data or computation data, that is the basis for the processor to perform the computation task, and a header that contains metadata for routing decisions, such as destination address, source address, etc., that ensures that the data packet can be properly sent to the destination processor. Each core of the non-blocking macro-instruction multistage pipeline processor comprises a router, a data packet manager and a local storage, wherein the router is responsible for determining a forwarding path of a data packet and ensuring that the data packet reaches a target position along an optimal path; the data packet manager processes the queue and the priority of the data packet and manages the receiving and the sending of the data packet; the local storage provides temporary or persistent storage space for data needed for performing the calculations. This design allows each processor core to independently handle the receipt, computation, and forwarding of data packets.

In this embodiment, the non-blocking macro instruction processor may periodically check the message buffer and signaling buffer interface to determine if there is a new input waiting to be processed, a detection mechanism that enables the processor to respond to external events and internal instructions in time.

As part of "non-blocking," the macro instruction processor may dynamically adjust its internal configuration, such as adjusting resource allocation, or changing priority of instruction execution, to optimize processing performance, depending on the current workload and system state. In addition, the application detects the message buffer interface and the signaling buffer interface through hardware logic, and the hardware logic allows the processor to independently detect the message and the signaling interface while processing the core to calculate tasks, which means that data processing and signal detection can occur simultaneously, and one party cannot wait or block because of the processing of the other party. The message and the signaling are transmitted through independent data channels, and the non-blocking macro instruction processor can efficiently process external communication tasks without interrupting main calculation tasks, which is particularly important in a multi-task and high-concurrency distributed cluster environment, and fully embodies the design concept of non-blocking.

Step S102: and in response to the new message or signaling input, detecting whether the task identifier and the target port number in the message or signaling data packet meet the current processing requirement, if so, directly reading the message or signaling data packet from the message cache interface or the signaling cache interface into a register, and entering step S103, and if not, discarding.

In this embodiment, each message or signaling packet contains a TASK identifier (task_id) that uniquely identifies the TASK or job with which the packet is associated, which allows the processor to determine whether the packet belongs to a TASK queue that is currently being processed or pending. The packet also specifies a destination port number (Channel) that indicates the processing unit of the particular program to which the packet should be routed. In a distributed cluster system, different processing units may monitor different port numbers to handle a particular type of task.

It is assumed that the present application has a macro-instruction executor in the distributed cluster system that needs to process 8 programs in Active state, each program being assigned a unique target port number, ranging from 0 to 7. The macro-instruction executor, upon start-up, initializes its internal resources and configuration, including allocating the necessary processing resources to each destination port number (0-7), each active program being allocated a unique destination port number that is used to identify and route its packets and instructions during program execution.

The macro-instruction executor continually detects the input queue to identify new message or signaling packets, each containing a destination port number representing the program with which it is associated. Depending on the destination port number in the packet, the macro-instruction executor routes the packet to the corresponding processing queue, which ensures that the packet for each program can be correctly identified and processed. The macro instruction executor can process the data packets of all active programs simultaneously, and each program has one independent processing queue and resource allocation, so that the programs can be executed in parallel without interference. When a program completes its task, its state is updated to inactive, and the macro-instruction executor may reclaim the resources used by the program for use by other tasks.

Preferably, the executing method of the present application further includes a function similar to an interrupt processing program, that is, when detecting that a newly input message is a trigger message (Vector message), locating to a corresponding subroutine area for processing according to the message content and a preset offset, where each Vector corresponding subroutine area does not need a Call-back (Call) or a Return (Return) instruction, and after execution, the executing method can directly transfer to a next task or Return to a state before being interrupted. This interrupt handler-like functionality improves the ability and efficiency of the distributed cluster system to handle asynchronous events.

Further preferably, the execution method of the present application further includes an arbitration process, that is, when the message buffer interface and the signaling buffer interface are empty, arbitrating between macro instruction programs corresponding to available port numbers, determining a port number, and then sending the determined port number and a start command to the multistage pipeline processing unit.

When the message buffer interface and the signaling buffer interface in the distributed cluster system are both empty, it indicates that there is no external message or signaling to be processed currently. In this case, the system needs to make efficient use of its computing resources, ensuring that the system continues to operate efficiently. By arbitrating between the macro instruction programs corresponding to the available port numbers, the system can select the task to be performed next, which is aimed at maintaining efficient utilization of the processing units and optimizing overall computing resource allocation.

Since each macro instruction is associated with a particular port number that identifies the communication and data flow paths of the program, the arbitration process determines which program corresponding to the port number is to be executed. After determining the port number to be executed, the system generates a start command containing all the information required to execute the selected macro program, such as start address, execution parameters, etc. The start command and the determined port number are sent to the multi-stage pipeline processing unit to start the execution flow of the macro instruction program. In this way, the system dynamically adjusts its task execution queue, ensuring that efficient utilization of computing resources is maintained even in the absence of external input. Thus, the process helps to optimize the overall efficiency and responsiveness of the system, maximizing the utilization of resources and the throughput of the system through the rational arrangement and execution of internal tasks.

Step S103: and sending the target port number and the starting command to a multistage pipeline processing unit to perform instruction fetching, decoding, fetching, executing and write-back processing.

In this embodiment, the target port number and the start command are used as inputs, and through a series of processes of the multistage pipeline processing unit, execution of instructions can be efficiently implemented, especially in an application scenario of high performance calculation or needing quick response, the pipeline technology significantly improves throughput and execution efficiency of the processor by processing different stages of multiple instructions in parallel.

Preferably, the stage pipeline processing unit of the present embodiment includes a finger fetching module, a decoding module, a fetching module, an executing module, and a write-back module, as shown in fig. 2, the steps may further include the following sub-steps:

Step S1031: and sending the target port number and the starting command to an instruction fetching module, inquiring a corresponding instruction address register by the instruction fetching module according to the target port number to obtain an instruction, and sending the instruction to a decoding module.

Taking the above-mentioned 8 active programs as an example, the Instruction fetching module may have 8 Instruction address registers (Instruction ADDRESS REG) and 8 stack pointer registers (Stack Pointer Reg), if the message received in step S102 is a Reset message, the Instruction address register and the stack pointer register corresponding to the message are set to initial values, which enables the system to ensure that the system starts executing from a known, clean state.

The instruction address register is used for storing the memory address of the next instruction to be executed, and the stack pointer register is used for managing the state of a call stack (CALL STACK) or an execution stack (execution stack). After receiving the target port number and the start command, the instruction fetching module queries a corresponding instruction address register according to the target port number, fetches an instruction from an address stored in the instruction address register, and transmits the instruction to a subsequent module for execution. After one instruction Class is executed, the instruction fetching module refreshes the corresponding instruction address register to the latest value, and then gives control right to the non-blocking macro instruction processor of the distributed cluster system.

Preferably, the instruction fetch module in this embodiment further supports exception handling and interrupt handling, that is, when the instruction fetch module detects an instruction access conflict, execution of the present level function is stopped, and after the conflict is resolved, execution of the present level function is resumed.

Instruction memory access conflicts refer to memory access conflicts encountered by instructions during execution, which occur mainly when a processor attempts to access multiple locations in memory at the same time, and data dependencies or resource contention exists between these access operations, resulting in failure to satisfy all memory access requests at the same time. For example, in this embodiment, when the fetch module has a read Ram operation and the write-back module has a write Ram operation, the instruction fetch module stops executing the present-stage function, and resumes executing the present-stage function after the conflict is resolved. By introducing the interrupt function, the non-blocking macro-instruction multistage pipeline processor can effectively manage and alleviate the influence caused by instruction access conflict, and improve the execution efficiency and the processor performance.

Preferably, when the instruction fetch module detects an instruction memory conflict, the method of the present application further includes: the method can effectively avoid long-time blocking caused by the memory conflict.

Preferably, the method of the present application further comprises: the resources of the non-blocking macro-instruction multistage pipeline processor are dynamically allocated according to the current execution condition and the resource occupation state of the message or the signaling, for example, priority queuing, time slice rotation or other fair scheduling algorithms can be adopted to ensure that all instructions are equal and effectively accessed to the required resources. In addition, a feedback system can be introduced to adjust the resource allocation strategy according to the real-time feedback of instruction execution so as to improve the resource utilization rate and the execution efficiency.

Step S1032: and the decoding module receives the instruction, performs preliminary decoding to obtain an instruction code, outputs the instruction code to the fetch module, enters a delay processing state if the decoding module encounters a state of pipeline pause, and outputs a decoding result in the next period.

The design of the decoding module of the present application allows it to dynamically adapt to the current execution environment during processing, running in two states: normal and Delay, which enhance the ability of the processor to handle different execution challenges, especially in the face of pipeline stalls.

In Normal state, the decode module receives instructions from the instruction fetch module that have been fetched from memory and are ready to decode, and the decode process includes parsing the instruction code of the instructions, determining the type of instruction (e.g., arithmetic operations, logical operations, data transfer operations, etc.), and identifying the operand.

The Delay state is designed to address pipeline stall situations that may be caused by various reasons, such as data dependency, resource contention, or the inability to continue processing new instructions because the previous instruction has not completed execution. When the decode module encounters such a condition, it pauses the current decoding operation and enters the Delay state. In this state, decoding and delivery of instructions is deferred until execution can continue safely. This design allows the pipeline to flexibly handle various delays in execution while maintaining data integrity and execution accuracy.

Therefore, the decoding module dynamically switches the operation state according to the current state of the pipeline, and the state switching mechanism improves the adaptability of the processor to the execution barrier and helps to reduce the performance loss caused by the pause.

Step S1033: and the fetching module analyzes the operation codes required by the two-stage logic of the execution module and the write-back module according to the instruction codes and transmits the operation codes backwards.

The fetch module transmits the parsed operation code and necessary operation information to the execution module and the write-back module to guide them to complete specific logic and write-back operations. In this level logic, the fetch module also performs a read RAM or a read reg operation based on the instruction code.

Preferably, the fetch module is further configured to receive a pipeline delay (pipeline_delay), a pipeline discard (pipeline_discard), and an atomic operation end signal generated by the write back module;

When an atomic operation ending signal is received, the fetch module discards the current operation and gives control rights to the non-blocking macro instruction multi-stage pipeline processor, wherein the atomic operation refers to an operation which cannot be interrupted by other threads in the execution process, and under a multithreading environment, the atomic operation can ensure the consistency and the integrity of data under the condition that the locking of resources is not needed;

when receiving the pipeline discarding signal, the fetching module discards the current operation and continues to execute the subsequent instruction;

Further preferably, the instruction fetching module and the decoding module may also receive an atomic operation end signal generated by the write-back module, discard a current operation based on the atomic operation end signal, and then give control right to the non-blocking macro instruction multistage pipeline processor.

This complex signal processing and state management mechanism embodies the high complexity and flexibility of non-blocking macro-instruction multi-stage pipelined processors in terms of instruction execution, exception management, and task scheduling. Through the fine control, the non-blocking macro-instruction multistage pipeline processor can effectively cope with various conditions encountered in the execution process, keep high-efficiency operation, and simultaneously ensure the accuracy of execution and the stability of a system.

Further preferably, after the write-back module generates the atomic operation end signal, the non-blocking macro instruction multi-stage pipeline processor may temporarily store the current state, then receive and parse a new message or signaling input, and perform a new pipeline operation by the multi-stage pipeline processing unit based on the parsed result. In this way, the non-blocking macro instruction processor can effectively process parallel tasks, and the execution efficiency and response speed are improved. It can immediately turn to processing new inputs after completion of a critical step of a task, thus achieving efficient multitasking capability, which is a specific implementation of its non-blocking feature.

Step S1034: the execution module completes execution of instructions based on the opcode.

The execution module is responsible for executing stages in a multi-stage pipelined processor architecture, at which the processor performs specific logical or arithmetic operations on instructions that were resolved and ready at the previous stage. The operation of the execution module is not limited to performing simple arithmetic operations, but also involves complex data processing and logic decisions: for example, the execution module needs to process Data from RAM (RAM Data) and Data from registers (REGISTER DATA), including reading Data from specified memory addresses and retrieving Data from register files; it is also necessary to perform addition or subtraction operations, which may involve processing integer or floating point data, according to the instruction requirements, and to correctly process the carry (carry) or borrow (borrow) generated in the operation; when it is desired to write the updated data back to RAM, the execution module performs a specific sequence of operations: firstly, original 64-bit data in a target memory address is read to ensure that only a target data segment is modified without affecting other data; integrating new data (from a register or a calculation result) to be written into the read original data to replace the original corresponding paragraph or value; the replaced complete 64-bit data is written back to the same address of the RAM. In addition, the carry signal (CARRY FLAG) generated by the addition-subtraction operation may also be used as a conditional predicate to control execution of conditional jump instructions, e.g., in some conditional branch instructions, jumps are only executed when the carry flag is set (or not set).

The design of the execution module in this embodiment represents the high complexity and flexibility of the non-blocking macro-instruction multistage pipeline processor in the execution stage, which is capable of handling various data operations, supporting complex program logic, and responding to condition changes during instruction execution.

Step S1035: the write-back module completes write operation, stack operation and control flow operation based on the execution result of the execution module and the operation code.

The write-back module is the last stage in a multi-stage pipelined processor architecture, whose main responsibility is to complete the final step of instruction execution, write the execution result back to the target location, and handle the control flow operations in the instruction.

Wherein the write operation includes writing to RAM and writing to Reg: the write RAM is a specified location in memory where the execution result or processed data is written, which may be the result of an arithmetic operation, the output of a data processing operation, or data resulting from other instructions; write Reg is to update a specific register in a register file, and the execution result is directly written into one or more target registers for use by subsequent instructions.

Stack operations include PUSH, POP: PUSH is pushing data (such as parameters, local variables, or return addresses of function calls) to the call stack; POP is the removal of data from the top of the call stack and possibly loading it into registers or memory.

Control flow operations include CALL, RETURN, and skip operations: the CALL performs a function CALL operation that includes saving a return address (i.e., an instruction address following the CALL instruction) to the CALL stack and jumping to the start address execution of the function. RETURN is a RETURN from a function to the call site, which typically involves popping a RETURN address from the call stack and setting a Program Counter (PC) to that address to resume execution flow before the call. The jump operation changes the execution flow of the program according to the instruction or the execution condition, the jump may be unconditional or based on a specific condition (such as the result of a comparison operation), when the jump needs to be executed, the write back module generates a jump operation signal, and sends the signal back to the instruction fetching module to indicate that the current execution path needs to be changed, which requires the instruction fetching module to update the instruction address register according to the received signal to point to a new instruction address.

The design and functionality of the write-back module reflects the complexity of the processor in completing instruction execution, managing control flow, and maintaining program state consistency. By performing these operations accurately, the writeback module ensures proper implementation of program logic and provides support for efficient operation of the pipeline processor.

Preferably, the modules coordinate in the same clock cycle to complete data flow, control and transmission, because the modules operate in parallel in the same clock cycle, that is, the data executed by the modules in the same clock cycle has no data correlation, and whether the modules are executed or not can be controlled by control instructions thereof. Thus, the system can process more instructions and data in each clock cycle, and the processing speed and the throughput rate are remarkably improved. The design fully utilizes hardware resources and improves the overall efficiency of the system. Because the data executed by the modules has no data correlation in the same clock period, the design effectively avoids data collision and dependence, and reduces delay and complexity caused by the need of additionally processing the data dependence. Each module can process its data independently without waiting for the results of the other modules. Finally, the system is designed to be modularized, and all modules can work in parallel in the same clock cycle, so that the system is easier to expand and maintain, and when functions are added or performance is improved, the system can be realized by adding new modules or optimizing existing modules without reconstructing the whole system.

Step S104: and encapsulating and outputting the processed message or signaling by using an independent message or signaling logic interface.

According to the technical scheme, the execution method of the non-blocking macro-instruction multistage pipeline processor for the distributed cluster system provided by the invention is used for decoupling the calculation instruction and the communication instruction, respectively transmitting the calculation instruction and the communication instruction through the data channel and the signaling channel, respectively receiving the calculation instruction and the communication instruction by utilizing the message buffer interface and the signaling buffer interface, and completing the parallel processing of the instructions based on the control logic unit and the multistage pipeline processing unit, so that the communication efficiency of the network system can be improved when the execution method is applied to the distributed cluster system.

FIG. 3 is a flow chart of another method for executing a non-blocking macro instruction multi-stage pipeline processor for a distributed cluster system according to the embodiment of the present application, the method comprises the following steps:

Step S301: task source code is received and a vectorizable data loop body is determined from the source code.

The non-blocking macro-instruction multistage pipeline processor first receives execution source code for deep learning or other computationally intensive tasks, and through specific analysis tools or compiler technology, a data loop body capable of vectorizing can be identified from the execution source code, and the step is completed at a software level, so as to optimize the execution efficiency of the code.

Step S302: and carrying out vectorization processing on the instruction of the data circulation body to generate a communication instruction and a calculation instruction.

Once it is determined which loop bodies can be vectorized, the non-blocking macroinstruction multi-stage pipeline processor vectorizes the instructions of the loop bodies to generate corresponding communication instructions and storage instructions. These instructions are designed to be efficiently executed on a non-blocking macroinstruction multistage pipelined processor.

Step S303: the communication instructions and the calculation instructions are respectively packaged into single data packets, and each data packet comprises a data load and a header.

The communication instructions and the computation instructions are packaged into individual data packets, each data packet containing a data payload and a header. These packets are identified and routed within the core mesh. Each non-blocking macro-instruction multistage pipeline processor calculates data packets, and the core comprises a router, a data packet manager and a storage, so that synchronous coordination between on-chip and off-chip is supported.

Step S304: the message buffer interface and the signaling buffer interface are detected based on hardware logic to determine whether a new message data packet or a signaling data packet is input.

Step S305: and in response to the new message or signaling input, detecting whether the task identifier and the target port number in the message or signaling data packet meet the current processing requirement, if so, directly reading the message or signaling data packet from the message cache interface or the signaling cache interface into a register, and entering into step S306, and if not, discarding.

Step S306: and sending the target port number and the starting command to a multistage pipeline processing unit to perform instruction fetching, decoding, fetching, executing and write-back processing.

Step S307: and encapsulating and outputting the processed message or signaling by using an independent message or signaling logic interface.

According to the technical scheme, the execution method of the non-blocking macro-instruction multistage pipeline processor for the distributed cluster system provided by the invention is used for decoupling the calculation instruction and the communication instruction, respectively transmitting the calculation instruction and the communication instruction through the data channel and the signaling channel, respectively receiving the calculation instruction and the communication instruction by utilizing the message buffer interface and the signaling buffer interface, and completing the parallel processing of the instructions based on the control logic unit and the multistage pipeline processing unit, so that the operation and the execution of the calculation instruction and the communication instruction are decoupled, and the communication efficiency of the network system can be improved when the execution method is applied to the distributed cluster system.

FIG. 4 is a schematic diagram of a non-blocking macro-instruction multistage pipeline processor for a distributed cluster system according to an embodiment of the present application, which includes: control logic unit 410, multi-stage pipeline processing unit 420, and output unit 430, connected in sequence, wherein:

A control logic unit 410 for determining whether there is a new message or signaling input based on the hardware logic detecting the message buffer interface and the signaling buffer interface; in response to a new message or signaling input, detecting whether a task identifier and a target port number in a message or signaling data packet meet current processing requirements, if so, directly reading the message or signaling data packet from the message cache interface or the signaling cache interface into a register, and sending the target port number and a start command to the multistage pipeline processing unit 420, and if not, discarding.

A multi-stage pipeline processing unit 420, configured to perform fetching, decoding, fetching, executing, and writing back processing based on the target port number and the start command.

And an output unit 430 for encapsulating and outputting the processed message or signaling by using an independent message or signaling logic interface.

Preferably, the control logic unit 410 is further configured to: if the newly input message is detected to be the trigger message, positioning the newly input message to a corresponding subprogram area for processing according to the message content and the preset offset.

Preferably, the control logic unit 410 is further configured to: when the message buffer interface and the signaling buffer interface are empty, arbitration is carried out between macro instruction programs corresponding to available port numbers, and the determined port numbers and starting commands are sent to the multistage pipeline processing unit after a port number is determined.

Preferably, as shown in fig. 5, the multi-stage pipeline processing unit 420 includes an instruction fetching module 421, a decoding module 422, a fetching module 423, an executing module 424, and a write-back module 425, where the instruction fetching module 421 is configured to query a corresponding instruction address register according to the target port number to obtain an instruction, and send the instruction to the decoding module 422; the decoding module 422 is configured to perform preliminary decoding after receiving an instruction to obtain an instruction code, and output the instruction code to the fetch module 423, and if the decoding module 422 encounters a state of pipeline stall, enter a delayed processing state, and output a decoding result in a next cycle; the fetching module 423 is configured to parse the operation code required by the two-stage logic of the execution module and the write-back module according to the instruction code and transmit the operation code backward; the execution module 424 is configured to complete execution of instructions based on the opcode; the write-back module 425 is configured to complete a write operation, a stack operation, and a control flow operation based on an execution result of the execution module and the operation code.

Preferably, the instruction fetch module 421 is further configured to stop execution of the present-stage function when detecting an instruction memory conflict, and resume execution of the present-stage function after the conflict is resolved.

Preferably, when the instruction fetch module detects an instruction access conflict, the apparatus further includes a push stack processing unit: and the instructions for carrying out temporary stack pressing processing on the access conflict instruction until the access conflict instruction is recovered from the stack after conflict resolution and is continuously executed.

Preferably, the apparatus further comprises a resource scheduling unit: and the resources of the non-blocking macro instruction multistage pipeline processor are dynamically allocated according to the current execution condition and the resource occupation state of the message or the signaling.

Preferably, the fetching module 423 is further configured to receive a pipeline delay, a pipeline discard, and an atomic operation end signal generated by the write-back module 425; when receiving the atomic operation end signal, the fetch module 423 discards the current operation and returns control rights to the non-blocking macro instruction multistage pipeline processor; when receiving the pipeline discard signal, the fetch module 423 discards the current operation and continues to execute the subsequent instruction; the instruction fetching module 421 and the decoding module 422 are further configured to receive an atomic operation end signal generated by the write-back module 425, discard a current operation based on the atomic operation end signal, and then grant control to the non-blocking macro instruction multistage pipeline processor.

Preferably, after the write-back module generates an atomic operation end signal, the non-blocking macro instruction multi-stage pipeline processor temporarily stores the current state, then receives and analyzes a new message or signaling input, and based on the analysis result, the multi-stage pipeline processor performs a new pipeline operation.

Further, the above structure of the non-blocking macro instruction multi-stage pipeline processor for the distributed cluster system may also refer to fig. 6, where each module of the multi-stage pipeline processing unit 420 further refers to input and output of related data segments, signaling segments, and virtual conversion segments when performing respective functions, that is, the non-blocking macro instruction processor of the present application has an instruction buffer and a data buffer and a virtual-to-real conversion buffer and a storage system, and may provide related data segments, signaling segments, and virtual conversion segments required for processing for the multi-stage pipeline processing unit 420.

According to the technical scheme, the execution device of the non-blocking macro-instruction multistage pipeline processor for the distributed cluster system provided by the invention is used for decoupling the calculation instruction and the communication instruction, respectively transmitting the calculation instruction and the communication instruction through the data channel and the signaling channel, respectively receiving the calculation instruction and the communication instruction by utilizing the message buffer interface and the signaling buffer interface, and completing parallel processing of the instructions based on the control logic unit and the multistage pipeline processing unit, so that the communication efficiency of a network system can be improved when the execution device is applied to the distributed cluster system.

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method when executing the program.

The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores a computer program for executing the method.

As shown in fig. 7, the electronic device 600 may further include: a communication module 110, an input unit 120, an audio processor 130, a display 160, a power supply 170. It is noted that the electronic device 600 need not include all of the components shown in fig. 7; in addition, the electronic device 600 may further include components not shown in fig. 7, to which reference is made to the related art.

As shown in fig. 7, the central processor 100, sometimes also referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which central processor 100 receives inputs and controls the operation of the various components of the electronic device 600.

The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information about failure may be stored, and a program for executing the information may be stored. And the central processor 100 can execute the program stored in the memory 140 to realize information storage or processing, etc.

The input unit 120 provides an input to the central processor 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.

The memory 140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, or the like. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. Memory 140 may also be some other type of device. Memory 140 includes a buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage 142, the application/function storage 142 for storing application programs and function programs or a flow for executing operations of the electronic device 600 by the central processor 100.

The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).

The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. A communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.

Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and to receive audio input from the microphone 132 to implement usual telecommunication functions. The audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 130 is also coupled to the central processor 100 so that sound can be recorded locally through the microphone 132 and so that sound stored locally can be played through the speaker 131.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of executing a non-blocking macroinstruction multistage pipelined processor for a distributed clustered system, the method comprising:

detecting a message buffer interface and a signaling buffer interface based on hardware logic to determine whether a new message or signaling is input;

Responding to the new information or signaling input, detecting whether the task identifier and the target port number in the information or signaling data packet meet the current processing requirement, if so, directly reading the information or signaling data packet from the information cache interface or the signaling cache interface into a register and entering the next stage of processing, and if not, discarding;

Sending the target port number and the starting command to a multistage pipeline processing unit to perform instruction fetching, decoding, fetching, executing and write-back processing;

And encapsulating and outputting the processed message or signaling by using an independent message or signaling logic interface.

2. The method of execution of a non-blocking macroinstruction multistage pipelined processor for a distributed clustered system of claim 1, the method further comprising: if the newly input message is detected to be the trigger message, positioning the newly input message to a corresponding subprogram area for processing according to the message content and the preset offset.

3. The method of execution of a non-blocking macroinstruction multistage pipelined processor for a distributed clustered system of claim 1, the method further comprising: when the message buffer interface and the signaling buffer interface are empty, arbitration is carried out between macro instruction programs corresponding to available port numbers, and after a port number is determined, the determined port number and a starting command are sent to the multistage pipeline processing unit.

4. The method of claim 1, wherein the multi-stage pipeline processing unit includes a fetch module, a decode module, a fetch module, an execute module, and a write-back module, and wherein sending the target port number and the start command to the multi-stage pipeline processing unit for fetching, decoding, fetching, executing, and writing-back processing includes:

The target port number and the starting command are sent to an instruction fetching module, the instruction fetching module inquires a corresponding instruction address register according to the target port number to obtain an instruction, and the instruction is sent to a decoding module;

the decoding module receives the instruction, performs preliminary decoding to obtain an instruction code, outputs the instruction code to the fetch module, enters a delay processing state if the decoding module encounters a state of pipeline pause, and outputs a decoding result in the next period;

the fetch module analyzes the operation codes required by the two-stage logic of the execution module and the write-back module according to the instruction codes and transmits the operation codes backwards;

The execution module completes execution of instructions based on the operation code;

The write-back module completes write operation, stack operation and control flow operation based on the execution result of the execution module and the operation code.

5. The method of execution of a non-blocking macroinstruction multistage pipelined processor for a distributed clustered system of claim 4, further comprising:

And stopping the execution of the function of the current stage when the instruction access conflict is detected by the instruction fetching module, and recovering the execution of the function of the current stage after the conflict is solved.

6. The method of claim 5, wherein when the instruction fetch module detects an instruction memory conflict, the method further comprises: and carrying out temporary stack pressing processing on the instructions with the access conflict until the instructions with the access conflict are recovered from the stack after the conflict is resolved and continue to be executed.

7. The method of execution of a non-blocking macroinstruction multistage pipelined processor for a distributed clustered system of claim 1, the method further comprising:

And dynamically distributing the resources of the non-blocking macro-instruction multistage pipeline processor according to the current execution condition and the resource occupation state of the message or the signaling.

8. The method of claim 4, wherein the fetch module is further configured to receive pipeline delay, pipeline discard, and atomic operation end signals generated by the write back module;

When an atomic operation ending signal is received, the fetching module discards the current operation and gives control rights to the non-blocking macro instruction multistage pipeline processor;

the instruction fetching module and the decoding module are also used for receiving an atomic operation ending signal generated by the write-back module, discarding the current operation based on the atomic operation ending signal, and then giving control right to the non-blocking macro instruction multistage pipeline processor.

9. The method of claim 8, wherein after the writeback module generates the atomic operation end signal, the non-blocking macro instruction multi-stage pipeline processor temporarily stores the current state, then receives and parses a new message or signaling input, and a new pipeline operation is performed by the multi-stage pipeline processing unit based on the parsed result.

10. A non-blocking macroinstruction multistage pipelined processor for a distributed clustered system, the non-blocking macroinstruction multistage pipelined processor comprising:

The control logic unit is used for determining whether a new message or signaling input exists or not based on the hardware logic detection message buffer interface and the signaling buffer interface; in response to the input of a new message or signaling, detecting whether a task identifier and a target port number in a message or signaling data packet meet the current processing requirement, if so, directly reading the message or signaling data packet from the message cache interface or the signaling cache interface into a register, and sending the target port number and a starting command to a multistage pipeline processing unit, and if not, discarding;

a multistage pipeline processing unit, configured to perform instruction fetch, decode, fetch, execute, and write-back processing based on the target port number and a start command;

And the output unit is used for encapsulating and outputting the processed message or signaling by utilizing the independent message or signaling logic interface.

11. The non-blocking macroinstruction multistage pipelined processor of claim 10, wherein the control logic unit is further to: if the newly input message is detected to be the trigger message, positioning the newly input message to a corresponding subprogram area for processing according to the message content and the preset offset.

12. The non-blocking macroinstruction multistage pipelined processor of claim 10, wherein the control logic unit is further to: when the message buffer interface and the signaling buffer interface are empty, arbitration is carried out between macro instruction programs corresponding to available port numbers, and the determined port numbers and starting commands are sent to the multistage pipeline processing unit after a port number is determined.

13. The non-blocking macroinstruction multistage pipeline processor for a distributed cluster system of claim 10, wherein the multistage pipeline processing unit comprises a fetch module, a decode module, a fetch module, an execute module, and a writeback module,

The instruction fetching module is used for inquiring a corresponding instruction address register according to the target port number to obtain an instruction, and sending the instruction to the decoding module;

The decoding module is used for performing preliminary decoding after receiving an instruction to obtain an instruction code, outputting the instruction code to the fetch module, entering a delay processing state if the decoding module encounters a state of pipeline pause, and outputting a decoding result in the next period;

The fetching module is used for resolving the operation codes required by the two-stage logic of the execution module and the write-back module according to the instruction codes and transmitting the operation codes backwards;

the execution module is used for completing the execution of instructions based on the operation code;

the write-back module is used for completing write operation, stack operation and control flow operation based on the execution result of the execution module and the operation code.

14. The non-blocking macroinstruction multistage pipeline processor for a distributed cluster system of claim 13, wherein the instruction fetch module is further configured to stop execution of the present stage of functionality when an instruction memory conflict is detected, and resume execution of the present stage of functionality after the conflict is resolved.

15. The non-blocking macroinstruction multistage pipelined processor of claim 14, wherein when the instruction fetch module detects an instruction memory conflict, the apparatus further comprises a push processing unit: and the instructions for carrying out temporary stack pressing processing on the access conflict instruction until the access conflict instruction is recovered from the stack after conflict resolution and is continuously executed.

16. The non-blocking macroinstruction multistage pipelined processor for a distributed clustered system of claim 10, wherein the apparatus further comprises a resource scheduling unit: and the resources of the non-blocking macro instruction multistage pipeline processor are dynamically allocated according to the current execution condition and the resource occupation state of the message or the signaling.

17. The non-blocking macroinstruction multistage pipelined processor of claim 13, wherein the fetch module is further configured to receive pipeline delay, pipeline discard, and atomic operation end signals generated by the writeback module;

18. The non-blocking macroinstruction multistage pipeline processor for a distributed cluster system of claim 17, wherein after the writeback module generates an atomic operation end signal, the non-blocking macroinstruction multistage pipeline processor temporarily stores a current state, then receives and parses a new message or signaling input, and performs a new round of pipeline operation by a multistage pipeline processing unit based on a result of the parsing.

19. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 9 when the computer program is executed by the processor.

20. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 9.