CN117931391A

CN117931391A - Lossless and efficient data processing method based on RMDA and network interface card

Info

Publication number: CN117931391A
Application number: CN202311721650.0A
Authority: CN
Inventors: 廖怡; 樊小平; 符权; 刘禄仁
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-04-26

Abstract

The invention discloses a lossless and efficient data processing method based on RMDA and a network interface card, which belong to the field of data communication of a data center, and solve the problem of efficient scheduling processing of requests of different QPs when large-scale QPs are concurrent in the RMDA, ensure that request signals are not lost, QP requests are not disordered and scheduling periods are not wasted, and simultaneously effectively solve the problem of scheduling fairness of different QPs by combining congestion control, avoid blocking other QPs by large messages and excessive requests to be processed by single scheduling, and solve the problem of head resistance.

Description

Lossless and efficient data processing method based on RMDA and network interface card

Technical Field

The invention belongs to the field of data communication of data centers, and particularly relates to a lossless and efficient data processing method based on RMDA and a network interface card.

Background

RDMA technology is increasingly widely deployed in data centers to achieve "high bandwidth, low latency, high throughput, zero packet loss" lossless networks. When a network interface card (RNIC) based on RDMA technology processes a large-scale IO concurrent request, coarse-granularity IO request scheduling is easy to cause serious head resistance problem, and requests of different QPs are mutually blocked; for IO requests of large messages, longer resources are occupied, the processing of small messages is affected, and fairness among multiple QPs cannot be guaranteed; meanwhile, under the condition of massive request IO concurrency, if the request is not processed timely, IO is easy to be lost, and the upper application performance is seriously affected.

Disclosure of Invention

The invention aims to solve the technical problems of providing a network interface card and a data processing method based on RDMA technology aiming at the defects of the background technology, and can efficiently solve the scheduling processing problem of requests of different QPs when large-scale QPs are concurrent in a host, ensure that request signals are not lost, QP requests are not disordered, scheduling period is not wasted, and simultaneously effectively ensure scheduling fairness of different QPs by combining congestion control, thereby solving the head resistance problem.

The invention adopts the following technical scheme for solving the technical problems:

An RDMA-based network interface card comprises a PCIe BAR register processing module, a first-stage scheduling module, a QPC state table, a DB verification module, a QPN-QID mapping module, a second-stage scheduling module, a scheduling policy configuration module, a congestion control module and a DMA engine:

The PCIe BAR register processing module is used for analyzing and processing doorbell signal DB in soft and hard interaction;

the first-stage scheduling module is used for ensuring that the doorbell signal DB is not lost based on PageID and priority scheduling of the first-stage doorbell signal DB;

A QPC state table for buffering QPC state information;

The DB verification module is used for verifying whether the validity of the DB and the state information of the corresponding QP are correct;

The QPN-QID mapping module is used for allocating an enqueue ID entering the second stage scheduling module for the DB;

the second stage scheduling module is used for ensuring that different QPs of different Host can obtain fair scheduling in charge of scheduling of the second stage DB based on hostID and QPN;

The scheduling strategy configuration module is used for configuring the scheduling series, the size of each scheduler, the scheduling algorithm and other strategies in the second-stage scheduling module;

the congestion control module is used for distributing Crodit for each QP and controlling the size of a message which can be sent by each QP in a round of scheduling period;

the WQE processing module is used for prefetching the WQE and processing the WQE, and if the WQE cannot be completely processed, breakpoint information of DB processing is returned to the second-stage scheduling module

A DMA engine for RNIC and Host direct data handling;

Wherein PageID is a page ID, which represents the ID of the BAR space address corresponding to the software knock DB; QPC represents context information of QP for caching QP address information; QP represents RDMA connection queues; hostID denotes a serial number ID of host; QPN represents the sequence number ID of QP; credit is Credit; WQE represents an RDMA request; RNIC represents RDMA network card;

a lossless and efficient data stream processing method of RDMA based on an RDMA network interface card specifically comprises the following steps:

Step 1, when a new WQE is generated in the SQ of the target host, the host generates a doorbell signal DB and sends the doorbell signal DB to the RNIC; namely, DB information is written into Doorbell space allocated by RNIC for the QP through a PCIe interface;

wherein SQ represents a queue of a transmitting end;

Step 2, the rnic resolves PageID of the address and the QPN, priority CoS information of Doorbell according to the address of Doorbell register, and adds Doorbell to the scheduler in the first stage; the first stage scheduler adopts a hierarchical scheduling structure, supports M groups, each Group can be further divided into 4 priority queues, adopts an SP+WRR scheduling algorithm in each Group, and adopts an RR scheduling algorithm among different groups; SP is a strict priority scheduling algorithm; WRR is a weighted round robin scheduling algorithm;

Step 3, a first layer scheduler of the first stage scheduling module adopts a configured scheduling algorithm to select a queue to be scheduled, and a DB to be processed is added into an output queue of the scheduler;

Step 4, the second-stage scheduler of the first-stage scheduling module adopts RR scheduling to write the DB output by the first-stage scheduler into an output queue;

Step 5, the DB verification module takes out the DB of the queue head from the second-stage output queue of the first-stage scheduler to perform validity verification, and judges whether PageID in QPs (quality control module) bound by PageID of the DB and QP (quality control protocol) of the DB are consistent and whether the state of QP is normal; if the data are consistent, inputting the DB into a QPN-QID mapping module, and if the data are inconsistent, discarding the DB and returning an error to a target host; wherein QPN represents the serial number ID of QP; QID represents the local ID allocated for QP in RNIC, unique in RDMA system; the QPN-QID mapping module is used for mapping and searching the QPN and the local QID, and takes the QID as the index of the local QP Context;

Step 6, after the DB inputs the QPN-QID mapping module, searching the corresponding GroupID according to the HostID in the DB, mapping the QPN to the input queue corresponding to the GroupID, and ensuring that QP and DB of the same HostID are placed in the same scheduling group;

Wherein GroupID represents the ID of a dispatch group, and different HostIDs can be placed on different dispatches Goup;

Wherein, hostID represents the ID of the host, if in the scene supporting virtualization, one VM corresponds to an independent HostID;

Step 7, each level scheduler in the second level scheduling module outputs DB to the last output queue of the module;

Step 8, the WQE processing module takes out DB from the queue head of the output queue and reads the QPC state required by WQE processing according to QPN; in the QPC, max_burst_Size and Max_batch_WQE_count are contained, and the maximum number of messages which can be sent by a single WQE and the maximum number of WQEs which can be processed at one time are respectively represented; max_burst_size represents the maximum number of bytes allowed to be transmitted in a single scheduling period; max_batch_WQE_count represents the maximum number of WQEs allowed to be acquired in a single scheduling period, namely the maximum number of messages allowed to be processed in a single scheduling processing period;

Step 9, the WQE processing module requests a Credit from the congestion control module; wherein Credit is the size of the transmittable message allocated by the congestion control algorithm for each QP;

Step 10, the WQE processing module obtains no more than N WQEs from SQ through the DMA engine according to max_batch_wqe_count and the current cacheable number of WQEs wqe_available_count, wherein n=min (wqe_available_count, max_batch_wqe_count); WQE_available_count represents the maximum number of WQEs Available; n represents the number of WQEs that can be processed in a single scheduling period, wherein N takes the minimum value of WQE_available_count and Max_batch_WQE_count;

step 11, the WQE processing module processes cached WQEs one by one, and updates a consumption pointer of the WQE in the QPC and a Credit value, wherein the Credit value is obtained by subtracting the Credit value consumed by the WQE from the current value after each WQE is processed; if the residual Credit is insufficient to process a complete WQE in the process of processing the WQE, returning the interrupt state of the WQE processing after the residual Credit is consumed, caching the interrupt state in a Interupted DB State table in a second scheduling module, and setting the bitmap corresponding to the queue to 1; wherein Interupted DB State includes HostID, QPN and Produce _Index, target_WQE_Index and Walk_Offset, produce _Index represents currently processed WQE_Index, walk_offset represents a data pointer that currently unprocessed WQE has sent, target_WQE_Index represents a WQE position that the current DB needs to process; crodit is a token value representing the number of bytes currently sent by the QP; interupted DB State is an interrupt DB information table for caching interrupt information of DB schedule; the bitmap is a bitmap, each bit in the bitmap corresponds to a QPN, if the QPN information is valid, the corresponding bit position 1 is set, and if invalid, 0 is set;

step 12, when the second stage scheduling module schedules to a certain queue in the next period, judging whether the DB state of the interrupt exists in the queue through the Bitmap, namely whether the Bitmap corresponding to the queue is 1; if bitmap=0, reading a new DB process from the dispatch input queue; if bitmap=1, the DB interrupt state is read preferentially, and DB interrupt state information is formed into a new DB and sent to the WQE processing module;

Step 13, the WQE processing module starts to process MR data from the walk_Offset of the WQE pointed by Produce _Index according to the DB information, judges whether the WQE can be completely processed according to Credit, and repeats the steps 11-13 until the DB is not interrupted any more and can be completely sent out, and then sets bitmap to 0; wherein Produce _index represents a consumption pointer, and represents the current location where SQ is processed; walk_offset represents the virtual address of the interrupt location of a single WQE process.

As a further preferred embodiment of the RDMA-based lossless efficient data stream processing method of the present invention, the basic steps of Doorbell joining the first stage scheduler include:

Step 2.1: the first stage scheduling module obtains the GroupID of the target queue according to the Hash (PageID), and adds Doorbell into the corresponding priority queue according to the CoS selection in the DB; if the queue is not full, directly writing Doorbell to the tail of the queue, and if Doorbell of the QPN exists in the queue, merging the two into 1 DB; if the queue is full, turning to step 2.2;

where CoS representation Channel of Sevice represents a priority channel, typically supporting 8 priorities;

Step 2.2: the first stage scheduling module records Doorbell information in an Overflow Buffer; the Overflow Buffer is a shared Buffer of all first-stage dispatch queues; allocating an entry for each priority of each GroupID in an Overflow Buffer, and caching DB information in sequence in a linked list mode; if Doorbell is added into the Overflow Buffer, if the information of the QPN where the Doorbell is located already exists in the Overflow Buffer, the old Doorbell is replaced by the new Doorbell, that is, in the Buffer, only one latest Doorbell is cached for the same QP; and recording whether each priority of the GroupID is Doorbell or not in the Overflow Buffer through a bitmap, if Doorbell information is available, setting the bit position to be 1, otherwise, setting the bit position to be 0.

As a further preferable scheme of the RDMA-based lossless and efficient data stream processing method, only the QPN of the DB is cached in the Overflow Buffer, when the QPN is scheduled, the latest WQE_Index is obtained by reading the DB Record cached in the target host computer, a new DB is regenerated and added into a first layer scheduling output queue of the scheduler, and the next layer scheduling is executed; wherein DB Reecord is content cached on host side, consumption pointer produce index and production pointer consumer index for recording SQ.

As a further preferred scheme of the RDMA-based lossless efficient data stream processing method of the present invention, the scheduling method of the first layer arbiter or scheduler Arbiter for different CoS queues within the same Group is as follows:

step 3.1, if a queue with groupid=n and cos=m is currently polled, firstly judging whether the queue is empty; if not, taking out the DB of the queue head and adding the DB into an output queue of the scheduler; if the queue is empty, go to step 3.2;

Step 3.2, reading the bitmap of the queue from the Overflow Buffer, and judging whether to cache the DB information of the queue; if so, the DB information is taken out from the head of the linked list and added into a corresponding dispatching output queue, the DB information is deleted from the linked list, the dispatching is finished at this time, and the next dispatching period is waited; if not, skipping the queue, polling the next queue, and ending the round of scheduling; and when the scheduling queue is full, the DB information is cached in the Overflow Buffer, and the Buffer is shared by all the scheduling queues.

As a further preferred scheme of the RDMA-based lossless efficient data stream processing method of the present invention, hostID may be represented by PF+VFID; the PF ID represents physical function ID, which represents the ID of the physical channel in PCIE; VF ID represents virtual function ID, represents the ID of the PCIE virtual channel, one virtual channel representing one virtual machine.

As a further preferable scheme of the lossless and efficient data stream processing method based on RDMA, the second stage scheduling module also adopts a hierarchical multilevel scheduling mode, and the scheduling level, the Group size of each layer and the scheduling algorithm of each layer of scheduler can be configured through a scheduling strategy module.

As a further preferred embodiment of the RDMA-based lossless efficient data stream processing method of the present invention, the scheduling algorithm includes, but is not limited to, SP, RR, WRR, DWRR scheduling algorithm;

Wherein, SP is strict priority scheduling algorithm;

RR is a polling algorithm;

WRR is a weighted round robin scheduling algorithm;

DWRR is a differentially weighted polling algorithm.

As a further preferred scheme of the RDMA-based lossless efficient data stream processing method of the present invention, the first stage input queue of the second stage scheduling module contains 2 cache lines Entry for caching DB, wherein Entry of the queue head represents the DB being scheduled, and the second Entry represents the DB to be scheduled next.

As a further preferred solution of the RDMA-based lossless efficient data stream processing method of the present invention, DB joins the input queue following method: if the queue is empty, adding a new DB into the head of the queue; if the queue has only 1 DB, adding a new DB into the tail of the queue; if there is also a DB at the end of the queue, the new DB is cached in the queue in place of the old DB.

As a further preferred scheme of the RDMA-based lossless efficient data stream processing method of the present invention, in step 11, the Interupted DB State includes HostID, QPN and Produce _Index, target_WQE_Index and Walk_Offset, produce _Index indicates currently processed WQE_Index, walk_offset indicates already transmitted data pointer of currently unprocessed WQE, target_WQE_Index indicates the WQE position to which the current DB needs to process.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. the multi-stage scheduling is realized, scheduling queues are designed based on different scheduling granularities, and the problem of serious head resistance caused by serial execution of all DB together is avoided;

2. The first stage scheduler module is different from the traditional DB type and address binding method based on the general DB allocation method of the process Page ID, decouples the DB type and the address, and improves the utilization rate of DB space;

3. in the first stage scheduler module, DB queue management and scheduling methods are designed based on comprehensive factors such as Host, process, priority and the like, scheduling with different granularity can be realized, and under the condition that the same QP scheduling is not disordered, high-priority requests can be effectively scheduled, so that the head resistance problem under the condition of large-scale QP concurrency is effectively solved;

4. In the first stage of scheduler, the DB lossless processing method of the over flow Buffer is shared, so that DB information is not lost under the high concurrency condition, and the lossless processing of the request is realized; meanwhile, only DB state information is cached, and the lost DB information is obtained by reading DB Record, so that the cache is effectively saved;

5. In the second stage scheduling module, the maximum data amount generated by WQEs which can be processed in each scheduling period is limited by the constraints of Crodit+WQE number+message size and the like, so that fairness of multi-QP scheduling is ensured, the situation that large messages always occupy resources, block other QPs and relieve head resistance is avoided;

6. In the second stage scheduling module, through adding Interupted DB state cache DB to each scheduling queue to process interrupt state, WQE breakpoint continuous transmission capability is supported during scheduling, fine QP scheduling is realized, and efficient lossless processing of requests is ensured.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a lossless and efficient data processing method applied to a bare metal scene according to embodiment 1 of the present invention;

Fig. 2 is a schematic diagram of a processing method of a lossless and efficient data processing method applied to a virtual machine scene according to embodiment 2 of the present invention.

Detailed Description

The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The objects and effects of the present invention will become more apparent from the following detailed description of the preferred embodiments and the accompanying drawings, it being understood that the specific embodiments described herein are merely illustrative of the invention and not limiting thereof.

The method provided by the embodiments of the present disclosure is applied to a network interface card (RNIC) in a data storage device, which is implemented based on remote direct memory access (remote direct memory access, RDMA).

RDMA protocol is supported in the target host, and data transmission and reception are realized through a Queue Pair (QP). Each QP contains a Send Queue (SQ) and a Receive Queue (RQ), where the SQ is responsible for sending messages and the RQ is responsible for receiving messages, and each QP's SQ and RQ may be associated with a Completion Queue (CQ), respectively. Each QP has a locally unique QP Number (QPN). Some state information of QP, including QPN, CQN, QP address, QP length, etc. is stored in QP Context (QPC), and a QPC table is maintained in RNIC to cache registered QPC information.

Multiple send queues, e.g., SQ1, SQ2, SQ3, may be included in the target host, and when the target host issues a Work Request (WR), a Work queue element (Work Queue Element, WQE) is written to the space available in the SQ and the network card is notified by driving a doorbell signal (Doorbell, DB) to the RNIC bound to the SQ. After RNIC receives DB, it will read QPC Table through the information carried by DMA read DB, such as QPN, WQE_index, and calculate the address of WQE in SQ, and then initiate the content reprocessing of DMA request read WQE.

Communication bus connection (PCIe) is adopted between the RNIC and the target host, and DB is written into a BAR register of PCIe. In RNIC, PCIe BAR spaces are managed in units of pages (pages), different processes allocate different BAR space registers, and BAR spaces are isolated from each other. N general DB are allocated in each Page, if target host sends Doorbell to RNIC, doorbell is written to Doorbell available in the BAR space bound to the process through PCIe interface. There is a 2bit db_type field in Doorbell that identifies Doorbell type, where db_type=00 is a reserved value, db_type=01 represents CQ doorbell, db_type=10 represents SQ Doorbell, and db_type=11 represents eq_ AEQ Doorbell.

The essence of DB is the request notification signal issued by the target host, and RNIC processes the corresponding WR or WQE according to DB issuing sequence and carried parameters. The processing of DB is essentially the processing of WR. In order to solve the problems of request scheduling and processing under the condition of large-scale QP concurrency, the head resistance problem caused by different QP scheduling is effectively relieved, the scheduling fairness of different QPs is ensured, and meanwhile, DB is not lost under the condition of request burst.

As shown in fig. 1 and 2, the method for processing DB signals with high efficiency and no loss proposed in the present invention comprises the following basic steps:

Step 1: when a new WQE is generated in the SQ in the target host, the host generates a doorbell DB that is sent to the RNIC. I.e., writing DB information over the PCIe interface into Doorbell space allocated by RNIC for that QP.

Wherein SQ represents a queue of a transmitting end;

Step 2: RNIC adds Doorbell to the first stage scheduler based on the Doorbell register address resolving the information of the address such as PageID and Doorbell QPN, priority (CoS), etc. The first stage scheduler adopts a hierarchical scheduling structure, can support M groups, each Group can be further divided into 4 priority queues, an SP+WRR scheduling algorithm is adopted in each Group, and an RR scheduling algorithm is adopted among different groups. SP is a strict priority scheduling algorithm; WRR is a weighted round robin scheduling algorithm.

Doorbell the basic steps to join the first stage scheduler include:

Step 2.1: the first stage dispatch module obtains the GroupID of the target queue according to Hash (PageID) and adds Doorbell to the corresponding priority queue according to CoS selection in DB. If the queue is not full, doorbell is written directly to the tail of the queue, if there is Doorbell of the QPN in the queue, it can be combined into 1 DB. If the queue is full, go to step 2.2.

Step 2.2: the first stage scheduling module records Doorbell information in the Overflow Buffer. The Overflow Buffer is a shared Buffer of all first-stage dispatch queues. And allocating an entry for each priority of each GroupID in the Overflow Buffer, and caching DB information in sequence in a linked list mode. If Doorbell is added to the Overflow Buffer, the information of the QPN where the Doorbell is located already exists in the Overflow Buffer, then the old Doorbell is replaced by the new Doorbell, that is, in the Buffer, only one latest Doorbell is cached for the same QP.

And recording whether each priority of the GroupID is Doorbell or not in the Overflow Buffer through a bitmap, if Doorbell information is available, setting the bit position to be 1, otherwise, setting the bit position to be 0.

Optionally, to save the Buffer, only the QPN of the DB is cached in the Overflow Buffer, and when the QPN is scheduled, the latest wqe_index is obtained by reading the DB Record cached in the target host, and a new DB is regenerated and added to the first layer scheduling output queue of the scheduler, so as to execute the next layer scheduling.

Step 3: the first layer scheduler of the first stage scheduler selects a queue to be scheduled using a configured scheduling algorithm, and adds the DB to be processed to the output queue of the scheduler, wherein the scheduling algorithm includes, but is not limited to, round Robin (RR) scheduling and the like.

The scheduling method of the first layer Arbiter for different CoS queues in the same Group is as follows:

Step 3.1 if the current poll is to a queue with groupid=n and cos=m, it is first determined whether the queue is empty. If not, taking out the DB of the queue head and adding the DB into an output queue of the scheduler; if the queue is empty, go to step 3.2;

and 3.2, reading the bitmap of the queue from the Overflow Buffer, and judging whether to cache the DB information of the queue. If so, the DB information is taken out from the head of the linked list and added into the corresponding dispatching output queue, the DB information is deleted from the linked list, the dispatching is finished at this time, and the next dispatching cycle is waited. If not, skipping the queue, polling the next queue, and ending the round of scheduling.

Step 4: the second stage Arbiter of the first stage scheduler writes the DB output by the first stage Arbiter into the output queue using RR scheduling.

Step 5: and the DB verification module takes out the DB of the queue head from the output queue of the second stage Arbiter of the first stage scheduler to perform validity verification, and judges whether the PageID of the DB is consistent with PageID in the QPC bound with the QPN of the DB and whether the QP is in a normal state. If the data are consistent, the DB is input into the QPN-QID mapping module, and if the data are inconsistent, the DB is discarded and an error is returned to the target host. Wherein QPN represents the serial number ID of QP; QID represents the local ID allocated for QP in RNIC, unique in RDMA system; the QPN-QID mapping module is used for mapping and searching the QPN and the local QID, and takes the QID as the index of the local QP Context.

Step 6: after the DB inputs the QPN-QID mapping module, the module searches the corresponding GroupID according to the HostID in the DB, and maps the GroupID into the input queue corresponding to the GroupID according to the QPN, thereby ensuring that the QPDB with the same HostID is placed in the same scheduling group. Wherein GroupID represents the ID of a dispatch group, and different HostIDs can be placed on different dispatches Goup;

where HostID represents the host's ID, if in a scenario supporting virtualization, one VM corresponds to an independent HostID.

Specifically, the HostID may be represented by PF+VFID. The PF ID represents physical function ID, which represents the ID of the physical channel in PCIE; VF ID represents virtual function ID, represents the ID of the PCIE virtual channel, one virtual channel representing one virtual machine.

The second stage scheduling module also adopts a hierarchical multi-stage scheduling mode, and the scheduling level, the Group size of each layer and the scheduling algorithm of each layer of scheduler can be configured by a scheduling strategy module. Scheduling algorithms include, but are not limited to, SP, RR, WRR, DWRR, etc. scheduling algorithms.

The first stage input queue of the second stage scheduling module contains 2 Entry for caching DB, where Entry of the head of the queue represents DB being scheduled and the second Entry represents DB next to be scheduled. The DB follows the following method when joining the input queue:

if the queue is empty, adding a new DB into the head of the queue; if the queue has only 1 DB, adding a new DB into the tail of the queue; if there is also a DB at the end of the queue, the new DB is cached in the queue in place of the old DB.

Step 7: each level scheduler in the second level scheduling module outputs the DB to the module's last output queue.

Step 8: the WQE processing module fetches DB from the head of the output queue and reads the QPC state required for processing the WQE according to QPN. In QPC, max_burst_size, max_batch_wqe_count are included, representing the maximum number of messages that a single WQE can send and the maximum number of WQEs that can be processed at a time, respectively. Max_burst_size represents the maximum number of bytes allowed to be transmitted in a single scheduling period; max_Batch_WQE_count represents the maximum number of WQEs that a single scheduling period is allowed to acquire, i.e., the maximum number of messages that a single scheduling processing period is allowed to process.

Step 9: the WQE processing module requests a Credit, which is the size of the transmittable message allocated by the congestion control algorithm for each QP, from the congestion control module according to (HostID, QPN).

Step 10: the WQE processing module obtains not more than N WQEs from the SQ through the DMA engine according to Max_batch_WQE_count and the number of WQEs which can be cached currently, wherein N=min (WQE_available_count, max_batch_WQE_count). Where n=min (wqe_available_count, max_batch_wqe_count); WQE_available_count represents the maximum number of WQEs Available; n represents the number of WQEs that can be processed in a single scheduling cycle, where N takes the minimum of WQE_available_count and Max_batch_WQE_count.

Step 11: the WQE processing module processes cached WQEs one by one, and updates a consumption pointer of the WQE in the QPC and a Credit value which is obtained by subtracting the Credit value consumed by the WQE from the current value when each WQE is processed. If the residual Credit is insufficient to process a complete WQE in the process of processing the WQE, returning the interrupt state of the WQE processing after the residual Credit is consumed, caching the interrupt state in a Interupted DB State table in a second scheduling module, and setting the bitmap corresponding to the queue to 1. Wherein Interupted DB State includes HostID, QPN and Produce _Index, target_WQE_Index and Walk_Offset, produce _Index represents currently processed WQE_Index, walk_offset represents a data pointer that currently unprocessed WQE has sent, target_WQE_Index represents a WQE position that the current DB needs to process; crodit is a token value representing the number of bytes currently sent by the QP; interupted DB State is an interrupt DB information table for caching interrupt information of DB schedule; the bitmap is a bitmap, each bit in the bitmap corresponds to a QPN, if the QPN information is valid, the corresponding bit position 1 is set, and if invalid, 0 is set.

Step 12: when the second stage scheduling module schedules to a certain queue in the next period, firstly judging whether the DB state of the interrupt exists in the queue through the Bitmap, namely whether the Bitmap corresponding to the queue is 1. If bitmap=0, reading a new DB process from the dispatch input queue; if bitmap=1, the DB interrupt state is read preferentially, and the DB interrupt state information is formed into a new DB and sent to the WQE processing module.

Step 13: and the WQE processing module starts to continuously process the MR data from the walk_Offset of the WQE pointed by Produce _Index according to the information of the DB, judges whether the WQE can be completely processed according to the Credit, and repeats the steps 11-13 until the DB is not interrupted any more and can be completely sent out, and then sets the bitmap to 0. Wherein Produce _index represents a consumption pointer, and represents the current location where SQ is processed; walk_offset represents the virtual address of the interrupt location of a single WQE process.

The method mainly aims at the condition that one WQE corresponds to a large message or a plurality of WQEs need to be processed in one DB, and needs to consume more Credit, so that a plurality of scheduling periods may be needed to process one DB.

So far, the whole 2-stage DB scheduling and processing method ends. Through the method, DB is not lost, the head resistance problem of QP scheduling is effectively reduced, and fairness of WQE processing under a multi-QP concurrency scene is guaranteed.

The invention provides a lossless and efficient data processing method based on RDMA and a network interface card, which provide 2 embodiments, and aim at a bare metal scene and a cloud host scene respectively. In 2 kinds of scenes, the basic modules and the method are basically consistent, as shown in the technical scheme of the fifth section, but the processing of each module in the RNIC of 2 kinds of scenes is slightly different.

The RNIC is unique for each Host bound BAR space, and therefore PageID is unique, either for bare metal scenarios or cloud Host scenarios, where the DB generated under both conditions joins the first stage scheduler in the same process.

For a bare metal scene, in a Host, only one Host creates a QP issuing request, and QPN is unique; in a cloud host scenario, there may be multiple virtual machines in the host, which may be considered as multiple host, each virtual machine may create a QP, issue a request, QPN is unique among VMs, and QPN may be the same among different VMs. Thus in the QPC table, maintenance and management at the Host granularity is required.

When the QPN-QID mapping is carried out, if only one host exists in the bare metal scene, the host can occupy the scheduling resources of all groups; if there are multiple hosts in the cloud host scenario, the group needs to be allocated based on the hosts.

Likewise, in the congestion control module, a Credit needs to be maintained for each QP, and management and maintenance of the Credit is also differentiated based on Host and QPN. Wherein PageID is a page ID, which represents the ID of the BAR space address corresponding to the software knock DB; QPC represents context information of QP for caching QP address information; QP represents RDMA connection queues; hostID denotes a serial number ID of host; QPN represents the sequence number ID of QP; credit is Credit; WQE represents an RDMA request; RNIC represents RDMA network card.

The multi-stage scheduling is to design a scheduling queue based on different scheduling granularities, so that the problem of serious head resistance caused by serial execution of all DB together is avoided.

The first stage scheduler module is different from the traditional DB type and address binding method based on the general DB allocation method of the process Page ID, decouples the DB type and the address, and improves the utilization rate of the DB space.

In the first stage scheduler module, DB queue management and scheduling methods are designed based on comprehensive factors such as Host, process, priority and the like, scheduling with different granularity can be realized, and under the condition that the same QP scheduling is not disordered, high-priority requests can be effectively scheduled, so that the head resistance problem under the condition of large-scale QP concurrency is effectively solved

In the first stage of scheduler, the DB lossless processing method of the over flow Buffer is shared, so that DB information is not lost under the high concurrency condition, and the lossless processing of the request is realized; meanwhile, only DB state information is cached, and the lost DB information is obtained by reading DB Record, so that the cache is effectively saved.

In the second stage scheduling module, the number of data packets generated by WQEs which can be processed in each scheduling period is limited by the restriction of the number of Credit+WQEs, so that fairness of multi-QP scheduling is guaranteed, the condition that a large message always occupies resources, other QPs are blocked, and the head resistance problem is relieved.

In the second stage scheduling module, through adding Interupted DB state cache DB to each scheduling queue to process interrupt state, WQE breakpoint continuous transmission capability is supported during scheduling, fine QP scheduling is realized, and efficient lossless processing of requests is ensured.

It will be appreciated by persons skilled in the art that the foregoing description is a preferred embodiment of the invention, and is not intended to limit the invention, but rather to limit the invention to the specific embodiments described, and that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for elements thereof, for the purposes of those skilled in the art. All technical features of the present embodiment, which are included in the scope of the present invention, can be freely combined according to actual needs.

Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements or changes may be made without departing from the spirit and principles of the present invention.

Claims

1. An RDMA-based network interface card, characterized by: the system comprises a PCIe BAR register processing module, a first-stage scheduling module, a QPC state table, a DB verification module, a QPN-QID mapping module, a second-stage scheduling module, a scheduling policy configuration module, a congestion control module and a DMA engine:

A QPC state table for buffering QPC state information;

A DMA engine for RNIC and Host direct data handling;

Wherein PageID is a page ID, which represents the ID of the BAR space address corresponding to the software knock DB; QPC represents context information of QP for caching QP address information; QP represents RDMA connection queues; hostID denotes a serial number ID of host; QPN represents the sequence number ID of QP; credit is Credit; WQE represents an RDMA request; RNIC represents RDMA network card.

2. A method for lossless and efficient data stream processing for RDMA based on the RDMA network interface card of claim 1, characterized by: the method specifically comprises the following steps:

wherein SQ represents a queue of a transmitting end;

3. The RDMA-based lossless efficient data stream processing method according to claim 1, wherein: doorbell the basic steps to join the first stage scheduler include:

4. The RDMA-based lossless efficient data stream processing method according to claim 1, wherein: when the QPN is scheduled, the latest WQE_Index is obtained by reading DB Record cached in a target host computer, a new DB is regenerated and added into a first layer scheduling output queue of a scheduler, and next layer scheduling is executed; wherein DB Reecord is content cached on host side, consumption pointer produce index and production pointer consumer index for recording SQ.

5. The RDMA-based lossless efficient data stream processing method according to claim 1, wherein: the method of scheduling the first layer arbiter or scheduler Arbiter for different CoS queues within the same Group is as follows:

6. The RDMA-based lossless efficient data stream processing method according to claim 1, wherein: hostID may be represented by PF+VFID; the PF ID represents physical function ID, which represents the ID of the physical channel in PCIE; VF ID represents virtual function ID, represents the ID of the PCIE virtual channel, one virtual channel representing one virtual machine.

7. The RDMA-based lossless efficient data stream processing method according to claim 1, wherein: the second stage scheduling module also adopts a hierarchical multi-stage scheduling mode, and the scheduling level, the Group size of each layer and the scheduling algorithm of each layer of scheduler can be configured by a scheduling strategy module.

8. The RDMA-based lossless efficient data stream processing method according to claim 1, wherein: scheduling algorithms include, but are not limited to SP, RR, WRR, DWRR scheduling algorithms;

Wherein, SP is strict priority scheduling algorithm;

RR is a polling algorithm;

WRR is a weighted round robin scheduling algorithm;

DWRR is a differentially weighted polling algorithm.

9. The RDMA-based lossless efficient data stream processing method according to claim 1, wherein: the first stage input queue of the second stage scheduling module contains 2 cache lines Entry for caching DB, where Entry of the head of the queue represents DB being scheduled and second Entry represents DB next to be scheduled.

10. The RDMA-based lossless efficient data stream processing method according to claim 1, wherein: the DB follows the following method when joining the input queue: if the queue is empty, adding a new DB into the head of the queue; if the queue has only 1 DB, adding a new DB into the tail of the queue; if there is also a DB at the end of the queue, the new DB is cached in the queue in place of the old DB.