CN118192883A - Remote storage access-oriented request merging and scheduling method and device - Google Patents
Remote storage access-oriented request merging and scheduling method and device Download PDFInfo
- Publication number
- CN118192883A CN118192883A CN202410144629.7A CN202410144629A CN118192883A CN 118192883 A CN118192883 A CN 118192883A CN 202410144629 A CN202410144629 A CN 202410144629A CN 118192883 A CN118192883 A CN 118192883A
- Authority
- CN
- China
- Prior art keywords
- request
- rdma
- requests
- merged
- merging
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000003860 storage Methods 0.000 title claims abstract description 61
- 230000015654 memory Effects 0.000 claims abstract description 64
- 230000002776 aggregation Effects 0.000 claims abstract description 22
- 238000004220 aggregation Methods 0.000 claims abstract description 22
- 239000012634 fragment Substances 0.000 claims abstract description 5
- 230000005540 biological transmission Effects 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 11
- 230000002146 bilateral effect Effects 0.000 abstract description 10
- 230000008569 process Effects 0.000 description 23
- 238000012546 transfer Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 14
- 238000012545 processing Methods 0.000 description 13
- 238000005516 engineering process Methods 0.000 description 10
- 239000000872 buffer Substances 0.000 description 8
- 230000006854 communication Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 230000004044 response Effects 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 4
- 230000001960 triggered effect Effects 0.000 description 4
- 239000007787 solid Substances 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000013403 standard screening design Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000008521 reorganization Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/42—Bus transfer protocol, e.g. handshake; Synchronisation
- G06F13/4282—Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0629—Configuration or reconfiguration of storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0659—Command handling arrangements, e.g. command buffers, queues, command scheduling
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/133—Protocols for remote procedure calls [RPC]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2213/00—Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F2213/0026—PCI express
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computer Hardware Design (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses a request merging and scheduling method and device for remote storage access, which comprises merging a plurality of continuous I/O requests into one I/O request by utilizing the characteristic that SGL can point to a plurality of scattered memory fragments, so that the merged I/O request contains hash aggregation elements SGE of the plurality of continuous I/O requests, and the work of the plurality of continuous I/O requests is completed between a host end and a target end of an RDMA network through a group of I/O request operation aiming at the merged I/O request. The invention aims to better schedule the I/O requests based on the characteristics of NVMe over RDMA network storage, and combine SGLs of a plurality of I/O requests while ensuring timeliness through a timer so as to effectively reduce the number of bilateral operations, release CPU computing power and fully exert NVMeoF remote storage performance.
Description
Technical Field
The invention relates to the technical field of network storage, in particular to a request merging and scheduling method and device for remote storage access.
Background
NVMe (Non-Volatile Memory Express) is a host controller interface specification redefined for PCIe SSDs. It was developed to take full advantage of the high speed performance of solid state storage to provide higher performance than conventional SATA and SAS interfaces by optimizing command queues and reducing the latency of I/O operations. One significant advantage of NVMe is its low latency and high IOPS (number of read/write operations per second) performance. Up to 64K of Queue sets (QP) consisting of SQ (commit Queue) and CQ (completion Queue) can be supported, and the maximum depth of each Queue is 64K. In the driving layer, NVMe and a blk-mq (Linux block device driving) mechanism of a Linux kernel block layer are cooperatively designed, and multithreaded I/O access of a multi-core CPU can not generate lock competition problem on an SSD request queue, and performance bottleneck caused by a single request queue is avoided. NVMe also optimizes the way in which the drive communicates with the system. It reduces the number of CPU cycles required for driver operation, which means higher system efficiency and lower power consumption. In addition, the NVMe protocol also includes support for power management, which makes it more excellent in terms of power saving and extending the lifetime of devices. With the development of solid state drive technology and the reduction of price, NVMe has become the first choice of high performance storage solutions, widely used in a variety of environments, from personal computers to data centers. It is particularly suitable for applications requiring high-speed data access, such as big data analysis, high-performance computing, real-time data processing, and gaming. With the continued advancement of technology, NVMe will continue to drive the development of storage technology, providing faster, more reliable data storage and access solutions.
NVMe over Fabrics (NVMeoF) is a network protocol aimed at extending the high performance and low latency characteristics of NVMe technology into a wider network environment. It allows a device connected through a network to access a remote NVMe storage device in a manner that approximates local capabilities. The core of this technology is to encapsulate NVMe commands and data in a network transport protocol and then transmit over the network, thereby enabling remote storage access. NVMeoF utilize existing network technologies such as fibre channel, ethernet, and InfiniBand through which NVMe commands and data are transmitted. Particularly when using network technologies that support Remote Direct Memory Access (RDMA), NVMeoF can achieve very low latency and high throughput, which is important for performance-sensitive applications. RDMA allows data to be moved directly in memory between the server and the storage device, reducing CPU intervention and additional memory copy operations, thereby improving efficiency. The introduction of NVMeoF provides a significant performance boost for data centers and cloud infrastructure. The method enables the data center to more effectively utilize the NVMe storage equipment, and provides faster data access speed, better expansibility and higher resource utilization rate. In addition, NVMeoF also supports a more flexible storage architecture, allowing large-scale, distributed storage systems to be built while maintaining fast access to storage. As the demand for high-speed, efficient storage solutions by data centers and businesses continues to grow, NVMeoF becomes a key technology helping them to handle challenges such as big data, real-time analysis, and high-performance computing. Its ability to increase storage network performance and efficiency makes it an important component of modern data center storage architecture.
RDMA (Remote Direct Memory Access) is a high-efficiency network communication technology, which allows one computer to directly access the memory on another computer without passing through the CPU or operating system of a remote system, and which significantly reduces the CPU load and system delay in the network communication process and improves the efficiency of data transmission. In conventional network communications, data must pass through multiple levels of the operating system, involving the involvement of the CPU to process data packets, perform protocol stack operations, and replicate data between systems. This process not only increases latency, but also takes up valuable CPU resources. RDMA, in contrast, allows a network adapter to directly read or write to the memory of a remote host without the need for CPU intervention by the remote host. This is achieved by a memory area (registration memory area) preset on the participating hosts. The addresses and access rights of these memory regions are shared to other hosts in the network. When an RDMA transfer occurs, the data is moved directly from the sender's memory to the receiver's memory, bypassing the conventional network stack of the operating system. This direct memory access mechanism of RDMA brings several significant advantages. First, it greatly reduces the delay in the data transfer process because it eliminates the processing time of the network stack in the operating system. Secondly, as the CPU no longer participates in actual data transmission, the utilization rate of the CPU can be obviously reduced, thereby improving the overall performance and efficiency of the system. In addition, since the copy operation of data inside the host is reduced, the throughput of data transmission can also be improved. Because of these features, RDMA is particularly favored in high performance computing, big data processing, cloud computing, and those scenarios where network performance is a high requirement. It plays a key role in modern data centers and enterprise-level storage solutions, particularly in applications where large amounts of data need to be processed and latency and throughput are critical.
In RDMA, a hash aggregation list (SCATTER GATHER LIST, abbreviated SGL) is a critical data structure used to handle efficient transmission of multiple non-contiguous blocks of data scattered in memory. Since the core feature of RDMA is the ability to directly transfer memory-to-memory data between network participants, SGLs play an important role in RDMA operations. During RDMA communication, data may not always be stored in one contiguous memory block. SGL allows RDMA operations to specify a list of multiple memory segments, each segment having its own address and length. With SGL, RDMA is able to read data from or write data to these scattered memory locations at once without the need to copy the data into a contiguous memory region in advance. Fig. 1 illustrates a correspondence between SGL and WR linked lists, where ibv _send_wr is a work request item of RDMA, wr_id is a number of the request, sg_list is a pointer to a scatter aggregation table (SGL), num_ SGE is a number of scatter aggregation items (SGEs) contained in the SGL, next is a pointer to a next work request item, ibv _ SGE is an array made up of scatter aggregation items, addr is an address thereof, length represents a length of a data segment to which the item points, and N1Bytes to N3Bytes are a plurality of data segments scattered in a memory. As can be seen from fig. 1, each node of the WR linked list includes an SGL, which is an array that includes one or more hash aggregation elements SGE that point to a cache Buffer for which data needs to be sent.
In RDMA networks, the process of data transfer using SGLs is shown in fig. 2. The SGL array includes 3 SGEs, each of length N1, N2, and N3 bytes. As can be seen from fig. 2, the 3 buffer buffers are not contiguous, and they are distributed throughout the memory. After the RDMA hardware reads the SGL, an aggregation operation is performed, so that what is seen on the RDMA hardware Wire is N3+N2+N1 consecutive bytes. This approach has the advantage of significantly improving the efficiency and flexibility of data transmission. It reduces the need for data replication and memory reorganization, thereby reducing the burden on the CPU and reducing latency. This is extremely important for high performance computing and large scale data processing applications, as they typically involve large, discrete data sets. In general, the use of SGLs in RDMA enhances the core advantage of RDMA, namely providing efficient, low latency direct memory access, which is especially critical to modern high performance network communication and data center technologies.
RDMA_READ and RDMA_WRITE provide efficient data READ and WRITE capabilities in RDMA communications. RDMA_READ is used to READ data from a remote memory, while RDMA_WRITE is used to WRITE data to a remote memory. Both operations directly transfer data between memories by bypassing the conventional network stack and operating system kernel, thereby greatly reducing delay and CPU overhead and improving the data transfer efficiency. This is particularly important in high performance computing and data center environments where fast data exchange is required.
As shown in fig. 3, the rdma_write operation allows the Target side to WRITE data directly to the memory of the Host side. The rdma_write operation and rdma_send operation should use the same RDMA Queue Pair (QP) to ensure consistency and ordering of the communication between the Host and Target targets, which is detailed as follows: 1. the Host transmits a command packet to the controller through RDMA_SEND operation. The packet contains or points to the SGL (Scatter-GATHER LIST) required for data transmission. 2. The Target side uses rdma_write operation to transfer data to the Host side. Each rdma_write is associated with a keyed Host side memory buffer (SGL entry) and one or more local memory buffers. This ensures that data is transferred directly from the memory at the Target end to the designated location in the memory at the Host end. 3. After the transfer is complete, the Target SENDs a response packet back to the host using an rdma_send or rdma_send_invalid operation, which may deactivate the memory key.
As shown in fig. 4, the rdma_read operation allows the Target side to READ data directly from the Host side memory. The rdma_read operation and rdma_send operation should use the same RDMA Queue Pair (QP) to ensure consistency and ordering of the communication between Host and Target, the detailed flow of which is as follows: 1. the Host transmits a command packet to the controller through RDMA_SEND operation. The packet contains or points to the SGL (Scatter-GATHER LIST) required for data transmission, which may directly contain data referenced by an offset address in the SGL within the packet. 2. For command data residing at the Host, the Target uses RDMA_READ operations to transfer the data from the Host memory to the Target. Each RDMA READ is associated with a keyed remote host memory buffer (SGL entry) and one or more local memory buffers. 3. After the transfer is complete, the Target SENDs a response packet back to the host using an rdma_send or rdma_send_invalid operation, which may deactivate the memory key. In the above procedure it can be seen that the corresponding commands and responses need to be sent by rdma_send operations before rdma_read and rdma_write operations are performed, in RDMA networks both SEND/RECV operations and READ/WRITE operations are used for data transfer over the network, but they differ significantly in the manner of operation and overhead: END/RECV is a bilateral operation. The sender SENDs data using SEND operation, and the receiver must prepare an RECV queue in advance to receive the data. This operation involves the CPU of the sender and the receiver. The CPU of the receiving side needs to process the receiving queue and manage the receiving buffer. The data is first copied into the memory of the sender and then sent to the memory of the receiver over the network. Because of the CPU processing and memory copying involved, SEND/RECV operations typically have higher latency and lower throughput than READ/WRITE. READ/WRITE is a single sided operation that allows data to be READ or written directly from the memory of one host to the memory of another host without intervention from the recipient CPU. The operations directly transmit data between the memories, so that extra memory copying steps are reduced, and the efficiency is improved. READ/WRITE operations generally provide lower latency and higher throughput, especially when large amounts of data are transferred.
I/O request scheduling is the process in the operating system of managing and optimizing read and write requests to the storage device. The key is to decide which I/O operations are performed first and how to perform them efficiently to improve the performance of the storage device and the efficiency of the overall system. In processing I/O requests, the scheduler needs to consider factors such as the priority of the request, the physical location of the data, and the performance characteristics of the storage device. In I/O request scheduling, the operating system organizes and orders pending I/O requests according to certain algorithms or policies. For example, it may sort requests to reduce the movement of the head on the mechanical hard disk, or merge multiple adjacent small requests to reduce the number of operations on the solid state disk. This optimization aims at reducing access latency to the storage device, improving data throughput, and equitably allocating resources in a multitasking environment. However, the existing I/O request scheduling policy is mainly optimized based on a local memory, and is not fully considered for the influence of network overhead in the network storage scene, so that the performance of the I/O request scheduling policy cannot be effectively improved.
As shown in FIG. 5, the I/O flow of the Host side is shown in FIG. 5, the bio (Block I/O) is processed into a request (I/O request) and enters the I/O request scheduler, and the I/O request scheduler organizes and sorts the I/O requests to be processed according to a certain algorithm or strategy and issues the I/O requests to Nvme _ RDMA _request_rq function in the RDMA driver for processing. The Nvme _ RDMA _queue_rq function is responsible for building the corresponding RDMA commands from the I/O requests and ensuring that the commands can be properly sent and processed to complete efficient data access operations. Because the existing I/O request scheduling strategy is mainly optimized based on a local memory, the influence of network overhead in a network storage scene is not fully considered, and the performance of the system cannot be effectively improved.
Disclosure of Invention
The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a request merging and scheduling method and device for remote storage access, which aim to better schedule I/O requests based on the storage characteristics of an NVMe over RDMA network, ensure timeliness through a timer and merge SGLs of a plurality of I/O requests at the same time so as to effectively reduce the number of bilateral operations, release CPU computing power and fully exert NVMeoF remote storage performance.
In order to solve the technical problems, the invention adopts the following technical scheme:
A request merging and scheduling method facing remote storage access comprises the steps of setting an I/O request scheduler as a NOOP scheduling strategy, putting the I/O request into a FIFO queue through the I/O request scheduler after the I/O request is processed into the I/O request, merging a plurality of continuous I/O requests into one I/O request by utilizing the characteristic that SGLs can point to a plurality of scattered memory fragments, enabling the merged I/O request to contain hash aggregation elements SGEs of the plurality of continuous I/O requests, and completing a plurality of continuous I/O requests between a host side and a target side of an RDMA network through a group of I/O request operation on the merged I/O request.
Optionally, the merging of multiple consecutive I/O requests into one I/O request refers to merging multiple consecutive write requests into one write request or merging multiple consecutive read requests into one read request.
Optionally, merging the multiple continuous write requests into one write request means that N continuous write requests which are originally required to be completed by 2N rdma_send bilateral operations and N rdma_read unilateral operations between the host side and the target side are merged into one write request, the merged write request contains hash aggregation elements SGE of the multiple continuous write requests, and the completion of 2 rdma_send bilateral operations and 1 rdma_read unilateral operations is performed between the host side and the target side of the RDMA network through the merged write request.
Optionally, performing 2 rdma_send double-sided operations and 1 rdma_read single-sided operations between the host and the target through the merged write request includes: the host end SENDs an RDMA_SEND command to the target end, and the target end SENDs the RDMA_READ command to the host end after receiving the RDMA_SEND command, so that the host end returns N continuous write request data to the target end, and the target end SENDs the RDMA_SEND command to the host end after receiving the N continuous write request data, thereby completing the write operation of the combined write requests.
Optionally, merging the multiple consecutive read requests into one read request means that N consecutive read requests that would otherwise need 2N rdma_send bilateral operations and N rdma_write unilateral operations between the host and the target are merged into one read request, the merged read request includes hash aggregation elements SGE of the multiple consecutive read requests, and the RDMA network host and the target perform 2 rdma_send bilateral operations and 1 rdma_write unilateral operations through the merged read request.
Optionally, performing 2 rdma_send double-sided operations and 1 rdma_write single-sided operations between the host and the target through the merged read request includes: the host end SENDs an RDMA_SEND command to the target end, and the target end SENDs an RDMA_WRITE command to the host end after receiving the RDMA_SEND command, so that the host end returns N continuous read request data to the target end, and the target end SENDs the RDMA_SEND command to the host end after receiving the N continuous read request data, thereby completing the read operation of the combined read requests.
Optionally, when merging a plurality of consecutive I/O requests into one I/O request, including controlling the number of the merged I/O requests based on a merging rule of the transmission capability, and calculating the number of hash aggregation elements SGEs included in the merged I/O request every time each I/O request is merged, if the number of hash aggregation elements SGEs included in the merged I/O request is equal to a threshold set according to the transmission capability of the NVMeoF remote storage network, stopping merging the new merged I/O request, and if the I/O request needs to be continuously merged, continuing to merge the next round of merging from the next I/O request needing to be merged.
Optionally, when merging a plurality of consecutive I/O requests into one I/O request, the method includes controlling the number of the merged I/O requests based on a merging rule of a latency, starting a timer after the first I/O request enters a Nvme _ RDMA _queue_rq function in the RDMA driver, and before each I/O request is merged, determining whether the timer is exceeded, if not, allowing the I/O request to be merged, otherwise, not allowing the I/O request to be merged.
In addition, the invention also provides a remote storage access-oriented request merging and scheduling device which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the remote storage access-oriented request merging and scheduling method.
Furthermore, the present invention provides a computer readable storage medium having stored therein a computer program for being programmed or configured by a microprocessor to perform the remote storage access oriented request merging and scheduling method.
Compared with the prior art, the invention has the following advantages: the invention fully considers the collaborative design of I/O request scheduling and RDMA transmission flow, schedules the I/O request based on the characteristic of NVMe over RDMA network storage, and merges SGLs of a plurality of I/O requests while ensuring timeliness through a timer so as to effectively reduce the number of bilateral operations, release CPU computing power and fully exert NVMeoF remote storage performance.
Drawings
Fig. 1 is a schematic diagram of SGL organization in the prior art.
Fig. 2 is a schematic diagram of an SGL transmission procedure in the prior art.
FIG. 3 is an interactive schematic diagram of an RDMA_WRITE process in the prior art.
FIG. 4 is an interactive schematic diagram of an RDMA_READ process in the prior art.
FIG. 5 is a schematic diagram of the I/O flow at the host side of an RDMA network in the prior art.
FIG. 6 is a schematic diagram of a basic flow of a method according to an embodiment of the invention.
FIG. 7 is a prior art process for handling a write request for an RDMA network.
FIG. 8 is a process of handling a write request to an RDMA network in a method according to an embodiment of the present invention.
FIG. 9 is a prior art process for handling read requests for RDMA networks.
FIG. 10 is a process of handling a read request from an RDMA network in a method according to an embodiment of the present invention.
Fig. 11 is a schematic diagram of a transmission process triggered by the number of SGEs in the method according to the embodiment of the present invention.
Fig. 12 is a schematic diagram of a transmission process triggered by a timer in the method according to the embodiment of the present invention.
Detailed Description
As shown in FIG. 6, the request merging and scheduling method for remote storage access in this embodiment includes setting the I/O request scheduler as a NOOP scheduling policy, after the block I/O is processed as the I/O request, entering the I/O request scheduler, placing the I/O request into the FIFO queue through the I/O request scheduler, and merging a plurality of consecutive I/O requests into one I/O request by utilizing the characteristic that SGLs can point to a plurality of dispersed memory segments, so that the merged I/O request contains hash aggregation elements SGEs of the plurality of consecutive I/O requests, and the operation of completing a plurality of consecutive I/O requests between a host side and a target side of the RDMA network for the merged I/O request through a group of I/O request operations.
In this embodiment, merging a plurality of consecutive I/O requests into one I/O request refers to merging a plurality of consecutive write requests into one write request or merging a plurality of consecutive read requests into one read request.
FIG. 7 is a prior art process for handling a write request for an RDMA network. For a plurality of write requests of a Host end, the queue_rq of the NVMe over RDMA driving layer sequentially prepares data according to the upper layer request, constructs a request command, and SENDs the command to the Target end through RDMA_SEND operation. After receiving the command, the Target uses RDMA_READ operation to transfer the command data residing at the Host from the memory of the Host to the Target. After the transfer is complete, the Target SENDs a response packet back to the host using an rdma_send or rdma_send_invalid operation, which may deactivate the memory key. Taking the scenario shown in fig. 7 as an example, for a transfer process of 3 write requests, 2 rdma_send double-sided operations and 1 rdma_read single-sided operations are required for each request, and processing consecutive 3 write requests requires 6 double-sided operations and 3 single-sided operations.
FIG. 8 is a process of handling a write request for an RDMA network in the method of the present embodiment. As shown in fig. 8, merging a plurality of consecutive write requests into one write request in this embodiment means that N consecutive write requests that would otherwise need 2N rdma_send double-sided operations and N rdma_read single-sided operations between a host and a target are merged into one write request, the merged write request includes hash aggregation elements SGE of the plurality of consecutive write requests, and 2 rdma_send double-sided operations and 1 rdma_read single-sided operations are completed between the host and the target of the RDMA network through the merged write request. In the nvme _ RDMA _queue_rq function shown in fig. 8, by utilizing the characteristic that SGLs can point to a plurality of non-continuous data blocks scattered in a memory, a plurality of continuous write requests are combined into one write request, and only 2 rdma_send double-sided operations and 1 rdma_read single-sided operations are required to process the write request, so that the write performance is improved.
Referring to fig. 8, performing 2 rdma_send double-sided operations and 1 rdma_read single-sided operations between the host and target through the merged write request includes: the host end SENDs an RDMA_SEND command to the target end, and the target end SENDs the RDMA_READ command to the host end after receiving the RDMA_SEND command, so that the host end returns N continuous write request data to the target end, and the target end SENDs the RDMA_SEND command to the host end after receiving the N continuous write request data, thereby completing the write operation of the combined write requests.
FIG. 9 is a prior art process for handling read requests for RDMA networks. For a plurality of read requests of a Host end, the queue_rq of the NVMe over RDMA driving layer sequentially prepares a memory area according to the upper layer request, constructs a request command, and SENDs the command to the Target end through RDMA_SEND operation. After receiving the command, the Target end uses RDMA_WRITE operation to directly transmit the data from the memory of the Target end to the appointed position of the memory of the Host end. After the transfer is complete, the Target SENDs a response packet back to the host using either an rdma_send operation or an rdma_send_invalid operation, which can deactivate the memory key. Taking the scenario shown in fig. 9 as an example, for a transfer process of 3 read requests, 2 rdma_send double-sided operations and 1 rdma_write single-sided operations are required for each request, and processing consecutive 3 WRITE requests requires 6 double-sided operations and 3 single-sided operations.
FIG. 10 is a process of handling read requests for an RDMA network in the method of the present invention. As shown in fig. 10, merging a plurality of consecutive read requests into one read request in this embodiment means that N consecutive read requests that would otherwise need 2N rdma_send double-sided operations and N rdma_write single-sided operations between a host and a target are merged into one read request, the merged read request includes a hash aggregation element SGE of the plurality of consecutive read requests, and 2 rdma_send operations and 1 rdma_write single-sided operations are completed between the host and the target of the RDMA network through the merged read request. In the nvme _ RDMA _queue_rq function shown in fig. 10, by utilizing the characteristic that SGLs can point to a plurality of non-continuous data blocks scattered in a memory, a plurality of continuous read requests are combined into one read request, and only 2 rdma_send double-sided operations and 1 rdma_write single-sided operations are needed to process the read request, so that the read performance is improved.
Referring to fig. 10, performing 2 rdma_send double-sided operations and 1 rdma_write single-sided operations between the host and the target through the merged read request includes: the host end SENDs an RDMA_SEND command to the target end, and the target end SENDs an RDMA_WRITE command to the host end after receiving the RDMA_SEND command, so that the host end returns N continuous read request data to the target end, and the target end SENDs the RDMA_SEND command to the host end after receiving the N continuous read request data, thereby completing the read operation of the combined read requests.
When a request arrives at the queue_rq in the NVMe over RDMA driver, the request is scheduled according to RDMA network conditions (such as the maximum number of SGEs carried in one request allowed by the RDMA network, the size of a data packet allowed by a router switch in the network, and the like), SGLs contained in a command packet are combined by utilizing the characteristic that SGLs can point to a plurality of dispersed memory fragments, and the number of bilateral operations is reduced on the premise that control delay is within a reasonable range, so that occupation of computing resources is reduced, and throughput is improved to a certain extent. How to combine requests during a specific transmission process is also a key part of the scheduling method. Errors may result if the number of merged SGEs exceeds the number of SGEs allowed by the RDMA network or the merged request size exceeds the command encapsulation size specified by the RDMA network. If the latency for the merge request is too long, an unacceptable increase in latency may result. Thus, the present embodiment designs the requested merge rule from both the transmission capability-based and latency-based aspects: in one aspect, when merging a plurality of consecutive I/O requests into one I/O request, the embodiment includes controlling the number of the merged I/O requests based on a merging rule of the transmission capability, and calculating the number of hash aggregation elements SGEs included in the merged I/O request for each I/O request to be merged, if the number of hash aggregation elements SGEs included in the merged I/O request is equal to a threshold set according to the transmission capability of the NVMeoF remote storage network, stopping merging new merged I/O requests, and if the I/O requests need to be continuously merged, continuing to merge from the next I/O request needing to be merged. The threshold may be a smaller value of both the maximum number of SGEs that can be carried within a request allowed by the RDMA network and the size of the packets allowed by the router switches in the network; on the other hand, when the present embodiment merges a plurality of consecutive I/O requests into one I/O request, including controlling the number of the merged I/O requests based on the merge rule of the latency, starting a timer after the first I/O request enters the Nvme _ RDMA _queue_rq function in the RDMA driver, and before each I/O request is merged, determining whether the timer is exceeded, if not, allowing the I/O request to be merged, otherwise, not allowing the I/O request to be merged.
Fig. 11 illustrates a transmission procedure triggered by the number of SGEs, taking the transmission procedure of a read request as an example. It is assumed in this example that the RDMA network allows a maximum number of SGEs within one request of 3, and that all switch routers within the network support packets of this size, thus setting the threshold for the number of merged SGEs to 3. An SGE counter and timer are set for each request to be sent in the nvme _ rdma _queue_rq function at the Host end. The SGE counter counts the number of SGEs contained in the SGL within the request and updates the SGE counter value when the SGEs contained in the newly incoming request are merged into the request. The timer is used to define the maximum waiting time after the request enters the queue. In fig. 10, the timer starts when the first read request enters the queue and the SGE counter value is updated. Before the timer expires, the SGE counter reaches a set threshold, triggers a send mechanism, and the request is sent. Fig. 12 illustrates a timer-triggered send process taking the send process of a read request as an example, where the timer for the request starts when the request reaches the nvme _ rdma _queue_rq function, since no subsequent requests arrive after the SGE contained in the second request is merged until the timer times out. Although the SGE counter does not reach the set threshold, the timer times out triggering the transmission mechanism, the request is transmitted. The threshold of the timer is generally set according to the time delay acceptable to the specific application environment, and can be set lower for the application environment with sensitive time delay to ensure the time delay, and can be set higher for the application environment with insensitive time delay to ensure that as many SGEs as possible are combined and transmitted.
In summary, in the request merging and scheduling method for remote storage access of this embodiment, the I/O scheduler is set to NOOP, and the scheduler places the IO requests into one FIFO queue and then executes the IO requests one by one. The scheduler does not reorganize the IO request order, and some merging is done appropriately for some IO requests that are consecutive on disk. When the request reaches NVMe _ RDMA _queue_rq function in the NVMe over RDMA driver, the request merging and scheduling method for remote storage access of the embodiment schedules the request according to RDMA network conditions (the maximum number of SGEs carried in one request allowed by the RDMA network and the size of data packets allowed by a router switch in the network), merges SGLs contained in command packets by utilizing the characteristic that SGLs can point to a plurality of dispersed memory fragments, and reduces the number of bilateral operations on the premise that control delay is within a reasonable range so as to reduce occupation of computing resources and improve throughput to a certain extent.
In addition, the embodiment also provides a remote storage access-oriented request merging and scheduling device, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the remote storage access-oriented request merging and scheduling method.
Furthermore, the present embodiment also provides a computer readable storage medium having stored therein a computer program for being programmed or configured by a microprocessor to perform the remote storage access oriented request merging and scheduling method.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.
Claims (10)
1. A request merging and scheduling method for remote storage access is characterized by comprising the steps of setting an I/O request scheduler as a NOOP scheduling strategy, putting the I/O request into a FIFO queue through the I/O request scheduler after the I/O of a block is processed as the I/O request, merging a plurality of continuous I/O requests into one I/O request by utilizing the characteristic that SGL can point to a plurality of scattered memory fragments, wherein the merged I/O request comprises a hash aggregation element SGE of the plurality of continuous I/O requests, and a plurality of continuous I/O requests are completed between a host side and an RDMA target side of an RDMA network through a group of I/O request operation aiming at the merged I/O request.
2. The remote storage access oriented request merging and scheduling method according to claim 1, wherein merging a plurality of consecutive I/O requests into one I/O request means merging a plurality of consecutive write requests into one write request or merging a plurality of consecutive read requests into one read request.
3. The method for merging and scheduling remote storage access-oriented requests according to claim 2, wherein merging a plurality of consecutive write requests into one write request means merging N consecutive write requests, which are originally required to be completed by 2N rdma_send double-sided operations and N rdma_read single-sided operations, between a host side and a target side into one write request, the merged write request includes a hash aggregation element SGE of the plurality of consecutive write requests, and 2 rdma_send double-sided operations and 1 rdma_read single-sided operations are completed between the host side and the target side of the RDMA network through the merged write requests.
4. The remote storage access oriented request merging and scheduling method according to claim 3, wherein performing 2 rdma_send double-sided operations and 1 rdma_read single-sided operations between the host side and the target side through the merged write request includes: the host end SENDs an RDMA_SEND command to the target end, and the target end SENDs the RDMA_READ command to the host end after receiving the RDMA_SEND command, so that the host end returns N continuous write request data to the target end, and the target end SENDs the RDMA_SEND command to the host end after receiving the N continuous write request data, thereby completing the write operation of the combined write requests.
5. The method for merging and scheduling remote storage access-oriented requests according to claim 2, wherein merging a plurality of consecutive read requests into one read request means merging N consecutive read requests, which are originally completed only by 2N rdma_send double-sided operations and N rdma_write single-sided operations, between a host side and a target side into one read request, the merged read request includes a hash aggregation element SGE of the plurality of consecutive read requests, and 2 rdma_send double-sided operations and 1 rdma_write single-sided operations are completed between the host side and the target side of the RDMA network through the merged read requests.
6. The remote storage access oriented request merging and scheduling method according to claim 5, wherein the completion of 2 rdma_send double-sided operations and 1 rdma_write single-sided operations between the host side and the target side through the merged read request includes: the host end SENDs an RDMA_SEND command to the target end, and the target end SENDs an RDMA_WRITE command to the host end after receiving the RDMA_SEND command, so that the host end returns N continuous read request data to the target end, and the target end SENDs the RDMA_SEND command to the host end after receiving the N continuous read request data, thereby completing the read operation of the combined read requests.
7. The method for remote storage access oriented request merge and dispatch as claimed in claim 1, wherein when merging multiple consecutive I/O requests into one I/O request, the method includes controlling the number of the merged I/O requests based on a merge rule of a transmission capability, and each merging one I/O request, calculating the number of hash aggregation elements SGEs contained in the merged I/O request, if the number of hash aggregation elements SGEs contained in the merged I/O request is equal to a threshold set according to the transmission capability of NVMeoF, stopping merging new merged I/O requests, and if the I/O requests need to be continuously merged, continuing the next round of merging from the next I/O request needing to be merged.
8. The remote storage access oriented request merge and dispatch method of claim 1 wherein when merging multiple consecutive I/O requests into one I/O request, including controlling the number of I/O requests that are merged based on a latency based merge rule, and starting a timer after the first I/O request enters the Nvme _ RDMA _queue_rq function in the RDMA driver, and before each I/O request is merged, determining if the timer is exceeded, if not, allowing the I/O request to be merged, otherwise not allowing the I/O request to be merged.
9. A remote storage access oriented request merge and dispatch device comprising a microprocessor and a memory interconnected, wherein the microprocessor is programmed or configured to perform the remote storage access oriented request merge and dispatch method of any one of claims 1 to 8.
10. A computer readable storage medium having a computer program stored therein, wherein the computer program is for programming or configuring by a microprocessor to perform the remote storage access oriented request merge and dispatch method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410144629.7A CN118192883A (en) | 2024-02-01 | 2024-02-01 | Remote storage access-oriented request merging and scheduling method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410144629.7A CN118192883A (en) | 2024-02-01 | 2024-02-01 | Remote storage access-oriented request merging and scheduling method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118192883A true CN118192883A (en) | 2024-06-14 |
Family
ID=91405740
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410144629.7A Pending CN118192883A (en) | 2024-02-01 | 2024-02-01 | Remote storage access-oriented request merging and scheduling method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118192883A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118672954A (en) * | 2024-08-26 | 2024-09-20 | 中国人民解放军国防科技大学 | NVMeoF request instruction transmission method and system based on interrupt merging |
-
2024
- 2024-02-01 CN CN202410144629.7A patent/CN118192883A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118672954A (en) * | 2024-08-26 | 2024-09-20 | 中国人民解放军国防科技大学 | NVMeoF request instruction transmission method and system based on interrupt merging |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101006260B1 (en) | Apparatus and method for supporting memory management in an offload of network protocol processing | |
TWI543073B (en) | Method and system for work scheduling in a multi-chip system | |
US20200192715A1 (en) | Workload scheduler for memory allocation | |
CN107995129B (en) | NFV message forwarding method and device | |
CN113711551A (en) | System and method for facilitating dynamic command management in a Network Interface Controller (NIC) | |
US8874810B2 (en) | System and method for read data buffering wherein analyzing policy determines whether to decrement or increment the count of internal or external buffers | |
TWI519958B (en) | Method and apparatus for memory allocation in a multi-node system | |
CN108647046B (en) | Apparatus and method for controlling execution flow | |
US11698757B2 (en) | Memory system and method of controlling nonvolatile memory | |
KR102719059B1 (en) | Multi-stream ssd qos management | |
TWI547870B (en) | Method and system for ordering i/o access in a multi-node environment | |
US10901624B1 (en) | Dummy host command generation for supporting higher maximum data transfer sizes (MDTS) | |
KR20070042152A (en) | Apparatus and method for supporting connection establishment in an offload of network protocol processing | |
US20130304841A1 (en) | Server node interconnect devices and methods | |
WO2013082809A1 (en) | Acceleration method, device and system for co-processing | |
CN103793342A (en) | Multichannel direct memory access (DMA) controller | |
WO2020000485A1 (en) | Nvme-based data writing method, device, and system | |
EP3629189A2 (en) | Technologies for using a hardware queue manager as a virtual guest to host networking interface | |
US11409466B2 (en) | Access control in CMB/PMR virtualization environment | |
US11016829B2 (en) | Two-layered deterministic interprocess communication scheduler for input output determinism in solid state drives | |
KR20170071180A (en) | Storage system and method for connection-based load balancing | |
CN118192883A (en) | Remote storage access-oriented request merging and scheduling method and device | |
CN107870866B (en) | IO command scheduling method and NVM interface controller | |
US9639473B1 (en) | Utilizing a cache mechanism by copying a data set from a cache-disabled memory location to a cache-enabled memory location | |
CN111290983A (en) | USB transmission equipment and transmission method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |