CN110457238B - Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache - Google Patents

Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache Download PDF

Info

Publication number
CN110457238B
CN110457238B CN201910601175.0A CN201910601175A CN110457238B CN 110457238 B CN110457238 B CN 110457238B CN 201910601175 A CN201910601175 A CN 201910601175A CN 110457238 B CN110457238 B CN 110457238B
Authority
CN
China
Prior art keywords
access
fifo queue
access request
cache
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910601175.0A
Other languages
Chinese (zh)
Other versions
CN110457238A (en
Inventor
李炳超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Civil Aviation University of China
Original Assignee
Civil Aviation University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Civil Aviation University of China filed Critical Civil Aviation University of China
Priority to CN201910601175.0A priority Critical patent/CN110457238B/en
Publication of CN110457238A publication Critical patent/CN110457238A/en
Application granted granted Critical
Publication of CN110457238B publication Critical patent/CN110457238B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0842Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/128Replacement control using replacement algorithms adapted to multidimensional cache systems, e.g. set-associative, multicache, multiset or multilevel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a method for slowing down memory access requests of a GPU (graphics processing Unit) and pauses when instructions access a cache, which comprises the following steps: the access request at the head of the FIFO queue accesses the L1 cache, the tag of the access request is compared with the tag in the L1 cache, if the access request with the reservation pause exists, the access request is popped out from the head of the FIFO queue and is placed into the tail of the FIFO queue; the first control logic controls the trend of the memory access request after being popped from the FIFO queue head; and constructing a second control logic, a first control signal and a third control signal, and performing pipeline processing on the access instructions between the thread bundle scheduler and the reading unit, so that when the address merging unit in the reading unit merges all the access requests, the next access instruction can be processed, and when an idle entry exists in the FIFO queue, the access request can be generated and stored. Compared with the prior art, the method can reduce the pause time of the memory access request, improve the processing speed of the memory access request, and simultaneously can reduce the waiting time of the memory access instruction and improve the processing speed of the memory access instruction.

Description

Method for slowing down memory access request of GPU (graphics processing Unit) and pause when instructions access cache
Technical Field
The invention relates to the field of GPU (graphic processor) cache (cache memory) architecture, in particular to a processing method for slowing down the pause of a GPU (graphic processor) memory access request and a memory access instruction when accessing an L1 cache (primary cache memory).
Background
In recent years, GPUs have evolved into a multi-threaded high-performance parallel general-purpose computing platform, and the computing power of GPUs is still rapidly increasing, attracting more and more applications to accelerate on GPUs.
In the GPU software level, when an application program runs on a GPU, tasks of the application program need to be subdivided into a plurality of threads that can run independently, and then the plurality of threads are organized into thread blocks. In the GPU hardware level, one GPU is composed of a plurality of streaming multiprocessors, an on-chip interconnection network and a memory. The stream multiprocessor is internally provided with a register file, a scalar processor, a read-write unit, a shared memory, a cache and the like which support multithreading parallel operation. The threads are respectively sent to each stream multiprocessor by taking a thread block as a unit, and the thread block is subdivided into thread bundles by hardware in the stream multiprocessor, wherein the thread bundles are the most basic execution units of the GPU [1] . In NVIDIA GPU, a thread bundle consists of 32 threads, and these 32 threads are executed in parallel.
When the thread bundle executes the access instruction, each thread generates an access request, and in order to reduce the number of the access requests, the access requests generated by the same thread bundle are merged in a stream multiprocessor of the GPU through an address merging unit. If the addresses accessed by the memory access requests generated by a thread bundle are positioned in the same data block (such as 128 bytes), the memory access requests can be combined into a memory access request [2] . However, because some programs have irregular access characteristics, even after address merging, an access instruction of a thread bundle still has a plurality of access requests, and the access requests are put into a FIFO (first-in first-out) queue to cause burst-type access to the cache. On the other hand, due to cache capacity inside the streaming multiprocessorThe quantity is small (16 KB-96 KB), the number of threads can reach thousands, the average cache capacity of each thread is only dozens of bytes, and the miss rate of the cache is very high. When a cache miss occurs to a memory access request, according to a corresponding replacement policy, a cache-line (cache line) is selected from the cache to replace data therein, and then the memory access request will continue to access a next-level memory (L2 cache (secondary cache memory) or DRAM (dynamic random access memory)). The state in which the cache-line is in the period from when old data is replaced to when new data is retrieved from the next-level memory and stored into the cache-line is called a reservation state. The cache-line in the reserved state can not be replaced by other memory access requests with missing. If too many access requests cause cache-lines in the cache to be in a reserved state, no object which can be replaced exists after cache miss occurs in subsequent access requests, and the access requests are stopped [3] And ending the reservation state until the data of a certain cache-line in the cache returns, wherein the phenomenon is called reservation pause. The GPU processes the access requests of one thread bundle according to a first-in first-out sequence, and as the access requests generally need hundreds of cycles to access the next-level memory, other access requests in the reading unit need to wait hundreds of cycles until the previous access requests are stopped, so that the access requests can be processed, and the processing efficiency of the access requests is reduced.
On the other hand, a fetch unit can currently only accommodate one thread bundle access instruction. That is, before all the access requests of the thread bundle access instruction currently located in the fetch unit are not processed, even if there is an idle entry in the FIFO, the thread bundle scheduler cannot send other access instructions to the fetch unit for processing. If the memory access request of the current memory access instruction is subjected to reservation pause, the next memory access instruction also needs to wait for hundreds of cycles, and the processing efficiency of the memory access instruction of the thread bundle is reduced.
Reference to the literature
[1]E.Lindholm,J.Nickolls,S.Oberman,J.Montrym.“NVIDIA Tesla:A Unified Graphics and Computing Architecture”,IEEE Micro,vol.28,no.2,pp.39-55,2008.
[2]NVIDIA Corporation.NVIDIA CUDA C Programming Guide,2019.
[3]W.Jia,K.A.Shaw,M.Martonosi.“MRPB:Memory Request Prioritization for Massively Parallel Processors”,International Symposium on High Performance Computer Architecture,pp.272-283,2014.
Disclosure of Invention
The invention provides a method for slowing down the memory access requests of a GPU and the pause when instructions access a cache, and the method reorders the memory access requests for reserving the pause, reduces the pause time of the memory access requests in a reading unit and improves the processing efficiency of the memory access requests; in addition, by performing pipeline processing on the memory access instruction, the waiting time of the memory access instruction outside the reading unit is reduced, and the processing efficiency of the memory access instruction is improved, which is described in detail in the following description:
a method for slowing down memory access requests of a GPU and stopping when instructions access cache comprises the following steps:
accessing the L1 cache by the access request positioned at the head of the FIFO queue, comparing the tag of the access request with the tag in the L1 cache, if the access request with reservation pause exists, popping the access request from the head of the FIFO queue, and placing the access request into the tail of the FIFO queue;
the first control logic controls the trend of the memory access request after being popped from the FIFO queue head;
and constructing a second control logic, a first control signal and a third control signal for performing pipeline processing on the access instructions between the thread bundle scheduler and the reading unit, so that the next access instruction can be processed when the address merging unit in the reading unit merges all the access requests, and the access requests can be generated and stored when free items exist in the FIFO queue.
The trend of the first control logic controlling the memory access request after popping from the head of the FIFO queue is specifically as follows:
when the access result of the access request in the L1 cache is reserved, the second control signal is false, and the second control signal is true after passing through the reverser;
the first tri-state gate is in a conducting state, and the second tri-state gate is in a high-impedance state, and the access request is transmitted to the tail of the FIFO queue after being popped from the head of the FIFO queue.
Further, the constructing the second control logic and the first and third control signals for performing pipeline processing on the access instruction between the thread bundle scheduler and the reading unit specifically includes:
1) If the access requests are not completely synthesized by the address merging unit, the third control signal is false, and the thread bundle scheduler is informed that other access instructions cannot be sent to the reading unit; otherwise, the third control signal is true, and the thread bundle scheduler is informed to send other memory access instructions to the reading unit;
2) And sending the state that the FIFO queue is full to the address merging unit through a first control signal so as to control the generation of the access request.
The sending the state of whether the FIFO queue is full to the address merging unit through the first control signal to control the generation of the access request specifically includes:
if the FIFO queue is full, the first control signal is false, and the address merging unit is informed to suspend merging the access requests until the FIFO queue has an idle entry;
otherwise, the address merging unit is informed to continue merging the access requests and put the merged access requests into the tail of the FIFO queue.
Preferably, the method further comprises: and performing conflict processing on the memory access request generated by the address merging unit and the memory access request popped from the head of the FIFO queue when the memory access requests are placed at the tail of the FIFO queue in the same period.
Wherein the conflict handling specifically comprises:
the memory access request popped from the head of the FIFO queue is given high priority, the memory access request is put into the tail of the FIFO queue through a second control logic,
the address merging unit suspends the generation of new access requests until no access request popped from the head of the FIFO queue needs to be placed at the tail of the FIFO queue.
Further, the step of placing the access and storage request into the FIFO queue tail through the second control logic is:
when the access result of the access request in the L1 cache is hit or missing, the second control signal is true, the input path2 of the multi-path selector is gated, and the access request generated by the address merging unit is placed at the tail of the FIFO queue;
when the access result of the access request in the L1 cache is the reservation pause, the second control signal is false, the input path1 of the multiplexer is gated, and the access request popped out from the head of the FIFO queue is placed in the tail of the FIFO queue.
The technical scheme provided by the invention has the beneficial effects that:
1. the invention can reorder the access requests with reserved pause, thereby enabling the subsequent access requests to continuously access the L1 cache and reducing the pause time of the access requests;
2. the other memory access instructions in the invention can be processed by the reading unit without waiting for the completion of the processing of all the memory access requests of the current memory access instruction in the reading unit, but only by the completion of the merging of the memory access requests of all the threads by the address merging unit in the reading unit, the thread bundle scheduler can send other memory access instructions to the reading unit for processing, thereby reducing the waiting time of the memory access instructions and improving the processing efficiency of the memory access instructions.
Drawings
FIG. 1 is a schematic structural diagram of a process for slowing down a GPU memory access request and a memory access instruction from stalling in an L1 cache according to the present invention;
FIG. 2 is a schematic diagram of a reservation stall occurring in a memory access request;
fig. 3 is a graph comparing the results of the operation after the present invention was applied.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
Referring to fig. 1, an embodiment of the present invention provides a method for slowing down a GPU memory access request and a pause when an instruction accesses a cache, where the method includes the following steps:
101: comparing tag (label) of the access request with tag in L1 cache, and if the access request with reservation pause exists, reordering FIFO queue;
wherein, the access request at the head of FIFO (first in first out) accesses L1 cache, firstly comparing the tag of the access request with the tag in L1 cache, including the following three conditions:
if cache hit occurs, popping the access request from the FIFO queue head, and further accessing the hit cache-line; or the like, or a combination thereof,
if cache miss occurs, popping the access request from the FIFO queue head, and sending the access request to a next-level memory; or the like, or, alternatively,
if the reservation pause occurs, the memory access request is popped out from the head of the FIFO queue and is placed at the tail of the FIFO queue, so that other memory access requests in the FIFO queue can continuously access the L1 cache in the next period, the pause is avoided, and the processing speed of the memory access request is accelerated.
For this reason, the embodiment of the present invention designs a corresponding data path1 to connect the head of the FIFO queue and the tail of the FIFO queue, and a corresponding first control logic 1 to control the direction of the memory access request after being popped from the head of the FIFO queue.
The data path1 is a data line for transmitting access request information, where the access request information generally includes: address information, thread bundle index, read-write information.
The first control logic 1 is used for controlling the trend of the memory access request popped from the head of the FIFO queue: when the access result r of the access request in the L1 cache is hit or missing, the control signal c2 is true, and becomes false after passing through the inverter. Therefore, the tristate gate 1 is in a high-impedance state, and the tristate gate 2 is in a conducting state, which indicates that the access and storage request is deleted after being popped from the FIFO queue head; when the access result r of the access request in the L1 cache is retention stop, the control signal c2 is false and becomes true after passing through the inverter. Therefore, the three-state gate 1 is in a conducting state, and the three-state gate 2 is in a high-impedance state, which indicates that the access request is sent to the tail of the FIFO queue after being popped from the head of the FIFO queue.
102: carrying out pipeline processing on the access instruction;
the method specifically comprises the following steps:
1) If the access requests generated by the current access instruction are not completely synthesized by the address merging unit, the control signal c3 is false, and the thread bundle scheduler is informed that other access instructions cannot be sent to the reading unit;
2) If the access requests generated by the current access instruction are completely synthesized by the address merging unit, the control signal c3 is true and informs the thread bundle scheduler that other access instructions can be sent to the reading unit;
3) Continuously detecting the state of the FIFO queue;
if the FIFO queue is full, the control signal c1 is false, and the address merging unit is informed to suspend merging the access requests until the FIFO queue has an idle entry;
if the FIFO is not full, the control signal c1 is true, and the address merging unit is informed that the access requests can be merged continuously and placed at the tail of the FIFO queue.
For this purpose, the FIFO controller sends the status of whether the FIFO queue is full to the address merge unit via control signal c1 to control the generation of the access request.
103: and conflict processing is carried out when the access request generated by the address merging unit and the access request popped out from the head of the FIFO queue are placed at the tail of the FIFO queue in the same period.
If the access request generated by the address merging unit and the access request popped from the head of the FIFO queue need to be put into the tail of the FIFO queue in the same period, the access request popped from the head of the FIFO queue is given high priority, and is put into the tail of the FIFO queue preferentially. At this time, the address merging unit suspends generating new access requests until no access request popped from the head of the FIFO queue needs to be placed at the tail of the FIFO queue.
For this purpose, it is necessary to design a corresponding control logic 2 at the tail of the FIFO queue, and to use the tag comparison result r of the access request as a control signal to control the input selection of the FIFO queue tail. When the access result r of the access request in the L1 cache is hit or missing, the control signal c2 is true, and the input path2 of the multiplexer in the gating control logic 2 indicates that the access request generated by the address merging unit is placed at the tail of the FIFO queue; when the access result r of the access request in the L1 cache is retention stop, the control signal c2 is false, and the input path1 of the multiplexer in the gating control logic 2 indicates that the access request popped from the head of the FIFO queue is placed at the tail of the FIFO queue.
Example 2
The following further introduces and verifies the embodiment 1 of the present invention with respect to the processing manner of memory access request reservation and pause in the prior art, which is described in detail below:
the FIFO queue of the memory access request has 32 entries, and the GPU flows the number of the multiple processors: 15; number of DRAM channels: 6; maximum number of threads of streaming multiprocessor: 1536; streaming multiprocessor register file capacity: 128KB; shared memory capacity: 48KB; l1 cache: 4-way set association, wherein the cache-line size is 128 bytes, 32 sets and the total capacity is 16KB; l2 cache (second level cache): 8-way set association, cache-line size 128 bytes, total capacity 128KB. L1 cache access latency: 1 period; l2 cache access latency: 120 periods; DRAM access latency: 220 period.
As shown in FIG. 2, assuming that all cache-lines in the L1 cache can be accessed in the initial state (cache cold start), the access requests stored in the FIFO queue are req-a0, req-a1, req-a 2. Req. A20 from the access instruction inst-a. Based on the address mapping, req-a 0. Req-a4 will access set-0 in the L1 cache, req-5. Req. A9 will access set-1 in the L1 cache.
According to the access sequence of first-in first-out, firstly accessing the L1 cache by using req-a0, causing cache miss, allocating one cache-line in set-0 to req-a0, and keeping the cache-line in a reserved state (R), then sending the req-a0 to a next-stage memory, and simultaneously popping the req-a0 from the head of a FIFO queue by a FIFO controller. The remaining three cache-lines in set-0 are allocated to req-a1, req-a2, req-a3, respectively, for the next three cycles. At this time, all cache-lines in set-0 are in the reserved state (R). When the req-a4 continues to access the set-0 and cache loss occurs, because no cache-line which can be allocated exists in the set-0 at the moment, the reservation pause occurs in the req-a4, the FIFO controller cannot pop the req-a4 out of the FIFO queue head, but waits for the return of the req-a0, the req-a1, the req-a2 or the req-a3 from the next-stage memory, and cancels the corresponding cache-line reservation state. Although the access requests such as req-a5 do not need to access set-0, namely, do not need to wait for the return of req-a0, req-a1, req-a2 and req-a3, the rest access requests such as req-a5 also have to wait because req-a4 is blocked in the front, and the processing efficiency of the access requests is greatly reduced.
On the other hand, at this time, all the access requests stored in the FIFO queue are the access requests of inst-a, although the FIFO queue still has free entries at this time, the FIFO controller notifies the thread bundle scheduler through the control signal c1 that the current reading unit still cannot process other access instructions inst-b and the like, so that other access instructions such as inst-b and the like are also affected by the retention and pause of the inst-a access instruction, and the processing efficiency of the access instruction is greatly reduced.
As shown in FIG. 1, after the embodiment of the present invention is adopted, when a reservation stall occurs to req-a4, the FIFO controller pops out req-a4 from the head of the FIFO queue, and at the same time, the control signal c2 controls the data path1 to be opened, and puts req-a4 at the tail of the FIFO queue, and req-a5 becomes the head of the FIFO queue, thereby avoiding the reservation stall.
In the next cycle, req-a5 accesses set-1, thereby increasing the processing speed of the access request. In addition, at this time, the address merge unit has merged all the memory access requests of inst-a, and the address merge unit notifies the thread bundle scheduler that the current read unit can receive inst-b. Assuming that inst-b can generate 24 access requests (req-b 0. Cndot. Req-b 23) in total, req-b0 and req-a4 need to be placed at the tail of the FIFO queue at the same time, and an access conflict occurs.
In the embodiment of the invention, the req-a4 with the reservation pause is given higher priority, so the req-a4 is put into an FIFO queue first, and in the period, the control signal c2 controls an address merging unit in a reading unit to suspend merging the memory access requests of inst-b. When the access conflict at the tail of the FIFO queue is finished, the control signal c2 controls the address merging unit to continue merging the access requests of inst-b. Therefore, the embodiment of the invention also improves the processing speed of the access instruction.
As shown in fig. 3, the average performance (GM) of the GPU improved by 23% after the embodiment of the present invention was employed.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A method for slowing down memory access requests of a GPU and pauses when instructions access a cache is characterized by comprising the following steps:
the access request at the head of the FIFO queue accesses the L1 cache, the tag of the access request is compared with the tag in the L1 cache, if the access request with the reservation pause exists, the access request is popped out from the head of the FIFO queue and is placed into the tail of the FIFO queue;
the first control logic controls the trend of the memory access request after the memory access request is popped from the FIFO queue head;
and constructing a second control logic, a first control signal and a third control signal for performing pipeline processing on the access instructions between the thread bundle scheduler and the reading unit, so that the next access instruction can be processed when the address merging unit in the reading unit merges all the access requests, and the access requests can be generated and stored when free items exist in the FIFO queue.
2. The method for slowing down the stall of the GPU during the access to the memory request and the instruction for accessing the cache according to claim 1, wherein the trend of the access request controlled by the first control logic after being popped from the FIFO queue head is specifically as follows:
when the access result of the access request in the L1 cache is retention stop, the second control signal is false, and the second control signal is true after passing through the reverser;
the first tri-state gate is in a conducting state, and the second tri-state gate is in a high-impedance state and represents that the access request is transmitted to the tail of the FIFO queue after being popped from the head of the FIFO queue.
3. The method for slowing down the memory access request of the GPU and the pause when the instructions access the cache according to claim 1, wherein the step of constructing a second control logic, a first control signal and a third control signal for performing pipeline processing on the memory access instructions between the thread bundle scheduler and the reading unit specifically comprises the following steps:
1) If the access requests are not completely synthesized by the address merging unit, the third control signal is false, and the thread bundle scheduler is informed that other access instructions cannot be sent to the reading unit; otherwise, the third control signal is true, and the thread bundle scheduler is informed to send other memory access instructions to the reading unit;
2) And sending the state that the FIFO queue is full to the address merging unit through a first control signal so as to control the generation of the access request.
4. The method for slowing down the stall of the GPU when accessing the memory request and the instruction accesses the cache according to claim 3, wherein the sending the state of whether the FIFO queue is full to the address merging unit through the first control signal to control the generation of the memory request specifically comprises:
if the FIFO queue is full, the first control signal is false, and the address merging unit is informed to suspend merging the access and memory requests until an idle entry exists in the FIFO queue;
otherwise, the address merging unit is informed to continue merging the access requests and put the merged access requests into the tail of the FIFO queue.
5. The method of any one of claims 1-4, wherein the method further comprises:
and performing conflict processing on the memory access request generated by the address merging unit and the memory access request popped from the head of the FIFO queue when the memory access request is placed at the tail of the FIFO queue in the same period.
6. The method for slowing down the stall of the GPU when accessing the memory request and the instruction to access the cache according to claim 5, wherein the conflict processing specifically comprises:
giving high priority to the memory access request popped from the head of the FIFO queue, and putting the memory access request into the tail of the FIFO queue through a second control logic;
the address merging unit suspends generating new access requests until no access request popped from the head of the FIFO queue needs to be placed at the tail of the FIFO queue.
7. The method of claim 6, wherein the step of placing the access request into the FIFO queue tail by the second control logic is:
when the access result of the access request in the L1 cache is hit or missing, the second control signal is true, the input path2 of the multiplexer is gated, and the access request generated by the address merging unit is placed at the tail of the FIFO queue;
when the access result of the access request in the L1 cache is that the reservation is stopped, the second control signal is false, the input path1 of the multiplexer is gated, and the access request popped from the head of the FIFO queue is placed at the tail of the FIFO queue.
CN201910601175.0A 2019-07-04 2019-07-04 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache Active CN110457238B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910601175.0A CN110457238B (en) 2019-07-04 2019-07-04 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910601175.0A CN110457238B (en) 2019-07-04 2019-07-04 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache

Publications (2)

Publication Number Publication Date
CN110457238A CN110457238A (en) 2019-11-15
CN110457238B true CN110457238B (en) 2023-01-03

Family

ID=68482257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910601175.0A Active CN110457238B (en) 2019-07-04 2019-07-04 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache

Country Status (1)

Country Link
CN (1) CN110457238B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111736900B (en) * 2020-08-17 2020-11-27 广东省新一代通信与网络创新研究院 Parallel double-channel cache design method and device
CN112817639B (en) * 2021-01-13 2022-04-08 中国民航大学 Method for accessing register file by GPU read-write unit through operand collector
CN113722111A (en) * 2021-11-03 2021-11-30 北京壁仞科技开发有限公司 Memory allocation method, system, device and computer readable medium
CN114595070B (en) * 2022-05-10 2022-08-12 上海登临科技有限公司 Processor, multithreading combination method and electronic equipment
CN114637609B (en) * 2022-05-20 2022-08-12 沐曦集成电路(上海)有限公司 Data acquisition system of GPU (graphic processing Unit) based on conflict detection
CN114647516B (en) * 2022-05-20 2022-08-23 沐曦集成电路(上海)有限公司 GPU data processing system based on FIFO structure with multiple inputs and single output
CN116302504B (en) * 2023-02-23 2024-08-27 海光信息技术股份有限公司 Thread block processing system, method and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981807A (en) * 2012-11-08 2013-03-20 北京大学 Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN103927277A (en) * 2014-04-14 2014-07-16 中国人民解放军国防科学技术大学 CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device
CN104461758A (en) * 2014-11-10 2015-03-25 中国航天科技集团公司第九研究院第七七一研究所 Exception handling method and structure tolerant of missing cache and capable of emptying assembly line quickly
CN106407063A (en) * 2016-10-11 2017-02-15 东南大学 Method for simulative generation and sorting of access sequences at GPU L1 Cache

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981807A (en) * 2012-11-08 2013-03-20 北京大学 Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN103927277A (en) * 2014-04-14 2014-07-16 中国人民解放军国防科学技术大学 CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device
CN104461758A (en) * 2014-11-10 2015-03-25 中国航天科技集团公司第九研究院第七七一研究所 Exception handling method and structure tolerant of missing cache and capable of emptying assembly line quickly
CN106407063A (en) * 2016-10-11 2017-02-15 东南大学 Method for simulative generation and sorting of access sequences at GPU L1 Cache

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
A Load Balanci ng Technique for Memory Channels;Byoungchan Oh等;《MEMSYS》;20181004;第1-12页 *
A modified post-TnL vertex cache for the multi-shader embedded GPUs;Jizeng Wei等;《IEICE Electronics Express》;20150427;第12卷(第10期);第1-12页 *
An Effcient GPU Cache Architecture for Applications with Irregular Memory Access Pattrns;BINGCHAO LI等;《ACM Transactions on Architecture and Code Optimization》;20190630;第16卷(第3期);第1-24页 *
Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management;Bingchao Li等;《2017 IEEE International Parallel and Distributed Processing Symposium》;20171231;第82-91页 *
Exploring new features of high-bandwidth memory for GPUs;Bingchao Li等;《IEICE Electronics Express》;20160628;第13卷(第14期);第1-12页 *
Improving SIMD utilization with thread-lane shuffled compaction in GPGPU;LI Bingchao等;《Chinese Journal of Electronics》;20151031;第24卷(第4期);第684-688页 *
MRPB: Memory Request Prioritization for Massively Parallel Processors;Wenhao Jia等;《International Symposium on High Performance Computer Architecture》;20141231;第272-283页 *
NVIDIA Tesla:A Unified Graphics and Computing Architecture;Erik Lindholm等;《IEEE Micro》;20081231;第39-55页 *
异构多核下Cache替换算法的性能优化研究;范清文;《中国优秀硕士学位论文全文数据库信息科技辑》;20180715;I137-66 *
影像重采样GPU并行实现及瓦片缓存策略优化研究;张婷婷;《中国优秀硕士学位论文全文数据库信息科技辑》;20180815;I140-440 *

Also Published As

Publication number Publication date
CN110457238A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110457238B (en) Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache
US11334262B2 (en) On-chip atomic transaction engine
US9898409B2 (en) Issue control for multithreaded processing
US6732242B2 (en) External bus transaction scheduling system
KR100936601B1 (en) Multi-processor system
US20060206635A1 (en) DMA engine for protocol processing
JP4322259B2 (en) Method and apparatus for synchronizing data access to local memory in a multiprocessor system
US10019283B2 (en) Predicting a context portion to move between a context buffer and registers based on context portions previously used by at least one other thread
WO2016101664A1 (en) Instruction scheduling method and device
US20140129784A1 (en) Methods and systems for polling memory outside a processor thread
GB2421328A (en) Scheduling threads for execution in a multi-threaded processor.
CN108549574A (en) Threading scheduling management method, device, computer equipment and storage medium
US8180998B1 (en) System of lanes of processing units receiving instructions via shared memory units for data-parallel or task-parallel operations
WO2003038602A2 (en) Method and apparatus for the data-driven synchronous parallel processing of digital data
US9870315B2 (en) Memory and processor hierarchy to improve power efficiency
US11868306B2 (en) Processing-in-memory concurrent processing system and method
CN112817639B (en) Method for accessing register file by GPU read-write unit through operand collector
US6016531A (en) Apparatus for performing real time caching utilizing an execution quantization timer and an interrupt controller
US11899970B2 (en) Storage system and method to perform workload associated with a host
Fang et al. Core-aware memory access scheduling schemes
CN110647357A (en) Synchronous multithread processor
EP4160423B1 (en) Memory device, memory device operating method, and electronic device including memory device
Gu et al. Cart: Cache access reordering tree for efficient cache and memory accesses in gpus
Sahoo et al. CAMO: A novel cache management organization for GPGPUs
CN114661352A (en) Method and device for accessing fine-grained cache through GPU multi-grained access memory request

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant