CN111459551B - Microprocessor with highly advanced branch predictor - Google Patents

Microprocessor with highly advanced branch predictor Download PDF

Info

Publication number
CN111459551B
CN111459551B CN202010289064.3A CN202010289064A CN111459551B CN 111459551 B CN111459551 B CN 111459551B CN 202010289064 A CN202010289064 A CN 202010289064A CN 111459551 B CN111459551 B CN 111459551B
Authority
CN
China
Prior art keywords
instruction
address
branch
fetch
jump
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010289064.3A
Other languages
Chinese (zh)
Other versions
CN111459551A (en
Inventor
巩凡工
杨梦晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhaoxin Semiconductor Co Ltd
Original Assignee
VIA Alliance Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by VIA Alliance Semiconductor Co Ltd filed Critical VIA Alliance Semiconductor Co Ltd
Priority to CN202010289064.3A priority Critical patent/CN111459551B/en
Publication of CN111459551A publication Critical patent/CN111459551A/en
Application granted granted Critical
Publication of CN111459551B publication Critical patent/CN111459551B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • G06F9/3806Instruction prefetching for branches, e.g. hedging, branch folding using address prediction, e.g. return stack, branch history buffer
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A microprocessor in which a branch predictor and an instruction cache are decoupled by a fetch target queue and the branch predictor performs branch predictions for N instruction addresses in parallel. In a current cycle, the address which is not avoided by the jump and is not overlapped with the address loaded in the fetch target queue in the previous cycle is pushed into the fetch target queue for popping in the subsequent cycle to be used as a fetch address of the instruction cache.

Description

Microprocessor with highly advanced branch predictor
Technical Field
The present application relates to instruction fetching of microprocessors.
Background
In computer architectures, branch predictors (branch predictors), which predict the jump of branch instructions (e.g., 'if-then-else' conditional branch instructions, 'call instructions,' return instructions, and 'jump' unconditional branch instructions), are typically implemented in digital circuitry. The branch predictor effectively accelerates the instruction fetch of the microprocessor, significantly increasing the performance of pipelined microprocessors.
In the conventional branch prediction technology, the branch predictor often uses the same address in synchronization with the instruction fetch operation (instruction-fetch operation) of the instruction cache (L1i), the branch prediction operation can only be performed at the same speed as the instruction fetch operation, and the delay of both the branch predictor and the instruction cache affects the other. Therefore, the improvement of the branch prediction technology is an important issue in the art.
Disclosure of Invention
In the microprocessor, a branch predictor for simultaneous multi-stage pipeline operation and an instruction cache are decoupled by a fetch target queue, and the fetch target queue is flexibly filled by adapting to multi-address parallel prediction of the branch predictor.
A microprocessor implemented according to one embodiment of the present application includes an instruction cache, a branch predictor, and an instruction fetch target queue. The instruction cache fetches instructions according to a fetch address. The branch predictor performs branch prediction on N addresses in parallel, where N is an integer greater than 1. The fetch target queue is coupled between the branch predictor and the instruction cache. In a current cycle, the instruction address which is not avoided by the jump and is not overlapped with the instruction address pushed into the instruction fetch target queue in the previous cycle in the N instruction addresses for completing the branch prediction in parallel by the branch predictor is pushed into the instruction fetch target queue for being popped in the subsequent cycle to be used as the instruction fetch address of the instruction cache.
In one embodiment, the branch predictor performs a first pipeline operation with multiple stages, and the instruction cache performs a second pipeline operation with multiple stages
In one embodiment, in the current cycle, if the N instruction addresses of the branch predictor that concurrently completes branch prediction predict that a jump branch instruction exists in the corresponding N fetch units and the jump branch instruction spans the fetch units, the instruction address of an adjacent fetch unit of the fetch unit in which the jump branch instruction is located is also pushed into the fetch target queue for popping in a subsequent cycle to serve as the fetch address of the instruction cache.
In one embodiment, N is 3, the unit of fetching the instruction corresponding to each instruction address is M bytes, and M is a number. The instruction addresses for parallel branch prediction by the branch predictor in the current cycle include PC, PC + M, and PC + 2M
In one embodiment, the current cycle overlaps an instruction address with the N instruction addresses where branch prediction was done in parallel with the previous cycle.
When the instruction address PC is not bypassed by the jump and is not pushed into the fetch target queue in the previous cycle: the instruction fetch target queue provides a first cell for storing an instruction address PC; the instruction fetch target queue provides a second cell for storing an instruction address PC + M when the instruction address PC does not jump or the instruction address PC takes a jump but a branch instruction taking the jump is across instruction fetch units; the instruction fetching target queue provides a third cell, and when the instruction address PC and the PC + M do not jump or the instruction address PC does not jump but the instruction address PC + M jumps but the branch instruction fetching the jump spans the instruction fetching unit, the instruction address PC + 2M is stored; and the instruction fetch target queue provides a fourth cell for storing the instruction address PC + 3M when the instruction address PC and PC + M do not jump, but the instruction address PC + 2M jumps, but the branch instruction fetching the jump is across the instruction fetch unit.
When the instruction addresses PC and PC + M are not avoided by the jump, but the instruction address PC is pushed into the instruction fetching target queue in the previous period: the instruction fetch target queue provides a first cell storing an instruction address PC + M; the instruction fetching target queue provides a second cell for storing the instruction address PC + 2M when the instruction address PC + M does not jump or the instruction address PC + M jumps but the branch instruction fetching the jump is across the instruction fetching unit; and the instruction fetch target queue provides a third cell that stores instruction address PC + 3M when instruction address PC + M does not jump, but instruction address PC + 2M jumps, but the branch instruction that fetched the jump is across instruction fetch units.
In one embodiment, the current cycle overlaps two instruction addresses with the N instruction addresses completing branch prediction in parallel with the previous cycle.
When the instruction address PC is not avoided by the jump and is not pushed into the instruction fetching target queue in the previous period: the instruction fetch target queue provides a first cell for storing an instruction address PC; the instruction fetch target queue provides a second cell for storing an instruction address PC + M when the instruction address PC does not jump or the instruction address PC takes a jump but a branch instruction taking the jump is across instruction fetch units; the instruction fetching target queue provides a third cell, and when the instruction address PC and the PC + M do not jump or the instruction address PC does not jump but the instruction address PC + M jumps but the branch instruction fetching the jump spans the instruction fetching unit, the instruction address PC + 2M is stored; and the instruction fetch target queue provides a fourth cell for storing the instruction address PC + 3M when the instruction address PC and PC + M do not jump, but the instruction address PC + 2M jumps, but the branch instruction fetching the jump is across the instruction fetch unit.
When the instruction addresses PC, PC + M are not bypassed by the jump but have been pushed into the fetch target queue in the previous cycle: the instruction fetch target queue provides a first cell storing an instruction address PC +2 × M; and the instruction fetch target queue provides a second cell that stores instruction address PC + 3M when instruction address PC + 2M is taken a jump, but where the branch instruction taken the jump is across instruction fetch units.
In one embodiment, none of the N instruction addresses where branch prediction was done in parallel with the previous cycle overlap.
When the instruction address PC is not avoided by the jump: the instruction fetch target queue provides a first cell for storing an instruction address PC; the instruction fetch target queue provides a second cell for storing an instruction address PC + M when the instruction address PC does not jump or the instruction address PC takes a jump but a branch instruction taking the jump is across instruction fetch units; the instruction fetching target queue provides a third cell, and when the instruction address PC and the PC + M do not jump or the instruction address PC does not jump but the instruction address PC + M jumps but the branch instruction fetching the jump spans the instruction fetching unit, the instruction address PC + 2M is stored; and the instruction fetch target queue provides a fourth cell for storing the instruction address PC + 3M when the instruction address PC and PC + M do not jump, but the instruction address PC + 2M jumps, but the branch instruction fetching the jump is across the instruction fetch unit.
In one embodiment, a cell of the targeted queue includes at least three fields. A first field stores a jump flag indicating whether a branch instruction for a jump is predicted in an instruction fetch unit corresponding to an instruction address stored in the cell. A second field stores a jump instruction unit flag indicating whether the branch instruction of the jump instruction crosses the instruction unit. A third field stores the instruction address.
Drawings
FIG. 1 illustrates a microprocessor 100 according to one embodiment of the present application;
FIG. 2 details the design of the FTQ and the multiplexer 116 according to one embodiment of the present application;
FIG. 3 illustrates how multiple addresses for parallel branch prediction are pushed into the fetch target queue FTQ, and how the contents of the FTQ cell are popped out, according to one embodiment of the present disclosure;
FIGS. 4A, 4B, and 4C illustrate how the addrBP address of each cell is filled in according to various embodiments of the present application;
FIG. 5A illustrates when after a refresh occurs, synchronization signal Sync is pulled up again, according to one embodiment of the present application;
FIG. 5B illustrates how Sync signal Sync would change if address 60 to address 200 jump is predicted (but the fetch target queue FTQ is not empty) for cycle T5 of FIG. 5A; and
FIG. 5C illustrates when the Sync signal Sync pulls after a refresh, when a jump target address is predicted, according to one embodiment of the present application.
List of reference numerals
100: a microprocessor;
102: instructions;
104: an instruction cache;
106: a decoder;
108: an execution unit;
110: a branch predictor;
114: skipping a target address;
116: a multiplexer;
118: (for instruction caches) self-incrementing addresses;
120. 122: refreshing the address;
124: a multiplexer;
202: a multiplexer;
204: fetching the address (Addr) popped by the FTQ cell of the instruction target queue;
206. 208: a multiplexer;
410. 420, 430: a table;
AddrBP: an instruction address;
addrequol: comparing the signals;
AddrL1 i: fetching an instruction address;
BTACQ: a queue;
FTQ: fetching an instruction target queue;
RdPtr: reading a pointer;
PDQ: a queue;
sync: a synchronization signal;
t: taking a skip mark;
t0 … T7: a period;
TargPtr: a pointer;
w: a cross-fetch unit flag;
WrAddr: an address;
wrapprgptr: a pointer; and
WrPtr, WrPtr0 … WrPtr 3: a write pointer, where WrPtr0 is the starting write pointer.
Detailed Description
The following description sets forth various embodiments of the invention. The following description is made for the purpose of illustrating the general principles of the invention and is not meant to limit the invention. The actual invention scope shall be in accordance with the protection scope of the claims.
FIG. 1 illustrates a microprocessor 100 according to one embodiment of the present application.
Based on the instruction fetch address addr L1i, an instruction 102 is fetched from an instruction cache (L1i, as is well known in the art) 104, decoded by a decoder 106, and finally executed by an execution unit 108. The fetch unit may be 16 bytes (16B), i.e., 16 bytes of instructions are fetched for one fetch operation. Unlike the conventional technique of synchronizing the instruction fetch address addr 1i of the instruction cache 104 for branch prediction, the microprocessor 100 of the present application is designed such that the instruction address AddrBP of the branch predictor 110 for branch prediction leads the instruction fetch address addr 1i of the instruction cache 104 to a high degree.
Referring to FIG. 1, microprocessor 100 provides a Fetch Target Queue (FTQ), coupled between branch predictor 110 and instruction cache 104, that stores at least one instruction address AddrBP for branch prediction by branch predictor 110 to be popped as instruction cache 116 instruction address AddrL1i, where the instruction address AddrBP for branch prediction by branch predictor 110 precedes instruction address AddrL1 i. In particular, all or a portion of at least one instruction address AddrBP for performing branch prediction by the branch predictor 110 is pushed (push) into the instruction fetch target queue FTQ and then popped (pop) as the instruction fetch address AddrL1i of the instruction cache 104. The fetch target queue FTQ allows the branch predictor 110 to be decoupled (decoruplied) from the instruction cache 104. Branch predictor 110 does not need to perform branch prediction on instruction fetch address AddrL1i of instruction cache 104 in synchronization with instruction cache 104, but rather performs branch prediction on instruction address AddrBP independently; according to the present disclosure, branch predictor 110 may operate substantially ahead of instruction cache 104.
The decoupled design between the branch predictor 110 and the instruction cache 104 (fetch target queue FTQ) may greatly improve microprocessor branch prediction and fetch efficiency. Since branch predictor 110 predicted the branch jump early, the meaningless (not in the direction of the predicted branch) instruction address AddrBP would not be pushed into the fetch target queue FTQ and could be eliminated. Only a significant address (in the direction of the predicted branch) will be pushed into the instruction fetch target queue FTQ to form an instruction fetch trace, directing the fetching of instructions from the instruction cache 104.
This paragraph initially describes the signal flow of the branch jump. As shown, instruction address AddrBP self-increments into branch predictor 110 every cycle to make branch predictions. When a jump is predicted, jump target address 114 updates instruction address AddrBP so that branch predictor 110 instead performs branch prediction instruction by instruction from jump target address. In addition, if the instruction fetch target queue FTQ is just empty, the instruction address addrBP updated to jump target address 114 may be passed back without going through the instruction fetch target queue FTQ, and immediately serve as the instruction fetch address addrL1i of the instruction cache 104. Multiplexer 116 provides a path for the direct transfer of the instruction address AddrBP updated by the jump target address 114. Thereafter, the instruction cache 104 provides a (instruction cache-dedicated) increment address 118 that increments by cycle from the jump target address, updating the fetch address AddrL1i via the multiplexer 116. After the contents of the FTQ catch up with the request of the instruction cache 104 (e.g., the aforementioned LR 118 also exists in the FTQ, i.e., the LRP 118 is equal to one of the addresses AddrBP of the instruction pushed into the FTQ), the instruction cache 104 may switch back to the address AddrL1i corresponding to the address popped from the FTQ.
Another discussion is directed to situations where the microprocessor 100 pipeline is on-line requiring a flush (flush) of the fetch address AddrL1i, such as where the branch predictor 110 may be inaccurate, the fetch trace carried in the fetch target queue FTQ may be incorrect, and the decoder 106 or execution units 108 at the back end of the microprocessor 100 pipeline may flush upon recognizing the initial branch prediction error; for example, when an exception (exception) occurs during the operation of the decoder 106 or the execution unit 108, the decoder 106 or the execution unit 108 returns a refresh address 120 or 122 when the refresh is initiated, refreshes (flush) the fetch address addr 1i via the multiplexer 116, and simultaneously empties the fetch target queue FTQ. In subsequent cycles, instruction cache 104 further performs address flushing starting at flush address 120/122, providing instruction cache address 118 to multiplexer 116 cycle by cycle, with instruction fetch address AddrL1i being updated accordingly. In addition, the flush address 120/122 is also coupled to the multiplexer 124 and output as the instruction address AddrBP, so that the branch predictor 110 is also switched to branch predict from the flush address 120/122, and branch prediction in the wrong direction is not wasted.
When a flush occurs in the pipeline of microprocessor 100 operations, instruction cache 104 switches the instruction fetch address AddrL1i between being updated by flush address 120/122 and being updated by the instruction cache incremented by flush address 120/122 with the increment address 118. Thus, although the instruction fetch target queue FTQ is empty, instruction cache 104 will not stall because the instruction fetch target queue FTQ has no pop-up contents. After the aforementioned instruction cache addend address 118 is also present in the instruction fetch target queue FTQ (i.e., the instruction cache addend address 118 is equal to one of the instruction addresses AddrBP subsequently pushed into the instruction fetch target queue FTQ), the instruction cache 104 may switch back to the address popped by the instruction fetch target queue FTQ as the instruction fetch address AddrL1 i.
In order, the instruction fetch address AddrL1i of the instruction cache 104 has two sources: from a preceding stage (from the instruction fetch target queue FTQ, or instruction address addrBP direct transfer) and from instruction cache used incrementation address 118 (including instruction cache used incrementation address 118 that increments to the jump target address when a branch jump is predicted and there is no instruction address available for pop in the instruction fetch target queue FTQ at the same time, and instruction cache used incrementation address 118 that increments to the flush address when a flush occurs); the corresponding operating modes are hereinafter referred to as synchronous mode and asynchronous mode. In synchronous mode, multiplexer 116 sources instruction address AddrL1i with the instruction fetch target queue FTQ pop address or directly with instruction address AddrBP. In asynchronous mode, multiplexer 116 takes other inputs 120, 122, or 118 as the source of the fetch address AddrL1 i. Based on the aforementioned switching between synchronous mode and asynchronous mode, the instruction cache 104 is hardly delayed by branch prediction, and the effect of the FTQ is fully exerted.
Additionally, there may be no instructions (cache miss) found on the instruction cache 104. Accordingly, the instruction cache 104 will complete the instruction load first, and then receive the missed address popped again from the instruction-fetching target queue FTQ, thereby completing the instruction fetch.
The microprocessor 100 of fig. 1 also includes a queue PDQ. As discussed below.
The branch predictor 110 typically records various branch prediction information or tables in a memory (e.g., SRAM). For example: a Branch Target Address Cache (BTAC), and a Branch History Table (BHT). The branch destination address cache BTAC may carry the branch type (branch type), branch destination address (target address) …, and the like, of the branch instruction contained in the fetch unit. The branch experience table BHT is used for searching for a predicted branch direction (prediction of branch direction) and judging whether to fetch a jump token or not to fetch a jump not token. The information or tables may be updated as branch predictor 110 operates. The branch predictor 110 operates with a large leading finger address AddrL1i, and the accumulated update information is considerable, i.e., pushed into the queue PDQ first. These update messages are popped out from the PDQ queue at appropriate times and sent to the pipeline at the back end of the microprocessor 100 for use in updating BTAC and BHT.
In one embodiment, instruction cache 104 and branch predictor 110 are C, I, B, U four-stage operations in a multi-stage (e.g., up to 20 stages) pipeline operation of microprocessor 100. Instruction address AddrBP must complete branch prediction (go to the last U stage) and be bypassed by branch prediction of the leading address in order to push the instruction fetch target queue FTQ. Thus, it is of interest (in the direction of the predicted branch) to push the FTQ.
FIG. 2 details the design of the FTQ and the multiplexer 116 according to one embodiment of the present application.
Each cell (entry) of the instruction-fetching target queue FTQ can store three kinds of information: taking a skip flag (predicted taken flag) T; a cross-fetch unit flag (wrap flag) W; and an instruction address AddrBP. In one embodiment, a cell has 50 bits (bit). The instruction address AddrBP takes 48 bits. The taken jump flag T may occupy one bit, and indicates whether a branch instruction of a taken jump (taken) is predicted in the fetch unit indicated by the instruction address AddrBP. The branch instruction crossing unit flag W may occupy one bit, and indicates whether the branch instruction of the fetch jump crosses the fetch unit (i.e. whether it is "wrap"), that is, whether there is a branch instruction predicted as a fetch jump (taken) to cross two fetch units (16B) in the corresponding fetch unit, that is, a part of the branch instruction of the fetch jump (taken) itself is in the first 16B, and another part is in the adjacent second 16B. It is noted that when the instruction address AddrBP does not contain a branch instruction, or contains a branch instruction but does not take a jump (taken), even the branch instruction inter-fetch unit need not mark the inter-fetch unit flag W because without a jump the next inter-fetch unit (16B) is not bypassed by the jump, which must be fetched in order, and therefore need not be identified.
Each message (composed of T, W and AddrBP) can be pushed (push) into the corresponding cell in the target queue FTQ according to the pointer WrPtr. The pointer RdPtr is used to read the cell, and the information (T, W and AddrBP) is popped (pop) from the instruction-fetching target queue FTQ.
The multiplexer 116 of FIG. 1 in the embodiment of FIG. 2 may be implemented by a combination of three multiplexers 202, 206 and 208. In addition to receiving the address (AddrBP)204 from which the cell pops in the FTQ, multiplexer 202 also receives the address WrAddr that will be taken from the FTQ and determined to be going to fill in the cell. When the FTQ is empty, multiplexer 202 sends address WrAddr directly to multiplexer 206, avoiding the latency of FTQ access. The multiplexer 206, controlled by a synchronization signal Sync, determines whether the FTQ is operating in the aforementioned synchronous mode or asynchronous mode: when the synchronization signal Sync is true, the instruction fetch target queue FTQ operates in the synchronization mode, and in the synchronization mode, the address provided by the fetch stage (from the multiplexer 202) is the instruction fetch address AddrL1i of the instruction cache 104, and the address popped from the instruction fetch target queue FTQ or the directly received address WrAddr is output via the multiplexers 206 and 208 as the instruction fetch address AddrL1i of the instruction cache 104; when the synchronous signal Sync is false, the instruction fetch target queue FTQ operates in asynchronous mode, and the instruction fetch address AddrL1i of the instruction cache 104 in asynchronous mode can be the instruction cache self-increment address 118 that predicts the self-increment at the self-increment target address 114 when a jump is taken, or the instruction cache self-increment address 118 that predicts the self-increment at the refresh address 120/122 when a refresh occurs.
When no instruction (cache miss) is found in the instruction cache 104, the pointer RdPtr returns to the cell where the missed address is located. After the instruction cache 104 completes the instruction load, the missing address pointed by the pointer RdPtr is popped from the instruction fetch target queue FTQ. Instruction cache 104 performs fetching.
In one embodiment, branch predictor 110 performs branch prediction for multiple addresses in parallel. FIG. 3 illustrates how multiple addresses for parallel branch prediction are pushed into the fetch target queue FTQ, and how the contents of the FTQ cell are popped out, according to one embodiment of the present application. The following example illustrates branch prediction for three instruction addresses in parallel. Particularly, when three instruction addresses completing branch prediction in parallel per cycle predict that a branch instruction of a jump (taken) exists in corresponding three fetch units, and the branch instruction of the jump itself is a "cross fetch unit" (i.e. wrap, i.e. the branch instruction of the jump itself spans two 16B), addresses of adjacent fetch units of the fetch unit where the branch instruction of the jump is located can also be pushed into the fetch target queue FTQ in the same cycle. Therefore, the address pushed into the FTQ in parallel may be up to four (3+1), and the cell pointed to by the pointer WrPtr0 received from the FTQ is initially stored. The pointers WrPtr0, WrPtr1, WrPtr2, and WrPtr3 are added to indicate three cells in succession. The pointers WrPtr 0-WrPtr 3 enable parallel storage of four cells. The operation of the fetching target queue FTQ additionally uses pointers RdPtr, TargPtr, and WrapTagPtr. The pointer RdPtr specifies the cell of the pop content, and if the instruction address AddrBP pointed by the pointer RdPtr predicts that the corresponding fetch unit includes a branch (target branch) instruction for fetching jump, the following cells are marked by the pointer TargPtr, and the pointer TargPtr points to the jump destination address (target address) of the branch instruction for fetching jump. If the branch instruction of the fetch jump itself "spans the instruction unit" (wrap, i.e., the W flag is set), the pointer WrapTargPtr points to the corresponding cell where the address of the adjacent instruction unit is stored. The pointers TargPtr and wrappgrptr are information that the jump destination address of the branch instruction of the jump fetch can be directly obtained from the instruction fetch target queue FTQ when the branch instruction of the jump fetch or/and the branch instruction of the jump fetch is wrap, and no extra resource is needed for storage.
Fig. 4A, 4B, and 4C illustrate how the address AddrBP of each cell is filled according to various embodiments of the present disclosure. The branch predictor 110 performs branch prediction for three addresses in parallel, and the branch predictor 110 performs multi-stage first pipeline operations (C/I/B/U stages); the instruction cache 104 performs multiple levels of second pipeline operations (e.g., also C/I/B/U levels). Addresses for performing branch prediction in the current cycle are labeled PC, PC +16, and PC +32, for example, as shown in FIGS. 5A, 5B, and 5C below, the branch prediction pipeline includes four stages C/I/B/U, and each of the aforementioned fetch units corresponds to a fetch of 16 bytes (16B) of instruction, so addresses for performing branch prediction in the current cycle are labeled PC, PC +16, and PC + 32. The write condition of a cell includes considering a flag, afferbr. Since the case where one or two of the three instruction addresses AddrBP for performing branch prediction in parallel in two cycles before and after the branch predictor 110 overlap is discussed, when the flag afterbr is true, the address PC indicating that the branch prediction is complete (U-stage pipeline operation is completed) does not appear in the branch prediction in the cycle before. For example, the address PC is the first address after branch jump or the first address after refresh. When the flag afferbr is NO (afferbr), the address PC representing the completion of the branch prediction (completion of the U-stage pipeline operation) is already present in the branch prediction one cycle ahead.
FIG. 4A shows that adjacent cycles of parallel branch operations overlap by only one address. For example, the first cycle sends instruction address A, B, C to the first stage pipeline of branch prediction, i.e., the C-stage pipeline operation shown in FIG. 5, and the second cycle sends address C, D, E to the C-stage pipeline operation. In the branch prediction incrementing address C, D, E, there is an address C that overlaps with the previous cycle. In the addresses PC, PC +16, and PC +32 of the last stage pipeline of branch prediction, i.e., the U stage pipeline shown in fig. 5, completed in the current cycle, if the address PC is determined not to be avoided by the jump, the instruction address AddrBP pushed into the instruction fetch target queue FTQ is filled into the relevant cell according to the table 410.
When address PC has not pushed into the instruction target queue (afterbr) in the previous cycle, then the four cells pointed to by pointer WrPtr0 … WrPtr3 are available in this example. The cell pointed to by pointer WrPtr0 stores address PC without additional conditions. The cell pointed to by pointer WrPtr1 stores address PC +16 when address PC does not jump, or when address PC jumps but the branch instruction that fetched the jump itself spans the instruction unit (i.e., wrap). The cell pointed to by pointer WrPtr2 stores address PC +32 when neither address PC nor PC +16 is jumping, or when address PC is not jumping, but PC +16 jumps and the branch instruction itself jumps and fetches the jump is across instruction units (i.e., wrap). The cell pointed to by pointer WrPtr3 stores address PC +48 when neither address PC, PC +16 is jumping, but PC +32 is jumping and the branch instruction itself that jumps is taken across the instruction unit (i.e., wrap).
If PC +16 is not bypassed by the jump when the address PC has been pushed into the fetch target queue in the previous cycle (afterbr), the instruction address AddrBP pushed into the fetch target queue FTQ is filled in as follows. The cell pointed to by pointer WrPtr0 stores address PC +16 with no additional condition. The cell pointed to by pointer WrPtr1 stores address PC +32 when address PC +16 is not jumping, or when address PC +16 jumps but the branch instruction itself that jumps but fetches the jump is across instruction units (i.e., wrap). When address PC +16 is not taken, but PC +32 takes the jump and the branch instruction that taken the jump itself is across instruction units (i.e., wrap), the cell pointed to by pointer WrPtr2 stores address PC + 48.
FIG. 4B shows two address overlaps for adjacent cycles of parallel branch operations. For example, the first cycle sends instruction address A, B, C to the first stage pipeline of branch prediction, i.e., the C stage pipeline operation shown in FIG. 5, and the second cycle sends address B, C, D to the C stage pipeline operation. In the branch prediction incrementing address B, C, D, two addresses B, C overlap the previous cycle. In the addresses PC, PC +16, and PC +32 for completing the operation of the pipeline of the last stage of branch prediction in the current cycle, i.e., the U-stage pipeline shown in fig. 5, if the address PC is determined not to be avoided by the jump, the instruction address AddrBP pushed into the instruction fetch target queue FTQ is filled into the relevant cell according to the table 420. Since the differences are in the afterbr correlation fields compared to table 410, the afterbr correlation fields that are the same as table 410 are not described in detail below.
When address PC has pushed into the instruction fetch target queue in the previous cycle (afterbr), if PC +16 is not bypassed by the jump, instruction address AddrBP fills out as follows. The cell pointed to by pointer WrPtr0 stores address PC +32 with no additional condition. When the address PC +32 takes the jump, but the branch instruction itself that taken the jump is across instruction units (i.e., wrap), the cell pointed to by the pointer WrPtr1 stores the address PC + 48.
FIG. 4C shows that adjacent cycles of parallel branch operations do not overlap at all. For example, the first cycle sends the instruction address A, B, C to the first stage pipeline of branch prediction, i.e., the C stage pipeline operation shown in FIG. 5, and the second cycle sends the address E, F, G to the C stage pipeline operation. None of the branch prediction addresses E, F, G overlap with the previous cycle. In the addresses PC, PC +16, and PC +32 that complete the branch prediction last stage pipeline in the current cycle, i.e., the U stage pipeline operation shown in FIG. 5, if the address PC is determined not to be bypassed by the jump, the flag afterbr need not be considered in this example. Table 430 shows that its instruction address AddrBP is filled.
In other embodiments, the number of instruction addresses that branch predictor 110 processes in parallel may be other numbers N. The number of the cells filled in each period can reach N + 1. The unit of fetching is also not limited to 16 bytes, and can be other numbers M.
In order, the filling of the cells of the instruction fetch target queue FTQ includes considering whether the instruction address addrBP is meaningful (not bypassed by the jump) and checking whether the instruction address addrBP has filled the instruction fetch target queue FTQ in the previous cycle (e.g., checking the afterbr or whether it overlaps the previous cycle).
The following paragraphs discuss the source of the fetch address addr 1i received by the instruction cache 104, illustrating how the source of the fetch address addr 1i may be switched as the fetch target queue FTQ operates in either synchronous mode or asynchronous mode. Taking FIG. 3 as an example, the instruction-fetching target queue FTQ operates according to pointers WrPtr0 (hereinafter referred to as the initial write pointer) and RdPtr (hereinafter referred to as the read pointer). In one embodiment, the instruction cache 104 is switched between synchronous mode and asynchronous mode by comparing the start write pointer WrPtr0 and the read pointer RdPtr, thereby switching the synchronous signal Sync of FIG. 2 to set the source of the instruction fetch address AddrL1i of the instruction cache 104.
In one embodiment, the read pointer RdPtr is incremented every cycle; the branch predictor 110 pushes the concurrently predicted instruction address AddrBP into the fetcher FTQ based on the start write pointer WrPtr0, indicated by parallel write pointers (including WrPtr0, WrPtr1 …), i.e., the start write pointer WrPtr0 corrects every cycle to point to the first cell occupied by the instruction address pushed into the fetcher FTQ in the current cycle. As described above, when a flush occurs in the pipeline of the microprocessor 100, the fetch target queue FTQ is empty, the return of the corresponding flush address 120/122, the start write pointer WrPtr0 and the read pointer RdPtr initialize a start cell pointing to the fetch target queue FTQ. As described above, when the branch predictor 110 predicts a branch target address (i.e., the branch instruction with the branch prediction operation having the branch taken at the instruction address addrBP) and no pop instruction address exists in the fetch target queue FTQ, the read pointer RdPtr and the start write pointer WrPtr0 are both aligned to be the empty cells following the cell pointing to the instruction address already stored in the fetch target queue FTQ. In response to the flush or jump event above, the instruction fetch address AddrL1i of the instruction cache 104 may be the flush address 120/122, or the jump target address (114 passed directly from AddrBP bypassing the instruction fetch target queue FTQ), or the instruction cache private address 118 that is subsequently appended to both cycle by cycle. The synchronization signal Sync is down (de-asserted). After the read pointer RdPtr is equal to any of the parallel write pointers (including WrPtr0 and WrPtr1 …) that do store, the instruction cache 104 switches back to the instruction address AddrBP popped from the instruction target queue FTQ as the instruction address AddrL1 i. in one embodiment, the read pointer RdPtr is equal to any of the parallel write pointers (including WrPtr0 and WrPtr1 …) in the first cycle, and the instruction cache 104 switches back to the instruction address AddrBP popped from the instruction target queue FTQ as the instruction address AddrL1i in the next cycle of the first cycle. The synchronization signal Sync is pulled up (asserted).
The above embodiment determines whether the instruction cache 104 switches from the asynchronous mode back to synchronous mode by way of pointer comparison, i.e., switches from instruction cache to increment address 118 back to the instruction address AddrBP popped from the instruction target queue as the instruction fetch address AddrL1 i. In other embodiments, it may be determined whether instruction cache 104 switches from asynchronous mode back to synchronous mode by directly comparing instruction cache increment address 118 with one of the instruction addresses AddrBP popped from instruction cache target queue FTQ to determine whether instruction cache 104 switches from asynchronous mode back to synchronous mode, i.e., whether instruction address AddrBP popped from instruction cache increment address 118 is the instruction fetch address AddrL1 i. Note that the pointer compare-compare parallel write pointers (including WrPtr0, WrPtr1 …) and read pointer RdPtr-wear in the first embodiment are resource-limited. The pointers RdPtr, WrPtr0, WrPtr1 … are typically very short in length, e.g., only three bits, compared to the way 48-bit addresses are compared directly in the second embodiment. The microprocessor 100 can compare pointers without spending too much resources and quickly decide to pull up or put down the Sync signal Sync.
Fig. 5A illustrates when the Sync signal Sync is pulled up after a refresh occurs according to an embodiment of the present application. In this illustrative example, the branch predictor 110 performs branch prediction on three addresses in parallel, and the addresses input and processed in two adjacent cycles have an overlap (refer to fig. 4A), and the branch predictor 110 performs multi-stage first pipeline operations (C/I/B/U stages); the instruction cache 104 performs multiple levels of second pipeline operations (e.g., also C/I/B/U levels). It is noted that for simplicity, the branch prediction cycle only marks the start address, and the other two consecutive addresses are not marked. The initial write pointer WrPtr0 for the get target queue FTQ operation is indicated by an open arrow on the left, and the read pointer RdPtr is indicated by a solid arrow on the right. For simplicity, the other write pointers (WrPtr1 … WrPtr3) are not labeled.
Period T0, corresponding to an address refresh event, synchronization signal Sync is down (de-asserted), and target queue FTQ is directed to enter asynchronous mode: the instruction fetch target queue FTQ is empty, and the start write pointer WrPtr0 and read RdPtr are initialized to the start cell. The flush address 10 and its adjacent addresses 20, 30 are input to the branch predictor 110, and the C-stage pipeline operation is performed in parallel. I.e. the branch predictor 110 switches to branch predicting the flush address 10. At the same time, the flush address 10 is also input into the instruction cache 104 for C-stage pipeline operation, i.e., the flush address 10 also flushes the instruction fetch address AddrL1i of the instruction cache 104.
In the asynchronous mode, period T1, the read pointer RdPtr moves to the next cell. The branch prediction is input to the branch predictor 110 with the increasing addresses 30, 40, 50 (only the start address 30 is marked in the figure) and the C-stage pipeline operation is performed in parallel. The instruction cache address 20 is also input into the instruction cache 104 for C-stage pipelining.
In cycle T2, the read pointer RdPtr moves to the next cell. The branch prediction is input to the branch predictor 110 with the carry-over addresses 50, 60, 70 (only the start address 50 is shown in the figure) and the C-stage pipeline operation is performed in parallel. The instruction cache complimentary address 30 is also input to the instruction cache 104 for C-stage pipelining.
At cycle T3, the read pointer RdPtr moves to the next cell. The branch prediction for addresses 10, 20, 30 (only start address 10 is labeled in the figure) proceeds to the U-stage pipeline operation (i.e., completes the branch prediction) and no jump is predicted. According to FIG. 4A, the addresses 10, 20, 30 for which branch prediction is completed in the current cycle fill the instruction target queue FTQ, and the pointers WrPtr0, WrPtr1, and WrPtr2 are aligned to point to the three cells, wherein the start write pointer WrPtr0 points to the first cell occupied by the instruction address pushed into the instruction target queue FTQ in the current cycle (T3), i.e., the start write pointer WrPtr0 points to the cell occupied by address 10. At T3, the branch predictor 110 is input with the incrementation addresses 70, 80, 90 (only the start address 70 is shown) and the C-stage pipeline operation is performed in parallel. Also at T3, the instruction cache coherency address 40 is also input to the instruction cache 104 for C-stage pipelining.
In cycle T4, the read pointer RdPtr moves to the next cell. The branch prediction for addresses 30, 40, 50 (only start address 30 is labeled in the figure) proceeds to the U-stage pipeline operation (i.e., branch prediction is complete) and no jump occurs. According to FIG. 4A, addresses 40, 50 of addresses 30, 40, 50 that do not overlap with the previous cycle for which branch prediction is done in the current cycle fill the fetch target queue FTQ, and pointers WrPtr0 and WrPtr1 are aligned to point to the two cells, where the starting write pointer WrPtr0 is aligned to point to the first cell (here, the cell occupied by address 40) occupied by the instruction address that pushed the fetch target queue FTQ in the current cycle. At T4, the branch predictor 110 receives the branch prediction addresses 90, a0, and B0 (only the start address 90 is shown) and performs C-stage pipeline operations in parallel. Also at T4, the instruction cache coherency address 50 is also input into the instruction cache 104 for C-stage pipelining. Note that in the current cycle T4, the read pointer RdPtr is equal to one of the parallel write pointers WrPtr 1. The comparison signal AddrEqual pulls up. The cell does carry an address (50) that is consistent with the condition that the Sync signal Sync is pulled up.
At cycle T5, synchronization signal Sync is pulled up in response to compare signal AddrEqual, read pointer RdPtr is moved to the next cell, and the branch prediction at addresses 50, 60, 70 (only start address 50 is shown) proceeds to the U-stage pipeline operation (i.e., branch prediction is complete) and no jump occurs. According to FIG. 4A, the address 60, 70 of the addresses 50, 60, 70 for which branch prediction is done in the current cycle that does not overlap with the previous cycle fills the instruction fetch target queue FTQ, and the pointers WrPtr0 and WrPtr1 are correct to point to both cells, where the start write pointer WrPtr0 is correct to point to the first cell (here, the cell occupied by address 60) occupied by the instruction address that pushed the instruction fetch target queue FTQ in the current cycle. At T5, the branch predictor 110 receives the branch prediction addresses B0, C0, and D0 (only the start address B0 is shown) and performs C-stage pipeline operations in parallel. Due to the pull of the synchronization signal Sync, the instruction fetch target queue FTQ is switched from asynchronous mode back to synchronous mode, i.e., the instruction cache 104 switches back to fetch the instruction fetch address AddrL1i from the instruction fetch target queue FTQ. Based on the read pointer RdPtr, the instruction fetch target queue FTQ pop address 60 is the instruction fetch address AddrL1i, which is input into the instruction cache 104 for C-stage pipelining. Because the mode has been switched back to synchronous mode, the comparison of pointers WrPtr and RdPtr need not be done any more.
In cycle T6, the read pointer RdPtr moves to the next cell. The branch prediction for addresses 70, 80, 90 (only start address 70 is labeled in the figure) proceeds to the U-stage pipeline operation (i.e., branch prediction is complete) and no jump occurs. According to FIG. 4A, addresses 80, 90 of addresses 70, 80, 90 for which branch prediction is done in the current cycle that do not overlap with the previous cycle are filled, and pointers WrPtr0 and WrPtr1 are aligned to point to these two cells, where the starting write pointer WrPtr0 is aligned to point to the first cell (here, the cell occupied by address 80) occupied by the instruction address pushed into the fetching target queue FTQ in the current cycle. In particular, when the pointer WrPtr0 already points to the last cell of the destination queue FTQ, the pointer WrPtr1 points to the cell occupied by the previously fetched garbage address in the destination queue FTQ (FIG. 5A shows an embodiment where address 90 overlaps the first cell occupied by garbage address, where the address in the cell preceding the cell pointed to by the read pointer RdPtr is a garbage address). At T6, the branch predictor 110 receives the branch prediction addresses D0, E0, and F0 (only the start address D0 is shown) and performs C-stage pipeline operations in parallel. Since the synchronization signal Sync of T6 is asserted to be in the synchronous mode, according to the read pointer RdPtr, the FTQ pop address 70 is the fetch address AddrL1i, and is inputted into the instruction cache 104 for C-stage pipeline operation.
At cycle T7, the branch prediction for the read pointer RdPtr moved to the next cell, addresses 90, A0, B0 (only start address 90 is labeled in the figure) proceeds to the U-stage pipeline operation (i.e., branch prediction is complete) and no jump occurs. According to FIG. 4A, addresses A0 and B0 of addresses 90, A0 and B0 that do not overlap with the previous cycle for which branch prediction completed in the current cycle are filled, and pointers WrPtr0 and WrPtr1 are aligned to point to the two cells, where the start write pointer WrPtr0 is aligned to point to the first cell occupied by the instruction address pushing the referring target queue FTQ in the current cycle (i.e., the cell occupied by address A0). The branch predictor 110 receives the branch prediction addresses F0, G0, and H0 (only the start address F0 is shown) and performs C-stage pipeline operations in parallel. Since the synchronization signal Sync is asserted to be in the synchronization mode when the synchronization signal Sync is asserted to be asserted to the T7, the FTQ pop address 80 is the fetch address AddrL1i according to the read pointer RdPtr, and is inputted into the instruction cache 104 for C-stage pipeline operation.
In FIG. 5A, the flush event makes the instruction fetch target queue FTQ less ready in cycle T0 … T4 than the corresponding address is ready in the instruction cache 104 before the instruction fetch address AddrL1i is used. However, this condition does not stall the operation of the instruction cache 104 at all. The flushed addresses 10, 20, 30, 40, 50 are also supplied to the instruction cache 104 as the instruction fetch address AddrL1i cycle by other pipelines (120/122, 118).
FIG. 5B illustrates how Sync signal Sync will change if address 60 to address 200 jump is predicted (but the fetch target queue FTQ is not empty) in cycle T5 of FIG. 5A.
In contrast to FIG. 5A, cycle T5 of FIG. 5B does not push address 70 into the instruction fetch target queue FTQ, since address 70 will be bypassed by the jump at address 60. Jump target addresses 200, 210, 220 (only start address 200 is labeled) are input to branch predictor 110 and the C-stage pipeline operations are performed in parallel.
In the period T6, since a jump is predicted in the previous period T5 and there is no command address available for being popped up in the target FTQ (since the address 60 and the previous address are both after the read pointer RdPtr, indicating that the command address is popped up), the synchronization signal Sync is pulled down, and the read pointer RdPtr and the start write pointer WrPtr0 are both aligned to point to a blank cell following the cell in the target FTQ into which the command address AddrBP has been stored (here, to point to a blank cell following the cell occupied by 60). The branch prediction at addresses 70, 80, 90 proceeds to the U-stage pipeline operation (i.e., branch prediction is completed), but meaningless, and the instruction fetch target queue FTQ is not pushed because the jump has been taken away. At T6, the branch predictor 110 is inputted with the self-increment addresses 220, 230, 240 (only the start address 220 is labeled) and the C-stage pipeline operation is performed in parallel. Meanwhile, at T6, since the synchronous signal Sync is pulled down to enter the asynchronous mode, the jump target address 200 is directly input into the instruction cache 104 without going through the fetch target queue FTQ for C-stage pipeline operation.
In cycle T7, the read pointer RdPtr moves to the next cell. The branch prediction of addresses 90, a0 and B0 proceeds to U-stage pipeline operations (i.e., branch prediction is completed), but is meaningless and will not push the FTQ because the jump has been avoided. The branch prediction uses the self-increment addresses 240, 250, 260 (only the start address 240 is labeled in the figure) to input into the branch predictor 110, and the C-stage pipeline operation is performed in parallel. Because of the asynchronous mode, the instruction cache 104 is pipelined with the instruction cache incrementation address 210.
In cycle T8, the read pointer RdPtr moves to the next cell. The branch prediction for addresses 200, 210, 220 (only start address 200 is labeled) proceeds to the U-stage pipeline operation (i.e., branch prediction is complete) and no jump occurs. According to FIG. 4A, the addresses 200, 210, 220 are filled in with the pointers WrPtr0, WrPtr1, and WrPtr2 correct to point to the three cells, where the starting write pointer WrPtr0 corrects to point to the first cell occupied by the instruction address that pushed the fetching target queue FTQ in the current cycle (here, the cell occupied by address 200). At T8, the branch predictor 110 is input with the incrementation addresses 260, 270, 280 (only the start address 260 is shown) and the C-stage pipeline operation is performed in parallel. Also at T8, because of the asynchronous mode, the incrementation address 220 for the instruction cache is also input into the instruction cache 104 for C-stage pipelining. It is noted that in the current cycle T8, the read pointer RdPtr is equal to one of the parallel pointers WrPtr 2. The comparison signal AddrEqual pulls up. The cell does carry an address (220) that is consistent with the condition for the Sync signal Sync to be pulled up.
At cycle T9, synchronization signal Sync is pulled up in response to compare signal AddrEqual, read pointer RdPtr is moved to the next cell, and the branch prediction for addresses 220, 230, 240 (only start address 220 is labeled) proceeds to the U-stage pipeline operation (i.e., branch prediction is complete) and no jump occurs. According to FIG. 4A, addresses 230, 240 of the addresses 220, 230, 240 that do not overlap with the previous cycle for which branch prediction completed in the current cycle are filled, and the pointers WrPtr0 and WrPtr1 are aligned to point to the two cells, where the starting write pointer WrPtr0 is aligned to point to the first cell (here, the cell occupied by address 230) occupied by the instruction address that pushed the fetching target queue FTQ in the current cycle. At T9, the branch predictor 110 is inputted with the self-increment addresses 280, 290, 300 (only the start address 280 is labeled) for branch prediction, and C-stage pipeline operations are performed in parallel. Due to the pull of the synchronization signal Sync, the instruction fetch target queue FTQ is switched from asynchronous mode back to synchronous mode, i.e., the instruction cache 104 switches back to fetch the instruction fetch address AddrL1i from the instruction fetch target queue FTQ. Based on the read pointer RdPtr, the instruction fetch target queue FTQ pop address 230 is the instruction fetch address AddrL1i, which is input into the instruction cache 104 for C-stage pipelining. Because the mode has been switched back to synchronous mode, the comparison of pointers WrPtr and RdPtr need not be done any more.
In FIG. 5B, the predicted jump event of cycle T5 makes the instruction fetch target queue FTQ unreachable at cycle T6 … T8 to prepare the corresponding address before instruction cache 104 uses the instruction fetch address AddrL1 i. However, this condition does not stall the operation of the instruction cache 104 at all. The jumped-to addresses 200, 210, 220 are also supplied to the instruction cache 104 as the instruction fetch address AddrL1i cycle by other pipes (114 passing directly from AddrBP bypassing the instruction fetch target queue FTQ, or 118 provided by instruction cache 104).
FIG. 5C illustrates when the Sync signal Sync pulls after a refresh, when a jump target address is predicted, according to one embodiment of the present application.
In comparison to FIG. 5A, FIG. 5C predicts a branch jump to address 200 at address 10 at cycle T3. The predicted addresses of the branch predictor 110 at T3 include 10, 20, and 30 (only the starting address 10 is shown). In response to a prediction that address 10 jumps to address 200, addresses 20, 30 are skipped without pushing the instruction target queue FTQ and only address 10 fills. The jump target addresses 200, 210, 220 (only the start address 200 is labeled) are input to the branch predictor 110 for parallel C-stage pipeline operations.
In the period T4, the read pointer RdPtr and the start write pointer WrPtr0 are both aligned to point to the blank cell following the cell in the instruction fetch target queue FTQ that has the instruction address AddrBP (here, to point to the blank cell following the cell occupied by 10). The branch prediction for addresses 30, 40, 50 proceeds to the U-stage pipeline operation (i.e., branch prediction is complete), but meaningless, and the instruction fetch target queue FTQ is not pushed because the jump has been taken away. At T4, the branch predictor 110 is inputted with the self-increment addresses 220, 230, 240 (only the start address 220 is labeled) and the C-stage pipeline operation is performed in parallel. Meanwhile, at T4, since the instruction is in asynchronous mode, the jump target address 200 is not directly input into the instruction cache 104 via the FTQ for C-stage pipeline operation. The comparison signal AddrEqual pulls up. However, the cell pointed by the start write pointer WrPtr0 is empty and is not properly treated as a parallel write pointer, which does not satisfy the condition for switching to synchronous mode.
In cycle T5, the read pointer RdPtr moves to the next cell. The branch prediction at addresses 50, 60, 70 proceeds to the U-stage pipeline (i.e., branch prediction is complete), but is meaningless and will not push the FTQ because the jump has been taken away. The branch prediction uses the self-increment addresses 240, 250, 260 (only the start address 240 is labeled in the figure) to input into the branch predictor 110, and the C-stage pipeline operation is performed in parallel. Because of the asynchronous mode, the instruction cache 104 is pipelined with the instruction cache incrementation address 210.
In cycle T6, the read pointer RdPtr moves to the next cell. The branch prediction for addresses 200, 210, 220 (only start address 200 is labeled) proceeds to the U-stage pipeline operation (i.e., branch prediction is complete) and no jump occurs. According to FIG. 4A, addresses 200, 210, 220 are filled with pointers WrPtr0, WrPtr1, and WrPtr2, which are aligned to point to the three cells, where the starting write pointer WrPtr0 is aligned to point to the first cell occupied by the instruction address that pushed the referring target queue FTQ in the current cycle (here, the cell occupied by address 200). At T6, the branch predictor 110 is input with the incrementation addresses 260, 270, 280 (only the start address 260 is shown) and the C-stage pipeline operation is performed in parallel. Also at T6, because of the asynchronous mode, the incrementation address 220 for the instruction cache is also input into the instruction cache 104 for C-stage pipelining. Note that in the current cycle T6, the read pointer RdPtr is equal to one of the parallel write pointers WrPtr 2. The comparison signal AddrEqual pulls up. The cell does carry an address (220) that is consistent with the condition for the Sync signal Sync to be pulled up.
At cycle T7, the synchronization signal Sync is pulled up in response to the comparison signal addrEqual, the read pointer RdPtr is moved to the next cell, the branch prediction at addresses 220, 230, 240 (only start address 220 is labeled) proceeds to the U-stage pipeline (i.e., branch prediction is complete), and no jump occurs. According to FIG. 4A, addresses 230, 240 of the addresses 220, 230, 240 that do not overlap with the previous cycle for which branch prediction completed in the current cycle are filled, and the pointers WrPtr0 and WrPtr1 are aligned to point to the two cells, where the starting write pointer WrPtr0 is aligned to point to the first cell (here, the cell occupied by address 230) occupied by the instruction address that pushed the fetching target queue FTQ in the current cycle. At T7, the branch predictor 110 is inputted with the self-increment addresses 280, 290, 300 (only the start address 280 is labeled) for branch prediction, and C-stage pipeline operations are performed in parallel. Due to the pull of the synchronization signal Sync, the instruction fetch target queue FTQ is switched from asynchronous mode back to synchronous mode, i.e., the instruction cache 104 switches back to fetch the instruction fetch address AddrL1i from the instruction fetch target queue FTQ. Based on the read pointer RdPtr, the instruction fetch target queue FTQ pop address 230 is the instruction fetch address AddrL1i, which is input into the instruction cache 104 for C-stage pipelining. Because the mode has been switched back to synchronous mode, the comparison of pointers WrPtr and RdPtr need not be done any more.
In FIG. 5C, the instruction fetch target queue FTQ is not addressed to the instruction cache 104 until cycle T7 as the instruction fetch address AddrL1 i. However, this condition does not stall the instruction cache 104 operation at all. The flush address 10, as well as its jump target address 200, and thereafter addresses 210, 220 are all also quickly supplied by other pipelines to the instruction cache 104 as the instruction fetch address AddrL1 i. Instruction cache 104 is still a highly efficient design, although it still consumes resources on the fetching of meaningless addresses 20, 30, 40.
Decoupling the instruction cache 104 and the instruction fetch target queue FTQ of the branch predictor 110 exhibits significant efficacy regardless of flush events or predicted branch jump events (i.e., predicted jump target addresses). Any microprocessor architecture that uses the FTQ belongs to the scope of the present application.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention.

Claims (18)

1. A microprocessor, comprising:
an instruction cache for fetching instructions according to the fetch address;
a branch predictor for performing branch prediction on N instruction addresses in parallel, wherein N is an integer greater than 1; and
an instruction fetch target queue coupled between the branch predictor and the instruction cache,
wherein:
in the current cycle, the instruction address which is not avoided by jumping and is not overlapped with the instruction address pushed into the instruction fetch target queue in the previous cycle is pushed into the instruction fetch target queue, and the instruction address is popped in the subsequent cycle and is used as the instruction fetch address of the instruction cache.
2. The microprocessor of claim 1, wherein:
the branch predictor performs multi-stage first pipeline operation; and is
The instruction cache performs a second plurality of stages of pipeline operations.
3. The microprocessor of claim 1, wherein:
in the current cycle, if the branch predictor completes the branch prediction in parallel and the N instruction addresses predict that a jump-taking branch instruction exists in the corresponding N instruction-taking units, and the jump-taking branch instruction is a cross instruction-taking unit, the instruction address of an adjacent instruction-taking unit of the instruction-taking unit where the jump-taking branch instruction is located is also pushed into the instruction-taking target queue, and the jump-taking branch instruction is popped up in the subsequent cycle to be used as the instruction-taking address of the instruction cache.
4. The microprocessor of claim 1, wherein:
n is 3;
the fetch unit corresponding to each instruction address is M bytes, and M is a number; and is
In the current cycle, the instruction addresses for which branch prediction is performed in parallel by the branch predictor include PC, PC + M, and PC + 2M.
5. The microprocessor of claim 4, wherein:
the current cycle overlaps an instruction address with the N instruction addresses where branch prediction was done in parallel with the previous cycle.
6. The microprocessor as recited in claim 5 wherein when instruction address PC is not bypassed by a jump and not pushed into the fetch target queue in the previous cycle:
the instruction fetch target queue provides a first cell storing an instruction address PC;
the instruction fetching target queue provides a second cell, and stores an instruction address PC + M when the instruction address PC does not jump or the instruction address PC jumps but the branch instruction of the jump is taken as a cross instruction fetching unit;
the instruction fetching target queue provides a third cell, and when neither the instruction address PC nor the instruction address PC + M skips, or when the instruction address PC does not skip, but the instruction address PC + M skips, but the branch instruction of the skip is taken as an instruction crossing unit, the instruction address PC + 2M is stored; and is
The instruction fetch target queue provides a fourth cell for storing the instruction address PC + 3M when neither the instruction address PC nor PC + M is skipped, but the instruction address PC + 2M is skipped, but the branch instruction that is skipped is taken as a unit of inter-fetch.
7. The microprocessor as recited in claim 5 wherein when instruction addresses PC, PC + M are not bypassed by a jump but instruction address PC has been pushed into the fetch target queue in the previous cycle:
the instruction fetch target queue provides a first cell storing an instruction address PC + M;
the instruction fetching target queue provides a second cell, and stores an instruction address PC +2 × M when the instruction address PC + M does not jump or the instruction address PC + M jumps but a branch instruction of the jump is taken as a cross instruction fetching unit; and is
The instruction fetch target queue provides a third cell that stores instruction address PC + 3M when instruction address PC + M does not jump, but instruction address PC + 2M jumps, but the branch instruction that jumps is taken is the unit of a jump instruction.
8. The microprocessor of claim 4, wherein:
the current cycle overlaps two instruction addresses with the N instruction addresses completing branch prediction in parallel with the previous cycle.
9. The microprocessor as recited in claim 8 wherein when instruction address PC is not bypassed and not pushed into the instruction fetch target queue in a previous cycle:
the instruction fetch target queue provides a first cell for storing an instruction address PC;
the instruction fetching target queue provides a second cell, and stores an instruction address PC + M when the instruction address PC does not jump or the instruction address PC jumps but the branch instruction of the jump is taken as a cross instruction fetching unit;
the instruction fetching target queue provides a third cell, and when neither the instruction address PC nor the instruction address PC + M skips, or when the instruction address PC does not skip, but the instruction address PC + M skips, but the branch instruction of the skip is taken as an instruction crossing unit, the instruction address PC + 2M is stored; and is
The instruction fetch target queue provides a fourth cell for storing the instruction address PC + 3M when neither the instruction address PC nor PC + M is skipped, but the instruction address PC + 2M is skipped, but the branch instruction that is skipped is taken as a unit of inter-fetch.
10. The microprocessor as recited in claim 8 wherein instruction addresses PC, PC + M are not bypassed by a jump but have been pushed into the instruction fetch target queue in a previous cycle:
the instruction fetch target queue provides a first cell storing an instruction address PC +2 × M; and is
The instruction fetch target queue provides a second cell that stores instruction address PC + 3M when instruction address PC + 2M fetches a jump, but where the branch instruction fetching the jump is in a unit of a stride fetch instruction.
11. The microprocessor of claim 4, wherein:
the current cycle is overlapped with no instruction address in the N instruction addresses for completing branch prediction in parallel with the previous cycle.
12. The microprocessor as recited in claim 11 wherein when instruction address PC is not bypassed by a jump:
the instruction fetch target queue provides a first cell for storing an instruction address PC;
the instruction fetching target queue provides a second cell, and stores an instruction address PC + M when the instruction address PC does not jump or the instruction address PC jumps but the branch instruction of the jump is taken as a cross instruction fetching unit;
the instruction fetching target queue provides a third cell, and when neither the instruction address PC nor the instruction address PC + M skips, or when the instruction address PC does not skip, but the instruction address PC + M skips, but the branch instruction of the skip is taken as an instruction crossing unit, the instruction address PC + 2M is stored; and is
The instruction fetch target queue provides a fourth cell for storing the instruction address PC + 3M when neither the instruction address PC nor PC + M is skipped, but the instruction address PC + 2M is skipped, but the branch instruction that is skipped is taken as a unit of inter-fetch.
13. The microprocessor as recited in claim 1, wherein one of the cells of the fetch target queue comprises:
a first field for storing a jump-taking flag, which marks whether a branch instruction for taking a jump is predicted in an instruction-taking unit corresponding to the instruction address stored in the cell;
a second field for storing a cross fetch unit flag for indicating whether the branch instruction of the fetch jump crosses the fetch unit; and
and a third field storing the instruction address.
14. The microprocessor as recited in claim 1 wherein the instruction fetch target queue performs parallel push operations with (N +1) parallel write pointers.
15. The microprocessor of claim 14, wherein the branch predictor concurrently pushes the instruction address of the instruction queue to be fetched in the current cycle with the concurrent write pointers including a start write pointer pointing to a first cell occupied by the instruction address pushed in the instruction queue to be fetched in the current cycle and a pointer incremented from the start write pointer.
16. The microprocessor as recited in claim 1 wherein the instruction fetch target queue pops up with a read pointer.
17. The microprocessor as recited in claim 16 wherein a jump destination address of the jump branch instruction is pointed to by the destination pointer if the instruction address pointed to by the read pointer predicts that the corresponding fetch unit comprises the jump branch instruction.
18. The microprocessor as recited in claim 17, wherein if the branch instruction of the jump is a cross fetch unit, the cross fetch unit pointer points to an instruction address corresponding to an adjacent fetch unit to the fetch unit in which the branch instruction of the jump is located.
CN202010289064.3A 2020-04-14 2020-04-14 Microprocessor with highly advanced branch predictor Active CN111459551B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010289064.3A CN111459551B (en) 2020-04-14 2020-04-14 Microprocessor with highly advanced branch predictor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010289064.3A CN111459551B (en) 2020-04-14 2020-04-14 Microprocessor with highly advanced branch predictor

Publications (2)

Publication Number Publication Date
CN111459551A CN111459551A (en) 2020-07-28
CN111459551B true CN111459551B (en) 2022-08-16

Family

ID=71678710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010289064.3A Active CN111459551B (en) 2020-04-14 2020-04-14 Microprocessor with highly advanced branch predictor

Country Status (1)

Country Link
CN (1) CN111459551B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112130897A (en) * 2020-09-23 2020-12-25 上海兆芯集成电路有限公司 Microprocessor
CN115934171B (en) * 2023-01-16 2023-05-16 北京微核芯科技有限公司 Method and apparatus for scheduling branch predictors for multiple instructions

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5630157A (en) * 1991-06-13 1997-05-13 International Business Machines Corporation Computer organization for multiple and out-of-order execution of condition code testing and setting instructions
CN1521635A (en) * 2003-01-14 2004-08-18 智权第一公司 Apparatus and method for resolving deadlock fetch conditions involving branch target address cache
CN1542625A (en) * 2003-01-14 2004-11-03 智权第一公司 Apparatus and method for efficiently updating branch target address cache
CN102508643A (en) * 2011-11-16 2012-06-20 刘大可 Multicore-parallel digital signal processor and method for operating parallel instruction sets
CN105426160A (en) * 2015-11-10 2016-03-23 北京时代民芯科技有限公司 Instruction classified multi-emitting method based on SPRAC V8 instruction set
CN109388429A (en) * 2018-09-29 2019-02-26 古进 The task distribution method of MHP heterogeneous multiple-pipeline processor
CN110825442A (en) * 2019-04-30 2020-02-21 海光信息技术有限公司 Instruction prefetching method and processor

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7519799B2 (en) * 2003-11-18 2009-04-14 Intel Corporation Apparatus having a micro-instruction queue, a micro-instruction pointer programmable logic array and a micro-operation read only memory and method for use thereof
CN102306092B (en) * 2011-07-29 2014-04-09 北京北大众志微系统科技有限责任公司 Method and device for realizing instruction cache path selection in superscaler processor
CN105718241B (en) * 2016-01-18 2018-03-13 北京时代民芯科技有限公司 A kind of sort-type mixed branch forecasting system based on SPARC V8 architectures
CN106406823B (en) * 2016-10-10 2019-07-05 上海兆芯集成电路有限公司 Branch predictor and method for operating branch predictor
CN107688468B (en) * 2016-12-23 2020-05-15 北京国睿中数科技股份有限公司 Method for verifying branch instruction and branch prediction function in speculative execution processor

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5630157A (en) * 1991-06-13 1997-05-13 International Business Machines Corporation Computer organization for multiple and out-of-order execution of condition code testing and setting instructions
CN1521635A (en) * 2003-01-14 2004-08-18 智权第一公司 Apparatus and method for resolving deadlock fetch conditions involving branch target address cache
CN1542625A (en) * 2003-01-14 2004-11-03 智权第一公司 Apparatus and method for efficiently updating branch target address cache
CN102508643A (en) * 2011-11-16 2012-06-20 刘大可 Multicore-parallel digital signal processor and method for operating parallel instruction sets
CN105426160A (en) * 2015-11-10 2016-03-23 北京时代民芯科技有限公司 Instruction classified multi-emitting method based on SPRAC V8 instruction set
CN109388429A (en) * 2018-09-29 2019-02-26 古进 The task distribution method of MHP heterogeneous multiple-pipeline processor
CN110825442A (en) * 2019-04-30 2020-02-21 海光信息技术有限公司 Instruction prefetching method and processor

Also Published As

Publication number Publication date
CN111459551A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN111459550B (en) Microprocessor with highly advanced branch predictor
US6898699B2 (en) Return address stack including speculative return address buffer with back pointers
US6553488B2 (en) Method and apparatus for branch prediction using first and second level branch prediction tables
US11403103B2 (en) Microprocessor with multi-step ahead branch predictor and having a fetch-target queue between the branch predictor and instruction cache
KR100431168B1 (en) A method and system for fetching noncontiguous instructions in a single clock cycle
US7117347B2 (en) Processor including fallback branch prediction mechanism for far jump and far call instructions
US5606682A (en) Data processor with branch target address cache and subroutine return address cache and method of operation
US6338136B1 (en) Pairing of load-ALU-store with conditional branch
US7159098B2 (en) Selecting next instruction line buffer stage based on current instruction line boundary wraparound and branch target in buffer indicator
JP3919802B2 (en) Processor and method for scheduling instruction operations in a processor
US5553254A (en) Instruction cache access and prefetch process controlled by a predicted instruction-path mechanism
US7516312B2 (en) Presbyopic branch target prefetch method and apparatus
KR101081674B1 (en) A system and method for using a working global history register
JP2011150712A (en) Unaligned memory access prediction
US5935238A (en) Selection from multiple fetch addresses generated concurrently including predicted and actual target by control-flow instructions in current and previous instruction bundles
US5964869A (en) Instruction fetch mechanism with simultaneous prediction of control-flow instructions
CN111459551B (en) Microprocessor with highly advanced branch predictor
EP1662377B1 (en) Branch predicting apparatus and branch predicting method using return address stacks
US7689816B2 (en) Branch prediction with partially folded global history vector for reduced XOR operation time
US7913068B2 (en) System and method for providing asynchronous dynamic millicode entry prediction
US5748976A (en) Mechanism for maintaining data coherency in a branch history instruction cache
CN112130897A (en) Microprocessor
US6421774B1 (en) Static branch predictor using opcode of instruction preceding conditional branch
US9507600B2 (en) Processor loop buffer
US20050144427A1 (en) Processor including branch prediction mechanism for far jump and far call instructions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room 301, 2537 Jinke Road, Zhangjiang High Tech Park, Pudong New Area, Shanghai 201203

Patentee after: Shanghai Zhaoxin Semiconductor Co.,Ltd.

Address before: 201210 Room 301, No. 2537, Jinke Road, Zhangjiang High Tech Park, Shanghai

Patentee before: VIA ALLIANCE SEMICONDUCTOR Co.,Ltd.