Hacker News new | past | comments | ask | show | jobs | submit login
Zen 5's 2-ahead branch predictor: how a 30 year old idea allows for new tricks (chipsandcheese.com)
222 points by matt_d 18 hours ago | hide | past | favorite | 119 comments





Here's a great explanation of branch prediction, starting from the earliest implementations: https://danluu.com/branch-prediction/

Godbolt recently did an ELI5 on computerphile about everything CPU[0] and about branch prediction in particular[1].

[0]: https://www.youtube.com/watch?v=nhXevKMm3JI&list=PLzH6n4zXuc...

[1]: https://www.youtube.com/watch?v=nczJ58WvtYo&list=PLzH6n4zXuc...


It will be interesting to see the SMT performance, I am expecting this would provide benefits and be further refined in future generation. With Zen5c we get 192 Core or 384vCPU. We should be getting 256 Core with Zen 6c next year. Which means on a Dual Socket 1U Server, that is a potential of 512 Core with 1024 vCPU.

Whatever Web App Scaling issues we had in 2014 could now fit into a single server, assuming we somehow manage to cool the thing. Even at 1 RPS per vCPU that is 1000 RPS, excluding cache hit. Even on HN front-page dont hit the server at 1000 Page View per second.


Serving web pages is cheap. You’ll probably hit network I/O limits before you saturate the cores.

I wonder what about its HPC performance. I think cooling this won’t be big problem, but might be wet one, requiring DLC after a certain point.


SMT could use a lot more benchmarking investigations.

Intuitively having more tasks working on the same problem at half speed should have a memory usage cost, are apps commonly using more memory for no speed gain when using SMT?

In a lot of published benchmarks it seems most apps don't noticeably benefit in executions peed.


SMT also allows the cores to hide memory latency by executing the other thread(s). Taken to the extreme end this would give you a barrel CPU.

The opposite might be true, the better you are at utilising the CPU pipeline the less space you have to run a 2nd thread so SMT's benefits might diminish

Web app scaling issues are usually around database latency.

But database performance is also a lot about CPU and RAM speed plus concurrent use of those resources.

In my experience, scaling is limited by database throughput, not latency.

Man I sure hope that your server does more than 1 RPS :)

I wish I am joking but there are plenty of production instances doing less than 10 RPS per vCPU. ( I think the request part doesn't really explain well, since it could be a page view but it could also be an API request. In this case it is mostly referring to a page view )

It's always interesting to see decades old papers, sometimes published with little to no fanfares, suddenly becomes "state of the art" because hardware have become powerful enough.

For example Z-buffers[1]. It's used by 3d video games. When it's first published on paper, it's not even the main topic of the paper, just some side notes because it requires expensive amount of memory to run.

Turn out megabytes is quite cheap few decades latter, and every realtime 3d renderer ended up using it.

[1] https://en.wikipedia.org/wiki/Z-buffering


Another example is Low Density Parity Check Codes [1]. Discovered in 1962 by Robert Gallager but abandoned and forgotten about for decades due to being computationally impractical. It looks like there was a 38 year gap in the literature until rediscovered by David MacKay [2].

The first mainstream use was in 2003. It is now used in WiFi, Ethernet and 5G.

[1] https://en.wikipedia.org/wiki/Low-density_parity-check_code

[2] https://scholar.google.com/scholar?q=%22low+density+parity+c...


I sometimes wonder if there’s an academic career hidden in there for an engineer: go to the library and read what the CS folks were publishing on physical papers, maybe there are some ideas that can actually be implemented now that weren’t practical back then.

In a series of books by David Brin [0] there is a galaxy-wide institution known as the library, and civilizations regularly mine its millions of years of data for suddenly-relevant-again techniques and technologies.

I remember one bit where a species had launched some tricky fleet-destroying weapon to surprise their enemies with esoteric physics, only to have it reversed against them, possibly because the Librarian that once helped their research-agent wasn't entirely unbiased.

[0] https://en.m.wikipedia.org/wiki/Uplift_Universe


Also in Vinge's Deepness in the Sky, there aren't really "programmers" as we know them anymore, but "programmer-archeologists" that just search the archives for code components to reuse.

That's pretty much what any (decent) programmer does today as well. You first search your code base to see if the application already does something like it, if not, whether there is a published library. Where this starts to fail is the idea that connecting those already written peaces together is easy.

..And then ask an LLM for code

Also: In the Destiny mythic sci-fi franchise, the human golden age ended with a mysterious apocalypse, leaving "cryptarchs" (crypto-archeologists) to try to rebuild from arcane fragments of encrypted data or unknown formats.

Mind editing that to give a spoiler alert?

Don't worry, it's nowhere near the main plot or characters, just a small "meanwhile, elsewhere" vignette. Basically emphasizing the "why bother everything's already invented" mentality of most client-races, and how deep access and query-secrecy have big impacts.

Academic? Perhaps, applied:

"When Soviets Won the Cold War: Wading Through Pyotr Ufimstev's Work on Stealth" (26.03.2024)

https://news.ycombinator.com/item?id=39830671

> In the early 1970s, Lockheed Martin engineer Denys Overholser discovered the key to stealth technology hidden in a stack of translated Soviet technical papers. Disregarded by the Soviet academic elite, and unheard of in the United States, Pyotr Ufimstev had worked out calculations that would help win ...

Not sure of the details of this story, but in general having there enough people to grant them time just to search for something interesting seems not unrealistic.


Yes, "read 10 year old papers as a source of ideas ripe for commercialization" IS common advice in universities.

A post-doc in my chemistry lab had the saying, “two weeks in the lab will save you a day in the library”

Heck. Look at 10 year old product launch PRs from big tech. Anything that Google launched 10 years ago and killed, but seems like a good idea today is probably also easier to do. And if you look 5-10 years before that, you can find the Yahoo launch PR where they did the same thing ;p

The whole AI trend is in part due to things that are now possible on GPU supercomputers with gigabytes of RAM backed by petabytes of data and at the top speed of GPUs. Some of the algorithms date back to before the ai winter and it's just that we can now do the same thing with a ton more data and faster.

All of the main algorithms do (multi-layer perceptrons and stochastic gradient descent are from the 50s and 60s!). Basically the only thing that changed is we decided to multiply some of the outputs of the multi-layer perceptrons by each other and softmax it (attention) before passing them back into more layers. Almost all of the other stuff is just gravy to make it converge faster (and run faster on modern hardware).

Image and video compression are like that. Ideas for mainframes in the '80s are realtime algorithms now.

I guess they are _soft_ realtime algorithms now?

'Realtime' isn't a specific thing, there's just 'fast enough'. Oldschool "render each frame during the scan line staying ahead of the electron beam" graphics was pretty hardcore though.

'Realtime' actually has multiple meanings.

At least one of them is very, very specific and is the one that Wikipedia uses in https://en.wikipedia.org/wiki/Real-time_computing

> Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response.[1] Real-time programs must guarantee response within specified time constraints, often referred to as "deadlines".[2]

Strictly speaking, this definition doesn't say anything about how tight those deadlines are, as long as you can guarantee some deadlines.

There's also 'soft' real time where you try to hit your deadlines often-enough, but there's no guarantees and a missed deadline is not the end of the world. Games are good example of that, including the example of chasing the electron beam.

ABS brakes are an example of a 'hard' real time system: the deadlines aren't nearly as tight as for video game frames, but you really, really can't afford to miss them.


I wonder if current AI training's effort to hover up all the training data they can find will accidentally give us most of the benefits of that?

A human can only read so much, so has to discriminate. But our machines are omnivorous readers.


Z-buffers don't just require about an other frame buffer worth of memory, but also lots of read and write bandwidth per pixel. This extra memory bandwidth requirement made them expensive to implement well. High end implementations used dedicated RAM channels for them, but on lower end hardware they "stole" a lot of memory bandwidth from a shared memory interface e.g. some N64 games optimised drawing a software managed background/foreground with the Z-buffer disabled to avoid the cost of reading and updating the depth information.

Another example is Rust's borrow checker, which has roots in substructural type system papers from decades earlier. Many academics considered substructural type systems dead (killed by GC, more or less) until Rust resurrected the idea by combining it with some new ideas from C++ of the time.

"with some new ideas from C++ of the time"

Could you elaborate on that?


I assume it's the deterministic, implicit, synchronous, single-use destructor.

Affine logic only says what you're able to do with a value, it doesn't say anything about what happens when that value goes out of scope.


> suddenly becomes "state of the art" because hardware have become powerful enough.

I'd rather say that we were capable of such a design for several decades, but only now the set of trade-offs currently in-place made this attractive. Single core performance was stifled in the last 2 decades by prioritizing horizontal scaling (more cores), thus the complexity / die area of each individual core became a concern. I imagine if this trend did not take place for some reason, and the CPU designers primarily pursued single core performance, we'd see implementation much sooner.

Regarding the Z-buffer, I kind of get that this would appear as a side note, it's a simple concept. Perhaps even better example is ray tracing - the concept is even quite obvious to people without 3D graphics background, but it was just impractical (for real-time rendering) in terms of performance until recently. What I find interesting is that we haven't found a simpler approach to approximate true to life rendering and need to fallback on this old, sort of naive (and expensive) solution.


The EEVDF scheduling algorithm is also a good example. Designed back in 1995 and it's the default process scheduler of Linux now.

On the software side, garbage collection was well explored by academia for more than two decades before the JVM brought it to wide commercial adoption ca. 1995.

Lisp and Smalltalk and BASIC (to name a few) were popular languages before the JVM that used GC.

When I was in high school, I wrote a Z80 disassembler in BASIC and printed out the entire TRS-80 Model 3 ROM on a thick stack of fan folded greenline paper. I spent a lot of free time figuring out what was what and commenting the code by pen in the right margin.

The (string) garbage collector was amazingly frugal and correspondingly slow. Because the garbage collector typically ran only when memory was critically low it used almost no memory, just 4 bytes I think. (it was also possible to invoke it any time with the FRE(0) function call IIRC)

The routine would sweep all of the symbol table and expression stack looking for string pointers, and would remember the lowest (highest? I forget which) string base address which hadn't already been compacted. After each pass it would move that one string and fix the symbol table pointer (or expression stack str pointer) then do another pass. As a result, is was quadratic with the number of live strings -- so if you had a large number of small strings, it was possible for your program to lock solidly for 30 seconds or more.


Not academia, GC (and lots of other PLT) was mature and productized then, just missed by the majority of the industry living the dark ages. In the 90s, enterprisey architecture astronaut cultures went Java/C++ and hackerish cultures went Perl/Python seasoned with C.

Speculative predictors have been subjected to a number of attacks to weasel out private data. Given that so many of the common ISAs are vulnerable, are they taking steps to reduce the impact of such attacks?

The vulnerability is speculative execution, not branch prediction. The branch predictor is the thing you have to trick to force the processor to speculatively execute code in the victim program. Furthermore you also need a valid timing source to read out the results of the speculative execution.

As for how to stop that, short of boiling the ocean[0], you don't. Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Actual practical mitigations for speculative execution vulnerabilities are varied, but at a minimum you have to ensure process separation between a victim process holding secrets and any potential attackers that may have the opportunity to influence victim process execution. Intel was caught with their hands in their pants speculating across rings, which is why you could read kernel or hypervisor memory from userspace, but on not-poorly-designed CPUs the main victim you have to worry about is HTML iframes. Different origins aren't allowed to make HTTP requests to one another[1], but they can transclude[2] one another without permission. That traditionally loaded information from the origin into the attacker's process, which could be exfiltrated with timing attacks.

The web's solution to this was actually not to process-separate iframes, at least not initially, but to take away shared-memory multithreading entirely. If you deny the attacker a timing reference then it doesn't matter what they can make the victim speculatively execute. But to do this you have to take away multithreading because otherwise a thread can just repeatedly write known data in a loop to create a clock.

[0] https://hackaday.com/2013/08/02/the-mill-cpu-architecture/

[1] At least not without the target origin allowing it via CORS

[2] e.g. hotlink images or embed iframes


> Speculative execution is so valuable for performance that a computer without it is completely unusable.

Jim Keller's view aligns with this and goes further. My interpretation of his thinking is that predictors and speculation are the only meaningful features of CPUs today. ISA doesn't matter anymore because the power of modern compilers makes high performance software highly portable and all CPUs end up bottlenecked on the quality and capacity of the predictors, regardless of the ISA. For example, the burden of x86 complexity no longer matters because it amounts to a "tax" small enough to be lost in the noise.

That's from a designer making high performance RISC-V CPUs.


This article has several paragraphs discussing how the decode width of x86 front ends is limited by the need to discover instruction boundaries, which in turn limits the useful issue width. Big ARM cores have much larger decode width than x86 cores, so they don’t need SMT to keep their execution units busy.

ISA doesn’t matter any more because all the CISCiest CISCs and the RISCiest RISCs have been discarded (except for RISC V…) so modern CPUs don’t have to cope with multiple memory addresses per instruction or indirect addressing.


This is so annoying about the hype around ARM, for which even smart people fall. Yes, Apple Silicon is good, but it's not because of ARM ISA. I still keep hearing that ARM is RISC which is wrong since the 1990s.

I think a few other things like memory models also matter and affect cpu architecture. E.g. the x86 total store order vs arm64's model. You potentially get to do a few nice optimizations on arm64 vs x86. I'm not sure how much of a difference that makes though.

The first pentium had branch prediction so of course it had speculative execution.

What it didn't have is out of order execution, which greatly increases the speculation window.


> Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Pentiums have branch prediction and speculative execution. You need to go back to i486 if you don't want speculative execution. Most of the socket 5/7 processors from other makers also had branch predictors and speculative execution, but not the Centaur Winchip. The Cyrix 5x86 for socket 3 (486) had speculative execution, but it was disabled by default and is reported to be buggy (but helps performance on published benchmarks).


Interactions between speculative execution and virtual memory translation and caches are exploitable. It's not an inherent vulnerability in prediction.

Sure, as long as it’s not observable in any way, it’s not observable. But the problem is that speculation has been repeatedly been found to be observable in unexpected ways from both brands.

Amd, for example, has observability issues in all zen1/2/3 processors that leaks enough data to break KASLR that remain unpatched in all chips except for Milan (specifically epyc only, not ryzen). They didn’t expect cache ways to be visible in that fashion, and it’s observable that the model is incorrectly implemented.

the idea cannot fail, only be failed, because if it’s observable then obviously you just did it wrong.


The point was, you can engineer around problems that aren't _inherent_ in the design, but arise as an interaction between systems. You just change the rules of interaction between systems to mitigate the vulnerabilities.

So, to your original question, we're always going to have prediction, and we're going to have to solve any vulnerabilities that arise downstream. Which, thankfully, is always an available option.


As a novice in this area, it's not clear to me after reading this what exactly the 2-ahead branch predictor is.

My understanding is that they do not predict the target of the next branch but of the one after the next (2-ahead). This is probably much harder than next-branch prediction but does allows to initiate code fetch much earlier to feed even deeper pipelines.

Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.


Interestingly you can build a branch predictor that predicts the second branch without predicting the first.

A branch predictor result is just a tuple of ("branch instruction address", "branch target address") that hints the processor that when the CPU will encounter a given branch instructions in the future (at "branch instruction address") it will likely branch to the branch target and so it would make sense to start fetching that address and filling the instruction pipeline with whatever steps are safe to perform before the jump will be actually performed.

Now, commonly this branch happens to be at the end of the current basic block and I assume some branch predictors may also leverage this fact in order to encode only offsets from the current instruction pointer.

But there is no reason why the branch location might be after some other branches may be taken. As long as the cpu eventually gets to that branch location the prediction will be useful. If the IP never reaches that location it's like the branch was never actually taken.


> Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

I'm not involved in CPU design, I just read a lot, but...

I think you need to do something special to have a second prediction, because you have to track three windows of out of order execution:

Window 0: code you're definitely running but is still being completed.

Window 1: code from the branch you think will be taken

Window 2: code from the 2-ahead branch you think will be taken.

If you figure out that the window 1 branch isn't taken, you have to drop the whole pipeline (pipeline bubble). But if you figure out that window 1 is taken, then window 1 becomes window 0 and window 2 bcomes window 1.

With a 1 ahead predictor, the pipeline stalls if you get to a conditional branch while speculating in window 1, because the processor can't manage three instruction windows.

IMHO, it sounds like if the core is doing SMT and both threads are active, each thread only gets 1-ahead prediction because the two windows are statically divided between the cpu threads. This may mean a) a significant boost for some loads when SMT is not in use and b) SMT can branch predict in both threads in the same cycle, I don't think that was possible on AMD before (no idea for other vendors)


With a typical OoO CPU you have reorder buffers in the hundreds of instructions, so you are not speculating two or three branches but potentially dozens ahead of non-speculative execution (and that's why prediction accuracy is so important).

So the question remain, what's the innovation here? I'm sure there is something, but it is not simply speculating two ahead.

I need tor was further, bit this might be an optimization in the fetched tage to avoid a bubble when fetching consecutive taken branches.


> Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

I'd think if you are at PC N and there are branches at N+1 and N+2, predicting just branch N+2 is fine because you predicted the N+1 branch previously, at PC N-1.


Building on the sibling comment:

  if (a) { ... }

  if (b) { return x; } else { return y; }
The two branches can be wholly independent, but predicting the second is still a two-ahed prediction.

I wonder what they need before this change. Branch predictor hardware may not have accounted for depth beyond single conditional branch? but pipeline was probably always filled, unpredicted.

Ah, that makes sense in the context of the article - thanks!

You're not alone. Non-novice, same here. Article spends a lot of time explaining the absolute basics of branch prediction but then when it gets to 2-ahead it just skips over explaining it...

I think it just predicts 2 branches per cycle instead of 1. So it can evaluate the result of n+2 ahead of time instead of only n+1 (typical branch prediction). How this works without wrecking the L1 cache, I'm not sure. It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

> Zen 5 can look farther forward in the instruction stream beyond the 2nd taken branch and as a result Zen 5 can have 3 prediction windows where all 3 windows are useful in producing instructions for decoding.

The original paper is open access but I haven't read far into it: https://dl.acm.org/doi/10.1145/237090.237169


> It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

The frontend is already predicting dozens of branches ahead of what the backend can actually confirm. Looking ahead by one extra branch ahead doesn't really hurt.

Also, modern TAGE branch predictors are scary accurate, well above 99% on most code (including unpredictable indirect jumps). Besides, the majority of branch prediction targets are already in the L1 cache, it only predicts them because it saw them recently.

The branch predictor in Apple's M1 actually takes advantage of the latter fact. It doesn't predict what address to fetch next, it predicts which cacheline in L1 holds the target. So you only actually get branch predictions for targets in L1.


It's from around 30 years ago, my guess is it's referring to this[0] paper from 1996. It's above my head, but it seems to help with branch prediction issues arising with both many instruction units and high-clock speed, which were sort of either or in the '90s, but I think most modern processors are both.

Multiple-block ahead branch predictors

Abstract:

A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting the address of the next instruction block, but rather for predicting the block following the next instruction block. This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle. Furthermore, pipelining the branch prediction process can also be done by means of our predictor for "speed demon" processors to achieve higher clock rate or to improve the prediction accuracy by means of bigger prediction structures. Moreover, and unlike the previously-proposed multiple predictor schemes, multiple-block ahead branch predictors can use any of the branch prediction schemes to perform the very accurate predictions required to achieve high-performance on superscalar processors."

[0] https://dl.acm.org/doi/10.1145/237090.237169

EDIT: oops. Looks like eyegor posted the link earlier. Oh well, enjoy the abstract.


You can check out the seminal paper linked in the article. Or start by summarizing the paper with Gemini, Claude, ChatGPT, etc. to get a high level overview (and then confirm the answer by reading the paper).


We would need more branch hints? https://github.com/ziglang/zig/issues/5177

Cold, warm, warmer, omit hot as it is the default? Sometimes you would set all branches to be cold except one


that's probably bad idea but I would like to learn why:

why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

is this that much harder or there are other reasons that makes this not worth it


Because it's rare for a branch result to be random. The compiler/runtime/cpu/etc can often guess which result is more likely, and correctly not do the extra work in the first place, and so that's usually the better strategy than spending silicon and heat on the wrong answer just in case.

I think a lot of people don't have an intuition about how accurate branch prediction can be, but if you look at your own code, you'll quickly realize "well, yeah, control flow is almost always going to go this way and we just have this branch so we can handle the exceptional case" -- compilers can often deduce this pretty well themselves now, and cpus/jits/runtimes can develop some pretty impressive heuristics as well, and when all those fail you can often add explicit hints in your code that tell your compiler/etc what you expect if they can't guess.


> it's rare for a branch result to be random

How rare, though?

QuickSort has fundamentally unpredictable branches, and it’s a pretty widely used algorithm. Binary search, B-trees also come to mind.


Binary searching is quite slow and should be used sparingly but not because of branch misprediction necessarily but because of memory stalls - you're almost always guaranteed to have a cache miss during the search. Similarly for B-trees it's going to be memory stalls that you're probably more focused on addressing, not branch mispredicts.

This probably depends on the size of the area to be searched, and just how hot that region is. After all, if it's fairly small, there won't be any cache misses, and the data structure does use less memory than a typical hash table, which is itself an advantage.

If the size of the data is small, a linear search through a contiguous array is going to be far faster than anything more complex.

Yep; though we'd have to test a few cases to figure out what the cutoffs are here, and if there's any middle ground left for a divide-and-conquer strategy.

It's also definitely going to depend on the cost of the hash function and comparison function - for something like strings, where those can be quite expensive, binary search probably has a better chance of applicability than for guid's say.


Well, it depends on how expensive your comparison is compared to a cache miss.

B-trees are cache friendly, spending more time doing linear scans at each level to keep the tree shallow, and thus indirection less frequent. They're designed for high latency indirection - loading pages from spinning rust.

Qsort can be implemented branchless by converting the branch into a data dependency.

Also happens to be why quicksort loses to almost anything else on small arrays, even bubble sort

Loops are usually much more one way than the other.

In benchmarks, branch predictors guess correctly 90-95% of the time.

Disclaimer: I don't work in this field, just an enthusiast.

As far as I can tell, branch predictors have always been too good for it to be worth it. Moderns CPUs have instruction reorder buffers that are hundreds of instructions deep, so even if only 8 of those instructions are conditional jumps, there's 256 different paths your program could take. If your branch predictor predicts all 8 correctly >50% of the time (It does), doing 256x the work to cover your ass is not worth it.


Divergence. On an OoO pipeline fetch might be ahead of retire by hundreds of instructions. That's dozen of branches, and if each causes execution to speculatively fork, it wouldn't scale.

You could potentially do it only for predicted "unpredictable" branches. Now the tradeoff is wasted power and execution units for dead work and so far the tradeoff was just not worth it.

Some form of this has been experimented with. In the late '90 SPRC experimented with scouting threads and as mentioned else thread the Efficient Intel cores can fetch (but not execute) across branches.


It's a sub-optimal strategy.

A modern TAGE branch predictor is correct well over 99% of the time. So those extra instructions for the other side of the branch are almost always discarded.

Worse, the frontend is fetching dozens of branches ahead of where the backend can actually confirm which direction to take. What are you going to do at the next branch, start decoding four possible branches? then 8, 16, 32 possible branches? Remember, most of the time you are going to throw it away.

If you actually have the hardware to fetch from multiple instruction streams in parallel (which Intel's Gracemont/Goldmont/Skymont and now AMDs Zen 5 do), the better strategy is to assume your branch predictor is actually correct 100% of the time. Follow one side of the branch, then the one after it.

Intel's Skymont actually decodes the next 3 branch targets in parallel because it has three decoders, each 3-wide. Intel actually introduce fake branches to break up large blocks of code, so that all three decoders are always active decoding different part of the upcoming instruction stream. The three uop streams are later merged, allowing Skymont to maintain an effect decode bandwidth of 9 instructions per cycle.

If you executed both sides of the branch, you are only slightly reducing the branch misspredit delay in the rare case the branch prediction was wrong. Instead, by executing one side the next two or three predictions, Intel and AMD can make multiple decoders do work in parallel. Intel are doing 9-wide with three simpler 3-wide decoders, and AMD can do 8-wide with two simpler 4-wide decoders.


That's called speculative execution and IIRC all modern CPUs are doing it.

It requires more silicon to hold more microarchitectural state and more execution units to fully exploit the technique, but superscalar CPUs already have those since they are essential to exploit instruction level parallelism in non-branchy code. The rest is "just" a lot of headaches to handle complicated stuff such as aliasing, interrupts, ... But hardware engineers are such wizards they can do these things too.

Turns out however that speculative execution opens up a possibility of abusing a cache timing side channel to extract information from data touched by branches of code that has been only speculatively executed but whose architectural side effects were not committed (i.e. not "really" executed).

Which includes code that had been explicitly not executed because of a conditional check (e.g. permissions, ...)

A familiar instance of such an attack is Spectre [1]

1: https://en.m.wikipedia.org/wiki/Spectre_(security_vulnerabil...


We reached 90% accuracy decades ago. Depending on workload modern chips can do way better.

So basically it’s just nowhere near worth it. Much better to use those chip resources for another thread or core.


It's a huge waste of energy and in some cases it would even be slower because you'd execute more instructions overall. If the branch mispredict rate is around 1% it's simply not worth paying a penalty 99% of the time to get a gain 1% of the time. Maybe it would be worth doing on low-confidence branches.

The vast vast vast majority of instruction pipelines are sparse. You can pack in non-dependent instructions essentially for free.

Most branches are biased one way or the other. "Fetching down both paths" means not exploiting any information you might have gathered about a branch being biased - I think that would be equivalent to randomly predicting the branch (except for it would cost more than a random predictor because you're actually fetching both ways instead of just one).

- Side effects; how do you handle two different writes to memory?

- Double the execution units; very expensive for wide vector units

- Massive waste of energy as half the resources will always be wasted no matter what

- Bad scaling, i.e. four branches ahead would require 16x the resources


Handling two different writes to memory is not really a concern - existing speculative/out of order processors already solve this issue by completing (perform architectural side effects) instructions in-order. So even if two writes are made, one in each branch, by the time the write is meant to be completed, the prior branch is resolved and we know which write is actually meant to be made and the bad one can be discarded.

Doubling the execution units also isn't strictly needed - you can use the existing out-of-order core to send two sets of instructions through the same functional units. There will be more contention for the resources, possibly causing stalls, but you don't need to fully double everything.

Things similar to this idea are already done in processors - simultaneous multithreading, early branch resolution, conditional instructions, are all ideas that have similar implementation difficulties. So the reason this specific idea is not done is more in line with your last two points rather than the first two.


Transistor count; now you have to duplicate all the decode and speculative execution circuitry for both possible branches

No, the same circuits would execute them interleaved just like they execute multiple hardware threads now

With the SMT core count having to be one less.

> why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

We actually do that. It's called a GPU. And it sucks for general code.


Now all it needs is more memory bandwidth, because those two memory channels on the consumer AM5 socket are pathetic given the compute this will deliver, and especially in comparison with even the most basic ASi.

I moved to an M2 Max from a chunky Zen setup and it's a revelation how much the memory bandwidth improvement accelerates intensive data work. Also for heavy-ish multitasking the Zen setup's narrow memory pipe would often choke.


There are very few applications that are actually memory bandwidth bound but aren't more suited to a GPU than a CPU.

The reason people have been looking at Apple Silicon for LLMs in particular is that even though they are more suited to GPUs, they also require a lot of VRAM and NVIDIA charges an extortionate amount of money for GPUs with a lot of VRAM.

What AMD should really do if they want to steal NVIDIA's thunder is to sell consumer GPUs with 64-128GB of VRAM.


> What AMD should really do if they want to steal NVIDIA's thunder is to sell consumer GPUs with 64-128GB of VRAM.

For gaming, it's an overkill (esp. given the price), it would be useful only for LLM enthusiasts which isn't that big of a market.


Then how come so many people are buying GPUs with that amount of VRAM for extraordinary amounts of money?

Nobody said it was for gaming. Though of course consumers could use it for both.


But AMD already offer GPUs with that amount of VRAM for extraordinary amounts of money, so there'll be no change?

And from what I hear they're selling every one they're producing, even to the level where there's a ~6 month waiting list. So even if they reduced the price, the only difference would be that the list would get longer and AMD got less money. No more devices would actually make into people's hands, being production limited.


The production limit isn't in the amount of VRAM. They could offer any of their existing consumer GPUs in versions with more VRAM with prices that add twice their own cost for the extra VRAM (but still far less than the datacenter versions) and thereby sell the same GPUs for higher margins.

This might even leave gamers with more GPUs, because right now there are people buying e.g. four 24GB consumer GPUs to run a large model, and they could instead buy one 96GB GPU and leave three more for gamers.


You're not really saying "They're not offering a product" - so much "They're not offering a product at a price I want", which is really a very different statement.

They're intentionally differentiated in the market and marked up due to the relatively higher R&D costs from targeting a much smaller market niche, with it's own demands on development. I doubt you'll just be running games on them, after all.

If that is your niche, AMD is effectively saying they don't want your custom at that price. No point selling products that lose money, after all. AMD have done it before - arguably are doing it right now as their graphics BU lost money last year if you exclude APUs and consoles - and it went badly each time (despite what some people say about "Market Share!" on some internet forums).


That strategy doesn't win though. The enterprise is going to buy the enterprise GPU anyway because they're spending someone else's money and it's a tax write off and the enterprise GPUs are faster and have ECC memory in addition to having more VRAM. But giving something for hobbyists to play with is how you build an ecosystem. So not only do they get higher margins on their consumer GPUs, they get more code written for their architecture, which lets them sell more of the expensive GPUs.

Not doing that is making a mistake.


>> Now when Zen 5 has two threads active, the decode clusters and the accompanying fetch pipes are statically partitioned.

This sounds like a big boost for hyper threading performance. My Zen1 gets about 25 percent faster due to HT. Has anyone tested the newer ones in this regard?


High SMT speedups aren’t a good thing, because those are pipeline bubbles - resources that can’t be saturated by the first thread.

The ideal speedup from SMT is 0% because you’d be getting full output for a single thread. Like if you got a 50% speedup from SMT, in an ideal world that means a single thread could run 50% faster than it is, but it’s being stalled by pipeline bubbles.


Yes, while in my older tests on Intel Skylake derivatives I have also obtained SMT speedups around 25%, on the newer Zen 3 (a 5900X) I have obtained at most a 20% speedup in the most SMT friendly task that I have ever encountered, i.e. the compilation of a big software project (the comparison being done between optimal parameters for SMT disabled vs. SMT enabled, which for a 5900X have been determined to be "make -j13" vs. "make -j24").

An example of a multithreaded benchmark that is not SMT friendly is the GeekBench 6 multithreaded test, where Zen 3 with SMT disabled (12 threads on a 5900X) is slightly faster than with SMT enabled (24 threads on a 5900X).


For a fully compute-bound workload, you're certainly correct.

That's rare though. All it takes is a couple stalls waiting on memory where a second thread is able to make progress to make that "ideal speedup" certainly be nonzero.


Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Because most code is not running anywhere near saturation of the available resources, and the problem is only getting worse as cores get wider. I mean, look at the Zen5 block diagram - there are 6 ALUs and 4 AGUs on the integer side alone! That's almost two entire Zen1 cores worth of execution resources, which is pretty amazing. Very, very little real world code is going to be able to get anywhere near saturating that every cycle. SMT helps improve the utilization of the execution resources that have already been paid for in the core.

I'll give another example from my own experience. I write a lot of code in the computer graphics domain. Some of the more numeric-oriented routines are able to saturate the execution resources, and get approximately 0% speedup from SMT.

Importantly though, there are other routines that make heavy use of lookup tables. Even though the tables reside completely within L1 cache, there are some really long dependency chains where the 3/4 cycle wait for L1 chains and causes some really low utilization of ALUs. Or at least, that's my theory. :) Regardless in that code running SMT provides about a 30% speedup "for free" which is quite impressive.

I was uncertain of if SMT had a future for a while, but I think for x86 in general it provides some pretty significant gains, for a design complexity that has already been 'paid' for.


With the continuous improvement of out-of-order execution, the SMT gains have been diminishing from Zen 1 to Zen 4.

However you are right that Zen 5, like also the Intel Lion Cove core, has a jump in the number of available execution resources and it is likely that out-of-order execution will not be enough to keep them busy.

This may lead to a higher SMT gain on Zen 5, perhaps on average around 30% (from typically under 20% with Zen 3 or Zen 4), like in the Intel presentation where they compared a Lion Cove without SMT with a Lion Cove with SMT. In the context of a hybrid CPU, where the MT-performance can be better provided by efficient cores than by SMT, Intel has chosen to omit SMT, for better PPA (performance per power and area), but in the context of their future server CPU with big cores, they will use cores with SMT (and with wider SIMD execution units, to increase the energy efficiency).


> Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Because SMT getting faster is a nearly free side-effect. We didn't add extra units to speed up SMT at the cost of single-thread speed. We added extra units to speed up the single thread, and they just happened to speed up SMT even more (at least for the purpose of this theoretical). That's better than speeding up SMT the same percent, or not speeding up SMT at all.

Imagine if I took a CPU and just made SMT slower, no other changes. That would be a bad thing even though it gets the speedup closer to 0%, right? And then if I undo that it's a good thing, right?


I still ahve no idea what a 2-ahead branch predictor is.

You can better start by reading the old research papers linked in the article.

In general, the older research papers suppose that the reader knows much less about such subjects, because they were still much more niche knowledge at that time.


4 years after graduating college, my decision to dive into computer architecture classes has borne no fruit except my ability to loosely understand what writeups like this are talking about. But, I guess that's the point, isn't it? This is fascinating stuff, whether or not I need to know it.

Did the paper author get a cut from AMD?

They have no reason to give him any cut.

On the contrary, he points that one of their most important innovations is not that innovative, because it was analyzed in detail almost 30 years ago.

Before AMD, Intel had started to use such multiple decoders in their Atom cores, now rebranded as E-cores (efficient cores). The multiple decoders have been first used in Tremont (2020), then in Gracemont (2021) and Crestmont (2023), and now in Skymont (2024).

It is hard to predict which of the decoders of Skymont (triple 3-instruction decoder) or of Zen 5 (double 4-instruction decoder) is better. AMD has made the choice of the double decoder because their core uses SMT, and a double decoder is easy to be partitioned into two decoders when both simultaneous threads are active.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: