> the so-called better results were for the compressed extension, not for the no...

brucehoult · on Dec 2, 2021

Indeed.

The "C" extension is technically optional, but I'm not aware of anyone who has made or sold a production chip without it -- generally only student projects or tiny cores for FPGAs running very simple programs don't have it.

My estimate is if you have even 200 to 300 instructions in your code it's cheaper to implement "C" than to build the extra SRAM/cache to hold the bigger code without it.

adrian_b · on Dec 3, 2021

The compressed encoding has good code density, but low speed.

The compressed RISC-V encoding must be compared with the ARMv8-M encoding not with the ARMv8-A.

The base 32-bit RISC-V encoding may be compared with the ARMv8-A, because only it can have comparable performance.

All the comparisons where RISC-V has better code density compare the compressed encoding with the 32-bit ARMv8-A. This is a classical example of apples-to-oranges, because the compressed encoding will never have a performance in the same league with ARMv8-A.

When the comparisons are matched, 16-bit RISC-V encoding with 16-bit ARMv8-M and 32-bit RISC-V with 32-bit ARMv8-A, RISC-V always loses in code density in both comparisons, because only the RISC-V branch instructions are frequently shorter than those of ARM, while all the other instructions are frequently longer.

There are good reasons to use RISC-V for various purposes, where either the lack of royalties or the easy customization of the instruction set are important, but claiming that it should be chosen not because it is cheaper, but because it were better, looks like the story with the sour grapes.

The value of RISC-V is not in its instruction set, because there are thousands of people who could design better ISAs in a week of work.

What is valuable about RISC-V is the set of software tools, compilers, binutils, debuggers etc. While a better ISA can be done in a week, recreating the complete software environment would need years of work.

FullyFunctional · on Dec 3, 2021

> The compressed encoding has good code density, but low speed.

That's 100% nonsense. They have the same performance and in fact, some pipelines can get better performance because they fetch a fixed number of bytes and with compressed instructions, that means more instructions fetched.

The rest of the argument falls apart resting on this fallacy.

adrian_b · on Dec 3, 2021

They have the same performance only in low performance CPUs intended for embedded applications.

If you want to use a RISC-V at a performance level good enough for being used in something like a mobile phone or a personal computer, you need to simultaneously decode at least 8 instructions per clock cycle and preferably much more, because to match 8 instructions of other CPUs you need at least 10 to 12 RISC-V instructions and sometimes much more.

Nobody has succeeded to simultaneously decode a significant number of compressed RISC-V instructions and it is unlikely that anyone would attempt this, because the cost in area and power of a decoder able to do this is much larger than the cost of a decoder for simultaneous decoding of fixed-length instructions.

This is the reason why also ARM uses a compressed encoding in their -M CPUs for embedded applications but a 32-bit fixed-length encoding in their -A CPUs for applications where more than 1 watt per core is available and high performance is needed.

brucehoult · on Dec 4, 2021

You're just making stuff up.

ARM doesn't have any cores that do 8 wide decode. Neither do Intel or AMD. Apple has, but Apple is not ARM and doesn't share their designs with ARM or ARM customers.

Cortex X-1 and X-1 have 5 wide decode. Cortex A78 and Neoverse N1 have 4 wide decode.

ARM uses compressed encoding in their 32 bit A-series CPUs, for example the Cortex A7, A15 and so on. The A15 is pretty fast, running at up to 2.5 GHz. It was used in phones such as the Galaxy S4 and Note 3 back before 64 bit became a selling point.

Several organisations are making wide RISC-V implementations. Most of them aren't disclosing what they are doing, but one has actually published details of how it's 4-8 wide RISC-V decoder works -- they decode 16 bytes of code at a time, which is 4 instructions if they are all 32 bit instructions, 8 instructions if they are all 16 bit instructions, somewhere between for a mix.

https://github.com/MoonbaseOtago/vroom

Everything is there, in the open, including the GPL licensed SystemVerilog source code. It's not complex. The decode scheme is modular and extensible to as wide as you want, with no increase in complexity, just slightly longer latency.

There are practical limits to how wide is useful not because you can't build it, but because most code has a branch every 5 or 6 instructions on average. You can build a 20-wide machine if you want -- it just won't be any faster because it doesn't fit most of the code you'll be executing.

amelius · on Dec 3, 2021

Doesn't decompression imply that there is some extra latency?

socialdemocrat · on Dec 3, 2021

No, it is just part of the regular instruction decoding. It is not like it is zip compressed. It is just 400 logic gates added to the decoder… which is nothing.

FullyFunctional · on Dec 3, 2021

All the implementation I know of all does the same thing: they expand the compressed instruction into a non-compressed instruction. For all (most?), this required an additional stage in the decoder. So in that sense, supporting C mean a slight increase in the branch mispredict penalty, but the instruction itself takes the same path with the same latency regardless of it being compressed or not.

Completely aside, compressed instruction hurt in a completely different way: as specified RISC-V happily allows instructions to be split across two cache lines, which could be from two different pages even. THIS is a royal pain in the ass and rules out certain implementation tricks. Also, the variable length instructions means more stages before you can act on the stream, including for example rename them. However a key point here is that it isn't a per-instruction penalty, it a penalty paid for all instruction if the pipeline support any variable length instructions.

amelius · on Dec 3, 2021

Yes, but the logic signal needs to ripple through those gates, which takes time.

kragen · on Dec 4, 2021

It potentially adds latency, but doesn't drop throughput.