Hacker News new | past | comments | ask | show | jobs | submit login

Hmmm... I think this argument is solid. Albeit biased from GMP's perspective, but bignums are used all the time in RSA / ECC, and probably other common tasks, so maybe its important enough to analyze at this level.

2-instructions to work with 64-bits, maybe 1 more instruction / macro-op for the compare-and-jump back up to a loop, and 1 more instruction for a loop counter of somekind?

So we're looking at ~4 instructions for 64-bits on ARM/x86, but ~9-instructions on RISC-V.

The loop will be performed in parallel in practice however due to Out-of-order / superscalar execution, so the discussion inside the post (2 instruction on x86 vs 7-instructions on RISC-V) probably is the closest to the truth.

----------

Question: is ~2-clock ticks per 64-bits really the ideal? I don't think so. It seems to me that bignum arithmetic is easily SIMD. Carries are NOT accounted for in x86 AVX or ARM NEON instructions, so x86, ARM, and RISC-V will probably be best.

I don't know exactly how to write a bignum addition loop in AVX off the top of my head. But I'd assume it'd be similar to the 7-instructions listed here, except... using 256-bit AVX-registers or 512-bit AVX512 registers.

So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).

AVX512 is uncommon, but AVX (256-bit) is common on x86 at least: leading to ~36-bits-per-clock tick.

----------

ARM has SVE, which is ambiguous (sometimes 128-bits, sometimes 512-bits). RISC-V has a bunch of competing vector instructions.

..........

Ultimately, I'm not convinced that the add + adc methodology here is best anymore for bignums. With a wide-enough vector, it seems more important to bring forth big 256-bit or 512-bit vector instructions for this use case?

EDIT: How many bits is the typical bignum? I think add+adc probably is best for 128, 256, or maybe even 512-bits. But moving up to 1024, 2048, or 4096 bits, SIMD might win out (hard to say without me writing code, but just a hunch).

2048-bit RSA is the common bignum, right? Any other bignums that are commonly used? EDIT2: Now that I think of it, addition isn't the common operation in RSA, but instead multiplication (and division which is based on multiplication).




> RISC-V has a bunch of competing vector instructions.

There is only one standard V extension. Alibaba made a chip with a prerelease version of that V extension which is thus incompatible with the final version, but in practice that just means that the vector unit on that chip is not used because it is incompatible, not that there are now competing standards


GMP is basically a worst-case example since it uses a lot of overflow. The RISC-V architecture has been extensively studies and for most cases it's a little more dense than (say) ARM when compared like-for-like.


> So 7-instructions to perform 512-bits of bignum addition is 73-bits-per-clock cycle, far superior in speed to the 32-bits-per-clock cycle from add + adc (the 64-bit code with implicit condition codes).

add+adc should still be 64 bits per cycle. adc doesn't just add the carry bit, it's an add instruction which includes the usual operands, plus the carry bit from the previous add or adc.


Can you treat the whole vector register as a single bignum on x86? If so, I totally missed that.


No.

Which is why I'm sure add / adc will still win at 128-bits, or 256-bits.

The main issue is that the vector-add instructions are missing carry-out entirely, so recreating the carry will be expensive. But with a big enough number, that carry propagation is parallelizable in log2(n), so a big enough bignum (like maybe 1024-bits) will probably be more efficient for SIMD.


even AVX512 dies




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: