deviations from RISC-V #2

brucehoult · 2022-04-03T22:50:02Z

There are 31 registers, x1 to x31, along with x0, which is always 0s. Use the same registers for both integers and floats. (this latter point deviates from RISC-V, because we are targeting creating a GPU, where locality is based around each of thousands of tiny cores, rather than around the FP unit vs the integer APU).

RISC-V has the "Zfinx" extension, specifically for this. So if you follow that then you're not deviating.

https://github.com/riscv/riscv-zfinx

What else?

hughperkins · 2022-04-04T00:20:22Z

Interesting. Good info. Thank you :) Yes, effectively, my current ISA is aligned with the Zfinx extension, as you say.

That said, I intend to move towards using BF16 at some point, because it provides the same dynamic range as single precision, but with half the number of bits. BF16 is becoming increasingly popular for machine learning. In ML, the quantization of having only a 7-bit mantissa manifests itself as noise, and ML training loves noise.

When I move towards BF16, I haven't quite decided how I intend to do that. There are two options I see:

1. pack two FP16 numbers into a single 32-bit register
1. use 16-bit registers

I'm fairly tempted by the second option, which would be a deviation from Zhinx, if I understand correctly? Actually, both options would be, since if I understand correctly, Zhinx would use a full 32-bit register to store each 16-bit half float?

Big picture, I intend to only support BF16 floats. No 32-bit, no 64-bit, no FP16. This will keep the cores small, lightweight, and then we can either pack in a lot of cores into the same size die; or shrink the die, keeping tape-out costs lower.

hughperkins · 2022-04-04T09:26:34Z

What else?

As far as what else...

there will be other deviations for the ISA used by each core

I'm starting to ponder instructions for the GPU controller I'm starting to draft. https://github.com/hughperkins/VeriGPU/blob/b3a9ff188e42edb814e1d42b17ac0a850014f68b/prot/verilator/prot_unified_source/controller.sv Initially this will just use whatever instructions I feel like, e.g. for the unified C++ source, I'm not even attempting to use OpenCL instructions or similar currently,

VeriGPU/prot/verilator/prot_unified_source/verilator_driver.cpp

Lines 107 to 110 in b3a9ff1

 void *ptrMemory = gpuMalloc(numValues * sizeof(unsigned int)); 

 std::cout << "found memory at " << ptrMemory << std::endl; 

 gpuCopy(ptrMemory, values, numValues * sizeof(unsigned int));

however, sooner or later might want to find a standard ISA for this (but since there's only a handful of instructions (memory alloc/free copy memory in each direction, launch kernel), and it's pretty niche, might not be necessary?)

hughperkins · 2022-04-04T12:10:36Z

(Update: controller now capable of allocating gpu memory, and passing data back and forth to the gpu :) https://github.com/hughperkins/VeriGPU/blob/8fcaf074e50d798e6b14930027c0ad862f206dd4/prot/verilator/prot_unified_source/verilator_driver.cpp ) (Edit: I could do with a PCIe4 interface; opportunity for someone to add one whilst I'm working on the c++ kernel compilation/launch bits).

hughperkins · 2022-04-11T00:36:00Z

Question: are you aware of any way of persuading clang/llvm to generate Zfinx-compatible assembly? I just now realized that if i:

use clang to separate out kernels, in a single-source scenario, into LLVM IR files,
and then use clang's llc to convert these LLVM IR files into riscv32 assembly files
... then the assembly files will plausibly use more total int + float registers than I will have room for.

hughperkins · 2022-04-12T00:41:12Z

Might be in llvm-14 :)

/usr/local/opt/llvm-14.0.0/bin/llc --march riscv32 -mattr=help 2>&1 | grep zfinx
  zfinx            - 'Zfinx' (Float in Integer).

hughperkins · 2022-04-14T23:45:02Z

Ok, so:

the bad news is that zfinx isn't in llvm14. It's not even in main

the good news is that https://github.com/sunshaoce has fixed up +zfinx in https://reviews.llvm.org/D122918 , so that at least loads, stores, additions and multiplications are working now :) (I believe that a lot more than these operations are working, but at least my simple float kernels at

VeriGPU/examples/cpp_single_source/sum_floats/sum_floats.cpp

Lines 15 to 23 in 68b6c22

 __global__ void sum_floats(float *in, unsigned int numValues, float *p_out) { 

 // sum the ints in in, and write the result to *out 

 // we assume just a single thread/core for now 

 float out = 0.0; 

 for (unsigned int i = 0; i < numValues; i++) { 

 out += in[i]; 

 } 

 *p_out = out; 

 }

and

VeriGPU/examples/cpp_single_source/mul_floats/mul_floats.cpp

Lines 15 to 23 in 68b6c22

 __global__ void mul_floats(float *in, unsigned int numValues, float *p_out) { 

 // sum the ints in in, and write the result to *out 

 // we assume just a single thread/core for now 

 float out = 1.0; 

 for (unsigned int i = 0; i < numValues; i++) { 

 out *= in[i]; 

 } 

 *p_out = out; 

 }

compile and run now :))))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deviations from RISC-V #2

deviations from RISC-V #2

brucehoult commented Apr 3, 2022

hughperkins commented Apr 4, 2022

hughperkins commented Apr 4, 2022

hughperkins commented Apr 4, 2022 •

edited

Loading

hughperkins commented Apr 11, 2022

hughperkins commented Apr 12, 2022

hughperkins commented Apr 14, 2022

deviations from RISC-V #2

deviations from RISC-V #2

Comments

brucehoult commented Apr 3, 2022

hughperkins commented Apr 4, 2022

hughperkins commented Apr 4, 2022

hughperkins commented Apr 4, 2022 • edited Loading

hughperkins commented Apr 11, 2022

hughperkins commented Apr 12, 2022

hughperkins commented Apr 14, 2022

hughperkins commented Apr 4, 2022 •

edited

Loading