Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deviations from RISC-V #2

Open
brucehoult opened this issue Apr 3, 2022 · 6 comments
Open

deviations from RISC-V #2

brucehoult opened this issue Apr 3, 2022 · 6 comments

Comments

@brucehoult
Copy link

There are 31 registers, x1 to x31, along with x0, which is always 0s. Use the same registers for both integers and floats. (this latter point deviates from RISC-V, because we are targeting creating a GPU, where locality is based around each of thousands of tiny cores, rather than around the FP unit vs the integer APU).

RISC-V has the "Zfinx" extension, specifically for this. So if you follow that then you're not deviating.

https://github.com/riscv/riscv-zfinx

What else?

@hughperkins
Copy link
Owner

Interesting. Good info. Thank you :) Yes, effectively, my current ISA is aligned with the Zfinx extension, as you say.

That said, I intend to move towards using BF16 at some point, because it provides the same dynamic range as single precision, but with half the number of bits. BF16 is becoming increasingly popular for machine learning. In ML, the quantization of having only a 7-bit mantissa manifests itself as noise, and ML training loves noise.

When I move towards BF16, I haven't quite decided how I intend to do that. There are two options I see:

    1. pack two FP16 numbers into a single 32-bit register
    1. use 16-bit registers

I'm fairly tempted by the second option, which would be a deviation from Zhinx, if I understand correctly? Actually, both options would be, since if I understand correctly, Zhinx would use a full 32-bit register to store each 16-bit half float?

Big picture, I intend to only support BF16 floats. No 32-bit, no 64-bit, no FP16. This will keep the cores small, lightweight, and then we can either pack in a lot of cores into the same size die; or shrink the die, keeping tape-out costs lower.

@hughperkins
Copy link
Owner

What else?

As far as what else...

@hughperkins
Copy link
Owner

hughperkins commented Apr 4, 2022

(Update: controller now capable of allocating gpu memory, and passing data back and forth to the gpu :) https://github.com/hughperkins/VeriGPU/blob/8fcaf074e50d798e6b14930027c0ad862f206dd4/prot/verilator/prot_unified_source/verilator_driver.cpp ) (Edit: I could do with a PCIe4 interface; opportunity for someone to add one whilst I'm working on the c++ kernel compilation/launch bits).

@hughperkins
Copy link
Owner

Question: are you aware of any way of persuading clang/llvm to generate Zfinx-compatible assembly? I just now realized that if i:

  • use clang to separate out kernels, in a single-source scenario, into LLVM IR files,
  • and then use clang's llc to convert these LLVM IR files into riscv32 assembly files
    ... then the assembly files will plausibly use more total int + float registers than I will have room for.

@hughperkins
Copy link
Owner

Might be in llvm-14 :)

/usr/local/opt/llvm-14.0.0/bin/llc --march riscv32 -mattr=help 2>&1 | grep zfinx
  zfinx            - 'Zfinx' (Float in Integer).

@hughperkins
Copy link
Owner

Ok, so:

  • the bad news is that zfinx isn't in llvm14. It's not even in main
  • the good news is that https://github.com/sunshaoce has fixed up +zfinx in https://reviews.llvm.org/D122918 , so that at least loads, stores, additions and multiplications are working now :) (I believe that a lot more than these operations are working, but at least my simple float kernels at
    __global__ void sum_floats(float *in, unsigned int numValues, float *p_out) {
    // sum the ints in in, and write the result to *out
    // we assume just a single thread/core for now
    float out = 0.0;
    for (unsigned int i = 0; i < numValues; i++) {
    out += in[i];
    }
    *p_out = out;
    }
    and
    __global__ void mul_floats(float *in, unsigned int numValues, float *p_out) {
    // sum the ints in in, and write the result to *out
    // we assume just a single thread/core for now
    float out = 1.0;
    for (unsigned int i = 0; i < numValues; i++) {
    out *= in[i];
    }
    *p_out = out;
    }
    compile and run now :))))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants