Hacker News new | past | comments | ask | show | jobs | submit login
RLSL: a Rust to SPIR-V Compiler (docs.google.com)
199 points by Impossible on Oct 21, 2019 | hide | past | favorite | 45 comments



I've been working on a project with similar goal to RLSL, called Emu. It's a procedural macro that automatically off-loads portions of code to run on a GPU. Here's a quick example...

  #[gpu_use]
  fn main() {
      let mut a = vec![0.2; 1000];
      let mut b = vec![0.3; 1000];
      let c = vec![0.0; 1000];

      gpu_do!(load(a));
      gpu_do!(load(b));
      gpu_do!(load(c));
      gpu_do!(launch());
      for i in 0..1000 {
          c[i] = a[i] + b[i];
      }
      gpu_do!(read(c));
      
      println!("{:?}", c);
  }
Emu is currently open-sourced as a 100% stable Rust library and while it only supports a tiny subset of Rust, that subset is well-defined with useful compile-time errors and a comprehensive test suite.

So if you're looking for a way to contribute to single-source GPGPU for Rust, please consider helping expand Emu's supported subset of Rust. The repository is at https://www.github.com/calebwin/emu

I will say that since Emu works at the AST/syntax level, RLSL is of great interest to me because it works instead at the MIR level which allows it to more easily support a large subset of Rust.


- So RLSL can work with Emu?

- Would it mean most of the general Rust code could be made to work on GPU? Or is it you want Emu to work at MIR level?

- Do you plan to actually try to do it?

Emu seems like a really cool project either way. :)


- Maybe there is a component of RLSL that could be useful. I have to think more about what I want that component to be.

- I want Emu to support general Rust code but still use stable Rust and provide really nice compile-time errors. Maybe Emu could do AST-level checking to (1) ensure that only legal transpilable-to-SPIR-V subset is used, (2) infer the kernel parameters, (3) infer global work size, local work size and then do MIR-level compilation to OpenCL or SPIR-V?

- At the moment, I want to focus on AST-level compilation because I think many applications (AI, ML, simulations, etc.) can still technically be implemented without a huge subset of Rust.


I was planning to write a tiny SVM in Rust just as a plaything, so I would probably use Emu to see if I can speed it up....

Does Emu have some getting started other than docs?


https://docs.rs/em contains not only documentation but also comprehensive explanation on effectively using Emu.

I would recommend looking through it first. Of course, if you have questions feel free to ask - https://gitter.im/talk-about-emu/thoughts


Since you've talked about single-source GPGPU for Rust, I should point out that I'm working towards a pre-RFC to add SYCL-like multi-target for Rust.


Woah excellent to see Embark partnering with the RLSL project!

As someone who does a lot of creative-coding contracts with lots of video, graphics and real-time requirements, RLSL has long been one of the Rust projects that excites me the most. The idea of writing graphics and compute shaders in Rust, a modern language with a decent type system, standard package manager, module system, etc, is very exciting. It makes a lot of sense that Embark see the potential in this for game dev too.

The ability to share Rust code between the CPU and GPU alone will be so useful. The number of times I've had to port some pre-existing Rust function to GLSL or vice versa is silly.

Obviously the Rust that compiles to SPIR-V will be a subset of the Rust that compiles to LLVM or WASM, but this still opens up so many doors for improved ergonomics when interacting with the GPU from a Rust program.

I've long dreamed of an API that allows me to casually send a closure of almost arbitrary Rust code (with the necessary 'lifetime + Send + Sync + SpirvCompatible compile-time checks) off to the GPU for execution and get back a GPU Future of the result. It looks like this may well become possible in the future :)


> Woah excellent to see Embark partnering with the RLSL project!

IIRC, actually they hired the guy behind RLSL a few months ago.


I thought it was the self driving truck company.


This is a great write-up. I think Rust could be a very nice language for Shader-like applications which are already very functional, and don't involve alot of shared, mutable state across threads.

In HPC, we're very much interested in GPU compute programming, rather than shader programming. In CUDA codes, you're typically doing transformation from input buffers directly into output buffers from your CUDA kernels. This should immediately raise red flags for a Rust developer - you've got shared, mutable state across threads!

Consider this simple CUDA-ish Rust code with threads independently executing over 0..cuda.len() (ignore the bounds bugs at i = 0 and i = in.len()):

  fn stencil(i: usize, in: &[f32], out: &mut [f32]) {
      out[i] = (in[i - 1] + in[i] + in[i + 1]) / 3.0;
  }
(The `i` is a conceit around computing indexing from thread/block IDs, but the input and output arrays are pretty similar to the style CUDA promotes).

It's obvious to me, the programmer, that I don't have any aliasing issue - each thread is only mutating at a single index in the output array. However, Rust is not smart enough to see this. If they allowed the definition of the kernel as is, you could easily write multi-threaded code that has shared mutable access to individual memory locations, violating Rust's memory model. OK, you force the kernel to look more like this, then:

  // `in` is the slice of [i-1,i+1]
  fn stencil(in: &[f32], out: &mut f32) {
      *out = (in[0] + in[1] + in[2]) / 3.0;
  }
And you'd enforce Rust's invariants at the kernel launch site, computing the valid slices at some higher level in the library in some "unsafe" code. But this only solves the simple case where you have some array mapping to another array where the index relationship is obvious, and it's easily provable that there are no aliasing issues. Start layering in things like unique indirect indexing, or perhaps non-unique indexing but with atomic reductions, and it becomes difficult to phrase your correct program in a way to safe(!) Rust that is compatible with the borrow checker, at least without having to build a bunch of abstractions to express each of your parallel patterns. Having to build a bunch of bespoke abstractions may not be scalable to the types of developers building big scientific codes.

Anyway, I'm curious if the folks at "Embark" have spent any time thinking about the issue of shared, mutable state in GPU programming with Rust. It seems like a deal breaker from where I stand.


There's no memory safety problem if the data you're racing on is pointer-free. Rust allows racing on memory with Relaxed atomics [1]. (Yes, I know about potential UB with floating point and whatnot; this is solvable.) Happily, GPU programming tends to use a lot of indices as opposed to direct pointers, because it makes CPU/GPU memory management easier--indices are valid no matter where in the address space the data in question is mapped.

Of course, you might then ask "what's the point of using Rust at all?" The answer is that a lot of code can be expressed in regular patterns idiomatic to the Rust type system and borrow checker. This isn't particularly different from any other type of programming. You sometimes see people say "Rust is useless because you can't safely write an arbitrary graph with direct pointers". The obvious fallacy with this argument is that most code doesn't need an arbitrary graph with direct pointers; for the small amount of code that does, you can drop into unsafe in a targeted way and still have a much safer program than one written in e.g. C. GPU programming is no different.

[1] https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

(Note than on x86 Relaxed atomics turn into plain mov instructions, and on ARM they turn into plain ldr and str.)


Huh - to make sure I understand correctly, is it your view (ignoring portability concerns) that it should be legal to store and load directly from a non-mut &[f32] when the platform has relaxed-atomic memory consistency by default? Or just that you should use a hypothetical "AtomicF32" with relaxed loads and stores and not worry about it from a performance perspective?


There should be an AtomicF32 type that supports unsynchronized reads/writes on GPU.


Thanks, that makes a lot of sense!


Your point is especially important as most of the GPGPU performance gains come from really clever use of shared memory. That is the very model of shared state with concurrent acces.

Usually, your shader or kernel is required to synchronize threads that need to see each other's changes. I'm sure that there are at least some kernels out there that skip synchronization and still work because there are enough other instructions between write and read accesses that cover the issue up.


> shared memory.

I've done enough Hacker News discussions to note that the lay-reader won't understand what this means, and this probably needs more elaboration.

"Shared Memory" is a special memory area inside of GPUs where grids (NVidia) or workgroups (AMD)... a group of 32x to 1024x SIMD threads... can perform inter-thread communications in an outrageously fast way.

Shared memory is extremely small: roughly 64kB in size. Optimizing shared memory access involves resolving bank-conflicts, and lots of very-low level thought. At a minimum, you need to consider how you fill shared memory (the memcpy in) as well as how to get the final data out (the memcpy out).

---------

In many cases, synchronization to-and-from shared memory only requires a __threadfence() instruction. Maybe only a __syncwarp() instruction in some cases.

A lot of thought goes into the "ordering" of memory accesses, to make shared-memory as fast as possible. See here for further details: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39....

See "39.2.3 Avoiding Bank Conflicts" for how shared memory is optimized in practice. You're not only thinking about the shared-state, but also the average number of other threads hitting any particular bank. On a say... 32-bank system, you'll want all 32-banks to be utilized as much as possible. If all 1024 threads are accessing bank#0, you'll be 1/32th the speed (and banks #1 through #31 are all wasted).

A "bank" is basically the implicitly "RAID0" arrangement of shared-memory. The precise number of banks is somewhere between 16x banks or 32x banks, depending on architecture. But regardless, if all GPU-threads access memory location #0, that will all hit bank#0, which is slow. Instead, you want GPU-threads to "spread out" over the banks (maybe GPU-thread#0 should access Bank0 / Memory location #0. GPU-thread#1 should access bank1 / memory-location #16. GPU thread #2 should access bank2 / memory-location #32. Etc. etc.)


> I've done enough Hacker News discussions to note that the lay-reader won't understand what this means, and this probably needs more elaboration.

This is why I didn't mention it, but I agree that shared memory and typical uses present a particularly challenging problem for Rust.


Yeah it's a little bit difficult to imagine Rust's concurrency primitives being used in the context of GPGPU programming. Like are we talking about using channels to pass state between threads instead of using tools like barriers for synchronization? Seems like it would negate the advantages GPU's are designed to enable.


Rust does not force you to use channels, and lets you use stuff like https://doc.rust-lang.org/std/sync/atomic/fn.fence.html if you'd like.


I understand Rust's advantage in lots of use cases, but can someone elaborate the benefits of using Rust over other languages for shader?


Some advantages cited in the slides are:

1. consistency with their existing codebase

2. leveraging Rust tooling (i.e. code completion, error messages etc)

3. being able to compile the same code for CPU and GPU, for example to test and debug GPU code

None of these are Rust-specific, but it would make sense if you're already all-in on Rust. Should be an interesting experiment.


I'd also be curious to see some of the disadvantages?


The standard library is a bit awkward at places with its pointer usage. This is why I have a custom one.

From a language point of view, it fits really well. I am also constrained with syntax as I want rlsl to be a strict subset of rust.

Closures capture by references which means you always have to capture explicitly with move (No pointers in structs in SPIR-V). Although I am fairly confident that this can be fixed.

Also MIR optimization passes are pretty weak right now, and rlsl would require an optimizer in the future. like SPIR-V <-> LLVM. I am not sure how good spirv-opt is right now.


Exciting talk! And I'm happy to know that one of the leading shader devs from Rust community has successfully "embarked" :)

Technology-wise, the compiler goes from MIR to SPIR-V. This is specific to Rust and different from the other direction Khronos has been exploring: https://github.com/KhronosGroup/LLVM-SPIRV-Backend . It's a bit sad that we can't all have nice things.


MIR seems the better choice in rust land in general. I wonder if cranelift could be an alternative approach here in the future.


I didn't choose cranelift because there was no real benefit for it. If the focus would have been on optimizations, I might have used it. The IR seems more friendly compared MIR, although I think there is a rewrite coming to make MIR more suitable for optimizations as well :).

And I do quite a lot of transformations at the MIR level, so I am looking forward to a more optimization friendly IR.


Just to be clear, SPIR-V has no relation to RISC-V?


No relation. SPIR-V is the fifth iteration of the Standard Portable Intermediate Representation, originally based on the LLVM IR.

It's an intermediate representation between the high level shader programming language and the GPU's native machine code. It's expected that the GPU will compile the SPIR-V to its own internal instruction set, rather than executing it natively.

RISC-V, on the other hand, is a native instruction set for CPUs.


SPIR-V really is an intermediate representation of a program. There's no way that this can be executed without any further translation. But it stops driver developers from having to write and ship complex compiler front ends that deviate from language specs in subtle ways that the shader writers need to test for and work around.


Yep, now shader writers only need to test for and work around backend bugs in drivers.

Also it makes gamedevs slightly happier that they don't have to ship the plain text source code of their shaders with games.

It doesn't provide any real security against RE, as SPIR-V is pretty close to being tokinized/SSAed GLSL and decompilers are trivial to implement. But if it makes them happier, so be it.


> * SPIR-V is pretty close to being tokinized/SSAed GLSL*

SPIR-V is in SSA form; GLSL most definitely is not.


I don't think that the V in SPIR-V has anything to do with the number 5. I think it's just that SPIR was renamed to SPIR-V with the major version that was introduced alongside the Vulkan graphics API and made SPIR usable for more than just OpenCL.


IIRC somebody said during one of 2015 SIGGRAPH presentations when it was introduced that it's the fifth iteration if you count SPIR pre-1.0, 1.0, 1.1, and 1.2 preceding it. They may have been joking.

Of course, the V in Vulkan also stems from 5. Vulkan was originally going to be OpenGL 5. That plan changed, but they chose a name starting with V as a nod to that history.


I'm sure that the name is alsona nod to Mantle, the direct predecessor. I haven't seen the Mantle documentation, but rumor has ot that Vulkan 1.0 initially was little more than Mantle with renamed identifiers.


Yeah. There were a lot of reasons why it was a good name. This panel briefly discussed the naming process: https://youtu.be/0b3x5Tlh6g0?t=2m30s


After Khronos decided to catch up with what all other 3D APIs were doing from the start on their shader support, and how CUDA always worked.


There is no relation.


Call me dumb but I'm always confused with SPIR-V. Does it have any relationship with Vulkan or OpenGL?


Graphics programming is fundamentally a form of client/server programming.

Your (usually) c++ is the client, shaders are the server, OpenGL/Vulkan/Direct3D are basically your client libraries to talk to the server.

Your OpenGL driver has a compiler in it that JIT compiles GLSL into into shader programs your GPU can run. In this world, SPIR-V is sort of like JVM/WASM bytecode — a compilation target for shader languages, that’s a bit nicer for drivers to work with.


> Graphics programming is fundamentally a form of client/server programming.

Thanks, in that way it is a lot easier to understand. So do you mean GLSL would be compiled into SPIR-V?



Yes. Its the underlying "language" of shader programs in vulkan. OpenGL 4.6 adds support for loading spirv and a few other things like that.


This is amazing and I hope to see things like this extended. There are so many gotchas with glsl that it's hard to get going and something with a really strict compiler would make that a lot less of a burden on an engineer. It would be really cool to see whole program & profiling guided optimization between GPU inputs, shader stages, and GPU outputs.


This is a great candidate for the Vulkano rust library: https://github.com/vulkano-rs/vulkano

It's currently using shaderc to compile glsl -> spir-v but it's clunky and takes forever to compile.

Would be great to have a more Rust-centric way to send spir-v over to the GPU


Is the video of the talk posted anywhere?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: