I've been working on a project with similar goal to RLSL, called Emu. It's a procedural macro that automatically off-loads portions of code to run on a GPU. Here's a quick example...
#[gpu_use]
fn main() {
let mut a = vec![0.2; 1000];
let mut b = vec![0.3; 1000];
let c = vec![0.0; 1000];
gpu_do!(load(a));
gpu_do!(load(b));
gpu_do!(load(c));
gpu_do!(launch());
for i in 0..1000 {
c[i] = a[i] + b[i];
}
gpu_do!(read(c));
println!("{:?}", c);
}
Emu is currently open-sourced as a 100% stable Rust library and while it only supports a tiny subset of Rust, that subset is well-defined with useful compile-time errors and a comprehensive test suite.
So if you're looking for a way to contribute to single-source GPGPU for Rust, please consider helping expand Emu's supported subset of Rust. The repository is at https://www.github.com/calebwin/emu
I will say that since Emu works at the AST/syntax level, RLSL is of great interest to me because it works instead at the MIR level which allows it to more easily support a large subset of Rust.
- Maybe there is a component of RLSL that could be useful. I have to think more about what I want that component to be.
- I want Emu to support general Rust code but still use stable Rust and provide really nice compile-time errors. Maybe Emu could do AST-level checking to (1) ensure that only legal transpilable-to-SPIR-V subset is used, (2) infer the kernel parameters, (3) infer global work size, local work size and then do MIR-level compilation to OpenCL or SPIR-V?
- At the moment, I want to focus on AST-level compilation because I think many applications (AI, ML, simulations, etc.) can still technically be implemented without a huge subset of Rust.
Woah excellent to see Embark partnering with the RLSL project!
As someone who does a lot of creative-coding contracts with lots of video, graphics and real-time requirements, RLSL has long been one of the Rust projects that excites me the most. The idea of writing graphics and compute shaders in Rust, a modern language with a decent type system, standard package manager, module system, etc, is very exciting. It makes a lot of sense that Embark see the potential in this for game dev too.
The ability to share Rust code between the CPU and GPU alone will be so useful. The number of times I've had to port some pre-existing Rust function to GLSL or vice versa is silly.
Obviously the Rust that compiles to SPIR-V will be a subset of the Rust that compiles to LLVM or WASM, but this still opens up so many doors for improved ergonomics when interacting with the GPU from a Rust program.
I've long dreamed of an API that allows me to casually send a closure of almost arbitrary Rust code (with the necessary 'lifetime + Send + Sync + SpirvCompatible compile-time checks) off to the GPU for execution and get back a GPU Future of the result. It looks like this may well become possible in the future :)
This is a great write-up. I think Rust could be a very nice language for Shader-like applications which are already very functional, and don't involve alot of shared, mutable state across threads.
In HPC, we're very much interested in GPU compute programming, rather than shader programming. In CUDA codes, you're typically doing transformation from input buffers directly into output buffers from your CUDA kernels. This should immediately raise red flags for a Rust developer - you've got shared, mutable state across threads!
Consider this simple CUDA-ish Rust code with threads independently executing over 0..cuda.len() (ignore the bounds bugs at i = 0 and i = in.len()):
(The `i` is a conceit around computing indexing from thread/block IDs, but the input and output arrays are pretty similar to the style CUDA promotes).
It's obvious to me, the programmer, that I don't have any aliasing issue - each thread is only mutating at a single index in the output array. However, Rust is not smart enough to see this. If they allowed the definition of the kernel as is, you could easily write multi-threaded code that has shared mutable access to individual memory locations, violating Rust's memory model. OK, you force the kernel to look more like this, then:
// `in` is the slice of [i-1,i+1]
fn stencil(in: &[f32], out: &mut f32) {
*out = (in[0] + in[1] + in[2]) / 3.0;
}
And you'd enforce Rust's invariants at the kernel launch site, computing the valid slices at some higher level in the library in some "unsafe" code. But this only solves the simple case where you have some array mapping to another array where the index relationship is obvious, and it's easily provable that there are no aliasing issues. Start layering in things like unique indirect indexing, or perhaps non-unique indexing but with atomic reductions, and it becomes difficult to phrase your correct program in a way to safe(!) Rust that is compatible with the borrow checker, at least without having to build a bunch of abstractions to express each of your parallel patterns. Having to build a bunch of bespoke abstractions may not be scalable to the types of developers building big scientific codes.
Anyway, I'm curious if the folks at "Embark" have spent any time thinking about the issue of shared, mutable state in GPU programming with Rust. It seems like a deal breaker from where I stand.
There's no memory safety problem if the data you're racing on is pointer-free. Rust allows racing on memory with Relaxed atomics [1]. (Yes, I know about potential UB with floating point and whatnot; this is solvable.) Happily, GPU programming tends to use a lot of indices as opposed to direct pointers, because it makes CPU/GPU memory management easier--indices are valid no matter where in the address space the data in question is mapped.
Of course, you might then ask "what's the point of using Rust at all?" The answer is that a lot of code can be expressed in regular patterns idiomatic to the Rust type system and borrow checker. This isn't particularly different from any other type of programming. You sometimes see people say "Rust is useless because you can't safely write an arbitrary graph with direct pointers". The obvious fallacy with this argument is that most code doesn't need an arbitrary graph with direct pointers; for the small amount of code that does, you can drop into unsafe in a targeted way and still have a much safer program than one written in e.g. C. GPU programming is no different.
Huh - to make sure I understand correctly, is it your view (ignoring portability concerns) that it should be legal to store and load directly from a non-mut &[f32] when the platform has relaxed-atomic memory consistency by default? Or just that you should use a hypothetical "AtomicF32" with relaxed loads and stores and not worry about it from a performance perspective?
Your point is especially important as most of the GPGPU performance gains come from really clever use of shared memory. That is the very model of shared state with concurrent acces.
Usually, your shader or kernel is required to synchronize threads that need to see each other's changes. I'm sure that there are at least some kernels out there that skip synchronization and still work because there are enough other instructions between write and read accesses that cover the issue up.
I've done enough Hacker News discussions to note that the lay-reader won't understand what this means, and this probably needs more elaboration.
"Shared Memory" is a special memory area inside of GPUs where grids (NVidia) or workgroups (AMD)... a group of 32x to 1024x SIMD threads... can perform inter-thread communications in an outrageously fast way.
Shared memory is extremely small: roughly 64kB in size. Optimizing shared memory access involves resolving bank-conflicts, and lots of very-low level thought. At a minimum, you need to consider how you fill shared memory (the memcpy in) as well as how to get the final data out (the memcpy out).
---------
In many cases, synchronization to-and-from shared memory only requires a __threadfence() instruction. Maybe only a __syncwarp() instruction in some cases.
See "39.2.3 Avoiding Bank Conflicts" for how shared memory is optimized in practice. You're not only thinking about the shared-state, but also the average number of other threads hitting any particular bank. On a say... 32-bank system, you'll want all 32-banks to be utilized as much as possible. If all 1024 threads are accessing bank#0, you'll be 1/32th the speed (and banks #1 through #31 are all wasted).
A "bank" is basically the implicitly "RAID0" arrangement of shared-memory. The precise number of banks is somewhere between 16x banks or 32x banks, depending on architecture. But regardless, if all GPU-threads access memory location #0, that will all hit bank#0, which is slow. Instead, you want GPU-threads to "spread out" over the banks (maybe GPU-thread#0 should access Bank0 / Memory location #0. GPU-thread#1 should access bank1 / memory-location #16. GPU thread #2 should access bank2 / memory-location #32. Etc. etc.)
Yeah it's a little bit difficult to imagine Rust's concurrency primitives being used in the context of GPGPU programming. Like are we talking about using channels to pass state between threads instead of using tools like barriers for synchronization? Seems like it would negate the advantages GPU's are designed to enable.
The standard library is a bit awkward at places with its pointer usage. This is why I have a custom one.
From a language point of view, it fits really well. I am also constrained with syntax as I want rlsl to be a strict subset of rust.
Closures capture by references which means you always have to capture explicitly with move (No pointers in structs in SPIR-V). Although I am fairly confident that this can be fixed.
Also MIR optimization passes are pretty weak right now, and rlsl would require an optimizer in the future. like SPIR-V <-> LLVM. I am not sure how good spirv-opt is right now.
Exciting talk! And I'm happy to know that one of the leading shader devs from Rust community has successfully "embarked" :)
Technology-wise, the compiler goes from MIR to SPIR-V. This is specific to Rust and different from the other direction Khronos has been exploring: https://github.com/KhronosGroup/LLVM-SPIRV-Backend . It's a bit sad that we can't all have nice things.
I didn't choose cranelift because there was no real benefit for it. If the focus would have been on optimizations, I might have used it. The IR seems more friendly compared MIR, although I think there is a rewrite coming to make MIR more suitable for optimizations as well :).
And I do quite a lot of transformations at the MIR level, so I am looking forward to a more optimization friendly IR.
No relation. SPIR-V is the fifth iteration of the Standard Portable Intermediate Representation, originally based on the LLVM IR.
It's an intermediate representation between the high level shader programming language and the GPU's native machine code. It's expected that the GPU will compile the SPIR-V to its own internal instruction set, rather than executing it natively.
RISC-V, on the other hand, is a native instruction set for CPUs.
SPIR-V really is an intermediate representation of a program. There's no way that this can be executed without any further translation. But it stops driver developers from having to write and ship complex compiler front ends that deviate from language specs in subtle ways that the shader writers need to test for and work around.
Yep, now shader writers only need to test for and work around backend bugs in drivers.
Also it makes gamedevs slightly happier that they don't have to ship the plain text source code of their shaders with games.
It doesn't provide any real security against RE, as SPIR-V is pretty close to being tokinized/SSAed GLSL and decompilers are trivial to implement. But if it makes them happier, so be it.
I don't think that the V in SPIR-V has anything to do with the number 5. I think it's just that SPIR was renamed to SPIR-V with the major version that was introduced alongside the Vulkan graphics API and made SPIR usable for more than just OpenCL.
IIRC somebody said during one of 2015 SIGGRAPH presentations when it was introduced that it's the fifth iteration if you count SPIR pre-1.0, 1.0, 1.1, and 1.2 preceding it. They may have been joking.
Of course, the V in Vulkan also stems from 5. Vulkan was originally going to be OpenGL 5. That plan changed, but they chose a name starting with V as a nod to that history.
I'm sure that the name is alsona nod to Mantle, the direct predecessor. I haven't seen the Mantle documentation, but rumor has ot that Vulkan 1.0 initially was little more than Mantle with renamed identifiers.
Graphics programming is fundamentally a form of client/server programming.
Your (usually) c++ is the client, shaders are the server, OpenGL/Vulkan/Direct3D are basically your client libraries to talk to the server.
Your OpenGL driver has a compiler in it that JIT compiles GLSL into into shader programs your GPU can run. In this world, SPIR-V is sort of like JVM/WASM bytecode — a compilation target for shader languages, that’s a bit nicer for drivers to work with.
This is amazing and I hope to see things like this extended. There are so many gotchas with glsl that it's hard to get going and something with a really strict compiler would make that a lot less of a burden on an engineer. It would be really cool to see whole program & profiling guided optimization between GPU inputs, shader stages, and GPU outputs.
So if you're looking for a way to contribute to single-source GPGPU for Rust, please consider helping expand Emu's supported subset of Rust. The repository is at https://www.github.com/calebwin/emu
I will say that since Emu works at the AST/syntax level, RLSL is of great interest to me because it works instead at the MIR level which allows it to more easily support a large subset of Rust.