A minimal, extensible OpenCL, Vulkan (with WGSL), CUDA, NNAPI (Android) and host CPU array manipulation engine / framework written in Rust.
This crate provides tools for executing custom array and automatic differentiation operations.
The latest published version is of 0.7.x
(April 14th, 2023). A lot has changed since then. 0.7.x
can be found in the custos-0.7
branch.
Add "custos" as a dependency:
[dependencies]
custos = "0.7.0"
# to disable the default features (cpu, cuda, opencl, static-api, blas, macro) and use an own set of features:
#custos = {version = "0.7.0", default-features=false, features=["opencl", "blas"]}
To make specific devices useable, activate the corresponding features:
Feature | Device | Notes |
---|---|---|
cpu | CPU |
Uses heap allocations. |
stack | Stack |
Useable in no-std environments as it uses stack allocated Buffer s without requiring alloc or std . Practically only supports the Base module. |
opencl | OpenCL |
Automatically maps unified memory. |
cuda | CUDA |
|
vulkan | Vulkan |
Shaders are written in WGSL. + unified memory |
nnapi | NnapiDevice |
Lazy module is mandatory. |
untyped | Untyped |
Removes the need of Buffer 's generic parameters. (CPU and CUDA only for now) |
custos ships combineable modules. Different selected modules result in different behaviour when executing operations. New modules can be added in user code.
use custos::prelude::*;
// Autograd, Base = Modules
let device = CPU::<Autograd<Base>>::new();
To make specific modules useable for building a device, activate the corresponding features:
Feature | Module | Description |
---|---|---|
on by default | Base |
Default behaviour. |
autograd | Autograd |
Enables running automatic differentiation. |
cached | Cached |
Reuses allocations on demand. |
fork | Fork |
Decides whether the CPU or GPU is faster for an operation. It then uses the faster device for following computations. (unified memory devices) |
lazy | Lazy |
Lazy execution of operations and lazy intermediate allocations. Enables support for CUDA graphs. |
graph | Graph |
Adds a memory usage optimizeable graph and fusing of unary operations in combination with Lazy . |
Usage of these modules when writing custom operations: modules.md
and modules_usage.rs
.
If an operations wants to be affected by a module, specific custos code must be called in that operation.
Remaining features:
Feature | Description |
---|---|
static-api | Enables the creation of Buffer s without providing a device. |
std | Adds standard library support. |
no-std | For no std environments, activates stack feature. |
macro | Reexport of custos-macro |
blas | Adds gemm functions of the system's (selected) BLAS library. |
half | Adds support for half precision floats. |
serde | Adds serialization and deserialization support. |
json | Adds convenience functions for serialization and deserialization to and from json. |
Implement an operation for CPU
:
- If you want to implement your own operations for all compute devices, consider looking here: implement_operations.rs or "modules_usage.rs"
or to see it at a larger scale, look herecustos-math
(outdated, requires custos 0.7) or heresliced
(for automatic diff examples).
This operation is only affected by the Cached
module (and partially Autograd
).
use custos::prelude::*;
use std::ops::{Deref, Mul};
pub trait MulBuf<T: Unit, S: Shape = (), D: Device = Self>: Sized + Device {
fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, Self, S>;
}
impl<Mods, T, S, D> MulBuf<T, S, D> for CPU<Mods>
where
Mods: Retrieve<Self, T, S>,
T: Unit + Mul<Output = T> + Copy + 'static,
S: Shape,
D: Device,
D::Base<T, S>: Deref<Target = [T]>,
{
fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, Self, S> {
let mut out = self.retrieve(lhs.len(), (lhs, rhs)).unwrap(); // unwrap or return error (update trait)
for ((lhs, rhs), out) in lhs.iter().zip(rhs.iter()).zip(&mut out) {
*out = *lhs * *rhs;
}
out
}
}
A lot more usage examples can be found in the tests and examples folders.
(Or in the unary operation file, custos-math and sliced
)