Skip to content
Jeff Johnson edited this page Jan 31, 2022 · 5 revisions

The DietGPU library currently contains two primary APIs, a C++ API that operates on raw device pointers, and a Python / PyTorch API that operates on PyTorch tensors.

Common API information

Batching

The library takes an arbitrary batch of arrays (required contiguous memory layout; multi-dimensional PyTorch tensors are just interpreted as 1-d structure) to compress. Each tensor in the batch can be of a different size.

For the PyTorch API and the float compressor, all tensors in the batch must have the same dtype.

Output size

Compressed data is written into a matrix on the device or into a collection of output arrays, where each matrix row or output array must be sized at least as large as max_float_compressed_output_size (float compressor) or max_any_compressed_output_size (byte-wise ANS compressor) in bytes, otherwise a memory access violation may occur.

Only the device knows the actual compressed length (the act of compressing determines the compressed length), and the host, if it wishes to truncate or copy the data elsewhere, needs to copy this information back to the host. To be fully asynchronous (and to allow the d2h sync to be deferred to the point when it is needed), the destination of the memcpy on the host may want to be pinned memory.

Warning: realizing compressed memory savings

As stated above, compression (in both C++ and Python forms) writes the output into an over-provisioned region of memory. Stopping here means that the resulting output is in fact larger than the input size. If compression is being performed in order to store the data locally, ta new exactly size memory region should be allocated and the compressed data should be copied into the new memory region, with the old output deallocated.

For networking purposes (e.g., sending the compressed data to another GPU or CPU), such a device memory allocation and memcpy is not needed, so this is why the API does not do this by default.

Temporary memory

DietGPU requires some amount of temporary global memory scratch space on the device for intermediate computations. In the Python and C++ APIs, this is managed with an optional pre-allocated region of memory provided to the compression functions. If the temporary memory provided is not sufficiently large, a warning will be written to stderr with a better bound on the desired size, and any overflow will call cudaMalloc and cudaFree within the scope of the compression, which is a device/host synchronization point. Properly sizing the temporary memory is important to providing a completely asynchronous (no spurious d2h/h2d interactions) compression and decompression.

This aspect of the current API is still under development. Newer versions of CUDA provide a stream-oriented memory allocator. PyTorch also provides a caching allocator which could be used as well, but neither of these are currently implemented. A future version of the API may be able to avoid this temporary memory entirely (or at least drastically decrease the requirement).