dietgpu: add optional checksum and version number information #16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
This diff adds two useful features for using DietGPU to store numeric data in persistent storage (e.g., for checkpointing large language model training purposes).
1 This diff adds an optional feature to calculate a checksum of the data to be compressed which is stored in the archive, as well as an option to check the stored checksum against the decompressed data post-decompression. This is an additional check to safeguard the archival integrity of DietGPU.
All compress/decompress functions add an additional optional parameter to enable the checksum feature, which is the first optional parameter for all the functions.
There are two checksum flags, one for byte-wise ANS and one for the float compression. With float compression, the ANS flag cannot be used.
Computing and comparing the checksum adds another full pass to the data, so it is not recommended to use it for online, in-memory compression/decompression (e.g., shipping data over a network).
2 A version number has been added to both the ANS and float data headers inside the archive. This will be used to safeguard against data format changes in subsequent iterations of DietGPU, in order to detect when a newer (or older) version of the library is being used to unpack an older (or newer, respectively) versioned archive.
Either failure (checksum mismatch or version mismatch) will cause a fatal assert within the CUDA kernels that check for this.
Also added a missing check that the float type in the archive expected upon decompression matches what is stored in the archive.
In order to add this feature, the data layout of the DietGPU archive has changed to include space for the checksum in the headers.
Issues / to do:
NDEBUG
is defined. The assert needs to be replaced with a d2h copy of the error status and raising an error on the host side instead.Differential Revision: D41829599