dietgpu: add optional checksum and version number information #16

wickedfoo · 2022-12-09T03:15:23Z

Summary:
This diff adds two useful features for using DietGPU to store numeric data in persistent storage (e.g., for checkpointing large language model training purposes).

1 This diff adds an optional feature to calculate a checksum of the data to be compressed which is stored in the archive, as well as an option to check the stored checksum against the decompressed data post-decompression. This is an additional check to safeguard the archival integrity of DietGPU.

All compress/decompress functions add an additional optional parameter to enable the checksum feature, which is the first optional parameter for all the functions.

There are two checksum flags, one for byte-wise ANS and one for the float compression. With float compression, the ANS flag cannot be used.

Computing and comparing the checksum adds another full pass to the data, so it is not recommended to use it for online, in-memory compression/decompression (e.g., shipping data over a network).

2 A version number has been added to both the ANS and float data headers inside the archive. This will be used to safeguard against data format changes in subsequent iterations of DietGPU, in order to detect when a newer (or older) version of the library is being used to unpack an older (or newer, respectively) versioned archive.

Either failure (checksum mismatch or version mismatch) will cause a fatal assert within the CUDA kernels that check for this.

Also added a missing check that the float type in the archive expected upon decompression matches what is stored in the archive.

In order to add this feature, the data layout of the DietGPU archive has changed to include space for the checksum in the headers.

Issues / to do:

the checksum at present is only 8 bits due to a (lack of) data alignment expectation, and is just a simple xor reduction at present. The ANS compressor can accept any input regardless of alignment (only byte alignment is required), while the float compressor expects float word-sized alignment (e.g., float16 requires 2 byte alignment). The checksum kernel ideally should compute a full uint32 checksum (if xor), or just use something like crc32. This is a todo.
asserts will be compiled out if NDEBUG is defined. The assert needs to be replaced with a d2h copy of the error status and raising an error on the host side instead.

Differential Revision: D41829599

Summary: This diff adds two useful features for using DietGPU to store numeric data in persistent storage (e.g., for checkpointing large language model training purposes). **1** This diff adds an optional feature to calculate a checksum of the data to be compressed which is stored in the archive, as well as an option to check the stored checksum against the decompressed data post-decompression. This is an additional check to safeguard the archival integrity of DietGPU. All compress/decompress functions add an additional optional parameter to enable the checksum feature, which is the first optional parameter for all the functions. There are two checksum flags, one for byte-wise ANS and one for the float compression. With float compression, the ANS flag cannot be used. Computing and comparing the checksum adds another full pass to the data, so it is not recommended to use it for online, in-memory compression/decompression (e.g., shipping data over a network). **2** A version number has been added to both the ANS and float data headers inside the archive. This will be used to safeguard against data format changes in subsequent iterations of DietGPU, in order to detect when a newer (or older) version of the library is being used to unpack an older (or newer, respectively) versioned archive. Either failure (checksum mismatch or version mismatch) will cause a fatal assert within the CUDA kernels that check for this. Also added a missing check that the float type in the archive expected upon decompression matches what is stored in the archive. In order to add this feature, the data layout of the DietGPU archive has changed to include space for the checksum in the headers. Issues / to do: - the checksum at present is only 8 bits due to a (lack of) data alignment expectation, and is just a simple xor reduction at present. The ANS compressor can accept any input regardless of alignment (only byte alignment is required), while the float compressor expects float word-sized alignment (e.g., float16 requires 2 byte alignment). The checksum kernel ideally should compute a full uint32 checksum (if xor), or just use something like crc32. This is a todo. - asserts will be compiled out if `NDEBUG` is defined. The assert needs to be replaced with a d2h copy of the error status and raising an error on the host side instead. Differential Revision: D41829599 fbshipit-source-id: 2792278e899680107df609b60149528708c94987

facebook-github-bot · 2022-12-09T03:16:05Z

This pull request was exported from Phabricator. Differential Revision: D41829599

facebook-github-bot · 2022-12-13T20:07:02Z

This pull request has been merged in 0e2ccba.

dietgpu: add optional checksum and version number information (facebookresearch#16)

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Dec 9, 2022

facebook-github-bot closed this in 0e2ccba Dec 13, 2022

facebook-github-bot added the Merged label Dec 13, 2022

jamesthesnake mentioned this pull request Dec 29, 2022

dietgpu: add optional checksum and version number information (#16) jamesthesnake/dietgpu#3

Merged

jamesthesnake added a commit to jamesthesnake/dietgpu that referenced this pull request Dec 29, 2022

Merge pull request #3 from facebookresearch/main

72fb42c

dietgpu: add optional checksum and version number information (facebookresearch#16)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dietgpu: add optional checksum and version number information #16

dietgpu: add optional checksum and version number information #16

wickedfoo commented Dec 9, 2022

facebook-github-bot commented Dec 9, 2022

facebook-github-bot commented Dec 13, 2022

dietgpu: add optional checksum and version number information #16

dietgpu: add optional checksum and version number information #16

Conversation

wickedfoo commented Dec 9, 2022

facebook-github-bot commented Dec 9, 2022

facebook-github-bot commented Dec 13, 2022