Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dietgpu: add optional checksum and version number information #16

Closed
wants to merge 1 commit into from

Conversation

wickedfoo
Copy link
Contributor

Summary:
This diff adds two useful features for using DietGPU to store numeric data in persistent storage (e.g., for checkpointing large language model training purposes).

1 This diff adds an optional feature to calculate a checksum of the data to be compressed which is stored in the archive, as well as an option to check the stored checksum against the decompressed data post-decompression. This is an additional check to safeguard the archival integrity of DietGPU.

All compress/decompress functions add an additional optional parameter to enable the checksum feature, which is the first optional parameter for all the functions.

There are two checksum flags, one for byte-wise ANS and one for the float compression. With float compression, the ANS flag cannot be used.

Computing and comparing the checksum adds another full pass to the data, so it is not recommended to use it for online, in-memory compression/decompression (e.g., shipping data over a network).

2 A version number has been added to both the ANS and float data headers inside the archive. This will be used to safeguard against data format changes in subsequent iterations of DietGPU, in order to detect when a newer (or older) version of the library is being used to unpack an older (or newer, respectively) versioned archive.

Either failure (checksum mismatch or version mismatch) will cause a fatal assert within the CUDA kernels that check for this.

Also added a missing check that the float type in the archive expected upon decompression matches what is stored in the archive.

In order to add this feature, the data layout of the DietGPU archive has changed to include space for the checksum in the headers.

Issues / to do:

  • the checksum at present is only 8 bits due to a (lack of) data alignment expectation, and is just a simple xor reduction at present. The ANS compressor can accept any input regardless of alignment (only byte alignment is required), while the float compressor expects float word-sized alignment (e.g., float16 requires 2 byte alignment). The checksum kernel ideally should compute a full uint32 checksum (if xor), or just use something like crc32. This is a todo.
  • asserts will be compiled out if NDEBUG is defined. The assert needs to be replaced with a d2h copy of the error status and raising an error on the host side instead.

Differential Revision: D41829599

Summary:
This diff adds two useful features for using DietGPU to store numeric data in persistent storage (e.g., for checkpointing large language model training purposes).

**1** This diff adds an optional feature to calculate a checksum of the data to be compressed which is stored in the archive, as well as an option to check the stored checksum against the decompressed data post-decompression. This is an additional check to safeguard the archival integrity of DietGPU.

All compress/decompress functions add an additional optional parameter to enable the checksum feature, which is the first optional parameter for all the functions.

There are two checksum flags, one for byte-wise ANS and one for the float compression. With float compression, the ANS flag cannot be used.

Computing and comparing the checksum adds another full pass to the data, so it is not recommended to use it for online, in-memory compression/decompression (e.g., shipping data over a network).

**2** A version number has been added to both the ANS and float data headers inside the archive. This will be used to safeguard against data format changes in subsequent iterations of DietGPU, in order to detect when a newer (or older) version of the library is being used to unpack an older (or newer, respectively) versioned archive.

Either failure (checksum mismatch or version mismatch) will cause a fatal assert within the CUDA kernels that check for this.

Also added a missing check that the float type in the archive expected upon decompression matches what is stored in the archive.

In order to add this feature, the data layout of the DietGPU archive has changed to include space for the checksum in the headers.

Issues / to do:
- the checksum at present is only 8 bits due to a (lack of) data alignment expectation, and is just a simple xor reduction at present. The ANS compressor can accept any input regardless of alignment (only byte alignment is required), while the float compressor expects float word-sized alignment (e.g., float16 requires 2 byte alignment). The checksum kernel ideally should compute a full uint32 checksum (if xor), or just use something like crc32. This is a todo.
- asserts will be compiled out if `NDEBUG` is defined. The assert needs to be replaced with a d2h copy of the error status and raising an error on the host side instead.

Differential Revision: D41829599

fbshipit-source-id: 2792278e899680107df609b60149528708c94987
@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Dec 9, 2022
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D41829599

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 0e2ccba.

jamesthesnake added a commit to jamesthesnake/dietgpu that referenced this pull request Dec 29, 2022
dietgpu: add optional checksum and version number information (facebookresearch#16)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants