Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming multipack for pretraining dataset #959

Merged

Conversation

jinwonkim93
Copy link
Contributor

This pull request introduces support for multipacking in streaming pretraining datasets.
Due to the immense size of these datasets, traditional methods of loading them entirely into memory are not feasible.
The proposed solution aims to enhance efficiency and scalability.

need guide for making config.
this multipack does not use BatchSamplerDataCollatorForSeq2Seq, only use DataCollatorForSeq2Seq due to huggingface dataset map function.

src/axolotl/utils/data.py Outdated Show resolved Hide resolved
@winglian
Copy link
Collaborator

winglian commented Jan 5, 2024

Here's a patch file I used to test a c4 pretraiing dataset with tinnyllama. multigpu doesn't work currently w this since I think it needs a proper data collator to pad the samples to the same sequence length
patch0.patch

@casper-hansen
Copy link
Collaborator

Would this streaming feature work with S3, GCS, Azure Blob Storage?

@winglian
Copy link
Collaborator

winglian commented Jan 5, 2024

This PR is ready for review and should resolve #1026. @mhenrichsen

@mhenrichsen
Copy link
Collaborator

Confirmed working on single gpu. Currently fails on multi gpu.

@winglian winglian changed the title [WIP] streaming multipack for pretraining dataset streaming multipack for pretraining dataset Jan 6, 2024
@winglian winglian merged commit 553c80f into OpenAccess-AI-Collective:main Jan 6, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants