Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tfrecord dataset to load less files into memory #41

Closed
sdtblck opened this issue Jan 5, 2021 · 0 comments
Closed

Fix tfrecord dataset to load less files into memory #41

sdtblck opened this issue Jan 5, 2021 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@sdtblck
Copy link
Contributor

sdtblck commented Jan 5, 2021

The current tfrecord dataset loads 1 tfrecord at a time into memory.
The deepspeed distributed wrapper causes the dataset to do this once, for every sample, for every GPU.
Maybe it would be best to preprocess / prefetch n samples, write them to disk, then load the correct sample from disk at train time.

@StellaAthena StellaAthena added the bug Something isn't working label Jan 6, 2021
@StellaAthena StellaAthena added this to To do in 1T or BUST via automation Jan 6, 2021
@StellaAthena StellaAthena moved this from To do to In progress in 1T or BUST Jan 6, 2021
@StellaAthena StellaAthena removed this from In progress in 1T or BUST Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants