Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding the dataset #2

Open
ketan0 opened this issue Dec 7, 2021 · 1 comment
Open

Encoding the dataset #2

ketan0 opened this issue Dec 7, 2021 · 1 comment

Comments

@ketan0
Copy link

ketan0 commented Dec 7, 2021

Did you generate one giant TFRecord for the Lakh MIDI dataset? Or did you process the data in shards? If the latter, how exactly does one go about sharding the data with the pipelines that you have in place? I'm finding that generating one giant TFRecord using convert_dir_to_note_sequences is too large to load into memory when doing scripts/generate_song_data_beam.py.

@BenokanDeepBlue
Copy link

Hello Ketan,

I had a similar issue and I've managed to solve it by removing ReShuffle from the pipeline to avoid parallelism at the cost of reducing the speed of the process but at least it doesn't try to load entire data to the memory and runs smoothly (but very slowly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants