Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lm_dataformat is outdated #552

Closed
65536william opened this issue Feb 13, 2022 · 0 comments · Fixed by #662
Closed

lm_dataformat is outdated #552

65536william opened this issue Feb 13, 2022 · 0 comments · Fixed by #662
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@65536william
Copy link

Describe the bug
When running tools/preprocess_data.py to tokenize my dataset, I was confused why the generated .bin and .idx files were empty. It turns out that lm_dataformat, the library which actually reads the dataset into the tokenization logic, was version 0.0.19 as specified in the requirements.txt file. This version of the library doesn't include support for uncompressed .jsonl files, so if you pass in a raw jsonl file, it won't read it, and no error will be raised either.

Since the README doesn't mention the requirement of compressing jsonl to jsonl.zst via the zstd library, this is likely to be a hurdle for those with smaller datasets, kept as jsonl.

Proposed solution
Upgrade lm_dataformat to version 0.0.20 which adds support for uncompressed jsonl. Additionally it would be nice to throw an helpful error if nothing actually gets tokenized since that would indicate reading the dataset has failed.

@65536william 65536william added the bug Something isn't working label Feb 13, 2022
@Mistobaan Mistobaan self-assigned this Feb 21, 2022
@StellaAthena StellaAthena added the good first issue Good for newcomers label Mar 22, 2022
@StellaAthena StellaAthena linked a pull request Sep 18, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants