You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When running tools/preprocess_data.py to tokenize my dataset, I was confused why the generated .bin and .idx files were empty. It turns out that lm_dataformat, the library which actually reads the dataset into the tokenization logic, was version 0.0.19 as specified in the requirements.txt file. This version of the library doesn't include support for uncompressed .jsonl files, so if you pass in a raw jsonl file, it won't read it, and no error will be raised either.
Since the README doesn't mention the requirement of compressing jsonl to jsonl.zst via the zstd library, this is likely to be a hurdle for those with smaller datasets, kept as jsonl.
Proposed solution
Upgrade lm_dataformat to version 0.0.20 which adds support for uncompressed jsonl. Additionally it would be nice to throw an helpful error if nothing actually gets tokenized since that would indicate reading the dataset has failed.
The text was updated successfully, but these errors were encountered:
Describe the bug
When running
tools/preprocess_data.py
to tokenize my dataset, I was confused why the generated.bin
and.idx
files were empty. It turns out thatlm_dataformat
, the library which actually reads the dataset into the tokenization logic, was version 0.0.19 as specified in the requirements.txt file. This version of the library doesn't include support for uncompressed.jsonl
files, so if you pass in a raw jsonl file, it won't read it, and no error will be raised either.Since the README doesn't mention the requirement of compressing jsonl to jsonl.zst via the
zstd
library, this is likely to be a hurdle for those with smaller datasets, kept as jsonl.Proposed solution
Upgrade
lm_dataformat
to version 0.0.20 which adds support for uncompressed jsonl. Additionally it would be nice to throw an helpful error if nothing actually gets tokenized since that would indicate reading the dataset has failed.The text was updated successfully, but these errors were encountered: