`lm_dataformat` is outdated #552

65536william · 2022-02-13T13:24:31Z

Describe the bug
When running tools/preprocess_data.py to tokenize my dataset, I was confused why the generated .bin and .idx files were empty. It turns out that lm_dataformat, the library which actually reads the dataset into the tokenization logic, was version 0.0.19 as specified in the requirements.txt file. This version of the library doesn't include support for uncompressed .jsonl files, so if you pass in a raw jsonl file, it won't read it, and no error will be raised either.

Since the README doesn't mention the requirement of compressing jsonl to jsonl.zst via the zstd library, this is likely to be a hurdle for those with smaller datasets, kept as jsonl.

Proposed solution
Upgrade lm_dataformat to version 0.0.20 which adds support for uncompressed jsonl. Additionally it would be nice to throw an helpful error if nothing actually gets tokenized since that would indicate reading the dataset has failed.

The text was updated successfully, but these errors were encountered:

65536william added the bug Something isn't working label Feb 13, 2022

Mistobaan self-assigned this Feb 21, 2022

StellaAthena added the good first issue Good for newcomers label Mar 22, 2022

StellaAthena linked a pull request Sep 18, 2022 that will close this issue

Update lm_dataformat dependency #662

Merged

StellaAthena closed this as completed Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`lm_dataformat` is outdated #552

`lm_dataformat` is outdated #552

65536william commented Feb 13, 2022

lm_dataformat is outdated #552

lm_dataformat is outdated #552

Comments

65536william commented Feb 13, 2022

`lm_dataformat` is outdated #552

`lm_dataformat` is outdated #552