Handling multiple fields of the custom input data in the preprocess_data.py #455

sameeravithana · 2021-11-05T23:51:57Z

Describe the bug
preprocess_data script expects to have "text" column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have "text" column in the json input.

To Reproduce
Steps to reproduce the behavior:
Run with custom input file with fields other than "text".

Expected behavior
We need to extract json elements given the specific json-keys in the preprocessing.

Proposed solution
Modify the lm_dataformat to accept the parameter to read the specific json key object.

Error

File "tools/preprocess_data.py", line 193, in <module>
    main()
  File "tools/preprocess_data.py", line 163, in main
    for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
  File "tools/preprocess_data.py", line 143, in <genexpr>
    encoded_docs = (encoder.encode(doc) for doc in fin)
  File "tools/preprocess_data.py", line 120, in yield_from_files
    yield from yielder(fname, semaphore)
  File "tools/preprocess_data.py", line 113, in yielder
    for f in filter(lambda x: x, lmd.Reader(fname).stream_data()):
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 116, in stream_data
    yield from self._stream_data(get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 149, in _stream_data
    yield from self.read_jsonl(f, get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 207, in read_jsonl
    yield from handle_jsonl(rdr, get_meta, autojoin_paragraphs, para_joiner)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 99, in handle_jsonl
    text = ob['text']
KeyError: 'text'

The text was updated successfully, but these errors were encountered:

EricHallahan · 2021-11-06T00:42:22Z

lm_dataformat==0.0.20 gained the ability to use a specified key other than 'text'. Removing the strict requirement in requirements/requirements.txt will unfortunately not be enough to solve this issue, as we are also constrained by the requirement of lm_dataformat==0.0.19 by lm_eval.

There should be nothing preventing GPT-NeoX or the evaluation harness from functioning when using a forced install of lm_dataformat>=0.0.20, it just has not been verified for use with the evaluation harness. pytest and pybind11 are also known to be constrained in this fashion.

I propose the following plan of action to resolve the issue:

Update the requirements of lm_eval
Update the requirements of GPT-NeoX
(ready to merge)
Add an argument to the preprocessing script to let the user specify their desired key

sameeravithana · 2021-11-06T01:55:26Z

@EricHallahan you are right, lm_dataformat>=0.0.20 has the jsonl_key as an argument in the _stream_data. But we have to modify the stream_data and _stream_data_threaded functions to accept the jsonl_key since the preprocess_data script uses the stream_data. I also have to introduce the operations within the loop of for key in args.json_keys to save the *.bin and *.idx files; next step is to test lm_eval.

EricHallahan · 2021-11-06T02:29:20Z

I have opened #456 to integrate the relaxed dependency versions.

StellaAthena · 2021-11-06T03:43:27Z

I’ve opened a branch of the eval harness to make sure that lm_dataformat>=0.0.20 doesn’t break it.

I also have to introduce the operations within the loop of for key in args.json_keys to save the *.bin and *.idx files; next step is to test lm_eval.

@SamTube405 can you elaborate a bit about what your data looks like, and why your usecase involves multiple keys?

sameeravithana · 2021-11-06T03:52:57Z

@StellaAthena For example, we have multiple sections (e.g., abstract, full texts) extracted from scientific articles that we stored within the same jsonl, and plan to train them in parallel.

StellaAthena · 2021-11-10T04:41:35Z

@EricHallahan I have confirmed that nothing goes wrong if you use lm_dataformat>=0.0.20 in the Eval Harness, and opened a PR in that repo to update the requirements.

sameeravithana added the bug Something isn't working label Nov 5, 2021

StellaAthena mentioned this issue Nov 10, 2021

Update setup.py EleutherAI/lm-evaluation-harness#230

Merged

StellaAthena closed this as completed Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling multiple fields of the custom input data in the preprocess_data.py #455

Handling multiple fields of the custom input data in the preprocess_data.py #455

sameeravithana commented Nov 5, 2021

EricHallahan commented Nov 6, 2021

sameeravithana commented Nov 6, 2021

EricHallahan commented Nov 6, 2021

StellaAthena commented Nov 6, 2021 •

edited

Loading

sameeravithana commented Nov 6, 2021

StellaAthena commented Nov 10, 2021

Handling multiple fields of the custom input data in the preprocess_data.py #455

Handling multiple fields of the custom input data in the preprocess_data.py #455

Comments

sameeravithana commented Nov 5, 2021

EricHallahan commented Nov 6, 2021

sameeravithana commented Nov 6, 2021

EricHallahan commented Nov 6, 2021

StellaAthena commented Nov 6, 2021 • edited Loading

sameeravithana commented Nov 6, 2021

StellaAthena commented Nov 10, 2021

StellaAthena commented Nov 6, 2021 •

edited

Loading