Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling multiple fields of the custom input data in the preprocess_data.py #455

Closed
sameeravithana opened this issue Nov 5, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@sameeravithana
Copy link

Describe the bug
preprocess_data script expects to have "text" column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have "text" column in the json input.

To Reproduce
Steps to reproduce the behavior:
Run with custom input file with fields other than "text".

Expected behavior
We need to extract json elements given the specific json-keys in the preprocessing.

Proposed solution
Modify the lm_dataformat to accept the parameter to read the specific json key object.

Error

File "tools/preprocess_data.py", line 193, in <module>
    main()
  File "tools/preprocess_data.py", line 163, in main
    for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
  File "tools/preprocess_data.py", line 143, in <genexpr>
    encoded_docs = (encoder.encode(doc) for doc in fin)
  File "tools/preprocess_data.py", line 120, in yield_from_files
    yield from yielder(fname, semaphore)
  File "tools/preprocess_data.py", line 113, in yielder
    for f in filter(lambda x: x, lmd.Reader(fname).stream_data()):
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 116, in stream_data
    yield from self._stream_data(get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 149, in _stream_data
    yield from self.read_jsonl(f, get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 207, in read_jsonl
    yield from handle_jsonl(rdr, get_meta, autojoin_paragraphs, para_joiner)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 99, in handle_jsonl
    text = ob['text']
KeyError: 'text'
@sameeravithana sameeravithana added the bug Something isn't working label Nov 5, 2021
@EricHallahan
Copy link
Contributor

lm_dataformat==0.0.20 gained the ability to use a specified key other than 'text'. Removing the strict requirement in requirements/requirements.txt will unfortunately not be enough to solve this issue, as we are also constrained by the requirement of lm_dataformat==0.0.19 by lm_eval.

There should be nothing preventing GPT-NeoX or the evaluation harness from functioning when using a forced install of lm_dataformat>=0.0.20, it just has not been verified for use with the evaluation harness. pytest and pybind11 are also known to be constrained in this fashion.

I propose the following plan of action to resolve the issue:

  • Update the requirements of lm_eval
  • Update the requirements of GPT-NeoX
    (ready to merge)
  • Add an argument to the preprocessing script to let the user specify their desired key

@sameeravithana
Copy link
Author

@EricHallahan you are right, lm_dataformat>=0.0.20 has the jsonl_key as an argument in the _stream_data. But we have to modify the stream_data and _stream_data_threaded functions to accept the jsonl_key since the preprocess_data script uses the stream_data. I also have to introduce the operations within the loop of for key in args.json_keys to save the *.bin and *.idx files; next step is to test lm_eval.

@EricHallahan
Copy link
Contributor

I have opened #456 to integrate the relaxed dependency versions.

@StellaAthena
Copy link
Member

StellaAthena commented Nov 6, 2021

I’ve opened a branch of the eval harness to make sure that lm_dataformat>=0.0.20 doesn’t break it.

I also have to introduce the operations within the loop of for key in args.json_keys to save the *.bin and *.idx files; next step is to test lm_eval.

@SamTube405 can you elaborate a bit about what your data looks like, and why your usecase involves multiple keys?

@sameeravithana
Copy link
Author

@StellaAthena For example, we have multiple sections (e.g., abstract, full texts) extracted from scientific articles that we stored within the same jsonl, and plan to train them in parallel.

@StellaAthena
Copy link
Member

@EricHallahan I have confirmed that nothing goes wrong if you use lm_dataformat>=0.0.20 in the Eval Harness, and opened a PR in that repo to update the requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants