-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling multiple fields of the custom input data in the preprocess_data.py #455
Comments
There should be nothing preventing GPT-NeoX or the evaluation harness from functioning when using a forced install of I propose the following plan of action to resolve the issue:
|
@EricHallahan you are right, |
I have opened #456 to integrate the relaxed dependency versions. |
I’ve opened a branch of the eval harness to make sure that
@SamTube405 can you elaborate a bit about what your data looks like, and why your usecase involves multiple keys? |
@StellaAthena For example, we have multiple sections (e.g., abstract, full texts) extracted from scientific articles that we stored within the same jsonl, and plan to train them in parallel. |
@EricHallahan I have confirmed that nothing goes wrong if you use |
Describe the bug
preprocess_data script expects to have "text" column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have "text" column in the json input.
To Reproduce
Steps to reproduce the behavior:
Run with custom input file with fields other than "text".
Expected behavior
We need to extract json elements given the specific json-keys in the preprocessing.
Proposed solution
Modify the lm_dataformat to accept the parameter to read the specific json key object.
Error
The text was updated successfully, but these errors were encountered: