-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding jsonl chunked dataset #52
Conversation
On first pass this looks well executed. I agree that splitting on the tokenizer would make more sense and, more importantly, will prevent alternative tokenizers or future datasets from causing problems. I’m curious about the attached graph. How many data points did you actually collect? Is it just the five corners, or is there good evidence that load time is proportional to 10^x? I am reviewing this on my phone and therefore cannot run the code myself. I will aim to do so tomorrow or Monday assuming nobody merges the commit before then. |
Thanks @StellaAthena for the feedback. Following what we talked about today in Discord, I added an option to pre-tokenize the data and parallelized the sharding script. In that above experiment, I only used 5 points because it took a long time to do sharding. I re-ran the experiment with pre-tokenizing vs not pre-tokenizing with the new code and got much less variability in loading time (on order of 200 ms difference for 1 gb of data) Somes stats from the pre-tokenizing/not runs, both using 16 workers and loading in examples 100 at a time after pre-processing a ~1GB size data file: ===not pre-tokenizing=== ===pre-tokenizing=== Some takeaways:
|
gpt_neox/data_utils.py
Outdated
path_shards,single_file_chunk,single_file_chunk_line = 0,1,0 | ||
|
||
dataset_name = path[last_slash+1:ext_index] | ||
extension = path[ext_index:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are much easier and more robust ways to do this with pathlib - if a filename has a dot in anywhere other than the extension for instance, this will break. i'll add changes tomorrow.
gpt_neox/datasets.py
Outdated
import linecache | ||
import numpy as np | ||
|
||
PAD_TOKEN=50257 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can get the pad token from the tokenizer
gpt_neox/data_utils.py
Outdated
|
||
start = i*seq_length | ||
stop = (i+1)*seq_length | ||
trunc_words = all_words[start:stop] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I'd say it would be safer to shard by file size or something, rather than word count (1 word =~ 2.5 tokens so you have very different shard lengths here and it already makes the tests inconclusive)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that num tokens is the desired measurement here. Would you prefer file size to that?
gpt_neox/datasets.py
Outdated
raise Exception("An example has no words in it.") | ||
|
||
if len(line) < self.seq_length: | ||
line.extend([PAD_TOKEN for _ in range(self.seq_length-len(line))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often does this happen? This should really be very rare or it will start to effect performance... I guess this is another pro for pre-tokenization - we can ensure it never happens at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added line 62 for debugging more than anything, right now there shouldn't be any examples of length 0.
Line 64/65 is used all the time: I did this to decrease the size of the pre-tokenized files since almost every example has less than 2048 tokens in it and storing padding tokens would greatly increase the size of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timed it out and each run of 64/65 for a single example takes 1-2 ms.
Mostly looks great! but a few bits i'm not so sure about as mentioned above. I think i'll try and fix those bits tomorrow, then test it out inside a train script, and if it works well, we'll merge |
For next time, I strongly recommend not displaying a plot like this as a line graph, or if you do inserting a marker at each measured point. It’s easy for an incautious reader to falsely assume that for large chunks theres a linear relationship between the axes, when in fact there is no evidence of this at all. It’s not a big deal, but it can avoid unfortunate confusions. |
… same number of examples, but length of each example is still different)
… same number of examples, but length of each example is still different)
…ired context length
Changed loading behavior so that it pulls lines until it reaches required sequence length. Also working on a separate version that chunks directly onto a single line on branch stream_loading |
…ely on knowing seq size!
Set main to the stream_loading branch that does not chunk based on known chunk size.
Pushed a new version of the code that doesn't explicitly chunk based on seq_length and replaced test code with more discrete test cases. |
…n train.py but requires more testing
This PR implements and tests a new dataset that loads data from chunked jsonl files instead of tfrecords and tokenizes on the fly. It also adds a helper function to convert decompressed jsonl files from the-pile to a set of chunked data files and a metadata file.
Test instructions are located in tests/test_dataset.md. Running this produces the following graph which shows how load time changes with chunk size:
![Screen Shot 2021-01-09 at 8 50 16 PM](https://user-images.githubusercontent.com/26977778/104115180-2baf3300-52c1-11eb-81c5-e8c544c2c679.png)
One concern with this approach is with how sharding is done: currently, sequences are divided by splitting on the " " char, it might be better to split directly using a tokenizer.