Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding jsonl chunked dataset #52

Closed
wants to merge 25 commits into from

Conversation

glebshevchukk
Copy link

This PR implements and tests a new dataset that loads data from chunked jsonl files instead of tfrecords and tokenizes on the fly. It also adds a helper function to convert decompressed jsonl files from the-pile to a set of chunked data files and a metadata file.

Test instructions are located in tests/test_dataset.md. Running this produces the following graph which shows how load time changes with chunk size:
Screen Shot 2021-01-09 at 8 50 16 PM

One concern with this approach is with how sharding is done: currently, sequences are divided by splitting on the " " char, it might be better to split directly using a tokenizer.

@StellaAthena
Copy link
Member

On first pass this looks well executed. I agree that splitting on the tokenizer would make more sense and, more importantly, will prevent alternative tokenizers or future datasets from causing problems.

I’m curious about the attached graph. How many data points did you actually collect? Is it just the five corners, or is there good evidence that load time is proportional to 10^x?

I am reviewing this on my phone and therefore cannot run the code myself. I will aim to do so tomorrow or Monday assuming nobody merges the commit before then.

@glebshevchukk
Copy link
Author

Thanks @StellaAthena for the feedback. Following what we talked about today in Discord, I added an option to pre-tokenize the data and parallelized the sharding script.

In that above experiment, I only used 5 points because it took a long time to do sharding. I re-ran the experiment with pre-tokenizing vs not pre-tokenizing with the new code and got much less variability in loading time (on order of 200 ms difference for 1 gb of data)

Somes stats from the pre-tokenizing/not runs, both using 16 workers and loading in examples 100 at a time after pre-processing a ~1GB size data file:

===not pre-tokenizing===
Average sharding time: 22860.43780649925 ms
Average shard dir size: 3403237354.25 bytes
Average loading time for 10000 seqs: 13855.113772748155 ms

===pre-tokenizing===
Average sharding time: 136226.53418425034 ms
Average shard dir size: 2742269740.25 bytes
Average loading time for 10000 seqs: 1501.960203750059 ms

Some takeaways:

  1. Pre-tokenizing definitely takes more sharding time but decreases loading time substantially. Neither are serious bottlenecks so we can go with either approach.
  2. The actual chunksize doesn't really matter, but it does increase the size of the metadata dictionary.

path_shards,single_file_chunk,single_file_chunk_line = 0,1,0

dataset_name = path[last_slash+1:ext_index]
extension = path[ext_index:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are much easier and more robust ways to do this with pathlib - if a filename has a dot in anywhere other than the extension for instance, this will break. i'll add changes tomorrow.

import linecache
import numpy as np

PAD_TOKEN=50257
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can get the pad token from the tokenizer


start = i*seq_length
stop = (i+1)*seq_length
trunc_words = all_words[start:stop]
Copy link
Contributor

@sdtblck sdtblck Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I'd say it would be safer to shard by file size or something, rather than word count (1 word =~ 2.5 tokens so you have very different shard lengths here and it already makes the tests inconclusive)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that num tokens is the desired measurement here. Would you prefer file size to that?

raise Exception("An example has no words in it.")

if len(line) < self.seq_length:
line.extend([PAD_TOKEN for _ in range(self.seq_length-len(line))])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does this happen? This should really be very rare or it will start to effect performance... I guess this is another pro for pre-tokenization - we can ensure it never happens at all

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added line 62 for debugging more than anything, right now there shouldn't be any examples of length 0.

Line 64/65 is used all the time: I did this to decrease the size of the pre-tokenized files since almost every example has less than 2048 tokens in it and storing padding tokens would greatly increase the size of the file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timed it out and each run of 64/65 for a single example takes 1-2 ms.

@sdtblck
Copy link
Contributor

sdtblck commented Jan 12, 2021

Mostly looks great! but a few bits i'm not so sure about as mentioned above. I think i'll try and fix those bits tomorrow, then test it out inside a train script, and if it works well, we'll merge

@StellaAthena
Copy link
Member

In that above experiment, I only used 5 points because it took a long time to do sharding. I re-ran the experiment with pre-tokenizing vs not pre-tokenizing with the new code and got much less variability in loading time (on order of 200 ms difference for 1 gb of data)

For next time, I strongly recommend not displaying a plot like this as a line graph, or if you do inserting a marker at each measured point. It’s easy for an incautious reader to falsely assume that for large chunks theres a linear relationship between the axes, when in fact there is no evidence of this at all. It’s not a big deal, but it can avoid unfortunate confusions.

@glebshevchukk
Copy link
Author

Changed loading behavior so that it pulls lines until it reaches required sequence length. Also working on a separate version that chunks directly onto a single line on branch stream_loading

@glebshevchukk
Copy link
Author

Pushed a new version of the code that doesn't explicitly chunk based on seq_length and replaced test code with more discrete test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants