Adding jsonl chunked dataset #52

glebshevchukk · 2021-01-10T05:33:46Z

This PR implements and tests a new dataset that loads data from chunked jsonl files instead of tfrecords and tokenizes on the fly. It also adds a helper function to convert decompressed jsonl files from the-pile to a set of chunked data files and a metadata file.

Test instructions are located in tests/test_dataset.md. Running this produces the following graph which shows how load time changes with chunk size:

One concern with this approach is with how sharding is done: currently, sequences are divided by splitting on the " " char, it might be better to split directly using a tokenizer.

StellaAthena · 2021-01-10T05:55:08Z

On first pass this looks well executed. I agree that splitting on the tokenizer would make more sense and, more importantly, will prevent alternative tokenizers or future datasets from causing problems.

I’m curious about the attached graph. How many data points did you actually collect? Is it just the five corners, or is there good evidence that load time is proportional to 10^x?

I am reviewing this on my phone and therefore cannot run the code myself. I will aim to do so tomorrow or Monday assuming nobody merges the commit before then.

glebshevchukk · 2021-01-11T23:33:58Z

Thanks @StellaAthena for the feedback. Following what we talked about today in Discord, I added an option to pre-tokenize the data and parallelized the sharding script.

In that above experiment, I only used 5 points because it took a long time to do sharding. I re-ran the experiment with pre-tokenizing vs not pre-tokenizing with the new code and got much less variability in loading time (on order of 200 ms difference for 1 gb of data)

Somes stats from the pre-tokenizing/not runs, both using 16 workers and loading in examples 100 at a time after pre-processing a ~1GB size data file:

===not pre-tokenizing===
Average sharding time: 22860.43780649925 ms
Average shard dir size: 3403237354.25 bytes
Average loading time for 10000 seqs: 13855.113772748155 ms

===pre-tokenizing===
Average sharding time: 136226.53418425034 ms
Average shard dir size: 2742269740.25 bytes
Average loading time for 10000 seqs: 1501.960203750059 ms

Some takeaways:

Pre-tokenizing definitely takes more sharding time but decreases loading time substantially. Neither are serious bottlenecks so we can go with either approach.
The actual chunksize doesn't really matter, but it does increase the size of the metadata dictionary.

sdtblck · 2021-01-12T00:40:36Z

gpt_neox/data_utils.py

+ path_shards,single_file_chunk,single_file_chunk_line = 0,1,0
+
+ dataset_name = path[last_slash+1:ext_index]
+ extension = path[ext_index:]


there are much easier and more robust ways to do this with pathlib - if a filename has a dot in anywhere other than the extension for instance, this will break. i'll add changes tomorrow.

sdtblck · 2021-01-12T00:42:59Z

gpt_neox/datasets.py

+import linecache
+import numpy as np
+
+PAD_TOKEN=50257


we can get the pad token from the tokenizer

sdtblck · 2021-01-12T00:52:16Z

gpt_neox/data_utils.py

+
+ start = i*seq_length
+ stop = (i+1)*seq_length 
+ trunc_words = all_words[start:stop]


🤔 I'd say it would be safer to shard by file size or something, rather than word count (1 word =~ 2.5 tokens so you have very different shard lengths here and it already makes the tests inconclusive)

I would think that num tokens is the desired measurement here. Would you prefer file size to that?

sdtblck · 2021-01-12T00:54:48Z

gpt_neox/datasets.py

+ raise Exception("An example has no words in it.")
+
+ if len(line) < self.seq_length:
+ line.extend([PAD_TOKEN for _ in range(self.seq_length-len(line))])


How often does this happen? This should really be very rare or it will start to effect performance... I guess this is another pro for pre-tokenization - we can ensure it never happens at all

I added line 62 for debugging more than anything, right now there shouldn't be any examples of length 0.

Line 64/65 is used all the time: I did this to decrease the size of the pre-tokenized files since almost every example has less than 2048 tokens in it and storing padding tokens would greatly increase the size of the file.

Timed it out and each run of 64/65 for a single example takes 1-2 ms.

sdtblck · 2021-01-12T00:56:11Z

Mostly looks great! but a few bits i'm not so sure about as mentioned above. I think i'll try and fix those bits tomorrow, then test it out inside a train script, and if it works well, we'll merge

StellaAthena · 2021-01-12T01:09:40Z

In that above experiment, I only used 5 points because it took a long time to do sharding. I re-ran the experiment with pre-tokenizing vs not pre-tokenizing with the new code and got much less variability in loading time (on order of 200 ms difference for 1 gb of data)

For next time, I strongly recommend not displaying a plot like this as a line graph, or if you do inserting a marker at each measured point. It’s easy for an incautious reader to falsely assume that for large chunks theres a linear relationship between the axes, when in fact there is no evidence of this at all. It’s not a big deal, but it can avoid unfortunate confusions.

… same number of examples, but length of each example is still different)

… main

…ired context length

glebshevchukk · 2021-01-13T19:03:22Z

Changed loading behavior so that it pulls lines until it reaches required sequence length. Also working on a separate version that chunks directly onto a single line on branch stream_loading

…ely on knowing seq size!

Set main to the stream_loading branch that does not chunk based on known chunk size.

glebshevchukk · 2021-01-13T22:18:17Z

Pushed a new version of the code that doesn't explicitly chunk based on seq_length and replaced test code with more discrete test cases.

…n train.py but requires more testing

Gleb added 2 commits January 9, 2021 21:06

Added jsonl sharded dataset

3186516

Simplified setup script

4c880f7

glebshevchukk requested a review from a team as a code owner January 10, 2021 05:33

glebshevchukk requested review from ConnorJL and AranKomat January 10, 2021 05:33

Gleb added 2 commits January 10, 2021 22:30

Added pre-tokenization and parallelized the data sharding function

d0a6aac

More thuroughly tested parallization and fixed bugs

585bf02

sdtblck reviewed Jan 12, 2021

View reviewed changes

gpt_neox/datasets.py Outdated

import linecache

import numpy as np

PAD_TOKEN=50257

Copy link

Contributor

sdtblck Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can get the pad token from the tokenizer

sdtblck reviewed Jan 12, 2021

View reviewed changes

sdtblck mentioned this pull request Jan 12, 2021

add dynamic dataset for processing/tokenizing examples lazily #46

Closed

Gleb and others added 9 commits January 12, 2021 11:12

Added fixes from pr and mand made splitting more equal (each file has…

d38a98a

… same number of examples, but length of each example is still different)

Added fixes from pr and mand made splitting more equal (each file has…

b92d6f2

… same number of examples, but length of each example is still different)

Merge branch 'main' into main

d282bb8

Fixed small issue with wrong pad token being used

2641d74

Fixed small issue with wrong pad token being used

48b44f3

Merge branch 'main' of https://github.com/glebshevchukk/gpt-neox into…

346af43

… main

changed setup

b376ecc

Non-working streaming version

95c0402

Instead of padding, now pulls the next line until it reaches the requ…

45afcac

…ired context length

Gleb added 4 commits January 13, 2021 13:08

Finally got a working version of the streaming version that doesn't r…

891fc98

…ely on knowing seq size!

Changed testing into discrete test cases

6ea2ed8

Merge branch 'main' into stream_loading

42b5d7f

Set main to the stream_loading branch that does not chunk based on known chunk size.

Changed small text in unit tests

24f5c01

Gleb and others added 6 commits January 16, 2021 11:19

Made it easier to download enron test dataset

8b9d21e

Merge branch 'main' into main

53513b1

Update data_downloader_registry.py

c3f16a7

Update dataset_test_cases.py

c6b91e2

Added 1 bit configuration

6290df2

Merge branch 'main' into main

ac50ac8

glebshevchukk requested a review from StellaAthena January 18, 2021 17:22

Gleb added 2 commits January 23, 2021 10:33

pushing changes to test cases and downloader

7c5953d

Added MPU code and model from Sid's MegatronPipeline, seems to work i…

1dd3add

…n train.py but requires more testing

glebshevchukk closed this Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding jsonl chunked dataset #52

Adding jsonl chunked dataset #52

glebshevchukk commented Jan 10, 2021

StellaAthena commented Jan 10, 2021

glebshevchukk commented Jan 11, 2021

sdtblck Jan 12, 2021

sdtblck Jan 12, 2021

sdtblck Jan 12, 2021 •

edited

Loading

StellaAthena Jan 12, 2021

sdtblck Jan 12, 2021

glebshevchukk Jan 12, 2021

glebshevchukk Jan 12, 2021

sdtblck commented Jan 12, 2021

StellaAthena commented Jan 12, 2021

glebshevchukk commented Jan 13, 2021

glebshevchukk commented Jan 13, 2021

Adding jsonl chunked dataset #52

Adding jsonl chunked dataset #52

Conversation

glebshevchukk commented Jan 10, 2021

StellaAthena commented Jan 10, 2021

glebshevchukk commented Jan 11, 2021

sdtblck Jan 12, 2021

Choose a reason for hiding this comment

sdtblck Jan 12, 2021

Choose a reason for hiding this comment

sdtblck Jan 12, 2021 • edited Loading

Choose a reason for hiding this comment

StellaAthena Jan 12, 2021

Choose a reason for hiding this comment

sdtblck Jan 12, 2021

Choose a reason for hiding this comment

glebshevchukk Jan 12, 2021

Choose a reason for hiding this comment

glebshevchukk Jan 12, 2021

Choose a reason for hiding this comment

sdtblck commented Jan 12, 2021

StellaAthena commented Jan 12, 2021

glebshevchukk commented Jan 13, 2021

glebshevchukk commented Jan 13, 2021

sdtblck Jan 12, 2021 •

edited

Loading