add dynamic dataset for processing/tokenizing examples lazily #46

trisongz · 2021-01-05T17:52:11Z

adding pysimdjson as a requirement for memory-efficient and fast json parsing
added DynamicDataset class intended for jsonlines, leveraging TF's TextLineDataset C++ io/num_parallel_reads as an iterator.
Tries to determine files to mitigate user error
Determines total lines in each file to get total_examples
Maintains shuffle/seed

Downsides:

Truncation of text with max_length

- adding pysimdjson as a requirement for memory-efficient and fast json parsing - added DynamicDataset class intended for jsonlines, leveraging TF's TextLineDataset C++ io as an iterator.

sdtblck · 2021-01-05T20:15:24Z

Very neat @trisongz !

The only problem I can foresee is that get_item's idx doesn't actually correspond to anything and just calls next on the iterator, will this not result in repeated data with data parallel?

Say you have one worker grabbing idxs 0-16, then another grabbing 16-32 (which is how deepspeed's dataloading works). These are just going to grab the same data regardless, no?

hypnopump · 2021-01-09T11:00:40Z

@sdtblck i think this could be solved with batching?

See: https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset -> .batch() method

trisongz · 2021-01-11T04:26:23Z

@sdtblck - This implementation should now work as expected.

Basic test:

from torch.utils.data import DataLoader
from transformers import GPT2Tokenizer
import glob

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

input_folder = '/data/*.json'
input_files = glob.glob(input_folder)

train_ds = DynamicDataset(input_files, tokenizer, max_seq_len=512, debug=True) # debug prints stuff.
loader = DataLoader(train_ds, batch_size=4, shuffle=True)

for x, item in enumerate(loader):
    print(item)
    if x > 5:
        break

How it works:

Upon init, constructs a file index that maps to the start index and stop index in relation to the total number of lines from all files.

File IDX Start: 0 - File IDX End: 73
File IDX Start: 74 - File IDX End: 189
File IDX Start: 190 - File IDX End: 253
...
File IDX Start: 11044 - File IDX End: 11083
File IDX Start: 11084 - File IDX End: 11103
File IDX Start: 11104 - File IDX End: 11163
Total Files: 83. Total Lines: 11164

When dataloader is called, the index is found by iterating through the file index to get the file_read function. The line index is then found by subtracting the true index (what is being called for) from the file's start index.

True IDX: 10578
File Line IDX: 87 (which means this file starts at IDX 10491)

Benchmarking on 8vCPU / 64GB RAM. Total Local Json Files: 83. Total Lines: 11164. Shuffle = True, max_length = 512, padding = max_length

# with 0 Workers, Batch Size 4

Epoch 0 - Step 500 - Total Examples: 2004 [Data Its: 501 x Batch Size: 4]
Examples/Sec: 29.97 - Total Time: 66.88 secs
Epoch 0 - Step 1000 - Total Examples: 4004 [Data Its: 1001 x Batch Size: 4]
Examples/Sec: 32.96 - Total Time: 121.49 secs
Epoch 0 - Step 1500 - Total Examples: 6004 [Data Its: 1501 x Batch Size: 4]
Examples/Sec: 32.94 - Total Time: 182.29 secs
Epoch 0 - Step 2000 - Total Examples: 8004 [Data Its: 2001 x Batch Size: 4]
Examples/Sec: 33.40 - Total Time: 239.67 secs
Epoch 0 - Step 2500 - Total Examples: 10004 [Data Its: 2501 x Batch Size: 4]
Examples/Sec: 35.12 - Total Time: 284.85 secs

Completed Epoch 0 - Total Examples: 11164 [Data Its: 2791 x Batch Size: 4 w/ Num Workers: 0]
Total Time: 319.74 secs for Epoch

# with 4 Workers, Batch Size 4

Epoch 0 - Step 500 - Total Examples: 2004 [Data Its: 501 x Batch Size: 4]
Examples/Sec: 40.20 - Total Time: 49.85 secs
Epoch 0 - Step 1000 - Total Examples: 4004 [Data Its: 1001 x Batch Size: 4]
Examples/Sec: 51.46 - Total Time: 77.81 secs
Epoch 0 - Step 1500 - Total Examples: 6004 [Data Its: 1501 x Batch Size: 4]
Examples/Sec: 53.14 - Total Time: 112.99 secs
Epoch 0 - Step 2000 - Total Examples: 8004 [Data Its: 2001 x Batch Size: 4]
Examples/Sec: 54.54 - Total Time: 146.77 secs
Epoch 0 - Step 2500 - Total Examples: 10004 [Data Its: 2501 x Batch Size: 4]
Examples/Sec: 54.51 - Total Time: 183.52 secs

Completed Epoch 0 - Total Examples: 11164 [Data Its: 2791 x Batch Size: 4 w/ Num Workers: 4]
Total Time: 205.98 secs for Epoch

# with 8 Workers, Batch Size 8

Epoch 0 - Step 500 - Total Examples: 4008 [Data Its: 501 x Batch Size: 8]
Examples/Sec: 89.50 - Total Time: 44.78 secs
Epoch 0 - Step 1000 - Total Examples: 8008 [Data Its: 1001 x Batch Size: 8]
Examples/Sec: 89.60 - Total Time: 89.37 secs

Completed Epoch 0 - Total Examples: 11168 [Data Its: 1396 x Batch Size: 8 w/ Num Workers: 8]
Total Time: 121.01 secs for Epoch

# with 16 Workers, Batch Size 32 (CPU @ 100% utilization)

Completed Epoch 0 - Total Examples: 11168 [Data Its: 349 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 68.29 secs for Epoch

Epoch 1 - Step 0 - Total Examples: 11200 [Data Its: 350 x Batch Size: 32]
Examples/Sec: 158.58 - Total Time: 70.63 secs
Completed Epoch 1 - Total Examples: 22336 [Data Its: 698 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 132.57 secs for Epoch

Epoch 2 - Step 0 - Total Examples: 22368 [Data Its: 699 x Batch Size: 32]
Examples/Sec: 165.83 - Total Time: 134.88 secs
Completed Epoch 2 - Total Examples: 33504 [Data Its: 1047 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 197.65 secs for Epoch

Epoch 3 - Step 0 - Total Examples: 33536 [Data Its: 1048 x Batch Size: 32]
Examples/Sec: 167.82 - Total Time: 199.83 secs
Completed Epoch 3 - Total Examples: 44672 [Data Its: 1396 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 262.35 secs for Epoch

Based on initial findings, this implementation should scale with higher worker counts, and is not necessarily bottlenecked by I/o. During runtime, only around 2GB was allocated in VM Memory, which should mean that this implementation can scale up with larger files as well, although seeking through larger files may prove a tad slower.

sdtblck · 2021-01-12T01:01:39Z

awesome! yes this is exactly how I envisioned it. I imagine loading times could be sped up significantly by sharding files as in #52

will try to merge both, then benchmark

- Adds dynamic chunking through token cache implementation, adding overflow tokens from tokenized examples into a token cache, which gets filled first in the next call. - Every example added to token_cache is appended by tokenizer.eos_token_id - Removes implicit padding unless sequence falls under max_length after filling from token cache

trisongz · 2021-01-12T22:13:41Z

Adding token cache implementation, removing implicit padding

Adds dynamic chunking through token cache implementation, adding overflow tokens from tokenized examples into a token cache, which gets filled first in the next call.
Every example added to token_cache is appended by tokenizer.eos_token_id
Removes implicit padding unless sequence falls under max_length after filling from token cache

Benchmarks- Total Files: 83. Total Lines: 11164

Running Test with Batch Size: 8 - Num Workers: 0

tensor([[  317, 11674,  4570,  ..., 14325,   447,   247],
        [23120,   317, 11674,  ...,    12,    24,  7125],
        [  317, 11674,   940,  ..., 50257, 50257, 50257],
        ...,
        [  604,  2791,   317,  ...,  3701,  5221,  8581],
        [ 8699,   317, 11674,  ...,  5870,    34, 20866],
        [ 8915,   317, 11674,  ...,    13, 25139, 19482]])
tensor([[  838, 12341, 10725,  ...,    18,    13, 43627],
        [ 4764,   317, 11674,  ..., 50257, 50257, 50257],
        [  317, 11674,  2481,  ..., 20553,  6469,    32],
        ...,
        [  317, 11674,  6420,  ...,   389,  2672,    11],
        [32158, 12341, 10725,  ...,   257,  6074,   329],
        [  317, 11674,  2624,  ..., 17406,   515,  4149]])
tensor([[ 5550, 30709, 10979,  ..., 50257, 50257, 50257],
        [ 2608,   317, 11674,  ...,    17,   737,   383],
        [  362,   317, 11674,  ..., 30579,   422,   262],
        ...,
        [ 1987,   317, 11674,  ...,   357,  4177,  5603],
        [ 2026,   317, 11674,  ...,   355,   890,   355],
        [12341, 10725,  2623,  ..., 50257, 50257, 50257]])
Epoch 0 - Step 500 - Total Examples: 4008 [Data Its: 501 x Batch Size: 8]
Examples/Sec: 36.19 - Total Time: 110.75 secs
Epoch 0 - Step 1000 - Total Examples: 8008 [Data Its: 1001 x Batch Size: 8]
Examples/Sec: 34.52 - Total Time: 232.01 secs
Completed Epoch 0 - Total Examples: 11168 [Data Its: 1396 x Batch Size: 8 w/ Num Workers: 0]
Total Time: 329.24 secs for Epoch

Running Test with Batch Size: 8 - Num Workers: 8

tensor([[  317, 11674,  2623,  ..., 50257, 50257, 50257],
        [12341, 10725,  2623,  ..., 50257, 50257, 50257],
        [32158,   317, 11674,  ...,   902,    13,  8655],
        ...,
        [ 5433, 12341, 10725,  ...,   383,   779,   286],
        [  220,   220,   220,  ...,   220,   220,   220],
        [42756,    40,   940,  ..., 50257, 50257, 50257]])
tensor([[38147,   317, 11674,  ..., 50257, 50257, 50257],
        [  317, 11674,  1157,  ...,   364,  1276, 19361],
        [34951,   317, 11674,  ...,   317,    18,    13],
        ...,
        [  317, 11674,  2623,  ..., 13614, 20537,  3047],
        [  220,   220,   220,  ...,   220,   220,   220],
        [  317, 11674,  2623,  ..., 50257, 50257, 50257]])
tensor([[  807,   317, 11674,  ...,   357,   273,  7548],
        [  317, 11674,  4101,  ...,   284,   307, 28308],
        [12341, 10725,  2623,  ..., 50257, 50257, 50257],
        ...,
        [38158,   317, 11674,  ...,  7603,   262,  9016],
        [21761,   317, 11674,  ..., 50257, 50257, 50257],
        [  317, 11674,   940,  ..., 39373,  4892, 22987]])
Epoch 0 - Step 500 - Total Examples: 4008 [Data Its: 501 x Batch Size: 8]
Examples/Sec: 83.37 - Total Time: 48.08 secs
Epoch 0 - Step 1000 - Total Examples: 8008 [Data Its: 1001 x Batch Size: 8]
Examples/Sec: 86.50 - Total Time: 92.58 secs
Completed Epoch 0 - Total Examples: 11168 [Data Its: 1396 x Batch Size: 8 w/ Num Workers: 8]
Total Time: 124.90 secs for Epoch

Running Test with Batch Size: 32 - Num Workers: 16

tensor([[  317, 11674,  4570,  ..., 13771,  2142,   317],
        [  220,   220,   220,  ...,   220,   220,   220],
        [ 4764,   317, 11674,  ...,  1398,  8224,    11],
        ...,
        [19035,   317, 11674,  ..., 50257, 50257, 50257],
        [  317, 11674,  2598,  ...,    12, 41355, 15119],
        [ 1160,   317, 11674,  ..., 25793,  2236,   307]])
tensor([[12341, 10725,  1954,  ..., 50257, 50257, 50257],
        [  317, 11674,  2623,  ...,  7266, 18797, 12931],
        [  317, 11674,  1157,  ...,  8472,    11,  9851],
        ...,
        [  317, 11674,  2623,  ..., 50257, 50257, 50257],
        [27253,   317, 11674,  ..., 50257, 50257, 50257],
        [ 6740, 12341, 10725,  ...,  1626,   262,  1957]])
tensor([[  317, 11674,  1485,  ..., 50257, 50257, 50257],
        [  220,   220,   220,  ...,   220,   220,   220],
        [  317, 11674,  2481,  ..., 10435,  3722,   314],
        ...,
        [22613, 12341, 10725,  ...,    14,  1186,   557],
        [12341, 10725,  2996,  ...,  3328,   257,  4217],
        [  317, 11674,  4101,  ..., 50257, 50257, 50257]])
Completed Epoch 0 - Total Examples: 11168 [Data Its: 349 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 77.89 secs for Epoch
Epoch 1 - Step 0 - Total Examples: 11200 [Data Its: 350 x Batch Size: 32]
Examples/Sec: 138.93 - Total Time: 80.62 secs
Epoch 2 - Step 0 - Total Examples: 22368 [Data Its: 699 x Batch Size: 32]
Examples/Sec: 149.11 - Total Time: 150.01 secs
Epoch 3 - Step 0 - Total Examples: 33536 [Data Its: 1048 x Batch Size: 32]
Examples/Sec: 149.51 - Total Time: 224.31 secs

New implementation should keep same performance as before.

trisongz · 2021-01-13T00:27:22Z

@sdtblck

Latest commit fixes issues with FastTokenizer, adds in stitching of short text sequences to further minimize padding

Completed Epoch 0 - Total Examples: 11168 [Data Its: 349 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 23.67 secs for Epoch
tensor([[ 4521,   317, 11674,  ...,  3682,    11,   290],
        [17827,   317, 11674,  ...,   286,  6723, 47503],
        [ 8579,    40,  6420,  ...,  7406, 16588, 15615],
        ...,
        [  940,   317, 11674,  ...,   290, 36220,    14],
        [ 2548, 12341, 10725,  ...,   642, 40410,  4146],
        [10232,   317, 11674,  ...,   290,    14,   273]])
Epoch 1 - Step 0 - Total Examples: 11200 [Data Its: 350 x Batch Size: 32]
Examples/Sec: 454.25 - Total Time: 24.66 secs
tensor([[  940,   317, 11674,  ...,    11,  1630,    11],
        [   17,   317, 11674,  ...,   807,  2857,   422],
        [ 8579,    40,  2623,  ...,  1222,  9027,    13],
        ...,
        [   35,  3727,    40,  ...,   287,   543,   284],
        [ 4051,   317, 11674,  ...,  1459,  3722,   286],
        [18298, 12341, 10725,  ...,  6266,   284,   262]])
tensor([[16562,   317, 11674,  ...,   329,  2563,    12],
        [ 8579, 10725,  3901,  ...,  1179,  1352,   379],
        [ 4051, 12341, 10725,  ...,   362,    13,    23],
        ...,
        [ 8579,    40,  3829,  ...,   262, 29857,   378],
        [ 8579,    40,   940,  ...,   329,  7216, 10405],
        [23815,   317, 11674,  ...,   307, 10911,   351]])
bad idx - getting random
bad idx - getting random
bad idx - getting random
Completed Epoch 1 - Total Examples: 22336 [Data Its: 698 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 42.42 secs for Epoch
Epoch 2 - Step 0 - Total Examples: 22368 [Data Its: 699 x Batch Size: 32]
Examples/Sec: 514.33 - Total Time: 43.49 secs

Using FastTokenizer, there is a 3-4x speedup from 150 examples/sec to 500+ examples/sec using same params.

StellaAthena · 2021-01-13T15:10:14Z

I received the following error when trying to integrate this into the training code. My modifications can be found here

Trying example
    for i, data in enumerate(train_loader):
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/dataloader.py", line 83, in __next__
    return next(self.data)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/dataloader.py", line 99, in <genexpr>
    self.data = (x for x in self.dataloader)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
get some ;)
Trying example
    data = self._next_data()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/clone/stella-test/gpt-neox/gpt_neox/datasets.py", line 259, in __getitem__
    return self.tokenize_example(self.parse_json(ex.strip()))
  File "/root/clone/stella-test/gpt-neox/gpt_neox/datasets.py", line 244, in tokenize_example
    tokenized = self.tokenizer(ex, max_length=self.max_seq_len, truncation=True, return_overflowing_tokens=True)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2320, in __call__
    "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
AssertionError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

trisongz · 2021-01-13T17:40:24Z

What does the example you're passing to it look like?

StellaAthena · 2021-02-15T01:04:21Z

Superseded by codebase refactoring.

add dynamic dataset for processing/tokenizing examples lazily

e3aec3b

- adding pysimdjson as a requirement for memory-efficient and fast json parsing - added DynamicDataset class intended for jsonlines, leveraging TF's TextLineDataset C++ io as an iterator.

trisongz requested a review from sdtblck January 5, 2021 17:52

trisongz requested a review from a team as a code owner January 5, 2021 17:52

trisongz requested a review from AranKomat January 5, 2021 17:52

This was linked to issues Jan 6, 2021

Fix tfrecord dataset to load less files into memory #41

Closed

Write dataset class that tokenizes on the fly #40

Closed

trisongz added 2 commits January 10, 2021 21:11

Merge remote-tracking branch 'upstream/main' into dataset

dd0a17c

update dataset implementation

80c65bf

trisongz added 2 commits January 11, 2021 19:05

minor fixes after testing.

423d688

trisongz added 2 commits January 12, 2021 17:08

add exception

abb5e46

add concat of short text sequences to minimize padding

cc7cc2f

StellaAthena closed this Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add dynamic dataset for processing/tokenizing examples lazily #46

add dynamic dataset for processing/tokenizing examples lazily #46

trisongz commented Jan 5, 2021

sdtblck commented Jan 5, 2021

hypnopump commented Jan 9, 2021

trisongz commented Jan 11, 2021

sdtblck commented Jan 12, 2021

trisongz commented Jan 12, 2021

trisongz commented Jan 13, 2021 •

edited

StellaAthena commented Jan 13, 2021

trisongz commented Jan 13, 2021

StellaAthena commented Feb 15, 2021

add dynamic dataset for processing/tokenizing examples lazily #46

add dynamic dataset for processing/tokenizing examples lazily #46

Conversation

trisongz commented Jan 5, 2021

sdtblck commented Jan 5, 2021

hypnopump commented Jan 9, 2021

trisongz commented Jan 11, 2021

sdtblck commented Jan 12, 2021

trisongz commented Jan 12, 2021

trisongz commented Jan 13, 2021 • edited

StellaAthena commented Jan 13, 2021

trisongz commented Jan 13, 2021

StellaAthena commented Feb 15, 2021

trisongz commented Jan 13, 2021 •

edited