Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dynamic dataset for processing/tokenizing examples lazily #46

Closed
wants to merge 7 commits into from

Conversation

trisongz
Copy link
Contributor

@trisongz trisongz commented Jan 5, 2021

  • adding pysimdjson as a requirement for memory-efficient and fast json parsing
  • added DynamicDataset class intended for jsonlines, leveraging TF's TextLineDataset C++ io/num_parallel_reads as an iterator.
  • Tries to determine files to mitigate user error
  • Determines total lines in each file to get total_examples
  • Maintains shuffle/seed

Downsides:

  • Truncation of text with max_length

- adding pysimdjson as a requirement for memory-efficient and fast json parsing
- added DynamicDataset class intended for jsonlines, leveraging TF's TextLineDataset C++ io as an iterator.
@trisongz trisongz requested a review from sdtblck January 5, 2021 17:52
@trisongz trisongz requested a review from a team as a code owner January 5, 2021 17:52
@sdtblck
Copy link
Contributor

sdtblck commented Jan 5, 2021

Very neat @trisongz !

The only problem I can foresee is that get_item's idx doesn't actually correspond to anything and just calls next on the iterator, will this not result in repeated data with data parallel?

Say you have one worker grabbing idxs 0-16, then another grabbing 16-32 (which is how deepspeed's dataloading works). These are just going to grab the same data regardless, no?

@hypnopump
Copy link

@sdtblck i think this could be solved with batching?

See: https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset -> .batch() method

@trisongz
Copy link
Contributor Author

@sdtblck - This implementation should now work as expected.

Basic test:

from torch.utils.data import DataLoader
from transformers import GPT2Tokenizer
import glob

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

input_folder = '/data/*.json'
input_files = glob.glob(input_folder)

train_ds = DynamicDataset(input_files, tokenizer, max_seq_len=512, debug=True) # debug prints stuff.
loader = DataLoader(train_ds, batch_size=4, shuffle=True)

for x, item in enumerate(loader):
    print(item)
    if x > 5:
        break

How it works:

Upon init, constructs a file index that maps to the start index and stop index in relation to the total number of lines from all files.

File IDX Start: 0 - File IDX End: 73
File IDX Start: 74 - File IDX End: 189
File IDX Start: 190 - File IDX End: 253
...
File IDX Start: 11044 - File IDX End: 11083
File IDX Start: 11084 - File IDX End: 11103
File IDX Start: 11104 - File IDX End: 11163
Total Files: 83. Total Lines: 11164

When dataloader is called, the index is found by iterating through the file index to get the file_read function. The line index is then found by subtracting the true index (what is being called for) from the file's start index.

True IDX: 10578
File Line IDX: 87 (which means this file starts at IDX 10491)

Benchmarking on 8vCPU / 64GB RAM. Total Local Json Files: 83. Total Lines: 11164. Shuffle = True, max_length = 512, padding = max_length

# with 0 Workers, Batch Size 4

Epoch 0 - Step 500 - Total Examples: 2004 [Data Its: 501 x Batch Size: 4]
Examples/Sec: 29.97 - Total Time: 66.88 secs
Epoch 0 - Step 1000 - Total Examples: 4004 [Data Its: 1001 x Batch Size: 4]
Examples/Sec: 32.96 - Total Time: 121.49 secs
Epoch 0 - Step 1500 - Total Examples: 6004 [Data Its: 1501 x Batch Size: 4]
Examples/Sec: 32.94 - Total Time: 182.29 secs
Epoch 0 - Step 2000 - Total Examples: 8004 [Data Its: 2001 x Batch Size: 4]
Examples/Sec: 33.40 - Total Time: 239.67 secs
Epoch 0 - Step 2500 - Total Examples: 10004 [Data Its: 2501 x Batch Size: 4]
Examples/Sec: 35.12 - Total Time: 284.85 secs

Completed Epoch 0 - Total Examples: 11164 [Data Its: 2791 x Batch Size: 4 w/ Num Workers: 0]
Total Time: 319.74 secs for Epoch

# with 4 Workers, Batch Size 4

Epoch 0 - Step 500 - Total Examples: 2004 [Data Its: 501 x Batch Size: 4]
Examples/Sec: 40.20 - Total Time: 49.85 secs
Epoch 0 - Step 1000 - Total Examples: 4004 [Data Its: 1001 x Batch Size: 4]
Examples/Sec: 51.46 - Total Time: 77.81 secs
Epoch 0 - Step 1500 - Total Examples: 6004 [Data Its: 1501 x Batch Size: 4]
Examples/Sec: 53.14 - Total Time: 112.99 secs
Epoch 0 - Step 2000 - Total Examples: 8004 [Data Its: 2001 x Batch Size: 4]
Examples/Sec: 54.54 - Total Time: 146.77 secs
Epoch 0 - Step 2500 - Total Examples: 10004 [Data Its: 2501 x Batch Size: 4]
Examples/Sec: 54.51 - Total Time: 183.52 secs

Completed Epoch 0 - Total Examples: 11164 [Data Its: 2791 x Batch Size: 4 w/ Num Workers: 4]
Total Time: 205.98 secs for Epoch

# with 8 Workers, Batch Size 8

Epoch 0 - Step 500 - Total Examples: 4008 [Data Its: 501 x Batch Size: 8]
Examples/Sec: 89.50 - Total Time: 44.78 secs
Epoch 0 - Step 1000 - Total Examples: 8008 [Data Its: 1001 x Batch Size: 8]
Examples/Sec: 89.60 - Total Time: 89.37 secs

Completed Epoch 0 - Total Examples: 11168 [Data Its: 1396 x Batch Size: 8 w/ Num Workers: 8]
Total Time: 121.01 secs for Epoch

# with 16 Workers, Batch Size 32 (CPU @ 100% utilization)

Completed Epoch 0 - Total Examples: 11168 [Data Its: 349 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 68.29 secs for Epoch

Epoch 1 - Step 0 - Total Examples: 11200 [Data Its: 350 x Batch Size: 32]
Examples/Sec: 158.58 - Total Time: 70.63 secs
Completed Epoch 1 - Total Examples: 22336 [Data Its: 698 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 132.57 secs for Epoch

Epoch 2 - Step 0 - Total Examples: 22368 [Data Its: 699 x Batch Size: 32]
Examples/Sec: 165.83 - Total Time: 134.88 secs
Completed Epoch 2 - Total Examples: 33504 [Data Its: 1047 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 197.65 secs for Epoch

Epoch 3 - Step 0 - Total Examples: 33536 [Data Its: 1048 x Batch Size: 32]
Examples/Sec: 167.82 - Total Time: 199.83 secs
Completed Epoch 3 - Total Examples: 44672 [Data Its: 1396 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 262.35 secs for Epoch

Based on initial findings, this implementation should scale with higher worker counts, and is not necessarily bottlenecked by I/o. During runtime, only around 2GB was allocated in VM Memory, which should mean that this implementation can scale up with larger files as well, although seeking through larger files may prove a tad slower.

@sdtblck
Copy link
Contributor

sdtblck commented Jan 12, 2021

awesome! yes this is exactly how I envisioned it. I imagine loading times could be sped up significantly by sharding files as in #52

will try to merge both, then benchmark

- Adds dynamic chunking through token cache implementation, adding overflow tokens from tokenized examples into a token cache, which gets filled first in the next call.

- Every example added to token_cache is appended by tokenizer.eos_token_id

- Removes implicit padding unless sequence falls under max_length after filling from token cache
@trisongz
Copy link
Contributor Author

Adding token cache implementation, removing implicit padding

  • Adds dynamic chunking through token cache implementation, adding overflow tokens from tokenized examples into a token cache, which gets filled first in the next call.
  • Every example added to token_cache is appended by tokenizer.eos_token_id
  • Removes implicit padding unless sequence falls under max_length after filling from token cache

Benchmarks- Total Files: 83. Total Lines: 11164

Running Test with Batch Size: 8 - Num Workers: 0

tensor([[  317, 11674,  4570,  ..., 14325,   447,   247],
        [23120,   317, 11674,  ...,    12,    24,  7125],
        [  317, 11674,   940,  ..., 50257, 50257, 50257],
        ...,
        [  604,  2791,   317,  ...,  3701,  5221,  8581],
        [ 8699,   317, 11674,  ...,  5870,    34, 20866],
        [ 8915,   317, 11674,  ...,    13, 25139, 19482]])
tensor([[  838, 12341, 10725,  ...,    18,    13, 43627],
        [ 4764,   317, 11674,  ..., 50257, 50257, 50257],
        [  317, 11674,  2481,  ..., 20553,  6469,    32],
        ...,
        [  317, 11674,  6420,  ...,   389,  2672,    11],
        [32158, 12341, 10725,  ...,   257,  6074,   329],
        [  317, 11674,  2624,  ..., 17406,   515,  4149]])
tensor([[ 5550, 30709, 10979,  ..., 50257, 50257, 50257],
        [ 2608,   317, 11674,  ...,    17,   737,   383],
        [  362,   317, 11674,  ..., 30579,   422,   262],
        ...,
        [ 1987,   317, 11674,  ...,   357,  4177,  5603],
        [ 2026,   317, 11674,  ...,   355,   890,   355],
        [12341, 10725,  2623,  ..., 50257, 50257, 50257]])
Epoch 0 - Step 500 - Total Examples: 4008 [Data Its: 501 x Batch Size: 8]
Examples/Sec: 36.19 - Total Time: 110.75 secs
Epoch 0 - Step 1000 - Total Examples: 8008 [Data Its: 1001 x Batch Size: 8]
Examples/Sec: 34.52 - Total Time: 232.01 secs
Completed Epoch 0 - Total Examples: 11168 [Data Its: 1396 x Batch Size: 8 w/ Num Workers: 0]
Total Time: 329.24 secs for Epoch

Running Test with Batch Size: 8 - Num Workers: 8

tensor([[  317, 11674,  2623,  ..., 50257, 50257, 50257],
        [12341, 10725,  2623,  ..., 50257, 50257, 50257],
        [32158,   317, 11674,  ...,   902,    13,  8655],
        ...,
        [ 5433, 12341, 10725,  ...,   383,   779,   286],
        [  220,   220,   220,  ...,   220,   220,   220],
        [42756,    40,   940,  ..., 50257, 50257, 50257]])
tensor([[38147,   317, 11674,  ..., 50257, 50257, 50257],
        [  317, 11674,  1157,  ...,   364,  1276, 19361],
        [34951,   317, 11674,  ...,   317,    18,    13],
        ...,
        [  317, 11674,  2623,  ..., 13614, 20537,  3047],
        [  220,   220,   220,  ...,   220,   220,   220],
        [  317, 11674,  2623,  ..., 50257, 50257, 50257]])
tensor([[  807,   317, 11674,  ...,   357,   273,  7548],
        [  317, 11674,  4101,  ...,   284,   307, 28308],
        [12341, 10725,  2623,  ..., 50257, 50257, 50257],
        ...,
        [38158,   317, 11674,  ...,  7603,   262,  9016],
        [21761,   317, 11674,  ..., 50257, 50257, 50257],
        [  317, 11674,   940,  ..., 39373,  4892, 22987]])
Epoch 0 - Step 500 - Total Examples: 4008 [Data Its: 501 x Batch Size: 8]
Examples/Sec: 83.37 - Total Time: 48.08 secs
Epoch 0 - Step 1000 - Total Examples: 8008 [Data Its: 1001 x Batch Size: 8]
Examples/Sec: 86.50 - Total Time: 92.58 secs
Completed Epoch 0 - Total Examples: 11168 [Data Its: 1396 x Batch Size: 8 w/ Num Workers: 8]
Total Time: 124.90 secs for Epoch

Running Test with Batch Size: 32 - Num Workers: 16

tensor([[  317, 11674,  4570,  ..., 13771,  2142,   317],
        [  220,   220,   220,  ...,   220,   220,   220],
        [ 4764,   317, 11674,  ...,  1398,  8224,    11],
        ...,
        [19035,   317, 11674,  ..., 50257, 50257, 50257],
        [  317, 11674,  2598,  ...,    12, 41355, 15119],
        [ 1160,   317, 11674,  ..., 25793,  2236,   307]])
tensor([[12341, 10725,  1954,  ..., 50257, 50257, 50257],
        [  317, 11674,  2623,  ...,  7266, 18797, 12931],
        [  317, 11674,  1157,  ...,  8472,    11,  9851],
        ...,
        [  317, 11674,  2623,  ..., 50257, 50257, 50257],
        [27253,   317, 11674,  ..., 50257, 50257, 50257],
        [ 6740, 12341, 10725,  ...,  1626,   262,  1957]])
tensor([[  317, 11674,  1485,  ..., 50257, 50257, 50257],
        [  220,   220,   220,  ...,   220,   220,   220],
        [  317, 11674,  2481,  ..., 10435,  3722,   314],
        ...,
        [22613, 12341, 10725,  ...,    14,  1186,   557],
        [12341, 10725,  2996,  ...,  3328,   257,  4217],
        [  317, 11674,  4101,  ..., 50257, 50257, 50257]])
Completed Epoch 0 - Total Examples: 11168 [Data Its: 349 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 77.89 secs for Epoch
Epoch 1 - Step 0 - Total Examples: 11200 [Data Its: 350 x Batch Size: 32]
Examples/Sec: 138.93 - Total Time: 80.62 secs
Epoch 2 - Step 0 - Total Examples: 22368 [Data Its: 699 x Batch Size: 32]
Examples/Sec: 149.11 - Total Time: 150.01 secs
Epoch 3 - Step 0 - Total Examples: 33536 [Data Its: 1048 x Batch Size: 32]
Examples/Sec: 149.51 - Total Time: 224.31 secs

New implementation should keep same performance as before.

@trisongz
Copy link
Contributor Author

trisongz commented Jan 13, 2021

@sdtblck

Latest commit fixes issues with FastTokenizer, adds in stitching of short text sequences to further minimize padding

Completed Epoch 0 - Total Examples: 11168 [Data Its: 349 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 23.67 secs for Epoch
tensor([[ 4521,   317, 11674,  ...,  3682,    11,   290],
        [17827,   317, 11674,  ...,   286,  6723, 47503],
        [ 8579,    40,  6420,  ...,  7406, 16588, 15615],
        ...,
        [  940,   317, 11674,  ...,   290, 36220,    14],
        [ 2548, 12341, 10725,  ...,   642, 40410,  4146],
        [10232,   317, 11674,  ...,   290,    14,   273]])
Epoch 1 - Step 0 - Total Examples: 11200 [Data Its: 350 x Batch Size: 32]
Examples/Sec: 454.25 - Total Time: 24.66 secs
tensor([[  940,   317, 11674,  ...,    11,  1630,    11],
        [   17,   317, 11674,  ...,   807,  2857,   422],
        [ 8579,    40,  2623,  ...,  1222,  9027,    13],
        ...,
        [   35,  3727,    40,  ...,   287,   543,   284],
        [ 4051,   317, 11674,  ...,  1459,  3722,   286],
        [18298, 12341, 10725,  ...,  6266,   284,   262]])
tensor([[16562,   317, 11674,  ...,   329,  2563,    12],
        [ 8579, 10725,  3901,  ...,  1179,  1352,   379],
        [ 4051, 12341, 10725,  ...,   362,    13,    23],
        ...,
        [ 8579,    40,  3829,  ...,   262, 29857,   378],
        [ 8579,    40,   940,  ...,   329,  7216, 10405],
        [23815,   317, 11674,  ...,   307, 10911,   351]])
bad idx - getting random
bad idx - getting random
bad idx - getting random
Completed Epoch 1 - Total Examples: 22336 [Data Its: 698 x Batch Size: 32 w/ Num Workers: 16]
Total Time: 42.42 secs for Epoch
Epoch 2 - Step 0 - Total Examples: 22368 [Data Its: 699 x Batch Size: 32]
Examples/Sec: 514.33 - Total Time: 43.49 secs

Using FastTokenizer, there is a 3-4x speedup from 150 examples/sec to 500+ examples/sec using same params.

@StellaAthena
Copy link
Member

I received the following error when trying to integrate this into the training code. My modifications can be found here

Trying example
    for i, data in enumerate(train_loader):
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/dataloader.py", line 83, in __next__
    return next(self.data)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/deepspeed/runtime/dataloader.py", line 99, in <genexpr>
    self.data = (x for x in self.dataloader)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
get some ;)
Trying example
    data = self._next_data()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
AssertionError: Caught AssertionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/clone/stella-test/gpt-neox/gpt_neox/datasets.py", line 259, in __getitem__
    return self.tokenize_example(self.parse_json(ex.strip()))
  File "/root/clone/stella-test/gpt-neox/gpt_neox/datasets.py", line 244, in tokenize_example
    tokenized = self.tokenizer(ex, max_length=self.max_seq_len, truncation=True, return_overflowing_tokens=True)
  File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2320, in __call__
    "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
AssertionError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

@trisongz
Copy link
Contributor Author

What does the example you're passing to it look like?

@StellaAthena
Copy link
Member

Superseded by codebase refactoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix tfrecord dataset to load less files into memory Write dataset class that tokenizes on the fly
4 participants