-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add dynamic dataset for processing/tokenizing examples lazily #46
Conversation
- adding pysimdjson as a requirement for memory-efficient and fast json parsing - added DynamicDataset class intended for jsonlines, leveraging TF's TextLineDataset C++ io as an iterator.
Very neat @trisongz ! The only problem I can foresee is that get_item's idx doesn't actually correspond to anything and just calls next on the iterator, will this not result in repeated data with data parallel? Say you have one worker grabbing idxs 0-16, then another grabbing 16-32 (which is how deepspeed's dataloading works). These are just going to grab the same data regardless, no? |
@sdtblck i think this could be solved with batching? See: https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset -> .batch() method |
@sdtblck - This implementation should now work as expected. Basic test: from torch.utils.data import DataLoader
from transformers import GPT2Tokenizer
import glob
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
input_folder = '/data/*.json'
input_files = glob.glob(input_folder)
train_ds = DynamicDataset(input_files, tokenizer, max_seq_len=512, debug=True) # debug prints stuff.
loader = DataLoader(train_ds, batch_size=4, shuffle=True)
for x, item in enumerate(loader):
print(item)
if x > 5:
break How it works: Upon init, constructs a file index that maps to the start index and stop index in relation to the total number of lines from all files.
When dataloader is called, the index is found by iterating through the file index to get the file_read function. The line index is then found by subtracting the true index (what is being called for) from the file's start index.
Benchmarking on 8vCPU / 64GB RAM. Total Local Json Files: 83. Total Lines: 11164. Shuffle = True, max_length = 512, padding = max_length
Based on initial findings, this implementation should scale with higher worker counts, and is not necessarily bottlenecked by I/o. During runtime, only around 2GB was allocated in VM Memory, which should mean that this implementation can scale up with larger files as well, although seeking through larger files may prove a tad slower. |
awesome! yes this is exactly how I envisioned it. I imagine loading times could be sped up significantly by sharding files as in #52 will try to merge both, then benchmark |
- Adds dynamic chunking through token cache implementation, adding overflow tokens from tokenized examples into a token cache, which gets filled first in the next call. - Every example added to token_cache is appended by tokenizer.eos_token_id - Removes implicit padding unless sequence falls under max_length after filling from token cache
Adding token cache implementation, removing implicit padding
Benchmarks- Total Files: 83. Total Lines: 11164 Running Test with Batch Size: 8 - Num Workers: 0
Running Test with Batch Size: 8 - Num Workers: 8
Running Test with Batch Size: 32 - Num Workers: 16
New implementation should keep same performance as before. |
Latest commit fixes issues with FastTokenizer, adds in stitching of short text sequences to further minimize padding
Using FastTokenizer, there is a 3-4x speedup from 150 examples/sec to 500+ examples/sec using same params. |
I received the following error when trying to integrate this into the training code. My modifications can be found here
|
What does the example you're passing to it look like? |
Superseded by codebase refactoring. |
Downsides: