-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MidiDataset optimizations #68
Merged
Merged
Changes from 1 commit
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
0230715
MidiDataset can initialize with an iterator and only expand when nece…
honglu2875 c1d170d
reduce some memory overhead (we are starting to have >100k MidiDict a…
honglu2875 deec979
classmethod+property is better...
honglu2875 6e58f09
remove functools import
honglu2875 00cef03
Merge branch 'EleutherAI:main' into dev
honglu2875 7073a1a
use separate workers to build dataset instead of process pool
honglu2875 69922b4
merge
honglu2875 8bbc14b
add jsonl.zst support; unit test; fix bug
honglu2875 95f492d
receive context length via commandline. It's more convenient than dig…
honglu2875 93301fa
fix a minor output format mismatch when grad_checkpoint is true
honglu2875 607c438
format and small changes
loubbrad File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
add jsonl.zst support; unit test; fix bug
- Loading branch information
commit 8bbc14b2f7498db3fd8f079a7531a80036b3c33b
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
import builtins | ||
import contextlib | ||
import io | ||
import zstandard | ||
import jsonlines | ||
import json | ||
|
||
|
||
class Reader: | ||
"""Reader for the jsonl.zst format.""" | ||
|
||
def __init__(self, path: str): | ||
"""Initializes the reader. | ||
|
||
Args: | ||
path (str): Path to the file. | ||
""" | ||
self.path = path | ||
|
||
def __iter__(self): | ||
with builtins.open(self.path, 'rb') as fh: | ||
cctx = zstandard.ZstdDecompressor() | ||
reader = io.BufferedReader(cctx.stream_reader(fh)) | ||
yield from jsonlines.Reader(reader) | ||
|
||
|
||
class Writer: | ||
"""Writer for the jsonl.zst format.""" | ||
|
||
def __init__(self, path: str): | ||
"""Initializes the writer. | ||
|
||
Args: | ||
path (str): Path to the file. | ||
""" | ||
self.path = path | ||
|
||
def __enter__(self): | ||
self.fh = builtins.open(self.path, 'wb') | ||
self.cctx = zstandard.ZstdCompressor() | ||
self.compressor = self.cctx.stream_writer(self.fh) | ||
return self | ||
|
||
def write(self, obj): | ||
self.compressor.write(json.dumps(obj).encode('UTF-8') + b'\n') | ||
|
||
def __exit__(self, exc_type, exc_value, traceback): | ||
self.compressor.flush(zstandard.FLUSH_FRAME) | ||
self.fh.flush() | ||
self.compressor.close() | ||
self.fh.close() | ||
|
||
|
||
@contextlib.contextmanager | ||
def open(path: str, mode: str = "r"): | ||
"""Read/Write a jsonl.zst file. | ||
|
||
Args: | ||
path (str): Path to the file. | ||
mode (str): Mode to open the file in. Only 'r' and 'w' are supported. | ||
|
||
Returns: | ||
Reader or Writer: Reader if mode is 'r', Writer if mode is 'w'. | ||
""" | ||
if mode == 'r': | ||
yield Reader(path) | ||
elif mode == 'w': | ||
with Writer(path) as writer: | ||
yield writer | ||
else: | ||
raise ValueError(f"Unsupported mode '{mode}'") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that you are using this when building hf datasets. If so I'm happy to add jsonl.zst functionality for streaming MidiDataset and TokenizedDatasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only blocker to replace everything by jsonl.zst reader/writer was that it doesn't allow for mmap indexing. It can't specify a place and directly retrieve the content without reading everything previously. I don't know if it's a theoretic impossibility or just an implementation limit of
zstandard