MidiDataset optimizations #68

honglu2875 · 2023-11-09T23:57:58Z

can initialize with an iterator and only expand when necessary (len or random access).
reduce a little memory/initialization overhead by changing program_to_instrument dict into a getter function (we are starting to have >100k MidiDataset and could have more in the future)

note: I thought we might hide some latency by doing json.decode async during tokenizer encoding. But doesn't feel like helping much....... However, I have a separate script to build a static huggingface tokenized dataset and it seems to help.
update: Now it's slightly faster by using persistent workers reading queues in while True loops, instead of process pool. The processes are spun up only once and therefore tokenizers are pickled only once

…ssary.

…nd may get more in the future)

…ging into the config file every time.

loubbrad · 2023-11-21T14:45:38Z

aria/data/datasets.py

+ def _load():
+  with jsonlines.open(load_path) as reader:
+  for entry in reader:
+  yield MidiDict.from_msg_dict(entry)


This seems fine, but maybe include an option for when this functionality should be used e.g. stream=True or something.

loubbrad · 2023-11-21T14:50:44Z

aria/model/model.py

@@ -320,7 +320,7 @@ def custom_forward(*args):

 return custom_forward

- hidden_states = torch.utils.checkpoint.checkpoint(
+ hidden_states, _ = torch.utils.checkpoint.checkpoint(


Did this end up fixing something?

loubbrad · 2023-11-21T14:52:59Z

aria/data/jsonl_zst.py

+import json
+
+
+class Reader:


I'm guessing that you are using this when building hf datasets. If so I'm happy to add jsonl.zst functionality for streaming MidiDataset and TokenizedDatasets.

The only blocker to replace everything by jsonl.zst reader/writer was that it doesn't allow for mmap indexing. It can't specify a place and directly retrieve the content without reading everything previously. I don't know if it's a theoretic impossibility or just an implementation limit of zstandard

loubbrad · 2023-11-21T14:55:36Z

aria/data/midi.py

@@ -114,8 +113,11 @@ def __init__(
 }
 ]

+ @classmethod
+ @property
+ def program_to_instrument(cls):


I'm assuming this is here to speed up the process of building MidiDatasets. Does it make much of a difference?

loubbrad · 2023-11-21T14:57:30Z

aria/run.py

 tokenizer = TokenizerLazy()
 dataset = TokenizedDataset.build(
 tokenizer=tokenizer,
 save_path=args.save_path,
 midi_dataset_path=args.load_path,
- max_seq_len=config["max_seq_len"],
+ max_seq_len=args.l,


Good idea. Might as well also remove the dataset_gen_args from the config json.

honglu2875 and others added 9 commits November 9, 2023 23:44

MidiDataset can initialize with an iterator and only expand when nece…

0230715

…ssary.

reduce some memory overhead (we are starting to have >100k MidiDict a…

c1d170d

…nd may get more in the future)

classmethod+property is better...

deec979

remove functools import

6e58f09

Merge branch 'EleutherAI:main' into dev

00cef03

use separate workers to build dataset instead of process pool

7073a1a

merge

69922b4

add jsonl.zst support; unit test; fix bug

8bbc14b

receive context length via commandline. It's more convenient than dig…

95f492d

…ging into the config file every time.

honglu2875 force-pushed the dev branch from 1425287 to 95f492d Compare November 11, 2023 16:50

fix a minor output format mismatch when grad_checkpoint is true

93301fa

loubbrad reviewed Nov 21, 2023

View reviewed changes

format and small changes

607c438

loubbrad merged commit 4cd90fc into EleutherAI:main Nov 22, 2023
1 check passed

honglu2875 deleted the dev branch November 22, 2023 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MidiDataset optimizations #68

MidiDataset optimizations #68

honglu2875 commented Nov 9, 2023 •

edited

loubbrad Nov 21, 2023

loubbrad Nov 21, 2023

loubbrad Nov 21, 2023

honglu2875 Nov 22, 2023

loubbrad Nov 21, 2023

loubbrad Nov 21, 2023

		import json


		class Reader:

MidiDataset optimizations #68

MidiDataset optimizations #68

Conversation

honglu2875 commented Nov 9, 2023 • edited

loubbrad Nov 21, 2023

Choose a reason for hiding this comment

loubbrad Nov 21, 2023

Choose a reason for hiding this comment

loubbrad Nov 21, 2023

Choose a reason for hiding this comment

honglu2875 Nov 22, 2023

Choose a reason for hiding this comment

loubbrad Nov 21, 2023

Choose a reason for hiding this comment

loubbrad Nov 21, 2023

Choose a reason for hiding this comment

honglu2875 commented Nov 9, 2023 •

edited