Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MidiDataset optimizations #68

Merged
merged 11 commits into from
Nov 22, 2023
Prev Previous commit
Next Next commit
receive context length via commandline. It's more convenient than dig…
…ging into the config file every time.
  • Loading branch information
honglu2875 committed Nov 10, 2023
commit 95f492dd8bd1f076664e3a058f0bb6092afb8a78
5 changes: 2 additions & 3 deletions aria/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,22 +124,21 @@ def _parse_tokenized_dataset_args():
argp.add_argument("load_path", help="path midi_dict dataset")
argp.add_argument("save_path", help="path to save dataset")
argp.add_argument("-s", help="also produce shuffled", action="store_true")
argp.add_argument("-l", help="max sequence length", type=int, default=2048)

return argp.parse_args(sys.argv[2:])


def build_tokenized_dataset(args):
from aria.tokenizer import TokenizerLazy
from aria.data.datasets import TokenizedDataset
from aria.config import load_config

config = load_config()["data"]["dataset_gen_args"]
tokenizer = TokenizerLazy()
dataset = TokenizedDataset.build(
tokenizer=tokenizer,
save_path=args.save_path,
midi_dataset_path=args.load_path,
max_seq_len=config["max_seq_len"],
max_seq_len=args.l,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Might as well also remove the dataset_gen_args from the config json.

overwrite=True,
)
if args.s:
Expand Down