Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use when --mask-before-token have values #995

Closed
xealml opened this issue Jul 13, 2023 · 2 comments · Fixed by #1056
Closed

how to use when --mask-before-token have values #995

xealml opened this issue Jul 13, 2023 · 2 comments · Fixed by #1056
Assignees
Labels
feature request New feature or request

Comments

@xealml
Copy link

xealml commented Jul 13, 2023

I process input use this script

python ./tools/preprocess_data_with_mask.py \
            --input /data1/limenglin/other_data/train.txt \
            --output-prefix ./train_on_post_comment_with_mask \
            --vocab ./data/gpt2-vocab.json \
            --merge-file ./data/gpt2-merges.txt \
            --dataset-impl mmap \
            --tokenizer-type GPT2BPETokenizer \
            --append-eod \
            --workers 40 \
            --mask-before-token 58,2257,7227,60

and I got two file text_document and label_document
I want to know how to modify yml file to use label_document, could you update readme so I can follow
Thx.

in local_setup.yml

in this way?

"data_path": "./data_text_document",
"label_data_paths": ["./data_label_document"],

but I get this error

Traceback (most recent call last):
  File "train.py", line 27, in <module>
    pretrain(neox_args=neox_args)
  File "/data1/limenglin/gpt-neox/megatron/training.py", line 226, in pretrain
    iteration = train(
  File "/data1/limenglin/gpt-neox/megatron/training.py", line 794, in train
    loss_dict, skipped_iter = train_step(
  File "/data1/limenglin/gpt-neox/megatron/training.py", line 700, in train_step
    reduced_loss = train_step_pipe(
  File "/data1/limenglin/gpt-neox/megatron/training.py", line 750, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    self._exec_schedule(sched)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1374, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 790, in _exec_load_micro_batch
    batch = self._next_batch()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 626, in _next_batch
    batch = self.batch_fn(batch)
  File "/data1/limenglin/gpt-neox/megatron/training.py", line 329, in get_batch_pipe
    tokens, labels, loss_mask, attention_mask, position_ids = _get_batch(
  File "/data1/limenglin/gpt-neox/megatron/training.py", line 276, in _get_batch
    data_b = mpu.broadcast_data(keys, data, datatype)
  File "/data1/limenglin/gpt-neox/megatron/mpu/data.py", line 91, in broadcast_data
    key_size, key_numel, total_numel = _build_key_size_numel_dictionaries(keys, data)
  File "/data1/limenglin/gpt-neox/megatron/mpu/data.py", line 44, in _build_key_size_numel_dictionaries
    assert data[key].dim() < max_dim, "you should increase MAX_DATA_DIM"
KeyError: 'label'
@xealml xealml added the feature request New feature or request label Jul 13, 2023
@xealml
Copy link
Author

xealml commented Jul 13, 2023

I may get some method
set

  "train-data-paths": ["./data_text_document"],
   "test-data-paths": ["./data_text_document"],
   "valid-data-paths": ["./data_text_document"],
  "label_data_paths": ["./data_label_document"],

@dashstander dashstander self-assigned this Sep 29, 2023
@dashstander
Copy link
Contributor

Hi @xealml , did setting up your config like this

  "train-data-paths": ["./data_text_document"],
   "test-data-paths": ["./data_text_document"],
   "valid-data-paths": ["./data_text_document"],
  "label_data_paths": ["./data_label_document"],

fix your issue?

I'm not sure that this has anything to do with --mask-before-token. The error you are getting is because, for some reason or another, the code is picking this path and not the other one that includes the labels key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants