pretrain_dataset broken #1026

mhenrichsen · 2024-01-01T20:25:25Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Should stream a hf dataset

Current behaviour

[2024-01-01 11:53:55,332] [INFO] [axolotl.load_model:517] [PID:20811] [RANK:2] GPU memory usage after model load: 2.062GB (+0.087GB cache, +1.755GB misc)
[2024-01-01 11:53:55,340] [INFO] [axolotl.load_model:552] [PID:20811] [RANK:2] converting modules to torch.bfloat16 for flash attention
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 136, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in _inner_training_loop
    train_dataloader = self.get_train_dataloader()
  File "/workspace/axolotl/src/axolotl/core/trainer_builder.py", line 210, in get_train_dataloader
    sampler = self._get_train_sampler()
  File "/workspace/axolotl/src/axolotl/core/trainer_builder.py", line 161, in _get_train_sampler
    RandomSampler(self.train_dataset),
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 106, in __init__
    if not isinstance(self.num_samples, int) or self.num_samples <= 0:
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 114, in num_samples
    return len(self.data_source)
TypeError: object of type 'IterableDataset' has no len()

Steps to reproduce

Run the yaml

Config yaml

base_model: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

pretraining_dataset:
  - path: mhenrichsen/terra
    type: completion

#datasets:
#  - path: DDSC/dagw_reddit_filtered_v1.0.0
#    type: completion
dataset_prepared_path:
val_set_size: 0.001
output_dir: ./tiny

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

wandb_project: tiny-danskgpt
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 16
num_epochs: 2
max_steps: 200000
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00005

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 4
debug:
deepspeed: deepspeed/zero2.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

Datasets may have been updated and broken the functionality.

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

docker

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2024-01-08T06:41:27Z

Hey, I'm not sure if you uploaded an old config. It should be the below,

pretraining_dataset: mhenrichsen/terra

Am I right?

Edit: Also, the linked PR seem to be merged. Was your issue solved?

NanoCode012 · 2024-03-30T18:39:37Z

Closing as it's stale. Feel free to reopen if this is still happening.

mhenrichsen added the bug Something isn't working label Jan 1, 2024

winglian mentioned this issue Jan 5, 2024

streaming multipack for pretraining dataset #959

Merged

NanoCode012 closed this as completed Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pretrain_dataset broken #1026

pretrain_dataset broken #1026

mhenrichsen commented Jan 1, 2024 •

edited

Loading

NanoCode012 commented Jan 8, 2024 •

edited

Loading

NanoCode012 commented Mar 30, 2024 •

edited

Loading

pretrain_dataset broken #1026

pretrain_dataset broken #1026

Comments

mhenrichsen commented Jan 1, 2024 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Jan 8, 2024 • edited Loading

NanoCode012 commented Mar 30, 2024 • edited Loading

mhenrichsen commented Jan 1, 2024 •

edited

Loading

NanoCode012 commented Jan 8, 2024 •

edited

Loading

NanoCode012 commented Mar 30, 2024 •

edited

Loading