Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: The expanded size of the tensor (1) must match the existing size (10) at non-singleton dimension 2 #870

Closed
crazyofapple opened this issue Apr 4, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@crazyofapple
Copy link
Contributor

crazyofapple commented Apr 4, 2023

Describe the bug
RuntimeError: The expanded size of the tensor (1) must match the existing size (10) at non-singleton dimension 2. Target sizes: [1, 4, 1, 10]. Tensor sizes: [1, 1, 10, 10]
File "generate.py", line 59, in main
generate_samples_input_from_file(
File "/share/home/gpt-neox/megatron/text_generation_utils.py", line 620, in generate_samples_input_from_file
generated_texts = generate_samples_from_prompt(
File "/share/home/gpt-neox/megatron/text_generation_utils.py", line 485, in generate_samples_from_prompt
for (
File "/share/home/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens
logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel)
File "/share/home/gpt-neox/megatron/text_generation_utils.py", line 137, in forward_model
return model.module(model_inputs)
File "/share/home/.conda/envs/gpt-neox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/ldf/gpt-neox/megatron/model/utils.py", line 168, in forward
x = func(forward_input)
File "/share/home/ldf/gpt-neox/megatron/model/utils.py", line 161, in exec_func
inputs = layer(inputs)
File "/share/home/.conda/envs/gpt-neox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/ldf/gpt-neox/megatron/model/transformer.py", line 807, in forward
return super().forward(hidden_states, attention_mask), attention_mask
File "/share/home/ldf/gpt-neox/megatron/model/transformer.py", line 769, in forward
attention_output, attention_bias = self.attention(
File "/share/home/.conda/envs/gpt-neox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/ldf/gpt-neox/megatron/model/transformer.py", line 609, in forward
context_layer = self.attention(
File "/share/home/ldf/gpt-neox/megatron/model/transformer.py", line 391, in attention
attention_probs = self.scale_mask_softmax(attention_scores, attention_mask)
File "/share/home/.conda/envs/gpt-neox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in call_impl
return forward_call(*input, **kwargs)
File "/share/home/ldf/gpt-neox/megatron/model/fused_softmax.py", line 146, in forward
return self.forward_torch_softmax(input, mask)
File "/share/home/ldf/gpt-neox/megatron/model/fused_softmax.py", line 190, in forward_torch_softmax
mask_output = self.mask_func(input, mask) if mask is not None else input
File "/share/home/ldf/gpt-neox/megatron/model/gpt2_model.py", line 48, in gpt2_attention_mask_func
attention_scores.masked_fill
(ltor_mask, -10000.0)
RuntimeError: The expanded size of the tensor (1) must match the existing size (10) at non-singleton dimension 2. Target sizes: [1, 4, 1, 10]. Tensor sizes: [1, 1, 10, 10]
wandb: Waiting for W&B process to finish... (failed 1).

To Reproduce
python deepy.py generate.py -d configs 6-7B.yml slurm_local.yml text_generation.yml

Environment (please complete the following information):

  • GPUs: A100 * 8
  • Configs:
  • data_impl ....................... mmap........................updated
    data_path ....................... llm_data/gpt_neox_wudao/data_content_documentupdated
    dynamic_loss_scale .............. True........................updated
    eval_iters ...................... 10..........................updated
    fp16 ............................ {'fp16': True, 'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
    gas ............................. 1...........................updated
    global_num_gpus ................. 8...........................updated
    gradient_clipping ............... 1.0.........................updated
    hidden_dropout .................. 0...........................updated
    hidden_size ..................... 4096........................updated
    init_method ..................... small_init..................updated
    keep_last_n_checkpoints ......... 4...........................updated
    load ............................ checkpoints.................updated
    log_dir ......................... logs........................updated
    log_interval .................... 100.........................updated
    lr .............................. 0.00012.....................updated
    lr_decay_iters .................. 320000......................updated
    lr_decay_style .................. cosine......................updated
    max_position_embeddings ......... 2048........................updated
    maximum_tokens .................. 102.........................updated
    merge_file ...................... tokenizer_files/gpt2-merges.txtupdated
    min_lr .......................... 1.2e-05.....................updated
    model_parallel_size ............. 8...........................updated
    no_weight_tying ................. True........................updated
    num_attention_heads ............. 32..........................updated
    num_layers ...................... 32..........................updated
    num_samples ..................... 10..........................updated
    optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.00012, 'betas': [0.9, 0.95], 'eps': 1e-08}}updated
    optimizer_type .................. Adam........................updated
    output_layer_init_method ........ wang_init...................updated
    output_layer_parallelism ........ column......................updated
    partition_activations ........... True........................updated
    pipe_parallel_size .............. 1...........................updated
    pos_emb ......................... rotary......................updated
    precision ....................... fp16........................updated
    sample_input_file ............... sample_input.txt............updated
    sample_output_file .............. sample_output.txt...........updated
    save ............................ checkpoints.................updated
    save_iters ...................... [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000]updated
    seq_length ...................... 2048........................updated
    sparsity_config ................. {}..........................updated
    synchronize_each_layer .......... True........................updated
    temperature ..................... 1.0.........................updated
    tensorboard_dir ................. tensorboard.................updated
    text_gen_type ................... input-file..................updated
    train_batch_size ................ 4...........................updated
    train_iters ..................... 320000......................updated
    train_micro_batch_size_per_gpu .. 4...........................updated
    use_wandb ....................... True........................updated
    user_script ..................... generate.py.................updated
    vocab_file ...................... tokenizer_files/gpt2-vocab.jsonupdated
    wall_clock_breakdown ............ True........................updated
    wandb_group ..................... w66f77n7....................updated
    weight_decay .................... 0.1.........................updated
    zero_allgather_bucket_size ...... 500000000...................updated
    zero_contiguous_gradients ....... True........................updated
    zero_optimization ............... {'stage': 2, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True}updated
    zero_reduce_bucket_size ......... 500000000...................updated
    zero_reduce_scatter ............. True........................updated
    zero_stage ...................... 2...........................updated
    activation ...................... gelu........................default
    adlr_autoresume ................. False.......................default
    adlr_autoresume_interval ........ 1000........................default
    amp ............................. None........................default
    apply_query_key_layer_scaling ... False.......................default
    attention_softmax_in_fp32 ....... False.......................default
    autotuning ...................... None........................default
    autotuning_run .................. None........................default
    base_shapes_file ................ None........................default
    bias_dropout_fusion ............. False.......................default
    bias_gelu_fusion ................ False.......................default
    char_level_ppl .................. False.......................default
    checkpoint_in_cpu ............... False.......................default
    checkpoint_num_layers ........... 1...........................default
    checkpoint_scale ................ linear......................default
    checkpoint_validation_with_forward_pass False................default
    comment ......................... None........................default
    contiguous_checkpointing ........ False.......................default
    coord_check ..................... False.......................default
    curriculum_learning ............. None........................default
    curriculum_seqlen ............... 0...........................default
    deepscale ....................... False.......................default
    deepscale_config ................ None........................default
    deepspeed ....................... True........................default
    deepspeed_activation_checkpointing True......................default
    deepspeed_mpi ................... False.......................default
    deepspeed_slurm ................. False.......................default
    detect_nvlink_pairs ............. False.......................default
    distributed_backend ............. nccl........................default
    do_test ......................... None........................default
    do_train ........................ None........................default
    do_valid ........................ None........................default
    dump_state ...................... False.......................default
    eod_mask_loss ................... False.......................default
    eval_interval ................... 1000........................default
    eval_results_prefix ............. ............................default
    eval_tasks ...................... None........................default
    exclude ......................... None........................default
    exit_interval ................... None........................default
    extra_save_iters ................ None........................default
    finetune ........................ False.......................default
    flops_profiler .................. None........................default
    fp16_lm_cross_entropy ........... False.......................default
    fp32_allreduce .................. False.......................default
    git_hash ........................ None........................default
    gmlp_attn_dim ................... 64..........................default
    gpt_j_residual .................. False.......................default
    gpt_j_tied ...................... False.......................default
    gradient_accumulation_steps ..... 1...........................default
    gradient_noise_scale_cpu_offload False.......................default
    gradient_noise_scale_n_batches .. 5...........................default
    gradient_predivide_factor ....... 1.0.........................default
    hostfile ........................ None........................default
    hysteresis ...................... 2...........................default
    include ......................... None........................default
    init_method_std ................. 0.02........................default
    is_pipe_parallel ................ False.......................default
    iteration ....................... None........................default
    launcher ........................ pdsh........................default
    layernorm_epsilon ............... 1e-05.......................default
    lazy_mpu_init ................... False.......................default
    local_rank ...................... None........................default
    log_grad_norm ................... False.......................default
    log_grad_pct_zeros .............. False.......................default
    log_gradient_noise_scale ........ False.......................default
    log_optimizer_states ............ False.......................default
    log_param_norm .................. False.......................default
    loss_scale ...................... None........................default
    loss_scale_window ............... 1000.0......................default
    make_vocab_size_divisible_by .... 128.........................default
    master_addr ..................... None........................default
    master_port ..................... 29500.......................default
    min_scale ....................... 1.0.........................default
    mmap_warmup ..................... False.......................default
    mup_attn_temp ................... 1.0.........................default
    mup_embedding_mult .............. 1.0.........................default
    mup_init_scale .................. 1.0.........................default
    mup_output_temp ................. 1.0.........................default
    mup_rp_embedding_mult ........... 1.0.........................default
    mup_width_scale ................. 2...........................default
    no_load_optim ................... False.......................default
    no_load_rng ..................... False.......................default
    no_save_optim ................... False.......................default
    no_save_rng ..................... False.......................default
    no_ssh_check .................... False.......................default
    norm ............................ layernorm...................default
    num_gpus ........................ None........................default
    num_nodes ....................... -1..........................default
    num_unique_layers ............... None........................default
    num_workers ..................... 2...........................default
    onnx_safe ....................... False.......................default
    opt_pos_emb_offset .............. 0...........................default
    override_lr_scheduler ........... False.......................default
    padded_vocab_size ............... None........................default
    param_sharing_style ............. grouped.....................default
    pipe_partition_method ........... type:transformer|mlp........default
    prescale_gradients .............. False.......................default
    profile_backward ................ False.......................default
    prompt_end ......................
    ...........................default
    rank ............................ None........................default
    recompute ....................... False.......................default
    return_logits ................... False.......................default
    rms_norm_epsilon ................ 1e-08.......................default
    rotary_emb_base ................. 10000.......................default
    rotary_pct ...................... 1.0.........................default
    rpe_max_distance ................ 128.........................default
    rpe_num_buckets ................. 32..........................default
    save_base_shapes ................ False.......................default
    scaled_masked_softmax_fusion .... False.......................default
    scaled_upper_triang_masked_softmax_fusion False..............default
    scalenorm_epsilon ............... 1e-08.......................default
    scheduler ....................... None........................default
    seed ............................ 1234........................default
    short_seq_prob .................. 0.1.........................default
    soft_prompt_tuning .............. None........................default
    sparse_gradients ................ False.......................default
    split ........................... 969, 30, 1..................default
    steps_per_print ................. 10..........................default
    test_data_paths ................. None........................default
    test_data_weights ............... None........................default
    tokenizer_type .................. GPT2BPETokenizer............default
    top_k ........................... 0...........................default
    top_p ........................... 0.0.........................default
    train_data_paths ................ None........................default
    train_data_weights .............. None........................default
    use_bnb_optimizer ............... False.......................default
    use_checkpoint_lr_scheduler ..... False.......................default
    use_cpu_initialization .......... False.......................default
    use_mup ......................... False.......................default
    use_shared_fs ................... True........................default
    valid_data_paths ................ None........................default
    valid_data_weights .............. None........................default
    wandb_host ...................... https://api.wandb.ai........default
    wandb_init_all_ranks ............ False.......................default
    wandb_project ................... neox........................default
    wandb_team ...................... None........................default
    warmup .......................... 0.01........................default
    weight_by_num_documents ......... False.......................default
    weighted_sampler_alpha .......... 0.3.........................default
    world_size ...................... None........................default
    zero_allow_untested_optimizer ... False.......................default
@crazyofapple crazyofapple added the bug Something isn't working label Apr 4, 2023
@Stormcode1
Copy link

Were you able to figure out the issue? I've been running into same issue with different datasets and have been going nuts trying to figure it out.

@TissueC
Copy link

TissueC commented May 17, 2023

Same here. Appreciated it if any information provided

@TissueC
Copy link

TissueC commented May 17, 2023

I have checked it out. If inputting multiple (>1) lines in input_sample.txt and generating, the error will occur. What an absurd bug.

@TissueC
Copy link

TissueC commented May 18, 2023

undo this commit and this bug will be fixed
17b84d7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants