Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training speed in bf16 mode is slow. #660

Open
frankang opened this issue Aug 29, 2022 · 2 comments
Open

Training speed in bf16 mode is slow. #660

frankang opened this issue Aug 29, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@frankang
Copy link

Describe the bug
I trained a model from scratch on RTX 3090 cards (torch.cuda.is_bf16_supported() returns True) in bfloat16 mode, but the training speed turns out to be 1/3 of the speed in fp16 mode (I use the "elapsed time per iteration (ms)" log field for comparision )

To Reproduce
I borrowed the following settings from the configs/small_bf16.yml file for the bf16 mode, the other setting are same to the fp16 mode.

  "zero_optimization": {
  "stage": 0,
  ...
  },

   # precision settings
   "fp16": {
     "enabled": true,
     "type": "bfloat16", # set bf16 as precision
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   "fp32_allreduce": True, # without a patch to torch, bf16 models have to do the allreduce in fp32

Expected behavior
According to huggingface/transformers#14608,

"bf16 is 2-3% slower than fp16".
So I'm expecting a similar speed to the fp16 mode.

Proposed solution
If you have an idea for how we can fix this problem, describe it here.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • GPUs: 8 x RTX 3090

  • Configs:

    click to expand the config
    data_impl ....................... mmap........................updated
    data_path ....................... data/enron/enron_text_documentupdated
    dynamic_loss_scale .............. True........................updated
    eval_interval ................... 2000........................updated
    eval_iters ...................... 10..........................updated
    fp16 ............................ {'enabled': True, 'type': 'bfloat16', 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
    fp32_allreduce .................. True........................updated
    gas ............................. 8...........................updated
    global_num_gpus ................. 8...........................updated
    gpt_j_residual .................. True........................updated
    gradient_accumulation_steps ..... 8...........................updated
    gradient_clipping ............... 1.0.........................updated
    hidden_dropout .................. 0...........................updated
    hidden_size ..................... 4096........................updated
    hostfile ........................ /job/hostfile...............updated
    init_method ..................... small_init..................updated
    is_pipe_parallel ................ True........................updated
    load ............................ ./6B-tensor-para-exp_checkpoints-mp8-pp1updated
    log_dir ......................... logs........................updated
    log_interval .................... 10..........................updated
    lr .............................. 0.0001......................updated
    lr_decay_iters .................. 40000.......................updated
    lr_decay_style .................. cosine......................updated
    max_position_embeddings ......... 2048........................updated
    merge_file ...................... data/gpt2-merges.txt........updated
    min_lr .......................... 9.7e-06.....................updated
    model_parallel_size ............. 8...........................updated
    no_weight_tying ................. True........................updated
    num_attention_heads ............. 16..........................updated
    num_layers ...................... 28..........................updated
    optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0001, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
    optimizer_type .................. Adam........................updated
    output_layer_init_method ........ wang_init...................updated
    output_layer_parallelism ........ column......................updated
    pipe_parallel_size .............. 1...........................updated
    pos_emb ......................... rotary......................updated
    precision ....................... bfloat16....................updated
    rotary_pct ...................... 0.25........................updated
    save ............................ ./6B-tensor-para-exp_checkpoints-mp8-pp1updated
    save_interval ................... 5000........................updated
    scaled_upper_triang_masked_softmax_fusion  True...............updated
    seq_length ...................... 2048........................updated
    sparsity_config ................. {}..........................updated
    split ........................... 995,4,1.....................updated
    synchronize_each_layer .......... True........................updated
    tensorboard_dir ................. tensorboard.................updated
    text_gen_type ................... unconditional...............updated
    train_batch_size ................ 16..........................updated
    train_iters ..................... 40000.......................updated
    train_micro_batch_size_per_gpu .. 2...........................updated
    use_wandb ....................... False.......................updated
    user_script ..................... train.py....................updated
    vocab_file ...................... data/gpt2-vocab.json........updated
    wandb_group ..................... Gf8av4SsLGBZ2riRHtJk4z_321ozt7aupdated
    warmup .......................... 0.1.........................updated
    zero_allgather_bucket_size ...... 500000000...................updated
    zero_contiguous_gradients ....... True........................updated
    zero_optimization ............... {'stage': 0, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False}updated
    zero_reduce_bucket_size ......... 500000000...................updated
    zero_reduce_scatter ............. True........................updated
    zero_stage ...................... 0...........................updated
    activation ...................... gelu........................default
    adlr_autoresume ................. False.......................default
    adlr_autoresume_interval ........ 1000........................default
    amp ............................. None........................default
    apply_query_key_layer_scaling ... False.......................default
    attention_softmax_in_fp32 ....... False.......................default
    bias_dropout_fusion ............. False.......................default
    char_level_ppl .................. False.......................default
    checkpoint_in_cpu ............... False.......................default
    checkpoint_num_layers ........... 1...........................default
    checkpoint_validation_with_forward_pass  False................default
    contiguous_checkpointing ........ False.......................default
    deepscale ....................... False.......................default
    deepscale_config ................ None........................default
    deepspeed ....................... True........................default
    deepspeed_activation_checkpointing  True......................default
    deepspeed_mpi ................... False.......................default
    detect_nvlink_pairs ............. False.......................default
    distributed_backend ............. nccl........................default
    do_test ......................... None........................default
    do_train ........................ None........................default
    do_valid ........................ None........................default
    dump_state ...................... False.......................default
    eod_mask_loss ................... False.......................default
    eval_results_prefix ............. ............................default
    eval_tasks ...................... None........................default
    exclude ......................... None........................default
    exit_interval ................... None........................default
    finetune ........................ False.......................default
    flops_profiler .................. None........................default
    fp16_lm_cross_entropy ........... False.......................default
    git_hash ........................ 5e0d614.....................default
    gmlp_attn_dim ................... 64..........................default
    gradient_noise_scale_cpu_offload  False.......................default
    gradient_noise_scale_n_batches .. 5...........................default
    gradient_predivide_factor ....... 1.0.........................default
    hysteresis ...................... 2...........................default
    include ......................... None........................default
    init_method_std ................. 0.02........................default
    iteration ....................... None........................default
    keep_last_n_checkpoints ......... None........................default
    launcher ........................ pdsh........................default
    layernorm_epsilon ............... 1e-05.......................default
    lazy_mpu_init ................... False.......................default
    local_rank ...................... None........................default
    log_grad_norm ................... False.......................default
    log_grad_pct_zeros .............. False.......................default
    log_gradient_noise_scale ........ False.......................default
    log_optimizer_states ............ False.......................default
    log_param_norm .................. False.......................default
    loss_scale ...................... None........................default
    loss_scale_window ............... 1000.0......................default
    make_vocab_size_divisible_by .... 128.........................default
    master_addr ..................... None........................default
    master_port ..................... 29500.......................default
    maximum_tokens .................. 64..........................default
    min_scale ....................... 1.0.........................default
    mmap_warmup ..................... False.......................default
    no_load_optim ................... False.......................default
    no_load_rng ..................... False.......................default
    no_save_optim ................... False.......................default
    no_save_rng ..................... False.......................default
    norm ............................ layernorm...................default
    num_gpus ........................ None........................default
    num_nodes ....................... -1..........................default
    num_samples ..................... 1...........................default
    num_unique_layers ............... None........................default
    num_workers ..................... 2...........................default
    onnx_safe ....................... False.......................default
    opt_pos_emb_offset .............. 0...........................default
    override_lr_scheduler ........... False.......................default
    padded_vocab_size ............... None........................default
    param_sharing_style ............. grouped.....................default
    partition_activations ........... False.......................default
    pipe_partition_method ........... type:transformer|mlp........default
    prescale_gradients .............. False.......................default
    profile_backward ................ False.......................default
    rank ............................ None........................default
    recompute ....................... False.......................default
    rms_norm_epsilon ................ 1e-08.......................default
    rotary_emb_base ................. 10000.......................default
    rpe_max_distance ................ 128.........................default
    rpe_num_buckets ................. 32..........................default
    sample_input_file ............... None........................default
    sample_output_file .............. samples.txt.................default
    scaled_masked_softmax_fusion .... False.......................default
    scalenorm_epsilon ............... 1e-08.......................default
    scheduler ....................... None........................default
    seed ............................ 1234........................default
    short_seq_prob .................. 0.1.........................default
    soft_prompt_tuning .............. None........................default
    sparse_gradients ................ False.......................default
    steps_per_print ................. 10..........................default
    temperature ..................... 0.0.........................default
    test_data_paths ................. None........................default
    test_data_weights ............... None........................default
    tokenizer_type .................. GPT2BPETokenizer............default
    top_k ........................... 0...........................default
    top_p ........................... 0.0.........................default
    train_data_paths ................ None........................default
    train_data_weights .............. None........................default
    use_bnb_optimizer ............... False.......................default
    use_checkpoint_lr_scheduler ..... False.......................default
    use_cpu_initialization .......... False.......................default
    valid_data_paths ................ None........................default
    valid_data_weights .............. None........................default
    wall_clock_breakdown ............ False.......................default
    wandb_host ...................... https://api.wandb.ai........default
    wandb_init_all_ranks ............ False.......................default
    wandb_project ................... neox........................default
    wandb_team ...................... None........................default
    weight_by_num_documents ......... False.......................default
    weight_decay .................... 0.01........................default
    weighted_sampler_alpha .......... 0.3.........................default
    world_size ...................... None........................default
    zero_allow_untested_optimizer ... False.......................default
    ---------------- end of arguments ----------------
    
    
@frankang frankang added the bug Something isn't working label Aug 29, 2022
@frankang frankang changed the title Training speed in bf16 model is slow. Training speed in bf16 mode is slow. Aug 29, 2022
@StellaAthena
Copy link
Member

Did your comparison fp16 model use zero? I notice that you're not using it here.

@frankang
Copy link
Author

frankang commented Oct 3, 2022

yes, my comparison fp 16 model is using zero-1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants