Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed is_pipe_parallel setting to fix pipeline-parallel inference #866

Merged
merged 4 commits into from
Apr 21, 2023

Conversation

curt-tigges
Copy link
Contributor

Fix for #854

@crazyofapple
Copy link
Contributor

/gpt-neox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPT2ModelPipe:
Missing key(s) in state_dict: "0.word_embeddings.weight", "2.input_layernorm.scale", "2.attention.query_key_value.weight", "2.attention.query_key_value.bias", "2.attention.rotary_emb.inv_freq", "2.attention.dense.weight", "2.attention.dense.bias", "2.post_attention_layernorm.scale", "2.mlp.dense_h_to_4h.weight", "2.mlp.dense_h_to_4h.bias", "2.mlp.dense_4h_to_h.weight", "2.mlp.dense_4h_to_h.bias", "3.input_layernorm.scale", "3.attention.query_key_value.weight", "3.attention.query_key_value.bias", "3.attention.rotary_emb.inv_freq", "3.attention.dense.weight", "3.attention.dense.bias", "3.post_attention_layernorm.scale", "3.mlp.dense_h_to_4h.weight", "3.mlp.dense_h_to_4h.bias", "3.mlp.dense_4h_to_h.weight", "3.mlp.dense_4h_to_h.bias", "4.input_layernorm.scale", "4.attention.query_key_value.weight", "4.attention.query_key_value.bias", "4.attention.rotary_emb.inv_freq", "4.attention.dense.weight", "4.attention.dense.bias", "4.post_attention_layernorm.scale", "4.mlp.dense_h_to_4h.weight", "4.mlp.dense_h_to_4h.bias", "4.mlp.dense_4h_to_h.weight", "4.mlp.dense_4h_to_h.bias", "5.input_layernorm.scale", "5.attention.query_key_value.weight", "5.attention.query_key_value.bias", "5.attention.rotary_emb.inv_freq", "5.attention.dense.weight", "5.attention.dense.bias", "5.post_attention_layernorm.scale", "5.mlp.dense_h_to_4h.weight", "5.mlp.dense_h_to_4h.bias", "5.mlp.dense_4h_to_h.weight", "5.mlp.dense_4h_to_h.bias", "6.input_layernorm.scale", "6.attention.query_key_value.weight", "6.attention.query_key_value.bias", "6.attention.rotary_emb.inv_freq", "6.attention.dense.weight", "6.attention.dense.bias", "6.post_attention_layernorm.scale", "6.mlp.dense_h_to_4h.weight", "6.mlp.dense_h_to_4h.bias", "6.mlp.dense_4h_to_h.weight", "6.mlp.dense_4h_to_h.bias", "7.input_layernorm.scale", "7.attention.query_key_value.weight", "7.attention.query_key_value.bias", "7.attention.rotary_emb.inv_freq", "7.attention.dense.weight", "7.attention.dense.bias", "7.post_attention_layernorm.scale", "7.mlp.dense_h_to_4h.weight", "7.mlp.dense_h_to_4h.bias", "7.mlp.dense_4h_to_h.weight", "7.mlp.dense_4h_to_h.bias", "8.input_layernorm.scale", "8.attention.query_key_value.weight", "8.attention.query_key_value.bias", "8.attention.rotary_emb.inv_freq", "8.attention.dense.weight", "8.attention.dense.bias", "8.post_attention_layernorm.scale", "8.mlp.dense_h_to_4h.weight", "8.mlp.dense_h_to_4h.bias", "8.mlp.dense_4h_to_h.weight", "8.mlp.dense_4h_to_h.bias", "9.input_layernorm.scale", "9.attention.query_key_value.weight", "9.attention.query_key_value.bias", "9.attention.rotary_emb.inv_freq", "9.attention.dense.weight", "9.attention.dense.bias", "9.post_attention_layernorm.scale", "9.mlp.dense_h_to_4h.weight", "9.mlp.dense_h_to_4h.bias", "9.mlp.dense_4h_to_h.weight", "9.mlp.dense_4h_to_h.bias", "10.input_layernorm.scale", "10.attention.query_key_value.weight", "10.attention.query_key_value.bias", "10.attention.rotary_emb.inv_freq", "10.attention.dense.weight", "10.attention.dense.bias", "10.post_attention_layernorm.scale", "10.mlp.dense_h_to_4h.weight", "10.mlp.dense_h_to_4h.bias", "10.mlp.dense_4h_to_h.weight", "10.mlp.dense_4h_to_h.bias", "11.input_layernorm.scale", "11.attention.query_key_value.weight", "11.attention.query_key_value.bias", "11.attention.rotary_emb.inv_freq", "11.attention.dense.weight", "11.attention.dense.bias", "11.post_attention_layernorm.scale", "11.mlp.dense_h_to_4h.weight", "11.mlp.dense_h_to_4h.bias", "11.mlp.dense_4h_to_h.weight", "11.mlp.dense_4h_to_h.bias", "12.input_layernorm.scale", "12.attention.query_key_value.weight", "12.attention.query_key_value.bias", "12.attention.rotary_emb.inv_freq", "12.attention.dense.weight", "12.attention.dense.bias", "12.post_attention_layernorm.scale", "12.mlp.dense_h_to_4h.weight", "12.mlp.dense_h_to_4h.bias", "12.mlp.dense_4h_to_h.weight", "12.mlp.dense_4h_to_h.bias", "13.input_layernorm.scale", "13.attention.query_key_value.weight", "13.attention.query_key_value.bias", "13.attention.rotary_emb.inv_freq", "13.attention.dense.weight", "13.attention.dense.bias", "13.post_attention_layernorm.scale", "13.mlp.dense_h_to_4h.weight", "13.mlp.dense_h_to_4h.bias", "13.mlp.dense_4h_to_h.weight", "13.mlp.dense_4h_to_h.bias", "14.input_layernorm.scale", "14.attention.query_key_value.weight", "14.attention.query_key_value.bias", "14.attention.rotary_emb.inv_freq", "14.attention.dense.weight", "14.attention.dense.bias", "14.post_attention_layernorm.scale", "14.mlp.dense_h_to_4h.weight", "14.mlp.dense_h_to_4h.bias", "14.mlp.dense_4h_to_h.weight", "14.mlp.dense_4h_to_h.bias", "15.input_layernorm.scale", "15.attention.query_key_value.weight", "15.attention.query_key_value.bias", "15.attention.rotary_emb.inv_freq", "15.attention.dense.weight", "15.attention.dense.bias", "15.post_attention_layernorm.scale", "15.mlp.dense_h_to_4h.weight", "15.mlp.dense_h_to_4h.bias", "15.mlp.dense_4h_to_h.weight", "15.mlp.dense_4h_to_h.bias", "16.input_layernorm.scale", "16.attention.query_key_value.weight", "16.attention.query_key_value.bias", "16.attention.rotary_emb.inv_freq", "16.attention.dense.weight", "16.attention.dense.bias", "16.post_attention_layernorm.scale", "16.mlp.dense_h_to_4h.weight", "16.mlp.dense_h_to_4h.bias", "16.mlp.dense_4h_to_h.weight", "16.mlp.dense_4h_to_h.bias", "17.input_layernorm.scale", "17.attention.query_key_value.weight", "17.attention.query_key_value.bias", "17.attention.rotary_emb.inv_freq", "17.attention.dense.weight", "17.attention.dense.bias", "17.post_attention_layernorm.scale", "17.mlp.dense_h_to_4h.weight", "17.mlp.dense_h_to_4h.bias", "17.mlp.dense_4h_to_h.weight", "17.mlp.dense_4h_to_h.bias", "18.input_layernorm.scale", "18.attention.query_key_value.weight", "18.attention.query_key_value.bias", "18.attention.rotary_emb.inv_freq", "18.attention.dense.weight", "18.attention.dense.bias", "18.post_attention_layernorm.scale", "18.mlp.dense_h_to_4h.weight", "18.mlp.dense_h_to_4h.bias", "18.mlp.dense_4h_to_h.weight", "18.mlp.dense_4h_to_h.bias", "19.input_layernorm.scale", "19.attention.query_key_value.weight", "19.attention.query_key_value.bias", "19.attention.rotary_emb.inv_freq", "19.attention.dense.weight", "19.attention.dense.bias", "19.post_attention_layernorm.scale", "19.mlp.dense_h_to_4h.weight", "19.mlp.dense_h_to_4h.bias", "19.mlp.dense_4h_to_h.weight", "19.mlp.dense_4h_to_h.bias", "20.input_layernorm.scale", "20.attention.query_key_value.weight", "20.attention.query_key_value.bias", "20.attention.rotary_emb.inv_freq", "20.attention.dense.weight", "20.attention.dense.bias", "20.post_attention_layernorm.scale", "20.mlp.dense_h_to_4h.weight", "20.mlp.dense_h_to_4h.bias", "20.mlp.dense_4h_to_h.weight", "20.mlp.dense_4h_to_h.bias", "21.input_layernorm.scale", "21.attention.query_key_value.weight", "21.attention.query_key_value.bias", "21.attention.rotary_emb.inv_freq", "21.attention.dense.weight", "21.attention.dense.bias", "21.post_attention_layernorm.scale", "21.mlp.dense_h_to_4h.weight", "21.mlp.dense_h_to_4h.bias", "21.mlp.dense_4h_to_h.weight", "21.mlp.dense_4h_to_h.bias", "22.input_layernorm.scale", "22.attention.query_key_value.weight", "22.attention.query_key_value.bias", "22.attention.rotary_emb.inv_freq", "22.attention.dense.weight", "22.attention.dense.bias", "22.post_attention_layernorm.scale", "22.mlp.dense_h_to_4h.weight", "22.mlp.dense_h_to_4h.bias", "22.mlp.dense_4h_to_h.weight", "22.mlp.dense_4h_to_h.bias", "23.input_layernorm.scale", "23.attention.query_key_value.weight", "23.attention.query_key_value.bias", "23.attention.rotary_emb.inv_freq", "23.attention.dense.weight", "23.attention.dense.bias", "23.post_attention_layernorm.scale", "23.mlp.dense_h_to_4h.weight", "23.mlp.dense_h_to_4h.bias", "23.mlp.dense_4h_to_h.weight", "23.mlp.dense_4h_to_h.bias", "24.input_layernorm.scale", "24.attention.query_key_value.weight", "24.attention.query_key_value.bias", "24.attention.rotary_emb.inv_freq", "24.attention.dense.weight", "24.attention.dense.bias", "24.post_attention_layernorm.scale", "24.mlp.dense_h_to_4h.weight", "24.mlp.dense_h_to_4h.bias", "24.mlp.dense_4h_to_h.weight", "24.mlp.dense_4h_to_h.bias", "25.input_layernorm.scale", "25.attention.query_key_value.weight", "25.attention.query_key_value.bias", "25.attention.rotary_emb.inv_freq", "25.attention.dense.weight", "25.attention.dense.bias", "25.post_attention_layernorm.scale", "25.mlp.dense_h_to_4h.weight", "25.mlp.dense_h_to_4h.bias", "25.mlp.dense_4h_to_h.weight", "25.mlp.dense_4h_to_h.bias", "26.input_layernorm.scale", "26.attention.query_key_value.weight", "26.attention.query_key_value.bias", "26.attention.rotary_emb.inv_freq", "26.attention.dense.weight", "26.attention.dense.bias", "26.post_attention_layernorm.scale", "26.mlp.dense_h_to_4h.weight", "26.mlp.dense_h_to_4h.bias", "26.mlp.dense_4h_to_h.weight", "26.mlp.dense_4h_to_h.bias", "27.input_layernorm.scale", "27.attention.query_key_value.weight", "27.attention.query_key_value.bias", "27.attention.rotary_emb.inv_freq", "27.attention.dense.weight", "27.attention.dense.bias", "27.post_attention_layernorm.scale", "27.mlp.dense_h_to_4h.weight", "27.mlp.dense_h_to_4h.bias", "27.mlp.dense_4h_to_h.weight", "27.mlp.dense_4h_to_h.bias", "28.input_layernorm.scale", "28.attention.query_key_value.weight", "28.attention.query_key_value.bias", "28.attention.rotary_emb.inv_freq", "28.attention.dense.weight", "28.attention.dense.bias", "28.post_attention_layernorm.scale", "28.mlp.dense_h_to_4h.weight", "28.mlp.dense_h_to_4h.bias", "28.mlp.dense_4h_to_h.weight", "28.mlp.dense_4h_to_h.bias", "29.input_layernorm.scale", "29.attention.query_key_value.weight", "29.attention.query_key_value.bias", "29.attention.rotary_emb.inv_freq", "29.attention.dense.weight", "29.attention.dense.bias", "29.post_attention_layernorm.scale", "29.mlp.dense_h_to_4h.weight", "29.mlp.dense_h_to_4h.bias", "29.mlp.dense_4h_to_h.weight", "29.mlp.dense_4h_to_h.bias", "30.input_layernorm.scale", "30.attention.query_key_value.weight", "30.attention.query_key_value.bias", "30.attention.rotary_emb.inv_freq", "30.attention.dense.weight", "30.attention.dense.bias", "30.post_attention_layernorm.scale", "30.mlp.dense_h_to_4h.weight", "30.mlp.dense_h_to_4h.bias", "30.mlp.dense_4h_to_h.weight", "30.mlp.dense_4h_to_h.bias", "31.input_layernorm.scale", "31.attention.query_key_value.weight", "31.attention.query_key_value.bias", "31.attention.rotary_emb.inv_freq", "31.attention.dense.weight", "31.attention.dense.bias", "31.post_attention_layernorm.scale", "31.mlp.dense_h_to_4h.weight", "31.mlp.dense_h_to_4h.bias", "31.mlp.dense_4h_to_h.weight", "31.mlp.dense_4h_to_h.bias", "32.input_layernorm.scale", "32.attention.query_key_value.weight", "32.attention.query_key_value.bias", "32.attention.rotary_emb.inv_freq", "32.attention.dense.weight", "32.attention.dense.bias", "32.post_attention_layernorm.scale", "32.mlp.dense_h_to_4h.weight", "32.mlp.dense_h_to_4h.bias", "32.mlp.dense_4h_to_h.weight", "32.mlp.dense_4h_to_h.bias", "33.input_layernorm.scale", "33.attention.query_key_value.weight", "33.attention.query_key_value.bias", "33.attention.rotary_emb.inv_freq", "33.attention.dense.weight", "33.attention.dense.bias", "33.post_attention_layernorm.scale", "33.mlp.dense_h_to_4h.weight", "33.mlp.dense_h_to_4h.bias", "33.mlp.dense_4h_to_h.weight", "33.mlp.dense_4h_to_h.bias", "35.norm.scale", "36.final_linear.weight".
Unexpected key(s) in state_dict: "sequential.0.word_embeddings.weight", "sequential.2.input_layernorm.scale", "sequential.2.attention.query_key_value.weight", "sequential.2.attention.query_key_value.bias", "sequential.2.attention.rotary_emb.inv_freq", "sequential.2.attention.dense.weight", "sequential.2.attention.dense.bias", "sequential.2.post_attention_layernorm.scale", "sequential.2.mlp.dense_h_to_4h.weight", "sequential.2.mlp.dense_h_to_4h.bias", "sequential.2.mlp.dense_4h_to_h.weight", "sequential.2.mlp.dense_4h_to_h.bias", "sequential.3.input_layernorm.scale", "sequential.3.attention.query_key_value.weight", "sequential.3.attention.query_key_value.bias", "sequential.3.attention.rotary_emb.inv_freq", "sequential.3.attention.dense.weight", "sequential.3.attention.dense.bias", "sequential.3.post_attention_layernorm.scale", "sequential.3.mlp.dense_h_to_4h.weight", "sequential.3.mlp.dense_h_to_4h.bias", "sequential.3.mlp.dense_4h_to_h.weight", "sequential.3.mlp.dense_4h_to_h.bias", "sequential.4.input_layernorm.scale", "sequential.4.attention.query_key_value.weight", "sequential.4.attention.query_key_value.bias", "sequential.4.attention.rotary_emb.inv_freq", "sequential.4.attention.dense.weight", "sequential.4.attention.dense.bias", "sequential.4.post_attention_layernorm.scale", "sequential.4.mlp.dense_h_to_4h.weight", "sequential.4.mlp.dense_h_to_4h.bias", "sequential.4.mlp.dense_4h_to_h.weight", "sequential.4.mlp.dense_4h_to_h.bias", "sequential.5.input_layernorm.scale", "sequential.5.attention.query_key_value.weight", "sequential.5.attention.query_key_value.bias", "sequential.5.attention.rotary_emb.inv_freq", "sequential.5.attention.dense.weight", "sequential.5.attention.dense.bias", "sequential.5.post_attention_layernorm.scale", "sequential.5.mlp.dense_h_to_4h.weight", "sequential.5.mlp.dense_h_to_4h.bias", "sequential.5.mlp.dense_4h_to_h.weight", "sequential.5.mlp.dense_4h_to_h.bias", "sequential.6.input_layernorm.scale", "sequential.6.attention.query_key_value.weight", "sequential.6.attention.query_key_value.bias", "sequential.6.attention.rotary_emb.inv_freq", "sequential.6.attention.dense.weight", "sequential.6.attention.dense.bias", "sequential.6.post_attention_layernorm.scale", "sequential.6.mlp.dense_h_to_4h.weight", "sequential.6.mlp.dense_h_to_4h.bias", "sequential.6.mlp.dense_4h_to_h.weight", "sequential.6.mlp.dense_4h_to_h.bias", "sequential.7.input_layernorm.scale", "sequential.7.attention.query_key_value.weight", "sequential.7.attention.query_key_value.bias", "sequential.7.attention.rotary_emb.inv_freq", "sequential.7.attention.dense.weight", "sequential.7.attention.dense.bias", "sequential.7.post_attention_layernorm.scale", "sequential.7.mlp.dense_h_to_4h.weight", "sequential.7.mlp.dense_h_to_4h.bias", "sequential.7.mlp.dense_4h_to_h.weight", "sequential.7.mlp.dense_4h_to_h.bias", "sequential.8.input_layernorm.scale", "sequential.8.attention.query_key_value.weight", "sequential.8.attention.query_key_value.bias", "sequential.8.attention.rotary_emb.inv_freq", "sequential.8.attention.dense.weight", "sequential.8.attention.dense.bias", "sequential.8.post_attention_layernorm.scale", "sequential.8.mlp.dense_h_to_4h.weight", "sequential.8.mlp.dense_h_to_4h.bias", "sequential.8.mlp.dense_4h_to_h.weight", "sequential.8.mlp.dense_4h_to_h.bias", "sequential.9.input_layernorm.scale", "sequential.9.attention.query_key_value.weight", "sequential.9.attention.query_key_value.bias", "sequential.9.attention.rotary_emb.inv_freq", "sequential.9.attention.dense.weight", "sequential.9.attention.dense.bias", "sequential.9.post_attention_layernorm.scale", "sequential.9.mlp.dense_h_to_4h.weight", "sequential.9.mlp.dense_h_to_4h.bias", "sequential.9.mlp.dense_4h_to_h.weight", "sequential.9.mlp.dense_4h_to_h.bias", "sequential.10.input_layernorm.scale", "sequential.10.attention.query_key_value.weight", "sequential.10.attention.query_key_value.bias", "sequential.10.attention.rotary_emb.inv_freq", "sequential.10.attention.dense.weight", "sequential.10.attention.dense.bias", "sequential.10.post_attention_layernorm.scale", "sequential.10.mlp.dense_h_to_4h.weight", "sequential.10.mlp.dense_h_to_4h.bias", "sequential.10.mlp.dense_4h_to_h.weight", "sequential.10.mlp.dense_4h_to_h.bias", "sequential.11.input_layernorm.scale", "sequential.11.attention.query_key_value.weight", "sequential.11.attention.query_key_value.bias", "sequential.11.attention.rotary_emb.inv_freq", "sequential.11.attention.dense.weight", "sequential.11.attention.dense.bias", "sequential.11.post_attention_layernorm.scale", "sequential.11.mlp.dense_h_to_4h.weight", "sequential.11.mlp.dense_h_to_4h.bias", "sequential.11.mlp.dense_4h_to_h.weight", "sequential.11.mlp.dense_4h_to_h.bias", "sequential.12.input_layernorm.scale", "sequential.12.attention.query_key_value.weight", "sequential.12.attention.query_key_value.bias", "sequential.12.attention.rotary_emb.inv_freq", "sequential.12.attention.dense.weight", "sequential.12.attention.dense.bias", "sequential.12.post_attention_layernorm.scale", "sequential.12.mlp.dense_h_to_4h.weight", "sequential.12.mlp.dense_h_to_4h.bias", "sequential.12.mlp.dense_4h_to_h.weight", "sequential.12.mlp.dense_4h_to_h.bias", "sequential.13.input_layernorm.scale", "sequential.13.attention.query_key_value.weight", "sequential.13.attention.query_key_value.bias", "sequential.13.attention.rotary_emb.inv_freq", "sequential.13.attention.dense.weight", "sequential.13.attention.dense.bias", "sequential.13.post_attention_layernorm.scale", "sequential.13.mlp.dense_h_to_4h.weight", "sequential.13.mlp.dense_h_to_4h.bias", "sequential.13.mlp.dense_4h_to_h.weight", "sequential.13.mlp.dense_4h_to_h.bias", "sequential.14.input_layernorm.scale", "sequential.14.attention.query_key_value.weight", "sequential.14.attention.query_key_value.bias", "sequential.14.attention.rotary_emb.inv_freq", "sequential.14.attention.dense.weight", "sequential.14.attention.dense.bias", "sequential.14.post_attention_layernorm.scale", "sequential.14.mlp.dense_h_to_4h.weight", "sequential.14.mlp.dense_h_to_4h.bias", "sequential.14.mlp.dense_4h_to_h.weight", "sequential.14.mlp.dense_4h_to_h.bias", "sequential.15.input_layernorm.scale", "sequential.15.attention.query_key_value.weight", "sequential.15.attention.query_key_value.bias", "sequential.15.attention.rotary_emb.inv_freq", "sequential.15.attention.dense.weight", "sequential.15.attention.dense.bias", "sequential.15.post_attention_layernorm.scale", "sequential.15.mlp.dense_h_to_4h.weight", "sequential.15.mlp.dense_h_to_4h.bias", "sequential.15.mlp.dense_4h_to_h.weight", "sequential.15.mlp.dense_4h_to_h.bias", "sequential.16.input_layernorm.scale", "sequential.16.attention.query_key_value.weight", "sequential.16.attention.query_key_value.bias", "sequential.16.attention.rotary_emb.inv_freq", "sequential.16.attention.dense.weight", "sequential.16.attention.dense.bias", "sequential.16.post_attention_layernorm.scale", "sequential.16.mlp.dense_h_to_4h.weight", "sequential.16.mlp.dense_h_to_4h.bias", "sequential.16.mlp.dense_4h_to_h.weight", "sequential.16.mlp.dense_4h_to_h.bias", "sequential.17.input_layernorm.scale", "sequential.17.attention.query_key_value.weight", "sequential.17.attention.query_key_value.bias", "sequential.17.attention.rotary_emb.inv_freq", "sequential.17.attention.dense.weight", "sequential.17.attention.dense.bias", "sequential.17.post_attention_layernorm.scale", "sequential.17.mlp.dense_h_to_4h.weight", "sequential.17.mlp.dense_h_to_4h.bias", "sequential.17.mlp.dense_4h_to_h.weight", "sequential.17.mlp.dense_4h_to_h.bias", "sequential.18.input_layernorm.scale", "sequential.18.attention.query_key_value.weight", "sequential.18.attention.query_key_value.bias", "sequential.18.attention.rotary_emb.inv_freq", "sequential.18.attention.dense.weight", "sequential.18.attention.dense.bias", "sequential.18.post_attention_layernorm.scale", "sequential.18.mlp.dense_h_to_4h.weight", "sequential.18.mlp.dense_h_to_4h.bias", "sequential.18.mlp.dense_4h_to_h.weight", "sequential.18.mlp.dense_4h_to_h.bias", "sequential.19.input_layernorm.scale", "sequential.19.attention.query_key_value.weight", "sequential.19.attention.query_key_value.bias", "sequential.19.attention.rotary_emb.inv_freq", "sequential.19.attention.dense.weight", "sequential.19.attention.dense.bias", "sequential.19.post_attention_layernorm.scale", "sequential.19.mlp.dense_h_to_4h.weight", "sequential.19.mlp.dense_h_to_4h.bias", "sequential.19.mlp.dense_4h_to_h.weight", "sequential.19.mlp.dense_4h_to_h.bias", "sequential.20.input_layernorm.scale", "sequential.20.attention.query_key_value.weight", "sequential.20.attention.query_key_value.bias", "sequential.20.attention.rotary_emb.inv_freq", "sequential.20.attention.dense.weight", "sequential.20.attention.dense.bias", "sequential.20.post_attention_layernorm.scale", "sequential.20.mlp.dense_h_to_4h.weight", "sequential.20.mlp.dense_h_to_4h.bias", "sequential.20.mlp.dense_4h_to_h.weight", "sequential.20.mlp.dense_4h_to_h.bias", "sequential.21.input_layernorm.scale", "sequential.21.attention.query_key_value.weight", "sequential.21.attention.query_key_value.bias", "sequential.21.attention.rotary_emb.inv_freq", "sequential.21.attention.dense.weight", "sequential.21.attention.dense.bias", "sequential.21.post_attention_layernorm.scale", "sequential.21.mlp.dense_h_to_4h.weight", "sequential.21.mlp.dense_h_to_4h.bias", "sequential.21.mlp.dense_4h_to_h.weight", "sequential.21.mlp.dense_4h_to_h.bias", "sequential.22.input_layernorm.scale", "sequential.22.attention.query_key_value.weight", "sequential.22.attention.query_key_value.bias", "sequential.22.attention.rotary_emb.inv_freq", "sequential.22.attention.dense.weight", "sequential.22.attention.dense.bias", "sequential.22.post_attention_layernorm.scale", "sequential.22.mlp.dense_h_to_4h.weight", "sequential.22.mlp.dense_h_to_4h.bias", "sequential.22.mlp.dense_4h_to_h.weight", "sequential.22.mlp.dense_4h_to_h.bias", "sequential.23.input_layernorm.scale", "sequential.23.attention.query_key_value.weight", "sequential.23.attention.query_key_value.bias", "sequential.23.attention.rotary_emb.inv_freq", "sequential.23.attention.dense.weight", "sequential.23.attention.dense.bias", "sequential.23.post_attention_layernorm.scale", "sequential.23.mlp.dense_h_to_4h.weight", "sequential.23.mlp.dense_h_to_4h.bias", "sequential.23.mlp.dense_4h_to_h.weight", "sequential.23.mlp.dense_4h_to_h.bias", "sequential.24.input_layernorm.scale", "sequential.24.attention.query_key_value.weight", "sequential.24.attention.query_key_value.bias", "sequential.24.attention.rotary_emb.inv_freq", "sequential.24.attention.dense.weight", "sequential.24.attention.dense.bias", "sequential.24.post_attention_layernorm.scale", "sequential.24.mlp.dense_h_to_4h.weight", "sequential.24.mlp.dense_h_to_4h.bias", "sequential.24.mlp.dense_4h_to_h.weight", "sequential.24.mlp.dense_4h_to_h.bias", "sequential.25.input_layernorm.scale", "sequential.25.attention.query_key_value.weight", "sequential.25.attention.query_key_value.bias", "sequential.25.attention.rotary_emb.inv_freq", "sequential.25.attention.dense.weight", "sequential.25.attention.dense.bias", "sequential.25.post_attention_layernorm.scale", "sequential.25.mlp.dense_h_to_4h.weight", "sequential.25.mlp.dense_h_to_4h.bias", "sequential.25.mlp.dense_4h_to_h.weight", "sequential.25.mlp.dense_4h_to_h.bias", "sequential.26.input_layernorm.scale", "sequential.26.attention.query_key_value.weight", "sequential.26.attention.query_key_value.bias", "sequential.26.attention.rotary_emb.inv_freq", "sequential.26.attention.dense.weight", "sequential.26.attention.dense.bias", "sequential.26.post_attention_layernorm.scale", "sequential.26.mlp.dense_h_to_4h.weight", "sequential.26.mlp.dense_h_to_4h.bias", "sequential.26.mlp.dense_4h_to_h.weight", "sequential.26.mlp.dense_4h_to_h.bias", "sequential.27.input_layernorm.scale", "sequential.27.attention.query_key_value.weight", "sequential.27.attention.query_key_value.bias", "sequential.27.attention.rotary_emb.inv_freq", "sequential.27.attention.dense.weight", "sequential.27.attention.dense.bias", "sequential.27.post_attention_layernorm.scale", "sequential.27.mlp.dense_h_to_4h.weight", "sequential.27.mlp.dense_h_to_4h.bias", "sequential.27.mlp.dense_4h_to_h.weight", "sequential.27.mlp.dense_4h_to_h.bias", "sequential.28.input_layernorm.scale", "sequential.28.attention.query_key_value.weight", "sequential.28.attention.query_key_value.bias", "sequential.28.attention.rotary_emb.inv_freq", "sequential.28.attention.dense.weight", "sequential.28.attention.dense.bias", "sequential.28.post_attention_layernorm.scale", "sequential.28.mlp.dense_h_to_4h.weight", "sequential.28.mlp.dense_h_to_4h.bias", "sequential.28.mlp.dense_4h_to_h.weight", "sequential.28.mlp.dense_4h_to_h.bias", "sequential.29.input_layernorm.scale", "sequential.29.attention.query_key_value.weight", "sequential.29.attention.query_key_value.bias", "sequential.29.attention.rotary_emb.inv_freq", "sequential.29.attention.dense.weight", "sequential.29.attention.dense.bias", "sequential.29.post_attention_layernorm.scale", "sequential.29.mlp.dense_h_to_4h.weight", "sequential.29.mlp.dense_h_to_4h.bias", "sequential.29.mlp.dense_4h_to_h.weight", "sequential.29.mlp.dense_4h_to_h.bias", "sequential.30.input_layernorm.scale", "sequential.30.attention.query_key_value.weight", "sequential.30.attention.query_key_value.bias", "sequential.30.attention.rotary_emb.inv_freq", "sequential.30.attention.dense.weight", "sequential.30.attention.dense.bias", "sequential.30.post_attention_layernorm.scale", "sequential.30.mlp.dense_h_to_4h.weight", "sequential.30.mlp.dense_h_to_4h.bias", "sequential.30.mlp.dense_4h_to_h.weight", "sequential.30.mlp.dense_4h_to_h.bias", "sequential.31.input_layernorm.scale", "sequential.31.attention.query_key_value.weight", "sequential.31.attention.query_key_value.bias", "sequential.31.attention.rotary_emb.inv_freq", "sequential.31.attention.dense.weight", "sequential.31.attention.dense.bias", "sequential.31.post_attention_layernorm.scale", "sequential.31.mlp.dense_h_to_4h.weight", "sequential.31.mlp.dense_h_to_4h.bias", "sequential.31.mlp.dense_4h_to_h.weight", "sequential.31.mlp.dense_4h_to_h.bias", "sequential.32.input_layernorm.scale", "sequential.32.attention.query_key_value.weight", "sequential.32.attention.query_key_value.bias", "sequential.32.attention.rotary_emb.inv_freq", "sequential.32.attention.dense.weight", "sequential.32.attention.dense.bias", "sequential.32.post_attention_layernorm.scale", "sequential.32.mlp.dense_h_to_4h.weight", "sequential.32.mlp.dense_h_to_4h.bias", "sequential.32.mlp.dense_4h_to_h.weight", "sequential.32.mlp.dense_4h_to_h.bias", "sequential.33.input_layernorm.scale", "sequential.33.attention.query_key_value.weight", "sequential.33.attention.query_key_value.bias", "sequential.33.attention.rotary_emb.inv_freq", "sequential.33.attention.dense.weight", "sequential.33.attention.dense.bias", "sequential.33.post_attention_layernorm.scale", "sequential.33.mlp.dense_h_to_4h.weight", "sequential.33.mlp.dense_h_to_4h.bias", "sequential.33.mlp.dense_4h_to_h.weight", "sequential.33.mlp.dense_4h_to_h.bias", "sequential.35.norm.scale", "sequential.36.final_linear.weight".

@StellaAthena StellaAthena self-assigned this Apr 17, 2023
@StellaAthena StellaAthena added the bug Something isn't working label Apr 17, 2023
@curt-tigges
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@Quentin-Anthony
Copy link
Member

@crazyofapple -- You're seeing an error because you're trying to load a sequential checkpoint that you saved before the PR (with self.pipe_parallel_size >= 2, leading to a sequential model/ckpt), then tried to load it with the PR (with self.pipe_parallel_size >= 1), which tries to convert the checkpoint to a GPT2ModelPipe and fails.

If you need to load those model weights intact, you'll have to leave this commit out. Otherwise, delete that old checkpoint and update to this commit.

@Quentin-Anthony Quentin-Anthony merged commit 1faff79 into main Apr 21, 2023
@Quentin-Anthony Quentin-Anthony deleted the curt/parallel-inference branch April 21, 2023 17:21
bzantium pushed a commit that referenced this pull request Apr 26, 2023
* add flash_attn_kvpacked

* fix formatting

* accept changes from main & resolve conflicts

* Error

Signed-off-by: Dashiell Stander <[email protected]>

* errors

Signed-off-by: Dashiell Stander <[email protected]>

* feat(ci): add pip caching to CI

* Set training attribute appropriately

Signed-off-by: Dashiell Stander <[email protected]>

* Split up FlashAttention methods

Signed-off-by: Dashiell Stander <[email protected]>

* Comment out clear_cache

Signed-off-by: Dashiell Stander <[email protected]>

* Just remove clear_cache

Signed-off-by: Dashiell Stander <[email protected]>

* Fix pre-commit formatting

Signed-off-by: Dashiell Stander <[email protected]>

* Changed is_pipe_parallel setting to fix pipeline-parallel inference (#866)

* Changed is_pipe_parallel setting to fix pipeline-parallel inference

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

---------

Co-authored-by: github-actions <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>

* feat: improve typing

* Added DeeperSpeed to requirements.txt

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

* Update train.py

update train.py 
1. black formatter.
2. remove unnecessary import
3. add more arguments

* Update utils.py

Black formatting
Add logic required to expand "~"

* Update train.py

removed num_proc
temporarily disabled emoji
added continuing subword prefix option ( does not work well with Bytelevel)

* Update utils.py

improve reader error handling

* Update train.py

add whitespace related handling.
add whitespace argument expose
reconstruct pre_tokenizer_list
add more whitespace to check tokenizer invertibility

* Update train.py

* Update utils.py

remove unnecessary print

* Update train.py

set dropout default to None
import path related code.
Change normalizer
change buffer_tokens
change whitespace reservation handling

* Update train.py

Clear whitespace_reservation TODO
add single_whitespace argument (might be necessary for invertibility)

* Create .gitignore

add gitignore file to ignore artifacts

* Update train.py

add directory parsing error checks
add more metrics
(tokenizer reconstructions, unicode fallback portion)

* Update preprocess.py

path handling changes
black formatting

* Update train.py

change from GPT2TokenizerFast to PreTrainedTokenizerFast class

* Update train.py

enhanced test string

* Update utils.py

add logic to handle jsonl, txt input
add logic to handle folder with jsonl,txt or arrow dataset

* Update train.py

add byte_fallback option expose
(incompatible with current transformer wrapper)
change dataset_loading with new util.py
add dataset shuffling option

* Update utils.py

fix error in loading sequence

* Update train.py

fix whitespace preservation logic

* Update train.py

simplify data loading logic.
remove unnecessary special tokens

* Update train.py

remove emoji related code

* Update train.py

add whitespace processing regex
r"\s{16,}"

* update tokenizer

add whitespace pretokenizer
(only processes looong whitespaces)

* Update train.py

* Update train.py

add camel case regex

* Update train.py

separate camel_case regex

* Update train.py

* Update train.py

---------

Signed-off-by: Dashiell Stander <[email protected]>
Co-authored-by: Satpal Singh Rathore <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Saurav Maheshkar <[email protected]>
Co-authored-by: Stella Biderman <[email protected]>
Co-authored-by: Curt Tigges <[email protected]>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants