Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from EleutherAI:main #2

Merged
merged 105 commits into from
Jul 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
e277bc7
fix lion optimizer documentation (#1067)
jahatef Oct 31, 2023
f574f22
Fix preprocess_data.py link (#1064)
Quentin-Anthony Oct 31, 2023
fcc5af5
Edge-casing for multi-GPU HF-to-NeoX conversion (#1065)
haileyschoelkopf Nov 1, 2023
8c9fc00
Create tools __init__.py for import (#1068)
Quentin-Anthony Nov 1, 2023
a10f69c
Pin version of `lm_eval` (#1070)
haileyschoelkopf Nov 1, 2023
41f019e
fixed case when ntasks_per_node is used instead (#1069)
AIproj Nov 1, 2023
90aa131
Update README.md
StellaAthena Nov 5, 2023
04dc2ba
When processing mlp.dense_4h_to_h.bias and attention.dense.bias, tp_r…
kyuheejang Nov 7, 2023
f214358
Merge pull request #1072 from kyuheejang/Fixing-neox-to-huggingface
StellaAthena Nov 7, 2023
d8028f8
Resolve error in the `test_neoxargs_usage` unit test (#1074)
mkerin Nov 8, 2023
10bf788
Update neox_args.py (#1081)
jahatef Nov 16, 2023
f48d3a6
Update README.md (#1082)
StellaAthena Nov 22, 2023
efea81f
Update README.md
StellaAthena Nov 30, 2023
3be59a4
Extend ci suite (#1080)
mkerin Dec 4, 2023
a2b2020
Patch coverity scan (#1090)
jaimemcc-intel Dec 4, 2023
050f560
Corrects FLOPs formula as per 1093 (#1094)
StellaAthena Dec 6, 2023
f19b2ec
Update CODEOWNERS
StellaAthena Dec 19, 2023
07166da
Bump transformers from 4.30.2 to 4.36.0 in /requirements (#1097)
dependabot[bot] Dec 20, 2023
9283eff
Pins old DeeperSpeed until bug is fixed (#1095)
StellaAthena Dec 20, 2023
9eef954
Update README.md
StellaAthena Dec 22, 2023
a48e09e
Update README.md
StellaAthena Dec 22, 2023
613e5a6
Update NeoXArgs docs automatically
invalid-email-address Dec 22, 2023
be7eeda
Update README.md
StellaAthena Dec 22, 2023
2117afc
Update README.md
StellaAthena Dec 22, 2023
8dba5b6
Update NeoXArgs docs automatically
invalid-email-address Dec 22, 2023
f161245
Add QK Normalization (#1100)
lintangsutawika Dec 22, 2023
7fb3b3c
Update README.md
StellaAthena Dec 22, 2023
a7509f0
Update README.md
StellaAthena Dec 22, 2023
8eaac4e
Merge branch 'main' into StellaAthena-patch-4-1
StellaAthena Dec 22, 2023
4d5a811
Update NeoXArgs docs automatically
invalid-email-address Dec 22, 2023
05cc29c
Merge pull request #1099 from EleutherAI/StellaAthena-patch-4-1
StellaAthena Dec 22, 2023
e25446e
Merge branch 'main' into StellaAthena-patch-4
StellaAthena Dec 22, 2023
287f9f7
Merge pull request #1102 from EleutherAI/StellaAthena-patch-4
StellaAthena Dec 22, 2023
b27e409
Lm eval 0.4.0 support (#1101)
haileyschoelkopf Dec 23, 2023
1148a0f
Update README.md
StellaAthena Dec 23, 2023
e5a7ea7
Update neox_args.py (#1107)
StellaAthena Dec 26, 2023
eca6b1a
Fix repo for CI (#1106)
yang Jan 4, 2024
98716eb
Fix install, Dockerfile, CI (#1104)
yang Jan 4, 2024
77605ca
Fused Rotary Embeddings (fixed) (#1108)
yang Jan 5, 2024
f14782a
Add pythia 14M and 31M configs (#1111)
segyges Jan 5, 2024
e6e944a
Add docker compose and change containerized setup instructions to use…
segyges Jan 9, 2024
92b1b6f
Fix openwebtext2 downloader, backport improvements to DataDownloader …
segyges Jan 11, 2024
90f70ff
Bump jinja2 from 3.1.2 to 3.1.3 in /requirements (#1120)
dependabot[bot] Jan 13, 2024
6399155
Enable passing of `--account` to `srun` / SlurmLauncher (#1126)
haileyschoelkopf Jan 19, 2024
7a8fa2f
update copyrights (#1128)
jahatef Jan 24, 2024
3d8fec0
fused layernorm (#1105)
yang Jan 26, 2024
e5602c3
Contributing Guide (#1138)
jahatef Jan 29, 2024
1c133bf
moved eval import and added to docs (#1139)
R0n12 Jan 30, 2024
032ec8c
Update lm_eval v0.4 to PyPI dependencies (#1141)
haileyschoelkopf Feb 1, 2024
91c44bc
Remove gas (beano) (#1144)
segyges Feb 5, 2024
f7373f8
Improve Conversion Utilities (#1124)
haileyschoelkopf Feb 8, 2024
412cf6e
Fixes distributed tests, and skips tests that are broken. (#1149)
jahatef Feb 21, 2024
46d179c
Memory profiling (#1153)
jahatef Feb 21, 2024
eee03b2
add profiling to readme (#1154)
jahatef Feb 23, 2024
a7638a8
Python version update (#1122)
segyges Feb 23, 2024
72d1803
Minor changes (#1125)
segyges Feb 23, 2024
f36aed7
Draft PR Adding mistral 0.1 (#1131)
AIproj Feb 23, 2024
9663802
[Bug?] Fix profiling argument names (#1155)
haileyschoelkopf Feb 26, 2024
3c03fc7
Update cpu_ci.yml (#1159)
jaimemcc-intel Feb 29, 2024
19596b0
Improve argument validation for Flash-attn + SWA (#1162)
haileyschoelkopf Mar 2, 2024
119950c
Single node Pythia 14M training on ngc pytorch 24.02 container (#1170)
tf-nv Mar 4, 2024
7b8187a
Remove unnecessary fp32/bf16 conversion (#1169)
DayOfThePenguin Mar 4, 2024
31cfe52
Ignore markdown for pre-commit (#1171)
Quentin-Anthony Mar 4, 2024
e109bf5
Make rotary freqs buffer non-persistent (#1168)
haileyschoelkopf Mar 4, 2024
df8cf24
Support Lion with Zero Optimizer (#1166)
DayOfThePenguin Mar 4, 2024
86758c3
Add MoE (#1129)
yang Mar 7, 2024
63b9fa1
remove `best_download` as dependency (#1179)
haileyschoelkopf Mar 8, 2024
90d4cb3
Fix documentation for --jsonl-keys argument of preprocess_data script…
KeitaW Mar 8, 2024
8c13642
clean up dockerfile: (#1175)
tf-nv Mar 8, 2024
c1fa994
When using kv cache and flash attention in conjunction, it's crucial …
chaochen99 Mar 8, 2024
1e7abe7
Remove gas from Pythia configs (#1181)
yang Mar 8, 2024
82ddc66
Fix moe_loss in gpt_j_residual path (#1180)
yang Mar 8, 2024
6809bbc
Add Mamba Architecture (#1157)
haileyschoelkopf Mar 10, 2024
03186de
Switch to using Cuda Flash Attn for Alibi (#1183)
haileyschoelkopf Mar 13, 2024
277141e
Mamba + Tensor Parallel Support (#1184)
haileyschoelkopf Mar 15, 2024
7267a74
[ZeRO-3] Partitioned init with `deepspeed.zero.Init()` (#1190)
R0n12 Mar 19, 2024
e6b5261
Small typo in the README
Mar 26, 2024
4085302
Merge pull request #1196 from edouardoyallon/typo_readme
StellaAthena Mar 26, 2024
1960b66
Added more papers
StellaAthena Mar 26, 2024
3616658
Update README.md
StellaAthena Mar 26, 2024
977448e
making PR triggered CPU test for changes to megatron (#1195)
jaimemcc-intel Apr 1, 2024
51a7de9
[AMD] Supporting fused kernels build using JIT (#1188)
R0n12 Apr 1, 2024
01657aa
[ZeRO-3] Ensured passing neox deepspeed_config when using partitioned…
R0n12 Apr 1, 2024
703d02f
Fix flash config for llama2/70B.yml config (#1206)
Quentin-Anthony Apr 24, 2024
838d5bf
Fixes a weird typo (#1207)
StellaAthena Apr 25, 2024
9d9d7c8
Bump transformers from 4.36.0 to 4.38.0 in /requirements (#1199)
dependabot[bot] May 4, 2024
06e5f0c
Jaimemcc intel/ci composite cpu tests (#1205)
jaimemcc-intel May 4, 2024
916c883
Add megablocks dropless MoE (#1192)
yang May 4, 2024
c814959
Fix bug in tools/ckpts/convert_neox_to_hf.py for setting intermediate…
jvendrow May 4, 2024
4bc6670
add rwkv support (#1198)
jahatef May 6, 2024
49cd41f
Bump jinja2 from 3.1.3 to 3.1.4 in /requirements (#1211)
dependabot[bot] May 13, 2024
d037756
Run document update again (#1216)
jahatef May 16, 2024
153e732
Rwkv pipeline parallelism (#1221)
jahatef May 21, 2024
2746d43
Add Torch Profiler Support (#1226)
DayOfThePenguin May 21, 2024
1d55708
fixed fused_rope naming in JIT + added readme for amd support (#1224)
R0n12 May 21, 2024
d3d59f2
Small tidying (#1222)
yang May 21, 2024
dfc6722
Fix markdown formatting error (#1217)
StellaAthena May 26, 2024
b5c0afe
add workflow_dispatch to gh actions pr so we can run on command (#1233)
jahatef Jun 4, 2024
4a34e0a
init changes to README (#1232)
jaimemcc-intel Jun 5, 2024
90a6cdb
fix summed biases not being divided by mp size (#1220)
dmahan93 Jun 7, 2024
2382bd4
Fix changed behavior of pipe_parallel (#1219)
yang Jun 7, 2024
4c426da
Conversion script bugfixes (#1218)
haileyschoelkopf Jun 7, 2024
2608972
fix python version and pytest install (#1234)
jahatef Jun 19, 2024
0e5f6db
Add a chat data preprocessing script (#1239)
dmahan93 Jun 25, 2024
1cee5b7
Fix paper reference in init_functions.py (#1241)
rasbt Jun 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add megablocks dropless MoE (EleutherAI#1192)
* Add megablocks dropless MoE

* pre-commit

---------

Co-authored-by: Yang Zhang <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
  • Loading branch information
3 people committed May 4, 2024
commit 916c88357fdbee5107574da156585addd17b31bb
75 changes: 75 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Prior to 3/9/2023, GPT-NeoX relied on [DeeperSpeed](https://github.com/EleutherA
+ [Containerized Setup](#containerized-setup)
* [Usage](#usage)
- [Configuration](#configuration)
* [Mixture of Experts](#mixture-of-experts)
- [Datasets](#datasets)
* [Preconfigured Datasets](#preconfigured-datasets)
* [Using Custom Data](#using-custom-data)
Expand Down Expand Up @@ -322,6 +323,80 @@ These files are generally complete, but non-optimal. For example, depending on y

For a more detailed guide to the features available and how to configure them, see [the configuration README](configs/README.md), and for documentation of every possible argument, see [configs/neox_arguments.md](configs/neox_arguments.md).

## Mixture of Experts

GPT-NeoX includes multiple expert implementations for MoE. To select between them, specify `moe_type` of `megablocks` (default) or `deepspeed`.

Both are based on the DeepSpeed MoE parallelism framework, which supports tensor-expert-data parallelism.
Both allow you to toggle between token-dropping and dropless (default, and this is what Megablocks was designed for).
Sinkhorn routing to come soon!

For an example of a basic complete configuration, see configs/125M-dmoe.yml (for Megablocks dropless) or configs/125M-moe.yml.

Most MoE related configuration arguments are prefixed with `moe`. Some common configuration parameters and their defaults are as follows:

```
moe_type: megablocks
moe_num_experts: 1 # 1 disables MoE. 8 is a reasonable value.
moe_loss_coeff: 0.1
expert_interval: 2 # See details below
enable_expert_tensor_parallelism: false # See details below
moe_expert_parallel_size: 1 # See details below
moe_token_dropping: false
```

DeepSpeed can be further configured with the following:

```
moe_top_k: 1
moe_min_capacity: 4
moe_train_capacity_factor: 1.0 # Setting to 1.0
moe_eval_capacity_factor: 1.0 # Setting to 1.0
```

One MoE layer is present every `expert_interval` transformer layers including the first, so with 12 layers total:

```
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
```

Experts would be in these layers:

```
0, 2, 4, 6, 8, 10
```

By default, we use expert-data parallelism, so any available tensor parallelism (`model_parallel_size`) will be used for expert routing. For instance, given the following:

```
expert_parallel_size: 4
model_parallel_size: 2 # aka tensor parallelism
```

With 32 GPUs, the behavior will be look like:

- In non-expert layers:
- Tensor parallelism is 2. (There are 32 / 2 = 16 such tensor parallel groups, each of size 2.)
- Data parallelism implicitly becomes 32 / 2 = 16.
- In expert layers:
- There is no tensor parallelism.
- Expert parallelism is 4. (There are 32 / 4 = 8 expert parallel groups, each of size 4.)
- Data parallelism implicitly becomes 32 / 4 = 8. Some cross-node token routing happens as a result of this redivision of data parallelism between 16 and 8. To avoid it, ensure that `expert_parallel_size == model_parallel_size`.

Setting `enable_expert_tensor_parallelism` enables tensor-expert-data (TED) parallelism. The way to interpret the above would then be:

- In non-expert layers: same as before.
- In expert layers:
- Tensor parallelism is 2. (There are 32 / 2 = 16 tensor parallel groups, each of size 2.)
- Expert parallelism is 4. (There are 32 / 4 = 8 expert parallel groups, each of size 4.)
- Data parallelism implicitly becomes 32 / (2 * 4) = 4. Again, cross-node token routing happens. To avoid, ensure `expert_parallel_size == 1` or `model_parallel_size == 1`.

So note that DP must be divisible by (MP * EP). For more details, see the [TED paper].

Pipeline parallelism is not yet supported - coming soon!

[TED paper]: https://arxiv.org/abs/2303.06318

# Datasets

## Preconfigured Datasets
Expand Down
101 changes: 101 additions & 0 deletions configs/125M-dmoe.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# GPT-2 pretraining setup
{
# See README for MoE config docs!
"moe_type": "megablocks",
"moe_token_dropping": false,
# Have 4 experts per layer (every 2 layers by default)
"moe_num_experts": 4,
# parallelism settings
"enable_expert_tensor_parallelism": true,
"pipe_parallel_size": 1, # not yet supported for MoE
"model_parallel_size": 1,
"moe_expert_parallel_size": 1,

# model settings
"num_layers": 12,
"hidden_size": 768,
"num_attention_heads": 12,
"seq_length": 2048,
"max_position_embeddings": 2048,
"norm": "layernorm",
"pos_emb": "rotary",
"no_weight_tying": true,
"gpt_j_residual": false,
"output_layer_parallelism": "column",

# these should provide some speedup but takes a while to build, set to true if desired
"scaled_upper_triang_masked_softmax_fusion": false,
"bias_gelu_fusion": false,
"rope_fusion": false,

# init methods
"init_method": "small_init",
"output_layer_init_method": "wang_init",


# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.95],
"eps": 1.0e-8,
}
},
"min_lr": 0.00006,

# for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
"stage": 0,
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
},

# batch / data settings
"train_micro_batch_size_per_gpu": 4,
"data_impl": "mmap",

# activation checkpointing
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

# regularization
"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0.0,
"attention_dropout": 0.0,

# precision settings
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},

# misc. training settings
"train_iters": 320000,
"lr_decay_iters": 320000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
"checkpoint_factor": 10000,
"eval_interval": 1000,
"eval_iters": 10,

# logging
"log_interval": 10,
"steps_per_print": 10,
"keep_last_n_checkpoints": 4,
"wall_clock_breakdown": true,

# networking
"hostfile": "/mock_path"
}
16 changes: 7 additions & 9 deletions configs/125M-moe.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
# GPT-2 pretraining setup
{
# See README for MoE config docs!
"moe_type": "deepspeed",
"moe_token_dropping": true,
# Have 4 experts per layer (every 2 layers by default)
# So with 12 layers total:
# 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
# Experts would be in layers:
# 0, 2, 4, 6, 8, 10
"num_experts": 4,

# parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
# across the node boundaries )
"pipe_parallel_size": 1,
"moe_num_experts": 4,
# parallelism settings
"enable_expert_tensor_parallelism": true,
"pipe_parallel_size": 1, # not yet supported for MoE
"model_parallel_size": 1,
"moe_expert_parallel_size": 1,

Expand Down
12 changes: 6 additions & 6 deletions megatron/data/helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -428,9 +428,9 @@ py::array build_mapping_impl(const py::array_t<int64_t>& docs_,
}

} // for (auto sent_index=sent_index_first; ...
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {

if (!second) {
if (verbose) {
Expand Down Expand Up @@ -660,9 +660,9 @@ py::array build_blocks_mapping_impl(const py::array_t<int64_t>& docs_,
num_sent = 0;
}
} // for (auto sent_index=sent_index_first; ...
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {

if (!second) {
if (verbose) {
Expand Down
34 changes: 34 additions & 0 deletions megatron/model/megablocks_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
"""Adapter to expose MegaBlocks package, if available."""

try:
import megablocks
except ImportError:
megablocks = None


def megablocks_is_available():
return megablocks is not None


def assert_megablocks_is_available():
assert (
megablocks_is_available()
), "MegaBlocks not available. Please run `pip install megablocks`."


moe = megablocks.layers.moe if megablocks_is_available() else None
dmoe = megablocks.layers.dmoe if megablocks_is_available() else None
arguments = megablocks.layers.arguments if megablocks_is_available() else None


def as_megablocks_args(neox_args):
import copy

tmp = copy.copy(neox_args)
delattr(tmp, "mlp_type")
tmp.mlp_type = "mlp"
args = arguments.from_megatron(tmp)
args.moe_lbl_in_fp32 = True
args.fp16 = neox_args.precision == "fp16"
args.moe_loss_weight = neox_args.moe_loss_coeff
return args
Loading