🍼 Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

😎 This is the official implementation of the paper Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts.

🔍 TL;DR: We devise a dynamic data sampling method for maximizing the instruction tuning efficacy of MoE models.

✨ Novelty: We use the token routing preference of MoE to build dataset-level representations and dynamically adjust the sampling weight of datasets. In other words, we babysit MoE models by feeding their favorite data soup 🍼.

Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales. However, previous methods simply merge all training tasks (e.g. creative writing, coding, and mathematics) and apply fixed sampling weights, without considering the importance of different tasks as the model training state changes. In this way, the most helpful data cannot be effectively distinguished, leading to suboptimal model performance. To reduce the potential redundancies of datasets, we make the first attempt and propose a novel dynamic data mixture for MoE instruction tuning. Specifically, inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets. Finally, we propose to dynamically adjust the sampling weight of datasets by their inter-redundancies, thus maximizing global performance under a limited training budget. The experimental results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge & reasoning tasks and open-ended queries.

Intuition	Algorithm

🌴 Setup

Installation

# base env: cuda==11.8, python==3.11, torch==2.1.2+cu118, transformers==4.36.2
conda install git
conda install conda-forge::git-lfs
# install deps
pip install wandb
pip install "fschat[model_worker,webui,llm_judge]"
pip install python-dotenv
# install flash-attn
pip install flash-attn --no-build-isolation
# install vllm under cuda-11.8
pip install https://github.com/vllm-project/vllm/releases/download/v0.2.7/vllm-0.2.7+cu118-cp311-cp311-manylinux1_x86_64.whl
pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118
# other deps for eval
pip install langdetect
pip install git+https://github.com/bigscience-workshop/promptsource.git
pip install immutabledict

Training Preparation

# download training data
mkdir -p data/four
huggingface-cli download Spico/dynamic-moe-sft-instructions --repo-type dataset --local-dir data/four_types_mix --local-dir-use-symlinks False
# download models - you may want to change the save folder at your convenience
# llama-moe
huggingface-cli download llama-moe/LLaMA-MoE-v1-3_5B-2_8 --repo-type model --local-dir /mnt/petrelfs/zhutong/llama-moe-models/LLaMA-MoE-v1-3_5B-2_8-new --local-dir-use-symlinks False
# overwrite model files
cp src/models/llama_moe/*.py /mnt/petrelfs/zhutong/llama-moe-models/LLaMA-MoE-v1-3_5B-2_8-new
# moduleformer
huggingface-cli download ibm/MoLM-700M-4B --repo-type model --local-dir /mnt/petrelfs/zhutong/llama-moe-models/MoLM-700M-4B --local-dir-use-symlinks False
# overwrite model files
cp src/models/moduleformer/*.py /mnt/petrelfs/zhutong/llama-moe-models/MoLM-700M-4B

Setup for Evaluation

# K&R evaluation except for MBPP
# commit: 89618bf8421d27c8cf28004d616b33fc5b305ceb
# 2024-01-16 16:45:29 commit: 032e879bf5ff39c08ae0db1f622a5b382a42eaa2
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout 032e879bf5ff39c08ae0db1f622a5b382a42eaa2
cp ../lm_eval.patch .
git apply lm_eval.patch
pip install -e .

# code evaluation
# commit: 9cfa52b
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
git checkout 9cfa52b
# change `pyext==0.5` in `bigcode-evaluation-harness/requirements.txt`, ref: https://github.com/bigcode-project/bigcode-evaluation-harness/pull/181
git checkout 9cfa52b
cp ../code_eval.patch .
git apply code_eval.patch
pip install -e .

🚀 QuickStart

Training

The training files are located in scripts/llama_moe and scripts/moduleformer. We use a GPU cluster with Slurm resource scheduling. You can run the following command to start training.

sbatch scripts/llama_moe/llama_moe_dynamic.sh

Or, if your environment does not support Slurm, you can run the following command to start training.

task_name="llama_moe_dynamic"
model_type="auto"
model_name_or_path="/mnt/petrelfs/zhutong/llama-moe-models/LLaMA-MoE-v1-3_5B-2_8-new"
dataset_dir_or_path="data/four_types_mix/train"
eval_data_dir="data/four_types_mix/dev"

comment="llama-moe 2/8, four type mix, dynamic baseline, 4 gpus, eval_steps 100, max_eval_steps 5, w/ balance loss, w/ freeze gate, w/ gate noise"
base_dir="outputs"
output_dir="${base_dir}/${task_name}"
mkdir -p $output_dir
git diff > $output_dir/diff.patch
env > $output_dir/env
echo -e "Git commit: $(git log -1 --oneline)\n\nGit branch: $(git branch | grep "*")\n\nComment: ${comment}" > $output_dir/comment.txt

torchrun \
--nnodes 1 \
--nproc_per_node 4 \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node:29522 \
    -m src.core.train \
        --do_train \
        --do_eval \
        --freeze_gate True \
        --eval_data_dir $eval_data_dir \
        --evaluation_strategy steps \
        --eval_steps 100 \
        --max_eval_steps 5 \
        --dynamic_sampling_criterion mean \
        --run_name $task_name \
        --model_type $model_type \
        --model_name_or_path $model_name_or_path \
        --dataset_dir_or_path $dataset_dir_or_path \
        --output_dir $output_dir \
        --deepspeed conf/ds_bf16_zero1.json \
        --bf16 True \
        --tf32 True \
        --torch_dtype bfloat16 \
        --per_device_train_batch_size 4 \
        --per_device_eval_batch_size 4 \
        --gradient_accumulation_steps 8 \
        --max_steps 2000 \
        --save_strategy steps \
        --save_steps 9999999999999 \
        --save_total_limit 1 \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type cosine \
        --logging_steps 1 \
        --model_max_length 2048 \
        --gradient_checkpointing True \
        --report_to wandb

Evaluation

After training, you can evaluate the model using the following command.

Please change the model path in scripts/eval/multi.sh, and run bash scripts/eval/multi.sh:

# e.g.
single_eval reasoning moduleformer_random outputs/moduleformer_random/2533914/
multi_eval moduleformer_random outputs/moduleformer_random/2533914/

After all the jobs are finished, you can check the results via:

python -m src.eval.show results/moduleformer_random

If you want to evaluate on MT-Bench, the generated MT-Bench responses are located at data/mt_bench/model_answer. You can use these files with fastchat/llm_judge to obtain the final results.

📋 Citation

@article{zhu-et-al-2024-dynamic-sft-for-moe,
  title={Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts},
  author={Zhu, Tong and Dong, Daize and Qu, Xiaoye and Ruan, Jiacheng and Chen, Wenliang and Cheng, Yu},
  journal={arXiv preprint arXiv:2406.11256},
  year={2024},
  url={https://arxiv.org/abs/2406.11256},
}

💌 This project is licensed under Apache-2.0. We hope you enjoy it ~

❤️接好运❤️

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
conf		conf
logs		logs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
alg.png		alg.png
code_eval.patch		code_eval.patch
intuition.png		intuition.png
lm_eval.patch		lm_eval.patch
requirements.txt		requirements.txt
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍼 Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

🌴 Setup

🚀 QuickStart

📋 Citation

About

Releases

Packages

Languages

License

Spico197/MoE-SFT

Folders and files

Latest commit

History

Repository files navigation

🍼 Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

🌴 Setup

🚀 QuickStart

📋 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages