Mixtral offloading

This project is a fork of the original Mixtral offloading project. Our main contributions involve testing the performance of various caching strategies (including layer-wise independent caching strategies to handle varying distributions of expert selections across layers), as well as upper-bounding the performance of speculative decoding by hard-coding expert activations for a selected set of prompts.

Specifically in dvmazur/mixtral-offloading they used i) LRU caching of experts, and ii) speculative pre-loading by predicting the active experts ahead of time, to accelerate token generation. In our project, we delve deeper into these two ideas and conduct a comprehensive analysis. Our investigation revealed the following:

Performance (measured by throughput) is largely unaffected by the caching strategy, and techniques such as LRU and LFU caching offer only marginal improvements over totally random cache eviction policies.
Speculative pre-loading of experts offers no further performance gains for 4-bit quantized MoE inference, and is bound by CPU-GPU communication overheads.
Reducing communication between the GPU and CPU and conducting inference on the CPU presents a favourable approach for MoE inference. Consequently, development of quantized multi-precision operation kernels for CPU inference presents the most promising, but challenging direction for further optimization.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
config		config
img		img
logs		logs
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_prompts.txt		benchmark_prompts.txt
calibration_prompts.txt		calibration_prompts.txt
expert_calls_frequency.json		expert_calls_frequency.json
experts_to_offload.json		experts_to_offload.json
find_optimal_expert_config.py		find_optimal_expert_config.py
requirements.txt		requirements.txt
run.py		run.py
run_benchmark.py		run_benchmark.py
schedule.sh		schedule.sh
schedule_3090.sh		schedule_3090.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mixtral offloading

About

Releases

Packages

Contributors 7

Languages

License

h39s/mlsys

Folders and files

Latest commit

History

Repository files navigation

Mixtral offloading

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages