Skip to content

h39s/mlsys

Repository files navigation

Mixtral offloading

This project is a fork of the original Mixtral offloading project. Our main contributions involve testing the performance of various caching strategies (including layer-wise independent caching strategies to handle varying distributions of expert selections across layers), as well as upper-bounding the performance of speculative decoding by hard-coding expert activations for a selected set of prompts.

Specifically in dvmazur/mixtral-offloading they used i) LRU caching of experts, and ii) speculative pre-loading by predicting the active experts ahead of time, to accelerate token generation. In our project, we delve deeper into these two ideas and conduct a comprehensive analysis. Our investigation revealed the following:

  • Performance (measured by throughput) is largely unaffected by the caching strategy, and techniques such as LRU and LFU caching offer only marginal improvements over totally random cache eviction policies.
  • Speculative pre-loading of experts offers no further performance gains for 4-bit quantized MoE inference, and is bound by CPU-GPU communication overheads.
  • Reducing communication between the GPU and CPU and conducting inference on the CPU presents a favourable approach for MoE inference. Consequently, development of quantized multi-precision operation kernels for CPU inference presents the most promising, but challenging direction for further optimization.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published