[Feature Request] disc/NVMe offloading #446

trabbart · 2023-08-12T23:18:18Z

I wanted to propose the addition of disc/NVMe offloading (I have observed this solution in the DeepSpeed library). Suppose we have a large set of prompts and a large model that exceeds the available RAM. In this scenario, it is possible to feed these prompts in large batches to our model. In such case at any given moment we only need to retain a few layers of the model in memory. By implementing layer prefetching, we can minimize latency and achieve a high throughput.

This specific feature has been implemented in the DeepSpeed library (with a focus on GPU inference). You can find a comprehensive description of it in the following link provided by DeepSpeed:
https://www.deepspeed.ai/2022/09/09/zero-inference.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] disc/NVMe offloading #446

[Feature Request] disc/NVMe offloading #446

trabbart commented Aug 12, 2023

[Feature Request] disc/NVMe offloading #446

[Feature Request] disc/NVMe offloading #446

Comments

trabbart commented Aug 12, 2023