You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to propose the addition of disc/NVMe offloading (I have observed this solution in the DeepSpeed library). Suppose we have a large set of prompts and a large model that exceeds the available RAM. In this scenario, it is possible to feed these prompts in large batches to our model. In such case at any given moment we only need to retain a few layers of the model in memory. By implementing layer prefetching, we can minimize latency and achieve a high throughput.
This specific feature has been implemented in the DeepSpeed library (with a focus on GPU inference). You can find a comprehensive description of it in the following link provided by DeepSpeed: https://www.deepspeed.ai/2022/09/09/zero-inference.html
The text was updated successfully, but these errors were encountered:
I wanted to propose the addition of disc/NVMe offloading (I have observed this solution in the DeepSpeed library). Suppose we have a large set of prompts and a large model that exceeds the available RAM. In this scenario, it is possible to feed these prompts in large batches to our model. In such case at any given moment we only need to retain a few layers of the model in memory. By implementing layer prefetching, we can minimize latency and achieve a high throughput.
This specific feature has been implemented in the DeepSpeed library (with a focus on GPU inference). You can find a comprehensive description of it in the following link provided by DeepSpeed:
https://www.deepspeed.ai/2022/09/09/zero-inference.html
The text was updated successfully, but these errors were encountered: