-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any plan to support cascading feature of flashinfer? #495
Comments
For context this is what @wellhowtosay is referring to: |
I tried to implement cascading in sglang on my own, but I just found it's beyond my capability, especially to maintain a shared prefix kv-cache and a unique kv data in a batch. |
Yes, it is in our plan. However, it is not the highest priority because, in most cases, the bottleneck of long shared-prefix originates from the prefill stage, which can be optimized through automatic prefix sharing with warmup if not sharing across batches. That being said, if you have a batch of requests sharing a long prefix, you can process one request first, and then the radix attention will skip the computation for all the subsequent requests. |
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed. |
No description provided.
The text was updated successfully, but these errors were encountered: