Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plan to support cascading feature of flashinfer? #495

Closed
wellhowtosay opened this issue Jun 3, 2024 · 4 comments
Closed

Any plan to support cascading feature of flashinfer? #495

wellhowtosay opened this issue Jun 3, 2024 · 4 comments
Assignees
Labels

Comments

@wellhowtosay
Copy link

No description provided.

@Qubitium
Copy link
Contributor

Qubitium commented Jun 3, 2024

For context this is what @wellhowtosay is referring to:

https://docs.flashinfer.ai/api/python/cascade.html

@wellhowtosay
Copy link
Author

wellhowtosay commented Jun 3, 2024

For context this is what @wellhowtosay is referring to:

https://docs.flashinfer.ai/api/python/cascade.html

I tried to implement cascading in sglang on my own, but I just found it's beyond my capability, especially to maintain a shared prefix kv-cache and a unique kv data in a batch.

@Ying1123 Ying1123 self-assigned this Jun 8, 2024
@Ying1123
Copy link
Contributor

Ying1123 commented Jun 8, 2024

Yes, it is in our plan. However, it is not the highest priority because, in most cases, the bottleneck of long shared-prefix originates from the prefill stage, which can be optimized through automatic prefix sharing with warmup if not sharing across batches. That being said, if you have a batch of requests sharing a long prefix, you can process one request first, and then the radix attention will skip the computation for all the subsequent requests.
However, cascade inference can offer substantial advantages in some scenarios (long shared prefix, short decode, and the bottleneck still being on decode); therefore, we are working on it. I would say the implementation is not trivial, as it requires scheduling multiple kernel launches according to the tree structure of prefix sharing. I will let you know when I have an initial draft, so that you can review and help improve it.

Copy link

github-actions bot commented Aug 8, 2024

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants