Any plan to support cascading feature of flashinfer? #495

wellhowtosay · 2024-06-03T03:21:25Z

No description provided.

Qubitium · 2024-06-03T05:42:40Z

For context this is what @wellhowtosay is referring to:

https://docs.flashinfer.ai/api/python/cascade.html

wellhowtosay · 2024-06-03T05:50:06Z

For context this is what @wellhowtosay is referring to:

https://docs.flashinfer.ai/api/python/cascade.html

I tried to implement cascading in sglang on my own, but I just found it's beyond my capability, especially to maintain a shared prefix kv-cache and a unique kv data in a batch.

Ying1123 · 2024-06-08T10:20:47Z

Yes, it is in our plan. However, it is not the highest priority because, in most cases, the bottleneck of long shared-prefix originates from the prefill stage, which can be optimized through automatic prefix sharing with warmup if not sharing across batches. That being said, if you have a batch of requests sharing a long prefix, you can process one request first, and then the radix attention will skip the computation for all the subsequent requests.
However, cascade inference can offer substantial advantages in some scenarios (long shared prefix, short decode, and the bottleneck still being on decode); therefore, we are working on it. I would say the implementation is not trivial, as it requires scheduling multiple kernel launches according to the tree structure of prefix sharing. I will let you know when I have an initial draft, so that you can review and help improve it.

github-actions · 2024-08-08T01:03:58Z

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Ying1123 self-assigned this Jun 8, 2024

github-actions bot closed this as completed Aug 8, 2024

github-actions bot added the inactive label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plan to support cascading feature of flashinfer? #495

Any plan to support cascading feature of flashinfer? #495

wellhowtosay commented Jun 3, 2024

Qubitium commented Jun 3, 2024

wellhowtosay commented Jun 3, 2024 •

edited

Loading

Ying1123 commented Jun 8, 2024

github-actions bot commented Aug 8, 2024

Any plan to support cascading feature of flashinfer? #495

Any plan to support cascading feature of flashinfer? #495

Comments

wellhowtosay commented Jun 3, 2024

Qubitium commented Jun 3, 2024

wellhowtosay commented Jun 3, 2024 • edited Loading

Ying1123 commented Jun 8, 2024

github-actions bot commented Aug 8, 2024

wellhowtosay commented Jun 3, 2024 •

edited

Loading