Pulse · sgl-project/sglang · GitHub

August 18, 2024 – August 25, 2024

Overview

42 Active pull requests

27 Active issues

35 Pull requests merged by 15 people

Update workflow files
#1214 merged Aug 26, 2024
[Feature] Support fp8 e5m2 kv cache with flashinfer
#1204 merged Aug 26, 2024
Update CI runner docs
#1213 merged Aug 26, 2024
Update CI workflows
#1210 merged Aug 25, 2024
[CI] Fix the issue of unit test hanging
#1211 merged Aug 25, 2024
[Minor] Temporarily skip flaky test
#1209 merged Aug 25, 2024
[Minor] Improve the function organization in TokenizerManager & improve loggers
#1208 merged Aug 25, 2024
Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model
#1186 merged Aug 25, 2024
[Fix] Fixing the multi-images error for llava-onevision
#1205 merged Aug 25, 2024
Relax the assert in moe throughput test to fix the flaky CI
#1207 merged Aug 25, 2024
[Fix] the issue of random order when input is a list
#1199 merged Aug 25, 2024
[CI] Fix the problem of hf runner too slow
#1202 merged Aug 25, 2024
Update README.md
#1198 merged Aug 24, 2024
Cleanup readme, llava examples, usage examples and nccl init
#1194 merged Aug 24, 2024
feat: use gelu_tanh_and_mul
#1193 merged Aug 24, 2024
[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server.
#1123 merged Aug 23, 2024
Fix benchmark script
#1185 merged Aug 22, 2024
Fix broken penalty
#1184 merged Aug 22, 2024
[Minor] Improve logging and rename the health check endpoint name
#1180 merged Aug 22, 2024
Improve code style of sampler
#1168 merged Aug 21, 2024
[Docs] Fix rendering of details in README
#1179 merged Aug 21, 2024
Support min-p sampling
#1167 merged Aug 21, 2024
[Feature] Add a function to convert sampling_params to kwargs
#1170 merged Aug 21, 2024
fix: custom op fallback forward native when lower sm80
#1177 merged Aug 21, 2024
Improve multi-node stability
#1171 merged Aug 21, 2024
[Feat] Support update weights without restart server
#1157 merged Aug 20, 2024
fix: resolve README render
#1166 merged Aug 20, 2024
support /v1/health using a generation 1 token
#1154 merged Aug 20, 2024
misc: add hypervisor vendor
#1165 merged Aug 20, 2024
[Feature] add disable-custom-all-reduce
#1148 merged Aug 20, 2024
Improve docs and warnings
#1164 merged Aug 20, 2024
feat: allow streaming for multi-prompt and/or parallel sampling
#1134 merged Aug 20, 2024
Optimize MLA/GQA/MQA Triton decoding
#1138 merged Aug 19, 2024
[Feat]Add support for optional start len of logprobs
#1035 merged Aug 19, 2024
[Docs] Add instruction for running on clouds and kubernetes with SkyPilot
#1144 merged Aug 19, 2024

7 Pull requests opened by 6 people

Save memory from interleaved attention
#1151 opened Aug 19, 2024
chore: bump v0.2.14
#1155 opened Aug 19, 2024
Separated control and compute loop, shorten the critical path, and enable more complicated policies
#1182 opened Aug 22, 2024
Dry sample
#1187 opened Aug 23, 2024
Move sampler into CUDA graph
#1201 opened Aug 25, 2024
minor: improve CI and dependencies
#1212 opened Aug 26, 2024
improve the threshold and ports in tests
#1215 opened Aug 26, 2024

12 Issues closed by 7 people

[Bug] Potential Logic Error in Memory Capacity Check for Distributed Setup
#1015 closed Aug 24, 2024
[Feature] support min_p sampling
#1071 closed Aug 23, 2024
[Help wanted] Does RadixAttention have anything to do with attention?
#1181 closed Aug 22, 2024
[Bug] Runtime Stuck
#1173 closed Aug 21, 2024
[Feature] SGLang using JSON as template config file needs improve
#1172 closed Aug 21, 2024
[Feature] add disable_custom_all_reduce
#1118 closed Aug 21, 2024
[Feature] The real health check API
#853 closed Aug 20, 2024
[Feature] Support W8A16 Int8 inside FusedMoE
#1161 closed Aug 20, 2024
[Feature] In Sglang ，Is chunked-prefill use fused(prefill+decode) batch?
#1162 closed Aug 20, 2024
[Bug] Gemma-2-9b-it produces garbage output
#1160 closed Aug 20, 2024
In which file is constraint decoding implemented?
#1149 closed Aug 19, 2024
[Bug] --disable-flashinfer is broken
#1146 closed Aug 19, 2024

15 Issues opened by 11 people

[Feature] add option to use liger triton kernel
#1216 opened Aug 26, 2024
Accuracy degrading in concurrent scenario
#1203 opened Aug 25, 2024
[Feature] Use Embedding/Generation Model to get its Generation/Emebedding
#1200 opened Aug 25, 2024
[Bug] enable-torch-compile error
#1196 opened Aug 24, 2024
[Bug] Bad outputs with fp8 quantization at high RPS
#1195 opened Aug 24, 2024
[Bug] Server crashes after loading (Mixtral 8x7b)
#1191 opened Aug 23, 2024
[Feature] Jamba 1.5 Support PLS
#1190 opened Aug 23, 2024
[Bug] schedule_batch.py: IndexError: list index out of range
#1189 opened Aug 23, 2024
[Bug] vllm updated its get_model function
#1183 opened Aug 22, 2024
[Bug] Dynamic FP8 quantization fails due to incorrect tensor shape
#1178 opened Aug 21, 2024
[Bug] Empty `top_logprobs` in LogProbs Output for Meta-Llama-3.1-8B-Instruct Model when Using OpenAI Compatible API
#1176 opened Aug 21, 2024
[Feature] Repeated generation expression
#1175 opened Aug 21, 2024
[Bug] head_dim 96 not supported
#1159 opened Aug 20, 2024
[Feature] support W8A8(FP8) and KV Cache FP8 for DeepSeek V2
#1156 opened Aug 19, 2024
[Tracker] OpenRouter LLM rankings tracking
#1152 opened Aug 19, 2024

12 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[FEAT] JSON constrained support
#1125 commented on Aug 23, 2024 • 2 new comments
Supports the InternVL multimodal large model
#328 commented on Aug 19, 2024 • 0 new comments
[Bug] Llama3 70B A100 PCIE TP4 slow speed
#1137 commented on Aug 19, 2024 • 0 new comments
[Bug] pt_main_thread uses 100% cpu all the time
#955 commented on Aug 19, 2024 • 0 new comments
[Bug] OOM for concurrent long requests
#1030 commented on Aug 19, 2024 • 0 new comments
Add Default Timeout to urllib.request.urlopen Calls to Prevent Potential Hanging
#339 commented on Aug 21, 2024 • 0 new comments
[Feature] Allow arbitrary logit processors
#1036 commented on Aug 21, 2024 • 0 new comments
[Feature] plan to support medusa?
#859 commented on Aug 23, 2024 • 0 new comments
[Bug] when llama-3.1-70b-instruct batch inference, CUDA memory usage is unusually large
#1132 commented on Aug 25, 2024 • 0 new comments
Development Roadmap (2024 Q3)
#634 commented on Aug 25, 2024 • 0 new comments
[RFC] Add an LLM engine
#1127 commented on Aug 21, 2024 • 0 new comments
Flex scheduler
#1142 commented on Aug 20, 2024 • 0 new comments