Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sglang.bench_latency for offline benchmark #564

Merged
merged 8 commits into from
Jun 25, 2024
Merged

Conversation

merrymercy
Copy link
Contributor

@merrymercy merrymercy commented Jun 25, 2024

  • Add sglang/bench_latency.py. It is a very convenient utility for offline test.

Usage (latency test):

>>> python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --load-format dummy --tp 2

Prefill. latency:  5.021 ms, throughput:    203.93 token/s
Decode.  latency:  0.331 ms, throughput:      3.02 token/s
Decode.  latency:  0.010 ms, throughput:    102.95 token/s
Decode.  latency:  0.009 ms, throughput:    107.55 token/s
Decode.  latency:  0.009 ms, throughput:    107.70 token/s
Prefill. latency:  0.019 ms, throughput:  52553.25 token/s
Decode.  latency:  0.009 ms, throughput:    108.23 token/s
Decode.  latency:  0.012 ms, throughput:     84.28 token/s
Decode.  latency:  0.009 ms, throughput:    108.52 token/s
Decode.  latency:  0.009 ms, throughput:    108.31 token/s

Usage (correctness test):

>>> python -m sglang.bench_latency --model-path TinyLlama/TinyLlama-1.1B-Chat-v0.4 --correct

prefill logits (first half) tensor([[-10.0312,  -9.5000,   0.8936,  ...,  -4.9414,  -3.2402,  -3.3633],
        [-10.0312,  -9.5000,   0.8936,  ...,  -4.9414,  -3.2402,  -3.3633],
        [ -9.1875, -10.2500,   2.7109,  ...,  -4.3359,  -4.0664,  -4.1328]],
       device='cuda:0', dtype=torch.float16)
prefill logits (final) tensor([[-8.3203, -7.1211,  3.3379,  ..., -4.9570, -4.1328, -3.4141],
        [-8.9062, -9.0156,  4.1445,  ..., -4.9922, -4.4961, -4.0742],
        [-9.6328, -9.0547,  4.0117,  ..., -5.3047, -4.7148, -4.4609]],
       device='cuda:0', dtype=torch.float16)
<s> The capital of France is.
The capital of the United States
<s> The capital of the United Kindom is.
The capital of the United Kingdom
<s> Today is a sunny day and I like go for a walk in the park.
  • Remove all other deprecated low-level API test files

@merrymercy merrymercy changed the title Fix bench_llama_low_api.py ADD bench_low_api.py Jun 25, 2024
@merrymercy merrymercy changed the title ADD bench_low_api.py Add bench_low_api.py Jun 25, 2024
@merrymercy merrymercy changed the title Add bench_low_api.py Add sglang.bench_latency Jun 25, 2024
@merrymercy merrymercy changed the title Add sglang.bench_latency Add sglang.bench_latency for offline benchmark Jun 25, 2024
@merrymercy merrymercy merged commit eb1ae6a into main Jun 25, 2024
@merrymercy merrymercy deleted the fix-micro-bench branch June 25, 2024 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant