You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue:
Throughput numbers while hitting the TGI endpoint is way off from the static benchmark throughput.
Server logs suggest there is some issue with continuous batching on GD2.
#testing by sending 5 request on Gaudi2 TGI endpoint. Note that the queue time is increasing for subsequent inference requests
Req1: total_time="3.076226394s" validation_time="449.063µs" queue_time="110.028µs" inference_time="3.075667684s" time_per_token="66.86234ms"
Req2: total_time="3.076173218s" validation_time="3.502745ms" queue_time="70.64658ms" inference_time="3.002024052s" time_per_token="65.261392ms"
Req3: total_time="3.132718439s" validation_time=""786.778µs" queue_time="201.632982ms" inference_time="2.930298993s" time_per_token="63.702152ms"Req4: total_time="3.197355097s" validation_time="1.277488ms" queue_time="331.050014ms" inference_time="2.865027991s" time_per_token="62.283217ms"Req5: total_time="3.259123777s" validation_time="924.292µs" queue_time="459.104331ms" inference_time="2.799095535s" time_per_token="60.849902ms"
#Same test as above this time sending 5 requests to a single Nvidia T4 card running TGI docker 2.0.4. Note that the queue time is more or less constant after the first request indicating effective continuous batching
Req1: total_time="1.513475533s" validation_time="1.069695ms" queue_time="52.017µs" inference_time="1.512354236s" time_per_token="32.877266ms"
Req2: total_time="1.507096983s" validation_time="799.031µs" queue_time="54.518157ms" inference_time="1.451780025s" time_per_token="31.560435ms"
Req3: total_time="1.502753387s" validation_time="418.679µs" queue_time="50.525381ms" inference_time="1.451809782s" time_per_token="31.561082ms"
Req4: total_time="1.507244713s" validation_time="841.468µs" queue_time="54.479958ms" inference_time="1.451923498s" time_per_token="31.563554ms"
Req5: total_time="1.503086631s" validation_time="828.972µs" queue_time="50.359691ms" inference_time="1.451898309s" time_per_token="31.563006ms"
Expected result:
Gaudi 2 throughput numbers on the TGI endpoint (with continuous batching) should be at par or better than the static benchmark throughput
The text was updated successfully, but these errors were encountered:
System Info
tgi-gaudi docker container built from master branch (4fe871f)
Ubuntu 22.04.3 LTS
Gaudi2
HL-SMI Version: hl-1.15.0-fw-48.2.1.1
Driver Version: 1.15.0-a596ef0
Model : bigcode/starcoderbase-3b
Information
Tasks
Reproduction
Steps
python3 run_generation.py \ --model_id $model \ --server_address https://localhost:8080 \ --max_input_length 568 \ --max_output_length 46 \ --total_sample_count 1280 \ --max_concurrent_requests 128
output:
output:
Expected behavior
Issue:
Throughput numbers while hitting the TGI endpoint is way off from the static benchmark throughput.
Server logs suggest there is some issue with continuous batching on GD2.
Expected result:
Gaudi 2 throughput numbers on the TGI endpoint (with continuous batching) should be at par or better than the static benchmark throughput
The text was updated successfully, but these errors were encountered: