Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

low throughput while using TGI-Gaudi on bigcode/starcoderbase-3b on Gaudi2 #166

Open
3 of 4 tasks
vishnumadhu365 opened this issue Jun 22, 2024 · 1 comment
Open
3 of 4 tasks

Comments

@vishnumadhu365
Copy link

System Info

tgi-gaudi docker container built from master branch (4fe871f)
Ubuntu 22.04.3 LTS
Gaudi2
HL-SMI Version: hl-1.15.0-fw-48.2.1.1
Driver Version: 1.15.0-a596ef0
Model : bigcode/starcoderbase-3b

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps

  1. Docker run
docker run -it -p 8080:80 -v $volume:/data    --runtime=habana   \
	-e HABANA_VISIBLE_DEVICES=all  \
	-e HUGGING_FACE_HUB_TOKEN=1234  \
	-e OMPI_MCA_btl_vader_single_copy_mechanism=none   \
	-e ENABLE_HPU_GRAPH=False   -e BATCH_BUCKET_SIZE=128  \
	-e PREFILL_BATCH_BUCKET_SIZE=4  \
	-e PAD_SEQUENCE_TO_MULTIPLE_OF=128    \
	--cap-add=sys_nice  \
	--ipc=host tgi-gaudi:latest   \
	--model-id $model    \
	--max-input-tokens 568    \
	--max-batch-prefill-tokens 618  \
	--max-total-tokens 614  \
	--max-batch-total-tokens 78592
  1. Measure perf of TGI endpoint with tgi-gaudi/examples
python3 run_generation.py \
	--model_id $model \
	--server_address https://localhost:8080 \
	--max_input_length 568 \
	--max_output_length 46 \
	--total_sample_count 1280 \
	--max_concurrent_requests 128

output:

--------------------------------
----- Performance  summary -----
--------------------------------
Throughput: 98.8 tokens/s
Throughput: 2.2 queries/s
--------------------------------
First token latency:
        Median:         54734.41ms
        Average:        52755.73ms
--------------------------------
Output token latency:
        Median:         58.47ms
        Average:        69.58ms
--------------------------------
  1. Run Static benchmark from within tgi container
text-generation-benchmark -b 128 -b 64 -b 32 -b 16 -b 8 -b 4 -b 2 -b 1 -s 567 -d 46 -w 5 -r 100 -t bigcode/starcoderbase-3b

output:
image

Expected behavior

Issue:
Throughput numbers while hitting the TGI endpoint is way off from the static benchmark throughput.
Server logs suggest there is some issue with continuous batching on GD2.

#testing by sending 5 request on Gaudi2 TGI endpoint. Note that the queue time is increasing for subsequent inference requests
Req1: total_time="3.076226394s" validation_time="449.063µs" queue_time="110.028µs" inference_time="3.075667684s" time_per_token="66.86234ms"
Req2: total_time="3.076173218s" validation_time="3.502745ms" queue_time="70.64658ms" inference_time="3.002024052s" time_per_token="65.261392ms"
Req3: total_time="3.132718439s" validation_time=""786.778µs" queue_time="201.632982ms" inference_time="2.930298993s" time_per_token="63.702152ms"
Req4: total_time="3.197355097s" validation_time="1.277488ms" queue_time="331.050014ms" inference_time="2.865027991s" time_per_token="62.283217ms"
Req5: total_time="3.259123777s" validation_time="924.292µs" queue_time="459.104331ms" inference_time="2.799095535s" time_per_token="60.849902ms" 
#Same test as above this time sending 5 requests to a single Nvidia T4 card running TGI docker 2.0.4. Note that the queue time is more or less constant after the first request indicating effective continuous batching

Req1: total_time="1.513475533s" validation_time="1.069695ms" queue_time="52.017µs" inference_time="1.512354236s" time_per_token="32.877266ms"
Req2: total_time="1.507096983s" validation_time="799.031µs" queue_time="54.518157ms" inference_time="1.451780025s" time_per_token="31.560435ms"
Req3: total_time="1.502753387s" validation_time="418.679µs" queue_time="50.525381ms" inference_time="1.451809782s" time_per_token="31.561082ms"
Req4: total_time="1.507244713s" validation_time="841.468µs" queue_time="54.479958ms" inference_time="1.451923498s" time_per_token="31.563554ms"
Req5: total_time="1.503086631s" validation_time="828.972µs" queue_time="50.359691ms" inference_time="1.451898309s" time_per_token="31.563006ms"

Expected result:
Gaudi 2 throughput numbers on the TGI endpoint (with continuous batching) should be at par or better than the static benchmark throughput

@regisss
Copy link
Collaborator

regisss commented Jul 12, 2024

Not sure why the queue time is increasing, any idea @kdamaszk ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants