TensorRT-LLM often hangs using both `tp_size 2` and `enable_context_fmha`. #390

lkm2835 · 2024-04-04T00:31:15Z

System Info

CPU architecture: x86_64
CPU/Host memory size: 1T
GPU name: NVIDIA A100-40G
TensorRT-LLM branch: main, v0.9.0, 118b3d7
CUDA: 12.3
NVIDIA driver: 545.23.08
OS: Ubuntu: 22.04
Docker image: https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/dockerfile

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
                            --model_dir /app/models \
                            --output_dir /app/models/tensorrt \
                            --dtype float16 \
                            --tp_size 2

trtllm-build --checkpoint_dir /app/models/tensorrt \
             --remove_input_padding enable \
             --gpt_attention_plugin float16 \
             --context_fmha enable \
             --gemm_plugin float16 \
             --output_dir /app/models/tensorrt_llm/context_fmha \
             --paged_kv_cache disable \
             --enable_xqa disable \
             --multi_block_mode disable \
             --tp_size 2 \
             --max_batch_size 1 \
             --max_input_len 4096 \
             --max_output_len 2048

mkdir /app/models/triton_model
cp -r /app/all_models/inflight_batcher_llm/* /app/models/triton_model

python3 /app/tools/fill_template.py -i /app/models/triton_model/preprocessing/config.pbtxt tokenizer_dir:/app/models/,triton_max_batch_size:1,preprocessing_instance_count:1
python3 /app/tools/fill_template.py -i /app/models/triton_model/postprocessing/config.pbtxt tokenizer_dir:/app/models/,triton_max_batch_size:1,postprocessing_instance_count:1
python3 /app/tools/fill_template.py -i /app/models/triton_model/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 /app/tools/fill_template.py -i /app/models/triton_model/ensemble/config.pbtxt triton_max_batch_size:1
python3 /app/tools/fill_template.py -i /app/models/triton_model/tensorrt_llm/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,max_beam_width:1,engine_dir:/app/models/tensorrt_llm/context_fmha,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:v1,max_queue_delay_microseconds:0

Expected behavior

It works well without hanging.

actual behavior

+-----------------------------------------+----------------------+----------------------+
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:00:06.0 Off |                    0 |
| N/A   35C    P0              76W / 400W |  11138MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:00:07.0 Off |                    0 |
| N/A   39C    P0              82W / 400W |  11106MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

TensorRT-LLM often hangs using both tp_size 2 and enable_context_fmha.

additional notes

NA

The text was updated successfully, but these errors were encountered:

PerkzZheng · 2024-04-10T08:54:13Z

@lkm2835 do you see this issues when using trt-llm examples directly without triton backends ?

lkm2835 · 2024-04-10T15:29:53Z

@PerkzZheng I solved it temporarily.
My solution is disable use_custom_all_reduce in trtllm-build.

lkm2835 added the bug Something isn't working label Apr 4, 2024

lkm2835 mentioned this issue Apr 19, 2024

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck. #149

Open

rmccorm4 mentioned this issue Jun 10, 2024

Mixtral 8x7-v0.1 Hangs after serving a few requests #457

Closed

4 tasks

lkm2835 closed this as completed Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM often hangs using both `tp_size 2` and `enable_context_fmha`. #390

TensorRT-LLM often hangs using both `tp_size 2` and `enable_context_fmha`. #390

lkm2835 commented Apr 4, 2024 •

edited

Loading

PerkzZheng commented Apr 10, 2024

lkm2835 commented Apr 10, 2024

TensorRT-LLM often hangs using both tp_size 2 and enable_context_fmha. #390

TensorRT-LLM often hangs using both tp_size 2 and enable_context_fmha. #390

Comments

lkm2835 commented Apr 4, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

PerkzZheng commented Apr 10, 2024

lkm2835 commented Apr 10, 2024

TensorRT-LLM often hangs using both `tp_size 2` and `enable_context_fmha`. #390

TensorRT-LLM often hangs using both `tp_size 2` and `enable_context_fmha`. #390

lkm2835 commented Apr 4, 2024 •

edited

Loading