/generate request possibly hanging when `CUDA out of memory` is thrown #435

Gintasz · 2024-05-13T07:28:38Z

I've run_batch with 1000 items, num_threads=200. I notice that the batch processing gets stuck at 98%, then the server shows no more console logs. I checked the full log and I see some CUDA out of memory errors.

Therefore, I suspect that if this error is thrown, the /generate request might be left hanging. I've added retry to http_request (check my pull request) and it still gets stuck. So this is why I suspect such requests may be hanging instead of failing, because if they were to fail, retry mechanism would have kicked in.

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 42069 --host 0.0.0.0 --tp-size 1 --mem-fraction-static 0.8

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 184, in exposed_step
    self.forward_step()
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 199, in forward_step
    self.forward_fill_batch(new_batch)
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 412, in forward_fill_batch
    ) = self.model_runner.forward(
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 510, in forward
    return self.forward_extend(**kwargs)
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 415, in forward_extend
    return self.model.forward(input_ids, input_metadata.positions, input_metadata)
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 270, in forward
    return self.logits_processor(
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/pubmed-baigiamasis/.venv/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 50, in forward
    all_logprobs = torch.log(torch.softmax(logits.float(), dim=-1) + 1e-6)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.07 GiB. GPU 0 has a total capacty of 23.68 GiB of which 1.25 GiB is free. Process 82733 has 22.42 GiB memory in use. Of the allocated memory 21.89 GiB is allocated by PyTorch, and 230.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Full server log here: log.txt

The text was updated successfully, but these errors were encountered:

hnyls2002 · 2024-05-16T03:21:42Z

@Gintasz Try to decrease the --mem-fraction-static more. As you are using the logprobs utils, it requires more unallocated GPU memory during temporary processing tensor.

Gintasz · 2024-05-16T17:53:39Z

@hnyls2002 here I was using 0.1.14 version. I know --mem-fraction-static should reduce these errors but my point is that if this error is thrown, then processing of other generation requests continues, however, I assume hangs for those that got errored out. Thus, run_batch itself hangs. If I have a batch of 1000 items and processing of 10 items causes this out of memory error, 990 items may complete processing and 10 will hang, thus run_batch itself will hang and will never complete.

It should not hang, it should error out with some 500 status code, so that http_request method can retry the failed request and successfully finish the batch. I've added retry on timeout in my pull request, however, I think it should also retry based on status code as well. Or retry on any exception...

Admittedly, I've not tested if server request actually hangs, however, this is my assumption based on no failure exception thrown on client-side regarding the out of memory generations.

AmericanPresidentJimmyCarter · 2024-05-19T20:44:50Z

I can get it stable by turning off the radix trie cache with --disable_radix_cache, but that's not really a good solution.

github-actions · 2024-07-26T01:02:21Z

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions bot closed this as completed Jul 26, 2024

github-actions bot added the inactive label Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/generate request possibly hanging when `CUDA out of memory` is thrown #435

/generate request possibly hanging when `CUDA out of memory` is thrown #435

Gintasz commented May 13, 2024

hnyls2002 commented May 16, 2024

Gintasz commented May 16, 2024 •

edited

Loading

AmericanPresidentJimmyCarter commented May 19, 2024

github-actions bot commented Jul 26, 2024

/generate request possibly hanging when CUDA out of memory is thrown #435

/generate request possibly hanging when CUDA out of memory is thrown #435

Comments

Gintasz commented May 13, 2024

hnyls2002 commented May 16, 2024

Gintasz commented May 16, 2024 • edited Loading

AmericanPresidentJimmyCarter commented May 19, 2024

github-actions bot commented Jul 26, 2024

/generate request possibly hanging when `CUDA out of memory` is thrown #435

/generate request possibly hanging when `CUDA out of memory` is thrown #435

Gintasz commented May 16, 2024 •

edited

Loading