add available memory check to accelerators #4508

jeffra · 2023-10-12T21:10:23Z

There are many scenarios where we need to get an accurate estimate of the available memory on a device. Currently we rely on the torch memory allocator stats to give us this information, however there are several cases where memory may be allocated outside the view of torch. This means that torch.cuda.get_device_properties(device_index).total_memory - torch.cuda.memory_allocated(device_index) is not accurate. This is usually less of a problem on data center GPUs but quite common on consumer grade GPUs that are often shared between torch and the operating system.

This PR introduces available_memory to the abstract accelerator interface. On CUDA devices we can rely on pynvml to get the ground truth w.r.t. available memory.

This also introduces a hard dependency on pynvml. I have tested on non-GPU systems and this package seems to install successfully but fails at runtime at the nvmlInit() call. We fall back to using torch stats for memory in cases where pynvml is not functional.

accelerator/cuda_accelerator.py

Co-authored-by: Michael Wyatt <[email protected]>

tjruwase · 2023-10-12T23:05:15Z

@delock, FYI

delock · 2023-10-15T07:52:31Z

@delock, FYI

Thanks for reminding. I think the CPU part is good. We will add to XPU backend as well.

* add available memory check to accelerator * catch case where nvmlInit fails * add pynvml to reqs * fix for cpu systems * Update accelerator/cuda_accelerator.py Co-authored-by: Michael Wyatt <[email protected]> * simplify --------- Co-authored-by: Michael Wyatt <[email protected]>

jeffra added 3 commits October 12, 2023 13:54

add available memory check to accelerator

a251e59

catch case where nvmlInit fails

567ba71

add pynvml to reqs

bc16f06

jeffra requested a review from mrwyattii as a code owner October 12, 2023 21:10

jeffra requested review from tjruwase and cmikeh2 October 12, 2023 21:10

fix for cpu systems

26c2bcd

mrwyattii reviewed Oct 12, 2023

View reviewed changes

accelerator/cuda_accelerator.py Outdated Show resolved Hide resolved

jeffra and others added 2 commits October 12, 2023 15:05

Update accelerator/cuda_accelerator.py

d7b0764

Co-authored-by: Michael Wyatt <[email protected]>

simplify

6a747d6

tjruwase approved these changes Oct 12, 2023

View reviewed changes

mrwyattii approved these changes Oct 13, 2023

View reviewed changes

tjruwase added this pull request to the merge queue Oct 16, 2023

Merged via the queue into master with commit 12aedac Oct 17, 2023
15 checks passed

XuehaiPan mentioned this pull request Dec 20, 2023

Retrieve CUDA available memory via torch.cuda.mem_get_info() #4847

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add available memory check to accelerators #4508

add available memory check to accelerators #4508

jeffra commented Oct 12, 2023

tjruwase commented Oct 12, 2023

delock commented Oct 15, 2023

add available memory check to accelerators #4508

add available memory check to accelerators #4508

Conversation

jeffra commented Oct 12, 2023

tjruwase commented Oct 12, 2023

delock commented Oct 15, 2023