Implement Argmax Inference Instead of Softmax when 'has_regions' is False #2356

PengchengShi1220 · 2024-07-08T15:21:48Z

Issue: I have encountered a memory consumption issue in my Docker setup during inference tasks. Specifically, when the shm-size is set to 30GB, the system can successfully process softmax operations on input tensors of size [24, 724, 435, 435]. However, reducing the shm-size to 15GB leads to the unexpected termination of the Docker container due to insufficient shared memory during softmax computation.

Solution: Implement argmax instead of softmax when the 'has_regions' flag is set to False. This change is intended to address the shared memory (shm-size) constraints within Docker environments. It has been observed that the Docker container can handle the computation effectively, even with the shm-size set to 0, after making the proposed change.

…ces CPU memory usage and speeds up processing.

FabianIsensee · 2024-07-15T09:15:03Z

Thanks for this PR. I am not a big fan of this to be honest because we take something that is well structured and abstracted (labelmanager is in charge) and are changing it to be more complicated and moving functionality around. The original idea was that labelmanager is the central authority deciding how predictions should be handled, so that we only need to maintain and change the way things are handled in one location. If we now start re-adding label handling into different functions in different locations then it all becomes a complicated mess again.
So I'd much rather better understand the underlying reason for this problem and address it systematically. Why does your implementation not cause as much SHM as mine. Does the overall amount of RAM needed to do the export change? Why not just give the docker more SHM?

PengchengShi1220 · 2024-07-17T09:36:07Z

Hi,
Thank you for your feedback. I encountered this issue while preparing for AortaSeg24 Grand Challenge. According to the Grand Challenge platform's requirements for algorithms (Grand Challenge Documentation), 50% of the system memory is shared. For example, on a 16 GB instance, /dev/shm will be 8 GB, and this percentage is not modifiable. The provided instance is ml.g4dn.2xlarge with 1x NVIDIA T4 GPU, 8 vCPUs, 32 GB of memory, and 1 x 225 GB NVMe SSD. Therefore, participants can use a maximum memory size of 32 GB and an shm-size of 16 GB.

When I set up a new Docker environment, I observed a similar issue. In this Docker environment, setting the memory size to 35 GB and the shm-size to 35 GB allows the system to successfully process softmax operations on input tensors of size [24, 724, 435, 435]. However, setting the memory size to 30 GB and the shm-size to 30 GB leads to the unexpected termination of the Docker container. This suggests that inference on devices with smaller 30 GB memory sizes will encounter similar issues.

Regarding your point about label management, I agree that handling labels in one location is more efficient and avoids redundancy. However, I am currently unsure how to directly add an option for the inference step specifically within the label manager. I hope you can provide further suggestions on how to address this issue systematically.

PengchengShi1220 added 2 commits July 8, 2024 22:54

When 'has_regions' is False, using 'argmax' instead of 'softmax' redu…

7de855f

…ces CPU memory usage and speeds up processing.

Update export_prediction.py

feec53c

FabianIsensee self-assigned this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Argmax Inference Instead of Softmax when 'has_regions' is False #2356

Implement Argmax Inference Instead of Softmax when 'has_regions' is False #2356

PengchengShi1220 commented Jul 8, 2024

FabianIsensee commented Jul 15, 2024

PengchengShi1220 commented Jul 17, 2024

Implement Argmax Inference Instead of Softmax when 'has_regions' is False #2356

Are you sure you want to change the base?

Implement Argmax Inference Instead of Softmax when 'has_regions' is False #2356

Conversation

PengchengShi1220 commented Jul 8, 2024

FabianIsensee commented Jul 15, 2024

PengchengShi1220 commented Jul 17, 2024