Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Argmax Inference Instead of Softmax when 'has_regions' is False #2356

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

PengchengShi1220
Copy link

Issue: I have encountered a memory consumption issue in my Docker setup during inference tasks. Specifically, when the shm-size is set to 30GB, the system can successfully process softmax operations on input tensors of size [24, 724, 435, 435]. However, reducing the shm-size to 15GB leads to the unexpected termination of the Docker container due to insufficient shared memory during softmax computation.

Solution: Implement argmax instead of softmax when the 'has_regions' flag is set to False. This change is intended to address the shared memory (shm-size) constraints within Docker environments. It has been observed that the Docker container can handle the computation effectively, even with the shm-size set to 0, after making the proposed change.

@FabianIsensee FabianIsensee self-assigned this Jul 8, 2024
@FabianIsensee
Copy link
Member

Thanks for this PR. I am not a big fan of this to be honest because we take something that is well structured and abstracted (labelmanager is in charge) and are changing it to be more complicated and moving functionality around. The original idea was that labelmanager is the central authority deciding how predictions should be handled, so that we only need to maintain and change the way things are handled in one location. If we now start re-adding label handling into different functions in different locations then it all becomes a complicated mess again.
So I'd much rather better understand the underlying reason for this problem and address it systematically. Why does your implementation not cause as much SHM as mine. Does the overall amount of RAM needed to do the export change? Why not just give the docker more SHM?

@PengchengShi1220
Copy link
Author

Hi,
Thank you for your feedback. I encountered this issue while preparing for AortaSeg24 Grand Challenge. According to the Grand Challenge platform's requirements for algorithms (Grand Challenge Documentation), 50% of the system memory is shared. For example, on a 16 GB instance, /dev/shm will be 8 GB, and this percentage is not modifiable. The provided instance is ml.g4dn.2xlarge with 1x NVIDIA T4 GPU, 8 vCPUs, 32 GB of memory, and 1 x 225 GB NVMe SSD. Therefore, participants can use a maximum memory size of 32 GB and an shm-size of 16 GB.

When I set up a new Docker environment, I observed a similar issue. In this Docker environment, setting the memory size to 35 GB and the shm-size to 35 GB allows the system to successfully process softmax operations on input tensors of size [24, 724, 435, 435]. However, setting the memory size to 30 GB and the shm-size to 30 GB leads to the unexpected termination of the Docker container. This suggests that inference on devices with smaller 30 GB memory sizes will encounter similar issues.

Regarding your point about label management, I agree that handling labels in one location is more efficient and avoids redundancy. However, I am currently unsure how to directly add an option for the inference step specifically within the label manager. I hope you can provide further suggestions on how to address this issue systematically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants