Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add third-party malloc library to improve pytorch memory performance on Windows #102534

Open
xuhancn opened this issue May 30, 2023 · 5 comments
Assignees
Labels
intel This tag is for PR from Intel module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@xuhancn
Copy link
Collaborator

xuhancn commented May 30, 2023

🚀 The feature, motivation and pitch

This doc is requesting comments for add third-party malloc library to improve pytorch memory performance on Windows.

During debug the issue: #62387 , We figure out the major performance gap between Windows to Linux is that, Windows has bad memory allocation performance.

I also write a simple malloc benchmark project, bench_malloc. Which can proof the third-party malloc (tc_malloc) can improve the memory alloction performance on Windows.
image

After that, I tried to evaluate some popular third-party malloc library and make a brief summay here:
image
From the summary, We can only select two candidate libraries:

Option 1: tc_malloc from gperftools.

  1. It is the best performance one, and it improve performance from original 11.1s to 2.9s.
  2. It need upstream same code changes to gperftools. PRs for pytorch gperftools/gperftools#1396. Actually I made it, but still not get response so far.

Note: I found the gperftools repo is inactivate more than one year, latest commit is gperftools/gperftools@bf8b714 on May 31, 2022

Option 2: mimalloc

  1. It could improve performance from original 11.1s to 3.9s. It not as good as tc_malloc, but still a huge improvement.
  2. It better compatibility to pytorch, and can integrate to pytorch directly.
  3. I have a PR for this option: Enable mimalloc on pytorch Windows #102595

Alternatives

Option 3: Implement a caching memory allocator for CPU in PyTorch.

  1. It is none additional depends on third party library.
  2. Need a lot of effort to develop and test.
  3. Its principle is similar to tc_malloc and mimalloc. Optimize existing third party library is better.

Additional context


My proposal

  1. I'm not sure whether the PR for gperftools can be accecpted. We can't always wait on option 1.
  2. We can enable option 2 (mimalloc) to optimize pytorch Windows firstly.
  3. Design build option to switch malloc library, such as system malloc, mimalloc, and (further tc_malloc).
  4. Maybe We can select option 2, enable the mimalloc. And then optimize mimalloc for pytorch.

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @vladimir-aubrecht @iremyux @Blackhex @cristianPanaite @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ngimel

@xuhancn xuhancn added module: performance Issues related to performance, either of kernel code or framework glue module: windows Windows support for PyTorch module: cpu CPU specific problem (e.g., perf, algorithm) labels May 30, 2023
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 30, 2023
@xuhancn xuhancn linked a pull request May 31, 2023 that will close this issue
@xuhancn xuhancn removed a link to a pull request May 31, 2023
pytorchmergebot pushed a commit that referenced this issue Jun 27, 2023
This PR is implemention of [#102534](#102534), option 2.
Major changes:
1. Add mimalloc to the submodule.
2. Add build option "USE_MIMALLOC".
3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance.

Additional Test:
<img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3">
This PR also build & static link mimalloc on Linux well.

Pull Request resolved: #102595
Approved by: https://github.com/jgong5, https://github.com/malfet
@mjp41
Copy link

mjp41 commented Jun 29, 2023

@xuhancn, If you are interested in trying https://github.com/microsoft/snmalloc, I would be happy to help. It has CMake and Windows support.

@xuhancn
Copy link
Collaborator Author

xuhancn commented Jul 2, 2023

@xuhancn, If you are interested in trying https://github.com/microsoft/snmalloc, I would be happy to help. It has CMake and Windows support.

I will try it soon. Thanks.

@xuhancn
Copy link
Collaborator Author

xuhancn commented Jul 6, 2023

@mjp41 Hi, I wrote a simple project to study snmalloc: https://github.com/xuhancn/research_embedded_snmalloc/blob/main/src/main.cpp#L4
I have include the headers and have some question to enable it:

  1. how to wrapper and call snmalloc's malloc/free/align_malloc?
  2. whether it will override the system malloc/new functions? how to disable override?

@xuhancn
Copy link
Collaborator Author

xuhancn commented Jul 10, 2023

Option 2 merged:
#102595
#104497

@louie-tsai
Copy link

@aice-support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
intel This tag is for PR from Intel module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Backlog
Development

No branches or pull requests

4 participants