Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regrid memory leak #225

Merged
merged 9 commits into from
Jul 28, 2021
Merged

Conversation

ThibHlln
Copy link

@ThibHlln ThibHlln commented Jun 25, 2021

resolve #222

The special methods __del__ of ESMF.Field and ESMF.Grid objects feature calls to their respective destroy methods that release the memory allocated by ESMF in Fortran. When these objects go out of scope, the Python garbage collection ought to call these __del__ methods, yet their allocated memory does not seem to be released properly, as repeatedly regridding fields results in running out of memory after a while.

After a discussion with @davidhassell, we suspect that the ESMF.Manager may be responsible for not making the memory release possible as long as it stays in scope. However, in cf the Manager is initialised but not bound to a variable name (see e.g. https://github.com/ThibHlln/cf-python/blob/d9cde28d9abd9cfeb6575c737560956f40620942/cf/field.py#L15619), so to me it goes out of scope directly in a Python sense. The ESMF documentation says that "The Manager will be created when the first ESMPy object is created if it is not created explicitly by the user". So I suspect that the Manager stays around (as a Fortran object I suppose) even if not bound to a name in Python's namespace (which may also very well be the case for Fields/Grids objects then).

In the cf regridding, there are ESMF fields and grids created during each call to regrids/regridc. If these are explicitly releasing memory (through calls to their destroy methods), this does seem to eliminate the running out of memory when repeatedly calling the regridding methods. Note, calling the destroy methods on these objects several times does not raise an error, if memory was already released, it remains silent. So this PR does just that, i.e. it explicitly releases memory for all ESMF objects as soon as they are not required anymore. This does not provide as explanation for why the destroy calls in Python's garbage collection are not working/not enough, but it seems to be a cheap and harmless solution that solves the problem.

I am attaching these field files if you'd like to test the following script:

fld_src = cf.read('fld_src_3d.nc')[0]  # try also with 'fld_src_2d.nc'
fld_dst = cf.read('fld_dst_3d.nc')[0]  # try also with 'fld_dst_2d.nc'

regrid_op = fld_src.regrids(fld_dst, 'conservative', return_operator=True)
print('regrid weights generated')

for _ in range(5000):
    fld_src.regrids(regrid_op)

Before this PR, after the expensive generation of the weights (it may take a short while), the memory was building up very quickly in the loop until running out of memory (using 8GB of RAM for my test), while with the modifications in this PR the memory now remains stable across the loop.

Thibault Hallouin added 6 commits June 24, 2021 18:36
memory will be released when `RegridOperator` goes out of scope
if not released, memory appeared to build up, even if the objects were going out of scope, likely due to ESMF.Manager still being active
unlike regrids, regridc works on subsections of the data sections, so the memory release needed to happen for each subsection
now all create_Field and create_Grid should have a corresponding `destroy()`
if the mask does not change, the fields/grids won't be recreated each time, so if they are destroyed after the first section, the section second will have nothing to work with
@ThibHlln ThibHlln marked this pull request as ready for review June 25, 2021 16:51
@davidhassell
Copy link
Collaborator

Thanks, @ThibHlln - I look forward to looking at this next week. Have a good weekend,
David

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thibault,

This all looks sound, along the lines we have discussed. I confirm the behaviour you report with this PR vs. 3.10.0.

I've noted couple of small things we should look at before we merge.

Thanks for sorting this out - I especially appreciate you taking the time ensure that everything is consistent by modifying files in cf/regrid/.

Some of these changes will need to be copied in cf.Field.regridc, and you are welcome to add those in :), but equally I'm happy to do that in a separate PR after this has been merged.

David

cf/field.py Show resolved Hide resolved
cf/field.py Outdated Show resolved Hide resolved
@ThibHlln
Copy link
Author

ThibHlln commented Jun 28, 2021

Hi David,

Thank you for your careful review.
I believe everything done in regrids is done in regridc too now, although I must say I have not run any test using regridc yet.
I could run those tests now, if you have not done it yet.

T

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - looks good.

@sadielbartholomew
Copy link
Member

@davidhassell - when you return from leave - is there a reason this hasn't been merged despite looking like it is ready / in a mergable state? We don't want to keep Thibault waiting if there isn't a reason. Thanks.

@davidhassell
Copy link
Collaborator

Ah - no reason! I'll merge it now. Thanks.

@davidhassell davidhassell merged commit fa62a4e into NCAS-CMS:master Jul 28, 2021
@davidhassell davidhassell added this to the 3.11.0 milestone Sep 30, 2021
ThibHlln pushed a commit to ThibHlln/unifhy that referenced this pull request Oct 20, 2021
This version includes:
- fix for memory leak due to ESMF objects NCAS-CMS/cf-python#225
- lift for size 1 restriction on grid NCAS-CMS/cf-python#249
- fix bug preventing cyclic domain (e.g. global runs) NCAS-CMS/cf-python#237
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remembering weights between consecutive regridding operations
3 participants