Fix regrid memory leak #225

ThibHlln · 2021-06-25T16:50:00Z

resolve #222

The special methods __del__ of ESMF.Field and ESMF.Grid objects feature calls to their respective destroy methods that release the memory allocated by ESMF in Fortran. When these objects go out of scope, the Python garbage collection ought to call these __del__ methods, yet their allocated memory does not seem to be released properly, as repeatedly regridding fields results in running out of memory after a while.

After a discussion with @davidhassell, we suspect that the ESMF.Manager may be responsible for not making the memory release possible as long as it stays in scope. However, in cf the Manager is initialised but not bound to a variable name (see e.g. https://github.com/ThibHlln/cf-python/blob/d9cde28d9abd9cfeb6575c737560956f40620942/cf/field.py#L15619), so to me it goes out of scope directly in a Python sense. The ESMF documentation says that "The Manager will be created when the first ESMPy object is created if it is not created explicitly by the user". So I suspect that the Manager stays around (as a Fortran object I suppose) even if not bound to a name in Python's namespace (which may also very well be the case for Fields/Grids objects then).

In the cf regridding, there are ESMF fields and grids created during each call to regrids/regridc. If these are explicitly releasing memory (through calls to their destroy methods), this does seem to eliminate the running out of memory when repeatedly calling the regridding methods. Note, calling the destroy methods on these objects several times does not raise an error, if memory was already released, it remains silent. So this PR does just that, i.e. it explicitly releases memory for all ESMF objects as soon as they are not required anymore. This does not provide as explanation for why the destroy calls in Python's garbage collection are not working/not enough, but it seems to be a cheap and harmless solution that solves the problem.

I am attaching these field files if you'd like to test the following script:

fld_src = cf.read('fld_src_3d.nc')[0]  # try also with 'fld_src_2d.nc'
fld_dst = cf.read('fld_dst_3d.nc')[0]  # try also with 'fld_dst_2d.nc'

regrid_op = fld_src.regrids(fld_dst, 'conservative', return_operator=True)
print('regrid weights generated')

for _ in range(5000):
    fld_src.regrids(regrid_op)

Before this PR, after the expensive generation of the weights (it may take a short while), the memory was building up very quickly in the loop until running out of memory (using 8GB of RAM for my test), while with the modifications in this PR the memory now remains stable across the loop.

memory will be released when `RegridOperator` goes out of scope

if not released, memory appeared to build up, even if the objects were going out of scope, likely due to ESMF.Manager still being active

unlike regrids, regridc works on subsections of the data sections, so the memory release needed to happen for each subsection

now all create_Field and create_Grid should have a corresponding `destroy()`

if the mask does not change, the fields/grids won't be recreated each time, so if they are destroyed after the first section, the section second will have nothing to work with

davidhassell · 2021-06-25T16:58:25Z

Thanks, @ThibHlln - I look forward to looking at this next week. Have a good weekend,
David

davidhassell

Hi Thibault,

This all looks sound, along the lines we have discussed. I confirm the behaviour you report with this PR vs. 3.10.0.

I've noted couple of small things we should look at before we merge.

Thanks for sorting this out - I especially appreciate you taking the time ensure that everything is consistent by modifying files in cf/regrid/.

Some of these changes will need to be copied in cf.Field.regridc, and you are welcome to add those in :), but equally I'm happy to do that in a separate PR after this has been merged.

David

cf/field.py

ThibHlln · 2021-06-28T13:25:46Z

Hi David,

Thank you for your careful review.
I believe everything done in regrids is done in regridc too now, although I must say I have not run any test using regridc yet.
I could run those tests now, if you have not done it yet.

T

davidhassell

Thanks - looks good.

sadielbartholomew · 2021-07-22T20:44:51Z

@davidhassell - when you return from leave - is there a reason this hasn't been merged despite looking like it is ready / in a mergable state? We don't want to keep Thibault waiting if there isn't a reason. Thanks.

davidhassell · 2021-07-28T12:37:16Z

Ah - no reason! I'll merge it now. Thanks.

This version includes: - fix for memory leak due to ESMF objects NCAS-CMS/cf-python#225 - lift for size 1 restriction on grid NCAS-CMS/cf-python#249 - fix bug preventing cyclic domain (e.g. global runs) NCAS-CMS/cf-python#237

Thibault Hallouin added 6 commits June 24, 2021 18:36

add explicit memory release for all objects involved in RegridOperator

70d8a30

memory will be released when `RegridOperator` goes out of scope

add memory release for internal ESMF objects

36be404

if not released, memory appeared to build up, even if the objects were going out of scope, likely due to ESMF.Manager still being active

fix some typos

5314a79

relocate memory release in subsection for regridc

cdaefb5

unlike regrids, regridc works on subsections of the data sections, so the memory release needed to happen for each subsection

add some missing memory releases

60506f4

now all create_Field and create_Grid should have a corresponding `destroy()`

fix premature destruction of source grids/fields

d123d5a

if the mask does not change, the fields/grids won't be recreated each time, so if they are destroyed after the first section, the section second will have nothing to work with

ThibHlln marked this pull request as ready for review June 25, 2021 16:51

davidhassell requested changes Jun 28, 2021

View reviewed changes

cf/field.py Show resolved Hide resolved

cf/field.py Outdated Show resolved Hide resolved

Thibault Hallouin added 3 commits June 28, 2021 15:10

remove premature memory release fogotten in d123d5a

68b0439

add missing memory releases before raising error

d79f9d2

remove unnecessary condition

2bdd324

davidhassell approved these changes Jun 28, 2021

View reviewed changes

davidhassell merged commit fa62a4e into NCAS-CMS:master Jul 28, 2021

davidhassell added this to the 3.11.0 milestone Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regrid memory leak #225

Fix regrid memory leak #225

ThibHlln commented Jun 25, 2021 •

edited

Loading

davidhassell commented Jun 25, 2021

davidhassell left a comment

ThibHlln commented Jun 28, 2021 •

edited

Loading

davidhassell left a comment

sadielbartholomew commented Jul 22, 2021

davidhassell commented Jul 28, 2021

Fix regrid memory leak #225

Fix regrid memory leak #225

Conversation

ThibHlln commented Jun 25, 2021 • edited Loading

davidhassell commented Jun 25, 2021

davidhassell left a comment

Choose a reason for hiding this comment

ThibHlln commented Jun 28, 2021 • edited Loading

davidhassell left a comment

Choose a reason for hiding this comment

sadielbartholomew commented Jul 22, 2021

davidhassell commented Jul 28, 2021

ThibHlln commented Jun 25, 2021 •

edited

Loading

ThibHlln commented Jun 28, 2021 •

edited

Loading