Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask: Reinstate netCDF lock #531

Merged

Conversation

davidhassell
Copy link
Collaborator

@davidhassell davidhassell commented Dec 19, 2022

It turns out that removing the concurrency lock for opening netCDF files is not (yet) safe in the netCDF-C library, so we should put it back for now. I had thought that separate opens/closes would be OK, but not so. This may be fixed in future version of netCDF-C. See, e.g. Unidata/netcdf4-python#844 and pydata/xarray#4100.

Edit: The test_Field_collapse are still segfaulting intermittently, but less frequently with this. I think it may be due to the fact that each from_array call is creating its own lock, so its still possible for concurrent accesses that originate from the same file, e.g. f + f.copy(). The answer could be to create a common lock object and coordinate all access to a particular file around that.

Edit: Following this changetest_Field_collapse (with the concatenate fixes from another PR) didn't segfault 18 times in row before I stopped counting - record!

Edit: I deleted the three functions from creation.py because they're not ever used, and were only there because I thought they would be useful over a year ago, when I didn't really know what I was doing.

@davidhassell davidhassell added the dask Relating to the use of Dask label Dec 19, 2022
@davidhassell davidhassell added this to the 3.14.0 milestone Dec 19, 2022
@davidhassell davidhassell marked this pull request as draft December 19, 2022 09:34
@davidhassell davidhassell changed the title Dask reinstate netcdf lock dask: Reinstate netCDF lock Dec 19, 2022
@sadielbartholomew
Copy link
Member

@davidhassell am I good to start reviewing or might you still make some (pre-feedback) tweaks?

Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult for me to test the aspect of reducing the observed segmentation faults since I haven't seen any for a few weeks (the darn things, only showing up when they are not wanted and never when they are!) but the tests we know should pass still do, with no noticeable speed compromise (no need to profile here), and the lock restoration and other changes made are all well-justified, so I am happy. I've made a few minor comments; please merge when ready.

cf/data/array/netcdfarray.py Outdated Show resolved Hide resolved
cf/data/creation.py Show resolved Hide resolved
cf/test/test_read_write.py Show resolved Hide resolved
Co-authored-by: Sadie L. Bartholomew <[email protected]>
@davidhassell
Copy link
Collaborator Author

It's difficult for me to test the aspect of reducing the observed segmentation faults since I haven't seen any for a few weeks

Yes - I was testing this in #532 which was reliably seg faulting

but the tests we know should pass still do

Good enough for me!

@davidhassell
Copy link
Collaborator Author

Hi @sadielbartholomew - I was thinking that maybe one lock per file might not be safe enough, and that one lock for all files is what's needed. I'll run a few tests that with the code that's already here that combines two different netCDF files in the same dask graph to see if they have a propensity to seg fault ....

@sadielbartholomew
Copy link
Member

sadielbartholomew commented Dec 19, 2022

I'll run a few tests that with the code that's already here that combines two different netCDF files in the same dask graph to see if they have a propensity to seg fault ....

Right-o. Makes sense (I think). If you go for that in the end, maybe you can attack that in a separate PR to separate out the steps here, as this aspect is done and we can keep the next bit self-contained?

@davidhassell
Copy link
Collaborator Author

Hi Sadie - If we need a global lock, this will change code here, rather than add to it, so maybe I'll run the tests before merging ...

@davidhassell
Copy link
Collaborator Author

Hi Sadie, I've tested with multiple files and the one-lock-per-file and it seems safe, i.e. no seg faults.

However, I've been looking at how xarray deals with locks (https://github.com/pydata/xarray/blob/main/xarray/backends/netCDF4_.py, https://github.com/pydata/xarray/blob/main/xarray/backends/locks.py) and they appear to do a stricter job - applying a global lock for all files, as far as I can tell after a short look.

I'd like to investigate a bit more how xarray does things before settling this.

@sadielbartholomew
Copy link
Member

I'd like to investigate a bit more how xarray does things before settling this

Sounds wise, if they do it that way there is probably a good reason so! I'll keep an eye out for any further comments here, although give me a shout if there's anything I can do RE aspects relating to this PR otherwise.

@davidhassell
Copy link
Collaborator Author

So, xarray certainly applies a global lock across all netCDF files, and this is the right thing to do because the HDF5 library modifies global data structures that are independent of a particular HDF5 dataset or HDF5 file (https://stackoverflow.com/questions/34906652/does-hdf5-support-concurrent-reads-or-writes-to-different-files).

That same stack overflow article also mentions that it is possible to compile HDF5 in with MPI support but no thread safety, or with thread safety but no MPI support. However the default (like most folk get from conda) is neither.

So, I think that we we probably do need a global lock for netCDF files, and this is an easy change: 77ea4fa

@sadielbartholomew
Copy link
Member

So, I think that we we probably do need a global lock for netCDF files, and this is an easy change: 77ea4fa

Nice, with thanks for clarifying on the reasons! So this is good for re-review now?

@davidhassell
Copy link
Collaborator Author

Thanks, Sadie - yes, good to review

@sadielbartholomew
Copy link
Member

That same stack overflow article...

Excellent researching by the way 😆 (joking about the standalone phrase above which amused me, though actually that one is well-linked to canonical sources so it's reliable enough as I guess you thought too)

Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Justified and working update to instead use a global lock to cover all netCDF files. The one typo as feedback from the first review has re-appeared, however, but if you can blitz that this is all good and ready to merge. (May we never see another seg fault in the wild again! 🤞 )

cf/data/array/netcdfarray.py Outdated Show resolved Hide resolved
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Relating to the use of Dask high priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants