-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask: Reinstate netCDF lock #531
dask: Reinstate netCDF lock #531
Conversation
@davidhassell am I good to start reviewing or might you still make some (pre-feedback) tweaks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult for me to test the aspect of reducing the observed segmentation faults since I haven't seen any for a few weeks (the darn things, only showing up when they are not wanted and never when they are!) but the tests we know should pass still do, with no noticeable speed compromise (no need to profile here), and the lock restoration and other changes made are all well-justified, so I am happy. I've made a few minor comments; please merge when ready.
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Yes - I was testing this in #532 which was reliably seg faulting
Good enough for me! |
Hi @sadielbartholomew - I was thinking that maybe one lock per file might not be safe enough, and that one lock for all files is what's needed. I'll run a few tests that with the code that's already here that combines two different netCDF files in the same dask graph to see if they have a propensity to seg fault .... |
Right-o. Makes sense (I think). If you go for that in the end, maybe you can attack that in a separate PR to separate out the steps here, as this aspect is done and we can keep the next bit self-contained? |
Hi Sadie - If we need a global lock, this will change code here, rather than add to it, so maybe I'll run the tests before merging ... |
Hi Sadie, I've tested with multiple files and the one-lock-per-file and it seems safe, i.e. no seg faults. However, I've been looking at how xarray deals with locks (https://github.com/pydata/xarray/blob/main/xarray/backends/netCDF4_.py, https://github.com/pydata/xarray/blob/main/xarray/backends/locks.py) and they appear to do a stricter job - applying a global lock for all files, as far as I can tell after a short look. I'd like to investigate a bit more how xarray does things before settling this. |
Sounds wise, if they do it that way there is probably a good reason so! I'll keep an eye out for any further comments here, although give me a shout if there's anything I can do RE aspects relating to this PR otherwise. |
So, That same stack overflow article also mentions that it is possible to compile HDF5 in with MPI support but no thread safety, or with thread safety but no MPI support. However the default (like most folk get from conda) is neither. So, I think that we we probably do need a global lock for netCDF files, and this is an easy change: 77ea4fa |
Nice, with thanks for clarifying on the reasons! So this is good for re-review now? |
Thanks, Sadie - yes, good to review |
Excellent researching by the way 😆 (joking about the standalone phrase above which amused me, though actually that one is well-linked to canonical sources so it's reliable enough as I guess you thought too) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Justified and working update to instead use a global lock to cover all netCDF files. The one typo as feedback from the first review has re-appeared, however, but if you can blitz that this is all good and ready to merge. (May we never see another seg fault in the wild again! 🤞 )
Co-authored-by: Sadie L. Bartholomew <[email protected]>
It turns out that removing the concurrency lock for opening netCDF files is not (yet) safe in the netCDF-C library, so we should put it back for now. I had thought that separate opens/closes would be OK, but not so. This may be fixed in future version of netCDF-C. See, e.g. Unidata/netcdf4-python#844 and pydata/xarray#4100.
Edit: The
test_Field_collapse
are still segfaulting intermittently, but less frequently with this. I think it may be due to the fact that eachfrom_array
call is creating its own lock, so its still possible for concurrent accesses that originate from the same file, e.g.f + f.copy()
. The answer could be to create a common lock object and coordinate all access to a particular file around that.Edit: Following this change
test_Field_collapse
(with the concatenate fixes from another PR) didn't segfault 18 times in row before I stopped counting - record!Edit: I deleted the three functions from
creation.py
because they're not ever used, and were only there because I thought they would be useful over a year ago, when I didn't really know what I was doing.