Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid calling non-public h5fd_xxx_init() functions #967

Merged
merged 8 commits into from
Jun 18, 2022
Merged

Avoid calling non-public h5fd_xxx_init() functions #967

merged 8 commits into from
Jun 18, 2022

Conversation

david-macmahon
Copy link
Contributor

The h5fd_xxx_init() functions are not part of the HDF5 public API. A
method of retrieving the hid_t value of drivers that uses
public/supported API calls was suggested in a comment on a related
github issue and is implemented in this commit. For more details see:

HDFGroup/hdf5#1809 (comment)

The h5fd_xxx_init() functions are not part of the HDF5 public API.  A
method of retrieving the hid_t value of drivers that uses
public/supported API calls was suggested in a comment on a related
github issue and is implemented in this commit.  For more details see:

HDFGroup/hdf5#1809 (comment)
@david-macmahon
Copy link
Contributor Author

Fixes #960

@david-macmahon
Copy link
Contributor Author

OK, something weird is happening to the mmap.jl test. Even if I revert this commit and just add the two lines marked with # NEW to the existing __init__() function, then the mmap test breaks.

fapl = HDF5.FileAccessProperties(fclose_degree=:default) # NEW
close(fapl) # NEW
DRIVERS[API.h5fd_core_init()] = Core # EXISTING
DRIVERS[API.h5fd_sec2_init()] = POSIX # EXISTING

I don't understand why that would break the mmap.jl test.

@david-macmahon
Copy link
Contributor Author

Even putting the # NEW lines at the end of HDF5.Drivers.__init__() breaks the mmap.jl test:

function __init__()
    DRIVERS[API.h5fd_core_init()] = Core
    DRIVERS[API.h5fd_sec2_init()] = POSIX
    @require MPI="da04e1cc-30fd-572f-bb4f-1f8673147195" include("mpio.jl")
    fapl = HDF5.init!(HDF5.FileAccessProperties()) # NEW
    close(fapl) # NEW
end

Is there some other part of the HDF5 libraries that need to be initialized before creating a file access property list?

@mkitti
Copy link
Member

mkitti commented Jun 15, 2022

Note that there is also the MPIO driver:

DRIVERS[API.h5fd_mpio_init()] = MPIO

@mkitti
Copy link
Member

mkitti commented Jun 15, 2022

My suspicion is that you accidentally initialized the HDF5 library before we could disable file locking in HDF5.__init__().

HDF5.jl/src/HDF5.jl

Lines 559 to 561 in cfa5a24

if !haskey(ENV, "HDF5_USE_FILE_LOCKING")
ENV["HDF5_USE_FILE_LOCKING"] = "FALSE"
end

This is why I pointed out the potential problem in #959.

@david-macmahon
Copy link
Contributor Author

Thanks for the pointer to the MPIO driver. And you are exactly right about the file locking messing with the mmap test. I had just found that out, but you beat me to it!

@mkitti
Copy link
Member

mkitti commented Jun 15, 2022

I see two paths forward:

  1. Create a new method to initialize DRIVERS and call that at the end of HDF5.__init__().
  2. Make this lazy, and set the key for DRIVERS when someone invokes set_driver!.

@david-macmahon
Copy link
Contributor Author

OK, yeah, I think the lazy approach is the way to go. I tried adding similar fapl logic to the MPIO driver using this:

# Check whether the HDF5 libraries were compiled with parallel support.
try
    fapl = HDF5.init!(HDF5.FileAccessProperties())
    comm = info = reinterpret(H5MPIHandle, 0)
    API.h5p_set_fapl_mpio(fapl, comm, info)
    DRIVERS[API.h5p_get_driver(fapl)] = MPIO
    close(fapl)
    HDF5.HAS_PARALLEL[] = true
catch e
    @show e
end

but that resulted in an MPI error: "The MPI_Comm_dup() function was called before MPI_INIT was invoked." It's obviously best to let the user call MPI_INIT, so the non-lazy approach seems fatally flawed in this case.

@david-macmahon
Copy link
Contributor Author

One issue with the lazy approach is that a FileAccessProperties object can be created without calling set_driver!, but h5p_get_driver will still return a driver (i.e. SEC2/POSIX) for it. So maybe the DRIVERS entries for the sec2 and core drivers could be set non-lazily, but then set the entry for mpio lazily in set_driver!(fapl, mpio::MPIO). The problem with lazy initializing the mpio entry is that mpio.jl currently uses a non-throwing call to h5fd_mpio_init() as the way to detect that HAS_PARALLEL should be set to true (i.e. only sets it to true if the function call does not throw an exception). If we no longer call h5fd_mpio_init() then we will need some other way to determine whether the HDF5 libraries have parallel support.

@mkitti
Copy link
Member

mkitti commented Jun 15, 2022

Perhaps the only driver we need to initialize is driver is the default one. That may be platform dependent. This may need to be done in HDF5.__init__.

Unless the user has already set `HDF5_USE_FILE_LOCKING` in the
environment, set it to `FALSE` to avoid causing problems with mmap'ing.
This needs to be done before the first call to the HDF5 APIs.
Instead of detecting parallel HDF5 support by whether `h5fd_mpio_init()`
throws an exception, parallel support is detected by finding the
`H5Pset_fapl_mpio` symbol in any loaded library whose name matches
`r"hdf5"i`.  The documentation for `H5Pset_fapl_mpio()` states:

> H5Pset_fapl_mpio() is available only in the parallel HDF5 library

The detection has also been moved into `Drivers.__init__()`
and the inclusion of `mpio.jl` (upon loading of the MPI package) will
only happen if the HDF5 libraries have parallel support.
@simonbyrne
Copy link
Collaborator

I believe SEC2 is the default on all platforms.

As for checking HAS_PARALLEL, we might be able to just check if the symbol exists?

@mkitti
Copy link
Member

mkitti commented Jun 15, 2022

You probably should just do using Libdl.

Comment on lines 101 to 110
loaded_libs = filter(s->match(r"hdf5"i, s) !== nothing, Libc.Libdl.dllist())
for libname in loaded_libs
has_mpio = Libc.Libdl.dlopen(libname) do lib
Libc.Libdl.dlsym(lib, :H5Pset_fapl_mpio; throw_error=false) !== nothing
end
if has_mpio
HDF5.HAS_PARALLEL[] = true
break
end
end
Copy link
Member

@mkitti mkitti Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
loaded_libs = filter(s->match(r"hdf5"i, s) !== nothing, Libc.Libdl.dllist())
for libname in loaded_libs
has_mpio = Libc.Libdl.dlopen(libname) do lib
Libc.Libdl.dlsym(lib, :H5Pset_fapl_mpio; throw_error=false) !== nothing
end
if has_mpio
HDF5.HAS_PARALLEL[] = true
break
end
end
API.Libdl.dlopen(API.libhdf5) do lib
HDF5.HAS_PARALLEL[] = Libdl.dlsym(lib, :H5Pset_fapl_mpio; throw_error=false) !== nothing
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could also just reference API.Libdl

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work with a custom HDF5 library either

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised it to use API.libhdf5handle[] instead of HDF5_jll.libhdf5_handle.

Why are we not using the standard JLL override mechanism?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's not going to work either. HDF5.API.libhdf5 should work though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nonetheless, I think we should make sure to use the exact same shared library that the API module loaded rather than doing a search through the loaded libraries.

@david-macmahon
Copy link
Contributor Author

I pushed that latest commit before I saw the "use HDF5.API.libhdf5" comment. I've made that change and will push once the current CI run finishes.

@david-macmahon
Copy link
Contributor Author

Not sure why 57c0e52 didn't get a green check, but CI did run after that was pushed. I think this is "done" other than possibly changing using Libdl: dlopen, dlsym to API.Libdl.dlopen etc. I'm agnostic about that.

@simonbyrne
Copy link
Collaborator

Why are we not using the standard JLL override mechanism?

The original Overrides.toml approach (https://docs.binarybuilder.org/stable/jll/#Overriding-the-artifacts-in-JLL-packages) is way too restrictive to be useful (it requires that the libraries have the same .so version as the one in the jll)

The Preferences.jl approach (https://docs.binarybuilder.org/stable/jll/#Overriding-specific-products) should work though

Comment on lines +95 to +96
API.h5p_set_fapl_core(fapl, 4096, false)
DRIVERS[API.h5p_get_driver(fapl)] = Core
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could lazily initialize Core as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way seemed closer to the original, so it's a slightly smaller change, but I don't think there would be a problem lazily initializing Core as well if that's the consensus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps all the set_driver! methods should uniformly register their hid_t in DRIVERS. From this __init__, we just call set_driver!(fapl, POSIX()).

@mkitti
Copy link
Member

mkitti commented Jun 16, 2022

I would also be willing to merge this as is and revise later.

end

fapl = HDF5.init!(HDF5.FileAccessProperties())
API.h5p_set_fapl_core(fapl, 4096, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 4096 come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere in particular (other than typical Linux page size). TBH, I suspect this could be set to 0 here because it's only used to setup fapl, which in turn is only used to get the ID of the Core driver. This fapl is never used to actually create anything. FWIW, I think this value defaults to 8192 elsewhere in HDF5.jl, but it's not clear where that value comes from.

DRIVERS[API.h5fd_sec2_init()] = POSIX
@require MPI="da04e1cc-30fd-572f-bb4f-1f8673147195" include("mpio.jl")
# disable file locking as that can cause problems with mmap'ing
if !haskey(ENV, "HDF5_USE_FILE_LOCKING")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason we need to add this check here as well? Is it related to Drivers being in its own module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we really need to do is address #959. Basically what happens is that the Drivers module is initialized before the HDF5 module. However, if we create a FAPL during Drivers initialization, then the environment variable no longer has the desired effect. Then memory mapping breaks before the files are now locked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that makes sense.

@musm
Copy link
Member

musm commented Jun 17, 2022

For future work, I think we should also try to get a build with HDF5 1.13 in our CI.

@mkitti
Copy link
Member

mkitti commented Jun 18, 2022

I would like to merge this and setup another pull request for revisions.

@mkitti
Copy link
Member

mkitti commented Jun 18, 2022

The original Overrides.toml approach (https://docs.binarybuilder.org/stable/jll/#Overriding-the-artifacts-in-JLL-packages) is way too restrictive to be useful (it requires that the libraries have the same .so version as the one in the jll)

The Preferences.jl approach (https://docs.binarybuilder.org/stable/jll/#Overriding-specific-products) should work though

Perhaps we should deprecate deps.jl and recommend the Preferences.jl approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants