Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

Nc4Iosp preferred and always loaded by the dap4 controller if c library is present #1142

Closed
lesserwhirls opened this issue Sep 6, 2018 · 13 comments

Comments

@lesserwhirls
Copy link
Collaborator

So the core dumps we have been seeing on thredds-test happen while reading netCDF-4 files through the c library, even when threddsConfig.xml explicitly says not to use the netCDF-C library for reading. I've confirmed this by enabling various kinds of logging and pushing the server until it core dumps.

During TDS initialization, threddsConfig.xml is checked to see if reading via the c library is allowed prior to loading the Nc4Iosp (default is false):

if (useForReading) {
if (Nc4Iosp.isClibraryPresent()) {
try {
// Registers Nc4Iosp in front of all the other IOSPs already registered in NetcdfFile.<clinit>().
// Crucially, this means that we'll try to open a file with Nc4Iosp before we try it with H5iosp.
NetcdfFile.registerIOProvider(Nc4Iosp.class);
} catch (IllegalAccessException | InstantiationException e) {
startupLog.error("TdsInit: Unable to register IOSP: " + Nc4Iosp.class.getCanonicalName(), e);
}
} else {
startupLog.warn("TdsInit: In threddsConfig.xml, 'Netcdf4Clibrary.useForReading' is 'true' but the native C " +
"library couldn't be found on the system. Falling back to the pure-Java reader.");
}
}

The same happens in netCDF-Java at runtime:

if (useForReading) {
try {
// Registers Nc4Iosp in front of all the other IOSPs already registered in NetcdfFile.<clinit>().
// Crucially, this means that we'll try to open a file with Nc4Iosp before we try it with H5iosp.
NetcdfFile.registerIOProvider(nc4IospClassName);
} catch (IllegalAccessException | InstantiationException | ClassNotFoundException e) {
errlog.append(String.format("Could not register IOSP '%s': %s%n", nc4IospClassName, e.getMessage()));
}
}

That said, it looks like there is code in the Dap4Controller that registers the Nc4Isop without any check at all:

/**
* Initialize servlet/controller
*/
@Override
public void
initialize()
{
super.initialize();
try {
// Always prefer Nc4Iosp over HDF5
NetcdfFile.iospDeRegister(ucar.nc2.jni.netcdf.Nc4Iosp.class);
NetcdfFile.registerIOProviderPreferred(ucar.nc2.jni.netcdf.Nc4Iosp.class,
ucar.nc2.iosp.hdf5.H5iosp.class
);
} catch (Exception e) {
DapLog.warn("Cannot load ucar.nc2.jni.netcdf.Nc4Iosp");
}
}

This is why we end up reading through the netCDF-C library, even when we set the configuration to not do that.

I think what we are seeing on thredds-test is a crash due to multiple threads using the netCDf-C library to read different files simultaneously. Given the popularity of netCDF-4 products from GOES-16 and GOES-17, I am surprised we don't see this more often.

@DennisHeimbigner, @WardF - what do you think? Is reading files through the netCDF-C library necessary for DAP4? It seems like we would need to lock while reading through the C library, like we do when writing through the C library, and that seems like it would kill performance when dealing with netCDF files across the board.

@DennisHeimbigner
Copy link
Contributor

Hah! Caught me :-)
But why do you think that the library is being accessed simultaneously?
No one uses DAP4, and the netcdf library access is still serialized.
To me the question of why the HDF5 crashes are occurring is still open.

@lesserwhirls
Copy link
Collaborator Author

There is also code that does this same thing in the dap4 module (CDMDSP), but it's never used:

/**
* Make sure that NC4Iosp is registered and library loaded
* (Is this really the right place for this?)
*/
static public void
loadNc4Iosp()
throws DapException
{
if(nc4loaded) return;
nc4loaded = true;
if(!NetcdfFile.iospRegistered(NC4CLASS)) {
try {
// register before H5Iosp
NetcdfFile.registerIOProvider(NC4CLASS, false);
Nc4Iosp.setLibraryAndPath(null, null); // use defaults
} catch (Throwable e) {
throw new DapException(e.getMessage(), e.getCause());
}
}
}

Maybe we should remove it to make sure no one accidentally uses it?

@DennisHeimbigner
Copy link
Contributor

Agreed, I will take care of it. (THis is version 5, right?).

@lesserwhirls
Copy link
Collaborator Author

Ah, that's right - you synchronized the whole library. Yep - this is all version 5.

We know for sure that the core dump happens when reading through the c library. The question in my mind is why are we reading through the c library when the default setting is "do not use the c library for reading". It seems to me that the dap4 controller only needs to be initialized for that setting to be ignored across the board (once loaded, the IOSP will be used everywhere).

At the very least, we should only read through the c library if the option is explicitly set in the threddsConfig file, and that's handled by TdsInit, so we don't need to load the iosp in the dap4 controller (again, unless dap4 requires reading netCDF files through the c library).

@WardF and I will look through the core file to see if we can get an idea of where things are going wrong.

@DennisHeimbigner
Copy link
Contributor

Agreed, mostly. can you tell what function in the C library is being called
and where?

@DennisHeimbigner
Copy link
Contributor

Ward and I are supposed to meet at one tomorrow.
If you are around, join us.

@lesserwhirls
Copy link
Collaborator Author

Ok, here is some more info. Everything goes well on a clean startup of the server. We read and write netcdf-4 data, no problem, like a champ. According to the logs, things are chugging along, good times.

Then, we get this showing up in catalina.out:

dap4.core.util.DapException: Not found: satellite/goes16/GOES16/Products/P99/20180817/GOES16_P99_20180817_0800.nc4
        at thredds.server.dap4.Dap4Controller.getResourcePath(Dap4Controller.java:158)
        at dap4.servlet.DapController.doDMR(DapController.java:305)
        at dap4.servlet.DapController.handleRequest(DapController.java:230)
        at thredds.server.dap4.Dap4Controller.handleRequest(Dap4Controller.java:73)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Someone hits dap4 and it's the only dap4 request in the logs - this causes the Dap4Controller initialize() method to get called (only happens if a dap4 request is made, not on startup of the TDS), thus triggering the Nc4Iosp to get loaded with precedence over the Hdf5Iosp. Now, if we look at the server logs on either side of thedap4 requests that causes the dap4.core.util.DapException, we see that we are in the middle of a slew of requests that involve reading netCDF-4 data.

I can now predictably make this happen on thredds-test, and can only get a core dump once I've made a dap4 request, thus triggering the reading via the c library.

So now, here is my question: is the real issue due to something going wrong when switching between the Hdf5Iosp to the Nc4Isop for reading during run time, while actively reading netCDF-4 files, or is there bug in netCDF-C / HDF5 that's triggered by something within the netCDF-4 files being read?

@lesserwhirls
Copy link
Collaborator Author

Ok, I can run the TDS with netCDF-C library reading enabled from the start, and I still get core dumps. Probably not the switching between the IOSPs that cause issues.

@lesserwhirls
Copy link
Collaborator Author

...and it's not always on the same netCDF-4 file.

@DennisHeimbigner
Copy link
Contributor

I am amazed; someone is actually using DAP4?
Ok, let me do a general cleanup.

@lesserwhirls
Copy link
Collaborator Author

Here is an interesting case from the Nc4Iosp debug logs:

2018-09-06T12:56:21.288-0600 [    242815][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: open /data/ldm/pub/native/satellite/GOES/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b
-RadC-M3C02_G16_s20182491852173_e20182491854546_c20182491854586.nc
2018-09-06T12:56:21.318-0600 [    242845][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: open /data/ldm/pub/native/satellite/GOES/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b
-RadC-M3C02_G16_s20182491852173_e20182491854546_c20182491854586.nc id=9830400 format=3
2018-09-06T12:56:21.319-0600 [    242846][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: add Dimension y = 6000; (0)
2018-09-06T12:56:21.319-0600 [    242846][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: add Dimension x = 10000; (1)
2018-09-06T12:56:21.319-0600 [    242846][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: add Dimension number_of_time_bounds = 2; (2)
2018-09-06T12:56:21.319-0600 [    242846][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: add Dimension band = 1; (3)
2018-09-06T12:56:21.319-0600 [    242846][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: add Dimension number_of_image_bounds = 2; (4)
2018-09-06T12:56:21.319-0600 [    242846][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: add Dimension num_star_looks = 24; (5)
2018-09-06T12:56:21.319-0600 [    242846][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: open /data/ldm/irma/sfc/METAR_20170910_0800.nc
2018-09-06T12:56:21.329-0600 [    242856][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: open /data/ldm/irma/sfc/METAR_20170910_0800.nc id=9764864 format=3
2018-09-06T12:56:21.329-0600 [    242856][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: add Dimension sample = UNLIMITED;   // (12964 currently) (0)
2018-09-06T12:56:21.329-0600 [    242856][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: add Dimension station = UNLIMITED;   // (4610 currently) (1)
2018-09-06T12:56:21.331-0600 [    242858][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :featureType = "timeSeries"
2018-09-06T12:56:21.331-0600 [    242858][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :Conventions = "CF-1.7"
2018-09-06T12:56:21.331-0600 [    242858][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :institution = "UCAR/Unidata"
2018-09-06T12:56:21.331-0600 [    242858][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :source = "METAR Reports"
2018-09-06T12:56:21.331-0600 [    242858][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :title = "Reformatted METAR observations"
2018-09-06T12:56:21.331-0600 [    242858][     730] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp: nvars= 17
2018-09-06T12:56:21.331-0600 [    242858][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :naming_authority = "gov.nesdis.noaa"
2018-09-06T12:56:21.331-0600 [    242858][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :Conventions = "CF-1.7"
2018-09-06T12:56:21.331-0600 [    242858][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :Metadata_Conventions = "Unidata Dataset Discovery v1.0"
2018-09-06T12:56:21.331-0600 [    242858][     777] DEBUG ucar.nc2.jni.netcdf.Nc4Iosp:  add Global Attribute :standard_name_vocabulary = "CF Standard Name Table (v25, 05 Ju
ly 2013)"

In this case, we open and start reading from a GOES 16 netCDF-4 file, then open and read from a surface data netCDF-4 file. However, based on second number in square brackets (730 or 777), it would seem that we are read from the two sources concurrently?

@DennisHeimbigner
Copy link
Contributor

You have to be careful about what that means. Remember that each call to the JNA
is serialized. But since accessing a dataset takes many separate calls to the JNA,
the two "requests" can appear to overlap when in fact their calls to the JNA are being
interleaved with each call completing before another can occur.
Detecting true failure of the serialization
is almost impossible at the resolution of the logs.

@lesserwhirls
Copy link
Collaborator Author

Ah, yes - I think that's the case here (that is, not concurrent access)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants