Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

NetCDF Java library fails intermittently when reading certain types of NCML aggregations #276

Open
clifford-harms opened this issue Nov 10, 2015 · 3 comments

Comments

@clifford-harms
Copy link

clifford-harms commented Nov 10, 2015

NetCDF Java library fails intermittently when reading certain types of NCML aggregations. As an example, consider the following NCML:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<netcdf xmlns="http:https://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <variable name="runtime" type="double">
    <attribute type="string" name="_CoordinateAxisType" value="RunTime"/>
    <attribute type="string" name="units" value="hours since 2012-10-25 00:00:00"/>
    <attribute type="string" name="time_origin" value="2012-10-25 00:00:00"/>
    <values>0 24</values>
  </variable>
  <aggregation type="joinNew" dimName="runtime">
    <netcdf  coordValue="0" location="ncom-relo-mayport_u_miw-t000.nc"/>
    <netcdf coordValue="24">
      <aggregation type="joinExisting" dimName="time">
        <netcdf location="ncom-relo-mayport_26_u_miw-t001.nc"/>
        <netcdf location="ncom-relo-mayport_26_u_miw-t000.nc"/>
      </aggregation>
    </netcdf>
    <variableAgg name="salinity"/>
    <variableAgg name="surf_temp_flux"/>
    <variableAgg name="surf_wnd_stress_gridy"/>
    <variableAgg name="surf_wnd_stress_gridx"/>
    <variableAgg name="surf_roughness"/>
    <variableAgg name="surf_salt_flux"/>
    <variableAgg name="tau"/>
    <variableAgg name="surf_atm_press"/>
    <variableAgg name="water_u"/>
    <variableAgg name="water_v"/>
    <variableAgg name="water_temp"/>
    <variableAgg name="surf_solar_flux"/>
    <variableAgg name="surf_el"/>
    <variableAgg name="time"/>
  </aggregation>
</netcdf>

Given this aggregation, the java code below will fail ... sometimes ..

   @Test
   public void testNestedNCMLReads() throws IOException {
         File ncml = new File("/tmp/data/aggregation.ncml");//the ncml referenced above
         try (NetcdfDataset ds = 
                   NetcdfDataset.acquireDataset(ncml.toString(), true, null)) {
            Variable var = ds.findVariable("time");
            double[] times = (double[]) var.read().get1DJavaArray(double.class);
            System.out.println(Arrays.toString(times));
         }
   }

WHEN it fails, the following stack trace is produced:

java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at ucar.ma2.Array.arraycopy(Array.java:300)
        at ucar.nc2.ncml.AggregationOuterDimension.reallyRead(AggregationOuterDimension.java:382)
        at ucar.nc2.dataset.VariableDS._read(VariableDS.java:507)
        at ucar.nc2.Variable.read(Variable.java:709)
        ....

If the code does not fail, the following output is produced:

 [112344.0, 112369.0, 112368.0, 0.0]

Which, I believe, is incorrect as there should only be 3 available times based on the ncml above.

If the unit test above is modified to run the open-read process multiple times in rapid succession:

import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import org.junit.Test;
import ucar.nc2.Variable;
import ucar.nc2.dataset.NetcdfDataset;

public class NestedAggregationTest {

   @Test
   public void testAggregatedNCMLReads() throws IOException {
      int iterations = 100, failed = 0, passed = 0;
      for (int i = 0; i < iterations; ++i) {
         File ncml = new File("/tmp/data/aggregation.ncml");
         try (NetcdfDataset ds = NetcdfDataset.acquireDataset(ncml.toString(), true, null)) {
            Variable var = ds.findVariable("time");
            double[] times = (double[]) var.read().get1DJavaArray(double.class);
            System.out.println(Arrays.toString(times));
            ++passed;
         } catch (Exception e) {
            ++failed;
            e.printStackTrace(System.err);
         }
      }
      System.out.println(failed + " of " + iterations + " iterations failed");
   }
}

Then I am seeing a roughly 50% failure rate (stack trace) with the rest of the reads successful but apparently producing incorrect results. This problem is present in ToolsUI, and I have tried netcdf versions 4.5.5 - 4.6.x.

I will attach the data I am using to reproduce these results.

@clifford-harms
Copy link
Author

tar.gz contaning the aggregation referenced above and the three nc files that it aggregates.

https://docs.google.com/uc?id=0B_bVl3gTeT9RS0QyRWFXUUIyVVE&export=download

@JohnLCaron
Copy link
Collaborator

<aggregation type="joinNew" dimName="runtime">
    <netcdf  coordValue="0" location="ncom-relo-mayport_u_miw-t000.nc"/>
    <netcdf coordValue="24">
      <aggregation type="joinExisting" dimName="time">
        <netcdf location="ncom-relo-mayport_26_u_miw-t001.nc"/>
        <netcdf location="ncom-relo-mayport_26_u_miw-t000.nc"/>
      </aggregation>
    </netcdf>

ncom-relo-mayport_u_miw-t000.nc only has 1 time coordinate, but the inner aggregation has 2, so these are not homogeneous in the sense that Ncml aggregation requires.

could you explain more what you are trying to do?

@clifford-harms
Copy link
Author

The data I attached is for a test case in a scenario I am trying to handle. I have several thousand netcdfs (some CF, some not), most of which are the same logical dataset broken up via a time or Z axis into datasets consisting of 30-50 files, which I must aggregate into a single 'logical' dataset (I believe this is a fairly common use case). These files are updated daily, but due to the amount of data involved as well as other environmental factors, these updates happen sporadically over a span of about 24 hours.

So what I am trying to do here is, as the files of an aggregated dataset are slowly updated with newer versions of the same file, add those new versions to the aggregated datasets that they belong to but ensuring that the new data can be differentiated within the aggregation via its data creation time (be it a model run time or production time or whatever). This is where the joining of files with the joinNew dimension comes in (in this example, 'runtime'), as the data creation time does not exist in the datasets as a coordinate variable, and in some cases is not even indicated in global attribution.

Ultimately, once all of the files for an aggregated dataset have been updated, the aggregation contains files that all have the same data creation or run time, until the next update starts.

You seem to be indicating that I cannot perform a 'joinNew' aggregation between datasets that have coordinate variables with different sizes? If that is the case, and I missed it in the documentation somewhere, then what about aggregating the files with a joinNew first, and then aggregating those aggregations as 'joinExisting' along time/Z axis?

There still is the issue, though, of the random behavior (an exception for some reads, for other reads an array of values) which indicates a concurrency problem. If the read worked consistently, instead of only half of the time, that would still be useful to me as my code could easily determine which values in the returned array were valid.

At any rate, thanks for responding so quickly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants