Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

Sorting for time values before aggregation #906

Open
kthyng opened this issue Aug 22, 2017 · 11 comments
Open

Sorting for time values before aggregation #906

kthyng opened this issue Aug 22, 2017 · 11 comments

Comments

@kthyng
Copy link

kthyng commented Aug 22, 2017

Hi all. I just figured out something that had been plaguing me for a week. We have ROMS model output files aggregated by thredds here, for example: http:https://barataria.tamu.edu:8080/thredds/dodsC/NcML/oof_archive_agg.

The output was coming out all jumbled and weird with some time indices working and others not working. It turns out that the model output files had time stamps that were out of order. "Touch"ing each file in the correct chronological order fixed the problem.

So my question is: would it be possible to have a "sort" step over the time dimension before the aggregation step?

Thanks.

@cofinoa
Copy link
Contributor

cofinoa commented Aug 23, 2017

Dear @kthyng,

How are you aggregating the files?

what is your NcML file?

regards

@kthyng
Copy link
Author

kthyng commented Aug 23, 2017

Roping in @skbaum since I'm a user but he set it up!

@skbaum
Copy link

skbaum commented Aug 23, 2017

The filenames are of the form:

roms_his_201611.nc
roms_his_201612.nc
roms_his_201701.nc
roms_his_201702.nc
roms_his_201703.nc
roms_his_201704.nc
roms_his_201705.nc
roms_his_201706.nc
roms_his_201707.nc
roms_his_201708.nc

and the NcML is:

<netcdf xmlns="http:https://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <aggregation dimName="ocean_time" type="joinExisting" recheckEvery="6 hour">
    <scan location="/atch/raid2/dj/oof_latest/oof/oof/outputs/ncfiles/archives/" regExp="roms_his.*\.nc"/>
  </aggregation>
</netcdf>

Upon reading the aggregation page at:

https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/Aggregation.html

I find the following: "By default, the files are ordered by sorting on the filename." This makes me think that what happened shouldn't have happened, and that the time stamps shouldn't have had to be modified. Perhaps it's a subtle bug.

I also realize that the issue can be forced by specifying each h filename within the NcML, but that would require editing the catalog.xml file every time a file is added.

Steve

@lesserwhirls
Copy link
Collaborator

I wonder if this is an issue with the use of the regExp attribute or, perhaps, caching. Would it be possible to do the aggregation without regExp?

@kthyng
Copy link
Author

kthyng commented Aug 24, 2017

Do you mean listing out the files? If so, that would work, but then it would have to be continually updated since this is an operational system that updates in time. So, that would not be ideal.

@lesserwhirls
Copy link
Collaborator

For example, would it be possible to do this:

<netcdf xmlns="http:https://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <aggregation dimName="ocean_time" type="joinExisting" recheckEvery="6 hour">
    <scan location="/atch/raid2/dj/oof_latest/oof/oof/outputs/ncfiles/archives/" suffix=".nc" /> 
  </aggregation>
</netcdf>

That is, is the regExp needed because there are files other than roms_his.* in the /atch/raid2/dj/oof_latest/oof/oof/outputs/ncfiles/archives/ directory?

@kthyng
Copy link
Author

kthyng commented Aug 24, 2017

Oh I see. Yes, there are other *.nc files in the directory.

@lesserwhirls
Copy link
Collaborator

Ah, ok. Another question - how many time steps are in each file?

@kthyng
Copy link
Author

kthyng commented Aug 24, 2017

Every hour for the month, so about 30*24=720 depending on how long the month is.

@lesserwhirls
Copy link
Collaborator

So having dug into some of our aggregation code, I can see that touching the files on disk caused a rescan of the collection (the code looks at the last modified time on disk to determine if a file was changed), which is probably why it caused things to work. But, the code is pretty complicated under the hood, unfortunately.

Just so I can understand a bit better here, it looks like you store data in daily netCDF files, and those files are rechecked every 6 hours.

  • Are the netCDF files netcdf-3 or netcdf-4?
  • Are data added to the daily files throughout the day? If so, how is the update to the files on disk done?
  • What OS is controlling the raid, and what filesystem are you using?
  • what OS and version of Java is the TDS running under?

Sorry for all the questions. The code that the standard java runtime library checks last modified time is OS dependent and I've seen reports where certain combinations end up returning the wrong last modified date. Also, depending on how files are being updated (if they are being updated throughout the day), the last modified time may not actually be updated (for example, if the file is held in an open state as data are added).

@kthyng
Copy link
Author

kthyng commented Sep 10, 2017

Thanks for the detailed response. I'll ping @skbaum again for help on this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants