Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GVRS to improve metadata and API #10

Closed
gwlucastrig opened this issue Nov 24, 2021 · 20 comments
Closed

Refactor GVRS to improve metadata and API #10

gwlucastrig opened this issue Nov 24, 2021 · 20 comments
Assignees
Labels
enhancement New feature or request

Comments

@gwlucastrig
Copy link
Owner

gwlucastrig commented Nov 24, 2021

I am working on a significant revision to the GVRS API and file-format. I plan to submit changes by the end of 2021. The changes include:

  1. Better support for multi-variable data (such as wind vectors and ocean currents). Support for multiple data types within a single file.
  2. Much better support for metadata. Ability to completely support TIFF tags.
  3. Introduction of checksums for internal error detection in data.
  4. Better consistency and predictability for the API.
  5. More thorough unit tests.

Unfortunately, these changes will be incompatible with earlier versions of GVRS. In particular, older GVRS files will be inaccessible. If you have built up a collection of GVRS files, please let me know so that we can figure out the easiest way to transition to the new format.

Ordinarily, I try to avoid breaking compatibility across revisions. But since GVRS is still in pre-alpha development, it seems like the most efficient way to move forward.

@gwlucastrig gwlucastrig self-assigned this Nov 24, 2021
@gwlucastrig gwlucastrig added the enhancement New feature or request label Nov 24, 2021
@gwlucastrig
Copy link
Owner Author

Tonight, I pushed a large number of changes to refactor elements of the GVRS API and file format as described in the text above.

Unfortunately, the new API and file format breaks compatibility with earlier versions of GVRS (again, as described above). I believe that these items are now stable and should not have to change in the foreseeable future (barring any unanticipated set backs).

For the next couple of weeks, I plan on refining some of the internals and adding new features and JUnit tests. When these are complete, I will be treating the code as Release 1.0 of Gridfour and making the initial submission of the Gridfour API to the Maven Central Repository.

However, before I make any further code changes, I will be updating the wiki to reflect the new API.

Beyond that, there are a few areas that I would still like to address:

  1. Add a new GridfourFileSpecification constructor that automatically computes tile-size. I feel that details of choosing an optimal file size are fairly subtle and that most developers won't want to bother with them unless they have specific requirements.
  2. Improve the internal tile-indexing API to better support data sets where the overall raster grid is not fully populated. Also support GVRS files that are larger than 32 gigabytes (the current limit).
  3. The GvrsMetadata class is currently missing method calls to support some of its data types. Finish coding these.
  4. Write a set of JUnit tests for GVRS Metadata.

Once this work is complete, the next major undertaking will be to provide documentation on the GVRS file format.

As always, I welcome any suggestions for ways to make the GVRS API more effective, efficient, or easier to understand.

Thanks.

Gary

This was referenced Dec 7, 2021
@gwlucastrig
Copy link
Owner Author

One thing that may need revision is the creation of the UUID to uniquely identify a GVRS product. The idea of a UUID is that it would provide a unique identifier for each and every GVRS "product".

Currently, the UUID is established when a GvrsFileSpecification is constructed. A new GVRS file is established by calling the GvrsFile constructor and passing a file specification into it. The constructor opens up a file on disk and writes header information to it.

The problem with this approach is that the UUID is tied to the file-specification object, not the file object. And there is nothing preventing an application from creating multiple, different GVRS files using the same specification.

Therefore, I am looking at the possibility of moving the logic that establishes a UUID into the GvrsFile constructor and taking it out of the GvrsFileSpecification class.

In the current implementation, the UUID is used for coordinating a GVRS file with its associated index file. The index is a "side-car" file that can be written when a GVRS file is closed. It is used when a GVRS file is opened (for reading or writing) to load up the file positions of internal content. The availability of an index can significantly reduce the time required to open a GVRS file. Since both files are part of the same group of things (e.g. the same "product"), they share a common UUID. This feature allows us to be sure that the correct index is used when an existing GVRS file is opened.

@gwlucastrig
Copy link
Owner Author

I added the UUID to the main GVRS file header (GvrsFile.java) and took it out of the GvrsFileSpecification. I also made a modification to clean up some confusing code in the record allocation logic.

Once again, these changes break compatibility with the earlier ISSUE 10 versions. I think I am done making changes that alter the file structure. I have a few more features to add, but I have already reserved space for them. I regret the inconvenience that these frequent changes have caused as I move toward completing the refactoring operation.

One prominent change that I am thinking about is eliminating the use of a separate "index file" and moving the index into the main GVRS file body. Fortunately, I have already planned for this change and, when implemented, it will not break compatibility.

I am still on track to complete this issue and submit Gridfour to the Maven Central Repository by year's end.

Thank you for your patience in this matter.

@gwlucastrig
Copy link
Owner Author

I have integrated the content of the index file into the main GVRS file thus eliminating the need for a separate "side car" file. The original purpose of the index file was to expedite opening a large GVRS file. These functions are not integrated into the GVRS file itself. The index file is no longer used. This change does not break backward compatibility.

I am currently working on support for very large data files. At present, the maximum file size supported by the Java implementation is 32 gigabytes. This limitation is due to an incomplete implementation of the GVRS file format specification. GVRS itself supports a 64-bit address space. The changes are relatively minor and should be done in the next couple of days (the real challenge isn't the implementation, but the testing thereof).

Once that change is in place, the remaining work consists of the following:

  1. Filling in missing Javadoc
  2. Writing a demonstration program to show how to extract TIFF tags from TIFF files and store them as GVRS metadata elements
  3. Adding a JUnit test for metadata
  4. Performing an overall code review (volunteers would be appreciated)
  5. Posting Gridfour version 1 to the Maven Central Repository

@gwlucastrig
Copy link
Owner Author

I pushed changes to support very large data files. The size of GVRS files is no longer tied to the 32 GB limit.

@gwlucastrig
Copy link
Owner Author

gwlucastrig commented Dec 20, 2021

To exercise GVRS' metadata features, I am working on a demonstration application that reads elevation data from a Shuttle Radar Topography Mission (SRTM) file, creates a shaded-relief image, and stores the results as a GeoTIFF file. The TIFF tags from the source file are transcribed to GVRS metadata objects and stored with the GVRS file. They are used to compute the geographic parameters for rendering the image. Once the image is done, the demonstration application writes out a GeoTIFF file, using the metadata to format TIFF tags as appropriate.

The application depends on the Apache Commons Imaging library for reading and writing TIFF files. The algorithms for the shaded-relief technique are described in Elevation GeoTIFF Part 1 -- Shaded Relief

The image below shows a work in progress. It's a down-sampled JEPG. The actual TIFF files are quite a bit larger (3600-by-3600 pixels). I hope to have the demonstration ready for review in a couple of weeks.

Incidentally, when storing the raw SRTM elevation information using GVRS' data compression, the output required 2.09 bits per sample value. SRTM data tends to compress rather well.

n42_w074_1arc_downsized

@gwlucastrig
Copy link
Owner Author

And here is an example of the result. A GeoTIFF created from a GVRS file. The geo-referencing metadata allows it to be navigated on to Google Earth or any mainstream GIS tool.

I will be polishing the code and writing documentation. When that's done, I will post it on Github.

n42_w074_1arc_GoogleEarth

@gwlucastrig
Copy link
Owner Author

gwlucastrig commented Dec 29, 2021

I didn't have access to my computer over the holidays, but as I was washing the dishes from Christmas dinner it occurred to me that there was a flaw in the file-space management logic. I have fixed the problem and am doing some testing before pushing out the change. I should have an update in the next few days.

When data compression is enabled, a change to a even a single data cell in a grid often changes the compressed size of the tile that contains it. So if an application is adjusting the content of an existing tile (record) in the file, it may need to allocate a bigger block of file space to store it. In such a case, the formerly occupied section of the file is added to a "free list" for future use and the data is written to a new file location. The RecordManager class takes care of all of that.

The flaw occurred when the last block of free space in the file happened to be at the end of the file, but was not large enough to hold the new content. The RecordManager realized that it had to allow the file size to grow, but it didn't realized that it could re-use the space occupied by that last block. So GVRS would end up making the file larger than it actually needed to be.

The whole file-allocation process is similar to the way C/C++ programs handle malloc, realloc, and free. At some point in a future release, I am going research algorithms for malloc and see if they can be applied to GVRS. My current implementation is pretty sturdy, but I suspect it is also a bit naïve. There may be opportunities to improve performance or attain more efficient use of file space.

@gwlucastrig
Copy link
Owner Author

I pushed a new commit to Github with the changes described above.

In addition to JUnit tests, I have a test procedure I run with the PackageData demo application in which I set the tile cache to a small size and enable data compression. This configuration results in the tiles being written and re-written multiple times as data is added to the file. As each row of the source data is scanned, the storage size for the tiles grows progressively larger (as empty data cells are replaced with valid data). As processing progresses, tiles are read from disk and re-written to a new location in the file. The file-space management logic reclaims their old storage locations for future use. Because the tile cache is so small, each tile is written and then re-written to disk 120 times.

The test ran just fine. I am satisfied with the behavior of the file-space management system. The code is now very close to being ready for the release of version 1.0.

If you are interested, you can read more about how the tile cache operates in the PackageData demonstration application at The Tile Cache

@gwlucastrig
Copy link
Owner Author

One of this goals of Issue-10 was to establish the final Version 1.0 of the GVRS file format before pushing out the first release. I am considering making one last change that will break compatibility with earlier files before submitting the Gridfour core library to the Maven Central Repository and making the official 1.0 release of Gridfour.

Right now, the Gridfour user base is rather small (perhaps non-existent), so I think the impact of the change would be small. However, if anyone has built up a collection of GVRS files, please let me know so I can think of an alternate approach to solving the problem I wish to address.

Thank you for your attention in this matter.

@ebocher
Copy link

ebocher commented Jan 3, 2022

Such exciting changes!
What about adding in the demo module a example to read a geotiff file, process it (tile by tile) with GVRS mechanism and store the result in a new tiff.
We can wrap the imagej library (https://imagej.nih.gov/ij/) and use their algorithms to apply a convolution or whatever.
It will be a good exercise to demonstrate the GridFour pros and capabilities, especially for processing a large file with limited memory.
My 2 c to work with you on common GIS demonstrations.

@gwlucastrig
Copy link
Owner Author

gwlucastrig commented Jan 4, 2022

I am glad that you like the idea. Merci!

The current test program still needs a lot of work before it's ready to distribute. I look forward to posting it to the Gridfour site sometime in the next couple of weeks.

The current version of the test program stores into the GVRS file a subset of the TIFF tags that were taken from the original TIFF file. They are then transcribed into the output GeoTIFF file. By preserving the GeoTIFF tags from the original, the process ensures that the final output file is also a valid GeoTIFF file.

The following lists the GeoTIFF tags that the demonstration application currently supports. I am researching ideas for including more TIFF tags.

 GvrsMetadata modelTiepoint;
 GvrsMetadata modelPixelScale;
 GvrsMetadata geoKey;
 GvrsMetadata geoDoubleParameters;
 GvrsMetadata geoAsciiParameters;
 GvrsMetadata gdalNoData;

Do you have suggestions about other TIFF tags that you think the program should be preserving?

@ebocher
Copy link

ebocher commented Jan 4, 2022

Hi Gary,
Here you have the whole spec for GeoTIFF (https://docs.opengeospatial.org/is/19-008r4/19-008r4.html) but you probably already know it.
The GvrsMetadata system is flexible. So I think you can keep the existing tags for now. They can be adapted according community requests and use cases in the future ?

@gwlucastrig
Copy link
Owner Author

Hi Erwan,

I used the TIFF-to-GVRS-to-TIFF test program to process that DEM that you gave to me a couple of months ago. I had to make some modifications to my test program because your DEM uses a projected coordinate system (Cartesian coordinates) rather than a geographic coordinate system. Testing with your file worked out well, because it uncovered some bugs in my coordinate transformation logic. Anyway, I used the process to create a TIFF file and then plotted it on Google Earth. I think the results would be compatible with any good quality GIS system.

Nantes_DEM

@gwlucastrig
Copy link
Owner Author

I have encountered additional delays in the refactoring effort.

This week, I started looking at what would be required to implement a Raster Pyramid feature. Although this feature will not be added until a future release, I wanted to be sure that I could do it without breaking compatibility with the current implementation. This effort revealed some significant limitations in the current file format.

I just pushed a new version of the GVRS code to Github. I believe that Version 1.0 is very close to being ready for release.

One other thing I will be adding to the release is support for an AffineTransform for mapping real-valued coordinates to the raster grid and vice versa. Although I have previously implemented basic scale-and-offset transforms, this feature will permit the addition of skew and rotation of coordinate systems into the GVRS file specification.

I still have to do testing, code review, and write some more JUnit tests. But I hope to release Version 1.0 next weekend (Jan 16th).

@ebocher
Copy link

ebocher commented Jan 10, 2022

Hi @gwlucastrig ,
I hope you are well.
I'm testing the new API starting from DemoCOG.java. In a previous message, you talk about a TIFF-to-GVRS-to-TIFF test program but I'm not able to find it.

Erwan

@gwlucastrig
Copy link
Owner Author

gwlucastrig commented Jan 10, 2022 via email

@gwlucastrig
Copy link
Owner Author

gwlucastrig commented Jan 11, 2022 via email

@gwlucastrig
Copy link
Owner Author

gwlucastrig commented Jan 12, 2022

Now that I am almost done with this issue I am thinking about the next step after I release the Version 1.0.

I think the next thing I will do is to write some more wiki pages describing "How To" use GVRS. I can only guess at what developers who are new to GVRS need to know, and any questions or suggestions that you may have will help me narrow it down.

Topics I am considering

  1. A GVRS FAQ (frequently asked questions)
  2. Using GVRS Elements
  3. Using GVRS Metadata
  4. The GVRS approach to coordinate systems and transforms

Any other suggestions that you may have.

I've put a lot of information into the Javadoc, but in places where the use of the software is not self-evident, I think some supplemental wiki's would help.

@gwlucastrig
Copy link
Owner Author

I am pleased to announce that I have completed work on this issue. When I started it in November, I had no idea how much work it was going to be, but I am happy with pretty much every change I've made in both the code and the file format.

I am treating the current state of Gridfour as Version 1.0. I believe that the file-format is now stable and will not change in the near future. While the API could still be extended with additional methods and Javadoc, I believe it is now in a state where additional changes can wait until future releases.

I have just pushed changes up to Github for Version 1.0. I am getting ready to push Gridfour Jars out to Maven Central.

Thank you for your patience in this matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants