Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor using internal wrappers #168

Merged
merged 58 commits into from
Aug 31, 2021
Merged

Refactor using internal wrappers #168

merged 58 commits into from
Aug 31, 2021

Conversation

rafaqz
Copy link
Owner

@rafaqz rafaqz commented May 15, 2021

This PR is a massive refactor of GeoData.jl internals, structuring things how I always wanted to but hadn't worked out how to do in practice initially.

In its final form this should have minimal impact on use, but a large impact on maintainability, clarity and consistency.

Notably, nearly all the code deletions are actually code, and the additions are mostly extra tests. We should have about 800 lines less code in the packege after this.

Notable changes:

  • AbstractGeoStack is now <: AbstractDimStack. This means the Tables.jl interface will work as well as a lot of base array methods applied to all layers, and the packages are consistent again (this was not easy)
  • FileArray/FileStack are used as internal wrappers instead of raw strings. This is the main thing that allowed removing so much code and standardizing everything further.
  • Reads are delayed to the absolute last minute. Indexing a stack gives you a lazy GeoArray that is still a closed file. It will only be opened when you retreive some data.
  • read(A) can be used to transfer data to memory - for GeoArray, GeoStack and GeoSeries - to manually break laziness. Using GeoArray(A) no longer does this - it may still may be lazy internally.
  • NCDarray/NCDstack/GRDarra/GDALarray etc types are all gone. We can add aliases to them to keep code working for a while, but they are no longer needed. Everything is just a GeoArray or GeoStack and file loading is specified by the wrapped object.
  • DiskArrays.jl will be supported as widely as possible.
  • Files can now be written to inside an open do block. This can use broadcast etc facilitated by DiskArrays.jl
  • Test coverage is higher, due to removing so much code.
  • crop, extend methods allow resizing multiple arrays/stacks to match dimensions
  • slice allows lazily slicing arrays/stacks to series. This is very useful for standardising accross datasets where e.g. the time dimension may be within a file (e.g. netcdf) or between multipole files (e.g. tif). It will also allow distributed out of core processing as the slices are just string pointers to the files with the metadata to open them.

TODO

  • speed up DiskArrays in areas it has made things slower
  • benchmark
  • Fix HDF5Utils.jl so that HDF5 DiskArrays code isn't required here for SMAP

Closes #136
Closes #134
Closes #64
Closes #70

@rafaqz rafaqz changed the title Internal wrappers Refactor using internal wrappers May 15, 2021
@rafaqz rafaqz mentioned this pull request May 15, 2021
4 tasks
@codecov-commenter
Copy link

codecov-commenter commented May 30, 2021

Codecov Report

Merging #168 (0c0db15) into master (879cb41) will increase coverage by 11.33%.
The diff coverage is 85.28%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #168       +/-   ##
===========================================
+ Coverage   74.45%   85.78%   +11.33%     
===========================================
  Files          21       21               
  Lines        1229     1379      +150     
===========================================
+ Hits          915     1183      +268     
+ Misses        314      196      -118     
Impacted Files Coverage Δ
src/sources/smap.jl 0.00% <0.00%> (ø)
src/dimensions.jl 68.00% <80.95%> (+39.42%) ⬆️
src/mode.jl 89.18% <83.33%> (+23.80%) ⬆️
src/filestack.jl 85.71% <85.71%> (ø)
src/series.jl 87.87% <85.71%> (-2.75%) ⬇️
src/filearray.jl 88.00% <88.00%> (ø)
src/sources/grd.jl 96.66% <90.00%> (+1.01%) ⬆️
src/methods.jl 91.75% <91.17%> (+6.04%) ⬆️
src/sources/ncdatasets.jl 85.84% <92.76%> (+5.42%) ⬆️
src/sources/rasterdatasources.jl 94.59% <92.85%> (+2.70%) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 879cb41...0c0db15. Read the comment docs.

@rafaqz
Copy link
Owner Author

rafaqz commented May 31, 2021

This is passing and pretty much finished besides some tests for extend and slice.

@mauro3, @ali-ramadhan and @vlandau if you have any time to look over the code (there's a lot so I can't ask for a full review), or could just switch to using this branch that would be a big help.

@meggart DiskArrays.jl has been hugely helpful in this PR, thanks for all the work you put into it. I mostly resolved meggart/DiskArrays.jl#28 using Flatten.jl to re-wrap an opened object with the DiskArray wrappers.
https://github.com/rafaqz/GeoData.jl/pull/168/files#diff-e9f2a6b99d3b555edce8867c798cc88f2d6416b1f170225ec159a2af7cf30c8eR137-R144

So open can be used to skip any file opening overhead but keep lazy DiskArrays properties everywhere. In regular use I think just opening/closing multiple times is fine for chunked files.

@rafaqz rafaqz marked this pull request as ready for review May 31, 2021 06:53
@rafaqz rafaqz merged commit 30f9506 into master Aug 31, 2021
@rafaqz rafaqz deleted the internal_wrappers branch August 31, 2021 20:54
@mauro3
Copy link
Collaborator

mauro3 commented Sep 1, 2021

Cool! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants