Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[GEOT-6900] Shapefile quadtree build performance improvements (geotoo…
…ls#3528) * [GEOT-6900] ShapeFileIndexerStressTest New `ShapeFileIndexerStressTest` to assess the performance of `ShapeFileIndexer.index()` over different shapefile sizes. * [GEOT-6900] Shapefile quadtree build performance improvements Use a strategy object (`BoundsReader`) to assist `ShapeFileIndexer` in speeding up the `QuadTree` optimization phase, providing quick access to each shapefile record envelope, potentially avoiding an immense amount of random disk I/O calls through {@link ShapefileReader}, as the quad tree internal nodes get split/shrank. Since the `QuadTree` leaf nodes hold only the shapefile record ids, and not their bounds, the tree layout optimization phase may incur into too much random disk reads on the `.shp` file, which has a bigger impact the bigger the shapefile is, especially related to the size of the geometries more than the number of records itself. The `BoundsReader` strategy object is meant to avoid that to the extent possible. To a given point, record bounds will be stored in heap memory (up to 1MiB, accounting for 32K records, or 64K records if it's a points shapefile). For a bigger number of shapefile records, the strategy is to store the bounds in a temporary file (named `GeoTools_shp_qix_bounds_<random number>.tmp` under `${java.io.tmpdir}`), which is memory mapped and deleted at `BoundsReader.close()`. This leverages the Operating System's native paging, and due to the reduced size of the bounds file compared to the actual `.shp` and avoiding the parsing performed by `ShapefileReader.nextRecord()`, results in dramatically less random I/O and computing. Note, however, that if there's not enough temporary space in the file system where the `java.io.tmpdir` directory resides, a fall back strategy that reads directly from the `ShapefileReader` will be used. This should be a very edge case though, since with a bounds record size of 32 bytes, the required temporary storage is 30.1MiB per million features.
- Loading branch information