Add multi-threaded logic for compressing data #18

gwlucastrig · 2022-03-30T22:43:14Z

The current GVRS API is based on a single thread of execution. There are a few operations related to storing data that could be conducted in a multi-threaded manner.

For example, when I tested the PackageData application today storing ETOPO1 data, the process required 4.5 seconds. But when I turned on data compression, it required 68 seconds. When I activated the advanced LSOP option, it required 101 seconds. GVRS compression works using a generate-and-test scheme where it will try different combinations of predictors (Differencing, Triangle, LSOP, etc.) and compressors (Deflate, Huffman). Essentially, it tries a bunch of things and goes with the one that produces the best results. Since these "trials" don't share any writable memory resources, they could be conducted in parallel.

This afternoon, I did a quick hack with the CodecMaster class and made it use a ThreadPoolExecutor to process these compressors in separate threads. The 101 seconds required for the full-compression suite (including LSOP) was reduced to 65 seconds. The 68 seconds required for the standard predictors was reduced to 35.

So I propose to investigate multi-threaded implementations as a way of expediting data compression.

Notes:

In general, if the different compressors run in parallel, then the time for compression would simply be that of the one that takes the longest time to work.

With multiple compressors running in parallel, the application would consume more memory, but the overall usage of CPU of the entire program execution would not be significantly increased. The program would consume more CPU while running, but would run for a shorter time.

Further refinement may be possible, though I want to avoid the temptation to create unduly complex code to save a few seconds of run time. For example, the LSOP predictor required 12.3 seconds to compute its internal coefficients over the raster grid. If we partitioned the grid into 4 pieces, we could process each in parallel. The processing time would be reduced to about a quarter of what it was, or about 3.5 seconds. So that wouldn't be worth the effort. However, it might be possible to integrate multiple threads for some of the other parts of the LSOP process. I think I would leave that for a future effort.

gwlucastrig · 2022-04-03T00:04:00Z

Just pushed code changes for this issue. The option for multi-threaded processing was introduced to the PackageData demonstration class. Additionally, the code base now includes a new class called TaskGroupExecutor that coordinates the use multiple threads when partitioning processing into sub-tasks. In some cases, this class is easier to use than the Java standard API's CyclicBarrier and Phaser classes.

In testing with the GEBCO global elevation and bathymetry data set (3.7 billion sample points), the time to compress the entire data set on a mid-sized laptop computer was reduced from 1920 seconds to 1052.

Opportunities for future improvements are being investigated.

gwlucastrig · 2022-04-10T20:05:02Z

Final changes pushed to repository. These enhancements resulted in a nearly 50 % reduction in processing time for global elevation/bathymetry data sets.

gwlucastrig closed this as completed Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-threaded logic for compressing data #18

Add multi-threaded logic for compressing data #18

gwlucastrig commented Mar 30, 2022

gwlucastrig commented Apr 3, 2022

gwlucastrig commented Apr 10, 2022

Add multi-threaded logic for compressing data #18

Add multi-threaded logic for compressing data #18

Comments

gwlucastrig commented Mar 30, 2022

gwlucastrig commented Apr 3, 2022

gwlucastrig commented Apr 10, 2022