Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-threaded logic for compressing data #18

Closed
gwlucastrig opened this issue Mar 30, 2022 · 2 comments
Closed

Add multi-threaded logic for compressing data #18

gwlucastrig opened this issue Mar 30, 2022 · 2 comments

Comments

@gwlucastrig
Copy link
Owner

The current GVRS API is based on a single thread of execution. There are a few operations related to storing data that could be conducted in a multi-threaded manner.

For example, when I tested the PackageData application today storing ETOPO1 data, the process required 4.5 seconds. But when I turned on data compression, it required 68 seconds. When I activated the advanced LSOP option, it required 101 seconds. GVRS compression works using a generate-and-test scheme where it will try different combinations of predictors (Differencing, Triangle, LSOP, etc.) and compressors (Deflate, Huffman). Essentially, it tries a bunch of things and goes with the one that produces the best results. Since these "trials" don't share any writable memory resources, they could be conducted in parallel.

This afternoon, I did a quick hack with the CodecMaster class and made it use a ThreadPoolExecutor to process these compressors in separate threads. The 101 seconds required for the full-compression suite (including LSOP) was reduced to 65 seconds. The 68 seconds required for the standard predictors was reduced to 35.

So I propose to investigate multi-threaded implementations as a way of expediting data compression.

Notes:

In general, if the different compressors run in parallel, then the time for compression would simply be that of the one that takes the longest time to work.

With multiple compressors running in parallel, the application would consume more memory, but the overall usage of CPU of the entire program execution would not be significantly increased. The program would consume more CPU while running, but would run for a shorter time.

Further refinement may be possible, though I want to avoid the temptation to create unduly complex code to save a few seconds of run time. For example, the LSOP predictor required 12.3 seconds to compute its internal coefficients over the raster grid. If we partitioned the grid into 4 pieces, we could process each in parallel. The processing time would be reduced to about a quarter of what it was, or about 3.5 seconds. So that wouldn't be worth the effort. However, it might be possible to integrate multiple threads for some of the other parts of the LSOP process. I think I would leave that for a future effort.

@gwlucastrig
Copy link
Owner Author

Just pushed code changes for this issue. The option for multi-threaded processing was introduced to the PackageData demonstration class. Additionally, the code base now includes a new class called TaskGroupExecutor that coordinates the use multiple threads when partitioning processing into sub-tasks. In some cases, this class is easier to use than the Java standard API's CyclicBarrier and Phaser classes.

In testing with the GEBCO global elevation and bathymetry data set (3.7 billion sample points), the time to compress the entire data set on a mid-sized laptop computer was reduced from 1920 seconds to 1052.

Opportunities for future improvements are being investigated.

@gwlucastrig
Copy link
Owner Author

Final changes pushed to repository. These enhancements resulted in a nearly 50 % reduction in processing time for global elevation/bathymetry data sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant