Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign biocache process and index to allow the use of Producer/Consumer/Writer queues #131

Open
ansell opened this issue May 23, 2016 · 0 comments

Comments

@ansell
Copy link
Contributor

ansell commented May 23, 2016

The current strategy for process and index operations is to split the dataset into key ranges based on a Solr query.

The main issue with this is that if the keys do not exist in Solr they cannot be used as part of the key range query, potentially leaving new data resources outside of a full process and a full reindex.

Another issue is that each thread must only operate on the keys that it was given. If this thread terminates earlier than the other threads it cannot take on any work that the other threads were doing and must immediately complete, leaving resources unused.

A common pattern when accessing databases is to use a Producer/Consumer/Writer pattern.

The pattern starts with one or more Producers, typically just 1 but may work with 2 or more in some cirumstances. This Producer opens a long term read connection to the database to get source records. The Producer feeds them onto a blocking queue whose size is tuned to always contain enough records to satisfy the Consumers, while not requiring a large memory block assigned to it.

Each Consumer takes one or more records from the queue and processes them, typically a single record at a time, with multiple Consumer threads matching the hardware it is operating on rather than batching multiple records to each Consumer.

Once the Consumer has processed each record they are added to a blocking queue for one or more Writers to pull from and write to the database or index that they are updating. The number of writers and the size of the blocking queue can be tuned to match the hardware that the process is operating to ensure that it is being used efficiently. It may still be beneficial to have multiple physical partial indexes, but having the Writer reading off a blocking queue enables other possibilities, including snapshotting/transactioning an update to a live Solr cloud to avoid manual operations afterward.

This has a number of advantages compared to the current system. In particular, the use of blocking queues and single record processing can ensure that OOM exceptions are minimised by reducing the requirement to store large numbers of records in memory. In the current batch system where the batches are tuned to be very large and contiguous within data records, multiple large data records from a single data resource may cause OOM exceptions in the right conditions. Even a single OOM will cause a corrupted index in the current pattern, as the thread that suffered the OOM has control over the indexing for its key range and the other threads cannot take up the unprocessed records. Even if the solr index created by the OOM thread is somehow consistently cleaned up, the result will still have missing records in the index which will require a full reindex again to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant