Redesign biocache process and index to allow the use of Producer/Consumer/Writer queues #131

ansell · 2016-05-23T02:03:45Z

The current strategy for process and index operations is to split the dataset into key ranges based on a Solr query.

The main issue with this is that if the keys do not exist in Solr they cannot be used as part of the key range query, potentially leaving new data resources outside of a full process and a full reindex.

Another issue is that each thread must only operate on the keys that it was given. If this thread terminates earlier than the other threads it cannot take on any work that the other threads were doing and must immediately complete, leaving resources unused.

A common pattern when accessing databases is to use a Producer/Consumer/Writer pattern.

The pattern starts with one or more Producers, typically just 1 but may work with 2 or more in some cirumstances. This Producer opens a long term read connection to the database to get source records. The Producer feeds them onto a blocking queue whose size is tuned to always contain enough records to satisfy the Consumers, while not requiring a large memory block assigned to it.

Each Consumer takes one or more records from the queue and processes them, typically a single record at a time, with multiple Consumer threads matching the hardware it is operating on rather than batching multiple records to each Consumer.

Once the Consumer has processed each record they are added to a blocking queue for one or more Writers to pull from and write to the database or index that they are updating. The number of writers and the size of the blocking queue can be tuned to match the hardware that the process is operating to ensure that it is being used efficiently. It may still be beneficial to have multiple physical partial indexes, but having the Writer reading off a blocking queue enables other possibilities, including snapshotting/transactioning an update to a live Solr cloud to avoid manual operations afterward.

This has a number of advantages compared to the current system. In particular, the use of blocking queues and single record processing can ensure that OOM exceptions are minimised by reducing the requirement to store large numbers of records in memory. In the current batch system where the batches are tuned to be very large and contiguous within data records, multiple large data records from a single data resource may cause OOM exceptions in the right conditions. Even a single OOM will cause a corrupted index in the current pattern, as the thread that suffered the OOM has control over the indexing for its key range and the other threads cannot take up the unprocessed records. Even if the solr index created by the OOM thread is somehow consistently cleaned up, the result will still have missing records in the index which will require a full reindex again to fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign biocache process and index to allow the use of Producer/Consumer/Writer queues #131

Redesign biocache process and index to allow the use of Producer/Consumer/Writer queues #131

ansell commented May 23, 2016

Redesign biocache process and index to allow the use of Producer/Consumer/Writer queues #131

Redesign biocache process and index to allow the use of Producer/Consumer/Writer queues #131

Comments

ansell commented May 23, 2016