Bulk re-index terminates early and then can't merge #146

charvolant · 2016-07-06T00:31:27Z

The bulk-indexer seems to sometimes stop indexing before

2016-07-05 15:16:26,032 INFO : [BulkProcessor] - [Indexer Thread 5] 52465000 >> Last key : dr376|CANB|CANB582931.1, records per sec: 248.57074
2016-07-05 15:16:30,002 INFO : [BulkProcessor] - [Indexer Thread 5] 52466000 >> Last key : dr376|CANB|CANB583877.1, records per sec: 251.88916
2016-07-05 15:16:34,504 INFO : [BulkProcessor] - [Indexer Thread 5] 52467000 >> Last key : dr376|CANB|CANB584821.1, records per sec: 222.1235
2016-07-05 15:16:38,381 INFO : [SolrIndexDAO] - Performing index commit....
2016-07-05 15:16:39,212 INFO : [SolrIndexDAO] - Performing index commit....done
2016-07-05 15:16:39,261 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-05 15:16:39,274 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-05 15:16:39,275 INFO : [SolrIndexDAO] - Optimising the indexing...
2016-07-05 15:35:22,488 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-05 15:35:22,493 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-05 15:35:22,493 INFO : [IndexRunner] - Total indexing time for this thread 2285.6086 minutes.
2016-07-05 15:35:22,493 INFO : [BulkProcessor] - Merging index segments
2016-07-05 15:35:22,493 INFO : [IndexMergeTool] - Merging to directory:  /data/biocache-reindex/solr/merged
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-0/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-1/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-2/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-3/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-4/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-5/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-6/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-7/data/index
2016-07-05 15:35:22,512 INFO : [IndexMergeTool] - Adding indexes...
Exception in thread "main" org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/data/biocache-reindex/solr-create/biocache-thread-2/data/index/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.acquireWriteLocks(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2526)
    at au.org.ala.biocache.index.IndexMergeTool$.merge(BulkProcessor.scala:256)
    at au.org.ala.biocache.index.BulkProcessor$.main(BulkProcessor.scala:164)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:134)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)

or

2016-07-02 17:12:51,280 INFO : [BulkProcessor] - [Indexer Thread 1] 25037000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS292868593, records per sec: 224.97186
2016-07-02 17:13:00,129 INFO : [BulkProcessor] - [Indexer Thread 1] 25038000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS292981770, records per sec: 113.00712
2016-07-02 17:13:04,440 INFO : [BulkProcessor] - [Indexer Thread 1] 25039000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS293014172, records per sec: 231.96475
2016-07-02 17:13:06,151 INFO : [SolrIndexDAO] - Performing index commit....
2016-07-02 17:13:08,554 INFO : [SolrIndexDAO] - Performing index commit....done
2016-07-02 17:13:08,592 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-02 17:13:08,598 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-02 17:13:08,598 INFO : [SolrIndexDAO] - Optimising the indexing...
2016-07-02 17:20:15,928 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-02 17:20:15,953 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-02 17:20:15,953 INFO : [IndexRunner] - Total indexing time for this thread 995.2349 minutes.
2016-07-02 17:39:21,885 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-02 17:39:21,888 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-02 17:39:21,888 INFO : [IndexRunner] - Total indexing time for this thread 1014.33386 minutes.
2016-07-02 17:39:21,888 INFO : [BulkProcessor] - Merging index segments
2016-07-02 17:39:21,888 INFO : [IndexMergeTool] - Merging to directory:  /data/biocache-reindex/solr/merged
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-0/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-1/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-2/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-3/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-4/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-5/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-6/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-7/data/index
2016-07-02 17:39:21,900 INFO : [IndexMergeTool] - Adding indexes...
Exception in thread "main" org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/data/biocache-reindex/solr-create/biocache-thread-2/data/index/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.acquireWriteLocks(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2526)
    at au.org.ala.biocache.index.IndexMergeTool$.merge(BulkProcessor.scala:256)
    at au.org.ala.biocache.index.BulkProcessor$.main(BulkProcessor.scala:164)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:134)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)

At this point the re-process hangs

The text was updated successfully, but these errors were encountered:

charvolant · 2016-07-06T00:32:16Z

The hang is possibly the same as that in #144

ansell · 2016-07-06T00:34:38Z

In the past I have seen the lucene lock obtain failure when a thread crashes for any reason, as it doesn't propagate the exception, just silently logging and ignoring it even though the index is corrupted or incomplete. I filed issue #124 for a similar case to this.

charvolant · 2016-07-06T00:56:58Z

So possibly an OutOfMemoryError? That would fit it appearing when more indexing is required.

ansell · 2016-07-06T01:08:36Z

Yes, OOM is what I originally found, but connection errors like #144 may also trigger it if they occur inside of the threads doing the indexing. The log file should show what the base cause was, but it will be buried somewhere in the middle because the exception doesn't crash the job early as would be preferred.

charvolant · 2016-07-06T01:13:52Z

Ah-ha,

Exception in thread "Thread-5" java.lang.RuntimeException: exception while unregistering MBean, com.scale7.cassandra.pelops.pool:type=PooledNode-occ-bie-aws.ala.org.au
    at org.scale7.cassandra.pelops.JmxMBeanManager.unregisterMBean(JmxMBeanManager.java:78)
    at org.scale7.cassandra.pelops.pool.PooledNode.<init>(PooledNode.java:68)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.addNode(CommonsBackedPool.java:415)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:137)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:88)
    at org.scale7.cassandra.pelops.Pelops.addPool(Pelops.java:62)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.initialise(CassandraPersistenceManager.scala:55)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.getColumnsFromRowsWithRetries(CassandraPersistenceManager.scala:380)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.pageOver(CassandraPersistenceManager.scala:346)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.pageOverAll(CassandraPersistenceManager.scala:470)
    at au.org.ala.biocache.index.IndexRunner.run(IndexRecordMultiThreaded.scala:466)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.management.InstanceNotFoundException: com.scale7.cassandra.pelops.pool:type=PooledNode-occ-bie-aws.ala.org.au
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1095)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.exclusiveUnregisterMBean(DefaultMBeanServerInterceptor.java:427)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.unregisterMBean(DefaultMBeanServerInterceptor.java:415)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.unregisterMBean(JmxMBeanServer.java:546)
    at org.scale7.cassandra.pelops.JmxMBeanManager.unregisterMBean(JmxMBeanManager.java:75)
    ... 11 more

ansell · 2016-07-06T01:16:20Z

That looks like it fits the behaviour I saw, as it is being pushed through as a RuntimeException which may be escaping through the checked Exception case at:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/index/IndexRecordMultiThreaded.scala#L481

ansell · 2016-07-06T01:17:38Z

Actually, that looks like it may have happened just before the try block, further up in CassandraPersistenceManager

adam-collins · 2016-09-30T02:04:49Z

349791a adds more logging and makes retries more robust. Hopefully this helps.

djtfmartin · 2018-04-24T17:03:50Z

can this be closed @charvolant @ansell ?

ansell · 2018-04-25T20:35:52Z

The change may prevent the exception escaping by suppressing but would that cause the creation of a partial index?

djtfmartin added the 1.9.x label Jul 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk re-index terminates early and then can't merge #146

Bulk re-index terminates early and then can't merge #146

charvolant commented Jul 6, 2016

charvolant commented Jul 6, 2016

ansell commented Jul 6, 2016

charvolant commented Jul 6, 2016

ansell commented Jul 6, 2016

charvolant commented Jul 6, 2016

ansell commented Jul 6, 2016

ansell commented Jul 6, 2016

adam-collins commented Sep 30, 2016

djtfmartin commented Apr 24, 2018 •

edited

Loading

ansell commented Apr 25, 2018

Bulk re-index terminates early and then can't merge #146

Bulk re-index terminates early and then can't merge #146

Comments

charvolant commented Jul 6, 2016

charvolant commented Jul 6, 2016

ansell commented Jul 6, 2016

charvolant commented Jul 6, 2016

ansell commented Jul 6, 2016

charvolant commented Jul 6, 2016

ansell commented Jul 6, 2016

ansell commented Jul 6, 2016

adam-collins commented Sep 30, 2016

djtfmartin commented Apr 24, 2018 • edited Loading

ansell commented Apr 25, 2018

djtfmartin commented Apr 24, 2018 •

edited

Loading