Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk re-index terminates early and then can't merge #146

Open
charvolant opened this issue Jul 6, 2016 · 10 comments
Open

Bulk re-index terminates early and then can't merge #146

charvolant opened this issue Jul 6, 2016 · 10 comments
Labels

Comments

@charvolant
Copy link
Contributor

The bulk-indexer seems to sometimes stop indexing before

2016-07-05 15:16:26,032 INFO : [BulkProcessor] - [Indexer Thread 5] 52465000 >> Last key : dr376|CANB|CANB582931.1, records per sec: 248.57074
2016-07-05 15:16:30,002 INFO : [BulkProcessor] - [Indexer Thread 5] 52466000 >> Last key : dr376|CANB|CANB583877.1, records per sec: 251.88916
2016-07-05 15:16:34,504 INFO : [BulkProcessor] - [Indexer Thread 5] 52467000 >> Last key : dr376|CANB|CANB584821.1, records per sec: 222.1235
2016-07-05 15:16:38,381 INFO : [SolrIndexDAO] - Performing index commit....
2016-07-05 15:16:39,212 INFO : [SolrIndexDAO] - Performing index commit....done
2016-07-05 15:16:39,261 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-05 15:16:39,274 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-05 15:16:39,275 INFO : [SolrIndexDAO] - Optimising the indexing...
2016-07-05 15:35:22,488 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-05 15:35:22,493 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-05 15:35:22,493 INFO : [IndexRunner] - Total indexing time for this thread 2285.6086 minutes.
2016-07-05 15:35:22,493 INFO : [BulkProcessor] - Merging index segments
2016-07-05 15:35:22,493 INFO : [IndexMergeTool] - Merging to directory:  /data/biocache-reindex/solr/merged
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-0/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-1/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-2/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-3/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-4/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-5/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-6/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-7/data/index
2016-07-05 15:35:22,512 INFO : [IndexMergeTool] - Adding indexes...
Exception in thread "main" org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/data/biocache-reindex/solr-create/biocache-thread-2/data/index/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.acquireWriteLocks(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2526)
    at au.org.ala.biocache.index.IndexMergeTool$.merge(BulkProcessor.scala:256)
    at au.org.ala.biocache.index.BulkProcessor$.main(BulkProcessor.scala:164)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:134)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)

or

2016-07-02 17:12:51,280 INFO : [BulkProcessor] - [Indexer Thread 1] 25037000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS292868593, records per sec: 224.97186
2016-07-02 17:13:00,129 INFO : [BulkProcessor] - [Indexer Thread 1] 25038000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS292981770, records per sec: 113.00712
2016-07-02 17:13:04,440 INFO : [BulkProcessor] - [Indexer Thread 1] 25039000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS293014172, records per sec: 231.96475
2016-07-02 17:13:06,151 INFO : [SolrIndexDAO] - Performing index commit....
2016-07-02 17:13:08,554 INFO : [SolrIndexDAO] - Performing index commit....done
2016-07-02 17:13:08,592 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-02 17:13:08,598 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-02 17:13:08,598 INFO : [SolrIndexDAO] - Optimising the indexing...
2016-07-02 17:20:15,928 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-02 17:20:15,953 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-02 17:20:15,953 INFO : [IndexRunner] - Total indexing time for this thread 995.2349 minutes.
2016-07-02 17:39:21,885 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-02 17:39:21,888 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-02 17:39:21,888 INFO : [IndexRunner] - Total indexing time for this thread 1014.33386 minutes.
2016-07-02 17:39:21,888 INFO : [BulkProcessor] - Merging index segments
2016-07-02 17:39:21,888 INFO : [IndexMergeTool] - Merging to directory:  /data/biocache-reindex/solr/merged
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-0/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-1/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-2/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-3/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-4/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-5/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-6/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-7/data/index
2016-07-02 17:39:21,900 INFO : [IndexMergeTool] - Adding indexes...
Exception in thread "main" org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/data/biocache-reindex/solr-create/biocache-thread-2/data/index/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.acquireWriteLocks(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2526)
    at au.org.ala.biocache.index.IndexMergeTool$.merge(BulkProcessor.scala:256)
    at au.org.ala.biocache.index.BulkProcessor$.main(BulkProcessor.scala:164)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:134)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)

At this point the re-process hangs

@charvolant
Copy link
Contributor Author

The hang is possibly the same as that in #144

@ansell
Copy link
Contributor

ansell commented Jul 6, 2016

In the past I have seen the lucene lock obtain failure when a thread crashes for any reason, as it doesn't propagate the exception, just silently logging and ignoring it even though the index is corrupted or incomplete. I filed issue #124 for a similar case to this.

@charvolant
Copy link
Contributor Author

So possibly an OutOfMemoryError? That would fit it appearing when more indexing is required.

@ansell
Copy link
Contributor

ansell commented Jul 6, 2016

Yes, OOM is what I originally found, but connection errors like #144 may also trigger it if they occur inside of the threads doing the indexing. The log file should show what the base cause was, but it will be buried somewhere in the middle because the exception doesn't crash the job early as would be preferred.

@charvolant
Copy link
Contributor Author

Ah-ha,

Exception in thread "Thread-5" java.lang.RuntimeException: exception while unregistering MBean, com.scale7.cassandra.pelops.pool:type=PooledNode-occ-bie-aws.ala.org.au
    at org.scale7.cassandra.pelops.JmxMBeanManager.unregisterMBean(JmxMBeanManager.java:78)
    at org.scale7.cassandra.pelops.pool.PooledNode.<init>(PooledNode.java:68)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.addNode(CommonsBackedPool.java:415)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:137)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:88)
    at org.scale7.cassandra.pelops.Pelops.addPool(Pelops.java:62)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.initialise(CassandraPersistenceManager.scala:55)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.getColumnsFromRowsWithRetries(CassandraPersistenceManager.scala:380)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.pageOver(CassandraPersistenceManager.scala:346)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.pageOverAll(CassandraPersistenceManager.scala:470)
    at au.org.ala.biocache.index.IndexRunner.run(IndexRecordMultiThreaded.scala:466)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.management.InstanceNotFoundException: com.scale7.cassandra.pelops.pool:type=PooledNode-occ-bie-aws.ala.org.au
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1095)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.exclusiveUnregisterMBean(DefaultMBeanServerInterceptor.java:427)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.unregisterMBean(DefaultMBeanServerInterceptor.java:415)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.unregisterMBean(JmxMBeanServer.java:546)
    at org.scale7.cassandra.pelops.JmxMBeanManager.unregisterMBean(JmxMBeanManager.java:75)
    ... 11 more

@ansell
Copy link
Contributor

ansell commented Jul 6, 2016

That looks like it fits the behaviour I saw, as it is being pushed through as a RuntimeException which may be escaping through the checked Exception case at:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/index/IndexRecordMultiThreaded.scala#L481

@ansell
Copy link
Contributor

ansell commented Jul 6, 2016

Actually, that looks like it may have happened just before the try block, further up in CassandraPersistenceManager

@adam-collins
Copy link
Contributor

349791a adds more logging and makes retries more robust. Hopefully this helps.

@djtfmartin
Copy link
Member

djtfmartin commented Apr 24, 2018

can this be closed @charvolant @ansell ?

@ansell
Copy link
Contributor

ansell commented Apr 25, 2018

The change may prevent the exception escaping by suppressing but would that cause the creation of a partial index?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants