Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexOutOfBoundsException via org.wikidata.wdtk.storage.datastructures.BitVectorImpl.assertRange(BitVectorImpl.java:169) #158

Open
mjw99 opened this issue Jul 29, 2015 · 2 comments
Labels

Comments

@mjw99
Copy link

mjw99 commented Jul 29, 2015

Dear Devs,

I'm making use of the wikidata-toolkit in a little project at https://bitbucket.org/mjw99/wikidatachemscraper/overview . Essentially, I'm trying to harvest all chemical structure diagrams that are in the SVG format and have a Standard InChI Key associated with them.

If one follows the example on the spash screen of the the project, I am seeing the following exception:

2015-07-28 17:46:02 INFO  - [statistics] Namespaces: {0=, 1=Talk, 2=User, 3=User talk, 4=Wikidata, 5=Wikidata talk, 6=File, 7=File talk, 8=MediaWiki, 829=Module talk, 9=MediaWiki talk, 828=Module, 10=Template, 11=Template talk, 12=Help, 13=Help talk, 14=Category, 15=Category talk, 2600=Topic, -2=Media, -1=Special, 1199=Translations talk, 1198=Translations, 123=Query talk, 122=Query, 121=Property talk, 120=Property}
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Position 207239102 is out of bounds.
    at org.wikidata.wdtk.storage.datastructures.BitVectorImpl.assertRange(BitVectorImpl.java:169)
    at org.wikidata.wdtk.storage.datastructures.BitVectorImpl.getBit(BitVectorImpl.java:241)
    at org.wikidata.wdtk.dumpfiles.MwRevisionProcessorBroker.processRevision(MwRevisionProcessorBroker.java:138)
    at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.processXmlRevision(MwRevisionDumpFileProcessor.java:427)
    at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.processXmlPage(MwRevisionDumpFileProcessor.java:345)
    at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.tryProcessXmlPage(MwRevisionDumpFileProcessor.java:282)
    at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.processXmlMediawiki(MwRevisionDumpFileProcessor.java:202)
    at org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor.processDumpFileContents(MwRevisionDumpFileProcessor.java:155)
    at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processDumpFile(DumpProcessingController.java:559)
    at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processDump(DumpProcessingController.java:495)
    at org.wikidata.wdtk.dumpfiles.DumpProcessingController.processMostRecentMainDump(DumpProcessingController.java:444)
    at name.mjw.wikidatachemscraper.StdInChIKeyExample.main(StdInChIKeyExample.java:30)
    ... 6 more

This worked with an older dump (early 2014), but this dump is no longer available.

Any ideas?

Thanks,

Mark

@mkroetzsch
Copy link
Member

This is caused by our bitvector implementation not growing automatically in that release. Because the number of revisions has grown beyond the (hardcoded) bounds, you are seeing an out-of-bounds error.

If you only need the current content of Wikidata then you should switch to using JSON dumps if at ll possible. They are smaller files, parsed more reliably, and also faster to process. The XML dump is only needed if you want revision data (user id, revision time, etc.), or if you need a historic record of all past revisions (this is very large and will grow further, thus taking a long time to parse).

Nevertheless, we should also fix the problem by having the bitvector grow automatically. In fact, it might be that this is already implemented and just not part of the release you are using. Alternatively, it might be that our implementation supports bitvector growth in principle but fails to trigger it in your case. We will need to investigate this.

@mkroetzsch mkroetzsch added the bug label Jul 29, 2015
@mjw99
Copy link
Author

mjw99 commented Aug 6, 2015

Dear Marcus,
Apologies for the delay in replying. Thank you for the explanation and I will follow your advice re switching to JSON dumps.

Thanks,

Mark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants