-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue when using corenlp for processing large corpus #1169
Comments
I think it is likely that will be enough memory, unless the documents are
truly huge. Can you send us a document which causes the problem so we can
reproduce it?
…On Sun, Jul 11, 2021, 8:09 PM LXCCC ***@***.***> wrote:
I divided the large English corpus into several subsets and ran multiple
CorenLp commands simultaneously, but the following error always occurs
after a period of time:
"""
Exception in thread "main" java.lang.RuntimeException: Error making
document
at edu.stanford.nlp.coref.CorefSystem.annotate(CorefSystem.java:55)
at
edu.stanford.nlp.pipeline.CorefAnnotator.annotate(CorefAnnotator.java:160)
at
edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:641)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:651)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1249)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1083)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1366)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1418)
Caused by: java.lang.IllegalArgumentException
at
edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730)
at
edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325)
at
edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103)
at
edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.(GraphRelation.java:1084)
at
edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.(GraphRelation.java:310)
at
edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310)
at
edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:337)
at
edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.(NodePattern.java:332)
at
edu.stanford.nlp.semgraph.semgrex.NodePattern.matcher(NodePattern.java:293)
at
edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.(CoordinationPattern.java:146)
at
edu.stanford.nlp.semgraph.semgrex.CoordinationPattern.matcher(CoordinationPattern.java:120)
at
edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:356)
at
edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:455)
at
edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:572)
at
edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
at edu.stanford.nlp.coref.data.Mention.findDependentVerb(Mention.java:1099)
at edu.stanford.nlp.coref.data.Mention.setDiscourse(Mention.java:318)
at edu.stanford.nlp.coref.data.Mention.process(Mention.java:235)
at edu.stanford.nlp.coref.data.Mention.process(Mention.java:241)
at
edu.stanford.nlp.coref.data.DocumentPreprocessor.fillMentionInfo(DocumentPreprocessor.java:341)
at
edu.stanford.nlp.coref.data.DocumentPreprocessor.initializeMentions(DocumentPreprocessor.java:169)
at
edu.stanford.nlp.coref.data.DocumentPreprocessor.preprocess(DocumentPreprocessor.java:62)
at
edu.stanford.nlp.coref.data.DocumentMaker.makeDocument(DocumentMaker.java:92)
at
edu.stanford.nlp.coref.data.DocumentMaker.makeDocument(DocumentMaker.java:64)
at edu.stanford.nlp.coref.CorefSystem.annotate(CorefSystem.java:53)
... 8 more
"""
Is this due to memory constraints?
My parameter setting is:
"java -mx64g -cp "$DATA/corenlp/stanford-corenlp-4.1.0/*"
edu.stanford.nlp.pipeline.StanfordCoreNLP $*"
and my command is:
sh ./corenlp.sh -fileList $DATA/${SPLIT}_path.txt
-outputDirectory $DATA/output -outputFormat json
-annotators tokenize,ssplit,pos,lemma,ner,depparse,parse,coref
Besides, What should I set the -mx parameter to?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1169>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWMWLTKE3Z5Z3OSVXSDTXJMHLANCNFSM5AF6VSHA>
.
|
Thanks for your reply! |
https://nlp.stanford.edu/software/stanford-corenlp-4.5.0b.zip might have a fix for this issue? |
#1296 seems fixed, and this should be the same issue |
I divided the large English corpus into several subsets and ran multiple CorenLp commands simultaneously, but the following error always occurs after a period of time:$DATA/$ {SPLIT}_path.txt
"""
Exception in thread "main" java.lang.RuntimeException: Error making document
at edu.stanford.nlp.coref.CorefSystem.annotate(CorefSystem.java:55)
at edu.stanford.nlp.pipeline.CorefAnnotator.annotate(CorefAnnotator.java:160)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:641)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:651)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1249)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1083)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1366)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1418)
Caused by: java.lang.IllegalArgumentException
at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.(GraphRelation.java:1084)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.(GraphRelation.java:310)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:337)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.(NodePattern.java:332)
at edu.stanford.nlp.semgraph.semgrex.NodePattern.matcher(NodePattern.java:293)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.(CoordinationPattern.java:146)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern.matcher(CoordinationPattern.java:120)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:356)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:455)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:572)
at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
at edu.stanford.nlp.coref.data.Mention.findDependentVerb(Mention.java:1099)
at edu.stanford.nlp.coref.data.Mention.setDiscourse(Mention.java:318)
at edu.stanford.nlp.coref.data.Mention.process(Mention.java:235)
at edu.stanford.nlp.coref.data.Mention.process(Mention.java:241)
at edu.stanford.nlp.coref.data.DocumentPreprocessor.fillMentionInfo(DocumentPreprocessor.java:341)
at edu.stanford.nlp.coref.data.DocumentPreprocessor.initializeMentions(DocumentPreprocessor.java:169)
at edu.stanford.nlp.coref.data.DocumentPreprocessor.preprocess(DocumentPreprocessor.java:62)
at edu.stanford.nlp.coref.data.DocumentMaker.makeDocument(DocumentMaker.java:92)
at edu.stanford.nlp.coref.data.DocumentMaker.makeDocument(DocumentMaker.java:64)
at edu.stanford.nlp.coref.CorefSystem.annotate(CorefSystem.java:53)
... 8 more
"""
Is this due to memory constraints?
My parameter setting is:
"java -mx64g -cp "$DATA/corenlp/stanford-corenlp-4.1.0/" edu.stanford.nlp.pipeline.StanfordCoreNLP $"
and my command is:
sh ./corenlp.sh -fileList
-outputDirectory $DATA/output -outputFormat json
-annotators tokenize,ssplit,pos,lemma,ner,depparse,parse,coref
Besides, What should I set the -mx parameter to?
The text was updated successfully, but these errors were encountered: