Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when using corenlp for processing large corpus #1169

Closed
HiXiaochen opened this issue Jul 12, 2021 · 4 comments
Closed

Issue when using corenlp for processing large corpus #1169

HiXiaochen opened this issue Jul 12, 2021 · 4 comments

Comments

@HiXiaochen
Copy link

I divided the large English corpus into several subsets and ran multiple CorenLp commands simultaneously, but the following error always occurs after a period of time:
"""
Exception in thread "main" java.lang.RuntimeException: Error making document
at edu.stanford.nlp.coref.CorefSystem.annotate(CorefSystem.java:55)
at edu.stanford.nlp.pipeline.CorefAnnotator.annotate(CorefAnnotator.java:160)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:641)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:651)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1249)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1083)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1366)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1418)
Caused by: java.lang.IllegalArgumentException
at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.(GraphRelation.java:1084)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.(GraphRelation.java:310)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:337)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.(NodePattern.java:332)
at edu.stanford.nlp.semgraph.semgrex.NodePattern.matcher(NodePattern.java:293)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.(CoordinationPattern.java:146)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern.matcher(CoordinationPattern.java:120)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:356)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:455)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:572)
at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
at edu.stanford.nlp.coref.data.Mention.findDependentVerb(Mention.java:1099)
at edu.stanford.nlp.coref.data.Mention.setDiscourse(Mention.java:318)
at edu.stanford.nlp.coref.data.Mention.process(Mention.java:235)
at edu.stanford.nlp.coref.data.Mention.process(Mention.java:241)
at edu.stanford.nlp.coref.data.DocumentPreprocessor.fillMentionInfo(DocumentPreprocessor.java:341)
at edu.stanford.nlp.coref.data.DocumentPreprocessor.initializeMentions(DocumentPreprocessor.java:169)
at edu.stanford.nlp.coref.data.DocumentPreprocessor.preprocess(DocumentPreprocessor.java:62)
at edu.stanford.nlp.coref.data.DocumentMaker.makeDocument(DocumentMaker.java:92)
at edu.stanford.nlp.coref.data.DocumentMaker.makeDocument(DocumentMaker.java:64)
at edu.stanford.nlp.coref.CorefSystem.annotate(CorefSystem.java:53)
... 8 more
"""
Is this due to memory constraints?
My parameter setting is:
"java -mx64g -cp "$DATA/corenlp/stanford-corenlp-4.1.0/" edu.stanford.nlp.pipeline.StanfordCoreNLP $"
and my command is:
sh ./corenlp.sh -fileList $DATA/${SPLIT}_path.txt
-outputDirectory $DATA/output -outputFormat json
-annotators tokenize,ssplit,pos,lemma,ner,depparse,parse,coref
Besides, What should I set the -mx parameter to?

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Jul 12, 2021 via email

@HiXiaochen
Copy link
Author

I think it is likely that will be enough memory, unless the documents are truly huge. Can you send us a document which causes the problem so we can reproduce it?

On Sun, Jul 11, 2021, 8:09 PM LXCCC @.*> wrote: I divided the large English corpus into several subsets and ran multiple CorenLp commands simultaneously, but the following error always occurs after a period of time: """ Exception in thread "main" java.lang.RuntimeException: Error making document at edu.stanford.nlp.coref.CorefSystem.annotate(CorefSystem.java:55) at edu.stanford.nlp.pipeline.CorefAnnotator.annotate(CorefAnnotator.java:160) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:641) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:651) at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1249) at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1083) at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1366) at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1418) Caused by: java.lang.IllegalArgumentException at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.(GraphRelation.java:1084) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.(GraphRelation.java:310) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310) at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:337) at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.(NodePattern.java:332) at edu.stanford.nlp.semgraph.semgrex.NodePattern.matcher(NodePattern.java:293) at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.(CoordinationPattern.java:146) at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern.matcher(CoordinationPattern.java:120) at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:356) at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:455) at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:572) at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193) at edu.stanford.nlp.coref.data.Mention.findDependentVerb(Mention.java:1099) at edu.stanford.nlp.coref.data.Mention.setDiscourse(Mention.java:318) at edu.stanford.nlp.coref.data.Mention.process(Mention.java:235) at edu.stanford.nlp.coref.data.Mention.process(Mention.java:241) at edu.stanford.nlp.coref.data.DocumentPreprocessor.fillMentionInfo(DocumentPreprocessor.java:341) at edu.stanford.nlp.coref.data.DocumentPreprocessor.initializeMentions(DocumentPreprocessor.java:169) at edu.stanford.nlp.coref.data.DocumentPreprocessor.preprocess(DocumentPreprocessor.java:62) at edu.stanford.nlp.coref.data.DocumentMaker.makeDocument(DocumentMaker.java:92) at edu.stanford.nlp.coref.data.DocumentMaker.makeDocument(DocumentMaker.java:64) at edu.stanford.nlp.coref.CorefSystem.annotate(CorefSystem.java:53) ... 8 more """ Is this due to memory constraints? My parameter setting is: "java -mx64g -cp "$DATA/corenlp/stanford-corenlp-4.1.0/" edu.stanford.nlp.pipeline.StanfordCoreNLP $" and my command is: sh ./corenlp.sh -fileList $DATA/${SPLIT}_path.txt -outputDirectory $DATA/output -outputFormat json -annotators tokenize,ssplit,pos,lemma,ner,depparse,parse,coref Besides, What should I set the -mx parameter to? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1169>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMWLTKE3Z5Z3OSVXSDTXJMHLANCNFSM5AF6VSHA .

Thanks for your reply!
I have sent the document as well as the result of my corenlp to [email protected].

@AngledLuffa
Copy link
Contributor

https://nlp.stanford.edu/software/stanford-corenlp-4.5.0b.zip might have a fix for this issue?

@AngledLuffa
Copy link
Contributor

#1296 seems fixed, and this should be the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants