Skip to content

fynnos/ecbPlus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ecbPlus

The ECB+ corpus is an extension to the EventCorefBank (ECB, Bejan and Harabagiu, 2010). A newly added corpus component consists of 502 documents that belong to the 43 topics of the ECB but that describe different seminal events than those already captured in the ECB. All corpus texts were found through Google Search and were annotated with mentions of events and their times, locations, human and non-human participants as well as with within- and cross-document event and entity coreference information. The 2012 version of annotation of the ECB corpus (Lee et al., 2012) was used as a starting point for re-annotation of the ECB according to the ECB+ annotation guideline. The major differences with respect to the 2012 version of annotation of the ECB are:

(a) five event components are annotated in text:

actions (annotation tags starting with ACTION and NEG) times (annotation tags starting with TIME) locations (annotation tags starting with LOC) human participants (annotation tags starting with HUMAN) non-human participants (annotation tags starting with NON_HUMAN)

(b) specific action classes and entity subtypes are distinguished for each of the five main event components resulting in a total tagset of 30 annotation tags based on ACE annotation guidelines (LDC 2008), TimeML (Pustejovsky et al., 2003 and Sauri et al., 2005)

(c) intra- and cross-document coreference relations between mentions of the five event components were established: INTRA_DOC_COREF tag captures within document coreference chains that do not participate in cross-document relations; within document coreference was annotated by means of the CAT tool (Bartalesi et al., 2012) CROSS_DOC_COREF tag indicates cross-document coreference relations created in the CROMER tool (Girardi et al., 2014); all coreference branches refer by means of relation target IDs to the so called TAG_DESCRIPTORS, pointing to human friendly instance names (assigned by coders) and also to instance_id-s (d) events are annotated from an “event-centric” perspective, i.e. annotation tags are assigned depending on the role a mention plays in an event (for more information see ECB+ references).

CONTENTS

  • 982 ECB+ corpus texts in the XML format: ECB+.zip
  • ECB+ annotation guideline: NWR-2014-1.pdf
  • Index of sentences annotated with coreference: ECBplus_coreference_sentences.csv (see README.TXT)
  • LICENSEDATA.TXT
  • COPYING-CC.TXT
  • README.TXT.

The XML files with the annotated text are in the CROMER format (see references Gerardi et al 2014). This format distinguishes mentions from instances using identifiers. Coreference relations and other semantic relations are expressed between identifiers of entities and events. Instance identifiers refer back to mention in the text, which is represented as separate tokens. E.g.:

<ACTION_OCCURRENCE m_id="67"  >
  <token_anchor t_id="195"/>

represents a mention of an action anchored to the token with identifier "195", which is the word "plea" in the file 1_1ecbplus.xml. Cross-document coreference relations are expressed between mentions as between mention "67" and mention "89".

<CROSS_DOC_COREF r_id="21700" note="ACT15743050988266116" >
  <source m_id="67" />
  <target m_id="89" />
</CROSS_DOC_COREF>

The CROSS_DOC_COREF element has a nore attribute that functions as the instance identifier of the action. These instance identifiers establish cross-document coreferences.

When using this resource in publications please cite:

  • Agata Cybulska and Piek Vossen. 2014. Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In Proceedings of the 9th international conference on Language Resources and Evaluation (LREC2014) The ECB+ corpus is also described in:
  • Agata Cybulska and Piek Vossen. 2014. Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In Proceedings of LREC 2014.
  • Agata Cybulska and Piek Vossen. Guidelines for ECB+ Annotation of Events and their Coreference. 2014. (https://www.newsreader-project.eu/files/2013/01/NWR-2014-1.pdf)

Other relevant references:

  • Linguistic Data Consortium. 2008. Ace (automatic content extraction) english annotation guidelines for entities, version 6.6 2008.06.13. Technical report, June. https://projects.ldc.upenn.edu/ace/docs/English-Entities-Guidelines v6.6.pdf.
  • Bartalesi Lenzi, Valentina, Moretti, Giovanni, and Sprugnoli, Rachele. 2012. CAT: the CELCT Annotation Tool. In Proceedings of LREC 2012.
  • Cosmin Bejan and Sanda Harabagiu. 2010. Unsupervised Event Coreference Resolution with Rich Linguistic Features. In Proceedings of ACL 2010, pages 1412–1422.
  • Christian Girardi, Manuela Speranza, Rachele Sprugnoli and Sara Tonelli, 2014. CROMER: a Tool for Cross-Document Event and Entity Coreference. In Proceedings of the International Conference on Language Resources and Evaluation LREC 2014
  • Heeyoung Lee, Marta Recasens, Angel Chang, Mihai Surdeanu and Dan Jurafsky. 2012. Joint Entity and Event Coreference Resolution across Documents. In Proceedings of EMNLP 2012
  • Pustejovsky, James, Castano, Jose, Ingria, Bob, Sauri, Roser, Gaizauskas, Rob, Setzer, Andrea, and Katz, Graham. 2003. Timeml: Robust specification of event and temporal expressions in text. In Proceedings of Computational Semantics Workshop (IWCS-5)
  • Saur´ı, Roser, Littman, Jessica, Knippen, Robert, Gaizauskas, Robert, Setzer, Andrea, and Pustejovsky, James. (2005). Timeml 1.2.1 annotation guidelines, October. https://timeml.org/site/publications/timeMLdocs/annguide 1.2.1.pdf.

About

ECB+ and derived corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published