Skip to content
/ zelda Public

A comprehensive benchmark for entity disambiguation

Notifications You must be signed in to change notification settings

flairNLP/zelda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZELDA benchmark

ZELDA is a comprehensive benchmark for entity disambiguation (ED). You can use it to train and evaluate ED models.

Download ZELDA as one big zip file here. Inside, you'll find this structure:

ZELDA
├── train_data
│   ├── zelda_train.conll
│   └── zelda_train.jsonl
├── test_data
│   ├── conll
│   │   ├── test_aida-b.conll
│   │   ├── test_cweb.conll
│   │   ├── test_reddit-comments.conll
│   │   ...
│   └── jsonl
│       ├── test_aida-b.jsonl
│       ├── test_cweb.jsonl
│       ├── test_reddit-comments.jsonl
│       ...
└── other
    ├── zelda_mention_entities_counter.pickle
    └── entity_descriptions.jsonl

The train_data folder contains the training split (use either conll or jsonl version). The test_data folder contains all 9 evaluation splits in both formats. The other folder contains entity descriptions and candidate lists.

How to Load

You should train your model using ZELDA train and evaluate with all 9 test splits. The macro-averaged accuracy over all splits is your final evaluation number.

This section shows different ways to load the data for training and testing.

Load with Flair

The easiest way to load and explore the corpus is through flair. Simply use this snippet to load the corpus and iterate through some sentences and their anntation:

from flair.datasets import ZELDA

# get Zelda corpus and print statistics
corpus = ZELDA()
print(corpus)

# get a sentence of the test split 
sentence = corpus.test[1]

# print this sentence with all annotations 
print(sentence)

# iterate over linked entities in this sentence and print each
for entity in sentence.get_labels('nel'):
    print(entity)

Load in CoNLL-Format

You can load the CoNLL format directly. In this format, each line is a token followed by the ID and URL annotations of this token, in BIO format:

-DOCSTART-

# 1163testb SOCCER
SOCCER	O	O
-	O	O
JAPAN	B-993546	B-Japan_national_football_team
GET	O	O
LUCKY	O	O
WIN	O	O
,	O	O
CHINA	B-887850	B-China_national_football_team
IN	O	O

Load in JSONL-Format

In the jsonl files each document is in the form of a dictionary with keys 'id', 'text', 'index', 'wikipedia_titles' and 'wikipedia_ids'. For instance, run this snippet to load:

import json

input_jsonl = open('test_data/jsonl/aida-b_final.jsonl', mode='r', encoding='utf-8')

# each line represents one document
first_line = next(input_jsonl)
document_dictionary = json.loads(first_line)

document_text = document_dictionary['text']
mention_indices = document_dictionary['index']
mention_gold_titles = document_dictionary['wikipedia_titles']
mention_gold_ids = document_dictionary['wikipedia_ids']

for index, title, idx in zip(mention_indices, mention_gold_titles, mention_gold_ids):
    mention_start = index[0]
    mention_end=index[1]
    print(f'Mention: {document_text[mention_start:mention_end]} --- Wikipedia title: {title} --- Wikipedia id: {idx}')

How to Cite

Please refer to our paper for information on how the benchmark was constructed:

@inproceedings{milich2023zelda,
  title={{ZELDA}: A Comprehensive Benchmark for Supervised Entity Disambiguation},
  author={Milich, Marcel and Akbik, Alan},
  booktitle={{EACL} 2023,  The 17th Conference of the European Chapter of the Association for Computational Linguistics},
  year={2023}
}

More Info

Next to our paper, more information on the scripts used to create this dataset is found here.

About

A comprehensive benchmark for entity disambiguation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages