Skip to content

ParaNames: A multilingual resource for parallel names

License

Notifications You must be signed in to change notification settings

bltlab/paranames

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

This repository is the home of the ParaNames project, a massively multilingual dataset consisting of parallel names of over 16.8 million named entities in over 400 languages. This README contains links to corpus releases as well as code used in our canonical name translation and named entity recognition experiments.

ParaNames was originally introduced in Sälevä, J. and Lignos, C., 2022. ParaNames: A Massively Multilingual Entity Name Corpus. arXiv preprint arXiv:2202.14035 and subsequently published at LREC-COLING 2024.

Please cite as:

@inproceedings{saleva-lignos-2024-paranames-1,
    title = "{P}ara{N}ames 1.0: Creating an Entity Name Corpus for 400+ Languages Using {W}ikidata",
    author = {S{\"a}lev{\"a}, Jonne  and
      Lignos, Constantine},
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italy",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1103",
    pages = "12599--12610",
    abstract = "We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.",
}

While we prefer you cite the LREC-COLING version above, the Arxiv preprint can be cited as

@misc{sälevä2024paranames,
      title={ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata},
      author={Jonne Sälevä and Constantine Lignos},
      year={2024},
      eprint={2405.09496},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

See the Releases page for the downloadable release.

Experimental results from paper

NOTE: The code for the downstream experiments does NOT live in this repository. See below for links.

Canonical name translation

Link to repository

This repository contains the code for running our canonical name translation experiments with fairseq.

Named entity recognition

Link to repository

This repository contains the code for running our canonical name translation experiments with dynet.

Using the data release

Format

The corpus is released as a gzipped TSV file which contains the following columns:

  • wikidata_id: the Wikidata ID of the entity
  • eng: the English name of the entity
  • label: the name of the entity in the language of the row
  • language: the language of the row
  • type: the type of the entity (PER, LOC, ORG)

Some example rows are shown below:

wikidata_id     eng     label   language        type
Q181893 Fredericksburg and Spotsylvania National Military Park  Fredericksburg and Spotsylvania National Military Park  mg      LOC
Q257160 Burgheim        Burgheim        fr      LOC
Q508851 Triefenstein    Triefenstein    nl      LOC
Q303923 Ruhstorf an der Rott    Ruhstorf an der Rott    bar     LOC
Q284696 Oberelsbach     Oberelsbach     wo      LOC
Q550561 Triftern        Թրիֆթերն        hy      LOC
Q529488 Reisbach        Reisbach        fr      LOC
Q385427 Stadtlauringen  Stadtlauringen  ia      LOC
Q505327 Wildflecken     Wildflecken     id      LOC
Q505288 Ipsheim Իպսհայմ hy      LOC

Notes

Repeated entities

In current releases, any entity that is associated with multiple named entity types (PER, LOC, ORG) in the Wikidata type hierarchy will appear multiple times in the output, once with each type. This affects less than 3% of the entities in the data.

If you want a unique set of entities, you should deduplicate the data using the wikidata_id field.

If you only want to use entities that are associated with a single named entity type, you should remove any wikidata_id that appears in multiple rows.

Using the code

First, install the following non-Python dependencies:

  • MongoDB
  • xsv
  • ICU support for your computer (e.g. libicu-dev)

Next, install ParaNames and its Python dependencies by running pip install -e ..

It is recommended that you use a Conda environment for package management.

Creating the ParaNames corpus

To create a corpus following our approach, follow the steps below:

  1. Download the latest Wikidata dump from the Wikimedia page and extract it. Note that this may take up several TB of disk space.
  2. Use recipes/paranames_pipeline.sh which ingests the Wikidata JSON to MongoDB and then dumps and postprocesses it to our final TSV resource.

The call to recipes/paranames_pipeline.sh works as follows:

recipes/paranames_pipeline.sh <path_to_extracted_json_dump> <output_folder> <n_workers>

Set the number of workers based on the number of CPUs your machine has. By default, only 1 CPU is used.

The output folder will contain one subfolder per language, inside of which paranames_<language_code>.tsv can be found. The entire resource is located in <output_folder>/combined/paranames.tsv.

Notes

ParaNames offers several options for customization:

  • If your MongoDB instance uses a non-standard port, you should change the value of mongodb_port accordingly inside paranames_pipeline.sh.

  • Setting should_collapse_languages=yes will cause Wikimedia language codes to be "collapsed" to the top-level Wikimedia language code, i.e. kk-cyrl will be converted to kk, en-ca to en etc.

  • Setting should_keep_intermediate_files=yes will cause intermediate files to be deleted. This includes the raw per-type TSV dumps ({PER,LOC,ORG}.tsv) from MongoDB, as well as outputs of postprocess.py.

  • Within recipes/dump.sh, it is also possible to define languages to be excluded and whether entity types should be disambiguated. By default, no languages are excluded and no disambiguation is done.

  • After the pipeline completes, <output_folder> will contain one folder per language, inside of which is a TSV file containing the subset of names in that language. Combined TSVs with names in all languages are available in the combined folder.