Skip to content

YAGO is a large semantic knowledge base, derived from Wikipedia, WordNet, WikiData, GeoNames, and other data sources

License

Notifications You must be signed in to change notification settings

yago-naga/yago3

Repository files navigation

YAGO

YAGO is a large semantic knowledge base, derived from Wikipedia, WordNet, WikiData, GeoNames, and other data sources. Currently, YAGO knows more than 17 million entities (like persons, organizations, cities, etc.) and contains more than 150 million facts about these entities.

YAGO is special in several ways:

  • The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95% (*). Every relation is annotated with its confidence value.
  • YAGO combines the clean taxonomy of WordNet with the richness of the Wikipedia category system, assigning the entities to more than 350,000 classes.
  • YAGO is anchored in time and space. YAGO attaches a temporal dimension and a spatial dimension to many of its facts and entities.
  • In addition to taxonomy, YAGO has thematic domains such as "music" or "science" from WordNet Domains.
  • YAGO extracts and combines entities and facts from 10 Wikipedias in different languages.

YAGO is jointly developed at the DBWeb group at Télécom ParisTech University, the Databases and Information Systems group at the Max Planck Institute for Informatics, and Ambiverse.

(*) Not every version of YAGO is manually evaluated. Most notably, the version generated by this code may not be the one that we evaluated! Check the versions on the YAGO download page

YAGO Code Repository

Target audience

If you are just interested in the data of YAGO, there is no need to use the present code repository. You can download data of YAGO from the YAGO homepage.

If you are interested in using the source code of YAGO, or in contributing to it, read on. The source code of YAGO is a Java project that extracts facts from Wikipedia and the other data sources, and stores these facts in files. These files make up the YAGO knowledge base.

If you run the code yourself, you can define (a) what Wikipedia languages to cover, and (b) which specific Wikipedia, Wikidata, and Wikimedia Commons snapshots should be used during the build.

Project components

The following Java projects belong to YAGO

  • Javatools: These classes are Java utilities. They are shared with other projects.
  • Basics: These classes are used to represent facts, TSV files, etc. The files in "data" describe the schema of YAGO.
  • YAGO: This project contains
    • all main YAGO extractors
    • some hand-crafted data
    • scripts that run YAGO

Prerequisites

To run YAGO, you need the following:

  • Java 8
  • Maven
  • for the automated downloading of data resources:
    • Python 2.7
    • the Python module requests (you can use pip install requests to install this module)
    • a unix machine
  • a machine with at least 256 GB of RAM and 1 TB of disk space

The YAGO configuration file

YAGO is configured with a configuration file. Use this template to generate your own copy of that file. It should contain the following lines:

  • reuse = true|false: Specifies whether a new run of YAGO should overwrite or re-use the facts that have already been generated in a previous run.
  • yagoFolder = ...: Specifies the folder where the YAGO facts shall be stored.
  • languages = en, de, fr, nl, it, es, ro, pl, ar, fa: Specifies the Wikipedia languages from which YAGO shall extract the facts. Use ISO 639-1 language codes.
  • extractors: List of extractors to run. By default, just use the list from the template.
  • subgraphClasses: Specify a single class (e.g. <wordnet_person_100007846>), or list of classes (e.g. <wikicat_Rock_musicians>,<wikicat_American_singers>). The final YAGO output will contain only entities of the specified classes, and entities connected to them. Additionally, the final YAGO output will contain entities specified in subgraphEntities.
  • subgraphEntities: Specify a single entity (e.g. <Jimmy_Page>), or list of entities