Skip to content

Latest commit

 

History

History
290 lines (199 loc) · 16.7 KB

WIKIDATAGUIDE.md

File metadata and controls

290 lines (199 loc) · 16.7 KB

Wikidata and Scribe Guide

Wikidata is a project from the Wikimedia Foundation - Specifically Wikimedia Deutschland (the German chapter of Wikimedia). Like Wikimedia's flagship project Wikipedia, Wikidata is an open information platform that anyone can edit. More specifically Wikidata is an open knowledge graph that is situated in the heart of the Linked Open Data infrastructure that seeks to harness the internet to create a global database of public information that anyone can use.

Wikidata data is licensed CC0 meaning reuse is permitted with no restriction for personal and commercial purposes. Even though you can use Wikidata data without giving credit, we at Scribe suggest that you actively promote your use of Wikidata and join the Linked Open Data movement so that all can benefit from the wealth of information created by its dedicated supporters.

Scribe uses Wikidata - specifically the lexicographical data - as a source of language data via Scribe-Data and Scribe-Server. All the noun genders, verb conjugations and so much more come directly from Wikidata contributors 💙

This markdown file provides important information about Wikidata that is geared towards people interested in learning about it in relation to working on Scribe applications. Edits are welcome to expand and change this document as the community sees fit!

Contents

First Steps into Wikidata

An important distinction to make is that Wikidata is an instance of Wikibase - an open source software for creating collaborative knowledge bases. Wikimedia Deutschland also serves other Wikibase instances such as those found on Wikibase Cloud that are hosted and Wikibase Suite that provides dockerized versions of the software for self hosting.

Data structure

Wikidata and other Wikibase instances are not relational databases, but rather RDF (Resource Description Framework) graph databases known as triplestores. RDF is a directed graph composed of triple statements that include:

  1. A subject (the entity being related)
  2. A predicate (the relation between the subject and object)
  3. An object (the entity being related to)

Note that objects can be a literal value (int, string, date, etc) or another entity within the graph. In Wikidata subjects and non-literal objects are generally stored as QIDs and predicates are stored as PIDs (see the Further resources section for the documentation for Wikidata identifiers). Scribe specifically uses Lexemes that are represented as LIDs where each lemma (base of a word) is given one unique identifier.

A few examples of triples are the following:

  • Germany (subject - Q183) has the capital (predicate - P36) Berlin (object - Q64).
  • Berlin (subject - Q64) has population (predicate - P1082) 3.7 million (object - an integer).
  • The European Union (subject - Q458) has the member (predicate - P527) Germany (object - Q183).
  • Germany (subject - Q183) is a member of (predicate - P463) the European Union (object - Q458).

One of the main benefits of RDF triplestores is that there are no limits based on the current structure of the data. If a new relationship is needed, then a predicate for it can be made and the associated objects can then be linked to their subjects.

When comparing to conventional data structures, it's important to mark the distinction that Wikidata data is not stored in tables. There are regular dumps of Wikidata that also come in relational database forms (with subject, predicate and object columns) as well as JSON and other types, but the data on Wikidata is stored using RDF relationships.

SPARQL

Because the structure of Wikidata data is different from traditional relational databases, we also need a different way to query it. SPARQL - the recursive acronym being "SPARQL Protocol and RDF Query Language" - is a standard of querying RDF formatted data.

Another interesting part of SPARQL is that it's also an HTTP transport protocol, so federated queries can also be written that access distributed resources across multiple different SPARQL endpoints. In this way Wikidata can be linked to other Wikibase instances or other databases within the linked open data infrastructure.

Note that there are also aggregation functions for SPARQL as in any query language. The only usage of aggregation functions for Scribe is check_language_data.sparql within Scribe-Data. This query allows us to get the totals for categories of words like nouns, verbs and others on a per language basis. The results allow the team to check the overall coverage for the language within Wikidata lexemes to prioritize which languages to implement next.

First queries

Below we find the most common Wikidata example of Q42 - Douglas Adams, who was specifically given this in homage to his book The Hitchhiker's Guide to the Galaxy in which the "Ultimate Question of Life, the Universe, and Everything" is found to be the number 42 :)

Scribe Logo

Please go to the Wikidata Query Service and try out the following queries to get information about Douglas Adams. You can also click the section header to go directly to the query service with the query populated.

SELECT
    ?book
    ?bookLabel
    ?bookDescription

WHERE {
    # Subject  # Author  # Douglas Adams
    ?book      wdt:P50   wd:Q42.

    SERVICE wikibase:label { bd:serviceParam wikibase:language
    "[AUTO_LANGUAGE], en". }
}

Note

The Scribe team would strongly suggest that VS Code developers download the Wikidata QID Labels VS Code extension that provides an in editor tooltip for Wikidata ID labels.

It's important to note that for triples where the object is a Wikidata entity the response to queries is its unique ID, not the string label. In order to get labels for our results we need to add in the labeling service to our queries that will then give us the ability to create any colNameLabel column for a column of IDs colName. We add this service via the following line that sets English as the default returned value at the end:

SERVICE wikibase:label { bd:serviceParam wikibase:language
  "[AUTO_LANGUAGE], en". }

Note that ?colNameDescription functions in a similar way where the description of the ID can be returned.

Note

We don't need to call the label service in this query as the object isn't a Wikidata entity.

SELECT
    ?dateOfBirth

WHERE {
    # Douglas Adams  # Date of Birth  # Object
    wd:Q42           wdt:P569         ?dateOfBirth.
}
SELECT
    ?placeOfBirth
    ?placeOfBirthLabel

WHERE {
    # Douglas Adams  # Place of Birth  # Object
    wd:Q42           wdt:P19           ?placeOfBirth.

    SERVICE wikibase:label { bd:serviceParam wikibase:language
    "[AUTO_LANGUAGE], en". }
}
SELECT DISTINCT
    ?person
    ?personLabel
    ?personDescription

WHERE {
    # Douglas Adams  # Place of Birth  # Object
    wd:Q42           wdt:P19           ?placeOfBirth.
    # Subject  # Instance of  # Human
    ?person    wdt:P31        wd:Q5;
               # Place of birth/*  # Object
               wdt:P19/wdt:P131*   ?placeOfBirth;

    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
}

Here's one more query to try out on the Wikidata Query Service. Can you change it to get different results? The following are great ways to find the Wikidata IDs you're looking for to rewrite the query below:

  • Search for the main item on Wikidata (in this case the European Union)
    • Check statements on the left and navigate to their PIDs
  • Use a search engine to search for Wikidata NAME_OF_ITEM, with the first result normally being the correct one
  • Use the Wikidata Query Builder to construct your query from normal language
SELECT
    ?country
    ?countryLabel

WHERE {
    # Subject  # Member of  # The European Union
    ?country   wdt:P463     wd:Q458.

    SERVICE wikibase:label { bd:serviceParam wikibase:language
    "[AUTO_LANGUAGE], en". }
}

Lexeme queries

The focus now shifts to the kind of data that's of interest to Scribe. Wikidata lexicographical data maps out lemmas (base versions of words) as LIDs and attaches all forms of the lemma as queryable points of data. Let's start with a base query:

SELECT DISTINCT
    ?lexeme
    ?lemma

WHERE {
    # Subject            # German
    ?lexeme dct:language wd:Q188 ;
        # Predicate              # Noun
        wikibase:lexicalCategory wd:Q1084 ;
        # The following is like labels above.
        wikibase:lemma ?lemma .
}

LIMIT 10

First we start with a lexeme, then we call the language dictionary to define which language it's from, we then apply a lexical category where we define that we only want nouns, and at the end we ask for the lemma via wikibase:lemma (the equivalent of labels for lexemes). Removing LIMIT 10 would give us the first query of interest to Scribe: all German nouns!

From here we need to get the forms (singular, plural, gender, etc) associated with the noun. Not every lemma is going to have all the points of data as they might not have been added or might not be grammatically valid, so for later steps we wrap form queries in OPTIONAL blocks.

SELECT DISTINCT
    ?lexeme
    ?lemma
    ?singular
    ?plural

WHERE {
    ?lexeme dct:language wd:Q188 ;
        wikibase:lexicalCategory wd:Q1084 ;
        wikibase:lemma ?lemma .

    OPTIONAL {
        ?lexeme ontolex:lexicalForm ?singularForm .
        ?singularForm ontolex:representation ?singular ;
        wikibase:grammaticalFeature wd:Q110786 ;
    } .

    OPTIONAL {
        ?lexeme ontolex:lexicalForm ?pluralForm .
        ?pluralForm ontolex:representation ?plural ;
        wikibase:grammaticalFeature wd:Q146786 ;
    } .
}

LIMIT 10

From here we're able to create most of the queries used by Scribe by changing the language that lexemes should be associated with, the category of word that we need (nouns, verbs, etc) and editing the optional form selections to include all needed information about the lemma that's needed for Scribe applications.

Scribe-Data and Wikidata

Scribe-Data data process that interfaces with Wikidata's lexicographical data with the following functionality:

  • Defines SPARQL queries with which data can be extracted from Wikidata
    • Sometimes queries need to be broken up as there are too many results
  • Passes these queries to Wikidata via the Python library SPARQLwrapper
  • Formats extracted data and prepares them for use within Scribe applications
  • Creates SQLite databases that form the basis of language packs that are loaded into Scribe app interfaces

Functionality not related to Wikidata includes:

  • Generating Emoji-trigger word relations for emoji autosuggestions and autocompletions using Unicode CLDR data
  • Creating autosuggest dictionaries based on the most frequent words in Wikipedia and the words that most frequently follow them

Scribe-Server and Wikidata

Scribe-Server functions as an automation step that runs Scribe-Data as a package and automatically updates Wikidata based language packs for users to then download within Scribe applications.

Further resources

The following are other resources that the community suggests to broaden your understanding of Wikidata and using it in Scribe development. Some resources from above are repeated to assure that the this section is a comprehensive list.

Wikidata documentation

Querying Wikidata

Wikidata lexemes

Tools used by Scribe