Wikidata and Scribe Guide

Wikidata is a project from the Wikimedia Foundation - Specifically Wikimedia Deutschland (the German chapter of Wikimedia). Like Wikimedia's flagship project Wikipedia, Wikidata is an open information platform that anyone can edit. More specifically Wikidata is an open knowledge graph that is situated in the heart of the Linked Open Data infrastructure that seeks to harness the internet to create a global database of public information that anyone can use.

Wikidata data is licensed CC0 meaning reuse is permitted with no restriction for personal and commercial purposes. Even though you can use Wikidata data without giving credit, we at Scribe suggest that you actively promote your use of Wikidata and join the Linked Open Data movement so that all can benefit from the wealth of information created by its dedicated supporters.

Scribe uses Wikidata - specifically the lexicographical data - as a source of language data via Scribe-Data and Scribe-Server. All the noun genders, verb conjugations and so much more come directly from Wikidata contributors 💙

This markdown file provides important information about Wikidata that is geared towards people interested in learning about it in relation to working on Scribe applications. Edits are welcome to expand and change this document as the community sees fit!

First Steps into Wikidata `⇧`

An important distinction to make is that Wikidata is an instance of Wikibase - an open source software for creating collaborative knowledge bases. Wikimedia Deutschland also serves other Wikibase instances such as those found on Wikibase Cloud that are hosted and Wikibase Suite that provides dockerized versions of the software for self hosting.

Data structure `⇧`

Wikidata and other Wikibase instances are not relational databases, but rather RDF (Resource Description Framework) graph databases known as triplestores. RDF is a directed graph composed of triple statements that include:

A subject (the entity being related)
A predicate (the relation between the subject and object)
An object (the entity being related to)

Note that objects can be a literal value (int, string, date, etc) or another entity within the graph. In Wikidata subjects and non-literal objects are generally stored as QIDs and predicates are stored as PIDs (see the Further resources section for the documentation for Wikidata identifiers). Scribe specifically uses Lexemes that are represented as LIDs where each lemma (base of a word) is given one unique identifier.

A few examples of triples are the following:

Germany (subject - Q183) has the capital (predicate - P36) Berlin (object - Q64).
Berlin (subject - Q64) has population (predicate - P1082) 3.7 million (object - an integer).
The European Union (subject - Q458) has the member (predicate - P527) Germany (object - Q183).
Germany (subject - Q183) is a member of (predicate - P463) the European Union (object - Q458).

One of the main benefits of RDF triplestores is that there are no limits based on the current structure of the data. If a new relationship is needed, then a predicate for it can be made and the associated objects can then be linked to their subjects.

When comparing to conventional data structures, it's important to mark the distinction that Wikidata data is not stored in tables. There are regular dumps of Wikidata that also come in relational database forms (with subject, predicate and object columns) as well as JSON and other types, but the data on Wikidata is stored using RDF relationships.

SPARQL `⇧`

Because the structure of Wikidata data is different from traditional relational databases, we also need a different way to query it. SPARQL - the recursive acronym being "SPARQL Protocol and RDF Query Language" - is a standard of querying RDF formatted data.

Another interesting part of SPARQL is that it's also an HTTP transport protocol, so federated queries can also be written that access distributed resources across multiple different SPARQL endpoints. In this way Wikidata can be linked to other Wikibase instances or other databases within the linked open data infrastructure.

Note that there are also aggregation functions for SPARQL as in any query language. The only usage of aggregation functions for Scribe is check_language_data.sparql within Scribe-Data. This query allows us to get the totals for categories of words like nouns, verbs and others on a per language basis. The results allow the team to check the overall coverage for the language within Wikidata lexemes to prioritize which languages to implement next.

First queries `⇧`

Below we find the most common Wikidata example of Q42 - Douglas Adams, who was specifically given this in homage to his book The Hitchhiker's Guide to the Galaxy in which the "Ultimate Question of Life, the Universe, and Everything" is found to be the number 42 :)

Please go to the Wikidata Query Service and try out the following queries to get information about Douglas Adams. You can also click the section header to go directly to the query service with the query populated.

Books that he (Q42) is the author (P50) of

SELECT
    ?book
    ?bookLabel
    ?bookDescription

WHERE {
    # Subject  # Author  # Douglas Adams
    ?book      wdt:P50   wd:Q42.

    SERVICE wikibase:label { bd:serviceParam wikibase:language
    "[AUTO_LANGUAGE], en". }
}

Note

The Scribe team would strongly suggest that VS Code developers download the Wikidata QID Labels VS Code extension that provides an in editor tooltip for Wikidata ID labels.

It's important to note that for triples where the object is a Wikidata entity the response to queries is its unique ID, not the string label. In order to get labels for our results we need to add in the labeling service to our queries that will then give us the ability to create any colNameLabel column for a column of IDs colName. We add this service via the following line that sets English as the default returned value at the end:

SERVICE wikibase:label { bd:serviceParam wikibase:language
  "[AUTO_LANGUAGE], en". }

Note that ?colNameDescription functions in a similar way where the description of the ID can be returned.

His (Q42) date of birth (P569)

Note

We don't need to call the label service in this query as the object isn't a Wikidata entity.

SELECT
    ?dateOfBirth

WHERE {
    # Douglas Adams  # Date of Birth  # Object
    wd:Q42           wdt:P569         ?dateOfBirth.
}

His (Q42) place of birth (P19)

SELECT
    ?placeOfBirth
    ?placeOfBirthLabel

WHERE {
    # Douglas Adams  # Place of Birth  # Object
    wd:Q42           wdt:P19           ?placeOfBirth.

    SERVICE wikibase:label { bd:serviceParam wikibase:language
    "[AUTO_LANGUAGE], en". }
}

All people (Q5) with the same place of birth (P19) as him (Q42)

SELECT DISTINCT
    ?person
    ?personLabel
    ?personDescription

WHERE {
    # Douglas Adams  # Place of Birth  # Object
    wd:Q42           wdt:P19           ?placeOfBirth.
    # Subject  # Instance of  # Human
    ?person    wdt:P31        wd:Q5;
               # Place of birth/*  # Object
               wdt:P19/wdt:P131*   ?placeOfBirth;

    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
}

Here's one more query to try out on the Wikidata Query Service. Can you change it to get different results? The following are great ways to find the Wikidata IDs you're looking for to rewrite the query below:

Search for the main item on Wikidata (in this case the European Union)
- Check statements on the left and navigate to their PIDs
Use a search engine to search for Wikidata NAME_OF_ITEM, with the first result normally being the correct one
Use the Wikidata Query Builder to construct your query from normal language

All countries that are members of (P463) the European Union (Q458)

SELECT
    ?country
    ?countryLabel

WHERE {
    # Subject  # Member of  # The European Union
    ?country   wdt:P463     wd:Q458.

    SERVICE wikibase:label { bd:serviceParam wikibase:language
    "[AUTO_LANGUAGE], en". }
}

Lexeme queries `⇧`

The focus now shifts to the kind of data that's of interest to Scribe. Wikidata lexicographical data maps out lemmas (base versions of words) as LIDs and attaches all forms of the lemma as queryable points of data. Let's start with a base query:

Query ten German (Q188) nouns (Q1084)

SELECT DISTINCT
    ?lexeme
    ?lemma

WHERE {
    # Subject            # German
    ?lexeme dct:language wd:Q188 ;
        # Predicate              # Noun
        wikibase:lexicalCategory wd:Q1084 ;
        # The following is like labels above.
        wikibase:lemma ?lemma .
}

LIMIT 10

First we start with a lexeme, then we call the language dictionary to define which language it's from, we then apply a lexical category where we define that we only want nouns, and at the end we ask for the lemma via wikibase:lemma (the equivalent of labels for lexemes). Removing LIMIT 10 would give us the first query of interest to Scribe: all German nouns!

From here we need to get the forms (singular, plural, gender, etc) associated with the noun. Not every lemma is going to have all the points of data as they might not have been added or might not be grammatically valid, so for later steps we wrap form queries in OPTIONAL blocks.

Query ten German (Q188) nouns (Q1084) with singulars (Q110786) and plurals (Q146786)

SELECT DISTINCT
    ?lexeme
    ?lemma
    ?singular
    ?plural

WHERE {
    ?lexeme dct:language wd:Q188 ;
        wikibase:lexicalCategory wd:Q1084 ;
        wikibase:lemma ?lemma .

    OPTIONAL {
        ?lexeme ontolex:lexicalForm ?singularForm .
        ?singularForm ontolex:representation ?singular ;
        wikibase:grammaticalFeature wd:Q110786 ;
    } .

    OPTIONAL {
        ?lexeme ontolex:lexicalForm ?pluralForm .
        ?pluralForm ontolex:representation ?plural ;
        wikibase:grammaticalFeature wd:Q146786 ;
    } .
}

LIMIT 10

From here we're able to create most of the queries used by Scribe by changing the language that lexemes should be associated with, the category of word that we need (nouns, verbs, etc) and editing the optional form selections to include all needed information about the lemma that's needed for Scribe applications.

Scribe-Data and Wikidata `⇧`

Scribe-Data data process that interfaces with Wikidata's lexicographical data with the following functionality:

Defines SPARQL queries with which data can be extracted from Wikidata
- Sometimes queries need to be broken up as there are too many results
Passes these queries to Wikidata via the Python library SPARQLwrapper
Formats extracted data and prepares them for use within Scribe applications
Creates SQLite databases that form the basis of language packs that are loaded into Scribe app interfaces

Functionality not related to Wikidata includes:

Generating Emoji-trigger word relations for emoji autosuggestions and autocompletions using Unicode CLDR data
Creating autosuggest dictionaries based on the most frequent words in Wikipedia and the words that most frequently follow them

Scribe-Server and Wikidata `⇧`

Scribe-Server functions as an automation step that runs Scribe-Data as a package and automatically updates Wikidata based language packs for users to then download within Scribe applications.

Further resources `⇧`

The following are other resources that the community suggests to broaden your understanding of Wikidata and using it in Scribe development. Some resources from above are repeated to assure that the this section is a comprehensive list.

Wikidata documentation

Wikidata on Wikipedia
Wikidata Identifiers

Querying Wikidata

Wikidata SPARQL tutorial
Wikidata tutorial by Wikimedia Israel
Wikidata example SPARQL queries
Wikidata Query Builder
Wikidata Query Builder Documentation

Wikidata lexemes

Wikidata lexicographical data
Wikidata example lexeme queries
Wikidata Lexicographical Data Statistics

Tools used by Scribe

SPARQLwrapper Python package
Wikidata QID Labels VS Code extension

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIKIDATAGUIDE.md

WIKIDATAGUIDE.md

Wikidata and Scribe Guide

Contents

First Steps into Wikidata `⇧`

Data structure `⇧`

SPARQL `⇧`

First queries `⇧`

Books that he (Q42) is the author (P50) of

His (Q42) date of birth (P569)

His (Q42) place of birth (P19)

All people (Q5) with the same place of birth (P19) as him (Q42)

All countries that are members of (P463) the European Union (Q458)

Lexeme queries `⇧`

Query ten German (Q188) nouns (Q1084)

Query ten German (Q188) nouns (Q1084) with singulars (Q110786) and plurals (Q146786)

Scribe-Data and Wikidata `⇧`

Scribe-Server and Wikidata `⇧`

Further resources `⇧`

Wikidata documentation

Querying Wikidata

Wikidata lexemes

Tools used by Scribe

Files

WIKIDATAGUIDE.md

Latest commit

History

WIKIDATAGUIDE.md

File metadata and controls

Wikidata and Scribe Guide

Contents

First Steps into Wikidata ⇧

Data structure ⇧

SPARQL ⇧

First queries ⇧

Books that he (Q42) is the author (P50) of

His (Q42) date of birth (P569)

His (Q42) place of birth (P19)

All people (Q5) with the same place of birth (P19) as him (Q42)

All countries that are members of (P463) the European Union (Q458)

Lexeme queries ⇧

Query ten German (Q188) nouns (Q1084)

Query ten German (Q188) nouns (Q1084) with singulars (Q110786) and plurals (Q146786)

Scribe-Data and Wikidata ⇧

Scribe-Server and Wikidata ⇧

Further resources ⇧

Wikidata documentation

Querying Wikidata

Wikidata lexemes

Tools used by Scribe

First Steps into Wikidata `⇧`

Data structure `⇧`

SPARQL `⇧`

First queries `⇧`

Lexeme queries `⇧`

Scribe-Data and Wikidata `⇧`

Scribe-Server and Wikidata `⇧`

Further resources `⇧`