Skip to content

Dictionary: Overview

petermr edited this page Jul 27, 2020 · 1 revision

Dictionary Overview

purpose

The purpose of Dictionaries in the openVirus project is:

  • to identify words and phrases ("entities") in the documents (running text and images).
  • to provide (computable) links to their meaning and context ("ontologies").
  • to collect a subset of terms representing a high-level concept ("virus", "disease", "country"...).

The benefits include:

  • understanding the meanings of words.
  • background reading.
  • aggregation ("searching") for the same or related entities in the corpus (collection of documents).
  • building computable knowledge networks/graphs.
  • classifying documents.

This can be described as ontological annotations in semantic networks.

possible uses

There are many established uses of such annotations:

improved reading.

We are often put off by unfamiliar terms, e.g. "nosocomial infections". Wikipedia has an article on https://en.wikipedia.org/wiki/Hospital-acquired_infection:

A hospital-acquired infection (HAI), also known as a nosocomial infection (from the Greek "νοσοκομιακός" / "nosokomiakos", meaning "of the hospital"), is an infection that is acquired in a hospital or other health care facility.

With mouseover or footnotes this can dramatically improve speed of reading.

searching and indexing.

Annotations are easily aggregated in indexes or search engines.

precision and checking.

People may confuse COVID-19 (disease) with coronavirus (a virus).

relations between entities.

As an example from Wikipedia (https://en.wikipedia.org/wiki/Coronavirus_disease_2019 )

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

This sentence links Coronavirus disease 2019 (COVID-19) to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Indeed we can write:

  • COVID-19 isA disease
  • COVID-19 isCausedBy SARS-CoV-2

Ami's annotations allow software to discover and use such annotation. We can find all diseases isCausedBy viruses.

disambiguation

What's "Zika"?
https://en.wikipedia.org/wiki/Zika_(disambiguation) tells us:

Zika, or Zika fever, is an illness caused by the Zika virus.

Zika or Žika may also refer to:

  • Zika virus, a member of the Flaviviridae virus family
  • Zika Forest, a forest in Uganda
  • Zika rabbit, a breed of rabbit

People
Surname
Adolf Zika (born 1972), Czech photographer
... many more ...

We can label the different concepts by using a unique identifier system as in Wikidata.

structure of a dictionary

Dictionaries have a simple format, best supported by XML or JSON (currently mainly XML). This defines certain elements and attributes (in <element att1="attval1" att2="attval2" ... > ). We are developing validation software. In general:

  • unknown elements are ignored
  • <desc> and <entry> and <alternative> are optional and repeatable.
  • all attributes except dictionary/@title are optional (at this stage)
  • order of elements and attributes is irrelevant (but worth making pretty and consistent)

dictionary/title

This is the root element and contains the title which MUST be a single word and MUST be the base of the filename, e.g. virus.xml must have the structure

<dictionary title="virus">
...
</dictionary>

There is no XML namespace.

header/description

There is a header of zero or more <desc> description elements, though we may enforce mandatory elements later. These can describe metadata such as dates, maintenance, provenance, etc. They are not yet standardised but will be.

<dictionary title="virus">
    <desc date="2020-06-21" author="Peter Murray-Rust">created dictionary from Wikipedia https://en.wikipedia.org/wiki/List_of_virus_taxa after manual removal of invertebrate hosts</desc>
    <desc date="2020-06-22" author="Peter Murray-Rust">removed further non-relevant viruses (Q1234567, Q2345678 ...)</desc>
    <desc date="2020-06-23" author="Peter Murray-Rust">reassigned Wikidata IDs (Q9876543, Q9876876) for incorrect
automatic assignments</desc> 

</dictionary>

entry/body

The main component of a dictionary are entries, still slightly evolving. An entry is a well-defined object which can normally be mapped / linked to a Wikidata item. This gives it a unique identifier (Q-number), removing the need to maintain identifiers. Typical entry (with new element synonym and more use of desc with new attributes:

<dictionary title="miniterpenes">
  <entry term="borneol" wikipedia="borneol" wikidata="Q27089413" name="(-)-borneol" description="chemical compound" id="CM.myterpenes.0" term.hi="बोर्निऑल" term.it="borneolo" term.zh="冰片" regex="(\([+-]\)\-)?[Bb]borneol">
    <desc date="2020-07-22">added Bornyl-alcohol synonym</desc>
    <alternative>(-)-Bornyl alcohol</alternative>
  <entry>
...
</dictionary>

entry attributes

  • the term is the unique lexical string (word) defining the entry. Terms are always lowercase and always start with a letter. The term may or may not be the linguistic entity in documents.
  • the name is the preferred name for the term. It is case-sensitive, and will often occur in text, name and term may or may not be identical words.
  • term.xx can occur as language equivalents where xx is the appropriate 2- or 3-letter language code. See https://en.wikipedia.org/wiki/ISO_639-2. These can often be picked up from the links to Wikipedia pages from a Wikidata item (bottom of page). (Experimental).
  • regex is a regular expression for locating possible matches in text. This one finds (-)-borneol, (+)-borneol, and borneol.
  • description is a human-readable string describing the entry. However it is often created directly from Wikidata and may be used for grouping or disambiguation.
  • wikipedia is the name of the Wikipedia page. It is often the term (for single words). It may not have spaces and may have escaped punctuation. resolves to (e.g. for EN, https://en.wikipedia.org/wiki/<wikipedia>
  • wikidata is the identifier of the Wikidata item, always of the form Qddddd.. (occasionally Pddd...). It resolves to https://wikidata.org/wiki/<wikidata>. There is only one identifier for a Wikidata item and the relationships and graphs are language-independent.
  • id is a local autogenerated ID and is not stable.

children of entry

We are introducing 2 children of entry

  • desc has the same semantics as desc for dictionary
  • <alternative> . These are alternative lexical forms for the term. There are deliberately no semantics. They may or may not be exact synonyms, and may or may not be narrower/broader terms. These ontological relations can often be obtained from Wikidata.

using dictionaries

  • dictionaries will provide search terms (term, name, regex, alternative) for ami, Lucene/Solr or KNIME.
  • dictionaries provide a link to Wikipedia pages or Wikidata Items. Annotation software can create hyperlinks for humans to follow.

creating dictionaries

Conventional dictionaries take a lot of effort to create and maintain, particularly if they contain ontological relationships. Often only specialist maintainers can do this. ContentMine dictionaries remove this problem by reducing the problem to a selection of relevant terms. Often this selection is already made, in Wikipedia pages, or other collections. Many dictionaries are thus "views" (subsets) of Wikidata. There are several ways of doing this.

word lists

Create a list of terms that you think are relevant

Clone this wiki locally