Skip to content

Dictionaries: creation from Wikipedia pages

petermr edited this page Jul 27, 2020 · 1 revision

Overview

There are at least these ways of creating dictionaries (see https://github.com/petermr/tigr2ess/blob/master/dictionaries/TUTORIAL.md)

  • from a list of terms (words and phrases). AMI will then look these up in Wikidata
  • by using SPARQL on Wikidata (see WikiFactMine in tutorial)

and from Wikipedia

  • from Wikipedia Templates. These are at the bottom of the page and often have a good list of concepts.
  • from Wikipedia Categories. These are pages with lists of pages with a common theme. The quality varies widely.
  • from Wikipedia Lists. Some pages are titled "List of X" and generally very useful. Other pages contain in-page lists which are variable.
  • from tables in Wikipedia Pages. These are like in-page lists, but may have additional links.
  • from Wikipedia Pages. This scrapes all links. There is a lot of noise.

We will use https://en.wikipedia.org/wiki/Medical_procedure . Have a close look before running. This is mainly lists but will need editing.

AMI creates a single dictionary and looks up Wikidata QIDs. This will ALWAYS need editing to remove noise (omit rows) and also remove ambiguities.

Running ami-dictionary

ami-dictionary has several top-level subcommands:

create,
display,
help,
search,
translate,

we use create. There are several mandatory and optional flags. It can be put on 1 lines but we insert \ to make it clearer here. The complete command on one line

ami-dictionary create --informat wikipage --input https://en.wikipedia.org/wiki/Medical_procedure --dictionary medproc --directory /Users/pm286/dictionaries --outformats xml

or more readably with line breaks and comments

ami-dictionary \                         # command
        create \                         # subcommand
        --informat wikipage \            # read from WikiPedia page
	--input https://en.wikipedia.org/wiki/Medical_procedure \       # full URL of page (note `_` and NOT space
        --dictionary medproc \           # name of dictionary (should be unique, more later)
        --directory /users/pm286/dictionaries \.  # where it goes. Again, more later
        --outformats xml                 # output format 

output

The command

$ ami-dictionary create --informat wikipage --input https://en.wikipedia.org/wiki/Medical_procedure --dictionary medproc --directory /Users/pm286/dictionaries --outformats xml

The input parameters are echoed

Generic values (AMIDictionaryTool)
================================
-v to see generic values
oldstyle            true

Specific values (AMIDictionaryTool)
================================
baseUrl       https://en.wikipedia.org/wiki
booleanQuery  false
descriptions  null
dataCols      null
dictionary    [[medproc]]
dictionaryTop     /Users/pm286/dictionaries
hrefCols      null
inputs        null
input         https://en.wikipedia.org/wiki/Medical_procedure
informat      wikipage
dictInformat  null
linkCol       null
log4j         null
nameCol       null
operation     create
outformats    [xml]
query         10
search        null
searchfile    null
splitCol      ,
templatea     null
termCol       null
terms         null
termfile      null
title         null
urlref        null
wikiLinks     [wikipedia, wikidata]
wptype        null

ignore
0    [main] DEBUG org.contentmine.ami.tools.AMIDictionaryTool  - extracting hyperlinks
0 [main] DEBUG org.contentmine.ami.tools.AMIDictionaryTool  - extracting hyperlinks

diagnostics for retrieving each wikidata item. Takes a second or sso for each character, i.e. a few minutes

N 171; T 171
.!........!!.........!..........!..!..........!....!!!....!!.!............!.!...!.!....!.!...!..!.!.!!!!.!.!.!.....!........!!!.....!...!!....!...!!..!............!.......++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>> medproc
>> dict medproc
writing dictionary to /Users/pm286/dictionaries/medproc.xml

I think this is a bug

Missing wikipedia: :

Bloodtest; Cancervaccine; Cardiopulmonaryresuscitation; Celltherapy; Contrast-enhancedultrasound; Diffusion-weightedimaging; Electrical impedancetomography; Electroconvulsivetherapy; 
Endoluminal capsulemonitoring; Enzyme replacementtherapy; Esophageal motilitystudy; Extracorporeal membraneoxygenation; Fluid replacementtherapy; Functional magneticresonance imaging; Generalsurgery; Genetherapy; 
Gynecologicultrasonography; Handsurgery; Heattherapy; Hormone replacementtherapy; Hyperbaric oxygentherapy; Immunosuppressivetherapy; Insulin potentiationtherapy; Knee cartilagereplacement therapy; 
Localanesthesia; Medicalimaging; Monoclonal antibodytherapy; Negative Pressure WoundTherapy; Nicotine replacementtherapy; Obstetricultrasonography; Opiate replacementtherapy; Oxygentherapy; 
Phagetherapy; Physical therapy/Physiotherapy; Positron emissiontomography; Protontherapy; Stooltest; Transcutaneouselectrical nerve stimulation; Unsealed sourceradiotherapy; Visiontherapy; 
diagnosing; 

Missing wikidata: :

; Cancervaccine; Cardiopulmonaryresuscitation; Contrast-enhancedultrasound; Diffusion-weightedimaging; Electrical impedancetomography; Electroconvulsivetherapy; Endoluminal capsulemonitoring; 
Enzyme replacementtherapy; Esophageal motilitystudy; Extracorporeal membraneoxygenation; Fluid replacementtherapy; Functional magneticresonance imaging; Gynecologicultrasonography; Heattherapy; Hormone replacementtherapy; 
Immunosuppressivetherapy; Insulin potentiationtherapy; Knee cartilagereplacement therapy; Localanesthesia; Medicalimaging; Monoclonal antibodytherapy; Nicotine replacementtherapy; Obstetricultrasonography; 
Opiate replacementtherapy; Phagetherapy; Phototerapy; Stooltest; Transcutaneouselectrical nerve stimulation; Unsealed sourceradiotherapy; Visiontherapy; pm286macbook:~ 

generated dictionary

The output in /Users/pm286/dictionaries/medproc.xml is

<?xml version="1.0" encoding="UTF-8"?>
<dictionary title="https://en.wikipedia.org/wiki/Medical_procedure">
 <entry term="Ablation" url="/wiki/Ablation" wikidata="Q1806547" name="‎laser ablation‎" description="process that removes material from an object by heating it with a laser" id="CM.medproc.0" wikipedia="Ablation"/>
 <entry term="Acupuncture" url="/wiki/Acupuncture" wikidata="Q121713" name="‎acupuncture‎" description="an alternative medicine practice involving insertion of fine needles" id="CM.medproc.1" wikipedia="Acupuncture"/>
 <entry term="Amputation" url="/wiki/Amputation" wikidata="Q477415" name="‎amputation‎" description="removal of a body extremity by trauma, prolonged constriction, or surgery" id="CM.medproc.2" wikipedia="Amputation"/>

</dictionary>

I now hand-edit to remove noise. These are removed because they aren't procedures, or because they contain punctuation.

 <entry term="blood pressure" url="/wiki/Blood_pressure" wikidata="Q82642" name="‎blood pressure‎" description="pressure exerted by circulating blood upon the walls of blood vessels" id="CM.medproc.11" wikipedia="Blood_pressure"/>
 
 
 <entry term="Physical therapy/Physiotherapy" url="/wiki/Physical_therapy" wikidata="Q186005" name="‎physiotherapy‎" description="a health profession that aims to address the illnesses or injuries that limit a person's abilities to function in everyday lives" id="CM.medproc.125" wikipedia="Physical_therapy"/>
 
 <entry term="Screening (medicine)" url="/wiki/Screening_(medicine)" wikidata="Q68422575" name="‎[Screening medicine]‎" description="scientific article published on 01 January 1988" id="CM.medproc.140" wikipedia="Screening_(medicine)"/>
 
 <entry term="Targeted therapy" url="/wiki/Targeted_therapy" wikidata="Q492646" name="‎targeted therapy‎" description="drug treatment which interacts with or blocks synthesis of specific cellular components, to impede the biochemical dysfunction involved in progression of the disease" id="CM.medproc.150" wikipedia="Targeted_therapy"/>
 
 <entry term="Vital signs" url="/wiki/Vital_signs" wikidata="Q1067560" name="‎vital signs‎" description="group of the 4-6 important medical signs that indicate the status of the body’s vital functions" id="CM.medproc.166" wikipedia="Vital_signs"/>

Many entries were wrongly looked up in Wikidata as "scientific article" These fields but not the entry were removed.

BUGS

Wikidata is NOT properly looked up.

We should use the "WikidataItem" on the page. This means we have a number of false hits.

The dictionary needs correct title

with is the SAME as the filename (without ".xml"). If this isn't there it fails to search at all - silently.

Clone this wiki locally