Skip to content

How ami search works

Richard Light edited this page Apr 14, 2020 · 12 revisions

Work In Progress!

Overview

AMI is a toolset for querying and analyzing a small-to-medium (up to 10,000) collection of documents, normally on local storage. It includes tools for downloading scientific papers, processing documents into sections and XML, analyzing components (text, tables, diagrams), creating dictionaries, and searching.

Getting Started

Software Installation

This document assumes you have AMI installed. If you haven't, please refer to https://github.com/petermr/ami3/blob/master/INSTALL.md or AMI installation.

Concepts In a Nutshell

CProject and CTrees

FIXME: @petermr, please review and update the below.

A CProject is a directory structure that the AMI toolset uses to gather and process data. You would usually create a new CProject for each new question you want to answer (like "How effective are masks to protect against COVID-19?")

Question: do I need to create a project before running getpapers?

A CTree is a subdirectory of a CProject that deals with a single paper. It contains the original paper (for example, a PDF), and the files created by the AMI toolset to analyse and transform the paper into other formats.

Use ami-makeproject tool to create a project. Run the tool with --help for details.

_FIXME: *** but you no longer need to do this - getpapers creates the output directory you specify if it doesn't already exist. Remove? ***

WARN: A CProject can contain hundreds of CTrees and become very large. Be careful about committing a CProject to git!

Dictionaries

FIXME: @petermr, please review and update the below.

A dictionary is a structured set of terms used by the AMI toolset to create a set of in-context snippets and occurrence counts for each term in each document in a result set. Dictionaries are stored in XML or JSON format. Here is an example.

The AMI toolset contains many built-in dictionaries, and ContentMine has dictionaries, but you may need to create custom dictionaries to do ??? because ???.

Use the ami-dictionary create to create a new dictionary. See https://github.com/petermr/openVirus/wiki/Creating-Dictionaries-from-Wikipedia-pages for details.

TODO: any Any pitfalls, things to be careful about?

Workflow

See the Overview page for an example that walks through the steps for "Search for N95 (masks) on EuropePMC".

It details some commonly used commands and their output.

Would it be an idea to have a diagram with common workflows?

makeproject
    |
    V
getpapers
    |
    V
dictionary create
    |
    V
 search
    |
    V
   pdf

ami-search

Given we have a valid CProject folder, with text or HTML versions of each item, ami-search can be used to analyse all the items and aggregate the results. The overall data flow is:

  1. Make sure there is a scholarly.html . If not, convert fulltext.xml to scholarly.html using ami-transform and a stylesheet (nlm2html.xsl , I think) .(uses a make-like strategy, i.e. only converts once)
  2. sets up an empty full.dataTables.html based on dataTables.js . This is 6 years old:
  <link rel="stylesheet" type="text/css" href="http:https://ajax.aspnetcdn.com/ajax/jquery.dataTables/1.9.4/css/jquery.dataTables.css"/>
  <script src="http:https://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.8.2.min.js" charset="UTF-8" type="text/javascript"> </script>
  <script src="http:https://ajax.aspnetcdn.com/ajax/jquery.dataTables/1.9.4/jquery.dataTables.min.js" charset="UTF-8" type="text/javascript"> </script>
  1. extracts the bibliographic metadata from each fulltext.html into (a) col1: links to sources and full text (b) col2: abbreviated bibliograph for title and abstract. NEEDS: decent mousoever and readable display
  2. reads the list of dictionaries (--dictionary option). NEEDS to check existence of dictionary.
  3. iterate over fulltext.html with each dictionary. capture each hit as "word in context" snippet. Context is "pre", "exact" (match), "post". These are limited in length to 200 chars. A hit with pre and post has enough information to locate it in the document since absolute coordinates are fragile (see W3C annotation spec). Snippets are XML and listed in files named results/search/<dictionaryName>/results.xml or empty.xml . The empty.xml means no results (to make it easier to search without reading XML, Ugly).
  4. calculate word frequencies independently of dictionaries. ami-search reads stopword files for (a) Common EN words, (b) Common words in scholarly discourse (e.g. "journal", "method"...) and omits these. Words are split at whitespace. Stemming is applied (not sure how) and repeat words are stored in caches (Bloom Filters - I think I implemented them).
  5. Read snippets XML and generate frequencies/counts XML etc. per item. These are (a) copied to the appropriate cell of the dataTables.html with a frequecy cut off. NEED much better display with mouseover, histograms or other icons, links back to fulltext.
  6. Read per-item XML data and generate top-level/summary XML or CSV (latter for co-occurrence data). These data are direct children of CProject. This is messy and there should probably be a __summary directory child of CProject. Note. The only children of CProject should be either <CTree>s or __* . If we redesigned it would be better to have the CTrees in a __ctrees directory but 5 years ago we didn't know where we'd be.
  7. Generate HTML versions of XML and CSV for use.

Example

see https://github.com/petermr/openVirus/tree/master/examples/n95

Implementation Details

...Rough statement of how the process is implemented, where it has scaling/performance consequences?...

See https://github.com/petermr/ami3/blob/235cf550111d8bdaa5d8e50d6f1e8c0142bd9f35/src/main/java/org/contentmine/ami/tools/AbstractAMISearchTool.java#L208-L212

Implemented in Java, AMISearchTool , BUT, unfortunately, uses an old pre-picocli commandline for message-passing. This needs to be rewritten but it works. Scaling: There may be memory leaks for thousands of files (I think we create a CTreeList of the hits and this is not completely flushed` . Otherwise should be O(n) for times and O(1) for space.

interpreting output

(more later)

More Information

The documentation that resulted from Peter's Tigr3ess workshop in Delhi may be useful. For example:

Clone this wiki locally