Running ami3 with Docker

This page is rough outline of how to use the AMI3 tools via Docker. This provides a simple way to download the whole software package, but can be a bit clumsy to work with.

First, you need Docker to be installed, of course.

Then, you need to make sure you have the most recent version of the tools:

docker pull anjackson/ami3

This might take a while. Note that this is built from the anjackson fork of ami3, so might be a bit out of date sometimes. (We should set up a Docker build of the main repo.).

Once downloaded, you can run it like this:

docker run -it anjackson/ami3 ami --help

and the current command-line help message will show.

However, to do anything useful, we need to use a Docker volume to make our own local files available inside the container. I tend to use the following pattern:

docker run -it -v $PWD:/host anjackson/ami3 ami --help

The -v $PWD:/host bit takes your current working directory and makes it available inside the container as /host.

Setting up an AMI CProject:

For example, to run the makeproject command to set up a new empty ami3 project called testproject, you can use:

mkdir testproject
docker run -it -v $PWD:/host anjackson/ami3 ami -p /host/testproject makeproject --rawfiletypes pdf

The first command just makes a directory. The second sets up the project itself, and if there where PDFs in that folder, it would arrange them into the CProject layout.

Downloading PDFs to analyse

Using Scrapy

At anjackson/scrapy-cm there is a Python Scrapy crawler that can download PDFs based on queries to Medrxiv and the EThOS API.

First, make sure you have the up-to-date container:

docker pull anjackson/scrapy-cm

Then you can run a query like this (querying Medrxiv for a query 'monte-carlo AND Tom', which has been chosen here simply because it means the result set is fairly small and manageable):

docker run -ti -v $PWD/testproject:/CProject anjackson/scrapy-cm scrapy crawl medrxiv -a "query=monte-carlo AND Tom"

(Note that if you are behind a web proxy, you might have to do something like this:

docker run -ti -v $PWD/testproject:/CProject -e HTTPS_PROXY=${HTTPS_PROXY} -e HTTP_PROXY=${HTTP_PROXY} anjackson/scrapy-cm scrapy crawl medrxiv -a "query=monte-carlo AND Tom"

)

Or you can use the EThOS API, like this:

docker run -ti -v $PWD/testproject:/CProject anjackson/scrapy-cm scrapy crawl ethosapi -a "query=\"climate change\" AND refugees"

This might take a while to run, but should log progress and output a list it items and the PDFs into the testproject folder, in the CProject layout.

Ferret

TODO Add Ferret example?

Convert PDFs to text

The tools need access to the text, so we can use this command to generate a text version of each item:

docker run -ti -v $PWD:/host anjackson/ami3 ami -p /host/testproject pdfbox --pdf2html

Note that this can be rather slow.

Search the set of documents using dictionaries

Once there are text versions available, the ami search dictionary search can be run like this:

docker run -ti -v $PWD:/host anjackson/ami3 ami -p /host/testproject search --dictionary country --dictionary funders

Output will be in testproject, with summary information in HTML at the top-level, and with co-occurence data in the __cooccurance folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly