Skip to content

Project: eTheses for openVirus

petermr edited this page May 25, 2020 · 2 revisions

Name

eTheses for openVirus

Why needed

Because Ph.D theses are an under-utilised body of scholarly writing and research.

Similar/previous work

  • Structurally similar to the DOAJ work, in that the goal is to set up a new data source we can feed into the openVirus tool chain, where the data source is too large to process directly and requires pre-processing in some way.

Proposed work

Step 1: Create a full-text search API to find relevant theses.

We have taken the data from the EThOS service and re-used the tools of the UK Web Archive to build a full-text search API that can be used to find relevant theses.

This notebook illustrates how to use the API

Step 2: Integrate into the openVirus toolkit

  • The openVirus tools need to be extended/supplemented to use the API and then download the PDFs of the relevant theses.
    • This would work in the same way as getpapers/quickscrape/ami download (/ferret?)
    • e.g. adding an ethos-api source to getpapers would be one implementation approach.
    • An alternative would be to write a Scrapy crawler that outputs a suitable CProject.
  • The whole workflow needs to be verified with a realistic/useful example.

Developers

Project page

???

Current state

Step 1 complete, the API works well enough.

Step 2 needs to be implemented, but it's not clear how best to proceed. Andy Jackson is currently working on understanding ami download/getpapers/etc. well enough to work out what might work best.

Other ideas

Re-implement openVirus analysis directly on Solr

One idea I keep coming back to is that the core of the work done by ami search is very similar to the core of Apache Solr itself. The upshot of this is that rather than adding this Solr index as a data source, the initial part of the ami search process could be done directly in Solr.

(PMR comment). Yes, I am working to replace the engines in ami search and ami words by Solr.

Specifically, for each query term in each dictionary, we could:

  • Search for that term using the Solr API
  • Export the full result set, including the text surrounding each hit.
  • THEN: Generate the snippets XML from that, and pass to the rest of the ami search chain.
  • OR: Generate the results tables and co-occurrences plots etc. directly from Solr.

(PMR comment). Agreed.

Clone this wiki locally