Skip to content

medrxiv biorxiv download

Dmitri Zaitsev edited this page May 6, 2020 · 8 revisions

medrxiv and biorxiv

These *rxivs have no API but can be accessed by a restful query. The process of download is shown in AMIDownloadTool and AMIDownloadTest. These examples can be seen as PoC; our current strategy is to use Ferret if possible.

overview

These rxiv s work in 3 or 4 steps when run by a human:

  1. search/query generates a paged hitlist (e.g. 25 hits per page).
  2. foreach hitlist link create a landingpage.
  3. foreach landingpage retrieve (a) fulltext.html (b) fulltext.pdf
  4. (optional) foreach fulltext.html retrieve supplemental files

desired operation

This should work in a similar way to getpapers:

download -q "my query" -o myproject --site medrxiv --limit 100

should generate a directory of myproject containing

  • metadata.jsonand a logfile
  • and 100 subdirectories (named from URLs) each containing
    1. fulltext.pdf if it exists,
    2. metadata.json if it exists

problems

  • HTML is created in all stages. This may sometimes be lazyloaded, requiring code that needs a headless browser. getpapers uses a headless browser (phantom.js) but neither are now supported.
  • a true API will support most of 1-3. Many sites do not do this.
  • there are problems of command-options, directory-creation and others that are somehwat independent of downloading
  • downloading can be slow. Especially if loading each resources separately, but there may be no option.

ami download

ami download run in org.contentmine.ami.tools.AMIDownloadTest.testMedrxivDownload() The code is:

	@Test
	public void testMedrxivDownload() {
		String args;
		String biorxiv = "target/medrxiv/ebola";
        args = "-p " + "target"
				+ " clean"
				+ " medrxiv/";
		AMI.execute(args);
		args = 
				"-p " + biorxiv +""
				+ " download"
				+ " --site medrxiv"
				+ " --query \"ebola AND n95\""
				+ " --pagesize 20"
				+ " --pages 1 4"        
				+ " --fulltext pdf"
				+ " --limit 2000"
			;
		AMIDownloadTool amiDownload = AMI.execute(AMIDownloadTool.class, args);
	}

on the commandline this would be:

cd ami3
ami -p target/medrxiv/ebola clean medrxiv

followed by

ami -p target/medrxiv/ebola download --site medrxiv --query "ebola AND n95" \
    --pagesize 20 --pages 1 4 --fulltext pdf --limit 2000

The flags limit it to 4 pages of 20 hits each (80 fulltexts) and the limit is there in case of goofs.

On running ami download it has successfully carried out tasks 1-3. It's possible that the "hang" earlier was due to overload on medrxiv . Here's the output (with debugs clipped) for the complete process:

summary of input

Specific values (AMICleanTool)
================================
fileGlobs     [medrxiv/]

Generic values (AMIDownloadTool)
================================
-v to see generic values
project         target/medrxiv/ebola

Specific values (AMIDownloadTool)
================================
fulltext           [pdf]
limit              40
metadata           metadata
pages              [1, 4]
pagesize           20
query              ["ebola, AND, n95"]
hitListList      []
site               medrxiv
file types          []

main output:

  1. running the query
Query: "ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
URL https://www.medrxiv.org/search/"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20
running curl :https://www.medrxiv.org/search/"ebola%252BAND%252Bn95"%20sort%3Arelevance-rank%20numresults%3A20?page=0 to target/medrxiv/ebola/__metadata/hitList1.html
wrote hitList: /Users/pm286/workspace/cmdev/ami3/target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
page hits (11) less than page size (20) ; assumed termination
Results 11
[target/medrxiv/ebola/__metadata/hitList1.clean.html]

gets a hitlist with 11 links to landing pages

  ========
HitList: 1
 creates hitList[1..1][.clean].html
 and <per-ctree>/scrapedMetadata.html
========

downloads landing pages

download files in hitList target/medrxiv/ebola/__metadata/hitList1.clean.html
result set: target/medrxiv/ebola/__metadata/hitList1.clean.html
metadataEntries 11
download with curl to <tree>scrapedMetadata.html[/content/10.1101/2020.04.24.20073973v1, /content/10.1101/2020.04.22.20076117v1, /content/10.1101/2020.04.23.20077230v1, /content/10.1101/2020.04.24.20078907v1, /content/10.1101/2020.03.05.20032003v1, /content/10.1101/2020.03.31.20047126v1, /content/10.1101/2020.04.06.20054197v1, /content/10.1101/2020.03.20.20039644v2, /content/10.1101/2020.04.11.20062356v1, /content/10.1101/2020.03.23.20039446v2, /content/10.1101/2020.04.15.20066480v2]
running batched up curlDownloader for 11 landingPages, takes ca 1-5 sec/page 
ran curlDownloader for 11 landingPages 
--------
+downloaded 11 files for target/medrxiv/ebola/__metadata/hitList1.clean.html
--------
========
adds LandingPages: 11
========

downloads PDFs from links in landing pages

========
 CTrees 11
========
LP [10_1101_2020_04_24_20073973v1, 10_1101_2020_04_22_20076117v1, 10_1101_2020_04_23_20077230v1, 10_1101_2020_04_24_20078907v1, 10_1101_2020_03_05_20032003v1, 10_1101_2020_03_31_20047126v1, 10_1101_2020_04_06_20054197v1, 10_1101_2020_03_20_20039644v2, 10_1101_2020_04_11_20062356v1, 10_1101_2020_03_23_20039446v2, 10_1101_2020_04_15_20066480v2]
content 127132
 2020.04.24.20073973.full.pdf
content 112497
 2020.04.22.20076117.full.pdf
content 119018
 2020.04.23.20077230.full.pdf
content 176902
 2020.04.24.20078907.full.pdf
content 113807
 2020.03.05.20032003.full.pdf
content 119105
 2020.03.31.20047126.full.pdf
content 109789
 2020.04.06.20054197.full.pdf
content 125442
 2020.03.20.20039644.full.pdf
content 111965
 2020.04.11.20062356.full.pdf
content 119340
 2020.03.23.20039446.1.full.pdf
content 118888
 2020.04.15.20066480.full.pdf
========
Fulltext: finished
========|

ami download output

$ tree ebola/
ebola/
├── 10_1101_2020_03_05_20032003v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_03_20_20039644v2
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_03_23_20039446v2
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_03_31_20047126v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_06_20054197v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_11_20062356v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_15_20066480v2
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_22_20076117v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_23_20077230v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_24_20073973v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
├── 10_1101_2020_04_24_20078907v1
│   ├── fulltext.pdf
│   ├── landingPage.html
│   └── scrapedMetadata.html
└── __metadata
    ├── hitList1.clean.html
    └── hitList1.html

12 directories, 35 files


Clone this wiki locally