Open Targets Library - NLP Pipeline

Note: This repo has been archived because LINK (Library) has been decommissioned.

Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

This pipeline is designed to run with Apache Beam using the dataflow runner. It has not been tested with other Beam backends, but it should work there as well pending minimal modifications. Please see the Apache Beam SDK for more info.

Steps to reproduce a full run

Use python2 with pip and virtualenv

Generate a mirror of MEDLINE FTP to a Google Storage Bucket (any other storage provider supported by Python Beam SDK should work). E.g. using rclone
- Download pre-built rclone binaries rather than platform packaged ones as they tend to be more up-to-date
- configure rclone with MEDLINE FTP ftp.ncbi.nlm.nih.gov and your target gcp project (my-gcp-project-buckets) rclone config. Medline must have username anonymous and password anonymous.
- Generate a full mirror: rclone sync -v medline-ftp:pubmed/baseline my-gcp-project-buckets:my-medline-bucket/baseline
- Update new files: rclone sync -v medline-ftp:pubmed/updatefiles my-gcp-project-buckets:my-medline-bucket/updatefiles
- Note: you can use --dry-run argument to test

install tooling

sudo apt-get install python-dev virtualenv build-essential git libxml2-dev libxslt-dev zlib1g-dev tmux

Download the pipeline

git clone https://github.com/opentargets/library-beam
cd library-beam

Create a virtual environment to manage dependencies in

virtualenv venv --python=python2
source venv/bin/activate

Install the pipeline into the virtual environment

python setup.py install
#note this needs between 3.75GB and 7.5GB RAM
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.0/en_core_web_lg-2.2.0.tar.gz

Grant the permission to compute user.

numberHidden [email protected] 		Cloud Build Service Agent

Change the value for the vocabulary info under modules/vocabulary.py

Run pipeline

python -m main \
  --project open-targets-library \
  --job_name medline201911 \
  --runner DataflowRunner \
  --temp_location gs:https://medline_2019_11/temp \
  --setup_file ./setup.py \
  --worker_machine_type n1-highmem-32 \
  --input_baseline gs:https://medline_2019_11/baseline/pubmed19n*.xml.gz \
  --input_updates gs:https://medline_2019_11/updatefiles/pubmed19n*.xml.gz \
  --output_enriched gs:https://medline_2019_11/analyzed/pubmed19 \
  --output_splitted gs:https://medline_2019_11/splitted/pubmed19 \
  --max_num_workers 32 \
  --region europe-west1 \
  --zone europe-west1-d

This can be monitored via Google Dataflow. Note that "wall time" displayed is not the usual definition but is per thread and worker.

In total it takes approximately 4h.

Steps to load the JSON dumps into ElasticSearch

The directory gcp contains the infrastructure scripts to generate the Elasticsearch cluster.

Create a virtual environment to manage dependencies in

virtualenv venv_elasticsearch --python=python2
source venv_elasticsearch/bin/activate
pip install -r venv_elasticsearch.txt

Run job load JSONs in Elasticsearch

WARNING: the loading scripts takes a lot of time currently, particurlarly the concept one (24h+). It is good to use screen or tmux or similar, so it will keep going after disconect and can be recovered.

python load2es.py publication bioentity taggedtext concept --es https://es:9200

Note: Elasticsearch must have the International Components for Unicode support plugin installed.i.e. /usr/share/elasticsearch/bin/elasticsearch-plugin -s install analysis-icu

Increase elasticsearch capacity for the adjancency matrix aggregation (used by LINK tool)

curl -XPUT 'https://myesnode1:9200/pubmed-18-concept/_settings' -H 'Content-Type: application/json' -d'
   {
      "index" : {
          "max_adjacency_matrix_filters" : 500
          }
   }'

Google Cloud Platform

When controlling this process from a Google cloud machine, make sure it has sufficient scopes enabled.

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
es-mapping-index		es-mapping-index
es-mapping		es-mapping
gcp-local-ssd		gcp-local-ssd
gcp-persistent-disk		gcp-persistent-disk
modules		modules
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean_source_bucket.py		clean_source_bucket.py
load2es.py		load2es.py
main.py		main.py
publication_alias.sh		publication_alias.sh
setup.py		setup.py
venv_elasticsearch.txt		venv_elasticsearch.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

Steps to reproduce a full run

Steps to load the JSON dumps into ElasticSearch

Google Cloud Platform

About

Releases

Packages

Contributors 8

Languages

License

opentargets-archive/library-beam

Folders and files

Latest commit

History

Repository files navigation

Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

Steps to reproduce a full run

Steps to load the JSON dumps into ElasticSearch

Google Cloud Platform

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages