Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

This pipeline is designed to run with Apache Beam using the dataflow runner. It has not been tested with other Beam backends, but it should work there as well pending minimal modifications. Please see the Apache Beam SDK for more info.

Steps to reproduce a full run

Use python 2

Generate a mirror of MEDLINE FTP to a Google Storage Bucket (any other storage provider supported by Python Beam SDK should work). E.g. using rclone
- configure rclone with MEDLINE FTP ftp.ncbi.nlm.nih.gov and your target gcp project (my-gcp-project-buckets) rclone config
- Generate a full mirror: rclone sync medline-ftp:pubmed my-gcp-project-buckets:my-medline-bucket
- Update new files: rclone sync medline-ftp:pubmed/updatefiles my-gcp-project-buckets:my-medline-bucket/updatefiles

Install the pipeline locally

git clone https://github.com/opentargets/library-beam
cd library-beam
(sudo) pip install virtualenv
virtualenv venv
source venv/bin/activate
pip install --upgrade  setuptools pip
python setup.py install
pip install https://github.com/explosion/spacy-models/releases/download/en_depent_web_md-1.2.1/en_depent_web_md-1.2.1.tar.gz

Run NLP analytical pipeline

python -m main \
    --project your-project \
    --job_name medline-nlp\
    --runner DataflowRunner \
    --temp_location gs:https://my-tmp-bucket/temp \
    --setup_file ./setup.py \
    --worker_machine_type n1-highmem-32 \
    --input_baseline gs:https://my-medline-bucket/baseline/pubmed18n*.xml.gz \
    --input_updates gs:https://my-medline-bucket/updatefiles/pubmed18n*.xml.gz \
    --output_enriched gs:https://my-medline-bucket-output/analyzed/pubmed18 \
    --max_num_workers 32 \
    --zone europe-west1-d

Run job to split Enriched JSONs in smaller pieces

python -m main \
    --project open-targets \
    --job_name open-targets-medline-process-split\
    --runner DataflowRunner \
    --temp_location gs:https://my-tmp-bucket/temp \
    --setup_file ./setup.py \
    --worker_machine_type n1-highmem-16 \
    --input_enriched gs:https://my-medline-bucket/analyzed/pubmed18*_enriched.json.gz \
    --output_splitted gs:https://my-medline-bucket/splitted/pubmed18 \
    --max_num_workers 32 \
    --zone europe-west1-d

NOTE: you can chain the analytical and the split steps by adding the option --output_splitted gs:https://my-medline-bucket/splitted/pubmed18 to the analytical step

Run job load JSONs in Elasticsearch

python load2es.py publication --es http:https://myesnode1:9200  --es http:https://myesnode2:9200
python load2es.py bioentity --es http:https://myesnode1:9200  --es http:https://myesnode2:9200
python load2es.py taggedtext --es http:https://myesnode1:9200  --es http:https://myesnode2:9200
python load2es.py concept --es http:https://myesnode1:9200  --es http:https://myesnode2:9200

WARNING: the loading scripts takes a lot of time currently, particurlarly the concept one (16h in our system). It might be a good idea to use tmux to load the data, so it will keep going while you are not there looking at it. E.g. after installing tmux

tmux new-session "python load2es.py publication --es http:https://myesnode1:9200  --es http:https://myesnode2:9200"
tmux new-session "python load2es.py bioentity --es http:https://myesnode1:9200  --es http:https://myesnode2:9200"
tmux new-session "python load2es.py taggedtext --es http:https://myesnode1:9200  --es http:https://myesnode2:9200"
tmux new-session "python load2es.py concept --es http:https://myesnode1:9200  --es http:https://myesnode2:9200"

OPTIONAL: If needed create appropriate aliases in elasticsearch

curl -XPOST 'http:https://myesnode1:9200/_aliases' -H 'Content-Type: application/json' -d '
  {
      "actions": [
          {"add": {"index": "pubmed-18", "alias": "!publication-data"}}
      ]
  } '

OPTIONAL: Increase elasticsearch capacity for the adjancency matrix aggregation (used by LINK tool)

curl -XPUT 'http:https://myesnode1:9200/pubmed-18-concept/_settings' -H 'Content-Type: application/json' -d'
   {
      "index" : {
          "max_adjacency_matrix_filters" : 500
          }
   }'

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
es-mapping		es-mapping
modules		modules
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean_source_bucket.py		clean_source_bucket.py
launch_dataflow.sh		launch_dataflow.sh
launch_dataflow_small_test.sh		launch_dataflow_small_test.sh
launch_dataflow_small_test_local.sh		launch_dataflow_small_test_local.sh
launch_dataflow_split_output.sh		launch_dataflow_split_output.sh
launch_dataflow_split_output_small_test.sh		launch_dataflow_split_output_small_test.sh
load2es.py		load2es.py
main.py		main.py
publication_alias.sh		publication_alias.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

Steps to reproduce a full run

About

Releases

Packages

Languages

License

Syncrossus/library-beam

Folders and files

Latest commit

History

Repository files navigation

Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

Steps to reproduce a full run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages