Skip to content
This repository has been archived by the owner on May 24, 2024. It is now read-only.
/ UimaOnSpark Public archive

Way to run Uima Pipelines on Apache Spark

Notifications You must be signed in to change notification settings

aphp/UimaOnSpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MimicSectionSegmenter

Run on the segmenter

  • NOTE: spark only works on linux/MacosX
  1. download spark 2.2 or higher, and unpack it in <spark_folder>
  2. clone and compile the uima-aphp project
  3. clone this project
  4. put the uima-aph/uima-segmenter/target/uima-segmenter-1.0-standalone.jar[1] under the UimaOnSpark/lib/ folder
  5. compile this project with sbt publish-local
  6. copy the target/scala-2.11/uimaonspark_2.11-0.1.0-SNAPSHOT.jar[2]
  7. copy NOTEEVENTS.csv.gz, ref_doc_section.csv, [1] and [2] into a <working_folder>
  8. run the spark command
  9. the resulting csv will be in t

Spark Command

Standalone

  • The two steps below run a spark environment with 4 executors.
  • The csv is split into 200 tasks that will be consumed by the 4 executors.
  • The /tmp/ directory will be used as a temporary folder to get the 200 results.
  • The results will be concatened into the note_nlp.csv file into the working folder
<spark_folder>/sbin/start-master.sh
<spark_folder>/sbin/start-slave.sh  spark:https://0.0.0.0:7077 -c 4
<spark_folder>/bin/spark-submit \
--class fr.aphp.wind.uima.spark.MimicSectionSegmenter \
--jars uima-segmenter-1.0-SNAPSHOT-standalone.jar,uimaonspark_2.11-0.1.0-SNAPSHOT.jar \
--files ref_doc_section.csv \
--master spark:https://0.0.0.0:7077 \
--executor-cores 1 \
uimaonspark_2.11-0.1.0-SNAPSHOT.jar \
/tmp/ \
note_nlp.csv \
NOTEEVENTS.csv.gz \
200

Below are outdated information


NotePhiAnnotator

Run UIMA pipelines over Spark

uimaFIT

Apparently, no problem thanks to simplifiing and removing xml stuff

UIMA

When loading an existing pipe from xml descriptor into uimaFIT pipeline keep in mind:

  • put them on the spark folder
  • the initialize Analysis Engine (the one providing the empty CAS) needs to be a UIMA pipe. Moreover, it needs to get all typeSystems from all descending pipes

General Notes

  • the uimaFIT pipeline needs to be packaged as jar (accordingly to documentation)
  • the resulting jar needs to be put in the spark folder jar
  • the resources/uima-an-dictionary.jar need to be in the jar folder too
  • all the resources (xml...) needs to be passed to slaves (--files ) but it cannot build folder. For this reason all of them are in the base folder of the UIMA project, and spark folder

Performances considerations

  1. config 1: classic uimaFIT, 1 core
  2. config 2: classic uimaFIT, 2 cores (parallel run of half dataset)
  3. config 3: spark, 1 slave / 2 cores
  4. config 4: spark, 1 slave / 4 cores
  • test 1 (256 texts)
    • config 1: 3 min 20
    • config 2: 2 min 20
    • config 3: 2 min 20
    • config 4: 1 min 50

Apparently, running separate instances of uimaFIT is equivalent in terms of performances to running them into spark. However, while adding a new layer with spark, this allows to distribute the pipelines over multiple computers, in parallell from one command. It is then possible to scale from one to thouthand of computers easily.

NoteDeid

How to run (standalone)

  1. Run the master: sbin/start-master.sh
  2. Run the slave: sbin/start-slave.sh spark:https://nps-HP-ProBook-430-G2:7077
  3. Submit the job: bin/spark-submit --files dictionary.xsd,DictionaryAnnotator.xml,RegExAnnotator.xml,dictionary.xml,dictionary2.xml --master spark:https://nps-HP-ProBook-430-G2:7077 natus/lib/logquery_2.11-0.1.0-SNAPSHOT.jar

How to run (yarn)

  1. push all jars, xml, txt files on one of the computer cluster
  2. push all the txt files on hdfs (=input_path)
  3. `/usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --jars NoteDeid-1.0-SNAPSHOT-standalone.jar,uima-an-dictionary.jar --files DictionaryAnnotator.xml,RegExAnnotator.xml,dictionary.xml,dictionary2.xml --master yarn-client --num-executors 8 --driver-memory 512m --executor-memory 512m --executor-cores 1 logquery_2.11-0.1.0-SNAPSHOT.jar $input_path $output_path
  4. it is crucial to put only one executor core. It looks like the CAS is shared otherwize, and this leeds job to fail. In the case of 1 core executor, the pipes looks like to be run independently on multiple cores (paradoxaly)

Run SectionSegmentation

RUN

  • STANDALONE: /bin/spark-submit --class org.apache.spark.examples.SectionSegmenter --jars jars/NoteDeid-1.0-SNAPSHOT-standalone.jar,natus/lib/logquery_2.11-0.1.0-SNAPSHOT.jar --files SectionSegmenterDescriptor.xml --executor-cores 1 --master spark:https://nps-HP-ProBook-430-G2:7077 natus/lib/logquery_2.11-0.1.0-SNAPSHOT.jar /tmp/tata/ /tmp/result.csv
  • YARN: /usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --jars NoteDeid-1.0-SNAPSHOT-standalone.jar,logquery_2.11-0.1.0-SNAPSHOT.jar --files SectionSegmenterDescriptor.xml --class org.apache.spark.examples.SectionSegmenter --num-executors 8 --executor-cores 1 --master yarn NoteDeid-1.0-SNAPSHOT-standalone.jar tata/ result.csv

NEEDS

  • a uima pipeline jar in the lib folder

INPUT

  • takes an AVRO file

OUTPUT

  • produces a csv file without header

HOW

  • this runs an UIMA pipeline over all text
  • then, this produces a csv per each partition
  • each csv are merged into one large csv
  • this csv is supposed to be sent to postgresql

TODO

  • AVRO READER (from sqoop)

Run SectionSegmentation

RUN

  • YARN: /usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --jars NoteDeid-1.0-SNAPSHOT-standalone.jar,logquery_2.11-0.1.0-SNAPSHOT.jar --class org.apache.spark.examples.SectionSegmenter --num-executors 16 --executor-cores 1 --master yarn NoteDeid-1.0-SNAPSHOT-standalone.jar tata/ result.csv

NEEDS

  • a uima pipeline jar in the lib folder

INPUT

  • takes an AVRO file

OUTPUT

  • produces a csv file without header

HOW

  • this runs an UIMA pipeline over all text
  • then, this produces a csv per each partition
  • each csv are merged into one large csv
  • this csv is supposed to be sent to postgresql

TODO

  • AVRO READER (from sqoop)

About

Way to run Uima Pipelines on Apache Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages