- NOTE: spark only works on linux/MacosX
- download spark 2.2 or higher, and unpack it in
<spark_folder>
- clone and compile the uima-aphp project
- clone this project
- put the
uima-aph/uima-segmenter/target/uima-segmenter-1.0-standalone.jar
[1] under theUimaOnSpark/lib/
folder - compile this project with
sbt publish-local
- copy the
target/scala-2.11/uimaonspark_2.11-0.1.0-SNAPSHOT.jar
[2] - copy
NOTEEVENTS.csv.gz
,ref_doc_section.csv
, [1] and [2] into a<working_folder>
- run the spark command
- the resulting csv will be in t
- The two steps below run a spark environment with 4 executors.
- The csv is split into 200 tasks that will be consumed by the 4 executors.
- The /tmp/ directory will be used as a temporary folder to get the 200 results.
- The results will be concatened into the note_nlp.csv file into the working folder
<spark_folder>/sbin/start-master.sh
<spark_folder>/sbin/start-slave.sh spark:https://0.0.0.0:7077 -c 4
<spark_folder>/bin/spark-submit \
--class fr.aphp.wind.uima.spark.MimicSectionSegmenter \
--jars uima-segmenter-1.0-SNAPSHOT-standalone.jar,uimaonspark_2.11-0.1.0-SNAPSHOT.jar \
--files ref_doc_section.csv \
--master spark:https://0.0.0.0:7077 \
--executor-cores 1 \
uimaonspark_2.11-0.1.0-SNAPSHOT.jar \
/tmp/ \
note_nlp.csv \
NOTEEVENTS.csv.gz \
200
Below are outdated information
Apparently, no problem thanks to simplifiing and removing xml stuff
When loading an existing pipe from xml descriptor into uimaFIT pipeline keep in mind:
- put them on the spark folder
- the initialize Analysis Engine (the one providing the empty CAS) needs to be a UIMA pipe. Moreover, it needs to get all typeSystems from all descending pipes
- the uimaFIT pipeline needs to be packaged as jar (accordingly to documentation)
- the resulting jar needs to be put in the spark folder
jar
- the
resources/uima-an-dictionary.jar
need to be in thejar
folder too - all the resources (xml...) needs to be passed to slaves (--files ) but it cannot build folder. For this reason all of them are in the base folder of the UIMA project, and spark folder
- config 1: classic uimaFIT, 1 core
- config 2: classic uimaFIT, 2 cores (parallel run of half dataset)
- config 3: spark, 1 slave / 2 cores
- config 4: spark, 1 slave / 4 cores
- test 1 (256 texts)
- config 1: 3 min 20
- config 2: 2 min 20
- config 3: 2 min 20
- config 4: 1 min 50
Apparently, running separate instances of uimaFIT is equivalent in terms of performances to running them into spark. However, while adding a new layer with spark, this allows to distribute the pipelines over multiple computers, in parallell from one command. It is then possible to scale from one to thouthand of computers easily.
- Run the master:
sbin/start-master.sh
- Run the slave:
sbin/start-slave.sh spark:https://nps-HP-ProBook-430-G2:7077
- Submit the job:
bin/spark-submit --files dictionary.xsd,DictionaryAnnotator.xml,RegExAnnotator.xml,dictionary.xml,dictionary2.xml --master spark:https://nps-HP-ProBook-430-G2:7077 natus/lib/logquery_2.11-0.1.0-SNAPSHOT.jar
- push all jars, xml, txt files on one of the computer cluster
- push all the txt files on hdfs (=input_path)
- `/usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --jars NoteDeid-1.0-SNAPSHOT-standalone.jar,uima-an-dictionary.jar --files DictionaryAnnotator.xml,RegExAnnotator.xml,dictionary.xml,dictionary2.xml --master yarn-client --num-executors 8 --driver-memory 512m --executor-memory 512m --executor-cores 1 logquery_2.11-0.1.0-SNAPSHOT.jar $input_path $output_path
- it is crucial to put only one executor core. It looks like the CAS is shared otherwize, and this leeds job to fail. In the case of 1 core executor, the pipes looks like to be run independently on multiple cores (paradoxaly)
- STANDALONE:
/bin/spark-submit --class org.apache.spark.examples.SectionSegmenter --jars jars/NoteDeid-1.0-SNAPSHOT-standalone.jar,natus/lib/logquery_2.11-0.1.0-SNAPSHOT.jar --files SectionSegmenterDescriptor.xml --executor-cores 1 --master spark:https://nps-HP-ProBook-430-G2:7077 natus/lib/logquery_2.11-0.1.0-SNAPSHOT.jar /tmp/tata/ /tmp/result.csv
- YARN:
/usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --jars NoteDeid-1.0-SNAPSHOT-standalone.jar,logquery_2.11-0.1.0-SNAPSHOT.jar --files SectionSegmenterDescriptor.xml --class org.apache.spark.examples.SectionSegmenter --num-executors 8 --executor-cores 1 --master yarn NoteDeid-1.0-SNAPSHOT-standalone.jar tata/ result.csv
- a uima pipeline jar in the lib folder
- takes an AVRO file
- produces a csv file without header
- this runs an UIMA pipeline over all text
- then, this produces a csv per each partition
- each csv are merged into one large csv
- this csv is supposed to be sent to postgresql
- AVRO READER (from sqoop)
- YARN:
/usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --jars NoteDeid-1.0-SNAPSHOT-standalone.jar,logquery_2.11-0.1.0-SNAPSHOT.jar --class org.apache.spark.examples.SectionSegmenter --num-executors 16 --executor-cores 1 --master yarn NoteDeid-1.0-SNAPSHOT-standalone.jar tata/ result.csv
- a uima pipeline jar in the lib folder
- takes an AVRO file
- produces a csv file without header
- this runs an UIMA pipeline over all text
- then, this produces a csv per each partition
- each csv are merged into one large csv
- this csv is supposed to be sent to postgresql
- AVRO READER (from sqoop)