LAS EMR

Parallelized Finnish text pre-processing with AWS EMR and SeCo Lexical Analysis Services.

Motivation

Some text pre-processing tasks, such as proper tokenization and lemmatization, are computationally challenging. There already exist many good tools that produce proper output, but take long time to run: pre-processing tens of gigabytes of text might take days, even weeks. Luckily some of these tools are stateless, meaning that they can be parallelized easily.

This repository contains a text pre-processor utility for Finnish NLP tasks. It wraps SeCo Lexical Analysis Services as an Apache Spark job that can process raw Finnish text files from AWS S3 with AWS EMR cluster and output the processed (tokenized and lemmatized) files back into S3. The process is entirely stateless so you can increase the cluster size in order to boost the pre-processing.

Setup

Make sure you have java8-jdk, maven installed
Obtain your AWS credentials from AWS console and configure your shell to use them
Create a new S3 bucket that will contain the nlp files (input data, output data and binaries)
Build the binaries: ./bin/build.sh
Deploy the binaries: ./bin/deploy.sh <your-bucket-name>

Usage

Copy your input data into S3 bucket
Process your S3 text data with bin/process.sh script (use --help for more info)

Example usage (with aws-cli for data sync):

# "my-input-data.txt" is a folder with hadoop data format 
aws s3 sync ./my-input-data.txt s3:https://my-s3-nlp-bucket/data/input.txt

# create default roles if you've not used emr before
aws emr create-default-roles

./bin/process.sh \
  --bucket my-finnlp-bucket \
  --input s3n:https://my-finnlp-bucket/data/input/mydata.txt \
  --tokens s3n:https://my-finnlp-bucket/data/output/tokenized.txt \
  --lemmas s3n:https://my-finnlp-bucket/data/output/lemmatized.txt

process.sh creates a new "las-emr" named cluster that processes the given input file and terminates automatically when the job is completed. If you want to examine the progress of the processing job, you can do it by using Spark Web UI.

Custom EMR flow configurations

The default configuration uses m1.large instances for CORE nodes because they don't have any special requirements for the cluster. However, m1.* instances are pretty slow (single m1.large instance can process approximately 2.3 MB Finnish text per hour) so it's recommended to use c4.* instances instead.

You can customize the flow and cluster configurations by using --config my_config.edn flag. The contents of your configuration file will be merged to defaults.edn configurations. For full configuration options, please see amazonica documentation.

ATTENTION! When using custom instance types, remember to set executor memory and core settings to utilize the maximum resources of your instances. Your instances should have big enough so that each executors get at least 3G memory. The EMR specs for different instance types can be found here.

Here is an example configuration with custom cluster name and c4.xlarge core nodes:

; my_custom_cluster.edn
{:name           "big-spot-las"
 :configurations [{:classification "spark"
                    :properties     {"maximizeResourceAllocation" "true"}}
                   {:classification "spark-defaults"
                    :properties     {"spark.executor.memory" "5120m"
                                     "spark.executor.cores"  "4"}}]
 :instances       {:ec2-subnet-id 
                   "subnet-abc12345"    ; required for c4 instances
                   :instance-groups
                   [{:instance-type  "m1.medium"
                     :instance-role  "MASTER"
                     :instance-count 1}
                    {:instance-type  "c4.xlarge"
                     :instance-role  "CORE"
                     :instance-count 100
                     :market         "SPOT"
                     :bid-price      "0.06"}]}}

And starting the job:

./bin/process.sh \
  --conf my_custom_cluster.edn \
  --bucket my-finnlp-bucket \
  --input s3n:https://my-finnlp-bucket/data/input/bigdata.txt \
  --tokens s3n:https://my-finnlp-bucket/data/output/tokenized.txt \
  --lemmas s3n:https://my-finnlp-bucket/data/output/lemmatized.txt

Some thoughts about time and costs

One c4.xlarge instance (with the configurations shown as above) can process Finnish text approximately 6.2 MB/hour which means that with a cluster of 100 instances, you can process approximately 10 GB of data in 16,2 hours. If you're using spot market with e.g. $0.06/h price cap, the cost for this job will be approximately $185.

License

GNU GPLv3

(I know and I'm sad as well but it can't be helped since some of the transient dependencies have that license... 😢 However, note that because this tool runs in a separate process, you can build your other NLP tools and infra without exposing them to the license.)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bin		bin
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
defaults.edn		defaults.edn
lein		lein
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LAS EMR

Motivation

Setup

Usage

Custom EMR flow configurations

Some thoughts about time and costs

License

About

Releases

Packages

Languages

License

milankinen/las-emr

Folders and files

Latest commit

History

Repository files navigation

LAS EMR

Motivation

Setup

Usage

Custom EMR flow configurations

Some thoughts about time and costs

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages