Catla-HS

Catla for Hadoop and Spark (Catla-HS) is a self-tuning system for Hadoop parameters to improve the performance of MapReduce jobs on both Hadoop and Spark clusters with plenty of advanced tools such as the machine learning support and performance visualization tool. Catla-HS is an improved version of Catla, which is our previous work that only focused on Hadoop cluster.

This redesigned project is template-driven, making it very flexible to perform complicated job execution, monitoring and self-tuning for MapReduce performance that addressed more modern solutions like Spark. Besides, the project provides prediction and visualization tools that are easy to use for designing jobs, analyzing, visualizing and predicting the performance of MapReduce jobs.

Architecture

Fig.1 Architecture of CatlaHS

Components

Task Runner: To submit a single MapReduce job to a Hadoop and Spark cluster and obtain its analyzing results and logs after the job is completed.
Project Runner: To submit a group of MapReduce jobs in an organized project folder and monitor the status of its running until completion; eventually, all analyzing results and their logs that contain information of running time in all MapReduce phrases are downloaded into specified location path in its project folder.
Optimizer Runner: To create a series of MapReduce jobs with different combinations of parameter values according to parameter configuration files and obtain the optimal parameter values with least time cost after the tuning process is finished. Two tuning processes, namely direct search and derivative-free optimization (DFO) techniques, are supported.
Predictor Runner: To provide multiple prediction models that helps fit the tuning results and predict future performance change of MapReduce jobs. New
Performance visualization tool: A tool that helps users analyze, visualize and decision making according to collected data of tuning jobs. New
Performance analysis tool: To support aggregation of MapReduce job profiles and provides a summary of time cost of each phrase in the job. New
Machine Learning mining tool: To support modeling based on existing machine learning techniques using tuning data and metric data from the tuning process. New
CatlaUI: CatlaUI provides user-friendly GUI to perform important functions of Catla-HS. here

Flowchart of tuning

Fig.2 Usag of Catla-HS that support both Hadoop and Spark

Advanced example?

Usage

Below lists some typical uses of Catla-HS.

(1) Shell

with Cata-HS.jar in Terminal

java -jar Catla-HS.jar -tool project -dir /your-example-folder/project_wordcount -task pipeline -download true -sequence true

(2) Execute using CatlaRunner

Example 1: Submit a MapReduce job

	String[] args=new String[] {
				"-tool","task",
				"-dir","\\YOUR-FOLDER\\task_wordcount"
		};
		
		CatlaRunner.main(args);

Example 2: Submit a composite MapReduce tasks with mutiple jobs

		String[] args=new String[] {
				"-tool","project",
				"-dir","\\YOUR-FOLDER\\project_wordcount",
				"-task","pipeline",
				"-download","true",
				"-sequence","true"
		};
		
		CatlaRunner.main(args);

Example 3: Tuning using Exhaustive Search

		String[] args = new String[] { 
					"-tool","tuning",
					"-dir", "\\YOUR-FOLDER\\tuning_similarity",
					"-clean", "true", 
					"-group", "wordcount", 
					"-upload","false", 
					"-uploadjar","true"
					
				};
			
			CatlaRunner.main(args);

Example 4: Tuning using BOBYQA (a method of derivative-free optimization)

String[]	args = new String[] { 
					"-tool","optimizer",
					"-dir", "\\YOUR-FOLDER\\tuning_wordcount",
					"-clean", "true", 
					"-group", "wordcount", 
					"-upload","true",
					"-uploadjar","true",
					"-maxinter","1000",
					"-optimizer","BOBYQA",
					"-BOBYQA-initTRR","20",
					"-BOBYQA-stopTRR","1.0e-4"
				};
			
			CatlaRunner.main(args);

Advanced usage please see here

Analysis results using Catla-HS

(1) Exhaustive search

Fig. 3 Three-dimensional surface plot of running time of a MapReduce job over two Hadoop configuration parameters using the exhaustive search method on Hadoop

Fig. 4 Two-dimensional plot of running time of a MapReduce job over one Hadoop configuration parameters using the exhaustive search method on Spark

(2) Derivative-free optimization-based search

Fig. 5 Change of running time of a MapReduce job over number of iterations when tuning using a BOBYQA optimizer

Other DFO-based algorithms supported include:

Powell's method
CMA-ES
Simplex methods

Fitting model

In Catla-HS, there is an additional component called PredictorRunner to facilitate performance change's fitting and predition. With the use of multiple fitting analysis, we can establish the prediction model for evaluating MapReduce job performance.

The component currently supports:

linear fitting
multivariate linear fitting
logarithmic fitting
exponential fitting
polynomial fitting

An example is below:

Credits

This project is established upon the project Apache Hadoop, Apache Commons Math3 and Apache MINA SSHD under APACHE LICENSE, VERSION 2.0.

We also used XCharts for visualizing the results.

We currently used Java-ML for implementing several machine learning algorithms for Catla-HS.

Citation

Donghua Chen, "An Open-Source Project for MapReduce Performance Self-Tuning," arXiv:1912.12456 [cs.DC], Dec. 2019.

OR

@misc{chen2019opensource,
    title={An Open-Source Project for MapReduce Performance Self-Tuning},
    author={Donghua Chen},
    year={2019},
    eprint={1912.12456},
    archivePrefix={arXiv},
    primaryClass={cs.DC}
}

LICENSE

See the LICENSE file for license rights and limitations (GNU GPLv3).

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
catla-hs-dist		catla-hs-dist
catla-hs-machine-learning		catla-hs-machine-learning
catla-hs-src		catla-hs-src
catla-hs-tools		catla-hs-tools
catla-spark		catla-spark
docs		docs
examples		examples
images		images
real-world-applications		real-world-applications
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Catla-HS

Architecture

Components

Flowchart of tuning

Usage

(1) Shell

(2) Execute using CatlaRunner

Analysis results using Catla-HS

(1) Exhaustive search

(2) Derivative-free optimization-based search

Fitting model

Credits

Citation

LICENSE

About

Releases

Packages

Languages

License

dhchenx/Catla-HS

Folders and files

Latest commit

History

Repository files navigation

Catla-HS

Architecture

Components

Flowchart of tuning

Usage

(1) Shell

(2) Execute using CatlaRunner

Analysis results using Catla-HS

(1) Exhaustive search

(2) Derivative-free optimization-based search

Fitting model

Credits

Citation

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages