DataNode Volumes Rebalancing tool for Apache Hadoop 2.6+ HDFS

This project aims at filling the gap with HDFS-1312 & family: when a hard drive dies on a Datanode and gets replaced, there is not real way to move blocks from most used hard disks to the newly added -- and thus empty.

HDFS-1804 is a really good addition that allows the DataNode to choose the least used hard drive, but for new blocks only. Balancing existing blocks is still something missing and thus the main focus of this project.

Attention

If you use Hadoop2.6 or Hadoop2.7, this script org.apache.hadoop.hdfs.server.datanode.VolumeBalancerNew is suitable for you to rebalance volume on Datanode.

If you use Hadoop2.5 or the following version, you should use script org.apache.hadoop.hdfs.server.datanode.VolumeBalancer, or you can use volumn-balancer by bperroud. this script is forked from him. Difference between the two scripts, see HDFS-6482 for more details.

If you use Hadoop3.0, you no need this! Because it has the hdfs diskbalancer command to rebalance data across multiple disks of a Datanode.

Usage

Using hadoop jar

$ hadoop jar /opt/volume-balancer/lib/volume-balancer-<version>.jar org.apache.hadoop.hdfs.server.datanode.VolumeBalancerNew [-threshold=0.1] [-concurrency=1] [-dirs=128]

Using plain java

(But still hadoop classpath for automatic jars detection)

$ java -cp /opt/volume-balancer/lib/volume-balancer-<version>.jar:/path/to/hdfs-site.xml/parentDir:$(find $(hadoop classpath | sed -e "s/:/ /g") -maxdepth 0 2>/dev/null | paste -d ":" -s -) org.apache.hadoop.hdfs.server.datanode.VolumeBalancerNew [-threshold=0.1] [-concurrency=1] [-dirs=128]

With CDH

Cloudera is splitting the configuration files and store in /etc/hadoop/conf only strict minimum. dfs.datanode.data.dir for instance is not found in /etc/hadoop/conf/hdfs-site.xml so a workaround needs to be put in place. Moreover, once you found the proper /var/run/cloudera-scm-agent/process folder to add in your classpath, you need to skip the log4j.properties file shipped otherwise you'll start a custom proprietary logger. A helper script is given in src/main/scripts folder, which run the balancer in CDH distributions.

$ /opt/volume-balancer/bin/cdh-balance-local-volumes.sh [-threshold=0.1] [-concurrency=1] [-dirs=128]

Parameters

Threshold

Default value: 0.1 (10%)

The script stops when all the drives reach the average disks utilization +/- the threshold. The smaller the value, the more balanced the drives will be and the longer the process will take.

Concurrency

Default value: 1 (no concurrency)

Define how many threads are concurrently reading (from different disks) and writing to the least used disk(s). Advised value is 2, increasing the concurrency does not always increase the overall throughput.

dirs

Default value: 128

Define how many dirs we can rebalance, from HDFS-6482 we know the two-level directory structure has the most 256 * 256 dirs, use this we can limit the rebalanced dirs to dirs * 256 on per volume. This value should not greater than 256, oherwise it will use default value instead.

How it works

The script take a random subdir* leaf (i.e. without other subdir inside) from the most used partition and move it to a same subdir of the least used partition

The script is doing pretty good job at keeping the bandwidth of the least used hard drive maxed out using FileUtils#moveDirectory(File, File) and a dedicated j.u.c.ExecutorService for the copy. Increasing the concurrency of the thread performing the copy does not always help to improve disk utilization, more particularly at the target disk. But if you use -concurrency > 1, the script is balancing the read (if possible) amongst several disks.

Once all disks reach the disks average utilization +/- threshold (can be given as input parameter, by default 0.1) the script stops. But it can also be safely stopped at any time hitting Crtl+C: it shuts down properly ensuring ALL blocks of a subdir are moved, leaving the datadirs in a proper state.

Monitoring

Appart of the standard df -h command to monitor disks fulling in, the disk bandwidths can be easily monitored using iostat -x 1 -m

$ iostat -x 1 -m
   Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
   sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
   sde               0.00 32911.00    0.00  300.00     0.00   149.56  1020.99   138.72  469.81   3.34 100.00
   sdf               0.00    27.00  963.00   50.00   120.54     0.30   244.30     1.37    1.35   0.80  80.60
   sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
   sdh               0.00     0.00  610.00    0.00    76.25     0.00   255.99     1.45    2.37   1.44  88.10
   sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

License

See LICENSE for licensing information.

Contributions

All contributions are welcome: ideas, documentation, code, patches, bug reports, feature requests etc. And you don't need to be a programmer to speak up!

If you are new to GitHub please read Contributing to a project for how to send patches and pull requests to this project.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
doc		doc
src/main		src/main
.gitignore		.gitignore
Chinese.md		Chinese.md
README.md		README.md
diskbalancer.png		diskbalancer.png
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

src/main

src/main

.gitignore

.gitignore

Chinese.md

Chinese.md

README.md

README.md

diskbalancer.png

diskbalancer.png

pom.xml

pom.xml

Repository files navigation

DataNode Volumes Rebalancing tool for Apache Hadoop 2.6+ HDFS

Attention

Usage

Using hadoop jar

Using plain java

With CDH

Parameters

Threshold

Concurrency

dirs

How it works

Monitoring

License

Contributions

About

Releases

Packages

Languages

kavnsir/disk-balancer-hadoop2.6

Folders and files

Latest commit

History

Repository files navigation

DataNode Volumes Rebalancing tool for Apache Hadoop 2.6+ HDFS

Attention

Usage

Using hadoop jar

Using plain java

With CDH

Parameters

Threshold

Concurrency

dirs

How it works

Monitoring

License

Contributions

About

Topics

Resources

Stars

Watchers

Forks

Languages