Skip to content

Commit

Permalink
tools/dirtop: Adding dirtop utility (iovisor#2819)
Browse files Browse the repository at this point in the history
This tools is about reporting IOs per directory.
That's a clone of filetop but works in a different way :
- user specify a set of globs to select a list of directories to watch
- dirtop extracts the inode_id of the selected directories
- the bpf program receives the list of top directories to consider
- when vfs_{read|write} occurs, the bpf program check
  if one of the parents is part of the list we search for
- if it matches, the io is accounted

On the python side, the program will reconcilate IOs per directory and print stats.

While filetop list the programs and filename, dirtop only list the directory name.

A typical usages looks like :

	[root@host]: dirtop.py -d '/hdfs/uuid/*/yarn'
	14:56:33 loadavg: 52.21 48.81 37.78 53/2721 28720

	READS  WRITES R_Kb     W_Kb     PATH
	36821  7632   238219   149183   /hdfs/uuid/d04fccd8-bc72-4ed9-bda4-c5b6893f1405/yarn
	20823  2      196290   3        /hdfs/uuid/b94cbf3f-76b1-4ced-9043-02d450b9887c/yarn
	16059  12064  109748   85778    /hdfs/uuid/250b21c8-1714-45fe-8c08-d45d0271c6bd/yarn
	14128  20360  106287   81440    /hdfs/uuid/4a833770-767e-43b3-b696-dc98901bce26/yarn
	15883  4991   86014    82075    /hdfs/uuid/0cc3683f-4800-4c73-8075-8d77dc7cf116/yarn
	11182  4485   28834    116917   /hdfs/uuid/7d512fe7-b20d-464c-a75a-dbf8b687ee1c/yarn
	11848  7810   103139   31240    /hdfs/uuid/2c6a7223-cb18-4916-a1b6-8cd02bda1d31/yarn
	10418  1272   114842   18       /hdfs/uuid/76dc0b77-e2fd-4476-818f-2b5c3c452396/yarn
	10066  6630   93969    20218    /hdfs/uuid/c11da291-28de-4a77-873e-44bb452d238b/yarn
	13648  15453  39450    53744    /hdfs/uuid/99c178d5-a209-4af2-8467-7382c7f03c1b/yarn
	9509   2049   31363    48219    /hdfs/uuid/a78f846a-58c4-4d10-a9f5-42f16a6134a0/yarn
	8112   2178   13765    63479    /hdfs/uuid/bf829d08-1455-45b8-81fa-05c3303e8c45/yarn
	4327   0      37544    0        /hdfs/uuid/fada8004-53ff-48df-9396-165d8e42925b/yarn
	2238   2742   72       50       /hdfs/uuid/b3b2a2ed-f6c1-4641-86bf-2989dd932411/yarn
	3716   0      47       0        /hdfs/uuid/8138a53b-b942-44d3-82df-51575f1a3901/yarn

Signed-off-by: Erwan Velu <[email protected]>

Co-authored-by: Erwan Velu <[email protected]>
  • Loading branch information
ErwanAliasr1 and ErwanAliasr1 committed May 20, 2020
1 parent c6342d2 commit 8c12794
Show file tree
Hide file tree
Showing 4 changed files with 488 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ pair of .c and .py files, and some are directories of files.
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[deadlock](tools/deadlock.py): Detect potential deadlocks on a running process. [Examples](tools/deadlock_example.txt).
- tools/[dirtop](tools/dirtop.py): File reads and writes by directory. Top for directories. [Examples](tools/dirtop_example.txt).
- tools/[drsnoop](tools/drsnoop.py): Trace direct reclaim events with PID and latency. [Examples](tools/drsnoop_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
- tools/[exitsnoop](tools/exitsnoop.py): Trace process termination (exit and fatal signals). [Examples](tools/exitsnoop_example.txt).
Expand Down
115 changes: 115 additions & 0 deletions man/man8/dirtop.8
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
.TH dirtop 8 "2020-03-16" "USER COMMANDS"
.SH NAME
dirtop \- File reads and writes by directory. Top for directories.
.SH SYNOPSIS
.B dirtop \-d directory1,directory2,... [\-h] [\-C] [\-r MAXROWS] [\-s {reads,writes,rbytes,wbytes}] [\-p PID] [interval] [count]
.SH DESCRIPTION
This is top for directories.

This traces file reads and writes, and prints a per-directory summary every interval
(by default, 1 second). By default the summary is sorted on the highest read
throughput (Kbytes). Sorting order can be changed via -s option.

This uses in-kernel eBPF maps to store per process summaries for efficiency.

This script works by tracing the __vfs_read() and __vfs_write() functions using
kernel dynamic tracing, which instruments explicit read and write calls. If
files are read or written using another means (eg, via mmap()), then they
will not be visible using this tool. Also, this tool will need updating to
match any code changes to those vfs functions.

This should be useful for file system workload characterization when analyzing
the performance of applications.

Note that tracing VFS level reads and writes can be a frequent activity, and
this tool can begin to cost measurable overhead at high I/O rates.

Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-d
Defines a list of directories, comma separated, to observe.
Wildcards are allowed if between single bracket.
.TP
\-C
Don't clear the screen.
.TP
\-r MAXROWS
Maximum number of rows to print. Default is 20.
.TP
\-s {reads,writes,rbytes,wbytes}
Sort column. Default is rbytes (read throughput).
.TP
\-p PID
Trace this PID only.
.TP
interval
Interval between updates, seconds.
.TP
count
Number of interval summaries.

.SH EXAMPLES
.TP
Summarize block device I/O by directory, 1 second screen refresh:
#
.B dirtop.py
.TP
Don't clear the screen, and top 8 rows only:
#
.B dirtop.py -Cr 8
.TP
5 second summaries, 10 times only:
#
.B dirtop.py 5 10
.TP
Report read & write IOs generated in mutliple yarn and data directories:
#
.B dirtop.py -d '/hdfs/uuid/*/yarn,/hdfs/uuid/*/data'
.SH FIELDS
.TP
loadavg:
The contents of /proc/loadavg
.TP
READS
Count of reads during interval.
.TP
WRITES
Count of writes during interval.
.TP
R_Kb
Total read Kbytes during interval.
.TP
W_Kb
Total write Kbytes during interval.
.TP
PATH
The path were the IOs were accounted.
.SH OVERHEAD
Depending on the frequency of application reads and writes, overhead can become
significant, in the worst case slowing applications by over 50%. Hopefully for
real world workloads the overhead is much less -- test before use. The reason
for the high overhead is that VFS reads and writes can be a frequent event, and
despite the eBPF overhead being very small per event, if you multiply this
small overhead by a million events per second, it becomes a million times
worse. Literally. You can gauge the number of reads and writes using the
vfsstat(8) tool, also from bcc.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Erwan Velu
.SH INSPIRATION
filetop(8) by Brendan Gregg
.SH SEE ALSO
vfsstat(8), vfscount(8), fileslower(8)
264 changes: 264 additions & 0 deletions tools/dirtop.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# dirtop file reads and writes by directory.
# For Linux, uses BCC, eBPF.
#
# USAGE: dirtop.py -d 'directory1,directory2' [-h] [-C] [-r MAXROWS] [interval] [count]
#
# This uses in-kernel eBPF maps to store per process summaries for efficiency.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 13-Mar-2020 Erwan Velu Created dirtop from filetop
# 06-Feb-2016 Brendan Gregg Created filetop.

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import os
import stat
from subprocess import call

# arguments
examples = """examples:
./dirtop -d '/hdfs/uuid/*/yarn' # directory I/O top, 1 second refresh
./dirtop -d '/hdfs/uuid/*/yarn' -C # don't clear the screen
./dirtop -d '/hdfs/uuid/*/yarn' 5 # 5 second summaries
./dirtop -d '/hdfs/uuid/*/yarn' 5 10 # 5 second summaries, 10 times only
./dirtop -d '/hdfs/uuid/*/yarn,/hdfs/uuid/*/data' # Running dirtop on two set of directories
"""
parser = argparse.ArgumentParser(
description="File reads and writes by process",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-C", "--noclear", action="store_true",
help="don't clear the screen")
parser.add_argument("-r", "--maxrows", default=20,
help="maximum rows to print, default 20")
parser.add_argument("-s", "--sort", default="all",
choices=["all", "reads", "writes", "rbytes", "wbytes"],
help="sort column, default all")
parser.add_argument("-p", "--pid", type=int, metavar="PID", dest="tgid",
help="trace this PID only")
parser.add_argument("interval", nargs="?", default=1,
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
parser.add_argument("--ebpf", action="store_true",
help=argparse.SUPPRESS)
parser.add_argument("-d", "--root-directories", type=str, required=True, dest="rootdirs",
help="select the directories to observe, separated by commas")
args = parser.parse_args()
interval = int(args.interval)
countdown = int(args.count)
maxrows = int(args.maxrows)
clear = not int(args.noclear)
debug = 0

# linux stats
loadavg = "/proc/loadavg"

# define BPF program
bpf_text = """
# include <uapi/linux/ptrace.h>
# include <linux/blkdev.h>
// the key for the output summary
struct info_t {
unsigned long inode_id;
};
// the value of the output summary
struct val_t {
u64 reads;
u64 writes;
u64 rbytes;
u64 wbytes;
};
BPF_HASH(counts, struct info_t, struct val_t);
static int do_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count, int is_read)
{
u32 tgid = bpf_get_current_pid_tgid() >> 32;
if (TGID_FILTER)
return 0;
// The directory inodes we look at
u32 dir_ids[INODES_NUMBER] = DIRECTORY_INODES;
struct info_t info = {.inode_id = 0};
struct dentry *pde = file->f_path.dentry;
for (int i=0; i<50; i++) {
// If we don't have any parent, we reached the root
if (!pde->d_parent) {
break;
}
pde = pde->d_parent;
// Does the files is part of the directory we look for
for(int dir_id=0; dir_id<INODES_NUMBER; dir_id++) {
if (pde->d_inode->i_ino == dir_ids[dir_id]) {
// Yes, let's export the top directory inode
info.inode_id = pde->d_inode->i_ino;
break;
}
}
}
// If we didn't found any, let's abort
if (info.inode_id == 0) {
return 0;
}
struct val_t *valp, zero = {};
valp = counts.lookup_or_try_init(&info, &zero);
if (valp) {
if (is_read) {
valp->reads++;
valp->rbytes += count;
} else {
valp->writes++;
valp->wbytes += count;
}
}
return 0;
}
int trace_read_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
return do_entry(ctx, file, buf, count, 1);
}
int trace_write_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
return do_entry(ctx, file, buf, count, 0);
}
"""


def get_searched_ids(root_directories):
"""Export the inode numbers of the selected directories."""
from glob import glob
inode_to_path = {}
inodes = "{"
total_dirs = 0
for root_directory in root_directories.split(','):
searched_dirs = glob(root_directory, recursive=True)
if not searched_dirs:
continue

for mydir in searched_dirs:
total_dirs = total_dirs + 1
# If we pass more than 15 dirs, ebpf program fails
if total_dirs > 15:
print('15 directories limit reached')
break
inode_id = os.lstat(mydir)[stat.ST_INO]
if inode_id in inode_to_path:
if inode_to_path[inode_id] == mydir:
print('Skipping {} as already considered'.format(mydir))
else:
inodes = "{},{}".format(inodes, inode_id)
inode_to_path[inode_id] = mydir
print('Considering {} with inode_id {}'.format(mydir, inode_id))

inodes = inodes + '}'
if len(inode_to_path) == 0:
print('Cannot find any valid directory')
exit()
return inodes.replace('{,', '{'), inode_to_path


if args.tgid:
bpf_text = bpf_text.replace('TGID_FILTER', 'tgid != %d' % args.tgid)
else:
bpf_text = bpf_text.replace('TGID_FILTER', '0')

inodes, inodes_to_path = get_searched_ids(args.rootdirs)
bpf_text = bpf_text.replace("DIRECTORY_INODES", inodes)
bpf_text = bpf_text.replace(
"INODES_NUMBER", '{}'.format(len(inodes.split(','))))

if debug or args.ebpf:
print(bpf_text)
if args.ebpf:
exit()

# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="vfs_read", fn_name="trace_read_entry")
b.attach_kprobe(event="vfs_write", fn_name="trace_write_entry")

DNAME_INLINE_LEN = 32 # linux/dcache.h

print('Tracing... Output every %d secs. Hit Ctrl-C to end' % interval)


def sort_fn(counts):
"""Define how to sort the columns"""
if args.sort == "all":
return (counts[1].rbytes + counts[1].wbytes + counts[1].reads + counts[1].writes)
else:
return getattr(counts[1], args.sort)


# output
exiting = 0
while 1:
try:
sleep(interval)
except KeyboardInterrupt:
exiting = 1

# header
if clear:
call("clear")
else:
print()
with open(loadavg) as stats:
print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))

print("%-6s %-6s %-8s %-8s %s" %
("READS", "WRITES", "R_Kb", "W_Kb", "PATH"))
# by-TID output
counts = b.get_table("counts")
line = 0
reads = {}
writes = {}
reads_Kb = {}
writes_Kb = {}
for k, v in reversed(sorted(counts.items(),
key=sort_fn)):
# If it's the first time we see this inode
if k.inode_id not in reads:
# let's create a new entry
reads[k.inode_id] = v.reads
writes[k.inode_id] = v.writes
reads_Kb[k.inode_id] = v.rbytes / 1024
writes_Kb[k.inode_id] = v.wbytes / 1024
else:
# unless add the current performance metrics
# to the previous ones
reads[k.inode_id] += v.reads
writes[k.inode_id] += v.writes
reads_Kb[k.inode_id] += v.rbytes / 1024
writes_Kb[k.inode_id] += v.wbytes / 1024

for node_id in reads:
print("%-6d %-6d %-8d %-8d %s" %
(reads[node_id], writes[node_id], reads_Kb[node_id], writes_Kb[node_id], inodes_to_path[node_id]))
line += 1
if line >= maxrows:
break

counts.clear()

countdown -= 1
if exiting or countdown == 0:
print("Detaching...")
exit()
Loading

0 comments on commit 8c12794

Please sign in to comment.