tools/dirtop: Adding dirtop utility (iovisor#2819)

This tools is about reporting IOs per directory. That's a clone of filetop but works in a different way : - user specify a set of globs to select a list of directories to watch - dirtop extracts the inode_id of the selected directories - the bpf program receives the list of top directories to consider - when vfs_{read|write} occurs, the bpf program check if one of the parents is part of the list we search for - if it matches, the io is accounted On the python side, the program will reconcilate IOs per directory and print stats. While filetop list the programs and filename, dirtop only list the directory name. A typical usages looks like : [root@host]: dirtop.py -d '/hdfs/uuid/*/yarn' 14:56:33 loadavg: 52.21 48.81 37.78 53/2721 28720 READS WRITES R_Kb W_Kb PATH 36821 7632 238219 149183 /hdfs/uuid/d04fccd8-bc72-4ed9-bda4-c5b6893f1405/yarn 20823 2 196290 3 /hdfs/uuid/b94cbf3f-76b1-4ced-9043-02d450b9887c/yarn 16059 12064 109748 85778 /hdfs/uuid/250b21c8-1714-45fe-8c08-d45d0271c6bd/yarn 14128 20360 106287 81440 /hdfs/uuid/4a833770-767e-43b3-b696-dc98901bce26/yarn 15883 4991 86014 82075 /hdfs/uuid/0cc3683f-4800-4c73-8075-8d77dc7cf116/yarn 11182 4485 28834 116917 /hdfs/uuid/7d512fe7-b20d-464c-a75a-dbf8b687ee1c/yarn 11848 7810 103139 31240 /hdfs/uuid/2c6a7223-cb18-4916-a1b6-8cd02bda1d31/yarn 10418 1272 114842 18 /hdfs/uuid/76dc0b77-e2fd-4476-818f-2b5c3c452396/yarn 10066 6630 93969 20218 /hdfs/uuid/c11da291-28de-4a77-873e-44bb452d238b/yarn 13648 15453 39450 53744 /hdfs/uuid/99c178d5-a209-4af2-8467-7382c7f03c1b/yarn 9509 2049 31363 48219 /hdfs/uuid/a78f846a-58c4-4d10-a9f5-42f16a6134a0/yarn 8112 2178 13765 63479 /hdfs/uuid/bf829d08-1455-45b8-81fa-05c3303e8c45/yarn 4327 0 37544 0 /hdfs/uuid/fada8004-53ff-48df-9396-165d8e42925b/yarn 2238 2742 72 50 /hdfs/uuid/b3b2a2ed-f6c1-4641-86bf-2989dd932411/yarn 3716 0 47 0 /hdfs/uuid/8138a53b-b942-44d3-82df-51575f1a3901/yarn Signed-off-by: Erwan Velu <[email protected]> Co-authored-by: Erwan Velu <[email protected]>
yingtaojuzi · May 20, 2020 · 8c12794 · 8c12794
1 parent c6342d2
commit 8c12794
Show file tree

Hide file tree

Showing 4 changed files with 488 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -105,6 +105,7 @@ pair of .c and .py files, and some are directories of files.
 - tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
 - tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
 - tools/[deadlock](tools/deadlock.py): Detect potential deadlocks on a running process. [Examples](tools/deadlock_example.txt).
+- tools/[dirtop](tools/dirtop.py): File reads and writes by directory. Top for directories. [Examples](tools/dirtop_example.txt).
 - tools/[drsnoop](tools/drsnoop.py): Trace direct reclaim events with PID and latency. [Examples](tools/drsnoop_example.txt).
 - tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
 - tools/[exitsnoop](tools/exitsnoop.py): Trace process termination (exit and fatal signals). [Examples](tools/exitsnoop_example.txt).

diff --git a/man/man8/dirtop.8 b/man/man8/dirtop.8
@@ -0,0 +1,115 @@
+.TH dirtop 8 "2020-03-16" "USER COMMANDS"
+.SH NAME
+dirtop \- File reads and writes by directory. Top for directories.
+.SH SYNOPSIS
+.B dirtop \-d directory1,directory2,... [\-h] [\-C] [\-r MAXROWS] [\-s {reads,writes,rbytes,wbytes}] [\-p PID] [interval] [count]
+.SH DESCRIPTION
+This is top for directories.
+
+This traces file reads and writes, and prints a per-directory summary every interval
+(by default, 1 second). By default the summary is sorted on the highest read
+throughput (Kbytes). Sorting order can be changed via -s option.
+
+This uses in-kernel eBPF maps to store per process summaries for efficiency.
+
+This script works by tracing the __vfs_read() and __vfs_write() functions using
+kernel dynamic tracing, which instruments explicit read and write calls. If
+files are read or written using another means (eg, via mmap()), then they
+will not be visible using this tool. Also, this tool will need updating to
+match any code changes to those vfs functions.
+
+This should be useful for file system workload characterization when analyzing
+the performance of applications.
+
+Note that tracing VFS level reads and writes can be a frequent activity, and
+this tool can begin to cost measurable overhead at high I/O rates.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-d
+Defines a list of directories, comma separated, to observe.
+Wildcards are allowed if between single bracket.
+.TP
+\-C
+Don't clear the screen.
+.TP
+\-r MAXROWS
+Maximum number of rows to print. Default is 20.
+.TP
+\-s {reads,writes,rbytes,wbytes}
+Sort column. Default is rbytes (read throughput).
+.TP
+\-p PID
+Trace this PID only.
+.TP
+interval
+Interval between updates, seconds.
+.TP
+count
+Number of interval summaries.
+
+.SH EXAMPLES
+.TP
+Summarize block device I/O by directory, 1 second screen refresh:
+#
+.B dirtop.py
+.TP
+Don't clear the screen, and top 8 rows only:
+#
+.B dirtop.py -Cr 8
+.TP
+5 second summaries, 10 times only:
+#
+.B dirtop.py 5 10
+.TP
+Report read & write IOs generated in mutliple yarn and data directories:
+#
+.B dirtop.py -d '/hdfs/uuid/*/yarn,/hdfs/uuid/*/data'
+.SH FIELDS
+.TP
+loadavg:
+The contents of /proc/loadavg
+.TP
+READS
+Count of reads during interval.
+.TP
+WRITES
+Count of writes during interval.
+.TP
+R_Kb
+Total read Kbytes during interval.
+.TP
+W_Kb
+Total write Kbytes during interval.
+.TP
+PATH
+The path were the IOs were accounted.
+.SH OVERHEAD
+Depending on the frequency of application reads and writes, overhead can become
+significant, in the worst case slowing applications by over 50%. Hopefully for
+real world workloads the overhead is much less -- test before use. The reason
+for the high overhead is that VFS reads and writes can be a frequent event, and
+despite the eBPF overhead being very small per event, if you multiply this
+small overhead by a million events per second, it becomes a million times
+worse. Literally. You can gauge the number of reads and writes using the
+vfsstat(8) tool, also from bcc.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Erwan Velu
+.SH INSPIRATION
+filetop(8) by Brendan Gregg
+.SH SEE ALSO
+vfsstat(8), vfscount(8), fileslower(8)
diff --git a/tools/dirtop.py b/tools/dirtop.py
@@ -0,0 +1,264 @@
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# dirtop file reads and writes by directory.
+# For Linux, uses BCC, eBPF.
+#
+# USAGE: dirtop.py -d 'directory1,directory2' [-h] [-C] [-r MAXROWS] [interval] [count]
+#
+# This uses in-kernel eBPF maps to store per process summaries for efficiency.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 13-Mar-2020 Erwan Velu Created dirtop from filetop
+# 06-Feb-2016 Brendan Gregg Created filetop.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+import os
+import stat
+from subprocess import call
+
+# arguments
+examples = """examples:
+ ./dirtop -d '/hdfs/uuid/*/yarn' # directory I/O top, 1 second refresh
+ ./dirtop -d '/hdfs/uuid/*/yarn' -C # don't clear the screen
+ ./dirtop -d '/hdfs/uuid/*/yarn' 5 # 5 second summaries
+ ./dirtop -d '/hdfs/uuid/*/yarn' 5 10 # 5 second summaries, 10 times only
+ ./dirtop -d '/hdfs/uuid/*/yarn,/hdfs/uuid/*/data' # Running dirtop on two set of directories
+"""
+parser = argparse.ArgumentParser(
+ description="File reads and writes by process",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog=examples)
+parser.add_argument("-C", "--noclear", action="store_true",
+ help="don't clear the screen")
+parser.add_argument("-r", "--maxrows", default=20,
+ help="maximum rows to print, default 20")
+parser.add_argument("-s", "--sort", default="all",
+ choices=["all", "reads", "writes", "rbytes", "wbytes"],
+ help="sort column, default all")
+parser.add_argument("-p", "--pid", type=int, metavar="PID", dest="tgid",
+ help="trace this PID only")
+parser.add_argument("interval", nargs="?", default=1,
+ help="output interval, in seconds")
+parser.add_argument("count", nargs="?", default=99999999,
+ help="number of outputs")
+parser.add_argument("--ebpf", action="store_true",
+ help=argparse.SUPPRESS)
+parser.add_argument("-d", "--root-directories", type=str, required=True, dest="rootdirs",
+ help="select the directories to observe, separated by commas")
+args = parser.parse_args()
+interval = int(args.interval)
+countdown = int(args.count)
+maxrows = int(args.maxrows)
+clear = not int(args.noclear)
+debug = 0
+
+# linux stats
+loadavg = "/proc/loadavg"
+
+# define BPF program
+bpf_text = """
+# include <uapi/linux/ptrace.h>
+# include <linux/blkdev.h>
+
+// the key for the output summary
+struct info_t {
+ unsigned long inode_id;
+};
+
+// the value of the output summary
+struct val_t {
+ u64 reads;
+ u64 writes;
+ u64 rbytes;
+ u64 wbytes;
+};
+
+BPF_HASH(counts, struct info_t, struct val_t);
+
+static int do_entry(struct pt_regs *ctx, struct file *file,
+ char __user *buf, size_t count, int is_read)
+{
+ u32 tgid = bpf_get_current_pid_tgid() >> 32;
+ if (TGID_FILTER)
+ return 0;
+
+ // The directory inodes we look at
+ u32 dir_ids[INODES_NUMBER] = DIRECTORY_INODES;
+ struct info_t info = {.inode_id = 0};
+ struct dentry *pde = file->f_path.dentry;
+ for (int i=0; i<50; i++) {
+ // If we don't have any parent, we reached the root
+ if (!pde->d_parent) {
+ break;
+ }
+ pde = pde->d_parent;
+ // Does the files is part of the directory we look for
+ for(int dir_id=0; dir_id<INODES_NUMBER; dir_id++) {
+ if (pde->d_inode->i_ino == dir_ids[dir_id]) {
+ // Yes, let's export the top directory inode
+ info.inode_id = pde->d_inode->i_ino;
+ break;
+ }
+ }
+ }
+ // If we didn't found any, let's abort
+ if (info.inode_id == 0) {
+ return 0;
+ }
+
+ struct val_t *valp, zero = {};
+ valp = counts.lookup_or_try_init(&info, &zero);
+ if (valp) {
+ if (is_read) {
+ valp->reads++;
+ valp->rbytes += count;
+ } else {
+ valp->writes++;
+ valp->wbytes += count;
+ }
+ }
+ return 0;
+}
+
+int trace_read_entry(struct pt_regs *ctx, struct file *file,
+ char __user *buf, size_t count)
+{
+ return do_entry(ctx, file, buf, count, 1);
+}
+
+int trace_write_entry(struct pt_regs *ctx, struct file *file,
+ char __user *buf, size_t count)
+{
+ return do_entry(ctx, file, buf, count, 0);
+}
+
+"""
+
+
+def get_searched_ids(root_directories):
+ """Export the inode numbers of the selected directories."""
+ from glob import glob
+ inode_to_path = {}
+ inodes = "{"
+ total_dirs = 0
+ for root_directory in root_directories.split(','):
+ searched_dirs = glob(root_directory, recursive=True)
+ if not searched_dirs:
+ continue
+
+ for mydir in searched_dirs:
+ total_dirs = total_dirs + 1
+ # If we pass more than 15 dirs, ebpf program fails
+ if total_dirs > 15:
+ print('15 directories limit reached')
+ break
+ inode_id = os.lstat(mydir)[stat.ST_INO]
+ if inode_id in inode_to_path:
+ if inode_to_path[inode_id] == mydir:
+ print('Skipping {} as already considered'.format(mydir))
+ else:
+ inodes = "{},{}".format(inodes, inode_id)
+ inode_to_path[inode_id] = mydir
+ print('Considering {} with inode_id {}'.format(mydir, inode_id))
+
+ inodes = inodes + '}'
+ if len(inode_to_path) == 0:
+ print('Cannot find any valid directory')
+ exit()
+ return inodes.replace('{,', '{'), inode_to_path
+
+
+if args.tgid:
+ bpf_text = bpf_text.replace('TGID_FILTER', 'tgid != %d' % args.tgid)
+else:
+ bpf_text = bpf_text.replace('TGID_FILTER', '0')
+
+inodes, inodes_to_path = get_searched_ids(args.rootdirs)
+bpf_text = bpf_text.replace("DIRECTORY_INODES", inodes)
+bpf_text = bpf_text.replace(
+ "INODES_NUMBER", '{}'.format(len(inodes.split(','))))
+
+if debug or args.ebpf:
+ print(bpf_text)
+ if args.ebpf:
+ exit()
+
+# initialize BPF
+b = BPF(text=bpf_text)
+b.attach_kprobe(event="vfs_read", fn_name="trace_read_entry")
+b.attach_kprobe(event="vfs_write", fn_name="trace_write_entry")
+
+DNAME_INLINE_LEN = 32 # linux/dcache.h
+
+print('Tracing... Output every %d secs. Hit Ctrl-C to end' % interval)
+
+
+def sort_fn(counts):
+ """Define how to sort the columns"""
+ if args.sort == "all":
+ return (counts[1].rbytes + counts[1].wbytes + counts[1].reads + counts[1].writes)
+ else:
+ return getattr(counts[1], args.sort)
+
+
+# output
+exiting = 0
+while 1:
+ try:
+ sleep(interval)
+ except KeyboardInterrupt:
+ exiting = 1
+
+ # header
+ if clear:
+ call("clear")
+ else:
+ print()
+ with open(loadavg) as stats:
+ print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
+
+ print("%-6s %-6s %-8s %-8s %s" %
+ ("READS", "WRITES", "R_Kb", "W_Kb", "PATH"))
+ # by-TID output
+ counts = b.get_table("counts")
+ line = 0
+ reads = {}
+ writes = {}
+ reads_Kb = {}
+ writes_Kb = {}
+ for k, v in reversed(sorted(counts.items(),
+ key=sort_fn)):
+ # If it's the first time we see this inode
+ if k.inode_id not in reads:
+ # let's create a new entry
+ reads[k.inode_id] = v.reads
+ writes[k.inode_id] = v.writes
+ reads_Kb[k.inode_id] = v.rbytes / 1024
+ writes_Kb[k.inode_id] = v.wbytes / 1024
+ else:
+ # unless add the current performance metrics
+ # to the previous ones
+ reads[k.inode_id] += v.reads
+ writes[k.inode_id] += v.writes
+ reads_Kb[k.inode_id] += v.rbytes / 1024
+ writes_Kb[k.inode_id] += v.wbytes / 1024
+
+ for node_id in reads:
+ print("%-6d %-6d %-8d %-8d %s" %
+ (reads[node_id], writes[node_id], reads_Kb[node_id], writes_Kb[node_id], inodes_to_path[node_id]))
+ line += 1
+ if line >= maxrows:
+ break
+
+ counts.clear()
+
+ countdown -= 1
+ if exiting or countdown == 0:
+ print("Detaching...")
+ exit()