Merge pull request iovisor#353 from brendangregg/master

3 tools: fsslower, execsnoop, runqlat
web-logs2 · Feb 8, 2016 · 59d6c35 · 59d6c35
2 parents 919045a + c4fa80e
commit 59d6c35
Show file tree

Hide file tree

Showing 12 changed files with 1,520 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -71,6 +71,8 @@ Tools:
 - tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt).
 - tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
 - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
+- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
+- tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt).
 - tools/[funccount](tools/funccount.py): Count kernel function calls. [Examples](tools/funccount_example.txt).
 - tools/[funclatency](tools/funclatency.py): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
 - tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt).
@@ -80,6 +82,7 @@ Tools:
 - tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt).
 - tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
 - tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
+- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
 - tools/[softirqs](tools/softirqs.py): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
 - tools/[stackcount](tools/stackcount.py): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt).
 - tools/[stacksnoop](tools/stacksnoop.py): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).

diff --git a/man/man8/biolatency.8 b/man/man8/biolatency.8
@@ -65,7 +65,7 @@ Show a latency histogram for each disk device separately:
 usecs
 Microsecond range
 .TP
-mecs
+msecs
 Millisecond range
 .TP
 count

diff --git a/man/man8/execsnoop.8 b/man/man8/execsnoop.8
@@ -0,0 +1,88 @@
+.TH execsnoop 8 "2016-02-07" "USER COMMANDS"
+.SH NAME
+execsnoop \- Trace new processes via exec() syscalls. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B execsnoop [\-h] [\-t] [\-X] [\-n NAME]
+.SH DESCRIPTION
+execsnoop traces new processes, showing the filename executed, argument
+list, and return value (0 for success).
+
+It works by traces the execve() system call (commonly used exec() variant).
+This catches new processes that follow the fork->exec sequence, as well as
+processes that re-exec() themselves. Some applications fork() but do not
+exec(), eg, for worker processes, which won't be included in the execsnoop
+output.
+
+This works by tracing the kernel sys_execve() function using dynamic tracing,
+and will need updating to match any changes to this function.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-t
+Include a timestamp column.
+.TP
+\-X
+Exclude failed exec()s
+.TP
+\-n NAME
+Only print command lines matching this name (regex), matched anywhere
+.SH EXAMPLES
+.TP
+Trace all exec() syscalls:
+#
+.B execsnoop
+.TP
+Trace all exec() syscalls, and include timestamps:
+#
+.B execsnoop \-t
+.TP
+Only trace successful exec()s:
+#
+.B execsnoop \-X
+.TP
+Only trace exec()s where the filename or arguments contain "mount":
+#
+.B opensnoop \-n mount
+.SH FIELDS
+.TP
+TIME(s)
+Time of exec() return, in seconds.
+.TP
+PCOMM
+Parent process/command name.
+.TP
+PID
+Process ID
+.TP
+RET
+Return value of exec(). 0 == successs.
+.TP
+ARGS
+Filename for the exec(), followed be up to 19 arguments. An ellipsis "..." is
+shown if the argument list is known to be truncated.
+.SH OVERHEAD
+This traces the kernel execve function and prints output for each event. As the
+rate of this is generally expected to be low (< 1000/s), the overhead is also
+expected to be negligible. If you have an application that is calling a high
+rate of exec()s, then test and understand overhead before use.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+opensnoop(1)
diff --git a/man/man8/fileslower.8 b/man/man8/fileslower.8
@@ -0,0 +1,111 @@
+.TH fileslower 8 "2016-02-07" "USER COMMANDS"
+.SH NAME
+fileslower \- Trace slow synchronous file reads and writes.
+.SH SYNOPSIS
+.B fileslower [\-h] [\-p PID] [min_ms]
+.SH DESCRIPTION
+This script uses kernel dynamic tracing of synchronous reads and writes
+at the VFS interface, to identify slow file reads and writes for any file
+system.
+
+This version traces __vfs_read() and __vfs_write() and only showing
+synchronous I/O (the path to new_sync_read() and new_sync_write()), and
+I/O with filenames. This approach provides a view of just two file
+system request types: file reads and writes. There are typically many others:
+asynchronous I/O, directory operations, file handle operations, file open()s,
+fflush(), etc.
+
+WARNING: See the OVERHEAD section.
+
+By default, a minimum millisecond threshold of 10 is used.
+
+Since this works by tracing various kernel __vfs_*() functions using dynamic
+tracing, it will need updating to match any changes to these functions. A
+future version should switch to using FS tracepoints instead.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+\-p PID
+Trace this PID only.
+.TP
+min_ms
+Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
+.SH EXAMPLES
+.TP
+Trace synchronous file reads and writes slower than 10 ms:
+#
+.B fileslower
+.TP
+Trace slower than 1 ms:
+#
+.B fileslower 1
+.TP
+Trace slower than 1 ms, for PID 181 only:
+#
+.B fileslower \-p 181 1
+.SH FIELDS
+.TP
+TIME(s)
+Time of I/O completion since the first I/O seen, in seconds.
+.TP
+COMM
+Process name.
+.TP
+PID
+Process ID.
+.TP
+D
+Direction of I/O. R == read, W == write.
+.TP
+BYTES
+Size of I/O, in bytes.
+.TP
+LAT(ms)
+Latency (duration) of I/O, measured from when the application issued it to VFS
+to when it completed. This time is inclusive of block device I/O, file system
+CPU cycles, file system locks, run queue latency, etc. It's a more accurate
+measure of the latency suffered by applications performing file system I/O,
+than to measure this down at the block device interface.
+.TP
+FILENAME
+A cached kernel file name (comes from dentry->d_iname).
+.SH OVERHEAD
+Depending on the frequency of application reads and writes, overhead can become
+severe, in the worst case slowing applications by 2x. In the best case, the
+overhead is negligible. Hopefully for real world workloads the overhead is
+often at the lower end of the spectrum -- test before use. The reason for
+high overhead is that this traces VFS reads and writes, which includes FS
+cache reads and writes, and can exceed one million events per second if the
+application is I/O heavy. While the instrumentation is extremely lightweight,
+and uses in-kernel eBPF maps for efficient timing and filtering, multiply that
+cost by one million events per second and that cost becomes a million times
+worse. You can get an idea of the possible cost by just counting the
+instrumented events using the bcc funccount tool, eg:
+.PP
+# ./funccount.py -i 1 -r '^__vfs_(read|write)$'
+.PP
+This also costs overhead, but is somewhat less than fileslower.
+.PP
+If the overhead is prohibitive for your workload, I'd recommend moving
+down-stack a little from VFS into the file system functions (ext4, xfs, etc).
+Look for updates to bcc for specific file system tools that do this. The
+advantage of a per-file system approach is that we can trace post-cache,
+greatly reducing events and overhead. The disadvantage is needing custom
+tracing approaches for each different file system (whereas VFS is generic).
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+biosnoop(8), funccount(8)
diff --git a/man/man8/funclatency.8 b/man/man8/funclatency.8
@@ -88,7 +88,7 @@ Nanosecond range
 usecs
 Microsecond range
 .TP
-mecs
+msecs
 Millisecond range
 .TP
 count

diff --git a/man/man8/runqlat.8 b/man/man8/runqlat.8
@@ -0,0 +1,105 @@
+.TH runqlat 8 "2016-02-07" "USER COMMANDS"
+.SH NAME
+runqlat \- Run queue (scheduler) latency as a histogram.
+.SH SYNOPSIS
+.B runqlat [\-h] [\-T] [\-m] [\-P] [\-p PID] [interval] [count]
+.SH DESCRIPTION
+This measures the time a task spends waiting on a run queue (or equivalent
+scheduler data structure) for a turn on-CPU, and shows this time as a
+histogram. This time should be small, but a task may need to wait its turn due
+to CPU load. The higher the CPU load, the longer a task will generally need to
+wait its turn.
+
+This tool measures two types of run queue latency:
+
+1. The time from a task being enqueued on a run queue to its context switch
+and execution. This traces enqueue_task_*() -> finish_task_switch(),
+and instruments the run queue latency after a voluntary context switch.
+
+2. The time from when a task was involuntary context switched and still
+in the runnable state, to when it next executed. This is instrumented
+from finish_task_switch() alone.
+
+This tool uses in-kernel eBPF maps for storing timestamps and the histogram,
+for efficiency. Despite this, the overhead of this tool may become significant
+for some workloads: see the OVERHEAD section.
+
+This works by tracing various kernel scheduler functions using dynamic tracing,
+and will need updating to match any changes to these functions.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-T
+Include timestamps on output.
+.TP
+\-m
+Output histogram in milliseconds.
+.TP
+\-P
+Print a histogram for each PID.
+.TP
+\-p PID
+Only show this PID (filtered in kernel for efficiency).
+.TP
+interval
+Output interval, in seconds.
+.TP
+count
+Number of outputs.
+.SH EXAMPLES
+.TP
+Summarize run queue latency as a histogram:
+#
+.B runqlat
+.TP
+Print 1 second summaries, 10 times:
+#
+.B runqlat 1 10
+.TP
+Print 1 second summaries, using milliseconds as units for the histogram, and include timestamps on output:
+#
+.B runqlat \-mT 1
+.TP
+Trace PID 186 only, 1 second summaries:
+#
+.B runqlat -P 185 1
+.SH FIELDS
+.TP
+usecs
+Microsecond range
+.TP
+msecs
+Millisecond range
+.TP
+count
+How many times a task event fell into this range
+.TP
+distribution
+An ASCII bar chart to visualize the distribution (count column)
+.SH OVERHEAD
+This traces scheduler functions, which can become very frequent. While eBPF
+has very low overhead, and this tool uses in-kernel maps for efficiency, the
+frequency of scheduler events for some workloads may be high enough that the
+overhead of this tool becomes significant. Measure in a lab environment
+to quantify the overhead before use.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+pidstat(1)