Skip to content

Commit

Permalink
tool: port readahead tool to bcc
Browse files Browse the repository at this point in the history
  • Loading branch information
tuxology authored and yonghong-song committed Sep 28, 2020
1 parent e548dd5 commit 561a71e
Show file tree
Hide file tree
Showing 4 changed files with 242 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ pair of .c and .py files, and some are directories of files.
- tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[profile](tools/profile.py): Profile CPU usage by sampling stack traces at a timed interval. [Examples](tools/profile_example.txt).
- tools/[readahead](tools/readahead.py): Show performance of read-ahead cache [Examples](tools/readahead_example.txt).
- tools/[reset-trace](tools/reset-trace.sh): Reset the state of tracing. Maintenance tool only. [Examples](tools/reset-trace_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
- tools/[runqlen](tools/runqlen.py): Run queue length as a histogram. [Examples](tools/runqlen_example.txt).
Expand Down
59 changes: 59 additions & 0 deletions man/man8/readahead.8
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
.TH readahead 8 "2020-08-20" "USER COMMANDS"
.SH NAME
readahead \- Show performance of read-ahead cache
.SH SYNOPSIS
.B readahead [-d DURATION]
.SH DESCRIPTION
The tool shows the performance of read-ahead caching on the system under a given load to investigate any
caching issues. It shows a count of unused pages in the cache and also prints a histogram showing how
long they have remained there.

This tool traces the \fB__do_page_cache_readahead()\fR kernel function to track entry and exit in the
readahead mechanism in the kernel and then uses \fB__page_cache_alloc()\fR and \fBmark_page_accessed()\fR
functions to calculate the age of the page in the cache as well as see how many are left unaccessed.

Since this uses BPF, only the root user can use this tool.
.SS NOTE ON KPROBES USAGE
Since the tool uses Kprobes, depending on your linux kernel's compilation, these functions may be inlined
and hence not available for Kprobes. To see whether you have the functions available, check \fBvmlinux\fR
source and binary to confirm whether inlining is happening or not. You can also check \fB/proc/kallsyms\fR
on the host and verify if the target functions are present there before using this.
.SH REQUIREMENTS
CONFIG_BPF, bcc
.SH OPTIONS
\-h
Print usage message
.TP
\-d DURATION
Trace the read-ahead caching system for DURATION seconds
.SH EXAMPLES
.TP
Trace for 30 seconds and show histogram of page age (ms) in read-ahead cache along with unused page count:
#
.B readahead -d 30
.SH OVERHEAD
The kernel functions instrumented by this program could be high-frequency depending on the profile of the
application (for example sequential IO). We advise the users to measure and monitor the overhead before leaving
this turned on in production environments.
.SH SOURCE
This originated as a bpftrace tool from the book "BPF Performance Tools",
published by Addison Wesley (2019):
.IP
http:https://www.brendangregg.com/bpf-performance-tools-book.html
.PP
See the book for more documentation on this tool.
.PP
This version is in the BCC repository:
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Suchakra Sharma
.SH SEE ALSO
readahead(2), madvise(2)
122 changes: 122 additions & 0 deletions tools/readahead.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# readahead Show performance of read-ahead cache
# For Linux, uses BCC, eBPF
#
# Copyright (c) 2020 Suchakra Sharma <[email protected]>
# Licensed under the Apache License, Version 2.0 (the "License")
# This was originally created for the BPF Performance Tools book
# published by Addison Wesley. ISBN-13: 9780136554820
# When copying or porting, include this comment.
#
# 20-Aug-2020 Suchakra Sharma Ported from bpftrace to BCC

from __future__ import print_function
from bcc import BPF
from time import sleep
import ctypes as ct
import argparse

# arguments
examples = """examples:
./readahead -d 20 # monitor for 10 seconds and generate stats
"""

parser = argparse.ArgumentParser(
description="Monitor performance of read ahead cache",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-d", "--duration", type=int,
help="total duration to monitor for, in seconds")
args = parser.parse_args()
if not args.duration:
args.duration = 99999999

# BPF program
program = """
#include <uapi/linux/ptrace.h>
#include <linux/mm_types.h>
BPF_HASH(flag, u32, u8); // used to track if we are in do_page_cache_readahead()
BPF_HASH(birth, struct page*, u64); // used to track timestamps of cache alloc'ed page
BPF_ARRAY(pages); // increment/decrement readahead pages
BPF_HISTOGRAM(dist);
int entry__do_page_cache_readahead(struct pt_regs *ctx) {
u32 pid;
u8 one = 1;
pid = bpf_get_current_pid_tgid();
flag.update(&pid, &one);
return 0;
}
int exit__do_page_cache_readahead(struct pt_regs *ctx) {
u32 pid;
u8 zero = 0;
pid = bpf_get_current_pid_tgid();
flag.update(&pid, &zero);
return 0;
}
int exit__page_cache_alloc(struct pt_regs *ctx) {
u32 pid;
u64 ts;
struct page *retval = (struct page*) PT_REGS_RC(ctx);
u32 zero = 0; // static key for accessing pages[0]
pid = bpf_get_current_pid_tgid();
u8 *f = flag.lookup(&pid);
if (f != NULL && *f == 1) {
ts = bpf_ktime_get_ns();
birth.update(&retval, &ts);
u64 *count = pages.lookup(&zero);
if (count) (*count)++; // increment read ahead pages count
}
return 0;
}
int entry_mark_page_accessed(struct pt_regs *ctx) {
u64 ts, delta;
struct page *arg0 = (struct page *) PT_REGS_PARM1(ctx);
u32 zero = 0; // static key for accessing pages[0]
u64 *bts = birth.lookup(&arg0);
if (bts != NULL) {
delta = bpf_ktime_get_ns() - *bts;
dist.increment(bpf_log2l(delta/1000000));
u64 *count = pages.lookup(&zero);
if (count) (*count)--; // decrement read ahead pages count
birth.delete(&arg0); // remove the entry from hashmap
}
return 0;
}
"""

b = BPF(text=program)
b.attach_kprobe(event="__do_page_cache_readahead", fn_name="entry__do_page_cache_readahead")
b.attach_kretprobe(event="__do_page_cache_readahead", fn_name="exit__do_page_cache_readahead")
b.attach_kretprobe(event="__page_cache_alloc", fn_name="exit__page_cache_alloc")
b.attach_kprobe(event="mark_page_accessed", fn_name="entry_mark_page_accessed")

# header
print("Tracing... Hit Ctrl-C to end.")

# print
def print_stats():
print()
print("Read-ahead unused pages: %d" % (b["pages"][ct.c_ulong(0)].value))
print("Histogram of read-ahead used page age (ms):")
print("")
b["dist"].print_log2_hist("age (ms)")
b["dist"].clear()
b["pages"].clear()

while True:
try:
sleep(args.duration)
print_stats()
except KeyboardInterrupt:
print_stats()
break
60 changes: 60 additions & 0 deletions tools/readahead_example.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
Demonstration of readahead, the Linux eBPF/bcc version

Read-ahead mechanism is used by operation sytems to optimize sequential operations
by reading ahead some pages to avoid more expensive filesystem operations. This tool
shows the performance of the read-ahead caching on the system under a given load to
investigate any caching issues. It shows a count for unused pages in the cache and
also prints a histogram showing how long they have remianed there.

Usage Scenario
==============

Consider that you are developing a React Native application which performs aggressive
reads while re-encoding a video in local-storage. Usually such an app would be multi-
layered and have transitional library dependencies. The actual read may be performed
by some unknown native library which may or may not be using hints to the OS, such as
madvise(p, LEN, MADV_SEQUENTIAL). If high IOPS is observed in such an app, running
readahead may pin the issue much faster in this case as the developer digs deeper
into what may be causing this.

An example where such an issue can surface is: https://github.com/boltdb/bolt/issues/691

# readahead -d 30
Tracing... Hit Ctrl-C to end.
^C
Read-ahead unused pages: 6765
Histogram of read-ahead used page age (ms):

age (ms) : count distribution
0 -> 1 : 4236 |****************************************|
2 -> 3 : 394 |*** |
4 -> 7 : 1670 |*************** |
8 -> 15 : 2132 |******************** |
16 -> 31 : 401 |*** |
32 -> 63 : 1256 |*********** |
64 -> 127 : 2352 |********************** |
128 -> 255 : 357 |*** |
256 -> 511 : 369 |*** |
512 -> 1023 : 366 |*** |
1024 -> 2047 : 181 |* |
2048 -> 4095 : 439 |**** |
4096 -> 8191 : 188 |* |

In the example above, we recorded system-wide stats for 30 seconds. We can observe that
while most of the pages stayed in the readahead cache for quite less time, after 30
seconds 6765 pages still remained in the cache, yet unaccessed.

Note on Kprobes Usage
=====================

This tool uses Kprobes on the following kernel functions:

__do_page_cache_readahead()
__page_cache_alloc()
mark_page_accessed()

Since the tool uses Kprobes, depending on your linux kernel's compilation, these
functions may be inlined and hence not available for Kprobes. To see whether you have
the functions available, check vmlinux source and binary to confirm whether inlining is
happening or not. You can also check /proc/kallsyms on the host and verify if the target
functions are present there before using this tool.

0 comments on commit 561a71e

Please sign in to comment.