From 561a71e105985ad04f5eb070b3487f5317f063b3 Mon Sep 17 00:00:00 2001 From: Suchakra Sharma Date: Thu, 24 Sep 2020 08:27:15 -0700 Subject: [PATCH] tool: port readahead tool to bcc --- README.md | 1 + man/man8/readahead.8 | 59 +++++++++++++++++ tools/readahead.py | 122 ++++++++++++++++++++++++++++++++++++ tools/readahead_example.txt | 60 ++++++++++++++++++ 4 files changed, 242 insertions(+) create mode 100644 man/man8/readahead.8 create mode 100755 tools/readahead.py create mode 100644 tools/readahead_example.txt diff --git a/README.md b/README.md index cbecf857c254..59d8e090c9b8 100644 --- a/README.md +++ b/README.md @@ -137,6 +137,7 @@ pair of .c and .py files, and some are directories of files. - tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt). - tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt). - tools/[profile](tools/profile.py): Profile CPU usage by sampling stack traces at a timed interval. [Examples](tools/profile_example.txt). +- tools/[readahead](tools/readahead.py): Show performance of read-ahead cache [Examples](tools/readahead_example.txt). - tools/[reset-trace](tools/reset-trace.sh): Reset the state of tracing. Maintenance tool only. [Examples](tools/reset-trace_example.txt). - tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt). - tools/[runqlen](tools/runqlen.py): Run queue length as a histogram. [Examples](tools/runqlen_example.txt). diff --git a/man/man8/readahead.8 b/man/man8/readahead.8 new file mode 100644 index 000000000000..a2a109149b2c --- /dev/null +++ b/man/man8/readahead.8 @@ -0,0 +1,59 @@ +.TH readahead 8 "2020-08-20" "USER COMMANDS" +.SH NAME +readahead \- Show performance of read-ahead cache +.SH SYNOPSIS +.B readahead [-d DURATION] +.SH DESCRIPTION +The tool shows the performance of read-ahead caching on the system under a given load to investigate any +caching issues. It shows a count of unused pages in the cache and also prints a histogram showing how +long they have remained there. + +This tool traces the \fB__do_page_cache_readahead()\fR kernel function to track entry and exit in the +readahead mechanism in the kernel and then uses \fB__page_cache_alloc()\fR and \fBmark_page_accessed()\fR +functions to calculate the age of the page in the cache as well as see how many are left unaccessed. + +Since this uses BPF, only the root user can use this tool. +.SS NOTE ON KPROBES USAGE +Since the tool uses Kprobes, depending on your linux kernel's compilation, these functions may be inlined +and hence not available for Kprobes. To see whether you have the functions available, check \fBvmlinux\fR +source and binary to confirm whether inlining is happening or not. You can also check \fB/proc/kallsyms\fR +on the host and verify if the target functions are present there before using this. +.SH REQUIREMENTS +CONFIG_BPF, bcc +.SH OPTIONS +\-h +Print usage message +.TP +\-d DURATION +Trace the read-ahead caching system for DURATION seconds +.SH EXAMPLES +.TP +Trace for 30 seconds and show histogram of page age (ms) in read-ahead cache along with unused page count: +# +.B readahead -d 30 +.SH OVERHEAD +The kernel functions instrumented by this program could be high-frequency depending on the profile of the +application (for example sequential IO). We advise the users to measure and monitor the overhead before leaving +this turned on in production environments. +.SH SOURCE +This originated as a bpftrace tool from the book "BPF Performance Tools", +published by Addison Wesley (2019): +.IP +http://www.brendangregg.com/bpf-performance-tools-book.html +.PP +See the book for more documentation on this tool. +.PP +This version is in the BCC repository: +.IP +https://github.com/iovisor/bcc +.PP +Also look in the bcc distribution for a companion _examples.txt file containing +example usage, output, and commentary for this tool. +.SH OS +Linux +.SH STABILITY +Unstable - in development. +.SH AUTHOR +Suchakra Sharma +.SH SEE ALSO +readahead(2), madvise(2) diff --git a/tools/readahead.py b/tools/readahead.py new file mode 100755 index 000000000000..14182d5aca04 --- /dev/null +++ b/tools/readahead.py @@ -0,0 +1,122 @@ +#!/usr/bin/python +# @lint-avoid-python-3-compatibility-imports +# +# readahead Show performance of read-ahead cache +# For Linux, uses BCC, eBPF +# +# Copyright (c) 2020 Suchakra Sharma +# Licensed under the Apache License, Version 2.0 (the "License") +# This was originally created for the BPF Performance Tools book +# published by Addison Wesley. ISBN-13: 9780136554820 +# When copying or porting, include this comment. +# +# 20-Aug-2020 Suchakra Sharma Ported from bpftrace to BCC + +from __future__ import print_function +from bcc import BPF +from time import sleep +import ctypes as ct +import argparse + +# arguments +examples = """examples: + ./readahead -d 20 # monitor for 10 seconds and generate stats +""" + +parser = argparse.ArgumentParser( + description="Monitor performance of read ahead cache", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=examples) +parser.add_argument("-d", "--duration", type=int, + help="total duration to monitor for, in seconds") +args = parser.parse_args() +if not args.duration: + args.duration = 99999999 + +# BPF program +program = """ +#include +#include + +BPF_HASH(flag, u32, u8); // used to track if we are in do_page_cache_readahead() +BPF_HASH(birth, struct page*, u64); // used to track timestamps of cache alloc'ed page +BPF_ARRAY(pages); // increment/decrement readahead pages +BPF_HISTOGRAM(dist); + +int entry__do_page_cache_readahead(struct pt_regs *ctx) { + u32 pid; + u8 one = 1; + pid = bpf_get_current_pid_tgid(); + flag.update(&pid, &one); + return 0; +} + +int exit__do_page_cache_readahead(struct pt_regs *ctx) { + u32 pid; + u8 zero = 0; + pid = bpf_get_current_pid_tgid(); + flag.update(&pid, &zero); + return 0; +} + +int exit__page_cache_alloc(struct pt_regs *ctx) { + u32 pid; + u64 ts; + struct page *retval = (struct page*) PT_REGS_RC(ctx); + u32 zero = 0; // static key for accessing pages[0] + pid = bpf_get_current_pid_tgid(); + u8 *f = flag.lookup(&pid); + if (f != NULL && *f == 1) { + ts = bpf_ktime_get_ns(); + birth.update(&retval, &ts); + + u64 *count = pages.lookup(&zero); + if (count) (*count)++; // increment read ahead pages count + } + return 0; +} + +int entry_mark_page_accessed(struct pt_regs *ctx) { + u64 ts, delta; + struct page *arg0 = (struct page *) PT_REGS_PARM1(ctx); + u32 zero = 0; // static key for accessing pages[0] + u64 *bts = birth.lookup(&arg0); + if (bts != NULL) { + delta = bpf_ktime_get_ns() - *bts; + dist.increment(bpf_log2l(delta/1000000)); + + u64 *count = pages.lookup(&zero); + if (count) (*count)--; // decrement read ahead pages count + + birth.delete(&arg0); // remove the entry from hashmap + } + return 0; +} +""" + +b = BPF(text=program) +b.attach_kprobe(event="__do_page_cache_readahead", fn_name="entry__do_page_cache_readahead") +b.attach_kretprobe(event="__do_page_cache_readahead", fn_name="exit__do_page_cache_readahead") +b.attach_kretprobe(event="__page_cache_alloc", fn_name="exit__page_cache_alloc") +b.attach_kprobe(event="mark_page_accessed", fn_name="entry_mark_page_accessed") + +# header +print("Tracing... Hit Ctrl-C to end.") + +# print +def print_stats(): + print() + print("Read-ahead unused pages: %d" % (b["pages"][ct.c_ulong(0)].value)) + print("Histogram of read-ahead used page age (ms):") + print("") + b["dist"].print_log2_hist("age (ms)") + b["dist"].clear() + b["pages"].clear() + +while True: + try: + sleep(args.duration) + print_stats() + except KeyboardInterrupt: + print_stats() + break diff --git a/tools/readahead_example.txt b/tools/readahead_example.txt new file mode 100644 index 000000000000..079dbaae7890 --- /dev/null +++ b/tools/readahead_example.txt @@ -0,0 +1,60 @@ +Demonstration of readahead, the Linux eBPF/bcc version + +Read-ahead mechanism is used by operation sytems to optimize sequential operations +by reading ahead some pages to avoid more expensive filesystem operations. This tool +shows the performance of the read-ahead caching on the system under a given load to +investigate any caching issues. It shows a count for unused pages in the cache and +also prints a histogram showing how long they have remianed there. + +Usage Scenario +============== + +Consider that you are developing a React Native application which performs aggressive +reads while re-encoding a video in local-storage. Usually such an app would be multi- +layered and have transitional library dependencies. The actual read may be performed +by some unknown native library which may or may not be using hints to the OS, such as +madvise(p, LEN, MADV_SEQUENTIAL). If high IOPS is observed in such an app, running +readahead may pin the issue much faster in this case as the developer digs deeper +into what may be causing this. + +An example where such an issue can surface is: https://github.com/boltdb/bolt/issues/691 + +# readahead -d 30 +Tracing... Hit Ctrl-C to end. +^C +Read-ahead unused pages: 6765 +Histogram of read-ahead used page age (ms): + + age (ms) : count distribution + 0 -> 1 : 4236 |****************************************| + 2 -> 3 : 394 |*** | + 4 -> 7 : 1670 |*************** | + 8 -> 15 : 2132 |******************** | + 16 -> 31 : 401 |*** | + 32 -> 63 : 1256 |*********** | + 64 -> 127 : 2352 |********************** | + 128 -> 255 : 357 |*** | + 256 -> 511 : 369 |*** | + 512 -> 1023 : 366 |*** | + 1024 -> 2047 : 181 |* | + 2048 -> 4095 : 439 |**** | + 4096 -> 8191 : 188 |* | + +In the example above, we recorded system-wide stats for 30 seconds. We can observe that +while most of the pages stayed in the readahead cache for quite less time, after 30 +seconds 6765 pages still remained in the cache, yet unaccessed. + +Note on Kprobes Usage +===================== + +This tool uses Kprobes on the following kernel functions: + +__do_page_cache_readahead() +__page_cache_alloc() +mark_page_accessed() + +Since the tool uses Kprobes, depending on your linux kernel's compilation, these +functions may be inlined and hence not available for Kprobes. To see whether you have +the functions available, check vmlinux source and binary to confirm whether inlining is +happening or not. You can also check /proc/kallsyms on the host and verify if the target +functions are present there before using this tool.