diff --git a/README.md b/README.md
index 7fa188b95b31..68c426a9310e 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 # BPF Compiler Collection (BCC)
 
 BCC is a toolkit for creating efficient kernel tracing and manipulation
-programs, and includes several useful tools and examples. It makes use of 
+programs, and includes several useful tools and examples. It makes use of
 extended BPF (Berkeley Packet Filters), formally known as eBPF, a new feature
 that was first added to Linux 3.15. Much of what BCC uses requires Linux 4.1
 and above.
@@ -23,7 +23,7 @@ power-of-2 histogram of the I/O size. For efficiency, only the histogram
 summary is returned to user-level.
 
 ```Shell
-# ./bitehist.py 
+# ./bitehist.py
 Tracing... Hit Ctrl-C to end.
 ^C
      kbytes          : count     distribution
@@ -130,6 +130,7 @@ pair of .c and .py files, and some are directories of files.
 - tools/[reset-trace](tools/reset-trace.sh): Reset the state of tracing. Maintenance tool only. [Examples](tools/reset-trace_example.txt).
 - tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
 - tools/[runqlen](tools/runqlen.py): Run queue length as a histogram. [Examples](tools/runqlen_example.txt).
+- tools/[runqslower](tools/runqslower.py): Trace long process scheduling delays. [Examples](tools/runqslower_example.txt).
 - tools/[slabratetop](tools/slabratetop.py): Kernel SLAB/SLUB memory cache allocation rate top. [Examples](tools/slabratetop_example.txt).
 - tools/[softirqs](tools/softirqs.py):  Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
 - tools/[solisten](tools/solisten.py): Trace TCP socket listen. [Examples](tools/solisten_example.txt).
diff --git a/man/man8/runqlat.8 b/man/man8/runqlat.8
index 2986ff5bbb69..d535ebbf47b1 100644
--- a/man/man8/runqlat.8
+++ b/man/man8/runqlat.8
@@ -13,7 +13,8 @@ wait its turn.
 This tool measures two types of run queue latency:
 
 1. The time from a task being enqueued on a run queue to its context switch
-and execution. This traces enqueue_task_*() -> finish_task_switch(),
+and execution. This traces ttwu_do_wakeup(), wake_up_new_task() ->
+finish_task_switch() with either raw tracepoints (if supported) or kprobes
 and instruments the run queue latency after a voluntary context switch.
 
 2. The time from when a task was involuntary context switched and still
@@ -109,4 +110,4 @@ Unstable - in development.
 .SH AUTHOR
 Brendan Gregg
 .SH SEE ALSO
-runqlen(8), pidstat(1)
+runqlen(8), runqslower(8), pidstat(1)
diff --git a/man/man8/runqlen.8 b/man/man8/runqlen.8
index 1cc2789c79c4..27a649dac8aa 100644
--- a/man/man8/runqlen.8
+++ b/man/man8/runqlen.8
@@ -83,4 +83,4 @@ Unstable - in development.
 .SH AUTHOR
 Brendan Gregg
 .SH SEE ALSO
-runqlat(8), pidstat(1)
+runqlat(8), runqslower(8), pidstat(1)
diff --git a/man/man8/runqslower.8 b/man/man8/runqslower.8
new file mode 100644
index 000000000000..0baee64aafe0
--- /dev/null
+++ b/man/man8/runqslower.8
@@ -0,0 +1,86 @@
+.TH runqslower 8  "2016-02-07" "USER COMMANDS"
+.SH NAME
+runqlat \- Trace long process scheduling delays.
+.SH SYNOPSIS
+.B runqslower [\-p PID] [min_us]
+.SH DESCRIPTION
+This measures the time a task spends waiting on a run queue (or equivalent
+scheduler data structure) for a turn on-CPU, and shows occurrences of time
+exceeding passed threshold. This time should be small, but a task may need
+to wait its turn due to CPU load. The higher the CPU load, the longer a task
+will generally need to wait its turn.
+
+This tool measures two types of run queue latency:
+
+1. The time from a task being enqueued on a run queue to its context switch
+and execution. This traces ttwu_do_wakeup(), wake_up_new_task() ->
+finish_task_switch() with either raw tracepoints (if supported) or kprobes
+and instruments the run queue latency after a voluntary context switch.
+
+2. The time from when a task was involuntary context switched and still
+in the runnable state, to when it next executed. This is instrumented
+from finish_task_switch() alone.
+
+The overhead of this tool may become significant for some workloads:
+see the OVERHEAD section.
+
+This works by tracing various kernel scheduler functions using dynamic tracing,
+and will need updating to match any changes to these functions.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-p PID
+Only show this PID (filtered in kernel for efficiency).
+.TP
+min_us
+Minimum scheduling delay in microseconds to output.
+.SH EXAMPLES
+.TP
+Show scheduling delays longer than 10ms:
+#
+.B runqslower
+.TP
+Show scheduling delays longer than 1ms for process with PID 123:
+#
+.B runqslower -p 123 1000
+.SH FIELDS
+.TP
+TIME
+Time of when scheduling event occurred.
+.TP
+COMM
+Process name.
+.TP
+PID
+Process ID.
+.TP
+LAT(us)
+Scheduling latency from time when task was ready to run to the time it was
+assigned to a CPU to run.
+.SH OVERHEAD
+This traces scheduler functions, which can become very frequent. While eBPF
+has very low overhead, and this tool uses in-kernel maps for efficiency, the
+frequency of scheduler events for some workloads may be high enough that the
+overhead of this tool becomes significant. Measure in a lab environment
+to quantify the overhead before use.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Ivan Babrou
+.SH SEE ALSO
+runqlen(8), runqlat(8), pidstat(1)
diff --git a/tools/runqlat.py b/tools/runqlat.py
index ebda11d09bb9..850aaafdb67b 100755
--- a/tools/runqlat.py
+++ b/tools/runqlat.py
@@ -12,7 +12,8 @@
 #
 # This measures two types of run queue latency:
 # 1. The time from a task being enqueued on a run queue to its context switch
-#    and execution. This traces enqueue_task_*() -> finish_task_switch(),
+#    and execution. This traces ttwu_do_wakeup(), wake_up_new_task() ->
+#    finish_task_switch() with either raw tracepoints (if supported) or kprobes
 #    and instruments the run queue latency after a voluntary context switch.
 # 2. The time from when a task was involuntary context switched and still
 #    in the runnable state, to when it next executed. This is instrumented
diff --git a/tools/runqslower.py b/tools/runqslower.py
new file mode 100755
index 000000000000..3ec898b4c2de
--- /dev/null
+++ b/tools/runqslower.py
@@ -0,0 +1,261 @@
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# runqslower    Trace long process scheduling delays.
+#               For Linux, uses BCC, eBPF.
+#
+# This script traces high scheduling delays between tasks being
+# ready to run and them running on CPU after that.
+#
+# USAGE: runqslower [-p PID] [min_us]
+#
+# REQUIRES: Linux 4.9+ (BPF_PROG_TYPE_PERF_EVENT support).
+#
+# This measures the time a task spends waiting on a run queue for a turn
+# on-CPU, and shows this time as a individual events. This time should be small,
+# but a task may need to wait its turn due to CPU load.
+#
+# This measures two types of run queue latency:
+# 1. The time from a task being enqueued on a run queue to its context switch
+#    and execution. This traces ttwu_do_wakeup(), wake_up_new_task() ->
+#    finish_task_switch() with either raw tracepoints (if supported) or kprobes
+#    and instruments the run queue latency after a voluntary context switch.
+# 2. The time from when a task was involuntary context switched and still
+#    in the runnable state, to when it next executed. This is instrumented
+#    from finish_task_switch() alone.
+#
+# Copyright 2016 Cloudflare, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 02-May-2018   Ivan Babrou   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+import argparse
+from time import strftime
+import ctypes as ct
+
+# arguments
+examples = """examples:
+    ./runqslower         # trace run queue latency higher than 10000 us (default)
+    ./runqslower 1000    # trace run queue latency higher than 1000 us
+    ./runqslower -p 123  # trace pid 123 only
+"""
+parser = argparse.ArgumentParser(
+    description="Trace high run queue latency",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid", type=int, metavar="PID", dest="pid",
+    help="trace this PID only")
+parser.add_argument("min_us", nargs="?", default='10000',
+    help="minimum run queue latecy to trace, in ms (default 10000)")
+parser.add_argument("--ebpf", action="store_true",
+    help=argparse.SUPPRESS)
+args = parser.parse_args()
+min_us = int(args.min_us)
+debug = 0
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
+#include <linux/pid_namespace.h>
+
+BPF_HASH(start, u32);
+
+struct rq;
+
+struct data_t {
+    u32 pid;
+    char task[TASK_COMM_LEN];
+    u64 delta_us;
+};
+
+BPF_PERF_OUTPUT(events);
+
+// record enqueue timestamp
+static int trace_enqueue(u32 tgid, u32 pid)
+{
+    if (FILTER_PID)
+        return 0;
+    u64 ts = bpf_ktime_get_ns();
+    start.update(&pid, &ts);
+    return 0;
+}
+"""
+
+bpf_text_kprobe = """
+int trace_wake_up_new_task(struct pt_regs *ctx, struct task_struct *p)
+{
+    return trace_enqueue(p->tgid, p->pid);
+}
+
+int trace_ttwu_do_wakeup(struct pt_regs *ctx, struct rq *rq, struct task_struct *p,
+    int wake_flags)
+{
+    return trace_enqueue(p->tgid, p->pid);
+}
+
+// calculate latency
+int trace_run(struct pt_regs *ctx, struct task_struct *prev)
+{
+    u32 pid, tgid;
+
+    // ivcsw: treat like an enqueue event and store timestamp
+    if (prev->state == TASK_RUNNING) {
+        tgid = prev->tgid;
+        pid = prev->pid;
+        if (!(FILTER_PID)) {
+            u64 ts = bpf_ktime_get_ns();
+            start.update(&pid, &ts);
+        }
+    }
+
+    tgid = bpf_get_current_pid_tgid() >> 32;
+    pid = bpf_get_current_pid_tgid();
+
+    u64 *tsp, delta_us;
+
+    // fetch timestamp and calculate delta
+    tsp = start.lookup(&pid);
+    if (tsp == 0) {
+        return 0;   // missed enqueue
+    }
+    delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
+
+    if (FILTER_US)
+        return 0;
+
+    struct data_t data = {};
+    data.pid = pid;
+    data.delta_us = delta_us;
+    bpf_get_current_comm(&data.task, sizeof(data.task));
+
+    // output
+    events.perf_submit(ctx, &data, sizeof(data));
+
+    start.delete(&pid);
+    return 0;
+}
+"""
+
+bpf_text_raw_tp = """
+RAW_TRACEPOINT_PROBE(sched_wakeup)
+{
+    // TP_PROTO(struct task_struct *p)
+    struct task_struct *p = (struct task_struct *)ctx->args[0];
+    u32 tgid, pid;
+
+    bpf_probe_read(&tgid, sizeof(tgid), &p->tgid);
+    bpf_probe_read(&pid, sizeof(pid), &p->pid);
+    return trace_enqueue(tgid, pid);
+}
+
+RAW_TRACEPOINT_PROBE(sched_wakeup_new)
+{
+    // TP_PROTO(struct task_struct *p)
+    struct task_struct *p = (struct task_struct *)ctx->args[0];
+    u32 tgid, pid;
+
+    bpf_probe_read(&tgid, sizeof(tgid), &p->tgid);
+    bpf_probe_read(&pid, sizeof(pid), &p->pid);
+    return trace_enqueue(tgid, pid);
+}
+
+RAW_TRACEPOINT_PROBE(sched_switch)
+{
+    // TP_PROTO(bool preempt, struct task_struct *prev, struct task_struct *next)
+    struct task_struct *prev = (struct task_struct *)ctx->args[1];
+    struct task_struct *next= (struct task_struct *)ctx->args[2];
+    u32 pid, tgid;
+    long state;
+
+    // ivcsw: treat like an enqueue event and store timestamp
+    bpf_probe_read(&state, sizeof(long), &prev->state);
+    if (state == TASK_RUNNING) {
+        bpf_probe_read(&tgid, sizeof(prev->tgid), &prev->tgid);
+        bpf_probe_read(&pid, sizeof(prev->pid), &prev->pid);
+        if (!(FILTER_PID)) {
+            u64 ts = bpf_ktime_get_ns();
+            start.update(&pid, &ts);
+        }
+    }
+
+    bpf_probe_read(&tgid, sizeof(next->tgid), &next->tgid);
+    bpf_probe_read(&pid, sizeof(next->pid), &next->pid);
+
+    u64 *tsp, delta_us;
+
+    // fetch timestamp and calculate delta
+    tsp = start.lookup(&pid);
+    if (tsp == 0) {
+        return 0;   // missed enqueue
+    }
+    delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
+
+    if (FILTER_US)
+        return 0;
+
+    struct data_t data = {};
+    data.pid = pid;
+    data.delta_us = delta_us;
+    bpf_get_current_comm(&data.task, sizeof(data.task));
+
+    // output
+    events.perf_submit(ctx, &data, sizeof(data));
+
+    start.delete(&pid);
+    return 0;
+}
+"""
+
+is_support_raw_tp = BPF.support_raw_tracepoint()
+if is_support_raw_tp:
+    bpf_text += bpf_text_raw_tp
+else:
+    bpf_text += bpf_text_kprobe
+
+# code substitutions
+if min_us == 0:
+    bpf_text = bpf_text.replace('FILTER_US', '0')
+else:
+    bpf_text = bpf_text.replace('FILTER_US', 'delta_us <= %s' % str(min_us))
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
+else:
+    bpf_text = bpf_text.replace('FILTER_PID', '0')
+if debug or args.ebpf:
+    print(bpf_text)
+    if args.ebpf:
+        exit()
+
+# kernel->user event data: struct data_t
+DNAME_INLINE_LEN = 32   # linux/dcache.h
+TASK_COMM_LEN = 16      # linux/sched.h
+class Data(ct.Structure):
+    _fields_ = [
+        ("pid", ct.c_uint),
+        ("task", ct.c_char * TASK_COMM_LEN),
+        ("delta_us", ct.c_ulonglong),
+    ]
+
+# process event
+def print_event(cpu, data, size):
+    event = ct.cast(data, ct.POINTER(Data)).contents
+    print("%-8s %-16s %-6s %14s" % (strftime("%H:%M:%S"), event.task, event.pid, event.delta_us))
+
+# load BPF program
+b = BPF(text=bpf_text)
+if not is_support_raw_tp:
+    b.attach_kprobe(event="ttwu_do_wakeup", fn_name="trace_ttwu_do_wakeup")
+    b.attach_kprobe(event="wake_up_new_task", fn_name="trace_wake_up_new_task")
+    b.attach_kprobe(event="finish_task_switch", fn_name="trace_run")
+
+print("Tracing run queue latency higher than %d us" % min_us)
+print("%-8s %-16s %-6s %14s" % ("TIME", "COMM", "PID", "LAT(us)"))
+
+# read events
+b["events"].open_perf_buffer(print_event, page_cnt=64)
+while 1:
+    b.perf_buffer_poll()
diff --git a/tools/runqslower_example.txt b/tools/runqslower_example.txt
new file mode 100644
index 000000000000..378887581748
--- /dev/null
+++ b/tools/runqslower_example.txt
@@ -0,0 +1,49 @@
+Demonstrations of runqslower, the Linux eBPF/bcc version.
+
+
+runqslower shows high latency scheduling times between tasks being
+ready to run and them running on CPU after that. For example:
+
+# runqslower
+Tracing run queue latency higher than 10000 us
+TIME     COMM             PID           LAT(us)
+04:16:32 cc1              12924           12739
+04:16:32 sh               13640           12118
+04:16:32 make             13639           12730
+04:16:32 bash             13655           12047
+04:16:32 bash             13657           12744
+04:16:32 bash             13656           12880
+04:16:32 sh               13660           10846
+04:16:32 gcc              13663           12681
+04:16:32 make             13668           10814
+04:16:32 make             13670           12988
+04:16:32 gcc              13677           11770
+04:16:32 gcc              13678           23519
+04:16:32 as               12999           20541
+[...]
+
+This shows various processes waiting for available CPU during a Linux kernel
+build. By default the output contains delays for more than 10ms.
+
+These daelays can be analyzed in depth with "perf sched" tool, see:
+
+* http://www.brendangregg.com/blog/2017-03-16/perf-sched.html
+
+USAGE message:
+
+# ./runqslower -h
+usage: runqslower.py [-h] [-p PID] [min_us]
+
+Trace high run queue latency
+
+positional arguments:
+  min_us             minimum run queue latecy to trace, in ms (default 10000)
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -p PID, --pid PID  trace this PID only
+
+examples:
+    ./runqslower         # trace run queue latency higher than 10000 us (default)
+    ./runqslower 1000    # trace run queue latency higher than 1000 us
+    ./runqslower -p 123  # trace pid 123 only