Skip to content

Commit

Permalink
tools: add tool to detect potential deadlocks in running programs
Browse files Browse the repository at this point in the history
`deadlock_detector` is a new tool to detect potential deadlocks (lock order
inversions) in a running process. The program attaches uprobes on
`pthread_mutex_lock` and `pthread_mutex_unlock` to build a mutex wait directed
graph, and then looks for a cycle in this graph. This graph has the following
properties:

- Nodes in the graph represent mutexes.
- Edge (A, B) exists if there exists some thread T where lock(A) was called
  and lock(B) was called before unlock(A) was called.

If there is a cycle in this graph, this indicates that there is a lock order
inversion (potential deadlock). If the program finds a lock order inversion, the
program will dump the cycle of mutexes, dump the stack traces where each mutex
was acquired, and then exit.

The format of the output uses a similar output as ThreadSanitizer (See example:
https://github.com/google/sanitizers/wiki/ThreadSanitizerDeadlockDetector)

This program can only find potential deadlocks that occur while the program is
tracing the process. It cannot find deadlocks that may have occurred before the
program was attached to the process.

If the traced process has many mutexes and threads, this program will add a
very large overhead because every mutex lock/unlock and clone call will be
traced. This tool is meant for debugging only, and you should run this tool
only on programs where the slowdown is acceptable.

Note: This tool adds a dependency on `networkx` for the graph libraries
(building a directed graph and cycle detection).

Note: This tool does not work for shared mutexes or recursive mutexes.

For shared (read-write) mutexes, a deadlock requires a cycle in the wait
graph where at least one of the mutexes in the cycle is acquiring exclusive
(write) ownership.

For recursive mutexes, lock() is called multiple times on the same mutex.
However, there is no way to determine if a mutex is a recursive mutex
after the mutex has been created. As a result, this tool will not find
potential deadlocks that involve only one mutex.
  • Loading branch information
kennyyu committed Feb 3, 2017
1 parent 4a57f4d commit 66fb4d2
Show file tree
Hide file tree
Showing 8 changed files with 926 additions and 3 deletions.
6 changes: 4 additions & 2 deletions FAQ.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
Q: while running 'make test' I'm seeing:
'ImportError: No module named pyroute2'
A: Install pyroute2:
'ImportError: No module named pyroute2' or 'ImportError: No module named networkx'
A: Install pyroute2 and networkx:
git clone https://github.com/svinota/pyroute2.git
cd pyroute2; sudo make install
git clone https://github.com/networkx/networkx.git
cd networkx; sudo make install

Q: hello_world.py fails with:
OSError: libbcc.so: cannot open shared object file: No such file or directory
Expand Down
8 changes: 8 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,13 @@ cd pyroute2; sudo make install
sudo python /usr/share/bcc/examples/simple_tc.py
```

(Optional) Install networkx for additional deadlock detector features
```bash
[email protected]:networkx/networkx.git
cd networkx; sudo make install
sudo python /usr/share/bcc/tools/deadlock_detector.py
```

## Fedora - Binary

Install a 4.2+ kernel from
Expand Down Expand Up @@ -200,6 +207,7 @@ sudo dnf install -y luajit luajit-devel # for Lua support
sudo dnf install -y \
http:https://pkgs.repoforge.org/netperf/netperf-2.6.0-1.el6.rf.x86_64.rpm
sudo pip install pyroute2
sudo pip install networkx
```

### Install binary clang
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ Examples:
- tools/[cpuunclaimed](tools/cpuunclaimed.py): Sample CPU run queues and calculate unclaimed idle CPU. [Examples](tools/cpuunclaimed_example.txt)
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[deadlock_detector.py](tools/deadlock_detector.py): Detect potential deadlocks on a running process. [Examples](tools/deadlock_detector_example.txt)
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
- tools/[ext4dist](tools/ext4dist.py): Summarize ext4 operation latency distribution as a histogram. [Examples](tools/ext4dist_example.txt).
- tools/[ext4slower](tools/ext4slower.py): Trace slow ext4 operations. [Examples](tools/ext4slower_example.txt).
Expand Down
2 changes: 1 addition & 1 deletion debian/control
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Maintainer: Brenden Blanco <[email protected]>
Section: misc
Priority: optional
Standards-Version: 3.9.5
Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8, llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev, libelf-dev, bison, flex, libedit-dev, clang-format | clang-format-3.7, python-netaddr, python-pyroute2, luajit, libluajit-5.1-dev
Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8, llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev, libelf-dev, bison, flex, libedit-dev, clang-format | clang-format-3.7, python-netaddr, python-networkx, python-pyroute2, luajit, libluajit-5.1-dev
Homepage: https://github.com/iovisor/bcc

Package: libbcc
Expand Down
108 changes: 108 additions & 0 deletions man/man8/deadlock_detector.8
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
.TH deadlock_detector 8 "2017-02-01" "USER COMMANDS"
.SH NAME
deadlock_detector \- Find potential deadlocks (lock order inversions)
in a running program.
.SH SYNOPSIS
.B deadlock_detector [\-h] [\--dump-graph FILE]
[\--lock-symbols LOCK_SYMBOLS] [\--unlock-symbols UNLOCK_SYMBOLS] binary pid
.SH DESCRIPTION
deadlock_detector detects potential deadlocks on a running process. The program
attaches uprobes on `pthread_mutex_lock` and `pthread_mutex_unlock` by default
to build a mutex wait directed graph, and then looks for a cycle in this graph.
This graph has the following properties:

- Nodes in the graph represent mutexes.

- Edge (A, B) exists if there exists some thread T where lock(A) was called
and lock(B) was called before unlock(A) was called.

If there is a cycle in this graph, this indicates that there is a lock order
inversion (potential deadlock). If the program finds a lock order inversion, the
program will dump the cycle of mutexes, dump the stack traces where each mutex
was acquired, and then exit.

This program can only find potential deadlocks that occur while the program is
tracing the process. It cannot find deadlocks that may have occurred before the
program was attached to the process.

This tool does not work for shared mutexes or recursive mutexes.

Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF, bcc, and networkx
.SH OPTIONS
.TP
\--dump-graph DUMP_GRAPH
If set, this will dump the mutex graph to the specified file.
.TP
\--lock-symbols LOCK_SYMBOLS
Comma-separated list of lock symbols to trace. Default is pthread_mutex_lock
.TP
\--unlock-symbols UNLOCK_SYMBOLS
Comma-separated list of unlock symbols to trace. Default is pthread_mutex_unlock
.TP
binary
Absolute path to binary
.TP
pid
Pid to trace
.SH EXAMPLES
.TP
Find potential deadlocks in a process:
#
.B deadlock_detector /path/to/binary $(pidof binary)
.TP
Find potential deadlocks in a process and dump the mutex wait graph to a file:
#
.B deadlock_detector /path/to/binary $(pidof binary) --dump-graph graph.json
.TP
Find potential deadlocks in a process with custom mutexes:
#
.B deadlock_detector /path/to/binary $(pidof binary)
--lock-symbols custom_mutex1_lock,custom_mutex2_lock
--unlock_symbols custom_mutex1_unlock,custom_mutex2_unlock
.SH OUTPUT
This program does not output any fields. Rather, it will keep running until
it finds a potential deadlock, or the user hits Ctrl-C. If the program finds
a potential deadlock, it will output the stack traces and lock order inversion
in the following format and exit:
.TP
Potential Deadlock Detected!
.TP
Cycle in lock order graph: Mutex M0 => Mutex M1 => Mutex M0
.TP
Mutex M1 acquired here while holding Mutex M0 in Thread T:
.B [stack trace]
.TP
Mutex M0 previously acquired by the same Thread T here:
.B [stack trace]
.TP
Mutex M0 acquired here while holding Mutex M1 in Thread S:
.B [stack trace]
.TP
Mutex M1 previously acquired by the same Thread S here:
.B [stack trace]
.TP
Thread T created by Thread R here:
.B [stack trace]
.TP
Thread S created by Thread Q here:
.B [stack trace]
.SH OVERHEAD
This traces all mutex lock and unlock events and all thread creation events
on the traced process. The overhead of this can be high if the process has many
threads and mutexes. You should only run this on a process where the slowdown
is acceptable.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Kenny Yu
207 changes: 207 additions & 0 deletions tools/deadlock_detector.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
/*
* deadlock_detector.c Detects potential deadlocks in a running process.
* For Linux, uses BCC, eBPF. See .py file.
*
* Copyright 2017 Facebook, Inc.
* Licensed under the Apache License, Version 2.0 (the "License")
*
* 1-Feb-2016 Kenny Yu Created this.
*/

#include <linux/sched.h>
#include <uapi/linux/ptrace.h>

// Maximum number of mutexes a single thread can hold at once.
// If the number is too big, the unrolled loops wil cause the stack
// to be too big, and the bpf verifier will fail.
#define MAX_HELD_MUTEXES 16

// Info about held mutexes. `mutex` will be 0 if not held.
struct held_mutex_t {
u64 mutex;
u64 stack_id;
};

// List of mutexes that a thread is holding. Whenever we loop over this array,
// we need to force the compiler to unroll the loop, otherwise the bcc verifier
// will fail because the loop will create a backwards edge.
struct thread_to_held_mutex_leaf_t {
struct held_mutex_t held_mutexes[MAX_HELD_MUTEXES];
};

// Map of thread ID -> array of (mutex addresses, stack id)
BPF_TABLE("hash", u32, struct thread_to_held_mutex_leaf_t,
thread_to_held_mutexes, 2097152);

// Key type for edges. Represents an edge from mutex1 => mutex2.
struct edges_key_t {
u64 mutex1;
u64 mutex2;
};

// Leaf type for edges. Holds information about where each mutex was acquired.
struct edges_leaf_t {
u64 mutex1_stack_id;
u64 mutex2_stack_id;
u32 thread_pid;
char comm[TASK_COMM_LEN];
};

// Represents all edges currently in the mutex wait graph.
BPF_TABLE("hash", struct edges_key_t, struct edges_leaf_t, edges, 2097152);

// Info about parent thread when a child thread is created.
struct thread_created_leaf_t {
u64 stack_id;
u32 parent_pid;
char comm[TASK_COMM_LEN];
};

// Map of child thread pid -> info about parent thread.
BPF_TABLE("hash", u32, struct thread_created_leaf_t, thread_to_parent, 10240);

// Stack traces when threads are created and when mutexes are locked/unlocked.
BPF_STACK_TRACE(stack_traces, 655360);

// The first argument to the user space function we are tracing
// is a pointer to the mutex M held by thread T.
//
// For all mutexes N held by mutexes_held[T]
// add edge N => M (held by T)
// mutexes_held[T].add(M)
int trace_mutex_acquire(struct pt_regs *ctx, void *mutex_addr) {
// Higher 32 bits is process ID, Lower 32 bits is thread ID
u32 pid = bpf_get_current_pid_tgid();
u64 mutex = (u64)mutex_addr;

struct thread_to_held_mutex_leaf_t empty_leaf = {};
struct thread_to_held_mutex_leaf_t *leaf =
thread_to_held_mutexes.lookup_or_init(&pid, &empty_leaf);
if (!leaf) {
bpf_trace_printk(
"could not add thread_to_held_mutex key, thread: %d, mutex: %p\n", pid,
mutex);
return 1; // Could not insert, no more memory
}

// Recursive mutexes lock the same mutex multiple times. We cannot tell if
// the mutex is recursive after the mutex is already created. To avoid noisy
// reports, disallow self edges. Do one pass to check if we are already
// holding the mutex, and if we are, do nothing.
#pragma unroll
for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
if (leaf->held_mutexes[i].mutex == mutex) {
return 1; // Disallow self edges
}
}

u64 stack_id =
stack_traces.get_stackid(ctx, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);

int added_mutex = 0;
#pragma unroll
for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
// If this is a free slot, see if we can insert.
if (!leaf->held_mutexes[i].mutex) {
if (!added_mutex) {
leaf->held_mutexes[i].mutex = mutex;
leaf->held_mutexes[i].stack_id = stack_id;
added_mutex = 1;
}
continue; // Nothing to do for a free slot
}

// Add edges from held mutex => current mutex
struct edges_key_t edge_key = {};
edge_key.mutex1 = leaf->held_mutexes[i].mutex;
edge_key.mutex2 = mutex;

struct edges_leaf_t edge_leaf = {};
edge_leaf.mutex1_stack_id = leaf->held_mutexes[i].stack_id;
edge_leaf.mutex2_stack_id = stack_id;
edge_leaf.thread_pid = pid;
bpf_get_current_comm(&edge_leaf.comm, sizeof(edge_leaf.comm));

// Returns non-zero on error
int result = edges.update(&edge_key, &edge_leaf);
if (result) {
bpf_trace_printk("could not add edge key %p, %p, error: %d\n",
edge_key.mutex1, edge_key.mutex2, result);
continue; // Could not insert, no more memory
}
}

// There were no free slots for this mutex.
if (!added_mutex) {
bpf_trace_printk("could not add mutex %p, added_mutex: %d\n", mutex,
added_mutex);
return 1;
}
return 0;
}

// The first argument to the user space function we are tracing
// is a pointer to the mutex M held by thread T.
//
// mutexes_held[T].remove(M)
int trace_mutex_release(struct pt_regs *ctx, void *mutex_addr) {
// Higher 32 bits is process ID, Lower 32 bits is thread ID
u32 pid = bpf_get_current_pid_tgid();
u64 mutex = (u64)mutex_addr;

struct thread_to_held_mutex_leaf_t *leaf =
thread_to_held_mutexes.lookup(&pid);
if (!leaf) {
// If the leaf does not exist for the pid, then it means we either missed
// the acquire event, or we had no more memory and could not add it.
bpf_trace_printk(
"could not find thread_to_held_mutex, thread: %d, mutex: %p\n", pid,
mutex);
return 1;
}

// For older kernels without "Bpf: allow access into map value arrays"
// (https://lkml.org/lkml/2016/8/30/287) the bpf verifier will fail with an
// invalid memory access on `leaf->held_mutexes[i]` below. On newer kernels,
// we can avoid making this extra copy in `value` and use `leaf` directly.
struct thread_to_held_mutex_leaf_t value = {};
bpf_probe_read(&value, sizeof(struct thread_to_held_mutex_leaf_t), leaf);

#pragma unroll
for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
// Find the current mutex (if it exists), and clear it.
// Note: Can't use `leaf->` in this if condition, see comment above.
if (value.held_mutexes[i].mutex == mutex) {
leaf->held_mutexes[i].mutex = 0;
leaf->held_mutexes[i].stack_id = 0;
}
}

return 0;
}

// Trace return from clone() syscall in the child thread (return value > 0).
int trace_clone(struct pt_regs *ctx, unsigned long flags, void *child_stack,
void *ptid, void *ctid, struct pt_regs *regs) {
u32 child_pid = PT_REGS_RC(ctx);
if (child_pid <= 0) {
return 1;
}

struct thread_created_leaf_t thread_created_leaf = {};
thread_created_leaf.parent_pid = bpf_get_current_pid_tgid();
thread_created_leaf.stack_id =
stack_traces.get_stackid(ctx, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);
bpf_get_current_comm(&thread_created_leaf.comm,
sizeof(thread_created_leaf.comm));

struct thread_created_leaf_t *insert_result =
thread_to_parent.lookup_or_init(&child_pid, &thread_created_leaf);
if (!insert_result) {
bpf_trace_printk(
"could not add thread_created_key, child: %d, parent: %d\n", child_pid,
thread_created_leaf.parent_pid);
return 1; // Could not insert, no more memory
}
return 0;
}
Loading

0 comments on commit 66fb4d2

Please sign in to comment.