-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smoke tests for the tools #1032
Conversation
I'm off to bed now; if this builds, I'll rebase tomorrow to get rid of the horrible commit mess and then it will be ready for merging. In the meantime, @brendangregg @4ast @drzaeus77 can you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! thank you for working on it!
tests/python/test_tools_smoke.py
Outdated
def run_with_int(self, command, timeout=5, allow_early=False, kill=False): | ||
full_command = TOOLS_DIR + command | ||
signal = "KILL" if kill else "INT" | ||
rc = subprocess.call("timeout -s %s -k %ds %ds %s > /dev/null" % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout on centos6 doesn't have -k flag :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Do we care about CentOS 6? If we do, I could nest the timeouts:
timeout -s KILL 7s timeout -s INT 5s ...
Or just write the logic manually for killing the process if the SIGINT didn't get rid of it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think doing killing from python side and greping for expected output instead of >/dev/null is more accurate, but I'm fine with the current code.
|
||
@skipUnless(kernel_version_ge(4,4), "requires kernel >= 4.4") | ||
def test_bashreadline(self): | ||
self.run_with_int("bashreadline.py") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as the next step we still want to --dry-run internal flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's necessary. What we're doing here is almost identical: launch the tool, give it time to initialize, and then exit. It would be a lot of work adding this to all the tools, and I'm not sure what we'd gain.
tests/python/test_tools_smoke.py
Outdated
full_command = TOOLS_DIR + command | ||
signal = "KILL" if kill else "INT" | ||
rc = subprocess.call("timeout -s %s -k %ds %ds %s > /dev/null" % | ||
(signal, 5, timeout, full_command), shell=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't KILL and INT timeouts be different?
should INT timout be short like 1 second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 second is not enough because a lot of tools take a while to initialize, especially on a slow system. Sending a SIGINT while they are still initializing sort of defeats the purpose of the smoke test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm still missing how both timeouts can be 5 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in this specific command? The -k timeout is the number of seconds after the first signal's timeout was breached, relative not absolute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh. that makes sense.
i think the assumption that every script should be ctrl-c-able in 5 seconds is fragile and --dry-run can make it precise and accurate, but I'm fine with current setup for now.
def test_argdist(self): | ||
self.run_with_duration("argdist.py -C 'p::SyS_open()' -n 1 -i 1") | ||
|
||
@skipUnless(kernel_version_ge(4,4), "requires kernel >= 4.4") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the kernel version check won't work for backported kernels, but should be ok for iovisor buildbots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and it's needed there -- I've had to tune a lot of the tests for this. E.g. it turned out that mountsnoop uses bpf_get_current_task
which is only available on 4.8, and so on.
@skipUnless(kernel_version_ge(4,6), "requires kernel >= 4.6") | ||
def test_deadlock_detector(self): | ||
# TODO This tool requires a massive BPF stack traces table allocation, | ||
# which might fail the run or even trigger the oomkiller to kill some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trigger oom just by running it? that's not expected. cc @kennyyu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. It's got some seriously massive hashes and stack trace tables. https://github.com/iovisor/bcc/blob/master/tools/deadlock_detector.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@4ast I made it huge originally in order to be able trace very large binaries with a large number of threads. The sizes in those maps should be made configurable.
Is there an easy way to pass arguments to the bpf C code from the user space python program, besides inlining the C program into the python program? Currently, the deadlock_detector bpf code exists as a separate C file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BPF() object has an optional cflags=[]
argument, which you could pass -DMYDEFINE=xxx
to.
def test_softirqs(self): | ||
# TODO Temporary disabled as softirqs.py doesn't work on recent | ||
# kernels (can't find some of its attach targets). Need to revisit | ||
# it to use the softirq tracepoints. Tracked in bcc#1031. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at this comment it feels that this .py file will have a lot of churn.
Every new script and temporary workarounds to the broken scripts will be updating this file.
can it be split in multiple files somehow to make future PRs less conflicting with each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't want a separate file for each tool, there are dozens of them :)
Also once this stabilizes I'd expect changes to this file only when a new tool is added. You update README.md, you also add a smoke test for your tool, done. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with one file, we can always break it up later if it becomes annoying.
self.run_with_duration("wakeuptime.py 1") | ||
|
||
def test_xfsdist(self): | ||
# Doesn't work on build bot because xfs functions not present in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the VMs we have don't have xfs installed? that's odd. xfs is a default fs for most distros now. We should test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was only the Ubuntu bot. And I don't know if it has xfs or not, but this is the traceback:
Traceback (most recent call last):
28: File "../../tools/xfsdist.py", line 136, in <module>
28: b.attach_kprobe(event="xfs_file_read_iter", fn_name="trace_entry")
28: File "/home/iovisor/jenkins/workspace/bcc-pr/label/ubuntu1604/src/python/bcc/__init__.py", line 511, in attach_kprobe
28: raise Exception("Failed to attach BPF to kprobe")
So xfs_file_read_iter
doesn't exist there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW it is quite likely that some of the *fsdist, *fsslower tools can be rewritten using tracepoints. @brendangregg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is from one of the ubuntu machines:
$ sudo grep xfs_file_read_iter /proc/kallsyms
ffffffffc043fe70 t xfs_file_read_iter [xfs]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really need access to the bots to debug this then I'm afraid. I can't go on on much more than the log outputs ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drzaeus77 Can that be arranged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please unicast me your ssh pubkey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, it's not there :)
root@ubuntu1604-slave-681:/home/iovisor/bcc/tests/python# grep xfs_file_read_iter /proc/kallsyms
root@ubuntu1604-slave-681:/home/iovisor/bcc/tests/python#
@4ast Oh and also note that contrary to your slightly pessimistic estimation 😄 , there was only one tool that was completely broken (softirqs) and one other tool that probably needs fixing (deadlock_detector). This is tongue-in-cheek of course and we should absolutely invest in more testing for the tools! This is at least a start. |
Thanks! Ignoring the "enum expression < 0 is always false" warnings (which should be fixed on Linux 4.11/12), I got:
because:
I'll file a ticket. I don't think it needs netaddr. And:
I don't have any python running... |
OK, so perhaps disable the solisten test for now. As for ucalls, not sure how you can not have Python running -- the test itself is a Python process. Or does it have a different comm? |
Okay, so I'm going to disable the xfs tests because I don't see the required functions on the Ubuntu buildbot. I am also going to take the uflow test out because the Python version on Ubuntu doesn't have USDT probes. And finally I'm going to switch ucalls to use -S for syscall counting instead of tracing methods, for the same reason. Hopefully this will pass on all bots. Ideally though, we'd have a newer kernel on these buildbots so that we could run more tests... @drzaeus77 |
All right; pushed these changes. If this builds, please don't merge yet -- I'd like to rebase and clean up the last handful of commits :) |
[buildbot, test this please] |
can you squash all of the commits? I don't see much value keeping all the intermediate steps especially when most of them don't pass. |
Yes, of course - once I get the whole thing to pass. |
Uh-oh: @drzaeus77 the fc25 build bot can't find git again .., |
[buildbot, test this please] |
This commit adds basic smoke tests for most tools in tools/ by running the tool with either a short duration, or interrupting it with a SIGINT after a short duration. The tests check the return value from the tool to detect any Python exceptions or other errors, but they do not read the standard error or standard output and parse the tool's result. Some tools are not covered by these smoke tests for reasons documented in the test itself: * btrfsdist and btrfsslower need btrfs * cachetop doesn't like to run without a terminal * dbslower, dbstat, and mysqld_qslower need a database engine * deadlock_detector allocates a huge amount of memory * softirqs doesn't work on new kernels and needs fixing (iovisor#1031) * ugc needs a USDT-enabled runtime with GC probes * zfsdist and zfsslower need zfs This is a good place to start, but clearly for some tools, especially those with a complex interface like trace and argdist, we need more than just basic smoke tests.
All righty -- cleaned up the history, if the build passes 🤞 let's get this merged. |
This commit adds basic smoke tests for most tools in tools/ by
running the tool with either a short duration, or interrupting it
with a SIGINT after a short duration. The tests check the return
value from the tool to detect any Python exceptions or other
errors, but they do not read the standard error or standard output
and parse the tool's result.
Some tools are not covered by these smoke tests for reasons
documented in the test itself:
This is a good place to start, but clearly for some tools,
especially those with a complex interface like trace and argdist,
we need more than just basic smoke tests.
Also included in this PR are minor fixes to some tools to enable
easy testing.
Resolves #981.