Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smoke tests for the tools #1032

Merged
merged 5 commits into from
Mar 11, 2017
Merged

Smoke tests for the tools #1032

merged 5 commits into from
Mar 11, 2017

Conversation

goldshtn
Copy link
Collaborator

@goldshtn goldshtn commented Mar 7, 2017

This commit adds basic smoke tests for most tools in tools/ by
running the tool with either a short duration, or interrupting it
with a SIGINT after a short duration. The tests check the return
value from the tool to detect any Python exceptions or other
errors, but they do not read the standard error or standard output
and parse the tool's result.

Some tools are not covered by these smoke tests for reasons
documented in the test itself:

  • btrfsdist and btrfsslower need btrfs
  • cachetop doesn't like to run without a terminal
  • dbslower, dbstat, and mysqld_qslower need a database engine
  • deadlock_detector allocates a huge amount of memory
  • softirqs doesn't work on new kernels and needs fixing (softirqs is broken on newer kernels #1031)
  • ugc needs a USDT-enabled runtime with GC probes
  • xfsdist and xfsslower don't work on the built bots because of missing probe functions
  • zfsdist and zfsslower need zfs

This is a good place to start, but clearly for some tools,
especially those with a complex interface like trace and argdist,
we need more than just basic smoke tests.

Also included in this PR are minor fixes to some tools to enable
easy testing.

Resolves #981.

@goldshtn
Copy link
Collaborator Author

goldshtn commented Mar 7, 2017

I'm off to bed now; if this builds, I'll rebase tomorrow to get rid of the horrible commit mess and then it will be ready for merging. In the meantime, @brendangregg @4ast @drzaeus77 can you take a look?

Copy link
Member

@4ast 4ast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! thank you for working on it!

def run_with_int(self, command, timeout=5, allow_early=False, kill=False):
full_command = TOOLS_DIR + command
signal = "KILL" if kill else "INT"
rc = subprocess.call("timeout -s %s -k %ds %ds %s > /dev/null" %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout on centos6 doesn't have -k flag :(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Do we care about CentOS 6? If we do, I could nest the timeouts:

timeout -s KILL 7s timeout -s INT 5s ...

Or just write the logic manually for killing the process if the SIGINT didn't get rid of it. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think doing killing from python side and greping for expected output instead of >/dev/null is more accurate, but I'm fine with the current code.


@skipUnless(kernel_version_ge(4,4), "requires kernel >= 4.4")
def test_bashreadline(self):
self.run_with_int("bashreadline.py")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as the next step we still want to --dry-run internal flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's necessary. What we're doing here is almost identical: launch the tool, give it time to initialize, and then exit. It would be a lot of work adding this to all the tools, and I'm not sure what we'd gain.

full_command = TOOLS_DIR + command
signal = "KILL" if kill else "INT"
rc = subprocess.call("timeout -s %s -k %ds %ds %s > /dev/null" %
(signal, 5, timeout, full_command), shell=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't KILL and INT timeouts be different?
should INT timout be short like 1 second?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 second is not enough because a lot of tools take a while to initialize, especially on a slow system. Sending a SIGINT while they are still initializing sort of defeats the purpose of the smoke test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm still missing how both timeouts can be 5 seconds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in this specific command? The -k timeout is the number of seconds after the first signal's timeout was breached, relative not absolute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh. that makes sense.
i think the assumption that every script should be ctrl-c-able in 5 seconds is fragile and --dry-run can make it precise and accurate, but I'm fine with current setup for now.

def test_argdist(self):
self.run_with_duration("argdist.py -C 'p::SyS_open()' -n 1 -i 1")

@skipUnless(kernel_version_ge(4,4), "requires kernel >= 4.4")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kernel version check won't work for backported kernels, but should be ok for iovisor buildbots.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and it's needed there -- I've had to tune a lot of the tests for this. E.g. it turned out that mountsnoop uses bpf_get_current_task which is only available on 4.8, and so on.

@skipUnless(kernel_version_ge(4,6), "requires kernel >= 4.6")
def test_deadlock_detector(self):
# TODO This tool requires a massive BPF stack traces table allocation,
# which might fail the run or even trigger the oomkiller to kill some
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trigger oom just by running it? that's not expected. cc @kennyyu

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. It's got some seriously massive hashes and stack trace tables. https://github.com/iovisor/bcc/blob/master/tools/deadlock_detector.c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@4ast I made it huge originally in order to be able trace very large binaries with a large number of threads. The sizes in those maps should be made configurable.

Is there an easy way to pass arguments to the bpf C code from the user space python program, besides inlining the C program into the python program? Currently, the deadlock_detector bpf code exists as a separate C file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BPF() object has an optional cflags=[] argument, which you could pass -DMYDEFINE=xxx to.

def test_softirqs(self):
# TODO Temporary disabled as softirqs.py doesn't work on recent
# kernels (can't find some of its attach targets). Need to revisit
# it to use the softirq tracepoints. Tracked in bcc#1031.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this comment it feels that this .py file will have a lot of churn.
Every new script and temporary workarounds to the broken scripts will be updating this file.
can it be split in multiple files somehow to make future PRs less conflicting with each other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't want a separate file for each tool, there are dozens of them :)
Also once this stabilizes I'd expect changes to this file only when a new tool is added. You update README.md, you also add a smoke test for your tool, done. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with one file, we can always break it up later if it becomes annoying.

self.run_with_duration("wakeuptime.py 1")

def test_xfsdist(self):
# Doesn't work on build bot because xfs functions not present in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the VMs we have don't have xfs installed? that's odd. xfs is a default fs for most distros now. We should test it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was only the Ubuntu bot. And I don't know if it has xfs or not, but this is the traceback:

Traceback (most recent call last):
28:   File "../../tools/xfsdist.py", line 136, in <module>
28:     b.attach_kprobe(event="xfs_file_read_iter", fn_name="trace_entry")
28:   File "/home/iovisor/jenkins/workspace/bcc-pr/label/ubuntu1604/src/python/bcc/__init__.py", line 511, in attach_kprobe
28:     raise Exception("Failed to attach BPF to kprobe")

So xfs_file_read_iter doesn't exist there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW it is quite likely that some of the *fsdist, *fsslower tools can be rewritten using tracepoints. @brendangregg

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is from one of the ubuntu machines:

$ sudo grep xfs_file_read_iter /proc/kallsyms
ffffffffc043fe70 t xfs_file_read_iter   [xfs]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really need access to the bots to debug this then I'm afraid. I can't go on on much more than the log outputs ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drzaeus77 Can that be arranged?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please unicast me your ssh pubkey.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, it's not there :)

root@ubuntu1604-slave-681:/home/iovisor/bcc/tests/python# grep xfs_file_read_iter /proc/kallsyms
root@ubuntu1604-slave-681:/home/iovisor/bcc/tests/python# 

@drzaeus77

@goldshtn
Copy link
Collaborator Author

goldshtn commented Mar 8, 2017

@4ast Oh and also note that contrary to your slightly pessimistic estimation 😄 , there was only one tool that was completely broken (softirqs) and one other tool that probably needs fixing (deadlock_detector).

This is tongue-in-cheek of course and we should absolutely invest in more testing for the tools! This is at least a start.

@brendangregg
Copy link
Member

Thanks!

Ignoring the "enum expression < 0 is always false" warnings (which should be fixed on Linux 4.11/12), I got:

======================================================================
FAIL: test_solisten (__main__.SmokeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./test_tools_smoke.py", line 246, in test_solisten
    self.run_with_int("solisten.py")
  File "./test_tools_smoke.py", line 50, in run_with_int
    or (rc == 137 and kill), "rc was %d" % rc)
AssertionError: rc was 1

======================================================================
FAIL: test_ucalls (__main__.SmokeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./test_tools_smoke.py", line 312, in test_ucalls
    self.run_with_int("ucalls.py -S $(pgrep -n python)")
  File "./test_tools_smoke.py", line 50, in run_with_int
    or (rc == 137 and kill), "rc was %d" % rc)
AssertionError: rc was 137

----------------------------------------------------------------------

because:

# ./solisten.py 
Traceback (most recent call last):
  File "./solisten.py", line 22, in <module>
    import netaddr
ImportError: No module named netaddr

I'll file a ticket. I don't think it needs netaddr.

And:

# grep ucalls test_tools_smoke.py 
    def test_ucalls(self):
        self.run_with_int("ucalls.py -S $(pgrep -n python)")

I don't have any python running...

@goldshtn
Copy link
Collaborator Author

goldshtn commented Mar 9, 2017

OK, so perhaps disable the solisten test for now. As for ucalls, not sure how you can not have Python running -- the test itself is a Python process. Or does it have a different comm?

@goldshtn
Copy link
Collaborator Author

goldshtn commented Mar 10, 2017

Okay, so I'm going to disable the xfs tests because I don't see the required functions on the Ubuntu buildbot. I am also going to take the uflow test out because the Python version on Ubuntu doesn't have USDT probes. And finally I'm going to switch ucalls to use -S for syscall counting instead of tracing methods, for the same reason. Hopefully this will pass on all bots. Ideally though, we'd have a newer kernel on these buildbots so that we could run more tests... @drzaeus77

@goldshtn
Copy link
Collaborator Author

All right; pushed these changes. If this builds, please don't merge yet -- I'd like to rebase and clean up the last handful of commits :)

@goldshtn
Copy link
Collaborator Author

[buildbot, test this please]

@4ast
Copy link
Member

4ast commented Mar 11, 2017

can you squash all of the commits? I don't see much value keeping all the intermediate steps especially when most of them don't pass.

@goldshtn
Copy link
Collaborator Author

Yes, of course - once I get the whole thing to pass.

@goldshtn
Copy link
Collaborator Author

Uh-oh: @drzaeus77 the fc25 build bot can't find git again ..,

@drzaeus77
Copy link
Collaborator

[buildbot, test this please]

This commit adds basic smoke tests for most tools in tools/ by
running the tool with either a short duration, or interrupting it
with a SIGINT after a short duration. The tests check the return
value from the tool to detect any Python exceptions or other
errors, but they do not read the standard error or standard output
and parse the tool's result.

Some tools are not covered by these smoke tests for reasons
documented in the test itself:

* btrfsdist and btrfsslower need btrfs
* cachetop doesn't like to run without a terminal
* dbslower, dbstat, and mysqld_qslower need a database engine
* deadlock_detector allocates a huge amount of memory
* softirqs doesn't work on new kernels and needs fixing (iovisor#1031)
* ugc needs a USDT-enabled runtime with GC probes
* zfsdist and zfsslower need zfs

This is a good place to start, but clearly for some tools,
especially those with a complex interface like trace and argdist,
we need more than just basic smoke tests.
@goldshtn
Copy link
Collaborator Author

All righty -- cleaned up the history, if the build passes 🤞 let's get this merged.

@drzaeus77 drzaeus77 merged commit dd3867d into iovisor:master Mar 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatic smoke tests for the tools
5 participants