-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tool: compactsnoop #2582
add tool: compactsnoop #2582
Conversation
ae5739a
to
1eb9d86
Compare
@yonghong-song @brendangregg Sorry to bother you, I met compact memory issue which caused high sys cpu and latency in two of our customs, so I develop this tool to diagnose this problem (even though their kernel doesn't support bpf, I diagnosed with sar -B + /proc/vmstat's compact_* + on-cpu frame graph) and hope this tool can help others. So would you please take a look at this when you have time? And I don't know how to alleviate memory fragmentation problems when it already caused lots of compation. Would you please give me some suggestions? Thanks a lot. |
Hi, @ethercflow will take a detailed look in the next couple of days. |
[buildbot, test this please] |
[buildbot, ok to test] |
@ethercflow For your below note
bpf_get_stackid_raw_tp is a kernel internal function. In bpf programs, the Could you try to use raw tracepoint implementation only? Looks like |
Thank you @yonghong-song, I'll use it with raw_tp.
Sure, I'll remove kprobe version. |
Looks like a good tool. Can you please make the default output 80 chars wide (fits better in books, articles, tickets, etc)? It's currently 81, so you need to trim a column by 1 character; I'd trim the LAT(ms) column, and reduce the number of decimal places to 2 or 3 (it dosen't need more). Does the example output make sense? Was zsh really taking 113 millisecond hits due to zone compaction? Just sanity checking. |
Thank you @brendangregg , I'll remove kprobe version and keep output 80 chars wide (sorry for this output wide thing).
Yes, I tested this on ubuntu 18.04 with its latest 5.0 kernel. (5.0.0-34-generic) I verified with
It seems same as compactsnoop tool output, and unbelievable, most of the delay falls on 64-127ms. Below is fragment info
|
Ok, thanks; I'm happy to include the tool (once it is updated). |
1eb9d86
to
4183526
Compare
@yonghong-song @brendangregg update, PTAL. |
maybe you also want to add an entry in snapcraft/snapcraft.yaml file. |
4183526
to
15fbd7c
Compare
The change looks good although probably due to that fact there is no much compaction going on my host, I only got stack like this,
@brendangregg any further comments on the tool? |
Hi, @brendangregg any further comments on the tool? |
@brendangregg Earlier you mentioned this is a good tool. Could you check again whether it has addressed all the issues and is good enough to be included in bcc/tools directory? |
Looks good, thanks! |
compactsnoop traces the compact zone system-wide, and print various details.
Example output:
While tracing, the processes alloc pages due to memory fragmentation is too
serious to meet contiguous memory requirements in the system, compact zone
events happened, which will increase the waiting delay of the processes.
compactsnoop can be useful for discovering when compact_stall(/proc/vmstat)
continues to increase, whether it is caused by some critical processes or not.
When enable extern fields, it also shows node's zone's order's frag index and zone's watermark:
The -e option prints out extra columns
The -K option prints out kernel stack