-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden flaky failures of 81 tests on x86 32-bit GA CI #6417
Comments
Happened again in a different PR so it seems to occur on a certain VM type or something: https://github.com/DynamoRIO/dynamorio/actions/runs/6780067178/job/18428137845?pr=6414 |
Xref #6416 as another sudden flaky failure |
I did a quick check on an AMD machine and I think these are related to AMD as well, like #6411 and #2267. For the signal.* failures on SIGSEGV with the assert: it seems to be that the SIGSEGV generated by SYS_kill is reported at the vsyscall PC and so the assert's check for it being in the fcache fails. |
I know little about x86, but the vdso manpage said the sigreturn trampoline for i386 is in vDSO, which is the same as AArch64 / RISC-V. |
Handle AMD 32-bit syscall instruction. The fix is to check for AMD 32-bit SYSCALL in is_at_do_syscall(). Here's more details: CI x86-32 signalNNNN tests started failing recently. After checking the log, the failures happen on AMD 32-bit system. Based on the debug logs AMD: 0xf7f90583 89 cd mov %ecx -> %ebp 0xf7f90585 0f 05 syscall -> %ecx interp: syscall @ 0xf7f90585 instr_get_opcode(instr): 95 change_prot(0xf7f90000, 0x2000, rwx) => mprotect(0xf7f90000, 0x2000, 7)==2 pages change_prot(0xf7f90000, 0x2000, r-x) => mprotect(0xf7f90000, 0x2000, 5)==2 pages set_syscall_method to 3make_writable: pc 0x441fc000 -> 0x441fc000-0x441fe000 0 Just updated syscall routine: 0x441fd240 0f 05 syscall -> %ecx 0x441fd242 a3 5c 29 18 44 mov %eax -> 0x4418295c[4byte] whereas Intel uses sysenter, and set the syscall_method to SYSCALL_METHOD_SYSENTER: 0xf7f71583 89 e5 mov %esp -> %ebp 0xf7f71585 0f 34 sysenter -> %esp interp: syscall @ 0xf7f71585 change_prot(0xf7f71000, 0x2000, rwx) => mprotect(0xf7f71000, 0x2000, 7)==2 pages change_prot(0xf7f71000, 0x2000, r-x) => mprotect(0xf7f71000, 0x2000, 5)==2 pages set_syscall_method to 2make_writable: pc 0x4845a000 -> 0x4845a000-0x4845c000 0 Just updated syscall routine: 0x4845b240 0f 34 sysenter -> %esp 0x4845b242 a3 5c 09 3e 48 mov %eax -> 0x483e095c[4byte] Issue: #6417
For the "Internal Error: DynamoRIO debug check failure: dynamorio/core/vmareas.c:925 start < end" error, it was caused by the instrumented code not restoring ecx in AMD case: +283 m4 @0x4de2eef0 64 a1 00 00 00 00 mov %fs:0x00[4byte] -> %eax vs the sysenter case for Intel: +276 m4 @0x49dfc5d8 64 a1 00 00 00 00 mov %fs:0x00[4byte] -> %eax The related debug message from drreg_reserve_reg_internal: drreg_reserve_reg_internal @3.0xf7f2058e: no need to spill ecx to slot 0 |
These failures include a lot of timeouts, which make this job take ~22 minutes now:
|
…ixed We have lived with our x86-32 GA CI job being red from massive test failures from the AMD switch for long enough. The fix here may be too drastic, but it's simple: we scale back to the tiny set of tests run on ubuntu22. I considered listing the failures on the ignore list, but with timeouts included this job was just taking too long. The plan is to have another Fixit and try to fix enough AMD failures we can re-enable the full job. Issue: #6417
…6590) We have lived with our x86-32 GA CI job being red from massive test failures from the AMD switch for long enough. The fix here may be too drastic, but it's simple: we scale back to the tiny set of tests run on ubuntu22. I considered listing the failures on the ignore list, but with timeouts included this job was just taking too long. The plan is to have another Fixit and try to fix enough AMD failures we can re-enable the full job. Issue: #6417
A run on GA CI suddently saw 81 failures. The virtualization changed, or something else was updated on the VM's? These all seem to be new flaky failures with maybe one seen before. Xref #5725.
https://github.com/DynamoRIO/dynamorio/actions/runs/6779945230/job/18427829003?pr=6415
linux.signal*:
drcachesim.threads:
This may be #6152
client.flush:
linux.eintr:
The text was updated successfully, but these errors were encountered: