Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden flaky failures of 81 tests on x86 32-bit GA CI #6417

Open
derekbruening opened this issue Nov 7, 2023 · 8 comments
Open

Sudden flaky failures of 81 tests on x86 32-bit GA CI #6417

derekbruening opened this issue Nov 7, 2023 · 8 comments
Assignees

Comments

@derekbruening
Copy link
Contributor

A run on GA CI suddently saw 81 failures. The virtualization changed, or something else was updated on the VM's? These all seem to be new flaky failures with maybe one seen before. Xref #5725.

https://github.com/DynamoRIO/dynamorio/actions/runs/6779945230/job/18427829003?pr=6415

debug-internal-32: 333 tests passed, **** 82 tests failed, of which 1 were flaky: ****
	code_api|common.floatpc_xl8all 
	code_api|linux.signal0000 
	code_api|linux.signal0001 
	code_api|linux.signal0010 
	code_api|linux.signal0011 
	code_api|linux.signal0100 
	code_api|linux.signal0101 
	code_api|linux.signal0110 
	code_api|linux.signal0111 
	code_api|linux.signalfd 
	code_api|common.decode-stress 
	code_api|linux.eintr-noinline 
	code_api|client.flush 
	code_api|client.strace 
	code_api|tool.drcachesim.phys-threads_SUDO 
	code_api|tool.drcov.eintr 
	code_api|linux.eintr 
	code_api|tool.drcachesim.delay-global 
	code_api|tool.drcachesim.threads 
	code_api|tool.drcachesim.threads-with-config-file 
	code_api|tool.drcachesim.coherence 
	code_api|api.static_sideline_FLAKY 

linux.signal*:

52: Generating SIGSEGV
52: <Application /home/runner/work/dynamorio/dynamorio/build_debug-internal-32/suite/tests/bin/linux.signal0000 (21169).  Internal Error: DynamoRIO debug check failure: /home/runner/work/dynamorio/dynamorio/core/unix/signal.c:6008 syscall_signal || safe_is_in_fcache(dcontext, pc, (byte *)sc->SC_XSP)
(Error occurred @2170 frags in tid 21169)

drcachesim.threads:

292:      Started iteration 1 of the computation...
292: <Application /home/runner/work/dynamorio/dynamorio/build_debug-internal-32/suite/tests/bin/client.annotation-concurrency (22375).  Internal Error: DynamoRIO debug check failure: /home/runner/work/dynamorio/dynamorio/core/vmareas.c:925 start < end
292: (Error occurred @3233 frags in tid 22375)

This may be #6152

client.flush:

182: count = 402
182: ERROR removing 0xf7b93549
182: ERROR: 0xf7b9354e undeleted
182: 1 undeleted fragments
182: constructed BB 5-15 times
180/415 Test #182: code_api|client.flush ............................................***Failed  Required regular expression not found. 

linux.eintr:

90: <Application /home/runner/work/dynamorio/dynamorio/build_debug-internal-32/suite/tests/bin/linux.eintr (21322).  Internal Error: DynamoRIO debug check failure: /home/runner/work/dynamorio/dynamorio/core/heap.c:1966 IF_WINDOWS(doing_detach ||) vmh->num_free_blocks == vmh->num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() >= WINDOWS_VERSION_8_1)) && vmh->num_free_blocks 
@derekbruening
Copy link
Contributor Author

Happened again in a different PR so it seems to occur on a certain VM type or something: https://github.com/DynamoRIO/dynamorio/actions/runs/6780067178/job/18428137845?pr=6414

@derekbruening
Copy link
Contributor Author

Xref #6416 as another sudden flaky failure

@derekbruening
Copy link
Contributor Author

#6411 and #2267 also seem to be increasing in frequency

@derekbruening
Copy link
Contributor Author

I did a quick check on an AMD machine and I think these are related to AMD as well, like #6411 and #2267.

For the signal.* failures on SIGSEGV with the assert: it seems to be that the SIGSEGV generated by SYS_kill is reported at the vsyscall PC and so the assert's check for it being in the fcache fails.

@ksco
Copy link
Contributor

ksco commented Nov 17, 2023

I did a quick check on an AMD machine and I think these are related to AMD as well, like #6411 and #2267.

For the signal.* failures on SIGSEGV with the assert: it seems to be that the SIGSEGV generated by SYS_kill is reported at the vsyscall PC and so the assert's check for it being in the fcache fails.

I know little about x86, but the vdso manpage said the sigreturn trampoline for i386 is in vDSO, which is the same as AArch64 / RISC-V.

derekbruening pushed a commit that referenced this issue Nov 20, 2023
Handle AMD 32-bit syscall instruction.

The  fix is to check for AMD 32-bit SYSCALL in is_at_do_syscall().

Here's more details:

CI x86-32 signalNNNN tests started failing recently. After checking the
log, the failures happen on AMD 32-bit system.
Based on the debug logs 

AMD:

  0xf7f90583  89 cd                mov    %ecx -> %ebp
  0xf7f90585  0f 05                syscall  -> %ecx
interp: syscall @ 0xf7f90585
instr_get_opcode(instr): 95
change_prot(0xf7f90000, 0x2000, rwx) => mprotect(0xf7f90000, 0x2000,
7)==2 pages
change_prot(0xf7f90000, 0x2000, r-x) => mprotect(0xf7f90000, 0x2000,
5)==2 pages
set_syscall_method to 3make_writable: pc 0x441fc000 ->
0x441fc000-0x441fe000 0
Just updated syscall routine:      
  0x441fd240  0f 05                syscall  -> %ecx
  0x441fd242  a3 5c 29 18 44       mov    %eax -> 0x4418295c[4byte]

whereas Intel uses sysenter, and set the syscall_method to
SYSCALL_METHOD_SYSENTER:

  0xf7f71583  89 e5                mov    %esp -> %ebp
  0xf7f71585  0f 34                sysenter  -> %esp
interp: syscall @ 0xf7f71585
change_prot(0xf7f71000, 0x2000, rwx) => mprotect(0xf7f71000, 0x2000,
7)==2 pages
change_prot(0xf7f71000, 0x2000, r-x) => mprotect(0xf7f71000, 0x2000,
5)==2 pages
set_syscall_method to 2make_writable: pc 0x4845a000 ->
0x4845a000-0x4845c000 0
Just updated syscall routine:
  0x4845b240  0f 34                sysenter  -> %esp
  0x4845b242  a3 5c 09 3e 48       mov    %eax -> 0x483e095c[4byte]


Issue: #6417
@ivankyluk
Copy link
Contributor

For the "Internal Error: DynamoRIO debug check failure: dynamorio/core/vmareas.c:925 start < end" error, it was caused by the instrumented code not restoring ecx in AMD case:

+283 m4 @0x4de2eef0 64 a1 00 00 00 00 mov %fs:0x00[4byte] -> %eax
+289 m4 @0x4de30758
+289 m4 @0x4deaeab8
+289 m4 @0x4de2dbe0
+289 m4 @0x4de285c4
+289 m4 @0x4deb4c6c
+289 m4 @0x4deb211c
+289 m4 @0x4ddd6530 64 a1 94 00 00 00 mov %fs:0x00000094[4byte] -> %eax
+295 L3 @0x4deb5884 0f 05 syscall -> %ecx
END 0xf7f20580

vs the sysenter case for Intel:

+276 m4 @0x49dfc5d8 64 a1 00 00 00 00 mov %fs:0x00[4byte] -> %eax
+282 m4 @0x49dea07c
+282 m4 @0x49df2418
+282 m4 @0x49df6ce4
+282 m4 @0x49df1248
+282 m4 @0x49d727e0
+282 m4 @0x49dfa750
+282 m4 @0x49d76874 64 a1 94 00 00 00 mov %fs:0x00000094[4byte] -> %eax
+288 m4 @0x49d22e60 64 8b 0d 90 00 00 00 mov %fs:0x00000090[4byte] -> %ecx
+295 L3 @0x49df753c 0f 34 sysenter -> %esp
END 0xf7f4b580

The related debug message from drreg_reserve_reg_internal:

drreg_reserve_reg_internal @3.0xf7f2058e: no need to spill ecx to slot 0

@derekbruening
Copy link
Contributor Author

These failures include a lot of timeouts, which make this job take ~22 minutes now:

$ grep '120.0. sec' x86-32/7_Run\ Suite.txt 
2024-01-25T23:29:47.8688605Z       Test #290: code_api|tool.drcachesim.filter-no-d ......................................................***Timeout 120.00 sec
2024-01-25T23:29:47.8702222Z       Test #291: code_api|tool.drcachesim.filter-d .........................................................***Timeout 120.00 sec
2024-01-25T23:31:47.8528142Z       Test #290: code_api|tool.drcachesim.filter-no-d ......................................................***Timeout 120.01 sec
2024-01-25T23:31:47.8541097Z       Test #291: code_api|tool.drcachesim.filter-d .........................................................***Timeout 120.00 sec
2024-01-25T23:33:27.5201374Z       Test #309: code_api|tool.drcachesim.scattergather-x86 ................................................***Timeout 120.00 sec
2024-01-25T23:35:27.1030157Z       Test #309: code_api|tool.drcachesim.scattergather-x86 ................................................***Timeout 120.01 sec

$ grep '180.0. sec' x86-32/7_Run\ Suite.txt 
2024-01-25T23:33:15.5197961Z       Test #308: code_api|tool.drcachesim.invariants .......................................................***Timeout 180.00 sec
2024-01-25T23:36:16.0945937Z       Test #308: code_api|tool.drcachesim.invariants .......................................................***Timeout 180.00 sec

derekbruening added a commit that referenced this issue Jan 26, 2024
…ixed

We have lived with our x86-32 GA CI job being red from massive test
failures from the AMD switch for long enough.  The fix here may be too
drastic, but it's simple: we scale back to the tiny set of tests run
on ubuntu22.  I considered listing the failures on the ignore list,
but with timeouts included this job was just taking too long.

The plan is to have another Fixit and try to fix enough AMD failures
we can re-enable the full job.

Issue: #6417
derekbruening added a commit that referenced this issue Jan 26, 2024
…6590)

We have lived with our x86-32 GA CI job being red from massive test
failures from the AMD switch for long enough. The fix here may be too
drastic, but it's simple: we scale back to the tiny set of tests run on
ubuntu22. I considered listing the failures on the ignore list, but with
timeouts included this job was just taking too long.

The plan is to have another Fixit and try to fix enough AMD failures we
can re-enable the full job.

Issue: #6417
ivankyluk added a commit that referenced this issue Mar 29, 2024
Create a denylist for CI x86-32 tests, and move failing tests to it.

The change re-enables passing tests removed by  #6590.

Issue: #6417
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants