Sudden flaky failures of 81 tests on x86 32-bit GA CI #6417

derekbruening · 2023-11-07T05:40:13Z

A run on GA CI suddently saw 81 failures. The virtualization changed, or something else was updated on the VM's? These all seem to be new flaky failures with maybe one seen before. Xref #5725.

https://github.com/DynamoRIO/dynamorio/actions/runs/6779945230/job/18427829003?pr=6415

debug-internal-32: 333 tests passed, **** 82 tests failed, of which 1 were flaky: ****
	code_api|common.floatpc_xl8all 
	code_api|linux.signal0000 
	code_api|linux.signal0001 
	code_api|linux.signal0010 
	code_api|linux.signal0011 
	code_api|linux.signal0100 
	code_api|linux.signal0101 
	code_api|linux.signal0110 
	code_api|linux.signal0111 
	code_api|linux.signalfd 
	code_api|common.decode-stress 
	code_api|linux.eintr-noinline 
	code_api|client.flush 
	code_api|client.strace 
	code_api|tool.drcachesim.phys-threads_SUDO 
	code_api|tool.drcov.eintr 
	code_api|linux.eintr 
	code_api|tool.drcachesim.delay-global 
	code_api|tool.drcachesim.threads 
	code_api|tool.drcachesim.threads-with-config-file 
	code_api|tool.drcachesim.coherence 
	code_api|api.static_sideline_FLAKY

linux.signal*:

52: Generating SIGSEGV
52: <Application /home/runner/work/dynamorio/dynamorio/build_debug-internal-32/suite/tests/bin/linux.signal0000 (21169).  Internal Error: DynamoRIO debug check failure: /home/runner/work/dynamorio/dynamorio/core/unix/signal.c:6008 syscall_signal || safe_is_in_fcache(dcontext, pc, (byte *)sc->SC_XSP)
(Error occurred @2170 frags in tid 21169)

drcachesim.threads:

292:      Started iteration 1 of the computation...
292: <Application /home/runner/work/dynamorio/dynamorio/build_debug-internal-32/suite/tests/bin/client.annotation-concurrency (22375).  Internal Error: DynamoRIO debug check failure: /home/runner/work/dynamorio/dynamorio/core/vmareas.c:925 start < end
292: (Error occurred @3233 frags in tid 22375)

This may be #6152

client.flush:

182: count = 402
182: ERROR removing 0xf7b93549
182: ERROR: 0xf7b9354e undeleted
182: 1 undeleted fragments
182: constructed BB 5-15 times
180/415 Test #182: code_api|client.flush ............................................***Failed  Required regular expression not found.

linux.eintr:

90: <Application /home/runner/work/dynamorio/dynamorio/build_debug-internal-32/suite/tests/bin/linux.eintr (21322).  Internal Error: DynamoRIO debug check failure: /home/runner/work/dynamorio/dynamorio/core/heap.c:1966 IF_WINDOWS(doing_detach ||) vmh->num_free_blocks == vmh->num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() >= WINDOWS_VERSION_8_1)) && vmh->num_free_blocks

The text was updated successfully, but these errors were encountered:

derekbruening · 2023-11-07T05:43:45Z

Happened again in a different PR so it seems to occur on a certain VM type or something: https://github.com/DynamoRIO/dynamorio/actions/runs/6780067178/job/18428137845?pr=6414

derekbruening · 2023-11-08T17:14:38Z

Xref #6416 as another sudden flaky failure

derekbruening · 2023-11-08T17:15:25Z

#6411 and #2267 also seem to be increasing in frequency

derekbruening · 2023-11-13T15:35:30Z

I did a quick check on an AMD machine and I think these are related to AMD as well, like #6411 and #2267.

For the signal.* failures on SIGSEGV with the assert: it seems to be that the SIGSEGV generated by SYS_kill is reported at the vsyscall PC and so the assert's check for it being in the fcache fails.

ksco · 2023-11-17T19:26:09Z

I did a quick check on an AMD machine and I think these are related to AMD as well, like #6411 and #2267.

For the signal.* failures on SIGSEGV with the assert: it seems to be that the SIGSEGV generated by SYS_kill is reported at the vsyscall PC and so the assert's check for it being in the fcache fails.

I know little about x86, but the vdso manpage said the sigreturn trampoline for i386 is in vDSO, which is the same as AArch64 / RISC-V.

Handle AMD 32-bit syscall instruction. The fix is to check for AMD 32-bit SYSCALL in is_at_do_syscall(). Here's more details: CI x86-32 signalNNNN tests started failing recently. After checking the log, the failures happen on AMD 32-bit system. Based on the debug logs AMD: 0xf7f90583 89 cd mov %ecx -> %ebp 0xf7f90585 0f 05 syscall -> %ecx interp: syscall @ 0xf7f90585 instr_get_opcode(instr): 95 change_prot(0xf7f90000, 0x2000, rwx) => mprotect(0xf7f90000, 0x2000, 7)==2 pages change_prot(0xf7f90000, 0x2000, r-x) => mprotect(0xf7f90000, 0x2000, 5)==2 pages set_syscall_method to 3make_writable: pc 0x441fc000 -> 0x441fc000-0x441fe000 0 Just updated syscall routine: 0x441fd240 0f 05 syscall -> %ecx 0x441fd242 a3 5c 29 18 44 mov %eax -> 0x4418295c[4byte] whereas Intel uses sysenter, and set the syscall_method to SYSCALL_METHOD_SYSENTER: 0xf7f71583 89 e5 mov %esp -> %ebp 0xf7f71585 0f 34 sysenter -> %esp interp: syscall @ 0xf7f71585 change_prot(0xf7f71000, 0x2000, rwx) => mprotect(0xf7f71000, 0x2000, 7)==2 pages change_prot(0xf7f71000, 0x2000, r-x) => mprotect(0xf7f71000, 0x2000, 5)==2 pages set_syscall_method to 2make_writable: pc 0x4845a000 -> 0x4845a000-0x4845c000 0 Just updated syscall routine: 0x4845b240 0f 34 sysenter -> %esp 0x4845b242 a3 5c 09 3e 48 mov %eax -> 0x483e095c[4byte] Issue: #6417

ivankyluk · 2023-11-21T03:55:49Z

For the "Internal Error: DynamoRIO debug check failure: dynamorio/core/vmareas.c:925 start < end" error, it was caused by the instrumented code not restoring ecx in AMD case:

+283 m4 @0x4de2eef0 64 a1 00 00 00 00 mov %fs:0x00[4byte] -> %eax
+289 m4 @0x4de30758
+289 m4 @0x4deaeab8
+289 m4 @0x4de2dbe0
+289 m4 @0x4de285c4
+289 m4 @0x4deb4c6c
+289 m4 @0x4deb211c
+289 m4 @0x4ddd6530 64 a1 94 00 00 00 mov %fs:0x00000094[4byte] -> %eax
+295 L3 @0x4deb5884 0f 05 syscall -> %ecx
END 0xf7f20580

vs the sysenter case for Intel:

+276 m4 @0x49dfc5d8 64 a1 00 00 00 00 mov %fs:0x00[4byte] -> %eax
+282 m4 @0x49dea07c
+282 m4 @0x49df2418
+282 m4 @0x49df6ce4
+282 m4 @0x49df1248
+282 m4 @0x49d727e0
+282 m4 @0x49dfa750
+282 m4 @0x49d76874 64 a1 94 00 00 00 mov %fs:0x00000094[4byte] -> %eax
+288 m4 @0x49d22e60 64 8b 0d 90 00 00 00 mov %fs:0x00000090[4byte] -> %ecx
+295 L3 @0x49df753c 0f 34 sysenter -> %esp
END 0xf7f4b580

The related debug message from drreg_reserve_reg_internal:

drreg_reserve_reg_internal @3.0xf7f2058e: no need to spill ecx to slot 0

derekbruening · 2024-01-26T00:40:56Z

These failures include a lot of timeouts, which make this job take ~22 minutes now:

$ grep '120.0. sec' x86-32/7_Run\ Suite.txt 
2024-01-25T23:29:47.8688605Z       Test #290: code_api|tool.drcachesim.filter-no-d ......................................................***Timeout 120.00 sec
2024-01-25T23:29:47.8702222Z       Test #291: code_api|tool.drcachesim.filter-d .........................................................***Timeout 120.00 sec
2024-01-25T23:31:47.8528142Z       Test #290: code_api|tool.drcachesim.filter-no-d ......................................................***Timeout 120.01 sec
2024-01-25T23:31:47.8541097Z       Test #291: code_api|tool.drcachesim.filter-d .........................................................***Timeout 120.00 sec
2024-01-25T23:33:27.5201374Z       Test #309: code_api|tool.drcachesim.scattergather-x86 ................................................***Timeout 120.00 sec
2024-01-25T23:35:27.1030157Z       Test #309: code_api|tool.drcachesim.scattergather-x86 ................................................***Timeout 120.01 sec

$ grep '180.0. sec' x86-32/7_Run\ Suite.txt 
2024-01-25T23:33:15.5197961Z       Test #308: code_api|tool.drcachesim.invariants .......................................................***Timeout 180.00 sec
2024-01-25T23:36:16.0945937Z       Test #308: code_api|tool.drcachesim.invariants .......................................................***Timeout 180.00 sec

…ixed We have lived with our x86-32 GA CI job being red from massive test failures from the AMD switch for long enough. The fix here may be too drastic, but it's simple: we scale back to the tiny set of tests run on ubuntu22. I considered listing the failures on the ignore list, but with timeouts included this job was just taking too long. The plan is to have another Fixit and try to fix enough AMD failures we can re-enable the full job. Issue: #6417

…6590) We have lived with our x86-32 GA CI job being red from massive test failures from the AMD switch for long enough. The fix here may be too drastic, but it's simple: we scale back to the tiny set of tests run on ubuntu22. I considered listing the failures on the ignore list, but with timeouts included this job was just taking too long. The plan is to have another Fixit and try to fix enough AMD failures we can re-enable the full job. Issue: #6417

Create a denylist for CI x86-32 tests, and move failing tests to it. The change re-enables passing tests removed by #6590. Issue: #6417

derekbruening added the Component-Tests label Nov 7, 2023

derekbruening mentioned this issue Nov 7, 2023

i#5694 core-sharded: Add TRACE_MARKER_TYPE_CORE_WAIT #6415

Merged

abhinav92003 mentioned this issue Nov 7, 2023

i#6412: Separate raw2trace impl for sharing with tracer #6408

Merged

AssadHashmi mentioned this issue Nov 7, 2023

i#5365 Update aarch64 workflow #6293

Closed

ivankyluk self-assigned this Nov 9, 2023

ksco mentioned this issue Nov 10, 2023

i#3544 RV64: Fill in the missing pieces to run actual programs #6437

Merged

This was referenced Nov 10, 2023

Add better zipfile error diagnostics #6443

Merged

i#6444 no exit: Dump rseq buffer on mid-region thread end #6446

Merged

i#6444 no exit: Add invariant check for thread exit #6445

Merged

i#6426: Add schedule_stats tool #6442

Merged

derekbruening mentioned this issue Nov 14, 2023

i#3348 sym conflicts: add DR_PARAM_ to IN, OUT, INOUT #6455

Merged

ksco mentioned this issue Nov 16, 2023

i#2267 utils.cmake: Use <unknown> for cpu_vendor as fallback #6457

Merged

abhinav92003 mentioned this issue Nov 16, 2023

Avoid failing in syscall_mix for traces without syscall markers #6440

Merged

This was referenced Nov 16, 2023

tool.drcacheoff.simple fails on some x86-32-ubuntu22 VMs #6416

Closed

i#5843 scheduler: Only consider long-latency syscalls blocking #6458

Merged

i#5843 scheduler: Fix assert to allow 0-latency syscalls #6460

Merged

ksco mentioned this issue Nov 17, 2023

i#3544 RV64: Refine fault translation and signal handling #6461

Merged

derekbruening mentioned this issue Nov 17, 2023

i#6416: Replace vdso raw bytes with per-block encodings #6462

Merged

ivankyluk mentioned this issue Nov 17, 2023

i#6417: Handle AMD 32-bit syscall instruction. #6464

Merged

abhinav92003 mentioned this issue Nov 19, 2023

i#6466: Fix raw2trace missing encoding error for filtered traces #6465

Merged

jackgallagher-arm mentioned this issue Nov 21, 2023

i#5036 A64 scatter/gather, part 6: Expand multi-register instrs #6403

Merged

AssadHashmi mentioned this issue Nov 21, 2023

i#4111: Update HREF link to ArmIE on home page #6459

Merged

derekbruening mentioned this issue Nov 22, 2023

i#6471 sched idle: Add idle time #6472

Merged

derekbruening mentioned this issue Dec 12, 2023

i#5843 scheduler: Accumulate quanta across runs #6502

Merged

This was referenced Dec 12, 2023

i#6495 syscall templates: Fix format specifier in log #6503

Merged

i#6417 x86-32 client.flush: Fix tag for client events #6480

Open

jackgallagher-arm mentioned this issue Dec 14, 2023

i#5036 Refactor AArch64 scatter/gather expansion #6507

Merged

This was referenced Dec 14, 2023

i#6471 idle: Add better idle time modeling #6505

Merged

i#6508: Add kernel context switch sequence injection #6509

Merged

AssadHashmi mentioned this issue Dec 19, 2023

i#3044 AArch64 SVE codec: Add SVE2 SPLICE variant #6517

Merged

derekbruening mentioned this issue Dec 19, 2023

Remove "pre" tags from bug template #6516

Merged

jackgallagher-arm mentioned this issue Dec 20, 2023

i#5036 A64 scatter/gather, part 8: Expand vector+scalar #6520

Merged

AssadHashmi mentioned this issue Dec 20, 2023

i#3044 AArch64 SVE codec: Add SVE2 TBL variant #6521

Merged

This was referenced Jan 2, 2024

i#6530: Support alternate libc name in drsyms-test #6532

Merged

i#5926 libdw: Add new interface dr_running_under_dynamorio() #6533

Merged

AssadHashmi mentioned this issue Jan 5, 2024

i#6499: AArch64 codec: mark instructions with governing predicate as predicated #6535

Merged

This was referenced Jan 5, 2024

i#5926 libdw: Switch drsyms to use libdw from elfutils #6534

Merged

Fix release notes typo in dr_allow_unsafe_static_behavior #6546

Merged

This was referenced Jan 12, 2024

i#6558, i#6127: Set CTest timeouts for all tests #6563

Merged

i#5926 elfutils: Remove elftc lib{dwarf,elf}.a from Linux targets #6564

Merged

ASSERT "vsyscall incorrect assumption" on AMD 32-bit DynamoRIO/drmemory#2491

Open

AssadHashmi mentioned this issue Jan 15, 2024

i#6548: drmemtrace does not create the required pipe file if not present. #6562

Merged

derekbruening mentioned this issue Jan 19, 2024

i#6570: Ignore bcrypt.dll init failure #6571

Merged

AssadHashmi mentioned this issue Jan 23, 2024

i#5365 AArch64: Change scatter/gather instructions to per-element size #6574

Merged

derekbruening mentioned this issue Jan 26, 2024

i#6417 AMD 32-bit: Temporarily, drastically reduce x86-32 tests run #6590

Merged

This was referenced Jan 26, 2024

Invariant failure "Syscall marker not preceded by timestamp + cpuid" on win32 #6599

Closed

i#3544 RV64: Implements more functions and enable more tests #6617

Merged

This was referenced Mar 25, 2024

i#6417 AMD 32-bit: Create a denylist for CI x86-32 tests. #6719

Closed

i#6417: create a denylist for CI x86-32 tests. #6720

Merged

ivankyluk added a commit that referenced this issue Mar 29, 2024

i#6417 AMD 32-bit: create a denylist for CI x86-32 tests. (#6720)

b9a0ba0

Create a denylist for CI x86-32 tests, and move failing tests to it. The change re-enables passing tests removed by #6590. Issue: #6417

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden flaky failures of 81 tests on x86 32-bit GA CI #6417

Sudden flaky failures of 81 tests on x86 32-bit GA CI #6417

derekbruening commented Nov 7, 2023

derekbruening commented Nov 7, 2023

derekbruening commented Nov 8, 2023

derekbruening commented Nov 8, 2023

derekbruening commented Nov 13, 2023

ksco commented Nov 17, 2023

ivankyluk commented Nov 21, 2023

derekbruening commented Jan 26, 2024

Sudden flaky failures of 81 tests on x86 32-bit GA CI #6417

Sudden flaky failures of 81 tests on x86 32-bit GA CI #6417

Comments

derekbruening commented Nov 7, 2023

derekbruening commented Nov 7, 2023

derekbruening commented Nov 8, 2023

derekbruening commented Nov 8, 2023

derekbruening commented Nov 13, 2023

ksco commented Nov 17, 2023

ivankyluk commented Nov 21, 2023

derekbruening commented Jan 26, 2024