Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hercules crash after resume from suspend #458

Closed
Fish-Git opened this issue Dec 26, 2021 · 5 comments
Closed

Hercules crash after resume from suspend #458

Fish-Git opened this issue Dec 26, 2021 · 5 comments
Assignees
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. Related This issue is closely related to another issue. Consider this issue a "sub-issue" of the other.

Comments

@Fish-Git
Copy link
Member

Fish-Git commented Dec 26, 2021


NOTE:  GitHub Issue #489 (Hercules 4.4.1 crashes after OSA failure) is also closely related to this issue.

NOTE:  GitHub Issue #449 (Periodic "HHC00801I Processor CP00: Fixed-point-divide exception running VSE/ESA) is also closely related to this issue.


Problem reported by Dave Wade:

Dear Developers,

Hercules occasionally crashes after my laptop exits windows sleep mode. Is this something
that could be investigated . It reports its version 4.3.0.10296.

As I seldom run anything other than VM/370 R6 I haven’t yet upgraded to the latest version.

Dave

He provided a crash dump:

@Fish-Git Fish-Git self-assigned this Dec 26, 2021
@Fish-Git Fish-Git added BUG The issue describes likely incorrect product functionality that likely needs corrected. Researching... The issue is being looked into or additional information is being gathered/located. labels Dec 26, 2021
@Fish-Git Fish-Git added the Related This issue is closely related to another issue. Consider this issue a "sub-issue" of the other. label Dec 26, 2021
@Fish-Git
Copy link
Member Author

Fish-Git commented Dec 26, 2021

Analysis of the crash dump reveals:

Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS\Hercules.dmp]
User Mini Dump File: Only registers, stack and portions of memory are available

WARNING: Minidump contains unknown stream type 0x15
WARNING: Minidump contains unknown stream type 0x16
Comment: 'HHC01413I Hercules version 4.3.0.10296-SDL-ga4db8213 (4.3.0.10296)'
Comment: 'HHC01414I (C) Copyright 1999-2020 by Roger Bowler, Jan Jaeger, and others'
Comment: 'HHC01417I ** The SoftDevLabs version of Hercules **'
Comment: 'HHC01415I Build date: Nov 27 2020 at 10:51:04'
Comment: 'HHC01417I Built with: Microsoft Visual Studio 2008 (MSVC 150030729 1)'
Comment: 'HHC01417I Build type: Windows MSVC AMD64 host architecture build'
Comment: 'HHC01417I Modes: S/370 ESA/390 z/Arch'
Comment: 'HHC01417I Max CPU Engines: 64'
Comment: 'HHC01417I Using   shared libraries'
Comment: 'HHC01417I Using   Fish threads Threading Model'
Comment: 'HHC01417I Using   Error-Checking Mutex Locking Model'
Comment: 'HHC01417I With    Shared Devices support'
Comment: 'HHC01417I With    Dynamic loading support'
Comment: 'HHC01417I With    External GUI support'
Comment: 'HHC01417I With    Partial TCP keepalive support'
Comment: 'HHC01417I With    IPV6 support'
Comment: 'HHC01417I With    HTTP Server support'
Comment: 'HHC01417I With    sqrtl support'
Comment: 'HHC01417I With    Signal handling'
Comment: 'HHC01417I With    Watchdog monitoring'
Comment: 'HHC01417I With    CCKD BZIP2 support'
Comment: 'HHC01417I With    HET BZIP2 support'
Comment: 'HHC01417I With    ZLIB support'
Comment: 'HHC01417I With    Regular Expressions support'
Comment: 'HHC01417I With    Object REXX support'
Comment: 'HHC01417I Without Regina REXX support'
Comment: 'HHC01417I With    Automatic Operator support'
Comment: 'HHC01417I Without National Language Support'
Comment: 'HHC01417I With    CCKD64 Support'
Comment: 'HHC01417I With    Transactional-Execution Facility support'
Comment: 'HHC01417I With    "Optimized" instructions'
Comment: 'HHC01417I Machine dependent assists: cmpxchg1 cmpxchg4 cmpxchg8 cmpxchg16 hatomics=msvcIntrinsics'
Comment: 'HHC01417I Running on: DESKTOP-OSERMAE (Windows-6.2.9200 Intel(R) x64) LP=4, Cores=2, CPUs=1'
Comment: 'HHC01417I Built with crypto external package version 1.0.0.31-gbf46d03'
Comment: 'HHC01417I Built with decNumber external package version 3.68.0.84-g94144c9'
Comment: 'HHC01417I Built with SoftFloat external package version 3.5.0.87-gb706677'
Comment: 'HHC01417I Built with telnet external package version 1.0.0.48-gb90a708'
Comment: 'HHC00018W Hercules is NOT running in elevated mode'
Comment: '10:12:32.299449 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]'
Comment: '10:12:32.299596 0:0141 cckd_dasd_end(512): end i/o bufcur 2390 cache[27] waiters 0'
Comment: '10:12:32.299841 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6020 cache[89]'
Comment: '10:12:32.300034 0:034F cckd_dasd_end(512): end i/o bufcur 6020 cache[89] waiters 0'
Comment: '10:12:32.300264 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]'
Comment: '10:12:32.300423 0:0141 cckd_dasd_end(512): end i/o bufcur 2390 cache[27] waiters 0'
Comment: '10:12:32.300578 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6020 cache[89]'
Comment: '10:12:32.300693 0:034F cckd_read_track(866): read  trk   6021 (asynchronous)'
Comment: '10:12:32.300702 0:034F cckd_read_trk(1194): 0 rdtrk     6021'
Comment: '10:12:32.301348 0:034F cckd_read_trk(1242): 0 rdtrk[150] 6021 cache hit buf 0000000004CA67A0:0000c80015'
Comment: '10:12:32.301391 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0'
Comment: '10:12:32.301674 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2390 cache[27]'
Comment: '10:12:32.301786 0:0141 cckd_read_track(866): read  trk   2391 (asynchronous)'
Comment: '10:12:32.301792 0:0141 cckd_read_trk(1194): 0 rdtrk     2391'
Comment: '10:12:32.302280 0:0141 cckd_read_trk(1242): 0 rdtrk[28] 2391 cache hit buf 0000000004CE6480:00004f0015'
Comment: '10:12:32.302685 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0'
Comment: '10:12:32.313596 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]'
Comment: '10:12:32.313759 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0'
Comment: '10:12:32.313940 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]'
Comment: '10:12:32.314142 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0'
Comment: '10:12:32.314339 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]'
Comment: '10:12:32.314486 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0'
Comment: '10:12:32.314649 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]'
Comment: '10:12:32.315347 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0'
Comment: '10:12:32.315475 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]'
Comment: '10:12:32.315744 0:034F cckd_dasd_end(512): end i/o bufcur 6021 cache[150] waiters 0'
Comment: '10:12:32.315972 0:0141 cckd_dasd_start(434): start i/o file[1] bufcur 2391 cache[28]'
Comment: '10:12:32.316161 0:0141 cckd_dasd_end(512): end i/o bufcur 2391 cache[28] waiters 0'
Comment: '10:12:32.316323 0:034F cckd_dasd_start(434): start i/o file[1] bufcur 6021 cache[150]'
Comment: '10:12:32.316434 0:034F cckd_read_track(866): read  trk   6022 (asynchronous)'
Comment: '10:12:32.316439 0:034F cckd_read_trk(1194): 0 rdtrk     6022'
Comment: '10:12:32.316961 0:034F cckd_read_trk(1242): 0 rdtrk[120] 6022 cache hit buf 0000000004A9B8C0:0000c80016'
Comment: '10:12:32.317002 0:034F cckd_dasd_end(512): end i/o bufcur 6022 cache[120] waiters 0'
Comment: '10:12:32.322752 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]'
Comment: '10:12:32.323013 0:07F0 cckd_dasd_end(512): end i/o bufcur 11490 cache[180] waiters 0'
Comment: '10:12:32.328969 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]'
Comment: '10:12:32.329318 0:07F0 cckd_dasd_end(512): end i/o bufcur 11490 cache[180] waiters 0'
Comment: '10:12:32.351690 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11490 cache[180]'
Comment: '10:12:32.351845 0:07F0 cckd_read_track(866): read  trk   11491 (asynchronous)'
Comment: '10:12:32.351850 0:07F0 cckd_read_trk(1194): 0 rdtrk     11491'
Comment: '10:12:32.351859 0:07F0 cckd_read_trk(1242): 0 rdtrk[181] 11491 cache hit buf 0000000004B25690:00017f0001'
Comment: '10:12:32.351972 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:32.352374 0:07F0 cckd_read_trk(1194): 1 rdtrk     11492'
Comment: '10:12:32.352385 0:07F0 cckd_read_trk(1273): 1 rdtrk[182] 11492 cache miss'
Comment: '10:12:32.352391 0:07F0 cckd_read_trk(1302): 1 rdtrk[182] 11492 dropping 0241:30 from cache'
Comment: '10:12:32.352396 0:07F0 cckd_read_trk(1325): 1 rdtrk[182] 11492 buf 0000000004A60C20 len 19456'
Comment: '10:12:32.352401 0:07F0 cckd_read_trkimg(3333): trk[11492] read_trkimg'
Comment: '10:12:32.352817 0:07F0 cckd_read_l2ent(3250): file[1] l2[44,228] trk[11492] read_l2ent 0x13d08'
Comment: '10:12:32.353219 0:07F0 cckd_read_l2(3009): file[1] read_l2 44 active 1 44 144'
Comment: '10:12:32.353640 0:07F0 cckd_read_l2ent(3268): file[1] l2[44,228] trk[11492] read_l2ent 0x185c57 189 189'
Comment: '10:12:32.353647 0:07F0 cckd_read(624): file[1] fd[20] read, off 0x0000000000185c57 len 189'
Comment: '10:12:32.354346 0:07F0 cckd_read_trk(1366): 1 rdtrk[182] 11492 complete buf 0000000004A60C20:01017f0002'
Comment: '10:12:36.017050 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:36.017270 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:36.023011 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:36.023265 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:36.024658 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:36.024907 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:36.027040 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:36.027313 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:43.774786 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:43.774950 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: '10:12:43.775210 0:07F0 cckd_dasd_start(434): start i/o file[1] bufcur 11491 cache[181]'
Comment: '10:12:43.775390 0:07F0 cckd_dasd_end(512): end i/o bufcur 11491 cache[181] waiters 0'
Comment: ''
Symbol search path is: C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS;srv*C:\Windows\Symbols*http:https://msdl.microsoft.com/download/symbols
Executable search path is: C:\Users\Fish\Documents\Visual Studio 2008\Projects\Hercules\_DUMPS
Windows 7 Version 19043 MP (4 procs) Free x64
Product: WinNt, suite: SingleUserTS
Machine Name:
Debug session time: Sat Dec 25 02:13:03.000 2021 (UTC - 8:00)
System Uptime: not available
Process Uptime: 2 days 0:49:02.000
.........................................................
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(8440.7bc8): Access violation - code c0000005 (first/second chance not available)
ntdll!ZwGetContextThread+0x14:
00007ffa`3c24eba4 c3              ret
0:003> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

Failed calling InternetOpenUrl, GLE=12029

FAULTING_IP: 
hengine!watchdog_thread+486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586]
00000001`005db506 c6450000        mov     byte ptr [rbp],0

EXCEPTION_RECORD:  ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 00000001005db506 (hengine!watchdog_thread+0x0000000000000486)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000001
   Parameter[1]: 0000000000000000
Attempt to write to address 0000000000000000

DEFAULT_BUCKET_ID:  NULL_POINTER_WRITE

PROCESS_NAME:  hercules.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_PARAMETER1:  0000000000000001

EXCEPTION_PARAMETER2:  0000000000000000

WRITE_ADDRESS:  0000000000000000 

FOLLOWUP_IP: 
hengine!watchdog_thread+486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586]
00000001`005db506 c6450000        mov     byte ptr [rbp],0

MOD_LIST: 

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

FAULTING_THREAD:  0000000000007bc8

PRIMARY_PROBLEM_CLASS:  NULL_POINTER_WRITE

BUGCHECK_STR:  APPLICATION_FAULT_NULL_POINTER_WRITE

LAST_CONTROL_TRANSFER:  from 0000000000c60450 to 00000001005db506

STACK_TEXT:  
00000000`02dff500 00000000`00c60450 : 00000000`00dbf130 00000000`00000000 00000000`00000003 00000000`7055334a : hengine!watchdog_thread+0x486 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\impl.c @ 586]
00000000`02dff790 00000000`00c53c04 : 00000000`00dc03f0 00000000`00000000 00000000`00000000 00000000`00000000 : hutil!hthread_func+0x60 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\hthreads.c @ 1051]
00000000`02dff7c0 00000000`70552fdf : 00000000`00dc1950 00000000`00000000 00000000`00dc1950 00000000`00000000 : hutil!FTWin32ThreadFunc+0x44 [c:\users\fish\documents\visual studio 2008\projects\hercules\_git\_fish\hyperion-0\fthreads.c @ 809]
00000000`02dff810 00000000`70553080 : 00000000`705dcad0 00000000`00dc1950 00000000`00000000 00000000`00000000 : msvcr90!_callthreadstartex+0x17 [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 348]
00000000`02dff840 00007ffa`3bbc7034 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : msvcr90!_threadstartex+0x84 [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 326]
00000000`02dff870 00007ffa`3c202651 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14
00000000`02dff8a0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21


STACK_COMMAND:  ~3s; .ecxr ; kb

FAULTING_SOURCE_CODE:  
   582:             /* Give logger thread time to log messages */
   583:             SLEEP(1);
   584: 
   585:             /* Create the crash dump for offline analysis */
>  586:             CRASH();
   587:         }
   588:     }
   589:     while (!sysblk.shutdown);
   590: 
   591:     return NULL;


SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  hengine!watchdog_thread+486

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: hengine

IMAGE_NAME:  hengine.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  5fc14ab7

FAILURE_BUCKET_ID:  NULL_POINTER_WRITE_c0000005_hengine.dll!watchdog_thread

BUCKET_ID:  X64_APPLICATION_FAULT_NULL_POINTER_WRITE_hengine!watchdog_thread+486

WATSON_STAGEONE_URL:  http:https://watson.microsoft.com/StageOne/hercules_exe/4_3_0_10296/5fc14ad2/hengine_dll/4_3_0_10296/5fc14ab7/c0000005/001db506.htm?Retriage=1

Followup: MachineOwner
---------

Which makes perfect sense!

The WatchDog thread checks for malfunctioning CPUs by checking to see if they've executed any instructions in the past 20 seconds, and when the host is later resumed after having been suspended, much longer than 20 seconds has passed! (thereby misleading the watchdog thread into believing the CPU is malfunctioning when in fact is likely wasn't!)

New code needs to be added to Hercules (at least for Windows anyway; not sure how to do it for non-Windows) to detect when a resume after a suspend has occurred, and compensate for that fact to prevent erroneous malfunctioning CPUs from being detected.

@Fish-Git Fish-Git added IN PROGRESS... I'm working on it! (Or someone else is!) and removed Researching... The issue is being looked into or additional information is being gathered/located. labels Dec 26, 2021
@dasdman
Copy link
Contributor

dasdman commented Dec 26, 2021 via email

@Fish-Git
Copy link
Member Author

Thanks anyway Mark, but I'm already on it. Having been a Windows programmer for 25+ years now, I DO know about these things after all.   :)

(Now the Linux side of things I could use some help with!)

@Fish-Git
Copy link
Member Author

QUESTION FOR MY FELLOW HERCULES DEVELOPERS:

Before I begin testing, which fix technique do you prefer? Technique #1 or #2?

Technique #1 is host operating system dependent and would obviously require adding additional code to deal with the non-Windows case (only the Windows case has been coded), whereas technique #2 has the advantage of not being specific to any host os (i.e. should work equally well on both Windows and non-Windows systems) and is much simpler and more straightforward as well.

I realize it's obvious the second technique is clearly superior to the first as far as resolving the specific issue at hand goes, but I guess what I'm asking is, how important is it for us to be able to detect host system suspend/resume events in general?

The first technique has the advantage(?) of directly detecting such events, thereby allowing us the ability react differently to such events elsewhere in Hercules code too (and not just in the watchdog thread), whereas the second technique only fixes the one specific suspend/resume problem that we've discovered thus far (in the watchdog thread). There might well be other places in Hercules code where we need to compensate for suspend/resume that we just haven't discovered yet, and by using technique #1, such code would simply need to check the sysblk flags. The groundwork of actually directly detecting suspend/resume events would already have been laid.

Personally, I vote for technique #2.

@Fish-Git Fish-Git added the QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. label Dec 27, 2021
Fish-Git added a commit that referenced this issue Jan 1, 2022
HOPEFULLY closes #458.

Not having a laptop, I was unable to test. Attempts to reproduce problem in a VMware virtual machine (so I could test whether my fix corrected the problem or not) failed. But I'm nonetheless confident this likely corrects the problem. <me: fingers crossed>
@Fish-Git
Copy link
Member Author

Fish-Git commented Jan 1, 2022

Fixed (hopefully!) by commit 4b439ae.

Closing.

@Fish-Git Fish-Git closed this as completed Jan 1, 2022
@Fish-Git Fish-Git removed QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. IN PROGRESS... I'm working on it! (Or someone else is!) labels Jan 1, 2022
Fish-Git added a commit that referenced this issue May 22, 2022
This is a followup fix to commit 4b439ae (GitHub #458) to try and prevent a possible watchdog thread hung processor false positive crash that can sometimes occur on Windows upon resuming from suspension (awaking from sleep).

This should hopefully fix GitHub issue #489.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. Related This issue is closely related to another issue. Consider this issue a "sub-issue" of the other.
Projects
None yet
Development

No branches or pull requests

2 participants