Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hercules 4.4.1 crashes after OSA failure #489

Closed
cfdonatucci opened this issue May 6, 2022 · 80 comments
Closed

Hercules 4.4.1 crashes after OSA failure #489

cfdonatucci opened this issue May 6, 2022 · 80 comments
Assignees
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. (Invalid/PEBKAC) Likely user error. The described problem does not exist or was otherwise determined to be bogus. Related This issue is closely related to another issue. Consider this issue a "sub-issue" of the other.

Comments

@cfdonatucci
Copy link

cfdonatucci commented May 6, 2022


NOTE:  GitHub Issue #458 (Hercules crash after resume from suspend) is also closely related to this issue.


Hi guys,

I had to reinstall my Windows 10 after a problem. After reinstalling Hercules, I'm now having an odd problem not happening before, regarding my external connection to use other Windows applications on the same PC. I reinstalled all Hercules software.

I also ran TTTest64.exe successfully. I have DHCP default conf in my PC. I always used CTCI connection but fails with no error message. So I tried with OSA, and now I get a dump.

z/OS 2.4 starts ok and the OSA is installed and activated ok.

I can start TSO sessions from any PC in my net, I can use CICSPLEX very well using cicsexpl.
When I attempt to use a zosexpl session with RSED, the connection crashed and Hercules as well.

Dump available:

I'd appreciate any help!

Regards,
Carlos

OSA Adapter

0400.3    QETH  chpid F0 iface D8-5E-D3-81-FE-1D  ipaddr 192.168.1.115  netmask 255.255.0.0

; -------------------------------------------------------------------
; Device to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  DEVICE PORTA MPCIPA
  LINK OSA1 IPAQENET  PORTA
  HOME 192.168.1.115  OSA1
; -------------------------------------------------------------------
; Routes to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  BEGINRoutes
  ; Destination        SubnetMask    FirstHop      LinkName    Size
  ROUTE 192.168.1.0    255.255.255.0    =          OSA1   MTU 1500
  ; Destination                      First Hop     LinkName    Size
  ROUTE DEFAULT                      192.168.1.1   OSA1   MTU 1500
  ENDRoutes
; -------------------------------------------------------------------
; Start  to support zPDT External   to z/OS Connections
; -------------------------------------------------------------------
  START PORTA
 BROWSE    ADCD.Z24B.VTAMLST(OSATRL2) - 01.01       Line 0000000000
 Command ===>                                                  Scrol
********************************* Top of Data **********************
OSATRL1 VBUILD TYPE=TRL
OSATRL1E TRLE LNCTL=MPC,READ=(0400),WRITE=(0401),DATAPATH=(0402),
               PORTNAME=PORTA,
               MPCLEVEL=QDIO
OSATRL2E TRLE LNCTL=MPC,READ=(0404),WRITE=(0405),DATAPATH=(0406),
               PORTNAME=PORTB,
               MPCLEVEL=QDIO

Hercules log

12:32:36.422 00000788 HHC03800I 0:0401 QETH: Adapter mode set to Layer 3
12:32:36.422 00000788 HHC04100I TunTap64.dll version ** UNPAID TRIAL COPY **  3.7.0.5409 initiated
12:32:36.423 00000788 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using MAC address 02:00:5e:a3:be:84
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.1.115
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.0.0
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using MTU 1500
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using drive MAC address 96:7a:59:e5:d2:bf
12:32:36.492 00000788 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::967a:59ff:fee5:d2bf
12:32:36.499 00000788 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.1.115
----
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using MAC address 02:00:5e:a3:be:84
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using IP address 192.168.1.115
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using subnet mask 255.255.0.0
12:52:19.818 00000788 HHC03997I 0:0401 QETH: tun0: not using MTU 1500
--
12:52:49.462 00000788 HHC00822S PROCESSOR CP01 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR CP03 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR IP06 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00822S PROCESSOR IP07 APPEARS TO BE HUNG!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(536)
12:52:49.462 00000788 HHC00823S You have 45 seconds to attach a debugger before crash dump will be taken!
12:52:49.462 00000788 HHC00007I Previous message from function 'watchdog_thread' at impl.c(554)
---
12:53:35.492 00000788                       ***************
12:53:35.492 00000788                       *    OOPS!    *
12:53:35.492 00000788                       ***************
12:53:35.492 00000788                     Hercules has crashed!
12:53:35.492 00000788 (you may or may not need to press ENTER if no 'oops!' dialog-box appears)
12:53:40.404 00000788 Creating crash dump "C:\hercules4.4.1\Hercules.dmp"...
12:53:40.404 00000788 Please wait; this may take a few minutes...
12:53:40.404 00000788 (another message will appear when the dump is complete)
12:53:44.470 00000788 Dump "C:\hercules4.4.1\Hercules.dmp" created.
12:53:44.470 00000788 Please forward the dump to the Hercules team for analysis.
@Fish-Git
Copy link
Member

Fish-Git commented May 6, 2022

A couple of things right off the bat:

  1. I would prefer seeing a complete Hercules configuration file rather than just a single device statement. There could be other valuable information in there that might help explain things.

  2. I would prefer to see a complete Hercules log file (from beginning to end) rather than the small extract your provided. There could be valuable information in the log that might help explain what's going on (or what happened).

  3. You said "I also ran TTTest64.exe successfully", but simply starting TTTest64 and then not doing anything is not enough. After starting TTTest64 you need to actually perform a test! More specifically, a "Multi-directional Ping Test", as explained in the CTCI-WIN documentation (Help file). You also failed to show us your TTTest64 test output too. It contains valuable information about your host networking setup.

  4. On your Hercules device statement, you specified netmask 255.255.0.0, but in your z/OS TCP config you specified 255.255.255.0.

Most of these things are mentioned in our "SUBMITTING PROBLEM REPORTS" document. Please review it and then submit all needed additional information. Thanks.

In the mean time I'll take a peek at your dump to see whether anything jumps out at me.

@Fish-Git Fish-Git self-assigned this May 6, 2022
@Fish-Git Fish-Git added BUG The issue describes likely incorrect product functionality that likely needs corrected. QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. IN PROGRESS... I'm working on it! (Or someone else is!) Researching... The issue is being looked into or additional information is being gathered/located. labels May 6, 2022
@Fish-Git
Copy link
Member

Fish-Git commented May 6, 2022

Dump was worthless.

I need to see the complete Hercules log file, configuration file, and TTTest64 output.

Explaining in more detail what you were doing (or trying to do) when the problem occurred might help too.

p.s. Did you disable Unconstrained Transactions (i.e. run TXOFF) before doing whatever it was you were attempting to do, like you did previously?

@cfdonatucci
Copy link
Author

cfdonatucci commented May 6, 2022

Hi, thank you for your answer. The netmask was fixed and everything redone. It failed again.

I've attached the entire log, second dump, TTTest64 report and entire config file:

I also ran TXOFF before doing whatever I do. The only thing not clear to me is when TXON should be executed.

I'm testing with OSA because it failed with CTCI as well. In general Hercules works pretty fine. The issue only occurs when I want to access PC applications to exploit mainframe facilities.

Activities:

  • TXOFF is executed after IPL.
  • Logon using IP 192.168.1.115 port 23 works.
  • CICS explorer using CMCI connection works completely.
  • RSED against 192.168.1.115 port 4035 establishes session, but then something crashes and Hercules abends.
  • Connection from zosexplorer to zCEE server can be established. I can see services and APIs.
  • Postman REST services against 192.168.1.115 using tcpservice/ipcon on CICS works.
  • Services/APIs deployment crashes the same as RSED connection.

I'm using all products included in Hercules, but for example I've seen that WinPCap is not supported and Npcap is recommended. Don't know if that may be related.

Let me know anything else you need.

Regards
Carlos

@SDL-Hercules-390 SDL-Hercules-390 deleted a comment from cfdonatucci May 6, 2022
@Fish-Git Fish-Git removed IN PROGRESS... I'm working on it! (Or someone else is!) QUESTION... A question was asked but has not been answered yet, -OR- additional feedback is requested. labels May 9, 2022
@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

I'm going to need some help reproducing this. I have almost zero z/OS skills.

I do have z/OS 2.4B, and it seems to IPL and run just fine, but for all of my z/OS Hercules IPL tests, I always use loadparm WSM (WS = CLPA and Warm start of JES2. Base z/OS system functions i.e. no CICS, DB2, IMS, etc. M=Verbose IPL Messages), not CI.

When I just now tried IPLing my system (which I believe is an ADCD system) using loadparm CIM (CI = CLPA and Warm start of JES2. Loads CICS 5.3 and 5.2 libraries. Starts CICS 5.3, z/OSMF, and IBM Developer for z Systems. M=Verbose IPL Messages), it's asking me:

    IGGN505A SPECIFY UNIT FOR DFH540.CICS.SDFHLPA ON B4C541 OR CANCEL

which I don't know how to respond to.   :(

I also don't know what cicsexpl is, nor what a zosexpl session with RSED is either.   :(

My system is configured to use OSA and is configured almost identically as yours.

Can you explain to me (in simple terms please! I'm not a z/OS person!) what I need to do to reproduce your problem?

Thanks.


p.s. I suspect something might be wrong with your local network configuration. Your TTTest64 report does not look right. Your first ping of www.linux.org using a tun, reported a ping response time of "time=16ms", which is quite reasonable. But then you closed your tun interface and tried the same thing again using a tap interface instead, and this time all of your ping responses were all "time=<1ms"!! I am seriously doubting you can ping www.linux.org from your Windows system in less than 1ms!!

Can you provide some more details regarding your local networking? Thanks.

@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

Can you provide some more details regarding your local networking? Thanks.

And it might be a good idea to try your TTTest64 Ping Test again, but this time without using a tun interface beforehand. Do the test right away using only a tap defined interface.

It might not hurt to clear your ARP cache beforehand too, or even IPL your Windows system.

You said you had to reinstall Windows 10, and as I recall, Windows 10 did have some type of bug in its networking handling at some point in the distant past that was fixed with an update. Did you (re-)install all of your Windows/Security updates after you reinstalled Windows? (and before doing your Hercules test?)

@cfdonatucci
Copy link
Author

cfdonatucci commented May 9, 2022

Hi, Regarding the message, it is just because the CICS 54 DFHLPA maybe in the lpalist, but the lib doesn't exist. In such cases you can reply 0,cancel. All consoles messages at that stage must be replied with 0,something. Sorry if this it obvious.

zosexplorer is an IBM eclipse application working as a framework for other applications included as plugins, like cics explorer, git interfaces, zos connect enterprise edition, zosexplorer, dbb for zDevOps and others. I start my ADCD z/OS 2.4 with CI as well.

To reproduce my problem you could use zosexplorer. You need two started tasks: jmon and rsed, which I believe are started by default and define a connection to the default eclipse zos explorer port, port 4035.

With my previous Windows I used to have a fixed IP address 192.168.1.110 with a CTCI definition. Unfortunately I didn't keep record of that configuration. So I tried to do the same and now I have this problem. This worked in my other Windows. So I switched to an OSA definition, hoping it would help, but it didn't.

My local networking has DHCP, no other configuration. I erased the ARP cache and re-ran TTTest64 Ping Test:

@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

Regarding the message, it is just because the CICS 54 DFHLPA maybe in the lpalist, but the lib doesn't exist. In such cases you can reply 0,cancel.

Thanks. That seems to have worked. I had to reply to it twice though. After the first reply, the same(?) message appeared again a minute or so later. After I replied to the second message, the system finally finished IPLing and is now running normally(?).

To reproduce my problem you could use zosexplorer.

Which I know nothing about.  :(

You need two started tasks: jmon and rsed...

Which I don't know how to do.  :(

...which I believe are started by default...

Good!!   :)

...and define a connection to the default eclipse zos explorer port, port 4035.

  1. How can I tell whether zosexplorer (i.e. jmon and rsed??) is finished and ready for work? After IPLing my system, all processors are running at nearly 100%. How do I know when everything is ready for me to begin my testing? How long (approximately) do I have to wait before trying my test?

  2. What do I do with port 4035? Do I connect a terminal to it? What type of terminal? 3270? Or simple command line?

  3. Once connected(?), THEN what do I do? Do I simply enter some type of command? What command do I enter? What do I need to try doing to reproduce the crash?

Sorry for the stupid questions, but as I explained, I know almost nothing about z/OS!  :(

Thanks.

@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

  1. How can I tell whether zosexplorer (i.e. jmon and rsed??) is finished and ready for work?

FYI: When I press PF10 on my master console (to issue the D A,L command), I do see both JMON and RSED in the list of active tasks. Does that mean they're both ready? Does that mean I can do my test? (whatever that test is! I don't even know what I'm supposed to be doing!). Thanks.

@cfdonatucci
Copy link
Author

one question, do you have a TSO session or just the console?

@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

one question, do you have a TSO session or just the console?

TSO session too. I should tell you I DO know how to do a little teensy tiny bit. I know how to logon. I know how to use ISPF 6 to issue 'ping' and other commands. I know how to browse dataset/edit dataset members, submit jobs (although I do NOT know JCL!), look at printouts, etc. And I know how to cleanly shut the system down. But nothing beyond that,

@cfdonatucci
Copy link
Author

ok, I can guide you to download zos explorer from IBM site, how to configure a connection to Mainframe and how it's invoked if you are willing to do it. I'll have to take a series of screenshots and send them to you. Please confirm if you want. additionally, do you have TXOFF/ON in your zos?

@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

...if you are willing to do it.

Yes, certainly!

I'll have to take a series of screenshots and send them to you.

That's fine. My email address is fish at either softdevlabs.com or infidels.org.

additionally, do you have TXOFF/ON in your zos?

Not yet, no. But I do have a copy of Jürgen's TXONOFF job stream, so as far as I know all I have to do is run it, and then it'll be on my system. Yes? And then all I need to do is enter the command 'txoff' or 'txon' from ISPF 6 to disable or enable unconstrained transactions, yes?

@cfdonatucci
Copy link
Author

yes to both... i'm writing the instructions. I'll send them as soon as i can. Thank you very much.

@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

CORRECTION:   I just found my notes regarding TXONOFF:

--------------------------------------------------------------------------------

                    Disable/Enable UNconstrained transactions


Upload Jürgen's TXONOFF to IBMUSER adcd.lib.jcl and run it.

(Be careful during upload! The job stream is CASE SENSITIVE!)

Then to activate, either re-IPL, or issue console command "F LLA,REFRESH"

Then to switch unconstrained transactions OFF/ON
(FOR ALL SUBSEQUENTLY STARTED jobs!), simply do:

  "S TXOFF" at the console   (or "TXOFF" from a shell prompt)
  "S TXON"  at the console   (or "TXON"  from a shell prompt)

Example:

  alias cls=clear

  export PATH=$PATH:$JAVA_HOME/bin

  TXOFF

  time java com.ibm.jvm.format.TraceFormat test.trc
  time java com.ibm.jvm.format.TraceFormat test.trc > test.trc.fmt 2>&1

--------------------------------------------------------------------------------

Is that correct?

@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

i'm writing the instructions. I'll send them as soon as i can.

Thanks! Standing by...

@Fish-Git
Copy link
Member

Fish-Git commented May 9, 2022

FYI: TXONOFF is now on my system, and entering s txoff from the master console resulted in:

   - 17.24.51           s txoff
   - 17.24.51           IRR812I PROFILE * (G) IN THE STARTED CLASS WAS USED
   -         TO START TXOFF WITH JOBNAME TXOFF.
   - 17.24.51 STC00442  $HASP373 TXOFF    STARTED
   - 17.24.52 STC00442  IEF404I TXOFF - ENDED - TIME=17.24.52

So I think we're good to go.

Standing by for further instructions via email...

@cfdonatucci
Copy link
Author

mail sent, let me know if you got it.

@Fish-Git
Copy link
Member

Nope. Not yet.  :(

What email address did you send it to?

@cfdonatucci
Copy link
Author

cfdonatucci commented May 10, 2022

fish@softdevlabs.com

@cfdonatucci
Copy link
Author

cfdonatucci commented May 10, 2022

Hi, I'll attach the file here. Regards.

@cfdonatucci
Copy link
Author

cfdonatucci commented May 10, 2022

Hi I did two things:

  1. Started with CTCI support with debug. I reproduced the error, but saw nothing related to CTCI.
  2. Started with OSA support with debug. When I tried to connect to RSED, I got this:
12:07:23.003 0000302C HHC03991D 0:0401 QETH: RRH_TYPE_ULP: PUK_TYPE_DISABLE (ULP_DISABLE): Request
12:07:23.003 0000302C HHC03981D 0:0401 QETH: TH : +0000< 00E00000 0000001B 00000014 00000055  ...............U  .\..............
12:07:23.003 0000302C HHC03981D 0:0401 QETH: TH : +0010< 10000001                             ....              ....
12:07:23.003 0000302C HHC03981D 0:0401 QETH: RRH: +0000< 00000000 417E0001 00000004 00000003  ....A~..........  .....=..........
12:07:23.003 0000302C HHC03981D 0:0401 QETH: RRH: +0010< 00240015 00001505 D8C5E3F3 00000000  .$..............  ........QET3....
12:07:23.003 0000302C HHC03981D 0:0401 QETH: RRH: +0020< 00000000                             ....              ....
12:07:23.003 0000302C HHC03981D 0:0401 QETH: PH : +0000< 01000015 00000040                    .......@          .......
12:07:23.003 0000302C HHC03981D 0:0401 QETH: PUK: +0000< 000C4103 00090000 00000000           ..A.........      ............
12:07:23.003 0000302C HHC03981D 0:0401 QETH: PUS: +0000< 00090403 05000101 16                 .........         .........
12:07:23.003 0000302C HHC03997I 0:0401 QETH: tun0: not using MAC address 02:00:5e:a3:be:84
12:07:23.003 0000302C HHC03997I 0:0401 QETH: tun0: not using IP address 192.168.1.115
12:07:23.003 0000302C HHC03997I 0:0401 QETH: tun0: not using subnet mask 255.255.255.0
12:07:23.003 0000302C HHC03997I 0:0401 QETH: tun0: not using MTU 1500
12:07:23.012 0000302C HHC03991D 0:0402 QETH: Halting data device

Entire log attached.

I hope this helps.

@Fish-Git
Copy link
Member

Fish-Git commented May 11, 2022

What email address did you send it to?
fish@softdevlabs.com

Weird. I never receive it.

Hi, I'll attach the file here. Regards.

Thanks. I downloaded it and tried it it yesterday after making the following changes:

  1. I ran IBM Explorer for z/OS from another system on my local network: a Windows 7 x64 VMware virtual machine, but the fact that it was a virtual machine shouldn't make any difference. As far as my Windows 7 host was concerned, it was another system on the local network).

  2. I added a new Inbound rule to my Windows Firewall to let just TCP port 4035 through. That didn't seem to work, so I changed it to Protocol = Any instead, which of course removed the port number restriction, effectively disabling my firewall entirely (i.e. letting anyone connect to anything from anywhere), and that did work:

firewall1
firewall2
firewall3

As you can see below, I was able to connect and things worked just fine (although I didn't know what the heck I was doing! I'm not familiar with IBM Explorer for z/OS!):

zosexplorer

I was however able to view files and printouts! Pretty cool!

One thing I did notice was that sometimes the connection would fail on the first attempt. But if I tried again it would work the second time around. I'm not sure what that means, if anything.

I also do not use DHCP. I have all of my systems are hard coded with their own uniquely assigned IP addresses, so it wasn't exactly a fair test. My initial goal wasn't to try and exactly reproduce your problem, but rather just to see if I could get it to work, and I succeeded in that endeavor.

I also have Checksum Offloading overridden in CTCI-WIN too:

Checksum offloading = OVERRIDDEN

(Refer to "Disable CTCI-WIN's default Checksum Offload behavior" in the "Common Problems" section of the CTCI-WIN Help file)

And finally, I do not have IPv6 enabled on my adapter either (whereas you do). Despite all the hype, I've personally never found much use for IPv6. If you have a lot of internet devices maybe you have a need, I don't know. But for me, living without IPv6 is not a problem.

I would also note that at no time during my initial attempts (when my connection attempts would fail and RSED would crash due to a TXF restricted instruction failure), at no time did my system crash. Hercules remained up and running just fine.

If I get time I will MAYBE try to configure my system to use DHCP and also try running IBM Explorer for z/OS on the same system that Hercules is running on, just to see whether that makes any difference or not. I'm doubting it will, but it might be worth a shot.

Personally I think your local Windows network is borked. Your second TTTest64 report is still showing "time=<1ms" for your pings to www.linux.org, which is virtually impossible.

Some things to try/check:

  • Check your router to make sure 192.168.1.115 (i.e. your z/OS guest's IP address) is not within its DHCP range. If it is, try using a different IP address for your z/OS guest to see if that works any better.

  • Try defining your QETH (OSA) device in your Hercules configuration as a normal non-DHCP device:

    0400.3 OSA chpid F0 iface 192.168.1.37 ipaddr 192.168.1.115 netmask 255.255.255.0

  • You may or may not want to disable IPv6 on your adapter.

  • Try uninstalling and reinstalling (and rebooting in between and afterwards?) CTCI-WIN. The fact that your pings to linux.org are still getting <1ms responses tells me there is still something very wrong with your networking setup. I would personally NOT PROCEED with any Hercules testing until you can first get CTCI-WIN to get valid responses to its pings to linux.org. (Does the same phenomenon occur when you try pinging someone else? Such as www.softdevlabs.com?)

That's all for now.

I'll continue trying to reproduce your crash but so far I haven't had any luck.

@Fish-Git
Copy link
Member

Carlos (@cfdonatucci),

Would it be possible for you to do a favor for me please? I am having trouble determining where in Hercules it is hanging and crashing, because I am unable to reproduce your problem on my own system. I would therefore very much appreciate if you would run a test for me please.

You will need to temporarily break things on your system by temporarily re-enabling "Large Send Offload" and "Jumbo Frame" by setting them to "Enable".

Once you have set "Large Send Offload" and "Jumbo Frame" to "Enabled", then please start Hercules and perform the same test you did before, but with the following small change:

  • Before starting RSED, enter the following Hercules commands:
ptt io qeth 50000
t+400
t+401
t+402

Then connect from zosexplorer like before. Hercules will hopefully crash just like it did before, but this time some additional information should be displayed in the Hercules log.

It is very important to save the Hercules log file! That is the only thing I need!

I do not need the crash dump. When Hercules crashes and it asks you if you want to create a crash dump, you can answer "No". I do not need the crash dump. I only need the Hercules log file.

Afterwards, you can fix things by re-disabling "Large Send Offload" and "Jumbo Frame" by setting them back to "Disable" again.

Thanks.

@cfdonatucci
Copy link
Author

Hi sorry i didn't aswer. I was at the office today. Just a question, to enable jumbo frame I have 4088 or 9014 bytes. I don't remember the default value... should I set 9014?

@Fish-Git
Copy link
Member

... to enable jumbo frame I have 4088 or 9014 bytes. I don't remember the default value... should I set 9014?

I don't know. It probably doesn't matter. But based solely on the following information, I would probably set it to 9014:

@cfdonatucci
Copy link
Author

cfdonatucci commented May 16, 2022

I set jumbo to 9014 and disabled Large Send Offload, ran txoff and started RSED. To my surprise it didn't fail. I think it should be greater than 1500

@cfdonatucci
Copy link
Author

But i cannot set it higher than 9014
image

@Fish-Git
Copy link
Member

... and disabled Large Send Offload, ran txoff and started RSED. To my surprise it didn't fail.

You were supposed to ENABLE Large Send Offload for my test.

Disabling it fixes the problem (prevents the crash).

Enabling Large Send Offload (which is the normal default setting) is what causes the crash.

I need you to ENABLE it, so the problem/crash occurs.

Once you run my test, then you can disable it again so it doesn't crash.

@cfdonatucci
Copy link
Author

cfdonatucci commented May 17, 2022

That's what I did, I enabled it... when I ran the test, I had Large Send Offload enabled and Jumbo set to 9014. The connection succeeded and everything worked. After a while the connection crashed. I see errors in the Hercules console related to QETH adapter, but Hercules didn't crash like before. I'll send the log to you.

@cfdonatucci
Copy link
Author

cfdonatucci commented May 17, 2022

Log attached:

@Fish-Git
Copy link
Member

After a while the connection crashed.

What does that mean? "The connection crashed"?

... but Hercules didn't crash like before.

Then the test was no good. We will have to try again.

I am trying to determine the cause of the original Hercules crashing, which you claimed earlier "could be consistently reproduced":

Very simple test:

  • Firewall completely disabled.
  • netsh interface ip delete arpcache executed
  • Just OSA device pointing to 192.168.1.112
  • zOS 2.4 IPLed. Keep in mind that RSED doesn't start with the IPL.
  • OSA started
  • ping 192.168.1.112 ok.
  • Once IPL is finished TXOFF executed.
  • RSED started.
  • connection from zosexplorer -> "remote" system, started
  • in explorer at the bottom line I see message "IZE0100I Connected to 192.168.1.112:4035, fetching children of My Home 100%", meaning the connection was established and it was working.
  • OSA adapter fails.
  • Hercules crashes.

This test can be consistently reproduced.

  • I don't have duplicate MACs
  • I don't have firewall active
  • I used IPs in device definition
  • I cleaned the ARP cache

Documents attached:

  • text capture
  • shark format capture
  • Hercules Log
  • hercules dump

Have a nice weekend.

If Hercules does not crash, then the test is no good. I want to recreate the original conditions that was causing your Hercules to crash, but with my debugging commands active (ptt io qeth 50000, t+400, etc...).

Thanks.

@Fish-Git
Copy link
Member

Fish-Git commented May 17, 2022

Log attached:

Not helpful.   :(

I need you to recreate the original Hercules crash problem you were experiencing. That is what I am trying to resolve. The "ptt io qeth 50000" and other commands will provide additional information that will help me determine where (and thus hopefully why) Hercules was hanging/crashing.

Thanks.

@cfdonatucci
Copy link
Author

cfdonatucci commented May 17, 2022

Okay, now, I set all parameters for this to work, meaning the parameters we believed resolved the problem. The connection is established. So I did nothing for a while. At some point the adapter started giving "22 Invalid argument" error, and the connection failed. I think we should go back to the network trace.

22:03:22.827 00000AF0 HHC00901I 0:0401 QETH: Interface tun0, type TUN opened
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using MAC address 02:00:5e:a3:be:84
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using IP address 192.168.1.112
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using subnet mask 255.255.255.0
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using MTU 1500
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using drive MAC address 96:7a:59:e5:d2:bf
22:03:22.917 00000AF0 HHC03997I 0:0401 QETH: tun0: using drive IP address fe80::967a:59ff:fee5:d2bf
22:03:22.924 00000AF0 HHC03805I 0:0401 QETH: tun0: Register guest IP address 192.168.1.112
22:03:22.925 00000AF0 HHC03805I 0:0401 QETH: tun0: Register guest IP address 10.1.10.1
22:03:34.693 00000AF0 HHC00801I Processor CP00: Operation exception code 0001 ilc 4
22:03:34.694 00000AF0 HHC02324I CP00: PSW=0704100080000000 000000001FB179D6 INST=B2AF0000     ????? ,                      
.....
23:38:53.653 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
23:38:53.653 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796)
23:38:53.663 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
23:38:53.663 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796)
23:38:54.127 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
23:38:54.127 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796)
23:38:54.923 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument
23:38:54.923 00000AF0 HHC00007I Previous message from function 'write_packet' at qeth.c(2796)
23:38:55.922 00000AF0 HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument

@Fish-Git
Copy link
Member

Fish-Git commented May 17, 2022

I would prefer that we recreate the Hercules crash that you originally reported. That is what I am most interested in: preventing Hercules from crashing.

Since we are having trouble accomplishing that however, we might be able to determine a likely cause using your current testing procedure. So let's do this:

Do the same thing you did above (that caused the 22 Invalid argument errors), and then enter the Hercules ptt command whenever it occurs. That will display the ptt trace table which might provide a clue to what's going on:

  • Start Hercules
  • enter ptt io qeth 50000
  • IPL
  • s txoff
  • s rsed
  • connect from zosexplorer
  • wait for the HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument errors to begin occurring
  • as quickly as possible, enter ptt to dump the trace table (i.e. just "ptt" by itself, without any arguments)

That should display all of the trace entries in the ptt trace table. THAT is what I need to see.

Save the Hercules logfile and post it here.

Thanks.

@cfdonatucci
Copy link
Author

cfdonatucci commented May 18, 2022

Hi. Test done. Hercules didn't crashed and didn't fail, This is the summary:

  • adapter parameters changed as before.
  • arp cache deleted
  • Start Hercules
  • enter ptt io qeth 50000
  • IPL
  • s txoff
  • s rsed
  • connect from zosexplorer

At this point everything worked perfectly. I couldn't reproduce the failure. I waited for a while, and then went to lunch. When I returned the PC was sleeping. When it was awoken, the connection with zosexplorer failed and the invalid argument error message was issued. I then executed ptt, and after a little while I stopped Hercules.

So now it seems I cannot reproduce what was previously a consistently reproducible error.

Regards.

@Fish-Git
Copy link
Member

Fish-Git commented May 18, 2022

Hi. Test done. Hercules didn't crashed and didn't fail, This is the summary:

<snip>

Regards.

Thanks. Unfortunately however, the test is no good. You didn't do it correctly. You entered the ptt command too soon: you entered it before the 22 Invalid argument error messages were issued, so the ptt results did not show anything unusual.

I need you to wait until the 22 Invalid argument error(s) is/are issued, and then issue the ptt command.

I've also been thinking: if you previously added a Windows Firewall rule, maybe you need to disable it to reproduce the original problem?

Please try again, being careful to issue the ptt command after the HHC00911E 0:0402 QETH: Error writing to device tun0: 22 Invalid argument error(s) begin occurring.

Thanks.

@cfdonatucci
Copy link
Author

I'm going to do the test now. When I enabled the firewall rules the problem was not resolved. The problem was resolved (hold the press, remember?) when I disabled jumbo and Large Send Offload. I disabled the firewall rules. Let's see what happens.

@Fish-Git
Copy link
Member

Fish-Git commented May 19, 2022

When I enabled the firewall rules the problem was not resolved. The problem was resolved (hold the press, remember?) when I disabled jumbo and Large Send Offload.

What I remember is, originally, you did not have any special Firewall rules defined. Your Windows Firewall was enabled (which is the default for Windows).

I then recommended that you define a new Firewall rule that would allow any/all packets through (effectively disabling your firewall). I don't recall whether you actually added (defined) such a rule to your Windows Firewall or not, but I do remember that you disabled the Windows Firewall completely, since that was the easiest way to eliminate it as a culprit.

It was sometime after that test (which still failed) that Ian Shorter (@mcisho) noticed your Wireshark trace showed packets trying to be sent that were too large for CTCI-WIN (WinPCap) / Hercules to handle. It was then I recommended disabling Large Send Offload, which, yes, did indeed resolve your problem.

So what I am trying to do is to go back to the way you originally had your system configured that was causing Hercules to crash. If we can recreate your original crash again, then we can see where things might be going wrong. That is what the ptt io qeth 50000 command should do: immediately before Hercules crashes, Hercules will automatically issue the ptt command automatically, so we can see what Hercules was doing.

If we cannot recreate the original crash, then we should try to recreate the "Error writing to device tun0: 22 Invalid argument" error instead (with ptt io qeth 50000 active), and then manually enter the ptt command after the "22 Invalid argument" error(s) start to occur.

Thanks.

@cfdonatucci
Copy link
Author

cfdonatucci commented May 19, 2022

I really don't think the firewall is the issue. I ran tests with the firewall completely disabled and they failed. So far I couldn't make it fail. I'm going to run a last test. I even changed RSED mode from 64 bits to 31 and go back, and nothing.

@cfdonatucci
Copy link
Author

cfdonatucci commented May 19, 2022

I was unable to recreate the original error. To recreate "22 invalid argument", I IPLed Hercules as you said. Nothing happened. So I sent the PC to sleep. After a few minutes I recalled it and the error occurred immediately. After few "22" messages, I executed ptt. I hope this helps. Let me know.

@Fish-Git
Copy link
Member

I really don't think the firewall is the issue.

I do.

When you reported your original problem, the Firewall was enabled, and because it was, it was not allowing any incoming connections. That was part of the original problem. Your zosexplorer client was trying to connect to RSED, but Windows wasn't allowing it because it wasn't allowing any outside packets through the firweall. It was only when we disabled the firewall that we then made some progress. Think. How was your Windows system configured when you first reported your original problem? That is the same configuration I would like to try to recreate.

But forget about that right now. Let's move on.

So I sent the PC to sleep. After a few minutes I recalled it and the error occurred immediately. After few "22" messages, I executed ptt. I hope this helps. Let me know.

Yes, that helped. Thank you.

I would now like you to perform the same test, but this time with the following Hercules commands active:

  • ptt io qeth 5000
  • tt32 debug

This should provide additional (hopefully more detailed) debugging information when the "22" errors occur.

What I suspect is happening is, your network adapter (NIC) is being turned off when your PC goes to sleep, but is not being fully powered on after it wakes up. This is a known problem in Windows:

"When deploying Windows 7 or Windows Server 2008 R2, you may want to disable the following network adapter power management setting on some computers:"

Allow the computer to turn off this device to save power

(emphasis mine)

And:

"Windows never turns off the network card due to inactivity. When this setting is checked (enabled), Windows puts the network card to sleep, and when it resumes, it puts it back to D0. When this setting isn't checked (disabled), Windows completely halts the device, and on resume reinitializes it."

(emphasis mine)

I also found the following web page as well:

"By trial and error and experimentation, I have determined that if I uncheck that box, the issue disappears."

(and, at the very bottom of the following page):

Configuring Power Management (Windows Operating System Only)

"For image streaming applications that run for an extended period of time, we recommend that you change the Windows power plan settings to ensure the computer and NIC never go to sleep. If you do not change the power plan settings, the computer or NIC may go to sleep when there is no activity from the keyboard or mouse for certain amount of time and the image stream may be interrupted."

(and):

To configure the power settings for the NIC

  1. In the Windows Control Panel, click Network and Internet.
  2. Click Network and Sharing Center.
  3. In the left-hand panel, click Change adapter settings.
  4. Right-click the NIC and then click Properties.
  5. Click Configure.
  6. Configure the power management settings for the NIC to ensure the computer does not turn off the device to save power.
  7. Click OK.
    Repeat steps 4-7 for each NIC on your computer.

(emphasis mine)

Bottom line: everywhere I look, disabling (unchecking) the "Allow the computer to turn off this device to save power" option fixes the problem.

But I would like to see another test like you just did, but with tt32 debug enabled as well. I'm hoping maybe it might confirm that the reason for the "22 invalid argument" errors is because the adapter isn't fully powered on after waking up from a sleep.

And if that is true, there is nothing Hercules nor CTCI-WIN can do to fix or workaround the problem. It is just another "User Error" (i.e. not having your system configured properly).

Thanks.


MORE INFO:

nic-properties

nic-configure

nic-power

@cfdonatucci
Copy link
Author

cfdonatucci commented May 21, 2022

Hi I ran the test and got your trace:

Afterwards, I unchecked that box and restarted Hercules and z/OS. After a while, I put the PC to sleep. When it restarted, I didn't get any 22 errors, but the adapter stopped working. Apparently the suspend introduces some problem. Anyway, I set my PC to never sleeps, so I think we are okay unless you want another test.

@Fish-Git
Copy link
Member

Hi I ran the test and got your trace:

Unfortunately, your .zip file is completely empty. It contains .... nothing. No files at all.

@cfdonatucci
Copy link
Author

cfdonatucci commented May 21, 2022

Try this one:

@SDL-Hercules-390 SDL-Hercules-390 deleted a comment from cfdonatucci May 21, 2022
Fish-Git added a commit that referenced this issue May 22, 2022
This is a followup fix to commit 4b439ae (GitHub #458) to try and prevent a possible watchdog thread hung processor false positive crash that can sometimes occur on Windows upon resuming from suspension (awaking from sleep).

This should hopefully fix GitHub issue #489.
@Fish-Git
Copy link
Member

Try this one:

Wow. I wasn't expecting that. Not very helpful at all, but not your fault of course.

It looks like Windows's re-initialization of the network adapter upon resume after suspension ends up killing Hercules's guest's virtual network adapter. I'm guessing that CTCI-WIN's WinPCap sniffing/injection hook on Windows's real physical adapter ends up being invalidated/removed/rendered inaccessible resulting in a permanent loss of connectivity between Hercules (CTCI-WIN/WinPCap) and the adapter. Most unfortunate.

So it looks like setting one's PC to never sleep (which you have already done) is the only solution available to us. There doesn't appear to be any way for us (Hercules/CTCI-WIN/WinPCap) to recover from this.

Nevertheless, since my ultimate goal for this issue was not to try to overcome that problem, but rather to try to prevent Hercules from crashing as a result of the situation, I have added some additional code to Hercules (commit 96b606a) to try and better detect when a resume from suspend (awake from sleep or re-power-on after hibernation) has occurred in order to prevent our watchdog thread from erroneously detecting what it believes is a malfunctioning CPU.

Hopefully, this new code, along with our previous attempt to address the same issue (commit 4b439ae), will, together, prevent Hercules from crashing upon resume from suspend.

<me: fingers crossed>

As I believe we have done all that we can with this issue, I am going to close it at this time. I would like to thank you, Carlos, for your incredible patience. Take care.

p.s. to Mark G. (@dasdman): I apologize for not having implemented your earlier suggestion sooner. At the time, I didn't think it was necessary!

@Fish-Git Fish-Git removed the IN PROGRESS... I'm working on it! (Or someone else is!) label May 22, 2022
@Fish-Git Fish-Git added the Related This issue is closely related to another issue. Consider this issue a "sub-issue" of the other. label May 22, 2022
@cfdonatucci
Copy link
Author

I'm glad this was useful. It was great for me to contribute with you guys, I really admire what you do. Let me know any way I can do anything in the future, tests or whatever. I'm not at your level but I'll try to do my best.
See you!
Carlos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG The issue describes likely incorrect product functionality that likely needs corrected. (Invalid/PEBKAC) Likely user error. The described problem does not exist or was otherwise determined to be bogus. Related This issue is closely related to another issue. Consider this issue a "sub-issue" of the other.
Projects
None yet
Development

No branches or pull requests

3 participants