Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Windows minion communication unreliable with multi-master failover #60010

Open
WayneSherman opened this issue Apr 12, 2021 · 1 comment
Assignees
Labels
Bug broken, incorrect, or confusing behavior P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Windows
Milestone

Comments

@WayneSherman
Copy link

WayneSherman commented Apr 12, 2021

Tested Systems Info
Salt master v3003 running on Ubuntu 20.04.2 LTS (kernel 5.4.0-70-generic)
Salt minion v3003 running on Win10 Pro 20H2 (build 19042.870)

Description
The minion keeps attempting to switch master IP addresses even though the first master in the list does not have a communication problem. I think during these change attempts, the minion does not respond properly to requests from the master. Communication with the minion has intermittent and periodic failures.
When master_failback is false, the minion still switches back to the first master in the list.

Setup
A master with two IP address and a firewall to control which IP address the minion can communicate with. (only one IP is enabled at a time)

# Windows Minion Config
master:
  - xxx.xxx.xxx.1
  - xxx.xxx.xxx.2
master_port: xxxxx
master_type: failover
master_alive_interval: 30
master_failback: false

Steps to Reproduce the behavior

  1. Run the master

  2. Set the firewall to permit minion to connect to master IP address 1 and block IP address 2.

  3. Run the minion

  4. Start a ping test on the master
    while sleep 2; do salt 'minion-name' test.ping; done

  5. test.ping has intermittent failures with minion replying to master IP 1. Failures occur every 30 seconds when the alive interval triggers and the minion attempts to switch to master IP 2 (which is blocked).

  6. Set the firewall to permit minion to connect to master IP address 2 and block IP address 1.

  7. test.ping has intermittent failures with minion replying to master IP 2. Failures occur every 30 seconds when the alive interval triggers and the minion attempts to switch to master IP 1 (which is blocked).

  8. Set the firewall to permit minion to connect to master IP address 1 and block IP address 2.

  9. Even though master_failback is false, the minion switches back to the first master in the list (IP 1). test.ping continues to have intermittent failures with minion replying to master IP 1.

Expected behavior
In master_type: failover mode, the minion should continue to communicate with the first master in the list and should not attempt to change masters unless the first master has a communication problem. test.ping from the master should not fail every 30 seconds. The minion should failback to the first master only when master_failback is true.

intermittent and periodic master ping failures

DESKTOP-XXXXX:
    Minion did not return. [No response]

minion log
Notes:

  1. During this section of logging, IP 1 was open and IP 2 was blocked by the firewall.
  2. These come in groups of 3. The minion reports that it changed to a non functioning IP address and then has to change back. There is lots of related logged activity surrounding these entries that has been left out for brevity.
2021-04-11 21:53:47,849 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.1 to x.x.x.2
2021-04-11 21:53:49,897 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.1 to x.x.x.2
2021-04-11 21:53:54,992 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.2 to x.x.x.1

2021-04-11 21:54:31,256 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.1 to x.x.x.2
2021-04-11 21:54:33,301 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.1 to x.x.x.2
2021-04-11 21:54:38,400 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.2 to x.x.x.1

2021-04-11 21:55:14,302 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.1 to x.x.x.2
2021-04-11 21:55:16,334 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.1 to x.x.x.2
2021-04-11 21:55:21,413 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.2 to x.x.x.1

2021-04-11 21:55:57,318 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.1 to x.x.x.2
2021-04-11 21:55:59,381 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.1 to x.x.x.2
2021-04-11 21:56:04,475 [salt.minion  :193 ][WARNING ][6308] Master ip address changed from x.x.x.2 to x.x.x.1

[INFO    ] Running scheduled job: __master_alive_xx.xx.xx.1 with jid 20210412045344115850
[DEBUG   ] Subprocess SignalHandlingProcess-4-Schedule-__master_alive_xx.xx.xx.1 added
[DEBUG   ] Multiprocessing queue logging configured for the process running under PID: 2228 at log level 10
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] LazyLoaded status.master
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'id': 'DESKTOP-XXXXXX', 'fun': 'status.master', 'fun_args': 
[{'master': 'xx.xx.xx.1', 'connected': True}], 'schedule': '__master_alive_xx.xx.xx.1', 'jid': '20210412045344115850', 'pid': 2228}
[DEBUG   ] LazyLoaded config.get
[DEBUG   ] SaltEvent PUB socket URI: 4510
[DEBUG   ] SaltEvent PULL socket URI: 4511
[DEBUG   ] Sending event: tag = __master_disconnected; data = {'master': 'xx.xx.xx.1', '_stamp': '2021-04-12T04:53:45.819179'}
[DEBUG   ] Closing IPCMessageClient instance
[DEBUG   ] schedule.handle_func: Removing c:\salt\var\cache\salt\minion\proc\20210412045344115850
[DEBUG   ] Minion of 'xx.xx.xx.1' is handling event tag '__master_disconnected'
[INFO    ] Connection to master xx.xx.xx.1 lost
[DEBUG   ] SaltEvent PUB socket URI: 4510
[DEBUG   ] SaltEvent PULL socket URI: 4511
[DEBUG   ] Sending event: tag = /salt/minion/minion_schedule_delete_complete; data = {'complete': True, 'schedule': {'__mine_interval': {'function': 
'mine.update', 'minutes': 60, 'jid_include': True, 'maxrunning': 2, 'run_on_start': True, 'return_job': False, 'enabled': True, 'name': '__mine_interval', 
'_next_fire_time': datetime.datetime(2021, 4, 11, 22, 53, 14, 98015), '_splay': None, '_run_on_start': False, '_seconds': 3600, 
'_next_scheduled_fire_time': datetime.datetime(2021, 4, 11, 22, 53, 14, 98015), 'splay': None, 'run': True, '_last_run': datetime.datetime(2021, 4, 11, 
21, 53, 14, 98015)}, '__master_alive_xx.xx.xx.1': {'function': 'status.master', 'seconds': 30, 'jid_include': True, 'maxrunning': 1, 'return_job': 
False, 'kwargs': {'master': 'xx.xx.xx.1', 'connected': True}, 'enabled': True, 'name': '__master_alive_xx.xx.xx.1', '_next_fire_time': 
datetime.datetime(2021, 4, 11, 21, 54, 14, 114930), '_splay': None, '_seconds': 30, '_next_scheduled_fire_time': datetime.datetime(2021, 4, 11, 21, 53, 
44, 98015), 'splay': None, 'run': True, '_last_run': datetime.datetime(2021, 4, 11, 21, 53, 44, 114930)}}, '_stamp': '2021-04-12T04:53:45.819179'}
[DEBUG   ] Closing IPCMessageClient instance
[DEBUG   ] Persisting schedule
[INFO    ] Trying to tune in to next master from master-list
[INFO    ] Moving possibly failed master xx.xx.xx.1 to the end of the list of masters
[WARNING ] Master ip address changed from xx.xx.xx.1 to xx.xx.xx.2
[DEBUG   ] Master URI: tcp:https://xx.xx.xx.2:xxxxx
[DEBUG   ] Master URI: tcp:https://xx.xx.xx.1:xxxxx
[DEBUG   ] Connecting to master. Attempt 1 (infinite attempts)
[WARNING ] Master ip address changed from xx.xx.xx.1 to xx.xx.xx.2
[DEBUG   ] Master URI: tcp:https://xx.xx.xx.2:xxxxx
[DEBUG   ] Initializing new AsyncAuth for ('c:\\salt\\conf\\pki\\minion', 'DESKTOP-XXXXXX', 'tcp:https://xx.xx.xx.2:xxxxx')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (3810)
[DEBUG   ] Setting zmq_reconnect_ivl to '3810ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('c:\\salt\\conf\\pki\\minion', 'DESKTOP-XXXXXX', 'tcp:https://xx.xx.xx.2:xxxxx', 'clear')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp:https://xx.xx.xx.2:xxxxx
[DEBUG   ] Trying to connect to: tcp:https://xx.xx.xx.2:xxxxx
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] schedule: Job __mine_interval was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG   ] schedule: Job __mine_interval was scheduled with a max number of 2
[INFO    ] Running scheduled job: __mine_interval with jid 20210412045349928407
[DEBUG   ] Subprocess SignalHandlingProcess-5-Schedule-__mine_interval added
[DEBUG   ] Subprocess SignalHandlingProcess-4-Schedule-__master_alive_xx.xx.xx.1 cleaned up
[DEBUG   ] Minion of 'xx.xx.xx.2' is handling event tag '/salt/minion/minion_schedule_delete_complete'
[DEBUG   ] Multiprocessing queue logging configured for the process running under PID: 5172 at log level 10
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] Using pkg_resources to load entry points
[DEBUG   ] LazyLoaded mine.update
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'id': 'DESKTOP-XXXXXX', 'fun': 'mine.update', 'fun_args': [], 
'schedule': '__mine_interval', 'jid': '20210412045349928407', 'pid': 5172}
[DEBUG   ] LazyLoaded config.merge
[DEBUG   ] schedule.handle_func: Removing c:\salt\var\cache\salt\minion\proc\20210412045349928407
[DEBUG   ] Subprocess SignalHandlingProcess-5-Schedule-__mine_interval cleaned up
[DEBUG   ] Re-init ZMQ socket: Message timed out
[DEBUG   ] Trying to connect to: tcp:https://xx.xx.xx.2:xxxxx
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[INFO    ] Master xx.xx.xx.2 could not be reached, trying next master (if any)
[WARNING ] Master ip address changed from xx.xx.xx.2 to xx.xx.xx.1
[DEBUG   ] Master URI: tcp:https://xx.xx.xx.1:xxxxx
[DEBUG   ] Initializing new AsyncAuth for ('c:\\salt\\conf\\pki\\minion', 'DESKTOP-XXXXXX', 'tcp:https://xx.xx.xx.1:xxxxx')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (3128)
[DEBUG   ] Setting zmq_reconnect_ivl to '3128ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('c:\\salt\\conf\\pki\\minion', 'DESKTOP-XXXXXX', 'tcp:https://xx.xx.xx.1:xxxxx', 'clear')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp:https://xx.xx.xx.1:xxxxx
[DEBUG   ] Trying to connect to: tcp:https://xx.xx.xx.1:xxxxx
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: c:\salt\conf\pki\minion\minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] Connecting the Minion to the Master publish port, using the URI: tcp:https://xx.xx.xx.1:PUBPORT
@WayneSherman WayneSherman added Bug broken, incorrect, or confusing behavior needs-triage labels Apr 12, 2021
@xeacott
Copy link
Contributor

xeacott commented Apr 14, 2021

Thanks for submitting this issue... read through it a few times and I'll have to work on reproducing this... have you seen this occur on other versions of Windows? Such as like, 18.04 or 19.04 or something? I personally don't have much experience with the firewall and allowing/blocking connections to flip like that... I'll ask @twangboy for his thoughts on this.

@xeacott xeacott added severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around and removed needs-triage labels Apr 16, 2021
@xeacott xeacott added this to the Silicon milestone Apr 16, 2021
@sagetherage sagetherage added the P3 Priority 3 label May 27, 2021
@sagetherage sagetherage modified the milestones: Silicon, Approved Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Windows
Projects
None yet
Development

No branches or pull requests

4 participants