Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add log for master svc check #1160

Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions dlrover/python/master/scaler/pod_scaler.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@
# the service is not avalilable.
logger.info(
f"The service {master_addr} is not available and "
"use the IP of master Pod."
f"use the IP of master Pod."
)
master_ip = os.getenv("POD_IP", "")
if not master_ip:
Expand Down Expand Up @@ -496,14 +496,24 @@

def _check_master_service_avaliable(self, host, port, timeout=15):
"""Verify that the master grpc servicer is available."""
for _ in range(timeout):
for i in range(timeout):
try:
telnetlib.Telnet(host=host, port=port, timeout=3)
return True
except socket.gaierror:
logger.warning(
BalaBalaYi marked this conversation as resolved.
Show resolved Hide resolved
f"Attempt {i}: Encountered gaierror while "
f"performing master service check."
)
return False
except Exception:
except Exception as e:
logger.warning(

Check warning on line 510 in dlrover/python/master/scaler/pod_scaler.py

View check run for this annotation

Codecov / codecov/patch

dlrover/python/master/scaler/pod_scaler.py#L509-L510

Added lines #L509 - L510 were not covered by tests
f"Attempt {i}: Encountered {str(e)} while "
f"performing master service check."
)
time.sleep(1)

logger.warning(f"Master service check failed after {timeout} retries.")

Check warning on line 516 in dlrover/python/master/scaler/pod_scaler.py

View check run for this annotation

Codecov / codecov/patch

dlrover/python/master/scaler/pod_scaler.py#L516

Added line #L516 was not covered by tests
return False

def _patch_tf_config_into_env(self, pod, node: Node, pod_stats, ps_addrs):
Expand Down
Loading