Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages when nodes can't communicate with each other. #223

Merged
merged 3 commits into from
Jan 22, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Print more information when starting the head node.
  • Loading branch information
robertnishihara committed Jan 22, 2017
commit 642423b313ebbe0950e358fa606719f7da65c326
5 changes: 2 additions & 3 deletions python/ray/services.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ def get_node_ip_address(address="8.8.8.8:53"):
s.connect((host, int(port)))
return s.getsockname()[0]

def wait_for_redis_to_start(redis_host, redis_port, num_retries=5):
def wait_for_redis_to_start(redis_host, redis_port, num_retries=2):
"""Wait for a Redis server to be available.

This is accomplished by creating a Redis client and sending a random command
Expand All @@ -161,13 +161,12 @@ def wait_for_redis_to_start(redis_host, redis_port, num_retries=5):
Exception: An exception is raised if we could not connect with Redis.
"""
redis_client = redis.StrictRedis(host=redis_host, port=redis_port)
print("Redis Client Started!")
# Wait for the Redis server to start.
counter = 0
while counter < num_retries:
try:
# Run some random command and see if it worked.
print("Waiting for %s to respond..." % redis_host)
print("Waiting for redis server at {}:{} to respond...".format(redis_host, redis_port))
redis_client.client_list()
except redis.ConnectionError as e:
# Wait a little bit.
Expand Down
4 changes: 3 additions & 1 deletion python/ray/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -674,7 +674,9 @@ def get_address_info_from_redis(redis_address, node_ip_address, num_retries=5):
if counter == num_retries:
raise
# Some of the information may not be in Redis yet, so wait a little bit.
print("Some processes that the driver needs to connect to have not registered with Redis, so retrying.")
print("Some processes that the driver needs to connect to have not "
"registered with Redis, so retrying. Have you run "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be placed outside the for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically the last part "have you tried..."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this comment, but now I'm concerned that the user won't see the message telling them to run scripts/start_ray.sh... seeing the message multiple times is annoying, but this way might be worse.

"./scripts/start_ray.sh on this node?")
time.sleep(1)
counter += 1

Expand Down
23 changes: 22 additions & 1 deletion scripts/start_ray.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,22 @@ def check_no_existing_redis_clients(node_ip_address, redis_address):
num_workers=args.num_workers,
cleanup=False,
redirect_output=True)
print(address_info)
print("\nStarted Ray with {} workers on this node. A different number of "
"workers can be set with the --num-workers flag (but you have to "
"first terminate the existing cluster). You can add additional nodes "
"to the cluster by calling\n\n"
" ./scripts/start_ray.sh --redis-address {}\n\n"
"from the node you wish to add. You can connect a driver to the "
"cluster from Python by running\n\n"
" import ray\n"
" ray.init(redis_address=\"{}\")\n\n"
"If you have trouble connecting from a different machine, check that "
"your firewall is configured properly. If you wish to terminate the "
"processes that have been started, run\n\n"
" ./scripts/stop_ray.sh".format(args.num_workers,
address_info["redis_address"],
address_info["redis_address"]))
else:
# Start Ray on a non-head node.
if args.redis_address is None:
Expand All @@ -74,4 +90,9 @@ def check_no_existing_redis_clients(node_ip_address, redis_address):
num_workers=args.num_workers,
cleanup=False,
redirect_output=True)
print(address_info)
print(address_info)
print("\nStarted {} workers on this node. A different number of workers "
"can be set with the --num-workers flag (but you have to first "
"terminate the existing cluster). If you wish to terminate the "
"processes that have been started, run\n\n"
" ./scripts/stop_ray.sh".format(args.num_workers))