Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Filter out dead nodes when getting address info from redis #14440

Merged
merged 1 commit into from
Mar 8, 2021

Conversation

kfstorm
Copy link
Member

@kfstorm kfstorm commented Mar 2, 2021

Why are these changes needed?

If the node where the driver sits is dead, the driver still tries to connect to it. Expected behavior: Prints warning log about Raylet is not found.

Repro script:

$ ray start --head
Local node IP: 100.88.111.11
2021-03-04 08:11:52,461	INFO services.py:1228 -- View the Ray dashboard at http:https://127.0.0.1:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='100.88.111.11:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop

$ pkill raylet

$ python
Python 3.7.2 (default, Jun 17 2019, 15:33:44)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init(address='100.88.111.11:6379', _redis_password='5241590000000000')
2021-03-04 08:13:29,382	INFO worker.py:665 -- Connecting to existing Ray cluster at address: 100.88.111.11:6379
Aborted

$ cat /tmp/ray/session_latest/logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_30594.log
[2021-03-04 08:13:29,396 I 30594 30594] core_worker.cc:136: Constructing CoreWorkerProcess. pid: 30594
[2021-03-04 08:13:29,405 I 30594 30594] core_worker.cc:310: Constructing CoreWorker, worker_id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff
[2021-03-04 08:13:30,405 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 1, num_retries = 10)
[2021-03-04 08:13:31,405 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 2, num_retries = 10)
[2021-03-04 08:13:32,406 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 3, num_retries = 10)
[2021-03-04 08:13:33,406 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 4, num_retries = 10)
[2021-03-04 08:13:34,406 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 5, num_retries = 10)
[2021-03-04 08:13:35,406 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 6, num_retries = 10)
[2021-03-04 08:13:36,406 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 7, num_retries = 10)
[2021-03-04 08:13:37,407 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 8, num_retries = 10)
[2021-03-04 08:13:38,407 I 30594 30594] client_connection.cc:53: Retrying to connect to socket for endpoint /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet (num_attempts = 9, num_retries = 10)
[2021-03-04 08:13:39,408 C 30594 30594] raylet_client.cc:57: Could not connect to socket /tmp/ray/session_2021-03-04_08-11-50_034501_29511/sockets/raylet
[2021-03-04 08:13:39,408 E 30594 30594] logging.cc:435: *** Aborted at 1614816819 (unix time) try "date -d @1614816819" if you are using GNU date ***
[2021-03-04 08:13:39,409 E 30594 30594] logging.cc:435: PC: @                0x0 (unknown)
[2021-03-04 08:13:39,409 E 30594 30594] logging.cc:435: *** SIGABRT (@0x70900007782) received by PID 30594 (TID 0x7fe284d4d700) from PID 30594; stack trace: ***
[2021-03-04 08:13:39,410 E 30594 30594] logging.cc:435:     @     0x7fe2841c65d0 (unknown)
[2021-03-04 08:13:39,411 E 30594 30594] logging.cc:435:     @     0x7fe2837148af __GI_raise
[2021-03-04 08:13:39,412 E 30594 30594] logging.cc:435:     @     0x7fe2837164aa __GI_abort
[2021-03-04 08:13:39,413 E 30594 30594] logging.cc:435:     @     0x7fe27ae5a121 ray::SpdLogMessage::Flush()
[2021-03-04 08:13:39,414 E 30594 30594] logging.cc:435:     @     0x7fe27ae5a37c ray::RayLog::~RayLog()
[2021-03-04 08:13:39,415 E 30594 30594] logging.cc:435:     @     0x7fe27ab6f509 ray::raylet::RayletConnection::RayletConnection()
[2021-03-04 08:13:39,417 E 30594 30594] logging.cc:435:     @     0x7fe27ab6fea3 ray::raylet::RayletClient::RayletClient()
[2021-03-04 08:13:39,418 E 30594 30594] logging.cc:435:     @     0x7fe27ab06590 ray::CoreWorker::CoreWorker()
[2021-03-04 08:13:39,419 E 30594 30594] logging.cc:435:     @     0x7fe27ab0adc6 ray::CoreWorkerProcess::CreateWorker()
[2021-03-04 08:13:39,421 E 30594 30594] logging.cc:435:     @     0x7fe27ab0bb62 ray::CoreWorkerProcess::CoreWorkerProcess()
[2021-03-04 08:13:39,422 E 30594 30594] logging.cc:435:     @     0x7fe27ab0c595 ray::CoreWorkerProcess::Initialize()
[2021-03-04 08:13:39,423 E 30594 30594] logging.cc:435:     @     0x7fe27aa1f7c4 __pyx_pf_3ray_7_raylet_10CoreWorker___cinit__()
[2021-03-04 08:13:39,424 E 30594 30594] logging.cc:435:     @     0x7fe27aa207f6 __pyx_tp_new_3ray_7_raylet_CoreWorker()
[2021-03-04 08:13:39,426 E 30594 30594] logging.cc:435:     @     0x7fe2844c3ca3 type_call
[2021-03-04 08:13:39,427 E 30594 30594] logging.cc:435:     @     0x7fe28446acd4 _PyObject_FastCallKeywords
[2021-03-04 08:13:39,428 E 30594 30594] logging.cc:435:     @     0x7fe2844437d2 _PyEval_EvalFrameDefault
[2021-03-04 08:13:39,429 E 30594 30594] logging.cc:435:     @     0x7fe284551488 _PyEval_EvalCodeWithName
[2021-03-04 08:13:39,430 E 30594 30594] logging.cc:435:     @     0x7fe28446a5f8 _PyFunction_FastCallKeywords
[2021-03-04 08:13:39,431 E 30594 30594] logging.cc:435:     @     0x7fe284444dcc _PyEval_EvalFrameDefault
[2021-03-04 08:13:39,431 E 30594 30594] logging.cc:435:     @     0x7fe284551488 _PyEval_EvalCodeWithName
[2021-03-04 08:13:39,432 E 30594 30594] logging.cc:435:     @     0x7fe28446a3d7 _PyFunction_FastCallDict
[2021-03-04 08:13:39,433 E 30594 30594] logging.cc:435:     @     0x7fe284441086 _PyEval_EvalFrameDefault
[2021-03-04 08:13:39,434 E 30594 30594] logging.cc:435:     @     0x7fe284551488 _PyEval_EvalCodeWithName
[2021-03-04 08:13:39,435 E 30594 30594] logging.cc:435:     @     0x7fe28446a5f8 _PyFunction_FastCallKeywords
[2021-03-04 08:13:39,436 E 30594 30594] logging.cc:435:     @     0x7fe284444dcc _PyEval_EvalFrameDefault
[2021-03-04 08:13:39,437 E 30594 30594] logging.cc:435:     @     0x7fe284551488 _PyEval_EvalCodeWithName
[2021-03-04 08:13:39,438 E 30594 30594] logging.cc:435:     @     0x7fe2845515dd PyEval_EvalCodeEx
[2021-03-04 08:13:39,439 E 30594 30594] logging.cc:435:     @     0x7fe28455162b PyEval_EvalCode
[2021-03-04 08:13:39,440 E 30594 30594] logging.cc:435:     @     0x7fe28458c1e3 PyRun_InteractiveOneObjectEx
[2021-03-04 08:13:39,441 E 30594 30594] logging.cc:435:     @     0x7fe28458c486 PyRun_InteractiveLoopFlags
[2021-03-04 08:13:39,442 E 30594 30594] logging.cc:435:     @     0x7fe28458cd3e PyRun_AnyFileExFlags
[2021-03-04 08:13:39,443 E 30594 30594] logging.cc:435:     @     0x7fe2845af411 pymain_main

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@jovany-wang jovany-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM. But I couldn't make sure if this is an API change.

@kfstorm
Copy link
Member Author

kfstorm commented Mar 5, 2021

@jovany-wang Why do you think it might be an API change?

@jovany-wang
Copy link
Contributor

Forget my words. It's not an API, just a Ray internal method.

@kfstorm kfstorm merged commit 7977474 into ray-project:master Mar 8, 2021
@kfstorm kfstorm deleted the get_address_info branch March 8, 2021 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants