Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "Revert "use an agent-id rather than the process PID (#24968)"…" #25669

Merged
merged 1 commit into from
Jun 13, 2022

Conversation

simon-mo
Copy link
Contributor

Reverts #25376

This has broke macOS "Ray C++, Java and Libraries" build.

image

@simon-mo simon-mo marked this pull request as ready for review June 10, 2022 18:10
@simon-mo
Copy link
Contributor Author

@simon-mo
Copy link
Contributor Author

The reason for failure is confusing but I have seem this one the most, seems related:



+ ray start --head --port=6379 --redis-password=123456 --node-ip-address=169.254.18.219
--
  | Enable usage stats collection? This prompt will auto-proceed in 10 seconds to avoid blocking cluster startup. Confirm [Y/n]:
  | Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
  |  
  | Local node IP: 169.254.18.219
  | 2022-06-02 22:55:35,589	INFO services.py:1483 -- View the Ray dashboard at http:https://127.0.0.1:8265
  | Traceback (most recent call last):
  | File "/Users/ec2-user/.buildkite-agent/builds/bk-mac1-branch-queue-i-017132b5bfb676cd7-1/ray-project/ray-builders-branch/python/ray/node.py", line 318, in __init__
  | self.redis_password,
  | File "/Users/ec2-user/.buildkite-agent/builds/bk-mac1-branch-queue-i-017132b5bfb676cd7-1/ray-project/ray-builders-branch/python/ray/_private/services.py", line 397, in wait_for_node
  | raise TimeoutError("Timed out while waiting for node to startup.")
  | TimeoutError: Timed out while waiting for node to startup.
  |  
  | During handling of the above exception, another exception occurred:
  |  
  | Traceback (most recent call last):
  | File "/usr/local/opt/miniconda/bin/ray", line 33, in <module>
  | sys.exit(load_entry_point('ray', 'console_scripts', 'ray')())
  | File "/Users/ec2-user/.buildkite-agent/builds/bk-mac1-branch-queue-i-017132b5bfb676cd7-1/ray-project/ray-builders-branch/python/ray/scripts/scripts.py", line 2341, in main
  | return cli()
  | File "/usr/local/opt/miniconda/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
  | return self.main(*args, **kwargs)
  | File "/usr/local/opt/miniconda/lib/python3.6/site-packages/click/core.py", line 1053, in main
  | rv = self.invoke(ctx)
  | File "/usr/local/opt/miniconda/lib/python3.6/site-packages/click/core.py", line 1659, in invoke
  | return _process_result(sub_ctx.command.invoke(sub_ctx))
  | File "/usr/local/opt/miniconda/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
  | return ctx.invoke(self.callback, **ctx.params)
  | File "/usr/local/opt/miniconda/lib/python3.6/site-packages/click/core.py", line 754, in invoke
  | return __callback(*args, **kwargs)
  | File "/Users/ec2-user/.buildkite-agent/builds/bk-mac1-branch-queue-i-017132b5bfb676cd7-1/ray-project/ray-builders-branch/python/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
  | return f(*args, **kwargs)
  | File "/Users/ec2-user/.buildkite-agent/builds/bk-mac1-branch-queue-i-017132b5bfb676cd7-1/ray-project/ray-builders-branch/python/ray/scripts/scripts.py", line 738, in start
  | ray_params, head=True, shutdown_at_exit=block, spawn_reaper=block
  | File "/Users/ec2-user/.buildkite-agent/builds/bk-mac1-branch-queue-i-017132b5bfb676cd7-1/ray-project/ray-builders-branch/python/ray/node.py", line 322, in __init__
  | "The current node has not been updated within 30 "
  | Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.


@simon-mo
Copy link
Contributor Author

merge criteria: buildkite/ray-builders-branch/mac-apple-ray-c-plus-plus-java-and-libraries should pass
https://buildkite.com/ray-project/ray-builders-branch/builds/8081#01814ecf-1cfe-4593-a410-0bed7dc26f11

@rkooo567
Copy link
Contributor

Can you ping the original author?

@simon-mo simon-mo merged commit feb8c29 into master Jun 13, 2022
@simon-mo simon-mo deleted the revert-25376-re-revert-agent-id-issue branch June 13, 2022 16:22
@rkooo567
Copy link
Contributor

@mattip we reverted this again because it broke the mac build. Can you follow up and tag me to the new PR?

@mattip
Copy link
Contributor

mattip commented Jun 15, 2022

How can I get the logs of the failing test?

@mattip
Copy link
Contributor

mattip commented Jun 15, 2022

The error I see is a re=use of already used port numbers, not anything to do with the dashboard agent ID.

ValueError: Ray component metrics_export is trying to use a port number 65535 that is used by other components.

@mattip
Copy link
Contributor

mattip commented Jun 15, 2022

If indeed the agent id was the cause of the raylet dying, there would be something like this line in the raylet.out log file. Is there a way to obtain the log files of the failed tests?

[2022-04-30 15:37:51,541 I 5028 15284] (raylet.exe) agent_manager.cc:88: Monitor agent process with pid 15680, register timeout 30000ms.
...
thirty seconds later
...
[2022-04-30 15:38:21,555 W 5028 5416] (raylet.exe) agent_manager.cc:94: Agent process with pid 15680 has not registered. ip 127.0.0.1, pid 12332
[2022-04-30 15:38:21,556 W 5028 15284] (raylet.exe) agent_manager.cc:104: Agent process with pid 15680 exit, return value 1067. ip 127.0.0.1. pid 12332
[2022-04-30 15:38:21,556 E 5028 15284] (raylet.exe) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause.

@mattip
Copy link
Contributor

mattip commented Jun 15, 2022

PR #24968, which this reverts, was merged May 27, reverted June1 in #25342, and restored in #25376 on June 2. Did the failures of the test in question track that pattern: start failing May 27, stop failing for a short period June 1 to June 2, and then resume failing June 2 until this PR was merged?

@mattip
Copy link
Contributor

mattip commented Jun 15, 2022

Answering my question: yes, the failures track the ups and downs of reverting #24968, it seems the agent id is connected with the cluster tests failures on mac/apple.

@mattip
Copy link
Contributor

mattip commented Jun 15, 2022

The problem appears to be the use of RAY_BACKEND_LOG_LEVEL=debug ray start ... where the debug logging writes the command line. It works on linux but seems to crash on macOS.

@mattip
Copy link
Contributor

mattip commented Jun 15, 2022

xref #25806

@simon-mo
Copy link
Contributor Author

ah

  argv.push_back(NULL);

  if (RAY_LOG_ENABLED(DEBUG)) {
    std::stringstream stream;
    stream << "Starting agent process with command:";
    for (const auto &arg : argv) {

the argv.push_back(NULL); looks suspicious? does the const auto & even work with NULL as part of argv?

@mattip
Copy link
Contributor

mattip commented Jun 16, 2022

the argv.push_back(NULL); looks suspicious?

Thanks, removed it in #25806

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants