Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jobs] [Logs] Stream runtime env log to job log file #44405

Merged

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Apr 2, 2024

Why are these changes needed?

When a job is submitted, it stays in a "PENDING" state before becoming "RUNNING". In practice, this is nearly always because the runtime_env is being installed. However, it's not obvious where to see the progress of the runtime_env installation.

For better observability, this PR streams the runtime_env log to the existing job-driver-<submission_id>.log file.

Note that currently the Job log API itself won't return the runtime_env setup log until the job starts, because the job_head dashboard module needs to know which node the runtime env is being installed on before the job starts, and we don't have a way of surfacing that information yet. I'll file an issue and link it here. The log file itself will still have the runtime_env setup logs written to it in real time.

Currently this feature is gated behind an environment variable (defaults to OFF), in case the logs are too spammy.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@architkulkarni architkulkarni changed the title [WIP] Stream runtime env log to job log file [Jobs] [Logs] Stream runtime env log to job log file Apr 4, 2024
@architkulkarni architkulkarni marked this pull request as ready for review April 4, 2024 20:14
@jjyao
Copy link
Collaborator

jjyao commented Apr 8, 2024

@architkulkarni this fixes #44303 right?

@architkulkarni
Copy link
Contributor Author

@jjyao No, this PR is just about adding a new data path for where the runtime_env logs end up. The issue you linked is about making the runtime env logs stream within a single step. Currently, the logs for each step (for example, pip install is one step) only appear after the entire step is complete. So that issue will still need to be fixed for this PR to be more useful.

@architkulkarni
Copy link
Contributor Author

@rynewang maybe you can help review this one?

Copy link
Contributor

@rynewang rynewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Note that this way the runtime env logs are more exposed to the user, so we may want to audit if there are some INFO level logs that are too verbose. I remembered every time we increase/decrease ref count (that is, every time we spawn / kill a worker) there's a log. Maybe we want to reduce them to DEBUG.

@architkulkarni architkulkarni merged commit 01d975e into ray-project:master Apr 10, 2024
5 checks passed
architkulkarni added a commit that referenced this pull request Apr 17, 2024
…44742)

Followup to #44405

runtime_env fields can be reset to None at runtime, which causes the following error:

2024-04-15 13:04:56,871	WARNING job_manager.py:1009 -- Failed to start supervisor actor for job raysubmit_UTb99vaR1DmJ9rkw: ''NoneType' object does not support item assignment'. Full traceback:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 989, in submit_job
    runtime_env=self._get_supervisor_runtime_env(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 818, in _get_supervisor_runtime_env
    config["log_files"] = [self._log_client.get_log_file_path(submission_id)]
TypeError: 'NoneType' object does not support item assignment
This PR explicitly checks for None before using config as a dict, fixing the above error.

It also includes the full traceback in the error log to make this kind of error easier to debug in the future.

---------

Signed-off-by: Archit Kulkarni <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Apr 18, 2024
…ay-project#44742)

Followup to ray-project#44405

runtime_env fields can be reset to None at runtime, which causes the following error:

2024-04-15 13:04:56,871	WARNING job_manager.py:1009 -- Failed to start supervisor actor for job raysubmit_UTb99vaR1DmJ9rkw: ''NoneType' object does not support item assignment'. Full traceback:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 989, in submit_job
    runtime_env=self._get_supervisor_runtime_env(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 818, in _get_supervisor_runtime_env
    config["log_files"] = [self._log_client.get_log_file_path(submission_id)]
TypeError: 'NoneType' object does not support item assignment
This PR explicitly checks for None before using config as a dict, fixing the above error.

It also includes the full traceback in the error log to make this kind of error easier to debug in the future.

---------

Signed-off-by: Archit Kulkarni <[email protected]>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
When a job is submitted, it stays in a "PENDING" state before becoming "RUNNING". In practice, this is nearly always because the runtime_env is being installed. However, it's not obvious where to see the progress of the runtime_env installation.

For better observability, this PR streams the runtime_env log to the existing job-driver-<submission_id>.log file.

Note that currently the Job log API itself won't return the runtime_env setup log until the job starts, because the job_head dashboard module needs to know which node the runtime env is being installed on before the job starts, and we don't have a way of surfacing that information yet. I'll file an issue and link it here. The log file itself will still have the runtime_env setup logs written to it in real time.

Currently this feature is gated behind an environment variable (defaults to OFF), in case the logs are too spammy.

---------

Signed-off-by: Archit Kulkarni <[email protected]>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
…ay-project#44742)

Followup to ray-project#44405

runtime_env fields can be reset to None at runtime, which causes the following error:

2024-04-15 13:04:56,871	WARNING job_manager.py:1009 -- Failed to start supervisor actor for job raysubmit_UTb99vaR1DmJ9rkw: ''NoneType' object does not support item assignment'. Full traceback:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 989, in submit_job
    runtime_env=self._get_supervisor_runtime_env(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 818, in _get_supervisor_runtime_env
    config["log_files"] = [self._log_client.get_log_file_path(submission_id)]
TypeError: 'NoneType' object does not support item assignment
This PR explicitly checks for None before using config as a dict, fixing the above error.

It also includes the full traceback in the error log to make this kind of error easier to debug in the future.

---------

Signed-off-by: Archit Kulkarni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants