-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jobs] [Logs] Stream runtime env log to job log file #44405
[Jobs] [Logs] Stream runtime env log to job log file #44405
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
…chitkulkarni/ray into stream-runtime-env-to-job-log Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@architkulkarni this fixes #44303 right? |
@jjyao No, this PR is just about adding a new data path for where the runtime_env logs end up. The issue you linked is about making the runtime env logs stream within a single step. Currently, the logs for each step (for example, |
@rynewang maybe you can help review this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Note that this way the runtime env logs are more exposed to the user, so we may want to audit if there are some INFO level logs that are too verbose. I remembered every time we increase/decrease ref count (that is, every time we spawn / kill a worker) there's a log. Maybe we want to reduce them to DEBUG.
…44742) Followup to #44405 runtime_env fields can be reset to None at runtime, which causes the following error: 2024-04-15 13:04:56,871 WARNING job_manager.py:1009 -- Failed to start supervisor actor for job raysubmit_UTb99vaR1DmJ9rkw: ''NoneType' object does not support item assignment'. Full traceback: Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 989, in submit_job runtime_env=self._get_supervisor_runtime_env( File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 818, in _get_supervisor_runtime_env config["log_files"] = [self._log_client.get_log_file_path(submission_id)] TypeError: 'NoneType' object does not support item assignment This PR explicitly checks for None before using config as a dict, fixing the above error. It also includes the full traceback in the error log to make this kind of error easier to debug in the future. --------- Signed-off-by: Archit Kulkarni <[email protected]>
…ay-project#44742) Followup to ray-project#44405 runtime_env fields can be reset to None at runtime, which causes the following error: 2024-04-15 13:04:56,871 WARNING job_manager.py:1009 -- Failed to start supervisor actor for job raysubmit_UTb99vaR1DmJ9rkw: ''NoneType' object does not support item assignment'. Full traceback: Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 989, in submit_job runtime_env=self._get_supervisor_runtime_env( File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 818, in _get_supervisor_runtime_env config["log_files"] = [self._log_client.get_log_file_path(submission_id)] TypeError: 'NoneType' object does not support item assignment This PR explicitly checks for None before using config as a dict, fixing the above error. It also includes the full traceback in the error log to make this kind of error easier to debug in the future. --------- Signed-off-by: Archit Kulkarni <[email protected]>
When a job is submitted, it stays in a "PENDING" state before becoming "RUNNING". In practice, this is nearly always because the runtime_env is being installed. However, it's not obvious where to see the progress of the runtime_env installation. For better observability, this PR streams the runtime_env log to the existing job-driver-<submission_id>.log file. Note that currently the Job log API itself won't return the runtime_env setup log until the job starts, because the job_head dashboard module needs to know which node the runtime env is being installed on before the job starts, and we don't have a way of surfacing that information yet. I'll file an issue and link it here. The log file itself will still have the runtime_env setup logs written to it in real time. Currently this feature is gated behind an environment variable (defaults to OFF), in case the logs are too spammy. --------- Signed-off-by: Archit Kulkarni <[email protected]>
…ay-project#44742) Followup to ray-project#44405 runtime_env fields can be reset to None at runtime, which causes the following error: 2024-04-15 13:04:56,871 WARNING job_manager.py:1009 -- Failed to start supervisor actor for job raysubmit_UTb99vaR1DmJ9rkw: ''NoneType' object does not support item assignment'. Full traceback: Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 989, in submit_job runtime_env=self._get_supervisor_runtime_env( File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/dashboard/modules/job/job_manager.py", line 818, in _get_supervisor_runtime_env config["log_files"] = [self._log_client.get_log_file_path(submission_id)] TypeError: 'NoneType' object does not support item assignment This PR explicitly checks for None before using config as a dict, fixing the above error. It also includes the full traceback in the error log to make this kind of error easier to debug in the future. --------- Signed-off-by: Archit Kulkarni <[email protected]>
Why are these changes needed?
When a job is submitted, it stays in a "PENDING" state before becoming "RUNNING". In practice, this is nearly always because the runtime_env is being installed. However, it's not obvious where to see the progress of the runtime_env installation.
For better observability, this PR streams the runtime_env log to the existing
job-driver-<submission_id>.log
file.Note that currently the Job log API itself won't return the runtime_env setup log until the job starts, because the job_head dashboard module needs to know which node the runtime env is being installed on before the job starts, and we don't have a way of surfacing that information yet. I'll file an issue and link it here. The log file itself will still have the runtime_env setup logs written to it in real time.
Currently this feature is gated behind an environment variable (defaults to OFF), in case the logs are too spammy.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.