-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Job] add driver exit code to job info #39675
Conversation
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lg
Signed-off-by: Jonathan Nitisastro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just minor comments/questions
@@ -98,3 +98,7 @@ class JobDetails(BaseModel): | |||
None, | |||
description="The node ID of the node the job entrypoint command is running on.", | |||
) | |||
driver_exit_code: Optional[str] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity why Optional[str] and not Optional[int]? Also, in the description we should also define when it would return None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, the actual output is int so shld update that
dashboard/modules/job/common.py
Outdated
@@ -95,6 +95,8 @@ class JobInfo: | |||
#: The node id that driver running on. It will be None only when the job status | |||
# is PENDING, and this field will not be deleted or modified even if the driver dies | |||
driver_node_id: Optional[str] = None | |||
#: The driver process exit code after driver terminates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define in the comment when this would be None. I think this is what shows up in the auto-generated Ray docs, but it would be good to confirm this and see if there's anywhere else in the docs that we need to update
@@ -908,6 +922,8 @@ async def test_failed_job(self, job_manager): | |||
await async_wait_for_condition_async_predicate( | |||
check_job_failed, job_manager=job_manager, job_id=job_id | |||
) | |||
data = await job_manager.get_job_info(job_id) | |||
assert data.driver_exit_code == -9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment about the meaning of -9 and why it's expected?
src/ray/protobuf/gcs.proto
Outdated
@@ -380,6 +380,8 @@ message JobsAPIInfo { | |||
// The node id that driver running on. It will be None only when the job status | |||
// is PENDING, and this field will not be deleted or modified even if the driver dies | |||
optional string driver_node_id = 13; | |||
// The driver process exit code after driver terminates | |||
optional string driver_exit_code = 14; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, why string instead of int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some test should have failed here, right? Or can a Python int silently be loaded into a string protobuf without any error? (Or is this line of code not currently run in any test?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't fail any tests before hmm
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Signed-off-by: Jonathan Nitisastro <[email protected]>
Btw, we should follow up with this. Adding the exit code itself is not very useful (we should integrate to prod or Ray dashboard) |
Adding it to the dashboard is tracked here #38142 |
Signed-off-by: Jonathan Nitisastro <[email protected]> Signed-off-by: Victor <[email protected]>
Why are these changes needed?
Add driver exit code to
JobInfo
class which handled byJobManager
Related issue number
Closes #31183
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.