-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][state] Record file offsets instead of logging magic token to track task log #35572
Conversation
Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
# look for the file offset with the line count | ||
start_offset = find_start_offset_last_n_lines_from_offset( | ||
f, offset=end_offset, n=lines | ||
start_offset = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need to read file to find the task offsets.
Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
I remember this causes the regression in the past too. Any guess why it doesn't introduce regression anymore?
Also, this behavior seems very confusing. I'd like to discuss with @scottsun94 for the best behavior in this case. I think we should always print the warning or something like that since it is a common to see logs for async actors (i.e., serve) |
Also there's a merge conflict! |
Can you guys help me understand how this will affect the current log ui? cc: @alanwguo Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead. Actor log: this shouldn't affect the log of any actor. Is my understanding correct? |
5 similar comments
Can you guys help me understand how this will affect the current log ui? cc: @alanwguo Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead. Actor log: this shouldn't affect the log of any actor. Is my understanding correct? |
Can you guys help me understand how this will affect the current log ui? cc: @alanwguo Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead. Actor log: this shouldn't affect the log of any actor. Is my understanding correct? |
Can you guys help me understand how this will affect the current log ui? cc: @alanwguo Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead. Actor log: this shouldn't affect the log of any actor. Is my understanding correct? |
Can you guys help me understand how this will affect the current log ui? cc: @alanwguo Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead. Actor log: this shouldn't affect the log of any actor. Is my understanding correct? |
Can you guys help me understand how this will affect the current log ui? cc: @alanwguo Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead. Actor log: this shouldn't affect the log of any actor. Is my understanding correct? |
Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
Added the dashboard front end change as well. @alanwguo @scottsun94 lmk what you think: |
Synced offline: this is now the behaviour:
|
we will merge this PR first for the revert to resolve conflicts: #35687 |
@rickyyx I was expecting something like this in the task detail page. |
I see - this makes sense. Took a brief look at the code, I am a bit uncertain how I should do it (Guess will need to pass a special case variable into I will probably remove the front-end change from this PR - and we could iterate and cherry pick in another one. |
Actually - if we also return what you write as the log content in state API, could that work?
|
Not sure.. Need @alanwguo to confirm. This is how @alanwguo implemented the "driver logs" message https://github.com/ray-project/ray/pull/35328/files#diff-8f6d8abcea0a9396076b5e65e3724ed386e45e636f1425d145dde2d045b9aa86R49-R62 for your reference. |
Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments
Signed-off-by: Ricky Xu <[email protected]>
@@ -186,6 +204,151 @@ async def _resolve_worker_file( | |||
return filename | |||
return None | |||
|
|||
async def _resolve_actor_filename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from the original resolve_filename
routine
) | ||
return node_id, log_filename | ||
|
||
async def _resolve_task_filename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from the original resolve_filename
routine
Since we changed the mechanism in the very last minute, we should do dogfooding |
also don't forget to cherry pick! |
…rack task log (ray-project#35572) We have seen regression from 1_1_actor_calls_concurrent and 1_1_actor_calls_async due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task. It's extremely bad for concurrent actors due to the contention on the same worker file. This PR: Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be: Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI Normal actor task log: same as above ^ Async actor task + threaded actor task (max_concurrency>1) : On CLI: we will raise error and let users know to use ray logs actor --id for actor log. Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.
…rack task log (ray-project#35572) We have seen regression from 1_1_actor_calls_concurrent and 1_1_actor_calls_async due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task. It's extremely bad for concurrent actors due to the contention on the same worker file. This PR: Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be: Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI Normal actor task log: same as above ^ Async actor task + threaded actor task (max_concurrency>1) : On CLI: we will raise error and let users know to use ray logs actor --id for actor log. Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.
…rack task log (ray-project#35572) We have seen regression from 1_1_actor_calls_concurrent and 1_1_actor_calls_async due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task. It's extremely bad for concurrent actors due to the contention on the same worker file. This PR: Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be: Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI Normal actor task log: same as above ^ Async actor task + threaded actor task (max_concurrency>1) : On CLI: we will raise error and let users know to use ray logs actor --id for actor log. Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs. Signed-off-by: e428265 <[email protected]>
Why are these changes needed?
We have seen regression from
1_1_actor_calls_concurrent
and1_1_actor_calls_async
due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task.It's extremely bad for concurrent actors due to the contention on the same worker file.
This PR:
Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be:
Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.
Microbenchmark results:
The ~8% regression on
1_1_actor_calls_async
: I would argue for tolerating this regression sinceIn a follow-up, I will delete the dead code.#35599
Related issue number
Closes #35598
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.