Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Record file offsets instead of logging magic token to track task log #35572

Merged
merged 13 commits into from
May 26, 2023

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented May 20, 2023

Why are these changes needed?

We have seen regression from 1_1_actor_calls_concurrent and 1_1_actor_calls_async due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task.
It's extremely bad for concurrent actors due to the contention on the same worker file.

This PR:

  • Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be:

    • Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI
    • Normal actor task log: same as above ^
    • Async actor task + threaded actor task (max_concurrency>1) :
      • On CLI: we will raise error and let users know to use ray logs actor --id for actor log.
  • Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.

Microbenchmark results:

Version 1_1_actor_calls_sync 1_1_actor_calls_async 1_1_actor_calls_concurrent
2.4 Release 2527.69 8374.56 5256.40
This PR 2404.83 7875.52 5134.41
Master 2359.93 6542.37 2760.99

The ~8% regression on 1_1_actor_calls_async: I would argue for tolerating this regression since

  1. The 1_n and n_n microbenchmarks of actor_calls_async don't show regression at all.
  2. This would be less of a problem if the actual actor task is not a no-op.

In a follow-up, I will delete the dead code.#35599

Related issue number

Closes #35598

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

rickyyx and others added 4 commits May 20, 2023 13:29
@rickyyx rickyyx changed the title init [core][state] Record file offsets instead of logging magic token to track task log May 22, 2023
@rickyyx rickyyx marked this pull request as ready for review May 22, 2023 12:39
Signed-off-by: Ricky Xu <[email protected]>
# look for the file offset with the line count
start_offset = find_start_offset_last_n_lines_from_offset(
f, offset=end_offset, n=lines
start_offset = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to read file to find the task offsets.

Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
@rkooo567
Copy link
Contributor

Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.

I remember this causes the regression in the past too. Any guess why it doesn't introduce regression anymore?

Skips recording the task start/end for concurrent actor tasks. When users query logs from a concurrent actor task, the entire actor log content will be returned. This is probably a reasonable (and not an incorrect) behavior because ray cannot control the interleaving, and would require parsing the file content to exclude logs from other concurrent tasks. It's probably not an issue anymore if we support structured logging in the future.

Also, this behavior seems very confusing. I'd like to discuss with @scottsun94 for the best behavior in this case. I think we should always print the warning or something like that since it is a common to see logs for async actors (i.e., serve)

@rkooo567
Copy link
Contributor

Also there's a merge conflict!

@scottsun94
Copy link
Contributor

Can you guys help me understand how this will affect the current log ui? cc: @alanwguo

Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead.

Actor log: this shouldn't affect the log of any actor.

Is my understanding correct?

5 similar comments
@scottsun94
Copy link
Contributor

Can you guys help me understand how this will affect the current log ui? cc: @alanwguo

Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead.

Actor log: this shouldn't affect the log of any actor.

Is my understanding correct?

@scottsun94
Copy link
Contributor

Can you guys help me understand how this will affect the current log ui? cc: @alanwguo

Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead.

Actor log: this shouldn't affect the log of any actor.

Is my understanding correct?

@scottsun94
Copy link
Contributor

Can you guys help me understand how this will affect the current log ui? cc: @alanwguo

Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead.

Actor log: this shouldn't affect the log of any actor.

Is my understanding correct?

@scottsun94
Copy link
Contributor

Can you guys help me understand how this will affect the current log ui? cc: @alanwguo

Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead.

Actor log: this shouldn't affect the log of any actor.

Is my understanding correct?

@scottsun94
Copy link
Contributor

Can you guys help me understand how this will affect the current log ui? cc: @alanwguo

Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead.

Actor log: this shouldn't affect the log of any actor.

Is my understanding correct?

Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
@rickyyx
Copy link
Contributor Author

rickyyx commented May 24, 2023

Added the dashboard front end change as well. @alanwguo @scottsun94 lmk what you think:

image

@rickyyx
Copy link
Contributor Author

rickyyx commented May 24, 2023

Can you guys help me understand how this will affect the current log ui? cc: @alanwguo

Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead.

Actor log: this shouldn't affect the log of any actor.

Is my understanding correct?

Synced offline: this is now the behaviour:

  • Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI
  • Normal actor task log: same as above ^
  • Async actor task + threaded actor task (max_concurrency>1) :
    • On Dashboard: we will show "Actor Log" for those tasks and direct users to the actor log.
    • On CLI: we will raise error and let users know to use ray logs actor --id <id> for actor log.

@rickyyx
Copy link
Contributor Author

rickyyx commented May 24, 2023

we will merge this PR first for the revert to resolve conflicts: #35687

@scottsun94
Copy link
Contributor

scottsun94 commented May 24, 2023

@rickyyx I was expecting something like this in the task detail page.
I feel like we don't need to change "Log" to "Actor log" only for those tasks. This might make the ui behavior more complicated and harder for users to reason about the behavior.

Screenshot 2023-05-23 at 9 06 50 PM

@rickyyx
Copy link
Contributor Author

rickyyx commented May 24, 2023

@rickyyx I was expecting something like this in the task detail page. I feel like we don't need to change "Log" to "Actor log" only for those tasks. This might make the ui behavior more complicated and harder for users to reason about the behavior.

Screenshot 2023-05-23 at 9 06 50 PM

I see - this makes sense. Took a brief look at the code, I am a bit uncertain how I should do it (Guess will need to pass a special case variable into TaskLogViewer cc @alanwguo

I will probably remove the front-end change from this PR - and we could iterate and cherry pick in another one.

@rickyyx
Copy link
Contributor Author

rickyyx commented May 24, 2023

Can you guys help me understand how this will affect the current log ui? cc: @alanwguo

Task log: this should only affect the concurrent actor method? Instead of showing the logs for that specific task, the actor logs will be shown in stead.

Actor log: this shouldn't affect the log of any actor.

Is my understanding correct?

@rickyyx I was expecting something like this in the task detail page. I feel like we don't need to change "Log" to "Actor log" only for those tasks. This might make the ui behavior more complicated and harder for users to reason about the behavior.
Screenshot 2023-05-23 at 9 06 50 PM

I see - this makes sense. Took a brief look at the code, I am a bit uncertain how I should do it (Guess will need to pass a special case variable into TaskLogViewer cc @alanwguo

I will probably remove the front-end change from this PR - and we could iterate and cherry pick in another one.

Actually - if we also return what you write as the log content in state API, could that work?

ray logs task --id <concurrent task id>

>>> Logs of acync actors or threaded actors, .....

@scottsun94
Copy link
Contributor

Actually - if we also return what you write as the log content in state API, could that work?

Not sure.. Need @alanwguo to confirm.

This is how @alanwguo implemented the "driver logs" message https://github.com/ray-project/ray/pull/35328/files#diff-8f6d8abcea0a9396076b5e65e3724ed386e45e636f1425d145dde2d045b9aa86R49-R62 for your reference.

Signed-off-by: Ricky Xu <[email protected]>
Signed-off-by: Ricky Xu <[email protected]>
@rickyyx rickyyx added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 25, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

python/ray/_raylet.pyx Outdated Show resolved Hide resolved
dashboard/modules/log/log_manager.py Show resolved Hide resolved
dashboard/modules/log/log_manager.py Outdated Show resolved Hide resolved
Signed-off-by: Ricky Xu <[email protected]>
@@ -186,6 +204,151 @@ async def _resolve_worker_file(
return filename
return None

async def _resolve_actor_filename(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from the original resolve_filename routine

)
return node_id, log_filename

async def _resolve_task_filename(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from the original resolve_filename routine

@rkooo567 rkooo567 merged commit 2bffbf8 into ray-project:master May 26, 2023
@rkooo567
Copy link
Contributor

Since we changed the mechanism in the very last minute, we should do dogfooding

@rkooo567
Copy link
Contributor

also don't forget to cherry pick!

rickyyx added a commit to rickyyx/ray that referenced this pull request May 26, 2023
…rack task log (ray-project#35572)

We have seen regression from 1_1_actor_calls_concurrent and 1_1_actor_calls_async due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task.
It's extremely bad for concurrent actors due to the contention on the same worker file.

This PR:

Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be:

Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI
Normal actor task log: same as above ^
Async actor task + threaded actor task (max_concurrency>1) :
On CLI: we will raise error and let users know to use ray logs actor --id for actor log.
Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.
scv119 pushed a commit to scv119/ray that referenced this pull request Jun 16, 2023
…rack task log (ray-project#35572)

We have seen regression from 1_1_actor_calls_concurrent and 1_1_actor_calls_async due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task.
It's extremely bad for concurrent actors due to the contention on the same worker file.

This PR:

Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be:

Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI
Normal actor task log: same as above ^
Async actor task + threaded actor task (max_concurrency>1) :
On CLI: we will raise error and let users know to use ray logs actor --id for actor log.
Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…rack task log (ray-project#35572)

We have seen regression from 1_1_actor_calls_concurrent and 1_1_actor_calls_async due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task.
It's extremely bad for concurrent actors due to the contention on the same worker file.

This PR:

Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be:

Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI
Normal actor task log: same as above ^
Async actor task + threaded actor task (max_concurrency>1) :
On CLI: we will raise error and let users know to use ray logs actor --id for actor log.
Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.

Signed-off-by: e428265 <[email protected]>
@jjyao jjyao mentioned this pull request Apr 8, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core][ci] Perf regression in microbenchmark actor calls
3 participants