[core][state] Record file offsets instead of logging magic token to track task log #35572

rickyyx · 2023-05-20T06:13:42Z

Why are these changes needed?

We have seen regression from 1_1_actor_calls_concurrent and 1_1_actor_calls_async due to the additional overheads of writing magic token to worker files when starting and finishing execution of a task.
It's extremely bad for concurrent actors due to the contention on the same worker file.

This PR:

Skips recording the task start/end for concurrent actor tasks. And augment the task log querying behaviour to be:
- Normal task log: we will still be able record the offsets and stream task log from dashboard + CLI
- Normal actor task log: same as above ^
- Async actor task + threaded actor task (max_concurrency>1) :
  - On CLI: we will raise error and let users know to use ray logs actor --id for actor log.
Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.

Microbenchmark results:

Version	1_1_actor_calls_sync	1_1_actor_calls_async	1_1_actor_calls_concurrent
2.4 Release	2527.69	8374.56	5256.40
This PR	2404.83	7875.52	5134.41
Master	2359.93	6542.37	2760.99

The ~8% regression on 1_1_actor_calls_async: I would argue for tolerating this regression since

The 1_n and n_n microbenchmarks of actor_calls_async don't show regression at all.
This would be less of a problem if the actual actor task is not a no-op.

In a follow-up, I will delete the dead code.#35599

Related issue number

Closes #35598

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Ricky Xu <[email protected]>

Signed-off-by: rickyyx <[email protected]>

Signed-off-by: Ricky Xu <[email protected]>

rickyyx · 2023-05-22T11:45:22Z

dashboard/modules/log/log_agent.py

- # look for the file offset with the line count
- start_offset = find_start_offset_last_n_lines_from_offset(
- f, offset=end_offset, n=lines
+ start_offset = (


We no longer need to read file to find the task offsets.

Signed-off-by: Ricky Xu <[email protected]>

rkooo567 · 2023-05-23T14:57:53Z

Revert to the old way of recording file offsets rather than writing magic tokens so that the overhead is actually lower and prevents the pollution of actor logs.

I remember this causes the regression in the past too. Any guess why it doesn't introduce regression anymore?

Skips recording the task start/end for concurrent actor tasks. When users query logs from a concurrent actor task, the entire actor log content will be returned. This is probably a reasonable (and not an incorrect) behavior because ray cannot control the interleaving, and would require parsing the file content to exclude logs from other concurrent tasks. It's probably not an issue anymore if we support structured logging in the future.

Also, this behavior seems very confusing. I'd like to discuss with @scottsun94 for the best behavior in this case. I think we should always print the warning or something like that since it is a common to see logs for async actors (i.e., serve)

rkooo567 · 2023-05-23T14:58:01Z

Also there's a merge conflict!

scottsun94 · 2023-05-23T16:32:46Z