[Core/data] use wait based prefetcher by default #34871

scv119 · 2023-04-28T18:27:30Z

Why are these changes needed?

Turn on the wait based prefetcher to address the issues around the actor based prefetcher.

[x]: benchmark the before/after performance.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

scv119 · 2023-04-28T20:14:30Z

kicking off benchmark tests

scv119 · 2023-04-30T04:52:54Z

release test result:

streaming_data_ingest_benchmark_1tb
with this pr: success! total time 64.92455434799194
baseline: success! total time 64.79763126373291
pipelined_data_ingest_benchmark_1tb.aws
with this pr:

(ConsumingActor pid=186, ip=10.0.58.86) ##### Overall Pipeline Time Breakdown #####                                            
(ConsumingActor pid=186, ip=10.0.58.86) * Time stalled waiting for next datastream: 7.6ms min, 1.2s max, 161.99ms mean, 4.21s total
(PipelineSplitExecutorCoordinator pid=5130) 2023-04-28 14:53:36,146     INFO streaming_executor.py:147 -- Shutting down <StreamingExecutor(Thread-30, stopped daemon 139750702831360)>.
success! total time 45.71649169921875                                                                                                   
(ConsumingActor pid=186, ip=10.0.49.88) Time to read all data 42.747634274999996 seconds [repeated 19x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(ConsumingActor pid=186, ip=10.0.49.88) P50/P95/Max batch delay (s) 0.009423945000008871 0.13930677815000647 1.6963853930000141 [repeated 19x across cluster]
(ConsumingActor pid=186, ip=10.0.49.88) Num epochs read 1 [repeated 19x across cluster]
(ConsumingActor pid=186, ip=10.0.49.88) Num batches read 1280 [repeated 19x across cluster]
(ConsumingActor pid=186, ip=10.0.58.86)  [repeated 5x across cluster]
(ConsumingActor pid=186, ip=10.0.49.88) Mean throughput 1197.73 MiB/s [repeated 19x across cluster]
(ConsumingActor pid=186, ip=10.0.49.88) Num bytes read 51200.0 MiB [repeated 18x across cluster]

baseline:

#### Overall Pipeline Time Breakdown #####                                           
(ConsumingActor pid=962, ip=10.0.44.103) * Time stalled waiting for next datastream: 7.95ms min, 1.21s max, 122.83ms mean, 3.19s total
(ConsumingActor pid=962, ip=10.0.44.103)                                                                                       
success! total time 46.14367604255676                                                                                                   
(ConsumingActor pid=4712) Time to read all data 43.25214273300003 seconds [repeated 19x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(ConsumingActor pid=4712) P50/P95/Max batch delay (s) 0.009616004500003328 0.14022980685003572 1.838354100999993 [repeated 19x across cluster]
(ConsumingActor pid=4712) Num epochs read 1 [repeated 19x across cluster]
(ConsumingActor pid=4712) Num batches read 1280 [repeated 19x across cluster]
(ConsumingActor pid=4712) Num bytes read 51200.0 MiB [repeated 19x across cluster]
(ConsumingActor pid=4712) Mean throughput 1183.76 MiB/s [repeated 19x across cluster]

Turn on the wait based prefetcher to address the issues around the actor based prefetcher. - [x]: benchmark the before/after performance.

This reverts commit b294bfd.

scv119 linked an issue Apr 28, 2023 that may be closed by this pull request

[Datasets] Should remove actor-based prefetcher after #30375 is addressed #30380

Closed

scv119 marked this pull request as ready for review April 28, 2023 20:12

scv119 requested review from ericl, c21, amogkam, scottjlee and bveeramani as code owners April 28, 2023 20:12

scv119 added the do-not-merge Do not merge this PR! label Apr 28, 2023

scv119 removed the do-not-merge Do not merge this PR! label Apr 30, 2023

scv119 assigned ericl and c21 Apr 30, 2023

use wait based prefetcher

ddc4a55

scv119 force-pushed the get-rid-of-prefetcher branch from bf89b03 to ddc4a55 Compare April 30, 2023 17:04

ericl assigned amogkam May 1, 2023

ericl approved these changes May 1, 2023

View reviewed changes

ericl merged commit b294bfd into ray-project:master May 1, 2023

amogkam mentioned this pull request May 18, 2023

[Data] Performance regression in iter_batches prefetching #35521

Closed

amogkam added a commit that referenced this pull request May 18, 2023

Revert "[Core/data] use wait based prefetcher by default (#34871)"

7610b3d

This reverts commit b294bfd.

amogkam mentioned this pull request May 18, 2023

Revert "[Core/data] use wait based prefetcher by default" #35522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core/data] use wait based prefetcher by default #34871

[Core/data] use wait based prefetcher by default #34871

scv119 commented Apr 28, 2023 •

edited

Loading

scv119 commented Apr 28, 2023

scv119 commented Apr 30, 2023 •

edited

Loading

[Core/data] use wait based prefetcher by default #34871

[Core/data] use wait based prefetcher by default #34871

Conversation

scv119 commented Apr 28, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

scv119 commented Apr 28, 2023

scv119 commented Apr 30, 2023 • edited Loading

scv119 commented Apr 28, 2023 •

edited

Loading

scv119 commented Apr 30, 2023 •

edited

Loading