[REP] Execution Optimizer for Ray Datasets #19

c21 · 2022-12-15T09:06:28Z

This REP introduces (1) lazy execution, (2) optimizer, and (3) vectorized execution with data batch, to improve user experience and performance for Ray Datasets.

Signed-off-by: Cheng Su <[email protected]>

ericl · 2022-12-15T23:49:15Z

reps/2022-12-15-optimizer-data.md

+
+Architecture after REP:
+
+<img width="945" alt="new-architecture" src="https://user-images.githubusercontent.com/4629931/207807703-bb65db63-41a0-41d9-8e7b-154e1a0ed565.png">


Since we are considering re-optimization outside the scope of this REP, can we also remove that from the diagram?

yeah, removed.

ericl · 2022-12-15T23:50:24Z

reps/2022-12-15-optimizer-data.md

+
+#### 3.2.1. Interfaces
+
+NOTE: `OneToOneOperator` used here is the same as `OneToOneOperator` in "Native pipelining support in Ray Datasets" REP.


This section needs to be updated, since the other REP now only proposes PhysicalOperator.

@ericl - yeah updated. I need more thought to hook up BatchedOperator.process_batches with PhysicalOperator.add_input/inputs_done/has_next/get_next. But I think it should be implementation detail that we can figure it out later.

reps/2022-12-15-optimizer-data.md

stephanie-wang · 2022-12-15T23:56:58Z

reps/2022-12-15-optimizer-data.md

+
+## Summary
+
+Build the breakthrough foundation to tackle a series of fundamental issues around Ray Data. The foundation is (1) **lazy execution**, (2) **optimizer**, and (3) **vectorized execution with data batch**.


The summary is a bit low-level right now and solution-heavy. It might be good to focus more on the problems (expensive and unnecessary materialization, current design lacks an optimizer which makes materialization impossible to elide).

Moved this under General Motivation to make it more coherent, as the top-level summary seem not strictly needed (not see in other REPs).

reps/2022-12-15-optimizer-data.md

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes: * Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed. * `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block. * Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing. TODO: - [x] Fix all unit tests - [x] #31459 - [x] #31460 - [ ] Remove the behavior to eagerly compute first block for read - [ ] #31417 - [ ] Update documentation

Signed-off-by: Cheng Su <[email protected]>

reps/2022-12-15-optimizer-data.md

Signed-off-by: Cheng Su <[email protected]>

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes: * Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed. * `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block. * Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing. TODO: - [x] Fix all unit tests - [x] #31459 - [x] #31460 - [ ] Remove the behavior to eagerly compute first block for read - [ ] #31417 - [ ] Update documentation

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes: * Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed. * `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block. * Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing. TODO: - [x] Fix all unit tests - [x] ray-project#31459 - [x] ray-project#31460 - [ ] Remove the behavior to eagerly compute first block for read - [ ] ray-project#31417 - [ ] Update documentation Signed-off-by: tmynn <[email protected]>

Execution Optimizer for Ray Datasets

366e584

Signed-off-by: Cheng Su <[email protected]>

c21 assigned ericl, stephanie-wang, clarkzinzow, jianoaix and zhe-thoughts Dec 15, 2022

c21 added 2 commits December 15, 2022 01:10

Remove extra white spaces

c03f39d

Signed-off-by: Cheng Su <[email protected]>

minor tweak

333f0ff

Signed-off-by: Cheng Su <[email protected]>

c21 added the shepherding label Dec 15, 2022

ericl reviewed Dec 15, 2022

View reviewed changes

stephanie-wang requested changes Dec 16, 2022

View reviewed changes

c21 mentioned this pull request Dec 22, 2022

[Datasets] Enable lazy execution by default ray-project/ray#31286

Merged

13 tasks

Address all comments

2045c55

Signed-off-by: Cheng Su <[email protected]>

stephanie-wang approved these changes Jan 9, 2023

View reviewed changes

zhe-thoughts reviewed Jan 10, 2023

View reviewed changes

reps/2022-12-15-optimizer-data.md Outdated Show resolved Hide resolved

ericl added pending-committer-vote and removed shepherding labels Jan 10, 2023

Address comment of diagrams

bce6ec3

Signed-off-by: Cheng Su <[email protected]>

ericl added vote-approved and removed pending-committer-vote labels Jan 13, 2023

zhe-thoughts merged commit 68b472b into main Jan 13, 2023

c21 deleted the optimizer branch January 13, 2023 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REP] Execution Optimizer for Ray Datasets #19

[REP] Execution Optimizer for Ray Datasets #19

c21 commented Dec 15, 2022

ericl Dec 15, 2022

c21 Jan 9, 2023

ericl Dec 15, 2022

c21 Jan 9, 2023

stephanie-wang Dec 15, 2022

c21 Jan 9, 2023


		Architecture after REP:

		<img width="945" alt="new-architecture" src="https://user-images.githubusercontent.com/4629931/207807703-bb65db63-41a0-41d9-8e7b-154e1a0ed565.png">


		#### 3.2.1. Interfaces

		NOTE: `OneToOneOperator` used here is the same as `OneToOneOperator` in "Native pipelining support in Ray Datasets" REP.


		## Summary

		Build the breakthrough foundation to tackle a series of fundamental issues around Ray Data. The foundation is (1) lazy execution, (2) optimizer, and (3) vectorized execution with data batch.

[REP] Execution Optimizer for Ray Datasets #19

[REP] Execution Optimizer for Ray Datasets #19

Conversation

c21 commented Dec 15, 2022

ericl Dec 15, 2022

Choose a reason for hiding this comment

c21 Jan 9, 2023

Choose a reason for hiding this comment

ericl Dec 15, 2022

Choose a reason for hiding this comment

c21 Jan 9, 2023

Choose a reason for hiding this comment

stephanie-wang Dec 15, 2022

Choose a reason for hiding this comment

c21 Jan 9, 2023

Choose a reason for hiding this comment