[Datasets] Streaming executor fixes #5 #32951

jianoaix · 2023-03-01T21:47:59Z

Why are these changes needed?

Fix [Datasets] LazyBlocklist split fails to split heteroeneous list #32950
Invoke ExecutionPlan.execute() directly (instead of ds.take()) when the purpose is to test ExecutionPlan.execute(): if using ds.take() it may run the new execution backend which doesn't invoke the `ExecutionPlan.execute()'

Related issue number

#32132

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

clarkzinzow · 2023-03-01T22:44:49Z

python/ray/data/_internal/lazy_block_list.py

+ cached_metadata = [
+ self._cached_metadata[i * k + min(i, m) : (i + 1) * k + min(i + 1, m)]
+ for i in range(len(self._cached_metadata))
+ ]


Hmm it seems like this cached_metadata should always be homogeneous, i.e. it should always contain a BlockMetadata for each element. Do you know how this heterogeneity is happening?

We set only the first block when figuring out the schema, so e.g. with this:

import ray inputs = ["example:https://iris.csv"] * 100 ds = ray.data.read_csv(inputs, parallelism=10) print("before:", ds._plan._in_blocks._cached_metadata) ds.schema() print("after:", ds._plan._in_blocks._cached_metadata) ds._plan._in_blocks.split(2)

It's producing:

before: [None, None, None, None, None, None, None, None, None, None] after: [[BlockMetadata(num_rows=1500, size_bytes=66500, schema=sepal.length: double sepal.width: double petal.length: double petal.width: double variety: string, input_files=array(['/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv', '/home/ubuntu/ray/python/ray/data/examples/data/iris.csv'], dtype='<U55'), exec_stats={'wall_time_s': 0.36291642487049103, 'cpu_time_s': 0.3410033880000001, 'node_id': 'f3e389087180baf4bcde82efe3873d1139be957718e69465786af17d'})], None, None, None, None, None, None, None, None, None]

Could we pull this out into an array split util and use it for the above splits too?

Otherwise it's not clear this is doing the same thing as array split.

jianoaix · 2023-03-02T00:19:59Z

python/ray/data/tests/test_dataset.py

@@ -4579,7 +4579,7 @@ def test_warning_execute_with_no_cpu(ray_start_cluster):
 try:
 ds = ray.data.range(10)
 ds = ds.map_batches(lambda x: x)
- ds.take()
+ ds._plan.execute()


The purpose of this test is testing ExecutionPlan.execute() so we run it directly. Running ds.take() may invoke the new execution backend when the flag is on.

…amingexecfix5

jianoaix · 2023-03-03T19:32:29Z

Tests passing (the failures are relevant to this pr).

Signed-off-by: Jack He <[email protected]>

Signed-off-by: Edward Oakes <[email protected]>

Signed-off-by: elliottower <[email protected]>

Signed-off-by: Jack He <[email protected]>

jianoaix requested review from ericl, scv119, clarkzinzow, jjyao and c21 as code owners March 1, 2023 21:47

clarkzinzow reviewed Mar 1, 2023

View reviewed changes

jianoaix changed the title ~~[Datasets] Split the list mannually for heterogeneous array~~ [Datasets] Streaming executor fixes #5 Mar 2, 2023

jianoaix commented Mar 2, 2023

View reviewed changes

jianoaix assigned ericl and clarkzinzow Mar 2, 2023

ericl approved these changes Mar 2, 2023

View reviewed changes

jianoaix force-pushed the streamingexecfix5 branch from 11c5af8 to 90954c5 Compare March 2, 2023 18:06

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 2, 2023

jianoaix added 6 commits March 3, 2023 00:44

Split the list mannually for heterogeneous array

1ad7324

test ExecutionPlan.execute()

115320f

feedback: list split util

490a1c7

typo

ac6f524

add test

d0c6a1f

test

6b7130b

jianoaix force-pushed the streamingexecfix5 branch from 90954c5 to 6b7130b Compare March 3, 2023 00:45

jianoaix added 2 commits March 3, 2023 17:44

fix test

fe3ca5b

Merge branch 'master' of https://github.com/ray-project/ray into stre…

93469cc

…amingexecfix5

jianoaix removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 3, 2023

ericl merged commit 303ac3b into ray-project:master Mar 5, 2023

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023

[Datasets] Streaming executor fixes ray-project#5 (ray-project#32951)

8feacbf

Signed-off-by: Jack He <[email protected]>

cadedaniel pushed a commit to cadedaniel/ray that referenced this pull request Mar 22, 2023

[Datasets] Streaming executor fixes #5 (ray-project#32951)

2e3ec6a

edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023

[Datasets] Streaming executor fixes #5 (ray-project#32951)

bce298c

Signed-off-by: Edward Oakes <[email protected]>

peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023

[Datasets] Streaming executor fixes #5 (ray-project#32951)

b052fc3

scottsun94 pushed a commit to scottsun94/ray that referenced this pull request Mar 28, 2023

[Datasets] Streaming executor fixes ray-project#5 (ray-project#32951)

18602e2

cassidylaidlaw pushed a commit to cassidylaidlaw/ray that referenced this pull request Mar 28, 2023

[Datasets] Streaming executor fixes ray-project#5 (ray-project#32951)

f3d634a

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

[Datasets] Streaming executor fixes ray-project#5 (ray-project#32951)

b4f82f4

Signed-off-by: elliottower <[email protected]>

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023

[Datasets] Streaming executor fixes ray-project#5 (ray-project#32951)

a45ae36

Signed-off-by: Jack He <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Streaming executor fixes #5 #32951

[Datasets] Streaming executor fixes #5 #32951

jianoaix commented Mar 1, 2023 •

edited

Loading

clarkzinzow Mar 1, 2023

jianoaix Mar 1, 2023

ericl Mar 2, 2023

jianoaix Mar 2, 2023

jianoaix Mar 2, 2023

jianoaix commented Mar 3, 2023

[Datasets] Streaming executor fixes #5 #32951

[Datasets] Streaming executor fixes #5 #32951

Conversation

jianoaix commented Mar 1, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

clarkzinzow Mar 1, 2023

Choose a reason for hiding this comment

jianoaix Mar 1, 2023

Choose a reason for hiding this comment

ericl Mar 2, 2023

Choose a reason for hiding this comment

jianoaix Mar 2, 2023

Choose a reason for hiding this comment

jianoaix Mar 2, 2023

Choose a reason for hiding this comment

jianoaix commented Mar 3, 2023

jianoaix commented Mar 1, 2023 •

edited

Loading