[Data][Docs] Revise "Loading data" #36144

bveeramani · 2023-06-07T02:28:45Z

Why are these changes needed?

The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>

amogkam · 2023-06-07T03:37:53Z

Can you add a PR description?

doc/source/data/loading-data.rst

raulchen · 2023-06-07T17:42:16Z

doc/source/data/loading-data.rst


- Create a dataset from a range of integers, packing this integer range into
- ndarrays of the provided shape.
+ ds = ray.data.read_images("/tmp/batoidea/JPEGImages")


just curious, does these directory exist in the CI env?

Not yet, but I'm planning on adding it in a follow-up PR. Thinking of download the s3:https://ray-example-data bucket to CI.

doc/source/data/loading-data.rst

raulchen · 2023-06-07T19:32:52Z

doc/source/data/loading-data.rst

@@ -618,7 +542,7 @@ Call :func:`~ray.data.read_sql` to read data from a database that provides a

 pip install mysql-connector-python

- Then, define your connection login and query the database.
+ Then, define your connection logic and query the database.


IIUC, the following example will create a new connection for each ray.data.read_sql. This is an anti-pattern in practice. Do we want to update it to re-use one connection?

It's necessary to create a new connection because connections aren't thread or process safe. So, we can't share a connection across read tasks.

Those read_sql calls are sequential. So multi-threading isn't a problem?

I'm not sure if I'm misunderstanding? The read_sql calls are sequential, but read_sql creates connections in read tasks. We create a connection in each read task so that we can read the database in parallel.

Co-authored-by: Hao Chen <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>

doc/source/data/loading-data.rst

ericl

Edits look good, but we also have to somehow retain the performance section. The most common pain point from users is how to obtain good performance from Ray Data, so we have to be deliberate in addressing this need.

amogkam · 2023-06-07T22:31:52Z

For performance, we can just link to the appropriate section in the performance guide in the intro? https://docs.ray.io/en/master/data/performance-tips.html. And make sure that information is up to date.

Don't think we need to have it in both places

ericl · 2023-06-07T22:33:17Z

For performance, we can just link to the appropriate section in the performance guide in the intro? https://docs.ray.io/en/master/data/performance-tips.html. And make sure that information is up to date.

Currently, the performance page is a mis-mash of random / advanced tips. It certainly needs to be updated to focus on the basics for loading / transforming.

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani · 2023-06-07T22:53:57Z

For performance, we can just link to the appropriate section in the performance guide in the intro? https://docs.ray.io/en/master/data/performance-tips.html. And make sure that information is up to date.

Currently, the performance page is a mis-mash of random / advanced tips. It certainly needs to be updated to focus on the basics for loading / transforming.

We're planning on revising the performance page in the near future. I'll leave the performance considerations in for now, and remove it once we've revised the performance page.

Signed-off-by: Balaji Veeramani <[email protected]>

doc/source/data/loading-data.rst

Signed-off-by: Balaji Veeramani <[email protected]>

amogkam

This is great!!

Just last thing, per @ericl's comment let's add a "Creating synthetic datasets" section at the bottom that shows range, range_table, and range_tensor (just like what we have currently), and specifies this is useful for performance benchmarking.

Signed-off-by: Balaji Veeramani <[email protected]>

ericl · 2023-06-08T17:53:41Z

As Amog's comment above suggested could we add range() and range_tensor() back as recommendations for creating large synthetic datasets for performance testing?

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani · 2023-06-08T17:55:05Z

As Amog's comment above suggested could we add range() and range_tensor() back as recommendations for creating large synthetic datasets for performance testing?

@ericl yeah, just pushed the changes.

The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: amogkam <[email protected]>

#35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: Hao Chen <[email protected]>

ray-project#35749 ray-project#35751 ray-project#35753 ray-project#35755 ray-project#35757 ray-project#36018 ray-project#36105 ray-project#36121 ray-project#36144 ray-project#36145 ray-project#36162 ray-project#36124 --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: Hao Chen <[email protected]>

* [Data] [Docs] Ray Data doc changes for 2.5 (#36224) #35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: Hao Chen <[email protected]>

* [Data] [Docs] Ray Data doc changes for 2.5 (#36224) #35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: Hao Chen <[email protected]> * [docs] relax kapa loading scheme (#36201) Signed-off-by: Max Pumperla <[email protected]> * Revert "[Data] [Docs] Ray Data doc changes for 2.5 (#36224)" This reverts commit 48a6c26. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Signed-off-by: Max Pumperla <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Hao Chen <[email protected]> Co-authored-by: Max Pumperla <[email protected]>

The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: e428265 <[email protected]>

Initial commit

835b210

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, maxpumperla and a team as code owners June 7, 2023 02:28

bveeramani changed the title ~~Initial commit~~ [Data][Docs] Revise "Loading data" Jun 7, 2023

bveeramani assigned ericl, amogkam and angelinalg Jun 7, 2023

amogkam assigned raulchen Jun 7, 2023

amogkam reviewed Jun 7, 2023

View reviewed changes

raulchen reviewed Jun 7, 2023

View reviewed changes

Update doc/source/data/loading-data.rst

403206e

Co-authored-by: Hao Chen <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>

ericl reviewed Jun 7, 2023

View reviewed changes

doc/source/data/loading-data.rst Outdated Show resolved Hide resolved

ericl requested changes Jun 7, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 7, 2023

Address review comments

1bf4de1

Signed-off-by: Balaji Veeramani <[email protected]>

Address review comments

2ca1672

Signed-off-by: Balaji Veeramani <[email protected]>

ericl reviewed Jun 7, 2023

View reviewed changes

doc/source/data/loading-data.rst Outdated Show resolved Hide resolved

bveeramani mentioned this pull request Jun 7, 2023

[Data][Docs] Revise user guides #35947

Closed

9 tasks

Update loading-data.rst

dc93b0c

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani added 2 commits June 7, 2023 17:57

Merge remote-tracking branch 'upstream/master' into loading-data

8f488dd

Signed-off-by: Balaji Veeramani <[email protected]>

Fix tests

fa79646

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023

amogkam approved these changes Jun 8, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into loading-data

1f93100

Signed-off-by: Balaji Veeramani <[email protected]>

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023

Add "Creating synthetic data"

d04be72

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023

ericl approved these changes Jun 8, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023

amogkam merged commit 369f68e into ray-project:master Jun 8, 2023
35 of 48 checks passed

bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023

bveeramani deleted the loading-data branch June 8, 2023 18:55

amogkam mentioned this pull request Jun 8, 2023

[Data] [Docs] Ray Data doc changes for 2.5 #36224

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][Docs] Revise "Loading data" #36144

[Data][Docs] Revise "Loading data" #36144

bveeramani commented Jun 7, 2023 •

edited

Loading

amogkam commented Jun 7, 2023

raulchen Jun 7, 2023

bveeramani Jun 7, 2023 •

edited

Loading

raulchen Jun 7, 2023

bveeramani Jun 7, 2023

raulchen Jun 8, 2023

bveeramani Jun 8, 2023 •

edited

Loading

ericl left a comment

amogkam commented Jun 7, 2023 •

edited

Loading

ericl commented Jun 7, 2023

bveeramani commented Jun 7, 2023

amogkam left a comment

ericl commented Jun 8, 2023

bveeramani commented Jun 8, 2023

[Data][Docs] Revise "Loading data" #36144

[Data][Docs] Revise "Loading data" #36144

Conversation

bveeramani commented Jun 7, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

amogkam commented Jun 7, 2023

raulchen Jun 7, 2023

Choose a reason for hiding this comment

bveeramani Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

raulchen Jun 7, 2023

Choose a reason for hiding this comment

bveeramani Jun 7, 2023

Choose a reason for hiding this comment

raulchen Jun 8, 2023

Choose a reason for hiding this comment

bveeramani Jun 8, 2023 • edited Loading

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

amogkam commented Jun 7, 2023 • edited Loading

ericl commented Jun 7, 2023

bveeramani commented Jun 7, 2023

amogkam left a comment

Choose a reason for hiding this comment

ericl commented Jun 8, 2023

bveeramani commented Jun 8, 2023

bveeramani commented Jun 7, 2023 •

edited

Loading

bveeramani Jun 7, 2023 •

edited

Loading

bveeramani Jun 8, 2023 •

edited

Loading

amogkam commented Jun 7, 2023 •

edited

Loading