Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data][Docs] Revise "Loading data" #36144

Merged
merged 9 commits into from
Jun 8, 2023
Merged

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Jun 7, 2023

Why are these changes needed?

The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani changed the title Initial commit [Data][Docs] Revise "Loading data" Jun 7, 2023
@amogkam
Copy link
Contributor

amogkam commented Jun 7, 2023

Can you add a PR description?

doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Show resolved Hide resolved
doc/source/data/loading-data.rst Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved

Create a dataset from a range of integers, packing this integer range into
ndarrays of the provided shape.
ds = ray.data.read_images("/tmp/batoidea/JPEGImages")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, does these directory exist in the CI env?

Copy link
Member Author

@bveeramani bveeramani Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but I'm planning on adding it in a follow-up PR. Thinking of download the s3:https://ray-example-data bucket to CI.

doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Show resolved Hide resolved
doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
@@ -618,7 +542,7 @@ Call :func:`~ray.data.read_sql` to read data from a database that provides a

pip install mysql-connector-python

Then, define your connection login and query the database.
Then, define your connection logic and query the database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the following example will create a new connection for each ray.data.read_sql. This is an anti-pattern in practice. Do we want to update it to re-use one connection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary to create a new connection because connections aren't thread or process safe. So, we can't share a connection across read tasks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those read_sql calls are sequential. So multi-threading isn't a problem?

Copy link
Member Author

@bveeramani bveeramani Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'm misunderstanding? The read_sql calls are sequential, but read_sql creates connections in read tasks. We create a connection in each read task so that we can read the database in parallel.

Co-authored-by: Hao Chen <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits look good, but we also have to somehow retain the performance section. The most common pain point from users is how to obtain good performance from Ray Data, so we have to be deliberate in addressing this need.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 7, 2023
@amogkam
Copy link
Contributor

amogkam commented Jun 7, 2023

For performance, we can just link to the appropriate section in the performance guide in the intro? https://docs.ray.io/en/master/data/performance-tips.html. And make sure that information is up to date.

Don't think we need to have it in both places

@ericl
Copy link
Contributor

ericl commented Jun 7, 2023

For performance, we can just link to the appropriate section in the performance guide in the intro? https://docs.ray.io/en/master/data/performance-tips.html. And make sure that information is up to date.

Currently, the performance page is a mis-mash of random / advanced tips. It certainly needs to be updated to focus on the basics for loading / transforming.

Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani
Copy link
Member Author

For performance, we can just link to the appropriate section in the performance guide in the intro? https://docs.ray.io/en/master/data/performance-tips.html. And make sure that information is up to date.

Currently, the performance page is a mis-mash of random / advanced tips. It certainly needs to be updated to focus on the basics for loading / transforming.

We're planning on revising the performance page in the near future. I'll leave the performance considerations in for now, and remove it once we've revised the performance page.

Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani mentioned this pull request Jun 7, 2023
9 tasks
Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!!

Just last thing, per @ericl's comment let's add a "Creating synthetic datasets" section at the bottom that shows range, range_table, and range_tensor (just like what we have currently), and specifies this is useful for performance benchmarking.

@ericl
Copy link
Contributor

ericl commented Jun 8, 2023

As Amog's comment above suggested could we add range() and range_tensor() back as recommendations for creating large synthetic datasets for performance testing?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023
Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani
Copy link
Member Author

As Amog's comment above suggested could we add range() and range_tensor() back as recommendations for creating large synthetic datasets for performance testing?

@ericl yeah, just pushed the changes.

@bveeramani bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023
@amogkam amogkam merged commit 369f68e into ray-project:master Jun 8, 2023
35 of 48 checks passed
@bveeramani bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 8, 2023
@bveeramani bveeramani deleted the loading-data branch June 8, 2023 18:55
amogkam pushed a commit to amogkam/ray that referenced this pull request Jun 8, 2023
The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Signed-off-by: amogkam <[email protected]>
amogkam added a commit that referenced this pull request Jun 9, 2023
#35749
#35751
#35753
#35755
#35757
#36018
#36105
#36121
#36144
#36145
#36162
#36124

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
amogkam added a commit to amogkam/ray that referenced this pull request Jun 9, 2023
amogkam added a commit that referenced this pull request Jun 9, 2023
* [Data] [Docs] Ray Data doc changes for 2.5 (#36224)

#35749
#35751
#35753
#35755
#35757
#36018
#36105
#36121
#36144
#36145
#36162
#36124

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
maxpumperla added a commit that referenced this pull request Jun 10, 2023
* [Data] [Docs] Ray Data doc changes for 2.5 (#36224)

#35749
#35751
#35753
#35755
#35757
#36018
#36105
#36121
#36144
#36145
#36162
#36124

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Co-authored-by: angelinalg <[email protected]>
Co-authored-by: Hao Chen <[email protected]>

* [docs] relax kapa loading scheme (#36201)

Signed-off-by: Max Pumperla <[email protected]>

* Revert "[Data] [Docs] Ray Data doc changes for 2.5 (#36224)"

This reverts commit 48a6c26.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: amogkam <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Max Pumperla <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Co-authored-by: Max Pumperla <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Signed-off-by: e428265 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants