-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data][Docs] Revise "Loading data" #36144
Conversation
Signed-off-by: Balaji Veeramani <[email protected]>
Can you add a PR description? |
doc/source/data/loading-data.rst
Outdated
|
||
Create a dataset from a range of integers, packing this integer range into | ||
ndarrays of the provided shape. | ||
ds = ray.data.read_images("/tmp/batoidea/JPEGImages") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, does these directory exist in the CI env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, but I'm planning on adding it in a follow-up PR. Thinking of download the s3:https://ray-example-data
bucket to CI.
@@ -618,7 +542,7 @@ Call :func:`~ray.data.read_sql` to read data from a database that provides a | |||
|
|||
pip install mysql-connector-python | |||
|
|||
Then, define your connection login and query the database. | |||
Then, define your connection logic and query the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the following example will create a new connection for each ray.data.read_sql
. This is an anti-pattern in practice. Do we want to update it to re-use one connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's necessary to create a new connection because connections aren't thread or process safe. So, we can't share a connection across read tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those read_sql calls are sequential. So multi-threading isn't a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I'm misunderstanding? The read_sql
calls are sequential, but read_sql
creates connections in read tasks. We create a connection in each read task so that we can read the database in parallel.
Co-authored-by: Hao Chen <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edits look good, but we also have to somehow retain the performance section. The most common pain point from users is how to obtain good performance from Ray Data, so we have to be deliberate in addressing this need.
For performance, we can just link to the appropriate section in the performance guide in the intro? https://docs.ray.io/en/master/data/performance-tips.html. And make sure that information is up to date. Don't think we need to have it in both places |
Currently, the performance page is a mis-mash of random / advanced tips. It certainly needs to be updated to focus on the basics for loading / transforming. |
Signed-off-by: Balaji Veeramani <[email protected]>
We're planning on revising the performance page in the near future. I'll leave the performance considerations in for now, and remove it once we've revised the performance page. |
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!!
Just last thing, per @ericl's comment let's add a "Creating synthetic datasets" section at the bottom that shows range, range_table, and range_tensor (just like what we have currently), and specifies this is useful for performance benchmarking.
Signed-off-by: Balaji Veeramani <[email protected]>
As Amog's comment above suggested could we add range() and range_tensor() back as recommendations for creating large synthetic datasets for performance testing? |
Signed-off-by: Balaji Veeramani <[email protected]>
@ericl yeah, just pushed the changes. |
The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: amogkam <[email protected]>
#35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: Hao Chen <[email protected]>
ray-project#35749 ray-project#35751 ray-project#35753 ray-project#35755 ray-project#35757 ray-project#36018 ray-project#36105 ray-project#36121 ray-project#36144 ray-project#36145 ray-project#36162 ray-project#36124 --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: Hao Chen <[email protected]>
* [Data] [Docs] Ray Data doc changes for 2.5 (#36224) #35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: Hao Chen <[email protected]>
* [Data] [Docs] Ray Data doc changes for 2.5 (#36224) #35749 #35751 #35753 #35755 #35757 #36018 #36105 #36121 #36144 #36145 #36162 #36124 --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: Hao Chen <[email protected]> * [docs] relax kapa loading scheme (#36201) Signed-off-by: Max Pumperla <[email protected]> * Revert "[Data] [Docs] Ray Data doc changes for 2.5 (#36224)" This reverts commit 48a6c26. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: amogkam <[email protected]> Signed-off-by: Amog Kamsetty <[email protected]> Signed-off-by: Max Pumperla <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Hao Chen <[email protected]> Co-authored-by: Max Pumperla <[email protected]>
The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content. --------- Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: e428265 <[email protected]>
Why are these changes needed?
The "Loading data" guide contains verbose examples and disorganized subsections. This PR abridges the guide and restructures the content.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.