-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Add Dataset.write_sql
#38544
[Data] Add Dataset.write_sql
#38544
Conversation
Signed-off-by: Balaji Veeramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid to me as a starting point. Can you publish and trigger the wheel build, so we can test it out in real workload?
Signed-off-by: Balaji Veeramani <[email protected]>
…/ray into bveeramani/write-sql Signed-off-by: Balaji Veeramani <[email protected]>
Use ray-project/ray#38544 to write out the embeddings. This has some advantages: The user doesn't need to specify the parallelism, they don't need to trigger the computation with `.count()`, since a write is an action -- so this is less error prone. It is also less code. We should wait merging the PR until Ray 2.7 is released with ray-project/ray#38544 merged. Signed-off-by: Goku Mohandas <[email protected]> Co-authored-by: Goku Mohandas <[email protected]>
So this is now working for me, thanks a lot for fixing it! One thing we should improve before merging it is the error handling. I noticed if the table does not exist, it will just silently fail and write nothing which is not good and can lead to data loss. To fix it I think we should do two things: Improve the error handling and catch exceptions better, and also potentially return some info about how many rows have been written (e.g. in the form of a status / info dict that we could add more stuff to in the future). The former we should do for sure, the latter only if it doesn't involve too many changes for now :) Here is a repro of the silent failure in sqlite3 -- let's also add a test for it: In [1]: import sqlite3
In [2]: import ray
In [3]: connection = sqlite3.connect("/tmp/db.sqlite")
In [4]: dataset = ray.data.from_items(
...: [{"string": "spam", "number": 0}, {"string": "ham", "number": 1}]
...: )
In [5]: dataset.write_sql(
...: "INSERT INTO test VALUES(?, ?)", lambda: sqlite3.connect("/tmp/db.sqlite")) |
Add chunking Signed-off-by: Balaji Veeramani <[email protected]>
Remove extra file Signed-off-by: Balaji Veeramani <[email protected]>
@pcmoritz fixed the issue with error messages. Here's what the error looks like now: import sqlite3
import ray
connection = sqlite3.connect("/tmp/eggs.db")
dataset = ray.data.range(1)
dataset.write_sql(
"INSERT INTO test VALUES(?)", lambda: sqlite3.connect("/tmp/eggs.db")
)
|
Add docstring Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Address review comments Signed-off-by: Balaji Veeramani <[email protected]>
Remove max_rows_per_write parameter Signed-off-by: Balaji Veeramani <[email protected]>
Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: e428265 <[email protected]>
Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Victor <[email protected]>
Dataset.write_sql was added a while ago in #38544, but it wasn't documented in the API reference. This PR adds it. Signed-off-by: Balaji Veeramani <[email protected]>
) Dataset.write_sql was added a while ago in ray-project#38544, but it wasn't documented in the API reference. This PR adds it. Signed-off-by: Balaji Veeramani <[email protected]>
) Dataset.write_sql was added a while ago in ray-project#38544, but it wasn't documented in the API reference. This PR adds it. Signed-off-by: Balaji Veeramani <[email protected]>
Use ray-project/ray#38544 to write out the embeddings. This has some advantages: The user doesn't need to specify the parallelism, they don't need to trigger the computation with `.count()`, since a write is an action -- so this is less error prone. It is also less code. We should wait merging the PR until Ray 2.7 is released with ray-project/ray#38544 merged. Signed-off-by: Goku Mohandas <[email protected]> Co-authored-by: Goku Mohandas <[email protected]>
Why are these changes needed?
Writing data back to databases is common for many applications like LLMs. For example, you might want to write vector indices back to a database like https://github.com/pgvector/pgvector. To support this use case, this PR adds an API to write Datasets to SQL databases.
Related issue number
Closes #38242
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.