Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] [strict mode] Allow returning lists instead of arrays for numpy batches #34734

Merged
merged 5 commits into from
Apr 25, 2023

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Apr 25, 2023

Why are these changes needed?

Allow map_batches UDFs to return {"foo": [1, 2, 3]} in addition to {"foo": np.array([1, 2, 3])} by implicitly casting lists to arrays.

Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
@amogkam
Copy link
Contributor

amogkam commented Apr 25, 2023

awesome, this is great! are the internal numpy arrays converted back into python lists when iterating over the outputs via iter_batches? Otherwise, I can imagine this may be confusing

@pcmoritz
Copy link
Contributor

Wow, this is genius, maybe it allows us to have the best of all worlds :D

@ericl
Copy link
Contributor Author

ericl commented Apr 25, 2023

awesome, this is great! are the internal numpy arrays converted back into python lists when iterating over the outputs via iter_batches? Otherwise, I can imagine this may be confusing

You still get np.arrays as output, but those are also Iterable (so if you try to use it as a list, it might be 95% the same?). Not sure if we can do better here without something like type extensions.

@pcmoritz
Copy link
Contributor

One drastic possibility: Always convert np arrays of objects to lists. I.e. if map_batches returns a list, convert to np array of objects under the hood (this PR), if dataset has np arrays of objects, expose them to map_batches as a python list on the reading path. It is unconventional. We have to be careful. But it might be very ergonomic. If we do this, we would need to make it very clear in the docs, and tell people how to keep using numpy arrays of objects if they want to. If we do this, it needs careful user testing :)

@pcmoritz
Copy link
Contributor

Alternatively, we document well how to use .tolist() on the reading path.

@ericl
Copy link
Contributor Author

ericl commented Apr 25, 2023

Yeah, I'm kind of skeptical of that because of the overhead. I think converting a dense numpy array into a python list may be zero copy if you use slice views, but if the user tries to concatenate that into a numpy array again that would almost certainly induce a copy of all the data.

If we have to choose one format, the numpy one is the one that allows the best performance in all cases.

@pcmoritz
Copy link
Contributor

Sounds good to me. When I was playing around with the hugging face code, the return codepath was the most annoying thing to deal with, which is 100% addressed by this PR. Great job.

The .tolist() stuff for the reading path is very bearable I think -- if we document it well I think it is a good solution :)

@pcmoritz
Copy link
Contributor

pcmoritz commented Apr 25, 2023

On the efficiency, one thought though: I'm pretty sure if you convert to a list, it will just do shallow copies of the objects and numpy arrays of objects are already not-so-performant (because the values are boxed, there is no vectorization and little cache locality). Just wanted to throw this out. So I would mostly focus on user experience here and less on efficiency (if we still want to experiment with this option).

@pcmoritz
Copy link
Contributor

The main argument against having special handling for np arrays of objects for me would be it will be inconsistent between numpy arrays of int / float / etc and numpy arrays of objects, so might be confusing :)

@pcmoritz
Copy link
Contributor

pcmoritz commented Apr 25, 2023

Ah I think I get your argument better now. If we do the conversion by default / standardize too much on this, all our users' data will be boxed and nobody will use numpy arrays of primitive types.

@@ -51,17 +51,29 @@ def validate_batch(batch: Block) -> None:
)

if isinstance(batch, collections.abc.Mapping):
for key, value in batch.items():
if not isinstance(value, np.ndarray):
for key, value in list(batch.items()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why you need the list( here -- shouldn't .items() always return an object that can be iterated on with the for key, value in batch.items() pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to avoid mutating the dict we are iterating over. Though, maybe in this case it is ok, since it doesn't change the keys.

)
if not isinstance(value, np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(value, list) would be a little clearer here (or is there a difference?)

@ericl
Copy link
Contributor Author

ericl commented Apr 25, 2023

Yup the issue precisely is how to "unbox" the array data without too much overheads.

If you default to arrays, then:

  • you can call arr.tolist() if you need a list (cheap)
  • otherwise keep it an array (cheap)

If you default to list, then:

  • you need to call np.stack(arr) if you need an array (expensive)
  • otherwise keep it a list (cheap)

@ericl ericl merged commit 74ddaaa into ray-project:master Apr 25, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…y batches (ray-project#34734)

Allow map_batches UDFs to return {"foo": [1, 2, 3]} in addition to {"foo": np.array([1, 2, 3])} by implicitly casting lists to arrays.

Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…y batches (ray-project#34734)

Allow map_batches UDFs to return {"foo": [1, 2, 3]} in addition to {"foo": np.array([1, 2, 3])} by implicitly casting lists to arrays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants