-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Pickling a sliced array serializes all the buffers #26685
Comments
Wes McKinney / @wesm: |
Maarten Breddels / @maartenbreddels: Two workarounds I came up with
%%timeit
s = pa.serialize(ar.slice(10, 1))
ar2 = pa.deserialize(s.to_buffer())
790 µs ± 578 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
import vaex.arrow.convert
----
def trim_buffers(ar):
'''
>>> ar = pa.array([1, 2, 3, 4], pa.int8())
>>> ar.nbytes
4
>>> ar.slice(2, 2) #doctest: +ELLIPSIS
<pyarrow.lib.Int8Array object at 0x...>
[
3,
4
]
>>> ar.slice(2, 2).nbytes
4
>>> trim_buffers(ar.slice(2, 2)).nbytes
2
>>> trim_buffers(ar.slice(2, 2))#doctest: +ELLIPSIS
<pyarrow.lib.Int8Array object at 0x...>
[
3,
4
]
'''
schema = pa.schema({'x': ar.type})
with pa.BufferOutputStream() as sink:
with pa.ipc.new_stream(sink, schema) as writer:
writer.write_table(pa.table({'x': ar}))
with pa.BufferReader(sink.getvalue()) as source:
with pa.ipc.open_stream(source) as reader:
table = reader.read_all()
assert table.num_columns == 1
assert table.num_rows == len(ar)
trimmed_ar = table.column(0)
if isinstance(trimmed_ar, pa.ChunkedArray):
assert len(trimmed_ar.chunks) == 1
trimmed_ar = trimmed_ar.chunks[0]
return trimmed_ar
----
%%timeit
vaex.arrow.convert.trim_buffers(ar.slice(10, 1))
202 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
|
Joris Van den Bossche / @jorisvandenbossche: |
Maarten Breddels / @maartenbreddels: I cannot reproduce the previous timings (I guess I had an debug install without optimization), but this one seems fastest: %%timeit
pa.concat_arrays([ar.slice(10, 1)])
2.16 µs ± 9.22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) (vs 8 and 125 us using ipc and (de)serialize respectively) |
Wes McKinney / @wesm: |
Joris Van den Bossche / @jorisvandenbossche: For a moment, I naively thought this would just be adding a arrow/python/pyarrow/array.pxi Line 597 in c43fab3
In the IPC code, the truncation is handled in the Visit methods of RecordBatchSerializer (eg for primitive arrays: arrow/cpp/src/arrow/ipc/writer.cc Line 331 in c43fab3
Are there other utilities in C++ that can be reused to do this truncation? Or could we "just" use the IPC serialization under the hood for pickling in Python? |
Jim Crist-Harif: |
Krisztian Szucs / @kszucs: |
Philipp Moritz / @pcmoritz: |
Clark Zinzow: First, is there any in-progress work by If not, I could take this on in the next month or so; the two implementation routes that I've thought of when looking at the IPC serialization code (these are basically the same routes that @jorisvandenbossche pointed out a year ago) are:
|
Clark Zinzow: |
Joris Van den Bossche / @jorisvandenbossche: |
Clark Zinzow: |
Joris Van den Bossche / @jorisvandenbossche: |
Joris Van den Bossche / @jorisvandenbossche: (I am removing the "In progress" label, just to make it a bit easier to keep track of this as open issue in JIRA, but can change that again once there is an open PR) |
Clark Zinzow: I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat. |
Apache Arrow JIRA Bot: |
Clark Zinzow: @jorisvandenbossche Do you know what the 11.0.0 release timeline is? |
@raulcd @kou @pitrou @jorisvandenbossche That means arrow 6 does not get migrated against new library releases anymore (e.g. abseil,grpc,protobuf,re2, etc.), and so we'd be making ray essentially impossible to install with a current set of dependencies. By extension, this would become very painful for ray-users on windows, as their environments become hard/impossible to resolve. For these reasons, I'm opposed to adding that cap in conda-forge, but of course we don't want to ship broken packages either. |
@h-vetinari This is definitely an issue that deserves fixing, it just needs someone to work on it. |
I am not sure this issue prevents Ray from updating. First, this issue is not causing a "broken" package, but AFAIK it's only a performance issue (that can be worked around with a copy). Further, looking at the Ray PRs linked above, it seems they actually implemented a workaround on their side by using IPC to support Arrow 7+ (ray-project/ray#29055). And the comment in Ray's setup.py mentions "Serialization workaround for pyarrow 7.0.0+ doesn't work for Windows.", so mentioning Windows, while this issue is not Windows-specific. Of course we should still fix this issue, because it is a serious annoying an unexpected behaviour of pickling arrow data, but so not fully sure this is the blocker for updating Ray's pyarrow dependency on conda-forge's side. |
Looking at the Ray PR I linked more closely, it indeed mentions after being merged that this was failing Windows' CI, and then it was reverted, and added back later on in steps with this restriction on the pyarrow version just for windows. |
@clarkzinzow apologies for the very slow reply on your earlier message about having a working implementation. If you would still have the time to create a PR for this, that's certainly welcome! (or push the branch somewhere, in case someone else might be able to finish it) |
Hi @jorisvandenbossche, apologies for the delays on my end! The application-level workaround for Ray ended up sufficing, so I never got back to submitting a PR, and I must have missed these notifications. I have an out-of-date branch that works e2e for option (1), where the buffer traversal + truncation is shared by both the IPC serialization and pickle paths: a This has a few key pros over option (1) that I believe make it a compelling choice:
But this still has a few TODOs:
I don't think that I'll have the bandwidth to get this across the line in the near-term, but I think that most of the difficult work is done if someone else has the time! cc @anjakefala main...clarkzinzow:arrow:arrow-10739-pickle-buffer-truncation-fix |
Existing pickling serialises the whole buffer, even if the Array is sliced. Now we use Arrow's buffer truncation implemented for IPC serialization for pickling. Relies on a RecordBatch wrapper, adding ~230 bytes to the pickled payload per Array chunk. Closes apache#26685
Existing pickling serialises the whole buffer, even if the Array is sliced. Now we use Arrow's buffer truncation implemented for IPC serialization for pickling. Relies on a RecordBatch wrapper, adding ~230 bytes to the pickled payload per Array chunk. Closes apache#26685
I had an already started branch for Option (2) (directly using IPC serialization for pickling), and confirmed that it does not support Pickle Protocol 5. That made that approach untenable. @pitrou Would you be able to take a look at @clarkzinzow's branch. It aims to be an implementation for Option (1):
Is it a decent starting point? I would be happy to adopt it, and break it up into smaller PRs if it looks like a promising approach. |
Hi @anjakefala , while I agree the Option (1) is a good approach, I'm a bit skeptical about the proposed implementation. Two general comments:
I might be missing something, though. Are there some concrete constraints mandating this particular implementation? |
Just to be fair, @clarkzinzow noted that as one of the remaining todo's of his branch
That is also not really clear to me. Above the following brief explanation was given (#26685 (comment)):
@clarkzinzow can you explain a bit your rationale of going with this payload class? |
@jorisvandenbossche @pitrou IIRC I started with One option would be maintaining an explicit shared-pointer stack of in-progress
With that change, I believe that |
I have no idea why you would need that. I would expect this API to be a single function call and the underlying implementation is free to keep temporary state while moving recursively along the tree. |
@pitrou The API would be a single function call. I'm referring to internal implementation details of the aggregator that's passed to the array visitor. The branch I linked refactors the IPC serialization code into a generic array visitor that takes a The API exposed to language frontends would just be a single function call returning that payload ( |
I don't understand why you want a visitor API? Typically, slicing the buffers should be a cheap operation, so you can just do it all at once. |
The visitor API isn't exposed to array serialization users, it's an (existing) implementation detail of the serialization. All that this PR does is factor out the IPC buffer aggregation so we can inject a different aggregator for frontend language serialization (pickle). |
I am not talking about |
Oh, that makes more sense! So you're saying that an API should be exposed on There's still array-type-specific buffer truncation logic that would require an array-type-based visitor, and it's less clear to me how to marry that array visitor with the IPC serialization code in order to share that buffer truncation logic without making the IPC serialization code less efficient. Right now the IPC serialization truncates buffers and aggregates buffers into a flat list on a single recursive pass over the array tree, while factoring that buffer truncation logic out into an |
I think we can definitely accept two passes on the array tree. There are existing benchmarks that can validate this, so we can refine later. |
Thank you @clarkzinzow and @pitrou! I will put forth a design proposal. =) |
Toward fixing #38300 PR #29993 added a local ray fix for issue apache/arrow#26685, but at the time windows failed the tests with pyarrow7. In issue #38300 the suggested fix was to release the pin. Signed-off-by: mattip <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
Toward fixing #38300 PR #29993 added a local ray fix for issue apache/arrow#26685, but at the time windows failed the tests with pyarrow7. In issue #38300 the suggested fix was to release the pin. Signed-off-by: mattip <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: khluu <[email protected]>
Hello @anjakefala! May I know the status of the proposal for this issue? |
If a large array is sliced, and pickled, it seems the full buffer is serialized, this leads to excessive memory usage and data transfer when using multiprocessing or dask.
I think this makes sense if you know arrow, but kind of unexpected as a user.
Is there a workaround for this? For instance copy an arrow array to get rid of the offset, and trim the buffers?
Reporter: Maarten Breddels / @maartenbreddels
Assignee: Clark Zinzow
Related issues:
Note: This issue was originally created as ARROW-10739. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: