[Python] Pickling a sliced array serializes all the buffers #26685

asfimport · 2020-11-25T20:15:54Z

If a large array is sliced, and pickled, it seems the full buffer is serialized, this leads to excessive memory usage and data transfer when using multiprocessing or dask.

>>> import pyarrow as pa
>>> ar = pa.array(['foo'] * 100_000)
>>> ar.nbytes
700004
>>> import pickle
>>> len(pickle.dumps(ar.slice(10, 1)))
700165

NumPy for instance
>>> import numpy as np
>>> ar_np = np.array(ar)
>>> ar_np
array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
>>> import pickle
>>> len(pickle.dumps(ar_np[10:11]))
165

I think this makes sense if you know arrow, but kind of unexpected as a user.

Is there a workaround for this? For instance copy an arrow array to get rid of the offset, and trim the buffers?

Reporter: Maarten Breddels / @maartenbreddels
Assignee: Clark Zinzow

Related issues:

[C++][Python] Create utils for deep-copying an Array/ ArrayData #30503 (is related to)

_{Note: This issue was originally created as ARROW-10739. Please see the migration documentation for further details.}

asfimport · 2020-11-25T20:38:09Z

Wes McKinney / @wesm:
We truncate the buffers on sliced arrays when writing record batches to the IPC protocol, so the buffers should be similarly truncated in the case of pickling.

asfimport · 2020-11-26T15:08:58Z

Maarten Breddels / @maartenbreddels:
Ok, good to know.

Two workarounds I came up with

%%timeit
s = pa.serialize(ar.slice(10, 1))
ar2 = pa.deserialize(s.to_buffer())
790 µs ± 578 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

import vaex.arrow.convert

----

def trim_buffers(ar):
    '''
    >>> ar = pa.array([1, 2, 3, 4], pa.int8())
    >>> ar.nbytes
    4
    >>> ar.slice(2, 2) #doctest: +ELLIPSIS
    <pyarrow.lib.Int8Array object at 0x...>
    [
      3,
      4
    ]
    >>> ar.slice(2, 2).nbytes
    4
    >>> trim_buffers(ar.slice(2, 2)).nbytes
    2
    >>> trim_buffers(ar.slice(2, 2))#doctest: +ELLIPSIS
    <pyarrow.lib.Int8Array object at 0x...>
    [
      3,
      4
    ]
    '''
    schema = pa.schema({'x': ar.type})
    with pa.BufferOutputStream() as sink:
        with pa.ipc.new_stream(sink, schema) as writer:
            writer.write_table(pa.table({'x': ar}))
    with pa.BufferReader(sink.getvalue()) as source:
        with pa.ipc.open_stream(source) as reader:
            table = reader.read_all()
            assert table.num_columns == 1
            assert table.num_rows == len(ar)
            trimmed_ar = table.column(0)
    if isinstance(trimmed_ar, pa.ChunkedArray):
        assert len(trimmed_ar.chunks) == 1
        trimmed_ar = trimmed_ar.chunks[0]


    return trimmed_ar
----

%%timeit
vaex.arrow.convert.trim_buffers(ar.slice(10, 1))
202 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

asfimport · 2020-12-01T11:20:15Z

Joris Van den Bossche / @jorisvandenbossche:
Note that pyarrow.serialize is deprecated, so best not use that as a workaround

asfimport · 2020-12-07T14:32:58Z

Maarten Breddels / @maartenbreddels:
Thanks Joris!

I cannot reproduce the previous timings (I guess I had an debug install without optimization), but this one seems fastest:

%%timeit
pa.concat_arrays([ar.slice(10, 1)])
2.16 µs ± 9.22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

(vs 8 and 125 us using ipc and (de)serialize respectively)

asfimport · 2021-06-22T00:29:52Z

Wes McKinney / @wesm:
Seems like this would be good to fix?

asfimport · 2021-06-22T08:37:06Z

Joris Van den Bossche / @jorisvandenbossche:
That would indeed be good.

For a moment, I naively thought this would just be adding a SliceBuffer call when wrapping the buffer in _reduce_array_data (

arrow/python/pyarrow/array.pxi

Line 597 in c43fab3

buffers.append(pyarrow_wrap_buffer(buf)

). But of course, the offset and length to slice the buffer with depends on the array type and bit width, or whether it's a bitmap, etc.
In the IPC code, the truncation is handled in the Visit methods of RecordBatchSerializer (eg for primitive arrays:

arrow/cpp/src/arrow/ipc/writer.cc

Line 331 in c43fab3

data = SliceBuffer(data, byte_offset, buffer_length);

), and this is quite a lot of code for doing this correctly for all the different data types. Something we shouldn't start replicating in cython, I think.

Are there other utilities in C++ that can be reused to do this truncation? Or could we "just" use the IPC serialization under the hood for pickling in Python?

asfimport · 2022-05-03T20:24:28Z

Jim Crist-Harif:
We're running into this in Dask right now when attempting to integrate Pandas string[pyarrow], since pickling pyarrow string arrays ends up pickling all the data even if the result only includes a small slice. I'm willing to hack on this if no one else has the bandwidth, but upon initial inspection this looks a bit more complicated than I'd like to bite off for a new-ish arrow contribute. With some guidance on the best path forward though I could possibly get something working though? @jorisvandenbossche any further thoughts on a solution here?

asfimport · 2022-07-25T08:10:12Z

Krisztian Szucs / @kszucs:
Postponing to 10.0 since there is no PR available at the moment.

asfimport · 2022-07-26T19:51:37Z

Philipp Moritz / @pcmoritz:
In Ray we are also planning to work around this ray-project/ray#22891 – it would be wonderful to see this fixed in Arrow :)

asfimport · 2022-07-26T20:23:23Z

Clark Zinzow:
Hey folks, I'm the author of the Ray PR that @pcmoritz linked to, which essentially ports Arrow's buffer truncation in the IPC serialization path to Python as a custom pickle serializer. I'd be happy to help push on getting this fixed upstream for Arrow 10.0.0.

First, is there any in-progress work by [~jcrist] or others?

If not, I could take this on in the next month or so; the two implementation routes that I've thought of when looking at the IPC serialization code (these are basically the same routes that @jorisvandenbossche pointed out a year ago) are:

Refactor the IPC writer's per-type buffer truncation logic into utilities that can be shared by the IPC serialization path and the pickle serialization path.
Directly use the Arrow IPC format in its pickle serialization, where the pickle reducer is a light wrapper around the IPC serialization and deserialization hooks.

Do either of these routes sound appealing? (2) has the added benefits of consolidating the serialization schemes on the IPC format and pushing all expensive serialization code into C++ land, but is a larger change and would involve otherwise-unnecessary wrapping of plain Arrow (chunked) arrays in record batches in order to match the IPC format, so maybe (1) is the better option.

asfimport · 2022-08-06T01:49:00Z

Clark Zinzow:
Ping on this, @amol- [~jcrist] are either of y'all actively working on this for Arrow 10.0.0? And if not, does option (1) that I gave in my previous comment sound like a good path forward?

asfimport · 2022-08-09T16:40:09Z

Joris Van den Bossche / @jorisvandenbossche:
[~clarkzinzow] sorry for the slow reply, several of us were on holidays. As far as I know, nobody is actively working on this, so a PR is certainly welcome! I think option (1) is a good path forward.

asfimport · 2022-08-10T19:27:31Z

Clark Zinzow:
@jorisvandenbossche No worries! And sounds good, I should be able to start working on this in a few weeks, I will update this thread once I've started working on it and again once I have a PR out.

asfimport · 2022-08-11T18:30:48Z

Joris Van den Bossche / @jorisvandenbossche:
That sounds good!

asfimport · 2022-09-23T12:45:54Z

Joris Van den Bossche / @jorisvandenbossche:
[~clarkzinzow] were you able to make some progress on this?

(I am removing the "In progress" label, just to make it a bit easier to keep track of this as open issue in JIRA, but can change that again once there is an open PR)

asfimport · 2022-09-28T17:01:27Z

Clark Zinzow:
@jorisvandenbossche I did a quick implementation of (2), where the Arrow IPC format is used under-the-hood for pickle serialization, and confirmed that the buffer truncation works as expected. Although this is a far simpler solution than (1), the overhead of the RecordBatch wrapper adds ~230 extra bytes to the pickled payload (per Array chunk) compared to current Arrow master, which can be pretty bad for the many-chunk and/or many-column case (order of magnitude larger serialized payloads). We could sidestep this issue by having Table, RecordBatch, and ChunkedArray port their __reduce__ to the Arrow IPC serialization as well, which should avoid this many-column and many-chunk blow-up, but there will still be the baseline ~230 byte bloat for ChunkedArray and Array that we might find untenable.

I can try to get a PR up for (2) either today or tomorrow while I start working on (1) in the background. (1) is going to have a much larger Arrow code impact + we'll continue having two serialization paths to maintain, but it shouldn't result in any serialized payload bloat.

asfimport · 2023-01-04T17:53:20Z

Apache Arrow JIRA Bot:
This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

asfimport · 2023-01-04T18:00:43Z

Clark Zinzow:
Apologies for the delay on this one! I have an implementation of (1) working e2e locally now, where the buffer traversal + truncation is shared by both the IPC serialization and pickle paths, but I haven't had time to clean up the handling of dictionary arrays, which is currently a bit ugly. I'm hoping to get back to it this week and push up a PR.

@jorisvandenbossche Do you know what the 11.0.0 release timeline is?

h-vetinari · 2023-06-19T11:48:48Z

@raulcd @kou @pitrou @jorisvandenbossche
Do you have any insights into the situation here? This bug caused ray to pin arrow <7 on windows, which is putting us into a difficult spot in conda-forge, as we only support the last 4 major versions (which is already highly unusual in conda-forge, but owed to how quickly arrow progresses, to give projects a chance to catch up).

That means arrow 6 does not get migrated against new library releases anymore (e.g. abseil,grpc,protobuf,re2, etc.), and so we'd be making ray essentially impossible to install with a current set of dependencies. By extension, this would become very painful for ray-users on windows, as their environments become hard/impossible to resolve. For these reasons, I'm opposed to adding that cap in conda-forge, but of course we don't want to ship broken packages either.

CC @mattip @krfricke @clarkzinzow

pitrou · 2023-06-19T12:50:58Z

@h-vetinari This is definitely an issue that deserves fixing, it just needs someone to work on it.

jorisvandenbossche · 2023-06-23T15:15:19Z

I am not sure this issue prevents Ray from updating. First, this issue is not causing a "broken" package, but AFAIK it's only a performance issue (that can be worked around with a copy). Further, looking at the Ray PRs linked above, it seems they actually implemented a workaround on their side by using IPC to support Arrow 7+ (ray-project/ray#29055). And the comment in Ray's setup.py mentions "Serialization workaround for pyarrow 7.0.0+ doesn't work for Windows.", so mentioning Windows, while this issue is not Windows-specific.

Of course we should still fix this issue, because it is a serious annoying an unexpected behaviour of pickling arrow data, but so not fully sure this is the blocker for updating Ray's pyarrow dependency on conda-forge's side.

jorisvandenbossche · 2023-06-23T15:28:49Z

Looking at the Ray PR I linked more closely, it indeed mentions after being merged that this was failing Windows' CI, and then it was reverted, and added back later on in steps with this restriction on the pyarrow version just for windows.
But if the workaround to use IPC serialization instead of pickle doesn't work on Windows, it might be interesting to know why this is or which bug exactly causes that (it might be easier to fix).

jorisvandenbossche · 2023-08-23T12:46:15Z

Apologies for the delay on this one! I have an implementation of (1) working e2e locally now, where the buffer traversal + truncation is shared by both the IPC serialization and pickle paths, but I haven't had time to clean up the handling of dictionary arrays, which is currently a bit ugly. I'm hoping to get back to it this week and push up a PR.

@clarkzinzow apologies for the very slow reply on your earlier message about having a working implementation. If you would still have the time to create a PR for this, that's certainly welcome! (or push the branch somewhere, in case someone else might be able to finish it)
In the meantime, @anjakefala is planning to make a PR for the simpler option (2), which shouldn't require too much custom code, but at least already allows to explore a fix.

clarkzinzow · 2023-09-01T18:59:25Z

Hi @jorisvandenbossche, apologies for the delays on my end! The application-level workaround for Ray ended up sufficing, so I never got back to submitting a PR, and I must have missed these notifications.

I have an out-of-date branch that works e2e for option (1), where the buffer traversal + truncation is shared by both the IPC serialization and pickle paths: a BufferAggregator interface is introduced with an IPC implementation and an implementation that accumulates buffers into a payload (similar to the C data interface) that can be serialized with language-specific schemes by Arrow front-ends (e.g. pickle via Python).

This has a few key pros over option (1) that I believe make it a compelling choice:

This does not break pickle protocol 5's out-of-band data support for Arrow data, which I actually think is table stakes for users sending Arrow data over the wire using pickle. Option (2) will allow the array buffers to be treated as out-of-band data, eliding 2 extra data copies in the e2e pickling flow compared to option (1).
Option (2) does not require wrapping all Arrow data in a RecordBatch, which elides 230 redundant bytes per payload compared to option (1). For systems sending small chunks of Arrow data over the wire, this could be non-negligible.

But this still has a few TODOs:

Rebase onto latest master and resolve any pop-up conflicts.
Move BufferAggregator, ArrayBufferPayload, and ArraySerializerBufferAggregator definitions/implementations out of the IPC code; these could be top-level, under arrow/util, under a new directory, etc.
Misc. cleanup (e.g. removing debug logging).

I don't think that I'll have the bandwidth to get this across the line in the near-term, but I think that most of the difficult work is done if someone else has the time! cc @anjakefala

main...clarkzinzow:arrow:arrow-10739-pickle-buffer-truncation-fix

Existing pickling serialises the whole buffer, even if the Array is sliced. Now we use Arrow's buffer truncation implemented for IPC serialization for pickling. Relies on a RecordBatch wrapper, adding ~230 bytes to the pickled payload per Array chunk. Closes apache#26685

anjakefala · 2023-10-12T17:21:41Z

I had an already started branch for Option (2) (directly using IPC serialization for pickling), and confirmed that it does not support Pickle Protocol 5. That made that approach untenable.

@pitrou Would you be able to take a look at @clarkzinzow's branch. It aims to be an implementation for Option (1):

Refactor the IPC writer's per-type buffer truncation logic into utilities that can be shared by the IPC serialization path and the pickle serialization path.

Is it a decent starting point? I would be happy to adopt it, and break it up into smaller PRs if it looks like a promising approach.

pitrou · 2023-10-17T08:29:06Z

Hi @anjakefala , while I agree the Option (1) is a good approach, I'm a bit skeptical about the proposed implementation. Two general comments:

This should be a general utility function, not something in ipc
I'm not sure what the point is of introducing a ArrayBufferPayload class that looks a lot like ArrayData. I would instead expect a function that takes a ArrayData (or Array) and returns a new ArrayData (or Array respectively, depending on which one is more convenient) with its buffers adjusted.

I might be missing something, though. Are there some concrete constraints mandating this particular implementation?

jorisvandenbossche · 2023-10-17T13:01:25Z

This should be a general utility function, not something in ipc

Just to be fair, @clarkzinzow noted that as one of the remaining todo's of his branch

2. 'm not sure what the point is of introducing a ArrayBufferPayload class that looks a lot like ArrayData

That is also not really clear to me. Above the following brief explanation was given (#26685 (comment)):

a BufferAggregator interface is introduced with an IPC implementation and an implementation that accumulates buffers into a payload (similar to the C data interface) that can be serialized with language-specific schemes by Arrow front-ends (e.g. pickle via Python).

@clarkzinzow can you explain a bit your rationale of going with this payload class?

clarkzinzow · 2023-10-17T15:15:56Z

@clarkzinzow can you explain a bit your rationale of going with this payload class?

@jorisvandenbossche @pitrou IIRC I started with ArrayData as the payload aggregation class, but ended up wanting a parent pointer to make the visitor-based aggregation clean, so we could hop back to the parent payload at Seal() time. I thought that the payload class would end up deviating in other ways, but it ended up being very close to ArrayData, so I do think that we should find a way to make ArrayData work.

One option would be maintaining an explicit shared-pointer stack of in-progress ArrayDatas in the aggregator, which would obviate the need for the parent pointer:

on WithArray, we create a new ArrayData, link it as the child of the current_, push the new one onto the stack and set it to be current_,
on Seal(), we pop the current_ ArrayData off the stack, and set current_ to be the new top of the stack (the parent).

With that change, I believe that ArrayData would suffice as the buffer aggregation payload.

pitrou · 2023-10-17T15:55:09Z

One option would be maintaining an explicit shared-pointer stack of in-progress ArrayDatas in the aggregator, which would obviate the need for the parent pointer:

I have no idea why you would need that. I would expect this API to be a single function call and the underlying implementation is free to keep temporary state while moving recursively along the tree.

clarkzinzow · 2023-10-17T16:27:39Z

@pitrou The API would be a single function call. I'm referring to internal implementation details of the aggregator that's passed to the array visitor.

The branch I linked refactors the IPC serialization code into a generic array visitor that takes a BufferAggregator, of which there are two implementations: one for IPC and one for serializing to some payload that can be used by language frontends for their own serialization schemes. I'm saying that the latter can explicitly maintain a stack internally, which would allow us to use ArrayData as the payload.

The API exposed to language frontends would just be a single function call returning that payload (ArrayData).

pitrou · 2023-10-17T16:28:56Z

I don't understand why you want a visitor API? Typically, slicing the buffers should be a cheap operation, so you can just do it all at once.

clarkzinzow · 2023-10-17T16:43:24Z

The visitor API isn't exposed to array serialization users, it's an (existing) implementation detail of the serialization. All that this PR does is factor out the IPC buffer aggregation so we can inject a different aggregator for frontend language serialization (pickle).

pitrou · 2023-10-17T16:45:47Z

I am not talking about RecordBatchSerializer, but about a new API to adjust buffers inside a ArrayData.

clarkzinzow · 2023-10-17T16:57:44Z

Oh, that makes more sense! So you're saying that an API should be exposed on Array/ArrayData to recursively truncate buffers, rather than piggyback off of IPC's array visitor + "serialize to a payload" logic? That would probably be cleaner from a frontend language's perspective.

There's still array-type-specific buffer truncation logic that would require an array-type-based visitor, and it's less clear to me how to marry that array visitor with the IPC serialization code in order to share that buffer truncation logic without making the IPC serialization code less efficient. Right now the IPC serialization truncates buffers and aggregates buffers into a flat list on a single recursive pass over the array tree, while factoring that buffer truncation logic out into an ArrayData.truncate_buffers() API would seem to necessitate two passes of the array tree: one to truncate buffers and one to aggregate the flat buffer list.

pitrou · 2023-10-17T17:01:40Z

I think we can definitely accept two passes on the array tree. There are existing benchmarks that can validate this, so we can refine later.

anjakefala · 2023-10-17T23:00:06Z

Thank you @clarkzinzow and @pitrou! I will put forth a design proposal. =)

Toward fixing #38300 PR #29993 added a local ray fix for issue apache/arrow#26685, but at the time windows failed the tests with pyarrow7. In issue #38300 the suggested fix was to release the pin. Signed-off-by: mattip <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

Toward fixing #38300 PR #29993 added a local ray fix for issue apache/arrow#26685, but at the time windows failed the tests with pyarrow7. In issue #38300 the suggested fix was to release the pin. Signed-off-by: mattip <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: khluu <[email protected]>

anmyachev · 2024-05-31T18:50:29Z

Hello @anjakefala! May I know the status of the proposal for this issue?

asfimport added this to the 11.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++][Python] Create utils for deep-copying an Array/ ArrayData #30503

Open

raulcd removed this from the 11.0.0 milestone Jan 11, 2023

mattip mentioned this issue Jun 19, 2023

ray-packages v2.5.0 conda-forge/ray-packages-feedstock#99

Merged

3 tasks

danepitkin added this to the 14.0.0 milestone Jul 12, 2023

anjakefala self-assigned this Jul 12, 2023

github-actions bot mentioned this issue Sep 12, 2023

GH-26685: [Python] use IPC for pickle serialisation #37683

Closed

jorisvandenbossche mentioned this issue Sep 29, 2023

[C++] Concatenating a single array is a compaction utility #37878

Open

scottjlee mentioned this issue Oct 2, 2023

[Core] Unable to install ray[all] in python ray-project/ray#39727

Closed

jorisvandenbossche modified the milestones: 14.0.0, 15.0.0 Oct 10, 2023

jorisvandenbossche mentioned this issue Dec 1, 2023

[Python] Add function/method to deepcopy a pa.Table #38806

Open

mattip mentioned this issue Dec 26, 2023

unpin pyarrow for windows ray-project/ray#42097

Merged

11 tasks

raulcd modified the milestones: 15.0.0, 16.0.0 Jan 8, 2024

raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024

jorisvandenbossche modified the milestones: 17.0.0, 18.0.0 Jun 19, 2024

[Python] Pickling a sliced array serializes all the buffers #26685

[Python] Pickling a sliced array serializes all the buffers #26685

Comments

asfimport commented Nov 25, 2020 • edited by jorisvandenbossche Loading

Related issues:

asfimport commented Nov 25, 2020

asfimport commented Nov 26, 2020

asfimport commented Dec 1, 2020

asfimport commented Dec 7, 2020

asfimport commented Jun 22, 2021

asfimport commented Jun 22, 2021

asfimport commented May 3, 2022

asfimport commented Jul 25, 2022

asfimport commented Jul 26, 2022

asfimport commented Jul 26, 2022

asfimport commented Aug 6, 2022

asfimport commented Aug 9, 2022

asfimport commented Aug 10, 2022

asfimport commented Aug 11, 2022

asfimport commented Sep 23, 2022

asfimport commented Sep 28, 2022 • edited by jorisvandenbossche Loading

asfimport commented Jan 4, 2023

asfimport commented Jan 4, 2023

h-vetinari commented Jun 19, 2023 • edited Loading

pitrou commented Jun 19, 2023

jorisvandenbossche commented Jun 23, 2023 • edited Loading

jorisvandenbossche commented Jun 23, 2023 • edited Loading

jorisvandenbossche commented Aug 23, 2023

clarkzinzow commented Sep 1, 2023 • edited Loading

anjakefala commented Oct 12, 2023

pitrou commented Oct 17, 2023

jorisvandenbossche commented Oct 17, 2023 • edited Loading

clarkzinzow commented Oct 17, 2023

pitrou commented Oct 17, 2023

clarkzinzow commented Oct 17, 2023 • edited Loading

pitrou commented Oct 17, 2023 • edited Loading

clarkzinzow commented Oct 17, 2023 • edited Loading

pitrou commented Oct 17, 2023

clarkzinzow commented Oct 17, 2023 • edited Loading

pitrou commented Oct 17, 2023

anjakefala commented Oct 17, 2023

anmyachev commented May 31, 2024

asfimport commented Nov 25, 2020 •

edited by jorisvandenbossche

Loading

asfimport commented Sep 28, 2022 •

edited by jorisvandenbossche

Loading

h-vetinari commented Jun 19, 2023 •

edited

Loading

jorisvandenbossche commented Jun 23, 2023 •

edited

Loading

jorisvandenbossche commented Jun 23, 2023 •

edited

Loading

clarkzinzow commented Sep 1, 2023 •

edited

Loading

jorisvandenbossche commented Oct 17, 2023 •

edited

Loading

clarkzinzow commented Oct 17, 2023 •

edited

Loading

pitrou commented Oct 17, 2023 •

edited

Loading

clarkzinzow commented Oct 17, 2023 •

edited

Loading

clarkzinzow commented Oct 17, 2023 •

edited

Loading