[Python] Add function/method to deepcopy a `pa.Table` #38806

hendrikmakait · 2023-11-20T18:15:06Z

Describe the enhancement requested

Problem

In Dask, we need to force deep-copies of pa.Tables to ensure that views/slices sever references to the original buffers and allow us to free memory. From what I understand, there are a few ways to force a copy, but all of them come with downsides and/or have a clumsy API (see Alternatives)

Proposal

To give better control over copying pa.Table, I propose to add a pa.Table.copy() method that creates a deep-copy of the table. Ideally, this copy() method would have a boolean combine keyword that would combine chunks if True and maintain the existing chunking scheme otherwise (default).

Alternatives

pa.Table.take() and pa.Table.filter() could be used, but have the additional overhead of evaluating some criterion before copying. Also, this is a fairly clumsy API and prone to someone optimizing this that zero-copies are performed "if possible".
We could manually copy the individual columns using pa.concat_arrays and compose a new Table from those. However, pa.concat_arrays has to acquire the GIL when creating the returned Python object, which causes us to run into GIL contention due to the convoy effect (https://bugs.python.org/issue7946). Basically, something else hogs the GIL and our loop over the columns gets slowed down because every time we try to acquire the GIL, we have to wait.
pa.Table.combine_chunks() copies a column if we have more than a single chunk in said column. Once again, we would have to jump through some hoops here to ensure that this is the case of fall back to another solution that forces a copy.

Side Comments

Intuitively, I would have thought that copy.deepcopy(table) as well as pickle.loads(pickle.dumps(table)) would serve my purpose. From what I can see, views/slices copy the entire buffer though. This may be by design to ensure that offsets are maintained, but this makes it even more important to have the ability to truncate underlying buffers for views/slices to avoid having to pickle all the data. Am I doing something wrong here?

import copy
import pickle

import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
                   'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
table = pa.Table.from_pandas(df)
print(f"Original size: {table.column('animals').chunks[0].buffers()[1].size}")
sliced = table.slice(0, 1)
truncated = pa.concat_arrays([chunk for chunk in sliced.column("animals").chunks if chunk is not None])
print(f"Truncated size: {truncated.buffers()[1].size}")
deep_copied = copy.deepcopy(sliced)
print(f"Deepcopied size: {deep_copied.column('animals').chunks[0].buffers()[1].size}")
pickled = pickle.loads(pickle.dumps(sliced))
print(f"Pickled size: {pickled.column('animals').chunks[0].buffers()[1].size}")

results in

Original size: 20
Truncated size: 8
Deepcopied size: 20
Pickled size: 20

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-12-01T18:33:22Z

@hendrikmakait thanks for raising the issue!
I agree it would be good to have more explicit methods to copy pyarrow objects (Table, RecordBatch, Array), instead of people relying on the "concat trick"

Some related issues about adding deep copy functionality for arrays: #37878, #30503

Intuitively, I would have thought that copy.deepcopy(table) as well as pickle.loads(pickle.dumps(table)) would serve my purpose.

Yes, but unfortunately pickle has the problem that it saves the full buffer instead of only the sliced part. That's a long standing issue with our implementation of pickling, see #26685 (and copy.deepcopy relies on pickle)

hendrikmakait · 2023-12-06T08:48:33Z

Hi @jorisvandenbossche, thanks for the additional context! I'd be happy to contribute a PR for this though I might need some assistance finding my way around Arrow.

hendrikmakait added the Type: enhancement label Nov 20, 2023

github-actions bot added the Component: Python label Nov 20, 2023

danepitkin mentioned this issue Jun 25, 2024

GH-42222: [Python] Add bindings for CopyTo on RecordBatch and Array classes #42223

Merged

jorisvandenbossche mentioned this issue Jun 26, 2024

[C++] Copy with compaction to a different device #43055

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Add function/method to deepcopy a `pa.Table` #38806

[Python] Add function/method to deepcopy a `pa.Table` #38806

hendrikmakait commented Nov 20, 2023

jorisvandenbossche commented Dec 1, 2023

hendrikmakait commented Dec 6, 2023 •

edited

Loading

[Python] Add function/method to deepcopy a pa.Table #38806

[Python] Add function/method to deepcopy a pa.Table #38806

Comments

hendrikmakait commented Nov 20, 2023

Describe the enhancement requested

Problem

Proposal

Alternatives

Side Comments

Component(s)

jorisvandenbossche commented Dec 1, 2023

hendrikmakait commented Dec 6, 2023 • edited Loading

[Python] Add function/method to deepcopy a `pa.Table` #38806

[Python] Add function/method to deepcopy a `pa.Table` #38806

hendrikmakait commented Dec 6, 2023 •

edited

Loading