-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Add function/method to deepcopy a pa.Table
#38806
Comments
@hendrikmakait thanks for raising the issue! Some related issues about adding deep copy functionality for arrays: #37878, #30503
Yes, but unfortunately pickle has the problem that it saves the full buffer instead of only the sliced part. That's a long standing issue with our implementation of pickling, see #26685 (and |
Hi @jorisvandenbossche, thanks for the additional context! I'd be happy to contribute a PR for this though I might need some assistance finding my way around Arrow. |
Describe the enhancement requested
Problem
In Dask, we need to force deep-copies of
pa.Table
s to ensure that views/slices sever references to the original buffers and allow us to free memory. From what I understand, there are a few ways to force a copy, but all of them come with downsides and/or have a clumsy API (see Alternatives)Proposal
To give better control over copying
pa.Table
, I propose to add apa.Table.copy()
method that creates a deep-copy of the table. Ideally, thiscopy()
method would have a booleancombine
keyword that would combine chunks ifTrue
and maintain the existing chunking scheme otherwise (default
).Alternatives
pa.Table.take()
andpa.Table.filter()
could be used, but have the additional overhead of evaluating some criterion before copying. Also, this is a fairly clumsy API and prone to someone optimizing this that zero-copies are performed "if possible".pa.concat_arrays
and compose a new Table from those. However,pa.concat_arrays
has to acquire the GIL when creating the returned Python object, which causes us to run into GIL contention due to the convoy effect (https://bugs.python.org/issue7946). Basically, something else hogs the GIL and our loop over the columns gets slowed down because every time we try to acquire the GIL, we have to wait.pa.Table.combine_chunks()
copies a column if we have more than a single chunk in said column. Once again, we would have to jump through some hoops here to ensure that this is the case of fall back to another solution that forces a copy.Side Comments
Intuitively, I would have thought that
copy.deepcopy(table)
as well aspickle.loads(pickle.dumps(table))
would serve my purpose. From what I can see, views/slices copy the entire buffer though. This may be by design to ensure that offsets are maintained, but this makes it even more important to have the ability to truncate underlying buffers for views/slices to avoid having to pickle all the data. Am I doing something wrong here?results in
Component(s)
Python
The text was updated successfully, but these errors were encountered: