ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow #2104

pschafhalter · 2018-06-05T00:17:49Z

This is based off of #2046 from @pcmoritz and @concretevitamin, who wrote the original TensorFlow Op with help from @ericl.

In addition to the previous PR, this supports serialization of arbitrary type tensors.

To get it working, arrow has to be compiled in the following way:

cmake -DARROW_PLASMA=on -DARROW_TENSORFLOW=on -DARROW_PYTHON=on ..
make -j8
sudo make install

And pyarrow like that:

PYARROW_WITH_PLASMA=1 PYARROW_WITH_TENSORFLOW=1 python setup.py develop

The PR also includes utilities that should be generally useful in converting between TensorFlow and Arrow. It is currently supposed to be used to transfer weights in and out of TensorFlow but hopefully in the future there will also be support for data Tensors.

robertnishihara · 2018-06-24T23:19:37Z

python/pyarrow/tensorflow/plasma_op.cc

+
+ mutex mu_;
+ bool connected_ = false;
+ plasma::PlasmaClient client_ GUARDED_BY(mu_);


This is one plasma client per custom op? The challenge with that is you could easily need hundreds or thousands of custom ops in a single TF graph, which would be a lot of plasma clients, potentially running out of file descriptors somewhere.

We can certainly try it this way and see what happens.

There is one plasma client per op, which is one client per graph (all the weights of one graph are written by a single per op), so it's fine to have one client per op.

What do you mean there is one op per graph. Couldn't there be an arbitrary number of ops per graph?

In principle yes, but one per graph is how we are using it, I think it's fine to have one plasma client per op (and it's cleaner code wise since we don't need global variables in this case)

robertnishihara · 2018-06-24T23:20:56Z

Great work, this looks good to me! I'd hold off on introducing a C API and focus on keeping the code clean/simple for now. If we run into substantial build or binary issues, then that may be the right thing to do.

pcmoritz · 2018-06-25T00:44:36Z

Yeah, sounds good to me, this is ready to merge now. Let's see what our experience is deploying it with Ray and then we can iterate on it.

Here is the JIRA for packaging: https://issues.apache.org/jira/browse/ARROW-2737

concretevitamin · 2018-06-25T04:45:57Z

python/pyarrow/tensorflow/plasma_op.cc

+ tf::mutex_lock lock(d2h_stream_mu);
+ if (d2h_stream != nullptr) {
+ delete d2h_stream;
+ d2h_stream = nullptr;


(Communicated offline as well, copy pasting here)

I don’t understand this fix. Why would the destructor of the same object be called twice?
Can you (1) revert this commit, (2) change the Python test from sess = tf.Session(...); ... to with tf.Session(...) as sess:? Just want to see if this is some default graph semantics or something like that (likely not)
Another independent change to try (1) revert this commit, (2) add .SetIsStateful() to the op’s registration, the two REGISTER_OP() in memcpy_plasma_op.cc

pitrou · 2018-07-10T11:29:54Z

Regardless of the review comments, I want to ask a higher-level question: does this have to be part of the Arrow codebase? We seem to be piling up adapters for third-party libraries without regard for maintenance concerns. I think even the fact that Plasma is part of the Arrow codebase should be questioned. It clearly is a separate piece of code (to the point that it has distinct build files) with its own quality standards.

wesm · 2018-07-10T14:42:39Z

I think it's a good idea to have all the code in the same place precisely to ease maintenance but also to have consistent governance standards Though there are Apache projects with multiple repos, I would really like to avoid that. Even having the Parquet adapter code in a separate codebase has been more trouble than it's worth IMHO.

The burden of maintenance should clearly be shared. I find it unfortunate that the Ray developers are not participating much in the packaging and release maintenance for Arrow, and instead ship this code on their own inside Ray.

pitrou · 2018-07-10T15:30:26Z

In Parquet's case I understand, because there's a weird bidirectional dependency between the Parquet and Arrow repositories that makes everything complicated. In Plasma's case, the dependency would only be in a single direction (from Plasma to Arrow). Also, Plasma is more of an application than a library that would sit at the same level as Arrow.

wesm · 2018-07-10T17:31:44Z

Also, Plasma is more of an application than a library that would sit at the same level as Arrow.

This is true for now. We haven't built anything inside Arrow that depends on Plasma yet, but I think it's entirely possible in the future (spill-to-disk joins, etc.). In the context of a database system, Plasma serves as a "subsystem for managing shared memory pages and out-of-core data", so it fits with the objective of building a deconstructed database.

pcmoritz · 2018-07-10T21:20:39Z

I fully agree with @wesm on Plasma. Let me know @pitrou if you have things in mind we can improve. The codebases originally started out as separate with different styles etc., but things are converging more and more and we are happy to put in work to make it even more so in the future.

Regarding TensorFlow, my perspective on this is that machine learning is a super important part of modern data processing (increasingly so) and TensorFlow is the de facto standard for ML and therefore we need to spend the time to make arrow and TensorFlow work together very well. Even if that's going to be painful (TensorFlow has its quirks). This is going to be far easier if the work happens in the arrow repo.

@wesm Thanks for the feedback and please let us know how we can improve the cooperation on the packaging and release maintenance.

wesm · 2018-07-10T21:36:57Z

Here is a thread from March 23 about problems with Arrow's packaging and release management:

https://lists.apache.org/thread.html/46b7251192568258e9fc01e819fd96d103d524647c8977f1fd9046de@<dev.arrow.apache.org>

No one who develops both Ray and Arrow replied on the thread, presumably (?) because you are vendoring Arrow and dealing with packaging on your own, so the woes that the Arrow release managers have experienced haven't impacted you very much. There's nothing inherently wrong or immoral with vendoring an upstream open source project.

In principle, you could base Ray releases on released versions of Arrow (Ray has released about as many times as Arrow has). To make this practical, Arrow would need to release more often. BUT, Arrow has not released more often because the release management and packaging process has been so historically onerous (I have borderline PTSD from having been the release manager too many times, only half joking).

Meanwhile, our 0.10.0 release is blocked as we've decided not to release again unless we can include the binary packages in the release vote, to prevent the kind of packaging disaster that befell the 0.9.0 release. So this is where we stand right now.

Since we are all ultimately volunteers, there's nothing forcing you to help with packaging and releases for Arrow, but the community would be stronger and healthier if you did.

wesm · 2018-07-10T22:20:21Z

BTW, Ray's vendored pyarrow conflicts with the non-vendored one if they are both installed:

>>> import ray
>>> import pyarrow
>>> pyarrow
<module 'pyarrow' from '/home/wesm/miniconda/envs/ray-test/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/__init__.py'>

I took a look at the PyPI download stats on BigQuery -- pyarrow is being installed ~10-15x as often as Ray, so this could spell trouble for users.

pcmoritz · 2018-07-11T22:37:05Z

We do want to switch over to including pyarrow via pip into ray; we can try looking into that. The requirement from the development side is to have arrow nightlies or per-commit packages that we can use for the development of Ray (since there is a lot of development on both sides that needs to be in sync). For releases we would need to synchronize with arrow releases in some way.

The advantage to vendor arrow with Ray like we do now is that there is just one version we need to support and from the user's perspective "it just works".

wesm · 2018-07-11T23:24:35Z

I don't think we are far from having nightly or per-commit wheels; this work would go faster if more parties were invested and helping. I think if we collaborate on developer tooling we can make reasonably short work of things.

What you use for development and what you ship seem like separate concerns to me.

robertnishihara · 2018-07-17T05:58:26Z

python/pyarrow/compat.py

@@ -208,7 +208,10 @@ def import_tensorflow_extension():
 ext = os.path.join(path, "libtensorflow_framework.so")
 if os.path.exists(ext):
 import ctypes
- ctypes.CDLL(ext)
+ try:
+ ctypes.CDLL(ext)


in the case where this fails, why not try import tensorflow?

It happens inside of the manylinux1 build (the OSError occurs if the the version of GLIBC is too old for TensorFlow); in this case import tensorflow also fails.

wesm · 2018-07-17T19:17:48Z

Thanks all! If any of you have a moment, maybe you can jot down some ideas for where to go with TF integration from here? https://cwiki.apache.org/confluence/display/ARROW#ApacheArrowHome-MachineLearningFrameworkIntegrations

yupbank · 2019-01-21T23:44:37Z

this should be relevant https://github.com/tensorflow/io/tree/master/tensorflow_io/arrow

tensorflow/io#36 (comment)

xhochy · 2019-01-22T09:34:09Z

I would be really happen when this code could move to tensorflow/io. It has caused quite some headaches in our packaging scripts.

pcmoritz and others added 13 commits May 19, 2018 14:34

add initial tensorflow op

f6cef78

adapt code for arrow

ae2e7ca

fix wheel building

22b5564

fix for python 2.7

c705d53

fixes

d381ff3

fixes for the op

e261f2c

Fix tensor serialization

4167d1d

Convert Tensor type to Arrow

f601bb0

Check that all tensors have the same dtype

f4ad8e5

Formatting

3ed436b

Formatting

f7466de

add tensor utilities

232ed6f

fix

3b01abe

pcmoritz force-pushed the tensor-serialization branch from e56b0fd to 8be3625 Compare June 13, 2018 23:33

fix linting

f461b38

pcmoritz force-pushed the tensor-serialization branch from 8be3625 to f461b38 Compare June 13, 2018 23:39

pcmoritz added 7 commits June 13, 2018 18:09

fix

d291f64

fix

d7ead2b

get write path working

8b1dd8c

needs debugging

140a2b9

fix

a674720

fix warnings

4f641c3

fixes

1164d5a

pcmoritz force-pushed the tensor-serialization branch from a85a22a to 077266b Compare June 18, 2018 21:25

pcmoritz added 2 commits June 18, 2018 14:25

fixes

077266b

try datatype

be83a20

pcmoritz force-pushed the tensor-serialization branch from 35faa7c to be83a20 Compare June 18, 2018 22:07

pcmoritz added 3 commits June 18, 2018 15:42

fix

531f4a7

debug

a2131ad

debug2

af68371

pcmoritz added 2 commits June 24, 2018 14:51

fixes

b0731ad

tensorflow namespace

b4b4134

robertnishihara reviewed Jun 24, 2018

View reviewed changes

pcmoritz force-pushed the tensor-serialization branch from 6dd14f6 to b4b4134 Compare June 24, 2018 23:39

pcmoritz added 2 commits June 24, 2018 16:48

move build flags earlier

200ad89

more fixes

b79c710

concretevitamin reviewed Jun 25, 2018

View reviewed changes

pcmoritz added 5 commits July 13, 2018 16:12

remove setting the deleted variable to nullptr

13e4db3

Merge branch 'master' into tensor-serialization

df44982

Merge branch 'master' into tensor-serialization

097d48d

skip tests that use tensorflow inside of manylinux1 container

3b7bbc7

don't import tensorflow on manylinux1

50c6635

pcmoritz approved these changes Jul 17, 2018

View reviewed changes

robertnishihara reviewed Jul 17, 2018

View reviewed changes

robertnishihara closed this in cecbcf7 Jul 17, 2018

pcmoritz mentioned this pull request Jul 17, 2018

Upgrade arrow to include the plasma TensorFlow op ray-project/ray#2412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow #2104

ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow #2104

pschafhalter commented Jun 5, 2018 •

edited by pcmoritz

Loading

robertnishihara Jun 24, 2018

pcmoritz Jul 17, 2018

robertnishihara Jul 17, 2018

pcmoritz Jul 17, 2018

robertnishihara commented Jun 24, 2018

pcmoritz commented Jun 25, 2018 •

edited

Loading

concretevitamin Jun 25, 2018

pitrou commented Jul 10, 2018

wesm commented Jul 10, 2018

pitrou commented Jul 10, 2018

wesm commented Jul 10, 2018

pcmoritz commented Jul 10, 2018

wesm commented Jul 10, 2018

wesm commented Jul 10, 2018

pcmoritz commented Jul 11, 2018

wesm commented Jul 11, 2018

robertnishihara Jul 17, 2018

pcmoritz Jul 17, 2018

robertnishihara Jul 17, 2018

wesm commented Jul 17, 2018

yupbank commented Jan 21, 2019

xhochy commented Jan 22, 2019

ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow #2104

ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow #2104

Conversation

pschafhalter commented Jun 5, 2018 • edited by pcmoritz Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara commented Jun 24, 2018

pcmoritz commented Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

pitrou commented Jul 10, 2018

wesm commented Jul 10, 2018

pitrou commented Jul 10, 2018

wesm commented Jul 10, 2018

pcmoritz commented Jul 10, 2018

wesm commented Jul 10, 2018

wesm commented Jul 10, 2018

pcmoritz commented Jul 11, 2018

wesm commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Jul 17, 2018

yupbank commented Jan 21, 2019

xhochy commented Jan 22, 2019

pschafhalter commented Jun 5, 2018 •

edited by pcmoritz

Loading

pcmoritz commented Jun 25, 2018 •

edited

Loading