Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow #2104

Closed
wants to merge 68 commits into from

Conversation

pschafhalter
Copy link
Contributor

@pschafhalter pschafhalter commented Jun 5, 2018

This is based off of #2046 from @pcmoritz and @concretevitamin, who wrote the original TensorFlow Op with help from @ericl.

In addition to the previous PR, this supports serialization of arbitrary type tensors.

To get it working, arrow has to be compiled in the following way:

cmake -DARROW_PLASMA=on -DARROW_TENSORFLOW=on -DARROW_PYTHON=on ..
make -j8
sudo make install

And pyarrow like that:

PYARROW_WITH_PLASMA=1 PYARROW_WITH_TENSORFLOW=1 python setup.py develop

The PR also includes utilities that should be generally useful in converting between TensorFlow and Arrow. It is currently supposed to be used to transfer weights in and out of TensorFlow but hopefully in the future there will also be support for data Tensors.


mutex mu_;
bool connected_ = false;
plasma::PlasmaClient client_ GUARDED_BY(mu_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one plasma client per custom op? The challenge with that is you could easily need hundreds or thousands of custom ops in a single TF graph, which would be a lot of plasma clients, potentially running out of file descriptors somewhere.

We can certainly try it this way and see what happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one plasma client per op, which is one client per graph (all the weights of one graph are written by a single per op), so it's fine to have one client per op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean there is one op per graph. Couldn't there be an arbitrary number of ops per graph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle yes, but one per graph is how we are using it, I think it's fine to have one plasma client per op (and it's cleaner code wise since we don't need global variables in this case)

@robertnishihara
Copy link
Contributor

Great work, this looks good to me! I'd hold off on introducing a C API and focus on keeping the code clean/simple for now. If we run into substantial build or binary issues, then that may be the right thing to do.

@pcmoritz
Copy link
Contributor

pcmoritz commented Jun 25, 2018

Yeah, sounds good to me, this is ready to merge now. Let's see what our experience is deploying it with Ray and then we can iterate on it.

Here is the JIRA for packaging: https://issues.apache.org/jira/browse/ARROW-2737

tf::mutex_lock lock(d2h_stream_mu);
if (d2h_stream != nullptr) {
delete d2h_stream;
d2h_stream = nullptr;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Communicated offline as well, copy pasting here)

I don’t understand this fix. Why would the destructor of the same object be called twice?
Can you (1) revert this commit, (2) change the Python test from sess = tf.Session(...); ... to with tf.Session(...) as sess:? Just want to see if this is some default graph semantics or something like that (likely not)
Another independent change to try (1) revert this commit, (2) add .SetIsStateful() to the op’s registration, the two REGISTER_OP() in memcpy_plasma_op.cc

@pitrou
Copy link
Member

pitrou commented Jul 10, 2018

Regardless of the review comments, I want to ask a higher-level question: does this have to be part of the Arrow codebase? We seem to be piling up adapters for third-party libraries without regard for maintenance concerns. I think even the fact that Plasma is part of the Arrow codebase should be questioned. It clearly is a separate piece of code (to the point that it has distinct build files) with its own quality standards.

@wesm
Copy link
Member

wesm commented Jul 10, 2018

I think it's a good idea to have all the code in the same place precisely to ease maintenance but also to have consistent governance standards Though there are Apache projects with multiple repos, I would really like to avoid that. Even having the Parquet adapter code in a separate codebase has been more trouble than it's worth IMHO.

The burden of maintenance should clearly be shared. I find it unfortunate that the Ray developers are not participating much in the packaging and release maintenance for Arrow, and instead ship this code on their own inside Ray.

@pitrou
Copy link
Member

pitrou commented Jul 10, 2018

In Parquet's case I understand, because there's a weird bidirectional dependency between the Parquet and Arrow repositories that makes everything complicated. In Plasma's case, the dependency would only be in a single direction (from Plasma to Arrow). Also, Plasma is more of an application than a library that would sit at the same level as Arrow.

@wesm
Copy link
Member

wesm commented Jul 10, 2018

Also, Plasma is more of an application than a library that would sit at the same level as Arrow.

This is true for now. We haven't built anything inside Arrow that depends on Plasma yet, but I think it's entirely possible in the future (spill-to-disk joins, etc.). In the context of a database system, Plasma serves as a "subsystem for managing shared memory pages and out-of-core data", so it fits with the objective of building a deconstructed database.

@pcmoritz
Copy link
Contributor

I fully agree with @wesm on Plasma. Let me know @pitrou if you have things in mind we can improve. The codebases originally started out as separate with different styles etc., but things are converging more and more and we are happy to put in work to make it even more so in the future.

Regarding TensorFlow, my perspective on this is that machine learning is a super important part of modern data processing (increasingly so) and TensorFlow is the de facto standard for ML and therefore we need to spend the time to make arrow and TensorFlow work together very well. Even if that's going to be painful (TensorFlow has its quirks). This is going to be far easier if the work happens in the arrow repo.

@wesm Thanks for the feedback and please let us know how we can improve the cooperation on the packaging and release maintenance.

@wesm
Copy link
Member

wesm commented Jul 10, 2018

Here is a thread from March 23 about problems with Arrow's packaging and release management:

https://lists.apache.org/thread.html/46b7251192568258e9fc01e819fd96d103d524647c8977f1fd9046de@<dev.arrow.apache.org>

No one who develops both Ray and Arrow replied on the thread, presumably (?) because you are vendoring Arrow and dealing with packaging on your own, so the woes that the Arrow release managers have experienced haven't impacted you very much. There's nothing inherently wrong or immoral with vendoring an upstream open source project.

In principle, you could base Ray releases on released versions of Arrow (Ray has released about as many times as Arrow has). To make this practical, Arrow would need to release more often. BUT, Arrow has not released more often because the release management and packaging process has been so historically onerous (I have borderline PTSD from having been the release manager too many times, only half joking).

Meanwhile, our 0.10.0 release is blocked as we've decided not to release again unless we can include the binary packages in the release vote, to prevent the kind of packaging disaster that befell the 0.9.0 release. So this is where we stand right now.

Since we are all ultimately volunteers, there's nothing forcing you to help with packaging and releases for Arrow, but the community would be stronger and healthier if you did.

@wesm
Copy link
Member

wesm commented Jul 10, 2018

BTW, Ray's vendored pyarrow conflicts with the non-vendored one if they are both installed:

>>> import ray
>>> import pyarrow
>>> pyarrow
<module 'pyarrow' from '/home/wesm/miniconda/envs/ray-test/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/__init__.py'>

I took a look at the PyPI download stats on BigQuery -- pyarrow is being installed ~10-15x as often as Ray, so this could spell trouble for users.

@pcmoritz
Copy link
Contributor

We do want to switch over to including pyarrow via pip into ray; we can try looking into that. The requirement from the development side is to have arrow nightlies or per-commit packages that we can use for the development of Ray (since there is a lot of development on both sides that needs to be in sync). For releases we would need to synchronize with arrow releases in some way.

The advantage to vendor arrow with Ray like we do now is that there is just one version we need to support and from the user's perspective "it just works".

@wesm
Copy link
Member

wesm commented Jul 11, 2018

I don't think we are far from having nightly or per-commit wheels; this work would go faster if more parties were invested and helping. I think if we collaborate on developer tooling we can make reasonably short work of things.

What you use for development and what you ship seem like separate concerns to me.

@@ -208,7 +208,10 @@ def import_tensorflow_extension():
ext = os.path.join(path, "libtensorflow_framework.so")
if os.path.exists(ext):
import ctypes
ctypes.CDLL(ext)
try:
ctypes.CDLL(ext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case where this fails, why not try import tensorflow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens inside of the manylinux1 build (the OSError occurs if the the version of GLIBC is too old for TensorFlow); in this case import tensorflow also fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see.

@wesm
Copy link
Member

wesm commented Jul 17, 2018

Thanks all! If any of you have a moment, maybe you can jot down some ideas for where to go with TF integration from here? https://cwiki.apache.org/confluence/display/ARROW#ApacheArrowHome-MachineLearningFrameworkIntegrations

@yupbank
Copy link

yupbank commented Jan 21, 2019

@xhochy
Copy link
Member

xhochy commented Jan 22, 2019

I would be really happen when this code could move to tensorflow/io. It has caused quite some headaches in our packaging scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants