-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow #2104
Conversation
e56b0fd
to
8be3625
Compare
8be3625
to
f461b38
Compare
a85a22a
to
077266b
Compare
35faa7c
to
be83a20
Compare
|
||
mutex mu_; | ||
bool connected_ = false; | ||
plasma::PlasmaClient client_ GUARDED_BY(mu_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one plasma client per custom op? The challenge with that is you could easily need hundreds or thousands of custom ops in a single TF graph, which would be a lot of plasma clients, potentially running out of file descriptors somewhere.
We can certainly try it this way and see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one plasma client per op, which is one client per graph (all the weights of one graph are written by a single per op), so it's fine to have one client per op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean there is one op per graph. Couldn't there be an arbitrary number of ops per graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle yes, but one per graph is how we are using it, I think it's fine to have one plasma client per op (and it's cleaner code wise since we don't need global variables in this case)
Great work, this looks good to me! I'd hold off on introducing a C API and focus on keeping the code clean/simple for now. If we run into substantial build or binary issues, then that may be the right thing to do. |
6dd14f6
to
b4b4134
Compare
Yeah, sounds good to me, this is ready to merge now. Let's see what our experience is deploying it with Ray and then we can iterate on it. Here is the JIRA for packaging: https://issues.apache.org/jira/browse/ARROW-2737 |
tf::mutex_lock lock(d2h_stream_mu); | ||
if (d2h_stream != nullptr) { | ||
delete d2h_stream; | ||
d2h_stream = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Communicated offline as well, copy pasting here)
I don’t understand this fix. Why would the destructor of the same object be called twice?
Can you (1) revert this commit, (2) change the Python test from sess = tf.Session(...); ...
to with tf.Session(...) as sess:
? Just want to see if this is some default graph semantics or something like that (likely not)
Another independent change to try (1) revert this commit, (2) add .SetIsStateful()
to the op’s registration, the two REGISTER_OP() in memcpy_plasma_op.cc
Regardless of the review comments, I want to ask a higher-level question: does this have to be part of the Arrow codebase? We seem to be piling up adapters for third-party libraries without regard for maintenance concerns. I think even the fact that Plasma is part of the Arrow codebase should be questioned. It clearly is a separate piece of code (to the point that it has distinct build files) with its own quality standards. |
I think it's a good idea to have all the code in the same place precisely to ease maintenance but also to have consistent governance standards Though there are Apache projects with multiple repos, I would really like to avoid that. Even having the Parquet adapter code in a separate codebase has been more trouble than it's worth IMHO. The burden of maintenance should clearly be shared. I find it unfortunate that the Ray developers are not participating much in the packaging and release maintenance for Arrow, and instead ship this code on their own inside Ray. |
In Parquet's case I understand, because there's a weird bidirectional dependency between the Parquet and Arrow repositories that makes everything complicated. In Plasma's case, the dependency would only be in a single direction (from Plasma to Arrow). Also, Plasma is more of an application than a library that would sit at the same level as Arrow. |
This is true for now. We haven't built anything inside Arrow that depends on Plasma yet, but I think it's entirely possible in the future (spill-to-disk joins, etc.). In the context of a database system, Plasma serves as a "subsystem for managing shared memory pages and out-of-core data", so it fits with the objective of building a deconstructed database. |
I fully agree with @wesm on Plasma. Let me know @pitrou if you have things in mind we can improve. The codebases originally started out as separate with different styles etc., but things are converging more and more and we are happy to put in work to make it even more so in the future. Regarding TensorFlow, my perspective on this is that machine learning is a super important part of modern data processing (increasingly so) and TensorFlow is the de facto standard for ML and therefore we need to spend the time to make arrow and TensorFlow work together very well. Even if that's going to be painful (TensorFlow has its quirks). This is going to be far easier if the work happens in the arrow repo. @wesm Thanks for the feedback and please let us know how we can improve the cooperation on the packaging and release maintenance. |
Here is a thread from March 23 about problems with Arrow's packaging and release management: https://lists.apache.org/thread.html/46b7251192568258e9fc01e819fd96d103d524647c8977f1fd9046de@<dev.arrow.apache.org> No one who develops both Ray and Arrow replied on the thread, presumably (?) because you are vendoring Arrow and dealing with packaging on your own, so the woes that the Arrow release managers have experienced haven't impacted you very much. There's nothing inherently wrong or immoral with vendoring an upstream open source project. In principle, you could base Ray releases on released versions of Arrow (Ray has released about as many times as Arrow has). To make this practical, Arrow would need to release more often. BUT, Arrow has not released more often because the release management and packaging process has been so historically onerous (I have borderline PTSD from having been the release manager too many times, only half joking). Meanwhile, our 0.10.0 release is blocked as we've decided not to release again unless we can include the binary packages in the release vote, to prevent the kind of packaging disaster that befell the 0.9.0 release. So this is where we stand right now. Since we are all ultimately volunteers, there's nothing forcing you to help with packaging and releases for Arrow, but the community would be stronger and healthier if you did. |
BTW, Ray's vendored pyarrow conflicts with the non-vendored one if they are both installed:
I took a look at the PyPI download stats on BigQuery -- pyarrow is being installed ~10-15x as often as Ray, so this could spell trouble for users. |
We do want to switch over to including pyarrow via pip into ray; we can try looking into that. The requirement from the development side is to have arrow nightlies or per-commit packages that we can use for the development of Ray (since there is a lot of development on both sides that needs to be in sync). For releases we would need to synchronize with arrow releases in some way. The advantage to vendor arrow with Ray like we do now is that there is just one version we need to support and from the user's perspective "it just works". |
I don't think we are far from having nightly or per-commit wheels; this work would go faster if more parties were invested and helping. I think if we collaborate on developer tooling we can make reasonably short work of things. What you use for development and what you ship seem like separate concerns to me. |
@@ -208,7 +208,10 @@ def import_tensorflow_extension(): | |||
ext = os.path.join(path, "libtensorflow_framework.so") | |||
if os.path.exists(ext): | |||
import ctypes | |||
ctypes.CDLL(ext) | |||
try: | |||
ctypes.CDLL(ext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case where this fails, why not try import tensorflow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens inside of the manylinux1 build (the OSError occurs if the the version of GLIBC is too old for TensorFlow); in this case import tensorflow also fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see.
Thanks all! If any of you have a moment, maybe you can jot down some ideas for where to go with TF integration from here? https://cwiki.apache.org/confluence/display/ARROW#ApacheArrowHome-MachineLearningFrameworkIntegrations |
I would be really happen when this code could move to tensorflow/io. It has caused quite some headaches in our packaging scripts. |
This is based off of #2046 from @pcmoritz and @concretevitamin, who wrote the original TensorFlow Op with help from @ericl.
In addition to the previous PR, this supports serialization of arbitrary type tensors.
To get it working, arrow has to be compiled in the following way:
And pyarrow like that:
The PR also includes utilities that should be generally useful in converting between TensorFlow and Arrow. It is currently supposed to be used to transfer weights in and out of TensorFlow but hopefully in the future there will also be support for data Tensors.