Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Produce and Consume ArrowDeviceArray struct from cudf::table / cudf::column #14926

Closed
zeroshade opened this issue Jan 29, 2024 · 44 comments
Labels
feature request New feature or request

Comments

@zeroshade
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I would like to generate Arrow IPC payloads from a cudf::table without copying the data off of the GPU device. Currently the to_arrow and from_arrow functions explicitly perform copies to and from the GPU device. There is not currently any efficient way to generate Arrow IPC payloads from libcudf without copying all of the data off of the device.

Describe the solution you'd like
In addition to the existing to_arrow and from_arrow functions, we could have a to_arrow_device_arr function that populates an ArrowDeviceArray struct from a cudf::table or cudf::column. We'd also create a from_arrow_device_arr function that could construct a cudf::table / cudf::column from an ArrowDeviceArray that describes Arrow data which is already on the device. Once you have the ArrowDeviceArray struct, the Arrow C++ library itself can be used to generate the IPC payloads without needing to copy the data off the device. This would also increase the interoperability options that libcudf has, as anything which produces or consumes ArrowDeviceArray structs could hand data off to libcudf and vice versa.

Describe alternatives you've considered
An alternative would be to implement Arrow IPC creating inside of the libcudf library, but I saw that this was explicitly removed from libcudf due to the requirement of linking against libarrow_cuda.so. (#10994). Implementing conversions to and from ArrowDeviceArray wouldn't require linking against libarrow_cuda.so at all and would provide an easy way to allow any consumers to create Arrow IPC payloads, or whatever else they want to do with the resulting Arrow data. Such as leveraging CUDA IPC with the data.

Additional context
When designing the ArrowDeviceArray struct, I created https://github.com/zeroshade/arrow-non-cpu as a POC which used Python numba to generate and operate on some GPU data before handing it off to libcudf, and then getting it back without copying off the device. Using ArrowDeviceArray as the way it handed the data off.

More recently I've been working on creating a protocol for sending Arrow IPC data that is located on GPUs across high-performance transports like UCX. To this end, I created a POC using libcudf to pass the data. As a result I have a partial implementation of the to_arrow_device_arr which can be found here. There's likely better ways than what I'm doing in there, but at least for my POC it was working.

The contribution guidelines say I should file this issue first for discussion rather than just submitting a PR, so that's where I'm at. I plan on trying to create a full implementation that I can contribute but wanted to have this discussion and get feedback here first.

Thanks for hearing me out everyone!

@zeroshade zeroshade added Needs Triage Need team to review and classify feature request New feature or request labels Jan 29, 2024
@davidwendt
Copy link
Contributor

I don't think this requires new APIs to libcudf.

The cudf::column and cudf::table are data-owning structs in libcudf.
For zero-copy you should be able wrap the arrow data in device memory with a cudf::column_view (and cudf::table_view) which are non-owning data structures.

All of the libcudf APIs accept cudf::column_view objects and so do not require an owning object so there should be no need to copy the arrow data in order to call a libcudf function.

Generally, libcudf APIs will return new cudf::column object since they are modifying or creating new column/table data. You can take ownership of this data (which should be in arrow format in device memory) using the cudf::column::release() method and then place the data in an appropriate arrow structure.

@zeroshade
Copy link
Contributor Author

For zero-copy you should be able wrap the arrow data in device memory with a cudf::column_view (and cudf::table_view) which are non-owning data structures.

The issue is that it's not clear-cut how to perform that wrapping since libcudf's memory representation still differs from Arrow in some cases, in addition to differences in how the buffers are handled (such as with string columns using children for their offsets/data and Arrow string the offsets and data buffers as plain buffers, not children). There's a significant amount of code required to correctly wrap cudf::column_views around Arrow data in device memory (note the significant amount of code in https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/interop/from_arrow.cu and https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/interop/to_arrow.cu), which makes it worthwhile to have functions in libcudf itself to encapsulate this logic. Rather than have consumers replicate the logic in their own libraries.

You can take ownership of this data (which should be in arrow format in device memory) using the cudf::column::release() method and then place the data in an appropriate arrow structure.

Sure, but like I mentioned above it's not necessarily as simple as placing it in the appropriate arrow structure. It requires a significant amount of code to do it correctly and properly, and it makes sense for that to exist within libcudf. Particularly because it can then remain updated as libcudf adds support for more Arrow data types.

@davidwendt
Copy link
Contributor

Ok. That makes sense. This appears to be just a wrapper around the cudf::column_view constructor. But resolving the data-type and the other components needed for the parameters would involve significant code. And the reverse as well.

@zeroshade
Copy link
Contributor Author

Before I go further working on it, could you take a look at my partial implementation in https://github.com/zeroshade/cudf-flight-ucx/blob/main/to_arrow.cc and let me know if you think that's a good direction to go in as opposed to a different approach?

If it's a good approach then i'll work on creating a PR for this

@davidwendt
Copy link
Contributor

davidwendt commented Jan 30, 2024

Who owns the data after this call?

I expected that this signature

arrow::Status to_arrow_device_arr(std::shared_ptr<cudf::table> input,
                                  struct ArrowDeviceArray* out,
                                  rmm::cuda_stream_view stream)

would be more like

std::unique_ptr<struct struct ArrowDeviceArray> to_arrow_device_array(cudf::table_view const& input,
                                                                      rmm::cuda_stream_view stream)

And throw an exception with an error message instead of returning a status.

Also, I don't understand what the cudaEventRecord objects are for and why they should be created here.
It seems a fragile piece of logic subject to fleeting changes in your arrow struct implementation.
I wonder if Arrow could have an API to build an ArrowDeviceArray from simple native elements like pointers and integers similar to the parameters for cudf::column_view and cudf::table_view constructors.

@zeroshade
Copy link
Contributor Author

Who owns the data after this call?

Technically the data has shared ownership. The ArrowDeviceArray maintains a reference to the passed in std::shared_ptr<cudf::table> to keep it alive until the release callback in the struct is called. Because the ArrowDeviceArray is intended to be a C ABI, it uses a release callback to control the lifetime of the underlying data.

The problem with this

std::unique_ptr<struct ArrowDeviceArray> to_arrow_device_array(cudf::table_view const& input,
                                                                      rmm::cuda_stream_view stream)

Is that because you're only passing in a cudf::table_view which doesn't own its data, there's no way to guarantee that the data stays alive until the release callback on the ArrowDeviceArray is called. We need to ensure that the underlying data stays valid and alive until the release callback is used.

If we don't like the idea of the shared ownership, then the interface could take a cudf::table and give the ownership to the ArrowDeviceArray entirely rather than sharing ownership.

Also, I don't understand what the cudaEventRecord objects are for and why they should be created here.
It seems a fragile piece of logic subject to fleeting changes in your arrow struct implementation.

We don't manually synchronize on the stream during the creation of the ArrowDeviceArray, instead we create an event and record it on the stream provided. The event is then part of the ArrowDeviceArray struct, so that a consumer can have their own stream wait on that event to synchronize before attempting to access the data. This lets the consumer of the struct choose when they synchronize, and on whatever stream they want to synchronize on. This allows the struct to be passed across C boundaries to different libraries and/or runtimes (such as python or Go or Rust etc.) and allow the consumer to synchronize the GPU however they want. @kkraus14 might be better able to explain the reasoning for the event in the struct than I.

That said, the ArrowDeviceArray struct is intended to be ABI stable and will not change. You can find the full definition, documentation and reasoning behind the structure of the ArrowDeviceArray here: https://arrow.apache.org/docs/format/CDeviceDataInterface.html

I wonder if Arrow could have an API to build an ArrowDeviceArray from simple native elements like pointers and integers similar to the parameters for cudf::column_view and cudf::table_view constructors.

That's exactly what the ArrowDeviceArray struct is. It's the collection of "simple native elements like pointers and integers" that describe the array and its children. Arrow provides APIs which take the struct and construct Arrow Arrays or RecordBatches from the struct (https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.h#L232)

In this scenario: arrow::Array and arrow::RecordBatch are equivalent to cudf::column and cudf::table. ArrowDeviceArray is a struct used to encapsulate all of the pointers/length/null counts/etc. to zero-copy send the data across a C ABI boundary.

@kkraus14
Copy link
Collaborator

Hey @davidwendt 😃

The CUDA event in the struct that @zeroshade mentioned is the responsibility of the producer of the struct. It should have all relevant work related to allocating and populating the memory that is being handed / shared to the struct captured so that a downstream user of the struct can wait on the event to guarantee that whatever stream they're working on doesn't cause a race condition with the relevant allocations / kernels that produced the memory.

The reason behind using a CUDA event as opposed to a CUDA stream is that often frameworks don't have a mechanism to share or extend the lifetime of their streams outside of their framework.

As far as the lifetime management of the actual memory, the struct's release callback is designed to be flexible to allow accommodating both owning and non-owning situations. I.E. if someone had a cudf::column_view, we could basically just have an empty release callback to have it function as a view as opposed to having any form of ownership. In the case of cudf having unique ownership in something like a std::unique_ptr<cudf::column>, then it would likely make sense to transfer ownership to the struct.

@davidwendt
Copy link
Contributor

Hey Keith. Thanks but it seems this kind of Arrow-specific logic for an Arrow-specific struct does not belong in libcudf. It seems a bit fragile in that changing how Arrow manages objects would require changes in a non-Arrow repo (like cudf). For example, if in the future Arrow decided the cudaEventCreate was not sufficient and now relied on cudaEventCreateWithFlags instead, a new cudf PR would be required to make this compliant again.

I was picturing more of a arrow::ArrowArray::Make() factory function that would handle these kinds of details.
Something like this perhaps (likely needs tweaking):

std::unique_ptr<ArrowArray> Make( int length, int null_count, int offset, void* buffer, std::vector<ArrowArray> children);

(And similar one for ArrowDeviceArray)
And this function would handle all the Arrow-specific things including the release mechanism and whatever CUDA objects it needs. It also allows the Arrow code complete control on how it is created and destroyed.

Then the libcudf function could simply call this with the appropriate counts and device pointers.

@davidwendt
Copy link
Contributor

Is that because you're only passing in a cudf::table_view which doesn't own its data, there's no way to guarantee that the data stays alive until the release callback on the ArrowDeviceArray is called. We need to ensure that the underlying data stays valid and alive until the release callback is used.

Generally, libcudf is based on accepting views that are non-owning as per our developer guidelines.
https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#views-and-ownership
The caller must ensure proper ownership scope and lifetime. This also provides a great deal of flexibility since there is no guarantee the original data is owned by a cudf::table in the first place.

@kkraus14
Copy link
Collaborator

Thanks but it seems this kind of Arrow-specific logic for an Arrow-specific struct does not belong in libcudf. It seems a bit fragile in that changing how Arrow manages objects would require changes in a non-Arrow repo (like cudf).

This is Arrow format specific, but not Arrow library specific. Libcudf already has to_arrow and from_arrow functions to go from device memory in libcudf containers to host memory in arrow containers, so it's already been exposed to fragility in Arrow for years, but the memory layout and ABI has been stable for years.

What is proposed here doesn't use Arrow containers and is designed to be a vendorable single header with a stable ABI so there really isn't additional exposure to Arrow that isn't already there.

I was picturing more of a arrow::ArrowArray::Make() factory function that would handle these kinds of details. Something like this perhaps (likely needs tweaking):

std::unique_ptr<ArrowArray> Make( int length, int null_count, int offset, void* buffer, std::vector<ArrowArray> children);

(And similar one for ArrowDeviceArray) And this function would handle all the Arrow-specific things including the release mechanism and whatever CUDA objects it needs. It also allows the Arrow code complete control on how it is created and destroyed.

Then the libcudf function could simply call this with the appropriate counts and device pointers.

In theory something like this could be added as a free function in the vendorable header, but you'd need to handle all the nesting structure that columns can have where you'd ultimately end up likely recreating a healthy chunk of what this struct describes in itself. No matter what there's some translation that needs to happen from how libcudf organizes its device pointers into some type of interface, and that's basically what this struct is.

@kkraus14
Copy link
Collaborator

Also, supporting this interface could be used to replace the existing to_arrow and from_arrow functions and remove the need to actually depend on the arrow library for supporting this functionality in the future. You could return host memory via this interface and there would be functions in the arrow library that could be called against the returned struct to get arrow containers similar to what the current to_arrow / from_arrow functions do.

@davidwendt
Copy link
Contributor

davidwendt commented Jan 31, 2024

In theory something like this could be added as a free function in the vendorable header, but you'd need to handle all the nesting structure that columns can have where you'd ultimately end up likely recreating a healthy chunk of what this struct describes in itself. No matter what there's some translation that needs to happen from how libcudf organizes its device pointers into some type of interface, and that's basically what this struct is.

No, I would not expect Arrow to unwind libcudf data structures. My suggestion leaves most of the proposed logic intact (type-dispatch, etc) but just replaces the pieces that create the ArrowArray and ArrowDeviceArray with factory functions implemented in the Arrow source.

I will work on a counter-proposal.

@kkraus14
Copy link
Collaborator

I will work on a counter-proposal.

Thank you! We'll more than happily review and iterate on it with you! 😃

@davidwendt
Copy link
Contributor

Ok, this is what I'm proposing for the 2 Make functions to go in the Arrow source.
We can name that whatever makes sense. And we can change the return type to be a std::shared_ptr if that helps too.

namespace arrow {

/// raw pointer and a function to free it
using OwningBuffer = std::pair<const void*, std::function<void()>>;

namespace {
// generic object deleter functor for ArrowArray instances
struct DeleterFn {
  std::vector<ArrowArray*> children;
  std::vector<OwningBuffer> owners;
  ~DeleterFn()
  {
    for (auto& c : children)
      ArrowArrayRelease(c);
    for (auto& o : owners)
      std::invoke(o.second);
  }
};
}  // namespace

std::unique_ptr<ArrowArray> MakeArrowArray(int64_t length,
                                           int64_t null_count,
                                           int64_t offset,
                                           std::vector<OwningBuffer>&& data    = {},
                                           std::vector<ArrowArray*>&& children = {},
                                           ArrowArray&& dictionary             = {0})
{
  auto result = new ArrowArray{};
  std::memset(result, 0, sizeof(ArrowArray));

  const void** buffers = (const void**)(malloc(sizeof(void*) * data.size()));
  std::transform(data.begin(), data.end(), buffers, [](auto& buffer) { return buffer.first; });

  result->length     = length;
  result->null_count = null_count;
  result->offset     = offset;
  result->n_buffers  = 2;
  result->n_children = static_cast<int64_t>(children.size());
  result->buffers    = buffers;
  result->children   = children.data();
  result->dictionary = dictionary.length == 0 ? nullptr : new ArrowArray(std::move(dictionary));
  result->release    = [](struct ArrowArray* arr) {
    free(arr->buffers);
    auto d = static_cast<DeleterFn*>(arr->private_data);
    delete d;
    if (arr->dictionary) ArrowArrayRelease(arr->dictionary);
    ArrowArrayMarkReleased(arr);
  };
  result->private_data = new DeleterFn{std::move(children), std::move(data)};
  return std::unique_ptr<ArrowArray>(result);
}

std::unique_ptr<ArrowDeviceArray> MakeDeviceArray(ArrowArray&& array)
{
  auto result = new ArrowDeviceArray{std::move(array)};
  cudaEventCreate(reinterpret_cast<cudaEvent_t*>(&(result->sync_event)));
  int dev_id = 0;
  cudaGetDevice(&dev_id);
  result->device_id   = dev_id;
  result->device_type = ARROW_DEVICE_CUDA;
  return std::unique_ptr<ArrowDeviceArray>(result);
}
}  // namespace arrow

The DeleterFn can certainly go in a .cpp file along with the 2 function definitions.
No need for these to be declared and be defined in a header file.
I believe this should work and all the appropriate objects are managed correctly but I've not tested it.

@davidwendt
Copy link
Contributor

davidwendt commented Feb 1, 2024

I spent some time recoded each of the dispatch functions to use these Make factories.
Here are few of them for reference.

  // handles most of the fixed-width types
  std::unique_ptr<ArrowArray> operator()(cudf::column_view input, rmm::cuda_stream_view)
  {
    std::vector<arrow::OwningBuffer> data{{input.null_mask(), empty_fn}, {input.head(), empty_fn}};
    return arrow::MakeArrowArray(input.size(), input.null_count(), input.offset(), std::move(data));
  }

// the bool specialization shows passing in a custom 'delete' function for freeing the device_buffer
template <>
std::unique_ptr<ArrowArray> dispatch_to_arrow::operator()<bool>(cudf::column_view input,
                                                                rmm::cuda_stream_view stream)
{
  cudf::column_view view_without_offset =
    input.offset() == 0 ? input
                        : view_without_offset = cudf::column_view{input.type(), input.size() + input.offset(),
                                                                  input.head(), input.null_mask(),  input.null_count()};
  auto bitmask = std::get<0>(cudf::detail::bools_to_mask(
    view_without_offset, stream, rmm::mr::get_current_device_resource()));

  std::vector<arrow::OwningBuffer> data{{input.null_mask(), empty_fn}};
  data.emplace_back(device_buffer_to_arrow(std::move(*bitmask.release())));
  return arrow::MakeArrowArray(input.size(), input.null_count(), input.offset(), std::move(data));
}
...
which uses this utility (to be included in the libcudf source along with these dispatch functions):
...
// utility to transfer a device_buffer to an OwningBuffer
arrow::OwningBuffer device_buffer_to_arrow(rmm::device_buffer&& buffer)
{
  auto dbuf    = new rmm::device_buffer(std::move(buffer));
  auto deleter = [dbuf]() { delete dbuf; };
  return arrow::OwningBuffer{dbuf->data(), deleter};
}

// the main public function that returns the new ArrowDeviceArray
std::unique_ptr<arrow::ArrowDeviceArray> to_arrow_device_array(table_view input_view,
                                                               rmm::cuda_stream_view stream)
{
  std::vector<ArrowArray*> children;
  for (auto& c : input_view) {
    auto col = c.type().id() != cudf::type_id::EMPTY
                 ? cudf::type_dispatcher(c.type(), detail::dispatch_to_arrow{}, c, stream)
                 : detail::create_null_array(c.size());
    children.emplace_back(col.release());
  }
  std::vector<arrow::OwningBuffer> data{{nullptr, detail::empty_fn}};
  auto array =  arrow::MakeArrowArray(input_view.num_rows(), 0, 0, std::move(data), std::move(children));
  return arrow::MakeDeviceArray(std::move(*array.release()));
}
...
The create_null_array() was copied from original the get_null_arr()
...
std::unique_ptr<ArrowArray> create_null_array(int size)
{
  auto arr = std::make_shared<arrow::NullArray>(size);
  auto out = new ArrowArray{};
  ARROW_UNUSED(arrow::ExportArray(*arr, out));
  return std::unique_ptr<ArrowArray>(out);
}

Let me know if you want to see any of the other ones.
I didn't realize how different each type is built into an ArrowArray but the Make function seems to handle them all.
I was a bit surprised the type-id is not included in the structure.

@vyasr
Copy link
Contributor

vyasr commented Feb 2, 2024

I think we'd also want to look into nanoarrow (#13678) before we design any new structs ourselves. If I'm reading this discussion right it seems like there should be significant overlap given that nanoarrow has a device-side extension.

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 4, 2024

The DeleterFn can certainly go in a .cpp file along with the 2 function definitions.
No need for these to be declared and be defined in a header file.
I believe this should work and all the appropriate objects are managed correctly but I've not tested it.

This is unfortunately a C API as opposed to a CPP API. I imagine we could make this work regardless, but the bigger question is where would we expect this to live? If this lived in the main Arrow library then it eliminates the goal of being dependency free and requires linking libarrow which has a somewhat non-trivial dependency tree on its own. One of the goals of the interfaces is explicitly to avoid an explicit dependency on Arrow: https://arrow.apache.org/docs/format/CDeviceDataInterface.html#goals.

We could potentially implement something like this in nanoarrow as @vyasr mentioned above, but we'd probably need to take in the cuda event somewhere as opposed to having the make function create and record the event since the buffers coming in could potentially be on different streams or something of the like and I don't think there's a nice general way for something like nanoarrow to introspect and handle things properly. Additionally, the device and subsequent CUDA device extension in nanoarrow is quite new where there isn't interfaces for doing things like stream ordered memory management, stream ordered copying, etc. yet where I'm not sure how helpful it would be in the actual implementation here outside of providing the relevant definitions in headers for the Arrow C Device Data Interface.

@davidwendt
Copy link
Contributor

davidwendt commented Feb 6, 2024

Ok. The link was helpful background.

If this lived in the main Arrow library then it eliminates the goal of being dependency free and requires linking libarrow which has a somewhat non-trivial dependency tree on its own.

Since libcudf is already linking to libarrow.so, I'd like to consider Arrow providing these functions as an alternative to hand building the struct elements as illustrated in the original proposal.

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 6, 2024

Since libcudf is already linking to libarrow.so

My understanding is that there is a desire for libcudf to no longer link against libarrow.so, where this proposal would enable a path to removing one of the key places it's used, in to_arrow and from_arrow as well as enabling handing GPU memory to other libraries that don't link to libcudf.

I believe from some conversations with @beckernick that he's expressed that Arrow increasing major versions ~quarterly and libcudf being tied to a specific major version has caused some compatibility pain in working with other packages across the ecosystem.

I'd like to consider Arrow providing these functions as an alternative to hand building the struct elements as illustrated in the original proposal.

I don't think this is particularly feasible. There's different ownership models / semantics that Arrow would need to capture / support here. I.E. shared ownership where someone would want to more or less stuff some shared_ptrs into the private_data struct member and handle them appropriately in the release callback.

Additionally, in your proposal above you'd still need to organize your buffers and child columns into a flattened structure, pass the device type, and create + record the CUDA event for synchronization yourself. It seems like the main difference would be moving handling the ownership semantics into Arrow as opposed to handling it in libcudf?

@bkietz
Copy link

bkietz commented Feb 6, 2024

I'd like to consider Arrow providing these functions as an alternative to hand building the struct elements as illustrated in the original proposal.

I'm not sure what the functions would look like if they lived in libarrow, since libarrow can't use definitions of libcudf classes like table_view. Could you sketch the signatures you were thinking of?

@davidwendt
Copy link
Contributor

I'd like to consider Arrow providing these functions as an alternative to hand building the struct elements as illustrated in the original proposal.

I'm not sure what the functions would look like if they lived in libarrow, since libarrow can't use definitions of libcudf classes like table_view. Could you sketch the signatures you were thinking of?

#14926 (comment)

@davidwendt
Copy link
Contributor

What version of Arrow includes ArrowDeviceArray? I don't see it in the version used by libcudf so upgrading may be a prerequisite for this work.

I'm still puzzled by the lack of a type-id in these structures.
What is your proposal for from_arrow_device_array? I believe it should be possible to build a cudf::table_view/cudf::column_views but only if the type-ids are available.
Building a cudf::table does not look possible since cudf::column objects expect to own their data and for it to be stored in an rmm::device_buffer which is managed by RMM. There is no mechanism for RMM to manage device memory that it has not allocated.

@vyasr
Copy link
Contributor

vyasr commented Feb 6, 2024

My understanding is that there is a desire for libcudf to no longer link against libarrow.so

@kkraus14 you're right that we would eventually like to stop linking against libarrow if possible.

We could potentially implement something like this in nanoarrow as @vyasr mentioned above

My understanding is that nanoarrow was intended to provide essentially what we would need to decouple the existing Arrow interop functionality in libcudf from linkage to libarrow itself: a small, easily vendored library that provides an implementation of readers/writers of the Arrow C data interface so that various libraries could produce ABI-equivalent versions of Arrow data structures without linking. Do I have that right?

but [...] I don't think there's a nice general way for something like nanoarrow to introspect and handle things properly.

Assuming my understanding of the goals of nanoarrow above is correct, is the main concern here leaking too much CUDA-specific information into the nanoarrow implementation, which would be a long-term issue? Or are you mostly concerned with the more short-term issue that

the device and subsequent CUDA device extension in nanoarrow is quite new where there isn't interfaces for doing things like stream ordered memory management, stream ordered copying, etc. yet

If it's the latter, then could it make sense to implement this kind of functionality in libcudf (or a separate but associated library) for now but eventually upstream it to nanoarrow?

If it's the former, then I'd like to better understand the inherent limitations you see in nanoarrow and see if we can find a path to upstream this. At a high level I think I understand your concerns but I would like to dig into the details a bit since IMHO something like this really ought to be within the long-term scope of nanoarrow if I've understood its intent properly. I think I agree that we'll always need some functionality in cudf, but in an ideal world I would hope that we'd have something close to as simple as (very rough, not trying to be precise with types etc since I imagine all that could change in nanoarrow):

ArrowDeviceArrayView to_arrow(column_view col) {
    // Note that I'm constructing an ArrowDeviceArrayView, not an ArrowDeviceArray,
    // because I assumed those were the intended semantics of that object.
    // Since it's a view and not a copy stream-ordering concerns seem like they'd be obviated.
    return ArrowDeviceArrayViewInit(col.data(), col.size());
}

but I really haven't looked into nanoarrow enough yet to understand where/why this would be problematic.

@zeroshade
Copy link
Contributor Author

The ArrowDeviceArray was introduced with arrow v13 or v14. You'll find it in the header file <arrow/c/abi.h>

I'm still puzzled by the lack of a type-id in these structures.

The type IDs are managed by a corresponding ArrowSchema object, also in the same header. They are separate to allow for a stream of batches of data with the same schema to not have to duplicate the type information for every batch of records. So libcudf would need two methods: one to fill in an ArrowSchema and it's children based on the type of a column/column_view or a table/table_view (a table is treated as a struct column whose fields are the columns of the table so everything is seamless).

The other issue I see with pushing this upstream is that the ArrowDeviceArray on the arrow side is supposed to be device agnostic. Any init function we provide would also need to have the device type passed in (we can't assume CUDA) which would also require the caller to pass in any synchronization event (if required).

Essentially the only thing a helper function like what you are asking could do is be a wrapper around populating a C struct, which seems a little redundant and unnecessary. At least to me.

Nanoarrow could certainly be used to simplify the creating of the ArrowArray and ArrowStruct objects though. @paleolimbot could comment further on nanoarrow for this

@paleolimbot
Copy link

Very interesting read! No pressure to use nanoarrow's implementation for any of this...if it can't help, its source might be useful to review and/or it might give you another endpoint to test against.

The nanoarrow C library (without any CUDA integration) can definitely populate the ArrowSchema for you. It might look like:

#include nanoarrow.hpp

int export_column_schema(const cudf::column& col, ArrowSchema* out) {
  nanoarrow::UniqueSchema tmp;
  ArrowSchemInit(tmp.get());
  // I imagine there is already a mapping to an Arrow type id somewhere in cudf, but for example...
  NANOARROW_RETURN_NOT_OK(ArrowSchemaSetType(tmp.get(), NANOARROW_TYPE_INT32);
  
  ArrowSchemaMove(tmp.get(), out);
  return NANOARROW_OK;
}

The nanoarrow C library (also without any CUDA integration) can also populate an ArrowArray for you. If you want export an ArrowArray that's actually non-owning (just pretending to be owning), you could do:

int export_column_view_array(const cudf::column_view& col, ArrowArray* out) {
  nanoarrow::UniqueArray tmp;
  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(tmp.get(), NANOARROW_TYPE_INT32);
  tmp->length = col.length;
  // offset, null_count
  tmp->buffers[1] = col.data_buffer_start_addr;
  // If validity bitmaps are a thing in cudf: tmp->buffers[0] = col.validity_buffer;
  
  ArrowArrayMove(tmp.get(), out);
  return NANOARROW_OK;
}

If you want to export an ArrowArray that fully conforms to the spec (i.e., it is safe to access buffer content until the consumer calls the release callback), you could also use nanoarrow but you would have to explode ownership to the buffer level, which it sounds like might involve some shared pointers or reference counting of some kind. Hypothetically:

static void finalize_buffer(ArrowBufferAllocator* allocator, uint8_t* ptr, int64_t size) {
  auto* shared_col = reinterpret_cast<std::shared_ptr<cuda::column>>(allocator->private_data);
  delete shared_col;
}

int export_column_view_array(const std::shared_ptr<cuda::column> col, ArrowArray* out) {
  nanoarrow::UniqueArray tmp;
  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(tmp.get(), NANOARROW_TYPE_INT32);
  tmp->length = col->length;
  
  ArrowBuffer* data_buffer = ArrowArrayBuffer(tmp.get(), 1);
  ArrowBufferSetAllocator(
    data_buffer, 
    ArrowBufferDeallocator(&finalize_buffer, new std::shared_ptr<cuda::column>(col));
  data_buffer->data = col->data_buffer_start_addr;
  
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuilding(tmp.get(), nullptr. NANOARROW_VALIDATION_LEVEL_MINIMAL);
  ArrowArrayMove(tmp.get(), out);
  return NANOARROW_OK;
}

The only CUDA-specific part would be ensuring that the cudaEvent_t pointer in the ArrowDeviceArray struct is cleaned up when the outermost ArrowArray's release callback is called.

@vyasr is correct that there is an ArrowArrayView (and, in the work-in-progress device helpers, an ArrowDeviceArrayView). It sounds like this is the equivalent of your cudf::column_view; however, it's not ABI stable and so I'm not sure it will be all that useful to use as an interface.

@jrhemstad
Copy link
Contributor

I haven't read the full thread in detail, but here's my $0.02.

As far as I'm concerned, the whole reason for Arrow's existence and why RAPIDS built on it in the first place was to enable zero-copy data sharing with a common vocabulary for in-memory, columnar data.

My memory is hazy, but I believe the only reason the original cudf::to/from_arrow perform deep copies is because there wasn't yet a way to describe GPU memory with the Arrow data structures, so we had to always assume we had to make copies to/from host memory.

Now it seems we have a zero-copy way to describe GPU memory with Arrow, so libcudf should definitely enable that.

In my mind, this is equivalent to if you're a C++ library that has your own custom string type, you better provide a std::string_view conversion operator.

@vyasr
Copy link
Contributor

vyasr commented Feb 9, 2024

I agree with that, I just want to make sure that we're leveraging newer Arrow tools and concepts (the C Data Interface, nanoarrow, etc) to the maximum extent possible, which also means making sure that we understand exactly what those tools have to offer and whether there is missing functionality that we should be helping to implement. The questions I'm asking are focused on filling the gaps in my understanding.

Ownership

The questions around proper ownership are ultimately quite similar to, for instance, how cuDF Python works. All objects allocated by libcudf algorithms are immediately forced to relinquish their ownership to Python objects that maintain their lifetime, and downstream algorithms then operate on views anyway so it doesn't matter that libcudf no longer owns the memory. It seems to me then that the proper signature would be ArrowDeviceArray to_arrow_device_array(unique_ptr<column> column) in this context because an ownership transfer would indeed be the only way to get proper interop with other Arrow consumers that are expecting a shared ownership model like Matt indicated above. On the flip side, for consuming arrow objects it seems like we'd want cudf::column_view from_arrow_device_array(ArrowDeviceArray) because we can only ever make a view since we cannot claim sole ownership of the data.

Am I missing anything here? It seems like these are the only ways to provide semantics that are consistent with the goal of minimizing data copies while also producing objects that are consistent with the Arrow spec. There is a fundamental difference between the existing implementations in libcudf and the new ones we're proposing here because the host versions always require copies whereas we want the device ones to never(? or maybe sometimes, in which we'd need different versions of the APIs.) make copies.

Object Creation

This is where I was hoping that nanoarrow could help, and thanks to @paleolimbot we have some good examples. The example @zeroshade linked above looks like it's on the right track, and it seems like it could be written to use nanoarrow instead of arrow APIs based on the examples @paleolimbot showed above. If not, is there missing functionality that we should be helping to add? Creating those structures with nanoarrow seems like exactly what it's intended for and would allow the resulting library to have no direct dependency on libarrow, which would be nice and probably be a template for something we'd try to do with our existing Arrow host-data interop APIs eventually.

Where does the code live

Based on the above I certainly think it makes sense for libcudf to own the logic for mapping our internal representation of Arrow data into Arrow's structs. What I would hope is that it would be possible to use nanoarrow to allocate the necessary Arrow structs and then ideally to use nanoarrow APIs to populate those structs within a libcudf-specific function that knows how to translate between our types and our groupings of (Arrow-compliant) data buffers into Arrow's types and Arrows structs. But Matt brings up a few points regarding that:

The other issue I see with pushing this upstream is that the ArrowDeviceArray on the arrow side is supposed to be device agnostic. Any init function we provide would also need to have the device type passed in (we can't assume CUDA) which would also require the caller to pass in any synchronization event (if required).

I seem to recall discussions around Arrow device data also discussing this and designing for the need to pass around synchronization (CUDA) events. @kkraus14 can probably say more, but isn't Arrow already designing for this in some places? Are you thinking that it's just overkill in this context?

Essentially the only thing a helper function like what you are asking could do is be a wrapper around populating a C struct, which seems a little redundant and unnecessary. At least to me.
Nanoarrow could certainly be used to simplify the creating of the ArrowArray and ArrowStruct objects though. @paleolimbot could comment further on nanoarrow for this

It seems like using nanoarrow here would at least be helpful and not redundant as a way to protect against future non-ABI-breaking changes in the spec, e.g. if arrow arrays added fields at the end of the struct (that didn't change the alignment). And that also isn't all that different from what's outline in the example above. Maybe I'm exaggerating the likelihood of meaningful changes like this though and reaching for an external tool rather than adding this code to libcudf is unnecessarily complex.

@zeroshade
Copy link
Contributor Author

Hey everyone, i've put up a draft PR (linked above) which provides an initial pass at a to_arrow_schema function to generate an ArrowSchema struct and a first pass at creating an ArrowDeviceArray using nanoarrow which takes ownership of the data (it takes in a cudf::table& and calls release() on it embedding ownership of the individual buffers into the ArrowArray objects themselves.

I haven't added any tests yet, and still need to implement string/list/struct/dictionary but I figured it would be good to at least get eyes on what I have so far for initial opinions and to make sure I'm going in a good direction before I get too deep.

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
@vyasr
Copy link
Contributor

vyasr commented Mar 21, 2024

I've created #15370 as a sample of how we'd wrap this C++ API up for Python consumption. It's based on the ideas in apache/arrow#38325 although no spec has been officially accepted there yet. It does reveal an issue that we knew we were going to hit anyway elsewhere, and that was discussed to some extent earlier in this issue, namely the concerns about shared ownership. @zeroshade started discussing that a bit more in #14926 (comment).

The current API in #15047 is fine for C++ users who want to transfer ownership of a cudf table to a (set of) arrow arrays. With that API the arrow array becomes the primary owner. However, we also have (at least) a couple of important use cases where we cannot transfer ownership in that way:

  1. The legacy to_arrow implementation copies data from a cudf table to an arrow table, and from_arrow does the opposite. This is the standard method by which we have handled non-I/O interchange of host data in cudf. We would like to replace that implementation with a version that uses the C Data Interface because the current version uses the Arrow library directly and we would like to not have libcudf be so tightly coupled to libarrow (see [FEA] Reduce arrow library dependencies in cudf #15193). To maintain the current interface, though, we need a way to call to_arrow_device such that the original owning cudf table does not lose ownership because the arrow device array created will just be an intermediate state from which we cudaMemcpy the data to host arrow buffers.
  2. In cudf Python (Add Python interface for to_arrow_device #15370) we do not use any of libcudf's owning types at all. All unique_ptrs produced by libcudf are immediately released and ownership is transferred to Python objects. The reason for this is that we need cudf to work seamlessly with other Python types (arrays exposing __cuda_array_interface__ or __dlpack_device__ for example) that allow us to view their data without copying it. Data from libcudf and data from other types need to be interchangeable once they have been ingested by cudf's initial translation layers. If we then want to take a column of data from the Python layer and convert it to an arrow device array for consumption by other APIs, we cannot have downstream consumers of that array try to release data that we could never really provide ownership of.

The way I see it, there are a few paths forward here but only one that really seems realistic. The options I see are:

  1. Exposing a second version of to_arrow_device that takes a table_view and creates an ArrowDeviceArray whose release callback is a no-op. This is the most flexible solution I can think of. It allows the array to then be passed around and used normally. The caveat is that ownership must be maintained by the creator of the original data. However, this is already consistent with libcudf's lifetime semantics (everything produces unique_ptrs and it is up to the caller to make sure that table_views point to valid data while in use). This solution solves both problems 1 and 2 above. In the Python case, we can ensure proper lifetimes are handled automatically by adding references to the original owning objects to the capsules we produce. A slightly more extensive version of this would be to define a function template<typename Deleter> to_arrow_device(table_view, Deleter) that accepts a deleter as a callable. Then the caller would be responsible for deciding how the data is deallocated, effectively passing the buck upstream one step. In practice we'd probably end up making the deleter to do nothing for the above two examples, but this approach would allow the user more control if there were smarter things that could be done or use cases beyond these two that I'm not considering.
  2. Exposing a second version of to_arrow_device that takes a std::shared_ptr<table> and creates an ArrowDeviceArray whose ownership is truly shared. This approach has the benefit of creating a truly shared ownership situation. At the C++ layer the cost of this approach is minimal from the viewpoint of libcudf because we would simply have to release a unique_ptr into a shared_ptr, and this would completely solve problem 1 above because the release callback would then just delete the shared_ptr it owned and rely on shared_ptr ref counting to ensure that the data is deleted when all references are gone. Unfortunately, I see no way for this approach to solve problem 2 above since at the Python layer we never even have a table to begin with.
  3. Wait to use this from Python until all the libraries we might want to interact with also support the arrow data interface. I don't see this as a good solution though since we'd like to be able to leverage this feature now given the downstream benefits we see. This solution would resolve problem 2 but still leaves the need for a solution to problem 1.
  4. Stop supporting as inputs Python types that don't expose the arrow capsule interface. This option isn't really viable. If we did it, it would also only solve problem 2.

Of the above solutions 1 seems like the only really viable solution to me due to the need to solve both the Python and C++ use cases.

We could also completely throw in the towel on unifying the host and device code paths, and at that point the host code paths would become simpler to handle because those always require copies and there is no question of an ownership transfer so we could handle that separately. That essentially skips over any handling of proper 0-copy device data transfers in Python, though.

rapids-bot bot pushed a commit that referenced this issue Apr 1, 2024
Introduce new `to_arrow_device` and `to_arrow_schema` functions to utilize the `ArrowDeviceArray` structure for zero-copy passing of libcudf::table.

Add nanoarrow as a vendored lib and a script to update it.

Initial step towards addressing #14926

Authors:
  - Matt Topol (https://github.com/zeroshade)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - David Wendt (https://github.com/davidwendt)

URL: #15047
rapids-bot bot pushed a commit that referenced this issue Apr 23, 2024
)

Adding a corresponding `from_arrow_device` function following up from #15047. This continues the work towards addressing #14926.

Authors:
  - Matt Topol (https://github.com/zeroshade)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - David Wendt (https://github.com/davidwendt)

URL: #15458
@davidwendt
Copy link
Contributor

From a next steps perspective I think we have the APIs we need for zero-copy (or near zero) Arrow device memory to and from cuDF. The cudf::to_arrow_device() APIs support both cudf::column/table #15047 (ownership is transferred) and cudf::column_view/table_view #15465 (data is wrapped). The cudf::from_arrow_device() APIs support wrapping Arrow device memory with cudf::column_view/table_view (#15458) objects as well. We don't have a need to also transfer ownership from an Arrow device memory into a cudf::column/table since all libcudf APIs accept only views.

This leaves the existing cudf::to_arrow() and cudf::from_arrow() APIs which handle host Arrow to device cudf and device cudf to host Arrow. Therefore, I think the only remaining task is to update these APIs to use the appropriate host Arrow structures/functions.

@zeroshade
Copy link
Contributor Author

I'm currently working on the host Arrow to device cudf function and should have a PR up early next week.

@davidwendt
Copy link
Contributor

Just wanted to add this link here #15645 (comment)

rapids-bot bot pushed a commit that referenced this issue May 29, 2024
Following up from #15458 and continuing the work to address #14926 adding host memory version of `from_arrow_device` which will perform the copies from host memory to create cudf objects.

Authors:
  - Matt Topol (https://github.com/zeroshade)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - David Wendt (https://github.com/davidwendt)

URL: #15645
@vyasr
Copy link
Contributor

vyasr commented May 30, 2024

Now that #15645 is merged, the last piece is a to_arrow_host function (as well as potentially addressing my comment that David linked above, which would be a very nice simplification of the code if it works as expected). @zeroshade do you see anything else left on here?

@vyasr
Copy link
Contributor

vyasr commented May 30, 2024

I'm going to take a stab at adding Python bindings for #15645. Once I get something working there we should get a lot more information about what debugging remains to be done since our Python test suite is far more extensive and almost all host data ingestion that isn't I/O eventually goes down this code path.

@vyasr
Copy link
Contributor

vyasr commented Jun 4, 2024

I have the Python implementation of from_arrow working in #15904.

rapids-bot bot pushed a commit that referenced this issue Jul 2, 2024
This PR replaces the internals of `from_arrow` in pylibcudf with an implementation that uses the [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) using the [Python Capsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html). This allows us to decouple our Python builds from using pyarrow Cython (partially, we haven't replaced the `to_arrow` conversion yet) and it will also allow us to support any other Python package that is a producer of the data interface.

To support the above functionality, the following additional changes were needed in this PR:
- Added the ability to produce cudf tables from `ArrowArrayStream` objects since that is what `pyarrow.Table` produces. This function is a simple wrapper around the existing `from_arrrow(ArrowArray)` API.
- Added support for the large strings type, for which support has improved throughout cudf since the `from_arrow_host` API was added and for which we now require a basic overload for tests to pass. I did not add corresponding support for `from_arrow_device` to avoid ballooning the scope of this PR, so that work can be done in a follow-up.
- Proper handling of `type_id::EMPTY` in concatenate because the most natural implementation of the ArrowArrayStream processing is to run `from_arrow` on each chunk and then concatenate the outputs, and from the Python side we can produce chunks of all null arrays from arrow.

Contributes to #14926

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Robert Maynard (https://github.com/robertmaynard)
  - David Wendt (https://github.com/davidwendt)

URL: #15904
lithomas1 added a commit to lithomas1/cudf that referenced this issue Jul 2, 2024
commit 1a4c2aa
Author: Thomas Li <[email protected]>
Date:   Tue Jul 2 07:38:18 2024 -0700

    Start migrating I/O writers to pylibcudf (starting with JSON) (rapidsai#15952)

    Switches the JSON writer to use pylibcudf.
    xref rapidsai#15162

    Authors:
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: rapidsai#15952

commit a1447c7
Author: Robert Maynard <[email protected]>
Date:   Tue Jul 2 09:34:29 2024 -0400

    Promote has_nested_columns to cudf public API (rapidsai#16131)

    The `has_nested_columns` functionality is used in numerous tests. It looks like it should be part of our stable public API.

    Authors:
      - Robert Maynard (https://github.com/robertmaynard)

    Approvers:
      - Muhammad Haseeb (https://github.com/mhaseeb123)
      - Yunsong Wang (https://github.com/PointKernel)

    URL: rapidsai#16131

commit a4be7bd
Author: Vyas Ramasubramani <[email protected]>
Date:   Tue Jul 2 00:50:42 2024 -0700

    Use Arrow C Data Interface functions for Python interop (rapidsai#15904)

    This PR replaces the internals of `from_arrow` in pylibcudf with an implementation that uses the [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) using the [Python Capsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html). This allows us to decouple our Python builds from using pyarrow Cython (partially, we haven't replaced the `to_arrow` conversion yet) and it will also allow us to support any other Python package that is a producer of the data interface.

    To support the above functionality, the following additional changes were needed in this PR:
    - Added the ability to produce cudf tables from `ArrowArrayStream` objects since that is what `pyarrow.Table` produces. This function is a simple wrapper around the existing `from_arrrow(ArrowArray)` API.
    - Added support for the large strings type, for which support has improved throughout cudf since the `from_arrow_host` API was added and for which we now require a basic overload for tests to pass. I did not add corresponding support for `from_arrow_device` to avoid ballooning the scope of this PR, so that work can be done in a follow-up.
    - Proper handling of `type_id::EMPTY` in concatenate because the most natural implementation of the ArrowArrayStream processing is to run `from_arrow` on each chunk and then concatenate the outputs, and from the Python side we can produce chunks of all null arrays from arrow.

    Contributes to rapidsai#14926

    Authors:
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Matthew Roeschke (https://github.com/mroeschke)
      - Robert Maynard (https://github.com/robertmaynard)
      - David Wendt (https://github.com/davidwendt)

    URL: rapidsai#15904

commit 08552f8
Author: Lawrence Mitchell <[email protected]>
Date:   Tue Jul 2 03:12:50 2024 +0100

    Update cudf-polars for v1 release of polars (rapidsai#16149)

    Minor changes to the IR, which we adapt to, and request `polars>=1.0` in dependencies.

    Authors:
      - Lawrence Mitchell (https://github.com/wence-)
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: rapidsai#16149

commit 760c15c
Author: Kyle Edwards <[email protected]>
Date:   Mon Jul 1 14:27:30 2024 -0400

    Use verify-alpha-spec hook (rapidsai#16144)

    With the deployment of rapids-build-backend, we need to make sure our dependencies have alpha specs.

    Authors:
      - Kyle Edwards (https://github.com/KyleFromNVIDIA)

    Approvers:
      - Bradley Dice (https://github.com/bdice)

    URL: rapidsai#16144

commit b691b1c
Author: David Wendt <[email protected]>
Date:   Mon Jul 1 14:25:11 2024 -0400

    Add stream parameter to cudf::io::text::multibyte_split (rapidsai#16034)

    Adds stream support the `cudf::io::text::multibyte_split` API.
    Also adds a stream test and deprecates an overloaded API.

    Authors:
      - David Wendt (https://github.com/davidwendt)

    Approvers:
      - Mark Harris (https://github.com/harrism)
      - Karthikeyan (https://github.com/karthikeyann)

    URL: rapidsai#16034

commit 5efd72f
Author: Matthew Roeschke <[email protected]>
Date:   Mon Jul 1 07:37:12 2024 -1000

    Ensure cudf objects can astype to any type when empty (rapidsai#16106)

    pandas allows objects to `astype` to any other type if the object is empty. The PR mirrors that behavior for cudf.

    This PR also more consistently uses `astype` instead of `as_*_column` and fixes a bug in `IntervalDtype.__eq__` discovered when writing a unit test for this bug.

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: rapidsai#16106

commit 51fb873
Merge: 599ce95 e932fbd
Author: gpuCI <[email protected]>
Date:   Mon Jul 1 12:17:38 2024 -0400

    Merge pull request rapidsai#16145 from rapidsai/branch-24.06

    Forward-merge branch-24.06 into branch-24.08

commit e932fbd
Author: Vyas Ramasubramani <[email protected]>
Date:   Mon Jul 1 09:17:32 2024 -0700

    Add patch for incorrect cuco noexcept clauses (rapidsai#16077)

    [cuco previously marked a number of methods as noexcept that can in fact
    throw exceptions](NVIDIA/cuCollections#510).
    This causes problems for cudf functions that call these methods. The
    issue [was fixed in cuco
    upstream](NVIDIA/cuCollections#511), but we
    cannot easily update to the latest commit of cuco, especially in a patch
    fix for 24.06. This PR instead adds a rapids-cmake patch for the cuco
    clone to address this issue. The patch may be removed once we update to
    a commit of cuco that contains the necessary fix.

    Resolves rapidsai#16059

commit 599ce95
Author: Lawrence Mitchell <[email protected]>
Date:   Mon Jul 1 09:35:35 2024 +0100

    Implement handlers for series literal in cudf-polars (rapidsai#16113)

    A query plan can contain a "literal" polars Series. Often, for example, when calling a contains-like function. To translate these, introduce a new `LiteralColumn` node to capture the concept and add an evaluation rule (converting from arrow).

    Since list-dtype Series need the same casting treatment as in dataframe scan case, factor the casting out into a utility, and take the opportunity to handled casting of nested lists correctly.

    Authors:
      - Lawrence Mitchell (https://github.com/wence-)

    Approvers:
      - Thomas Li (https://github.com/lithomas1)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: rapidsai#16113

commit 3c3edfe
Author: Yunsong Wang <[email protected]>
Date:   Fri Jun 28 13:58:22 2024 -0700

    Update implementations to build with the latest cuco (rapidsai#15938)

    This PR updates existing libcudf to accommodate a cuco breaking change introduced in NVIDIA/cuCollections#479. It helps avoid breaking cudf when bumping the cuco version in `rapids-cmake`.

    Redundant equal/hash overloads will be removed once the version bump is done on the `rapids-cmake` end.

    Authors:
      - Yunsong Wang (https://github.com/PointKernel)

    Approvers:
      - David Wendt (https://github.com/davidwendt)
      - Nghia Truong (https://github.com/ttnghia)

    URL: rapidsai#15938

commit df88cf5
Author: Bradley Dice <[email protected]>
Date:   Fri Jun 28 15:40:52 2024 -0500

    Use size_t to allow large conditional joins (rapidsai#16127)

    The conditional join kernels were using `cudf::size_type` where `std::size_t` was needed. This PR fixes that bug, which caused `cudaErrorIllegalAddress` as shown in rapidsai#16115. This closes rapidsai#16115.

    I did not add tests because we typically do not test very large workloads. However, I committed the test and reverted it in this PR, so there is a record of my validation code.

    Authors:
      - Bradley Dice (https://github.com/bdice)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - https://github.com/nvdbaranec
      - Yunsong Wang (https://github.com/PointKernel)

    URL: rapidsai#16127

commit fb12d98
Author: Robert Maynard <[email protected]>
Date:   Fri Jun 28 12:14:58 2024 -0400

    Installed cudf header use cudf::allocate_like (rapidsai#16087)

    Remove usage of non public cudf::allocate_like from implementations in headers we install

    Authors:
      - Robert Maynard (https://github.com/robertmaynard)

    Approvers:
      - Yunsong Wang (https://github.com/PointKernel)
      - Nghia Truong (https://github.com/ttnghia)

    URL: rapidsai#16087

commit 78f4a8a
Author: Robert Maynard <[email protected]>
Date:   Fri Jun 28 11:26:27 2024 -0400

    Move common string utilities to public api (rapidsai#16070)

    As part of rapidsai#15982 a subset of the strings utility functions have been identified as being worth expsosing as part of the cudf public API.

    The `create_string_vector_from_column`, `get_offset64_threshold`, and `is_large_strings_enabled` are now made part of the public `cudf::strings` api.

    Authors:
      - Robert Maynard (https://github.com/robertmaynard)

    Approvers:
      - MithunR (https://github.com/mythrocks)
      - David Wendt (https://github.com/davidwendt)
      - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub)
      - Lawrence Mitchell (https://github.com/wence-)

    URL: rapidsai#16070

commit a4b951a
Author: nvdbaranec <[email protected]>
Date:   Fri Jun 28 10:20:42 2024 -0500

    Templatization of fixed-width parquet decoding kernels. (rapidsai#15911)

    This PR merges all of the fixed-width parquet decoding kernels into a single templatized kernel that can be selectively instantiated with desired features (dictionary/no-dictionary, nested/non-nested, etc).  It also adds support for (non-list) nested columns in this path. So structs do not have to use the much slower general decode kernel any more.

    A new benchmark was added specific to structs containing only fixed width columns.  I added this because the performance improvement is fairly high (+20%) but we don't see it in the normal struct benchmarks because they include (and are dominated by) string decode times.  The new benchmark shows:

    Before this PR:
    ```
    | data_type |    io_type    | cardinality | run_length | bytes_per_second | peak_memory_usage | encoded_file_size |
    |-----------|---------------|-------------|------------|------------------|-------------------|-------------------|
    |    STRUCT | DEVICE_BUFFER |           0 |          1 |      21071216823 |         1.047 GiB |       511.675 MiB |
    |    STRUCT | DEVICE_BUFFER |        1000 |          1 |      18974392387 |       821.312 MiB |       128.884 MiB |
    |    STRUCT | DEVICE_BUFFER |           0 |         32 |      20429356824 |      621.787 MiB  |        28.141 MiB |
    |    STRUCT | DEVICE_BUFFER |        1000 |         32 |      20572327813 |       598.421 MiB |        16.475 MiB |
    ```

    After this PR:

    ```
    | data_type |    io_type    | cardinality | run_length | bytes_per_second | peak_memory_usage | encoded_file_size |
    |-----------|---------------|-------------|------------|------------------|-------------------|-------------------|
    |    STRUCT | DEVICE_BUFFER |           0 |          1 |      25805996399 |         1.047 GiB |       511.675 MiB |
    |    STRUCT | DEVICE_BUFFER |        1000 |          1 |      22422306660 |       821.312 MiB |       128.884 MiB |
    |    STRUCT | DEVICE_BUFFER |           0 |         32 |      24460694014 |       621.787 MiB |        28.141 MiB |
    |    STRUCT | DEVICE_BUFFER |        1000 |         32 |      24674861214 |       598.421 MiB |        16.475 MiB |
    ```

    Split-page decoding for fixed-width types + structs are also going through this new path. New test added.

    This brings us closer to eliminating the "general" kernel.  The only things left that run through it are lists and booleans.

    This is PR 1 of 2, with the followup moving a lot of code around.  At this point, I think it makes sense to start consolidating our files a bit.

    I also left some breadcrumbs (a few small commented out code blocks) in the core kernel `gpuDecodePageDataGeneric` for the next step of adding list support. They can be removed if people don't like them.

    Authors:
      - https://github.com/nvdbaranec

    Approvers:
      - Mike Wilson (https://github.com/hyperbolic2346)
      - Vukasin Milovanovic (https://github.com/vuule)
      - Muhammad Haseeb (https://github.com/mhaseeb123)

    URL: rapidsai#15911

commit e434fdb
Author: David Wendt <[email protected]>
Date:   Fri Jun 28 10:57:01 2024 -0400

    Update libcudf compiler requirements in contributing doc (rapidsai#16103)

    Updates the compiler requirements in the contributing document.

    Authors:
      - David Wendt (https://github.com/davidwendt)

    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Karthikeyan (https://github.com/karthikeyann)

    URL: rapidsai#16103

commit 565c0d1
Author: Matthew Murray <[email protected]>
Date:   Fri Jun 28 10:16:55 2024 -0400

    Migrate lists/contains to pylibcudf (rapidsai#15981)

    Part of rapidsai#15162.

    Authors:
      - Matthew Murray (https://github.com/Matt711)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: rapidsai#15981

commit c40e0cc
Author: Matthew Murray <[email protected]>
Date:   Fri Jun 28 10:10:31 2024 -0400

    Add support for proxy `np.flatiter` objects (rapidsai#16107)

    Closes rapidsai#15388

    Authors:
      - Matthew Murray (https://github.com/Matt711)

    Approvers:
      - Matthew Roeschke (https://github.com/mroeschke)

    URL: rapidsai#16107

commit 673d766
Author: Paul Mattione <[email protected]>
Date:   Fri Jun 28 09:38:57 2024 -0400

    Make binary operators work between fixed-point and floating args (rapidsai#16116)

    Some of the binary operators in cuDF don't work between fixed_point and floating-point numbers after [this earlier PR](rapidsai#15438) removed the ability to construct and implicitly cast fixed_point numbers from floating point numbers. This PR restores that functionality by detecting and performing the necessary explicit casts, and adds tests for the supported operators.

    Note that the `binary_op_has_common_type` code is modeled after `has_common_type` found in traits.hpp.

    This closes [issue 16090](rapidsai#16090)

    Authors:
      - Paul Mattione (https://github.com/pmattione-nvidia)

    Approvers:
      - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub)
      - Karthikeyan (https://github.com/karthikeyann)

    URL: rapidsai#16116

commit 224ac5b
Author: David Wendt <[email protected]>
Date:   Fri Jun 28 09:26:37 2024 -0400

    Add libcudf public/detail API pattern to developer guide (rapidsai#16086)

    Adds specific description for the public API to detail API function pattern to the libcudf developer guide.
    Also fixes some formatting issues and broken link.

    Authors:
      - David Wendt (https://github.com/davidwendt)

    Approvers:
      - Shruti Shivakumar (https://github.com/shrshi)
      - Karthikeyan (https://github.com/karthikeyann)

    URL: rapidsai#16086

commit 2b547dc
Author: Matthew Roeschke <[email protected]>
Date:   Fri Jun 28 03:11:01 2024 -1000

    Add ensure_index to not unnecessarily shallow copy cudf.Index (rapidsai#16117)

    The `cudf.Index` constructor will shallow copy a `cudf.Index` input. Sometimes, we just need to make sure an input is a `cudf.Index`, so created `ensure_index` (pandas has something similar) so we don't shallow copy these inputs unnecessarily

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    URL: rapidsai#16117

commit 57862a3
Author: Robert Maynard <[email protected]>
Date:   Fri Jun 28 08:43:12 2024 -0400

    stable_distinct public api now has a stream parameter (rapidsai#16068)

    As part of rapidsai#15982 we determined that the cudf  `stable_distinct` public API needs to be updated so that a user provided stream can be provided.

    Authors:
      - Robert Maynard (https://github.com/robertmaynard)

    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Srinivas Yadav (https://github.com/srinivasyadav18)
      - Bradley Dice (https://github.com/bdice)

    URL: rapidsai#16068

commit 6b04fd3
Author: Mads R. B. Kristensen <[email protected]>
Date:   Fri Jun 28 12:31:18 2024 +0200

    Memory Profiling (rapidsai#15866)

    Use [RMM's new memory profiler](rapidsai/rmm#1563) to profile all functions already decorated with `_cudf_nvtx_annotate`.

    Example
    ```python
    import cudf
    from cudf.utils.performance_tracking import print_memory_report

    cudf.set_option("memory_profiling", True)

    df1 = cudf.DataFrame({"a": [1, 2, 3]})
    df2 = cudf.DataFrame({"a": [2, 2, 3]})
    df3 = df1.merge(df2)

    print_memory_report()
    ```

    Output:
    ```
    Memory Profiling
    ================

    Ordered by: memory_peak

    ncalls     memory_peak    memory_total  filename:lineno(function)
         1             272             688  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:4072(DataFrame.merge)
         2              32              64  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:1043(DataFrame._init_from_dict_like)
         2              32              64  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:690(DataFrame.__init__)
         2               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:1131(DataFrame._align_input_series_indices)
         7               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/index.py:214(RangeIndex.__init__)
         6               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/index.py:424(RangeIndex.__len__)
         4               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/frame.py:271(Frame.__len__)
         2               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/dataframe.py:3195(DataFrame._insert)
         2               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/index.py:270(RangeIndex.name)
         2               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/index.py:369(RangeIndex.copy)
         5               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/frame.py:134(Frame._from_data)
         2               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/frame.py:1039(Frame._copy_type_metadata)
         2               0               0  /home/mkristensen/apps/miniforge3/envs/rmm-cudf-0527/lib/python3.11/site-packages/cudf/core/indexed_frame.py:315(IndexedFrame._from_columns_like_self)
    ```

    Authors:
      - Mads R. B. Kristensen (https://github.com/madsbk)

    Approvers:
      - Mark Harris (https://github.com/harrism)
      - Lawrence Mitchell (https://github.com/wence-)
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: rapidsai#15866

commit e35da6b
Author: Lawrence Mitchell <[email protected]>
Date:   Fri Jun 28 09:54:03 2024 +0100

    Implement Ternary copy_if_else (rapidsai#16114)

    A straightforward evaluation using `copy_if_else`.

    Authors:
      - Lawrence Mitchell (https://github.com/wence-)

    Approvers:
      - https://github.com/brandon-b-miller

    URL: rapidsai#16114

commit c847b98
Author: Lawrence Mitchell <[email protected]>
Date:   Thu Jun 27 21:33:29 2024 +0100

    Finish implementation of cudf-polars boolean function handlers (rapidsai#16098)

    The missing nodes were `is_in`, `not` (both easy), `is_finite` and `is_infinite` (obtained by translating to `contains` calls).

    While here, remove the implementation of `IsBetween` and just translate to an expression with binary operations. This removes the need for special-casing scalar arguments to `IsBetween` and reproducing the code for binop evaluation.

    Authors:
      - Lawrence Mitchell (https://github.com/wence-)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: rapidsai#16098

commit 2ed69c9
Author: Matthew Roeschke <[email protected]>
Date:   Thu Jun 27 10:11:09 2024 -1000

    Ensure MultiIndex.to_frame deep copies columns (rapidsai#16110)

    Additionally, this allows simplification in `MultiIndex.__repr__` which avoids a shallow copy and also caught a bug where `NaT` was not supposed to be quoted

    Authors:
      - Matthew Roeschke (https://github.com/mroeschke)

    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)

    URL: rapidsai#16110

commit a71c249
Author: GALI PREM SAGAR <[email protected]>
Date:   Thu Jun 27 14:29:31 2024 -0500

    Fix dtype errors in `StringArrays` (rapidsai#16111)

    This PR adds proxy classes for `ArrowStringArray` and `ArrowStringArrayNumpySemantics` that will increase the pandas test pass rate by 1%.

    Authors:
      - GALI PREM SAGAR (https://github.com/galipremsagar)

    Approvers:
      - Matthew Roeschke (https://github.com/mroeschke)

    URL: rapidsai#16111
@vyasr
Copy link
Contributor

vyasr commented Jul 9, 2024

@zeroshade good news, #15904 has been in for a week and no issues have been reported, so it looks like the host arrow conversion API covered all our bases. The Python cudf tests are far more extensive than libcudf's, so it's great to see that even with the increased coverage we didn't really uncover anything. Great job with the implementation!

Are you interested in tackling the last piece of the puzzle here, the to_arrow host implementation? Once that's completed I think we can close this issue. I'd like to address #16104, but since that is mostly orthogonal I'm OK waiting to do that until after we finalize the to_arrow implementation.

@zeroshade
Copy link
Contributor Author

Great to hear @vyasr! I'll start working on the to_arrow host implementation later this week

@vyasr
Copy link
Contributor

vyasr commented Jul 11, 2024

Awesome thanks Matt!

@jakirkham
Copy link
Member

Are you interested in tackling the last piece of the puzzle here, the to_arrow host implementation? Once that's completed I think we can close this issue.

It looks like the host implementation of to_arrow was integrated

Should we close or is there something still left here?

@vyasr
Copy link
Contributor

vyasr commented Aug 13, 2024

Resolved by the merge of #16297 (along with prior PRs).

@vyasr vyasr closed this as completed Aug 13, 2024
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: Done
Development

No branches or pull requests

10 participants