Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cast kernel support for StringViewArray and BinaryViewArray <--> DictionaryArray` #5861

Closed
Tracked by #5374
alamb opened this issue Jun 10, 2024 · 5 comments · Fixed by #5872
Closed
Tracked by #5374

cast kernel support for StringViewArray and BinaryViewArray <--> DictionaryArray` #5861

alamb opened this issue Jun 10, 2024 · 5 comments · Fixed by #5872
Assignees
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jun 10, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is part of the larger project to implement StringViewArray -- see #5374

In #5508, @RinChanNOWWW tracked adding casting to/from StringArray 🙏 ❤️

This ticket tracks adding additional data type support for StringViewArray and ByteViewArray in the cast kernel: https://docs.rs/arrow/latest/arrow/compute/kernels/cast/index.html

Many systems (e.g InfluxDB 3.0, Apache DataFusion Comet, and I think Coralogix) use DictionaryArrays. Thus supporting casting to/from DictionaryArray will be important to permit easy integration into downstream consumers

Describe the solution you'd like

Specifically the following conversions should be supported in the cast kernels:

  • StringViewArray <--> DictionaryArray<IndexType, Utf8>
  • StringViewArray <--> DictionaryArray<IndexType, LargeUtf8>

And similarly for Binary:

  • BinaryViewArray <--> DictionaryArray<IndexType, Binary>
  • BinaryViewArray <--> DictionaryArray<IndexType, LargeBinary>

Notes:

  1. Good test coverage is the most important part of this ticket
  2. I recommend smaller PRs if possible
  3. I think DictionaryArray<IndexType, LargeUtf8> --> StringViewArray can be implemented without copying strings
  4. I think StringViewArray --> DictionaryArray<IndexType, LargeUtf8> will likely require copying the strings

Describe alternatives you've considered
I think casting from Dictionary

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jun 10, 2024
@XiangpengHao
Copy link
Contributor

I'm working on this, can you assign me @alamb ?

@alamb
Copy link
Contributor Author

alamb commented Jun 11, 2024

I'm working on this, can you assign me @alamb ?

Done!

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Jun 11, 2024

StringViewArray --> DictionaryArray<IndexType, LargeUtf8> will copy strings twice.
The current implementation is handled in https://github.com/XiangpengHao/arrow-rs/blob/view-to-dict/arrow-cast/src/cast/dictionary.rs#L259-L283
Which first cast the StringViewArray to StringArray then insert to dictionary one by one. A better approach is probably to directly insert items in StringViewArray to dictionary array.

@alamb
Copy link
Contributor Author

alamb commented Jun 11, 2024

A better approach is probably to directly insert items in StringViewArray to dictionary array.

Yes I agree this is likely -- I suggest using https://docs.rs/arrow/latest/arrow/array/builder/struct.GenericByteDictionaryBuilder.html directly

That will also deduplicate the values and result in the smallest dictionary possible. The downside is that inserting each value will require hashing on a string

I could imagine a faster implementation that keeps a map of u128 (aka the view) to the in-process dictionary values that would be faster (hashing u128 rather than &[u8]). The downside is that this wouldn't catch values where the views pointed to the different bytes that happend to have the same value (e.g. "foofoo" if there were two views that pointed to "foo" and the second "foo")

@alamb
Copy link
Contributor Author

alamb commented Jul 2, 2024

label_issue.py automatically added labels {'arrow'} from #5872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants