Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve RoundTrip types in RowConverter even if preserve_dictionaries=false #4813

Closed
alamb opened this issue Sep 13, 2023 · 6 comments
Closed
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Sep 13, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We struggle with the memory used by the RowConverter when interning values from DictionaryArrays. We are even proposing a special CardinalityAware wrapper on top of the RowConverter in DataFusion (see apache/datafusion#7401)

At the moment, round tripping data from Array to Rows and then back to Array works like this:

DictionaryArray -- (preserve_dictionaries = false) --> Rows --> Primtive/StringArray

In DataFusion we must maintain the same input / output types, so in our proposed improvement we needed to add a call to cast, which @tustvold notes is likely very expensive: https://github.com/apache/arrow-datafusion/pull/7401/files#r1324281222

Describe the solution you'd like
I would like the RowConverter to produce the same output type as the input type on SortField, even if preserve_dictionaries is set to false

This would avoid a copy of the String data and likely perform much better.

Describe alternatives you've considered
We could potentially simply remove stateful row encoding: #4811

Additional context

@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog arrow Changes to the arrow crate labels Sep 13, 2023
@tustvold
Copy link
Contributor

The reason this is currently the case is to avoid taking a dependency on arrow-cast. It isn't a particularly good reason, but it is the reason

@alamb
Copy link
Contributor Author

alamb commented Sep 13, 2023

I was thinking in order to make this perform well, we would probably want to avoid using cast and instead directly creating the output type. I realize this might be complex codewise

@tustvold
Copy link
Contributor

So I'm not really sure how to achieve this in a performant manner, taking a step-back though DataFusion uses this for hydrating group keys. Perhaps we could simply alter group by to always materialize dictionaries in the group keys 🤔 ? Certainly in the case of a single column grouping, the dictionary encoding is pure overhead, as each value will only appear once.

@alamb
Copy link
Contributor Author

alamb commented Sep 15, 2023

Certainly in the case of a single column grouping, the dictionary encoding is pure overhead, as each value will only appear once.

I am not sure about this -- the group keys for each incoming batch are converted to Row format first, to compare them with existing group keys.

However, we could potentially add a special case GroupValues for single column Dictionary grouping that knew how to avoid that step. That would likely be close to optimal in terms of performance as we would skip doing any hydration at all

@tustvold
Copy link
Contributor

I was referring to the reverse transformation, which is what this ticket concerns, i.e. going from unique rows back to arrays

@tustvold
Copy link
Contributor

tustvold commented Jan 1, 2024

Following apache/datafusion#8291 the motivating use-case for this functionality I think has been removed, and so I am closing this for now

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants