ENH: Add groupby(...).agg_index #56992

rhshadrach · 2024-01-21T10:42:35Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

With groupby(...).grouper deprecated, we can expose .agg_index on the groupby object so users can still determine the index that will be on aggregations (when as_index is True).

If we do go forward with this, I'd like it in 2.2.1 because of the aforementioned deprecation.

cc @jorisvandenbossche @jbrockmendel @mroeschke

mroeschke · 2024-01-21T23:38:01Z

An alternative is to perform a reduction and take the index from the result

It was a bit opaque to me in the seaborn example why the index was needed outside a groupby aggregation function being called. A MultiIndex composed of groupby.groups is not sufficient?

jbrockmendel · 2024-01-22T22:39:58Z

No objection, though itd be nice if matt's idea is viable and we can obviate the need for this.

rhshadrach · 2024-01-22T23:29:04Z

Converting the groups to a MultiIndex is not performant, and at least at current state does not arrive at the correct dtypes for e.g. EAs.

size = 1_000_000
df = pd.DataFrame(
    {
        "a": np.random.randint(0, 500, size),
        "b": np.random.randint(0, 500, size),
        "c": np.random.random(size),
    }
)
%timeit pd.Index(df.groupby(["a", "b"]).groups)
# 2.19 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby(["a", "b"]).grouper.result_index
# 53.4 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}, dtype="Int64")
gb = df.groupby("a")
result = pd.Index(gb.groups)
expected = gb.sum().index
tm.assert_index_equal(result, expected)
# Attribute "dtype" are different
# [left]:  int64
# [right]: Int64

mroeschke · 2024-01-23T01:07:17Z

In the Seaborn context appears type preservation isn't important as values are cast to float in the end: https://github.com/mwaskom/seaborn/blob/a3cb0f1b33551eb25fed1907d0abdd0237fac215/seaborn/categorical.py#L640

As of now, I would lean -0 on adding .agg_index in favor of just doing groupby(as_index=True|False).size().index as you suggested in #56521 (comment).

rhshadrach · 2024-01-23T22:00:07Z

As of now, I would lean -0 on adding .agg_index in favor of just doing groupby(as_index=True|False).size().index as you suggested in #56521 (comment).

No opposition here, will close if no one is in favor of this.

jorisvandenbossche · 2024-01-24T10:40:44Z

Seaborn is of course only just one example that triggered my question (and they indeed convert to floats, because for plotting that's all you need in the end).
They updated their code now to call unique() on the original df's column for the group key (mwaskom/seaborn#3620).

It's always difficult to estimate how much used this would be in practice. But personally, it feels somewhat useful to me that you can get the unique group keys for a groupby object. We also already expose the indices of the groups (and my feeling is that the groups keys itself might be more often useful than the indices).

If we add it, I would name maybe name it differently, though. agg_index is very much tied to "it's the index of what you would get when doing an aggregation", which is certainly correct, but for me it's also more general: it are the group keys (also what you would get when iterating over the groups instead of calling agg). So I would suggest something like group_keys.

rhshadrach · 2024-01-24T22:44:52Z

But personally, it feels somewhat useful to me that you can get the unique group keys for a groupby object.

I don't think there is any disagreement there. The question is whether .groupby(...).size().index is sufficient to where we don't need to expand the API.

So I would suggest something like group_keys.

We already have groups, it seems to me it's not straight forward for a user to differentiate these two:

x = gb.groups
y = gb.group_keys

That's why I think index should be part of the name.

ENH: Add groupby(...).agg_index

1794772

rhshadrach added Enhancement Groupby Needs Discussion Requires discussion from core team before further action labels Jan 21, 2024

rhshadrach added this to the 2.2.1 milestone Jan 21, 2024

lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024

rhshadrach closed this Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add groupby(...).agg_index #56992

ENH: Add groupby(...).agg_index #56992

rhshadrach commented Jan 21, 2024

mroeschke commented Jan 21, 2024 •

edited

Loading

jbrockmendel commented Jan 22, 2024

rhshadrach commented Jan 22, 2024 •

edited

Loading

mroeschke commented Jan 23, 2024

rhshadrach commented Jan 23, 2024

jorisvandenbossche commented Jan 24, 2024

rhshadrach commented Jan 24, 2024

ENH: Add groupby(...).agg_index #56992

ENH: Add groupby(...).agg_index #56992

Conversation

rhshadrach commented Jan 21, 2024

mroeschke commented Jan 21, 2024 • edited Loading

jbrockmendel commented Jan 22, 2024

rhshadrach commented Jan 22, 2024 • edited Loading

mroeschke commented Jan 23, 2024

rhshadrach commented Jan 23, 2024

jorisvandenbossche commented Jan 24, 2024

rhshadrach commented Jan 24, 2024

mroeschke commented Jan 21, 2024 •

edited

Loading

rhshadrach commented Jan 22, 2024 •

edited

Loading