Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature/ online ordinal encoder #1271

Merged
merged 2 commits into from
Jul 4, 2023

Conversation

ColdTeapot273K
Copy link
Contributor

@ColdTeapot273K ColdTeapot273K commented Jun 22, 2023

The idea:

Ordinal encoder for categoricals that robustly accomodates for:

  • unfamilliar category values
  • empty category values

Achieved by:

  • using only positive integers for encoding (no custom mappings like in sklearn implementation)
  • reserving two first values for unknown and empty category values

Note:

  • positive-only indexing is useful when passing output to some tensor-based modeling frameworks which require indexing tensors and don't have flexible indexing things like VectorDict (i'm mixing river with jax-based PPL for online Bayesian modeling, proper OrdinalEncoder is very valuable)
  • naturally i have included mini-batch implementation aswell

Example tests:

transformer = OrdinalEncoder()
# %%
X = [
    {"country": "France", "place": "Taco Bell"},
    {"country": None, "place": None},
    {"country": "Sweden", "place": "Burger King"},
    {"country": "France", "place": "Burger King"},
    {"country": "Russia", "place": "Starbucks"},
    {"country": "Russia", "place": "Burger King"},
    {"country": "Russia", "place": "Taco Bell"},
    {"country": "Sweden", "place": "Taco Bell"},
    {"country": "Sweden", "place": "Burger King"},
    {"country": "France", "place": "Taco Bell"},
    {"country": "France", "place": "Burger King"},
    {"country": "Sweden", "place": "Starbucks"},
    {"country": "Sweden", "place": "Starbucks"},
    {"country": None, "place": None},
]

transformer = OrdinalEncoder()
for x in X:
    print(transformer.transform_one(x))
    transformer = transformer.learn_one(x)
# {'country': 1, 'place': 1}
# {'country': 1, 'place': 1}
# {'country': 1, 'place': 1}
# {'country': 2, 'place': 4}
# {'country': 1, 'place': 1}
# {'country': 5, 'place': 4}
# {'country': 5, 'place': 2}
# {'country': 4, 'place': 2}
# {'country': 4, 'place': 4}
# {'country': 2, 'place': 2}
# {'country': 2, 'place': 4}
# {'country': 4, 'place': 5}
# {'country': 4, 'place': 5}
# {'country': 0, 'place': 0}


transformer = OrdinalEncoder()
transformer.learn_many(pd.DataFrame(X))

assert transformer.categories_['country'] == {None: 0, 'France': 2, 'Sweden': 3, 'Russia': 4}
assert transformer.categories_['place'] == {None: 0, 'Taco Bell': 2, 'Burger King': 3, 'Starbucks': 4}

transformer.transform_many(pd.DataFrame(X))
#	country	place
#0	2	2
#1	0	0
#2	3	3
#3	2	3
#4	4	4
#5	4	3
#6	4	2
#7	3	2
#8	3	3
#9	2	2
#10	2	3
#11	3	4
#12	3	4
#13	0	0

Copy link
Member

@MaxHalford MaxHalford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally in favor of this pull request! I have a few questions though



class OrdinalEncoder(base.MiniBatchTransformer):
# INFO: makes sense to add None initially? so it'll be easier to spot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a docstring? :)


class OrdinalEncoder(base.MiniBatchTransformer):
# INFO: makes sense to add None initially? so it'll be easier to spot
def __init__(self, handle_unknown="use_reserved_value", unknown_value=1, encoded_missing_value=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many possible values are there for handle_unknown? If it's just a boolean, I rather have an use_reserved_value_for_unknowns: bool parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wouldn't -1 make more sense for the unknown_value? Isn't it surprising for the first value encountered, as well as unknown values to both be assigned 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is also making me think, because this method is unsupervised, why don't you do the update logic in transform_one? That way there won't be any unknown values, because you update before transforming. I believe we do that in some other places in River

Copy link
Contributor Author

@ColdTeapot273K ColdTeapot273K Jun 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wouldn't -1 make more sense for the unknown_value?

Please see the notes.
I've rejected this idea because there's no '-1' index in an array/tensor in Python
This might seem not very conventional since we're used to '-1' sentinel value but positive-only indexing makes the module more reusable as i've mentioned.

Isn't it surprising for the first value encountered, as well as unknown values to both be assigned 1

The idea was to keep 2 options available:

  1. same codes for unknown value & missed value
  2. different codes for unknown value & missed value

But both codes should be provisioned at the start since both should be expected.
I have no principled idea wether it should be

  • unknown = 1
  • missed = 0
    or vice-versa.

Which is also making me think, because this method is unsupervised, why don't you do the update logic in transform_one? That way there won't be any unknown values, because you update before transforming. I believe we do that in some other places in River

Please refer to #1269
Specifically the current strictness was relevant in my usecase

I can, however, add the update in transform_* as a mode for 'stateful' inference to make it consistent with other transformers (i shall check)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it's fine, I'm convinced. I'm not sure the +2 works if you specify different values for missing and unknown parameters.

Also, just another question: in your examples you explicitly set missing unknown values to None. But couldn't you instead preserve the sparse nature of the dict and implicitly assume missing keys are None? Not sure I'm being clear

new_unique_mask = np.isin(current_uniqs, known_uniqs, assume_unique=True, invert=True)

new_uniqs = current_uniqs[new_unique_mask]
print(f"{known_uniqs=}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't leave print calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ofc!
That was intended for debugging. I was going to remove them once we've settled on implementation, hence "WIP" tag (sorry i should have mentioned that)


def learn_one(self, x):
for i, xi in x.items():
if self.handle_unknown == "use_reserved_value":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens in the other case? Nothing?

Copy link
Contributor Author

@ColdTeapot273K ColdTeapot273K Jun 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to suggestions!

At the moment i just implemented what was relevant in my usecase.

Mimicking sklearn api here a bit. Tho raising an error for unkowns like sklearn does would be weird in online setting. Unless there's an option to supply a predefined mapping values -> encodings. Then unknown values start make sense in online setting.
But i'm not sure we want to support that in OrdinalEncoder - user can just use FunctionTransformer if they want a mapping.

The only other option I see is to have the unknowns and the missing values map to the same code, as mentioned in the comments above. I can implement that.

@MaxHalford
Copy link
Member

If it's ok, I'll merge this PR and tweak it a tiny bit. I think it's valuable to have.

The logic is clearer to me now we've made predict_one stateless.

@MaxHalford MaxHalford merged commit 63823f8 into online-ml:main Jul 4, 2023
8 of 11 checks passed