[WIP] Feature/ online ordinal encoder #1271

ColdTeapot273K · 2023-06-22T18:07:44Z

The idea:

Ordinal encoder for categoricals that robustly accomodates for:

unfamilliar category values
empty category values

Achieved by:

using only positive integers for encoding (no custom mappings like in sklearn implementation)
reserving two first values for unknown and empty category values

Note:

positive-only indexing is useful when passing output to some tensor-based modeling frameworks which require indexing tensors and don't have flexible indexing things like VectorDict (i'm mixing river with jax-based PPL for online Bayesian modeling, proper OrdinalEncoder is very valuable)
naturally i have included mini-batch implementation aswell

Example tests:

transformer = OrdinalEncoder()
# %%
X = [
    {"country": "France", "place": "Taco Bell"},
    {"country": None, "place": None},
    {"country": "Sweden", "place": "Burger King"},
    {"country": "France", "place": "Burger King"},
    {"country": "Russia", "place": "Starbucks"},
    {"country": "Russia", "place": "Burger King"},
    {"country": "Russia", "place": "Taco Bell"},
    {"country": "Sweden", "place": "Taco Bell"},
    {"country": "Sweden", "place": "Burger King"},
    {"country": "France", "place": "Taco Bell"},
    {"country": "France", "place": "Burger King"},
    {"country": "Sweden", "place": "Starbucks"},
    {"country": "Sweden", "place": "Starbucks"},
    {"country": None, "place": None},
]

transformer = OrdinalEncoder()
for x in X:
    print(transformer.transform_one(x))
    transformer = transformer.learn_one(x)
# {'country': 1, 'place': 1}
# {'country': 1, 'place': 1}
# {'country': 1, 'place': 1}
# {'country': 2, 'place': 4}
# {'country': 1, 'place': 1}
# {'country': 5, 'place': 4}
# {'country': 5, 'place': 2}
# {'country': 4, 'place': 2}
# {'country': 4, 'place': 4}
# {'country': 2, 'place': 2}
# {'country': 2, 'place': 4}
# {'country': 4, 'place': 5}
# {'country': 4, 'place': 5}
# {'country': 0, 'place': 0}


transformer = OrdinalEncoder()
transformer.learn_many(pd.DataFrame(X))

assert transformer.categories_['country'] == {None: 0, 'France': 2, 'Sweden': 3, 'Russia': 4}
assert transformer.categories_['place'] == {None: 0, 'Taco Bell': 2, 'Burger King': 3, 'Starbucks': 4}

transformer.transform_many(pd.DataFrame(X))
#	country	place
#0	2	2
#1	0	0
#2	3	3
#3	2	3
#4	4	4
#5	4	3
#6	4	2
#7	3	2
#8	3	3
#9	2	2
#10	2	3
#11	3	4
#12	3	4
#13	0	0

MaxHalford

I'm generally in favor of this pull request! I have a few questions though

MaxHalford · 2023-06-23T06:08:57Z

river/preprocessing/ordinal.py

+
+
+class OrdinalEncoder(base.MiniBatchTransformer):
+ # INFO: makes sense to add None initially? so it'll be easier to spot


Could you add a docstring? :)

MaxHalford · 2023-06-23T06:10:10Z

river/preprocessing/ordinal.py

+
+class OrdinalEncoder(base.MiniBatchTransformer):
+ # INFO: makes sense to add None initially? so it'll be easier to spot
+ def __init__(self, handle_unknown="use_reserved_value", unknown_value=1, encoded_missing_value=0):


How many possible values are there for handle_unknown? If it's just a boolean, I rather have an use_reserved_value_for_unknowns: bool parameter.

Also, wouldn't -1 make more sense for the unknown_value? Isn't it surprising for the first value encountered, as well as unknown values to both be assigned 1?

Which is also making me think, because this method is unsupervised, why don't you do the update logic in transform_one? That way there won't be any unknown values, because you update before transforming. I believe we do that in some other places in River

Also, wouldn't -1 make more sense for the unknown_value?

Please see the notes.
I've rejected this idea because there's no '-1' index in an array/tensor in Python
This might seem not very conventional since we're used to '-1' sentinel value but positive-only indexing makes the module more reusable as i've mentioned.

Isn't it surprising for the first value encountered, as well as unknown values to both be assigned 1

The idea was to keep 2 options available:

same codes for unknown value & missed value

different codes for unknown value & missed value

But both codes should be provisioned at the start since both should be expected.
I have no principled idea wether it should be

unknown = 1

missed = 0
or vice-versa.

Which is also making me think, because this method is unsupervised, why don't you do the update logic in transform_one? That way there won't be any unknown values, because you update before transforming. I believe we do that in some other places in River

Please refer to #1269
Specifically the current strictness was relevant in my usecase

I can, however, add the update in transform_* as a mode for 'stateful' inference to make it consistent with other transformers (i shall check)

Ok it's fine, I'm convinced. I'm not sure the +2 works if you specify different values for missing and unknown parameters.

Also, just another question: in your examples you explicitly set missing unknown values to None. But couldn't you instead preserve the sparse nature of the dict and implicitly assume missing keys are None? Not sure I'm being clear

MaxHalford · 2023-06-23T06:13:15Z

river/preprocessing/ordinal.py

+ new_unique_mask = np.isin(current_uniqs, known_uniqs, assume_unique=True, invert=True)
+
+ new_uniqs = current_uniqs[new_unique_mask]
+ print(f"{known_uniqs=}")


You can't leave print calls

Ofc!
That was intended for debugging. I was going to remove them once we've settled on implementation, hence "WIP" tag (sorry i should have mentioned that)

MaxHalford · 2023-06-23T06:14:18Z

river/preprocessing/ordinal.py

+
+ def learn_one(self, x):
+ for i, xi in x.items():
+ if self.handle_unknown == "use_reserved_value":


So what happens in the other case? Nothing?

Open to suggestions!

At the moment i just implemented what was relevant in my usecase.

Mimicking sklearn api here a bit. Tho raising an error for unkowns like sklearn does would be weird in online setting. Unless there's an option to supply a predefined mapping values -> encodings. Then unknown values start make sense in online setting.
But i'm not sure we want to support that in OrdinalEncoder - user can just use FunctionTransformer if they want a mapping.

The only other option I see is to have the unknowns and the missing values map to the same code, as mentioned in the comments above. I can implement that.

MaxHalford · 2023-07-04T14:56:08Z

If it's ok, I'll merge this PR and tweak it a tiny bit. I think it's valuable to have.

The logic is clearer to me now we've made predict_one stateless.

ColdTeapot273K added 2 commits June 22, 2023 22:55

experimental implementation of online ordinal encoder

2ee8c43

Merge branch 'online-ml:main' into feature/ordinal-encoder

ddeaf01

ColdTeapot273K requested review from MaxHalford and smastelini as code owners June 22, 2023 18:07

MaxHalford reviewed Jun 23, 2023

View reviewed changes

MaxHalford merged commit 63823f8 into online-ml:main Jul 4, 2023
8 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature/ online ordinal encoder #1271

[WIP] Feature/ online ordinal encoder #1271

ColdTeapot273K commented Jun 22, 2023 •

edited

Loading

MaxHalford left a comment

MaxHalford Jun 23, 2023

MaxHalford Jun 23, 2023

MaxHalford Jun 23, 2023

MaxHalford Jun 23, 2023

ColdTeapot273K Jun 24, 2023 •

edited

Loading

MaxHalford Jun 24, 2023

MaxHalford Jun 23, 2023

ColdTeapot273K Jun 24, 2023

MaxHalford Jun 23, 2023

ColdTeapot273K Jun 24, 2023 •

edited

Loading

MaxHalford commented Jul 4, 2023



		class OrdinalEncoder(base.MiniBatchTransformer):
		# INFO: makes sense to add None initially? so it'll be easier to spot

[WIP] Feature/ online ordinal encoder #1271

[WIP] Feature/ online ordinal encoder #1271

Conversation

ColdTeapot273K commented Jun 22, 2023 • edited Loading

MaxHalford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ColdTeapot273K Jun 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ColdTeapot273K Jun 24, 2023 • edited Loading

Choose a reason for hiding this comment

MaxHalford commented Jul 4, 2023

ColdTeapot273K commented Jun 22, 2023 •

edited

Loading

ColdTeapot273K Jun 24, 2023 •

edited

Loading

ColdTeapot273K Jun 24, 2023 •

edited

Loading