-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hacky Idea: ifelse in a pipeline #908
Comments
With the upcoming "Recipe" (or "PipeBuilder" or whatever its name will be), it >>> import pandas as pd
>>> import numpy as np
>>> from sklearn.base import BaseEstimator
>>> from skrub._pipe_builder import PipeBuilder
>>> from skrub import selectors as s
>>> from skrub import TableVectorizer
>>> class DatetimeSplines(BaseEstimator):
... "dummy placeholder"
... def fit_transform(self, X, y=None):
... return self.transform(X)
...
... def transform(self, X):
... print(f"\ntransform: {X.columns.tolist()}\n")
... values = np.ones(X.shape[0])
... return pd.DataFrame({"spline_0": values, "spline_1": values})
>>> pipe = (
... PipeBuilder()
... .apply(DatetimeSplines(), cols=s.all() & "date")
... .apply(TableVectorizer())
... ).get_pipeline()
>>> df = pd.DataFrame({
... "date": ["2020-01-02", "2021-04-03"],
... "temp": [10.1, 17.5]
... }) The column "date" gets transformed by the spline transformer: >>> pipe.fit_transform(df)
transform: ['date']
temp spline_0 spline_1
0 10.1 1.0 1.0
1 17.5 1.0 1.0 When there is no column matching the selector, the spline transformer is not applied: >>> df = pd.DataFrame({
... "not_date": ["2020-01-02", "2021-04-03"],
... "temp": [10.1, 17.5]
... })
>>> pipe.fit_transform(df)
not_date_year not_date_month not_date_day not_date_total_seconds temp
0 2020.0 1.0 2.0 1577923200.0 10.1
1 2021.0 4.0 3.0 1617408000.0 17.5 Does that more or less address the problem you are facing? |
Having a conditional transformer might be useful when something more general than selecting columns is needed though, such as "apply a PCA if there are more than 200 columns" |
However, if the important part is not really the name "date" but rather applying (note the snippet below does not run on the main branch but it does on that of PR #902) import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator
from skrub import TableVectorizer
class DatetimeSplines(BaseEstimator):
"dummy placeholder"
def fit_transform(self, X, y=None):
return self.transform(X)
def transform(self, X):
print(f"\ntransform: {X.columns.tolist()}\n")
values = np.ones(X.shape[0])
return pd.DataFrame({"spline_0": values, "spline_1": values}) >>> vectorizer = TableVectorizer(datetime_transformer=DatetimeSplines())
>>> df = pd.DataFrame({
... "date": ["2020-01-02", "2021-04-03"],
... "temp": [10.1, 17.5]
... })
>>> vectorizer.fit_transform(df)
transform: ['date']
spline_0 spline_1 temp
0 1.0 1.0 10.1
1 1.0 1.0 17.5
>>> df = pd.DataFrame({
... "not_date": ["blue", "red"],
... "temp": [10.1, 17.5]
... })
>>> vectorizer.fit_transform(df)
not_date_red temp
0 0.0 10.1
1 1.0 17.5 |
I think it does, just one thing. How would the |
Do we want to assume that the user ran their dataframe code or do we want our library to infer that on their behalf? I am partially asking because polars/pandas handle the date stuff slightly differently. But I am also wondering about categorical types. Do we only one-hot encode columns that are categorical? |
for selecting all datetime columns you could use the |
I think we will have the The TableVectorizer will one-hot encode anything that is strings or Categorical with a low cardinality. It will also try to parse strings as datetimes and apply the |
If you wanted to manually control your pipeline you could do something like: import pandas as pd
import numpy as np
from skrub import ToDatetime
from skrub import selectors as s
from skrub._pipe_builder import PipeBuilder
from skrub._on_each_column import SingleColumnTransformer
class DatetimeSplines(SingleColumnTransformer):
"dummy placeholder"
def fit_transform(self, col, y=None):
return self.transform(col)
def transform(self, col):
name = col.name
print(f" ==> transform: {name}")
values = np.ones(len(col))
return pd.DataFrame({f"{name}_spline_0": values, f"{name}_spline_1": values})
pipe = (
PipeBuilder()
.apply(ToDatetime(), allow_reject=True)
.apply(DatetimeSplines(), cols=s.any_date())
).get_pipeline() >>> df = pd.DataFrame({
... "A": ["2020-01-02", "2021-04-03"],
... "B": [10.1, 17.5],
... "C": ["2020-01-02T00:01:02", "2021-04-03T10:11:12"],
... "D": ["red", "blue"],
... })
>>> df
A B C D
0 2020-01-02 10.1 2020-01-02T00:01:02 red
1 2021-04-03 17.5 2021-04-03T10:11:12 blue
>>> pipe.fit_transform(df)
==> transform: A
==> transform: C
A_spline_0 A_spline_1 B C_spline_0 C_spline_1 D
0 1.0 1.0 10.1 1.0 1.0 red
1 1.0 1.0 17.5 1.0 1.0 blue |
|
But if you want something completely automatic, eg that you are running on many datasets that you don't inspect manually, then you're probably better off using the TableVectorizer and let it do the preprocessing and those choices for you.
|
Problem Description
I am running benchmarks on many datasets. When the dataset contains a column called "date" then I am interested in running a different pipeline.
At the moment I fixed this by doing this:
I wonder, could skrub maybe offer a nicer way to do stuff like this?
Feature Description
I don't know if we want this, but for large scale model search across multiple datasets you might want this. I also don't know if this is easy to generalise but I figured at least mentioning it in an issue here.
Alternative Solutions
The custom estimator also works, but it can get hacky quite quick once I want to repeat this pattern for other types of column features.
Additional Context
No response
The text was updated successfully, but these errors were encountered: