Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task/gsk 308 churn prediction model #1

Merged
merged 17 commits into from
Oct 20, 2022

Conversation

rabah-khalek
Copy link
Member

@rabah-khalek rabah-khalek commented Oct 7, 2022

In this notebook, we explore how to predict customer churn, a critical factor for telecommunication companies to be able to effectively retain customers.

Two notebooks are being pushed in this branch:

  • Churn_Telco_Kaggle_with_transformers.ipynb

    • Implemented
    • Uploadable to Giskard
    • Inspectable by Giskard
  • Churn_Telco_Kaggle_without_transformers.ipynb

    • Implemented
    • Uploadable to Giskard
    • Inspectable by Giskard

@rabah-khalek
Copy link
Member Author

rabah-khalek commented Oct 7, 2022

The issue I'm currently having with Churn_Telco_Kaggle_without_transformers.ipynb is the following:

ValueError: X has 19 features, but RandomForestClassifier is expecting 40 features as input.

Apparently, there's currently an issue with writing a predict function that contains transformers, which augment the columns of the dataset (such as one-hot encoder). Conjecture: there's an if statement that checks the shape of the data frame input and compare it to the shape expected by the model before going in the predict function.

Full log from ml-worker

2022-10-07 14:58:09,530 pid:60247 giskard.ml_worker.utils.logging INFO giskard.ml_worker.utils.grpc_mapper.deserialize_dataset executed in 0:00:00.023816 2022-10-07 14:58:09,546 pid:60247 giskard.ml_worker.core.model INFO Casting dataframe columns from {'gender': 'object', 'SeniorCitizen': 'int64', 'Partner': 'object', 'Dependents': 'object', 'tenure': 'int64', 'PhoneService': 'object', 'MultipleLines': 'object', 'InternetService': 'object', 'OnlineSecurity': 'object', 'OnlineBackup': 'object', 'DeviceProtection': 'object', 'TechSupport': 'object', 'StreamingTV': 'object', 'StreamingMovies': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'PaymentMethod': 'object', 'MonthlyCharges': 'float64', 'TotalCharges': 'float64'} to {'InternetService': 'object', 'TotalCharges': 'float64', 'MonthlyCharges': 'float64', 'StreamingTV': 'object', 'DeviceProtection': 'object', 'Dependents': 'object', 'TechSupport': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'MultipleLines': 'object', 'SeniorCitizen': 'int64', 'PhoneService': 'object', 'gender': 'object', 'StreamingMovies': 'object', 'OnlineBackup': 'object', 'OnlineSecurity': 'object', 'tenure': 'int64', 'PaymentMethod': 'object', 'Partner': 'object'} 2022-10-07 14:58:09,553 pid:60247 giskard.ml_worker.core.model INFO Casting dataframe columns from {'gender': 'object', 'SeniorCitizen': 'object', 'Partner': 'object', 'Dependents': 'object', 'tenure': 'object', 'PhoneService': 'object', 'MultipleLines': 'object', 'InternetService': 'object', 'OnlineSecurity': 'object', 'OnlineBackup': 'object', 'DeviceProtection': 'object', 'TechSupport': 'object', 'StreamingTV': 'object', 'StreamingMovies': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'PaymentMethod': 'object', 'MonthlyCharges': 'object', 'TotalCharges': 'object'} to {'InternetService': 'object', 'TotalCharges': 'float64', 'MonthlyCharges': 'float64', 'StreamingTV': 'object', 'DeviceProtection': 'object', 'Dependents': 'object', 'TechSupport': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'MultipleLines': 'object', 'SeniorCitizen': 'int64', 'PhoneService': 'object', 'gender': 'object', 'StreamingMovies': 'object', 'OnlineBackup': 'object', 'OnlineSecurity': 'object', 'tenure': 'int64', 'PaymentMethod': 'object', 'Partner': 'object'} Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised. Feature names seen at fit time, yet now missing: - Contract_One year - Contract_Two year - DeviceProtection_No internet service - DeviceProtection_Yes - InternetService_DSL - ...

Provided model function fails when applied to the provided data set.
2022-10-07 14:58:09,623 pid:60247 grpc._cython.cygrpc ERROR Unexpected [ValueError] raised by servicer method [/worker.MLWorker/explain]
Traceback (most recent call last):
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 682, in grpc._cython.cygrpc._handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 796, in _handle_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 547, in _handle_unary_unary_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 411, in _finish_handler_with_unary_response
File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/rak/Documents/giskard-client/giskard/ml_worker/server/ml_worker_service.py", line 133, in explain
explanations = explain(model, dataset, request.columns)
File "/Users/rak/Documents/giskard-client/giskard/ml_worker/core/model_explanation.py", line 39, in explain
kernel_shap.fit(example)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/alibi/explainers/shap_wrappers.py", line 765, in fit
self._explainer = KernelExplainerWrapper(*explainer_args, **explainer_kwargs)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/alibi/explainers/shap_wrappers.py", line 250, in init
super().init(*args, **kwargs)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/shap/explainers/_kernel.py", line 69, in init
model_null = match_model_to_data(self.model, self.data)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/shap/utils/_legacy.py", line 112, in match_model_to_data
out_val = model.f(data.data)
File "/Users/rak/Documents/giskard-client/giskard/ml_worker/core/model_explanation.py", line 30, in predict_array
return model.prediction_function(pd.DataFrame(array, columns=list(df.columns)))
File "/Users/rak/Documents/giskard-client/giskard/client/project.py", line 359, in
return lambda df: prediction_function(df[feature_names])
File "/var/folders/jp/b7681vg128nf8s2hw47sl6380000gn/T/ipykernel_59876/2265545978.py", line 33, in predict
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 850, in predict_proba
X = self._validate_X_predict(X)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 579, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/sklearn/base.py", line 585, in _validate_data
self._check_n_features(X, reset=reset)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/sklearn/base.py", line 400, in _check_n_features
raise ValueError(
ValueError: X has 19 features, but RandomForestClassifier is expecting 40 features as input.

@jmsquare
Copy link
Member

jmsquare commented Oct 7, 2022

  • from the initial description, no need to tell where the notebook is coming from (medium, Kaggle and Giskard)
  • clf_classification should be renamed as clf
  • could you also upload a new model. Preferably an xgboost model or a lightgbm model. These are good models to test the externalized ML worker version
  • About the following error: ValueError: X has 19 features, but RandomForestClassifier is expecting 40 features as input : when do you see. When does it appear? When you click on the inspection button ?

@rabah-khalek
Copy link
Member Author

rabah-khalek commented Oct 7, 2022

Thanks @jmsquare, have a go now.

  • I adjusted the description
  • That is taken care of
  • Now all the following models are uploaded:
    • dummy_classifier, Accuracy: 0.734
    • k_nearest_neighbors, Accuracy: 0.768
    • logistic_regression, Accuracy: 0.804
    • random_forest, Accuracy: 0.793
    • gradient_boosting, Accuracy: 0.802
    • LGBM, Accuracy: 0.8
  • Exactly! I'll have a look at this next week. I think that's an important point, if we want to make the user's life easier, and not impose that they have to create their custom transformers that could be fed to sklearn pipelines.

@rabah-khalek
Copy link
Member Author

This PR is ready for a final review, once merged we can close with it: Giskard-AI/giskard-client#27.

"source": [
"# Declare the type of each column in the dataset(example: category, numeric, text)\n",
"column_types = {'gender': \"category\",\n",
" 'SeniorCitizen': \"numeric\", \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the value in SeniorCitizen is [0 , 1] and the first impression goes to declare it as numeric, it is actually a category value. Please feel free to give suggestions on how you feel we can help avoid this confusion. Was the documentation falling short?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @princyiakov for this catch!
Yeah definitely, good question. I think generally it's the responsibility of the user to check this, but yeah this mistake could happen, worst case as a typo.

One way I could think of is to implement a warning check using df.nunique(), which gives the following output for this notebook for example:

gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                72
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1584
TotalCharges        6530
Churn                  2

then, we compare these to the column_types given by the user, in case the user puts numeric for a unique count of <10 for instance, we issue a warning. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

GSK-317: Conflict between pd.get_dummies and _validate_model_execution in giskard-client
3 participants