Task/gsk 308 churn prediction model #1

rabah-khalek · 2022-10-07T12:55:00Z

In this notebook, we explore how to predict customer churn, a critical factor for telecommunication companies to be able to effectively retain customers.

Two notebooks are being pushed in this branch:

Churn_Telco_Kaggle_with_transformers.ipynb
- Implemented
- Uploadable to Giskard
- Inspectable by Giskard
Churn_Telco_Kaggle_without_transformers.ipynb
- Implemented
- Uploadable to Giskard
- Inspectable by Giskard

rabah-khalek · 2022-10-07T13:01:19Z

The issue I'm currently having with Churn_Telco_Kaggle_without_transformers.ipynb is the following:

ValueError: X has 19 features, but RandomForestClassifier is expecting 40 features as input.

Apparently, there's currently an issue with writing a predict function that contains transformers, which augment the columns of the dataset (such as one-hot encoder). Conjecture: there's an if statement that checks the shape of the data frame input and compare it to the shape expected by the model before going in the predict function.

Full log from ml-worker

2022-10-07 14:58:09,530 pid:60247 giskard.ml_worker.utils.logging INFO giskard.ml_worker.utils.grpc_mapper.deserialize_dataset executed in 0:00:00.023816 2022-10-07 14:58:09,546 pid:60247 giskard.ml_worker.core.model INFO Casting dataframe columns from {'gender': 'object', 'SeniorCitizen': 'int64', 'Partner': 'object', 'Dependents': 'object', 'tenure': 'int64', 'PhoneService': 'object', 'MultipleLines': 'object', 'InternetService': 'object', 'OnlineSecurity': 'object', 'OnlineBackup': 'object', 'DeviceProtection': 'object', 'TechSupport': 'object', 'StreamingTV': 'object', 'StreamingMovies': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'PaymentMethod': 'object', 'MonthlyCharges': 'float64', 'TotalCharges': 'float64'} to {'InternetService': 'object', 'TotalCharges': 'float64', 'MonthlyCharges': 'float64', 'StreamingTV': 'object', 'DeviceProtection': 'object', 'Dependents': 'object', 'TechSupport': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'MultipleLines': 'object', 'SeniorCitizen': 'int64', 'PhoneService': 'object', 'gender': 'object', 'StreamingMovies': 'object', 'OnlineBackup': 'object', 'OnlineSecurity': 'object', 'tenure': 'int64', 'PaymentMethod': 'object', 'Partner': 'object'} 2022-10-07 14:58:09,553 pid:60247 giskard.ml_worker.core.model INFO Casting dataframe columns from {'gender': 'object', 'SeniorCitizen': 'object', 'Partner': 'object', 'Dependents': 'object', 'tenure': 'object', 'PhoneService': 'object', 'MultipleLines': 'object', 'InternetService': 'object', 'OnlineSecurity': 'object', 'OnlineBackup': 'object', 'DeviceProtection': 'object', 'TechSupport': 'object', 'StreamingTV': 'object', 'StreamingMovies': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'PaymentMethod': 'object', 'MonthlyCharges': 'object', 'TotalCharges': 'object'} to {'InternetService': 'object', 'TotalCharges': 'float64', 'MonthlyCharges': 'float64', 'StreamingTV': 'object', 'DeviceProtection': 'object', 'Dependents': 'object', 'TechSupport': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'MultipleLines': 'object', 'SeniorCitizen': 'int64', 'PhoneService': 'object', 'gender': 'object', 'StreamingMovies': 'object', 'OnlineBackup': 'object', 'OnlineSecurity': 'object', 'tenure': 'int64', 'PaymentMethod': 'object', 'Partner': 'object'} Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised. Feature names seen at fit time, yet now missing: - Contract_One year - Contract_Two year - DeviceProtection_No internet service - DeviceProtection_Yes - InternetService_DSL - ...

Provided model function fails when applied to the provided data set.
2022-10-07 14:58:09,623 pid:60247 grpc._cython.cygrpc ERROR Unexpected [ValueError] raised by servicer method [/worker.MLWorker/explain]
Traceback (most recent call last):
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 682, in grpc._cython.cygrpc._handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 796, in _handle_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 547, in _handle_unary_unary_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 411, in _finish_handler_with_unary_response
File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/rak/Documents/giskard-client/giskard/ml_worker/server/ml_worker_service.py", line 133, in explain
explanations = explain(model, dataset, request.columns)
File "/Users/rak/Documents/giskard-client/giskard/ml_worker/core/model_explanation.py", line 39, in explain
kernel_shap.fit(example)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/alibi/explainers/shap_wrappers.py", line 765, in fit
self._explainer = KernelExplainerWrapper(*explainer_args, **explainer_kwargs)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/alibi/explainers/shap_wrappers.py", line 250, in init
super().init(*args, **kwargs)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/shap/explainers/_kernel.py", line 69, in init
model_null = match_model_to_data(self.model, self.data)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/shap/utils/_legacy.py", line 112, in match_model_to_data
out_val = model.f(data.data)
File "/Users/rak/Documents/giskard-client/giskard/ml_worker/core/model_explanation.py", line 30, in predict_array
return model.prediction_function(pd.DataFrame(array, columns=list(df.columns)))
File "/Users/rak/Documents/giskard-client/giskard/client/project.py", line 359, in
return lambda df: prediction_function(df[feature_names])
File "/var/folders/jp/b7681vg128nf8s2hw47sl6380000gn/T/ipykernel_59876/2265545978.py", line 33, in predict
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 850, in predict_proba
X = self._validate_X_predict(X)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 579, in _validate_X_predict
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/sklearn/base.py", line 585, in _validate_data
self._check_n_features(X, reset=reset)
File "/Users/rak/Documents/giskard-client/.venv/lib/python3.8/site-packages/sklearn/base.py", line 400, in _check_n_features
raise ValueError(
ValueError: X has 19 features, but RandomForestClassifier is expecting 40 features as input.

jmsquare · 2022-10-07T13:53:19Z

from the initial description, no need to tell where the notebook is coming from (medium, Kaggle and Giskard)
clf_classification should be renamed as clf
could you also upload a new model. Preferably an xgboost model or a lightgbm model. These are good models to test the externalized ML worker version
About the following error: ValueError: X has 19 features, but RandomForestClassifier is expecting 40 features as input : when do you see. When does it appear? When you click on the inspection button ?

rabah-khalek · 2022-10-07T14:34:26Z

Thanks @jmsquare, have a go now.

I adjusted the description
That is taken care of
Now all the following models are uploaded:
- dummy_classifier, Accuracy: 0.734
- k_nearest_neighbors, Accuracy: 0.768
- logistic_regression, Accuracy: 0.804
- random_forest, Accuracy: 0.793
- gradient_boosting, Accuracy: 0.802
- LGBM, Accuracy: 0.8
Exactly! I'll have a look at this next week. I think that's an important point, if we want to make the user's life easier, and not impose that they have to create their custom transformers that could be fed to sklearn pipelines.

…t_dummies

rabah-khalek · 2022-10-12T12:31:00Z

This PR is ready for a final review, once merged we can close with it: Giskard-AI/giskard-client#27.

Churn_Telco_Kaggle_without_transformers.ipynb

princyiakov · 2022-10-14T10:03:17Z

Churn_Telco_Kaggle_with_transformers.ipynb

+ "source": [
+ "# Declare the type of each column in the dataset(example: category, numeric, text)\n",
+ "column_types = {'gender': \"category\",\n",
+ " 'SeniorCitizen': \"numeric\", \n",


Though the value in SeniorCitizen is [0 , 1] and the first impression goes to declare it as numeric, it is actually a category value. Please feel free to give suggestions on how you feel we can help avoid this confusion. Was the documentation falling short?

Thanks @princyiakov for this catch!
Yeah definitely, good question. I think generally it's the responsibility of the user to check this, but yeah this mistake could happen, worst case as a typo.

One way I could think of is to implement a warning check using df.nunique(), which gives the following output for this notebook for example:

gender 2 SeniorCitizen 2 Partner 2 Dependents 2 tenure 72 PhoneService 2 MultipleLines 3 InternetService 3 OnlineSecurity 3 OnlineBackup 3 DeviceProtection 3 TechSupport 3 StreamingTV 3 StreamingMovies 3 Contract 3 PaperlessBilling 2 PaymentMethod 4 MonthlyCharges 1584 TotalCharges 6530 Churn 2

then, we compare these to the column_types given by the user, in case the user puts numeric for a unique count of <10 for instance, we issue a warning. What do you think?

Rabah Abdul Khalek and others added 5 commits October 7, 2022 14:15

added Kaggle Churn dataset

40d10f9

added Churn_Telco_Kaggle_with_transformers.ipynb

37f76a2

added Churn_Telco_Kaggle_without_transformers.ipynb

d556ad7

updated Churn_Telco_Kaggle_with_transformers.ipynb

084935c

updated Churn_Telco_Kaggle_without_transformers.ipynb

e7631d0

rabah-khalek requested review from jmsquare and princyiakov October 7, 2022 13:04

updated both notebooks with a new cell to add the training data

5e5888b

rabah-khalek added 2 commits October 7, 2022 15:57

updated intro in both notebooks

31d57f5

updated pipeline name

b547d99

rabah-khalek added 3 commits October 7, 2022 18:16

added more models

81c41e7

updated Churn_Telco_Kaggle_with_transformers.ipynb

d2cc285

updated Churn_Telco_Kaggle_without_transformers.ipynb

e010ba5

rabah-khalek mentioned this pull request Oct 12, 2022

GSK-317: Conflict between pd.get_dummies and _validate_model_execution in giskard-client Giskard-AI/giskard-client#27

Closed

rabah-khalek linked an issue Oct 12, 2022 that may be closed by this pull request

GSK-317: Conflict between pd.get_dummies and _validate_model_execution in giskard-client Giskard-AI/giskard-client#27

Closed

final update to Churn_Telco_Kaggle notebooks, solving the issue of ge…

d657e13

…t_dummies