-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task/gsk 308 churn prediction model #1
Conversation
The issue I'm currently having with Churn_Telco_Kaggle_without_transformers.ipynb is the following:
Apparently, there's currently an issue with writing a Full log from
2022-10-07 14:58:09,530 pid:60247 giskard.ml_worker.utils.logging INFO giskard.ml_worker.utils.grpc_mapper.deserialize_dataset executed in 0:00:00.023816
2022-10-07 14:58:09,546 pid:60247 giskard.ml_worker.core.model INFO Casting dataframe columns from {'gender': 'object', 'SeniorCitizen': 'int64', 'Partner': 'object', 'Dependents': 'object', 'tenure': 'int64', 'PhoneService': 'object', 'MultipleLines': 'object', 'InternetService': 'object', 'OnlineSecurity': 'object', 'OnlineBackup': 'object', 'DeviceProtection': 'object', 'TechSupport': 'object', 'StreamingTV': 'object', 'StreamingMovies': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'PaymentMethod': 'object', 'MonthlyCharges': 'float64', 'TotalCharges': 'float64'} to {'InternetService': 'object', 'TotalCharges': 'float64', 'MonthlyCharges': 'float64', 'StreamingTV': 'object', 'DeviceProtection': 'object', 'Dependents': 'object', 'TechSupport': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'MultipleLines': 'object', 'SeniorCitizen': 'int64', 'PhoneService': 'object', 'gender': 'object', 'StreamingMovies': 'object', 'OnlineBackup': 'object', 'OnlineSecurity': 'object', 'tenure': 'int64', 'PaymentMethod': 'object', 'Partner': 'object'}
2022-10-07 14:58:09,553 pid:60247 giskard.ml_worker.core.model INFO Casting dataframe columns from {'gender': 'object', 'SeniorCitizen': 'object', 'Partner': 'object', 'Dependents': 'object', 'tenure': 'object', 'PhoneService': 'object', 'MultipleLines': 'object', 'InternetService': 'object', 'OnlineSecurity': 'object', 'OnlineBackup': 'object', 'DeviceProtection': 'object', 'TechSupport': 'object', 'StreamingTV': 'object', 'StreamingMovies': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'PaymentMethod': 'object', 'MonthlyCharges': 'object', 'TotalCharges': 'object'} to {'InternetService': 'object', 'TotalCharges': 'float64', 'MonthlyCharges': 'float64', 'StreamingTV': 'object', 'DeviceProtection': 'object', 'Dependents': 'object', 'TechSupport': 'object', 'Contract': 'object', 'PaperlessBilling': 'object', 'MultipleLines': 'object', 'SeniorCitizen': 'int64', 'PhoneService': 'object', 'gender': 'object', 'StreamingMovies': 'object', 'OnlineBackup': 'object', 'OnlineSecurity': 'object', 'tenure': 'int64', 'PaymentMethod': 'object', 'Partner': 'object'}
Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names seen at fit time, yet now missing:
- Contract_One year
- Contract_Two year
- DeviceProtection_No internet service
- DeviceProtection_Yes
- InternetService_DSL
- ...
Provided model function fails when applied to the provided data set. |
|
Thanks @jmsquare, have a go now.
|
This PR is ready for a final review, once merged we can close with it: Giskard-AI/giskard-client#27. |
"source": [ | ||
"# Declare the type of each column in the dataset(example: category, numeric, text)\n", | ||
"column_types = {'gender': \"category\",\n", | ||
" 'SeniorCitizen': \"numeric\", \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though the value in SeniorCitizen is [0 , 1] and the first impression goes to declare it as numeric, it is actually a category value. Please feel free to give suggestions on how you feel we can help avoid this confusion. Was the documentation falling short?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @princyiakov for this catch!
Yeah definitely, good question. I think generally it's the responsibility of the user to check this, but yeah this mistake could happen, worst case as a typo.
One way I could think of is to implement a warning check using df.nunique()
, which gives the following output for this notebook for example:
gender 2
SeniorCitizen 2
Partner 2
Dependents 2
tenure 72
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
MonthlyCharges 1584
TotalCharges 6530
Churn 2
then, we compare these to the column_types
given by the user, in case the user puts numeric
for a unique
count of <10 for instance, we issue a warning. What do you think?
In this notebook, we explore how to predict customer churn, a critical factor for telecommunication companies to be able to effectively retain customers.
Two notebooks are being pushed in this branch:
Churn_Telco_Kaggle_with_transformers.ipynb
Churn_Telco_Kaggle_without_transformers.ipynb