Deliver ML products, better & faster. \n", "\n", "* Collaborate faster with feedback from business stakeholders.\n", "* Deploy automated tests to eliminate regressions, errors & biases.\n", "\n", "🏡 [Website](https://giskard.ai/)\n", "\n", "📗 [Documentation](https://docs.giskard.ai/)" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "f35c8e8d3fbf4c0f9c01a69673c318a1", "deepnote_app_coordinates": { "h": 5, "w": 12, "x": 0, "y": 6 }, "deepnote_cell_height": 110, "deepnote_cell_type": "markdown", "id": "mJTqM-W_7xbW", "owner_user_id": "41ec0844-b5b7-49c2-9460-710a452f98de", "pycharm": { "name": "#%% md\n" }, "tags": [] }, "source": [ "# Telco custormer churn data\n", "\n", "\n", "In this notebook we explore how to predict customer churn, a critical factor for telecommunication companies to be able to effectively retain customers. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing `giskard` and `lightgbm`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install giskard lightgbm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connect the external worker in daemon mode" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!giskard worker start -d" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "e8d609f32d5243dd917cc3104599b8d8", "deepnote_app_coordinates": { "h": 5, "w": 12, "x": 0, "y": 12 }, "deepnote_cell_height": 230, "deepnote_cell_type": "markdown", "id": "WNI85koE7xbX", "pycharm": { "name": "#%% md\n" }, "tags": [] }, "source": [ "## 1. Data Reading" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.dummy import DummyClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.metrics import accuracy_score\n", "import lightgbm as lbt\n", "\n", "random_seed=123" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# import telecom dataset into a pandas data frame\n", "\n", "dataset_url=\"https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv\"\n", "\n", "df_telco=pd.read_csv(dataset_url)\n", "\n", "# check unique values of each column\n", "#for column in df_telco.columns:\n", "# print('Column: {} - Unique Values: {}'.format(column, df_telco[column].unique()))\n", "\n", "# summary of the data frame\n", "#df_telco.info()\n", "\n", "# transform the column TotalCharges into a numeric data type\n", "df_telco['TotalCharges'] = pd.to_numeric(df_telco['TotalCharges'], errors='coerce')\n", "\n", "# drop observations with null values\n", "df_telco.dropna(inplace=True)\n", "\n", "# drop the customerID column from the dataset\n", "df_telco.drop(columns='customerID', inplace=True)\n", "\n", "# remove (automatic) from payment method names\n", "df_telco['PaymentMethod'] = df_telco['PaymentMethod'].str.replace(' (automatic)', '', regex=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Initialising feature names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Declare the type of each column in the dataset(example: category, numeric, text)\n", "column_types = {'gender': \"category\",\n", " 'SeniorCitizen': \"category\", \n", " 'Partner': \"category\", \n", " 'Dependents': \"category\", \n", " 'tenure': \"numeric\",\n", " 'PhoneService': \"category\", \n", " 'MultipleLines': \"category\", \n", " 'InternetService': \"category\", \n", " 'OnlineSecurity': \"category\",\n", " 'OnlineBackup': \"category\", \n", " 'DeviceProtection': \"category\", \n", " 'TechSupport': \"category\", \n", " 'StreamingTV': \"category\",\n", " 'StreamingMovies': \"category\", \n", " 'Contract': \"category\", \n", " 'PaperlessBilling': \"category\", \n", " 'PaymentMethod': \"category\",\n", " 'MonthlyCharges': \"numeric\", \n", " 'TotalCharges': \"numeric\", \n", " 'Churn': \"category\"}\n", "\n", "# feature_types is used to declare the features the model is trained on\n", "feature_types = {i:column_types[i] for i in column_types if i!='Churn'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Setting up Feature Engineering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn import model_selection\n", "\n", "\n", "columns_to_scale = [key for key in feature_types.keys() if feature_types[key]==\"numeric\"]\n", "\n", "columns_to_encode = [key for key in feature_types.keys() if feature_types[key]==\"category\"]\n", "\n", "# Perform preprocessing of the columns with the above pipelines\n", "preprocessor = ColumnTransformer(\n", " transformers=[\n", " ('num', StandardScaler(), columns_to_scale),\n", " ('cat', OneHotEncoder(handle_unknown='ignore',drop='first'), columns_to_encode)\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Data splitting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select independent variables\n", "X = df_telco.drop(columns='Churn')\n", "\n", "# select dependent variables\n", "y = df_telco.loc[:, 'Churn']\n", "\n", "# split the data in training and testing sets\n", "X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.25, random_state=random_seed, shuffle=True)\n", "# Prepare data to upload on Giskard\n", "train_data = pd.concat([X_train, Y_train], axis=1)\n", "test_data = pd.concat([X_test, Y_test ], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Pipelines and Models Evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "models = {}\n", "models['dummy_classifier']= {\"model\": DummyClassifier(random_state=random_seed, strategy='most_frequent'), \"accuracy\":0} \n", "models['k_nearest_neighbors']= {\"model\": KNeighborsClassifier(), \"accuracy\":0} \n", "models['logistic_regression']= {\"model\": LogisticRegression(random_state=random_seed,max_iter=150), \"accuracy\":0} \n", "models['random_forest']= {\"model\": RandomForestClassifier(random_state=random_seed), \"accuracy\":0} \n", "models['gradient_boosting']= {\"model\": GradientBoostingClassifier(random_state=random_seed), \"accuracy\":0} \n", "models['LGBM']= {\"model\": lbt.LGBMClassifier(random_state=random_seed), \"accuracy\":0} \n", " \n", "\n", "# test the accuracy of each model using default hyperparameters\n", "scoring = 'accuracy'\n", "for name in models.keys():\n", " models[name]['model']= Pipeline(steps=[('preprocessor', preprocessor), ('classifier', models[name]['model'])])\n", " \n", " # fit the model with the training data\n", " models[name]['model'].fit(X_train, Y_train).predict(X_test)\n", " # make predictions with the testing data\n", " predictions = models[name]['model'].predict(X_test)\n", " # calculate accuracy \n", " accuracy = accuracy_score(Y_test, predictions)\n", " # append the model name and the accuracy to the lists\n", " models[name]['accuracy']=accuracy\n", " # print classifier accuracy\n", " print('Classifier: {}, Accuracy: {})'.format(name, accuracy))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Upload the models in Giskard 🚀🚀🚀" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Initiate a project" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from giskard import GiskardClient\n", "\n", "url = \"http://localhost:19000\" #if Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)\n", "#url = \"http://app.giskard.ai\" # If you want to upload on giskard URL\n", "token = \"YOUR GENERATED TOKEN\"\n", "client = GiskardClient(url, token)\n", "\n", "# your_project = client.create_project(\"project_key\", \"PROJECT_NAME\", \"DESCRIPTION\")\n", "# Choose the arguments you want. But \"project_key\" should be unique and in lower case\n", "churn_analysis_with_tfs = client.create_project(\"churn_analysis_with_tfs\", \"Telco Kaggle Churn Analysis\", \"Project to predict if a customer quits\")\n", "\n", "# If you've already created a project with the key \"churn-analysis\" use\n", "#churn_analysis = client.get_project(\"churn_analysis\")\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Upload a specific model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" }, "scrolled": true }, "outputs": [], "source": [ "churn_analysis_with_tfs.upload_model_and_df(\n", " prediction_function=models['dummy_classifier']['model'].predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model\n", " model_type='classification', # \"classification\" for classification model OR \"regression\" for regression model\n", " df=test_data, # the dataset you want to use to inspect your model\n", " column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values\n", " target='Churn', # The column name in df corresponding to the actual target variable (ground truth).\n", " feature_names=list(feature_types.keys()), # List of the feature names of prediction_function\n", " classification_labels=[\"No\",\"Yes\"] , # List of the classification labels of your prediction #TODO: Check their order!!!!!\n", " model_name='dummy_classifier', # Name of the model\n", " dataset_name='test_data' # Name of the dataset\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload more models" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for name in models.keys():\n", " if name=='dummy_classifier': continue\n", " churn_analysis_with_tfs.upload_model(\n", " prediction_function=models[name]['model'].predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model\n", " model_type='classification', # \"classification\" for classification model OR \"regression\" for regression model\n", " feature_names=list(feature_types.keys()), # List of the feature names of prediction_function\n", " name=name, # Name of the model\n", " target=\"Churn\", # Optional. target sshould be a column of validate_df. 