{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-pvKxJjsNSoE",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "a9b17467105f4031a3f9eae70ef4138f",
    "deepnote_cell_height": 134,
    "deepnote_cell_type": "markdown",
    "id": "PKcOi3D37xbW",
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "# About Giskard\n",
    "\n",
    "Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. \n",
    "\n",
    "*   Collaborate faster with feedback from business stakeholders.\n",
    "*   Deploy automated tests to eliminate regressions, errors & biases.\n",
    "\n",
    "🏡 [Website](https://giskard.ai/)\n",
    "\n",
    "📗 [Documentation](https://docs.giskard.ai/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "f35c8e8d3fbf4c0f9c01a69673c318a1",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 6
    },
    "deepnote_cell_height": 110,
    "deepnote_cell_type": "markdown",
    "id": "mJTqM-W_7xbW",
    "owner_user_id": "41ec0844-b5b7-49c2-9460-710a452f98de",
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "# Telco custormer churn data\n",
    "\n",
    "\n",
    "In this notebook we explore how to predict customer churn, a critical factor for telecommunication companies to be able to effectively retain customers. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing `giskard` and `lightgbm`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install giskard lightgbm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect the external worker in daemon mode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!giskard worker start -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "e8d609f32d5243dd917cc3104599b8d8",
    "deepnote_app_coordinates": {
     "h": 5,
     "w": 12,
     "x": 0,
     "y": 12
    },
    "deepnote_cell_height": 230,
    "deepnote_cell_type": "markdown",
    "id": "WNI85koE7xbX",
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "## 1. Data Reading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.dummy import DummyClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "from sklearn.metrics import accuracy_score\n",
    "import lightgbm as lbt\n",
    "\n",
    "random_seed=123"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# import telecom dataset into a pandas data frame\n",
    "\n",
    "dataset_url=\"https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv\"\n",
    "\n",
    "df_telco=pd.read_csv(dataset_url)\n",
    "\n",
    "# check unique values of each column\n",
    "#for column in df_telco.columns:\n",
    "#    print('Column: {} - Unique Values: {}'.format(column, df_telco[column].unique()))\n",
    "\n",
    "# summary of the data frame\n",
    "#df_telco.info()\n",
    "\n",
    "# transform the column TotalCharges into a numeric data type\n",
    "df_telco['TotalCharges'] = pd.to_numeric(df_telco['TotalCharges'], errors='coerce')\n",
    "\n",
    "# drop observations with null values\n",
    "df_telco.dropna(inplace=True)\n",
    "\n",
    "# drop the customerID column from the dataset\n",
    "df_telco.drop(columns='customerID', inplace=True)\n",
    "\n",
    "# remove (automatic) from payment method names\n",
    "df_telco['PaymentMethod'] = df_telco['PaymentMethod'].str.replace(' (automatic)', '', regex=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Initialising feature names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Declare the type of each column in the dataset(example: category, numeric, text)\n",
    "column_types = {'gender': \"category\",\n",
    "                'SeniorCitizen': \"category\", \n",
    "                'Partner': \"category\", \n",
    "                'Dependents': \"category\", \n",
    "                'tenure': \"numeric\",\n",
    "                'PhoneService': \"category\", \n",
    "                'MultipleLines': \"category\", \n",
    "                'InternetService': \"category\", \n",
    "                'OnlineSecurity': \"category\",\n",
    "                'OnlineBackup': \"category\", \n",
    "                'DeviceProtection': \"category\", \n",
    "                'TechSupport': \"category\", \n",
    "                'StreamingTV': \"category\",\n",
    "                'StreamingMovies': \"category\", \n",
    "                'Contract': \"category\", \n",
    "                'PaperlessBilling': \"category\", \n",
    "                'PaymentMethod': \"category\",\n",
    "                'MonthlyCharges': \"numeric\", \n",
    "                'TotalCharges': \"numeric\", \n",
    "                'Churn': \"category\"}\n",
    "\n",
    "# feature_types is used to declare the features the model is trained on\n",
    "feature_types = {i:column_types[i] for i in column_types if i!='Churn'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Setting up Feature Engineering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn import model_selection\n",
    "\n",
    "\n",
    "columns_to_scale = [key for key in feature_types.keys() if feature_types[key]==\"numeric\"]\n",
    "\n",
    "columns_to_encode = [key for key in feature_types.keys() if feature_types[key]==\"category\"]\n",
    "\n",
    "# Perform preprocessing of the columns with the above pipelines\n",
    "preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('num', StandardScaler(), columns_to_scale),\n",
    "      ('cat', OneHotEncoder(handle_unknown='ignore',drop='first'), columns_to_encode)\n",
    "          ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Data splitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# select independent variables\n",
    "X = df_telco.drop(columns='Churn')\n",
    "\n",
    "# select dependent variables\n",
    "y = df_telco.loc[:, 'Churn']\n",
    "\n",
    "# split the data in training and testing sets\n",
    "X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.25, random_state=random_seed, shuffle=True)\n",
    "# Prepare data to upload on Giskard\n",
    "train_data = pd.concat([X_train, Y_train], axis=1)\n",
    "test_data = pd.concat([X_test, Y_test ], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Pipelines and Models Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "models = {}\n",
    "models['dummy_classifier']= {\"model\": DummyClassifier(random_state=random_seed, strategy='most_frequent'), \"accuracy\":0} \n",
    "models['k_nearest_neighbors']= {\"model\": KNeighborsClassifier(), \"accuracy\":0} \n",
    "models['logistic_regression']= {\"model\": LogisticRegression(random_state=random_seed,max_iter=150), \"accuracy\":0} \n",
    "models['random_forest']= {\"model\": RandomForestClassifier(random_state=random_seed), \"accuracy\":0} \n",
    "models['gradient_boosting']= {\"model\": GradientBoostingClassifier(random_state=random_seed), \"accuracy\":0} \n",
    "models['LGBM']= {\"model\": lbt.LGBMClassifier(random_state=random_seed), \"accuracy\":0} \n",
    "    \n",
    "\n",
    "# test the accuracy of each model using default hyperparameters\n",
    "scoring = 'accuracy'\n",
    "for name in models.keys():\n",
    "    models[name]['model']= Pipeline(steps=[('preprocessor', preprocessor), ('classifier', models[name]['model'])])\n",
    "    \n",
    "    # fit the model with the training data\n",
    "    models[name]['model'].fit(X_train, Y_train).predict(X_test)\n",
    "    # make predictions with the testing data\n",
    "    predictions = models[name]['model'].predict(X_test)\n",
    "    # calculate accuracy \n",
    "    accuracy = accuracy_score(Y_test, predictions)\n",
    "    # append the model name and the accuracy to the lists\n",
    "    models[name]['accuracy']=accuracy\n",
    "    # print classifier accuracy\n",
    "    print('Classifier: {}, Accuracy: {})'.format(name, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Upload the models in Giskard 🚀🚀🚀"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Initiate a project"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from giskard import GiskardClient\n",
    "\n",
    "url = \"http://localhost:19000\" #if Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)\n",
    "#url = \"http://app.giskard.ai\" # If you want to upload on giskard URL\n",
    "token = \"YOUR GENERATED TOKEN\"\n",
    "client = GiskardClient(url, token)\n",
    "\n",
    "# your_project = client.create_project(\"project_key\", \"PROJECT_NAME\", \"DESCRIPTION\")\n",
    "# Choose the arguments you want. But \"project_key\" should be unique and in lower case\n",
    "churn_analysis_with_tfs = client.create_project(\"churn_analysis_with_tfs\", \"Telco Kaggle Churn Analysis\", \"Project to predict if a customer quits\")\n",
    "\n",
    "# If you've already created a project with the key \"churn-analysis\" use\n",
    "#churn_analysis = client.get_project(\"churn_analysis\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Upload a specific model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "churn_analysis_with_tfs.upload_model_and_df(\n",
    "    prediction_function=models['dummy_classifier']['model'].predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model\n",
    "    model_type='classification', # \"classification\" for classification model OR \"regression\" for regression model\n",
    "    df=test_data, # the dataset you want to use to inspect your model\n",
    "    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values\n",
    "    target='Churn', # The column name in df corresponding to the actual target variable (ground truth).\n",
    "    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function\n",
    "    classification_labels=[\"No\",\"Yes\"] ,  # List of the classification labels of your prediction #TODO: Check their order!!!!!\n",
    "    model_name='dummy_classifier', # Name of the model\n",
    "    dataset_name='test_data' # Name of the dataset\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload more models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for name in models.keys():\n",
    "    if name=='dummy_classifier': continue\n",
    "    churn_analysis_with_tfs.upload_model(\n",
    "        prediction_function=models[name]['model'].predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model\n",
    "        model_type='classification', # \"classification\" for classification model OR \"regression\" for regression model\n",
    "        feature_names=list(feature_types.keys()), # List of the feature names of prediction_function\n",
    "        name=name, # Name of the model\n",
    "        target=\"Churn\", # Optional. target sshould be a column of validate_df. Pass this parameter only if validate_df is being passed\n",
    "        classification_labels=[\"No\",\"Yes\"] # List of the classification labels of your prediction\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload more datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "churn_analysis_with_tfs.upload_df(\n",
    "    df=train_data, # The dataset you want to upload\n",
    "    column_types=column_types, # All the column types of df\n",
    "    target=\"Churn\", # Do not pass this parameter if dataset doesn't contain target column\n",
    "    name=\"train_data\" # Name of the dataset\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "German_credit_scoring_giskard (2).ipynb",
   "provenance": []
  },
  "deepnote": {
   "is_reactive": false
  },
  "deepnote_app_layout": "article",
  "deepnote_execution_queue": [],
  "deepnote_notebook_id": "6e7ea85d-f19e-4d05-90a4-44b7668fd037",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}