Add files via upload

Santosh2611 · Jan 11, 2024 · 5b0545c · 5b0545c
1 parent e3b18b6
commit 5b0545c
Showing 1 changed file with 371 additions and 0 deletions.
diff --git a/...ovations - Machine Learning/Credit Card Fraud Detection/Credit_Card_Fraud_Detection.ipynb b/...ovations - Machine Learning/Credit Card Fraud Detection/Credit_Card_Fraud_Detection.ipynb
@@ -0,0 +1,371 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5r5cGN7gxaIk"
+ },
+ "source": [
+ "This code is a Python script that performs various tasks related to model training, evaluation, and visualization using machine learning libraries.\n",
+ "\n",
+ "Dataset: https://www.kaggle.com/datasets/kartik2112/fraud-detection"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3zS-7fR-xlwx"
+ },
+ "source": [
+ "**Importing Libraries**: This section imports various libraries such as pandas for data manipulation, scikit-learn for machine learning models and model evaluation, matplotlib for plotting, and datetime for handling date and time data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "ciHCxQGiOB-K",
+ "outputId": "ced0a434-4805-4fa5-d27a-4568922b0274"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Import required libraries\n",
+ "from datetime import datetime\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "from sklearn.naive_bayes import GaussianNB\n",
+ "from sklearn.tree import DecisionTreeClassifier\n",
+ "from sklearn.svm import SVC\n",
+ "from sklearn.model_selection import train_test_split, cross_val_score\n",
+ "from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, roc_curve, auc\n",
+ "from google.colab import drive\n",
+ "drive.mount('/content/drive')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "YJkOrrcOxzSO"
+ },
+ "source": [
+ "**load_and_process_data Function**: This function loads a dataset from a file path, drops any duplicate entries, and returns the processed DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "id": "T4N5xa6vx72y"
+ },
+ "outputs": [],
+ "source": [
+ "# Load the training dataset and drop duplicate entries\n",
+ "def load_and_process_data(file_path):\n",
+ " df = pd.read_csv(file_path).drop_duplicates()\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "46POvOXLx_mx"
+ },
+ "source": [
+ "**add_time_related_features Function**: This function calculates the age based on the date of birth, and extracts additional time-related features such as hour, day of the week, and month from the transaction date and time."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "id": "sbj0VfwLx-vp"
+ },
+ "outputs": [],
+ "source": [
+ "# Feature Engineering\n",
+ "def add_time_related_features(df):\n",
+ " # Calculate age based on date of birth\n",
+ " df['age'] = datetime.now().year - pd.to_datetime(df['dob']).dt.year\n",
+ " # Extract hour, day of the week, and month from transaction date and time\n",
+ " df['hour'] = pd.to_datetime(df['trans_date_trans_time']).dt.hour\n",
+ " df['day'] = pd.to_datetime(df['trans_date_trans_time']).dt.dayofweek\n",
+ " df['month'] = pd.to_datetime(df['trans_date_trans_time']).dt.month\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "mH3WdjEvyFUf"
+ },
+ "source": [
+ "**prepare_train_test_data Function**: This function prepares the features and target variable for model training, converts categorical 'category' column to dummy variables, and splits the data into training and testing sets using the `train_test_split` function from scikit-learn."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "id": "DPtrXpcryIZB"
+ },
+ "outputs": [],
+ "source": [
+ "# Model Training and Testing\n",
+ "def prepare_train_test_data(df, target_column):\n",
+ " # Prepare features and target variable for model training\n",
+ " train = df[['category', 'amt', 'zip', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'age', 'hour', 'day', 'month', target_column]]\n",
+ " # Convert categorical 'category' column to dummy variables\n",
+ " train = pd.get_dummies(train, columns=['category'], drop_first=True)\n",
+ " # Split the data into training and testing sets\n",
+ " y_train = train[target_column].values\n",
+ " X_train = train.drop(target_column, axis=1).values\n",
+ " X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)\n",
+ " return X_train, X_test, y_train, y_test"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1JTCqGp7yLLk"
+ },
+ "source": [
+ "**train_and_evaluate_model Function**: This function trains a given model, makes predictions, prints the classification report and confusion matrix, calculates precision, recall, and F1-score, and plots the Receiver Operating Characteristic (ROC) curve with the Area Under the Curve (AUC) value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "id": "MKXbZ7zxyOjl"
+ },
+ "outputs": [],
+ "source": [
+ "# Function for model training and testing\n",
+ "def train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test):\n",
+ " # Train the model and make predictions\n",
+ " model.fit(X_train, y_train)\n",
+ " predicted = model.predict(X_test)\n",
+ "\n",
+ " # Print classification report and confusion matrix\n",
+ " print('\\n\\n' + model_name + ' - Classification Report:\\n', classification_report(y_test, predicted))\n",
+ " conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)\n",
+ " print(model_name + ' - Confusion Matrix:\\n', conf_mat)\n",
+ "\n",
+ " # Perform cross-validation and calculate precision, recall, and F1-score\n",
+ " cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')\n",
+ " print(\"\\nCross-Validation Scores for \" + model_name + \":\", cv_scores)\n",
+ "\n",
+ " precision, recall, fscore, _ = precision_recall_fscore_support(y_test, predicted, average='binary')\n",
+ " print(\"\\nPrecision:\", precision)\n",
+ " print(\"Recall:\", recall)\n",
+ " print(\"F1-Score:\", fscore)\n",
+ "\n",
+ " # Plot ROC curve and calculate AUC\n",
+ " fpr, tpr, threshold = roc_curve(y_test, model.predict_proba(X_test)[:,1])\n",
+ " roc_auc = auc(fpr, tpr)\n",
+ " plt.plot(fpr, tpr, label = model_name + ' (AUC = %0.2f)' % roc_auc)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "971rjOQiyRXN"
+ },
+ "source": [
+ "**evaluate_models Function**: This function creates a plot for the ROC curve and iterates through each model, calling the `train_and_evaluate_model` function to train, evaluate, and visualize the performance of each model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "id": "L3e0aBb4yT3s"
+ },
+ "outputs": [],
+ "source": [
+ "# Create models and evaluate them\n",
+ "def evaluate_models(models, X_train, X_test, y_train, y_test):\n",
+ " # Initialize a plot for ROC curve\n",
+ " plt.figure(figsize=(10, 8))\n",
+ " # Iterate through each model, train and evaluate it\n",
+ " for model_name, model in models.items():\n",
+ " train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test)\n",
+ "\n",
+ " # Plot the ROC curve and display the graph\n",
+ " plt.plot([0, 1], [0, 1],'r--')\n",
+ " plt.xlim([0, 1])\n",
+ " plt.ylim([0, 1])\n",
+ " plt.xlabel('False Positive Rate')\n",
+ " plt.ylabel('True Positive Rate')\n",
+ " plt.title('Receiver Operating Characteristic')\n",
+ " plt.legend(loc=\"lower right\")\n",
+ " plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1oyvMTI0yW1L"
+ },
+ "source": [
+ "**Main Code Execution**: In the main part of the code, it loads and processes the dataset, adds time-related features, prepares the train and test data, defines various machine learning models, and evaluates them using the previously prepared data by calling the `evaluate_models` function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "TooV6p3nyYoz",
+ "outputId": "30fd22e5-69a0-4a72-8d72-9994daf8eb2c"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\n",
+ "\n",
+ "Logistic Regression - Classification Report:\n",
+ " precision recall f1-score support\n",
+ "\n",
+ " 0 0.99 1.00 1.00 257815\n",
+ " 1 0.00 0.00 0.00 1520\n",
+ "\n",
+ " accuracy 0.99 259335\n",
+ " macro avg 0.50 0.50 0.50 259335\n",
+ "weighted avg 0.99 0.99 0.99 259335\n",
+ "\n",
+ "Logistic Regression - Confusion Matrix:\n",
+ " [[257671 144]\n",
+ " [ 1520 0]]\n",
+ "\n",
+ "Cross-Validation Scores for Logistic Regression: [0.99357973 0.99363757 0.99355563 0.99381109 0.99376771]\n",
+ "\n",
+ "Precision: 0.0\n",
+ "Recall: 0.0\n",
+ "F1-Score: 0.0\n",
+ "\n",
+ "\n",
+ "Random Forest - Classification Report:\n",
+ " precision recall f1-score support\n",
+ "\n",
+ " 0 1.00 1.00 1.00 257815\n",
+ " 1 0.97 0.74 0.84 1520\n",
+ "\n",
+ " accuracy 1.00 259335\n",
+ " macro avg 0.99 0.87 0.92 259335\n",
+ "weighted avg 1.00 1.00 1.00 259335\n",
+ "\n",
+ "Random Forest - Confusion Matrix:\n",
+ " [[257782 33]\n",
+ " [ 388 1132]]\n",
+ "\n",
+ "Cross-Validation Scores for Random Forest: [0.99846241 0.99830817 0.99812019 0.99820213 0.99835155]\n",
+ "\n",
+ "Precision: 0.9716738197424892\n",
+ "Recall: 0.7447368421052631\n",
+ "F1-Score: 0.8432029795158287\n",
+ "\n",
+ "\n",
+ "K-Nearest Neighbors - Classification Report:\n",
+ " precision recall f1-score support\n",
+ "\n",
+ " 0 1.00 1.00 1.00 257815\n",
+ " 1 0.62 0.35 0.45 1520\n",
+ "\n",
+ " accuracy 0.99 259335\n",
+ " macro avg 0.81 0.67 0.72 259335\n",
+ "weighted avg 0.99 0.99 0.99 259335\n",
+ "\n",
+ "K-Nearest Neighbors - Confusion Matrix:\n",
+ " [[257498 317]\n",
+ " [ 993 527]]\n",
+ "\n",
+ "Cross-Validation Scores for K-Nearest Neighbors: [0.99486186 0.99467388 0.99491488 0.99489078 0.99480884]\n",
+ "\n",
+ "Precision: 0.6244075829383886\n",
+ "Recall: 0.34671052631578947\n",
+ "F1-Score: 0.4458544839255499\n",
+ "\n",
+ "\n",
+ "Naive Bayes - Classification Report:\n",
+ " precision recall f1-score support\n",
+ "\n",
+ " 0 1.00 0.99 0.99 257815\n",
+ " 1 0.24 0.49 0.32 1520\n",
+ "\n",
+ " accuracy 0.99 259335\n",
+ " macro avg 0.62 0.74 0.66 259335\n",
+ "weighted avg 0.99 0.99 0.99 259335\n",
+ "\n",
+ "Naive Bayes - Confusion Matrix:\n",
+ " [[255472 2343]\n",
+ " [ 778 742]]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Main code execution\n",
+ "file_path = '/content/drive/MyDrive/Colab Notebooks/Neuronexus Innovations/Neuronexus Innovations - Machine Learning/Credit Card Fraud Detection/fraudTrain.csv'\n",
+ "df = load_and_process_data(file_path)\n",
+ "df = add_time_related_features(df)\n",
+ "\n",
+ "target_column = 'is_fraud'\n",
+ "X_train, X_test, y_train, y_test = prepare_train_test_data(df, target_column)\n",
+ "\n",
+ "# Define models and evaluate them using the prepared data\n",
+ "models = {\n",
+ " 'Logistic Regression': LogisticRegression(max_iter=1000),\n",
+ " 'Random Forest': RandomForestClassifier(random_state=5),\n",
+ " 'K-Nearest Neighbors': KNeighborsClassifier(),\n",
+ " 'Naive Bayes': GaussianNB(),\n",
+ " 'Decision Tree (Gini)': DecisionTreeClassifier(criterion='gini', random_state=5),\n",
+ " 'Decision Tree (Entropy)': DecisionTreeClassifier(criterion='entropy', random_state=5),\n",
+ " 'Support Vector Machine': SVC(probability=True, random_state=5)\n",
+ "}\n",
+ "\n",
+ "evaluate_models(models, X_train, X_test, y_train, y_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2EzYVulnybvD"
+ },
+ "source": [
+ "Overall, this code demonstrates a pipeline for model training, evaluation, and visualization for a fraud detection task using various machine learning algorithms."
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}