-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
e3b18b6
commit 5b0545c
Showing
1 changed file
with
371 additions
and
0 deletions.
There are no files selected for viewing
371 changes: 371 additions & 0 deletions
371
...ovations - Machine Learning/Credit Card Fraud Detection/Credit_Card_Fraud_Detection.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,371 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "5r5cGN7gxaIk" | ||
}, | ||
"source": [ | ||
"This code is a Python script that performs various tasks related to model training, evaluation, and visualization using machine learning libraries.\n", | ||
"\n", | ||
"Dataset: https://www.kaggle.com/datasets/kartik2112/fraud-detection" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "3zS-7fR-xlwx" | ||
}, | ||
"source": [ | ||
"**Importing Libraries**: This section imports various libraries such as pandas for data manipulation, scikit-learn for machine learning models and model evaluation, matplotlib for plotting, and datetime for handling date and time data." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "ciHCxQGiOB-K", | ||
"outputId": "ced0a434-4805-4fa5-d27a-4568922b0274" | ||
}, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Import required libraries\n", | ||
"from datetime import datetime\n", | ||
"import matplotlib.pyplot as plt\n", | ||
"import pandas as pd\n", | ||
"from sklearn.linear_model import LogisticRegression\n", | ||
"from sklearn.ensemble import RandomForestClassifier\n", | ||
"from sklearn.neighbors import KNeighborsClassifier\n", | ||
"from sklearn.naive_bayes import GaussianNB\n", | ||
"from sklearn.tree import DecisionTreeClassifier\n", | ||
"from sklearn.svm import SVC\n", | ||
"from sklearn.model_selection import train_test_split, cross_val_score\n", | ||
"from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, roc_curve, auc\n", | ||
"from google.colab import drive\n", | ||
"drive.mount('/content/drive')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "YJkOrrcOxzSO" | ||
}, | ||
"source": [ | ||
"**load_and_process_data Function**: This function loads a dataset from a file path, drops any duplicate entries, and returns the processed DataFrame." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"id": "T4N5xa6vx72y" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Load the training dataset and drop duplicate entries\n", | ||
"def load_and_process_data(file_path):\n", | ||
" df = pd.read_csv(file_path).drop_duplicates()\n", | ||
" return df" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "46POvOXLx_mx" | ||
}, | ||
"source": [ | ||
"**add_time_related_features Function**: This function calculates the age based on the date of birth, and extracts additional time-related features such as hour, day of the week, and month from the transaction date and time." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"id": "sbj0VfwLx-vp" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Feature Engineering\n", | ||
"def add_time_related_features(df):\n", | ||
" # Calculate age based on date of birth\n", | ||
" df['age'] = datetime.now().year - pd.to_datetime(df['dob']).dt.year\n", | ||
" # Extract hour, day of the week, and month from transaction date and time\n", | ||
" df['hour'] = pd.to_datetime(df['trans_date_trans_time']).dt.hour\n", | ||
" df['day'] = pd.to_datetime(df['trans_date_trans_time']).dt.dayofweek\n", | ||
" df['month'] = pd.to_datetime(df['trans_date_trans_time']).dt.month\n", | ||
" return df" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "mH3WdjEvyFUf" | ||
}, | ||
"source": [ | ||
"**prepare_train_test_data Function**: This function prepares the features and target variable for model training, converts categorical 'category' column to dummy variables, and splits the data into training and testing sets using the `train_test_split` function from scikit-learn." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": { | ||
"id": "DPtrXpcryIZB" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Model Training and Testing\n", | ||
"def prepare_train_test_data(df, target_column):\n", | ||
" # Prepare features and target variable for model training\n", | ||
" train = df[['category', 'amt', 'zip', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'age', 'hour', 'day', 'month', target_column]]\n", | ||
" # Convert categorical 'category' column to dummy variables\n", | ||
" train = pd.get_dummies(train, columns=['category'], drop_first=True)\n", | ||
" # Split the data into training and testing sets\n", | ||
" y_train = train[target_column].values\n", | ||
" X_train = train.drop(target_column, axis=1).values\n", | ||
" X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)\n", | ||
" return X_train, X_test, y_train, y_test" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "1JTCqGp7yLLk" | ||
}, | ||
"source": [ | ||
"**train_and_evaluate_model Function**: This function trains a given model, makes predictions, prints the classification report and confusion matrix, calculates precision, recall, and F1-score, and plots the Receiver Operating Characteristic (ROC) curve with the Area Under the Curve (AUC) value." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": { | ||
"id": "MKXbZ7zxyOjl" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Function for model training and testing\n", | ||
"def train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test):\n", | ||
" # Train the model and make predictions\n", | ||
" model.fit(X_train, y_train)\n", | ||
" predicted = model.predict(X_test)\n", | ||
"\n", | ||
" # Print classification report and confusion matrix\n", | ||
" print('\\n\\n' + model_name + ' - Classification Report:\\n', classification_report(y_test, predicted))\n", | ||
" conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)\n", | ||
" print(model_name + ' - Confusion Matrix:\\n', conf_mat)\n", | ||
"\n", | ||
" # Perform cross-validation and calculate precision, recall, and F1-score\n", | ||
" cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')\n", | ||
" print(\"\\nCross-Validation Scores for \" + model_name + \":\", cv_scores)\n", | ||
"\n", | ||
" precision, recall, fscore, _ = precision_recall_fscore_support(y_test, predicted, average='binary')\n", | ||
" print(\"\\nPrecision:\", precision)\n", | ||
" print(\"Recall:\", recall)\n", | ||
" print(\"F1-Score:\", fscore)\n", | ||
"\n", | ||
" # Plot ROC curve and calculate AUC\n", | ||
" fpr, tpr, threshold = roc_curve(y_test, model.predict_proba(X_test)[:,1])\n", | ||
" roc_auc = auc(fpr, tpr)\n", | ||
" plt.plot(fpr, tpr, label = model_name + ' (AUC = %0.2f)' % roc_auc)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "971rjOQiyRXN" | ||
}, | ||
"source": [ | ||
"**evaluate_models Function**: This function creates a plot for the ROC curve and iterates through each model, calling the `train_and_evaluate_model` function to train, evaluate, and visualize the performance of each model." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": { | ||
"id": "L3e0aBb4yT3s" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Create models and evaluate them\n", | ||
"def evaluate_models(models, X_train, X_test, y_train, y_test):\n", | ||
" # Initialize a plot for ROC curve\n", | ||
" plt.figure(figsize=(10, 8))\n", | ||
" # Iterate through each model, train and evaluate it\n", | ||
" for model_name, model in models.items():\n", | ||
" train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test)\n", | ||
"\n", | ||
" # Plot the ROC curve and display the graph\n", | ||
" plt.plot([0, 1], [0, 1],'r--')\n", | ||
" plt.xlim([0, 1])\n", | ||
" plt.ylim([0, 1])\n", | ||
" plt.xlabel('False Positive Rate')\n", | ||
" plt.ylabel('True Positive Rate')\n", | ||
" plt.title('Receiver Operating Characteristic')\n", | ||
" plt.legend(loc=\"lower right\")\n", | ||
" plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "1oyvMTI0yW1L" | ||
}, | ||
"source": [ | ||
"**Main Code Execution**: In the main part of the code, it loads and processes the dataset, adds time-related features, prepares the train and test data, defines various machine learning models, and evaluates them using the previously prepared data by calling the `evaluate_models` function." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "TooV6p3nyYoz", | ||
"outputId": "30fd22e5-69a0-4a72-8d72-9994daf8eb2c" | ||
}, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"\n", | ||
"\n", | ||
"Logistic Regression - Classification Report:\n", | ||
" precision recall f1-score support\n", | ||
"\n", | ||
" 0 0.99 1.00 1.00 257815\n", | ||
" 1 0.00 0.00 0.00 1520\n", | ||
"\n", | ||
" accuracy 0.99 259335\n", | ||
" macro avg 0.50 0.50 0.50 259335\n", | ||
"weighted avg 0.99 0.99 0.99 259335\n", | ||
"\n", | ||
"Logistic Regression - Confusion Matrix:\n", | ||
" [[257671 144]\n", | ||
" [ 1520 0]]\n", | ||
"\n", | ||
"Cross-Validation Scores for Logistic Regression: [0.99357973 0.99363757 0.99355563 0.99381109 0.99376771]\n", | ||
"\n", | ||
"Precision: 0.0\n", | ||
"Recall: 0.0\n", | ||
"F1-Score: 0.0\n", | ||
"\n", | ||
"\n", | ||
"Random Forest - Classification Report:\n", | ||
" precision recall f1-score support\n", | ||
"\n", | ||
" 0 1.00 1.00 1.00 257815\n", | ||
" 1 0.97 0.74 0.84 1520\n", | ||
"\n", | ||
" accuracy 1.00 259335\n", | ||
" macro avg 0.99 0.87 0.92 259335\n", | ||
"weighted avg 1.00 1.00 1.00 259335\n", | ||
"\n", | ||
"Random Forest - Confusion Matrix:\n", | ||
" [[257782 33]\n", | ||
" [ 388 1132]]\n", | ||
"\n", | ||
"Cross-Validation Scores for Random Forest: [0.99846241 0.99830817 0.99812019 0.99820213 0.99835155]\n", | ||
"\n", | ||
"Precision: 0.9716738197424892\n", | ||
"Recall: 0.7447368421052631\n", | ||
"F1-Score: 0.8432029795158287\n", | ||
"\n", | ||
"\n", | ||
"K-Nearest Neighbors - Classification Report:\n", | ||
" precision recall f1-score support\n", | ||
"\n", | ||
" 0 1.00 1.00 1.00 257815\n", | ||
" 1 0.62 0.35 0.45 1520\n", | ||
"\n", | ||
" accuracy 0.99 259335\n", | ||
" macro avg 0.81 0.67 0.72 259335\n", | ||
"weighted avg 0.99 0.99 0.99 259335\n", | ||
"\n", | ||
"K-Nearest Neighbors - Confusion Matrix:\n", | ||
" [[257498 317]\n", | ||
" [ 993 527]]\n", | ||
"\n", | ||
"Cross-Validation Scores for K-Nearest Neighbors: [0.99486186 0.99467388 0.99491488 0.99489078 0.99480884]\n", | ||
"\n", | ||
"Precision: 0.6244075829383886\n", | ||
"Recall: 0.34671052631578947\n", | ||
"F1-Score: 0.4458544839255499\n", | ||
"\n", | ||
"\n", | ||
"Naive Bayes - Classification Report:\n", | ||
" precision recall f1-score support\n", | ||
"\n", | ||
" 0 1.00 0.99 0.99 257815\n", | ||
" 1 0.24 0.49 0.32 1520\n", | ||
"\n", | ||
" accuracy 0.99 259335\n", | ||
" macro avg 0.62 0.74 0.66 259335\n", | ||
"weighted avg 0.99 0.99 0.99 259335\n", | ||
"\n", | ||
"Naive Bayes - Confusion Matrix:\n", | ||
" [[255472 2343]\n", | ||
" [ 778 742]]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Main code execution\n", | ||
"file_path = '/content/drive/MyDrive/Colab Notebooks/Neuronexus Innovations/Neuronexus Innovations - Machine Learning/Credit Card Fraud Detection/fraudTrain.csv'\n", | ||
"df = load_and_process_data(file_path)\n", | ||
"df = add_time_related_features(df)\n", | ||
"\n", | ||
"target_column = 'is_fraud'\n", | ||
"X_train, X_test, y_train, y_test = prepare_train_test_data(df, target_column)\n", | ||
"\n", | ||
"# Define models and evaluate them using the prepared data\n", | ||
"models = {\n", | ||
" 'Logistic Regression': LogisticRegression(max_iter=1000),\n", | ||
" 'Random Forest': RandomForestClassifier(random_state=5),\n", | ||
" 'K-Nearest Neighbors': KNeighborsClassifier(),\n", | ||
" 'Naive Bayes': GaussianNB(),\n", | ||
" 'Decision Tree (Gini)': DecisionTreeClassifier(criterion='gini', random_state=5),\n", | ||
" 'Decision Tree (Entropy)': DecisionTreeClassifier(criterion='entropy', random_state=5),\n", | ||
" 'Support Vector Machine': SVC(probability=True, random_state=5)\n", | ||
"}\n", | ||
"\n", | ||
"evaluate_models(models, X_train, X_test, y_train, y_test)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "2EzYVulnybvD" | ||
}, | ||
"source": [ | ||
"Overall, this code demonstrates a pipeline for model training, evaluation, and visualization for a fraud detection task using various machine learning algorithms." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"provenance": [] | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |