Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
Santosh2611 committed Jan 11, 2024
1 parent e3b18b6 commit 5b0545c
Showing 1 changed file with 371 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,371 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "5r5cGN7gxaIk"
},
"source": [
"This code is a Python script that performs various tasks related to model training, evaluation, and visualization using machine learning libraries.\n",
"\n",
"Dataset: https://www.kaggle.com/datasets/kartik2112/fraud-detection"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3zS-7fR-xlwx"
},
"source": [
"**Importing Libraries**: This section imports various libraries such as pandas for data manipulation, scikit-learn for machine learning models and model evaluation, matplotlib for plotting, and datetime for handling date and time data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ciHCxQGiOB-K",
"outputId": "ced0a434-4805-4fa5-d27a-4568922b0274"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
]
}
],
"source": [
"# Import required libraries\n",
"from datetime import datetime\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.svm import SVC\n",
"from sklearn.model_selection import train_test_split, cross_val_score\n",
"from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support, roc_curve, auc\n",
"from google.colab import drive\n",
"drive.mount('/content/drive')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YJkOrrcOxzSO"
},
"source": [
"**load_and_process_data Function**: This function loads a dataset from a file path, drops any duplicate entries, and returns the processed DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "T4N5xa6vx72y"
},
"outputs": [],
"source": [
"# Load the training dataset and drop duplicate entries\n",
"def load_and_process_data(file_path):\n",
" df = pd.read_csv(file_path).drop_duplicates()\n",
" return df"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "46POvOXLx_mx"
},
"source": [
"**add_time_related_features Function**: This function calculates the age based on the date of birth, and extracts additional time-related features such as hour, day of the week, and month from the transaction date and time."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "sbj0VfwLx-vp"
},
"outputs": [],
"source": [
"# Feature Engineering\n",
"def add_time_related_features(df):\n",
" # Calculate age based on date of birth\n",
" df['age'] = datetime.now().year - pd.to_datetime(df['dob']).dt.year\n",
" # Extract hour, day of the week, and month from transaction date and time\n",
" df['hour'] = pd.to_datetime(df['trans_date_trans_time']).dt.hour\n",
" df['day'] = pd.to_datetime(df['trans_date_trans_time']).dt.dayofweek\n",
" df['month'] = pd.to_datetime(df['trans_date_trans_time']).dt.month\n",
" return df"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mH3WdjEvyFUf"
},
"source": [
"**prepare_train_test_data Function**: This function prepares the features and target variable for model training, converts categorical 'category' column to dummy variables, and splits the data into training and testing sets using the `train_test_split` function from scikit-learn."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "DPtrXpcryIZB"
},
"outputs": [],
"source": [
"# Model Training and Testing\n",
"def prepare_train_test_data(df, target_column):\n",
" # Prepare features and target variable for model training\n",
" train = df[['category', 'amt', 'zip', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'age', 'hour', 'day', 'month', target_column]]\n",
" # Convert categorical 'category' column to dummy variables\n",
" train = pd.get_dummies(train, columns=['category'], drop_first=True)\n",
" # Split the data into training and testing sets\n",
" y_train = train[target_column].values\n",
" X_train = train.drop(target_column, axis=1).values\n",
" X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)\n",
" return X_train, X_test, y_train, y_test"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1JTCqGp7yLLk"
},
"source": [
"**train_and_evaluate_model Function**: This function trains a given model, makes predictions, prints the classification report and confusion matrix, calculates precision, recall, and F1-score, and plots the Receiver Operating Characteristic (ROC) curve with the Area Under the Curve (AUC) value."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "MKXbZ7zxyOjl"
},
"outputs": [],
"source": [
"# Function for model training and testing\n",
"def train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test):\n",
" # Train the model and make predictions\n",
" model.fit(X_train, y_train)\n",
" predicted = model.predict(X_test)\n",
"\n",
" # Print classification report and confusion matrix\n",
" print('\\n\\n' + model_name + ' - Classification Report:\\n', classification_report(y_test, predicted))\n",
" conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)\n",
" print(model_name + ' - Confusion Matrix:\\n', conf_mat)\n",
"\n",
" # Perform cross-validation and calculate precision, recall, and F1-score\n",
" cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')\n",
" print(\"\\nCross-Validation Scores for \" + model_name + \":\", cv_scores)\n",
"\n",
" precision, recall, fscore, _ = precision_recall_fscore_support(y_test, predicted, average='binary')\n",
" print(\"\\nPrecision:\", precision)\n",
" print(\"Recall:\", recall)\n",
" print(\"F1-Score:\", fscore)\n",
"\n",
" # Plot ROC curve and calculate AUC\n",
" fpr, tpr, threshold = roc_curve(y_test, model.predict_proba(X_test)[:,1])\n",
" roc_auc = auc(fpr, tpr)\n",
" plt.plot(fpr, tpr, label = model_name + ' (AUC = %0.2f)' % roc_auc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "971rjOQiyRXN"
},
"source": [
"**evaluate_models Function**: This function creates a plot for the ROC curve and iterates through each model, calling the `train_and_evaluate_model` function to train, evaluate, and visualize the performance of each model."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "L3e0aBb4yT3s"
},
"outputs": [],
"source": [
"# Create models and evaluate them\n",
"def evaluate_models(models, X_train, X_test, y_train, y_test):\n",
" # Initialize a plot for ROC curve\n",
" plt.figure(figsize=(10, 8))\n",
" # Iterate through each model, train and evaluate it\n",
" for model_name, model in models.items():\n",
" train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test)\n",
"\n",
" # Plot the ROC curve and display the graph\n",
" plt.plot([0, 1], [0, 1],'r--')\n",
" plt.xlim([0, 1])\n",
" plt.ylim([0, 1])\n",
" plt.xlabel('False Positive Rate')\n",
" plt.ylabel('True Positive Rate')\n",
" plt.title('Receiver Operating Characteristic')\n",
" plt.legend(loc=\"lower right\")\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1oyvMTI0yW1L"
},
"source": [
"**Main Code Execution**: In the main part of the code, it loads and processes the dataset, adds time-related features, prepares the train and test data, defines various machine learning models, and evaluates them using the previously prepared data by calling the `evaluate_models` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "TooV6p3nyYoz",
"outputId": "30fd22e5-69a0-4a72-8d72-9994daf8eb2c"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"\n",
"Logistic Regression - Classification Report:\n",
" precision recall f1-score support\n",
"\n",
" 0 0.99 1.00 1.00 257815\n",
" 1 0.00 0.00 0.00 1520\n",
"\n",
" accuracy 0.99 259335\n",
" macro avg 0.50 0.50 0.50 259335\n",
"weighted avg 0.99 0.99 0.99 259335\n",
"\n",
"Logistic Regression - Confusion Matrix:\n",
" [[257671 144]\n",
" [ 1520 0]]\n",
"\n",
"Cross-Validation Scores for Logistic Regression: [0.99357973 0.99363757 0.99355563 0.99381109 0.99376771]\n",
"\n",
"Precision: 0.0\n",
"Recall: 0.0\n",
"F1-Score: 0.0\n",
"\n",
"\n",
"Random Forest - Classification Report:\n",
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 257815\n",
" 1 0.97 0.74 0.84 1520\n",
"\n",
" accuracy 1.00 259335\n",
" macro avg 0.99 0.87 0.92 259335\n",
"weighted avg 1.00 1.00 1.00 259335\n",
"\n",
"Random Forest - Confusion Matrix:\n",
" [[257782 33]\n",
" [ 388 1132]]\n",
"\n",
"Cross-Validation Scores for Random Forest: [0.99846241 0.99830817 0.99812019 0.99820213 0.99835155]\n",
"\n",
"Precision: 0.9716738197424892\n",
"Recall: 0.7447368421052631\n",
"F1-Score: 0.8432029795158287\n",
"\n",
"\n",
"K-Nearest Neighbors - Classification Report:\n",
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 257815\n",
" 1 0.62 0.35 0.45 1520\n",
"\n",
" accuracy 0.99 259335\n",
" macro avg 0.81 0.67 0.72 259335\n",
"weighted avg 0.99 0.99 0.99 259335\n",
"\n",
"K-Nearest Neighbors - Confusion Matrix:\n",
" [[257498 317]\n",
" [ 993 527]]\n",
"\n",
"Cross-Validation Scores for K-Nearest Neighbors: [0.99486186 0.99467388 0.99491488 0.99489078 0.99480884]\n",
"\n",
"Precision: 0.6244075829383886\n",
"Recall: 0.34671052631578947\n",
"F1-Score: 0.4458544839255499\n",
"\n",
"\n",
"Naive Bayes - Classification Report:\n",
" precision recall f1-score support\n",
"\n",
" 0 1.00 0.99 0.99 257815\n",
" 1 0.24 0.49 0.32 1520\n",
"\n",
" accuracy 0.99 259335\n",
" macro avg 0.62 0.74 0.66 259335\n",
"weighted avg 0.99 0.99 0.99 259335\n",
"\n",
"Naive Bayes - Confusion Matrix:\n",
" [[255472 2343]\n",
" [ 778 742]]\n"
]
}
],
"source": [
"# Main code execution\n",
"file_path = '/content/drive/MyDrive/Colab Notebooks/Neuronexus Innovations/Neuronexus Innovations - Machine Learning/Credit Card Fraud Detection/fraudTrain.csv'\n",
"df = load_and_process_data(file_path)\n",
"df = add_time_related_features(df)\n",
"\n",
"target_column = 'is_fraud'\n",
"X_train, X_test, y_train, y_test = prepare_train_test_data(df, target_column)\n",
"\n",
"# Define models and evaluate them using the prepared data\n",
"models = {\n",
" 'Logistic Regression': LogisticRegression(max_iter=1000),\n",
" 'Random Forest': RandomForestClassifier(random_state=5),\n",
" 'K-Nearest Neighbors': KNeighborsClassifier(),\n",
" 'Naive Bayes': GaussianNB(),\n",
" 'Decision Tree (Gini)': DecisionTreeClassifier(criterion='gini', random_state=5),\n",
" 'Decision Tree (Entropy)': DecisionTreeClassifier(criterion='entropy', random_state=5),\n",
" 'Support Vector Machine': SVC(probability=True, random_state=5)\n",
"}\n",
"\n",
"evaluate_models(models, X_train, X_test, y_train, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2EzYVulnybvD"
},
"source": [
"Overall, this code demonstrates a pipeline for model training, evaluation, and visualization for a fraud detection task using various machine learning algorithms."
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

0 comments on commit 5b0545c

Please sign in to comment.