Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Fraud Detection in Python
Fraud detection is a critical application of machine learning where we analyze historical transaction data to predict whether a new transaction is fraudulent. In this tutorial, we'll build a fraud detection system using credit card transaction data, applying a decision tree classifier to identify suspicious transactions.
Preparing the Data
We start by loading and exploring our dataset to understand its structure and features. The credit card fraud dataset contains anonymized features (V1-V28) obtained through PCA transformation, along with Time, Amount, and Class columns ?
import pandas as pd
# Load the credit card dataset
# Note: Download from https://www.kaggle.com/mlg-ulb/creditcardfraud
datainput = pd.read_csv('creditcard.csv')
# Display first 5 records
print(datainput.head())
print("\nDataset shape:", datainput.shape)
Time V1 V2 ... Amount Class
0 0.0 -1.359807 -0.072781 ... 149.62 0
1 0.0 1.191857 0.266151 ... 2.69 0
2 1.0 -1.358354 -1.340163 ... 378.66 0
3 1.0 -0.966272 -0.185226 ... 123.50 0
4 2.0 -1.158233 0.877737 ... 69.99 0
[5 rows x 31 columns]
Dataset shape: (284807, 31)
Checking Data Imbalance
Fraud detection datasets typically suffer from class imbalance, where fraudulent transactions are much fewer than legitimate ones. Let's examine this distribution ?
import pandas as pd
import numpy as np
# Create sample imbalanced data for demonstration
np.random.seed(42)
legitimate = np.random.normal(100, 50, 10000)
fraudulent = np.random.normal(200, 30, 100)
# Simulate the analysis
fraud_cases = 492
legitimate_cases = 284315
total_cases = fraud_cases + legitimate_cases
fraud_ratio = fraud_cases / legitimate_cases
fraud_percentage = (fraud_cases / total_cases) * 100
print(f"Fraud ratio: {fraud_ratio:.6f}")
print(f"Fraudulent cases: {fraud_cases}")
print(f"Legitimate cases: {legitimate_cases}")
print(f"Fraud percentage: {fraud_percentage:.2f}%")
Fraud ratio: 0.001730 Fraudulent cases: 492 Legitimate cases: 284315 Fraud percentage: 0.17%
Analyzing Transaction Amounts
Understanding the statistical differences between fraudulent and legitimate transactions helps in feature engineering ?
import pandas as pd
import numpy as np
# Create sample transaction data
np.random.seed(42)
# Simulate fraudulent transactions (typically smaller amounts)
fraud_amounts = np.random.exponential(scale=50, size=492)
fraud_amounts = np.clip(fraud_amounts, 0, 2000)
# Simulate legitimate transactions
legit_amounts = np.random.exponential(scale=80, size=1000) # Smaller sample for demo
legit_amounts = np.clip(legit_amounts, 0, 1000)
# Create DataFrames
fraud_df = pd.DataFrame({'Amount': fraud_amounts})
legit_df = pd.DataFrame({'Amount': legit_amounts})
print("Fraudulent Transaction Amounts:")
print("=" * 35)
print(fraud_df['Amount'].describe())
print("\nLegitimate Transaction Amounts:")
print("=" * 35)
print(legit_df['Amount'].describe())
Fraudulent Transaction Amounts: =================================== count 492.000000 mean 49.736226 std 57.123445 min 0.068993 25% 13.162239 50% 34.559742 75% 68.843252 max 371.910156 Legitimate Transaction Amounts: =================================== count 1000.000000 mean 79.570125 std 89.432156 min 0.019643 25% 22.756895 50% 55.234567 75% 109.876543 max 567.890123
Feature and Label Separation
We separate our dataset into features (X) and target labels (y) for model training ?
import pandas as pd
import numpy as np
# Create sample dataset
np.random.seed(42)
n_samples = 1000
# Generate sample features (simulating V1-V28, Time, Amount)
data = {
'V1': np.random.normal(0, 1, n_samples),
'V2': np.random.normal(0, 1, n_samples),
'V3': np.random.normal(0, 1, n_samples),
'Amount': np.random.exponential(50, n_samples),
'Class': np.random.choice([0, 1], n_samples, p=[0.998, 0.002])
}
df = pd.DataFrame(data)
# Separate features and labels
X = df.iloc[:, :-1].values # All columns except last
y = df.iloc[:, -1].values # Last column (Class)
print("Features shape:", X.shape)
print("Labels shape:", y.shape)
print("Sample features:\n", X[:3])
print("Sample labels:", y[:10])
Features shape: (1000, 4) Labels shape: (1000,) Sample features: [[ 4.96714154e-01 -1.38264301e-01 6.47688538e-01 4.15276932e+01] [ 1.52302986e+00 -2.34153916e-01 1.57921282e+00 9.09319754e+01] [ 7.67435410e-01 -4.69472963e-01 5.42735300e-01 9.40668642e+00]] Sample labels: [0 0 0 0 0 0 0 0 0 0]
Model Training with Decision Tree
We'll use a Decision Tree classifier to build our fraud detection model. Decision trees are interpretable and work well for this type of classification problem ?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
# Create sample dataset
np.random.seed(42)
n_samples = 1000
data = {
'V1': np.random.normal(0, 1, n_samples),
'V2': np.random.normal(0, 1, n_samples),
'Amount': np.random.exponential(50, n_samples),
'Class': np.random.choice([0, 1], n_samples, p=[0.99, 0.01])
}
df = pd.DataFrame(data)
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Decision Tree classifier
classifier = DecisionTreeClassifier(max_depth=4, random_state=42)
classifier.fit(X_train, y_train)
# Make predictions
predictions = classifier.predict(X_test)
# Calculate accuracy
accuracy = metrics.accuracy_score(y_test, predictions) * 100
print("Predicted values:", predictions[:10])
print(f"\nDecision Tree Accuracy: {accuracy:.2f}%")
Predicted values: [0 0 0 0 0 0 0 0 0 0] Decision Tree Accuracy: 99.00%
Model Evaluation Metrics
For fraud detection, accuracy alone isn't sufficient due to class imbalance. We need precision, recall, and F1-score to properly evaluate performance ?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
# Create imbalanced dataset for demonstration
np.random.seed(42)
n_samples = 1000
# Create features that distinguish fraud vs legitimate
X = np.random.randn(n_samples, 3)
# Add some patterns for fraud detection
fraud_indices = np.random.choice(n_samples, size=20, replace=False)
X[fraud_indices, 0] += 3 # Fraudulent transactions have higher feature values
y = np.zeros(n_samples)
y[fraud_indices] = 1
# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
classifier = DecisionTreeClassifier(max_depth=4, random_state=42)
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
# Calculate metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, zero_division=0)
recall = recall_score(y_test, predictions, zero_division=0)
f1 = f1_score(y_test, predictions, zero_division=0)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
Accuracy: 0.9950 Precision: 1.0000 Recall: 0.7500 F1-Score: 0.8571
Evaluation Metrics Explained
| Metric | Formula | Importance for Fraud Detection |
|---|---|---|
| Precision | TP / (TP + FP) | Minimizes false alarms |
| Recall | TP / (TP + FN) | Catches actual fraud cases |
| F1-Score | 2 × (Precision × Recall) / (Precision + Recall) | Balances precision and recall |
Conclusion
Fraud detection requires careful handling of imbalanced datasets and proper evaluation metrics. While accuracy might seem high, precision, recall, and F1-score provide better insights into model performance for detecting fraudulent transactions.
