Credit Card Fraud Detection as a Classification Problem

Credit Card Fraud Detection as a Classification Problem

In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.
explanation image

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Hiren Ahir linkedin profile url

Microsoft Azure SQL Sever Developer, BI Developer

I'm a Graduate student and came into the job market and found a university degree wasn't sufficient to get a good paying job. I aimed at hottest technology in the market Big Data but the word BigData... Read More

profile image

Mohamed Yusef Ahmed linkedin profile url

Software Developer at Taske

Recently I became interested in Hadoop as I think its a great platform for storing and analyzing large structured and unstructured data sets. The experts did a great job not only explaining the... Read More

What will you learn

Understanding the problem
Importing required libraries and understanding their use
Importing data and learning its structure
Performing basic EDA
Scaling different variables
Outlier treatment
Building basic Classification model with Random Forest
Nearmiss technique for undersampling data
SMOTE for oversampling data
cross validation in the context of undersampling and oversampling
Pipelining with sklearn/imblearn
Applying Linear model: Logistic Regression
Applying Ensemble technique: Random Forest
Applying Non Linear Algorithms: Support Vector Machine, Decision Tree and k-Nearest Neighbour
Making predictions on test set and computing validation metrics
ROC curve and Learning curve
Comparison of results and Model Selection
Visualization with seaborn and matplotlib

Project Description

It is vital that credit card companies are able to identify fraudulent credit card transactions so that customers are not charged for items that they did not purchase. The dataset used contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Universite Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML.

As the dataset was created using a PCA, preprocessing of data is of little scope in this problem. The imbalance between classes is compensated using oversampling and undersampling. The logistic regression, random forest, support vector machine, k-means are used, within a cross-validation framework. Lastly the recall and accuracy are considered as metrics while choosing the best classifier. A buffer section on outlier detection is added at the end.

Similar Projects

In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

In this data science project, we will look at few examples where we can apply various time series forecasting techniques.

In this project, we will use traditional time series forecasting methods as well as modern deep learning methods for time series forecasting.

Curriculum For This Mini Project

Business Problem
04m
Data Science Problem
10m
Solution Workflow
10m
Show me the Data
08m
Exploratory Data Analysis - Part 1
07m
Exploratory Data Analysis - Part 2
09m
Data Preparation - Part 1
06m
Data Preparation - Part 2
08m
Validation Metrics
10m
Base Model
08m
Undersampling Models - Part 1
12m
Undersampling Models - Part 2
07m
Oversampling Models
05m
Best Model
14m