Predict Credit Default | Give Me Some Credit Kaggle

Predict Credit Default | Give Me Some Credit Kaggle

In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Camille St. Omer linkedin profile url

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

profile image

Shailesh Kurdekar linkedin profile url

Solutions Architect at Capital One

I have worked for more than 15 years in Java and J2EE and have recently developed an interest in Big Data technologies and Machine learning due to a big need at my workspace. I was referred here by a... Read More

What will you learn

Understanding the problem statement
Understand the dataset and the behavior of the features/attributes
Performing Exploratory Data Analysis to understand how the data is distributed and what is the behavior of the inputs with respect to the target variable
Data preprocessing will be one based on how the values are distributed such as are there any data entry errors that needed to be removed, outlier treatment, which is necessary for certain algorithms, imputing missing values if there are any
Splitting dataset into the train and test dataset using Stratified Sampling to maintain the event rate across the different datasets so that a model can learn behavior from the training dataset and can predict with certain accuracy up to some on the unseen dataset
Feature Engineering for better decision making by a model
Scaling of the features using BoxCox transformation and Standardization
Training a model using Neural Network as a Deep Learning architecture and analyzing the impact of training on the same dataset yet having different features input values because of scaling features, increasing and decreasing minority class
Training a model using statistical technique Logistic Regression and analyzing why scaling features is necessary in such statistical techniques
Training a model using Tree based algorithms such as Bagging and Boosting and analyzing why certain techniques are not required for such algorithms which are quintessential in other modeling techniques
Hyperparameter tuning of the modeling algorithms and checking its impact on model performance
Using Recursive Feature Elimination using Cross Validation to check whether any highly correlated features are there in the model and what are the optimal no of features to be used for training
Analyzing why a popular metric Accuracy will not be useful in our case
Checking the model performance on the unseen dataset using metrics such as F1 score, Precision, Recall and the AUCROC
Model Interpretability using SHAP at a global level and LIME at a local level

Project Description

Business Context - Banks are primarily known for money lending business. The more money they lend to people whom they can get good interest with timely repayment, the more revenue is for the banks. This not only save banks money from having bad loans but also improves image in the public figure and among the regulatory bodies.

The better the banks can identify people who are likely to miss their repayment charges, the more in advance they can take purposeful actions whether to remind them in person or take some strict action to avoid delinquency.

In cases where a borrower is not paying monthly charges when credit is issued against some monetary thing, two terms are frequently used which are delinquent and default.

Delinquent in general is a slightly mild term where a borrower is not repaying charges and is behind by certain months whereas Default is a term where a borrower has not been able to pay charges and is behind for a long period of months and is unlikely to repay the charges.

This case study is about identifying the borrowers who are likely to default in the next two years with serious delinquency of having delinquent more than 3 months.

We have a general profile about the borrower such as age, Monthly Income, Dependents and the historical data such as what is the Debt Ratio, what ratio of amount is owed with respect to credit limit, and the no of times defaulted in the past one, two, three months.

We will be using all these features to predict whether the borrower is likely to default in the next 2 years or not having delinquency of more than 3 months.

These kinds of predictions will help banks to take necessary actions.

Objective: Building a model using the inputs/attributes which are general profile and historical records of a borrower to predict whether one is likely to have serious delinquency in the next 2 years

We will be using Python as a tool to perform all kind of operations.

Main Libraries used

      Pandas for data manipulation, aggregation
      Matplotlib and Seaborn visualization and behavior with respect to the target variable
      NumPy for computationally efficient operations
      Scikit Learn for model training, model optimization and metrics calculation
      Imblearn for tackling class imbalance problem
      Shap and LIME for model interpretability
      Keras for Neural Network(Deep Learning architecture)

Similar Projects

Given a partial trajectory of a taxi, you will be asked to predict its final destination using the taxi trajectory dataset.

In this data science project, you will learn to predict churn on a built-in dataset using Ensemble Methods in R.

In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Curriculum For This Mini Project

Business Context
Data Understanding
Splitting training dataset into train and test for model selection and validation
Univariate Analysis
Data Cleaning - Outlier Treatment Data Entry Errors Imputing Missing Values
Checking Correlation
Bivariate Analysis
Feature Engineering
Tackling Class Imbalance - SMOTE Upsampling and Downsampling technique
Feature Scaling - BoxCox Transformations & Standaradization
Modelling Overview and Metrics
Deep Learning - Neural Network Architecture
Modelling - Neural Network on scaled and non -scaled dataset
Modelling - Logistic Regression
Modelling - Tree Based : Random Forest(Bagging)
Boosting Overview : XGBoost and LightGBM
Modelling - Tree Based : XGBoost and LightGBM(Boosting)
Combined ROCAUC plots
RFECV for Correlated Feature elimination and selecting optimal features
Hyperparamter Tuning
AUCROC plot on hypertuned parameters and Model Prediction on test dataset
Model Interpretation - SHAP at a global level and LIME at a local level
Modular Code Overview