Loan Prediction using Machine Learning Project Source Code

Access Solved End-to-End Loan Prediction using Machine Learning Project with Source Code, Documentation, and Report | ProjectPro

Loan Prediction using Machine Learning Project Source Code
 |  BY ProjectPro

This article will walk you through how one can start by exploring a loan prediction system as a data science and machine learning problem and build a system/application for loan prediction using your own machine learning project.

Loan Prediction using  Machine Learning Project

Loan sanctioning and credit scoring forms a multi-billion dollar industry -- in the US alone. With everyone from young students, entrepreneurs, and multi-million dollar companies turning to banks to seek financial support for their ventures, processing these applications creates a complex and cumbersome task for any banking institution. As of 2022, more than 20 million people in the US have active loans owing a collective debt of 178 billion dollars. Despite that, more than 20% of all applicants were denied loans. The loan approval or rejection has enormous ramifications for both the applicant and the bank, causing possible opportunity costs for both parties. Banks like Wells Fargo and Morgan Stanley have looked at the use of AI in determining lending risk and developing a loan prediction system in recent years to overcome human bias and delays in the application processing time.


Loan Eligibility Prediction Project using Machine learning on GCP

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Traditional processes determine the risk by manually looking at the applicant's income, credit history, and several other dynamic parameters and creating a data-driven risk model. Despite using data science in this process, there is still a large amount of manual work involved. Researchers have recently explored the possibility of using deep learning in various aspects of this process. For example, credit score and credit history are essential parameters for assessing the applicant's lending risk. DL-based approaches such as Embedding Transactional Recurrent Neural Network (E.T.-RNN) compute the credit scores of applicants by looking at the history of their credit and debit card transactions. Such an approach eliminates the high dependency on manual intervention, extensive domain knowledge, and human bias in loan approval prediction.

 

ProjectPro Free Projects on Big Data and Data Science

What is Loan Prediction using Machine Learning?

Generally, loan prediction involves the lender looking at various background information about the applicant and deciding whether the bank should grant the loan. Parameters like credit score, loan amount, lifestyle, career, and assets are the deciding factors in getting the loan approved. If, in the past, people with parameters similar to yours have paid their dues timely, it is more likely that your loan would be granted as well.

Machine learning algorithms can exploit this dependency on past experiences and comparisons with other applicants and formulate a data science problem to predict the loan status of a new applicant using similar rules.

Several collections of data from past loan applicants use different features to decide the loan status. A machine learning model can look at this data, which could be static or time-series, and give a probability estimate of whether this loan will be approved. Let's look at some of these datasets.

Top 5 Loan Prediction Datasets to Practice Loan Prediction Projects

This Univ.AI Loan Prediction dataset uses 11 parameters and maps their relation with the applicant's default on their loan. This helps flag behavior that might increase the risk of lending to that customer. The bank will reject the applicant's loan status if the risk prediction is high. The parameters include age, profession, home, car ownership, and income; there are 252,000 samples.

Using 17 features and over 80,000 samples, the Future Loan Status Prediction Dataset trains a machine learning model to predict whether this loan will be paid off based on the past behavior of other customers. 

The objective of the data is to use Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and other factors and predict the approval probability of each application. There are 614 values in this dataset. Owing to this dataset's simple and beginner nature, we will use this to demonstrate the use of ML in loan status and loan prediction using machine learning in our example below.

Here's what valued users are saying about ProjectPro

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic was "Credit Risk Modelling". To understand other domains, it is important to wear a thinking cap and...

Gautam Vermani

Data Consultant at Confidential

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills in Data Engineering/Science and hoping to find real-world projects fortunately, I came across...

Ed Godalle

Director Data Analytics at EY / EY Tech

Not sure what you are looking for?

View All Projects

This collection details the credit history of customers from various countries who defaulted on their credit payments -- classifying them into credible and non-credible clients. This forms another parameter with which loan prediction can be made. There are 23 attributes in the dataset, most tracking past ents and bill statements. With over 30,000 instances, this is quite a comprehensive dataset. Find it on the UCI Credit Risk Dataset for Loan Eligibility Prediction.

This loan prediction dataset from actual German financial institutions contains 1000 samples in the set and 20 categorical variables; each sample represents a customer who has taken a loan from the bank. UCI German Credit Risk Dataset.

Evolution of Machine Learning Applications in Finance : From Theory to Practice

Algorithms used for Loan Prediction using Machine Learning

Historically, lending risk prediction has used statistical methods, including Linear Discriminant Analysis and Logistic Regression. However, with large credit datasets, ML-driven risk estimation algorithms like k-Nearest Neighbor, Random Forest, and Support Vector Machines are better at capturing complex relationships. Moreover, deep learning methods have gained a particular advantage in modeling non-linear relationships between risk and risk factors for large-scale lending risk and loan prediction datasets.

Novel frameworks like DEAL (Deep Ensemble Algorithm), or improvements over existing models of Recurrent Neural Networks (RNN) or Boosted Decision Tree or Autoencoders, give satisfactory accuracy over large datasets and generate features with domain expertise.

However, more work is available on machine learning models than deep learning architectures since the latter's performances are often specific to the dataset they were designed and tested on. In the figure shown below, a recent paper compares the performance of various machine learning algorithms on the German credit risk dataset. We can see that algorithms like SVM, Random forest, and the Logistic regression model perform better than ELM and ANN. However, decision trees and boosting also give a competitive performance on this dataset.

Algorithms used for Loan Prediction using Machine Learning

Top 3 Machine Learning Solution Approaches for Loan Prediction

We can now see the three machine learning approaches that would be best for our loan prediction project. As opposed to deep learning, traditional machine learning algorithms give a more generalized performance across datasets. We will attempt to observe this further in the article in our end-to-end implementation of loan prediction machine learning project -

Support Vector Machine (SVM) is a supervised machine learning algorithm that generates a hyperplane (a decision boundary) to separate classes even in a high-dimensional vector space. It can capture different non-linear relationships between the features and the target variable. It decides a class for a sample based on the sign of w[T]+b. In the equation, w (weights) represents the negative and positive hyperplane margin, and b is the bias. SVM is particularly useful in loan prediction because this task usually has several features that need to be considered for the final decision

"Boosting" is a method that combines individual models in an ensemble manner to gain higher performance gain. AdaBoost and Stochastic Gradient Boosting are the most popular boosting algorithms where higher weights are assigned to wrong classified instances during training. At the same time, SGB adds randomness as an integral part of training. Extreme Gradient Boost (XGBoost) is an improvement over Gradient Boost and is very popular in binary classification algorithms. The decision trees are built in parallel in XGBoost than in series, giving it an edge over normal Decision Trees and Boosting algorithms.

The random forest algorithm improves the flexibility and decision-making capacity of individual trees. It is another machine learning algorithm incorporating the ensemble learning theorem as its foundation, combining results from various decision trees to optimize training. In some use cases of loan and credit risk prediction, some features are more important than the rest or, more specifically, some features whose removal would improve the overall performance. Since we know the fundamentals of decision trees and how they choose features based on information gain, random forests would incorporate these benefits to give superior performance,

End-to-End Implementation of Loan Prediction Project using Machine Learning in Python

Python provides incredible flexibility in implementing a machine learning model and working on data preprocessing and exploratory data analysis tasks, making it the most preferred language for loan eligibility prediction projects.

Python Libraries used for Loan Prediction using Machine Learning

Since we are working on a fixed dataset to compare the performance of multiple algorithms and get started with a loan prediction project, we can use some popular libraries commonly used in Python.

Pandas

Pandas is the most straightforward and powerful package for beginners for data loading, cleaning, and processing. Modules in Pandas will help us treat null values, handle categorical variables, get an overview of the dataset, and perform exploratory data analysis if needed.

Scikit-Learn (sklearn)

Perhaps the most accessible library in Python for machine learning beginners, scikit-learn has ready-to-use modules for most machine learning-related tasks, from data preparation to model building, optimized training, and evaluation. To build our machine learning model, we use the existing modules available in sklearn. We use them through a module called RandomizedSearchCV, which computes cross-validation accuracy to find the best set of hyperparameters for every model.

XGBoost

The XGBoost package available outside of sklearn has a faster and more accessible implementation of the boosting algorithm. We install it separately and use the XGBClassifier module from it.

NumPy and Matplotlib are used for standard data processing and visualization tasks, respectively.

Let's get our hands dirty and start with the coding for our loan prediction project.

Loading Libraries for Loan Prediction using Machine Learning Project

As discussed above, we load the required modules from all the mentioned libraries. We choose to perform loan prediction using the Decision tree, Random forest, XGBoost, SVM, and Logistic regression model.

Loading Libraries for Loan Prediction using Machine Learning Project

Loan Prediction Dataset Example

For this project, we choose the dataset from the Loan Prediction Competition on Kaggle. It has 12 features, one target variable, and 614 samples. The features include Income, Loan Amount, Credit History, Gender, Marital Status, Education, Dependents, and others. This straightforward dataset will be a good measure of finding which ML models work the best for a beginner project.

Kickstart your journey in the exciting domain of Data Science with these solved data science mini projects today!

Loading the Loan Prediction Dataset for the Machine Learning Project

Download the CSV files from the Kaggle page. Using the read_csv function from Pandas, we load the training dataset given in the competition since it is the only file with the target variable mapped.

Loading the Loan Prediction Dataset

Loading the Loan Prediction Dataset for Loan Eligibility Prediction

As we can see above, most variables are categorical, with most having binary categories. First, we need to encode these one-hot and then normalize the numerical variables. Finally, the Loan_ID column is not very useful since it has 614 unique values. We will remove that column from our final set.

Data Pre-Processing for Loan Prediction using Machine Learning

Data preprocessing involves label encoding, handling missing values, selecting appropriate columns, normalization, and more. We will have to perform all of these steps on our dataset despite it being a relatively clean and structured one.

Python Pandas is particularly useful in taking care of these preprocessing steps to prepare the training dataset. In-built functions of the DataFrame class can cater to EDA, cleaning, preprocessing, sorting, and filtering as needed.

First, we should get a summary of the presence of empty or NULL fields in every column.

Treating missing Values

The isnull() method of the DataFrame class returns a binary value for every row of every column, indicating whether or not the cell is empty. Using sum() we can treat the binaries as 0 and 1 and get a count of NULL values for each column.

Treating Null Values for Loan Prediction using Machine Learning

7 columns have a non-zero number of NULL values, with Credit_History having the most (50). Given the size of our dataset, some of these columns have many empty fields, which we cannot handle by just removing the respective rows. Doing this will significantly decrease the size of the training dataset and adversely impact the model performance. Instead, we use null value treatment methods like replacing the values with the Mean or Mode of the column values. Using mode works best in our case, as most columns are binary. Moreover, Mode will simply put the most occurring instance in place of empty fields, which, under the circumstances, would be the best guess.

For LoanAmount, we use mean since it is not one of the categorical variables.

Calculating Mean for LoanAmount_Loan Prediction using Machine Learning

Again, Pandas methods help us achieve this in a single line of code for every column. The fillna() method fills the empty fields with whatever parameter is given. While dropna() will return the column values after removing the NULL values. Calculating the mean or mode of this array of values, and passing it to fillna() completes this step.

Handling Missing Values for Loan Prediction using Machine Learning Project

As seen above, all the missing values have been handled. However, looking at the Dtype column shows that there are several non-numerical variables which require label encoding.

Handling Categorical variables

Let us look at the unique values each of these non-numerical variables hold.

Handling Categorical Variables for Loan Prediction using Machine Learning Project

So, the next step would be to map these categories to their binary alternative. For instance, "No" is mapped to 0, and "Yes" is mapped to 1. For non-binary values, like Property area with options like semiurban, rural or urban areas, we use the get_dummies function of Pandas to automatically one-hot encode them. Similarly, the variable for male and female applicants will be separated into two dummy columns: Gender_male and Gender_female.

Finally, once all the columns are numeric, we fit a StandardScaler module to normalize the numeric variables.

Normalize Numeric Variables for Loan Prediction using Machine Learning Project

Once we do this, we can see that there are dummy variables for each category of every categorical variable. A subset can be seen below:

Normalized Values for Loan Prediction using Machine Learning Project

Similarly, we can see the normalized values for the 'ApplicantIncome,' 'CoapplicantIncome', 'LoanAmount', and 'Loan_Amount_Term' columns which had a numerical variable.

In the next step, we will remove the Loan_ID column and create train and test data.

Creating Train and Test Dataset

Using the popular train_test_split function from sklearn and a split ratio of 80:20, we create the train and test data sets as follows:

Train and Test Dataset for Loan Prediction using Machine Learning Project

Our final train dataset has nearly 500 samples and 20 columns. We are ready to use it to train different machine learning models.

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

Training Machine Learning (ML) Models for Loan Prediction

We follow a fixed pipeline in trying out different models. All the models are implemented in sklearn except XGBoost. We initialize the model without any parameters and pass it through the randomized cross-validation search module RandomizedSearchCV with a dictionary of relevant hyperparameters. We define these hyperparameters in advance for each classification algorithm based on their availability in their sklearn implementation.

The cross-validation process will train and check the training accuracy for different permutations of these hyperparameter values and return the best-performing model. We run cross-validation for 100 iterations with a fold size of 4 samples. We will also get to see what the best choice of parameters is for every model on our training dataset.

Let's get started.

XGBoost

Starting with XGBoost, we define various options for 'n_estimators', 'max_depth', 'learning_rate', and 'colsample_bytree'. Descriptions of these parameters can be found in the XGBoost documentation here.

XGBoost  for Loan Prediction using Machine Learning Project

After fitting and training the model, cross-validation suggested that the best parameters were: {'n_estimators': 111, 'max_depth': 1, 'learning_rate': 0.1, 'colsample_bytree': 0.9}.

Testing the test data, we find that the final binary classification accuracy is around 78.9%.

Decision Tree

Next, we can try a single decision tree with the max depth ranging from 4 to 25 and minimum samples for leaf and split between 10 and 100. 4 is the best max depth, while the ideal criterion is the default 'Gini' index.

Impurity Criterion for Loan Prediction using Machine Learning Project

Decision Tree  for Loan Prediction using Machine Learning Project

Test accuracy for the decision tree is around 78%, worse than XGBoost.

Random Forest Classifer

The only hyperparameter we tune for the random forest classifier is the number of estimators (n_estimators), which is the number of decision trees in the forest. It turns out that 100 is the best number of trees for our training data. The test accuracy is less than 79% (around 78.8%) which is slightly better than a single decision tree.

Random Forest Classifier for Loan Prediction using Machine Learning Project

We observe that random forests did not have as much of an improvement over decision trees as expected, and its performance is very close to XGBoost. Thus, it could be safe to hypothesize that there is not much more to learn from our limited dataset.

Let's check a few more models first.

Support Vector Classifier (SVC)

The support vector classifier only has to build a hyperplane and fit it well to the decision boundaries of different classes. To create the hyperplane, various functions or kernels can be used. These kernels are functions that map the input variables to the output variables. Linear, polynomial, RBF, and Sigmoid are the four options that we tune along with the regularization parameter that keeps overfitting in check.

Using these parameters, the model performed best on a linear kernel with regularization factor C as 1.0 and produced a test score of 78.86%.

SVC for Loan Prediction using Machine Learning

Now that we have determined that SVM, XGBoost, and Random Forest are some of the best performing ML models for performing loan prediction and building a beginner's loan prediction machine learning project, let's see more details of what each model found in our loan prediction dataset.

What makes Python one of the best programming languages for ML Projects? The answer lies in these solved and end-to-end Machine Learning Projects in Python. Check them out now!

Understanding Feature Importance for Loan Prediction

As we know, tree-based methods build a tree by computing the information gain added or increased by adding a particular feature to the decision. In this way, after the tree is built, we can recognize the features that helped the most or were the most significant in adding information or decreasing the overall entropy of the tree.

In other words, we can determine which features the model thought were the most important in making a decision. In our case, this would show the attributes of an applicant's profile that most affect their loan status and load prediction decision.

In sklearn, feature importances are given by the attribute feature_importances_ of the fitted model. They represent the mean and standard deviation of the impurity decrease within each tree.

We access this feature using a simple function, as shown below. This will return a sorted DataFrame with a feature importance value for each feature column.

Feature Importance for Loan Prediction using Machine Learning

To better understand, let's plot this column as a bar graph and see if we can find some patterns from it that you can apply to the real-world loan prediction machine learning problem. 

For Decision Tree

Plotting Feature Importance for Loan Prediction using Machine Learning

Bar Graph for Feature Importance for Loan Prediction using Machine Learning_Decision Tree

As we saw in the theoretical discussion of loan prediction, credit history is an essential feature for a decision tree to base its final prediction on. Most other features added little to no influence over the final prediction.

This hints at a problem which resulted in most of our models performing similarly. One of the features is significantly more important than the rest, and thus, it biases the loan prediction model performance.

For Random forest

We observe a similar trend in random forests. Credit history, loan amount, and applicant income greatly influence the final decision. For instance, applicants with very high incomes and co-applicant income with a good credit history have an excellent chance of getting loan approval.

Bar Graph for Feature Importance for Loan Prediction using Machine Learning_Random Forest

The rest of the features have some but insignificant influence or importance when compared to these.

For XGBoost

XGBoost had a more balanced performance. Given its partial dependency on trees and the boosting algorithm, it found some importance in other features. Property area, loan amount, co-applicant income, dependents, and marital status have little but some importance in the prediction. However, credit history remains the most crucial feature in determining loan status.

Bar Graph for Feature Importance for Loan Prediction using Machine Learning_XGBoost

Access Data Science and Machine Learning Project Code Examples

FAQs on Loan Prediction using Machine Learning Project

1. Why is Loan Prediction Needed?

Manual processing of loan applications is a long, cumbersome, error-prone, and often biased process. It might lead to financial disaster for banks and obstruct genuine applicants from getting the needed loans. Loan Prediction using machine learning tools and techniques can help financial institutions quickly process applications by rejecting high-risk customers entirely, accepting worthy customers, or assigning them to a manual review. Such processes with loan prediction using machine learning intact can reduce loan processing times by nearly 40%.

2. What is Loan Prediction Analysis?

Loan prediction analysis uses specific parameters about a loan application to determine whether or not the loan should get approved. Approved loans usually have a good credit history, decent applicant income, and reliability in other factors. Banks use statistical and manual methods to verify these factors and decide about the applicant's loan status.

3. Which algorithm is best for Loan Prediction using Machine Learning?

This article showed how traditional machine learning approaches such as SVM and XGBoost perform well on a standard dataset. Depending on the type of dataset, in reality, these models will surely give a competitive performance. In other cases, like the ones where regular payments over a while are a deciding factor, time-series models such as RNNs or LSTMs would perform better.

Takeaway

This article discussed the importance and relevance of using machine learning for loan prediction. We saw some existing approaches and datasets used to approach loan eligibility prediction and how AI might help smoothen this process. Finally, we built an end-to-end loan prediction machine learning project using a publicly available dataset from scratch. At the end of this project, one would know how different features influence the model prediction and how specific attributes affect the decision more than the other features. Only building machine learning projects from scratch, even as beginners, will naturally bring such insights to light and give a comprehensive view of a machine learning problem.

 

PREVIOUS

NEXT

Access Solved Big Data and Data Science Projects

About the Author

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author arrow link