Risk Analytics - Finance & Banking industries. Classification model to analyze the risk of the customer for loan delinquency. (Decision Tree, Hyperparameter tuning, Cost complexity pruning, Hypothesis testing)
DRS bank is facing challenging times. Their NPAs (Non-Performing Assets) has been on a rise recently and a large part of these are due to the loans given to individual customers(borrowers). Chief Risk Officer of the bank decides to put in a scientifically robust framework for approval of loans to individual customers to minimize the risk of loans converting into NPAs and initiates a project for the data science team at the bank. You, as a senior member of the team, are assigned this project.
To identify the criteria to approve loans for an individual customer such that the likelihood of the loan delinquency is minimized
What are the factors that drive the behavior of loan delinquency?
- ID: Customer ID
- isDelinquent : indicates whether the customer is delinquent or not (1 => Yes, 0 => No)
- term: Loan term in months
- gender: Gender of the borrower
- age: Age of the borrower
- purpose: Purpose of Loan
- home_ownership: Status of borrower's home
- FICO: FICO (i.e. the bureau score) of the borrower
Transactor – A person who pays his due amount balance full and on time. Revolver – A person who pays the minimum due amount but keeps revolving his balance and does not pay the full amount. Delinquent - Delinquency means that you are behind on payments, a person who fails to pay even the minimum due amount. Defaulter – Once you are delinquent for a certain period your lender will declare you to be in the default stage. Risk Analytics – A wide domain in the financial and banking industry, basically analyzing the risk of the customer.
The Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them.
Null Hypothesis - There is no association between the two variables.
Alternate Hypothesis - There is an association between two variables.
- P-value for all tests < 0.01. Hence, all the differences that we see in the 3 plots are statistically significant.
- There is a correlation between FICO Score and house_ownership. People who have mortgaged their houses have higher FICO scores than people who own the house (peculiar!).
- There is a correlation between FICO Score and gender. More females have >500 FICO scores as compared to Males.
- There is a correlation between FICO Score and age. People >25 years of age have higher FICO scores as compared to people of age 20-25.
What does a bank want?
- A bank wants to minimize the loss - it can face 2 types of losses here:
- Whenever a bank lends money to a customer, they don't return that.
- A bank doesn't lend money to a customer thinking a customer will default but in reality, the customer won't - opportunity loss.
Which loss is greater ?
- Lending to a customer who wouldn't be able to pay back.
Since we want to reduce loan delinquency we should use Recall as a metric of model evaluation instead of accuracy.
- Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives, i.e. low chances of predicting loan-delinquent customers as a non-loan-delinquent customer.
- FICO, term and gender (in that order) are the most important variables in determining if a borrower will get into a delinquent stage
- No borrower shall be given a loan if they are applying for a 36 month term loan and have a FICO score in the range 300-500.
- Female borrowers with a FICO score greater than 500 should be our target customers.
- Criteria to approve loan according to decision tree model should depend on three main factors - FICO score, duration of loan and gender that is - If the FICO score is less than 500 and the duration of loan is less than 60 months then the customer will not be able to repay the loans. If the customer has greater than 500 FICO score and is a female higher chances that they will repay the loans.