I've initially done this project for an interview. I then reviewed it and pushed the analysis further.
I used spark framework in order to easily perform computation on datasets. Altough better adapted for larger datasets, the use of the framework was a requirement for this exercise. I used a docker container that contains all the needed libraries (Spark, Jupyter, etc).
Based on a public dataset named PKDD'99, I first run basic analysis to understand better transactions and loan amounts.
Then I perform a credit risk prediction using several models. Finally, I do a model selection with my model suggestion.
Data come from the public dataset PKDD'99 and are separated in several files. I work on 2 of them:
- loan.csv
- trans.csv
Although my work focus on the 2 above files, I do the preprocessing on all files.
In this part I work on loading the data, casting them in the correct format and removing unusable ones. I choose to remove wrong data format (we can easily find again the wrong records).
For the dates records, I work with String only for a better display.
This section consists of looking at basic statistics such as mean, variance, etc. of the transaction and loan tables.
The objective is to build a model that classifies if a loan will be paid or not.
In this section I conducted simple analysis to understand better the data.
I focus on the target field that represents the loan status. From the data provider:
- 'A' stands for contract finished, no problems,
- 'B' stands for contract finished, loan not payed,
- 'C' stands for running contract, OK so far,
- 'D' stands for running contract, client in debt
Most clients have a running contract that is OK so far.
Among the sample, 7+5=12% of the loans are missing a payment.
In order to predict whether a loan will be paid or not in a relevant manner, I choose to focus on the following features that will be included in my model:
- date when the loan was granted
- amount of money
- duration of the loan
- type of card
Status of paying off the loan will be used as the value to predict.
In this section I focus on operation to prepare the data for machine learning algorithms, that is:
- join
- conversion: I convert status (string) to integer
- standardization: used to make sure all variables contribute equally
Grouping the data by type, I could show 3 dimensions: type of customer, loan amount and loan status.
--> most of the clients who cannot pay their loans do not have a card
--> a large proportion of contracts that finished without any issue are for lower amounts
--> no junior client are in debt for paying a contract
Although this graph implies that using the type is a relevant feature, the next will draw the opposite conclusion.
Using the PCA, we can reduce the data to 2 dimensions only (explaining 72% of the variance) and plot them:
Based on this graph we can see that the data are largely clusterizable.
I perform the prediction firstly using a multiclass approach, that is predicting A, B, C or D as the loan status. I then do a binary prediction, that is if the loan is paid (A, B, C) or not (D).
I choose to start with one of the easiest algorithm; it's also the model I know the best so I am able to better extract information of it.
From the OLS results, we can see that the relationship is highly significant globally since p-value associated with Fisher stat is very low. All variables are significant with date and duration being the most important factors. Type is actually not that important (contrary to what we expected in previous part).
Linear regression gives an accuracy score of 60%. Running a cross validation shows that the variance is high.
Using binary classification gives a strongly better score of 96%.
The decision tree is a simple algorithm for non linear relations. It gives an accuracy score of 85% in multiclass approach and roughly 60% in binary.
The tree depth is 15, which is already too high to give a good interpretability:
AdaBoost is an algorithm belonging to boosting methods; it may be well adapted with such few data as we won't be penalised by the computational time.
It gives 69% in multiclass and roughly 80% in binary after conducting a grid search.
KNN algorithm gives high score in binary and multiclass approach. As seen previously, clustering methods seem a good option for this problem.
With binary approach, those predictions can lead to 2 types of errors:
- False positive: the model wrongly predicted that the client will pay its loan
- False negative: the model wrongly predicted that the client won't pay its loan
A bank would probably want to make sure the loans are indeed paid. They will thus be in favor of an algorithm that minize the first error (low false positive).
Thus, AdaBoost seems the best algorithm.
Limits: explainability, computational time with more data.