Financial data analytics

I've initially done this project for an interview. I then reviewed it and pushed the analysis further.

Context

I used spark framework in order to easily perform computation on datasets. Altough better adapted for larger datasets, the use of the framework was a requirement for this exercise. I used a docker container that contains all the needed libraries (Spark, Jupyter, etc).

Based on a public dataset named PKDD'99, I first run basic analysis to understand better transactions and loan amounts.

Then I perform a credit risk prediction using several models. Finally, I do a model selection with my model suggestion.

Data

Data come from the public dataset PKDD'99 and are separated in several files. I work on 2 of them:

loan.csv
trans.csv

Preprocessing

Although my work focus on the 2 above files, I do the preprocessing on all files.

In this part I work on loading the data, casting them in the correct format and removing unusable ones. I choose to remove wrong data format (we can easily find again the wrong records).

For the dates records, I work with String only for a better display.

Analytics

This section consists of looking at basic statistics such as mean, variance, etc. of the transaction and loan tables.

Credit risk prediction

The objective is to build a model that classifies if a loan will be paid or not.

Basic analysis

In this section I conducted simple analysis to understand better the data.

I focus on the target field that represents the loan status. From the data provider:

'A' stands for contract finished, no problems,
'B' stands for contract finished, loan not payed,
'C' stands for running contract, OK so far,
'D' stands for running contract, client in debt

Most clients have a running contract that is OK so far.

Among the sample, 7+5=12% of the loans are missing a payment.

Feature engineering

In order to predict whether a loan will be paid or not in a relevant manner, I choose to focus on the following features that will be included in my model:

date when the loan was granted
amount of money
duration of the loan
type of card

Status of paying off the loan will be used as the value to predict.

In this section I focus on operation to prepare the data for machine learning algorithms, that is:

join
conversion: I convert status (string) to integer
standardization: used to make sure all variables contribute equally

Data visualization

Grouping the data by type, I could show 3 dimensions: type of customer, loan amount and loan status.

--> most of the clients who cannot pay their loans do not have a card

--> a large proportion of contracts that finished without any issue are for lower amounts

--> no junior client are in debt for paying a contract

Although this graph implies that using the type is a relevant feature, the next will draw the opposite conclusion.

Using the PCA, we can reduce the data to 2 dimensions only (explaining 72% of the variance) and plot them:

Based on this graph we can see that the data are largely clusterizable.

Prediction

I perform the prediction firstly using a multiclass approach, that is predicting A, B, C or D as the loan status. I then do a binary prediction, that is if the loan is paid (A, B, C) or not (D).

Linear regression

I choose to start with one of the easiest algorithm; it's also the model I know the best so I am able to better extract information of it.

From the OLS results, we can see that the relationship is highly significant globally since p-value associated with Fisher stat is very low. All variables are significant with date and duration being the most important factors. Type is actually not that important (contrary to what we expected in previous part).

Linear regression gives an accuracy score of 60%. Running a cross validation shows that the variance is high.

Using binary classification gives a strongly better score of 96%.

Decision tree

The decision tree is a simple algorithm for non linear relations. It gives an accuracy score of 85% in multiclass approach and roughly 60% in binary.

The tree depth is 15, which is already too high to give a good interpretability:

AdaBoost

AdaBoost is an algorithm belonging to boosting methods; it may be well adapted with such few data as we won't be penalised by the computational time.

It gives 69% in multiclass and roughly 80% in binary after conducting a grid search.

KNN

KNN algorithm gives high score in binary and multiclass approach. As seen previously, clustering methods seem a good option for this problem.

Model selection

Conclusion

With binary approach, those predictions can lead to 2 types of errors:

False positive: the model wrongly predicted that the client will pay its loan
False negative: the model wrongly predicted that the client won't pay its loan

A bank would probably want to make sure the loans are indeed paid. They will thus be in favor of an algorithm that minize the first error (low false positive).

Thus, AdaBoost seems the best algorithm.

Limits: explainability, computational time with more data.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
img		img
README.md		README.md
financial-data-analytics.ipynb		financial-data-analytics.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial data analytics

Context

Data

Preprocessing

Analytics

Credit risk prediction

Basic analysis

Feature engineering

Data visualization

Prediction

Linear regression

Decision tree

AdaBoost

KNN

Model selection

Conclusion

About

Releases

Packages

Languages

savoga/financial-data-analytics

Folders and files

Latest commit

History

Repository files navigation

Financial data analytics

Context

Data

Preprocessing

Analytics

Credit risk prediction

Basic analysis

Feature engineering

Data visualization

Prediction

Linear regression

Decision tree

AdaBoost

KNN

Model selection

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages