Hello! This write-up is based on Tensorflow and the codes from DataHandler module that will be the precursor to the python package “kero”. It will be my exploratory attempt at solving the loan problem with limited number of layers and short training, with the focus on pre-processing. I take this chance to survey some scientific journals talking about credit scoring.
Why is pre-processing worth so much consideration? We use machine learning algorithm to help us with complex pattern that typically is just so complex we cannot formulate easily. Imagine if we have a problem, and 1000 variables, but it turns out the output y only depends on 1 of the 1000 variables; we might have either wasted resources or, more regrettably, have been very careless to miss out an easy trend. On that spirit, we will try to dissect the features with kero-precursor, and see that correlating few variables to the output will not be quite successful.
The full report, including annex, can be found here.
Credit scoring was introduced in 1950s to predict the probability that a loan applicant will default. It is especially useful to enable loan for small businesses, which in turn allows a bank or lender to diversify (Mester, 1997). With credit scoring, process efficiency for approval or rejection of loan is accelerated (Allen, 1995) and it is thus worth exploring the many techniques (Thomas, 2000) available to optimize the process further, saving time and cost while helping lenders estimate profits in the loan business. Further studies (Frame, Srinivasan, & Woosley, 2001) concluded that credit scoring reduces information costs. It should be noted, though, that credit scoring is not strictly a matter of listing a number of prospective borrowers’ profile, such as income, employment history etc. For example, relationship lending is a powerful force capable of steering the direction of further credit availability and terms. This is especially important in the light of dynamic economic environment, where a continual assessment and gathering of ‘soft information’ become crucial to assess credit availability of the moment (Allen N. Berg & Gregory F. Udell, 2002).
More in-depth studies, employing statistical tools, genetic algorithm and machine learning algorithms have been conducted recently as well. Big data approaches, where even factors not obviously related to credit-worthiness, have been used and believed to both improve credit scoring fairness and help reach out to underserved customers (Hurley & Julius Adebayo, 2016). Continuous model has been shown to outperform some of the traditional classification models; in the same study profit optimization in credit scoring is shown to perform better than general error minimization methods (Finlay, 2009). A specific study on how to deal with missing data also has been conducted (Michael Bücker, Maarten van Kampen, & Walter Krämer, 2013). Imbalanced data set for credit scoring often occurs as a matter of fact, since the relative of defaults or bad loans can be relatively low. The effect of such imbalance is studied experimentally, and the performance of different algorithms are assessed for the most optimal scoring; in fact, unsuitable algorithms for imbalanced data set could be performing badly enough to warrant caution (Brown & Christophe Mues, 2012). Besides, there is also a study on the combinations of existing classifiers (Ala’raj & Maysam F. Abbod, 2016). Individual classifier opinions are cooperatively used to reach a consensus opinion, with some studies forming framework and modelling the interaction between the opinions.
The problem is listed here (datahack.analyticsvidhya.com, n.d.). The problem can be shortly described as the following. For each applicant (with or without co-applicant), given 13 properties as shown in table 1, decide if they are eligible for the loan.
|1||Loan_ID||Unique Loan ID|
|3||Married||Applicant married (Y/N)|
|4||Dependents||Number of dependents|
|5||Education||Applicant Education (Graduate/ Under Graduate)|
|9||LoanAmount||Loan amount in thousands|
|10||Loan_Amount_Term||Term of loan in months|
|11||Credit_History||credit history meets guidelines|
|12||Property_Area||Urban/ Semi Urban/ Rural|
Table 1. Table showing the properties of each loan applicant.
Figure 1. The train set, showing some columns and a few applicants.
We are given a data file (called training set) containing 614 applicants, each with the 12 properties listed in the table and 1 more column for loan status that takes the value of Y (yes) or N (no), i.e. a csv file with 614 rows (excluding header) and 14 columns. Y means that the applicant is eligible and N not eligible. The file is partially shown in figure 1. We are similarly given a test set with 367 applicants and the 12 properties. For each applicant in the test set, we want to predict their loan status, whether it is Y or N. Once the prediction is done for all applicants in the test set, we submit the result to the same site and find out how accurate our prediction is, given by a fraction, with 1.0 being 100% accurate.
2. Methods and Discussion
2.1. Blind learning.
Neural network will be employed, and the full code can be found in here [Tjoa, 2018], loan1.1.ipynb Jupyter notebook. All 12 properties are considered as features for model training. We set a small learning rate, 0.001 to obtain greater precision and a relatively large number of epoch, 12000 to compensate for the small learning rate. We did achieve an accuracy of 0.975 on the training set. Unfortunately, upon comparing it to the real external data, the accuracy was 0.63194. For consistency we repeat the training using different settings (smaller number of epoch and different number of hidden neurons), obtaining 0.95625 accuracy on the training set and 0.62578 on the test set. This is a symptom of overfitting. Some features might be irrelevant or unimportant in the actual consideration, and hence it is imperative that we try to select features, out of the 12, to discard. Otherwise more hidden layers may be needed.
2.2. Analyzing features.
In another code, here loan2.0.ipynb Jupyter notebook, the relation between each feature and the loan status is observed. Only data points without defects are used to plot the tables and scatter plots. In this analysis, uncorrelated features will be dropped during actual training. By uncorrelated, we mean features that do not seem to affect the loan status.
The majority of applicants (325) who are eligible (Y) have credit history that meets the guideline. A few of them (7) who meets the guideline as well are not eligible. However, the causality is, as expected, not direct. In fact, there are more applicants who meet the guideline amongst the ineligible applicants than those who do not (85 vs 63). We can observe the trend from another direction. Of those who do not meet the guideline, 90% are not eligible. Of those who meet, 79.2% are eligible.
Table 2. Table to relate loan status (Y or N) with credit history (0.0 or 0.1). This table is arranged in 3 sections. The left section (dark blue) tabulates the number of data points, e.g. 325 applicants whose credit history meets guidelines (1.0) are eligible (Y), while 7 whose credit history meets guidelines (0.0) are not. The next table, lighter blue, is the same table except the numbers are in fraction. The last part of the table, lightest blue, is in fraction as well, but transposed. For both, fractions sum up to 1 down each column.
The property area describes the property that applicants own. The properties are distinguished from one another based on location: either rural, semi-urban or urban. A naïve interpretation can be drawn by relating wealth in assets with urban property and the lack thereof with rural property. However, from table 3, it appears semi-urban property owner is more likely assessed as eligible. There does not seem to be striking difference of eligibility between rural and urban property owner.
|Semi Urban||42||149||Semi Urban||0.283784||0.448795||Semi Urban||0.219895||0.780105|
Table 3. Table to relate loan status and property area.
Applicant and co-applicant Income
Applicant income alone does not seem to be a good predictor of eligibility. We can see the bulk of applicants have income in the $3000-$6000 range. However, the proportion of applicants eligible for the loan does not increase with income, i.e. the company does not approve of loan just because the applicants are wealthier and apparently more able to repay the loan. On the other hand, the company does not approve the loan just because the applicants are needy. In annex figure A1, including co-applicant income does not greatly alter the trend.
Loan amount term, Marriages status, Sely-employment, Gender and Graduate status
Most applicant chose 360 months (30 years) loan amount term, and, amongst them 71% are eligible. This feature should be dropped since the next most popular choice, 180 months, are of sample size roughly 10% of the former (411 vs 36 applicants).
In annex table A1, a married applicant is more likely to be eligible, although the trend is not strong as well. Self-employment (annex table A2), number of dependents, gender (annex table A3) and graduate status (annex table A4) do not seem to give rise to any strong trend as well, since the eligibility seems to drop in proportion to the number of sample. For example, there are 278 eligible male applicants, with 116 not eligible, 54 eligible females with 32 not eligible. By proportion alone, it seems males are more likely eligible; however, since the female sample are significantly less, this could be simply attributed to chance.
Number of dependents
As it is, the relation between number of dependents and loan status suffers from the same inadequacy. However, we might perform slightly different analysis, grouping applicants to only two groups: with or without dependents. This is important, since a family with dependents may have different mentality compared to family who are not. Their demographic distribution, for example, may also differ, as applicant without dependents may be younger, with some implications on credit worthiness. Unfortunately, the trend is not strong either (annex table A5).
2.3. Combinations of Factors
Analysing the features by matching one feature to loan status hardly yields any strong relationship between one single feature and eligibility status, except perhaps for credit history. We should turn into the relations between 2 simultaneous features and eligibility status in the following section. However, we should not generalize this to multiple features, since we will dilute the number of samples per combination of features. By this, we mean, for example, there are only 11 male applicants who are not married and not graduated. We will pick only some promising combinations, as shown in table 4.
|Combinations||Y/N fraction||No. of applicant|
|Male, semi urban||1/0.28||127/35|
|Married, 0 dependent||1/0.35||117/42|
|Married, semi urban||1/0.22||114/25|
|0 dependent, semi urban||1/0.27||96/26|
|2 dependent, self employed||0.312/0.087≈1/0.28||61/17|
Table 4. A tabulation of two-feature combinations, their loan statuses and the number of applicants corresponding to the combination.
A relation between two-feature and loan status is stronger if both the Y/N ratio and no. of approved applicants are high. By contrast, in the last row, the ratio Y/N is high but the number of approved applicants in the category is relatively small, rendering the relation less reliable. The trend reveals a trade-off between Y/N fraction and no. of approved applicants. Informally, we do not have strong relation between any 2 features with the loan status. For completeness, we perform three-feature vs loan status relation in the same fashion.
|Combinations||Y/N fraction||No. of applicant|
|Male, married, graduate||1/0.32||195/62|
|Male, graduate, self-employed||1/0.38||216/83|
|Married, graduate, semi-urban||1/0.19||95/18|
|Male, married, semi-urban||1/0.25||97/24|
|Married, semi-urban, 0 dependent||1/0.15||55/8|
Table 5. Similar to table 4, except for 3 features.
At this point, the trend indicates that well performing Y/N ratio is favoured by the ownership of semi-urban property, while the number of approved applicants are dominated by married male or graduate male applicants. We thus drop all indicators except credit property, semi-urban property ownership, applicants being male-graduate or married male. With the same number of epoch, our training attained only 0.81785 accuracy but predicts the outcome better, with 0.76389 accuracy.
As seen, credit rating prediction via few layers using all features have yielded results that (1) might have suffered overfitting or (2) are simply not hyper-parameter fine-tuned. Dropping some columns believed to be irrelevant does improve the prediction, but it does not qualify as successful. It is however worthy to note that accuracy improvement occurs when we drop some data columns. Further work includes automating hyper parameter tuning and perhaps using different machine learning model.
Ala’raj, M., & Maysam F. Abbod. (2016). Classifiers consensus system approach for credit scoring. Knowledge-Based Systems.
Allen. (1995). A Promise of Approvals in Minutes, Not Hours. American Banker.
Allen N. Berg, & Gregory F. Udell. (2002). Small Business Credit Availability and Relationship Lending: The Importance of Bank Organisational Structure. The Economic Journal.
Brown, I., & Christophe Mues. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications.
datahack.analyticsvidhya.com. (n.d.). Retrieved July 2018, from https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/?utm_source=auto-email
Finlay, S. (2009). Credit Scoring for Profitability Objectives. European Journal of Operational Research.
Frame, W. S., Srinivasan, A., & Woosley, L. (2001). The Effect of Credit Scoring on Small Business Lending. Journal of Money, Credit, and Banking.
Hurley, M., & Julius Adebayo. (2016). Credit Scoring in the Era of Big Data. Yale Journal of Law and Technology.
Mester, L. J. (1997). What’s the Point of Credit Scoring? BUSINESS REVIEW.
Michael Bücker, Maarten van Kampen, & Walter Krämer. (2013). Reject inference in consumer credit scoring with nonignorable missing data. Journal of Banking & Finance.
Thomas, L. C. (2000). A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers. International Journal of Forec.
Tjoa, E. (2018). DNN loan problem. Retrieved from https://github.com/etjoa003/machine_learning/tree/master/DNN%20classification%2C%20loan%20problem