Machine learning-based credit analytics for SME finance

(1)

Machine Learning-Based Credit Analytics for SME

Finance

Dimitar Mechev

Student number: 11417765

Master of Business Administration Thesis

For obtaining a degree of Master of Business Administration with a specialization in Big Data & Business Analytics at the Amsterdam Business School

(2)

Abstract

The higher capital requirements and the new leverage rules imposed on banks have had a dis-proportionately negative effect on the lending to small and medium-sized enterprises (SME). The lack of standardised, verifiable and accessible financial information about SMEs increases their perceived riskiness and has led to limited financing options. In order to overcome the in-formation barriers and gain competitive advantage both traditional and non-banking finance providers are increasingly using alternative information sources. The higher volume, velocity and variety of data, however, requires also new techniques for credit risk assessment. This thesis presents a blueprint for a machine learning-based credit analytics framework that aims to help SME funding providers to utilise alternative data sources for reliable and trans-parent credit decision-making.

(3)

1 Introduction

The research and the implementation work presented in this thesis were conducted in close cooperation with NIBC Bank and Beequip Equipment Financing. This thesis presents a blue-print for a machine learning-based credit analytics framework that aims to help SME funding providers to utilise alternative data sources for reliable and transparent credit decision-making.

1.1

Background and Context

The economic and regulatory developments after the 2007-2008 financial crisis have created a conducive environment for the emergence and growth of non-banking financial companies. The post-crisis regulatory reform has been concentrated on achieving a more resilient banking system, while focusing predominantly on the banking sector and overlooking non-banking fi-nancial intermediaries.

The higher capital requirements and new leverage rules have led to increased risk aversion on the side of banks. They have also exerted a disproportionately negative effect on lending to small and medium-sized enterprises (SME), which are inherently riskier due to higher default rates, scarce collateral and limited credit information. In the process of deleveraging, banks have reduced riskier loans and left a significant part of the SMEs without adequate financing options, making room for alternative forms of financial intermediation.

Although bank financing is expected to remain the main funding option for SMEs, a more di-versified set of alternatives is needed to finance their growth and enhance their resilience through the business cycle. The alternatives range from asset-based financing (e.g. factoring, leasing, etc.) and alternative debt (e.g. crowdfunding) to hybrid and equity instruments (e.g. private equity, venture capital, business angels, crowdfunding). An elaborate description of the various types of SME financing is given in [1].

In 2011, the European Economic and Social Committee adopted a plan [2] to improve SMEs’ access to finance. Despite the measures1_{taken on the European level to improve the} conditions of SME financing, the results have been moderate. In 2016, the European Banking Association (EBA) published a report [3] analysing the effect of capital regulations on SME lending and concluded that despite the positive growth, SME lending remained below its pre-crisis level.

(6)

The information barriers in the SME funding market have been identified as a major challenge for both traditional and alternative finance providers. In a working staff document published in 2017, the European Commission concludes that the lack of standardised, verifiable and acces-sible financial information about SMEs represents a significant barrier for alternative finance providers to lend to European SMEs and tackling this shortcoming is essential to broadening SME funding avenues.

The creditworthiness of a company is traditionally assessed primarily based on information in-cluding history of audited financial statements (balance sheet, income and cash flow

statement), credit history and repayment behaviour. Therefore, the information generated from long-term relationships with SME clients (incl. credit and current accounts) has been seen as a comparative advantage by the banks and rarely shared with other market participants.

The upcoming revised payment service directive PSD2 [4] is expected to facilitate information sharing on the European level and to enable new and innovative players to compete with banks for digital financial services including provision of finance. PSD2 will oblige banks to share client data, subject to their consent, with alternative funding providers (authorised under PSD2). This will present non-banking funding providers with an additional information source for creditworthiness analysis.

The increased access to financial data is expected not only to improve the pricing of funding providers but also to enable additional financing options for riskier SMEs (micro and small SMEs with limited financial information and SMEs with insufficient collateral).

In the last years, an increasing number of alternative data sources is being utilised for the pur-pose of credit assessment. In a 2017 report [6] on the transformation of SME finance, the Global Partnership for Financial Inclusion (GPFI) reports that SME lenders are utilizing a wide variety of data including real-time sales, bank account money flows and balances, pay-ments, social media, trading, logistics and business accounting among others.

In order to benefit from the increase in SME data availability, volume and velocity, both the traditional and alternative finance providers would require an innovative approach to credit risk assessment. This new approach would need to combine several characteristics. On the one hand, the robustness and transparency of the traditional credit modelling. On the other hand, the flexibility of machine learning algorithms to work with less structured and interdependent data.

(7)

1.2

Scope and Objectives

The goal of this thesis is to provide a blueprint for a machine learning-based framework for reliable and transparent credit risk assessment. This is achieved by providing clear guidelines for:

 assessing the potential value of various data sources for credit analysis;

 data transformation and feature engineering techniques relevant for the identified data and machine learning algorithms;

 selection of machine learning algorithms (incl. their application on example data sets);  achieving higher explainability and incorporation of prior domain knowledge.

Such a framework would help funding providers to strengthen their credit analysis of SMEs with limited financial information and/or insufficient collateral. The focus of the framework presented below is the assessment of the risk of default. Although in general the credit risk as-sessment process includes the estimation of the expected losses in the case of default, this aspect of the process is beyond the scope of the presented framework.

1.3

Thesis structure

Chapter 1 provides an overview of the developments and current challenges of the SME fi-nance market. It outlines several important trends and regulatory changes that would provide the alternative financing providers with new opportunities.

Chapter 2 presents the status quo of default risk modelling and more specifically the applica-tion of machine learning algorithms for the assessment of default risk. It also discusses the various approaches for incorporating prior domain knowledge and enhancing the explainabil-ity of machine learning algorithms.

Chapter 3 presents a machine learning-based framework for credit risk assessment. Chapter 3.1 gives concrete guidelines on how to assess the potential value of data and outlines several data sources that need to be explored. It also presents several data transformation and feature engineering techniques that are crucial for the utilization of the identified data sources. The chapter also elaborates on the importance of a solid ground truth, data availability and class balance. Finally, it introduces an example data set which is later used for the evaluation of sev-eral machine learning algorithms. Chapter 3.2 presents a selection of machine learning

algorithms, compares their performance and discusses their advantages and disadvantages with respect to their explainability.

(8)

2 Relevant Developments

2.1

Introduction to credit risk assessment

The credit risk analysis and assessment process focuses on five major borrower characteristics (the five Cs of credit): capacity, capital, collateral, conditions and character. These charac-teristics are traditionally measured by various financial ratios and assessed via a direct communication with the borrower. With growing customer expectations and increasing com-petition, lenders are under pressure to streamline and automate the lending process. Both banks and alternative financing providers are looking into additional data sources that would allow for an automated process of assessing the five major borrower characteristics.

Capacity

The first factor is the capacity to repay the debt from the expected cash flow of the business considering the projected debt service requirements.

Traditionally, lenders assess capacity by means of various financial ratios: short-term financial capacity is assessed via liquidity ratios (liquid assets to short-term liabilities) and coverage ra-tios (interest coverage ratio, debt service coverage ratio and asset coverage ratio); long-term financial capacity is assessed via leverage ratios (e.g. debt-to-equity ratio) and performance ratios (e.g. return on assets and return on equity). A list of financial ratios used to assess capac-ity is given in 3.1.2.1. Such an analysis is predominantly backward-looking and based on the historical trend of the various ratios available from the annual or semi-annual financial state-ments. While for larger SMEs, this information is of adequate quality, for smaller companies it is not always available. With the expected availability of data on borrowers’ payment accounts (due to PSD2), lenders would be able to base the assessment of the borrowers’ capacity on their payment patterns and real-time payment behaviour.

Capital

The capital factor is important for two major reasons. Firstly, the company needs to have suffi-cient equity to withstand a potential deterioration in its ability to generate suffisuffi-cient cash flow. Secondly, the amount of capital is a good proxy for how much “skin in the game” the owner has. The debt-to-equity ratio is a typical measure used by lenders. This information is usually available even for the smallest SMEs.

Collateral

Collateral or guarantees are additional forms of security and sources of repayment in case the loan cannot be repaid under the agreed terms. If the company were unable to generate

(9)

sufficient cash flow to repay the loan, the lender would liquidate the collateral and use the pro-ceeds to pay off the loan.

In order to secure sufficient proceeds in case of an economic downturn and after the expected amortization of the collateral, lenders often require the amount of the collateral to exceed the amount of the loan. The typical asset classes considered as acceptable collateral are accounts receivable, inventory, equipment and real estate. In many cases the collateral is considered as the primary credit risk mitigating factor and, therefore, more important than the company’s ca-pacity.

Traditionally, the lenders discount the value of the collateral based on historical liquidation values per collateral class. This is a largely backward-looking approach that disregards asset-specific characteristics and current condition. With the increased availability of sensor and ge-olocation data, lenders would arguably be able to make a real-time and asset-specific

assessment of the current value of the collateral. Conditions

Conditions that may affect the borrower can be macroeconomic, political, regulatory or indus-try-specific. The assessment of the conditions is largely done on a portfolio level and based on readily available macroeconomic and industry-specific indicators, as well as country and polit-ical risk indices.

Character

Character is the subjective evaluation of the borrower by the lender and it is partly fact-based and partly reliant on “gut feeling”. Arguably, the importance of a borrower’s character in-creases with the amount of the loan relative to the size of the portfolio and dein-creases with the availability of reliable and timely data. Data from online social networks is increasingly used to assess the character of the borrower – e.g. the number and nature of their connections, em-ployees, followers and the relationship to their customers.

2.2

Introduction to inductive Machine Learning

In [5], Tom Mitchell provides the following widely used definition of Machine Learning: A computer program is said to learn from experience 𝐸 with respect to some class of tasks 𝑇 and performance measure 𝑃, if its performance at tasks in 𝑇, as measured by 𝑃, improves with experience 𝐸.

On the basis of this definition, [7] further defines the inductive machine learning as a process of constructing a model (or hypothesis) 𝑓(𝑥) in a given hypothesis space 𝐻 that best

(10)

approximates any continuous or discontinuous unknown target function 𝑐(𝑥), i.e. 𝑓(𝑥) −> 𝑐(𝑥), with sufficient training examples S (input-output pairs (𝑥 , 𝑦 )).

The statistical learning theory raises three crucial issues related to any learning algorithm that impacts its efficiency and effectiveness: consistency, generalization and convergence. Firstly, a learner must be consistent with the training examples or, in other words, must mini-mise a loss function 𝐿(𝑓) that measures the difference between the estimated values 𝑦∗_{and the}

observed values 𝑦 . Secondly, the function 𝑓(𝑥) must be generalised for the whole population (including unseen examples) or, in other words, the learner must reduce the difference between the approximation based on the observed examples and the approximation based on the whole population. Finally, the convergence to the optimal function 𝑓(𝑥) must be efficient.

Inductive machine learning can be used to tackle any classification problem including identi-fying high risk customers.

2.3

Default risk modelling

In the credit assessment process, lenders employ predictive models to estimate the likelihood of an undesirable behaviour by a borrower (e.g. bankruptcy, late- or non-payment). The main challenge for the design and implementation of credit risk models is the data limitations. The infrequent nature of default events and the long-time horizons used in measuring credit risk lead to data scarcity and require the use of proxy data and simplifying model assumptions. 2.3.1 Data

Corporate credit models have utilised primarily information from financial statements (balance sheet, income and cash flow statements), various financial ratios and macroeconomic indica-tors, while retail models have traditionally been based on socio-demographic, financial, employment and repayment behaviour data.

In the recent years, data on clients’ behaviour with regard to different banking products are in-creasingly used in retail credit scoring. Behavioural data used to enhance default predictions now include data on repayment behaviour on other loans, credit card usage [8] and payment patterns [9], among others.

The data gathered and generated in retail banking are usually with higher volume and velocity due to the more intensive interaction between the clients and financing providers. An increas-ing number of data sources are becomincreas-ing available in corporate bankincreas-ing [6] as well and many of the techniques used on the retail side are now applicable to the risk assessment of corporate clients including SMEs.

(11)

2.3.2 Models

The credit score and credit rating are model-based estimates of the likelihood that the bor-rower will default on their obligation. The score (or rating) typically implies an absolute measure of the probability of default (PD).

The most popular methods for default risk modelling are survival analysis and classification. Survival analysis is used to estimate not only the probability of default but also its timing. It allows for estimating the probability of multiple events of interest like default and early pre-payment (see [11]).

This work focuses on the application of machine learning techniques, and more specifically classification algorithms, to default risk assessment. The classification algorithms can be di-vided in two groups: individual and ensemble classifiers. Each of the classification algorithms discussed below is accompanied by a concrete example of its application in the domain of credit risk assessment.

2.3.2.1 Individual classifiers

The taxonomy of classification methods defines two major classes with respect to the ap-proach of calculating the posterior probability: discriminative and generative. [12] discusses the difference between these two classes and provides an in-depth comparison between lo-gistic regression and naïve Bayes.

Discriminative classifiers like Logistic regression (LR) model the posterior probabilities di-rectly or construct a direct map from input to class labels. Generative classifiers, on the other hand, like Naïve Bayes (NB) and Linear Discriminant Analysis (LDA) estimate the class-con-ditional probabilities which are then converted to posterior probabilities using the Bayes’ rule. Classifiers can also be parametric or non-parametric, depending on whether the classifier as-sumes a functional form of the mapping function. Such an assumption can simplify the learning process. However, it can also limit what can be learned. On the one hand, parametric classifiers are simpler, better interpretable, faster to learn and require less data to learn from. On the other hand, they are not suitable for more complex problems. In contrast, non-paramet-ric models make no assumption on the underlying function and are more flexible to fit a wider variety of functional forms. Due to their higher complexity, non-parametric classifiers require more training data, more computational resources and are prone to over-fitting.

(12)

Figure 1. Taxonomy of classification methods

One of the most comprehensive classifier comparisons for the purpose of credit scoring is the benchmarking study provided by [13]. It reflects the recent advancements in predictive learn-ing and systematically examines the potential of novel classification algorithms for credit scoring. The study is a large-scale benchmark of 41 classification algorithms across eight credit scoring data sets, which not only assesses the statistical accuracy of the scorecard pre-dictions but also the business value of applying more accurate classification models.

Logistic regression

Logistic regression analysis studies the association between a categorical dependent variable and a set of independent variables (or features) by estimating probabilities using a logistic function. Logistic regression is the most widely used individual classifier for credit risk assess-ment. It is based on three major assumptions.

Firstly, it assumes that the logit transformation of the target variable has a linear relationship with the features. Secondly, it requires little or no multicollinearity among the features. Thirdly, it requires the observations to be independent from each other. In practice, logistic re-gression requires a large sample size. The rule of thumb is 10 cases with the least frequent outcome for each independent variable.

Regularised logistic regressions are used to overcome these limitations. Ridge and Lasso lo-gistic regressions shrink the regression coefficients by imposing a penalty on their size, which leads to reduced variance in the presence of multicollinearity and prevents from over-fitting in the presence of large number of features and/or a limited number of observations. [14]

(13)

demonstrates the application of Lasso logistic regression for credit scoring and reports a per-formance comparable to a Support Vector Machine classifier.

Support Vector Machine

Support Vector Machine (SVM) classifier has been used as an alternative to logistic regression and discriminant analysis providing similar performance (see [14], [15] and [16]). SVM sepa-rates binary classified data by a hyperplane and is defined as an optimization problem with an underlying kernel function that can be linear or non-linear.

The main disadvantage of SVM is the lack of transparency, which is crucial for most credit risk applications. Since SVM does not represent the score as a tractable parametric function of features, its predictions cannot be easily explained by the user. Another disadvantage is that choosing an appropriate kernel function and tuning the key parameters are not straightforward and require experimentation.

Neural networks

Neural Network (NN) models have also been tested for the purpose of credit scoring. [17] in-vestigates the use of five neural network models for credit scoring and benchmarks the results against several traditional models. The research shows that neural network credit scoring mod-els achieve only marginal improvement in accuracy (in the range of 0.5-3%). The use of NN requires stronger modeling skills to develop network topologies and implement training meth-ods that are suitable for the problem at hand. Next to the complexity of these models, a major disadvantage is their lack of explanatory power and transparency.

Decision trees

Decision tree classifiers recursively partition the data to separate good from bad loans through a sequence of tests and produce a set of rules for the classification of new observations. Due to their high complexity and tendency to overfit, decision tree classifiers are not used as individ-ual classifiers. Nevertheless, they are widely used within ensemble classifiers described in the next section.

Linear Discriminant Analysis and Naïve Bayes

Linear Discriminant Analysis [18] and Gaussian Naïve Bayes classifiers [19] are widely used generative classifiers for the prediction of default. The main limitations of these methods stem from the assumption of Gaussian distributed features and same dependence structure (covari-ance matrix) per class. The Naïve Bayes classifier goes even further in assuming independent features (diagonal covariance matrix).

(14)

2.3.2.2 Ensemble classifiers

Ensemble classifiers combine the predictions of multiple base classifiers in order to improve the generalizability and increase the predictive accuracy beyond what is achieved by the base models. [20] and [21] report that ensemble classifiers perform significantly better than individ-ual classifiers in credit risk applications.

Ensemble modelling employs a two-step approach: base classifier creation and prediction combination. The creation of the base classifiers can be independent or dependent. The inde-pendent approach is done via an algorithm called bootstrap aggregating whereby the base classifiers are built independently and then their predictions are averaged. This approach is used to reduce the variance of model predictions which leads to higher prediction stability and accuracy. Examples of independent ensembles are Bagging classifier [22] and Random forest classifier [23].

Boosting algorithms [24],[25] on the other hand, build the base classifiers sequentially, trying to reduce the bias of the combined estimator. The motivation is to combine several weak clas-sifiers to produce a powerful ensemble.

2.4

Incorporating prior domain knowledge

One of the key challenges in applying supervised learning to assess credit risk is often the in-sufficient size of the training set. Both established finance providers and alternative providers in a start-up phase can face this challenge when entering a niche or unknown market. Insuffi-cient data can significantly impact the effectiveness of a learner, leading to weak

generalization.

A possible way to address this problem is to exploit expert knowledge that may be available about the market. The vast majority of machine learning techniques are data driven, rely al-most exclusively on sample data and ignore the existing domain knowledge. [7] and [27] outline three main categories of methods for incorporating prior domain knowledge by: 1) us-ing virtual examples; 2) data preparation and 3) alterus-ing the search objective. Within credit risk modelling, prior domain knowledge is traditionally included as part of the data preparation, via selecting available and engineering new features.

2.4.1 Virtual examples

The general framework presented in [27] defines prior domain information as a set of transfor-mations that allow new examples to be obtained from the old. The assumption is that the unknown target function 𝑐(𝑥) is invariant with respect to certain transformations through which the prior domain knowledge can be encoded. The article shows that the virtual

(15)

examples approach is mathematically equivalent to incorporating prior knowledge via a regu-larization. In many cases encoding prior knowledge via creating virtual examples might be much easier than via a direct regularization constraint. The effectiveness of this approach in the context of credit risk assessment could not be explored due to time constraints and is left for future research.

2.4.2 Data preparation

Expert knowledge is often used to select relevant features in order to minimise noise and re-dundancies. It is also used to transform existing features into new ones in order to explore more complex relationships. In many domains certain features are naturally linked together and prior domain knowledge is used to combine them in a new engineered feature. For exam-ple, in robotics, the relation between state variables and torques is represented by sine and cosine transformations; in credit risk analysis, financial capacity is well represented by a set of transformations (ratios) of numeric information from financial statements. [28] presents an ap-proach for an automatic generation of features which can be used to improve the productivity of experts in searching the space of features with forecasting power.

2.4.3 Altering the search objective

Machine learning problems are often represented as optimization problems whereby an objec-tive function is maximised or minimised according to a set of constraints. In order to find the optimal hypothesis, the learning algorithm converges to a best approximation of the underlying model within a given hypothesis space. For example, estimating the coefficients of a logistic regression requires an optimization technique like the Maximum Likelihood method. There are several methods of altering the search objective to incorporate prior knowledge via:

1. an additional regularization within the objective function (similar to the Ridge and Lasso penalty terms);

2. introducing additional constraints within the objective function;

3. weighting the observations, whereby the weights are expert-determined parameters of the relative importance;

4. cost-sensitive learning, whereby different costs are assigned to different classifica-tion errors;

5. augmented search, whereby new hypothesis candidates are produced in the process of searching. An example of an augmented search approach is the FOCL system developed in [29].

(16)

2.5

Explainability

The effectiveness of machine learning algorithms in a business context is often limited by the inability to present their reasoning to the human users in a transparent and tractable way. The increased use of algorithms for autonomous decision-making on behalf or instead of humans is raising societal and ethical questions. The growing importance of algorithmic decision-making and its impact on customers requires an adequate regulatory framework. In May 2018, the Eu-ropean Consumer Consultative Group published policy recommendations [30] regarding the transparency and public control of algorithmic decision-making. One of the recommendations was to require regulators and companies to develop effective means and simple processes for consumers to exercise their rights to information and rights to challenge automated decisions based on personal and non-personal data that produce legal effects. This would require ma-chine learning algorithms to be able to explain and justify the automated decisions so that the consumers will be able to contest the decision and require a correction.

The raised awareness and demand for explainable machine learning techniques stem mainly from the consumer world. Although not explicitly required in the SME financing context in the short-term, the capability to explain an automated decision would be an advantage in case the regulatory and customer demands shift on the corporate side as well.

In order to overcome the black-box problem, several techniques have been developed to pro-vide explanatory insight into the contribution of each feature into the classification process. 2.5.1 Deep explanation

The first approach stems from the deep learning applications for visual object recognition. [31] presents an image explanation framework that is able to generate textual explanations that are both image relevant and class-relevant. Since this approach has no immediate application to the context of credit risk assessment, it is left out of the scope of the presented framework. 2.5.2 Model induction

The second approach to achieve explainability is via model induction. Model induction refers to techniques used to infer an explainable model from any black-box model. An example of this approach is given in [32] and used by the leading analytics company and credit score pro-vider, FICO.

This approach combines the transparency of traditional scorecards and the flexibility of the machine learning algorithms to capture non-linear relations and feature interactions. It opti-mises a segmented scorecard to approximate a tree ensemble score function and achieves higher predictive power than traditional scorecards while keeping the model fully transparent.

(17)

The FICO approach was left out of the scope of the present framework due to the lack of elab-orate methodological description. The replicating and testing of this methodology would require significant time and effort and is left as a direction for future research.

Another approach to explain the predictions of any classifier is called Local Interpretable Model-Agnostic Explanations (LIME) [33]. It provides insights into black-box classifiers by learning an interpretable model locally around a specific prediction. This approach was demonstrated to work well with models for text (e.g. Random Forest) and image classification (e.g. neural networks). Further research is required on the applicability of this approach for credit risk assessment.

2.5.3 Interpretable models

The third approach focuses on developing more interpretable models by design, as well as techniques to increase the interpretability of existing machine learning algorithms.

An example of an advanced but still highly interpretable machine learning algorithm used for credit risk assessment is Real AdaBoost [24]. Another example of a machine learning algo-rithm that combines interpretability with the flexibility to work with non-linear and mutually dependent features are logistic regression trees.

A logistic regression tree is a classification model that combines logistic regression and deci-sion tree learning. It is effectively a piece-wise logistic regresdeci-sion model built by recursively partitioning the data and fitting a different logistic regression in each partition. The logistic re-gression tree can also be described as a decision tree that has a logistic rere-gression model at each leaf. The resulting model can work well with non-linear features without explicit variable transformations. It can also achieve high interpretability by balancing tree structure complex-ity with node model complexcomplex-ity. For example, using a more complex tree structure would allow simple and directly interpretable logistic models to be fitted to the leaves. A concrete im-plementation of logistic regression trees is given by [34].

(18)

3 SME credit assessment framework

This chapter presents a blueprint for a machine learning-based credit analytics framework that aims to help traditional and alternative funding providers to utilise the increasingly diverse data sources in a transparent and explainable fashion. Figure 2 presents the main components of the machine learning-based credit risk framework including references to the relevant chap-ters.

Figure 2. Blueprint for a machine learning-based credit risk assessment framework.

The business objective of the credit risk assessment is to minimise potential losses while as-suring an acceptable loan origination rate. This objective can be achieved by effective differentiation between low-risk and high-risk borrowers. In order to solve this classification problem with inductive machine learning, companies first need to identify potentially valuable data sources and then transform the raw data into ready-to-use features. The classification al-gorithms are then evaluated on their performance, flexibility and explainability, whereby the performance measures should be aligned with the business objective and the priority given to minimizing risks versus portfolio growth.

The presented framework provides guidelines for: (1) identifying potentially valuable data sources, (2) data pre-processing, (3) transforming the raw data into features, (4) translating the credit assessment objectives into performance measures, (5) selecting and applying appropri-ate machine learning algorithms and (6) achieving explanability of the credit decisions.

3.1

Data

Data used to be perceived simply as a by-product of business activities with limited value out-side of a concrete business process. The past few years have been marked by a rapid increase

(19)

not only in the volume of data but also in the capabilities to combine them into actionable in-sights and create value.

Nevertheless, many organizations struggle to think strategically regarding data. The main chal-lenge is that the value of data can be unlocked only if data are treated as a corporate asset and are properly captured, managed and shared within the organization. Furthermore, organiza-tions need to commit to a data strategy before being able to assess the potential of data for competitive differentiation.

3.1.1 Data strategy

A data strategy is a common reference of methods, services, architectures, usage patterns and procedures for prioritizing, gathering, integrating, storing, analysing and operationalizing data. The ultimate goal of the data strategy is to enable the transformation of raw data into actiona-ble information and create business value.2_{Giving guidelines for development of an adequate} data strategy is beyond the scope of this thesis. However, this work would not be complete without highlighting the importance of data strategy for companies that aim to be data-driven. A data strategy ensures that data are managed and used as a corporate asset that enhances deci-sion-making and creates value. It is important to realise that the value of data can only be proven if the data have been managed with proper respect for their potential value. 3.1.2 Data sources

Building a credit assessment framework starts with the identification of data sources and as-sessing their potential value for credit risk modelling. The framework proposes four major types of data that have been reported to add value in the credit risk assessment process: finan-cial statements data, transaction data, supply-chain performance data and sensor data. The list presented below is by no means exhaustive. Data availability largely depends on the SME seg-ment, jurisdiction and the concrete business circumstances. The identification of alternative data sources, however, should be seen as an ongoing process which goes beyond the initial im-plementation of the credit assessment framework.

After identifying the available data sources, the next step is to make an initial assessment of their potential value. The value of data is defined here as the feasibility of raw data to be trans-lated into actionable business insights. The presented framework proposes five data

(20)

characteristics that are directly or indirectly related to the potential business value of the data: volume, frequency, lag, structure and integrity.

Volume refers to the total number of observations available. Although it depends on the ma-chine learning techniques used, generally a higher number of features requires larger data volume. Volume is the single most important data characteristic as it directly affects the perfor-mance of the machine learning algorithms.

Frequency refers to the frequency of observations. An example of low frequency data are fi-nancial statements data which are updated 1-2 times a year. The importance of data frequency depends on the requirements of the credit assessment process. For example, a continuous mon-itoring of credit risk would require high frequency data, while for the purpose of credit acceptance frequency has little importance.

Lag measures the time between the moment of measurement and the moment the data are available to the analytics framework. Manual pre-processing like data cleansing or data valida-tion can delay the provisioning process. Similar to frequency, the importance of lag depends on the end requirements for the credit risk assessment process.

Structure refers to the level of formatting of the data which can be structured, unstructured and semi-structured. Structured data have well-defined format and are divided into standard-ised data elements that are identifiable and accessible. Structured data are generally easier to analyse as their elements are more straightforward to define as features. Unstructured data lack a pre-defined data model. Typical examples of unstructured data are text, images, audio, video and combinations of these within document files.

Semi-structured data are data that have a well-defined format but where the standardised data elements cannot be used directly as features. When dealing with semi-structured or unstruc-tured data, feature engineering is a fundamental part of the modelling process and directly affects the performance of the machine learning algorithms. Feature engineering is the process of using domain knowledge to create features by transforming and combining available data elements (see 3.1.3.5).

Integrity refers to the trustworthiness and incorruptibility of the data source. Transactional data like GPS location and bank account transactions are generally more trustworthy as they are hard to manipulate. Financial statements data, on the other hand, need to be audited before they can be considered trustworthy.

3.1.2.1 Financial statements

Financial statements have been the main information source for the purpose of corporate credit risk assessment and, more specifically, the assessment of a company’s capacity and capital.

(21)

Financial statements are the formal records of the financial activities and position of a com-pany. There are four basic financial statements:

1. Balance sheet or statement of financial position, which is a report on the company’s assets, liability and equity at a specific point in time;

2. Income statement or statement of comprehensive income reports on a company’s in-come, expenses and profits over a period of time;

3. Equity statement, which reports on the changes in equity over a period of time; 4. Cashflow statement, which reports on the company’s cash flow activities.

The information reported in the financial statements is combined in indicators and ratios which are used as credit analysis measures. Table 1 shows the set of financial indicators that are typi-cally used for the assessment of credit risk.

Characteristic Indicator type Indicator Description

Short-term

capacity Liquidity ratios Current ratio Current Assets / Current Liabilities

Quick ratio (Current Assets – Inventories) / Current Liabilities

Operating cash flow

ratio Cash flow from operations / Current Liabilities Coverage ratios Interest Coverage

Ra-tio EBIT/Interest Expense Debt Service Coverage

Ratio

Net Operating Income/Total Debt Service

Asset Coverage Ratio ((Assets – Intangible Assets) – (Current Liabilities – Short-term Debt)) / Total Debt

Long-term ca-pacity

Leverage ratios Debt-to-Equity ratio Total Debt / Total Equity

Total Debt to Capitali-zation Ratio

Total Debt/ (Total Debt + Total Equity)

(22)

Funds from operations

to total debt ratio FFO / Total Debt Performance ratios Return on Assets

(ROA) Net Income / Total Assets Return on Equity (ROE) Net Income / Shareholder’s Equity

Capacity Profitability and cash-flow indicators

EBIT, EBITDA, FFO, Free cash flow before divi-dends, Free cash flow after dividends Capital Leverage ratios See above under Long-term capacity

Table 1. Financial indicators used for credit risk assessment.

The availability of comprehensive financial statements depends largely on the size and level of regulation of companies. The financial statements of micro and small SMEs can lack granular-ity which makes the calculation of many financial ratios unfeasible. Therefore, the volume of available data can vary per SME sub-segment. The frequency of financial statements ranges from quarterly to annually and depends on the jurisdiction, company size and regulatory re-quirements. The majority of SMEs prepare their financial statements annually.

Another important consideration is the comparability of financial statements. Although the data of financial statements are usually well-structured, there might be differences between the definitions of various data items between companies and industries. Such differences lead to higher uncertainty in the estimation of credit risk [35] and require data pre-processing and ad-justments. Many providers of financial statements data assure comparability by standardizing the data to a common data model. The integrity of the data is assured by an internal or external audit process. However, most micro and small SMEs cannot provide audited financial state-ments.

3.1.2.2 Transaction data

Previously, transaction data were used predominantly in the retail banking, whereby banks would use cash flow data from various payment accounts to assess the creditworthiness of their customers. Transaction data include balances, as well as cash inflows and outflows of any type of bank accounts (e.g. deposit accounts, credit accounts or current account). Trade and commercial payment data are also considered transactional.

Account balances and transaction flows provide real-time visibility into SMEs’ net cash flows. For the purpose of assessing the capacity of a firm, this information is often superior to the out-of-date and often unreliable financial statements. Cash flow data incorporate information

(23)

about incoming sales, outgoing expenses and debt payments. This data can also be used to as-sess the customer base growth and profile. [36] and [37] provide examples of transaction-derived measures that can be used as features (e.g. Balance to Total Cash outflow in last X weeks, Balance to Total Cash inflow in last X weeks, Current expense to X-week moving av-erage expense, measures based on the cash flow distribution (e.g. volatility), the number and frequency of particular types of transactions).

3.1.2.3 Supply-chain performance data

Emerging digital supply-chain platforms can provide insights into the trading flows between buyers and sellers. Account receivables and factoring financing have become important fund-ing sources for many SMEs that operate in the B2B sector. Data on purchase orders, invoices, accounts receivable, accounts payable, bills of lading shipping are already being used for credit risk assessment (see [6] and [38]). Table 2 provides a list of supply-chain related indica-tors that can be used as features.

Characteristic Indicator type Indicator Description

Capacity Working capital

com-ponents Inventory days 365 / (Cost Of Goods Sold / Average inventory balance) Account receivable

days 365 / (Sales / Average accounts receivables bal-ance) Account payable days 365 / (Cost Of Goods Sold / Average accounts

payables balance) Profitability Profit margin Net Profit / Revenue

Table 2. Supply-chain performance indicators, also relevant for credit risk assessment.

3.1.2.4 Other data

The availability of alternative data sources depends on the industry SMEs are operating in. In the asset-based financing (e.g. financial leasing of vehicles or equipment), GPS location data have already been used for risk management purposes, mainly for the monitoring of risks re-lated to the current location of the asset (e.g. country risk). Asset location and moving patterns can potentially be used as a real-time proxy for the utilization of the assets and the business activity of a firm.

(24)

3.1.3 Data pre-processing and feature engineering

The data pre-processing is not only a crucial step in the application of machine learning but is often the most time-consuming task as it is a highly manual process which requires expert judgement. The data pre-processing starts with understanding the data at hand (source, defini-tion, meaning), their quality (missing values, noise, outliers), their format (e.g. numerical, categorical, ordinal, text) and statistical characteristics (mean, median, mode, standard devia-tion, skewness, kurtosis).

The main aspects of data pre-processing are data cleansing, data transformation and feature engineering. Without undermining its importance, this paper is not explicitly describing the data cleansing process and its methods. [39] provides a thorough guide for data cleansing, in-cluding handling of missing data, identifying misclassifications and outlier detection. The rest of this chapter focuses on the data transformation and feature engineering techniques which are crucial for the application of the selected machine learning algorithms (in 3.2.2) and the utilization of less structured data like transaction and payment data.

3.1.3.1 Standardization

Standardization refers to the transformation of individual numerical variables so that the distri-bution of the transformed variable is closer to a Gaussian distridistri-bution with zero mean and unit variance. In practice, this is often done by the Z-score method or subtracting the mean value of each feature and dividing it by its standard deviation (see [40]).

An alternative approach would be the Min-Max scaling, whereby the data are scaled to a fixed range (e.g. [0,1]). Some machine learning algorithms like Support Vector Machines with RBF kernel and (L1 and L2) regularised logistic regressions assume that all features have zero mean and variance in the same order. Otherwise, the features with variances of higher order of magnitude would dominate the objective function and would lead to ineffective learning. [41] shows that scaling of the features can speed up some gradient descent-based learning algo-rithms. Tree-based models are in general invariant to standardization and do not explicitly require transforming the features. An important consideration is that standardization has an ad-verse effect on sparse data sets as it destroys the sparsity which can lead to a significant increase in required memory and computational resources.

3.1.3.2 Statistical transformations

Standardization transforms the data in order to adjust the location and the variance of the fea-ture distributions. However, there are other transformations that can deal with the skewness of the feature distribution and adjust for it in order to achieve more normal-like distribution. For

(25)

example, root or logarithm transformations are used to adjust for the right skewness and squares or cube transformations are used to adjust for left skewness.

Another approach for adjusting for the skewness is the Box-Cox transformation. The main ad-vantage of such transformations is that they remove the variance-on-mean relationship and stabilise the variance, making it constant relative to the mean. According to [42], the effect of on the model performance is much more prominent for simpler models like linear and naïve Bayes models than for SVMs, k-NN and neural networks. [43] states that in many cases ma-chine learning algorithms like SVM and neural networks work better if the features have symmetric and unimodal distributions. Tree-based models are invariant to such transfor-mations.

3.1.3.3 Binarization

In many cases, the distribution of values of continuous numeric features is highly skewed which may adversely affect the machine learning algorithm. In such cases, the features can be made discrete via a procedure called binning. The discrete values can be considered as catego-ries to which the raw continuous numeric values are mapped.

One way to achieve that is through fixed-width binning whereby each bin (category) has a pre-defined range of values based on a prior domain knowledge. For example, rounding can be seen as a special case of fixed-width binning. While a certain bin width might be meaningful in the opinion of the expert, fixed-width binning might lead to bins that are not well repre-sented by the concrete data set and can be sparsely populated. This effect can be avoided by an approach called adaptive binning. An example of adaptive binning is the quantile-based bin-ning, whereby the cut-off points of the bins are determined by the quantiles of the continuous valued distribution.

3.1.3.4 Encoding categorical features

Categorical variables cannot be directly used as features in machine learning algorithms. Even if represented as integers, categories would be implicitly misinterpreted as being ordered. A common way to represent categorical variables is the one-hot encoding. The idea behind dummy variables is to replace a categorical variable with one or more new features that can have the values 0 or 1.

3.1.3.5 Feature engineering

The presented framework defines feature engineering as creating new features from existing ones to improve model performance. Feature engineering goes beyond the feature

(26)

transfor-knowledge. Feature engineering is crucial for semi-structured data (e.g. cash flow data) whereby features need to be explicitly constructed from the existing raw data.

There are two major types of engineered features: indicator features and interaction features. Indicator features are features that explicitly encode thresholds or special events related to some existing features. Interaction features are composed by two or more raw variables. Table 3 presents examples of engineered features in the context of credit risk assessment.

Engineered features Description Example feature Indicator features

Threshold indicator variable A dummy variable indicating if the value of existing feature is above or below a certain threshold.

Indicator variable that represent micro SME’s equals 1 IF (head count < 10) AND (Turnover ≤ EUR 2mln OR Total Assets ≤ EUR 2mln) and 0 otherwise

Special event indicator variable A dummy variable that describes a specific event defined over one or more existing features.

Indicator variable that represents a change in the ownership struc-ture of a firm.

Indicator variable for group of classes

A dummy variable that indicates if an observation belongs to a group defined over e.g. existing categori-cal features.

Indicator feature variable that groups financing transactions in a current account.

Interaction features

Differences and ratios Differences and ratios of existing raw features.

Financial ratios

Counts Number of elements in the feature vector that satisfies a certain con-dition.

Polynomials Engineered feature in the form of polynomial.

𝑦 = 3𝑥 + 𝑥 + 1

Rational differences and polyno-mials

Rational combinations of differ-ences and polynomials.

𝑦 = ; 𝑦 =

e.g. Balance to Total Cash outflow in last X weeks of a bank account.

Table 3. Engineered feature types (incl. examples).

It is important to note that some of the engineered features can be synthesised by the machine learning algorithms and they need not be explicitly provided. [44] compares several machine learning algorithms based on how they perform with different engineered features and which

(27)

of the features are implicitly synthesised. The paper demonstrates that a neural network and SVMs can synthesise all engineered feature types except for ratio and ratio-difference features, while Random forests and gradient boosting machines fail to synthesise only counts and ratio-differences features.

Feature engineering is traditionally a manual process that depends on prior domain knowledge and trial and error. There are, however, approaches that aim for automated feature engineering. [45] presents an evolutionary computation approach using symbolic regression to build free-form mathematical models. It should be noted that a fully automated generation of features is likely to lead to black-box models that lack interpretability. Therefore, such an approach should only be used for supporting the data scientists in the feature engineering process. 3.1.4 Ground truth

The term “ground truth” refers to the direct observations of the target variable used in super-vised machine learning. In the context of credit risk assessment, the target variable can be binary (e.g. “good” or “bad” client) or ordinal (e.g. credit ratings AAA to D). There are two main challenges related to the ground truth in applying classification algorithms to assess credit risk.

Firstly, the number of “ground truth” observations might be insufficient (in case of a niche or unknown market), which is likely to lead to a weak generalization and a low predictive power. When entering a new market, financing providers often lack sufficient data on clients and par-ticularly sufficient observations whereby the clients fail to fulfil their legal obligation (or conditions) of a loan. There is no universal rule of thumb for the minimum number of required observations as this depends on the particular machine learning algorithm, the number of fea-tures and characteristics like interdependence and distribution of the feafea-tures.

Secondly, the various classes of the target variable are usually not represented equally in the data. Imbalanced data sets are common in real-world machine learning applications, including credit risk assessment. It is common to have a limited number of default observations or a dis-crepancy between the number of observations for some credit rating classes (minority classes) and others (majority classes). Such a situation can lead to the so-called accuracy paradox whereby models with high accuracy can have a significantly lower predictive power (see [46]). There are several approaches to adjust for the class imbalance: down- or over-sampling and cost-sensitive learning.

Under-sampling refers to using a subset of the majority instances so that the number of in-stances equals that of the minority classes. This is beneficial if there is sufficiently large number of observations for the minority classes. Over-sampling refers to increasing the

(28)

number of minority instances by replicating them. In highly imbalanced data sets, the replica-tion of minority instances may introduce a bias which will adversely affect the predictive power of the machine learning algorithm. The cost-sensitive learning approach assigns weight to the data set’s instances which are inversely proportional to the class frequencies in the input data. The proposed framework uses the cost-sensitive learning as a preferred approach. In the cases with no or a very limited number of defaulted clients, none of the techniques above would be feasible. Such a situation can arise in the early stages of building a loan port-folio in an unknown market. In this case, a proxy “ground truth” can be considered until enough data are gathered on defaulted clients. For example, an indicator variable for clients with delinquent payments can be used as a proxy for the likelihood to pay. The assumption here is that clients with delinquent payments are more likely to default on their obligations and, therefore, this information can be used as ground truth in the absence of observations for defaulted clients.

3.1.5 Example data sets

The example data set3_{used to demonstrate the practical application of the presented} frame-work contains data on credit card clients in Taiwan provided by the University of California Irvine (UCI) and also used in [47]. Although it is not explicitly related to SMEs, the data set is a good example of combining both client-specific information and transaction information. The data set contains 30,000 observations (22.12% defaulted and 77.88% performing loans) with 24 variables, including the target variable which classifies a client as defaulted (1) or not (0). The rest of the variables are defined as follows:

Variable Description Values

LIMIT_BAL Total loan amount including the individual consumer credit and a supplementary credit given to the family of the con-sumer

Amount in NT dollars (integer)

GENDER Gender 1 = male; 2 = female.

EDUCA-TION

Education 1 = graduate school; 2 = univer-sity; 3 = high school; 4 = others. STATUS Marital status 1 = married; 2 = single; 3 = other. AGE Age In years (integer)

(29)

PAY_1 – PAY_6

Payment history. Past monthly payments were tracked from April to September. The variables are defined as fol-lows: PAY_1 = the repayment status in September 2005; PAY_2 = the repayment status in August 2005; and so on.

-1 = pay duly; 1 = payment delay for 1 month; 2 = payment delay for two months; …; 8 = payment delay for eight months; 9 = pay-ment delay for nine months or above.

BILL_AMT1 –

BILL_AMT6

Amount of bill statements per month, whereby BILL_AMT1 = amount of bill statement per September, 2005;

BILL_AMT2 = amount of bill statement per August, 2005; …; BILL_AMT6 = amount of bill statement per April, 2005.

Amount in NT dollars. (integer)

PAY_AMT1 –

PAY_AMT6

Amount of previous payment. PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; …, PAY_AMT6 = amount paid in April, 2005.

Amount in NT dollars. (integer)

Table 4. Example data set - credit card data.

3.1.5.1 Data pre-processing

The example data set does not have any missing values. There are two variables that are not formatted according to the specification. Firstly, EDUCATION includes values which are not in the range specified in the data description (between 1 and 4). It contains also the following values: 0 (15 instances), 5 (281 instances) and 6 (52 instances). In order to correct for this, a new variable, EDU_FIX, was created whereby all instances with values outside of the defined range are set to value 4 (other). Secondly, STATUS contains 55 instances with value 0 which is not specified in the data description. A new variable, STAT_FIX, was created whereby these in-stances are set to value 3 (other). Variables GENDER, EDU_FIX and STAT_FIXare categorical, therefore a one-hot encoding was applied, transforming them to binary variables: GENDER_1 - GENDER_2; EDU_FIX_1 - EDU_FIX_4; STAT_FIX_1 - STAT_FIX_3.

The algorithms below are tested on two data sets. DATASET1 contains all 28 variables as fea-tures. DATASET2 contains the same variables except for the variables PAY_1 – PAY_6. Arguably, the information on payment delinquencies contained in the PAY variables can to a great extend be extracted from the combination of bill statements and payments made (BILL_AMT and PAY_AMT variables). Please note that there is no correlation between PAY and BILL_AMT or PAY and PAY_AMT variables. The PAY variable is an indicator variable showing the number of months of payment delay, while BILL_AMT and PAY_AMT variables are the balance and pay-ment amounts (in dollars). The purpose of DATASET2 is to demonstrate that some algorithms are more effective in implicitly synthesizing features without the need of explicit feature

(30)

engineering. Next to that, this data set is a better example of the type of transaction data that would be made available under PSD2.

3.2

Machine Learning tools

This section presents the considerations required for the selection, implementation and evalua-tion of machine learning algorithms in the context of credit risk assessment. It also shows a concrete implementation and application of the framework on two example data sets. The data pre-processing and machine learning algorithms are implemented in Python 3 using scikit-learn version 0.19.1 with small exceptions whereby links to the additional packages or soft-ware are provided.

3.2.1 Deciding on the goal and performance measures

The goal of the credit risk assessment is to minimise potential losses while assuring an ac-ceptable loan origination rate. Portfolio growth and minimizing risk are not necessarily contradicting objectives. However, in certain business environments (e.g. recession or highly competitive mature markets) they can be a constraint to each other. It is of a strategic im-portance to any financing provider to set clear targets in terms of portfolio growth and asset quality; as well as to prioritise one of the two in case they act as a constraint.

The acceptable range of the portfolio growth and asset quality can be translated to some con-crete performance metrics of the classification algorithms used for credit assessment. Recall is the performance metric that measures how many of the positive samples are captured by the positive predictions:

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 (𝑅𝑒𝑐𝑎𝑙𝑙) = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

The recall can be seen as the percentage of bad clients to which the framework denies a loan. Lower recall directly translates to higher potential losses and lower asset quality of the loan portfolio.

The precision metric, on the other hand, is used to limit the number of false positives.

𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒 = 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

In the context of credit risk assessment, a false positive means rejecting a loan request from a good client. A high false positive rate means a higher percentage of good clients that are re-jected and therefore lower origination rate.

(31)

The trade-off between optimizing portfolio growth and minimizing the potential losses can be translated to the trade-off between optimizing the false positive rate and optimizing the recall. In order to evaluate the performance of each algorithm below, a stratified 10-fold cross-valida-tion with grid search for the main model parameters is applied. The use of k-fold stratified cross-validation is a standard approach for classification problems. This approach will work well only when the data contain one observation per client. In the case when there are multiple observations per client, a cross-validation with groups (group k-fold cross-validation) should be used. In order to accurately evaluate the generalization, it must be ensured that the training set and test set do not contain observations of the same client.

The performance of each machine learning algorithm presented below is given by the best mean cross-validation accuracy score and test accuracy score. Next to that, the precision-recall and ROC curves are given with a default threshold of 0.5. For both curves, the area under the curve (AUC) is also reported as a measure of the overall performance of the algorithms across all possible thresholds.

It is important to note that for highly imbalanced data sets the Precision-recall curve and its AUC are better performance measures [48], therefore the precision-recall AUC is leading in the algorithm evaluation on the example data sets.

3.2.2 Selected algorithms

The presented framework proposes a set of three machine learning algorithms: Logistic Re-gression, Random Forest and XGBoost. Logistic Regression is considered an industry standard to which any new approach is compared. Random Forest is an ensemble classifier that is known for its performance and ease of use. XGBoost is a novel approach that has gained popularity in the Kaggle competitions as one of the most efficient and powerful classi-fiers.

Each of the algorithms has its advantages and drawbacks and may be the preferred approach in particular circumstances. Two important classification algorithms are left out of scope: the SVM and Neural Network classifiers. SVM achieves performance similar to logistic regres-sion while providing no significant advantage. It is also less interpretable as it does not represent the score as a tractable parametric function of features. The use of Neural Network classifiers in the scope of credit risk assessment is left for further research due to their com-plexity and lack of transparency.

In 3.2.3 the performance of the selected algorithms is also compared to the performance re-ported by [47] on the same data set.

(32)

3.2.2.1 Logistic regression

Logistic regression (LR) is a widely used classifier for credit risk assessment. The major ad-vantage of the approach is that it is simple and provides an efficient mechanism for calculating probabilities. In the case of fewer and independent features, LR is also highly interpretable. The major drawback is that it does not work well with non-linear and interdependent features. It also does not perform well when the feature space is too large or when there is a large num-ber of categorical features. In practice, LR requires a large sample size. The rule of thumb is 10 cases with the least frequent outcome for each independent variable.

Regularised LR is used to overcome the limitations regarding the large number and interde-pendence of features. Both normal and Lasso LR are applied to the example data sets. The Lasso LR shrinks the regression coefficients by imposing a penalty on their size which leads to reduced variance in the presence of multicollinearity. Next to that, Lasso also performs a vari-able selection by forcing certain coefficients to be set to zero. In many cases (especially with large number of features), Lasso not only increases the prediction accuracy but also increases the interpretability of the model by leaving only the most important features.

Logistic regression with no regulariza-tion

DATASET1 DATASET2

Best mean cross-val-idation score 0.695 0.524 Test-set score 0.693 0.518 Precision-Recall AUC 0.473 0.302 ROC AUC 0.722 0.639 Coefficients (x1000) Feature ranking (Top5 by coeffi-cients x1000) 1. PAY_1 (289.75) 2. BILL_AMT1 (-203.56) 3. PAY_AMT2 (-106.05) 4. BILL_AMT2 (88.45) 5. PAY_AMT2 (-88.05) 1. BILL_AMT1 (-295.17) 2. BILL_AMT2 (197.05) 3. PAY_AMT2 (-194.36) 4. PAY_AMT1 (-172.56) 5. BILL_AMT3 (73.66)

(33)

Precision-Recall curve

ROC curve

Table 5. Logistic regression without regularization – model parameters and results.

LR without regularization achieves higher accuracy (both mean cross-validation and test score) and larger Precision-Recall AUC for DATASET1 than for DATASET2. The removal of vari-ables PAY_1 – PAY_6 leads to a significantly lower predictive power. Arguably, a significant amount of the information contained in PAY_1 – PAY_6 is also present in some of the other fea-tures (as discussed in 3.1.5.1). LR requires carefully engineered feafea-tures and in the case of DATASET2, features similar to PAY_1 – PAY_6 would need to be explicitly engineered. The scattered plot of coefficients of the LR applied on DATASET1 shows that the most im-portant features are PAY_1 (the delay in payments in the last months) and BILL_AMT1 (the outstanding loan amount in the last month). The importance of PAY_1 shows that recent pay-ment delinquencies are a good predictor for default.

Logistic regression with L1 regularization DATASET1 DATASET2 Best C parameter (regularization pa-rameter) 0.000464 0.2329

Cross-validation Stratified 10-fold with random state 7 Stratified 10-fold with random state 7 Best mean

cross-vali-dation score

0.781 0.524

Test-set score 0.778 0.518 Precision-Recall AUC 0.468 0.302

(34)

Coefficients (x1000)

Precision-Recall curve

ROC curve

Table 6. Logistic regression with L1 (LASSO) regularization – model parameters and results.

The results for the regularised LR applied on DATASET1 presented in Table 6 demonstrate how Lasso increases the interpretability by performing variable selection and selecting the most im-portant features. In this case, PAY_1 (the delay in payments in the last months) is selected as the single most important feature used for classification. Although the accuracy score is higher in comparison to the LR without regularization, the Precision-Recall AUC is lower, which means that the overall performance of the classifier across all possible decision thresholds is worse.

Applying LR with L1 regularization to DATASET2 does not lead to any improvements over the simple LR. L1 regularization is useful when applied to data sets with a large number of poten-tially interdependent features. In the case of DATASET2, it fails to perform any meaningful variable selection due to the low number of features and seemingly low explanatory value of the information contained in those features.

3.2.2.2 Random forest

The Random forest (RF) classifier is an ensemble of individual decision tree classifiers, whereby the predicted class is the mode of the predictions of the individual trees. It is one of the best performing machine learning algorithms and is reported [13] to achieve better perfor-mance than LR in the domain of credit risk assessment. Due to the randomness in building the

Machine learning-based credit analytics for SME finance