Predicting the potential acquisition targets of the healthcare industry in the United States

(1)

Predicting the potential acquisition

targets of the healthcare industry in

the United States

Author:

Zhixian SONG

Supervisor:

Dhr. Dr. N. P. A. (Noud) van Giersbergen

A thesis submitted in fulfillment of the requirements for the degree of Master of Science

in the

Amsterdam School of Economics

July 8, 2016

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

(2)

(3)

Declaration of Authorship

I, Zhixian SONG, declare that this thesis titled, “Predicting the potential

acquisition targets of the healthcare industry in the United States” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a re-search degree at this University.

• Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other insti-tution, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Signed: Date:

(4)

(5)

UNIVERSITY OF AMSTERDAM

Abstract

Faculty of Economics and Business Amsterdam School of Economics

Master of Science

Predicting the potential acquisition targets of the healthcare industry in the United States

by Zhixian SONG

In this thesis, various sampling techniques are applied to Logit model, Probit model and Support Vector Machines (SVMs) model. Although it is well-known that to some extent sampling techniques could improve the class imbalance issue, it is still uncertain that this improvement could make the prediction power better than without-sampling in these three models. By comparison, the performance of the classifications in Logit model is similar to that in Probit model. On one hand, SVMs model has a much higher accuracy rate (around 90%) than Logit model (around 65%) and predicts True Negative better. One the other hand, the prediction of True Positive in Logit model is better. By application of sampling techniques, we conclude that the development of sampling techniques can improve the performance structure for Logit model, Probit model and SVMs model in the case of imbalanced dataset.

(6)

(7)

my research supervisor, for his patient guidance, enthusiastic encourage-ment and useful critiques of this research work.

My grateful thanks are also extended to Mr. Zhu for his help in collect-ing data.

Finally, I wish to thank my parents and little brother for their support and encouragement throughout my study.

(8)

(9)

List of Figures

1.1 Merger and Acquisition transactions in US . . . 2

2.1 Normal and logistic densities . . . 5

(12)

(13)

List of Tables

1.1 Merger and Acquisition transactions in the United States in

2015, source from Tomson Financial . . . 2

2.1 Confusion Matrix . . . 8

4.1 US healthcare public companies in the United States in 2015 17 4.2 Financial Variables included in the Models . . . 19

5.1 Descriptive Statistics and Kruskal-Wallis Test . . . 22

5.2 Classification Result: AUROC and AUPR . . . 22

5.3 Coefficient Estimate for Logit Model . . . 23

5.4 Coefficient Estimate for Probit Model . . . 24

5.5 Comparison of Accuracy Rate: _{T P +F P +T N +F N}T P +T N . . . 24

5.6 Comparison of 1-Specificity Rate : _{F P +T N}F P . . . 24

(14)

(15)

List of Abbreviations

SVMs Support Vector Machines

ROC the Receiver Operating Characteristics

AUROC Area Under the Receiver Operating Characteristics AUPR Area Under the Precision - Recall

SMOTE Synthetic Minority Over-sampling Technique NYSE New York Stock Exchange

NASDAQ National Association of Securities Dealers Automated Quotations ML Maximum Likelihood

(16)

(17)

Chapter 1

Introduction

In recent years, advanced econometrics and machine learning algorithms have been developing at a fast speed. Gradually, researchers from other domains start to pay close attention to these new methods and models and try to apply these techniques to their field so that they can analyze the results with high accuracy and then reach more reasonable conclusions (Huang et al.,2005; Tay & Cao,2002; Wu et al.,2007).

Due to the exponential increase of the data, the features of datasets be-come an important topic, of which class imbalance defined by the fact that the one class (called the majority or negative class) vastly outnum-bers the other (called the minority or positive class) turns out to be a more-concerned problem (Japkowicz,2000). In the past 40 years, various models and algorithms have been applied to identify potential acquisition targets, such as discriminant analysis (Bartley & Boardman, 1990), logit analysis (Powell, 2001), artificial neural networks (Cheh et al., 1999), rough sets (Slowinski et al., 1997) and multicriteria decision aid (MCDA) (Pasiouras et al., 2010, 2007). Support Vector Machines (SVMs) were not considered until Pasiouras, Doumpos and Gaganis compared it with other techniques and concluded that SVMs could provide a robust performance. Therefore, the combination of SVMs with sampling techniques in an integrated model is still in its infancy. In this thesis, we use a sample of US healthcare com-panies to compare the relative performance of Logit model, Probit model and SVMs model with an RBF kernel, and investigate the effect of various sampling techniques on the prediction of positive outcomes.

We choose US healthcare industry for two reasons. First, few studies have emerged about prediction models designed for healthcare industry. Sec-ond, over the last decade, merger and acquisition (M&A) transactions en-joy sustained growth. According to Figure 1.1 and Table 1.1 (source from Thomson Financial in 2015), from the perspective of the transaction size ($mm), healthcare industry accounts for 17.58% ranking the No. 2 in the United States; while, from the perspective of the number of transactions,

(18)

2 Chapter 1. Introduction

TABLE 1.1: Merger and Acquisition transactions in the

United States in 2015, source from Tomson Financial

Industry Size in million dollars Number of Transactions

Energy 203314 548 Materials 126653 549 Industrials 169939 2413 Consumer Discretionary 209889 2794 Consumer Staples 71518 577 Healthcare 331050 1685 Financials 383821 7212 Information Technology 309720 2317 Telecommunication Services 8719 80 Utilities 68123 220

it also reaches a high level, 9.16% ranking the No.5. Furthermore, consid-ering the reality that people focus on their health more and more, health-care industry has a promising future. Hence, not surprisingly the develop-ment of classification models which can predict acquisition targets in the healthcare sector has a strong appeal to several parties. On one hand, buy-ers including healthcare companies and financial companies are interested in which kind of companies in the whole healthcare industry should be drawn attention to. On the other hand, the companies should be careful and sometimes need to take action if they are in the list of potential targets. In addition, academics and researchers working on the similar topics in fi-nance and management may draw their inspiration from the results of this study.

FIGURE1.1: Source from Thomson Financial in 2015

(19)

chines model, AUROC (Area Under the Receiver Operating Characteris-tics) and AUPR (Area Under the Precision - Recall). Chapter 4 describes the empirical data and selected variables. Chapter 5 shows the results of the models. Then, Chapter 6 discusses what we have found and what it implies in practice, plus some possible directions for further research.

(20)

(21)

Chapter 2

Literature Review

Previous literature related to the present chapter can be classified into four parts, i.e. Logit/Probit, Support Vector Machines, Imbalanced Data, Sam-pling Methods and Performance Measures.

2.1 Logit/Probit

In statistic, the logit function and the probit function are widely applied to binary classification. The logit function is the quantile function associated with the logistic distribution (i.e. density function is f (t) = et

(1+et)2), while

the probit function is that of the standard normal distribution (i.e. density function is f (t) = φ(t) = √1

2πe −t2

2 _{). In general, there is little difference}

in output between these two models (Hardin & Hilbe,2007). One excep-tion is that data sets are imbalanced in which case tails of the distribuexcep-tions may cause the difference between logit model and probit model (Heij et al., 2004).

FIGURE 2.1: Normal and logistic densities, source from (Heij et al.,2004)

(22)

6 Chapter 2. Literature Review

Better-fitted parameters would lead to better results, therefore how to esti-mate the parameters is of importance. In the case of non-linear, Maximum Likelihood (ML) is a good choice since the samples are adequate so that the bias is small.

2.2 Support Vector Machines

As a popular machine learning technique, Support Vector Machines (SVMs) has been successfully applied to many areas such as classification and re-gression analysis (Chang & Lin,2011; Cristianinio & Shawe-Taylor,2000) since Vapnik and Chervonenkis invented it in 1963. SVMs learning algo-rithm aims to find the optimal separating hyperplane that can separate data points into two classes.

In 1992, Vapnik, Boser and Guyon developed SVMs by applying a kernel function to the maximum-margin hyperplanes. This "kernel trick" method uses a kernel function to map the nonseparable space into a higher dimen-sional space where classes can be separated with good fit.

Later, researchers find that although SVMs can deal with balanced datasets effectively, its performance on imbalanced datasets is not good (Akbani et al.,2004; Veropoulos et al.,1999; Wu & Chang,2003). SVMs can be sensi-tive to class imbalance for the possible reasons as follows: (1) weakness of the soft-margin optimization problem1; (2) the imbalanced support-vector ratio2.

2.3 Imbalanced Data and Sampling Methods

With the rapid development of information and technology, the availability of raw data has been growing at an explosive rate. Consequently, the issue of imbalanced dataset in the class distribution has been more pronounced to the real world, ranging from text classification (Cohen, 1995; Dumais et al.,1998; Lewis & Catlett,1994; Mladenic & Grobelnik, 1999), telecom-munications management (Ezawa et al.,1996), fraudulent telephone calls (Miroslav Kubat,1996), bioinformatics (Radivojac et al.,2004), to detection

1

In the case of class imbalance, the separating hyperplane can be skewed toward the minority class and this skewness can make the performance of SVMs worse (Veropoulos et al.,1999).

2_{As the training data turns out to be more imbalanced, the ratio between the positive}

(23)

the minority, or positive class), and the issue of imbalanced dataset hap-pens only when the minority class is of interest. As an example, consider a dataset for credit card fraud detection. Typically, a transaction dataset might contain 99% legitimate transactions and 1% fraudulent transactions. The naive classifier as a prediction of the majority class would have an ac-curacy of 99%, which shows the effectiveness of acac-curacy in determining a classifier’s performance. This new challenge has attracted much atten-tion from academia and industry, and quesatten-tioned the performance of most standard learning algorithms since these algorithms assume balanced class distributions or equal misclassification costs.

From the perspective of data generation, sampling methods are a key to overcome class imbalance. The main idea is to create a new dataset with a relatively balanced class distribution. At the beginning, random under-sampling and random over-under-sampling methods come forward. Sooner, re-searchers argue that neither of them escapes from serious drawbacks. In random under-sampling, the discarded majority-class instances that can occupy more than 50% in the total instances may bring some useful infor-mation, which can cause a bad classification performance. Instead of dis-carding the useful information, the best result is to retain all useful infor-mation in the majority class by removing redundant noisy and/or line instances. Since the method called Tomek Links can remove border-line and noisy instances (Tomek,1976) and the Condensed Nearest Neigh-bor (CNN) can remove redundant instances (Hart et al.,1968), Kubat and Matwin (Kubat & Matwin,1997) combine these two methods to deal with the major drawback of random under-sampling.

In random over-sampling, over-fitting is a serious issue especially repli-cating instances at random. Inspired by predecessors (Ha & Bunke,1997), three researchers (Chawla et al.,2002) propose a method of creating "syn-thetic" instances rather than replacement at random. Although this method, the Synthetic Minority Over-sampling Technique (SMOTE), becomes one of the favors to balance class distributions, it still cannot define class groups well. Hence, SMOTE+Tomek, a combination of over-sampling and under-sampling, arise to create better-defined class groups and was first applied to Bioinformatics (Batista et al.,2003).

(24)

TABLE2.1: Confusion Matrix

PREDICTION

Predicted Negative Predicted Positive

TRUE Actual Negative TN FP

Actual Positive FN TP

and SMOTE+Tomek are respectively added to the logit/probit model and the Support Vector Machines model. Then, we discuss the effect of these sampling techniques on these models.

2.4 Performance Measures

In order to find out a relatively optimal algorithm for imbalanced dataset, it is critical to have standardized evaluation methods to properly assess the effectiveness of algorithms. All the metrics discussed in this thesis are based on the confusion matrix, as illustrated in Table 2.2. In this matrix, TN is the number of negative instances correctly classified (True Negatives), FN is the number of negative instances misclassified (False Negatives), FP is the number of positive instances misclassified (False Positives), and TP is the number of positive instances correctly classified.

On the whole, there are three families of evaluation metrics identified by researchers (Caruana & Niculescu-Mizil,2004; Ferri et al.,2009), which are the threshold metrics (e.g., accuracy=_{T P +T N +F P +F N}T P +T N , precision=_{T P +F P}T P , G-mean= q T P T P +F N · T N T P +F P and F-measure= 1+β2·recall·precision

β2·recall+precision ), the

rank-ing methods and metrics (e.g., Receiver Operatrank-ing Characteristics (ROC) analysis), and the probabilistic metrics (e.g., root-mean-squared error). Technically speaking, the threshold metrics can be divided into two groups, a multiple-class focus and a single-class focus (Japkowicz & Shah,2011). The first group, including accuracy, error rate, Cohen’s kappa, and Fleiss’ kappa measures, focuses on the overall performance on all the classes in the dataset, and do not consider the varying degree of importance on the different classes (Ferri et al.,2009). Hence, they do not behave well in the case of imbalanced dataset unless the class ratio is taken into consideration. On the contrary, single-class focus metrics, such as sensitivity/specificity, precision/recall, G-mean and F-measure, are better suited, because they can be more sensitive to the issue of different contribution on the different classes.

(25)

metrics are sensitive to data distributions (He & Garcia,2009). For exam-ple, there are 10 positive instances and 90 negative instances in the dataset. If one model predicts all instances as negative ones, the accuracy reaches 90%, a high rate. However, in fact the model is useless since no matter what feature values are the result is still positive. It has been shown in many representative works(Chawla et al., 2003; Fawcett & Provost, 1997; Fawcett et al.,1998; Guo & Viktor,2004; Joshi et al.,2001; Maloof,2003; Sun et al.,2007; Weiss, 2004). Then, researchers pay attention to other evalua-tion metrics such as precision, recall, F-measure and G-mean.

Precision is also sensitive to data distributions, while recall is not. But recall provides no insight to how many instances are mislabeled as posi-tive. F-Measure and G-Mean, are still ineffective in the classification val-uations, even though they have great improvement compared to accuracy (He & Garcia,2009). In the past few years, many newer combinations of threshold metrics (e.g., MCWA (Cohen et al., 2006), Optimized Precision (Ranawana & Palade,2006), Adjusted G-mean (Batuwita & Palade,2006), Index of Balanced Accuracy (Garcia et al.,2010)) have been proposed but none of them can overcome the important shortcoming that they assume full knowledge of the conditions.

In order to solve this problem, the Receiver Operating Characteristics (ROC) analysis is considered and discussed widely (Fawcett, 2006). The ROC curve provides a good method for assessing the performance of classifiers as the discrimination threshold varies, because it can represent the relative trade-offs between the benefits (called sensitivity, T P R = _{T P +F N}T P ) and costs (namely, 1-specificity, F P R = _{F P +T N}F P ) of classification in regards to data distributions. The perfect classification would be (0, 100), that is all positive instances are classified correctly and no negative instances are misclassified. Hence, the Area Under the ROC (AUROC) has become the de facto standard metric for evaluating classifiers in the presence of imbal-anced dataset (Bradley,1997).

Nowadays, AUROC has become one of the most powerful methods to do the valuation. However, it also has its own shortcomings. Firstly, one of the main critiques is that this method would invalidate the results since changes in class distributions often cause changes in the true and false pos-itive rates (Webb & Ting,2005). While, Fawcett and Flach disagree with this claim and argue that within the two general types of domains Webb and

(26)

Ting’s concerns only apply to one of them (Fawcett & Flach,2005). Hence, it is recommended to consider the potential limitation of ROC analysis be-fore fully trusting it. Secondly, in the case of highly-skewed datasets, an algorithm optimizing AUROC is not guaranteed to optimize AUPR (Davis & Goadrich,2006). Just as ROC curves are to AUROC, so Precision-Recall (PR) curves are to AUPR (Area Under Precision-Recall Curve). Therefore, many of the recent research works prefer PR curves (Landgrebe et al.,2006; Singla & Domingos,2005). Except for AUPR, B-ROC (Bayesian-ROC) de-signed by Cardenas and Baras is an alternative method. Compared to ROC analysis, B-ROC is suited for analyzing classifier performance on highly skewed datasets for following three reasons below: (1) controlling for a low false positive rate is allowed; (2) the plotting of different curves for different class distributions is allowed; (3) bypassing the issue of misclas-sification cost estimation altogether is allowed (Cardenas & Baras,2006). Therefore, AUROC and AUPR are chosen as the first evaluation criteria. At the same time, accuracy, recall and 1- specificity rate are considered for reference.

(27)

Chapter 3

Model and Method

3.1 Logit/Probit Model

In both logit and probit models, a binary outcome variable follows a Bernoulli probability function that reaches 1 (positive class) with probability piand 0

(negative class) with probability 1 − pi. In the logit model, the distribution

F (x0β)is a logistic distribution with the density function f (x0β) = Λ(1 − Λ) where Λ = ex

0 β

1+ex

0

β. In the probit model, the distribution F (x

0

β) is a stan-dard normal distribution with the density function f (x0β) = φ(x0β) =

1 √ 2πe −(x 0 β)2 2 . Since p i = P [yi = 1] = F (xi 0

β), we could maximize its log-likelihood to find its estimator ˆβ as follows:

• Probability Distribution: p(y_i) = p_iyi_{(1 − p} i) 1−yi_{, y} i = 0, 1 (3.1) • Likelihood Function: L(p_i) = n Y i=1 p_iyi_{(1 − p} i) 1−yi _(3.2) • Log-likelihood: logL(β) = n X i=1 y_ilog(p_i) + n X i=1 (1 − y_i)log(1 − p_i) (3.3) = n X i=1 y_ilog(F (x_i0β)) + n X i=1 (1 − y_i)log(1 − F (x_i0β)) (3.4) = X i;yi=1 log(F (x_i0β)) + X i;yi=0 log(1 − F (x_i0β)) (3.5)

(28)

12 Chapter 3. Model and Method

• First Order Conditions: g(β) = ∂log(L) ∂β = n X i=1 yi p_ifixi− n X i=1 (1 − yi) (1 − p_i)fixi = n X i=1 yi− pi p_i(1 − p_i)fixi (3.6) • Approximate distribution of the ML estimator:

ˆ β ≈ N (β, ˆV ), ˆV = h Pn i=1 ∂li ∂β ∂li ∂β0 i−1 = [ n X i=1 (yi− ˆpi)2 ˆ pi2(1 − ˆpi)2 ˆ fi 2 xix 0 i] −1 (3.7)

The ML-first order equation should be equal to zero so that 1_nPn

i=1pˆi = 1

n

Pn

i=1yi, that is, the average predicted probabilities of negative class and

positive class are equal to the observed fractions of negative class and pos-itive class in the sample. Since the Hessian matrix ∂

2

L(θ)

∂β∂β0 in the logit model

and the probit model is negative definite, the ML-first order conditions have a unique solution. The estimated models give predicted probabilities for individuals and transform them into predicted results, defined as:

y_i=    1 if pi = F (x 0 iβ) > v 0 if 0 ≤ pi ≤ v, (3.8)

where v is the threshold value.

3.2 Support Vector Machines

SVMs maximizes the soft-margin1 between the support vectors2 and the hyperplane. At first, introduce a nonlinear mapping function to transform the data points into a higher dimensional feature space, so that the sepa-ration is more effective. Then, a possible separating hyperplane is repre-sented by w · φ(xi) + b = 0, where w is the weight vector normal to the

hyperplane, and b is a bias. This soft-margin optimization problem can be formulated as follows: min w,ξ,b{ 1 2w · w + C n X i=1 ξ_i} (3.9) s.t.y_i(w · φ(x_i) + b) ≥ 1 − ξ_i, ξ_i ≥ 0, i = 1, . . . , n (3.10)

The slack variables ξi ≥ 0 hold for misclassified instances, and the penalty

1_{Soft-margin is the distance from data points to the separating hyperplane.} 2_{Support vectors are the nearest data points to the separating hyperplane.}

(29)

problem can be represented as a Lagrangian optimization problem: max αi { n X i=1 α_i−1 2 n X i=1 n X j=1 α_iα_jy_iy_jφ(x_i) · φ(x_j)} (3.11) s.t. n X i=1 y_iα_i= 0, 0 ≤ α_i≤ C, i = 1, . . . , n (3.12)

Based on K(xi, xj) = φ(xi) · φ(xj), the problem is transformed as below:

max αi { n X i=1 αi− 1 2 n X i=1 n X j=1 αiαjyiyjK(xi, xj)} (3.13) s.t. n X i=1 yiαi= 0, 0 ≤ αi≤ C, i = 1, . . . , n (3.14)

It remains a problem how to choose the optimal kernel function. In all, four basic kernels have been cited frequently :

• Linear:

K(x_i, x_j) = x_iTx_j (3.15)

• Polynomial:

K(xi, xj) = (γxiTxj+ c)d, γ > 0 (3.16)

• Radial Basis Function(RBF):

K(x_i, x_j) = exp(−γ k x_i− x_j k2), γ > 0; parametrized using γ = 1 2σ2 (3.17) • Sigmoid:

K(x_i, x_j) = tanh(γx_iTx_j+ c), (3.18)

where c, d and γ are kernel parameters. Of these four kernel functions, the RBF kernel is a preferable choice for the reasons as follows (Hsu et al., 2010): (1) it still works in the case of nonlinear relation between classes and features; (2) it incorporates less hyperparameters so that its model selec-tion can be less complex than others, such as the polynomial kernel; (3)it

(30)

has fewer numerical conditions, while the sigmoid kernel is only condi-tionally positive definite and the values of the polynomial kernel could go to infinity or zero in the case of large |d| (Vapnik,1995).

3.3 Sampling Methods

Below is the introduction of the main sampling methods mentioned in Chapter 2:

• Random Over-sampling

Randomly copy and repeat samples from the minority group until reaching the size of the majority group.

• Random Under-sampling

Randomly remove samples from the majority group until reaching the size of the minority group.

• Tomek Links

Let d(a, b) be the distance between instance a from one class and in-stance b from the other class. A pair (a, b) is called a Tomek link if there is no instance c such that d(a, c) < d(a, b) or d(b, c) < d(a, c). In other words, two instances from different classes are the nearest to each other. One of these two instances is noise data or both are called borderline. Find all the noisy instances and borderline and then delete them so that the imbalanced dataset can be separated bet-ter.

• SMOTE

How to create a synthetic instance: select a from the minority class randomly and find its k-nearest neighbors in the minority class. Then choose b from the k-nearest neighbors randomly. At last, select a ran-dom point along the line segment between a and b.

This created point is called synthetic instance. Add synthetic in-stances until the data sets are balanced. The SMOTE algorithm is shown in Figure 3.1.

(31)

FIGURE3.1: Source from (Chawla et al.,2002)

• SMOTE+Tomek

As a cleaner, Tomek links are applied to the over-sampled training set. Let N be difference between the size of negative class and that of positive class. First, Tomek links removes m%N from positive class. Second, SMOTE creates (1 − m%)N based on the reminder instances. Here, m ∈ [0, 100]. m% depends on the size and relative imbalance of the dataset.

As previously mentioned, we explore the development of Logit model, Probit model and Support Vector Machines model with various sampling methods.

3.4 AUROC and AUPR

As discussed in Chapter 2, AUROC and AUPR measure the performance of the models. Here, we choose one method to compute AUROC (Hand & Till,2001). Given n0 points of negative class, n1points of positive class, ri

(32)

negative class instances:

n0 X i=1 (r_i− i) = n0 X i=1 r_i− n0 X i=1 i = S₀−1 2n0(n0+ 1) (3.19) \ AU ROC = S0− 1 2n0(n0+ 1) n₀n₁ (3.20)

Similarly, AUPR can be computed using the curve of Precision-Recall in-stead of the curve of ROC.

\ AU P R = n X i=1 p_i4r_i = 1 R n X i=1 y_i∗ Ri i , (3.21)

where pi = Rii means the proportion of negative class in the first i

ele-ments, yi = 0, 1, R is the total number of negative class and 4ri is recall

(33)

Chapter 4

Data and Variables

4.1 Data

The dataset we employ consists of 36 acquired US companies in the health-care sector, as well as of 540 non-acquired ones. Table 4.1 presents the per-centage of acquired and that of non-acquired ones. This gives us a skewed 2-class dataset.

The acquired companies selected into the samples have to meet the follow-ing criteria: (1) they were acquired between January 1, 2015 and December 31, 2015. Although selecting a long time span can provide a large sample, it also brings a large time-series distortion in the acquisition likelihood mod-els because the modmod-els are not robust over time (Powell,1997) unless the economic environment, the motives for acquisition and other conditions do not change over time. Hence, the time span is chosen to offer an adequate number of acquired healthcare companies without sacrificing the stability; (2) the acquisition represents the purchase of 100% of ownership of the ac-quired healthcare companies; (3) all are classified as healthcare public com-panies in the S&P Capital IQ Database and belong to three stock exchanges including OTCPK, NYSE and NASDAQ; (4) all were founded before 2013. The companies that fail to meet this requirement would be more likely to emerge big rise and fall since they are in the initial phase. Furthermore, their relevant growth rate cannot be calculated since no prior year is avail-able.

TABLE4.1: US healthcare public companies in the United

States in 2015

Class Number Percentage

Acquired Companies 36 6.25% Non-acquired Companies 540 93.75%

(34)

18 Chapter 4. Data and Variables

In order to be included in the samples, non-acquired companies have to: (1) be classified as healthcare public companies, and (2) be founded before 2013. It should be noted that for both acquired and non-acquired compa-nies the end of the fiscal year was set to be between December 27, 2014 and January 31, 2015. If there is a big gap for the different definitions of "end of the fiscal year", the financial data cannot be compared. For example, A’s annual revenue (set as August 31) may be much less than B’s annual revenue (set as December 31) in the case of great economic growth from August to December.

In this thesis, one training set and one test set are separated from the whole dataset for 20 times, and the average results over the 20 test sets are more reliable to test the algorithms since this approach makes the results less de-pendent on the choice which observations belong to the training and test set. The training set including 420 non-acquired and 28 acquired compa-nies is used to estimate the model or train the algorithms. Each training example owns several features and one target variable. The target variable is known in the training set so that the algorithms can learn the relations between features and target variable. The test set including the remain-ing 120 non-acquired and 8 acquired companies is to measure the perfor-mance. The target variable is unknown and needs to be predicted by the use of the algorithms. Comparing the predicted value to the real value, we can examine how the algorithms work.

4.2 Variables

Table 4.2 shows a list of the variables included in the models. In the fol-lowing part, we explain the meaning of these financial variables and their influence.

EPS, diluted Earnings per share that we use in this thesis, has been adjusted for historical stock splits. Generally speaking, a higher EPS means that a company is highly profitable. We should notice that although two firms have the same diluted EPS and all other things being equal, one firm with less equity would be more efficient at using its capital to generate income. ROA, return on assets, tells us what earnings were generated from invested capital (assets). This is the maximum growth rate that a firm can achieve without resorting to external financing. For public companies, it is highly dependent on the industry. In this thesis, all the comparable companies are chosen from the healthcare industry. Hence, this index is available.

(35)

EPS Diluted Earnings per Share by Diluted Weighted

Average Shares Outstanding

ROA Return on Assets It determines a company’s

internal growth rate.

LFCF Levered Free Cash Flow

It is an indicator how much the free cash flow is left over after a company has

paid its obligations on its debt.

DFCF LFCF- Unlevered FCF

The difference is expenses, such as operating expense and interest payment.

SIZE Total Assets

It is anything that a business owns, has value, and can be converted to cash.

LQD ST Ratio (Total Cash+Short-term investments)

divided by Total Assets.

CR Current Ratio Current Assets / Current Liabilities

D/E Debt / Equity It is a debt ratio used to measure

a company’s financial leverage.

D/A Total Liabilities / Total Assets

It shows the proportion of a company’s assets that are financed through debt.

CS Cash and Short-term Investments It indicates how much the company

can pay directly in a short term.

LFCF, the levered free cash flow, represents the amount of cash that re-mains for stockholders and for investment after all obligations are covered. As a factor, the free cash flow is considered by an investor to scrutinize the health of the business. When the cash that comes from operations is not enough to cover obligations, the levered cash flow is negative but the op-erating cash flow is positive.

DFCF, the difference between levered and unlevered free cash flow, comes from financial obligations that are paid from levered free cash flow. It shows how many financial obligations the business has, that is, whether the business is operating with a healthy amount of debt or an unhealthy amount of debt (overextended).

SIZE, total assets, can influence both synergy (i.e. economies of scale and scope) and acquisition motivations. In this way, it is very attractive. While, a company in a big size may also be more difficult to be acquired because its big shareholders and bondholders may get more powerful resources to fight a hostile bid. In an empirical study, Brent W. Ambrose and William L. Megginson (1992) found that the probability of receiving a takeover bid is

(36)

20 Chapter 4. Data and Variables

negatively related to firm size.

LQD, ST ratio, measures how much proportion of the cash and short-term investments is in the total assets. Cash and short-term investments (CS) are considered as highly-liquid assets.

CR, current ratio, as a liquidity and efficiency ratio indicates how much current assets a firm is holding to pay off the short-term liabilities. Known as the working capital ratio, the current ratio is an important measure of liquidity and short-term financial health because short-term liabilities are due within the next year. As one of common liquidity ratios, current ratio are frequently used by Bankruptcy analysts and mortgage originators to determine whether a company will be able to continue as a going concern. If current ratio is not larger than 1, then it may run into trouble paying back creditors in the short term. The worst-case scenario is bankruptcy. But a high current ratio doesn’t always mean a good sign, it could indicate that this company is not investing its excess cash and short-term assets. In the boom economics, this means a waste. A buyer can make use of these liquid assets to get profits by acquiring it.

The D/E ratio measures the ability of a company to finance its assets by shareholders’ equity. A higher D/E ratio typically shows that a company has been obsessed with financing its growth with debt and there may be a greater potential for financial distress if earnings do not exceed the cost of borrowed funds. Since some industries that are highly capital inten-sive such as services, utilities and the industrial goods sector tend to have higher D/E ratios, it is not a good index in comparison with cross-industry companies. Whereas, we make comparisons among the same industry (the healthcare sector), so D/E ratio can work. Moreover, a higher D/E ratio makes companies more difficult to borrow money in the future, which is also a credit risk.

D/A, total liabilities/total assets, is a leverage ratio that defines the total amount of debt relative to assets. If the ratio is less than 0.5, most of the company’s assets are financed through equity. If the ratio is greater than 0.5, most of the company’s assets are financed through debt.

(37)

Chapter 5

Analysis and Results

5.1 Impact of The Variables

Table 5.1 shows descriptive statistics (mean and standard deviation) and the results of the Kruskal-Wallis test1 for mean differences in the vari-ables between acquired healthcare firms and non-acquired healthcare firms based on the training sample dataset. These results indicate that acquired healthcare firms were better profitable (EPS and ROA) with lower liquid-ity (CR and LQD) on average. In real life, the higher profitabilliquid-ity means that in a long run this company could make more money for sharehold-ers and lower liquidity implies worse short-term financial health. These promising firms who suffer short-term financial distress would give buyers a good chance of acquisition. Hence, the companies with short-term finan-cial distress and higher profit are attractive targets. However, according to the Kruskal-Wallis test result, only DFCF and D/E do appear to be sig-nificantly different between the two groups of healthcare companies since their p-values are below 1%. Therefore, translated by financial language if the business is operating with an unhealthy amount of debt, it becomes a potential target with high probability.

5.2 Performance and Analysis

In this section, AUROC and AUPR are used to evaluate the predictive power of Logit model, Probit model and SVMs model with different sam-pling techniques. At the same time, accuracy rate, specificity and recall are also for reference to understand the specific performance from the different perspectives. Table 5.2 presents the performance of classification - AUROC and AUPR. There are some interesting findings below.

1 _{Kruskal-Wallis test: H}

0: The data in each categorical group comes from the same

(38)

22 Chapter 5. Analysis and Results

TABLE5.1: Descriptive Statistics and Kruskal-Wallis Test

Features Acquired Non-acquired Kruskal-Wallis Mean Std.Dev Mean Std.Dev Chi-square EPS -0.436 1.301 -0.561 1.259 0.24 (0.624) CS 4.049 2.599 3.745 1.929 0.41 (0.520) SIZE 5.316 2.757 4.704 2.288 1.21 (0.271) LFCF 0.527 4.178 -0.341 3.368 0.41 (0.524) ROA -0.171 0.339 -0.274 0.505 1.71 (0.191) CR 1.458 0.824 1.576 0.828 0.59 (0.442) D/A -1.040 0.828 -1.136 0.949 0.47 (0.495) LQD 0.474 0.351 0.540 0.343 1.16 (0.281) DFCF 1.456 1.977 0.796 1.425 6.58* (0.010) D/E 0.422 0.370 0.247 0.500 18.47* (0.000)

Notes: p-values in parentheses.

∗ Statistically significant at the 1% level.

TABLE5.2: Classification Result: AUROC and AUPR

Scenario AUROC AUPR

Logit Probit SVMs Logit Probit SVMs Without Sampling 0.6578 0.6593 0.4566 0.1419 0.1436 0.0570 Random Over-sampling 0.6589 0.6584 0.4489 0.1413 0.1406 0.0534 Random Under-sampling 0.6538 0.6532 0.4149 0.1358 0.1358 0.0502 SMOTE 0.6522 0.6517 0.4450 0.1341 0.1333 0.0526 SMOTE+Tomek 0.6478 0.6474 0.4489 0.1331 0.1330 0.0533

(39)

EPS -3.162 0.731 0.703 1.197 0.641 CS -0.085 -0.088 -0.063 -0.053 -0.049 SIZE 0.197 0.190 0.143 0.257 0.326 LFCF -0.303 -0.310 -0.265 -0.404 -0.463 ROA -0.013 -0.016 -0.008 -0.030 -0.031 CR 1.078 1.292 1.147 1.536 1.530 D/A -0.619 -0.718 -0.710 -0.933 -0.992 LQD -0.531 -0.567 -0.542 -0.670 -0.744 DFCF -0.100 0.002 0.109 0.055 -0.022 D/E 0.388 0.391 0.358 0.423 0.444

AUROC and AUPR measure the whole performance of the model at differ-ent threshold values. Hence, although AUROC values of Logit model and SVMs model without sampling are equal to 0.6578 and 0.4566 respectively in Table 5.2, it does not mean that Logit model outperforms SVMs model. In the case of imbalanced data, the threshold value cannot be easily set as 0.5. We take the percentage of positive instances (0.0625) as the thresh-old value (Cramer,1999) and consider accuracy, recall and specificity rate comprehensively in coming to draw a conclusion.

5.2.1 Logit model vs. Probit model vs. SVMs model

In almost all the scenarios, Logit model and Probit model perform simi-larly no matter which the performance measure is - accuracy, 1-specificity or recall rate. In the same sample dataset, by comparing the coefficient esti-mates of Logit model with that of Probit model, it is shown that the power of each feature is similar in these two models. For example, without sam-pling technique, the coefficient estimates of D/E in Logit model and Probit model are 0.388 and 0.372 respectively in Table 5.3 and Table 5.4. They are similar. After checking all the other features, we could find that it works for almost all features. Similar coefficient estimates lead to the similar per-formance. Hence, Logit model and Probit model share a similar result. Table 5.5-5.7 show that SVMs model performs better than Logit model from the standpoint of total dataset but worse from the standpoint of only posi-tive class in most scenarios except for Random Under-sampling case. Ran-dom Under-sampling case would be discussed later in the next part. In the other four cases, accuracy rates in SVMs model (around 90%) are much higher than those in Logit model (around 65%). Moreover, Logit model has a higher rate of 1-Specificity than SVMs model. For example, it is shown as 31.63%>0.02% in Without Sampling case, which means that SVMs model

(40)

TABLE5.4: Coefficient Estimate for Probit Model

Variable Without Sampling Random Over-sampling Random Under-sampling SMOTE SMOTE +Tomek EPS -3.444 0.737 0.698 1.222 0.647 CS -0.080 -0.089 -0.062 -0.052 -0.048 SIZE 0.188 0.197 0.147 0.268 0.338 LFCF -0.295 -0.315 -0.265 -0.414 -0.473 ROA -0.011 -0.017 -0.009 -0.031 -0.032 CR 1.076 1.281 1.131 1.537 1.524 D/A -0.606 -0.741 -0.716 -0.971 -1.026 LQD -0.518 -0.583 -0.547 -0.695 -0.766 DFCF -0.063 -0.023 0.087 0.031 -0.042 D/E 0.372 0.400 0.356 0.435 0.455

TABLE5.5: Comparison of Accuracy Rate: _{T P +F P +T N +F N}T P +T N

Without Sampling Random Over-sampling Random Under-sampling SMOTE SMOTE +Tomek Logit 67.50% 66.54% 64.37% 63.15% 62.48% Probit 66.30% 66.98% 64.56% 63.59% 62.53% SVMs 93.99% 93.70% 60.18% 90.88% 88.22%

predicts the negative class correctly with a higher probability than Logit model. However, Logit model has a higher recall rate than SVMs model (such as 53.99%>0.58% in Without Sampling case), that is, if the instance belongs to positive class, Logit model predicts better.

5.2.2 Comparison of Various Sampling Techniques on Models

A comparison of the effectiveness of various sampling techniques on Logit/SVMs model shows that accuracy rate has a slight difference. Since accuracy rate is sensitive to data distribution, these small differences cannot tell which sampling technique has a relative effect on the prediction power of binary models. Hence, the focus is shifted away from accuracy towards recall and 1-specificity. Recall rate indicates how many acquired firms we predict cor-rectly; while, 1-specificity rate indicates how many non-acquired firms we

TABLE5.6: Comparison of 1-Specificity Rate : _{F P +T N}F P

Without Sampling Random Over-sampling Random Under-sampling SMOTE SMOTE +Tomek Logit 31.63% 32.72% 35.14% 36.50% 37.27% Probit 33.04% 32.24% 34.90% 35.99% 37.20% SVMs 0.02% 0.34% 39.28% 3.66% 6.69%

(41)

Logit 53.99% 55.00% 56.73% 57.69% 58.56% Probit 55.91% 54.71% 56.11% 57.16% 58.37%

SVMs 0.58% 0.58% 51.73% 5.67% 8.65%

predict wrongly. Hence, it is better to get a higher recall rate and a lower 1-specificity rate.

First, random over-sampling and random under-sampling do not provide an improvement of correct classification in Logit model, Probit model and SVMs model on average. Recall rate in Random Over-sampling case (Logit model : 55% and SVMs model : 0.58%) is approximately equal to that in Without Sampling case (Logit model : 53.99% and Probit model : 0.58%), and also 1-Specificity in Random Over-sampling case (Logit model : 32.72% and SVMs model : 0.34%) is around that in Without Sampling case (Logit model : 31.63% and SVMs model : 0.02%). From the perspective of each test, sometimes Random Over-sampling case has a better result than With-out Sampling case, but sometimes not. In other words, the result of Ran-dom over-sampling is unstable. RanRan-dom over-sampling approach does not significantly improve minority class (positive class) recognition because of drastic overfitting in the case of highly-skewed dataset. According to the results in Random Under-sampling case, it is shown that the perfor-mance of Logit model does not have a significant difference, while the per-formance of SVMs model has. On average, the increase of recall rate in SVMs model from 0.02% to 39.28% gives rise to a drop of specificity rate from 99.42% to 48.27%. Accuracy rate of SVMs falls by 30% because ran-dom under-sampling approach by deleting too many instances would lead to the attributes’ loss of negative class when the imbalance dataset is too skewed. Hence, random sampling techniques are not good choices to im-prove class imbalance.

Second, as an updated version of random over-sampling technique, SMOTE has a higher recall rate and a higher 1-specificity rate. In other words, SMOTE adjusts the structure of performance by sacrificing a slight percent of its accuracy rate for negative class to get a higher accuracy rate of pos-itive class in these three binary models. Compared to the SMOTE case, SMOTE+Tomek adjusts the structure as well - improvement of prediction power of positive class but loss of prediction power of negative class. Of these four sampling techniques, the best one is SMOTE+Tomek, following by SMOTE.

(42)

Therefore, the development of sampling techniques can improve the per-formance structure for binary classification in the case of imbalanced dataset. However, the amount of misclassification for positive class reaches more than 30% in all cases. Although compared to non-acquired companies tar-get companies have definitely inferior or abnormal performance, the po-tential acquisition targets are much more difficult to be identified since consistency of targets’ characteristics may not exist across companies and across time (Barnes,1999).

(43)

Chapter 6

Conclusion

6.1 Conclusion

In this thesis, we examined the relative efficiency of Logit model, Probit model and SVMs model with various sampling techniques in the develop-ment of classification models for the prediction of US healthcare acquisi-tion targets in the case of imbalanced dataset.

We selected ten financial variables to reflect the financial characteristics of the healthcare industry: liquidity, size, profitability and capital struc-ture in both long and short terms. There are some interesting findings we conclude below: (1) Based on the analysis of the healthcare financial data, DFCF and D/E are two important features to be focused on. (2) Based on AUROC and AUPR, all sampling techniques cases perform similarly since AUROC and AUPR measure the whole performance of the model at different threshold values. In the case of imbalance dataset, the thresh-old is not 0.5 but related to the ratio of positive class to negative class. (3) No matter what sampling techniques have been taken, the performance of Logit model is always similar to that of Probit model. (4) The accuracy rate of SVMs model (about 90%) is much higher than that of Logit model (about 65%). Logit model predicts the True Positive better, while SVMs model predicts True Negative better. (5) SMOTE+Tomek has the best per-formance structure. Random over-sampling and random under-sampling do not have a positive effect on the prediction power of the models on av-erage since the random method would confuse the models by overfitting or loss of data. SMOTE adjusts the structure of performance by sacrificing a slight percentage of its accuracy rate for negative class to get a higher ac-curacy rate for positive class. As an updated version, SMOTE+Tomek sam-pling technique works better than SMOTE since it combines a new under-sampling algorithm and SMOTE. Therefore, the development of under-sampling techniques can improve the performance structure for Logit model, Probit model and SVMs model in the case of imbalanced dataset.

(44)

28 Chapter 6. Conclusion

6.2 Future Work

Future research could be considered in at least three directions. The first one is to add nonfinancial variables into the models such as patents and star-products of the firms. The second one could be to do research on po-tential buyers. Make a pair of buyer and seller and then calculate the suc-cess rate of acquisition because the synergy effects of both parties should be considered as a big benefit for acquisition which cannot be ignored. The third one could involve the inclusion of additional nominal features, such as company structure type.

(45)

Appendix A

Appendix A: Data Processing

In order to let the coefficients not be too small (not good for comparison), the variables are transformed by the functions defined as:

• EPS, ROA, LFCF, DFCF, SIZE, LQD, D/A and CS:

f (x_i) = sign(x_i) ∗ log(|x_i+ 1|) (A.1)

• D/E:

f (x_i) = sign(x_i) ∗ log(|x_i+ 1|) (A.2)

g(x_i) = sign(f (x_i)) ∗ log(|f (x_i)|) (A.3)

The following results of calculation, estimation and prediction in this thesis are based on the post-processing data.

(46)

(47)

Bibliography

1. Akbani, R., Kwek, S. & Japkowicz, N. Applying support vector machines to imbalanced datasets in Proceedings of the 15th International Conference on Machine Learning (Springer, Pisa, Italy, 2004), 816–823.

2. Barnes, P. Predicting UK takeover targets: Some methodological is-sues and an empirical study. Review of Quantitative Finance and Ac-counting 12, 283–301 (3 1999).

3. Bartley, J. W. & Boardman, C. M. The relevance of inflation adjusted accounting data to the prediction of corporate takeovers. Journal of Business and Finance and Accounting 17, 53–72 (1 1990).

4. Batista, G. E. A. P. A., Bazzan, A. L. C. & Monard, M. C. Balancing Training Data for Automated Annotation of Keywords: a Case Study in WOB (2003), 35–43.

5. Batuwita, R. & Palade, V. A new performance measure for class imbalance learning: Application to bioinformatics problems in ICMLA (IEEE, Miami Beach, FL, USA, 2006), 545–550.

6. Bradley, A. P. The use of the area under the ROC curve in the evalua-tion of machine learning algorithms. Pattern Recognievalua-tion 30, 1145–1159 (7 1997).

7. Cardenas, A. A. & Baras, J. S. B-ROC Curves for the Assessment of Clas-sifiers over Imbalanced Data Sets in American Association for Artificial In-telligence (ACM, 2006), 1581–1584.

8. Caruana, R. & Niculescu-Mizil, A. Data mining in metric space: An em-pirical analysis of supervised learning performance criteria in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Seattle, WA, 2004), 69–78.

9. Chang, C.-C. & Lin, C.-J. LIBSVM: A library for support vector ma-chines. ACM Transactions on Intelligent Systems and Technology 27, 27:1– 27:27 (3 Apr. 2011).

10. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intel-ligence Research 16, 321–357 (2002).

11. Chawla, N., Lazarevic, A., Hall, L. & Bowyer, K. SMOTEBoost: Im-proving Prediction of the Minority Class in Boosting. Knowledge Dis-covery in Databases: PKDD 2838, 107–119 (3 2003).

(48)

32 BIBLIOGRAPHY

12. Cheh, J. J., Weinberg, R. S. & Yook, K. C. An Application of An Artifi-cial Neural Network Investment System To Predict Takeover Targets. Journal of Applied Business Research 15, 33–45 (4 1999).

13. Cohen, G., Hilario, M., Sax, H., Hugonnet, S. & Geissbuhler, A. Learn-ing from imbalanced data in surveillance of nosocomial infection. Ar-tificial Intelligence in Medicine 37, 7–18 (1 2006).

14. Cohen, W. W. Fast Effective Rule Induction in Proceedings of the 12th In-ternational Conference on Machine Learning (Morgan Kaufmann, Lake Tahoe, CA, 1995), 115–123.

15. Cramer, J. Predictive performance of the binary logit model in unbal-anced samples. The Statistician 1, 85–94 (48 1999).

16. Cristianinio, N. & Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods (Cambridge Univer-sity Press, UK, 2000).

17. Davis, J. & Goadrich, M. The relationship between Precision-Recall and ROC curves in Proceedings of the 23rd International Conference on Machine Learning (ICML) (ACM, New York, NY, USA, 2006), 233–240.

18. Dumais, S., Platt, J. & Heckerman, D. Inductive Learning Algorithms and Representations for Text Categorization in Proceedings of the 7th In-ternational Conference on Information and Knowledge Management (1998), 148–155.

19. Ezawa, K. J., Singh, M. & Norton, S. W. Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management in Proceedings of the International Conference on Machine Learning, ICML-96 (Morgan Kauf-mann, Bari, Italy, 1996), 139–147.

20. Fawcett, T. An introduction to ROC analysis. Pattern Recognition Let-ters 27, 861–874 (8 2006).

21. Fawcett, T. & Flach, P. A. A Response to Webb and Ting’s On the Ap-plication of ROC Analysis to Predict Classification Performance un-der Varying Class Distributions. Machine Learning 58, 33–38 (1 2005). 22. Fawcett, T. & Provost, F. Adaptive Fraud Detection. Data Mining and

Knowledge Discovery 1, 291–316 (3 1997).

23. Fawcett, T., Provost, F. & Kohavi, R. Data mining in metric space: An empirical analysis of supervised learning performance criteria in Proceed-ings of the 15th International Conference on Machine Learning (IMLC-98) (Madison, WI, 1998).

24. Ferri, C., Haernandez-Orallo, J. & Modroiu, R. An experimental com-parison of performance measures for classification. Pattern Recognition Letters 30, 27–38 (2009).

(49)

2010), 617–620.

26. Guo, H. & Viktor, H. L. Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost IM Approach. ACM SIGKDD Explorations Newsletter 6, 30–39 (1 2004).

27. Ha, T. M. & Bunke, H. Off-Line, Handwritten Numeral Recognition by Perturbation Method. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 19, 535–539 (5 1997).

28. Hand, D. J. & Till, R. J. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning 45, 171–186 (2001).

29. Hardin, J. W. & Hilbe, J. M. Generalized Linear Models and Extensions 2nd ed., 141–159 (Stata Press, Texas, US, 2007).

30. Hart, P. E., Nilsson, N. J. & Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics 6, 100–107 (2 1968).

31. He, H. & Garcia, E. A. Learning from Imbalanced Data. IEEE TRANS-ACTIONS ON KNOWLEDGE AND DATA ENGINEERING 21, 1263– 1284 (9 2009).

32. Heij, C., de Boer, P., Franses, P. H., Kloek, T. & van Dijk, H. K. Econo-metric Methods with Applications in Business and Economics 438–462.

ISBN: 0 19 926801 0 (Oxford University Press, NY, 2004).

33. Hsu, C.-W., Chang, C.-C. & Lin, C.-J. A Practical Guide to Support Vector Classification. <http://www.csie.ntu.edu.tw/~cjlin> (2010).

34. Huang, W., Nakamori, Y. & Wang, S.-Y. Forecasting stock market move-ment direction with support vector machine. Computer and Operations Research 32, 2513–2522 (10 2005).

35. Japkowicz, N. & Shah, M. Evaluating Learning Algorithms: A Classifica-tion Perspective (Cambridge University Press, New York, USA, 2011). 36. Japkowicz, N. The Class Imbalance Problem: Significance and Strategies in

Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI) (CiteSeer, 2000). <http : / / citeseerx . ist . psu . edu / viewdoc/summary?doi=10.1.1.35.1693>.

(50)

34 BIBLIOGRAPHY

37. Joshi, M. V., Kumar, V. & Agarwal, R. C. Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements in Proceedings of IEEE International Conference on Data Mining (San Jose, CA, 2001), 257– 264.ISBN: 0-7695-1119-8.

38. Kubat, M. & Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection in Machine Learning-International Workshop then Con-ference (Morgan Kaufmann, Nashville, TN, USA, 1997), 179–186. 39. Kubat, M., Holte, R. C. & Matwin, S. Machine Learning for the

Detec-tion of Oil Spills in Satellite Radar Images. Machine Learning 30, 195– 215 (2 Feb. 1998).

40. Landgrebe, T. C. W., Paclik, P. & Duin, R. P. W. Precision-recall operating characteristic (P-ROC) curves in imprecise environments in Proceedings of the 18th International Conference on Pattern Recognition (ICPR) 4 (IEEE, Hong Kong, 2006), 123–127.

41. Lewis, D. D. & Catlett, J. Heterogeneous Uncertainty Sampling for Super-vised Learning in Proceedings of the Eleventh International Conference of Machine Learning (Morgan Kaufmann, San Francisco, CA, 1994), 148– 156.

42. Maloof, M. A. Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown in Workshop on Learning from Imbalanced Data Sets II, ICML (Washington, DC, 2003).

43. Miroslav Kubat, S. M. Robert C. Holte. Machine Learning for the De-tection of Oil Spills in Satellite Radar Images in Proceedings of the 2th In-ternational Conference on Knowledge Discovery and Data Mining (AAAI, Portland, OR, 1996), 8–13.

44. Mladenic, D. & Grobelnik, M. Feature Selection for Unbalanced Class Distribution and Naive Bayes in Proceedings of the 16th International Con-ference on Machine Learning (Morgan Kaufmann, 1999), 258–267. 45. Pasiouras, F., Gaganis, C. & Zopounidis, C. Multicriteria

classifica-tion models for the identificaclassifica-tion of targets and acquirers in the Asian banking sector. European Journal of Operational Research 204, 328–335 (2 2010).

46. Pasiouras, F., Tanna, S. & Zopounidis, C. The identifications of acqui-sition targets in the EU banking industry: An application of multicri-teria approaches. International Review of Financial Analysis 16, 262–281 (3 2007).

47. Powell, R. G. Multicriteria classification models for the identification of targets and acquirers in the Asian banking sector. Journal of Business Finance and Accounting 24, 1009–1030 (7-8 1997).

(51)

and Knowledge Discovery in Protein Databases. Journal of Biomedical 37,224–239 (4 Aug. 2004).

50. Ranawana, R. & Palade, V. Optimized precision: A new measure for clas-sifier performance evaluation in Proceedings of the IEEE Congress on Evo-lutionary Computation (IEEE, Vancouver, BC, 2006), 2254–2261.

51. Singla, P. & Domingos, P. Discriminative Training of Markov Logic Net-works in American Association for Artificial Intelligence (ACM, 2005), 868– 873.

52. Slowinski, R., Zopunidis, C. & Dimitras, A. I. Prediction of company acquisition in Greece by means of the rough set approach. European Journal of Operational Research 100, 1–15 (1 1997).

53. Sun, Y., Kamel, M., Wong, A. & Wang, Y. Cost-Sensitive Boosting for Classification of Imbalanced Data. Pattern Recognition 40, 3358–3378 (12 2007).

54. Tay, F. E. H. & Cao, L. J. Modified support vector machines in financial time series forecasting. Neurocomputing 48, 847–861 (2002).

55. Tomek, I. An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics Part C 6, 448–452 (6 1976). 56. Vapnik, V. N. The Nature of Statistical Learning Theory (Springer, New

York, USA, 1995).

57. Veropoulos, K., Campbell, C. & Cristianini, N. Controlling the sensi-tivity of support vector machines in Proceedings of the International Joint Conference on Artificial Intelligence (Stockholm, Sweden, 1999), 55–60. 58. Webb, G. I. & Ting, K. M. On the Application of ROC Analysis to

Pre-dict Classification Performance Under Varying Class Distributions. Machine Learning 58, 25–32 (1 2005).

59. Weiss, G. M. Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6, 7–19 (1 2004).

60. Wu, C.-H., Tzeng, G.-H., Goo, Y.-J. & Fang, W.-C. A real-valued ge-netic algorithm to optimize the parameters of support vector machine for predicting bankruptcy. Expert Systems With Applications 32, 397– 408 (2 2007).

61. Wu, G. & Chang, E. Adaptive feature-space conformal transformation for imbalanced-data learning in Proceedings of the 20th International Confer-ence on Machine Learning (IEEE, Washington, DC, 2003), 816–823.

Predicting the potential acquisition targets of the healthcare industry in the United States