Predict the probability of a deal using a multi-source data-driven qualifying process

(1)

0

PREDICT THE PROBABILITY OF A DEAL USING A

MULTI-SOURCE DATA-DRIVEN QUALIFYING PROCESS

RESEARCH THESIS

The Amsterdam University of Applied Sciences and the economic environment,

MSc in Business Studies

Supervisor:

dr. Ed Peelen

Second reader:

dr. Umut Konus

University of Amsterdam

by

Serge da Fonseca Pereira de Sousa

Student Number: 10475397

(2)

1

Amsterdam, March 2015

(3)

2 Statement of originality

This document is written by Serge da Fonseca Pereira de Sousa, who declares to take full re-sponsibility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of comple-tion of the work, not for the contents.

(4)

3

List of tables and figures

Tables

Table 1: Overview of data mining algorithms ... 17

Table 2: Deal vs No Deal per variable (Weka) ... 23

Table 3: Classification Table ... 29

Table 4: Log-2 Likelihood ... 29

Table 5: Omnibus Tests of Model Coefficients ... 29

Table 6: Collinearity Statistics ... 30

Table 7: R Square ... 30

Table 8: Accuracy comparison between J48 and ADTree ... 37

Table 9: Overall comparison among the three methods used ... 37

Figures Figure 1: A basic understanding of knowledge discovery (Sahay & Ranjan 2008)... 13

Figure 2: Design of this research ... 20

Figure 3: Process overview of Knowledge discovery process (Usama, Gregory, Padhraic 1996) ... 24

Figure 4: Method procedure ... 25

Figure 5: Classified instances per data source in Weka using J48 Decision Tree ... 32

Figure 6: Decision tree when combining all data sources ... 32

Figure 7: Feature selection classification result ... 33

Figure 8: J48 Decision Tree after using Feature Selection ... 34

Figure 9: Classified instances per data source in Weka using ADTree ... 35

(6)

5

Abstract

For a salesperson it is difficult to predict, early in the sales funnel, which leads are worth the time. This paper explores how to make a prediction of which type of clients will lead to a deal. The relation between the characteristics of clients and the possibility of a deal is ana-lysed within a midsized sales organisation that specializes in expatriate rental housing. In this study, multiple algorithms (i.e. Logistic Regression, J48 Decision Tree and ADTree) and data sources (i.e. CRM database, Google Analytics and LinkedIn data) are tested. The chosen data sources are generally available but not often used combined. By combining them, a data-driven lead qualifier model is proposed that unleashes these data sources by exploiting them with the ADTree algorithm. By using actual data from a mid-sized organization and only low cost resources like open source tools (i.e. Weka) and freely available algorithms, a case is made which could be implemented for similar sales-focused organizations. This may increase the efficiency of its marketing and sales departments by correctly prioritizing leads and in-creasing lead follow-up time. An added value of this research can be found in the findings related to which variables attribute most when qualifying a possible lead with a high probabil-ity of a deal. Variables which were found to be highly significant predictors of a deal are con-tact time or follow-up time, budget, language, website sessions and B2B or B2C.

Subject:

Create a lead classification model for a mid-sized organization, specialized in expatriate rental housing.

Keywords:

 Data-Driven Decisions, Conversion Rate, Data Mining, Knowledge Discovery in Da-tabases, Machine Learning, Merging Data, Predictive Analytics, Predicting Behav-iours, Qualifying Model, Small Data, Unstructured Data.

(7)

6

1. Introduction

“Half the money I spend on advertising is wasted; the trouble is I don’t know which half” is a famous saying credited to both John Wanamaker (1838-1922) and Lord Leverhulme (1851-1925). However, for many sales-related companies it might just as well be: “Half my time is wasted; the trouble is I don’t know which half”. As a sales person, it is hard to predict, early in the sales funnel, which leads are worthwhile in terms of time and correlate to past success-ful customers. Nevertheless, this knowledge becomes more and more important as consumers get smarter and digital developments make competitors one click away (Ooi & Pundurasi 2013). Especially in intermediary companies where the key activity is to match a potential candidate to a potential product – bringing supply and demand together in order to complete a ‘deal’ – this service can be a time intensive activity where the main cost is often salary; in other words, cost of time spent. Therefore, correctly predicting which lead is, for example, worth calling first, is of crucial (economical) use since a speedy follow-up of a lead is very important, as shown by research done by Velocify (2010). By predicting which lead has first priority and should have priority in follow-up, time management can be improved and sales processes could be made more efficient (Ooi & Pundurasi 2013). Many organizations struggle with time-management problems when dealing with potential clients. An overview on the subject of time management by Claessens, Van Eerde, Rutte and Roe (2007) shows that focus on high-priority tasks increases performance. Managing leads correctly is crucial in increasing conversion rates. One way to manage leads is to prioritize the time spent on a lead or to choose wisely which lead to act on, as one lead can have a much higher potential than anoth-er. Often this selection process is managed by the subjective judgment of a sales agent, as the function of sales is often considered an art based on the intuition of the sales representative (Oechsli 2010). However, with more data gathered from successful and unsuccessful clients using real data and analytical tools, an insight might be developed on which to base the pro-cess of selecting which client to prioritize, thereby making a data-driven decision and achiev-ing a higher conversion rate with less sensitivity to human error (Court, Perrey, McGuire, Gordon & Spillecke 2013). “Data-driven decisions are better decisions…to decide on the basis of evidence rather than intuition” (McAfee & Brynjolfsson 2012).

Brian Carroll, chief executive officer of InTouch, asserts that "leads ignored by sales reps make up about 77% of potential sales lost by the firm" (Sabnis, Chatterjee, Grewal & Lilien 2013), and the Institute for the Study of Business Markets has recently held two conferences

(8)

7

on ways to encourage marketing and sales staff to communicate better. These examples sug-gest that while sales and marketing practitioners are concerned about the issue studied in this research, the academic community has largely ignored it. Sabnis, Chatterjee, Grewal and Li-lien (2013) took a first step at addressing this issue and concluded that “Firms can motivate sales reps to follow up on marketing leads by requiring marketing departments to prequalify leads”.

This research studies which indicators, variables and methods can predict the quality of a cli-ent, using multiple sources in order to classify the right customer and fit early in the sales funnel to improve sales effectiveness. In order to achieve this result, it is important to identify which information or attributes about a lead are needed to accurately classify a lead with a high probability of a deal. This is done by using and testing multiple available data sources, such as a company’s Customer Relation Management (CRM) database, Google Analytics and LinkedIn data. The data used, do not contain a great amount of volume, which is one of the three characteristics that differentiate Big Data from traditional analytics (McAfee & Brynjolfsson 2012). However it does have the other two characteristics, lots of variety in terms of number of variables and velocity in terms of real-time generated. Research will prove if merging these sources increases the accuracy of prediction. Theory suggests that analysis is more effective when data are available from multiple sources (Kohavi, Rothleder & Simoudis 2002), although no empirical evidence is given.

The midsized organization that is the subject of this study is a rental agent, Perfect Housing, which specializes in expatriate housing in the three largest Dutch cities: Amsterdam, Rotter-dam and The Hague. Data from this company are used to create a case in order to find evi-dence that shows a qualifying system can be of economical use by predicting the quality of a client. One of the primary activities of a rental agent is to match a client’s needs with the vari-ety of houses offered by homeowners, at the right time, in the right place, to the right person and with the right message (Barton & Court 2012). The management and control of the rental process, with its staged, funnel-like structure, has been thus far neglected in empirical re-search even though an effective qualifying process is crucial for economic success (Kohavi, Rothleder & Simoudis 2002). In addition to profitability, the main goal of a rental agency is to match a rental client to a house fitting their needs and budget, while having a pipeline process that ensures that the pipeline is always filled with opportunities and results in a constant stream of rental deals over time. The rental deals for a rental agent, focusing on expatriates –

(9)

8

foreign business professionals moving abroad – are characterized by complexity, uniqueness, and discontinuity. This report assesses whether a lead qualifier with a data-driven component is applicable and might be utilized by sales management to monitor and increase sales, and it also evaluates whether a sophisticated data-driven qualifying or predicting model contributes to enhanced economic success and enables the rental agent to prioritize his or her lead stream.

Analytics must be considered a requirement for all organizations, especially those active on the internet, gathering data in a market with an increasing focus on cost-effectiveness. The subject organization has access to data from numerous sources, such as website analytics, CRM data (lead streams, client data and property data) and Social Media data like LinkedIn. However, these data are relatively small in volume, especially in comparison to the data of large companies, who have access to exabytes of data (McAfee & Brynjolfsson 2012). Some data, like CRM and website analytics, have a high velocity since they are updated in real time. Furthermore, they exhibit a high variety because they account for many different variables, which enable the data to have a practical use. One of the high-volume data sources to which the company has access is data provided by Google, such as search volumes online. This data source is already used to anticipate which keywords the specific target groups used and to match these to the relevant marketing message in order to reach a higher return on investment (ROI) in advertising. The company in this study uses a mix of data sources, some of which may not be as large in quantity as those some large organizations have stored in their data-bases, but due to the high variety and velocity, the data sources can still be used in the way Analytics 2.0 is intended (Nichols 2013) especially in combination with access to large data-bases from third party sources like Google and LinkedIn.

This research attempts to contribute to the sales theory and data mining practice by creating a model that qualifies leads and marks them as qualified to get the product or service offered. Instead of conducting this qualifying process subjectively and depending on the experience of the sales agent, this research focuses on creating a data-driven model by comparing patterns of past customers with characteristics of potential new customers. The goal is to find a framework that is not only of use for the rental market but also for sales-focused organiza-tions, by creating a method for developing a qualifying model based on past data.

First, an attempt is made to identify which information or attributes are needed from a lead to be able to accurately predict the quality of a client with a potential deal, and which variables

(10)

9

correlate most with a deal and therefore contribute to the knowledge of a sales agent. Next, as the company has access to different data sources, the author was simultaneously interested in which sources or combination of sources attribute most to predicting a successful client. Since Perfect Housing is a mid-sized company with data sources most similar sized companies also have access to, it can be of practical use to know what a similar company can use to improve its sales performance. Finally, several methods were tested in order to determine which makes the best use of a relatively small data set. As this research made use of real data containing 26 variables from 3 different sources and 17,000 records, it also dealt with its challenges. The used methods should be able to handle missing data and unbalanced data sets and prevent inaccurate classifications.

Using a typical statistical method using Statistical Package for the Social Sciences (SPSS), different methods were tested in order to find the most informative attributes. A Pearson Cor-relation matrix was created in order to discover which variables were positively correlated after using a regression analysis to create a first predictive model. Then this research tested a different method using a data mining tool called Weka, in order to make decision trees that can be used to quickly and efficiently exploit different data sets and sources (Drazin & Mon-tag 2012) and hereby predict a qualitative client and decide which lead to prioritise. Both tools are low-cost or even free to use which makes them accessible to small and mid-sized companies. This research will compare the different methods and discuss benefits and disad-vantages of each.

(11)

10

2. Theoretical Background

This research examines the influence of data mining usage in medium-sized organizations to qualify sales leads and predict the chance of a deal. The most relevant findings from the cur-rent literature about lead qualification, curcur-rent qualifying tactics, knowledge discovery in da-tabases, data mining and its techniques will be discussed below.

Sales improvements, lead qualification and forecasting techniques

Rick Page, founder of the organization The Complex Sale, sponsored a survey of over 250 companies on the initiatives they were planning to undertake to improve sales effectiveness. Improving sales processes was the most frequent response (40.9%) (Page 2008), meaning that sales processes can still benefit from improvement. An example would be using forecasting techniques, which is Bosworth’s primary advice, to enable sales representatives and managers to ‘predict the future’ by qualifying and forecasting accurately in order to boost sales agents efficiency and speed up sales processes.

Salespeople mainly prefer sales leads that are receptive and have an overt desire and prefer-ence for the seller’s products, and which are therefore easy to close (Jolson 1988). Salespeo-ple often argue that many marketing-generated leads lack potential to become a deal, so they focus first on what they perceive, from experience, to be a qualified lead. Firms, however, can motivate their sales teams to follow up on marketing leads by requiring marketing depart-ments to prequalify leads (Sabnis, Chatterjee, Grewal & Lilien 2013). This could reduce a lack of follow-up on the leads that the marketing staff generates (Churchill & Walker 2003; Zoltners, Sinha & Lorimer 2009), which is a commonly occurring problem between sales and marketing teams. Fast follow-up is very important, as shown by research done by Velocify (2010) that used data from millions of calls and claimed that following up leads within minutes increases the conversion rate exponentially. This importance was also supported by data and research from Software Advice.

A lead or sales lead is defined in the dictionary as ‘the identification of a person or entity that has the interest and authority to purchase a product or service’. A qualified lead however, is more than a record containing contact details showing the person’s or entity's interest. It should include information on the person or the company they represent and information that qualifies them for a certain product. In the following section the current practices on how to

(12)

11

qualify a lead are stated. Marketers and sales agents primarily classify well-qualified leads as “tight leads” and less-qualified leads as “loose leads” (Jolson 1988). Salespeople most often prefer the first, as these have higher closure rates and a higher chance of conversion.

Kirby (2012) mentions a few leading sales process-selling methodologies, one of which is the BANT approach, which IBM has been using for many years to qualify opportunities. BANT refers to budget, authority, need and timeframe (Kirby 2012). BANT asks the following ques-tions in order to classify a lead: What are the needs or condiques-tions that have to exist before your product or service would be valuable to a potential customer? How much money must be available in the customer’s budget to buy your product or service? Does this person have the authority or can they get approval to make a buying decision? Do they have a specific time when they wish to make their purchase? Depending on the (sales) criteria, if answers to the above-mentioned questions are negative, the lead is qualified as uninteresting and the sales representatives will have little incentive to pursue the lead. There are other process-based lead qualifying techniques that use similar qualification frameworks; examples are Target Account Selling developed by the TAS group and STRATEGY-Battleplan™. They use questions like ‘Why buy? What to buy? When to buy? Who buys? How to buy? Can we compete? Can we win? Is it worth winning?’ (Kirby 2012). This requires intense internal and consumer analy-sis, as it requires the organization to know exactly who its clients are, what their requirements and needs should be and how the clients behave before converting. This analysis is often per-formed by managers within the organization with the help of customer surveys and it could take a lot of person-hours to be able to create a complete picture, however subjectivity is not necessarily excluded in its end result.

Multiple tools can be found online using BANT, TAS group and STRATEGY-Battleplan™ frameworks; however, they do not give empirical evidence on their websites or in peer-reviewed journals. One example of an online tool is Marketo, which claims the secret to a high-performance revenue engine is the effective use of the sales development team. The team has one focus: to review, contact and qualify marketing-generated leads and deliver them to sales account executives, however no scientific proof is given of its qualifying ability. Other examples which use both sales intelligence and data mining to qualify leads include CallidusCloud Marketing Automation (http://www.leadformix.com), which claims to be using a lead scoring model derived from browsing behaviour in order to qualify its leads, but again no research data is given.

(13)

12

Leadformix uses a technique by which they claim to create a data-driven decision lead-scoring model. They have based their model on BANT but have called it PIEDIO, which, they quote, will go one step further than BANT. PIEDIO uses data variables grouped by “Purchase Readiness”, “Degree of Intent of Prospects”, “Engagement” (e.g. Time Spent, Regency), “Demographic Factors” and any purchase barriers defined by Sales (e.g. Revenue and other demographic data like Title, etc), “Identity of individual contact or prospect established” is able to attribute behavior to an individual decision maker and “Off-site & Off-line Activities”. In the Leadformix scoring algorithm, its relationship to BANT is explained. “Budget” from “BANT” translates as “Revenue” of the prospect, a Demographic parameter defined by sales) that indicates the scope of budget allocation in the immediate or near future. “Authority” is implicitly addressed by decision makers’ titles and seniority, a necessary inclusion in “Demo-graphic parameters defined by sales”. “Need” can be established based on the level of “En-gagement” and degree of “Intent” revealed by the prospect. “Timeline” is purported to be es-tablished based on “Purchase Readiness” or decision stage. The name “PIEDIO” in itself does not have a theoretical foundation except that is was created with the BANT technique in mind, after which Leadformix is said to have further developed its model.

A recent survey done by J.J. Beentjes (2015) concludes that 84.9% of the participants who are active in sales mention that data is important or very important to their decision-making. Oth-er than using these questions to qualify, these participants do not deploy data mining tech-niques but rather use them as training methods to streamline sales managers and teams.

(14)

13 Overview of recent research on data mining and knowledge discovery

In another non peer-reviewed study (Vernon, n.d.), prioritisation is given as the main reason one should be qualifying leads and it claims that prioritizing will boost performance, improve dealer satisfaction on the leads they receive, increase closing rates by 24% and gain greater ROI on lead management efforts. Vernon says they gained this insight using their customers’ data. Their customers are mainly sales organizations. They propose using data mining and statistical modelling as a way to go forward with qualifying instead of using modeller intui-tion.

Figure 1: A basic understanding of knowledge discovery (Sahay & Ranjan 2008).

Data mining (which is also known as Knowledge Discovery in Databases) is the process of analysing data from different perspectives and summarizing it into useful information; it ena-bles data exploration, data analysis, and data visualization of sometimes huge databases at a high level of abstraction, without a specific hypothesis in mind (Sahay & Ranjan 2008). For a basic understanding see Figure 1, created by Sahay & Ranjan (2008), which explains the nec-essary steps: data collection; storage of these data; insight, such as asking the right queries or using the correct algorithms; and the use of tools to efficiently take advantage of the insights. The information can be used to understand customer behaviour and patterns in order to in-crease revenue, cuts costs and preferably both for the gain of profit and other types of value. Data mining techniques are the results of a long process of research and include decision trees, artificial neural networks and genetic algorithms (Lakshmi & Raghunandhan 2011). These data sets often contain unstructured data, which is defined as “information that either

(15)

14

does not have a pre-defined data model and/or is not organized in a pre-defined manner” (Lakshmi & Raghunandhan 2011).

Rapidly and within a short period of time, a large variety of articles about new forms of data mining and knowledge discovery have appeared in both academic and management publica-tions. These researchers include; Davenport & Patil (2012); LaValle, Lesser, Shockley, Hop-kins & Kruschwitz (2011); McAfee & Brynjolfsson (2012); and Nichols (2013). The term ‘Big Data’ is often described in another form but is similarly used to describe the process of analysing large, unstructured sets of data to discover information (Davenport, 2012). This is one of the biggest buzzwords in the technology industry (Business Insider, 2014), and a mar-keting forecast conducted by ABI Research has shown that global spending by organisations on Big Data in 2013 has exceeded $31 billion. It further predicts that spending on Big Data and Data Mining will grow at a compound annual rate of 29.6% over the next five years, reaching $114 billion in 2018. The forecast includes the money spent on internal salaries, pro-fessional services, technology services, internal hardware and internal software. An innova-tive example is Airbnb, a company whose prime business strategy is using unstructured data to match their clients (tourists) and products (holiday houses) to each other; Airbnb is report-ed by the Financial Times and the Wall Street Journal to be worth over EUR 13 billion after only six years of existence.

Analytics have been part of companies for quite a few years. Davenport in 2006 claimed that companies who collect, analyse and act on data perform better than companies that do not. With the right people, culture, focus and technology, a company can make the most out of its data by acting on knowledge gained through the data. The Analytics 1.0 measurement ap-proaches look back a few times each year to correlate sales with several variables. Technolo-gy has evolved and has increased the volume of data, the number of variables, the way data is used and how fast it is collected (often in real time). This next step is referred to as unstruc-tured data or Big Data.

This new data movement, like analytics before it, seeks to glean intelligence from a large va-riety of data and convert this into a business advantage. Nevertheless, there are three key dif-ferences in this new form of data storage: velocity, variety and, hence the name, volume. The

velocity of the data should be in real time; no need for historical data of a year old but data

(16)

15

traditional analytics according to McAfee and Brynjolfsson (2012) is variety; the number of measured components and factors has increased to obtain a better understanding of the entire picture. Lastly, the volume of data is increasingly becoming larger; some companies now have access to exabytes of data, which is why it is referred to as ‘Big Data’. Gordon, Perrey, and Spillecke (2012) note that ‘Bigger data is not always better data’. As the famous statisti-cian Nate Silver put it: “Every day, three times per second, we produce the equivalent of the amount of data that the Library of Congress has in its entire print collection. Most of it is ir-relevant noise. So unless you have good techniques for filtering and processing the infor-mation, you’re going to get into trouble”.

Smaller sets of data obtained by smaller sized companies can also have a high velocity and variety (Sahay & Ranjan, 2008). Small and medium-sized enterprises (SMEs) make a great contribution to the employment rate and GDP in every economy. With the internet and emerging information technologies removing the barriers for medium and small enterprises, involvement of SMEs in the information economy grows and is now no longer exclusively for large enterprises (Steyerberg, Eijkemans, Harrell & Habbema, 2000).

Historically, those with money (to build the technology and make use of the right techniques) could produce a different type of research than those without the means or access to the way (raw) data is generated. Research without access can neither reproduce nor evaluate the meth-odological claims of those who have privileged access. Those without the means do not have ways to translate the data into the correct conclusions. However the right means are becoming more accessible as open source software like SAS University and Weka are freely available and open for general use (Hall 2009). Data is easier to generate with free tools, offering both real time and historical data, such as Google Analytics and Social Media Insights. These tools make it easier for smaller companies to use and benefit from data-driven decision making.

(17)

16 Machine learning algorithms

There are many machine learning algorithms, with each having their own unique combination of abilities, attributes and methods. Different algorithms can often answer the same question with different prediction rates, error margins and precision rates (also called positive predic-tive value, with high precision meaning that an algorithm returned substantially more relevant results than irrelevant). However, there is no single algorithm that is better than all others for all problems. One of the most-used methods of selecting the best algorithm is to use a “spot-checking method” that tests multiple algorithms and parameter settings with a dataset and evaluates their prediction rates, error margins and precisions rates on a similar evaluation ba-sis (Brownlee, 2014). Three key benefits of spot-checking algorithms on machine learning problems are speed, objectivity and results (Brownlee, 2014). Speed is achieved by systemati-cally checking a number of algorithms using the same standards. By using the same standards an objective evaluation of the algorithms is made. This spot-checking approach focuses on the goal you want to achieve. Therefore results can immediately show if you can move forward and optimize a given model or if you need to revisit the presentation of the problem or da-taset.

The Institute of Electrical and Electronics Engineers (IEEE) International Conference on data mining (ICDM, http://www.cs.uvm.edu/~icdm/) identified the top 10 algorithms in data min-ing regardmin-ing baggmin-ing and boostmin-ing, association analysis, classification, integrated minmin-ing, statistical learning, sequential patterns, clustering, rough sets and graph mining. The top 10 algorithms in were C4.5, AdaBoost, Naive Bayes, k-Means, EM, Apriori, SVM, kNN, Pag-eRank, and CART. A short description of these algorithms is provides by Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., . . . Philip, S. Y. (2008) and another literature overview with more in-depth sources is given in Table 1.

Only the algorithms in this overview that are relevant to this research will be further explored. The algorithms should be able to classify leads and predict a deal, as well as handle missing data and different types of variables (i.e. numeric, nominal and binary). The algorithm vari-ants of decision trees are well-equipped to handle these criteria. If one performs a decision tree analysis the result can be both a classification and a prediction. Piatetsky-Shapiro (1996) argues:

“The decision tree is a classification model, applied to existing data. If you apply it to new data, for which the class is unknown, you also get a prediction of the class. The assumption is

(18)

17

that the new data comes from the similar distribution as the data you used to build your deci-sion tree. In many cases this is a correct assumption and that is why you can use the decideci-sion tree for building a predictive model.”

Classification means classifying data by putting instances in a single group when they belong to a common class. Applying this definition to the current research, this means classifying leads in order to create a lead qualification model that classifies a client as a deal or no deal potential, which can be defined as a two-way problem: either a deal or no deal. One logical algorithm to make the decision whether a lead will be valid or not – deal or no deal – is with a decision tree, also called the C4.5 Algorithm. This is used to help identify a strategy most likely to reach a goal. A decision tree creates a tree-like model that indicates when a certain outcome is likely, showing a probability distribution per leaf, and is therefore easy to under-stand by the end user. It does not require a great amount of effort to achieve high performance and it can handle a variety of input data: nominal, numeric and textual. It can handle errone-ous datasets and missing data (Bhargava, Sharma, Bhargava & Mathuria 2013). The Java var-iant J48 can be used in Weka and is basically the algorithm translated to Weka software, which runs on the programming language Java (Hall, Frank, Holmes, Pfahringer, Reutemann & Witten, I. H. 2009).

Experimental results by Freud & Mason (1999) show that the ADTree algorithm is competi-tive with C4.5 decision tree algorithms and is therefore a logical test follow-up. ADTree is a Hybrid method that combines different supervised learning methods in order to increase pre-diction accuracy, using weak classifiers in a combined manner to achieve one strong predict-ing model used for research questions such as predictpredict-ing the chance of gettpredict-ing a certain dis-ease and many other two-way problems. The ADTree algorithm is based on the AdaBoost algorithm proposed by Freund, Mason and Schapire (1999) and is one of the most important ensemble methods. Ensemble learning deals with methods that employ multiple learners to solve a problem.

(19)

17

Table 1: Overview of data mining algorithms

Algorithm Sort Output Attributes usage Use Source C4.5 (J48) classification Tree

Empty nominal attributes, Nominal attributes, Unary attributes, Binary attributes, Date attributes, Missing values, Numeric attributes

Create a prediction decision is made for a given record

Kotsiantis, S. (2013). Decision trees: A recent overview. Artificial Intelligence Review

Naive Bayes classification or

regression Tree or Table

Empty nominal attributes, Nominal attributes, Unary attributes, Binary attributes, Missing values, Numeric attributes

Discover relationships between input columns and predictable columns

Keogh, E. J., & Pazzani, M. J. (2002). Learning

the structure of augmented bayesian classifiers . International Journal on Artificial

Intelligence Tools

k-Means Clustering Scatterplot Binary attributes, Nominal attributes, Empty nominal attributes, Unary attributes, Numeric attributes, Missing values

Using the inherent structures in the data to organize the data into groups of maximum commonality

Jain AK, Dubes RC (1988) Algorithms for

clustering data . Prentice-Hall, Englewood Cliffs

Expectation– Maximization (EM)

Clustering Scatterplot Binary attributes, Nominal attributes, Empty nominal attributes, Unary attributes, Numeric attributes, Missing values

maximum likelihood estimates of parameters in statistical models

McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions.

Apriori Association Rule

Learning Table Missing class values, Nominal class, Binary class, No class

Highlight general trends in the database: this has applications in domains such as market basket analysis

R. Agrawal, R. Srikant (1994) Fast Algorithms for Mining Association Rules in Large Databases.

Support Vector

Machine (SVM) Regression Table

Binary attributes, Nominal attributes, Empty nominal attributes, Date attributes, Unary attributes, Numeric attributes, Missing values

Analyze data and recognize patterns, used for classification and regression analysis

Vapnik V (1995) The nature of statistical learning theory. Springer, New York

ADTree

(AdaBoost) classification Tree

Binary attributes, Nominal attributes, Empty nominal attributes, Date attributes, Unary attributes, Numeric attributes, Missing values

Two class problems by boosting weak learners combined into a stronger prediction model

Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm

CART Classification and

Regression Tree

Empty nominal attributes, Nominal attributes, Unary attributes, Binary attributes, Missing values, Numeric attributes

Predict in which group to classify a record and give real number

Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

(20)

18 How to benefit from gathered data

LaValle, Lesser, Shockley, Hopkins and Kruschwitz (2011) name three stages of analytics adoption of a company: Aspirational, Experiences and Transformed. These three stages indi-cate how advanced a company is in terms of its use of analytics and the value it obtains from such use.

Aspirational stage: organizations in this stage are focusing on efficiency or automation of ex-isting processes and are searching for ways to cut costs. Aspirational organizations currently have few of the necessary building blocks – people, processes or tools – to collect, under-stand, incorporate or act on data-generated insights

Experienced stage: these organizations are looking to go beyond cost management. Experi-enced organizations are developing better ways to collect, incorporate and act on analytics effectively so they can begin to optimize their organizations and increase their revenue streams by using these data to their advantage.

Transformed stage: these organizations have substantial experience using analytics across a broad range of functions. They use analytics as a competitive differentiator and are already adept at organizing people, processes and tools to optimize and differentiate. Transformed organizations are less focused on cutting costs than Aspirational and Experienced organiza-tions, possibly having already automated their operations through effective use of insights. They are most focused on driving customer profitability and making targeted investments in niche analytics as they keep pushing the organizational envelope. Only organizations that are more advanced in terms of analytical adoption could expect to reap the full benefits of, for example, a data-driven lead qualification model used by sales functions. Effective data collec-tion, data storage and data usage are necessary to create an accurate model, and these three factors are a requirement for companies with a higher analytical adoption (LaValle, Lesser, Shockley, Hopkins and Kruschwitz 2011). They would therefore see the importance in creat-ing a data-driven lead qualifier and use it effectively. Sales organizations that are further in terms of analytical implementation understand more about their clients and leads and are therefore more able to implement a data-driven sales solution for their sales department.

How data-driven companies perform

The increase in spending on Big Data components as mentioned in ABI Research is very large, certainly considering economic decrease. This increases pressure on margins and at the same time prioritizes efficiency on all levels, including marketing (Nichols 2013). This last

(21)

19

trend could well be the very reason Big Data has become this big, as it is considered to be the path to reach optimal efficiency by reallocating cost to boost margins and be able to beat competitors with knowledge.

“Data-driven decisions are better decisions. Using Big Data enables managers to decide on the basis of evidence rather than intuition”; this quote by McAfee and Brynjolfsson (2012) gets right to the core of the use for Big Data. To find evidence for this statement, they conducted structured interviews with 330 executives at public North American companies, asking about the executives’ organizational and technology management practices and gathering perfor-mance data from annual reports and independent sources. Across all the analyses conducted, McAfee and Brynjolfsson found that the more a company characterized itself as data-driven, the better it performed on objective measures of financial and operational results. The anal-yses show that companies in the top third of their industry in the use of data-driven decision making were both more productive (i.e. on average 5%) as more profitable (i.e. 6%) than their competitors. They found their research statistically significant and the results economically important, as it was reflected in measurable increases in stock market valuations. When look-ing at sales, instead of a process of selection which is conducted uslook-ing the subjective judg-ment of a sales representative, since sales is often still considered to be an art based on the intuition of the salesperson (Oechsli, M. 2010), data can help turn this subjective nature into a data-driven decision based on evidence, which decreases human error in judgements and could well increase productivity and profitability.

McKinsey analysis (Court, Perrey, McGuire, Gordon & Spillecke, 2013), which covers more than 250 engagements over five years, has shown that putting data at the center of the market-ing and sales decisions of companies improves the marketmarket-ing return on investment (MROI) by 15%-20%. That adds up to $150 billion to $200 billion of additional value based on an esti-mated $1 trillion of global annual marketing spending. The studies by McAfee and Brynjolfs-son, and Court et al. confirm that data-driven companies perform better on productivity, MROI and overall profitability compared to their competitors. Lead qualification by itself is shown to increase efficiency, as this prioritizes sales leads and could therefore increase speed of follow-up and increase close rates (Vernon, n.d.). Data-driven lead qualification could lead to even more efficiency and thereby profitability since it combines classic sales process opti-misation with an evidence-based technique such as data mining and decreases intuition error margins made by sales staff (McAfee & Brynjolfsson 2012).

(22)

20

3. Methods

This paper outlines which indicators or attributes can, with the use of analytics, predict the quality of a client and make it possible to match that client with a past successful one and thus be identified as likely to convert. Also addressed is which cluster of variables matters most in qualifying a lead and can determine whether the lead classifies as a deal: a successful match between client and service. In order to make this classification it is first necessary to deter-mine the information needed to make an accurate prediction. This research uses data merged from several data sources, in order to concisely describe the indicators of a quality client with the best fit by analysing the similarities of a converted client. Secondly, it will show which source attributes contribute most to predicting the quality of a client. Thirdly, this analysis is used to determine which algorithm or model variation will produce the highest accurate pre-diction of which clients have the highest chance of converting to a successful deal. In Figure 2 below, an overview is given of the research design and the variables and sources included. This research provides practical and crucial recommendations for sales and marketing repre-sentatives. With the goal of making an attractiveness model, the steps outlined below will assist in determining which client should be prioritized first. This model could also be imple-mented in other domains where the specificity of services is similar.

Figure 2: Design of this research

Successful deal or No deal Customer Profile (CRM data)

1. Nationality (improved with Google Analytics) 2. Language (improved with Google Analytics) 3. Gender 4. Budget 5. Preferred # of bedrooms 6. Preferred Furnishing 7. Preferred city 8. Preferred neighbourhood 9. Response

10. Contact time (min)

Website behaviour (Google Analytics data) 14. Source (merged with CRM data)

15. Source medium 16. Used device to register 17. Browser 18. Operating system 19. Page depth 20. Session count 21. Session duration 22. Sessions to transaction 23. Days to transaction 24. Day of the week 25. Hour of day

Employer information (LinkedIn data) 11. B2B or B2C client (improved with CRM) 12. Industry

(23)

21 Data set

The data set has been created by extracting data from the Customer Relationship Management (CRM) system of the subject rental agent.

Sample from CRM: 20,425 opportunities (from January 2013 until November 2014), of which 1,610 were converted into successful customers / deals (January 2013 until November 2014); 6,402 opportunities and 342 deals have been enhanced using Google Analytics data.

The variables have been chosen from the historical data based on availability and business insights, gained from years of experience and through conversations with rental consultants; the author of this paper has been practicing marketing and database building in the rental mar-ket for more than five years. In addition to these reasons, existing literature such as BANT and practices such as Leadformix include many similar variables.

The variables are sorted into three categories defined by data source; customer profile data extracted from the CRM system of the agent, website data taken from Google Analytics and company data from LinkedIn. All these data sources are freely and easily usable by average mid-sized organizations, thereby creating a useful case for a lead qualification model.

Customer Profile (CRM) Data

With a CRM system it is possible to manage the data of an organisation’s interactions with its customers, clients and future prospects. A CRM system combines several modules to organize data and automate and synchronise communication processes within an organisation (Ooi & Pundurasi 2013). The available data from customers are: nationality, which is a nominal vari-able; gender as a dichotomous variable, either man or woman; available rental budget accord-ing to the rental profile (ratio); number of bedrooms (ratio); number of tenants (ratio); and their interested location (nominal), often described as city or neighbourhood. The available data may also include information on whether a tenant’s employer is paying, i.e. a business to business (B2B) client, where the contract is made with the company instead of the occupant, or whether the individual in question pays for the rental agent service, i.e. business to con-sumer (B2C), where the contract is made with the individual tenant. This variable is defined as B2B vs B2C (dichotomous). Time to first contact is registered in minutes as a continuous scale variable and referred to as “contact_time”. Data on whether a client responded or not (dichotomous value) was also collected, which is an important factor in client interactions.

(24)

22

LinkedIn data

Comparing B2B to B2C is further enhanced by using LinkedIn to find the current job industry of the client and the size of the company for which they work.

Website behaviour data using Google Analytics data

Website behaviour is also used to see if there are similarities in the website journey made by the customer, especially in terms of time and method used to find and contact the company in question, and these data are included in the available dataset by making use of Google Analyt-ics.

Source data is added to the data set, which was already partially available in the CRM system, by using Google Analytics for clients using an online method of contact. Their source of en-tering the agent’s website (nominal) was added. Furthermore, the following nominal varia-bles, which the client used at the time of registering itself on the website, were added: “used_device” (mobile, desktop or tablet), “browser” (Chrome, Firefox or Internet Explorer) and “operating_system” (Windows or Macintosh).

Then the following time variables (scale) are used to obtain an accurate view of the time laps-es between different staglaps-es in the proclaps-ess: “page_depth” (number of paglaps-es seen before regis-tering); “sessions_to_transaction” (number of sessions on the website before regisregis-tering); “session_count” (the total number of sessions, even after registering); the “session_duration”, time of their session before registering; “days_to_transaction” (which accounts for the number of days from first session to session of registration); and finally the “day_of_the_week” and “hour_of_day” of registering. Table 2 below provides a complete overview of variables.

Compared to existing methods like BANT, many similar questions are answered using these variables as mentioned earlier. Budget has been defined, authority can be seen by client type (B2B vs B2C): either business to business, which shows the company is the final decision maker, or business to consumer, where the individual client is the final decision maker. Needs are seen by property interests. A specific timeframe can be added by asking the client the date by which they need to move into a property, however this data item was not properly record-ed. A more indirect approach was taken to use as an alternative timeframe measure; variables such as website sessions, sessions to transaction and days to transaction were taken into ac-count, since one could argue that if a property, service or product is needed more urgently, it will be more often ‘top of mind’ for the client and therefore the client will be more likely to

(25)

23

check the website more frequently. This correlation will be further researched. Already there are companies who use similar data sets to create a lead scoring model. One example previ-ously mentioned is Leadformix, who uses a technique they called PIEDIO and basically uses similar variables. The name “PIEDIO” stands for Purchase Readiness, Intent Revealed, En-gagement, Demographics & Purchase Barriers, Identity of Lead (or Contact) Established, Off-site (& Off-line) Activities.

Table 2: Deal vs No Deal per variable (Weka)

No filter

Level of measure deal nodeal All Missing values

deal vs no-deal dichotomous 1515 15725 17240 0.0%

nationality Nominal 1261 5209 6470 62.5%

language Nominal 315 5886 6201 64.0%

gender dichotomous 1515 15725 17240 0.0%

budget Ordinal 1495 9797 11292 34.5%

preferred bedrooms Ordinal 1478 10033 11511 33.2%

furnishing Nominal 810 5240 6050 64.9%

preferred city Nominal 1515 15725 17240 0.0%

preferred neighborhood Nominal 1443 898 2341 86.4%

b2b vs b2c dichotomous 1515 15725 17240 0.0%

industry Nominal 1150 4647 5797 66.4%

company size Nominal 459 1089 1548 91.0%

source Nominal 867 11206 12073 30.0%

medium Nominal 318 5931 6249 63.8%

used device Nominal 315 5890 6205 64.0%

browser Nominal 315 5886 6201 64.0%

operating system Nominal 315 5886 6201 64.0%

page depth Scale 315 5886 6201 64.0%

session count Scale 315 5886 6201 64.0%

session duration Scale 315 5886 6201 64.0%

sessions to transaction Scale 315 5886 6201 64.0%

days to transaction Scale 315 5886 6201 64.0%

day of week Scale 315 5886 6201 64.0%

hour of the day Scale 315 5886 6201 64.0%

contact time (min) Scale 1515 15725 17240 0.0%

responded dichotomous 1515 15725 17240 0.0%

overview of variables

(26)

24 Procedure

Figure 3: Process overview of Knowledge discovery process (Usama, Gregory, Padhraic 1996)

In this study, the forecast to predict whether a lead classifies as a deal is produced using the five steps that are often used in the Knowledge discovery process (Usama, Gregory, Padhraic 1996). In the first step, the data is selected and variables are chosen (see sub-chapter “Varia-bles”).

In order to be able to predict if a lead will become an actual customer in the rental market branch and create an attractiveness model, data gained from the rental agent is used. The data is extracted from the rental agent’s CRM, which is then enhanced with data from Google Ana-lytics and LinkedIn. See the sub-chapter “Variables” for a more in-depth explanation of which variables were chosen and why.

The data is then pre-processed, meaning that the target set is checked and cleaned. Basic oper-ations include removing noise if appropriate, collecting the necessary information to model or account for noise, deciding on strategies for handling missing data fields and accounting for time-sequence information and known changes (Brachman & Anand 1996). After the data has been cleaned, relationships are identified using a correlation matrix. The effect of the varia-bles is tested by using a multi-variable logistical regression analysis. Regression attempts to find a function that models the data with the least error. This is one way of creating a predic-tion model; another way is using a classificapredic-tion model. The data is first split per source in order to define which source or combination of sources forecasts a deal best, and one split is made using a feature selection filter in Weka. These data sets are tested using a classification method: J48, a Java version of the C4.5 algorithm (Bhargava, Sharma, Bhargava, & Mathuria

(27)

25

2013). Classification is the task of generalizing a known structure to apply to new data. One method of classification is a decision tree, which is primarily used for predictive modelling of categorical class labels (Apte & Hong 1996; Fayyad, Djorgovski & Weir 1996) and which give a clear and understandable model for actions to take when a client possesses certain char-acteristics.

However, when dealing with multiple variables, some of which have a small impact and are so-called weak learners (i.e. have a weak correlation and are only slightly better than guess-ing), a third method is used. This is a variation of the C3.5 Algorithm based on AdaBoost, which enables weak learners to become, together, very accurate predictors of two-way prob-lems with great simplicity; this method is called Alternating Decision Tree or ADtree (Freud & Mason 1999). The two reported algorithms are found using a preliminary algorithms spot-checking method on the given dataset. For an overview of the steps taken, see Figure 4. Be-low this research will go more in-depth into each step that was taken to achieve the final goal: a data-driven method of qualifying leads.

Figure 4: Method procedure

Pre-process Statistical test

(Pearson) correlation Matrix

(2) Logistical regression without nominal predictors (1) Logistical regression with

recoded variables

Split dataset per source

(3) Create a Decision Tree per data source

(4) Create an ADTree per data source

(28)

26 4.

Analysis and Results

Pre-processing steps

The following Pre-processing steps were performed on the dataset obtained from the CRM system of the subject rental agent. Upon completion, the data was merged with Website data gained from Google Analytics using the technique “ecommerce tracking”, which enables the system to award a unique ID to each record. This unique ID is similar to the ID provided by the CRM system of the agent, and by matching these IDs, merging of these two data sources was made possible. The CRM system also records the company for which the individual works; using this data, LinkedIn was used to find “industry” and “company_size” and add these variables to the data set. By combining these three data sources, a more complete picture could be made of a client profile and client behaviour leading up to the registration.

Data was then cleaned up by removing items such as non-tenants and duplicates, and solely historic clients either in the stages of a deal or who have been lost have been used. All non-numeric or alphabetic symbols like ','s and ' 's were replaced with '_'s. Uppercases of all values were converted to lowercases to get rid of case confusion.

Data were removed as incorrectly entered when there was a suspicion of incorrectly entered or measured data (two examples which were removed are 24 and 12 “preferred_bedrooms”, which likely was meant to be 2 to 4 bedrooms and 1 to 2 bedrooms), which suspicion arose after conversation with the Sales Manager. If the system automatically entered 0 instead of “missing”, this was mainly done for the variables “budget” and “preferred_bedrooms”, of which only interest in studio apartments (which consist of 0 bedrooms) were kept. Variables with similar meanings, such as title, were recoded; for example, Ms. or Mrs. was recoded as female and Mr., as male.

Missing values from the languages and nationality variables in the CRM were decreased by merging with Google Analytics data. Furthermore, the variable “Source” had multiple dupli-cate mentions for the same source; synonyms were merged. Outliers were removed using stem-and-leaf box plots to increase normality.

The next step was merging data from LinkedIn by using employer name of the client to find “industry” and “company_size”. B2B or B2C was defined both by data filled in from the CRM system and by using a list of companies known to have signed agreements with the rental agent, and these were matched to the employer name listed by the client on LinkedIn.

(29)

27

Finally, after cleaning and merging the data, the data was converted to a comma separated value (CSV) file to be able to insert it into known data tools such as Weka and Statistical software Package for Social Sciences (SPSS). Upon completion of this step, 17,240 opportu-nities remained after pre-processing, only 1,515 of which were considered deals.

Analysis

First, a preliminary analysis of the data was conducted by checking the data for abnormalities using descriptive statistics, skewness, and kurtosis and normality tests. The Shapiro-Wilk test, the Kolmogorov test and the qq-plot were used. The normality check indicated that all the variables were roughly normally distributed.

Correlation Matrix

Since the goal was to determine which variables were important predictors of a successful client, a Pearson correlation matrix was created in order to find correlation between scalar variables and “deal or no deal”. All non-scalar variables were recoded as binary variables: “responded” was recoded using “did not respond” = 0 and “responded” = 1; “male_vs_female”, where “male” = 1 and “female” = 0; “b2c_vs_b2b”, where “b2c” = 0 and “b2b” = 1; and “deal_vs_no_deal”, which was recoded as “no_deal” = 0 and “deal” = 1. “Days_of_the_week” was recoded as a multilevel nominal variable, with 1 for “Monday” through 7 for “Sunday”. All of the variables were recoded into different numerical "bins" (i.e. left = 1, right = 2, with 0.000 assigned if no data was provided). This procedure is described in Zhang, K., & Jin, H. (2011):

“Algorithms working only on numerical attributes, the categorical attribute values have to be

recoded as, e.g., 0, 1, 2, ···, and mixed them with the numerical attribute values of the original dataset. Meanwhile, for some algorithms designed for categorical datasets, numerical values must be discretized into several bins, and treat them as a set of categorical values.”

It is important to note that all variables except “days_to_transaction” were significant predic-tors of “deal_vs_no_deal”. The most important was “responded”, which was significantly positively correlated with “deal_vs_no_deal”, and B2B B2C, which was significantly nega-tively correlated with “deal_vs_no_deal”. In addition, budget, session_duration, pre-ferred_bedrooms, page_depth, session_count, and sessions_to_transaction were all signifi-cantly positively correlated with the likelihood of a deal. Hour_of_the_day and con-tact_time_min were both significantly negatively correlated with “deal_vs_no_deal”.

(30)

28 Logistic regression using SPSS

Since many of the predictors were significantly correlated with the response (deal or no deal), a logistic regression model was planned to predict: (1) the relative importance of the multiple predictors, as well as (2) whether or not the levels of categorical predictors (e.g. internet browser) were significant predictors of a deal.

A logistic regression analysis was used to find the type of relationship between the variables and measure the influence of the independent variable on the dependent variable. The first logistic regression (“Regression 1”) included deal (1) or no deal (0) as the dependent variable and, as scalar or binary categorical predictors: (1) budget, preferred bedrooms, B2B vs B2C, page depth, session count, session duration, sessions to transaction, days to transaction, hour of the day, contact time, and responded; and (2) all other categorical variables as nominal pre-dictors.

Regression 1 included: deal or no deal as the dependent variable and: budget, preferred bed-rooms, B2B vs B2C, page depth, session count, session duration, sessions to transaction, days to transaction, hour of the day, contact time, and responded as scalar or binary categorical predictors. All other categorical variables as nominal predictors did not significantly predict the response (deal vs no deal) based on the parameters provided, since the number of observa-tions was less than the number of fitted parameters, causing overfitting; this is most likely occurring due to too many missing values. Normally, logistic regression in SPSS deletes the entire record when a variable is missing and thus performs a list-wise deletion. Despite this fact, some variables appear to be highly significant predictors of the deal, including: budget, B2B or B2C, responded (the most significant predictor), hour of the day = 18, and company size = 11-50 employees.

A second logistic regression (“Regression 2”) was therefore constructed, which included deal (1) or no deal (0) as the dependent variable and, as scalar or binary categorical predictors: budget, preferred bedrooms, B2B vs B2C, page depth, session count, session duration, ses-sions to transaction, days to transaction, hour of the day, contact time, gender and responded. Chi-square test of goodness-of-fit (to test absolute fit) was performed to determine whether the model fits the data.

(31)

29

The Regression 2 model included: deal or no deal as the dependent variable and: budget, pre-ferred bedrooms, B2B vs B2C, gender page depth, session count, session duration, sessions to transaction, days to transaction, hour of the day, contact time, and responded as scalar or bina-ry categorical predictors. The categorical predictors combined were able to predict deal or no deal with 88.6% accuracy, and was therefore highly statistically significant: Log-2 Likelihood = 1034.96, Chi-square=243.42, df = 13, p <.001.

Observed Predicted

Deal nodeal Percentage Correct

0 1 Step 1 Deal Nodeal 0 1595 8 99.5 1 198 7 3.4 Overall Percentage 88.6

Table 3: Classification Table

-2 Log likelihood Cox & Snell R

Square

Nagelkerke R Square

1034.956a 0.126 0.248

Table 4: Log-2 Likelihood

Chi-square df Sig.

Step 1 Step 243.424 13 .000

Block 243.424 13 .000

Model 243.424 13 .000

Table 5: Omnibus Tests of Model Coefficients

This research also calculated variance inflation factors (VIF) to assess multi-collinearity. The VIF within the model was always below 2.100, which is well below the rule-of-thumb cut-off of 10. Model Unstandardized Coeffi-cients Standardized Coefficients T Sig. Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) .142 .046 3.091 .002

Budget 4.856E-05 .000 .090 3.512 .000 .811 1.232

preferred_bedrooms -.016 .009 -.044 -1.739 .082 .816 1.226

(32)

30 B2B0B2C1 -.122 .025 -.114 -4.916 .000 .995 1.005 page_depth .001 .000 .044 1.322 .186 .485 2.061 session_count .002 .002 .037 1.328 .184 .677 1.478 session_duration 1.467E-05 .000 .056 1.698 .090 .488 2.050 sessions_to_transaction .007 .004 .048 1.644 .100 .614 1.629 days_to_transaction .000 .001 -.011 -.448 .654 .879 1.138 hour_of_the_day -.003 .001 -.064 -2.751 .006 .987 1.013 contact_time_min -1.264E-05 .000 -.041 -1.744 .081 .990 1.010

a. Dependent Variable: dealnodeal

Table 6: Collinearity Statistics

Analysis of parameter estimates of the Regression 2 coefficients indicated that the significant predictors of deal were: budget (+), B2B0B2C1 (+), session duration (+), and hour of the day (-). Thus, according to Regression 2, a larger budget, B2B client, with long session duration early in the day is most likely to reach a deal. Other predictors were not significantly associat-ed with a different likelihood of deal or no deal.

The major limitation of the Regression 2 model is that while the overall predictive validity of it is efficient for predicting when the deal may not occur (accurately predicting 1,595 of 1,603 “no deal” results), it is less able to predict when the deal will occur (accurately predicting only 7 of 205 “deal” results). This means that a significant proportion of the variation in the re-sponse is not accounted for by the variables included in Regression 2, as can be seen below in Table 7. Besides being able to only use scalar or binary categorical predictors to create a valid logistic regression, the unbalance was the main reason to make use of different methods, such as decision tree, in order to test whether other methods can achieve the necessary results to create a lead qualifier model with accurate prediction.

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .202a .041 .035 .312

a. Predictors: (Constant), contact_time_min, budget, B2B0B2C1, session_count, session_duration, hour_of_the_day, days_to_transaction, furnishing, preferred_bedrooms, sessions_to_transaction, page_depth

(33)

31 Classification Model

Two decision tree methods were used to determine which sources had the most predictable value by testing each data source variation. There was however a significantly greater occur-rence of no deal samples (91.3%), therefore making the data set unbalanced. For this reason, in the Decision Tree experiments, only 3,030 samples were considered for the deal/no deal prediction. These samples were equally divided into deal/no deal labels. Using the under-sampling method of Stratified Random Sampling (Cochran, 1946 and Stehman, 1996), SpreadSubSample and using a Distribution of 1.0 in Weka, a random sample of ‘no deals’ was selected to equal the amount of deals, thereby making the data balanced. Hereafter the balanced data set was randomized using Weka.

The data was sliced into different parts, namely; (1) CRM, (2) LinkedIn, (3) Google Analyt-ics, (4) CRM and LinkedIn data combined, (5) Google Analytics and CRM data combined, (6) LinkedIn and Google Analytics combined and (7) All data. Per data set, the data was con-verted to .CSV format and uploaded to Weka. From the All data set, an 8th slice was made by running the (8) Feature Selection filter in Weka. Feature Selection or attribute selection is a process by which you automatically search for the best subset of attributes in your dataset, typically meaning the features with the highest accuracy, which will reduce overfitting, im-prove accuracy and reduce training time. The attribute with the highest information gain is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions (Hall, Frank, Holmes, Pfahringer, Reutemann, & Witten, 2009). After the data splits were created, both a J48 Decision Tree (Ja-va implementation of the C4.5 Algorithm) and an Alternating Decision Tree were performed per split data set, and both tested in order to find the best model.

In order to evaluate the used algorithms, cross-validation was used. Cross-validation first in-volves separating the dataset into a number of equally sized groups of instances, which are called folds. Then the model is trained on all folds with the exception of one that is omitted, and the prepared model is tested on that omitted fold. The process is repeated so that each fold has the opportunity to be omitted and act as the test dataset. Finally, the performance measures are averaged across all folds to estimate the capability of the algorithm to address the problem. Commonly used numbers of folds are 3, 5 or 10.

(34)

32 J48 Decision Tree using Weka

Thus a second method was tried; a J48 Decision Tree was formed for each data source and combination of data sources after balancing the complete data set. Results are shown in Figure 5, classified instances per data source. As the data is balanced incorrectly, classified instances can be seen as a Type I and Type II error or false positives and false negatives combined. Dif-ferences in exact false positives and false negatives can be found in the Appendix in the con-fusion matrix.

Figure 5: Classified instances per data source in Weka using J48 Decision Tree

As shown, single data sources have a less predictive nature, with LinkedIn only giving a pre-dictability of 60.92%; by combining all data sources (CRM, Google Analytics and LinkedIn) this figure increases to 73.39%. This results in the decision tree shown in Figure 6.

Predict the probability of a deal using a multi-source data-driven qualifying process