Modeling company success on crunchbase with Doc2Vec features

(1)

Master Thesis

Modeling Company Success on Crunchbase

with Doc2Vec Features

Jan Hlousek

Student number: 11374403

Date of final version: August 15, 2017 Master program: Econometrics

Specialisation: Free Track

Supervisor: Prof. Dr. Marcel Worring Second reader: Dr. Noud van Giersbergen

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

(2)

1 Abstract

In this thesis we incorporate textual information from Crunchbase database into predictive mod-els of company success by constructing two variables via Doc2Vec method: team’s diversity and solution-education overlap. We hypothesise that both team’s diversity and solution-education overlap have a positive impact on company’s performance. The new variables are used in two predictive models of company success: logistic regression and support vector machines. Both team’s diversity and solution-education overlap show the expected positive impact on company’s performance. However, when controlling for other factors, only team’s diversity remains statis-tically significant. Inclusion of the new predictors has increased the predictive performance, as measured by the area under receiver operating characteristic curve, for both logistic regression and support vector machines, but the increase was not statistically significant. Logistic regres-sion performs similarly on the classification task as support vector machines, with the highest AUC values of 0.6739 and 0.6778, respectively.

(5)

2 Statement of Originality

This document is written by Jan Hlousek who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(6)

3 Introduction

Much of the digital data on the internet comes in unstructured form: contents of websites, blogs, and microblogging platforms are a few examples. While structured data have been extensively analyzed in econometrics, textual information has been rarely used. In practice, decision mak-ers rely not only on analyses of numeric and categorical data, but use to a large extent textual information that concerns their objects of interest. One of the fields in which decoding text can be particularly helpful is predictions of company performance. Much of the successful research in this area has covered sentiment analysis related to publicly traded companies. Two factors have contributed to this success: large dimensionality reduction when processing news – the amount of information contained within news articles is typically reduced to positive or neg-ative sentiment, and the availability of key performance indicators related to publicly traded companies. Small technology companies have been studied in econometric research to much lesser extent, because their financial, network and operations data have not been available to the public. Recently, new startup databases, such as Crunchbase, Angellist or Pitchbook, have started to fill this gap by providing previously unavailable information on startup companies. The growing availability of startup information, as well as the new legal developments that allow general public’s higher involvement in startup investment, contribute to the increasing demand for quantitative strategies based on the publicly available data. For example, the Crowdfund Act of 2012 in the United States has given small technology companies the right to raise cap-ital from non-accredited investors, who do not have the same access to private information as venture capital funds and therefore their selection process has to be based on open data. The challenging characteristic of public information is that if often comes in a textual form. In order to utilize the textual information, we first need to impose some structure with the help of natural-language-processing. In this thesis we apply a natural-language-processing technique Doc2Vec on the textual information in the Crunchbase database. This technique allows us to create fixed-length feature vectors related to startup companies and use them in predictive mod-eling. To our knowledge this algorithm has not been used in this context before. Two specific features about each company are constructed using the Doc2Vec feature vectors: diversity of the founding team and solution-education overlap. Higher diversity should provide a company with broader range of skills, which is particularly important in the early stages of company’s life. Similarly, higher overlap between business solution and education of founders should have a positive impact on company’s performance. We combine the textual features with numeric and categorical variables, such as the dollar amount of the first investment, success rates of

(7)

CHAPTER 3. INTRODUCTION 4

volved investors or previous success of the founding team. These variables have been previously found to play a major role in the predictions of companies’ success. To obtain reasonable vector representations from the texts on Crunchbase, we need to address three challenges. First, the information that we choose to represent a business solution should be specific enough to dis-tinguish the nuances among diverse areas of operation. As for Crunchbase, one straightforward way to define business solutions would be to work with the Crunchbase categorization of com-panies. For our purposes, however, this categorization is rather broad, so we choose to establish more niche grouping based on company keywords. Second, close synonyms, multiple entries per user and irrelevant information need to be accounted. Finally, the training corpus for Doc2Vec needs to be sufficiently large. To address these challenges, we consider few very specific pieces of textual information about every company and use a large amount of data for training. In particular, we link the company keywords and education majors to Wikipedia articles, so that each keyword and education major are represented by a long text. This approach is inspired by the work of Funderbeam (2016). As a result, words of similar meanings are mapped into close positions in the vector space and we can construct the variables that represent our attributes of interest.

We make the following contributions: First, we show that it is possible to incorporate the Doc2Vec feature vectors into existing classification frameworks that require numeric input. Second, we show that using the company keywords instead of the Crunchbase categorization leads to more niche grouping of companies. This finding might be useful for further applications, such as solution-wise comparison of performance. Third, we compare the logistic regression and support vector machines on a classification task and analyze the impact of their parameters on the classification performance.

The next chapters are outlined as follows: In Chapter 4 we provide our literature review, which includes studies that have previously analyzed the success factors of small technology companies, studies dealing with textual and network features from Crunchbase, and also the general research in econometrics that uses textual data. In Chapter 5 we provide statistics on the Crunchbase dataset and define explicitly all the features we use in our predictions. In Chapter 6 we introduce the Doc2Vec technique, two classification models Logistic Regression and Support Vector Machines, their parameters, and our evaluation measures. In Chapter 7 we present our classification results and compare them using the DeLong test. Chapters 8 and 9 provide the conclusion and our remarks. Finally, complementary tables to this thesis are appended in Chapter 10.

(8)

4 Literature Review

Our literature review is organized into three parts. First, an overview of research that focuses on company success is presented. This provides guidance on the construction of features that have been previously shown to be predictive of company performance. Second, studies that have previously dealt with the Crunchbase dataset are summarized. These are studies with similar goals that use different natural-language processing techniques such as Latent Dirich-let Allocation, studies that incorporate network features from Crunchbase into modeling and studies on venture capital portfolio optimization. In the third part, the general use of textual information in econometric research is reviewed. This includes studies in financial econometrics that use features from news articles as predictors of stock-prices. While the third part is not directly linked to our application, we include it to provide an example of an area, in which the textual econometrics has been successful and to outline the underlying reasons.

4.1 Measuring Company Success

Success of new technology companies has been typically defined as an exit through acquisition or initial public offering (Hochberg et al. (2007), Da Rin et al. (2011), Eesley et al. (2013)). Ac-quisition takes place when a large corporation, such as Amazon, Google, or Microsoft, purchases a small firm that develops an innovative technology. The new technology or team then becomes part of the large corporation’s internal structures. For large corporations, incorporating a new R&D unit is often more beneficial than outsourcing it. Knott (2016) finds evidence that in contrast to an internal R&D development, outsourced R&D is not productive, and its output elasticity is essentially zero. The potential gains of acquiring a new innovative technologies are offset by the difficulty of assessing their true potential. Recently, emerging startup databases have started to provide valuable information on investor and team characteristics that might foster the assessment of new venture performance. This information often appears in form of a textual field.

Investor Characteristics

The ability to raise capital is one of the most critical factors that contribute to the survival and success of startup companies. Sufficient resources in the early stage are necessary for developing a new business solution, building a network of connections in the competitive landscape, and searching for the right business model (Cheng et al. (2016)). Venture capital funds or other

(9)

CHAPTER 4. LITERATURE REVIEW 6

accredited investors invest in small technology companies with a high growth potential in ex-change for equity. Multiple studies have suggested that companies backed by venture capital firms are more likely to be successful. The positive impact of venture capital backing is further driven by the experience of venture capital managers from previous deals, access to other firms in the venture capital portfolio and close due diligence processes. This is especially true for more reputable venture capital funds whose performance has been found to be persistent (Puri and Zarutskie (2012)). Several measures have been proposed to define the reputation of investors: total number of investments, share of IPOs, share of acquisitions, or cumulative aggregate in-vestment. The different measures of reputation have been compared by Nahata (2008), who concludes that the share of IPOs and cumulative dollar capitalization are consistent predictors of venture capital performance. Venture capital firms often invest in syndications in order to share the risk and increase their portfolio diversity. Having multiple investors may be to the benefit, as the company gains access to a broader portfolio of potential collaborators and the due diligence process is more thorough. Lindsey (2008), for instance, shows that strategic al-liances are often formed within common investor networks and they correlate with a higher performance. Lerner (1994) shows that having more investors involved reduces the asymmetry information between the investors and the targets. Dollar value of raised capital, reputation of investors and number of involved parties are therefore the investor-related variables for which we control in our models.

Team Characteristics

Besides the investor characteristics, the company success largely depends on the characteristics of the management team. Previous experience of the founders is particularly important, as it provides the team with relevant skills, relationships, and information that is not codified and can only be gained through practice. Delmar (2006) tests its effects on the survival and sales of new ventures and finds evidence that the previous experience enhances both of them. A common way to define previous success of the founding team is to track their previous involve-ment in acquisitions of IPOs (Hunter and Zaman (2017)). Another characteristic that has been shown to be a factor of company performance is the team’s diversity. Higher diversity should have a positive effect as the team is able to cover a broader area of skills. Beckman and Burton (2008) find that both experience and diversity of founding team are predictive both over time and a particular point of time. Lastly, Eesley et al. (2013) include higher education indicators as a control variables in their models and Hunter and Zaman (2017) introduce a new feature based on the textual information that measures the solution-education overlap. In Table 4.1. we summarize the above mentioned variables and outline their definitions based on Crunchbase data.

(10)

Variable Suggested by Definition using Crunchbase data

Exit (target) Hochberg et al. (2007) Acquired or IPO Da Rin et al. (2011)

Eesley et al. (2013)

Investor Characteristics

Ability to Raise Capital Cheng et al. (2016) Raised amount at first round

Count of involved investors at first round Investor Reputation Nahata (2008) Share of IPOs & Acquisitions

Venture Funding Beckman (2006) Venture funding or other type

Team Characteristics

Education of Founders Hunter and Zaman (2017) Highest Degree (none, bachelor, master, doctoral) Experience of Founders Eesley et al. (2013) Previous Exit (IPO or Acquisition)

Solution-Education Overlap * Hunter and Zaman (2017) Similarity between solution and education keywords Diversity of Team * Beckman (2006) Count of education clusters covered

* asterisked variables constructed with natural language processing

Table 4.1: Predictors of Company Success – overview. Various studies have analyzed investor and team characteristics in predictive models.

4.2 Studies Using The Crunchbase Dataset

The Crunchbase database was launched in 2011 in order to track startups mentioned on Tech Crunch and quickly became the industry standard for market resarch on small technology companies. Since then, various academic studies have analyzed the startup information using diverse approaches. Some of them have dealt only with the textual information and tried to apply supervised learning on company descriptions to infer Crunchbase categorization of the companies. This is a challenging task, because the company descriptions on Crunchbase are short and diverse and the number of distinct classes is high. As we shall see in our analysis, another reason is that companies which provide similar business solutions may be found across various categories. For example, companies that provide videochat solutions may fall into cate-gory other, web, messaging, news, games video, whereas the actual descriptions clearly describe videochat companies (see Appendix 10.1). To deal with these challenges, Batista and Carvalho (2013) introduce two fuzzy fingerprint based text-classification techniques. They provide 10-20% improvement over standard classifiers (Naive Bayes, Support Vector Machines) with the highest rate of 44.3%. Inspite of the improvement, the accuracy rates remain rather low and demonstrate the difficulty of dealing with textual information on Crunchbase. Other studies went beyond the text classification and combined text features with numeric and categorical data in order to predict the probability of success. Xiang et al. (2012) link Crunchbase data to TechCrunch articles and apply Latent Dirichlet Allocation to extract numeric representa-tions. The features generated from TechCrunch articles are used along numeric features from Crunchbase database in the prediction of company acqusitions. Performance of predictions is compared category-wise, with improvement of true-positive rates only for few categories. In the

(11)

majority of cases, the performance remains the same. Most recently, Hunter and Zaman (2017) have developed a framework for venture investment by combining the Crunchbase, Pitchbook and LinkedIn data. Instead of solving a classification problem and calculating probabilities of exit for each company, Hunter and Zaman (2017) consider a portfolio selection problem. Their goal is to choose a portfolio of fixed size, so that the probability that at least one company is included in the portfolio is maximized. Such case they refer to as winning. They develop a novel model based on Brownian motion first passage times, fit it to data amalgamated from Crunchbase, Pitchbook and LinkedIn, and achieve exit rates as high as 60%.

4.3 Text Features in Econometrics

The final section of the literature review covers sentiment analysis in financial econometrics. The text sentiment in corporate disclosures, daily financial news or internet messages represents the author’s opinion on the firm of interest, so once extracted, it can be tested against the performance or behavior of a company. This is achieved based on linear regression models (Antweiler and Frank (2004)), vector autoregression models (Tetlock (2007)) or volatility models (Antweiler and Frank (2004)). Sentiment analysis begins with predicting the polarity of a given text at word, sentence, or document. This is usually a binary (sentiment + or –) or 3-class (sentiment + 0 –) classification task. The suitability of each set-up depends on the characteristics of the data. Corporate disclosures are often used for the extraction of sentiment, because the releases come from insiders that have more information about the firm than others (Kearney and Liu (2014)). The sentiment encoded in daily financial news reflects the opinion of the journalists and can be relevant for studying performance indicators, such as stock prices, sales or returns. Antweiler and Frank (2004) perform sentiment analysis of internet stock message boards to predict market volatility and returns. They begin with a manual classification of a subset of messages and then use Naive-Bayes to classify an unlabeled corpus of 1.5 million messages. They find that the sentiment within the internet stock message boards is predictive of market volatility but does not have any predictive power for returns. On the other hand, Tetlock (2007) develops a sentiment indicator which is found to have predictive power for both earnings and equity returns. The corpus includes news from the Wall Street Journal and Dow Jones News and the sentiment is measured as a fraction of negative words. Besides firm-level, the financial news can be aggregated for multiple entities in order to study the performance or behavior on an industry- or market-level. Finally, internet messages, such as postings on corporate Twitter, LinkedIn or Facebook accounts contain the sentiment of users and provide additional space for analysis. This information is often noisy as it comes from different sources at the same time (Das and Chen (2007)).

(12)

5 Data

We collect our data on startup companies from the Crunchbase pre-2014 dump. For each company, we take the performance information, the company keywords that are relevant to the business solution they provide, the education majors of the founders and additional variables on team and investor characteristics that have been previously used in studies predicting company success. For training of the Doc2Vec model we use additional source of data – Wikipedia corpus of linked articles. The overall scheme with both data sources is presented in Figure 5.1. From the Crunchbase dump we use two kinds of data: structured information (numeric or categori-cal) and unstructured information (company keywords and education majors). The structured information from Crunchbase is used to construct features related to company and investor characteristics, while the unstructured information is pre-processed using Doc2Vec method to obtain new textual variables.

Figure 5.1: Overall Scheme. The two sources of information are represented by the layered objects: Crunchbase dump and Wikipedia corpus. Company keywords and education majors are subject to pre-processing using the Doc2Vec model. Doc2Vec takes as an input the linked 36,509 Wikipedia articles collected via Wikimedia API. The output of Doc2Vec are vector representation of a fixed length for each company keyword and education majors. These are then used for constructing our features of interest – team’s diversity and solution-education overlap. The constructed variables are used in two classification models: logistic regression and support vector machines. These models are compared to one another using the DeLong Test.

(13)

CHAPTER 5. DATA 10

5.1 Crunchbase Dataset

The Crunchbase dataset contains information on companies, founding team members and their education, investors, funding rounds, company acquisitions and initial public offerings. Besides the current affiliations, we are able to backtrack past affiliations of team members and past involvements of investors. We work with the pre-2014 dump and use data on companies that:

• Received first funding in categories venture, angel, series-a, crowdfunding or other. This restriction eliminates companies that have the first funding information about later rounds, e.g. series-b, series-c+, post-ipo and private-equity investments, as we aim to capture in-formation about new technology companies available at point of first investment.

• Have information on founding team and their education, so that we can construct team’s diversity indicator and solution-education overlap

• Have keywords or company description, so that we can construct solution-education over-lap

• Received first funding in 2003 or later. Crunchbase was launched in 2011, therefore some companies that were closed before this year may be not included in the database. We select a 8-year cutoff in time before 2011, which represents the upper quartile of time-to-IPO in our sample, to mitigate this challenge.

The final dataset contains information on 6,982 companies. Company is considered as success-ful if it was acquired or went public through an IPO. In our Crunchbase sample of companies, the proportion of companies that exited via acquisition or IPO is 7.8%. Figure 5.2. shows the distribution of successful companies in time by the date funded.

(14)

CHAPTER 5. DATA 11

We see that the proportion of successful companies in Figure 5.2. varies in time. This points at the likely censoring of the data: companies that were funded towards the end of the period, may have not had enough time to develop. To correct this limitation, we scrape the status informa-tion on the respective companies to understand which companies went public or were acquired after 2013. With the updated information, the fraction of successful companies changed from 7.8% to 16.5%, as presented in Figure 5.3. by the change of successful companies from 545 to 1152.

Figure 5.3: Successful vs. Unsuccessful companies before and after update. The proportion of companies treated as successful doubled.

We distinguish two sets of independent variables: numeric features that are taken from the Crunchbase database are defined in Section 5.2. Text features, which require pre-processing are defined in Section 5.3.

5.2 Numeric Features

Numeric features are based on the previous research summarized in Section 4.1. Six numeric variables are constructed from the Crunchbase database that are related to investor and team characteristics. Two of them are expected to play a major role: previous exit of founding team and previous exit rate of involved investors. To capture the companies at the similar points of development, we only consider the information available at the point of their first investment rounds.

• Previous Success of Founding Team: Binary indicator, whether any member of the founding team, excl. advisors and investors, has been previously involved in an exited company. Special care was taken to ensure that the causality is not violated and only prior exits are considered. Expected Sign (+).

(15)

CHAPTER 5. DATA 12

of the involved investors – we consider all investors involved in the company’s first round and take the highest exit rate. Ideally, we would measure the performance of an in-vestor directly, e.g. by internal rate of return. However, this information is not available. Therefore, we rely on the indirect measurement under the assumption that exit rates are proportional to capital gains. As before, we only consider the prior performance and do not take into account any events happening afterwards. Expected Sign (+).

• Highest Degree: Highest degree achieved by the founding team. This is a categorical variable that is in the predictions split into two binary indicators – highest degree Master’s (h degree2) and highest degree Doctoral (h degree3). Expected Sign of both binary indicators (+).

• Participants: Number of involved investors during the first financing. Expected Sign (+).

• Raised Amount: Natural logarithm of the Amount Raised during the first financing. The logarithm is used for scaling and interpretation purposes. Expected Sign (+). • Startup Hub: Technology companies tend to be located in a small number of geographic

regions. Startup hubs such as San Francisco, London, Berlin have substantially higher con-centration of young technology companies than other locations. Modern growth theories suggest that firms located in such centers benefit from better infrastructure, interconnect-edness and developed network of financial services, and therefore being based in a startup hub may have a positive effect on company’s performance. Expected Sign (+)

5.3 Text Features

We work with two textual fields for each company: keyword lists and education majors of founders. First, we link the company keywords, such as videochat or chatrooms to Wikipedia articles and create vector representations of each keyword. In order to define team’s diversity and solution-education overlap, we need to understand in which area the business operates and in which areas the founders are educated. Therefore, each company obtains two sets of vector representations: one that relates to its area of operation (keyword representations) and another that relates to the education majors of team members (education major representations).

1. Keyword Representations – 36,509 keywords have been linked to articles on Wikipedia. To ensure that keywords are representative, only keywords occurring more than 3 times across the database are considered. 595 keywords are eliminated as they represent loca-tions or given names. This leaves us with final set of 35,914 keywords. Note that 23% companies had an empty keyword-list. To include these companies into analysis, we also extract keywords from company descriptions to complete the missing information.

(16)

CHAPTER 5. DATA 13

2. Education-Major Representations – Education majors on Crunchbase are less het-erogenous than company keywords, but still require careful inspection. First, some ob-servations have to be taken out, e.g. user inputs of the form M.Sc., B.S., Degree are not indicative of any specific field of study. Second, some people list multiple fields of study, e.g. Computer Science and Economics. These are then considered as two distinct areas and both enter the analysis.

Having obtained keyword and education-major representations allows us to define new variables: team’s diversity and solution-education overlap.

Team’s Diversity

Team diversity measures how many different areas of expertise the founding team covers. Some studies distinguish among technical, non-technical, and mixed teams. Eesley et al. (2013) sets up a questionnaire survey with five groups: technology, finance, sales, marketing, or other, and counts how many of them are covered by each company. In this thesis, we use k-means and optimize k with respect to its predictive ability in the final model. Inspecting the clusters in Table 5.1. yields a tech–nontech split for k = 2, natural-science–social-science–business split for k = 3, and more niche grouping for higher k. Notice that business related studies, such as Investment Management, Financial Accounting or Corporate Law are grouped together on levels 2, 3, and 7, while more niche grouping separates them into distinct categories. Similarly, highly technical fields, such as Computer Science, Chemical Engineering, or Mathematics are clustered together on levels 2, 3, 7, and 12, while the more niche grouping separates them on level 15. Expected Sign (+).

Subject k 15 k 12 k 7 k 3 k 2 Investment Management 0 5 5 2 1 Financial accounting 1 1 5 2 1 Computer Science 2 7 1 0 0 Biology 3 2 2 0 0 Digital Marketing 4 10 4 2 1 Telecommunications Engineering 5 7 1 0 0 Linguistics 6 0 3 1 1 Chemical Engineering 7 7 1 0 0 Mathematics 8 7 1 0 0 Executive Education 9 11 6 2 1 Astrophysics 10 8 2 0 0 Systems Engineering 11 7 1 0 0 Corporate Law 12 4 5 2 1 Electrical Engineering 13 7 1 0 0 Political Science 14 0 3 1 1

(17)

CHAPTER 5. DATA 14

Solution-Education Overlap

For each education major–business solution pair we calculate the cosine-similarity and then take the maximum value. We hypothesise that having at least one founding member who is educated in a field close to the area of operation increases the chances of being successful. Expected Sign (+).

To demonstrate visually the structure of this vector space, we average the keyword repre-sentations for each company and project down the resulting business solution vectors to two dimensions with principal component analysis. Random 4 clusters from the k-mean clusters (k = 50) are plotted in Figure 5.4.

Figure 5.4: Two principal components of 4 clusters from k-means algorithm with k = 50. The first group (black) coincides with education-related startups (common keywords include: college, university, gaming, learning, education). The second group (red) coincides with cloud-services (common keywords include: hosted-it-systems, hosted-systems, it-support, cloud-computing). The third group (green) coincides with shopping-related startups (common keywords include: ecommerce, shopping, online-shopping, retail-shopping). The fourth group (blue) coincides with business intelligence-related startups (common keywords include: business-intelligence, analytic-applications, competitive-intelligence, market-research). For a sample company and its nearest neighbors, see Appendix 10.1.

(18)

6 Methods

We use one natural language processing method Doc2Vec for processing the textual information and two classification techniques for predictions of success: logistic regression and support vector machines. The Doc2Vec method has been developed as an extension to the framework for representations of words. Therefore, we first summarize the underlying theory of vector representations of words, outline the extension to document representations and finally provide an overview of both classification techniques

6.1 Natural Language Processing

6.1.1 Vector Representations of Words

A framework for Vector Representations of Words has been proposed by the research group around Tomas Mikolov (Mikolov et al. (2013)). Given a sequency of training words w1, w2, ..., wT,

let wtbe the target word and h = {wt−k, ..., wt+k} its context of window length k. The objective

is to maximize average log probability: 1 T T −k X t=k log P (wt|h) (6.1)

The above specification, a continuous bag of words, predicts a target word given context words. The alternative is a skip-gram specification, which works the opposite way, i.e. predicts context words given a target word. The probability term in the objective function can be expressed with a multiclass classifier, such as the softmax function. In the continuous bag of words specification the softmax reads:

p(wt|h) =

exp(wt· h)

P

w0_∈W

exp(w0_{· h)} (6.2)

Vectors are randomly initialized and the objective function is optimized using stochastic gradi-ent descgradi-ent. The calculation of ∇log p(wt|h) can be costly in larger corpora, as it is proportional

to the vocabulary size W . Therefore in practice hierarchical softmax or negative sampling are used for faster training. Training of word vectors results in words with similar meanings being mapped into similar positions in the vector space.

(19)

CHAPTER 6. METHODS 16

Several techniques have been proposed for extending the word-level framework to sentence representations, e.g. combining word vectors in an order given by a parse tree of a sentence (Socher et al. (2011)) or more recently, application of convolutional neural networks trained on top of pre-trained word vectors (Kim (2014)). In these studies, input texts are restricted to short sentences and would not work well for texts that vary in size. Document representations (Doc2Vec) have been proposed by Le and Mikolov (2014) and provide state-of-the-art solution for longer document representation.

6.1.2 Doc2Vec

Given a sequency of training words from a document, the objective is to maximize the average log probability given in equation (6.1). The only difference is that the context is a concatenation or summation of new document vector with several word vectors from the document. Contexts are generated from the document using a sliding window over the document. The document vector is the same for all contexts within the document but not across the corpus. On the other hand, the word vectors are the same across the corpus. Doc2Vec comes in two variants: Distributed Bag of Words (DBOW) and Distributed Memory (DM). DBOW disregards syntax and treats each document as a collection of words. The more complex DM model predicts a context word given the concatenation of document vector and word vectors. Doc2Vec method has performed strongly by comparison to other state-of-the art methods (Lau and Baldwin (2016)). They also suggest that Doc2Vec should perform particularly well over longer documents. Following up on their results, we choose to collect an external corpus of documents from Wikipedia, and then apply Doc2Vec technique for generating vector representations.

6.2 Estimation

Two estimation techniques are used for classification: Logistic Regression and Support Vector Machines. Having two methods ensures that the results are robust with respect to the classifier choice. In order to choose a proper threshold in both classification algorithms, analysis of the receiver operating characteristic is performed. Difference between the Logistic Regression and Support Vector Machines is tested using DeLong test.

6.2.1 Logistic Regression

Given the binary response variable, residuals cannot be normally distributed and therefore the standard linear model is not applicable. Instead we use logistic regression, which assumes the conditional distribution P (y|X) to be binomial. In this model, the predicted values are probabilities, restricted between 0 and 1 via logistic function σ. The model takes form

P [y = 1|X = x] = exp(x

0_β)

exp(x0β) + 1 =

1

(20)

Company success is rather rare. In our Crunchbase sample of companies that received some form of funding, the proportion of companies that exited via acquisition or IPO is 16.5%. The logistic regression may underestimate the probability of success and therefore we add a penalizer for misclassifications. Penalized regression puts a constraint on some coefficients to balance the bias-variance tradeoff. The optimization problems with lasso (6.4) and ridge (6.5) penalizers read: min n X i=1 (yi− β0− p X j=1 βjxij)2+ λ p X j=1 |βj| (6.4) min n X i=1 (yi− β0− p X j=1 βjxij)2+ λ p X j=1 β2_j (6.5)

where the second term is the penalty function. Note that setting λ = 0 leads to the original model without penalty.

6.2.2 Support Vector Machines

As an alternative to Logistic Regression we use Support Vector Machines, machine learning algorithm that has been previously applied to a wide range of classification tasks. Some studies have reported more accurate results with Support Vector Machines, while others have achieved better results with alternative methods. Support Vector Machines have few favorable attributes: They are able to deal with non linearities as opposed to the linear regression models, which is connected to the use of kernel functions. They also result in a unique minimum and create sparse solutions. Given the set of companies that belong to class 1 or 0, the goal is to find a separating hyperplane with highest possible margin. In linear case, let w be a weight vector, b bias, φ(x) a fixed feature-space transformation. Then the hyperplane takes form:

y(x) = wTφ(x) + b (6.6)

Lagrangian function is used for optimization of parameters w and b. Support Vector Machines are suitable for non-linear problems – using kernel trick they are able to map input features into high dimensional feature space and create a non-linear separating hyperplane. One of the following kernel functions is typically used: linear, polynomial, radial or sigmoidal. However any transformation that fulfils the properties of a valid kernel may be used in practice.

(21)

6.3 Evaluation Measures

For the classification task, we are interested how well the classifiers discriminate the successful companies from the unsuccessful companies in terms of predicted probabilities. Logistic Re-gression and Support Vector Machines are evaluated and compared to one another using area under receiving operating curve (AUC). The AUC is the most common evaluation metric for assessing the performance of generalized linear regression models on a binary classification task (Steyerberg et al. (2010)). To define the AUC explicitly, we begin with the confusion matrix in Table 6.1. Confusion matrix is a 2×2 table that splits the classification results into four cate-gories: true positives (TP) are successful companies that were correctly classified as successful, true negatives (TN) are unsuccessful companies that were correctly classified as unsuccessful. False positives (FP) and false negatives (FN) are their misclassified counterparts.

Actual Success Actual No Success Predicted Success True Positive (TP) False Positive (FP) Predicted No Success False Negative (FN) True Negative (TN)

Table 6.1: Confusion Matrix

From the confusion matrix we can obtain the true positive rate (TPR) and false positive rate (FPR), which form the basis of the AUC.

• TPR = _{T P +F N}T P is the proportion of successful companies that are correctly classified as successful, i.e. sensitivity.

• FPR = F P

F P +T N is the proportion of unsuccessful companies that are incorrectly classified

as successful, i.e. 1 – specificity.

Whether we classify a company as successful or unsuccessful depends on the varying threshold value t. For instance, the logistic regression might produce predicted probability p for a given company. If the threshold value t is greater than p, we classify this company as unsuccessful. Conversely, if the threshold value t is lower than p, we classify it as successful. The receiver operating characteristic curve is a commonly used plot that shows the tradeoff between the sen-sitivity and specificity, as we vary the parameters of a classification rule. The AUC is defined as

AU C = Z ∞

−∞

TPR(t) · FPR0(t) dt (6.7)

where t represents the threshold parameter. It can be shown that the AUC is equivalent to the Mann-Whitney U statistic, for the median difference between the prediction scores between the

(22)

two groups (Hanley and McNeil (1982)). Therefore, AUC can be interpreted as the probability that a randomly chosen successful company will be ranked higher than a randomly chosen un-successful company. AUC values typically fall into interval [1₂, 1]. Ideal value of the AUC would be 1, representing perfect classification. On the other hand, an AUC equal to 1₂ would mean that our models perform as good as random classifiers.

Accuracy Measures

Standard accuracy is the second measure for evaluation of binary classifiers, defined as

standard accuracy = T P + T N

T P + T N + F P + F N (6.8)

It measures the proportion of correctly classified instances across the whole set. For imbalanced data, the standard accuracy measure may lead to far too optimistic estimates by disregarding the smaller class. For instance, consider the case of classifying all companies in our dataset as unsuccessful – this would lead to 83.5% accuracy, driven by the true negatives, while the number of true positives would be zero. To take the imbalance into account, we consider the balanced accuracy measure, i.e. (Brodersen et al. (2010)) defined as a weighted average of the accuracy on the two classes:

balanced accuracy = c · T P

T P + F N + (1 − c) ·

T N

F P + T N (6.9)

Instead of treating the positive instances equally, the balanced accuracy computes a convex com-bination of sensitivity and specificity. The misclassification cost c ∈ [0, 1] can give a preference to sensitivity or specificity. In this work we consider the simple case and set c = 1₂.

6.4 Parameter Optimization

Accuracy rates depend to large extent on the parameter choice of our classifiers. We first sep-arate 20% of the data as a test-set, and then perform 5-fold cross validation on the remaining 80%. In Logistic Regression, two penalizers to control for overfitting: ridge and lasso. In both cases, the parameter λ controls the overall strength of the penalty. In SVM there are two param-eters for optimization: kernel parameter σ and cost parameter C. The kernel parameter defines the influence of a single training set observation. The higher the σ, the lower the influence. The penalty parameter C represents the tradeoff between misclassification rates and complex-ity of the decision surface. Additionally, we optimize parameters of Doc2Vec and K-Means.

(23)

For Doc2Vec representations, we set up a small grid around the suggested hyper-parameter set-tings for semantic textual similarity applications by Lau and Baldwin (2016). 3 parameters are optimized: window size – width of the left and right context, sub-sampling – threshold for down-sampling frequent words, and negative sample – number of negative word samples. Parameter vector size is fixed at 300 dimensions throughout the analysis and min.count – min. frequency threshold for word types, is fixed at 1. Education clusters are generated in order to define a new variable team’s diversity. We choose a grid of values between 2 and 10 to understand which level has the best performance in the final predictions. The overview of hyper-parameters is given in Table 6.2.

Hyperparameter Grid Optimum

Doc2Vec Window size 10,15,20 10

Vector size 300 300

Min. count 1 1

Sub-sampling 10−4_{, 10}−5_{, 10}−6 ₁₀−5

Negative sample 3,5,7 5

K-Means Number of Education Clusters [2,10] 6

LR λ – Penalizer {10−3_{, 2 · 10}−3_{, ..., 10}−1_} _{2 · 10}−2_(lasso)

5 · 10−3(ridge)

SVM σ – Kernel parameter {10−5_{, 10}−4_{, ..., 10}3_} ₁₀−1

C – Penalty parameter {10−5_{, 10}−4_{, ..., 10}3_} ₁₀

(24)

7 Results

In this chapter we discuss the results of the analysis. First, we inspect the textual variables to understand the direction, value and significance of their effects. Second, we report the per-formance of our models on the classification task and check whether the inclusion of textual variables led to any improvement. Finally we show, to which extent the performance of logistic regression and support vector machines depends on their parameters.

7.1 Effects of the Textual Variables

To interpret the effects of the textual variables we use the logistic regression. Let us start with only the textual variables. In case of the diversity indicator, we report multiple specifications for levels 4, 6, 8, and 10, as defined in Section 5.3. For the solution-education overlap we only have one specification, which measures the semantic cosine similarity between the business solution and education majors. In Table 7.1 we see that both diversity and solution-education overlap have the expected positive signs and are significant in this basic setup.

Table 7.1: Logistic Regression without Control Variables.

Dependent variable: status bool (1) (2) (3) (4) div 4 0.390∗∗∗ (0.033) div 6 0.319∗∗∗ (0.027) div 8 0.264∗∗∗ (0.021) div 10 0.256∗∗∗ (0.021) overlap 0.666∗∗ 0.612∗∗ 0.476∗ 0.463∗ (0.260) (0.262) (0.265) (0.266) Constant −2.159∗∗∗ −2.051∗∗∗ −1.951∗∗∗ −1.910∗∗∗ (0.135) (0.135) (0.135) (0.135) Log Likelihood -3,114.734 -3,114.415 -3,102.251 -3,107.658 Akaike Inf. Crit. 6,235.467 6,234.830 6,210.501 6,221.315

∗_p<0.1;∗∗_p<0.05;∗∗∗_p<0.01

The effect of a covariate on the chance of company success is multiplicative on the odds scale. The div 4 feature that counts how many of the four areas of expertise are covered by the

(25)

CHAPTER 7. RESULTS 22

ing team, has an odds-ratio of exp(0.390) = 1.47, so for every unit increase in the diversity indicator div 4, the odds that the company will be successful is multiplied by 1.47. The effect of diversity is fairly consistent across the different specifications. The directions, coefficients and standard errors of the div 6, div 8 and div 10 are aligned to that of the div 4 indicator. For every unit increase in the solution-education overlap, the odds that the company will be successful is multiplied by exp(0.666) = 1.94. We observe that the effect of the overlap indi-cator is larger in magnitude than that of diversity, however, the standard errors are as much as ten-fold higher. The effect of diversity retains significance after adding control variables in Table 7.2. On the other hand, the solution-education overlap indicator looses significance when controlling for other factors.

Table 7.2: Logistic Regresssion with Control Variables

Dependent variable: status bool (1) (2) (3) (4) div 4 0.111∗∗∗ (0.039) div 6 0.100∗∗∗ (0.031) div 8 0.086∗∗∗ (0.024) div 10 0.081∗∗∗ (0.024) overlap 0.226 0.185 0.145 0.140 (0.279) (0.281) (0.282) (0.284) log(raised amount usd) 0.151∗∗∗ 0.151∗∗∗ 0.149∗∗∗ 0.148∗∗∗

(0.021) (0.021) (0.021) (0.021) difference 0.045∗∗∗ 0.044∗∗∗ 0.044∗∗∗ 0.044∗∗∗

(0.005) (0.005) (0.005) (0.005) prev success 0.456∗∗∗ 0.454∗∗∗ 0.442∗∗∗ 0.451∗∗∗

(0.078) (0.078) (0.078) (0.078) prev inv success 1.411∗∗∗ 1.414∗∗∗ 1.414∗∗∗ 1.419∗∗∗

(0.158) (0.158) (0.158) (0.158) h degree2 0.089 0.085 0.080 0.088 (0.101) (0.101) (0.101) (0.101) h degree3 0.159 0.151 0.138 0.149 (0.103) (0.103) (0.103) (0.103) hub 0.403∗∗∗ 0.406∗∗∗ 0.403∗∗∗ 0.407∗∗∗ (0.072) (0.072) (0.072) (0.072) Constant −4.507∗∗∗ _−4.473∗∗∗ _−4.407∗∗∗ _−4.392∗∗∗ (0.306) (0.307) (0.309) (0.311) Log Likelihood -2,888.249 -2,887.271 -2,886.119 -2,886.704 Akaike Inf. Crit. 5,796.499 5,794.541 5,792.239 5,793.408

∗_p<0.1;∗∗_p<0.05;∗∗∗_p<0.01

Given that the diversity indicator is consistent across the various specifications, we suspect it could provide a positive contribution in the classification task. On the other hand, the solution-education overlap indicator does not seem to exhibit large predictive power. The strongest predictors in the full specification are previous success of the team (prev success), previous in-vestor’s success (prev inv success) and whether or not the company is based in a startup hub. The full specification yields McFaddens Pseudo R2 of 0.10, which is on the low side. This can be explained by the fact that we are dealing with information available in companies’ young

(26)

age and multiple milestones are yet expected to come. Nevertheless, we still have a model with some predictive power, considering that in general, by comparison to linear regression, even lower values of McFaddens Pseudo R2 still represent a good fit (McFadden and Domencich (1975)).

7.2 Performance on the Classification Task

Associations with the higher chance of success, as presented in the previous section, do not nec-essarily imply a high value-added to classification performance of the textual variables. In order to understand, whether the textual variables constitute an improvement in the classification, we compare the classification results for logistic regression with and without the textual variables. As we are interested in the predictive performance, we also compare the classification results of the logistic regression with penalizers and the support vector machines.

7.2.1 Logistic Regression Classifier

In Figure 7.1 we show two boxplots with the classification results for the logistic regression on the test set. On the left-hand side we exclude the textual variables and on the right-hand side we include them. No major change is observable, except for the small change in discrimination slope (0.09 → 0.10) and the upward movements of the upper quartile and the upper whisker for the successful companies. The predictions for the unsuccessful companies have a mean prediction of 0.33 ± 0.15 in the model including textual variables, while the predictions for successful companies have a mean prediction of 0.43 ± 0.15.

Figure 7.1: Boxplots of the Logistic Regression classification results. Two groups are compared on the test set: unsuccessful companies (actual value 0) and successful companies (actual value 1). The discrimination slope measures the difference between the two means.

A classifier based on the logistic regression requires setting a threshold value. Decreasing the threshold results in a higher TPR, but also comes at the cost of a higher FPR. From investor’s point of view, some degree of FPR may be acceptable, because having one successful company

(27)

in a portfolio offsets multiple failed investments. Consider for example Peter Thiel’s 2004 in-vestment of $500,000 in Facebook for 10% of the company, which yielded 2,000 times return in 2012 when sold for $1 billion. (Pepitone and Cowley (2012)). As outlined in Section 6.3., for imbalanced data the standard accuracy tends to be driven by disregarding the smaller class. In Table 7.3. we present the trade-off between TPR and FPR for the full specification of the logistic regression model. Setting the threshold value to t = 0.50 results in a high specificity of 0.865 and a lower sensitivity 0.360. Above this threshold, the standard accuracy measure approaches the underlying proportion of unsuccessful companies 0.835. In fact, the standard accuracy measure is maximized at the extreme case t = 1.00. In contrast, the balanced accuracy is maximized at the threshold value t = 0.40 with the best score of 0.651.

Table 7.3: Evaluation Measures: Standard Logistic Regression (full specification)

threshold specificity (1-FPR) sensitivity (TPR) balanced accuracy standard accuracy

t=0.00 0.000 1.000 0.500 0.165 t=0.25 0.308 0.895 0.601 0.404 t=0.30 0.449 0.820 0.634 0.509 t=0.35 0.577 0.702 0.639 0.597 t=0.40 0.709 0.592 0.651 0.690 t=0.45 0.803 0.447 0.625 0.745 t=0.50 0.865 0.360 0.612 0.782 t=1.00 1.000 0.000 0.500 0.835

Next we consider a whole range of thresholds in the analysis of ROC curves. In Table 7.4. we compare the two cases – with and without the textual variables. We see that there is a slight improvement after adding the new variables in the AUC criterion, however, the DeLong test did not reject significant difference between the two cases for the standard logistic regression. After adding penalizers, we observe an improvement in the AUC criterion as well, compared to the standard logistic regression. In the models with textual variables, the AUC is equal to 0.6739 for non-penalized version, 0.6748 for lasso regression and 0.6747 for ridge regression, so the probability that the logistic regression will rank a randomly chosen successful company higher than a randomly chosen unsuccessful company is 67.39%, 67.48%, and 67.47%, respectively. The DeLong test again did not reject significant difference between the AUC values of any of the pairs: LR-Lasso, LR-Ridge, Lasso-Ridge.

Evaluation Measure excluding textual vars including textual vars LR – AUC 0.6723 [0.6429 – 0.7018] 0.6739 [0.6445 – 0.7034] Lasso – AUC 0.6728 [0.6432 – 0.7022] 0.6748 [0.6452 – 0.7041] Ridge – AUC 0.6728 [0.6431 – 0.7021] 0.6747 [0.6454 – 0.7043]

(28)

The AUC criteria are calculated at the optimal values of parameter λ: • optimum for the Lasso regression is λ = 5 · 10−3

• optimum for the Ridge regression is λ = 3 · 10−2

To understand the impact of the three penalizers we consider Table 7.5. The ridge penalizer shrinks the coefficients of correlated variables towards each other. On the other hand, lasso penalizer picks one of the correlated variables and disregards the others. The coefficients of the variables h degree2 and participants where shrunk to zero in case of the lasso regression.

Table 7.5: Coefficients in the penalized logistic regression.

Lasso Ridge Unpenalized

(Intercept) -0.2434 -0.2553 -0.2721

div 6 0.0262 0.0261 0.0263

overlap 0.0153 0.0349 0.0310

lraised amount usd 0.0224 0.0223 0.0225

participants 0.0000 0.0009 0.0006

difference 0.0082 0.0082 0.0085

prev success 0.0695 0.0701 0.0711

prev inv success 0.2765 0.2765 0.2873

h degree2 0.0000 -0.0009 0.0113

h degree3 0.0071 0.0100 0.0219

hub 0.0688 0.0723 0.0756

To understand the scope of agreement between the predicted values and actual values, we examine the calibration plot in Figure 7.2. The graphical assessment suggests that we are over-estimating the predicted probabilities for the high-value predictions, as demonstrated by the deviation of the calibration curve from 45 degree line.

Figure 7.2: Calibration Plot. For instance, predicting a 50% chance of success for a company should mean that there should be approximately 1 out of 2 companies successful which have predictions around 50%.

(29)

7.2.2 Support Vector Machines Classifier

The parameters of SVM were optimized and 5-times cross validated on the 80% of the data with respect to the balanced accuracy measure. The optimized values are presented in Table 7.6. The SVM outperforms the logistic regression on the classification task as measured by the AUC, which is equal to 0.6778 for the Support Vector Machines and 0.6739 for the linear regression. Logistic regression achieves slightly better results for lower TPR rates. Inclusion of textual variables has increased the AUC of SVM from 0.6613 → 0.6778. The balanced accuracy measure at the optimal value of parameters is equal to 0.6260 which is again comparable to that of logistic regression.

Evaluation Measure excluding textual vars including textual vars AUC 0.6613 [0.6296 – 0.6931] 0.6778 [0.6475 – 0.7083]

Balanced Accuracy 0.6305 0.6260

Standard Accuracy 0.6059 0.6557

TPR 0.5660 0.6819

FPR 0.5790 0.6870

Table 7.6: Evaluation Measures – Support Vector Machines

In Figure 7.3. we show the dependence of the accuracy measures on the parameters of SVM. On the left hand-side we plot both standard- and balanced accuracy measures against the parameter σ. The balanced accuracy achieves a unique maximum at 10−1. On the right hand side, we plot the accuracy measures against the parameter C. The balanced accuracy has as a unique maximum at 100_{= 1.}

Figure 7.3: Accuracy Measures for different parameter values. Values are averaged over the 5 validation folds. On the left hand side, the accuracy measures are plotted at C = 1. On the right hand side, the accuracy measures are plotted at σ = 10−1.

For evaluating the contribution of the textual variables predictors we plot the ROC curves for the full model, as well as the model excluding textual variables (Figure 7.3.). Here we do not reject the hypothesis that the models with textual variables dominate the models without textual variables, as measured by the DeLong test in Table 7.7.

(30)

Figure 7.4: ROC curves for the SVM and logistic regression. Classifiers without the textual variables are asterisked. Movement of curves towards the northwest corner represents a better classifier. In this case, the SVM classifiers dominate the logistic regressions for lower TPR but perform worse for higher values of TPR. Inclusion of textual variables has little impact for the logistic regression but improves the SVM classifications for higher TPR values.

The results of the DeLong test imply that we do not reject the hypothesis that the true difference in AUC is not equal to zero for any of the following pairs: (i) SVM before and after including the textual variables, (ii) Logistic Regressions before and after including the textual variables, (iii) LR and SVM in their full specifications, (iv) LR and SVM excluding the textual variables. Only the first case has a boundary p-value on the 95% level, which corresponds to the deviation of the ROC curve for the full SVM specification (black curve in Figure 7.4.) from the ROC curve excluding the textual variables (blue).

Table 7.7: DeLong Test. Asterisked models exclude the textual variables.

AUC 95% CI z-statistic Significance level

SVM 0.6778 0.6475 to 0.7083 SVM ∼ SVM* z = 2.0424 p-value = 0.0411 SVM* 0.6613 0.6296 to 0.6931 LR ∼ LR* z = 0.6601 p-value = 0.5092 LR 0.6739 0.6445 to 0.7034 LR ∼ SVM z = 0.4063 p-value = 0.6845 LR* 0.6723 0.6429 to 0.7018 LR* ∼ SVM* z = 0.9147 p-value = 0.3603

(31)

8 Conclusion

We have applied the natural language processing technique Doc2Vec to the unstructured textual information contained within the Crunchbase database and transformed the document vectors into new numeric variables diversity of team and solution-education overlap. These variables were used on a classification task in simple logistic regression and a non-parametric learning model Support Vector Machines. The results of this thesis are threefold: (1) Both team’s diversity and solution-education overlap show the expected positive impact on company’s per-formance. However, only team’s diversity is statistically significant when controlling for other factors. (2) Adding the new variables into classification models has increased the predictive performance as measured by the area under operating curve, but the increase was not statis-tically significant. Factors that appear to be most important in our setting were categorical and numerical: previous success of founding team and success rates of the involved investors. Inspite of the insiginifcant improvement over the models with numerical data, we demonstrate that it it possible to transform the unstructured information on Crunchbase into reasonable rep-resentations and use in predictive modelling. Furthermore, clustering business solutions based on Doc2Vec representations leads to grouping that is more niche than the original Crunchbase categories and might be beneficial in further research. (3) Penalized logistic regression, non-penalized logistic regression and support vector machines perform similarly on the classification task, with the best performance on the area under operating curve of 0.6778 for support vector machines. We have made several choices for our classification models and accuracy measures. Alternative techniques may perform better on this particular dataset and it might be reason-able to include additional algorithms for comparison, e.g. random forests, neural networks, or relevance vector machines, which lead to even sparser solutions than SVM. In the textual part, it may also be of interest to address the ways to link keywords that were not matched with target articles (for instance, keyword search-optimization was not matched with search-engine-optimization article on Wikipedia but would be beneficial). Estimating success of companies is a challenging task and given the available information, we are able to discriminate the two groups to a very limited extent. The setup outlined in this thesis may serve for further research that would incorporate one or more of the following: (i) textual information from news, (ii) investor data from later investment rounds, (iii) interim financial results. These may have a positive impact on the performance of the models. Finally, to validate our results, independent application of the models with and without new textual variables on a new dataset would be appropriate.

(32)

9 Appendix

9.1 Distances in the Doc2Vec Vector Space

In this section, we give an example of companies that are mapped closed to each other in the Doc2Vec vector space, while their original Crunchbase categorizations are very diverse. This demonstrates that using Crunchbase keywords can lead to more niche grouping of companies than the original Crunchbase categories. Let us begin with a company Paltalk. Its four keywords are represented by 300-dimensional Doc2Vec vectors. Taking the mean of the company keywords we obtain a unique vector for the company Paltalk. The short description of the closest company explicitly lists Paltalk as a competitor, so we have indeed a very close company nearby. The other close companies also operate in the area of video platforms.

cos.sim category keywords description

94.1% other videochat, chatrooms, video, chat The SeeToo service enables viewers in multiple loca-tions to watch pre-recorded videos together. SeeToo is synchronized, as if everybody is watching the same screen, and immediate, without the need for prior up-loading or any preparations. SeeToo’s competitors include PalTalk, and the Meebo Rooms product by Meebo.

89.1% web chat, videochat, meet, viral-video, video-platform, video, chat, social-video, social-discovery-platform, webcam, broadcast, stream, live

Streamup is a platform for sharing video chat rooms. It’s a place to hang out together online. Discuss what’s on your mind in real-time conversations with dedicated communities from around the world, visit friends, collaborate, and much more. See what’s hap-pening right now.

89.1% news webcam, video, videoconference, news, media Opinews is the first information website that enables users to debate around a topic by webcam. It’s the evolution of the communication between the bloggers, the journalists, the fans because it offers an ultimate interactive tool. Each user can create a topic about politics, sport, entertainment, society, International or High Tech and invite the community to have a de-bate around it at the time he decides. Therefore, the debate is announced in the homepage so every visitor can participate. The Pros can also use Opinews as a videoconferencing tool since it offers,them the possi-bility of doing it in private room.

88.7% games video socialnetworking, video, instructions, howto Graspr is a social network where users can share, an-notate, and discuss instructional videos. The web-site premiered in September 2007 at the DEMOfall conference. Competitors include 5min, eHow, Sclipo, SuTree, Expert Village, Instructables, and VideoJug. 88.2% messaging video, social-networking Fipeo harnesses the power of video communication to help users find and connect with new people locally and throughout the world.

Table 9.1: Closest companies to Paltalk, which comes with following keywords: {videochat, chatrooms, video, chat, social networking }.

(33)

CHAPTER 9. APPENDIX 30

9.2 Descriptive Statistics

Statistic N Mean St. Dev. Min Max

log(raised amount usd) 6,982 15.970 17.593 6.907 21.598

participants 6,982 1.692 2.426 0 32

difference 4,112 4.621 6.761 0 105

status bool 6,982 0.262 0.440 0 1

prev success 6,982 0.600 0.490 0 1

prev inv success 6,982 0.135 0.215 0 1

hub 6,781 0.410 0.492 0 1 solution-education overlap overlap 6,982 0.530 0.124 0.212 0.982 diversity indicators div 2 6,982 1.451 0.498 1 2 div 3 6,982 1.742 0.785 1 3 div 4 6,982 1.900 0.968 1 4 div 5 6,982 1.998 1.100 1 5 div 6 6,982 2.079 1.191 1 6 div 7 6,982 2.102 1.202 1 7 div 8 6,982 2.390 1.548 1 8 div 9 6,982 2.366 1.523 1 9 div 10 6,982 2.351 1.560 1 10

Participants is the number of involved investor during the first round of funding. Difference denotes the gap between Date Founded and Date Funded. Negative instances are set to zero, however it may be reasonable to source this information from other sources for verification. Log of the raised amount denotes the amount in US dollars raised during the first round. Status bool takes value 0 if company has not been acquired or has not gone public through initial public offering. Note that only companies founded before 2013 are considered, and the status information is updated as of July 2017. For the highest degree, we have two binary indicators: doctoral degrees and master degrees. Previous success is generated as follows: for every founding team member, except for advisors and investors we check their previous affiliations. If at least one of the affiliated companies was acquired or went public before the founding of the new company, we assign 1. Otherwise this variable is set to 0. Previous investor success indicates the success rate of the best performing investor from the first investing round. Note that only the portfolio of companies that was available before the founding date of the company is considered. Overlap indicator measures the semantic cosine similarity between the closest education-business solution pair: for every founding team member we measure cosine similarity between their field of study and the business solution keywords available for the company. Then we take the maximum value. Diversity indicators count the number of areas covered by the founding team. In order to generate groups of education fields, k-means clustering for k ∈ [2, 10] has been performed.

(34)

CHAPTER 9. APPENDIX 31

9.3 Alternative Evaluation Measures

In this section we list alternative evaluations measures to the TPR, FPR, AUC and balanced accuracy:

• Positive Predicted Value (PPV) measures the proportion of positive results that are true positive: : P P V = _{T P +F P}T P . Its maximal value corresponds to the absence of Type-I Error. This comes in contrast to the maximal value of the true positive rate (TPR), which corresponds to the absence of Type-II Error. While high values of both PPV and TPR are desirable, they typically trade off each other.

• F-Score. To balance the tradeoff between PPV and TPR, a harmonized mean F-Score is often used. It can give equal weights to both goals or prefer one over the other. Two common variants are the F1 an F2 Scores, where the first gives equal weights and the

second preferes PPV over TPR. – F1 Score: F1= 2·P P V ·T P R_{P P V +T P R}

– F2 Score: F2= _{4·P P V +T P R}5·P P V ·T P R

In Figure 9.1 we show the alternative evaluation measures with respect to varying threshold (cutoff) in the logistic regression. The alternative evaluation measures have practical interpre-tations: From investor’s point of view, lower PPV could be acceptable in favor of a higher TPR.

Figure 9.1: Trade-off between TPR and PPV (logistic regression). F2 Score has a unique

(35)

Bibliography

Antweiler, W. and Frank, M. Z. Is all that talk just noise? the information content of internet stock message boards. The Journal of Finance, 59(3):1259–1294, 2004. ISSN 00221082, 15406261. URL http://www.jstor.org/stable/3694736.

Batista, F. and Carvalho, J. P. Text based classification of companies in crunchbase. 2013. Beckman, C. M. and Burton, M. Founding the future: Path dependence in the evolution of top

management teams from founding to ipo. Organization Science, 19(1)(3-24), 2008.

Beckman, C. M. The influence of founding team company affiliations on firm behavior. Academy of Management Journal, Vol. 49(No. 4):741–758, 2006.

Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M. The balanced accuracy and its posterior distribution. 2010 International Conference on Pattern Recognition, 2010. Cheng, M., Sriramulu, A., Muralidhar, S., Loo, B. T., Huang, L., and Loh, P.-L.

Col-lection, exploration and analysis of crowdfunding social networks. In Proceedings of the Third International Workshop on Exploratory Search in Databases and the Web, ExploreDB ’16, pages 25–30, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4312-1. doi: 10.1145/2948674.2948679. URL http://doi.acm.org/10.1145/2948674.2948679.

Da Rin, M., Hellmann, T. F., and Puri, M. A survey of venture capital research. NBER Working Paper Series, 2011.

Delmar, F. Does experience matter? the effect of founding team experience on the survival and sales of newly founded ventures. Organization Science, 2006.

Eesley, C., Hsu, D. H., and Roberts, E. B. Focus or diversify? aligning founding teams with strategy and environment. Strategic Management Journal, 2013.

Funderbeam. How do you organise 160 000 startups in mean-ingful clusters. 2016. URL https://wire.funderbeam.com/ how-do-you-organise-160-000-startups-in-meaningful-clusters-cb720291d1f0. Hanley, J. A. and McNeil, B. The meaning and use of the area under a receiver operating

characteristic (roc) curve. Radiological Society of North America, 1982.

Modeling company success on crunchbase with Doc2Vec features