The effect of technical skill on startup success

(1)

The e↵ect of technical skill

on startup success

Master Thesis

Michael Fresco

Supervisor:

. W. (Wietze) van der Aa

(2)

Statement of Originality This document is written by Michael Fresco (5999162)

who declares to take full responsibility for the contents of this document. I declare

that the text and the work presented in this document is original and that no

sources other than those mentioned in the text and its references have been used

(3)

GitHub Thesis: The e↵ect of technical skill on startup success.

Michael Fresco

June 23, 2017

Abstract

Through combining two large data sets, this study investigates the e↵ects of technical skill on startup success. In total 957 Y Combinator startups and their founders (1262) were analyzed on GitHub, which is a social code sharing platform. This study provides evidence that the companies which get acquired tend to have technical founder teams. This was found to be the case in two distinctive ways: first, they have the largest amounts of technical skill in their founding teams. And secondly, they also tend to exhibit teams where all founders actively contribute, in terms of technical expertise. Also, it was found that the relationship between technical skill and startup success was not linear. The logit models show that after a certain point, adding more ’technical talent’ to the team, did not increase the chance of success. In analyzing the data, it also became apparent that there two very distinctive groups in the data set. There are very technically skilled founders, and there are far less technically skilled founders. The models show that that having a baseline technical talent, strongly enhances the chances of success. Thus, founders and managers are well recommended to find talent which meets that baseline requirement.

1 Introduction

Previous research has shown that human capital is an important factor for startup success. (Cooper et al., 1994). This research will analyze a specific example of such human capital, namely the ability to ‘program well’. The question is raised, if having ‘capable coders’ is a critical factor for success in tech startups. To do this, a sample of 957 companies was gathered from the Y Combinator website.

(4)

Y Combinator is an American seed-accelerator from Silicon Valley. In selecting Y Combinator as the main sample for this research two important benefits were derived. First, all companies received the same start; namely seed money, connections and advice in exchange for an 7% equity stake. Second, the companies generally have a high-tech orientation, since this is the area in which Y Combinator has specialized.

Human capital can be expressed in multiple ways. One could measure the level of education, or give someone an IQ test. However, Unger et al. (2011) found that measuring relevant skills, and relevant knowledge is the best way to predict success in startups. In the context of high tech startups, which often have a strong software component, focusing on the ability to program well seems highly relevant. Measuring this ability however is quite difficult, but this research proposes that it is possible to use GitHub profile data as an indicator of programming ability. GitHub is an important code sharing platform, which allows individuals to work together on projects. The website also has elements that can be quantified, most notably the GitHub Stars and the GitHub Repos. These statistics will be used to build a proxy for technical skill, which is henceforth used to link ‘technical skill’ to startup success. It should be noted that in this paper, ‘technical skill’ is defined as any skill that relates to programming and the development of software. Previous authors have noted that there are synergies to be gained from having both technical and commercial talent. This topic will be addressed as well.

This study uses a data mining strategy. All data was collected and processed automatically using algorithms specifically designed for the purpose. In its very essence, this research combines two large existing data sets: CrunchBase and GitHub. The list of companies provided by the Y Combinator website were just the starting point for the research. With the company name, it is possible to query the CrunchBase website and gather information about the founders, the status of the company and the amount of money they raised. The CrunchBase data is thus used as an indicator for success. GitHub on the other hand, provided the information about the technical skill level of the founders. With the combined data set, it was investigated if having ’capable coders’ correlates with success in startups. In addition, the relative distribution of talent in the founder team was analyzed.

The automated study design has a few important benefits. First, by using GitHub user profiles as a proxy for technical skill, this research does not rely on self-reporting. As such, the gathered

(5)

data is unbiased, and arguably an objective measure of technical skill. In addition, there is a certain homogeneity in the sample since all companies were part of Y Combinator. Second, the results should also be relatively valid for other tech startups that were not part of Y Combinator, as long as they have a strong software orientation.

1.1 Previous work

The positive relationship between human capital and startup performance has been well established in the literature (Reuber & Fischer, 1994; Cooper, Gimeno-Gascon, & Woo, 1994; Jo & Lee, 1996; Bosma, Praag, Thurik, & Wit, 2004; Shrader & Siegel, 2007). In a longitudinal study Cooper et al. (1994) found that general human capital, which focuses on measures such as years of education, positively influenced the survival and growth of firms. Addressing the same relationship, but on much larger scale, Unger et al. (2011), found additional evidence that human capital matters a great deal when it comes to creating successful companies. The work of Unger et al. is particularly interesting since they conducted a meta-review of the previous academic work on the topic. The authors provide a good framework for understanding the e↵ects of human capital. They separate between investments in human capital, e.g. going to school and getting an education, and the outcomes of human capital: skills, knowledge, relevant experience.

This separation between ‘investments’ and ‘outcomes’ has important repercussions for designing a study. Unger et al. hypothesize and proof that the outcomes (r=+.204) have a much stronger correlation than the investments (r=+.090) in human capital. A way to understand this result, is that not everyone who goes to school gets the same knowledge and skills out of it. With the result of Unger et al., one could argue that it is better to study the outcomes of human capital, since these tend to be more closely correlated with actual success. As Unger et al. put it: “human capital investments are indirect indicators of human capital and are, therefore, one step removed, while knowledge and skills are direct indicators of human capital” (2011).

Many authors however, have a comprehensive approach when it comes to studying human capital. They study both investments and the outcomes. Bosma et al. for instance, finds that education

corre-lates positively with profitability but not with survivability (2004).

(6)

holds for both profitability and survivability. Hence, it is a better indicator for performance. The results between industry experience and general success are replicated by many researchers (Cassar,

2014; Song, Podoynitsyna, Van Der Bij, & Halman, 2008; Baptista, Kara¨oz, & Mendon¸ca, 2014;

Koch, Sp¨ath, & Strotmann, 2013; Colombo & Grilli, 2005). The results for business experience vary

more. Bosma finds that it enhances the chances of survival, but it has no impact on profitability. In terms of getting funding the results are also mixed. Some authors (Gimmon & Levie, 2010; Colombo & Grilli, 2010; Delmar, 2006) find a positive e↵ect on funding, but there are contradicting findings too (Zhang, 2011). In conclusion, industry experience seems to be a good direct measure for success, but the e↵ects of business experience are not uniform.

A possible explanation is that some of these studies failed to capture the synergetic e↵ects of having a combination of technical and business experience. Many authors (Ganotakis, 2012; Colombo & Grilli, 2005; Crowne, 2002; Marvel & Lumpkin, 2007) find that when technical and commercial talent is combined, the outcome is much better. In the context of high tech firms, Colombo & Grilli (2005) found that business experience in startups alone does not add value, but only when combined with technical expertise. Importantly, they also found that technical experience and business experience was the first and foremost predictor of startup success: “As to professional experience, [high tech startups] established by individuals who have greater work experience in technical functions in the same industry of the new firm and have been involved in prior entrepreneurial ventures exhibit superior growth, with all else equal”. (Colombo & Grilli, 2005, p. 812) The positive e↵ect of technical experience on startup success was also noted by Roberts, in a context of MIT startups. (Roberts, 1991) Thus, summarizing the so far mentioned literature, it is to be expected that the outcomes of human capital, i.e. the technical skill and relevant business know-how is more likely to predict success than general measures. Furthermore, industry experience plays an important role for startups, particularly when combined with business experience.

1.2 PageRank and GitHub

This study introduces two new measurements: namely PageRank as an indicator for success, and GitHub profiles to estimate the level of technical prowness of the founders. PageRank was invented by Sergey Brin and Larry Page, and formed the foundation for the Google search engine. The basic

(7)

mathematics of PageRank are relatively simple. Websites which get ‘highly linked’ are considered more important than pages with few links (Page et al., 1999). The ingenuity of the PageRank however, is that the PageRank is also determined by the quality of sites, who link to each other. A ‘backlink’ from a site with a high PageRank itself is far more valuable than a ‘backlink’ from an average site.

The use of GitHub profiles to make inferences is a relatively novel concept. Kalliamvakou et al. conducted a series of interviews with GitHub users, to provide a better understanding how the users perceive the profiles themselves (2014). For the purposes of this research, two parts will be highlighted. First, Kalliamvakou et al. put forward that the count of the artifacts, i.e. the number of people watching a project, is considered by the users a sign of importance: “The number of people watching a project or people interested in the project, obviously it’s a better project than versus something that has no one else interested in it.” (2014, p.17). Similarly, the volume of activity is considered to be an indicator of interest and commitment: “this guy on Mongoid is just – a machine, he just keeps cranking out code.” (Kalliamvakou et al., 2014, p.23). Taken together with the interviews conducted for this research itself, it seems that there is a basis for using GitHub profiles to make inferences about technical skill level, since GitHub users make similar judgements, on the same information.

1.3 Using GitHub projects as proxy for technical ability

Public Repositories, or in short ‘Repos’ are projects which are shared with the Internet, and generally, these are Open Source projects. This research maintains that these are the product of the skilled programmer. Making an analogy with a painter, who produces ’paintings’, the programmer makes software projects. In this sense, a project on GitHub (which is stored in the GitHub repo), is an artifact of a skilled programmer at work. Arguably, quantity does not equal quality, and this research design does not allow to di↵erentiate between quality of published projects. However, in discussing this issue with a local programmer, who actively uses GitHub, he noted that ”[sic] when I publish something on GitHub, I want it to be right. I only publish the very best things on there, and make sure it is bug free”. Therefore, this research maintains that the quantity of GitHub Repos can be seen as indicator for the ability to program, hence an indicator of technical skill.

(8)

‘Starred projects’, are projects which the user follows on GitHub. When a user follows many projects, then that user will have a high ’GitHub Stars count’ in the data set. Following a project means that the user is actively interested in getting updates of that project. It is also quite likely that the GitHub user finds the project useful for his or her own software. Many projects are in fact frameworks, that form ’building blocks’ that can be used for various tasks. For instance, there are frameworks for doing text analysis, or processing images. As a consequence, following many projects also means that the user is aware of many di↵erent frameworks. This in itself is an indicator of knowledge. This research proposes that the number of starred projects closely aligns with technical skill, and expertise.

1.4 Hypotheses

Next this paper’s main hypotheses will be discussed. It is important for the reader to note that the combined data set exists on two levels. First there is the company level, which has data on the size of the founding team, the amount of funding received and the status of the company (ranging from active, acquired to closed). And second, there is the founder level, where the GitHub profiles are contained. The two most important variables for the founders are the ’GitHub Repos’ and the ’GitHub Stars’, which are seen as indicators of technical skill. When analyzing companies with multiple founders, the sum is calculated of all GitHub Repos and Stars for the company. This sum of GitHub Repos and Stars is seen as an indicator of the available technical expertise in the company. As such, companies with higher sums of repos are ranked higher than ones with less repos. The same applies for the GitHub stars. Now, let’s review the hypotheses for this research.

Hypothesis 1a The acquired group has a significantly higher number of Total GitHub Repos and Total GitHub Stars than the active group and closed group.

Here it is argued that companies which fall in the acquired category, tend to exhibit a higher number of Total GitHub Repos, and Total GitHub Stars than the other groups. The rationale here is that the most technically skilled programmers make unique IP (Intellectual Property), which leads to a higher chance of getting acquired. This proposed mechanism is not original to the author, but comes from having a discussion with a prominent venture capitalist. He noted that; “good

(9)

programmers do not make good companies, but they make better IP”. Having a special product (i.e. not a derived product), enhances the chances of getting acquired. It should be added that that being ’acquired’ is not the only form of success. Obviously, some of the most successful companies never get acquired (e.g. Google), but just continue to exist independently. In addition, this hypothesis tries to replicate the findings of Roberts (1991) and Colombo & Grilli (2005) who found that technical experience was an important indicator for success in startups. This hypothesis will be tested with the Kruskal-Wallis Test which is a non-pramatic method for testing samples, and does not assume that the residuals are normally distributed.

Hypothesis 1b Having more GitHub Repos, statistically increases the likelihood of a company falling under the ’acquired’ category in the data set.

Hypothesis 1c Having more GitHub Stars, statistically increases the likelihood of a company falling under the ’acquired’ category in the data set.

As a continuation of the first hypothesis, H1b and H1c consider the question if it is statistically possible to model the chance between ’falling under the acquired category’ based on the number of GitHub Repos or GitHub Stars. Since there is no established basis for using GitHub to measure technical skill, it was decided to run tests for both variables. Equation (1) shows how this logistic regression is set up for the hypothesis H1b.

y = 8 > > < > > : 1 0+ 1ln(SumRepos) + " > 0 0 else (1)

The logistic regression model can be understood as finding a parameters such that the error

term, ✏, gets minimized. Note that the proposed model here uses a log-normally transformed IV variable, ln(SumRepos), as it was found that the variable GitHub Repos relatively closely follows a log-normal distribution.

Hypothesis 1d The founder count, and the ’being present on GitHub’ can together predict the chances that a company falls under the ’closed’ category in the data set.

(10)

the closed category, based on the founder count and one of the founders being present on GitHub. Previously, authors have shown the positive e↵ect of human capital on firm growth (Cooper et al., 1994), and overall success. (Baptista et al., 2014) It is argued that teams perform better, because of the additional talents and skills. (Baptista et al., 2014) Being present on GitHub is a dummy variable, which provides a rough indicator of ’baseline’ technical skill. It is expected that being present on GitHub enhances the chances of not ’closing down’, since there is at least some technical talent available in the company to make a product.

y = 8 > > < > > : 1 0+ 1F ounderCount + 2P resentOnGithub + " > 0 0 else (2)

Hypothesis 2a Companies that are part of the ’Top 10% PageRank’ group have a higher number of GitHub Repos and Stars than the overall population.

Success can take many forms. In the context of tech startups, having a high PageRank is often considered to be quite important. A high PageRank indicates receiving a lot of traffic, and traffic can be monetized. (Rayport, 1999). As of writing, the current top 5 websites in the world are 1: Google.com, 2: Youtube.com, 3: Facebook.com, 4: Baidu.com and 5: Wikipedia. Then ‘’Reddit.com’, which is a Y Combinator startup, is ranked 7. (“Alexa Top 500 Global Site”, 2017) PageRank is not a normally distributed variable; only a select few are at the top. The hypothesis states a positive relationship, where it is argued that companies which a higher PageRank are also expected to score higher in technical skill.

Hypothesis 2b Companies that are part of the ’Top 10% Dollars Raised’ group have a higher number of GitHub Repos and Stars than the overall population.

This hypothesis investigates if there is a relationship between attracting funding and having technically skilled founders. Previous work has found that this is not necessarily the case. The results of Colombo & Grilli show that technical experience did not increase the chances of obtaining VC funding (Colombo & Grilli, 2010, p. 621). In this research, all companies already received funding, so here the di↵erences in the additional funding are examined. Some companies are quite successful and manage to attract millions. The question is: did technical skill contribute in getting

(11)

these millions? A positive finding would indicate that technical skill matters nevertheless. If this hypothesis is rejected, then other factors likely play a major role.

Hypothesis 3a Companies which get acquired have a di↵erent distribution of talent compared to the other groups (running/closed).

Hypothesis 3b Companies that are part of the ’Top 10% Dollars Raised’ or ’Top 10% PageRank’ have a di↵erent distribution of talent compared to the average.

Even though this research focuses on high-tech startups, the founders do not all have to be ’tech guys’. Business talent is still important for success, and it is hypothesized that some of the most successful firms use a combination of technical and business talent. It would be interesting, if one could measure the mix of talent in the company. To do this, this research introduces two new ‘ratio’ variables: ‘TopRepoOverSumRepo’ (3) and ‘TopStarsOverSumStars’ (4).

T opReposOverSumRepos = M ax(Repos)

Sum(ReposCompany) (3)

T opStarsOverSumStars = M ax(Stars)

Sum(ReposCompany) (4)

To understand how these ratios work, let’s consider a quick example. In the case of a team with 4 founders, one founder could contribute 400 repos, while the others contribute only 33.33 on average. The sum would be 500, and the top founder would be responsible for 400/500 = 80% of the total repositories. Thus the ‘TopRepoOverSumRepo’ ratio would equal 0.8. Defined in this way, the TopRepoOverSumRepo ratio gets smaller when more founders are actively contributing on GitHub, and larger when only one person is really active on GitHub.

Obviously, these are imperfect measures. It could very well be the case that the person who contributes the most on GitHub is the designated team member for publishing work online. However, it still allows to make inferences about the extreme cases: teams where the Repos can all be traced back to one person (the sole programmer), and teams where the founders all work actively on GitHub. The point of these hypotheses is to separate the extremes. Take the example of a team with a total repo count of 25 and a ratio 1.0. This points to a less technical team (because of the

(12)

low repo count), and to one founder who does all the work (hence the ratio of 1.0) Contrast this with the previous example (400 repos, and a ratio of 0.8) and the di↵erences become apparent.

Based on the work of Colombo & Grilli (2005), it is expected that the most successful companies, (i.e. the acquired, top 10% funding or top 10% pagerank group) score higher than average on the star/repo count, and higher than average on the TopRepoOverSumRepo (and TopStarsOverSum-Stars) ratio. Thus, high in total technical skill, but with a diverse team. Some members are focusing on the technical work, the others work on the business side; hence a high ratio for the previously mentioned variables.

2 Method

2.1 Sample

The sample is a study of 957 Y Combinator startups and their founders. The final data set on the companies and founders was created through combining information from CrunchBase and GitHub. A typical company in the data set has a mode of 2 founders and received 20k in funding. The average company has a sum of 31.63 GitHub Repos and 91.78 GitHub Stars and raised 8.22 million dollars. The Stars and Repo variables are non-normally distributed. The average founder has 69.6 starred projects, and 23.9 public repositories. In total, there are 1,930 founders in the CrunchBase data set. Of these, 1262 (=65.4%) were matched on GitHub. The sampled companies range from the time period 2006-to-2017. Over time, Y Combinator has started to fund more companies each year and the amount of funding has been raised as well. In the beginning, Y Combinator used to give a standard deal of $20,000 for a 7% equity stake. Later in 2014, this became $120,000. Y Combinator organizes a bootcamp, which provides the founders with intensive guidance and training for a period of 3 months. After this, the companies are expected to become independent.

2.2 Protocol for collecting data

The data for this research was gathered automatically using a variety of data scraping techniques. The starting point was the ’company name’ from the list of ’957’, as provided by the Y Combinator website. This company name was then looked up on CrunchBase. With Python 2.6, a customized

(13)

Table 1: Descriptives Company Level

Variable N Mean StDev Min Max Mode Skew

avg stars 957 39.15 75.63 0 529 0 2.93

sum stars 957 91.78 181.61 0 1346 0 3.06

avg public repos 957 13.851 22.743 0 221 0 3.64

sum public repos 957 31.63 51.4 0 423 0 3.26

dollars 957 8.22 39.1 0.01 674.8 0.02 11.4

sum public repos log 686 3.0347 1.3787 0 6.0474 0 -0.45

sum stars log 625 3.7578 1.8055 0 7.2049 0 -0.37

pagerank 957 15815636 10414178 1 24904702 1 -0.68

’spider script’ was built. This spider automatically processed the most important information from the CrunchBase website. (like founder team, the status and money received) A large part of the programming time was spent on perfecting the crawler script. Initially, the script would halt on ’Unicode errors’, and could not process names like ”Jo˜ao”. In total 5,166 unique lines of codes were written for the purposes of this research.

Having retrieved the founder names, another script was then designed to query this founder names on GitHub, through the official GitHub API (Application Layer Interface). This also allowed for automatic matching of the founder names, since the GitHub website could be search directly. When multiple results came back, the result which ranked highest in ‘match score’ was chosen. As with all automatic processing, mistakes unfortunately cannot be totally eliminated. In total, 1,262 founders were processed this way.

Since performing calculations on the data set becomes rather tedious using manual methods, the calculations were performed in Python directly. The data has been processed and stored in a way, similar to a relational database. So the data exists in separate layers: one for the company and one for the founders. This also allowed to easily implement the calculations on the correct level of analysis. The processing and calculations were performed in Python. Minitab 17 was used to run the various regression models and tests for this research.

(14)

Figure 1: Data Collection Process

2.3 Statistical procedures

Here we discuss how the data was prepared for the di↵erent hypotheses. Hypothesis 1a compares the medians for the Sum of GitHub Repos and Stars for the three di↵erent groups (acquired/active/-closed). The test uses a Kruskall Wallis Test, since this test is robust and can handle non-normal distributions. With the Kruskal Wallis test, the medians for the Sum of GitHub Repos and Stars were compared. The Repos and Stars variables exhibit strong positive skews (See Table 1). This means that there are a lot of ’low counts’ in the data set. Please refer to Appendix 1 for a histogram of Sum Repos and Sum Stars. When running a regression analysis, this can lead to problems in fitting the model. Therefore, it was deemed important to use a log transformation to account for this. Even after transforming the data, the Log variables have a lot of low counts. Consider the probability plots, figure 2 and 4 in the Appendix. Excluding these would allow for a better fit, but it would also destroy information. Therefore it was decided to keep them in.

The implications of running the binary logistic model of H1b and H1c with log transformation are further discussed in the result section, but they err on the conservative side. Hence, the transfor-mation was deemed suitable for the purposes of this research. The same logic applied to the GitHub Stars variable, which was also log transformed. Dropping the 0 repo items, which was necessary because one cannot take the log of 0, reduces the data set to 686 companies for the Log Sum Repo

(15)

variable, and 624 companies for the Log Sum Stars variable.

3 Results

3.1 The 3 groups for the companies: acquired, active, and closed

The first hypothesis compares the medians for the di↵erent companies; and states that the acquired group is expected to have the highest median. A Kruskal Wallis test was run to compare the di↵erences in the medians. The ‘Sum Repos’ medians for the three groups (acquired/active/closed) are 32, 10 and 4.5. The average rank shows that the acquired group di↵ers most from the average rank for all observations and that this group is higher than the overall median. The ‘Sum Star’ medians for the three groups (acquired/active/closed) are 61.50, 8 and 2. The average rank shows that the acquired group di↵ers most from the average rank for all observations and that this group is higher than the overall median. Both p-values are less than the significance level of 0.05. The p-values indicate that the median number of repos di↵ers for at least one group. This holds for both the Kruskal-Wallist tests on Sum Repos and Sum Stars. Hypothesis 1a is thus accepted.

In addition it should be noted that the median of the ’closed’ group is below the overall median for both tests. For the Sum Repos test, the median is 4.4 with a z-value of -4.21. For the Sum Stars test, the median is 2.0 with z-value of -3.29.

3.2 Logit models

Let’s quickly repeat the hypotheses relating to the logit models again:

1. H1b: Having more GitHub Repos, statistically increases the likelihood of a company falling under the ’acquired’ category in the data set.

2. H1c: Having more GitHub Stars, statistically increases the likelihood of a company falling under the ’acquired’ category in the data set.

3. H1d: The founder count, and the ’being present on GitHub’ can together predict the chances that a company falls under the ’closed’ category in the data set.

(16)

Table 2: Kruskal Wallist Tests

Kruskal-Wallis Test - Sum Repos

status N Median Ave Rank Z

acquired 140 32 623.5 6.69

active 565 10 471.3 -1.04

closed 252 4.5 416.1 -4.21

Overall 957 479

H = 51.76 DF = 2 P = 0.000

H = 52.96 DF = 2 P = 0.000 (adjusted for ties)

Kruskal-Wallis Test - Sum Stars

status N Median Ave Rank Z

acquired 140 61.5 619.4 6.5

active 565 8 466.2 -1.73

closed 252 2 429.8 -3.29

Overall 957 479

H = 45.31 DF = 2 P = 0.000

H = 47.29 DF = 2 P = 0.000 (adjusted for ties)

A binary logistic regression model was used to check if it is possible to model the chances that a company gets acquired, using the independent variable ’sum github repos’. The Hosmer-Lemeshow (HL) test showed that the model fitted the data reasonably well, p = .120. The log transformed Sum Repos (p <.001) significantly predicted the chances of getting acquired. (Table 3) Hypothesis 1b is accepted. Then, H1c tested the same relationship, for the log transformed GitHub Stars. The Hosmer-Lemeshow test showed that the model fitted the data well, p = .66. The log transformed Sum Stars (p <.01) statistically significantly predicted the chances of getting acquired. Therefore, hypothesis 1c is accepted.

Hypothesis H1d tests the chances of failure. This test does *not* use log transformed variables. The model of H1d, shown in table 3, takes two inputs: ’founder count’ and ’presence on github’. The model was a reasonably good fit for the data: the HL came out at p = 0.114. Founder count was shown to be highly significant (p <0.0001). The dummy variable that measures ’Presence on GitHub’ came out at p <0.01. Therefore, hypothesis 1d is accepted.

(17)

H1b and H1c. These patterns were also present for the model of H1d but to a lesser extent that in the first two models. This can be verified by contrasting figure 5 and 6 in the appendix.

Table 3: Results Binary Logistic Regression

Probability Acquired

H: Variable Log Odds Odds Ratio CI N H-L

H1b sum public repos log 0.191 1.2109 (1.0471, 1.4002) 687 0.12 ***

H1c sum stars log 0.139 1.1490 (0.9895, 1.3344) 625 0.66 **

Probability Closed

H: Variable Log Odds Odds Ratio CI N H-L

H1d founder count -0.383 0.6817 (0.5640, 0.8239) 957 0.114 ****

H1d present on git -0.485 0.6158 (0.3494, 1.0854) *

3.3 PageRank and Top Dollars

1. H2a: Companies that are part of the ’Top 10% PageRank’ group have a higher number of GitHub Repos and Stars than the overall population.

2. H2b: Companies that are part of the ’Top 10% Dollars Raised’ group have a higher number of GitHub Repos and Stars than the overall population.

Table 4 shows the results for for di↵erent success measures (top page rank, top dollars). The companies which have raised the most funds did not test significant in higher amounts of technical skill. Top Page Rank companies did however. At p<.01 for GitHub Stars and p<.10 for GitHub Repos. Hypothesis 2b is therefore partially accepted, but only for the the Top Page Rank group.

3.4 Distribution of talent

1. H3a: Companies which get acquired have a di↵erent distribution of talent compared to the other groups (running/closed).

2. H3b: Companies that are part of the Top 10% Dollars Raised’ or ’Top 10% PageRank’ have a di↵erent distribution of talent compared to the average.

(18)

Table 4: Top 10%

Dollars Top 10 PageRank Top 10

Sum Stars Sum Stars **

dollars top10p N Median AvRank Z pagerank top10p N Median AvRank Z

0 853 11 477.3 -0.55 0 862 9 470.6 -2.82

1 104 12.5 493 0.55 1 95 46 554.8 2.82

Overall 957 479 Overall 957 479

H = 0.30 DF = 1 P = 0.585 H = 7.94 DF = 1 P = 0.005

H = 0.31 DF = 1 P = 0.577 (adjusted for ties) H = 8.29 DF = 1 P = 0.004 (adjusted for ties)

Sum Repos Sum Repos *

dollars top10p N Median AvRank Z pagerank top10p N Median AvRank Z

0 853 12 477.6 -0.44 0 862 10 472.1 -2.33

1 104 9.5 490.2 0.44 1 95 20 541.8 2.33

Overall 957 479 Overall 957 479

H = 0.19 DF = 1 P = 0.662 H = 5.44 DF = 1 P = 0.020

H = 0.20 DF = 1 P = 0.658 (adjusted for ties) H = 5.56 DF = 1 P = 0.018 (adjusted for ties)

The results for hypothesis H3a and H3b are presented in table 5. ANOVA tests were performed where the means were tested for being ’all equal’, or one being statistically di↵erent. Hypothesis 3a is supported by the data. Note the low p-values between ’status’ and ’TopRepoOverSumRepo’ in the first row (p=0.011**). This relationship is also found to be significant for between ’status’ and TopStarsOverSumStars (p=0.093*), though at an alpha level of 0.010. Hence, there is a chance that this is a false positive. Thus, hypothesis H3a is partially accepted. The residuals were analyzed, and were found to be relatively normally distributed.

The high scores in the F-value of the ANOVA test suggest that there is no relationship between having a high PageRank and the the TopStarsOverSumStars and TopRepoOverSumRupo variables. The same applies for the group which scored highest in getting funding. (the TopDollar group). With p-values that are >.10 for these groups, H3b is rejected.

(19)

Table 5: Relative distribution of talent

TopRepoOverSumRepo

Status N Mean CI F p-value:

- Acquired 132 0.8258 ( 0.7945, 0.8571) 4.51 0.011**

- Active 407 0.86826 (0.85045, 0.88608)

- Closed 148 0.8904 ( 0.8608, 0.9199)

TopPageRank N Mean CI F p-value

0 610 0.86518 (0.85055, 0.87981) 0.02 0.902

1 77 0.8624 ( 0.8213, 0.9036)

TopDollar N Mean CI F p-value

0 608 0.86596 (0.85130, 0.88061) 0.18 0.669

1 79 0.8565 (0.8159, 0.8972)

TopStarsOverSumStars

Status N Mean CI F p-value

- Acquired 124 0.8723 ( 0.8445, 0.9001) 2.38 0.093*

- Active 364 0.90561 (0.88937, 0.92185)

- Closed 137 0.9094 ( 0.8829, 0.9359)

TopPageRank N Mean CI F p-value

0 553 0.89997 (0.88675, 0.91318) 0.00 0.954

1 72 0.8988 (0.8622, 0.9354)

TopDollar Mean CI F p-value

0 554 0.90159 (0.88839, 0.91479) 0.6 0.438

(20)

Figure 2: Bubble Plot

4 Graphical Summary

The results so far can be aptly captured in Figure 8. Each bubble represents a company in the data set. The bubbles marked red are top in terms of page rank, the blue ones are part of the lower 90th percentile. The size of the bubble is the amount of funding the companies have attracted, and the location (x,y) is determined by the amount of GitHub Repos and GitHub Stars. The Repositories and Stars are shown on the log scale, which compresses the outliers in the data set. Yet, the interpretation remains the same; companies that are farther out on the right-upper side have more GitHub Repos, and GitHub Stars. Two things become apparent; first is that the log-scaled repos and log stars correlate with each other. Higher numbers of repos tend to go together with higher numbers of stars. (note the approximate 45-degree angle). Secondly, it should be noted that a few of the top companies in terms of page rank (the red bubbles), also are outliers in terms of log stars and repos. These are the companies that have scored the highest in all measures reviewed so far, the highest in dollars, the highest in terms of technical skill and the highest in page rank.

(21)

5 Results

This research investigated the question if ’capable coders’ correlates with success for the Y Com-binator companies. Let’s dive deeper in the results of this study now. First and foremost, this study provides evidence that the human capital theory is generally right. In this study, it is found that technical skill, specifically programming skill tends to correlate strongly with success. This was tested in two important ways: First, companies got acquired had significant higher levels of technical skill than the average population. (Hypothesis 1a is supported). Getting acquired is a successful outcome for the founder since it generally means having a nice ’pay day’. Second, it was found that companies which rank the highest in PageRank, (and thus draw a lot of traffic), also were companies with above levels of technical skill (measured in GitHub Repos and Stars). Therefore, hypothesis 2a was also accepted. These two results confirm the findings of Unger et al. (2011) who noted that direct measures of human (i.e. relevant skills), tend to be the best indicator of startup success. This study however has focused primarily on the technical skillset of the programmers, since this is what the data could provide.

Hypothesis 1b addressed the question if it is possible to predict the chances that a company gets acquired, as a function of the Sum of GitHub Repositories. This hypothesis uses a logit regression model, which takes the the lognormally transformed Sum Repos as its input. This model was found to be supported by the data. The Hosmer-Lemeshow test comes out >.10, and the regression coefficient itself is found to be significant at a level of <0.001. Therefore, it seems appropriate to assume that higher levels of GitHub repos in the dataset in fact do associate with higher chances of getting acquired. Note however, that the logit regression model uses a log transformed variable as input, (the IV). The Sum GitHub Repos variable has a strong positive skew (which can be observed in Table 1 Descriptive, and visually in Appendix, Figure 1). A log normal transformation however (Appendix, Figure 2) shows a reasonable fit. The consequences of this non-linear relationship also reflect in the final modeled relationship between the chances of success (P Acquired = 1) and the number of GitHub Repos, based on the outcome of the regression model (H1b). This is shown in Figure 2. It shows that it takes an increasing number of Repos, (nonlinear) to raise the chances of success (marginally). A direct interpretation of Figure 2, is that there is a decreasing rate of return

(22)

Figure 3: Chances of getting Acquired

on having more technical skill in the founding team.

However, in analyzing the residuals for model H1b, it was found that the regression residuals point to a bimodal distribution, in the indepent variable, the GitHub Repos. This can be observed in Figure 5, in the appendix. The residuals are not normally distributed because GitHub Repos cannot be perfect log-normally distributed. In the original data, there are a lot of values of ’1’ for GitHub repos. That is companies with just 1 repos. This is much more than a log normal distribution would generally expect. This problem is also highlighted in Appendix Figure 4. The problem continues to exist, even after the 0 values were removed, since one cannot take the log of 0. Does this mean that this model is inherently bad? One could easily enhance the model and remove the outliers with an additional dummy variable. In fact, in a way this has already be done with the model of Hypothesis 1c, which introduces the ’dummy variable’, being present on GitHub. Note that this logit model is accepted as a reasonable fit (Table 3, H1d). Most importantly, being Present on GitHub, has an odds ratio of .6158, which indicates, that for this data set, the chances of falling in the ’closed category’, are reduced with about 40% on average. This is a strong e↵ect size. Though this coefficient was only significant on a 0.10 alpha level. Thus, caution is necessary

(23)

to interpret such results directly.

This ultimately begs the question: what does this all mean? Summarizing the gathered results so far:

1. There is a positive relationship between technical skill and startup success, in the form of getting acquired (h1a) and in the form of hitting the top10% PageRank status (H2a).

2. There is a non-linear relationship between additional GitHub Repos (H1b) and increasing chances of getting acquired:

3. It seems that there is a strong argument to be made that technical skill is a baseline requirement for success in technical startups. In this data set (H1d), the absence of technical skill points to higher chances of failure. However, increasing amounts of technical skill are not necessarily with higher levels of success Figure 3, which shows the non-linear relationship.

5.1 Strengths and limitations of the study

A major limitation of this study is that it uses databases to establish relationships. Even though the sampled population is quite large, one cannot establish causal relationships based on correlation research. In addition, given the relatively young nature of the various measurements, readers might question the validity of the established measures. Unfortunately, one cannot eliminate spurious correlations. Still, given the work of Kalliamvakou et al., and the interviews conducted for this research, there seems to be a basis for using the new measures. It should be noted that GitHub repos and stars are proxy measures, and direct tests. As such, the findings should be considered unconfirmed, until further research validates these new constructs. A more significant problem in this research however, is that there is no consistent timeframe for the measured results. The data was gathered in the period of June 2017, as the GitHub repos and stars are a snapshot of the current founders’ profiles. As such, it is not possible to find causal relationships. However, there are elements of the GitHub profiles that can actually be used to build in the element of time, which also applies for the funding rounds. Therefore, a future research could address these problems relatively easily, and validate (or invalidate) the proposed measurements. This research maintains however, that even the di↵erence between Github users with very few repos, and users with many repos is

(24)

a relatively constant factor. This is particularly true for users with low repos counts (e.g. a count of 1 or 2), which indicate a low level of interest. (After all, a user who has a low count today, by implication also had a low count in the previous periods.)

5.2 Implications for Managers and Founders

The are several implications of this research for founders and managers. First, non technical founders are well advised to team up with someone who can contribute technically, as this results of this study point indicate that this can significantly enhances both the success of the company (h1*, h2a) and the surviabilty of the company (h1d). Finally, it seems logical for indivdiuals with an interest in ’technnology’ to keep refining their skillset, allowing them to hit the a certain critical mass, which is required to compete at a level playing field with other startups. Founders and managers could use GitHub profiles to get a relatively objective indicator of someones technical ability. Particurly for managers involved in VC funds, the absence of any GitHub profile for a technical startup should be considered a possible problem.

5.3 Conclusion

This study has established strong evidence, that the companies that get acquired also have the most technical founder teams. This was found to be the case in two distinctive ways: first, they have the largest amounts of technical skill in their founding teams. And secondly, they also tend to exhibit teams where all founders actively contribute, in terms of technical expertise. Also, it was found that the relationship between technical skill and startup success was not linear. The logit models show that after a certain point, adding more ’technical talent’ to the team, did not increase the chance of success. In analyzing the data, it also became apparent that there two very distinctive groups in the data set. There are very technically skilled founders, and there are far less technically skilled founders. The models show that that having a baseline technical talent, strongly enhances the chances of success. Thus, founders and managers are well recommended to find talent which meets that baseline requirement.

(25)

References

Baptista, R., Kara¨oz, M., & Mendon¸ca, J. (2014, April). The impact of

hu-man capital on the early success of necessity versus opportunity-based

en-trepreneurs. Small Business Economics, 42 (4), 831–847. Retrieved 2017-02-07, from

http://link.springer.com/article/10.1007/s11187-013-9502-z doi:

10.1007/s11187-013-9502-z

Bosma, N., Praag, M. v., Thurik, R., & Wit, G. d. (2004, October). The Value

of Human and Social Capital Investments for the Business Performance of

Star-tups. Small Business Economics, 23 (3), 227–236. Retrieved 2017-02-09, from

http://insights.ovid.com/small-business-economics/sbeco/2004/23/030/value-human-social-capital-investments-business/6/00023027

Cassar, G. (2014, January). Industry and startup experience on entrepreneur forecast

per-formance in new firms. Journal of Business Venturing, 29 (1), 137–151. Retrieved

2017-02-08, from https://www.sciencedirect.com/science/article/pii/S0883902612000948 doi:

10.1016/j.jbusvent.2012.10.002

Colombo, M. G., & Grilli, L. (2005, August). Founders’ human

capi-tal and the growth of new technology-based firms: A competence-based

view. Research Policy, 34 (6), 795–816. Retrieved 2017-02-08, from

https://www.sciencedirect.com/science/article/pii/S0048733305000776 doi:

10.1016/j.respol.2005.03.010

Colombo, M. G., & Grilli, L. (2010, November). On growth drivers of

high-tech start-ups: Exploring the role of founders’ human capital and venture

capi-tal. Journal of Business Venturing, 25 (6), 610–626. Retrieved 2017-02-08, from

10.1016/j.jbusvent.2009.01.005

Cooper, A. C., Gimeno-Gascon, F. J., & Woo, C. Y. (1994, September).

(26)

Journal of Business Venturing, 9 (5), 371–395. Retrieved 2017-02-08, from

http://www.sciencedirect.com/science/article/pii/0883902694900132 doi:

10.1016/0883-9026(94)90013-2

Crowne, M. (2002). Why software product startups fail and what to do about it. Evolution of software product development in startup companies. In IEEE International Engineering Management Conference (Vol. 1, pp. 338–343 vol.1). doi: 10.1109/IEMC.2002.1038454

Delmar, F. (2006, August). Does experience matter? The e↵ect of founding team experience on the survival and sales of newly founded ventures. Strategic Organization, 4 (3), 215–247. Re-trieved 2017-02-03, from http://soq.sagepub.com/cgi/doi/10.1177/1476127006066596 doi: 10.1177/1476127006066596

Ganotakis, P. (2012, September). Founders’ human capital and the performance of UK new tech-nology based firms. Small Business Economics, 39 (2), 495–515. Retrieved 2017-02-08, from

http://link.springer.com/article/10.1007/s11187-010-9309-0 doi:

10.1007/s11187-010-9309-0

Gimmon, E., & Levie, J. (2010, November). Founder’s human capital, external investment, and the survival of new high-technology ventures. Research Policy, 39 (9), 1214–1226. Retrieved

2017-02-08, from https://www.sciencedirect.com/science/article/pii/S0048733310001411 doi:

10.1016/j.respol.2010.05.017

Jo, H., & Lee, J. (1996, April). The relationship between an entrepreneur’s background

and performance in a new venture. Technovation, 16 (4), 161–211. Retrieved

2017-01-23, from http://www.sciencedirect.com/science/article/pii/0166497296891243 doi:

10.1016/0166-4972(96)89124-3

Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D. M., & Damian, D.

(2014). The promises and perils of mining GitHub. In Proceedings of the 11th working

conference on mining software repositories (pp. 92–101). ACM. Retrieved 2017-01-23, from

(27)

Koch, A., Sp¨ath, J., & Strotmann, H. (2013, October). The role of employees for post-entry firm growth. Small Business Economics, 41 (3), 733–755. Retrieved 2017-02-08, from

http://link.springer.com/article/10.1007/s11187-012-9456-6 doi:

10.1007/s11187-012-9456-6

Marvel, M. R., & Lumpkin, G. (2007, November). Technology Entrepreneurs’

Human Capital and Its E↵ects on Innovation Radicalness.

Entrepreneur-ship Theory and Practice, 31 (6), 807–828. Retrieved 2017-02-08, from

http://onlinelibrary.wiley.com/doi/10.1111/j.1540-6520.2007.00209.x/abstract doi: 10.1111/j.1540-6520.2007.00209.x

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation

ranking: Bringing order to the web. (Tech. Rep.). Stanford InfoLab. Retrieved from

http://ilpubs.stanford.edu:8090/422

Rayport, J. F. (1999). The truth about Internet business models. Strategy and Business, 5–7. Retrieved from http://www.ddwei.info/pdf/WebBusinesses/0.pdf

Reuber, A. R., & Fischer, E. M. (1994). Entrepreneurs’ experience, expertise, and the performance of technology-based firms. IEEE Transactions on Engineering Management, 41 (4), 365–374. Retrieved from http://ieeexplore.ieee.org/abstract/document/364560/

Roberts, E. B. (1991). Entrepreneurs in high technology: Lessons from MIT and beyond. MIT , 3 .

Shrader, R., & Siegel, D. S. (2007, November). Assessing the Relationship

be-tween Human Capital and Firm Performance: Evidence from Technology-Based New

Ven-tures. Entrepreneurship Theory and Practice, 31 (6), 893–908. Retrieved 2017-02-08, from

http://onlinelibrary.wiley.com/doi/10.1111/j.1540-6520.2007.00206.x/abstract doi:

10.1111/j.1540-6520.2007.00206.x

Song, M., Podoynitsyna, K., Van Der Bij, H., & Halman, J. I. (2008). Success factors in new ventures: A meta-analysis. Journal of product innovation management, 25 (1), 7–27. Retrieved 2017-02-03, from http://onlinelibrary.wiley.com/doi/10.1111/j.1540-5885.2007.00280.x/full

(28)

Unger, J. M., Rauch, A., Frese, M., & Rosenbusch, N. (2011, May).

Hu-man capital and entrepreneurial success: A meta-analytical review.

Jour-nal of Business Venturing, 26 (3), 341–358. Retrieved 2017-02-08, from

10.1016/j.jbusvent.2009.09.004

Zhang, J. (2011, February). The advantage of experienced start-up founders in venture capi-tal acquisition: evidence from serial entrepreneurs. Small Business Economics, 36 (2), 187–208. Retrieved 2017-02-08, from http://link.springer.com/article/10.1007/s11187-009-9216-4 doi: 10.1007/s11187-009-9216-4

(29)

Figure 1: Histogram Sums Repos !"# $%# $## "!# &'# &"# %# # "## &(# &## (# # !"#$%"&'()$*+%,! !" # $ % # & ' (

!"#$%&'()%+#,)-.,/0"1-'2.%#

Figure 2: Probability Plot Sums Repos Lognormal

!"""" !""" !"" !" ! "#! $$#$$ $$ $% &" %" '" % ! "#"! !"#$%"&'()$*+%,! !" # $" %&

!"#$%$&'&()!'#(#+*,-./0-$'&1/"20#,

-,./,*#0'121345167

(30)

Figure 3: Histogram Sums Stars !"## !### $## %## &## "## # %## '## &## (## "## !## # !"#$!%&'! !" # $ % # & ' (

!"#$%&'()%+#,)-#$('#

Figure 4: Probability Plot Sums Stars Lognormal

!""""" !"""" !""" !"" !" ! "#! "#"! $$#$$ $$ $% &" %" '" % ! "#"! !"#$!%&'! !" # $" %&

!"#$%$&'&()!'#(#+*,-./,(%",

()*+)'#&,-.-/01-23

(31)

Figure 5: Residuals for H1b Logit Model - Acquired, Ln Sum Repos ! " # $ % &$ &# &" &! ''('' '' ') *% )% #% ) $ %(%$ !"#$%&'"()"*$+,%-!" # $" %&

!"#$%&'(#")%)&+,'(&"+

./"*01&*"($*(%'2,$/"+3%'2,$/"+4

Figure 6: Residuals for H1d Logit Model - Closed, Founder Count, Present on GitHub

! " # $ % &$ &# &" &! ''('' '' ') *% )% #% ) $ %(%$ !"#$%&'"()"*$+,%-!" # $" %&

!"#$%&'(#")%)&+,'(&"+

./"*01&*"($*(%'2,$/"+3'-1*"+4

The effect of technical skill on startup success

The e↵ect of technical skill

on startup success

Master Thesis

Michael Fresco

Supervisor:

. W. (Wietze) van der Aa

Statement of Originality This document is written by Michael Fresco (5999162)

who declares to take full responsibility for the contents of this document. I declare

that the text and the work presented in this document is original and that no

sources other than those mentioned in the text and its references have been used

GitHub Thesis: The e↵ect of technical skill on startup success.

Michael Fresco

June 23, 2017

1

Introduction

1.1

Previous work

1.2

PageRank and GitHub

1.3

Using GitHub projects as proxy for technical ability

1.4

Hypotheses

2

Method

2.1

Sample

2.2

Protocol for collecting data

2.3

Statistical procedures

3

Results

3.1

The 3 groups for the companies: acquired, active, and closed

3.2

Logit models

3.3

PageRank and Top Dollars

3.4

Distribution of talent

4

Graphical Summary

5

Results

5.1

Strengths and limitations of the study

5.2

Implications for Managers and Founders

5.3

Conclusion

References

!"#$%&'()*%+*#,)-.,/0"1-'2.%#

!"#$%$&'&()*!'#(*#+*,-./0-$'&1/"20#,

!"#$%&'()*%+*#,)-#$('#

!"#$%$&'&()*!'#(*#+*,-./,(%",

!"#$%&'(#")%)*&*+,'(&"+

!"#$%&'(#")%)*&*+,'(&"+

!"#$%&'()%+#,)-.,/0"1-'2.%#

!"#$%$&'&()!'#(#+*,-./0-$'&1/"20#,

!"#$%&'()%+#,)-#$('#

!"#$%$&'&()!'#(#+*,-./,(%",

!"#$%&'(#")%)&+,'(&"+

!"#$%&'(#")%)&+,'(&"+