Researching the influence of information technology companies on scientific publications

(1)

Researching the influence of information

technology companies on scientific publications

Joost Verkaik

10998284

August 30, 2017

Bachelor thesis Information Science

Supervisor dr. Maarten J. Marx Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Abstract

Information technology is becoming increasingly important in our daily lives. Many big technology companies influence us through their products and services. Earlier work this influence, did not explore their influence on scientific research. The central question in this thesis asks whether infor-mation technology companies are becoming more influential in scientific research.

The dataset contained almost 4 million papers, of which 146,167 were used. Papers were linked to richer datasets about the related conferences, which allowed restricting the papers to high-quality conferences.

A classifier was built to distinguish academic from industry affiliations. This resulted in a total of 169,941 usable affiliations, with 44,468 industrial affiliations and 125,473 academic affiliations. Points were allocated to all affiliations to give them weight and perform analysis on the share of industrial affiliations.

As organizations grow bigger and increase their revenues, their scien-tific output increases. The overall share of industry affiliations started to decline since 2000, but individual companies like Microsoft, Google and Intel show that they have increased their share in total publications over the years. While the increases for these companies are not significant yet, this is a possibility for the future.

(3)

List of Figures

1 Information available on Google Scholar . . . 7

2 Information available on ACM . . . 8

3 Information available on DBLP . . . 9

4 Amount of publications per year . . . 10

5 Schematic representation of name matching parsing phase . . . . 15

6 Percentual distribution of points for A* conferences . . . 16

7 Percentages and absolute numbers of publications per conference 18 8 Percentual distribution of points for all papers . . . 19

9 Rankings and publications of major technology companies . . . . 20

9 Rankings and publications of major technology companies (cont.) 21 10 Percentual share of points per year of major tech companies . . . 22

10 Percentual share of points per year of major tech companies (cont.) 23 11 Machine learning techniques - artificial neural networks and SVM’s 25

List of Tables

1 Fields of Research, with number of publications . . . 11

2 Example of a publication with the distribution of points . . . 14

3 List of A* conferences . . . 27

4 general - table containing publications . . . 28

5 authors - table containing authors . . . 28

6 affiliation - table containing affiliations of authors . . . 28

7 affiliation coord - table containing all affiliations per author including geographical information . . . 29

(5)

1 Introduction

Life without information technology is hard to think of. Humans find them-selves increasingly intertwined with technology every day. This resonates into our daily lives, independent of ones age, education or profession. It is inter-esting to see how companies handle the increasing amount of technology in their own operational activities and in their surroundings. Organizations like Google, Facebook, Twitter and others are impacting our lives more and more each day and these companies are conducting their own researches to invent new technologies.

While conducting this research, an article appeared in The Wall Street Jour-nal on July 14th, 2017 about Google paying professors to conduct research in Google’s best interest (Mullins & Nicas, 2017). Professors send their research to Google for suggestions, but they do not disclose the ties to Google when publishing their research. Even though the researchers are not always aware of refraining from naming Google, they receive large sums of money to conduct the research. Google maintains a list of subjects that they find interesting and search for willing authors. Google pays for almost all expenses without getting the name Google published anywhere. According to Robin Feldman, a professor at California Hastings College of Law, academics run the risk of ”being seen as lobbyists instead of scholars”. Google of course denies the allegations that they pay professors to change or control their positions. Even though the events from the newspaper article cannot be found in the data that will be analyzed, his newspaper article raised the necessity for the research conducted in this thesis.

Most studies that research influence of companies did not focus on academic publications. In the section 2, more insight will be given in what related research has been done in the past. Because of the absence of previous research and the growing influence of information technology companies on multiple aspects, the main research question of this thesis is: Are information technology companies becoming more influential in scientific research?

To answer that question, the main question has been split up in the following subquestions:

RQ1: Has the share of overall industrial affiliations in relation to all publica-tions increased?

RQ2: Have scientific activities grown with the size and importance of technol-ogy companies?

RQ3: Has the share of individual technology companies in relation to all pub-lications increased?

The goal of this research is to find out whether the influence of information technology companies has increased over the years. This will be done by analyz-ing the affiliations of the authors of scientific publications and will be described in section 3. In section 4, the results of this thesis will be discussed. Finally, the conclusions section gives a brief summary and critique of the findings. Also, areas of further research are identified.

(6)

2 Related work

The subject that will be researched in this thesis is quite new. Earlier work mainly focuses on influence of information technology itself on society, business, et cetera (Baloh & Trkman, 2003). Previous research has examined the role of public research on industrial R&D (Cohen, Nelson, & Walsh, 2002), stating that it is very important for generating new ideas. No publications have been found that turn the question about influence around to focus on the influence of a certain party or collection of parties in the area of information technology on scientific research.

In 2006, an article was published in Research Policy about the influence of Research & Development and innovation strategies between companies and

universities. From a business perspective, the alliances were examined and

comparisons were made between companies and universities on the one hand and companies and other external partners on the other hand (Bercovitz & Feldman, 2007). While this article focused on the relation itself, this thesis will focus on the greater picture of the influence of companies on research.

In an article written by Critchley & Nicol, support for research is assessed depending on the source of funding. The impact of commercialization is notice-able, with private funding resulting in lower support of the research (Critchley & Nicol, 2011). Research with funding from, or research conducted by, large technology companies could count on less support from the public.

Research conducted by Cohoon, Nigai, and Kaye (2011) investigated the number of publications by women and their productivity. A recent study by Agarwal, Mittal, Katyal, Sureka, and Correa (2016) examined the gender im-balance in computer science research to explore the scientific output of women. Other bibliometric studies include a study by Tijssen and Van Leeuwen (2006) where the impact of academic science on industrial research is examined, and a study by Teixeira and Mota (2012) in which the influence of literature on university-industry links is investigated.

Given these interesting bibliometric studies and the fact that no other study focusses on the influence of companies on scientific research, it is interesting to explore this subject. The hypothesis that will be tested is that the influence of information technology companies has increased over the years, especially since the year 2000.

3 Methodology

In this section, the methodology of this thesis will be discussed. To perform the analyses required to answer the research question, data needs to be acquired. Fortunately, there are a lot of sources where the data can come from. However, a lot of these data sources do not offer downloadable datasets. An alternative to using a downloadable dataset is scraping the contents of a website. In an ideal situation, the data needed for this thesis should consist of records of scientific publications, their corresponding authors and the affiliations of said authors.

(7)

3.1 Data sources

Prior to commencing the study, the different possible data sources are to be explored. The following parties possess the data that potentially could be useful:

1. Google Scholar (Google, 2017)

2. Association for Computing Machinery (ACM) (ACM, 2017) 3. DBLP.org (University of Trier, 2017)

To assess whether a data source is fit for use, an analysis per possible party has been made. Also, the available information per party for a certain publica-tion will be discussed to determine whether there is enough data to use.

3.1.1 Google Scholar

Google Scholar is a search engine for scientific publications, books, abstracts and more. Just like its big brother Google Search, it indexes all kinds of scholarly literature for easy searching. Unfortunately, Google Scholar does not offer a structured downloadable dataset. Scraping Google Scholar would not be an op-tion since searching is not advanced enough to filter on certain fields of research, conferences or other meaningful filtering to this thesis. Also, Google Scholar con-tains other types of content like patents which go beyond the scope of this thesis. Furthermore, Google Scholar only links to the relevant provider like JSTOR, El-sevier and other parties which would make building a scraper a time-consuming

and error prone task. The information that is available on Google Scholar

about the paper ”Algebraic model counting” by Kimmig, Van den Broeck, and De Raedt (2017) is shown in figure 1.

Figure 1: Information available on Google Scholar

3.1.2 Association for Computing Machinery

Association for Computing Machinery (ACM) is a society for computing pro-fessionals like researchers, professors and other educators. ACM brings those people together in several ways, through conferences and Special Interest Groups (SIG), representing the ”the world’s leading thinkers in computing and informa-tion technology”. Relevant conferences and SIGs are SIGMOD (Management of Data), PODS (Principle of Database Systems), SIGIR (Information Retrieval) to name a few. Annual conferences and workshops are organized to support and boost scientific research in the computing field. To this thesis, the research that falls under the ACM umbrella is the most relevant (Association for Computing Machinery, 2017).

ACM has a Digital Library which contains all the scientific research that falls under ACM. Unlike Google Scholar, the search engine of the Digital Library links to internal pages containing full-text articles, detailed information about the

(8)

authors, et cetera. The Digital Library also contains a network of connections between authors, universities and other relevant parties.

Unfortunately, ACM also does not offer a downloadable dataset. The option of scraping the ACM Digital Library was also explored, but unfortunately ACM is heavily opposed to scrapers. This would mean that the scraper could only be used in a very slow pace to prevent the discovery of the scraper. There is research available about scraping the Digital Library (Bergmark, 2001), but the researchers received a local digital copy of the library with a sole goal to explore the possibilities of analyzing the data, not the retrieval of the data itself. That means that using the ACM Digital Library is also not an option. The information that is available on ACM is shown in figure 2.

Basic information like the journal the paper was published in, the abstract, authors and their affiliations are immediately available on this page. As said earlier, the information about affiliations of authors is crucial for answering the research question.

Figure 2: Information available on ACM

3.1.3 DBLP

A third option that was explored is DBLP.org. DBLP is a digital library con-taining nearly four million publications, from over 5,000 conferences and 1,500 journals. DBLP is a service from the University of Trier and its goal to of-fer a large open bibliographic database on scientific publications. What DBLP does, is only displaying some basic information about papers in journals and conferences. DBLP offers an advanced search engine with which you can search by author name, paper title, venue, year published, and more. In figure 3, the search results for a query on ”Guy Van den Broeck” is shown, with the information that is available on the aforementioned publication.

(9)

DBLP shows authors names, where it was published, page numbers and year of publication and furthermore only links to a web page where the publication can be downloaded. In this particular case, DBLP links to an electronic edition of this publication that is available on DOI.org, but this depends on where the paper was published. While DBLP does not display all the information that is contained in the database, an XML dump file is offered for download.

Figure 3: Information available on DBLP

3.1.4 Conclusion

The requirements for the data required to answer the research question are: • Either a structured dataset or scrapable data source, containing: • records of scientific publications

• information about the author(s) • the affiliations of said authors

Unfortunately, all three possible data sources did not meet the requirements necessary. There was either no downloadable version available or the available information had its shortcomings, e.g. missing information about affiliations. Further exploration of possible data sources was necessary. Previous biblio-graphical studies used large datasets, but only one recent paper used a (portion of a) dataset of almost 4 million records to research the distribution of female authors in computer science publications (Agarwal et al., 2016), which accord-ing to the authors is the biggest dataset used to date (Agarwal, 2016). The dataset used in that research is available for free from the Data section on the website of Mendeley, a powerful reference manager for students and educators (Elsevier, 2017). This dataset will be used in this thesis. In the next sections, the dataset will be described and analyzed.

3.2 Data acquisition and description

In this section, the structure and other details of the data will be discussed. The dataset is based on a XML dump file from DBLP, but the parser that the

(10)

authors have used has not been published. This results in using the dataset that is built from a DBLP snapshot from September 17, 2015. It contains 146,167 records of scientific publications from 81 Computer Science Research (CSR) conferences and consists of seven Structured Query Language (SQL) files, of which the schema’s of the relevant tables can be found in appendix B. The most important elements of this dataset are the papers, authors and their affiliations.

3.2.1 Papers

Table 4 in appendix B shows the schema of the papers SQL table. This table contains 146,167 papers from 81 conferences and 5 publishers. Additional in-formation includes the year of publication, to which domain it belongs, and a URL to the publishers page for that paper. A paper can have multiple authors. On average, a paper has at least two authors. The papers’ publication dates range from 1960 until 2015, with the amount of publications per year displayed in figure 4.

Figure 4: Amount of publications per year

3.2.2 Authors

Every publication has one or more authors. These authors are stored in the author SQL table which can be found in table 5 in appendix B. The total number of records in this table is 410,568. These records are non-unique, because this data has a many-to-many relationship with the publications table. The total amount of unique authors in the dataset is 126,094.

3.2.3 Affiliations

The schema of the original affiliations data is shown in table 6 in appendix B. This SQL table contains 306,644 records. The first record in this table originates from 1960, but there is no affiliation stored for this record. The first affiliation that is stored originates from 1969. In total, there are 63,313 unique, readable

(11)

and usable affiliations. In section 3.3.1, the classifier for determining academic or industry affiliation will be discussed.

The dataset also includes a separate table containing author-affiliation pairs with geographical information about the affiliation. The structure of this table can be found in table 7 in appendix B.

3.2.4 Linking publications to conferences

As can be seen in tables 4 and 6 in appendix B, both tables contained a column (k) with a unique identifier in the format:

conf/*conference name*/*unique slug*

The second of the three parts of that identifier refers to an acronym of a confer-ence. Dr. Maarten Marx provided the data and code to link these conference acronyms to data from CORE Conference Ranking. CORE (Computing Re-search & Education) is an association of university departments of computer science in Australia and New Zealand (CORE, 2016). The CORE Executive Committee manages a ranking of conferences ranging from A* to C ratings, Australasian and Unranked. In this thesis, only the A* conferences are taken into account as these are the leading and thus most important conferences glob-ally. In total, there are 31 relevant A* conferences. A full list of all conferences can be found in appendix A.

As the full names of the conferences were not available in the initial dataset, this CORE dataset was used to retrieve those full names including the Field of Research (FoR) for that conference. The Fields of Research include:

Code Name # Publications

801.0 Artificial Intelligence and Image Processing 6,768

802.0 Computation Theory and Mathematics 7,166

803.0 Computer Software 8,637 804.0 Data Format 11,719 805.0 Distributed Computing 7,225 806.0 Information Systems 14,928 1006.0 Computer Hardware 2,071 Total 58,514

Table 1: Fields of Research, with number of publications

This list enables the filtering of high quality publications, analysis per con-ference and providing some background information about those concon-ferences.

After filtering for the A* conferences, a unique total of 58,514 publications remain, with 66,652 unique authors that are linked to these publications. The total unique amount of usable affiliations for data analysis is 171,302. The amounts that these numbers represent will be analyzed in the next sections.

3.3 Data analysis

(12)

• MySQL (any distribution, v5.6.35)

The dataset comes in Structured Query Language files and needs to be imported in a MySQL database to work with the data.

• Anaconda 3 (v4.3.14, Python 3.5.3 Scientific Distribution)

The programming language Python will be used to analyze the data. Python is a high-level, efficient and fast programming language which makes it very useful for handling large amounts of data. The Anaconda distribution is tailored for data science.

• SQLAlchemy (v1.0.12)

SQLAlchemy is a Python package for connecting to MySQL sockets and interacting with databases.

• Pandas (v0.19.2)

Pandas is a Python library for high-performance data analysis. Due to the way it is built, it is capable of handling large amounts of data efficiently. • Matplotlib (v1.5.1)

Matplotlib is a Python library for visualizing data. The implementation of Matplotlib in Pandas makes it very easy to plot charts and graphs. A great benefit of using the tools above is that there is a lot of online available, free documentation to use. Since all of these tools are free to use, the community is big and next to the documentation, a lot of Question & Answer websites contain free to use code samples.

The full code that will be used to analyze the data will be available on https://github.com/joostverkaik/influence-tech-companies.

3.3.1 Academic/industry classifier

At this point, there was no way to distinguish the academic affiliations from the industry affiliations. To perform this partitioning task, a classifier has been built. The code for this classifier can be found in appendix C. The task at hand was analyzing the different affiliations to construct a set of rules to identify the type of the affiliation.

One of the tools used was a list of all universities in the world to quickly filter out many of the academic affiliations. To accomplish this, a scraper was used that collects all information from the univ.cc website (endSly, 2015). This resulted in a comma-separated values (CSV) file containing 9,374 universities from around the world. An issue of the dataset was the ambiguity of the affilia-tions. One of the universities with many affiliated authors is the Massachusetts Institute of Technology, or MIT for short. The dataset contained multiple vari-ants of this name, like MIT, M.I.T., et cetera. If a university name was matched against the aforementioned list of universities, an academic label was attached to the affiliation.

A caveat of the datasets affiliation records was the plurality of affiliations in a single record. Possible formats were two affiliations split by ” and ” or split by a forward slash (/). The classifier checks the affiliation on these two possible delimiters and splits the affiliation string on that delimiter when needed. Since

splitting a string in Python 3 results in a list, a for loop is used to check

(13)

affiliation can be split by commas, e.g. ’Department of Applied Mathematics, The Johns Hopkins University’. The classifier splits this string in multiple parts and proceeds to check every part of this affiliation.

The classifier then tries to match parts of the affiliation on the following (sub)strings:

• univ*

• ku (for Katholieke Universiteit Leuven) • tu (for technical universiteis)

• dept, department

• faculteit, facult´e, faculty

• institut* • academy • academia • college

• école/ecole/ens (for e.g. École polytechnique, École Nationale Supérieure)

• MIT, M.I.T.

• names like UC Santa Barbara, UC Berkeley, UCLA, et cetera

The asterisk (*) denotes a wildcard, matching university, universiteit,

uni-versidad, universit¨at, et cetera. These (sub)strings make for the most common

mismatches when determining whether an affiliation is academic or industrial. These and the list of universities match a lot of the affiliations already.

If one of the two methods above (list of universities, substrings) has a match, this will be saved to a Python list object. When all parts of the affiliation have been checked, the length of the Python list object containing academic matches will be checked on being zero or higher. If there are no academic matches, the affiliation is industrial. If there is one match or more, the affiliation will be seen as academic.

After removing the empty affiliations, classifying the industrial and academic affiliations and splitting up multiple affiliations in separate records, there were 169,941 affiliations left with 44,468 being industrial affiliations and 125,473 being academic affiliations. These numbers are not unique affiliations (e.g. a list of companies and universities like ”Google” or ”New York University”), but number of records in author-affiliation pairs (”John Doe” is linked to ”New York University”, for paper X).

3.3.2 Assigning points to publications

One of the methods to analyze the data is to assign points to publications. Each publication gets a full point to distribute among its affiliations. Table 2 shows an example publication (Feris, Raskar, Longbin Chen, Kar-Han Tan, & Turk, 2005) with 5 authors and the distributed points. The name column contains the name of the author, the affiliation column shows the place of employment for that author. In the Type column, A stands for academic and I stands for industry. The paper is published in the International Conference for Computer Vision (ICCV). FoR stands for Field of Research, which refers to information

(14)

that was retrieved and discussed in section 3.2.4. Out of the possible one point, 0.6 is allocated to academics and 0.4 to industry. The affiliation data contained either academic affiliations or industry affiliations, not combinations for one author. Therefore, points allocated to a certain author did not have to be split up for that person.

Name Affiliation Type Acronym FoR Points

Longbin Chen UC Santa Barbara A ICCV 801.0 0.2

Rogerio Feris UC Santa Barbara A ICCV 801.0 0.2

Matthew Turk UC Santa Barbara A ICCV 801.0 0.2

Ramesh Raskar Mitsubishi Electric I ICCV 801.0 0.2

Kar-Han Tan Epson Palo Alto Lab I ICCV 801.0 0.2

Total points 1.0

Table 2: Example of a publication with the distribution of points

3.3.3 Company name disambiguation

The ambiguity in academic affiliations also occurred in industry affiliations. Multiple versions of company names exist in the dataset, so this needs to be resolved in order to analyze the affiliations by company. To do this, a unique,

alphabetically sorted list of all company names was created and a forloop is

used to go through all names. A key-value Python dictionary is created with the main company name as key and all subsidiary company names in a Python list. At every n-th position in the list, that value is compared with the value of the n-th + 1 position in the list. If the value of the n-th + 1 position starts with the value of the n-th position, the name is added to the dictionary under the right key.

If this method does not give a match, the two company names will be com-pared using the Jaro-Winkler distance (or similarity) (Winkler, 1999). This method calculates the minimum number of changes needed to a string to match another string. The formula for the Jaro-Winkler distance is shown in (1).

dj = ( 0 if m = 0 1 3 m |s1|+ m |s2|+ m−t m otherwise dw= dj+ (`p(1 − dj)), (1)

To increase chances of matching the right name, filtering needs to be done first. This idea comes from a paper by Raffo and Lhuillery (2009), where name matching is split up in three phases: parsing, matching and filtering.

Parsing is the use of sanitation methods and is done by 1) removing legal terms like co. and ltd., 2) removing all punctuation 3) tokenization and 4) removing stopwords. A Python library called Jellyfish contains a Jaro-Winkler distance function to calculate the similarity and match the company names (Turk, 2017). A threshold of 0.95 is used to limit the amount of false-positives, especially since a large amount of the company names are already matched by using a alphabetically sorted list. Filtering is the last phase in which several techniques are used to reject false-positives from the previous stages. Since the complexity of this list is low and the confidence in the previous methods is quite high, filtering is not needed in this case. A schematic representation of the used method is shown in figure 5.

(15)

Lowercase tokens, remove stopwords ’idelix’ ’software’

Tokenization ’IDELIX’, ’Software’ Remove all punctuation

IDELIX Software Remove legal terms

IDELIX Software: Start parsing phase IDELIX Software: Inc.

Figure 5: Schematic representation of name matching parsing phase Almost all company names were matched, but there were a few false posi-tives. The script had problems matching facebook and Facebook, because it also matched Face Therapie. These names were matched manually. Overall, the matching script worked well. In total, 336 different versions of IBM’s name and 122 different versions of Microsoft’s name were found. To illustrate, a part of different versions of the names of Microsoft and IBM were matched:

• Microsoft Reasearch Asia • Microsoft China R&D; Group • Microsoft Inc

• Microsoft Inc.

• Microsoft Research — New

Eng-land Campus

• Microsoft Reserach SVC • Microsoft-INRIA joint center • Microsoft Research Silocon

Val-ley • IBM (U. K.) Ltd.

• IBM — Heidelberg Scientific Center

• IBM — Los Angeles Scientific Center

• IBM-Research Almaden

• IBM Res. Div.

• IBM T.J. Watson Research Cen-ter Yorktown Heights

• IBMT. J. Watson Research Cen-ter

After matching all the names, each row with an industry affiliation is matched against this list of disambiguated company names and given a general company name like IBM, Google, Microsoft, et cetera.

(16)

3.3.4 Company rankings

American magazine Fortune publishes lists of the biggest companies each year by looking at their total revenues for a given year, like the Fortune US 500 and the Fortune Global 500. Unfortunately, there is no usable archive of the Fortune Global 500 list, which means that this thesis will only focus on the Fortune US 500 list to explore the connection between the ranking of company and its scientific activities.

A scraper was used to scrape all publications of the Fortune US 500 list from 1955 until 2005 (MiguelR90, 2016), plus a third-party archive of the years 2006-2014 (TopForeignStocks, 2017), and manual retrieval of the year 2015. The full Comma-Separated Values (CSV) file will be published on this thesis’ GitHub project. The CSV file was loaded in a Pandas DataFrame for comparing the rankings with the scientific activities.

4 Results

4.1 Academy/industry distribution per conference

First, the distribution of points per A* conference has been calculated and

visualized using a chart. More information on which conferences belong to

this category can be found in section 3.2.4. The calculation has been done by grouping data on conference acronym and affiliation type. The points for the industry and the academic categories have been summed and divided by the total amount of points. This is shown in figure 6.

(17)

The conferences with the highest share of industry affiliations are S&P, SIG-MOD and WWW. In section 4.1.1, we will focus on a few conferences to see whether the data can tell more.

4.1.1 Closer look at conferences

In figure 7, individual charts for the relevant conferences are shown. As these conferences illustrate the data the best and give us the most information, the figures for the other conferences have been placed on GitHub (https://github .com/joostverkaik/influence-tech-companies).

On the x-axis, the years are shown. The left y-axis contains the percentage of industry affiliations, while the right y-axis contains the absolute number

of points for that conference in a given year. The blue line represents the

percentage of industry affiliations, the red line represents the absolute number of industry points and the green line represents the absolute number of academic points.

In addition to conferences S&P (Security & Privacy, 7g), SIGMOD (Manage-ment of Data, 7d) and WWW (World Wide Web, 7h), the following conferences were also observed:

• COMM - Data Communications (7a) • ICDE - Data Engineering (7b)

• KDD - Knowledge & Data Mining (7c)

• PODC - Principles of Distributed Computing (7e) • SOSP - Operating System Principles (7f)

A slight increase in industry affiliations can be seen at SIGCOMM, but overall all distributions of points is somewhat the same over the years, despite a few spikes. The SOSP conference shows a moment where there were more industry affiliations than academic affiliations. The S&P conference even has a time where only industry affiliations occurred. All other charts (available on GitHub) do not show any increase in industry affiliations.

(18)

(a) SIGCOMM (b) ICDE

(c) SIGKDD (d) SIGMOD

(e) PODC (f) SOSP

(g) S&P (h) WWW

Figure 7: Percentages and absolute numbers of publications per conference

4.2 Affiliations over the years

The next visualization that gives insight in the proportion of academic versus industry affiliations is the percentual distribution of points between those two categories. In figure 8, the percentual distribution of points for all papers over the years has been displayed.

(19)

Figure 8: Percentual distribution of points for all papers

In the 1970s, several spikes are noticeable. In 1974, there was not a single industrial affiliation involved in the dataset. Afterwards in the 1980s and 1990s, this number was steady and even increased a little. However, a downward trend can be seen, with the percentage dropping below 25% after 2000.

4.3 Company rankings and scientific activities

Since information technology companies are the focus of this thesis, a list of the biggest technology companies will be used to zoom in on the data. As said in section 3.3.4, the data that is used to rank companies is that of Fortune Magazine and its Fortune US 500 list. Together with the disambiguation of the company names from sector 3.3.3, it is very easy to match the names from the Fortune list against the names in the DBLP dataset. Affiliations like ”Microsoft Research– Inria Joint Centre” are matched against the name Microsoft and ”IBM T.J. Watson Research Center. P.O. Box 704” are matched against the name IBM.

Fortune magazine’s website provides a filtering mechanism to select at-tributes like Sector, Industry, location of headquarters, atat-tributes like Female CEO, et cetera (Fortune magazine, 2017). By setting the Sector filter on ’Tech-nology’, the following list is displayed (rank is shown behind the company name):

1. Apple (3) 2. Amazon.com (12) 3. Alphabet (27) 4. Microsoft (28) 5. IBM (32) 6. Dell Technologies (41) 7. Intel (47) 8. Cisco Systems (59) 9. Oracle (81) 10. Facebook (98)

First, we can eliminate Dell from this list, as this company only published two papers between 1990 and 2015. An explanation for this low number could

(20)

not be found. Second, we add Yahoo to this list, as Yahoo used to be a big technology company until around 2011/2012. It will not come as a surprise that this year also meant decline on both ranking and publications.

In figure 9, all visualizations for aforementioned companies are shown. On the x-axis, the years are shown, while the left y-axis contains Fortune 500 rank-ing and the right y-axis contains the amount of publications. The green line represents the Fortune 500 rank, the blue line represents the amount of publi-cations per year.

While Apple (9b) was doing research in the 1990s, this declined in the new millennium while its ranking kept increasing. That is perfectly easy to explain, as Apple released its first iPod in 2001, the first MacBook in 2006, the first iPhone in 2007. Afterwards, its rank as a company kept increasing.

Amazon (9c) is a relatively new technology company, especially when you look at the growth that it has experienced. Right now, the expectations are that amazon will be doing more research in the coming years, while its rank will probably stay stable for the coming time.

Alphabet (parent company of Google) (9d) is also relatively new, but an immensely big player. It has been on the Fortune 500 list since 2006 as Google, with increasing numbers in research activities. Alphabet seems to devote more and more resources to (public) research as it grows bigger. The same can be said for Facebook (9f).

(a) Yahoo (b) Apple

(c) Amazon (d) Alphabet (formerly Google)

(21)

(e) Cisco (f) Facebook

(g) IBM (h) Intel

(i) Microsoft (j) Oracle

Figure 9: Rankings and publications of major technology companies (cont.) IBM (9g) used to be a pioneer in technology (and still is in certain areas of expertise), but it was surpassed by Microsoft and others in the consumer market. Windows won over OS/2 and so Microsoft gained and served a greater audience, which can be seen in its ranking. In 2015, it held the 23rd position on the Fortune 500 list. However, IBM is still an important name in research because of its other major activities, although drops can be noticed in the last 10 years.

(22)

increased amount of publications. As Microsoft only started appearing in the ranking since 1995, we cannot say with one hundred percent certainty that business grew immensely when it started selling its Windows operating system, but that is very plausible. As of 1993, it started devoting more of its available resources to conducting research.

Both Oracle (9j) and Cisco (9e) have been in the technology sector for quite some time, with both companies doing a steady amount of research and climbing up the Fortune 500 list, but their charts do not tell any information.

Now that we have examined the absolute figures with respect to the rank of the companies, we have to put this in context. For each of these companies, the amount of points that have been given to a company will be compared to the total amount of points given in that year to all publications.

The charts in figure 10 show the percentage of points allocated to the com-pany in relation to the total amount of points given to all papers. The x-axis contains the years and the y-axis contain the percentage. This visualizes the changes in how big the share of the companies is in relation to the points handed out to all papers. For the companies Yahoo (10a), Alphabet (10d), Facebook (10f), partly Intel (10h) and above all Microsoft (10i), this share increases over the years.

(a) Yahoo (b) Apple

(c) Amazon (d) Alphabet (formerly Google)

(23)

(e) Cisco (f) Facebook

(g) IBM (h) Intel

(i) Microsoft (j) Oracle

Figure 10: Percentual share of points per year of major tech companies (cont.) Although the percentages are small, which is normal because of the amount of papers in the dataset, this shows that there is an increase in presence in scientific research for some companies. Companies like IBM (10g) and Apple (10b) see their share in scientific research decline, which has the same reason as described with figure 9.

(24)

5 Conclusions

The goal of this research was to find out whether the influence of information technology companies in scientific research has increased. The main research question is: Are information technology companies becoming more influential in scientific research?. The hypothesis was that there was indeed an increase in influence of technology companies, especially since the year 2000.

To answer the main research question, the following subquestions were for-mulated:

RQ1: Has the share of overall industrial affiliations in relation to all publica-tions increased?

RQ2: Have scientific activities grown with the size and importance of technol-ogy companies?

RQ3: Has the share of individual technology companies in relation to all pub-lications increased?

Research question 1 can be answered by looking at the A* conferences that this thesis focuses on, as well as the amount of points distributed to industrial affiliations in relation to the total amount of points, over the years. Figure 8 shows that there was in fact a decline in industrial affiliations in the past decades. Where this percentage is mostly above 25% before 2000, this changed to below 25% after 2000.

The charts in figure 7 show the distribution of affiliation points per confer-ence. No significant increase in industry affiliations could be noticed, except for the S&P (Security & Privacy), SIGCOMM (Group Communications) and SOSP (Operating System Principles) conferences. They do show a positive change in the distribution of points for the industry category.

The answer to research question 2 has been given in section 4.3. By using a list of the major technology companies by Fortune magazine, the data has been examined closely to look whether major technology companies invest in research when they grow bigger. We can see that the amount of publications increases in the past couple of decades, and for some companies this goes simultaneous with their Fortune 500 rank, meaning that those companies invest time and money in public research and not only in-house Research & Development.

These numbers have been put in context to answer research question 3, by calculating what the share of those major technology companies is in terms of their points in respect to the total amount of points allocated to all papers. This shows that of the major technology companies, Yahoo, Alphabet (Google), Facebook, Intel and Microsoft have been able to increase their share in public research over the years. It has to be noted that the percentage are very small (increases of between 0,10% and 4%), but that is normal since these companies are still small fish in a large pond of global research.

To conclude, even though the changes are small, technology companies are gaining some weight in scientific research. The hypothesis that was formulated at the start of this research was a bit too exaggerated. The overall influence of industry affiliations in research dropped slightly after 2000. However, a few indi-vidual companies have increased their presence in public research. Big changes have not been found, but belong to the possibilities in the coming years.

(25)

5.1 Discussion

Although the DBLP dataset was comprehensive, one of the shortcomings is the fact that the dataset is a snapshot of the DBLP database on September 17, 2015. It would be better to have a more recent snapshot of both the year 2015 and 2016. Another shortcoming was the absence of publication titles and/or tags to do a keyword analysis for example.

While the classifier for the affiliations filters out most of the affiliations, building a classifier with machine learning techniques would enhance the per-formance and results of the classifier. However, building and utilizing such a solution would have been too time-consuming for this research. There exist mul-tiple types of machine learning techniques, like artificial neural networks that mimic the working of the biological brain.

Artificial neural networks consist of layers of connections between synapses, with each synapse performing its own function. This will result in the input being classified by its characteristics. The history of artificial neural networks comes from ”learning by doing”, it is grounded heuristically and by experimen-tation. Typically, a dataset is split in training and test sets to let the neural network learn by training it and testing the accuracy of it on the test data. A schematic representation for a neural network can be found in figure 11a.

Another possible machine learning technique is Support Vector Machines. Support Vector Machines are theoretically grounded, where artificial neural net-works are not. Support Vector Machines analyze data for classification of new cases. The process is based on creating a numeric model and placing the data in a vector space based on characteristics of the data, making it possible to place new data in the same space and dividing this data into two categories. By fitting the best possible hyperplane to this data, the margin for error is as low as possible. This is illustrated in figure 11b.

(a) Artificial neural network

(b) Support Vector Machine

Figure 11: Machine learning techniques - artificial neural networks and SVM’s The steps for using machine learning for classification tasks mainly consist of letting the technique learn from training data, refine with test data and repeat this process when new data comes in. By using machine learning techniques, the risk of classifying false-positives or false-negatives is marginalized more than by

(26)

manually defining rules. Nowadays, one can start experimenting with machine learning quite easily by using one of the following libraries, among others:

• skicit-learn, a Python package

• Weka, by the University of Waikato, New Zealand

• MPlib or Cloudera Oryx, extensions for Hadoop (open source big data analysis software)

Also, company name disambiguation could be improved, either with or with-out machine learning, to further eliminate false-positives and false-negatives.

Unfortunately, we cannot take into account the internal research output of organizations. This might give a distorted image of the influence of companies. Also, in combination with having a global company ranking archive, the geo-graphical information in the dataset could be used to observe this subject from a geographical perspective, e.g. by continent.

While conducting this research, the desire for a using a more recent dataset instead of a slightly older one grew bigger. For example, it would perhaps be better to get in contact with the researchers who made the data available to use their tools to create a more recent dataset. Analyzing more companies like the top 50 tech companies in the world would also be beneficial for this research. Not all data was available to complete that task, but it can be done if the research was conducted again.

5.2 Future work

As mentioned in section 5.1, building a neural network for classifying affilia-tions would be an idea for future research. Another idea for future research is building a scraper for the ACM Digital Library to increase the amount of data, like publication title, tags, abstract and other information, to allow for other analyses to be performed like keyword analysis, grouping keywords, et cetera.

Also, retrieving an archive of the Fortune Global 500 list would help en-larging the possibilities of company-specific research. The publications come from research institutes and companies from all over the world, while the com-panies only originate from the US. This could help in gaining more insights in region-specific data.

While the different Fields of Research were not intensively researched in this paper, this might give more insights in future research.

(27)

Appendices

A

List of A* conferences

Name Acronym FoR

ACM Conference on Applications, Technologies, Architec-tures, and Protocols for Computer Communication

SIGCOMM 803.0

ACM International Conference on Knowledge Discovery and Data Mining

SIGKDD 804.0

ACM International Conference on Mobile Computing and Networking

MOBICOM 1006.0

ACM International Conference on Research and Develop-ment in Information Retrieval

SIGIR 806.0

ACM SIG on Computer and Communications Metrics and Performance

SIGMETRICS 1006.0 ACM SIGMOD-SIGACT-SIGART Conference on Principles

of Database Systems

PODS 804.0

ACM SIGOPS Symposium on Operating Systems Principles SOSP 803.0 ACM Special Interest Group on Management of Data

Con-ference

SIGMOD 804.0

ACM Symposium on Principles of Distributed Computing PODC 805.0

ACM Symposium on Theory of Computing STOC 802.0

ACM-SIGACT Symposium on Principles of Programming Languages

POPL 803.0

ACM-SIGPLAN Conference on Programming Language De-sign and Implementation

PLDI 803.0

ACM/SIAM Symposium on Discrete Algorithms SODA 802.0

Advances in Cryptology CRYPTO 804.0

Annual Conference on Computational Learning Theory COLT 801.0 IEEE International Conference on Computer

Communica-tions

IEEE

INFO-COM

805.0 IEEE International Conference on Computer Vision ICCV 801.0 IEEE Symposium on Foundations of Computer Science FOCS 802.0

IEEE Symposium on Logic in Computer Science LICS 802.0

IEEE Symposium on Security and Privacy S&P 802.0

International Conference on Data Engineering ICDE 804.0

International Conference on Human Factors in Computing Systems

CHI 806.0

International Conference on Machine Learning ICML 801.0

International Conference on Software Engineering ICSE 803.0 International Conference on Very Large Databases VLDB 804.0 International Conference on the Theory and Application of

Cryptographic Techniques

EuroCrypt 804.0

International Joint Conference on Artificial Intelligence IJCAI 801.0 International Joint Conference on Automated Reasoning IJCAR 801.0 International Symposium on Symbolic and Algebraic

Com-putation

ISSAC 802.0

International World Wide Web Conference WWW 805.0

National Conference of the American Association for Artifi-cial Intelligence

AAAI 801.0

(28)

B

Dataset SQL structure

Name Type Description

k varchar Unique ID of publication

year int Year of publication

conf varchar Conference name

crossref varchar String containing conference, year and auto

in-cremented number for publication in conference

cs tinyint Computer Science boolean

de tinyint Data Engineering boolean

se tinyint Software Engineering boolean

th tinyint Theory boolean

publisher varchar Publisher of conference

link text DOI link

Table 4: general - table containing publications

id bigint Unique ID for author

k varchar Foreign key for crossref of paper

pos int Position of author in paper

name varchar Name of the author

gender enum Gender of the author (m, f, -)

prob float Probability of gender for name

Table 5: authors - table containing authors

sno bigint Unique ID for affiliation

k varchar Foreign key for crossref of publication

affil text Name of the affiliation

(29)

sno bigint Unique ID for affiliation

k varchar Foreign key for crossref of publication

affil text Name of the affiliation

country varchar Name of country for affiliation

country code varchar Code for country for affiliation

lat decimal Latitude for affiliation

lng decimal Longitude for affiliation

Table 7: affiliation coord - table containing all affiliations per author including geographical information

C

Academic/industry classifier

Listing 1: Python code for classifying academic/industry affiliation 1 defsearch_university(row):

2 # Strip all HTML tags from affiliations

3 cleanr =re.compile(r'<[^>]+>')

4 val =re.sub(cleanr, '', row.affil)

5

6 res ={'I': [],'A': []}

7 # Do some extra cleaning

8 val =val.replace('\xa0',' ')

9

10 if ", " not in val: # No comma probably means one university without department or multiple universities

,→

11 if" and " inval or"/" invalor " & " inval:

12 # Probably multiple univs

13 if " and "in val:

14 parts= val.split(" and ")

15 elif "/"in val:

16 parts= val.split("/")

17 elif "&"in val:

18 parts= val.split(" & ")

19

21 forpart in parts:

22 # Match university name to global universities list

23 if part inunivs.university.values orpart in univs.normalized.values:

24 res['A'].append(part)

25 else:

26 if "univ" in part.lower() or \

27 "ku "in part.lower() or \

28 "tu "in part.lower() or \

29 "dept"in part.lower()or \

30 "department"inpart.lower() or\

31 "faculteit"in part.lower()or \

32 "facult´e" in part.lower()or \

33 "faculty"in part.lower()or \

34 "institut"inpart.lower() or\

35 "academy"in part.lower()or \

36 "academia"inpart.lower() or\

37 "college"in part.lower()or \

38 "ecole"in part.lower() or \

39 "´ecole" in part.lower()or \

40 "ens "in part.lower()or \

41 "school" inpart.lower() or\

42 "MIT"in partor \

43 "M.I.T." inpart or \

44 "UCLA"in part or \

(30)

46 "UCSB"in part or \

47 "UCSC"in part or \

48 "UCSD"in part or \

49 "UCSF"in part or \

50 part.startswith("UC ")or \

51 part.startswith("UC-"):

53

54 if len(res['A']) == 0:

55 res['I'].append(val)

56

57 else:

58 # Probably just one univ, check for department

59 parts=val.split(", ")

62 if part inunivs.university.values orval inunivs.normalized.values:

63 res['A'].append(val)

64 else:

66 "ku "in part.lower() or \

81 "MIT"in partor \ 82 "M.I.T." inpart or \ 83 "UCLA"in part or \ 84 "UCR"in partor \ 85 "UCSB"in part or \ 86 "UCSC"in part or \ 87 "UCSD"in part or \ 88 "UCSF"in part or \

92

93 if len(res['A'])== 0:

95

96 else:# There is a comma in this affiliation

97 if" and " inval or"/" inval or" & " inval:

99 if " and "in val:

100 parts= val.split(" and ")

101 elif "/"in val:

102 parts= val.split("/")

103 elif " & "in val:

104 parts= val.split(" & ")

105

107 if ", " inpart:

108 commas =part.split(', ')

109 for commain commas:

111 ifcomma inunivs.university.values orcomma in

univs.normalized.values:

,→

112 # Return only the right uni

114 else:

115 if"univ" in comma.lower() or \

116 "ku "in comma.lower() or \

117 "tu "in comma.lower() or \

(31)

119 "department"in comma.lower() or\

120 "faculteit" incomma.lower()or \

121 "facult´e" incomma.lower()or \

122 "faculty" incomma.lower()or \

123 "institut"in comma.lower() or\

124 "academy" incomma.lower()or \

125 "academia"in comma.lower() or\

126 "college" incomma.lower()or \

127 "ecole"in comma.lower()or \

128 "´ecole" incomma.lower()or \

129 "ens " in comma.lower()or \

130 "school" in comma.lower() or\

131 "MIT"in commaor \ 132 "M.I.T." in commaor \ 133 "UCLA" in commaor \ 134 "UCR"in commaor \ 135 "UCSB" in commaor \ 136 "UCSC" in commaor \ 137 "UCSD" in commaor \ 138 "UCSF" in commaor \

139 comma.startswith("UC ")or \

140 comma.startswith("UC-"):

142

143 if len(res['A']) == 0:

144 res['I'].append(part)

145

146 else:

148 if part inunivs.university.values or partin univs.normalized.values:

150 else:

151 if"univ" in part.lower()or \

152 "ku "in part.lower()or \

153 "tu "in part.lower()or \

154 "dept" inpart.lower() or\

155 "department"in part.lower()or \

157 "facult´e"in part.lower()or \

159 "institut"in part.lower() or \

161 "academia"in part.lower() or \

163 "ecole"in part.lower()or \

165 "ens " inpart.lower() or\

167 "MIT"in part or\ 168 "M.I.T." inpart or \ 169 "UCLA" inpart or \ 170 "UCR"in part or\ 171 "UCSB" inpart or \ 172 "UCSC" inpart or \ 173 "UCSD" inpart or \ 174 "UCSF" inpart or \

175 part.startswith("UC ")or\

178

179 if len(res['A']) == 0:

180 res['I'].append(part)

181 else:

182 # There is a comma but no and or forward slash so we probably

183 # deal with departments, univ and city or something

184 parts=val.split(", ")

187 if part inunivs.university.values orpart in univs.normalized.values:

188 # Return only the right uni

190 else:

(32)

207 "MIT"in partor \ 208 "M.I.T." inpart or \ 209 "UCLA"in part or \ 210 "UCR"in partor \ 211 "UCSB"in part or \ 212 "UCSC"in part or \ 213 "UCSD"in part or \ 214 "UCSF"in part or \

218

219 if len(res['A'])== 0:

221

222 # Get a unique list of the resulting affiliations

223 unique ={}

224 unique['A']=list(set(res['A']))

225 unique['I']=list(set(res['I']))

226 if len(res['A']) == 0:

227 res['I']=[val]

228 row['A']= unique['A']

229 row['I']= unique['I']

230 return row

D

Grouping companies function

Listing 2: Python code for grouping company names 1 # Grab all industry affiliations from the Pandas DataFrame

2 industry_affils =merged_df[merged_df['affiliation_type'] =='I']

3 industry_names =industry_affils['I'].str.split(', ', n=1, expand=True)[0]

4

5 # Drop too short names and only digits, strip digits and sort the list

6 sorted_industry =sorted([x.strip()for xinset(industry_names.dropna())if len(x)> 1and notx[0].isdigit()])

,→ 7

8 current_company =sorted_industry[1]

9 company_names =OrderedDict()

10 fori, company inenumerate(sorted_industry):

11 if len(company)== 0:

12 continue

13

14 # Match company name by looking at the start of the company name

15 if company.startswith('facebook')or company.startswith('Facebook'):

16 if'Facebook'not in company_names.keys():

17 company_names['Facebook']=[company]

18 else:

19 company_names['Facebook'].append(company)

20 else:

21 ifi+ 1 < len(sorted_industry) and\

22 (current_company.lower() incompany.lower()or \

23 (len(current_company.split(' ')[0]) > 2and\

24 current_company.split(' ')[0]in company)):

(33)

26 company_names[current_company]= [company]

27 else:

28 company_names[current_company].append(company)

29 else:

30 # Remove punctuation

31 translator=str.maketrans('','', string.punctuation)

32

33 # Perform sanitation methods for current main company

34 current_company_clean=current_company

35 current_company_clean=cleanco(current_company_clean).clean_name()

36 current_company_clean=current_company_clean.translate(translator)

37 current_company_tokens =nltk.word_tokenize(current_company_clean)

38 current_company_tokens =[t.lower() fortin current_company_tokens]

39 current_company_tokens =[tfor tin current_company_tokensif tnot in

stopwords.words('english')]

,→

40 current_company_clean=' '.join(current_company_tokens)

41

42 # Perform sanitation methods for this company in for loop

43 company_clean =company

44 company_clean =cleanco(company_clean).clean_name()

45 company_clean =company_clean.translate(translator)

46 company_tokens=nltk.word_tokenize(company_clean)

47 company_tokens=[t.lower()for tin company_tokens]

48 company_tokens=[t fortin company_tokensif tnot in

stopwords.words('english')]

,→

49 company_clean =' '.join(company_tokens)

50

51 # Calculate Jaro-Winkler similarity for both company names

52 if jaro_winkler(current_company_clean, company_clean)> 0.95:

53 if current_company not in company_names.keys():

54 company_names[current_company] =[company] 55 else: 56 company_names[current_company].append(company) 57 else: 58 current_company =company 59 company_names[current_company]= [company]

References

ACM. (2017). Association for Computing Machinery. Retrieved from http:// www.acm.org/

Agarwal, S. (2016, 1). DBLP Records and Entries for Key Computer Science Conferences (Vol. 1). Mendeley. Retrieved from https://data.mendeley .com/datasets/3p9w84t5mr/1 doi: 10.17632/3P9W84T5MR.1

Agarwal, S., Mittal, N., Katyal, R., Sureka, A., & Correa, D. (2016, 3).

Women in computer science research: what is the bibliography data telling

us? ACM SIGCAS Computers and Society , 46 (1), 7–19. Retrieved

from http://dl.acm.org/citation.cfm?doid=2908216.2908218 doi:

10.1145/2908216.2908218

Association for Computing Machinery. (2017). ABOUT SIGS.

Baloh, P., & Trkman, P. (2003). Influence of Internet and

Informa-tion Technology on Work and Human Resource Management. In

In-site. Retrieved from http://www.proceedings.informingscience.org/ IS2003Proceedings/docs/071Baloh.pdf

Bercovitz, J. E. L., & Feldman, M. P. (2007). Fishing upstream: Firm innovation strategy and university research alliances. Research Policy, 36 , 930–948. doi: 10.1016/j.respol.2007.03.002

Bergmark, D. (2001). Scraping the ACM Digital Library. ACM SIGIR Fo-rum, 35 (2), 1–7. Retrieved from http://portal.acm.org/citation.cfm ?doid=374308.374363 doi: 10.1145/374308.374363

(34)

Cohen, W. M., Nelson, R. R., & Walsh, J. P. (2002). Links and Impacts: The Influence of Public Research on Industrial R&D. Management Science, 48 (1), 1–23. doi: 10.1287/mnsc.48.1.1.14273

Cohoon, J. M., Nigai, S., & Kaye, J. J. (2011, 8). Gender and computing con-ference papers. Communications of the ACM , 54 (8), 72. Retrieved from

http://portal.acm.org/citation.cfm?doid=1978542.1978561 doi:

10.1145/1978542.1978561

CORE. (2016). About Us - Computing Research & Education. Retrieved from http://www.core.edu.au/team

Critchley, C. R., & Nicol, D. (2011). Understanding the impact of commercial-ization on public support for scientific research: is it about the funding source or the organization conducting the research. Public understanding of science, 20 (3), 347–366. doi: 10.1177/0963662509346910

Elsevier. (2017). Mendeley — Free reference manager and academic

so-cial network — Elsevier. Retrieved from https://www.elsevier.com/ solutions/mendeley

endSly. (2015). world-universities-csv. GitHub Repository. Retrieved from https://github.com/endSly/world-universities-csv

Feris, R., Raskar, R., Longbin Chen, Kar-Han Tan, & Turk, M. (2005). Discon-tinuity preserving stereo with small baseline multi-flash illumination. In Tenth ieee international conference on computer vision (iccv’05) volume 1 (pp. 412–419). IEEE. Retrieved from http://ieeexplore.ieee.org/ document/1541285/ doi: 10.1109/ICCV.2005.76

Fortune magazine. (2017). Fortune 500 Companies 2017. Retrieved from

http://fortune.com/fortune500/

Google. (2017). About Google Scholar. Retrieved from https://scholar

.google.com/intl/en/scholar/about.html

Kimmig, A., Van den Broeck, G., & De Raedt, L. (2017, 7). Algebraic model counting. Journal of Applied Logic, 22 (C), 46–62. Retrieved from http:// linkinghub.elsevier.com/retrieve/pii/S157086831630088X doi: 10 .1016/j.jal.2016.11.031

MiguelR90. (2016). Fortune 500 webscraper.

Mullins, B., & Nicas, J. (2017, 7). Paying Professors: Inside Google’s Academic Influence Campaign. Retrieved from https://www.wsj.com/articles/ paying-professors-inside-googles-academic-influence-campaign -1499785286

Raffo, J., & Lhuillery, S. (2009, 12). How to play the “Names Game”: Patent retrieval comparing different heuristics. Research Policy, 38 (10), 1617–

1627. Retrieved from http://linkinghub.elsevier.com/retrieve/

pii/S0048733309001528 doi: 10.1016/j.respol.2009.08.001

Teixeira, A. A. C., & Mota, L. (2012, 12). A bibliometric portrait of the evolu-tion, scientific roots and influence of the literature on university–industry

links. Scientometrics, 93 (3), 719–743. Retrieved from http://link

.springer.com/10.1007/s11192-012-0823-5 doi: 10.1007/s11192-012

-0823-5

Tijssen, R. J. W., & Van Leeuwen, T. N. (2006, 1). Measuring impacts of academic science on industrial research: A citation-based approach. Sci-entometrics, 66 (1), 55–69. Retrieved from http://link.springer.com/ 10.1007/s11192-006-0005-4 doi: 10.1007/s11192-006-0005-4

(35)

TopForeignStocks. (2017). Fortune 500 Companies Lists Downloads. Retrieved from http://topforeignstocks.com/downloads/

Turk, J. (2017). jellyfish 0.5.6 documentation. Retrieved from http://

jellyfish.readthedocs.io/en/latest/

University of Trier. (2017). dblp computer science bibliography.

Winkler, W. E. (1999). The State of Record Linkage and Current Research Prob-lems. Statistical Research Division US Census Bureau, 1–15. Retrieved from https://www.census.gov/srd/papers/pdf/rr99-04.pdfhttp://

citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.4336 doi:

Researching the influence of information technology companies on scientific publications