(2)Tasks and Responsibilities of the Intern The intern’s work was a part of the big project, which goal was to label the companies from Innovatiespotter’s database with particular labels automatically

(1)

Student: Ksenia Iakovleva Student number: S3461769 Telephone number: +31 6 25542933

E-mail: k.iakovleva@student.rug.nl Address: Marmerstraat 44, 9743xh Groningen

MA Digital Humanities

Placement report

Title of the Placement: Data analyst/ intern Starting Date: 12.02.2019.

Ending Date: 31.06.2019.

Company: Innovatiespotter (Groningen, Netherlands)

Placement Supervisor: Full Stack Engineer Jeronimo Calace Montu Abstract

The internship was conducted at the Dutch IT company Innovatiespotter and included developing software for data collection, analysis and statistics. The intern had a responsibility of creating a universal scraper in Python for getting contact information from Dutch companies’

websites, and also a statistics project for calculating the accuracies of matches between the scraped data and the data from the company’s database.

Keywords

Data analysis, Software development, Statistics, Python, SQL, JSON Introduction to the Company

Innovatiespotter is an IT-company in Groningen, which manages online data about 1.8 million companies in the Netherlands and updates it daily. A team of data scientists analyzes this data with the help of Big Data and Machine Learning and clusters the results. The company specifically focuses on start ups and innovation companies, creating networking, matchmaking and socially relevant projects. For example, some of Innovatiespotter’s clients are Dutch municipalities, which need to know the most innovative companies in their regions to assign grants.

(2)

Tasks and Responsibilities of the Intern

The intern’s work was a part of the big project, which goal was to label the companies from Innovatiespotter’s database with particular labels automatically. As an input this project got a table with some Dutch companies’ associations and their websites. These websites were automatically scraped by another employee’s script in order to get the websites of the companies which belonged to a particular association. This data was stored in another table with the corresponding association’s label (so everyone could see to which association the company belongs). Then the intern’s scraper got the contact information by scraping each of these companies’ websites from that table (in particular, it scraped company name, KVK - number in Chamber of Commerce, postcode, city, street, house number, email, phone) and updating the table with them. After that these data were used in order to find the corresponding companies in the database of Innovatiespotter and to label them with the name of association to which they belong (the last part was not done by the intern).

In order to understand which types of contact information provide the highest accuracy of matches, the intern also created a statistics project in Python, which automatically calculated the accuracy of matches for each attribute (e.g. company’s name) and each combination of attributes (e.g. company’s name with street name) and applied statistical Chi-square test to check if the experiment was statistically significant. This part of the project was important, because sometimes some attributes resulted to false matches in the database: for instance, using only a street name for finding matches was risky, because many companies can be located on one street.

The intern was using mostly Python in her work. She also used PostgreSQL for working with the database and SQL queries inside the Python scripts via the library Psycopg (the most popular PostgreSQL adapter for the Python programming language).

The project has a complicated structure and includes several Python scripts. Firstly, the scraper gets HTML text of the website’s main page and searches through it for the contact information. In case it fails to find it or finds incomplete information, it gets all the contact links from the website and scrapes them as well until it finds all the contacts presented (in case it finds nothing, it just does not update the final table). The data for every attribute is scraped in a separate process. Each attribute (company name, postcode, street etc.) has a specific Python function which includes a regular expression for getting the correct data from the HTML-page.

After collecting the data, the scripts for cleaning it are applied (HTML-tags, extra spaces, tabs, new line characters are removed, the case is standardised etc.). In case the scraper finds more than one data point for each attribute, it stores all the extra information in JSON array, which goes to a separate column in the table. When the scraper is applied to more than one URL (main page of the website and one or more contact pages), the functions for removing duplicates are executed automatically. They compare both main information and extra information (in case many addresses or contacts were found on one page) and remove duplicated data. At the end of scraping, all the scraped data for each company is stored in the table as one row.

(3)

Calculating statistics was needed in order to understand which attributes give a better accuracy of the matches between the scraped data and Innovatiespotter’s database. Firstly, a database dump (sample) of 300 companies, which were already labeled with associations’

names manually by Innovatiespotter’s employees, was created. Then another dump with around 16 000 companies, including those 300, was created as well (so we could get false matches while merging with the scraped data). The Python scripts for finding matches between the scraped data and the second dump were applied and the results were compared to the first dump with the labeled companies in order to understand which companies were labeled mistakenly. After that, other Python scripts calculated the number of right and wrong matches per attribute automatically, and at the end Chi-square test was applied in order to prove that these results were statistically significant (e.g. that the combination of company name and postcode gave a better accuracy than the combination of website and city).

Conclusion

Although the data used during the internship was not connected with humanities’

disciplines, all the methods which I have applied in it are relevant for working with humanities’

data as well. Moreover, the academic knowledge gained at the University of Groningen was definitely an integral part of the placement. In particular, I used and developed the skills which I got in the courses Coding for Humanities, Database Design, Collecting Data and Analyzing Data.

Throughout this internship, I learnt plenty of software development tips which were new for me. One of the most important parts was understanding and applying Object Oriented Programming (OOP). Moreover, before the internship I did not know how SQL queries can be implemented into a Python script and executed with it. I also learnt a lot about building the whole structure of a project in Python (e.g. using functions from one Python file in another one through imports). One of the most important benefits from this internship was an opportunity to communicate with the software engineers who not only gave some particular advice about developing the project, but also explained some general concepts of programming. Overall, my internship experience was close to perfect, since I managed to develop the skills which I gained through my Master program and to finalise both the universal scraper and the statistics project.