Internship at Solid Professionals S

(1)

Internship at Solid Professionals

Sustainability Benchmarking

L.R.N. (Leon) Graumans

Master Placement Report Ma Information Science University of Groningen

Academic supervisor: prof. dr. G.J.M. (Gertjan) van Noord Placement supervisor: Joanneke Meijer

L.R.N. (Leon) Graumans s2548798

(2)

P R E F A C E

For me, this paper is the final chapter of the Information Science masters degree and it represents the end of my study. It started five and a half years ago in Gronin-gen, and came to an end during my internship here, in Utrecht. Despite missing Groningen so now and then, mainly because of its pubs and coziness, I really like living here in Utrecht, mostly because of the near presence of friends and family.

First, I want to thank Solid Professionals for allowing me to do my internship at their company. I have really enjoyed my time at the Maliebaan in Utrecht. All colleagues were friendly, helpful and genuinely interested, whether it was work related or personal. At Solid Professionals it felt like a family, which you should definitely cherish.

Second, I want to thank Joanneke Meijer, my supervisor at the company. I re-member the first time we had a call over the phone, months prior to my internship. During this call we discussed the purpose of the internship, and continued brain-storming on ideas and things we should try out in order to succeed. From that point on, I really valued the face to face meetings we have had every week. Besides the helpful feedback I received, we could also discuss project and personal matters. These moments gave me new insight and helped me develop during my internship. Last, thanks to all my colleagues as well. I enjoyed working with each of you, as well as drinking a beer together on Fridays and improving my skills on the pool table. Unfortunately, my internship was too short to succeed in the latter.

Thanks for reading. Leon

(3)

C O N T E N T S

Preface 1

1 introduction 1

2 task and implementation 3

2.1 Goal . . . 3

2.2 Data(base) . . . 3

2.3 Implementation and Results . . . 5

2.3.1 Classification . . . 5

2.3.2 Explainability . . . 5

2.3.3 Neighbour classes . . . 6

2.4 Future Work . . . 7

2.5 Internal and External Communication . . . 7

2.5.1 Communication and Collaboration . . . 7

2.5.2 Open Source Code . . . 8

2.5.3 Web Application . . . 8 3 evaluation 9 4 conclusion 10 5 appendix 12 5.1 Logbook . . . 12 2

(4)

1

_{I N T R O D U C T I O N}

Solid Professionals1

was founded in 2007 by Marcel van Wersch and Wim Meijer, who both have a solid background in financial services. Initially, Solid Professionals focused on recruitment and selection of financial professionals for both financial and business sectors. Later, the recruitment of Consultancy Professionals and an interim branch were added. Solid Professionals has now become a combination of the following services:

• Advisory (SPA) - a team of consultants, for the design, organization and im-provement of sustainable, effective and efficient Finance & Risk functions • Interim (SPI) - finance professionals, for solving temporary knowledge and

capacity problems

• Young Professional Program (SYP) - finance, risk and actuarial traineeship Over the years, other companies have emerged from Solid Professionals, which are:

• Hermes | Partners2

• DNFS | De Nederlandse Farmaceutische Support3 • IVVO | Instituut Voor Vitaal Ondernemen4

These companies respectively focus on Business & IT professionals, pharmaceu-tical professionals and vitality within companies. These companies are all part of The Hup5

, a home and base for professionals. The Hup can be described as the umbrella under which these companies operate, see Figure1 for a visual

explana-tion.

Within the Advisory branch (SPA), the AI Competence Centre was emerged. This team of AI consultants has expertise in Data Science, Text Mining, NLP and is specialized in the Finance & Risk sector. During my placement, the members withing this team were my go-to colleagues if I had questions regarding my project. Joanneke, my placement supervisor, was also part of the AI Competence Centre.

As of 2020, this AI Competence Centre has merged with Amsterdam Data Col-lective (ADC)6

, a team working on quantitatively demanding projects in finance, public, healthcare and retail. Thus, ADC is now also part of The Hup, something to keep in mind when looking at (slightly outdated) Figure1.

Before moving to Utrecht I had never heard of The Hup or its companies. How-ever, searching for an internship in the Randstad was already certain. A little before the summer break I was talking to a friend of mine, Tim Marsman, who is working as recruitment consultant for Solid Professionals. We talked about some subjects which I was looking for in an internship and not long after that he gave me a call, saying that Solid Professionals was in need of a student with expertise in Natural Language Processing (NLP).

Shortly thereafter I had a phone conversation with Joanneke. We discussed the goal of the project as well as the subjects we would address, which were mainly data scraping and machine learning. We immediately started brainstorming about ideas and techniques we could apply. From that point, I already knew I was going to enjoy my internship. 1 https://www.solidprofessionals.nl 2 https://www.hermespartners.nl 3 https://www.dnfs.nl 4 https://ivvo.nl 5 https://www.thehup.nl 6 https://amsterdamdatacollective.com/

(5)

introduction 2

Figure 1: Hierarchy of The Hup and its companies. Slightly outdated, since Amsterdam

Data Collective is not yet listed in this figure. The Hup being the umbrella is placed at the bottom of this figure, thus emphasizing the flat organizational structure.

In the next chapter, the task and implementation will be discussed to gain a better understanding of the work that I have done during my placement. Here, I will discuss the data we used and then motivate the methodology, as well as some results, communication and future work. Next, I evaluate the placement on subjects like acquired new knowledge and skills, supervision and career goals. Last, I state my conclusion.

(6)

2

_{T A S K A N D I M P L E M E N T A T I O N}

2.1 goal

The modern fast-fashion industry produces most of our clothes. The industry, how-ever, is dominated by the pressure to produce as fast and cheap as possible, at the cost of working conditions, the use of poor materials and a bad influence on our environment.

To give a push into the right direction for improving this situation, Joanneke came up with the idea of creating a database consisting of clothing brands with a corresponding sustainability rating, which is generated by using web scraping and artificial intelligence (AI). Here, one can find a reliable answer to the degree of sustainability of a clothing brand, before purchasing a product of that specific brand.

Creating such a database is perhaps not what you expect from a company which is mainly focused on the financial sector. However, Solid Professionals (and also The Hup and its companies) follows a Corporate Social Responsibility (CSR) policy1 . Meaning that, in the ordinary course of business, this company is operating in ways that enhance society and the environment, instead of contributing negatively to them. For example, Solid Professionals donates to the TIKA Foundation2

every time they recruit new people, and also provide vitality programs, enough fruit, fair coffee and bicycles for everyone at their office. This project is thus employed with the CSR policy in mind.

The main focus of my placement was to create a minimum viable product, i.e. a version of a product with just enough features to satisfy early adopters and provide feedback for future product development. Roughly, the idea of this project is to scrape data for a lot of clothing brands, provide a sub-part of them with golden classification labels and predict labels for the remaining brands. Resulting in a database with a couple of thousand clothing brands, rated based on their level of sustainability. The database will be shown on a responsive web application, for the public to use. Also, the code will be published as open source, in order to promote the community to add features and improve it in other ways.

2.2 data(base)

In order to train an AI model for predicting sustainability labels for clothing brands, we needed a big collection of brands and data for each of these brands. We started by creating a list of brands, using the brands directories of Zalando3

and Aboutyou 4

. In addition, we used a couple of websites selling or listing sustain-able clothing brands, such as Rank a Brand5

and Project Cece6

who shared their database with clothing brands with us. These brands were added to our database and provided with their corresponding label. Most of these websites solely listed sustainable brands, and were thus added to our database with the sustainable label. Rank a Brand, however, (manually) rated brands on a scale from A to E. As the proportion of E-rated brands (i.e. the lowest rating) was fairly high, we added these 1 https://www.solidprofessionals.nl/over-ons/mvo 2 https://www.tikafoundation.com 3 https://www.zalando.nl/merken 4 https://www.aboutyou.nl/merk 5 https://rankabrand.nl 6 https://www.projectcece.nl

(7)

2.2 data(base) 4

SUS NOT None Total

Brands 494 279 1.620 2.393

No homepage - - - 281

Total 2.674

Table 1: Distribution of our data(base). SUS: sustainable brands, NOT: not sustainable

brands, None: brands without any label.

brands with a not sustainable label to our database. This resulted in a collection of 2.674 clothing brands, of which 773 brands were provided with a golden label (i.e. correct examples). The distribution of our database can be found in Table1.

For collecting data about each brand, we started with searching homepages us-ing theGoogle Search API7

. As shown in Table1, a couple of hundred brands did

not have a homepage or were not found using our method. We manually investi-gated a few of these brands, and concluded that most of them used a social media source as their homepage, such as Facebook. We ignored social media pages on purpose, thus resulting in 2.393 brands we could work with. Once the homepage was found, we used the BeautifulSoup4 8

scraper to search for all hyperlinks on these pages. Subsequently, we used the scraper to open all of the found hyperlinks to search again for all hyperlinks on that specific page.

This method led to a collection of tens, hundreds or even thousands of hyper-links for each clothing brand. Using all of these webpages would take too long to scrape and would result into a database far too big to use. To keep our process effi-cient, we looked at a trade-off between speed and incorporating useful information. We crawled through several levels of the clothing brand and selected the homepage + 10 relevant pages based on a dictionary with manually picked keywords. These relevant pages included pages like about-us, history and blogs and excluded prod-uct pages, shopping carts and store locators. If less than ten pages were found, the remaining pages were picked randomly from all found pages for that specific brand. Once this subset was selected, we usedBeautifulSoup4to scrape all content. As we were mainly interested in the written content of these pages, we only kept text which was found in headers and paragraphs. Also, since our database consists of a lot non-English clothing brands, we translated all non-English content with a library using theGoogle Translator API9

.

Using scraped material from each of these clothing brands websites means that our model relies solely on information created by the clothing brands themselves. We were aware of this caveat. However, in the EDA (Explorative Data Analysis) stage, we already see some clear differences between data from sustainable and not-sustainable brands. The two wordclouds in Figure2contain the most frequent words

per class. We ignored words which occurred often at both classes and we applied a little preprocessing, such as lowercasing words and removing noise like punctuation and stopwords using theNLTKlibrary (Bird and Loper,2004). As can be seen in the wordclouds, there is a clear difference between the two classes. Interestingly, we mainly see brand names in the not-sustainable wordcloud. This could be due to the fact that we removed content which occurred often at both classes, indicating we removed content which is often used at webshops. This resulted in remaining sustainable keywords for sustainable brands and brand names for not-sustainable brands. 7 https://pypi.org/project/google 8 https://pypi.org/project/beautifulsoup4 9 https://pypi.org/project/translators

(8)

2.3 implementation and results 5

Figure 2: Wordclouds for the not-sustainable (left) and sustainable (right) classes.

2.3 implementation and results

2.3.1 Classification

For assigning sustainability labels to clothing brands we make use of supervised machine learning techniques. We applied text classification using the golden labels created by initiatives like Rank a Brand and Project Cece, as described in the Data section. During the placement, we omitted the use of deep learning techniques as the purpose of this project was to create a minimum viable product, with a strong focus on an explainable AI model.

In order to use a machine learning algorithm, we needed a training set and a test set. We split our data into two sets, the training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data. During this project, the test set is referred to as the hold-out set.

We used cross-validation on the training set to train, test and tune our models. With cross-validation (k-fold cross-validation), the training set is randomly split up into k groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group has been used as the test set. This way, the model has the opportunity to train on multiple train-test splits, leading to a better indication of how well our model will perform on unseen data.

Before running our machine learning algorithm, we converted our textual data into numerical feature vectors using a bag of words method. To avoid giving more weight to longer documents and reducing the weight of more common words we used the TF-IDF vectorizer from thescikit-learnlibrary (Pedregosa et al.,2011). There are various algorithms which can be used for text classification. In Table2

you can see the results of these various algorithms on our data. We continued using

theLinearSVC10

(Support Vector Machine) classifier, as this model performed best during cross validation. As its input features, we used TF-IDF vectors of our data and additional indicator features for finding any sustainability certifications within the content of a brand (such as the Global Organic Textile Standard, FairTrade, etc.).

2.3.2 Explainability

As shown in Table 2, the accuracy is sufficiently high. However, only looking at

these scores does not provide the best overview of our models performance. To manually check this performance, we used theExplain Like I’m 5tool11

which highlights the most indicative features for choosing a specific class. Looking at the 10

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC

11

(9)

2.3 implementation and results 6

Model Training set accuracy (5-fold) Hold out set accuracy

LinearSVC 0.834 0.819 SVC (kernel=linear) 0.817 0.819 LightGBM 0.799 0.836 Logistic Regression 0.790 0.819 RandomForest 0.744 0.784 Naive Bayes 0.616 0.655

Table 2: Performance of multiple supervised machine learning models on the training set

(using k-fold cross validation) and the hold out set, measured in accuracy.

Figure 3: Most important features, highlighted withELI5tool.

most important features for Patagonia (clothing brand) in Figure3, the sustainable

label is predicted correctly, but the prediction heavily relies on noise like javascript and cart. We used this input to refine our preprocessing, by removing sentences containing such keywords and lemmatization of the data. Lemmatization converts words to their single form, for example ‘walks’ and ‘walked’ to ‘walk’. This led to more sensible weighing terms, at the cost of a slight decrease in model performance on the training set (0.814), while accuracy on the hold-out set remained the same.

2.3.3 Neighbour classes

Building a 100% correct model, is (almost) impossible. In addition to using the out-put (probability) of the model, we came up with the idea of usingWord2Vec(Mikolov et al.,2013) embeddings to support our models prediction. For each brand, we col-lect the 300 dimensional embeddings of the 100 most frequent words, resulting in a matrix of shape(100,300), trying to capture semantic and syntactic patterns for that specific brand. Using these matrices, we can make two calculations: 1.) the distance between the matrix of a brand and the average matrix of a class, and 2.)

(10)

2.4 future work 7

finding the class of the brand that is closest, i.e. which brands are the neighbours of a given brand.

Based on our models probability for a given brand, along with its closest average class and the class of its nearest neighbour, we decided whether to trust our models output or to ignore it. For example, if our models probability is above 90 percent, we immediately trust its prediction. If its probability is between 55 and 90 percent, and its prediction is similar to the closest average class or similar to the class of its nearest neighbour, we trust the prediction.

When these requirements are not met, the new brand differs quite a bit from anything the model has ever seen before. Therefore, we omit the prediction and provide the unknown class. This resulted into a collection of 1.521 rated clothing brands (SUS: 773, NOT: 748).

2.4 future work

As this project is the minimum viable product of something that gradually becomes bigger and better, there is a lot of future work to be done. For example, one could improve the usage of different languages, other than English. Currently, we trans-late non-English content with theGoogle Translate API. However, handling mul-tiple languages (e.g. Dutch, German and French) would be an interesting next step. Also, since we already scraped a fair amount of content for each clothing brand, one could extract more information from this already available collection of data, such as price categories, country of origin and links to external websites. In addition to scraping content, we think scraping PDF-files could provide us with useful informa-tion. These files possibly contain information regarding the level of sustainability of a given brand. Instead of information provided by brands themselves, one could include more data sources, like news websites and external databases, in order to collect more objective data. Since we show our database with brands, homepages and their predicted label on a website, it would be interesting to provide more infor-mation like a logo, or summary using text summarization. Last, one could employ other (unsupervised) machine learning models, like advanced neural networks or

BERTembeddings (Devlin et al.,2018).

2.5 internal and external communication

2.5.1 Communication and Collaboration

To keep all colleagues and people from outside the company up-to-date regarding the progress of this project, we decided to publish two blogs. In the first blog12

we introduced the idea and possible outcomes, whereas the second blog13

was meant more as an explanation of how we approached this project and what choices we made along the way.

In addition to the blog, I gave a presentation about this project for everyone within the company that was interested in NLP. The audience consisted of a mix of data scientists, as well as some marketing and recruitment colleagues. All col-leagues were very enthusiastic and interested in scraping and machine learning, which for example raised questions like: "Could we partially automate the selection of candidates for recruitment, using NLP?".

Another aspect which is worth mentioning, is the possible collaboration with a large company which has a broad international reach within the branch of sustain-able clothing. They were very interested in our project. Plans on how they could 12

https://www.solidprofessionals.nl/blog-could-artificial-intelligence-help-you-buy-sustainable-clothing

13

(11)

2.5 internal and external communication 8

apply our model to their business are currently on the drawing board and will be continued in 2020.

2.5.2 Open Source Code

The goal of this project was to publish all code as open source. The advantage of open source code is collaboration from the community. We noticed there is a lot of interest within the community (i.e. developers and other people who attach great importance to sustainability) in these type of projects. Publishing open source code, the community can help improve and build on our model. The open source repository can be foundhere(https://gitlab.com/thehup/duurzame-benchmark).

2.5.3 Web Application

In order to present our database with generated sustainability labels to the public, I made an interactive website using Python andVueJS.VueJSis aJavascript frame-work which lets you easily create and maintain web applications, as well as load data from a staticJSON-file. The website can be foundhere(https://goodbase.ai). How-ever, as this project might be used for a possible collaboration, we chose to publish a temporary placeholder for now.

(12)

3

_{E V A L U A T I O N}

Sometimes it felt like I was working on a second Master’s thesis project, as I worked on this project independently for the most part. The supervision, however, was good. Every Friday I had a meeting with Joanneke to give an update about last week’s progress and determine the course for the next week. Joanneke always provided me with helpful feedback and challenged me to always perform better. In addition, as Joanneke was often working on a project at the Rabobank, we called over the phone one or two times a week. Also, If I had questions or I was stuck somehow, I could give her a call, or ask colleagues with programming and/or NLP expertise. Overall, I enjoyed the combination of working independently and the several calls and meetings for feedback and new insights.

The main goal of this project was to set up a product for generating sustainabil-ity ratings for clothing brands. I started from scratch, with some expertise in natural language processing, machine learning and web scraping gained during my Infor-mation Science study. During the first weeks of my placement I mainly focused on scraping. As I once usedBeautifulSoup4 during my Master’s thesis for crawling a single static website, I knew how to get started. During this project, however, one single website was replaced with millions of pages, all from thousands of cloth-ing brands, each with a different kind of website structure. Also, I learned about request headers, cookies and proxy handling. The amount of data was massive compared to what I used to work with during my study. The scraping part is just an example, but these bumps in the road led to making all kinds of choices using business rules, in order to keep the whole process scalable and efficient. Gradually I found it easier to make reasonable choices.

During this placement there were also aspects that I had no experience with, like full-stack production, where you have to think of solutions, building code and delivering a final product. Of course, during my study I did made some repositories with a small README file. However, creating and maintaining a repository this size, with a huge amount of commenting and documentation, providing it with a MIT licence and buying a website domain based on Google SEO-value were some aspects I had no experience with. In addition, when connecting a machine learning model in Python with an interactive web application, I noticed the joy I got from connecting the massive back-end with a clean, simple and sleek design. This was something I really wanted to learn during this internship. Besides looking at jobs in the data science branch I will definitely broaden by career perspective and also search for vacancies regarding software development.

Also, attending meetings, having weekly Skype-calls with the AI department, giving a presentation company-wide and having contact (mail and Skype) with dif-ferent (international) companies regarding a possible collaboration were relatively new to me too. I did enjoy these moments as they gave me new insight on how working in such a company could look like. I did have some working experience prior to this placement, but never within the field of my study. I must say, I do have a much clearer perspective now of how I can use my expertise in Information Science in professional practise.

(13)

4

_{C O N C L U S I O N}

From the moment I walked through the front door at Solid Professionals I imme-diately felt the warm and family-like vibe within this company. To get started during the first week, I got to use their onboarding tool. This tool consisted of some programming tutorials, information regarding the company as well as assignments like: "Drink a cup of coffee with...", helping me to get to know everyone as quick as possible.

As I had the whole summer break to think about the project I had to do at Solid Professionals, I had already some ideas about techniques we could apply and on how we could succeed. This idea was replaced by an actual product that gradually became bigger and better. I enjoyed working on all sub-parts of this project, and learned to get a clear picture of the whole process, from idea to final product. Also, I am proud of setting up a scraping system and AI model of this size, from scratch. The intention of this project was to create a minimum viable product, in which we succeeded. As there are possible collaborations with other companies on the drawing board, I am very curious where this project will be in a year from now.

(14)

B I B L I O G R A P H Y

Bird, S. and E. Loper (2004). Nltk: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, pp. 31. Association for Computational Linguistics.

Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Mikolov, T., K. Chen, G. Corrado, and J. Dean (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.

(15)

5

_{A P P E N D I X}

5.1 logbook

WEEK LOG

38

Onboarding

Attending Big Data Expo

Giving presentation about project idea First meeting Joanneke at Rabobank Creating planning

39

Creating lists of clothing brands using Zalando, Aboutyou etc.

Removing duplicate brands, as we imported brands from multiple sources Creating SQL database

Retrieving homepages using Google Search API

40

Fix homepage failures using requests headers and proxies Scraping homepages, looking for hrefs

Improving scraping speed

Using VPN, requests headers to retrieving English webpages

41

Using keywords for selecting important pages, in order to increase scraping speed

Data exploration, pre-processing Data analysis using wordclouds

Creating simple machine learning pipeline

42

Selecting English pages using language codes in hyperlink Contact with Project Cece for sharing dataset

Creating multiple feature FunctionTransformers, like looking for certifications, brandnames, amount of found hyperlinks etc.

43

Scrape all brands over the weekend

Creating repository, including readme, requirements.txt etc. Provide code from commenting

Cleaning of code, combine functions etc. Contact with Marketing about blogpost Replacing numpy, dicts etc. with Pandas

44

Update project to stakeholders

Creating business rules for determining classification label based on ratings from 5+ websites

Identifying language of content using Google Langdetect Sending enquete to all colleagues within The Hup

45

Use different metrics for measuring model performance Running scraper with VPN if content is not-English

Translating remaining non-English content to English using Google Translate API

Created function for extracting price categories from scraped material 46

First contact with company for collaboration Interpretability of classifier using Lime package Use ACE framework for explainable machine learning

(16)

5.1 logbook 13

47

Employ multiple supervised machine learning techniques, including LightGBM

Normalisation on data

Employ multiclass classification

Second contact with company for collaboration

48

Providing repository with more documentation, licence, usage, etc. Word2Vec implementation

Publishing blog 1

First steps website, purchasing goodbase.ai

49

Better explainability using ELI5

Better preprocessing using output of ELI5 Create final presentation

Building website using Flask and VueJS 50

Update project to stakeholders

Also scraping homepage together with 10 important pages Final presentation at company

51

Use word2vec for (100,300) matrices, calculate between brands, find neighbours etc.

Decide wether to post prediction, using word2vec neighbours and business rules

Publishing blog 2

Final talk with Joanneke and Casper (manager of AI department)

52

Christmas Fix last bugs

Add extra comments Last push of code to repo

1 Last meeting at Amsterdam Data Collective 2 Write placement report