KPI extraction from CSR reports

(1)

University of Amsterdam – Amsterdam Business School

KPI extraction from

CSR reports

MBA Big Data and Business Analytics - master thesis

Cyril Peillon MSc.

Sep-18

Supervised by

Dr. Evangelos Kanoulas

(2)

1 | 27

1 Context

1.1 CSR reporting

Corporate Social Responsibility (CSR) reporting refers to corporate reporting in terms of environmental and social impact. This discipline appeared in the 1970’s and started becoming a standard for multinational corporations in the late 1990’s, with the raise of public concerns regarding environmental and social impact of globalization. From this period onwards, several organizations such as GRI, CDP or SASB started developing frameworks in order to have some consistency in the reporting practices. This standardization would help preventing misleading reporting and provide more transparency on the corporations’ practices.

It is commonly agreed that CSR reporting is a complex discipline: it covers a much wider spectrum than traditional financial reporting, by focusing on a variety of topics such as effluents pollution, gender equality or anti-corruption practices. As such the Global Reporting Initiative (GRI) has documented more than 150 topics to report on, separated in 4 categories (universal, economic, environmental and social) and spread out over 33 sub-categories. Since 2017, about 6,000 of largest companies operating in the EU have to deliver a CSR reporting. But even though those companies might decide to comply with GRI standards, they report all this information in PDF format, which makes it very difficult and costly to extract, compare and challenge this information.

As a point of comparison, in the financial reporting field, compliance with IFRS norms is mandatory for all companies in almost all countries in the world. International standards are well accepted since the 1970’s and the financial reporting data is quickly becoming opened-data, thanks to initiatives such EDGAR, XBRL or LEI. This provides an interesting perspective on the current CSR reporting development and draws a possible path for future improvements.

Overall, today the CSR reporting has finally reached an acceptable level of standardization. Though, the information generated by the reporting companies needs to be made easily available, so that various stakeholders can use it act upon it.

1.2 Content extraction

An obvious path to make CSR reporting data easily available would be to incorporate it into a standardized digital format such as XBRL. GRI tried to develop such an initiative in 2014 [1]. Unfortunately, this initiative didn’t receive enough support and has been discontinued since. To my knowledge no similar initiative is currently in development. Which means that the only way to retrieve CSR reporting data in the foreseeable future is by extracting it from PDF documents.

CSR reporting is composed of both qualitative and quantitative information. Qualitative information is available in free text, while quantitative information would often be presented within tables. I decided to focus first on quantitative data, as it will provide a convenient base for benchmarking companies against each other. As such I decided to develop a tool to extract quantitative KPIs out of tables presented in CSR reports.

1.3 Personal context

I started working on providing sustainability insights about corporations during my tenure as a Data Science Manager at Eccentrade. I left the company during the preparation of the present thesis subject. As a consequence, I decided to focus my effort on an innovative approach using deep learning, rather than taking a more conservative approach that might have delivered tangible results more quickly. Also, I took this opportunity to evaluate how viable commercially a product based on this work could be.

(4)

3 | 27

2 Technical solution

2.1 Literature review

Table detection and content extraction has become a research topic since the boom of OCR software solutions and the development of digitized documents such as PDF in the 1990s. The central challenge being that tables are complex and versatile structures. It can represent multiple dimensions of a data point in a concise matter. This versatility makes it challenging, first to define and recognize a table, then to interpret and extract the information from it in an automated way.

2.1.1 Table detection

Many approaches have been tested to detect tables from a PDF file. Some of the key challenges to tackle are:

• Tables can be fully or partially defined by separating lines, their formatting can also be suggested, without any actual line.

• Pages of content can be presented in several columns, leading to false positives if the case is not carefully thought through.

• Graphics, such as bar chart can present content in a similar fashion as tables • Complex headers or stubs can be overlapping on several columns

Most approaches are using heuristics, statistical information, machine learning or a mix of the three. Some of the key work in this field has been:

• The definition of a tabular abstraction by X Wang [2], which suits the digitization of most table formats. Figure 1 represents the terminology formalized by Wang.

• The development of an evaluation framework and a test dataset by M Göbel et al. [3], which enables the formal comparison of proposed solutions.

Figure 1: X Wang table terminology. Source: [4]

Notable work in the recent years includes:

• A Nurminen [5], which reaches state-of-the-art results using heuristics

• TEXUS solution [4], which puts together a complete working pipeline, from table detection up to the identification of headers and stubs

• DeepDeSRT solution [6], which presents the first fully deep learning-based solution for table detection and cell content extraction.

2.1.2 Table formatting detection

Even though some file formats such as HTML or Microsoft Excel files support natively the storage of tables, they don’t carry the means to interpret this data: how to find precisely the contextual data

(5)

4 | 27

linked to a value? Research has been made on this topic in the recent year. I identified the following work as relevant for our topic:

• Z Chen and M Cafarella [7] extracted relational tuples out of 400K Excel files scraped on internet using Conditional Random Field and some heuristics.

• X Chen et al. [8] focused on extracting relational tuples out of financial statement, which are among the most complex table structures.

2.2 Methodology

2.2.1 Baseline

2.2.1.1 Template

I started by developing a simple baseline pipeline which takes a CSR annual report in pdf format as an input and returns the CO2 footprint of the company in normalized unit as an output. There was multiple purpose for developing this baseline:

• Well defining every step that needed to be undertaken to produce the output • Proving that this approach is viable

• Assessing the weakest points in the pipeline in order to focus our efforts later on those points • Creating a reference point against which we will later compare our progress

After reviewing research work in the fields of document analysis, content extraction and information retrieval, I designed the generic pipeline template as shown in Figure 2 to solve our problem.

Figure 2: KPI extraction process

This pipeline is composed of the following components:

1. Pdf parsing: takes a pdf as an input and returns a list of words, lines and images with their coordinates in the document and their meta-properties (size, font, colors, etc.).

2. Table detection: this process understands the organization of the document and identifies the areas of text, tables and images.

3. Cell content extraction: this step focuses on table areas. It separates cells from each other and returns the content of each cell separately.

4. Frame finder: it focuses on understanding the logical structure of the table: what region represents the values of the table and what region is contextual information for those values?

(6)

5 | 27

5. Hierarchy extractor: this step focuses on understanding the contextual information linked to a value: is the header or the side stub composed of a simple one-level information, or is it complex with multiple level of hierarchical information?

6. Semantic tuple builder: this step simply links an individual cell with the contextual information related to it.

7. KPI extractor: this process identifies which tuples are related to the KPI that we are trying to retrieve and identifies the contextual information around it, e.g. year, geographical region, unit.

8. KPI normalization: this final step is converting the extracted values into the targeted unit (e.g. tCO2e/year for CO2 emissions) and stores the identified metadata in a relational database. 2.2.1.2 Baseline implementation

So, the intent of developing of this first pipeline was to use the simplest tools and heuristics, in order to get results as quickly as possible. The baseline has been developed as such:

1. Pdf parsing: step handled by PDFMiner Python library, as a subcomponent of the PDFPlumber library.

2. Table detection: step handled by PDFPlumber library. The developers claim to use principles developed in A Nurminen thesis [5], though PDFPlumber doesn’t reach the quality of Nurminen results as we will see in the following section.

3. Cell content extraction: step handled by PDFPlumber library, as previous step.

4. Frame finder: here, I simply defined the top row as the header and the left column as the stub. 5. Hierarchy extractor: I didn’t implement this step, as my header and stub are supposed to be

one-layer only.

6. Semantic tuple builder: I simply linked each data cell to the corresponding header and stub. 7. KPI extractor: from a key term (e.g. emission), I developed automatically a dictionary thanks

to online platforms showcasing word embedding techniques. For example, the following platform is using the English Wikipedia as its corpus [9]. This approach can help to easily develop dictionaries in other foreign languages. I used this dictionary to find terms related to the targeted KPI in the semantic tuples.

8. KPI normalization: here I again used word embedding to develop extensive synonyms lists of units. Then I linked each unit list with a conversion factor in to the normalized unit. To perform the normalization, I simply search for a unit in the identified tuples and apply the conversion factor the tuple value.

2.2.1.3 Results

As a first assessment, I ran PDFPlumber table extraction tool against the ICDAR 2013 Table Extraction competition dataset [10]. On the combined exercise of locating tables and extracting their cell content, PDFPlumber yields a F1-score of 0.49 where Nurminen solution reaches a F1-score of 0.84 and commercial solutions like FineReader or OmniPage even better (see Figure 3).

Figure 3: Sample of results on ICDAR 2013 table recognition competition

Then I ran the baseline described previously on 100 pdfs randomly selected among the CSR annual reports listed by GRI. As expected, the results are not outstanding. So, in order to determine which

Participant Recall Precision F1-measure

FineReader 0,88 0,87 0,88

OmniPage 0,84 0,85 0,84

Nurminen 0,81 0,87 0,84

(7)

6 | 27

area we should focus our improvement efforts on, I performed a shallow quantitative analysis of the results. This analysis consists of checking the number of documents producing results after several steps of the pipeline. The outcome can be found in the following Figure 4. The “# of results” represents the sum of true positive and false positive results. A rigorous approach would have been to calculate for each step a confusion matrix, but the present simplistic approach already tells us that most steps still need drastic improvement.

Figure 4: Results churn along the baseline pipeline

Following the analysis of this baseline, I decided to focus on the first problematic step: the table detection step.

2.2.2 Table detection

As discussed in the literature review section, table detection is a complex topic. Researchers have tried various techniques in the past decades, but none of them is yet providing outstanding results. For the purpose of the present project, I evaluated various solutions from both research and commercial organizations. The Figure 5 presents the outcome of my investigation.

I was looking for a solution with the following constraints: • Presenting a good accuracy

• Free

• Preferably with open-source code

Figure 5: Solutions assessment for Table detection and Cell content extraction

Four solutions appeared as viable alternatives. I excluded Breuel 2002 and Jing Fang 2011 solutions, as they have not been benchmarked against other solutions. Even though both articles are influential in the research community, I was not sure that they would deliver state-of-the-art results.

On the other hand, both Nurminen and DeepDeSRT solutions demonstrated great results against the ICDAR 2013 dataset. I exchanged with Anssi Nurminen who warned me that taking over his source

Step

output Step name

# of results

% decrease

Input Input 100

3 Cell content extraction 52 -48%

6 Semantic tuple builder 44 -15%

7 KPI extractor 20 -55%

Output KPI normalization 5 -75%

Solution Academic Open-source Commercial Table detection

Cell content extraction Remark

ABBYY Finereader x x x Commercial solution

Oro 2009 (PDF-TREX) x x x Commercial solution

Rastan 2015 (TEXUS) x x x x Commercial solution

Tesseract x x x Difficulty to configure

pdfplumber x x x Low accuracy

Y Liu 2007 (TableSeer) x ? x x Low accuracy

Breuel 2002 x x Ready to implement

Jing Fang 2011 x x x Ready to implement

Nurminen 2015 x x x x Taking over source code

DeepDeSRT x x x Ready to implement

V ia b le so lu ti o n s

(8)

7 | 27

code might be time consuming. Hence, I decided to implement my own table detection solution, inspired from DeepDeSRT approach.

2.2.3 Deep learning approach

In this project, I focused on taking a similar approach to the DeepDeSRT [6]: the first table detection project performed based on deep learning technologies. In 2017, the German team put together a neural network architecture based on a Faster-RCNN architecture. In the present project, I decided to use the novel Retinanet architecture instead, as it is supposed to deliver better results according to the community.

2.2.3.1 Retinanet architecture Retinanet

Retinanet is a novel neural network architecture for object detection proposed by Facebook researchers in 2018 [11]. Since the development of R-CNN in 2014, the best performing object detection networks had two-stages:

• A first regression stage to propose a sparse list of possible bounding boxes around objects • A second classification stage to classify the object within those bounding boxes

In parallel, so-called one-stage neural networks have been developed. They are typically very performant in terms of prediction speed, though the accuracy was lacking behind compared to the two-stages networks. Retinanet changed the situation by reaching state-of-the-art accuracy with a one-stage architecture. The Retinanet architecture is sketched in the Figure 6.

Figure 6: Retinanet network architecture. Source: [11]

The Facebook team managed this performance by introducing a new loss function called Focal Loss function, described in the Figure 7. The Focal Loss (FL) function is used at the output of the classification subnets (c). Compared to the Cross Entropy used by two-stages networks, the FL function enables to deal with class imbalance: for easily detected classes, the first term of the FL function leads to a very

(9)

8 | 27

penalizing loss function. On the other end, for underrepresented classes, the first term is close to zero, leading to a more permissive function close to the Cross Entropy (CE) function.

Figure 7 : Focal loss function with various gamma values. Source: [11]

They also included recently developed features such as Feature Pyramid Network (FPN) and Anchors that I briefly describe in the following section.

ResNet

The backbone of Retinanet is composed a Feature Pyramid Network (FPN) built on top of a ResNet-50 network.

The Residual Network (ResNet) is arguably the architecture that enabled deep neural network development in recent years [12]. In plain neural networks, researchers found out empirically that the errors on a training set starts going up at some point while increasing the number of layers, where in theory it should have gone down. The by-passing mechanism of ResNet as sketched in Figure 8 enabled the loss function on decreasing while increasing the number of layers.

Figure 8: ResNet building block principle. Source: [12]

The 50 in ResNet-50 represent the depth of the ResNet. Feature Pyramid Network (FPN)

FPN is a computational efficient way to deal with the detection of objects at various scale in an image. The Feature Pyramid is a simple principle developed in 1984 [13] to subsample images in order to enable its interpretation at various scales. The Facebook and Cornell University research teams

(10)

9 | 27

managed to leverage the convolutional architecture to make Feature Pyramid analysis computationally appealing [14]. An FPN is coupled with a convolutional network as e.g. ResNets and is capable of proposing bounding boxes of various scale. In the following explanatory Figure 9, the left side represents the convolutional network with a bottom-up pathway. Every output of a convolutional layer is then sent to the FPN. The FPN then produces as an output proposed bounding boxes of various scale and various proportions.

Figure 9: Architecture and building block of FPN architecture [14]

Finally, the last components of Retinanet are a Classification subnet and a Box Regression subnet. Both are Fully Convolutional Networks (FCN) and are trained in parallel without interacting with the other. 2.2.3.2 Transfer learning

Principle

Due to its millions of parameters, training a deep neural network from scratch requires a huge dataset and an important computing power. That’s why in practice it is quite rare to train a deep neural network from scratch, with randomized weight initialization of the parameters. Instead, one uses the method called transfer learning. The main idea is to start from a network that has been pretrained on a huge generic dataset. Then the last layers of the network are modified in order to accommodate the new task at hand. The parameters’ weight of the first layers are then frozen, in order to prevent their modification. The logic behind this approach is that the first layer of a neural network, the convolution network in an image recognition application, perform generic data processing tasks. So, they can most likely be reused as is in the new task to be performed.

Possible strategies

Several strategies can be adopted with transfer learning, depending on the volume of training data available and how distant the domain of prediction is from the purpose of the initial network.

• With a small dataset and a close application domain, retraining only the newly added dense layers can be sufficient.

(11)

10 | 27

• With a medium size dataset or a distant application domain, it is advised to unfreeze the convolution layers closest to the new dense layers.

• With a large dataset, it is even possible to unfreeze the whole network.

• With a large dataset and a different application domain, it is possible to only train the new dense layers for few epochs, and then unfreeze the whole network for the rest of the training. • A last possibility is to apply so-called differential learning method: different learning rates in different parts of the network. High learning rates are used in the dense layers, medium learning rates in the end of the convolutional network and low rates for the base of the convolutional network. Unfortunately, this method is not yet implemented in all frameworks, including Keras.

Our approach

In our case, we are starting from a RetinaNet that has been trained for us on the MS COCO dataset. MS COCO is a generic labeled dataset of over 200k images with 1.5 million object instances of 80 categories. MS COCO includes categories such as cats, trains or chairs.

From this pretrained network, we modify the last dense layer, so it can predict two classes: Table and TableBody. TableBody is the table itself, while Table also includes eventual caption and titles of the table. I will later use those to classes for different purposes:

• TableBody class will be used to benchmark our table detection algorithm against other solutions. Indeed, most table detection are only considering the extraction of the table body as a success.

• Table will be included in the final pipeline, as titles and captions often contain critical information for the context understanding of a numerical value.

We are then using the Marmot dataset to perform transfer learning. As part of the hyper-parameters tuning, I performed several test trainings to decide how many layers I should unfreeze: I started unfreezing the last layer, then keep on unfreezing contiguous layers until the performance of the network was not improving significantly. We will discuss in more details the hyper-parameters tuning in the related section.

2.2.3.3 Datasets ICDAR 2013

ICDAR (International Conference on Document Analysis and Recognition) is a biennial conference. In the 2013 edition, one of the competitions has been focusing on the table location and table structure recognition in pdf documents. The organizers published a ground-truth dataset of 200 documents and the corresponding labels, as well as the methodology to the assess the quality of the result. We will come back to the quality assessment details in the Benchmarking section.

This dataset contains extract of corporate annual reports, which makes it an ideal candidate for our use case. Unfortunately, the size of the dataset is too limited to perform a proper transfer learning. So we only this dataset as a test dataset to evaluate the performance of our solution.

Marmot

The Marmot dataset has been published by the University of Peking [15]. It contains 2000 pages and the corresponding labels. The documents come from the scientific literature and are evenly spread between Chinese and English language, as well as documents with and without tables. As described by the DeepDeSRT team [6], the labeling of Marmot dataset contains mistakes and inconsistencies. Using

(12)

11 | 27

the online labeling platform LabelBox [16], I manually adjusted or corrected 218 documents. I used this corrected dataset to perform the transfer learning.

2.2.3.4 Hyperparameters tuning

In the context of deep neural networks, even more than in traditional machine learning, it is impossible to tune manually single parameters, as there are millions of them. Instead one is focusing on tuning hyperparameters that influence the training behavior.

Advanced exploration methods of the hyperparameters space are available for neural networks: methods like Bayesian optimization and software packages like Hyperopt. However, they require an important number of iterations. Given that a typical iteration costed around 4€, due to the high price of GPU rental, I took a much tougher approach where I tuned the following hyperparameters only once by hand.

I left most of the hyperparameters untouched, by keeping the default settings recommended by either the Retinanet team or the DeepDeSRT team. Though, I focused on tuning hyperparameters that are reputed to have an important influence: Learning rate, Depth of freezed layers, Number of epochs and batch size.

Initial learning rate

As explained previously, Keras only gives the possibility to set one learning rate for the whole network at a given time. It is wise to set a high learning rate for new areas of the network, so they can make quick progress, then decrease it gradually to reach an optimal state. On the other hand, previously trained areas should only use a low learning rate, to avoid unlearning what has been previously acquired.

To accommodate the need for progressive decrease of the learning rate, I use a function called ReduceLROnPlateau. This function is watching a given metric after each epoch, the total loss on the validation set in our case, and decreases the learning rate by a factor if these metric stops improving for a given number of epochs. I kept all parameters fixed during the training apart from the initial learning rate: a decreasing factor of 0.1 and a patience of 2.

Depth of layers trained

I tried 5 difference setups where I progressively increased the number of layers to be trained. The detail is as follow:

1. Only unfreeze the newly created classification and regression sub-models – 4.8M parameters to train (13% of total)

2. Additionally, unfreeze FPN layers directly connected to the classification and regression sub-models – 7.2M parameters to train (20% of total)

3. Unfreeze all the FPN layers – 12.8M parameters to train (35% of total)

4. Unfreeze all the Batch Normalization nodes directly connected to the FPN – 12.9M parameters to train (35% of total)

5. Unfreeze all the Convolutional nodes connected to the previous Batch Normalization – 15.8M parameters to train (43% of total)

Retrospectively, after reading more on how transfer learning is usually performed, I would change how I performed steps 4 and 5: instead of unfreezing layers connected to the FPN, I would progressively unfreeze the higher levels of the ResNet, as illustrated inn Figure 10 and Figure 11 on a simplified

(13)

12 | 27

ResNet representation. The idea is to retrain the deeper part of the network which are specialized in the current task, while leaving the shallow parts, which are generic, untouched.

Figure 10: Implemented ResNet training strategy Figure 11: Proposed ResNet training strategy

Epochs, batch size

Finally, I tuned the batch size and the number of epochs. The batch size is the number of images that are processed at once before performing back-propagation. It is recommended to use the biggest batch size possible, as it will make the learning process more stable. Though, the batch size is usually limited by the memory space available on the GPU. In my case, I couldn’t increase the batch size over 4.

An epoch represents one-time processing of all images in the training dataset. I ran all my training on 10 and 20 epochs. Though the improvement was usually not significant after the 10th_epoch.

(14)

13 | 27 2.2.3.5 Software and hardware

All the code of this project has been made in Python 3.6. I used Keras as a deep learning framework, with Tensorflow as a backend solution. I especially used the RetinaNet implementation in Keras [17] and the pretrained model on the MS COCO dataset, using a ResNet-50 backbone architecture.

In order to train the neural networks, I deployed a preconfigured Linux server on Google Cloud Platform with tools such as CUDA, Tensorflow or Jupyter already installed. I deployed one NVIDIA Tesla K80 GPU on this server. I linked this server to some persistent storage buckets, in order to store my training data at a lower cost. Altogether this installation is costing about 1€/hour of operation and cost less than 100€ for the whole project. I considered purchasing a NVIDIA GPU or deploying multiple GPUs in the cloud, but it didn’t make economic sense as we are performing transfer learning. It might make more economical sense to purchase GPUs when training a neural network from scratch on a whole new subject.

2.3 Results

2.3.1 Metrics

2.3.1.1 Dataset split

In order to perform transfer learning on the RetinaNet, I split randomly the cleaned Marmot dataset into a training set (70%), a validation set (15%) and a test set (15%).

Different metrics are monitored during the training and the evaluation phase. Among others, this distinction enables the modeler to assess separately the performance of the regression sub-model and the classification sub-model.

2.3.1.2 Training metrics

During the training, we are monitoring 6 different losses at the end of each epoch: • The regression loss, of the training or the validation set

• The classification loss, of the training or the validation set

• The sum of the two previous losses, of the training or the validation set

The key metric being the sum of losses of the validation set. Next, we detail the various losses. Regression loss

A smooth L1 loss is used to evaluate the regression loss. Its function and distribution are the following: {0.5(𝑦̂ − 𝑦)

2_{, 𝑖𝑓 |𝑦̂ − 𝑦| ≤ 1}

|𝑦̂ − 𝑦| − 0.5 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Figure 12: smooth L1 loss function

As the traditional L1 loss, the smooth L1 is robust to outliers. Additionally, the smooth L1 can be derived, which is a prerequisite for applying back-propagation.

Classification loss

(15)

14 | 27 2.3.1.3 Evaluation metric

Once we estimate that the network reached a steady state, we calculate the average precision for the training set and the test set. The average precision is calculated the following way:

1. The Jaccard coefficient is calculated for the bounding boxes. It should be above 0.5 to be accepted. I explain below how the Jaccard coefficient is calculated.

2. The confidence score is calculated for the classification. All confidence scores below 0.05 are excluded.

3. Using the remaining confidence score, the precision against recall curve is calculated. 4. Then the average accuracy is computed, using the following formula:

The Jaccard coefficient, also called Intersection over Union (IoU) is simply the percentage of overlapping surface between a predicted bounding box and the ground-truth bounding box, as illustrated on the Figure 13.

Figure 13: Jaccard coefficient calculation

2.3.1.4 Benchmarking

The ICDAR 2013 Table Competition [10] defines two sets of quality metrics: one for the Table Detection competition and one for Cell content extraction competition. We will use both: The Table Detection metrics to benchmark our Retinanet solution and the Cell Content Extraction metric to benchmark PDFPlumber.

The Table Detection metric consists of counting the number of characters in the ground-truth bounding box, the predicted bounding box and their intersection, and then calculating the precision, recall and F1-score based on those counts on a page. Average scores are calculated by document. The final scores are the average of those document scores.

The Cell Content Extraction metric is using the principle of proto-links. A proto-link is the relationship between two adjacent cells, vertically or horizontally. In this definition, relationships with blank cells are not considered, as shown in the Figure 14.

(16)

15 | 27

Figure 14: Illustration of proto-links. Source: [3]

Two proto-link are considered as matching when the two cells content and the direction are matching. The precision, recall and F1-score can be calculated on this base.

2.3.2 Results

2.3.2.1 Evolution during training

For illustration purpose, I will show here some insights that are gained during the training phase. The data is coming from the training of the best model. In Figure 15, we can see the evolution of the training and the validation error, throughout the epochs, as well as the evolution of the learning rate. We see clearly that the learning rate is decreased each time the validation loss increases for 2 epochs. Also, we note that the validation loss is in general about 2 times higher than the training loss, meaning that we are not yet overfitting our training set. Though, the peaks on epochs 5 and 18 could be signed of overfitting.

Figure 15: Evolution of losses during training

Next, we have a look at the evolution of the classification and regression losses. As explained in the previous section, the model loss is calculated as the sum of the two sub-models. Hence, it’s important that those two losses are comparable measure. We verify in Figure 16 that it is indeed the case. Also,

(17)

16 | 27

we see that the classification has a much lower loss and that it’s reaching a steady state much faster. It can be explained by the fact that we are only trying to predict two classes. As a consequence, the regression problem is much more complex to tackle.

Figure 16: Evolution of sub-models’ losses during training

2.3.2.2 Hyperparameters optimization

As described in the Hyperparameters tuning section, I focused my optimization efforts on few hyperparameters: the number of layers to unfreeze for training, the learning rate, the number of epochs and batch size.

Regarding the batch size, as explained previously, the bigger is usually the better. I picked a batch size of 4, given the constraint of memory on my GPU. In respect of the number of epochs, I usually saw little improvements after 10 epochs. The model showcased in the previous section is one of the slowest to train given a high initial learning rate (1e-3) and an important number of parameters to train (15.8M). Yet, we see little improvements after 10 epochs.

Then I worked on tuning the learning rate. I started trainings with an initial 1e-5 learning rate, as recommended in the DeepDeSRT paper. I increased then this learning rate by factors of 10. For the same model, it led to the Figure 17. 1e-4 turned out to be the best initial learning rate.

(18)

17 | 27

Figure 17: Average precision evolution against initial learning rate

Finally, I focused on progressively unfreezing layers of the network, training after training. It led to the following Figure 18. We note that the average accuracy keeps on increasing even after step 5. It means that the optimum might have not yet been reached. As discussed in the Hyperparameters tuning section, I believe that it would be interesting to try an alternative approach.

Figure 18: Average precision evolution against number of layers unfrozen

In conclusion, the best model has been trained with 20 epochs, a batch size of 4, an initial learning rate of 1e-4 and 15.8 million parameters to be trained, corresponding to all layers following the last convolutional layer of each level of the ResNet (i.e. Step 5). I benchmark this precise model in the following section.

2.3.2.3 Benchmarking

I benchmarked the best model on the ICDAR 2013 table detection sub-competition. The testing protocol is described in the Benchmarking section. At this point, the model only got to know the MS COCO and the cleaned Marmot datasets. In our problem, we want to avoid losing information that won’t be accessible later in the pipeline. Hence, the recall is the most important measure for us. The model achieves decent results comparable to recent submissions and commercial solutions, with a recall of 0.92 and a F1-measure of 0.93. Also, as underlines by the DeepDeSRT team, we are using images as input rather than PDFs documents. In practice, it makes the solution more robust as we can process a greater variety of documents, but we are also penalized compared to other solutions as we

(19)

18 | 27

don’t have access to various metadata. The following Figure 19 puts my results in perspective to other solutions. The solutions in italic are commercial products.

Figure 19: Table detection performance of my solution against state-of-art methods

2.4 Future improvements

2.4.1 Deep learning

As we found out, the table detection solution that I created shows promising results, though it doesn’t reach yet the state-of-the-art. I’m foreseeing several improvement axes.

Hyperparameters optimization

On top of the improvement points already mentioned previously, it would be good to take a more systematic approach to hyperparameters optimization. One possible approach is to use Tree of Parzen Estimators, in order to avoid performing a grid search in the hyperspace of parameters.

Try other architectures

Another approach would be to try using different neural network architectures. I can try using other versions of the ResNet backbone, such as the ResNet-152 or ResNet-18 to verify if the depth of the backbone is really necessary for our problem.

I also would like to train a Faster-RCNN network to try reproducing the DeepDeSRT results and understand if my tuning strategy is sufficient to reach the results of this paper.

Own dataset

Finally, it would be very beneficial to develop my own dataset, based on actual CSR annual reports. Even though the ICDAR dataset is close to our targeted universe, having our own dataset should improve the end-result.

2.4.2 Improve other steps

I did incorporate the RetinaNet neural network into the baseline, but it didn’t improve the performances of the pipeline: I couldn’t skip the table detection performed by PDFPlumber, leading to lesser performances.

As noted in the Results section, they are still a number of topics which still needs improvements before reaching a satisfactory reliability level.

Cell content extraction

The DeepDeSRT team demonstrated that a deep learning approach can also yield state-of-the-art results in terms of cell content extraction. I would like to implement a similar FCN-based segmentation model.

Table formatting understanding

Input type Participant Recall Precision F1-score

PDF FineReader 0,9971 0,9729 0,9848 Image DeepDeSRT 0,9615 0,9521 0,9578 PDF Nurminen 0,9077 0,921 0,9143 PDF Acrobat 0,8738 0,9365 0,904 Image Peillon 0,919 0,7778 0,829 PDF Yildiz 0,853 0,6399 0,7313

(20)

19 | 27

The Frame Finder and Hierarchy Extractor steps have received less attention from the research community. Though, the approach developed by X Chen et al. [8] to deal with complex tables from financial reports seems promising for our particular use case.

Semantic understanding

Many improvements can be envisioned in the understanding of the semantic tuples, especially using named-entity recognition techniques. A more immediate improvement would be to use all the CSR reports that I gathered as the corpus for word embedding, instead of a generic corpus.

Industrialization

As we will see in more details in the following section, the industrialization of this pipeline will require some investments, such as on-premise GPUs to decrease training costs, crowdsourcing to develop a proprietary labeled dataset and additional hosting capabilities, to crawl corporate websites for the latest CSR reports and to deliver our data to customers via a web platform.

(21)

20 | 27

3 Business plan

3.1 Environment

3.2 Value proposition

The focus of this business is to increase the value delivered by sustainability analysts, by automating the extraction and the benchmarking of corporate data.

3.2.1 Customer segments

According to Linkedin, there is globally about 15 thousand sustainability analysts and consultants. During interviews with some of those professionals, all of them mentioned that acquiring data was a very time-consuming part of their job. In fact, one analyst from a leading ESG consulting firm estimates that analysts in his company spend up to 50% of their time on this sole task. My goal is to help those people deliver more value by taking care of the data acquisition and letting them focus on higher added value tasks.

3.2.2 Value proposition

The idea is to provide CSR KPIs about specific companies, such as the ones described in GRI Standards [18], to our customers. This data will be benchmarked against:

• Competitors

• Historical values and corporate commitments • Global objectives such as Paris Agreement

The automation of our pipeline will enable us to deliver updates about companies only few hours after their publication by the company, compared to the current norm in the industry which is between one day and few weeks.

3.2.3 Customer relations and acquisition channel

The head of ESG consulting services at our customer will sign a yearly subscription with us, which will entitle its department to consult and monitor a given number of companies.

The analysts will have access to our data through a web platform. Once they start monitoring a company, they will be informed about any new information available for this firm. The analyst will be able to request information about a company which is not currently in our database. It will trigger an investigation process on our side.

Our marketing and sales approach will be focused on CEOs and Directors of ESG consulting entities. We will publish content and advertise on ESG specialized websites and will physically attend sustainability related forums. We will become an official GRI partner, to gain visibility. Finally, our Account Manager will approach warm leads generated through those channels.

3.3 Business model

3.3.1 Market

The market of sustainability consulting is globally about $1 billion per year according to Verdantix [19]. This overall market is steady, due among others to uncertainties on international commitments regarding sustainability goals, but it hides important differences in various areas of focus. Investors are moving fast in the ESG field: in 2016, 26% of assets managed professionally had to respond to ESG criteria according to McKinsey [20], which represent a yearly growth of 17%. It makes the ESG consultancy sector very dynamic, as highlighted by this other Verdantix study [21]. Based on Verdantix study, we estimate that ESG consultancy market represents $400 million in 2018. Also, funds’

(22)

21 | 27

managers identify the lack of data and the costs of research in the ESG as one of the main challenge to address [22].

Several other key factors are expected to have positive impacts on this sector:

• The EU directive on mandatory non-financial reporting for large corporations [23] is coming into place for the financial year 2018, which will lead to an important increase in the amount of data available.

• International standardization and reporting bodies in the sustainability field, such as GRI [18], CDP (Corporate Disclosure Project) [24] or SASB (Sustainability Accounting Standards Board) [25] are increasing the maturity of their standard and are getting promoted directly by the business community.

• As in many other financial fields, the main actors in the ESG investment scene believe that AI will have an important impact on their business and are investing in those technologies accordingly [26].

3.3.2 Competition and competitive advantage

Various companies and organizations are delivering CSR and ESG related information today. Some notable players are:

• GRI [27], which is listing about 50,000 CSR annual reports. As an NGO with limited financial power, GRI has struggled in past years to extract key data from those reports.

• Sustainalytics [28] is a rating company focused on ESG practices of listed companies. It is regarded as one of the leading providers in the field. It heavily relies on its 200 sustainability analysts to deliver quality insights.

• CSRHub [29] is a data provider founded in 2007. It claims to consolidate 550 data sources to provide insights about 18,000 companies. Its team is using web scraping tool and purchase other databases to acquire the input for their ranking.

• Wikirate [30] is a crowdsourcing platform. Members from the community are asked to upload documents and sources about corporations. Other members will then manually extract KPIs from those documents. The platform hosts 380,000 data points about 30,000 companies. Though the data is very sparse, which makes it difficult to use as a consistent analysis tool. • Additionally, many ESG consulting companies perform the data analysis themselves, reading

the available documents and extracting manually the relevant KPIs.

Our approach differs from those players as we will focus on automating the data acquisition, maintaining our operating costs low.

3.3.3 Implementation

The implementation phase should start with the development of an MVP (Minimal Viable Product) for a period of 3 months. This phase should test the viability of our approach in terms of quality of the output of the product and in terms of demand for our solution. This phase will include:

• The automation of PDFs retrieval, using web scraping and web crawling techniques. This system should enable us to detect the publication of new reports in less than one hour. • The extraction of KPIs, by developing the points mentioned in the Future improvements

section. The MVP should provide a clearer view on the amount of manual work needed to extract a new KPI, or to operate the system, so that the viability of the project can be validated.

(23)

22 | 27

• The commercial development should start, by connecting with professionals and validate their interest in the approach. Several prospects should have confirmed their interest by written to validate the MVP.

The MVP team will only consist of a partner acting as an Account Manager and myself to focus on a Data Scientist role.

Once we validated our hypothesis, we will start an implementation phase. This phase will consist of: • Developing a web platform to host and deliver our data to customers. This platform should

enable us to track the behavior of our customers, so we can understand their way of working and anticipate their needs.

• Performing quality control on our data and enriching it with other contextual data. It includes performing customer support related to data. This task is key for the evolution of the product, as it will guide its continuous development. E.g. we foresee the need to integrate additional data sources, to provide contradictory information which will challenge the reporting delivered by a company, such as news articles or NGO’s reports. The insights on which data to integrate first should come from this channel.

• Developing an online marketing funnel. This should help us decrease the cost of acquisition of new customers. It will also go hand-in-hand with the behavior tracking of our users.

For this new phase, we will need to recruit a Web Developer, a Data Quality Analyst and a part-time Marketeer.

3.3.4 Financial structure

3.3.4.1 Cost structure

The estimated costs for the 3-months MVP phase will be the following. Both the Account Manager and myself, as partners, won’t receive any salary for this period.

Figure 20: 3-months MVP cost structure

For the 6-months implementation phase, the cost structure will be the following. 151,000€ will be spent over the period, then the operating cost for the structure will be 22,500€ per month.

Figure 21: 6-months implementation phase cost structure

Expense Amount

GRI listing 800 €

Purchase server with GPU 3 000 € Events marketing and prospecting 2 000 €

Total 5 800 €

Type Expense Amount Monthly

Fix Laptops (x5) 10 000 €

Web developer 36 000 € 6 000 €

Part-time marketeer 12 000 € 2 000 € Data Quality Analyst 24 000 € 4 000 € Partners salaries (x2) 48 000 € 8 000 €

Web hosting 3 000 € 500 €

Office rent 12 000 € 2 000 €

Marketing campaigns & events 6 000 € 1 000 €

Total 151 000 € 23 500 €

(24)

23 | 27 3.3.4.2 Revenue stream

On the macro-level, we estimate that the ESG consulting market represent $400M, based on Verdantix study [19]. By automating 10% of the workload of those analyst, we reach a Total Addressable Market of $40M globally.

On the micro-level, we will sell our data in average 0.6€ per KPI category (e.g. CO2 footprint), per company and per reporting year. This price is in line with providers such as CDP of Engaged Tracking [31]. If we are capable of saving 10% of an analyst working time, it will represent in average a saving of 6k€ on yearly basis for their company.

GRI CSR reports listing is registering 50,000 reports and is adding 6,000 entries to the list every year. We can then breakeven under the following hypothesis:

• We are able to retrieve CO2 footprint KPIs for 40% of the GRI listed CSR reports. • We are signing 2 new customers for our whole dataset every month.

• Customers are not churning and keep on buying new data the following years.

Figure 22: Forecasted results for CO2-footprint product line

If we manage to validate those hypotheses, this positive result will free up budget to develop further our offering. Also, an investing round should be considered already at the end of the MVP phase, to speed up the development of the product.

End of year Total customers Cumulated revenue Cumulated

cost Net result

1 24 322 560 € 282 000 € 40 560 € 2 48 357 120 € 282 000 € 75 120 € 3 72 391 680 € 282 000 € 109 680 €

(25)

24 | 27

4 Conclusion

We have validated in this study that an automation, even partial, of the CSR reporting data is feasible. Though, it will require a significant additional effort to bring it to fully automated and reliable state. Given that the CSR reporting market is still very young and immature, there is room today for a solution solving even partially the data acquisition problem.

I really liked working on this thesis project, as it pushed me to deliver results in a limited time on a difficult topic. I took this opportunity to run a first deep learning project and I learned a lot on the journey.

I want to thank Dr. Evangelos Kanoulas for his constant support and his sound guidance throughout the project. I also would like to thank Dr. Michelle Westermann-Behaylo and Dr. Efstratios Gavves for their great help even when requested on very tight schedule. I was also happily surprised by the availability of Dr. Anssi Nurminen and Dr. Helen Paik to provide me information about their past research.

(26)

25 | 27

5 References

[1] "XBRL Reports Program," Global Reporting Initiative, [Online]. Available:

https://www.globalreporting.org/services/Analysis/XBRL_Reports/Pages/default.aspx. [2] X. Wang, "Tabular Abstraction, Editing and Formatting," 1996.

[3] M. Göbel, T. Hassan, E. Oro and G. Orsi, "A methodology for evaluating algorithms for table understanding in PDF documents," 2012.

[4] R. Rastan, H.-Y. Paik and J. Shepherd, "TEXUS: a task-based approach for table extraction and understanding," 2015.

[5] A. Nurminen, "Algorithmic extraction of data in tables in PDF documents," 2015. [6] S. Schreiber, S. Agne, I. Wolf, A. Dengel and S. Ahmed, "DeepDeSRT: Deep Learning for

Detection and Structure Recognition of Tables in Document Images," 2017. [7] Z. Chen and M. Cafarella, "Automatic Web Spreadsheet Data Extraction," 2013.

[8] X. Chen, L. Chiticariu, M. Danilevsky, A. Evfimievski and P. Sen, "A Rectangle Mining Method for Understanding the Semantics of Financial Tables," 2017.

[9] "WebVectors: word embeddings online," [Online]. Available: http://vectors.nlpl.eu/explore/embeddings/en/#.

[10] M. Göbel, T. Hassan, E. Oro and G. Orsi, "ICDAR 2013 Table Competition," 2013.

[11] T.-Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollar, "Focal Loss for Dense Object Detection," 2018.

[12] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2015. [13] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt and J. M. Ogden, "Pyramid methods in

image processing," 1984.

[14] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan and S. Belongie, "Feature Pyramid Networks for Object Detection," 2017.

[15] "Marmot dataset," Peking University, [Online]. Available:

http://www.icst.pku.edu.cn/cpdp/data/marmot_data.htm. [Accessed 24 09 2018]. [16] "LabelBox," LabelBox, [Online]. Available: https://www.labelbox.com/.

[17] "Keras RetinaNet repository," [Online]. Available: https://github.com/fizyr/keras-retinanet. [18] "GRI Standards," GRI, [Online]. Available: https://www.globalreporting.org/standards/.

[19] "Verdantix Forecasts The Global Sustainability Consulting Market Will Exceed $1 Billion In 2019, Far Below Expectations Of The Consulting Industry," Verdantix, 2015. [Online]. Available:

(27)

http://www.verdantix.com/newsroom/press-releases/verdantix-forecasts-the-global-26 | 27

sustainability-consulting-market-will-exceed-1-billion-in-2019-far-below-expectations-of-the-consulting-industry.

[20] "From ‘why’ to ‘why not’: Sustainable investing as the new normal," McKinsey, [Online]. Available: https://www.mckinsey.com/industries/private-equity-and-principal-investors/our-insights/from-why-to-why-not-sustainable-investing-as-the-new-normal.

[21] "Verdantix Says Heads of Sustainability Will Spend Less on Consulting Engagements Over the Next 5 Years," Verdantix, 2016. [Online]. Available:

http://www.verdantix.com/newsroom/press-releases/verdantix-says-heads-of-sustainability-will-spend-less-on-consulting-engagements-over-the-next-5-years.

[22] "From Niche to Mainstream: Responsible Investment and Hedge Funds," Alternative Investment Management Association, 2018.

[23] "Non-financial reporting," European Commission, [Online]. Available:

https://ec.europa.eu/info/business-economy-euro/company-reporting-and-auditing/company-reporting/non-financial-reporting_en.

[24] "CDP homepage," CDP, [Online]. Available: https://www.cdp.net/en.

[25] "Sustainability Accounting Standards Board - Home," SASB, [Online]. Available: https://www.sasb.org/.

[26] S. Basar, "AI Used To Analyse ESG Data," Markets Media, 2018. [Online]. Available: https://www.marketsmedia.com/ai-used-to-analyse-esg-data/.

[27] Global Reporting Initiative, [Online]. Available: https://www.globalreporting.org.

[28] "Sustainalytics: home," Sustainalytics, [Online]. Available: https://www.sustainalytics.com/. [29] "CSRHub - Sustainability management tools," CSRHub, [Online]. Available:

https://www.csrhub.com/.

[30] "Home - Wikirate," Wikirate, [Online]. Available: https://wikirate.org/. [31] Engaged Tracking, [Online]. Available: https://www.engagedtracking.com/. [32] "Wikipedia," [Online]. Available: https://www.wikipedia.org/.

(28)

27 | 27

6 Glossary

(Source: [32])

CDP - Carbon Disclosure Program: organization based in the United Kingdom which supports companies and cities to disclose the environmental impact of major corporations [24].

CSR - Corporate Social Responsibility: international private business self-regulation, focused on environmental, social and economic concerns.

Epoch: training process which consists of passing the whole training dataset forward and backward one through the neural network.

ESG - Environmental, Social and Governance: three central factors in measuring the sustainability and ethical impact of an investment in a company or business.

FCN - Fully Convolutional Network: neural network solely composed of convolutional layers. It is very popular in the image recognition field.

FPN - Feature Pyramid Network: neural network architecture developed for image recognition to enable the recognition of objects at different scales [14].

GPU - Graphics Processing Unit: Electronic circuit designed to process massively parallel computation. For this reason, it is very popular to train neural networks.

GRI - Global Reporting Initiative: NGO in charge of defining and diffusing GRI’s sustainability reporting framework.

ICDAR - International Conference on Document Analysis and Recognition: biennial international academic conference.

ResNet – Residual Network: architectural feature in neural network which enabled the fast development of deep neural networks [12].

SASB - Sustainability Accounting Standards Board: American organization which focuses on developing and disseminating sustainability accounting standards [25].

KPI extraction from CSR reports

University of Amsterdam – Amsterdam Business School