Placement Report K

(1)

Placement Report

Klippa

Roy David Placement Report Ma Information Science University of Groningen

Supervisor University: dr. A. Toral Ruiz Supervisor Klippa: M. Doggen

Roy David s2764989 January 17, 2020

(2)

P R E F A C E

Since the internship is one of my final courses before completing the MA Informa-tion Science at the University of Groningen, I want to dedicate this preface to all the staff and teachers of the University of Groningen, who have helped me completing my masters degree over the past year. In special I would like to thank my supervi-sors: dr. A. Toral Ruiz and Mark Doggen for their help and guidance during my internship. I would also like to thank everyone at Klippa for the fun times, their input and help where needed, during the internship.

(3)

C O N T E N T S

Preface i

1 introduction 1

2 task and implementation 2

2.1 Task . . . 2 2.2 Data . . . 2 2.3 Implementation . . . 3 2.4 Results . . . 4 2.5 Conclusion . . . 7 2.6 Future work . . . 7 3 evaluation 8 4 conclusion 9 5 appendix 11 5.1 Log . . . 11

(4)

1

_{I N T R O D U C T I O N}

Klippa came up once before during my bachelor Information Science. What I re-membered from it was that some of the founders had done the Information Science study and their focus was to be innovative with the focus on automatizing paper processes. Couple of years later they are still innovative and have several automa-tized products, aimed at getting rid of paper documents, on the market. Klippa fo-cuses a lot on machine-learning and automatizing, which fits nicely into the focus of my study. The assignment was also interesting: developing a machine-learning sys-tem (vision- and text-based) to detect total amounts and address blocks on invoices. Although I had little to no experience with invoices and object detection methods, the assignment interested me since it was something new, therefore challenging, and machine-learning related. Getting the placement went pretty straightforward, I simply send an e-mail, had a call and scheduled an application. Everything went smoothly and I could start a few weeks later. Klippa consists roughly of three teams: app developers, OCR (Optical Character Recognition) developers and mar-keting. During the internship I joined their OCR team, they focus, among other things, on the data extraction from invoices.

Klippa1

was founded in 2014 by five Dutch IT specialists, some of them with an Information Science background, with the goal to digitize paper processes with modern technologies. They have offices in Groningen and Amsterdam. During the internship I stayed at the office in Groningen.

1

https://www.klippa.com

(5)

2

_{T A S K A N D I M P L E M E N T A T I O N}

2.1 task

The task for this internship was to develop a machine-learning system (vision- and text-based) to detect total amounts and address blocks on invoices. This task con-sisted of researching possible machine-learning algorithms suited for this task, an-alyzing the best suited machine-learning algorithm options and exploring it to get the best results.

They currently, use a mix of machine-learning algorithms and a rule-based parser to extract data from invoices. They parse the text from the invoice gotten from Google Vision1

and then, via certain rules and algorithms, extract the re-quired data from that text. They were interested in knowing if there were any other machine-learning based options that could compete with or outperform their cur-rent approach. This task specifically targets address blocks and total amounts. Their current approach is not limited to only detect total amounts and address blocks, it extracts more data such as: line items, VAT numbers, etc. The new model needs to be competitive in terms of performance and speed, since the data extraction from the invoices needs to be done in real-time.

2.2 data

The data for the total amounts detection task consisted of 4055 invoices from dif-ferent companies. The total amounts data set was divided in 74.0% train, 24.7% validation and 1.3% test (Table1). The test set was used as a sanity check, by

man-ually checking the bounding boxes on the images. For the address block model a different data set was used, this data set consisted of 2535 invoices from different companies. The data set was divided in 78.9% train, 19.7% validation and 1.4% test (for the full data distribution per class see Table2). Again the test set was used as a

sanity check. label\stats amount % total_amount 4055 100 train 3000 74.0 val 1000 24.7 test 55 1.3

Table 1: Data distribution of the total amounts dataset.

For the object detection model data, the invoices were annotated by hand us-ing prodigy2

, by drawing bounding boxes on the invoices. For the total amount data, only one total amount label per invoice was annotated. For the address blocks detection task there were six labels annotated: address block consumer (billing ad-dress), address block merchant (main merchant), address block consumer other (e.g. shipping address), address block consumer merchant (e.g. addresses from other of-fices from the same merchant), address block consumer line (address displayed on a line), address block merchant line (address displayed on a line). These labels were later merged into address block consumer, address block merchant and just address 1

http://bit.ly/2sCHxl6 2

(6)

2.3 implementation 3 label\stats amount % ab_other_merch 308 12.1 ab_merch 1076 42.4 ab_cons 426 16.8 ab_line_merch 673 26.6 ab_other_cons 48 1.9 ab_line_cons 4 0.2 total 2535 100 train 2000 78.9 val 500 19.7 test 35 1.4

Table 2: Data distribution of the labels in the address block dataset.

block. This to have several models for address block detection. Also, having less labels often leads to an increase in performance. However the detection of more specific classes is preferred to just detecting the address block. After annotating the data would be exported as a json file containing the label, image and coordinates of the bounding boxes. From that the data was constructed in the Pascal VOC3

format. The textual data, for the text-based model, was extracted from the invoices us-ing Google’s Vision API. Then from the prodigy json file the coordinates of the bounding boxes were used to extract the value of the labels.

2.3 implementation

For this task the plan was to experiment with different implementations: embed-dings (text-based) and object identification (image-based). For the embedembed-dings im-plementation, the plan was to experiment with Word2Vec (Mikolov et al., 2013),

GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2016). However

since there was not enough data to create reliable embeddings, the embeddings im-plementation could not work. Therefore a different approach was tried in the form of Custom Named Entity Recognition using spaCy4

. However when experimenting with this approach, because of the variations of address blocks and especially total amounts, this approach was also not successful. This meant the focus for this task was on the object identification approach.

There were several object detection methods that could be used for this task, such as: RetinaNet (Lin et al.,2017), Single Shot MultiBox Detector (SSD) (Liu et al., 2016), Fast R-CNN (Girshick,2015) and You only look once (YOLO) (Redmon et al., 2016;Redmon and Farhadi,2017, 2018). To determine which object identification

method to use, we first have to determine the priorities for the object identification. Preferably the object identification method is fast, in training a model but especially during real-time, high accuracy and easy to work with/adapt the model to the needs of the task. However since the goal of the object identification for this task is to have it done live, as in an app, real-time speed is most important. Preferably to know exactly which of these object detection methods fills the needs of the task best, we test a base model for each of these methods. However due to the required training time needed for each of these methods to create such a model and the lack of resources (i.e. multiple GPUs available), this was not feasible. Therefore I resorted to determining the method based on their papers. From their papers, YOLO seems to be the fastest one in real-time (Figure 1). Therefore, I decided to

use YOLO as the object detection method. 3

http://host.robots.ox.ac.uk/pascal/VOC/ 4

(7)

2.4 results 4

Figure 1: Redmon and Farhadi(2018) adapted this figure from Lin et al.(2017). It shows

YOLOv3 running significantly faster than other detection methods with compara-ble performance. Times from either an M40 or Titan X.

You only look once (YOLO)5

is a state-of-the-art, real-time object detection sys-tem. YOLO (Redmon and Farhadi, 2018) is fast, accurate and easy to tradeoff

between speed and accuracy simply by changing the size of the model, without retraining. Prior detection systems repurpose classifiers or localizers to perform de-tection. These detection systems apply the model to an image at multiple locations and scales, high scoring regions of the image are considered detections. Redmon and Farhadi(2018) use a different approach, by applying a single neural network to

the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. Their model has several advantages over classifier-based systems: it looks at the whole image at test time so its predictions are informed by global context in the image, it makes predictions with a single network eval-uation unlike systems like R-CNN6

which require thousands for a single image. This makes YOLO significantly faster than both R-CNN and Fast R-CNN7

(Girshick,

2015).

2.4 results

Since YOLO was selected as the object identification algorithm, the next step is to determine the model. This involves determining which version to use as well as determine several parameters within the model.

The first experiments were done to determine which YOLO version to use. In Table 3 we see an overview of the performance of the four possible YOLO

ver-sions to be used for this task. From that table we can see that the yolov3 model performs best, based on mAP, therefore YOLOv3 is being used for this task. A side-note, from the results there seems to be something wrong with the YOLOv2 tiny model, however this could not be determined since the proprietary steps to use this model were done. Most likely this would not make a difference since the tiny model seems to perform worse than the full model since it is a smaller, therefore less accurate version from the full model. The second experiments were done to determine the image size (width, height) parameter, this parameter effects: time, performance and batch size (mostly determined by GPU VRAM). Table 4 shows

the results of the second experiment. From these results we can see the 608 model 5 https://pjreddie.com/darknet/yolo/ 6 https://github.com/rbgirshick/rcnn 7 https://github.com/rbgirshick/fast-rcnn

(8)

2.4 results 5

model\stats mAP time (hours) size (MiB)

yolov3-5200 0.75 6 2487

yolov2-5200 0.68 6 1273

yolov3-tiny-5200 0.54 2 331

yolov2-tiny-5200 0.05 2 337

Table 3: YOLO model overview, using the out-of-the-box parameters only changing the max_batches and steps and accordingly to the amount of classes the filters and classes parameters. Highest score, based on mAP (mean Average Precision) @0.50 IoU (In-tersection over Union), is in bold. Times from an RTX 2070 Super.

model\stats mAP time (hours) time per batch (seconds)

yolov3-608 0.77 11,5 7,9

yolov3-448 0.73 6,5 4,6

yolov3-416 0.70 6 4

yolov3-256 0.56 3 1,9

yolov3-128 0.20 2,5 1

Table 4: YOLOv3 model overview, using different image sizes, setting random to 0, to disable variations in image sizes during training. Highest score, based on mAP (mean Average Precision) @0.50 IoU (Intersection over Union), is in bold. Times from an RTX 2070 Super.

performing best based on mAP, although this does take the longest time to train. Because training time was less of an issue and this does not have a negative ef-fect on real-time detection, the image size (width, height) parameter for the final model was set to 608. Then the random parameter was determined. This param-eter effects the image size, random=0 meaning no changes in terms of width and height, random=1, meaning the image sizes are random: either one size down from the width and height parameter or one size up (depending on which YOLO fork you are using). Table 5 shows the results of the random parameter experiment.

Based on these results the random parameter was set to 0, since this was the best model\stats mAP %

yolov3-0 76.81

yolov3-1 74.49

Table 5: YOLOv3 model overview, using different random values, setting width and height to 608, following the results form Table4and setting the threshold to 0.7, following the results from Table7. Highest score, based on mAP (mean Average Precision) @0.50 IoU (Intersection over Union), is in bold.

performing parameter of the two, based on mAP. Then the max batches parame-ter (the equivalent of the epoch parameparame-ter in other deep-learning algorithms) were experimented with. The max batches parameter effects the training time. Table6

shows the results of the max batches experiment. From these results we can con-model\stats time (hours) mAP %

yolov3-5200 11,5 76.81

yolov3-15200 30,5 77.83

yolov3-25200 51 76.06

yolov3-55200 112 77.81

Table 6: YOLOv3 model overview, using different max_batches values, setting width and height to 608, following the results from Table 4 and setting the threshold to 0.7, following the results from Table 7. Highest score, based on mAP (mean Average Precision) @0.50 IoU (Intersection over Union), is in bold. Times from an RTX 2070 Super.

clude that for the total amounts model a higher max batches does not mean better performance, 15200 seems to be the sweet spot. The last experiment for the total

(9)

2.4 results 6

amounts model was to determine the ignore thresh parameter. Table 7shows the

results of the third experiment. Based on these results we can see the .5 model model\stats mAP %

yolov3-.7 77.83 yolov3-.6 78.23

yolov3-.5 80.00

Table 7: YOLOv3 model overview, using different threshold values, setting width and height to 608, following the results from Table 4. Highest score, based on mAP (mean Average Precision) @0.50 IoU (Intersection over Union), is in bold.

performing best, based on mAP. Therefore .5 will be used for the ignore thresh pa-rameter. The parameters for the final model for total amount detection consisted of: yolov3; width, height=608;random=0;max_batches=15200, with steps=12000, 14000(these were always a rounded number of roughly 80% and 90% from the max batches, respectively);ignore_thresh=.5.

For the address blocks model, the approach was a little different compared to the total amounts approach. This because the address blocks labels have more variation. All the address blocks labels consist of: consumer (billing address), merchant (main seller), consumer other (shipping address), merchant other (other sellers/other ad-dress blocks related to the merchant, i.e. adad-dress blocks for different offices), con-sumer line (address block, but on a line instead of a classic block), merchant line (address block, but on a line instead of a classic block). Therefore we decided to split these into different groups, each targeting a different goal: 6 classes, meaning all the classes are a distinct class, aimed to not only detect the address blocks but also detect the wright class (consumer or merchant) and subclass (main, other, line); 2classes, only consumer and merchant are distinct, aimed to detect address blocks and detect distinctions between consumer and merchant; 1 class, all the address blocks are viewed the same, aimed at just recognizing the address blocks. Table8

shows the results of the different groups experiment. Based on these results we can

model\stats mAP %

yolov3-6classes 42.13 yolov3-2classes 59.25 yolov3-1class 70.25

Table 8: YOLOv3 address blocks model overview, using different amounts of classes, setting width and height to 608, setting the threshold to 0.7, random to 0 and max_batches to 15200.

see that more classes have a negative effect on performance, as expected.

The decrease in performance for the 6 classes model can be explained due to the amount of classes. Because of the amount of classes and the data not being evenly distributed per class, this caused certain classes to perform a lot worse than others. To determine the decrease in performance of the models, we look at the perfor-mance on the consumer and merchant classes apart form each other. Table9shows

the differences between the consumer and merchant class for the address blocks model. Based on these results we can see that the performance on the consumer

class\stats ap % TP FP total

consumer 47.46 44 24 68

merchant 71.04 298 61 359

total na. 342 85 427

Table 9: YOLOv3 address blocks model overview, determining the differences between the consumer and merchant classes.

(10)

2.5 conclusion 7

being a lot less present in the data than the merchant class. To try and improve on this we can duplicate consumer class data to make the data more balanced and then train the model again. However due to time constraints this could not be tested dur-ing the internship. This will be done after the internship. A separate model was trained on only the address block merchant class, to get the best performing model. This model scored 73.25% mAP.

2.5 conclusion

The goal for this internship was to develop a machine-learning system (vision- and text-based) to detect total amounts and address blocks on invoices . The text-based model unfortunately was not successful, however the object detection models did show promising results.

As for the object detection results, the total amounts model of 80% mAP per-forms well. The training of the model takes about 12 hours. As they get more data this model can be improved upon. This does mean that as they get more and differ-ent invoices the data set grows and a new model needs to be trained every once in a while, however this could be automatized and since they have the hardware for it, I do not expect this to be a problem. The address block detection task currently stands at 59.25% mAP for the 2 classes model (consumer, merchant) model, 70.25% mAP for the 1 class (address block) model and 73.25% mAP for the 1 class (mer-chant) model. Due to time constraints these models could not be further improved upon, during the internship. As stated earlier, this will be further worked on after the internship.

During real-time the models take about 350 milliseconds or .350 seconds, once the model is loaded, to perform the object identification per image, which is fast enough to be eligible to implement it.

2.6 future work

For future work different object identification methods such as Mask R-CNN (He et al., 2017), RetinaNet (Lin et al., 2017), SSD (Liu et al., 2016) and Fast R-CNN

(Girshick,2015) could be experimented with, to compare results. Also embeddings,

once there is enough data available, could be experimented with to have diverse machine-learning approaches.

(11)

3

_{E V A L U A T I O N}

All in all I liked my time at Klippa. I felt like a full member of the staff. As expected, in the beginning of the internship there were some things that I needed to get used to. I needed to get used to the rhythm of a nine to five job. I also needed to get used to the work space. They work in an open work space with all of the employees. Therefore at times the work space could get noisy, which could be distracting. Noise-cancelling headphones in this case are recommended. Also, after some time I got used to the open work space more and found the noise less distracting.

The supervision at Klippa was good. I was working on the task independently for the most part. I did not need much supervision regarding the task, but if needed my supervisor and other employees were always available to help me. The weekly meetings with the OCR-team where we discussed what each of us was working on and gave input/helped each other where needed, helped me settle in, in the beginning. Also during the internship my supervisor checked on me regularly in case I needed help.

As learning outcomes of the internship I most of all wanted to know how my skills learned during the study transferred to professional practice. Furthermore I wanted to gain experience in using machine-learning algorithms for professional practice. I also learned about object detection methods since I had no experience with this and was curious what was possible with these deep-learning methods. Working with a different sort of data in the form of invoices was another learning outcome. Another learning outcome was to get experience in working in a business environment.

The internship was complementary to the Information Science master. As the master, the task during the internship focused on machine-learning for the most part. The only difference being that the machine-learning was not done on textual data but on images.

(12)

4

_{C O N C L U S I O N}

The internship was very educative and gave me a good understanding of my possi-bilities in professional practice. I also learned a lot and came across new challenges. I experienced working in a business setting and experienced the work flow of an IT company. Regarding the task: developing a machine-learning system (vision- and text-based) to detect total amounts and address blocks on invoices. Two models for object detection were created, however no text-based models were created. Both object detection models will need a reiteration in the future since they continuously are retrieving more and diverse data. To conclude the internship, I liked work-ing at Klippa and ended up with a part-time job offer from them. Furthermore, I recommend Klippa as a master placement.

(13)

B I B L I O G R A P H Y

Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.

Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448.

He, K., G. Gkioxari, P. Dollár, and R. Girshick (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969.

Lin, T.-Y., P. Goyal, R. Girshick, K. He, and P. Dollár (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.

Liu, W., D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg (2016). Ssd: Single shot multibox detector. In European conference on computer vision, pp. 21–37. Springer.

Mikolov, T., W.-t. Yih, and G. Zweig (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North Amer-ican Chapter of the Association for Computational Linguistics: Human Language Tech-nologies, pp. 746–751.

Pennington, J., R. Socher, and C. D. Manning (2014). Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543.

Redmon, J., S. Divvala, R. Girshick, and A. Farhadi (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788.

Redmon, J. and A. Farhadi (2017). Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Redmon, J. and A. Farhadi (2018). Yolov3: An incremental improvement. arXiv

(14)

5

_{A P P E N D I X}

5.1 log

Week 37 Introduction week Getting to know the team

Week 38 Researching possible machine-learning algorithms Getting to know the team

Week 39 Exploring embeddings

Exploring custom Named Entity Recognition Getting to know the team

Week 40 Exploring Yolov3 Getting to know the team

Annotating invoices using prodigy

Week 41 Setting up yolov3 model for total amount detection Annotating invoices using prodigy

1% October

Week 42 Exploring different Yolov3 parameters: width, height 1% October

Week 43 Exploring different Yolov3 parameters: max batches 1% October

Week 44 Exploring different Yolov3 parameters: ignore threshold

1% October: created an agenda to schedule server time for the GPU Week 45 Setting up the final Yolov3 model for total amount detection

1% November

Week 46 Annotating invoices for address block detection model using prodigy 1% November

Week 47 Exploring Yolov3 address block model: 6 classes 1% November

Week 48 Exploring Yolov3 address block model: 2 classes

1_{% November: created a guide to setup yolov3 for co-workers} Week 49 Exploring Yolov3 address block model: 1 class

1% December

Week 50 Setting up final Yolov3 model for address block detection 1% December

Week 51 Setting up final Yolov3 model for address block detection 1% December

Writing placement report Week 52 Writing placement report

1% December: created a data distribution script for prodigy data using apexcharts Week 1 Writing placement report

Helping with training another Yolov3 model for the detection of line items Week 2 Writing final placement report

Helping with training another Yolov3 model for the detection of line items