Master’s placement Klippa App B.V.

(1)

University of Groningen

Information Science

Master’s placement Klippa App B.V.

Author:

T.A. Horstman s2569205

Supervising lecturer:

prof. dr. G.J.M. van Noord Supervisor placement provider:

K.J. Boskma

March 29, 2019

(2)

1 Introduction 1

2 Klippa App B.V. 1

3 Assignment & Goal 1

4 Description of activities 1

4.1 Prodigy . . . . 1

4.2 Output and pre-processing . . . . 3

4.3 Inter-annotator agreement & Baseline . . . . 3

4.4 Google Vision OCR . . . . 4

4.5 Machine Learning . . . . 4

4.6 Gensim . . . . 5

4.7 Results . . . . 5

5 Evaluation 7 5.1 Placement evaluation . . . . 7

5.2 Acquired knowledge . . . . 7

5.3 Conclusions . . . . 8

Appendices 9

A Prodigy annotation interface 9

(3)

1 Introduction

While looking for a place to fulfill my master’s placement, I came in contact with Klippa. At that time, working on projects that involved ma- chine learning had my specific interest. Get- ting in touch with Klippa went easy, as some co-founders are former Information Science stu- dents. Together, we put together an interesting assignment that is in line with the contents of the RUG Information Science master’s degree.

2 Klippa App B.V.

Klippa was founded in 2014 and aims at digitiz- ing paper processes with modern technologies.

With Klippa ¹ , you are able to keep track of your expenses and you’ll never lose a receipt again.

Organizations can use the Optical Image Recog- nition (OCR) software and the Klippa Expense Manager, which provides a digital authorization flow and integration with accounting systems.

The Klippa environment is available via a web app and mobile apps for Android & iOS.

At the time of writing, Klippa has 14 employ- ees of which a few work only part time. The resulting FTE is around 8 or 9. Some other cool projects the Klippa team works on, are accom- modated in subsidiary companies Digibon and RCPTS. Digibon provides retailers with a hard- ware device that is to be placed between the cash register and the receipt printer. A QR- code is injected in every receipt for customers to download a digital copy. Shop keepers get additional insights in product sales on a dash- board. RCPTS works on receipt-based customer profiling for products and brands. For example, a current project involves a collaboration with

1

https://www.klippa.com/

Coca Cola where customers can scan a receipt, upload it to a Facebook bot and get awarded an automatic Tikkie cashback if an eligible product was found on the receipt.

3 Assignment & Goal

A part of what the Klippa software is able to do, consists of extracting and automatically rec- ognizing information from receipts. Using OCR software, text is extracted from images. Mer- chant information, date of purchase, product in- formation, prices and vat information are impor- tant items that need to be recognized, so declara- tions and invoices can be processed in accounting systems. Currently, this is implemented mostly rule-based. However, no receipt is the same, so recognition isn’t perfect. My task is to imple- ment a machine learning model to classify each line on a receipt, so the extraction rules can be applied to more specific sections of a receipt.

4 Description of activities

4.1 Prodigy

To obtain annotated data, I used Prodigy.

Prodigy is a web-based application by Explosion AI ² that makes annotation easier and less time- consuming. It supports both textual and image data. For textual data (such as Named Entity Recognition, POS Tagging and text classifica- tion), active learning is implemented: the model learns from every annotation and gets better on the go. Active learning support for images lacks, however. Therefore, we can only use the annota- tion capabilities of Prodigy. On top of that, we

2

https://prodi.gy/

(4)

build our own machine learning program that can read from Prodigy’s annotations.

Together, we decided on how a receipt could be classified in a useful way. It would be interest- ing to determine what part of a receipt belongs to the header, what part is the body and what part can be seen as a footer. More specifically, we defined five labels:

1. merchant info: Name and/or logo of mer- chant, plus address information, including phone and website.

2. line items: Products and corresponding prices. Table headers or (dashed) lines are included. If a subtotal or discount is part of the line items ’block’, we also include it.

3. total amount: The line(s) that state the total amount of the receipt. Sometimes, a total amount occurs multiple times on a re- ceipt.

4. payment info: All payment information.

Includes payment method, change, terminal number, transaction id, card number etc.

5. vat: VAT information. Mostly some kind of table, but could also be a single line or completely absent.

Lines that do not belong to any of these classes will get assigned a sixth label no label.

An example of how the Prodigy interface looks like and what annotators were able to do is at- tached in Appendix A.

A small part of the receipts customers up- loaded in the month October 2018, was loaded into Prodigy. A local web server was set up so everybody could help annotate some receipts.

Rules on how labels are defined and how receipts should be annotated were explained beforehand.

Figure 1: Prodigy annotated receipt example.

After a while, the Klippa team annotated around

750 receipts of which approx. 550 were usable

(i.e. readable, no invoice and no non-western

(5)

text). Later, another set of January 2019 was added to create a more extensive dataset. See Figure 1 for an example of an annotated receipt.

4.2 Output and pre-processing

Annotations in Prodigy can be exported as a single JSON Lines (JSONL) file. The most im- portant contents in the JSONL file are the im- age info, meta-data, annotated region coordi- nates and labels. We use these coordinates to assign each line one of the earlier mentioned la- bels. Lines that fall outside any annotated region get assigned no label.

Inspection of the data showed that not all an- notators used the very same guidelines. For ex- ample, some marked only card and transaction details as payment info, where others marked ev- ery line that has something to do with transac- tions. There also appeared to be some disagree- ment on whether ’subtotal’ amounts should get labeled line items, total amount or no label.

All annotated receipts until now were checked and due to low quality images or wrong annota- tions, we declined some receipts. We noticed the guidelines could use a minor revision to make the annotations more usable and consistent. There- fore, we decided to annotate the extra dataset of January 2019. This set was annotated by just two people and uncertainties were discussed be- forehand. In total, a varied dataset of 981 re- ceipts was created. As we train on lines rather than on receipts, the resulting dataset consists of 30239 training instances.

4.3 Inter-annotator agreement &

Baseline

To assess the quality of the annotations, inter- annotator agreement is calculated using Cohen’s

kappa (κ) coefficient [Cohen, 1968]. The coeffi- cient results in a number between -1 and 1 and is a robust statistic for measuring the degree of agreement between annotators. According to Manning et al. [2010], a κ value of 1 indicates perfect agreement. A value of 0 means no agree- ment at all and at -1 there’s a perfect opposite agreement.

Cohen’s kappa coefficient formula reads as fol- lows:

κ = P A − P _E

1 − P _E (1)

Here, P _A is the ratio between how often anno- tators agree. P E is the ratio between how often they would agree by chance.

For the first set from October we don’t cal- culate agreement, as no receipts were annotated twice and moreover agreement wouldn’t be as high as on the second set. However, we ran through all these annotations and rejected those that did not meet our strict quality require- ments.

We calculated the inter-annotator agreement on all six labels on our second dataset. Of this set, around 50 receipts were annotated by both annotators. From Table 1, we calculate a raw agreement of 92.0% and κ = 0.898. This means the upper baseline for our model is 0.898 accu- racy. The system uses these data for training and therefore achieving an accuracy greater than this number is unlikely: it would mean the model can do better than humans.

From Table 1 can be deduced already that the labels are not evenly distributed. E.g.

payment info and no label occur more often than total amount. Our final dataset consists of 981 receipts which results in 30239 lines.

The label distribution is 13.24%, 17.58%, 3.39%,

24.41%, 6.65% and 34.74% for merchant info,

line items, total amount, payment info, vat

(6)

Table 1: Inter-annotator agreement between two annotators (line-based).

Annotator #1

merchant info line items total amount payment info vat no label Total

merchant info 133 0 0 0 0 9 142

line items 0 173 0 0 0 4 177

total amount 0 1 27 2 1 0 31

#2 payment info 0 0 1 208 0 7 216

vat 0 0 0 3 68 0 71

no label 7 6 3 29 1 242 288

Total 140 180 31 242 70 262 925

and no label, respectively. The majority base- line therefore is 34.74%. The model should out- perform this baseline.

4.4 Google Vision OCR

At Klippa, the Google Cloud Vision API is used for optical character recognition. Images customers upload in the Klippa app are sent through the Vision API before they’re being pro- cessed further. For every receipt image in our training data, we use the corresponding Google Vision output. A big file in JSON format con- tains, among other things, all recognized char- acters, their coordinates on the image and the confidence score. From these character coordi- nates, we create words and from the words we create lines. Using the prodigy output, every line is assigned one of the six labels.

4.5 Machine Learning

During mapping the right labels to every line, we create a new JSON file with useful informa- tion that can be used to extract features from.

We do so by parsing all the text on the receipts.

The Klippa system is able to link words that be- long to the same line together, even if there is a lot of white space between words or when the

picture of the receipt is skewed. After parsing, some words in the line text are replaced by tags that are found by the system. For example, the line ”2 Ice Tea Green 1,80 3,60” is replaced with

”number Ice Tea Green amount amount”. Be- sides, we save the receipt width and height in pixels, the line vertices, normalized line width and height to the JSON.

The machine learning script reads the JSON and extracts or computes features. The features that made it to the final system include:

• Full text: the text on a line, where certain words are replaced by tags. All text is con- verted to lowercase.

• Text length: the length of the original line in characters.

• Height: The line’s vertical position on the receipt. The height is a float between 0 and 1 and is computed from the receipt size and the line vertices.

• Width: The line’s horizontal position on the receipt. This is computed in the same way as the line height.

• Height difference: The normalized amount

of vertical white space between the line and

(7)

the previous line. A big difference might indicate a new section on the receipt.

• Normalized line width: The length of a line compared to other lines on the re- ceipt. Longer lines typically indicate e.g.

line items, while shorter ones could indi- cate total amounts.

• Normalized line height: The height (font size) of a line compared to other lines on the receipt. Sometimes, the total amount is bigger in size than other lines.

All features and corresponding labels are loaded into a Pandas data frame. In a pipeline, we apply a TF-IDF vectorizer at character level with an n-gram range 1 <= n <= 3 to the full text feature. Up to now, lines are seen as kind of isolated, whilst the context is also important.

To take the line order into account, the features for every previous and next line are added as a feature for the current line.

We ran a grid search on a powerful server to find out the most appropriate classifier and tune hyper-parameters. A Logistic Regression and SGD classifier did a good job, but a SVM with non-linear kernel performed even better (al- though training time took much longer with this amount of data).

4.6 Gensim

An attempt was made to replace the TF-IDF vectorizer feature with a novel topic modelling framework called Gensim [ ˇ Reh˚ uˇ rek and Sojka, 2010]. Gensim is designed to handle large text corpora and implements a neural network to train a model. With Doc2Vec, we trained both a DBOW (Distributed Bag Of Words) and a DM (Distributed Memory) model with the full line

text as input. With Gensim, we couldn’t score higher than around 65% accuracy, while solely the TF-IDF vectorizer could go up to almost 70%. Therefore, we decided to drop Gensim and continue with our TF-IDF vectorizer.

4.7 Results

During development, we continually evaluated the model performance on a held-out part (25%) of the annotated data. Accuracy is used as eval- uation metric, while learning curves, confusion matrices and classification reports show us where the model could still be improved. Final per- formance is reported in terms of accuracy score using 5-fold cross-validation, so the full dataset could be used for creating the model.

Figure 2: Confusion matrix. True labels are shown on the y-axis, while the model predictions are on the x-axis.

The confusion matrix (Figure 2) shows that

most lines are predicted correctly. When they’re

predicted wrong, it mostly makes sense as the

confused lines are somehow related to each other.

(8)

Table 2: Classification report showing precision, recall and f1-score for each label.

precision recall f1-score merchant info 0.90 0.90 0.90

line items 0.94 0.92 0.93

total amount 0.82 0.82 0.82

payment info 0.87 0.93 0.90

vat 0.90 0.87 0.89

no label 0.88 0.85 0.86

weighted avg 0.89 0.89 0.89

For example, merchant info is predicted right 902 times and 10 times wrong as line items (which most of time come right after the mer- chant information). It seems that the system al- ways predicts relatively much lines as no label.

No label is the hardest to predict, because any line that does not fit in our predefined classes belongs to it.

Another label that is difficult to predict is total amount, as can be seen from Table 2.

Probably, because of the lower occurrence of total amount (only 3.39%) in the training data and the possibility of multiple total amounts on a receipt, this label is harder to predict than oth- ers. The system is very good at predicting line items: many are caught and almost all are cor- rect.

The final achieved accuracy score is 87.8%, using 5-fold cross-validation on all available an- notated data. The majority baseline of 34.74%

is amply defeated and the score even comes very close to the upper baseline and theoretical max- imum score of 89.8%.

Adding more annotated data helps increase the scores even further, as there’s still an upward trend in the cross-validation score and the curves

have not yet converged. However, the cost - both in terms of annotating more data and computing power - is high. See Figure 3. The model would benefit more from high quality annotation data and clear annotation guidelines.

Figure 3: Learning curve showing increasing ac- curacy scores at more training examples.

The model is integrated in the Klippa OCR software as a line classifier step. In the near fu- ture, every line on a receipt that comes through the Klippa app will first get assigned a few la- bels. The model returns the top three pre- dicted labels and corresponding probabilities, e.g. [(line items, 0.91), (no label, 0.07), (to- tal amount, 0.01)]. The rules and regexes for finding, for example, vat information or total amounts can then be applied more precisely by the software.

As a visual example of what the model is able

to do, see Figure 4. Here, the most confident pre-

diction for each line is drawn back on the original

picture.

(9)

Figure 4: Output of the trained model: pre- dicted labels of a receipt are drawn back on the original image.

5 Evaluation

5.1 Placement evaluation

I enjoyed my time at Klippa. The atmosphere in the office was always good and because all em- ployees are still in their twenties or thirties, I felt at home. In the beginning it took some time get- ting into the working rhythm and getting used to sitting behind the computer all day, but luckily it became normal quickly.

Supervision was thorough in the first days, as some time was needed before I knew where to start and what existing Klippa code to use.

Later, I worked mostly independent and the su- pervisor was always available for answering ques- tions. The starting point of the placement was that it would be mutually beneficial, thus useful for both parties.

I also joined a colleague a few times to help in- stalling and configuring multiple Digibon hard- ware devices between the cash registers and the receipt printers. We drove to different customer locations throughout the country and it was fun to explore this part of the business activities as well.

5.2 Acquired knowledge

Before the start of the placement, I set some learning outcomes. I wanted to apply skills in practice while also learning new techniques.

Since machine learning had my main interest, I wanted to develop further in this. The mas- ter courses Learning from Data and Language Technology Project proved to be particularly use- ful. Concepts of machine learning, different classifiers and techniques were explained there.

For my machine learning model, I worked with

Python and the extensions scikit-learn, Keras

and Gensim.

(10)

In addition, I worked on a small side project for a short time with the goal of learning more about the Go programming language (GoLang).

In this project, I built a straightforward KvK API that serves the address details for a given KvK number.

5.3 Conclusions

I can conclude that the period at Klippa was useful and educational. I have seen how things are organized in a young company, applied and improved my skills and learned new things. The machine learning model I created performs well and is a useful addition to Klippa’s OCR soft- ware. It still has some limitations, as it only recognizes lines on receipts (no invoices) and for- eign languages are less well handled. In a possi- ble future project, the model could be extended so that it becomes even more versatile. I would like to thank my colleagues at Klippa and hope to see them again in the future.

References

Jacob Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):

213, 1968.

Christopher Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze. Introduction to informa- tion retrieval. Natural Language Engineering, 16(1):100–103, 2010.

Radim ˇ Reh˚ uˇ rek and Petr Sojka. Software Frame- work for Topic Modelling with Large Cor- pora. In Proceedings of the LREC 2010 Work- shop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010.

ELRA. http://is.muni.cz/publication/

884893/en.

(11)

Appendices

A Prodigy annotation interface

Figure 5: Prodigy web interface for annotating receipts. Bounding boxes can be drawn around

appropriate lines. Labels merchant info, line items, total amount, payment info and vat are

available.

Master’s placement Klippa App B.V.

University of Groningen

Information Science

Master’s placement Klippa App B.V.

Author:

T.A. Horstman s2569205

Supervising lecturer:

prof. dr. G.J.M. van Noord Supervisor placement provider:

K.J. Boskma

March 29, 2019

Contents

1 Introduction 1

2 Klippa App B.V. 1

3 Assignment & Goal 1

4 Description of activities 1

4.1 Prodigy . . . . 1

4.2 Output and pre-processing . . . . 3

4.3 Inter-annotator agreement & Baseline . . . . 3

4.4 Google Vision OCR . . . . 4

4.5 Machine Learning . . . . 4

4.6 Gensim . . . . 5

4.7 Results . . . . 5

5 Evaluation 7 5.1 Placement evaluation . . . . 7

5.2 Acquired knowledge . . . . 7

5.3 Conclusions . . . . 8

Appendices 9

A Prodigy annotation interface 9

1 Introduction

2 Klippa App B.V.

Klippa was founded in 2014 and aims at digitiz- ing paper processes with modern technologies.

With Klippa 1 , you are able to keep track of your expenses and you’ll never lose a receipt again.

Organizations can use the Optical Image Recog- nition (OCR) software and the Klippa Expense Manager, which provides a digital authorization flow and integration with accounting systems.

The Klippa environment is available via a web app and mobile apps for Android & iOS.

https://www.klippa.com/

Coca Cola where customers can scan a receipt, upload it to a Facebook bot and get awarded an automatic Tikkie cashback if an eligible product was found on the receipt.

3 Assignment & Goal

4 Description of activities

4.1 Prodigy

To obtain annotated data, I used Prodigy.

https://prodi.gy/

build our own machine learning program that can read from Prodigy’s annotations.

Together, we decided on how a receipt could be classified in a useful way. It would be interest- ing to determine what part of a receipt belongs to the header, what part is the body and what part can be seen as a footer. More specifically, we defined five labels:

1. merchant info: Name and/or logo of mer- chant, plus address information, including phone and website.

2. line items: Products and corresponding prices. Table headers or (dashed) lines are included. If a subtotal or discount is part of the line items ’block’, we also include it.

3. total amount: The line(s) that state the total amount of the receipt. Sometimes, a total amount occurs multiple times on a re- ceipt.

4. payment info: All payment information.

Includes payment method, change, terminal number, transaction id, card number etc.

5. vat: VAT information. Mostly some kind of table, but could also be a single line or completely absent.

Lines that do not belong to any of these classes will get assigned a sixth label no label.

An example of how the Prodigy interface looks like and what annotators were able to do is at- tached in Appendix A.

A small part of the receipts customers up- loaded in the month October 2018, was loaded into Prodigy. A local web server was set up so everybody could help annotate some receipts.

Rules on how labels are defined and how receipts should be annotated were explained beforehand.

Figure 1: Prodigy annotated receipt example.

After a while, the Klippa team annotated around

750 receipts of which approx. 550 were usable

(i.e. readable, no invoice and no non-western

text). Later, another set of January 2019 was added to create a more extensive dataset. See Figure 1 for an example of an annotated receipt.

4.2 Output and pre-processing

4.3 Inter-annotator agreement &

Baseline

To assess the quality of the annotations, inter- annotator agreement is calculated using Cohen’s

Cohen’s kappa coefficient formula reads as fol- lows:

κ = P A − P E

1 − P E (1)

Here, P A is the ratio between how often anno- tators agree. P E is the ratio between how often they would agree by chance.

For the first set from October we don’t cal- culate agreement, as no receipts were annotated twice and moreover agreement wouldn’t be as high as on the second set. However, we ran through all these annotations and rejected those that did not meet our strict quality require- ments.

From Table 1 can be deduced already that the labels are not evenly distributed. E.g.

payment info and no label occur more often than total amount. Our final dataset consists of 981 receipts which results in 30239 lines.

The label distribution is 13.24%, 17.58%, 3.39%,

24.41%, 6.65% and 34.74% for merchant info,

line items, total amount, payment info, vat

Table 1: Inter-annotator agreement between two annotators (line-based).

Annotator #1

merchant info line items total amount payment info vat no label Total

merchant info 133 0 0 0 0 9 142

line items 0 173 0 0 0 4 177

total amount 0 1 27 2 1 0 31

#2 payment info 0 0 1 208 0 7 216

vat 0 0 0 3 68 0 71

no label 7 6 3 29 1 242 288

With Klippa ¹ , you are able to keep track of your expenses and you’ll never lose a receipt again.

κ = P A − P _E

1 − P _E (1)

Here, P _A is the ratio between how often anno- tators agree. P E is the ratio between how often they would agree by chance.