University of Groningen
Information Science
Master’s placement Klippa App B.V.
Author:
T.A. Horstman s2569205
Supervising lecturer:
prof. dr. G.J.M. van Noord Supervisor placement provider:
K.J. Boskma
March 29, 2019
Contents
1 Introduction 1
2 Klippa App B.V. 1
3 Assignment & Goal 1
4 Description of activities 1
4.1 Prodigy . . . . 1
4.2 Output and pre-processing . . . . 3
4.3 Inter-annotator agreement & Baseline . . . . 3
4.4 Google Vision OCR . . . . 4
4.5 Machine Learning . . . . 4
4.6 Gensim . . . . 5
4.7 Results . . . . 5
5 Evaluation 7 5.1 Placement evaluation . . . . 7
5.2 Acquired knowledge . . . . 7
5.3 Conclusions . . . . 8
Appendices 9
A Prodigy annotation interface 9
1 Introduction
While looking for a place to fulfill my master’s placement, I came in contact with Klippa. At that time, working on projects that involved ma- chine learning had my specific interest. Get- ting in touch with Klippa went easy, as some co-founders are former Information Science stu- dents. Together, we put together an interesting assignment that is in line with the contents of the RUG Information Science master’s degree.
2 Klippa App B.V.
Klippa was founded in 2014 and aims at digitiz- ing paper processes with modern technologies.
With Klippa 1 , you are able to keep track of your expenses and you’ll never lose a receipt again.
Organizations can use the Optical Image Recog- nition (OCR) software and the Klippa Expense Manager, which provides a digital authorization flow and integration with accounting systems.
The Klippa environment is available via a web app and mobile apps for Android & iOS.
At the time of writing, Klippa has 14 employ- ees of which a few work only part time. The resulting FTE is around 8 or 9. Some other cool projects the Klippa team works on, are accom- modated in subsidiary companies Digibon and RCPTS. Digibon provides retailers with a hard- ware device that is to be placed between the cash register and the receipt printer. A QR- code is injected in every receipt for customers to download a digital copy. Shop keepers get additional insights in product sales on a dash- board. RCPTS works on receipt-based customer profiling for products and brands. For example, a current project involves a collaboration with
1
https://www.klippa.com/
Coca Cola where customers can scan a receipt, upload it to a Facebook bot and get awarded an automatic Tikkie cashback if an eligible product was found on the receipt.
3 Assignment & Goal
A part of what the Klippa software is able to do, consists of extracting and automatically rec- ognizing information from receipts. Using OCR software, text is extracted from images. Mer- chant information, date of purchase, product in- formation, prices and vat information are impor- tant items that need to be recognized, so declara- tions and invoices can be processed in accounting systems. Currently, this is implemented mostly rule-based. However, no receipt is the same, so recognition isn’t perfect. My task is to imple- ment a machine learning model to classify each line on a receipt, so the extraction rules can be applied to more specific sections of a receipt.
4 Description of activities
4.1 Prodigy
To obtain annotated data, I used Prodigy.
Prodigy is a web-based application by Explosion AI 2 that makes annotation easier and less time- consuming. It supports both textual and image data. For textual data (such as Named Entity Recognition, POS Tagging and text classifica- tion), active learning is implemented: the model learns from every annotation and gets better on the go. Active learning support for images lacks, however. Therefore, we can only use the annota- tion capabilities of Prodigy. On top of that, we
2