Context-based Information Classification on Hungarian Invoices Gábor Szegedi, Diána Bajdikné Veres, Imre Lendák, and Tomáš Horváth Department of Data Science and Engineering ELTE – Eötvös Loránd University, Faculty of Informatics Budapest, H-1117 Budapest, Pázmány Péter sétány 1/C., Hungary vszm@inf.elte.hu, veresdia91@gmail.com, {lendak, tomas.horvath}@inf.elte.hu Abstract: The field of Artificial Intelligence has always strove towards solving problems with computers that were previously only solvable by humans. An interesting chal- lenge we have these years is extracting information from printed documents. In this publication we focus on the sub-domain of classifying pieces of information on printed invoices. Our goal here was to create a solution capable of finding information on scanned invoices without knowing the template of the invoice. The template-less design is im- portant as invoices can have many different structure based on the issuer. First we feed the invoice image to a commer- cially available Optical Character Recognition (OCR) en- gine which returns the extracted texts with their bounding boxes. This information itself wouldn’t be enough so we enrich it with feature engineering. The engineered features give information about the content of the text and the con- text of the surrounding of the bounding box as well as meta information about the entire document. The main novelty of our work comes from how we store contextual informa- tion. We then classify the text fragments into 8 categories: Invoice Number, Issue Date, Transaction Date, Due Date, Seller tax number, Customer tax number, Total gross and Other. In this paper we measure the performance of sev- Figure 1: A sample printed invoice that was photographed eral classification algorithms for our data. We achieved using a mobile phone. 0.93 macro average F1 measure with our best model. of invoices and thus each issuer has its own template for 1 Introduction generating invoices. Even if we could manage to create a template for each of the issuers, the number of issuers is Information extraction can be a very easy or an impossibly growing. hard problem depending on the kind of data we have. In Because of these given conditions we have developed a this paper we are dealing with invoices of various sources. solution that is generic and free of any premises about the We have received thousands of files from our business structure of the input document. partner OTP1 bank to create a solution for their problem of parsing the incoming invoices. The format of the files fall in 3 main categories: 1.1 Related Work The main approaches to invoice recognition are Template 1. Photographed image of a printed invoice based, Graph Convolutional Neural Network (CNN) based 2. Scanned image of a printed invoice and direction based. Information Extraction (IE) from structured documents 3. Text based PDF invoice has been a studied problem for decades now. One of the first works was The generic information extraction system The problem with the dataset is that we can’t use a finite by Jerry R. Hobbs [1]. In his work he laid out an architec- set of templates. The law does not constrain the format ture for creating systems capable of extracting information Copyright ©2020 for this paper by its authors. Use permitted under from text using a rule based layer-by-layer approach. Creative Commons License Attribution 4.0 International (CC BY 4.0). Parsing invoice data is a sub-domain of IE. In the case 1 https://www.otpbank.hu/portal/en/home when the input data consists of a finite set of known layouts we can simply extract information based on the 3 Our approach known positions of the key pieces. Such a problem does not require sophisticated Machine Learning (ML) solu- In any ML problem the biggest dividing factor is how the tions, but rather a simple algorithmic approach. data is represented. In our case we have a very high level The problem gets significantly harder if we do not know representation of the data given images of invoices. We the document templates at training time. In such a case didn’t find it feasible to train a complex Neural Network we need to create a generic solution that is smart enough that operates directly on the image data for the following to extract information from unknown document layouts. reasons: This field has been studied as well in recent years. A very recent study entitled An Invoice Reading System 1. The dimensionality of the images is large and we Using a Graph Convolutional Network was published in can’t shrink them without loosing the small but es- 2019, in which the authors turned the output of an Op- sential textual information. tical Character Recognition (OCR) system into a Graph 2. Training times without dimensionality reduction [2]. The graph’s nodes contained the text outputs from the would be too big even on a special hardware. OCR and edges were pointing to nearby nodes based on the location information given by the OCR output. The 3. There are invoices that come in textual PDF format. nodes were then fed to a Graph based CNN to classify for entities of interest. For these 3 reasons we chose to simplify the high level CloudScan – A configuration-free invoice analysis sys- representation of the invoice data. We began by breaking tem using recurrent neural networks is another article from down the invoice into a list of Fragments. A Fragment is a 2017 in which the authors have created a fully fledged end- piece of text with its location (X, Y, page) and dimension- to-end pipeline for parsing invoices [3]. They get the text ality (width, height) information. from OCR and create N-grams of words in the same line Creating the list of Fragments is a rather simple task. up to a length of 4. They enrich their entities with features For text PDF invoices the representation is basically the based on the properties of the piece of text in the N-gram. same as we desire. We just need to use a Python API 2 They add contextual features based on the closest 4 enti- to get the text with the bounding boxes from each page of ties above, below, left and right of the current text. The the PDF. Converting image based invoices to Fragments is classifier they use is a Long Short-Term Memory (LSTM) also trivial as we can use an OCR API which returns a list Recurrent Neural Network [4]. of texts with their bounding boxes. We have integrated our pipeline with Tesseract3 , Azure OCR4 and OCR Space5 . We found Tesseract to be quite inaccurate compared to the 2 Scope other two, maybe because our invoices are in Hungarian and not in English. In the end we used OCR Space. Our goal was to find the most relevant information needed Once we have the list of Fragments the task is finding by our industry partner. The scope of our work was to the relevant ones and extracting the information out of the extract the following information from the given invoices: text. This can be formulated as a classification task, where each Fragment is a separate input to the classifier and the • Invoice Number: The unique identifier of the invoice; model assigns a label denoting what kind of information is in the Fragment. For each label type we select the Frag- • Issue Date: The date the invoice was issued; ment of the highest probability as the Fragment containing • Transaction Date: The date of reconciliation; the given piece of information. • Due Date: The date until the invoice has to be recon- 3.1 Feature Engineering ciled; It should be possible to build a model that we can feed • Seller tax number: The tax number of the seller party; the entire fragment list as input and ask the model to re- • Customer tax number: The tax number of the cus- turn what is the most probable fragment for each label is. tomer party (This is only included if the customer is We could use Recurrent Neural Networks and consider the a company, and not an individual); list of fragments a sequence of input. However this would make the problem very hard. • Total gross: The total amount to be paid by the cus- tomer party. 2 We used pdfminer.six API https://pypi.org/project/ pdfminer.six/ 3 https://github.com/tesseract-ocr/tesseract The invoices considered are all printed invoices, thus 4 https://docs.microsoft.com/en-us/ no handwritten invoices were used as those invoices are azure/cognitive-services/computer-vision/ getting ousted. The invoices can be image based or “text” concept-recognizing-text based such that, e.g. a portable document format (PDF). 5 https://ocr.space/ Instead of this we have chosen to do extensive feature wouldn’t be enough to do labeling effectively, we need engineering to add relevant information to each fragment the context of the Fragments as well. We have also seen and do classification on each fragment individually. In to- the idea of adding context information to the Fragments in tal we have added 104 features to the existing 6, such that other works but there is no consensus on what is the best (text, X, Y, width, height, page), what is the main added way to do this. value of our work. We can group these engineered fea- In [3], the authors give context information by adding tures into 4 groups which we go into detail below. the features of the nearest neighboring Fragments in the 4 general directions (Up, Down, Left, Right).In [2], a very similar method is used as the graph nodes connect to the Positional features We have added a lot of features that 4 general directions as well. In [6], the authors took a are describing the position of the Fragment relative to the different approach by storing the polar coordinates of each document. We intuitively found this to be very important. fragment relative to the current fragment. For example the invoice number is usually on the top of the We take a new approach in capturing the context of the first page of a document. To describe this first we needed Fragment. We have identified 38 keywords in invoices that to add a feature for storing which page the Fragment is are indicative of important information in the vicinity. For part of. Then we need the features to describe the X and Y each Fragment we search for the nearest occurrence of all coordinates of the Fragment on the given page. Lastly we the keywords. We calculate the distance from the Frag- have added features to describe the dimensionality of the ment to the keyword and store the normalized x and y val- Fragment. ues as features for each of the keywords. After doing this for each keyword we get a complex web over our invoice Document features Another set of features added were document. We can see this mechanism in Figure 2 for a to describe the entire document. Knowing the positional single Fragment of a sample invoice. attributes of a Fragment is not meaningful without know- Using such keywords comes naturally to us humans ing the dimensionality attributes of the document. Thus when we interpret an invoice. For example if we take a we have added features for document width and height, look at an invoice and we see 2 tax numbers we can decide their ratios, the number of pages. We also added features which belongs to the seller by deciding which one is closer describing where are the outermost Fragments for the cur- to the keyword Seller or Provider 6 . rent page and for the entire document. Text features In ML, when dealing with text, the problem arises from the fact that our models are only capable of working with numbers. We need to encode the text into a finite set of features all the time, so that is what we did as well for our Fragments. One obvious feature that we can always use to quantify text is the length of the text we have. We added this of course to our features. Another set of features we used were counters of spe- cial characters in the Fragment. For example the count of whitespaces is an important feature telling us how many words are there in the fragment. Counters of digits and alphabetical characters are also good indicators of what kind of text are we dealing with. The count of slashes and dashes could imply that the current Fragment contains a bank account number or an invoice number. We could add many more special characters depending on the data we have. Lastly we have added a few Boolean features derived from the text based on whether the text matches a given Figure 2: A sample invoice where the Fragment in the regular expression or not. We created regular expression green bounding box is the tax number of the seller and features for matching date, tax number, currency and dec- the closest keywords are highlighted with blue bounding imal point. boxes. Note that some of the keywords occur multiple times on the invoice but we chose the closest one for the current Fragment. Contextual features The feature categories we discussed so far are standard features we have seen in other related works as well [3] [5] [2]. These features in themselves 6 We use Hungarian keywords only as the dataset is Hungarian Gradient Boosting Classifier [7]. 5 Experiments The data we received consisted of thousands of Invoices from different issuers in various file formats. About half of them were electronic invoices in textual PDFs, while other half was images of either photographs or scans of printed invoices. First we had to manually annotate the ground truth about the invoices. For each invoice we stored the rele- vant information pieces that were in scope. After that we had run the OCR engine on all of the im- age based invoices. We then filtered out those image based invoices where the OCR output did not contain at least the invoice number and the seller tax number. We did this be- cause the OCR engine optimization is out of our scope and this way we eliminated any errors due to image quality or bugs in the OCR engine used. We had 2000 invoices that met this criteria. Lastly we ran the algorithm to convert the filtered in- Figure 3: Class distribution of Fragments on one of our voices into Fragments. We manually labeled the Frag- test sets. ments to eliminate any OCR issues, or data format differ- ences there may be. This was the data that we could build our models upon. There are 2 special cases to note when pointing the to We have split the Fragments into train, test and valida- the nearest keyword. The first one is when the current frag- tion sets as usual in the field. The ratios used were 60% for ment contains one of the keywords. In that case we chose training, 20% for validation (tuning the hyper-parameters) to set the value of the x and y features to be zero. The and 20% for comparing the trained and optimized mod- second special case is when a certain keyword is not to be els. The results are visible in Table 5. Without surprise found on the invoice. In this case we assign −1 to the x the XGBoost Classifier performs the best out of these 3, and y features of the keyword. Our approach is different but even the Decision Tree and Random Forest models are from the related works because we neither restrict the con- very close to the Gradient Boosting model. XGBoost out- text to the 4 nearest Fragments in the general directions, performs the other 2 on all of the labels and in the overall nor do we include all the features. The contextual infor- macro average as well. The tree based models fall behind mation is relevant because of the keywords these vectors in the Total Gross label. are pointing towards. 6 Conclusion and future work 4 Classification In this work we presented a method of identifying key in- After preparing the data we had to train a classifier for la- formation on Invoices. We have built upon external depen- beling the Fragments. During the training and evaluation dencies via Optical Character Recognition, which is used we had to be cautious as the dataset is very unbalanced be- to break down the image based invoice into text pieces cause on each document there are hundreds of text Frag- with positional information called Fragments. We did an ments but we are only interested in 7 of them. See Figure extensive feature engineering so that the Fragments con- 3 for the distribution of the 8 class labels. tain features regarding the properties document, the con- As 93% of this data was labeled as Other, a classifier tained text and the context of the text as well. that assigns the Other label blindly to any input would The main added value of our work comes from how we achieve an overall accuracy of 93%. This is of course not add contextual information to the Fragments. It works by good as our focus is on the classes with low cardinality. pointing vectors from the Fragment to each of the nearest To overcome this we are using the macro average of the keywords. This gives the power of our method to gen- F1 Score of the classes to compare the classifiers. eralize so that unlike some of the previous works it can We have chosen 3 classifiers for comparison: Decision perform well on previously unseen Invoices. Tree Classifier, Random Forest Classifier 7 and Extreme However our work does not stop here. We plan on cre- 7 For the Decision Tree and Random Forest models we used the im- ating an end-to-end solution for extracting the information plementations from scikit-learn from invoices. We have merely identified the Fragments Decision Random XGBoost Tree Forest Other 0.99 0.99 1.00 Invoice 0.81 0.86 0.89 number Issue date 0.88 0.89 0.93 Transaction 0.87 0.89 0.94 date Due date 0.88 0.93 0.96 Seller tax 0.89 0.91 0.92 number Customer 0.92 0.93 0.94 tax number Total gross 0.71 0.75 0.87 Macro Av- 0.8685 0.8952 0.9297 erage Table 1: Comparison of models after hyperparameter op- timization. The numbers are F1 scores. correctly, but extracting the information from the selected Fragments still holds challenges. Figure 4: The key information pieces our system can iden- As our main focus was on feature engineering there can tify from the Fragments. still be room for improvement in the modeling side. We are planning on experimenting with Neural Network mod- els in the future to see if we can improve our accuracy. [4] S. Hochreiter and J. Schmidhuber, “Long short-term mem- ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, We also plan on embedding the system in a service, 1997. where the user sends in the invoice and our system either [5] H. T. Ha, Z. Nevěřilová, A. Horák et al., “Recognition of ocr returns the extracted information, or a new version of the invoice metadata block types,” in International Conference invoice image where the key information points are high- on Text, Speech, and Dialogue. Springer, 2018, pp. 304– lighted like in Figure 4. 312. [6] M. Rusinol, T. Benkhelfallah, and V. Poulain dAndecy, Acknowledgements The research has been supported by “Field extraction from administrative documents by incre- mental structural templates,” in 2013 12th International the European Union, co-financed by the European Social Conference on Document Analysis and Recognition. IEEE, Fund (EFOP-3.6.2-16-2017-00013, Thematic Fundamen- 2013, pp. 1100–1104. tal Research Collaborations Grounding Innovation in In- [7] T. Chen and C. Guestrin, “Xgboost: A scalable tree boost- formatics and Infocommunications). ing system,” in Proceedings of the 22nd acm sigkdd interna- Supported by Telekom Innovation Laboratories (T- tional conference on knowledge discovery and data mining, Labs), the Research and Development unit of Deutsche 2016, pp. 785–794. Telekom and EIT Digital. References [1] J. R. Hobbs, “The generic information extraction system,” in Fifth Message Understanding Conference (MUC-5): Pro- ceedings of a Conference Held in Baltimore, Maryland, Au- gust 25-27, 1993, 1993. [2] D. Lohani, A. Belaïd, and Y. Belaïd, “An invoice reading system using a graph convolutional network,” in Asian Con- ference on Computer Vision. Springer, 2018, pp. 144–158. [3] R. B. Palm, O. Winther, and F. Laws, “Cloudscan-a configuration-free invoice analysis system using recurrent neural networks,” in 2017 14th IAPR International Con- ference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 406–413.