Context-based Information Classification on Hungarian Invoices

                             Gábor Szegedi, Diána Bajdikné Veres, Imre Lendák, and Tomáš Horváth

                                          Department of Data Science and Engineering
                                   ELTE – Eötvös Loránd University, Faculty of Informatics
                                Budapest, H-1117 Budapest, Pázmány Péter sétány 1/C., Hungary
                     vszm@inf.elte.hu, veresdia91@gmail.com, {lendak, tomas.horvath}@inf.elte.hu

Abstract: The field of Artificial Intelligence has always
strove towards solving problems with computers that were
previously only solvable by humans. An interesting chal-
lenge we have these years is extracting information from
printed documents. In this publication we focus on the
sub-domain of classifying pieces of information on printed
invoices. Our goal here was to create a solution capable of
finding information on scanned invoices without knowing
the template of the invoice. The template-less design is im-
portant as invoices can have many different structure based
on the issuer. First we feed the invoice image to a commer-
cially available Optical Character Recognition (OCR) en-
gine which returns the extracted texts with their bounding
boxes. This information itself wouldn’t be enough so we
enrich it with feature engineering. The engineered features
give information about the content of the text and the con-
text of the surrounding of the bounding box as well as meta
information about the entire document. The main novelty
of our work comes from how we store contextual informa-
tion. We then classify the text fragments into 8 categories:
Invoice Number, Issue Date, Transaction Date, Due Date,
Seller tax number, Customer tax number, Total gross and
Other. In this paper we measure the performance of sev-                    Figure 1: A sample printed invoice that was photographed
eral classification algorithms for our data. We achieved                   using a mobile phone.
0.93 macro average F1 measure with our best model.

                                                                           of invoices and thus each issuer has its own template for
1     Introduction                                                         generating invoices. Even if we could manage to create a
                                                                           template for each of the issuers, the number of issuers is
Information extraction can be a very easy or an impossibly                 growing.
hard problem depending on the kind of data we have. In                        Because of these given conditions we have developed a
this paper we are dealing with invoices of various sources.                solution that is generic and free of any premises about the
We have received thousands of files from our business                      structure of the input document.
partner OTP1 bank to create a solution for their problem
of parsing the incoming invoices. The format of the files
fall in 3 main categories:                                                 1.1   Related Work

                                                                           The main approaches to invoice recognition are Template
    1. Photographed image of a printed invoice
                                                                           based, Graph Convolutional Neural Network (CNN) based
    2. Scanned image of a printed invoice                                  and direction based.
                                                                              Information Extraction (IE) from structured documents
    3. Text based PDF invoice                                              has been a studied problem for decades now. One of the
                                                                           first works was The generic information extraction system
   The problem with the dataset is that we can’t use a finite              by Jerry R. Hobbs [1]. In his work he laid out an architec-
set of templates. The law does not constrain the format                    ture for creating systems capable of extracting information
      Copyright ©2020 for this paper by its authors. Use permitted under
                                                                           from text using a rule based layer-by-layer approach.
Creative Commons License Attribution 4.0 International (CC BY 4.0).           Parsing invoice data is a sub-domain of IE. In the case
    1 https://www.otpbank.hu/portal/en/home                                when the input data consists of a finite set of known
layouts we can simply extract information based on the          3     Our approach
known positions of the key pieces. Such a problem does
not require sophisticated Machine Learning (ML) solu-           In any ML problem the biggest dividing factor is how the
tions, but rather a simple algorithmic approach.                data is represented. In our case we have a very high level
   The problem gets significantly harder if we do not know      representation of the data given images of invoices. We
the document templates at training time. In such a case         didn’t find it feasible to train a complex Neural Network
we need to create a generic solution that is smart enough       that operates directly on the image data for the following
to extract information from unknown document layouts.           reasons:
This field has been studied as well in recent years.
   A very recent study entitled An Invoice Reading System           1. The dimensionality of the images is large and we
Using a Graph Convolutional Network was published in                   can’t shrink them without loosing the small but es-
2019, in which the authors turned the output of an Op-                 sential textual information.
tical Character Recognition (OCR) system into a Graph
                                                                    2. Training times without dimensionality reduction
[2]. The graph’s nodes contained the text outputs from the
                                                                       would be too big even on a special hardware.
OCR and edges were pointing to nearby nodes based on
the location information given by the OCR output. The               3. There are invoices that come in textual PDF format.
nodes were then fed to a Graph based CNN to classify for
entities of interest.                                              For these 3 reasons we chose to simplify the high level
   CloudScan – A configuration-free invoice analysis sys-       representation of the invoice data. We began by breaking
tem using recurrent neural networks is another article from     down the invoice into a list of Fragments. A Fragment is a
2017 in which the authors have created a fully fledged end-     piece of text with its location (X, Y, page) and dimension-
to-end pipeline for parsing invoices [3]. They get the text     ality (width, height) information.
from OCR and create N-grams of words in the same line              Creating the list of Fragments is a rather simple task.
up to a length of 4. They enrich their entities with features   For text PDF invoices the representation is basically the
based on the properties of the piece of text in the N-gram.     same as we desire. We just need to use a Python API 2
They add contextual features based on the closest 4 enti-       to get the text with the bounding boxes from each page of
ties above, below, left and right of the current text. The      the PDF. Converting image based invoices to Fragments is
classifier they use is a Long Short-Term Memory (LSTM)          also trivial as we can use an OCR API which returns a list
Recurrent Neural Network [4].                                   of texts with their bounding boxes. We have integrated our
                                                                pipeline with Tesseract3 , Azure OCR4 and OCR Space5 .
                                                                We found Tesseract to be quite inaccurate compared to the
2    Scope                                                      other two, maybe because our invoices are in Hungarian
                                                                and not in English. In the end we used OCR Space.
Our goal was to find the most relevant information needed          Once we have the list of Fragments the task is finding
by our industry partner. The scope of our work was to           the relevant ones and extracting the information out of the
extract the following information from the given invoices:      text. This can be formulated as a classification task, where
                                                                each Fragment is a separate input to the classifier and the
    • Invoice Number: The unique identifier of the invoice;     model assigns a label denoting what kind of information
                                                                is in the Fragment. For each label type we select the Frag-
    • Issue Date: The date the invoice was issued;
                                                                ment of the highest probability as the Fragment containing
    • Transaction Date: The date of reconciliation;             the given piece of information.

    • Due Date: The date until the invoice has to be recon-
                                                                3.1    Feature Engineering
      ciled;
                                                                It should be possible to build a model that we can feed
    • Seller tax number: The tax number of the seller party;
                                                                the entire fragment list as input and ask the model to re-
    • Customer tax number: The tax number of the cus-           turn what is the most probable fragment for each label is.
      tomer party (This is only included if the customer is     We could use Recurrent Neural Networks and consider the
      a company, and not an individual);                        list of fragments a sequence of input. However this would
                                                                make the problem very hard.
    • Total gross: The total amount to be paid by the cus-
      tomer party.                                                 2 We used pdfminer.six API https://pypi.org/project/

                                                                pdfminer.six/
                                                                   3 https://github.com/tesseract-ocr/tesseract
  The invoices considered are all printed invoices, thus           4 https://docs.microsoft.com/en-us/
no handwritten invoices were used as those invoices are         azure/cognitive-services/computer-vision/
getting ousted. The invoices can be image based or “text”       concept-recognizing-text
based such that, e.g. a portable document format (PDF).            5 https://ocr.space/
   Instead of this we have chosen to do extensive feature        wouldn’t be enough to do labeling effectively, we need
engineering to add relevant information to each fragment         the context of the Fragments as well. We have also seen
and do classification on each fragment individually. In to-      the idea of adding context information to the Fragments in
tal we have added 104 features to the existing 6, such that      other works but there is no consensus on what is the best
(text, X, Y, width, height, page), what is the main added        way to do this.
value of our work. We can group these engineered fea-               In [3], the authors give context information by adding
tures into 4 groups which we go into detail below.               the features of the nearest neighboring Fragments in the 4
                                                                 general directions (Up, Down, Left, Right).In [2], a very
                                                                 similar method is used as the graph nodes connect to the
Positional features We have added a lot of features that
                                                                 4 general directions as well. In [6], the authors took a
are describing the position of the Fragment relative to the
                                                                 different approach by storing the polar coordinates of each
document. We intuitively found this to be very important.
                                                                 fragment relative to the current fragment.
For example the invoice number is usually on the top of the
                                                                    We take a new approach in capturing the context of the
first page of a document. To describe this first we needed
                                                                 Fragment. We have identified 38 keywords in invoices that
to add a feature for storing which page the Fragment is
                                                                 are indicative of important information in the vicinity. For
part of. Then we need the features to describe the X and Y
                                                                 each Fragment we search for the nearest occurrence of all
coordinates of the Fragment on the given page. Lastly we
                                                                 the keywords. We calculate the distance from the Frag-
have added features to describe the dimensionality of the
                                                                 ment to the keyword and store the normalized x and y val-
Fragment.
                                                                 ues as features for each of the keywords. After doing this
                                                                 for each keyword we get a complex web over our invoice
Document features Another set of features added were             document. We can see this mechanism in Figure 2 for a
to describe the entire document. Knowing the positional          single Fragment of a sample invoice.
attributes of a Fragment is not meaningful without know-            Using such keywords comes naturally to us humans
ing the dimensionality attributes of the document. Thus          when we interpret an invoice. For example if we take a
we have added features for document width and height,            look at an invoice and we see 2 tax numbers we can decide
their ratios, the number of pages. We also added features        which belongs to the seller by deciding which one is closer
describing where are the outermost Fragments for the cur-        to the keyword Seller or Provider 6 .
rent page and for the entire document.


Text features In ML, when dealing with text, the problem
arises from the fact that our models are only capable of
working with numbers. We need to encode the text into a
finite set of features all the time, so that is what we did as
well for our Fragments.
   One obvious feature that we can always use to quantify
text is the length of the text we have. We added this of
course to our features.
   Another set of features we used were counters of spe-
cial characters in the Fragment. For example the count of
whitespaces is an important feature telling us how many
words are there in the fragment. Counters of digits and
alphabetical characters are also good indicators of what
kind of text are we dealing with. The count of slashes and
dashes could imply that the current Fragment contains a
bank account number or an invoice number. We could add
many more special characters depending on the data we
have.
   Lastly we have added a few Boolean features derived
from the text based on whether the text matches a given          Figure 2: A sample invoice where the Fragment in the
regular expression or not. We created regular expression         green bounding box is the tax number of the seller and
features for matching date, tax number, currency and dec-        the closest keywords are highlighted with blue bounding
imal point.                                                      boxes. Note that some of the keywords occur multiple
                                                                 times on the invoice but we chose the closest one for the
                                                                 current Fragment.
Contextual features The feature categories we discussed
so far are standard features we have seen in other related
works as well [3] [5] [2]. These features in themselves             6 We use Hungarian keywords only as the dataset is Hungarian
                                                                       Gradient Boosting Classifier [7].


                                                                       5   Experiments
                                                                       The data we received consisted of thousands of Invoices
                                                                       from different issuers in various file formats. About half
                                                                       of them were electronic invoices in textual PDFs, while
                                                                       other half was images of either photographs or scans of
                                                                       printed invoices.
                                                                          First we had to manually annotate the ground truth
                                                                       about the invoices. For each invoice we stored the rele-
                                                                       vant information pieces that were in scope.
                                                                          After that we had run the OCR engine on all of the im-
                                                                       age based invoices. We then filtered out those image based
                                                                       invoices where the OCR output did not contain at least the
                                                                       invoice number and the seller tax number. We did this be-
                                                                       cause the OCR engine optimization is out of our scope and
                                                                       this way we eliminated any errors due to image quality or
                                                                       bugs in the OCR engine used. We had 2000 invoices that
                                                                       met this criteria.
                                                                          Lastly we ran the algorithm to convert the filtered in-
Figure 3: Class distribution of Fragments on one of our                voices into Fragments. We manually labeled the Frag-
test sets.                                                             ments to eliminate any OCR issues, or data format differ-
                                                                       ences there may be. This was the data that we could build
                                                                       our models upon.
   There are 2 special cases to note when pointing the to                 We have split the Fragments into train, test and valida-
the nearest keyword. The first one is when the current frag-           tion sets as usual in the field. The ratios used were 60% for
ment contains one of the keywords. In that case we chose               training, 20% for validation (tuning the hyper-parameters)
to set the value of the x and y features to be zero. The               and 20% for comparing the trained and optimized mod-
second special case is when a certain keyword is not to be             els. The results are visible in Table 5. Without surprise
found on the invoice. In this case we assign −1 to the x               the XGBoost Classifier performs the best out of these 3,
and y features of the keyword. Our approach is different               but even the Decision Tree and Random Forest models are
from the related works because we neither restrict the con-            very close to the Gradient Boosting model. XGBoost out-
text to the 4 nearest Fragments in the general directions,             performs the other 2 on all of the labels and in the overall
nor do we include all the features. The contextual infor-              macro average as well. The tree based models fall behind
mation is relevant because of the keywords these vectors               in the Total Gross label.
are pointing towards.

                                                                       6   Conclusion and future work
4    Classification
                                                                       In this work we presented a method of identifying key in-
After preparing the data we had to train a classifier for la-          formation on Invoices. We have built upon external depen-
beling the Fragments. During the training and evaluation               dencies via Optical Character Recognition, which is used
we had to be cautious as the dataset is very unbalanced be-            to break down the image based invoice into text pieces
cause on each document there are hundreds of text Frag-                with positional information called Fragments. We did an
ments but we are only interested in 7 of them. See Figure              extensive feature engineering so that the Fragments con-
3 for the distribution of the 8 class labels.                          tain features regarding the properties document, the con-
   As 93% of this data was labeled as Other, a classifier              tained text and the context of the text as well.
that assigns the Other label blindly to any input would                   The main added value of our work comes from how we
achieve an overall accuracy of 93%. This is of course not              add contextual information to the Fragments. It works by
good as our focus is on the classes with low cardinality.              pointing vectors from the Fragment to each of the nearest
To overcome this we are using the macro average of the                 keywords. This gives the power of our method to gen-
F1 Score of the classes to compare the classifiers.                    eralize so that unlike some of the previous works it can
   We have chosen 3 classifiers for comparison: Decision               perform well on previously unseen Invoices.
Tree Classifier, Random Forest Classifier 7 and Extreme                   However our work does not stop here. We plan on cre-
    7 For the Decision Tree and Random Forest models we used the im-   ating an end-to-end solution for extracting the information
plementations from scikit-learn                                        from invoices. We have merely identified the Fragments
                 Decision       Random          XGBoost
                 Tree           Forest
 Other           0.99           0.99            1.00
 Invoice         0.81           0.86            0.89
 number
 Issue date      0.88           0.89            0.93
 Transaction     0.87           0.89            0.94
 date
 Due date        0.88           0.93            0.96
 Seller tax      0.89           0.91            0.92
 number
 Customer        0.92           0.93            0.94
 tax number
 Total gross     0.71           0.75            0.87
 Macro Av-       0.8685         0.8952          0.9297
 erage

Table 1: Comparison of models after hyperparameter op-
timization. The numbers are F1 scores.


correctly, but extracting the information from the selected
Fragments still holds challenges.                                Figure 4: The key information pieces our system can iden-
   As our main focus was on feature engineering there can        tify from the Fragments.
still be room for improvement in the modeling side. We
are planning on experimenting with Neural Network mod-
els in the future to see if we can improve our accuracy.         [4] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
                                                                     ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780,
   We also plan on embedding the system in a service,
                                                                     1997.
where the user sends in the invoice and our system either
                                                                 [5] H. T. Ha, Z. Nevěřilová, A. Horák et al., “Recognition of ocr
returns the extracted information, or a new version of the
                                                                     invoice metadata block types,” in International Conference
invoice image where the key information points are high-             on Text, Speech, and Dialogue. Springer, 2018, pp. 304–
lighted like in Figure 4.                                            312.
                                                                 [6] M. Rusinol, T. Benkhelfallah, and V. Poulain dAndecy,
Acknowledgements The research has been supported by                  “Field extraction from administrative documents by incre-
                                                                     mental structural templates,” in 2013 12th International
the European Union, co-financed by the European Social
                                                                     Conference on Document Analysis and Recognition. IEEE,
Fund (EFOP-3.6.2-16-2017-00013, Thematic Fundamen-                   2013, pp. 1100–1104.
tal Research Collaborations Grounding Innovation in In-
                                                                 [7] T. Chen and C. Guestrin, “Xgboost: A scalable tree boost-
formatics and Infocommunications).                                   ing system,” in Proceedings of the 22nd acm sigkdd interna-
   Supported by Telekom Innovation Laboratories (T-                  tional conference on knowledge discovery and data mining,
Labs), the Research and Development unit of Deutsche                 2016, pp. 785–794.
Telekom and EIT Digital.


References

[1] J. R. Hobbs, “The generic information extraction system,”
    in Fifth Message Understanding Conference (MUC-5): Pro-
    ceedings of a Conference Held in Baltimore, Maryland, Au-
    gust 25-27, 1993, 1993.
[2] D. Lohani, A. Belaïd, and Y. Belaïd, “An invoice reading
    system using a graph convolutional network,” in Asian Con-
    ference on Computer Vision. Springer, 2018, pp. 144–158.
[3] R. B. Palm, O. Winther, and F. Laws, “Cloudscan-a
    configuration-free invoice analysis system using recurrent
    neural networks,” in 2017 14th IAPR International Con-
    ference on Document Analysis and Recognition (ICDAR),
    vol. 1. IEEE, 2017, pp. 406–413.