<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Deep Learning Based Methodology for Information Extraction from Documents in Robotic Process Automation?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Noovle S.p.A</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy https://www.noovle.com/en/</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In recent years, thanks to Optical Character Recognition techniques and technologies to deal with low scan quality and complex document structure, there has been a continuous evolution and automation of the digitization processes to allow Robotic Process Automation. In this paper we propose a methodology based both on deep learning algorithms (as generative adversarial network) and statistical tools (as the Hough transform) for the creation of a digitization system capable of managing critical issues, like low scan quality and complex structure of documents. The methodology is composed of 5 modules to manage the poor quality of scanned documents, identify the template and detect tables in documents, extract and organize the text into an easy-to-query schema and perform queries on it through search patterns. For each module di erent state-of-the-art algorithms are compared and analyzed, with the aim of identifying the best solution to be adopted in an industrial environment. The implemented methodology is measured with respect to the business needs over real data by comparing the extracted information with the target value and shows performance of 90%, in terms of Gestalt Pattern Matching measure.</p>
      </abstract>
      <kwd-group>
        <kwd>Robotic Process Automation</kwd>
        <kwd>Optical Character Recognition</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Image Denoising</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the spread of cameras on mobile devices, more and more images of scanned
documents are collected in order to be digitized for di erent uses. Most of the
digitization processes are still done manually today, however, thanks to recent
advances in machine learning, it is possible to further automate these processes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        When dealing with information extraction from documents, Optical Character
Recognition (OCR) techniques are the key technology, however these alone are
not enough to extract all the visual and structural information from scanned
documents. Moreover their power is limited when dealing with poor-scan
quality documents or with documents with complex structure. In this research we
de ne a methodology for extracting information from scanned or editable
documents, trying to manage and limit the noise coming from the low quality of the
scans and taking into account document structure. The methodology is designed
using real data coming from two di erent companies and is tested by
considering the companies' real business needs. The methodology that we propose rstly
uses Generative Adversarial Network (GAN) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to clean the scanned
documents, then identi es the document template by using Siamese Neural Network
(SNN) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and then, by using a method based on a computer vision technique
called Hough Transform (HT) [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and on the Google Cloud Vision API [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] for
OCR, identi es tables. Then an information mapping process that delegates the
personalization of content extraction to the drafting of a set of queries is de ned,
thus making the information retrieval simpler and more immediate. The rest of
this paper is organized as follows. Firstly, in chapter 2, the analysis of the state
of the art for information extraction is presented; in chapter 3 the actual use case
and the real-world dataset is described; in chapter 4 the de ned methodology is
shown in details; in chapter 5 the experimental results obtained are resumed; in
chapter 6 some conclusions and future directions are mentioned.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>State of the art</title>
      <p>
        Information retrieval from documents has been an important research area for
several decades. With the advent of deep learning, OCR systems have become
extremely powerful and usable, thanks to open source systems such as
Tesseract [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and cloud API based solutions such as Google Cloud Vision API [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ].
Today, interpreting documents with a simple text layout and good scan quality
has become a trivial problem thanks to these recent developments [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], especially
if the PDF is software-generated and editable, as described by H. Chao and J.
Fan in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In case of non editable PDF (scanned documents) the only applicable
solution for text extraction is represented by OCR techniques. OCR techniques
generally consist of 5 phases: pre-processing, segmentation, normalization,
feature extraction and post-processing. In Table 1 the main algorithms of each
phase are presented.
      </p>
      <p>
        The preprocessing step, which aims at eliminating noise in an image without
missing any signi cant information, traditionally was performed with
statistical and computer vision techniques but recently also some deep learning
approaches have been successfully proposed. One of the most used in the eld of
image denoising is represented by GAN [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] which have been used in di erent
image-to-image translation contexts, showing very good performance. Also for
the feature extraction phase there are various techniques. Today the main ones
are based on the use of neural networkss [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] an overview of the state of
      </p>
      <p>
        Algorithms and Approaches
Binarization, skew correction, ltering, thresholding,
compression, thinning [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], GAN [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
Top-down methods, bottom-up methods, hybrid methods [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]
Standard approaches [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
Neural Network based approach [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]
      </p>
      <p>
        Rule-based methods [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
the art of algorithms based on neural networks for OCR is presented, showing
how these algorithms are able to achieve the best performance in the context of
feature extraction. The study also highlights the impact of the features extracted
from these algorithms on the classi cation task. As described by Suen in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
the main classes of features are two: statistical and structural. Statistical (like
momentum, zoning, crossing, fourier transform and histogram projections) are
also known as global features, while structural (like convexity or concavity in
the characters, number of holes in the document or number of endpoints) are
known as local features. Nowadays, there are a lot of tools that automatically
perform these steps with high accuracy, as Tesseract [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or Google Cloud Vision
API [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] however the steps of pre-processing and post-processing are extremely
di cult to be generalized, since they merely depend on the data given in input
and the expected output. Moreover, in the implementation of a Robotic Process
Automation (RPA) systems, text extraction is just one of many issues and the
main challenges are understanding the structure of the document and extracting
visual entities like tables. In 2013 M. Gobel et al. proposed a rst meticulous
comparison of the performance of various table identi cation techniques over the
ICDAR 2013 dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. From 2013 to date, in addition to the enhancement of
these deterministic approaches, several new techniques based on machine
learning algorithms have been introduced, like a method based on the identi cation
of the horizontal and vertical lines classi ed through Support Vector Machine
(SVM) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or methods based on Fast-RCNN (FRCNN) [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] trained on the
Marmot dataset for table recognition [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]. Even though these techniques are
extremely powerful, they need a lot of data to be trained and generalized. For this
reason, besides machine learning techniques, many computer vision techniques
based on the Hough Transform (HT) [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ], which is used universally today
especially thanks to the discovery of its generalized form by Dana H. Ballard [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], are
proposed. Its power relies on the fact that this transform does not need training
data to be used since it can be applied as a mathematical function. Overall,
table extraction techniques are even more e ective when combined with document
structure identi cation techniques. Indeed, being able to identify the document
template allows one to have a priori knowledge of the structure of the text and
this knowledge can be exploited to facilitate the identi cation of objects within
the document. With respect to this, there are several related studies [16{18]. In
particular, some deep learning techniques like Siamese neural networks (SNN)
have been recently proposed. These models consist of two groups of parallel
layers of CNNs that extract features from two distinct input documents and a series
of documents that represent the knowledge-base, i.e. the set of possible templates
in which the document can be mapped. These algorithms are extremely
powerful, because they allow to obtain very high performance even with little training
data, thanks to a learning technique called one-shot learning [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
2.1
      </p>
      <sec id="sec-2-1">
        <title>Related Works</title>
        <p>
          Extracting information from documents is an active research eld and in recent
years some works on the topic have been published. For instance, in [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]
Vishwanath D. et al. present an end-to-end framework that maps some visual entities
(such as tables, printed and handwritten text, boxes and lines) into a relational
schema so that relevant relationships between entities can be established. The
framework performs image denoising by means of GAN and horizontal
clustering to localize page lines. In [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], instead the authors build an invoice analysis
system that does not rely on templates of invoice layout, but learns a single
global model of invoices that naturally generalizes to unseen invoice layouts.
In [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], a framework which makes use of an attention mechanism to transfer
the layout information between document images is proposed. The authors also
applied conditional random elds on the transferred layout information for the
re nement of eld labeling.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Problem Setting</title>
      <p>The goal of this research is to identify a methodology that allows the creation of
an RPA system for extracting speci c information from documents. The
information to be extracted are driven by business needs and vary according to the
use case of two pilot companies that furnished the data. The goal of these two
companies is that of digitizing the information in order to activate a business
process of notarization and supplier management. In order to understand the
approach to be pursued, in the following the datasets provided to implement the
solution are described.
3.1</p>
      <sec id="sec-3-1">
        <title>Dataset Description</title>
        <p>The datasets used to de ne and test the methodology are composed of a
collection of documents from two di erent companies. In one case, the documents
are editable pdf of production sheets (in the following we will refer to this data
as dataset A), in the other case the documents are scanned pdf of technical
product sheets, invoices and transport documents (we will refer to this data as
dataset B). In the rst case, the production sheet is composed of di erent pages
each composed of a body in the form of a table that contains several
production sub-sheets, each representing an order to be performed to speci c suppliers.
Production sub-sheets are identi able by a set of contiguous populated lines
separated by other production sub-sheets by a white line. The goal is to extract all
the lines corresponding to each production sub-sheet to automatically activate
the process of supplier selection. In the second case, the dataset is composed of
three kinds of scanned documents: invoices, with di erent templates, document
of transport with di erent templates, signed and dated by hand, and technical
product sheet in the form of a table that contains speci c information about the
products (EAN, ingredients, nutrition values, sterility requirements etc). Each
document has its own set of information to be retrieved in order to be notarized
via block chain.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>By considering the business needs expressed by the partner companies, a
methodology that consists of ve main steps has been identi ed. Starting from the
document, a rst phase of pre-processing is envisaged with GANs, with the aim of
reducing the noise of scanned documents. Subsequently, a template identi cation
module based on deep learning models (CNN or SNN) is implemented. Then a
module to detect and extract tables is de ned, looking for the vertical and
horizontal lines within the document (using the HT) to trace the number of columns
and rows that may constitute them. Follows the OCR phase (with Google Cloud
Vision API) to extract the textual content and the content mapping module to
encapsulate the information within a matrix schema. Finally, the last module
deals with extracting the required information, looking for patterns de ned in
advance and depending on the type of document. The goal is to ensure that the
methodology can be applied in di erent industrial environments with respect
to the business needs expressed. The individual steps of the methodology are
described in detail below.
4.1</p>
      <sec id="sec-4-1">
        <title>Pre-processing: image denoising</title>
        <p>
          The rst step of the methodology is represented by the pre-processing phase.
Speci cally, at this stage the goal is to reduce the noise present in the images
of the scanned documents since the quality of the documents in uences all the
next steps of the methodology. As described in chapter 2, one of the most recent
and e ective methods to perform image denoising is represented by GAN [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
GANs are neural networks made up of two parts: a convolutional network called
generator, trained to generate synthetic samples from input data, and a
convolutional network called discriminator, trained to understand if an image is real
or generated. Formally, consider a generative network G that captures the
distribution of data and a discriminative network D that estimates the probability
that an example derives from the training dataset rather than G. To learn the
generator distribution pg over data x, the generator build a non-linear mapping
function of the distribution of the a-priori noise Pz(z) in a space G(z; g). The
discriminator D(x; d) produces as output a value that represents the probability
that x derives from the training set rather than from pg. G and D are trained
contemporarily: G's parameter are adjusted to minimize log(1 D(G(z))) and
D 's parameter to minimize logD(x) as if they follow a min-max game with two
players and value function
        </p>
        <p>
          V (G; D) : minGmaxDEx pdata(x)[logD(x)] + Ez pz(z)[log(1
D(G(z)))] (1)
According to [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], there are di erent existing GAN architectures and we propose
to compare conditional GAN (cGAN) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and cycle GAN [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] since they
represent the most recent developments in the eld of image-to-image translation.
Conditional GAN GANs can be extended to a conditional model if both
the generator and the discriminator are conditioned by extra information y.
The conditioning can be done by inserting y both in the generator and in the
discriminator as an additional input layer. In the generator the a-priori input
noise pz(z) and y are combined in a joint hidden representation [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. In this case
the value function of the two player min-max game is
        </p>
        <sec id="sec-4-1-1">
          <title>V (G; D) : minGmaxDEx pdata(x)[logD(xjy)]+Ez pz(z)[log(1</title>
          <p>
            D(G(zjy)))] (2)
Using a model of this type it is possible, by providing the model with images of
noisy documents and their respective noise-free, to train the model to produce,
given in input an image of a document, the image of the same document but
with a reduction of the disorder. To perform the training, it is necessary to have
for each input also the target image, i.e. the dataset must be composed of pairs
formed by noisy images and their respective images without noise.
Cycle GAN Another extension of GANs, called cycleGAN [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ], is that of
letting them to learn a mapping function between two domains X and Y , given
          </p>
          <p>N M
some training examples fxigi=1 where xi 2 X and fyj gj=1 where yj 2 Y . The
cycleGAN model includes two mapping functions G : X ! Y and F : Y ! X.
moreover two adversary discriminators Dx and Dy are introduced, where the
goal of Dx is to distinguish between image fxg and its corresponding translated
fF (y)g. The same for Dy with respect to image fyg and the corresponding
translated fG(x)g. The goal is twofold: to reduce the opposing losses to align
the distributions of the generated images and the distributions of the target
images and to reduce the cycle consistency loss to prevent the mapping learned
from G and F from being contradictory.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Document template identi cation: image classi cation</title>
        <p>
          The second step of the methodology is represented by the document template
identi cation module. This module aims at de ning the template of the input
document and this can be accomplished by any image classi cation algorithm.
In our research we compare two kinds of algorithms: a more consolidated one
based on CNN [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and a more recent one based on SNN [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], which has shown
high performance even with small datasets thanks to what is called
\one-shotlearning".
Siamese Neural Network SNNs are models that aim at recognizing, starting
from two distinct inputs, whether they belong to the same class or not. The
network is made up of two main phases: in a rst phase, two input images are
passed through a series of convolutional layers, in order to obtain an embedding
of them; between these two output vectors in a second step, a distance measure
is calculated. The output is a value that indicates the distance between the two
inputs used to classify the images. This learning approach is also called
"oneshot-learning" since it is not necessary to have thousands of documents to carry
out training, but works on the features extracted from the CNN that make up the
SNN and therefore even a single image, that constitute the comparison sample,
for each class is su cient.
4.3
Once the images have been denoised and the templates identi ed, the next
module aims at locating the structural elements of the document, not only by
applying OCR to extract text, but also by identifying and mapping the structural
elements of the document, such as tables. This module therefore consists of two
blocks: an OCR block for extracting textual content based on a pre-trained OCR
tool and a block for identifying the tables for organizing the content based on
computer vision techniques.
        </p>
        <p>
          Google Cloud Vision API As an OCR tool we decided to use the Google
Cloud Vision API tool, the documentation of which is reported in [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]. This
API performs an analysis of the image layout to segment the areas where text
is present. After the general localization phase has been carried out, the OCR
module recognizes the text on the speci ed areas and, consequently, extracts
it. Finally, the result is corrected through post-processing techniques based on
language models and dictionaries. Most of these steps are carried out through
the use of CNNs. The extraction performed by the Google Cloud Vision API
OCR module returns the textual content and the organization and position
of the content within the image. More precisely, the output produced consists
of a dictionary containing a structure divided into a hierarchy of blocks and
paragraphs that, at the lowest level, contains the individual words extracted
and even single symbols with the coordinates of their position within the image,
a con dence score and the detected language.
        </p>
        <p>
          Hough Transform By analyzing the state-of-the-art for table extraction it
emerges that one of the last trends is based on deep learning techniques.
However, to train such models a lot of data must be available and often it is di cult
to generalize well with public dataset. To overcome this limitation, we follow an
unsupervised approach based on the HT. Before applying the HT to detect lines,
however, we propose to pre-process the image to highlight the lines contained
in it using a computer vision method called Edge Detection [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] that aims at
drastically reducing the amount of data to be processed, while preserving the
structural information on the contours of the objects. This method uses a
convolution mask and its gradients together with two threshold values (upper and
lower) that de ne whether the pixel is accepted as an edge or not. More
precisely, the pixel is maintained if the gradient value is greater than the upper
threshold value and discarded if it is below the lower threshold value, while if it
is in the middle it is maintained if at least one neighboring pixel is above the
upper threshold value. Once the information has been cleaned through the Edge
Detector, it is passed to the module that deals with identifying lines using an
approach based on the HT. As described in [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], the Hough Line Detector is a
computer vision techniques that extracts all the lines from the image,
considering as lines all the series of pixels in a row that exceed a certain number of pixels
and with a maximum value of missing pixels.
4.4
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Content mapping</title>
        <p>The fourth module of the methodology takes care of mapping the extracted
information into a schema that is easily searchable. In order to quickly access
the content of the text, we decided to organize it within a dynamic matrix
structure. Indeed, if each content is assigned to a cell whose position is known,
it is easy to search for the other contents associated with it in the adjacent cells,
since their positions can be used in the text search. As mentioned, thanks to
the Google Cloud Vision API, not only the textual content of the document
is available, but also the organization and position of the content within the
image. Thus, to create the matrix structure we used an approach which assumes
that in a document there could be di erent separated tables and each table
necessarily has at least one vertical lines. The approach has six steps: 1) Scans
all the vertical lines and groups them by considering their ordinates: if lines have
overlapping ordinates they are grouped together. 2) For each group of vertical
lines add to the group all the horizontal lines that are in the group's range
of ordinates. 3) Divide each group into subgroups of directly and indirectly
connected lines, where directly connected lines are lines that have a pixel in
common, and indirectly connected lines are lines that are not directly connected
with each other but are both directly connected with the same line or with a set
of indirectly connected lines. Each subgroup thus identi ed represents a table. 4)
For each table, detect the number of rows and columns by taking the number of
intersections of respectively the vertical and the horizontal lines with the highest
number of intersection points. 5) Create a matrix with the identi ed number of
rows and columns. 6) Fill in the matrix with the text extracted by the OCR tool
selecting the proper cell using the coordinates returned and the coordinates of the
table identi ed. This approach generally creates a matrix with more cells with
respect to the actual table, since it also consider tables with complex structures
(as rows or columns with di erent number of cells). For this reason, in such
cases, text is replicated in di erent cells of the matrix if these cells correspond
to the same cell of the table.
The last step of the methodology deals with retrieving the information of
interest starting from the generated content matrix. The goal is therefore to de ne
search patterns for each value to be extracted. We de ned two types of patterns:
anchor and text. The anchor pattern rst searches for an anchor, i.e. one or more
terms from whose position it is then possible to go back to the actual content. An
anchor pattern is formed by the elds described in Table 2. The second type of
pattern does not provide for the existence of an anchor, but deals with searching
directly within the cells for a speci c content that satis es a certain condition.
The elds are similar to the previous case but instead of starting searching from
the anchor it starts from the rst cell of the matrix. With these two types of
patterns it is possible to cover all searches within the matrix and it is therefore
evident that the task of retrieving is enormously simpli ed. Indeed, the
reorganization of the content in matrices and the two search patterns de ned allows for
more complex and exible searches with respect to simple rule-based systems.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>In the following the experimental results of the methodology are presented. For
each step of the methodology, the results are distinguished per dataset A
(composed by good quality editable pdf of the same type) and dataset B (composed of
scanned pdf of three di erent types) with the aim of verifying that the
methodology can be applied to both kinds of dataset.
5.1</p>
      <sec id="sec-5-1">
        <title>Image denoising</title>
        <p>
          To train the GANs algorithms a public Kaggle dataset has been used [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The
dataset contains pairs of images with and without noise. Images have been
reduced to 256x256 crops and divided into training and test with a 80-20 split. The
total number of crops used is 436, equally divided in the class with noise and
Where the Mean Square Error (MSE) between two images I and K is de ned as:
        </p>
        <sec id="sec-5-1-1">
          <title>M AXfIg</title>
          <p>P SN R = 20log p</p>
          <p>
            M SE
M SE =
the class without noise. As far as cGAN is concerned, the network architecture
is composed of 3 layers of convolutions with ReLu activation function, followed
by max-pooling and dropout. As far as cycleGAN is concerned the convolutional
network has been created using a U-Net architecture [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ]. This kind of network
is formed by a series of convolution layers followed by a series of transpose
convolution layers to bring the image back to its original size with skip connections.
Both the models have been trained using Google Cloud Vertex AI training jobs
that leverage a hyperparameter tuning tool based on Google Vizier [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ]. In both
cases, learning rate, batch size and number of epochs have been automatically
selected by the optimization algorithm. To measure the performance of these
algorithms, a metric called peak signal-to-noise ratio (PSNR) is used. This
measures the quality of a compressed image compared to the original one and is
de ned as the ratio between the maximum power of a signal and the power
of noise that can invalidate the delity of its compressed representation. Since
many signals have a very wide dynamic range, PSNR is usually expressed in
terms of the logarithmic decibel scale. The PSNR is de ned as
The PSNR obtained for the cGAN model is 8,93db while the one obtained for
cycleGAN is 23,245db. Thus, the model based on cycleGAN outperforms the
one based on cGAN.
5.2
          </p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Image classi cation</title>
        <p>To train the image classi cation module we used the dataset A and B together.
Since the dataset A is composed of just one kind of document and the dataset B
is composed of three kinds of documents, the outcome of the model is composed
of four classes. The images have been divided into training and test with an
80-20 splits. The models compared to perform template identi cation are SNNs
and CNNs. As far as the SNN is concerned, the two embedding layers placed in
parallel consist of 2 convolutional layers with ReLu activation function followed
by fully-connected layer. In order to train the algorithm, training samples have
been paired randomly to have 500 paired samples of the same class and 500 of
di erent classes. To test the algorithm, all the training images have been used
as comparison samples against the test images and, to select the nal class,
majority voting has been performed. As far as CNN is concerned, the network
is composed of 3 di erent levels of convolution with ReLu activation function,
each followed by a max-pooling layer and dropout to avoid over tting and with
a nal fully-connected layer. The model has been tested using the same test
(3)
(4)
images of the SNN model. Both the models have been trained using Google Cloud
Vertex AI training jobs with automatic hyperparameter optimization. In both
cases, learning rate, batch size and number of epochs have been automatically
selected by the optimization algorithm. As a comparison metric we decided to
rely on overall accuracy (OA). On the four classes, the CNN yielded 93,71% of
OA while the SNN model 94,33% of OA. Thus, the SNN model yields slightly
better performance, however, its great advantage relies on the fact that, in an
industrial environment, this kind of model needs less training data with respect
to the CNN, letting it be easier to train and use such a model.
5.3
To evaluate the performance of the table detection algorithm based on the HT,
we decided to use the number of lines (vertical and horizontal) correctly
extracted, undetected, partially identi ed and exceeding. Furthermore, to
understand the actual goodness and usefulness of the pre-processing phase based on
cycleGAN, this calculation was done both with and without the application of
the cycleGAN model. The results are reported in Table 3.</p>
        <p>From the results we can see that the application of GANs does not actually
impact the performance of lines detection in the case of dataset A, since these
data are software-generated and, thus, perfectly cleaned. Instead, with respect
to dataset B, their application enhance the results, showing that the algorithm
is able to improve the quality of some scanned documents.
Finally, in order to evaluate the overall performance of the methodology, since the
nal objective is to retrieve speci c information from documents, we decided to
compare each single eld extracted with the target of the extraction. To evaluate
the performance of our system we decided to use a metric called Gestalt Pattern
Matching (GPM), which assigns a similarity value between two strings S1 and
S2, based on the size of the substrings and on the number of matching Km
characters between the strings where the matching characters are de ned as
the longest common substring (LCS) plus recursively the number of matching
characters in the non-matching regions on both sides of the LCS
GM P =</p>
        <p>2Km
jS1j + jS2j
(5)</p>
        <p>As can be seen from Table 4, in the case of dataset A, the extractor has an exact
matching both with the use of the GAN and without. This is likely because the
pdf is perfectly clean but also because the search task of these documents is
simpler and seems to be less in uenced by a precise identi cation of the tabular
scheme (on which, however, there is good performance). As for the dataset B,
on the other hand, since the search patterns are more complex and the quality
of documents lower, the percentage of matching obtained (better in the case
of application of GAN, con rming the e ectiveness of the pre-processing layer)
is 0.81. Overall, our methodology has a 90,5% of performance for the task of
information extraction in terms of GPM.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future works</title>
      <p>This study presents a methodology based on machine learning for the realization
of a general and customizable RPA system based on the type of documents and
information to be extracted. After an extensive analysis of the state-of-the-art,
we developed a modular methodology that could adapt to di erent documents in
terms of template and content. For each module we tested di erent options. The
identi ed methodology consists of 5 modules: an image denosing module based
on cyceGAN; a document template identi cation module, based on SNN; an
information extraction module based on table identi cation via HT and on text
extraction via Google Cloud Vision API; a custom information mapping module
to organize the content into a matrix structure and a query module that extracts
the necessary information through search patterns. The methodology has been
tested using the GPM score. Overall, the methodology perform well with a score
of 0.905 (corresponding to a matching of 90%) and also proved the e ectiveness of
the image denoising algorithm. Finally, the implemented methodology has been
deployed in di erent industrial environments, with di erent document formats
just by ne tuning the template identi cation model and by de ning the search
pattern. Some enanchements can be applied for better generalization, such as
using a deep learning approaches to detect tables in the document, thus reducing
the error in line identi cation and, consequently, in information retrieval, or
designing some optimization algorithms to keep the complexity of the SNN model
low (due to the necessary scan of the images to perform prediction) that could
slow down the extraction of information in an industrial context.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Peanho</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stagni</surname>
          </string-name>
          , H., da
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>F.S.C.</given-names>
          </string-name>
          :
          <article-title>Semantic information extraction from images of complex documents</article-title>
          .
          <source>Applied Intelligence</source>
          <volume>37</volume>
          ,
          <fpage>543</fpage>
          {
          <fpage>557</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An Overview of the Tesseract OCR Engine</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference on Document Analysis and Recognition -</source>
          Volume
          <volume>02</volume>
          (
          <issue>ICDAR</issue>
          '07) pp.
          <volume>629</volume>
          {
          <fpage>633</fpage>
          . IEEE Computer Society, USA (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Adobe system Incorporated, PDF Reference, https://www.adobe.com/content/ dam/acom/en/devnet/pdf/pdfs/pdf\ reference\ archives/PDFReference.pdf.
          <source>Last accessed 11 Aug 2021</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chao</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            <given-names>J</given-names>
          </string-name>
          .:
          <article-title>Layout and Content Extraction for PDF Documents</article-title>
          . In Marinai S.,
          <string-name>
            <surname>Dengel</surname>
            <given-names>A</given-names>
          </string-name>
          .R. (eds.)
          <source>Document Analysis Systems VI. DAS 2004. LNCS</source>
          , vol
          <volume>3163</volume>
          , pp
          <fpage>213</fpage>
          -
          <lpage>224</lpage>
          . Springer, Berlin, Heidelberg (
          <year>2004</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>540</fpage>
          - 28640-0 20
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hamad</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaya</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A Detailed Analysis of Optical Character Recognition Technology</article-title>
          .
          <source>International Journal of Applied Mathematics, Electronics and Computers</source>
          <volume>4</volume>
          .
          <fpage>244</fpage>
          -
          <lpage>244</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kaur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khurana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Page Segmentation in OCR System-A Review</article-title>
          .
          <source>International Journal of Computer Science and Information Technologies</source>
          <volume>4</volume>
          ,
          <fpage>420</fpage>
          -
          <lpage>422</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Shinde</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chougule</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Text Pre-processing and Text Segmentation for OCR</article-title>
          .
          <source>International Journal of Computer Science Engineering and Technology</source>
          .
          <volume>2</volume>
          (
          <issue>1</issue>
          ),
          <fpage>810</fpage>
          -
          <lpage>812</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Trier, .D. ,
          <string-name>
            <surname>Jain</surname>
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taxt</surname>
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Feature extraction methods for character recognition - a survey</article-title>
          .
          <source>Pattern recognition 29 (4)</source>
          ,
          <fpage>641</fpage>
          -
          <lpage>662</lpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Shah</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karamchandani</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nadkar</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulechha</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koli</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lad</surname>
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>OCR-based chassis-number recognition using arti cial neural networks</article-title>
          .
          <source>In: Proceedings of the 2009 IEEE International Conference on Vehicular Electronics and Safety (ICVES)</source>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>34</lpage>
          , IEEE Computer Society, India (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Rehman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saba</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Neural networks for document image preprocessing: state of the art</article-title>
          .
          <source>Arti cial Intelligence Review</source>
          <volume>42</volume>
          ,
          <issue>253</issue>
          {
          <fpage>273</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Suen</surname>
            <given-names>CY</given-names>
          </string-name>
          .:
          <article-title>Character recognition by computer and applications, Handbook of pattern recognition and image processing</article-title>
          ,
          <volume>569</volume>
          -
          <fpage>586</fpage>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouget-Abadie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warde-Farley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ozair</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Generative Adversarial Networks</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          <volume>3</volume>
          (
          <issue>11</issue>
          ), (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , H.,
          <string-name>
            <surname>Sindagi</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel V.M.: Image De-Raining Using</surname>
          </string-name>
          <article-title>a Conditional Generative Adversarial Network</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>30</volume>
          ,
          <fpage>3943</fpage>
          -
          <lpage>3956</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Gobel,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Oro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Orsi</surname>
          </string-name>
          , G.:
          <article-title>ICDAR 2013 Table Competition</article-title>
          .
          <source>In :2013 12th International Conference on Document Analysis and Recognition</source>
          , pp.
          <fpage>1449</fpage>
          -
          <lpage>1453</lpage>
          . IEEE Computer Society, USA (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Thong</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khuong</surname>
            ,
            <given-names>N.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trinh</surname>
            ,
            <given-names>L.B.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hyung-Jeong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuan</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SooHyung</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Learning to detect tables in document images using line and text information</article-title>
          .
          <source>In: Proceedings of the 2nd International Conference on Machine Learning and Soft Computing (ICMLSC '18)</source>
          , pp.
          <volume>151</volume>
          {
          <issue>155</issue>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, USA (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Breuel</surname>
            ,
            <given-names>T. M.:</given-names>
          </string-name>
          <article-title>High performance document layout analysis</article-title>
          <source>In Proceedings of the Symposium on Document Image Understanding Technology</source>
          , pp.
          <fpage>209</fpage>
          -
          <lpage>208</lpage>
          (
          <year>2003</year>
          ) Understanding Technology,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Hamza</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Bela¨ d, Y., Bela¨ d, A.:
          <article-title>A case-based reasoning approach for invoice structure extraction</article-title>
          .
          <source>In Proceedings of the Ninth International Conference on Document Analysis and Recognition</source>
          <year>2007</year>
          , pp.
          <volume>327</volume>
          {
          <fpage>331</fpage>
          . IEEE Computer society
          <article-title>(</article-title>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Schulz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ebbecke</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gillmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adrian</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agne</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dengel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Seizing the treasure: Transferring knowledge in invoice analysis</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Document Analysis and Recognition</source>
          <year>2009</year>
          , pp.
          <volume>848</volume>
          {
          <issue>852</issue>
          , IEEE Computer Society (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Siamese neural networks for one-shot image recognition, ICML Deep Learning Workshop</article-title>
          . vol.
          <volume>2</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efros</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks</article-title>
          .
          <source>In Proceedings of the 2017 IEEE International Conference on Computer Vision</source>
          (ICCV), Venice, Italy, pp.
          <volume>2242</volume>
          {
          <issue>2251</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osindero</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Conditional generative adversarial nets</article-title>
          .
          <source>arXiv preprint arXiv:1411.1784</source>
          , (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          , Denoising Dirty Documents, Dataset Competition, https://www.kaggle .com/c/denoising-dirty-documents/data,
          <source>Last accessed 11 Aug 2021</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shahroudy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shuai</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Recent advances in convolutional neural networks</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>77</volume>
          ,
          <fpage>354</fpage>
          -
          <lpage>377</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abhishek</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vig</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Learning to Clean: A GAN Perspective In</article-title>
          : Carneiro,
          <string-name>
            <given-names>G.</given-names>
            <surname>You</surname>
          </string-name>
          , S. (eds.) Computer Vision - ACC Workshops,
          <year>ACCV 2018</year>
          , LNCS vol.
          <volume>11367</volume>
          . Springer, Cham (
          <year>2018</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -21074-8
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Ronneberger</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fisher</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brox</surname>
          </string-name>
          , T.:
          <article-title>U-Net: Convolutional Networks for Biomedical Image Segmentation</article-title>
          . In Navab, N.,
          <string-name>
            <surname>Hornegger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wells</surname>
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frangi</surname>
            <given-names>A</given-names>
          </string-name>
          . (eds.)
          <article-title>Medical Image Computing</article-title>
          and
          <string-name>
            <surname>Computer-Assisted</surname>
            <given-names>Intervention { MICCAI</given-names>
          </string-name>
          <year>2015</year>
          ,
          <article-title>LNCS</article-title>
          , vol
          <volume>9351</volume>
          . Springer, Cham
          <year>2015</year>
          . https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -24574- 4 28
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Canny</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A Computational Approach to Edge Detection</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol. PAMI-
          <volume>8</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>679</fpage>
          -
          <lpage>698</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Heikki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petri</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lei</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erkki</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Comparisons of Probabilistic and Non-probabilistic Hough Transforms</article-title>
          . In: Eklundh JO. (eds) Computer Vision | ECCV '
          <fpage>94</fpage>
          .
          <article-title>ECCV 1994</article-title>
          .
          <article-title>LNCS</article-title>
          , vol
          <volume>801</volume>
          . Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0028367
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Ballard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Generalizing the Hough transform to detect arbitrary shapes</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>13</volume>
          ,
          <fpage>111</fpage>
          -
          <lpage>122</lpage>
          . (
          <year>1981</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Golovin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solnik</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moitra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kochanski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karro</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sculley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Google Vizier: A Service for Black-Box Optimization</article-title>
          .
          <source>In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17)</source>
          , pp.
          <volume>1487</volume>
          {
          <fpage>1495</fpage>
          .
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York,USA (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Tellez-Valero</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
          </string-name>
          y
          <string-name>
            <surname>-Gomez</surname>
            <given-names>M.</given-names>
          </string-name>
          , Villasen~
          <string-name>
            <surname>or-Pineda</surname>
            <given-names>L.:</given-names>
          </string-name>
          <article-title>A Machine Learning Approach to Information Extraction</article-title>
          . In: Gelbukh A. (
          <article-title>eds) Computational Linguistics and Intelligent Text Processing</article-title>
          .
          <source>CICLing</source>
          <year>2005</year>
          . LNCS, vol
          <volume>3406</volume>
          . Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -30586-6 58
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Vishwanath D</surname>
          </string-name>
          . et al.:
          <article-title>Deep Reader: Information Extraction from Document Images via Relation Extraction and Natural Language</article-title>
          . In: Carneiro G.,
          <string-name>
            <surname>You</surname>
            <given-names>S</given-names>
          </string-name>
          . (eds) Computer Vision { ACCV 2018
          <string-name>
            <surname>Workshops. ACCV</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>LNCS</article-title>
          , vol
          <volume>11367</volume>
          . Springer, Cham (
          <year>2018</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -21074-8 15
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Palm</surname>
          </string-name>
          ,R.:
          <article-title>End-to-end information extraction from business documents</article-title>
          , (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33. Cheng, M.,
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>One-shot Text Field labeling using Attention and Belief Propagation for Structure Information Extraction</article-title>
          .
          <source>In: Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery</source>
          , New York, USA (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Schreiber</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agne</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dengel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images</article-title>
          .
          <source>In: Proceeding of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</source>
          , pp.
          <fpage>1162</fpage>
          -
          <lpage>1167</lpage>
          . IEEE,
          <string-name>
            <surname>Japan</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Dataset, Ground-Truth and Performance Metrics for Table Detection Evaluation</article-title>
          .
          <source>In: Proceedings of the10th IAPR International Workshop on Document Analysis Systems (DAS)</source>
          , pp.
          <volume>445</volume>
          {
          <fpage>449</fpage>
          . IEEE Computer Society (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Hussein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdullah</surname>
          </string-name>
          , H.:
          <article-title>A new Approach for Detection and ExtractionTables in Scanned Document Image using Improved Hough Transform</article-title>
          .
          <source>Engineering and Technology Journal</source>
          <volume>34</volume>
          ,
          <fpage>738</fpage>
          -
          <lpage>753</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>37. Google Cloud Vision API Documentation, https://cloud.google.com/vision/ docs, Last accessed 11 Aug 2021</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>