<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Deep Learning Based Methodology for Information Extraction from Documents in Robotic Process Automation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Massarenti</forename><surname>Nicola</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Lazzarinetti</forename><surname>Giorgio</surname></persName>
						</author>
						<author>
							<persName><roleName>Italy</roleName><forename type="first">Noovle</forename><forename type="middle">S P A</forename><surname>Milan</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Italian &quot;Ministero dello Sviluppo Economico&quot;</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">Fondo per la Crescita Sostenibile</orgName>
								<orgName type="institution" key="instit2">Bando &quot;Agenda Digitale&quot;</orgName>
								<address>
									<postBox>D.M. Oct. 15th</postBox>
									<postCode>2014</postCode>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Deep Learning Based Methodology for Information Extraction from Documents in Robotic Process Automation</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">5255EA3BAA515D8FE70B10A9051EAC5E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:01+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Robotic Process Automation</term>
					<term>Optical Character Recognition</term>
					<term>Information Extraction</term>
					<term>Deep Learning</term>
					<term>Image Denoising</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In recent years, thanks to Optical Character Recognition techniques and technologies to deal with low scan quality and complex document structure, there has been a continuous evolution and automation of the digitization processes to allow Robotic Process Automation.</p><p>In this paper we propose a methodology based both on deep learning algorithms (as generative adversarial network) and statistical tools (as the Hough transform) for the creation of a digitization system capable of managing critical issues, like low scan quality and complex structure of documents. The methodology is composed of 5 modules to manage the poor quality of scanned documents, identify the template and detect tables in documents, extract and organize the text into an easy-to-query schema and perform queries on it through search patterns. For each module different state-of-the-art algorithms are compared and analyzed, with the aim of identifying the best solution to be adopted in an industrial environment. The implemented methodology is measured with respect to the business needs over real data by comparing the extracted information with the target value and shows performance of 90%, in terms of Gestalt Pattern Matching measure.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Overview</head><p>With the spread of cameras on mobile devices, more and more images of scanned documents are collected in order to be digitized for different uses. Most of the digitization processes are still done manually today, however, thanks to recent advances in machine learning, it is possible to further automate these processes <ref type="bibr" target="#b0">[1]</ref>.</p><p>When dealing with information extraction from documents, Optical Character Recognition (OCR) techniques are the key technology, however these alone are not enough to extract all the visual and structural information from scanned documents. Moreover their power is limited when dealing with poor-scan quality documents or with documents with complex structure. In this research we define a methodology for extracting information from scanned or editable documents, trying to manage and limit the noise coming from the low quality of the scans and taking into account document structure. The methodology is designed using real data coming from two different companies and is tested by considering the companies' real business needs. The methodology that we propose firstly uses Generative Adversarial Network (GAN) <ref type="bibr" target="#b11">[12]</ref> to clean the scanned documents, then identifies the document template by using Siamese Neural Network (SNN) <ref type="bibr" target="#b18">[19]</ref> and then, by using a method based on a computer vision technique called Hough Transform (HT) <ref type="bibr" target="#b26">[27]</ref> and on the Google Cloud Vision API <ref type="bibr" target="#b36">[37]</ref> for OCR, identifies tables. Then an information mapping process that delegates the personalization of content extraction to the drafting of a set of queries is defined, thus making the information retrieval simpler and more immediate. The rest of this paper is organized as follows. Firstly, in chapter 2, the analysis of the state of the art for information extraction is presented; in chapter 3 the actual use case and the real-world dataset is described; in chapter 4 the defined methodology is shown in details; in chapter 5 the experimental results obtained are resumed; in chapter 6 some conclusions and future directions are mentioned.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">State of the art</head><p>Information retrieval from documents has been an important research area for several decades. With the advent of deep learning, OCR systems have become extremely powerful and usable, thanks to open source systems such as Tesseract <ref type="bibr" target="#b1">[2]</ref> and cloud API based solutions such as Google Cloud Vision API <ref type="bibr" target="#b36">[37]</ref>. Today, interpreting documents with a simple text layout and good scan quality has become a trivial problem thanks to these recent developments <ref type="bibr" target="#b2">[3]</ref>, especially if the PDF is software-generated and editable, as described by H. Chao and J. Fan in <ref type="bibr" target="#b3">[4]</ref>. In case of non editable PDF (scanned documents) the only applicable solution for text extraction is represented by OCR techniques. OCR techniques generally consist of 5 phases: pre-processing, segmentation, normalization, feature extraction and post-processing. In Table <ref type="table" target="#tab_0">1</ref> the main algorithms of each phase are presented.</p><p>The preprocessing step, which aims at eliminating noise in an image without missing any significant information, traditionally was performed with statistical and computer vision techniques but recently also some deep learning approaches have been successfully proposed. One of the most used in the field of image denoising is represented by GAN <ref type="bibr" target="#b11">[12]</ref> which have been used in different image-to-image translation contexts, showing very good performance. Also for the feature extraction phase there are various techniques. Today the main ones are based on the use of neural networkss <ref type="bibr" target="#b8">[9]</ref>. In <ref type="bibr" target="#b9">[10]</ref> an overview of the state of Rule-based methods <ref type="bibr" target="#b4">[5]</ref> the art of algorithms based on neural networks for OCR is presented, showing how these algorithms are able to achieve the best performance in the context of feature extraction. The study also highlights the impact of the features extracted from these algorithms on the classification task. As described by Suen in <ref type="bibr" target="#b10">[11]</ref>, the main classes of features are two: statistical and structural. Statistical (like momentum, zoning, crossing, fourier transform and histogram projections) are also known as global features, while structural (like convexity or concavity in the characters, number of holes in the document or number of endpoints) are known as local features. Nowadays, there are a lot of tools that automatically perform these steps with high accuracy, as Tesseract <ref type="bibr" target="#b1">[2]</ref> or Google Cloud Vision API <ref type="bibr" target="#b36">[37]</ref> however the steps of pre-processing and post-processing are extremely difficult to be generalized, since they merely depend on the data given in input and the expected output. Moreover, in the implementation of a Robotic Process Automation (RPA) systems, text extraction is just one of many issues and the main challenges are understanding the structure of the document and extracting visual entities like tables. In 2013 M. Göbel et al. proposed a first meticulous comparison of the performance of various table identification techniques over the ICDAR 2013 dataset <ref type="bibr" target="#b13">[14]</ref>. From 2013 to date, in addition to the enhancement of these deterministic approaches, several new techniques based on machine learning algorithms have been introduced, like a method based on the identification of the horizontal and vertical lines classified through Support Vector Machine (SVM) <ref type="bibr" target="#b14">[15]</ref> or methods based on Fast-RCNN (FRCNN) <ref type="bibr" target="#b33">[34]</ref> trained on the Marmot dataset for table recognition <ref type="bibr" target="#b34">[35]</ref>. Even though these techniques are extremely powerful, they need a lot of data to be trained and generalized. For this reason, besides machine learning techniques, many computer vision techniques based on the Hough Transform (HT) <ref type="bibr" target="#b35">[36]</ref>, which is used universally today especially thanks to the discovery of its generalized form by Dana H. Ballard <ref type="bibr" target="#b27">[28]</ref>, are proposed. Its power relies on the fact that this transform does not need training data to be used since it can be applied as a mathematical function. Overall, table extraction techniques are even more effective when combined with document structure identification techniques. Indeed, being able to identify the document template allows one to have a priori knowledge of the structure of the text and this knowledge can be exploited to facilitate the identification of objects within the document. With respect to this, there are several related studies <ref type="bibr" target="#b15">[16]</ref><ref type="bibr" target="#b16">[17]</ref><ref type="bibr" target="#b17">[18]</ref>. In particular, some deep learning techniques like Siamese neural networks (SNN)</p><p>have been recently proposed. These models consist of two groups of parallel layers of CNNs that extract features from two distinct input documents and a series of documents that represent the knowledge-base, i.e. the set of possible templates in which the document can be mapped. These algorithms are extremely powerful, because they allow to obtain very high performance even with little training data, thanks to a learning technique called one-shot learning <ref type="bibr" target="#b18">[19]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Related Works</head><p>Extracting information from documents is an active research field and in recent years some works on the topic have been published. For instance, in <ref type="bibr" target="#b30">[31]</ref> Vishwanath D. et al. present an end-to-end framework that maps some visual entities (such as tables, printed and handwritten text, boxes and lines) into a relational schema so that relevant relationships between entities can be established. The framework performs image denoising by means of GAN and horizontal clustering to localize page lines. In <ref type="bibr" target="#b31">[32]</ref>, instead the authors build an invoice analysis system that does not rely on templates of invoice layout, but learns a single global model of invoices that naturally generalizes to unseen invoice layouts.</p><p>In <ref type="bibr" target="#b32">[33]</ref>, a framework which makes use of an attention mechanism to transfer the layout information between document images is proposed. The authors also applied conditional random fields on the transferred layout information for the refinement of field labeling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Setting</head><p>The goal of this research is to identify a methodology that allows the creation of an RPA system for extracting specific information from documents. The information to be extracted are driven by business needs and vary according to the use case of two pilot companies that furnished the data. The goal of these two companies is that of digitizing the information in order to activate a business process of notarization and supplier management. In order to understand the approach to be pursued, in the following the datasets provided to implement the solution are described.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Dataset Description</head><p>The datasets used to define and test the methodology are composed of a collection of documents from two different companies. In one case, the documents are editable pdf of production sheets (in the following we will refer to this data as dataset A), in the other case the documents are scanned pdf of technical product sheets, invoices and transport documents (we will refer to this data as dataset B). In the first case, the production sheet is composed of different pages each composed of a body in the form of a table that contains several production sub-sheets, each representing an order to be performed to specific suppliers.</p><p>Production sub-sheets are identifiable by a set of contiguous populated lines separated by other production sub-sheets by a white line. The goal is to extract all the lines corresponding to each production sub-sheet to automatically activate the process of supplier selection. In the second case, the dataset is composed of three kinds of scanned documents: invoices, with different templates, document of transport with different templates, signed and dated by hand, and technical product sheet in the form of a table that contains specific information about the products (EAN, ingredients, nutrition values, sterility requirements etc). Each document has its own set of information to be retrieved in order to be notarized via block chain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Methodology</head><p>By considering the business needs expressed by the partner companies, a methodology that consists of five main steps has been identified. Starting from the document, a first phase of pre-processing is envisaged with GANs, with the aim of reducing the noise of scanned documents. Subsequently, a template identification module based on deep learning models (CNN or SNN) is implemented. Then a module to detect and extract tables is defined, looking for the vertical and horizontal lines within the document (using the HT) to trace the number of columns and rows that may constitute them. Follows the OCR phase (with Google Cloud Vision API) to extract the textual content and the content mapping module to encapsulate the information within a matrix schema. Finally, the last module deals with extracting the required information, looking for patterns defined in advance and depending on the type of document. The goal is to ensure that the methodology can be applied in different industrial environments with respect to the business needs expressed. The individual steps of the methodology are described in detail below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Pre-processing: image denoising</head><p>The first step of the methodology is represented by the pre-processing phase. Specifically, at this stage the goal is to reduce the noise present in the images of the scanned documents since the quality of the documents influences all the next steps of the methodology. As described in chapter 2, one of the most recent and effective methods to perform image denoising is represented by GAN <ref type="bibr" target="#b11">[12]</ref>. GANs are neural networks made up of two parts: a convolutional network called generator, trained to generate synthetic samples from input data, and a convolutional network called discriminator, trained to understand if an image is real or generated. Formally, consider a generative network G that captures the distribution of data and a discriminative network D that estimates the probability that an example derives from the training dataset rather than G. To learn the generator distribution p g over data x, the generator build a non-linear mapping function of the distribution of the a-priori noise P z (z) in a space G(z; g). The discriminator D(x; d) produces as output a value that represents the probability that x derives from the training set rather than from p g . G and D are trained contemporarily: G's parameter are adjusted to minimize log(1 − D(G(z))) and D 's parameter to minimize logD(x) as if they follow a min-max game with two players and value function</p><formula xml:id="formula_0">V (G, D) : min G max D E x−p data (x) [logD(x)] + E z−pz(z) [log(1 − D(G(z)))]<label>(1)</label></formula><p>According to <ref type="bibr" target="#b23">[24]</ref>, there are different existing GAN architectures and we propose to compare conditional GAN (cGAN) <ref type="bibr" target="#b12">[13]</ref> and cycle GAN <ref type="bibr" target="#b20">[21]</ref> since they represent the most recent developments in the field of image-to-image translation.</p><p>Conditional GAN GANs can be extended to a conditional model if both the generator and the discriminator are conditioned by extra information y.</p><p>The conditioning can be done by inserting y both in the generator and in the discriminator as an additional input layer. In the generator the a-priori input noise p z (z) and y are combined in a joint hidden representation <ref type="bibr" target="#b20">[21]</ref>. In this case the value function of the two player min-max game is</p><formula xml:id="formula_1">V (G, D) : min G max D E x−p data (x) [logD(x|y)]+E z−pz(z) [log(1−D(G(z|y)))] (2)</formula><p>Using a model of this type it is possible, by providing the model with images of noisy documents and their respective noise-free, to train the model to produce, given in input an image of a document, the image of the same document but with a reduction of the disorder. To perform the training, it is necessary to have for each input also the target image, i.e. the dataset must be composed of pairs formed by noisy images and their respective images without noise.</p><p>Cycle GAN Another extension of GANs, called cycleGAN <ref type="bibr" target="#b19">[20]</ref>, is that of letting them to learn a mapping function between two domains X and Y , given some training examples {x i } N i=1 where x i ∈ X and {y j } M j=1 where y j ∈ Y . The cycleGAN model includes two mapping functions G : X → Y and F : Y → X. moreover two adversary discriminators D x and D y are introduced, where the goal of D x is to distinguish between image {x} and its corresponding translated {F (y)}. The same for D y with respect to image {y} and the corresponding translated {G(x)}. The goal is twofold: to reduce the opposing losses to align the distributions of the generated images and the distributions of the target images and to reduce the cycle consistency loss to prevent the mapping learned from G and F from being contradictory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Document template identification: image classification</head><p>The second step of the methodology is represented by the document template identification module. This module aims at defining the template of the input document and this can be accomplished by any image classification algorithm. In our research we compare two kinds of algorithms: a more consolidated one based on CNN <ref type="bibr" target="#b22">[23]</ref> and a more recent one based on SNN <ref type="bibr" target="#b18">[19]</ref>, which has shown high performance even with small datasets thanks to what is called "one-shotlearning".</p><p>Siamese Neural Network SNNs are models that aim at recognizing, starting from two distinct inputs, whether they belong to the same class or not. The network is made up of two main phases: in a first phase, two input images are passed through a series of convolutional layers, in order to obtain an embedding of them; between these two output vectors in a second step, a distance measure is calculated. The output is a value that indicates the distance between the two inputs used to classify the images. This learning approach is also called "oneshot-learning" since it is not necessary to have thousands of documents to carry out training, but works on the features extracted from the CNN that make up the SNN and therefore even a single image, that constitute the comparison sample, for each class is sufficient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Tables identification</head><p>Once the images have been denoised and the templates identified, the next module aims at locating the structural elements of the document, not only by applying OCR to extract text, but also by identifying and mapping the structural elements of the document, such as tables. This module therefore consists of two blocks: an OCR block for extracting textual content based on a pre-trained OCR tool and a block for identifying the tables for organizing the content based on computer vision techniques.</p><p>Google Cloud Vision API As an OCR tool we decided to use the Google Cloud Vision API tool, the documentation of which is reported in <ref type="bibr" target="#b36">[37]</ref>. This API performs an analysis of the image layout to segment the areas where text is present. After the general localization phase has been carried out, the OCR module recognizes the text on the specified areas and, consequently, extracts it. Finally, the result is corrected through post-processing techniques based on language models and dictionaries. Most of these steps are carried out through the use of CNNs. The extraction performed by the Google Cloud Vision API OCR module returns the textual content and the organization and position of the content within the image. More precisely, the output produced consists of a dictionary containing a structure divided into a hierarchy of blocks and paragraphs that, at the lowest level, contains the individual words extracted and even single symbols with the coordinates of their position within the image, a confidence score and the detected language.</p><p>Hough Transform By analyzing the state-of-the-art for table extraction it emerges that one of the last trends is based on deep learning techniques. However, to train such models a lot of data must be available and often it is difficult to generalize well with public dataset. To overcome this limitation, we follow an unsupervised approach based on the HT. Before applying the HT to detect lines, however, we propose to pre-process the image to highlight the lines contained in it using a computer vision method called Edge Detection <ref type="bibr" target="#b25">[26]</ref> that aims at drastically reducing the amount of data to be processed, while preserving the structural information on the contours of the objects. This method uses a convolution mask and its gradients together with two threshold values (upper and lower) that define whether the pixel is accepted as an edge or not. More precisely, the pixel is maintained if the gradient value is greater than the upper threshold value and discarded if it is below the lower threshold value, while if it is in the middle it is maintained if at least one neighboring pixel is above the upper threshold value. Once the information has been cleaned through the Edge Detector, it is passed to the module that deals with identifying lines using an approach based on the HT. As described in <ref type="bibr" target="#b26">[27]</ref>, the Hough Line Detector is a computer vision techniques that extracts all the lines from the image, considering as lines all the series of pixels in a row that exceed a certain number of pixels and with a maximum value of missing pixels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Content mapping</head><p>The fourth module of the methodology takes care of mapping the extracted information into a schema that is easily searchable. In order to quickly access the content of the text, we decided to organize it within a dynamic matrix structure. Indeed, if each content is assigned to a cell whose position is known, it is easy to search for the other contents associated with it in the adjacent cells, since their positions can be used in the text search. As mentioned, thanks to the Google Cloud Vision API, not only the textual content of the document is available, but also the organization and position of the content within the image. Thus, to create the matrix structure we used an approach which assumes that in a document there could be different separated tables and each table necessarily has at least one vertical lines. The approach has six steps: 1) Scans all the vertical lines and groups them by considering their ordinates: if lines have overlapping ordinates they are grouped together. 2) For each group of vertical lines add to the group all the horizontal lines that are in the group's range of ordinates. 3) Divide each group into subgroups of directly and indirectly connected lines, where directly connected lines are lines that have a pixel in common, and indirectly connected lines are lines that are not directly connected with each other but are both directly connected with the same line or with a set of indirectly connected lines. Each subgroup thus identified represents a table. 4) For each table, detect the number of rows and columns by taking the number of intersections of respectively the vertical and the horizontal lines with the highest number of intersection points. 5) Create a matrix with the identified number of rows and columns. 6) Fill in the matrix with the text extracted by the OCR tool selecting the proper cell using the coordinates returned and the coordinates of the table identified. This approach generally creates a matrix with more cells with respect to the actual table, since it also consider tables with complex structures (as rows or columns with different number of cells). For this reason, in such cases, text is replicated in different cells of the matrix if these cells correspond to the same cell of the table.</p><p>Table <ref type="table">2</ref>. Fields of the anchor search pattern.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Field Description anchor regex</head><p>Regular expression to search for to find the anchor. content position rows Relative lines (starting from the anchor position) within which to search for the content. content position cols Relative columns (starting from the anchor position) within which to search for the content. content regex Regular expression to be applied to candidate cells within which to search for the content; each cell will satisfy the expression or not. content lambda Function to apply to the cell(s) that matched the regular expression in content regex. content dim Number of cells to keep among those returned.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Content parsing</head><p>The last step of the methodology deals with retrieving the information of interest starting from the generated content matrix. The goal is therefore to define search patterns for each value to be extracted. We defined two types of patterns: anchor and text. The anchor pattern first searches for an anchor, i.e. one or more terms from whose position it is then possible to go back to the actual content. An anchor pattern is formed by the fields described in Table <ref type="table">2</ref>. The second type of pattern does not provide for the existence of an anchor, but deals with searching directly within the cells for a specific content that satisfies a certain condition. The fields are similar to the previous case but instead of starting searching from the anchor it starts from the first cell of the matrix. With these two types of patterns it is possible to cover all searches within the matrix and it is therefore evident that the task of retrieving is enormously simplified. Indeed, the reorganization of the content in matrices and the two search patterns defined allows for more complex and flexible searches with respect to simple rule-based systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experimental Results</head><p>In the following the experimental results of the methodology are presented. For each step of the methodology, the results are distinguished per dataset A (composed by good quality editable pdf of the same type) and dataset B (composed of scanned pdf of three different types) with the aim of verifying that the methodology can be applied to both kinds of dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Image denoising</head><p>To train the GANs algorithms a public Kaggle dataset has been used <ref type="bibr" target="#b21">[22]</ref>. The dataset contains pairs of images with and without noise. Images have been reduced to 256x256 crops and divided into training and test with a 80-20 split. The total number of crops used is 436, equally divided in the class with noise and the class without noise. As far as cGAN is concerned, the network architecture is composed of 3 layers of convolutions with ReLu activation function, followed by max-pooling and dropout. As far as cycleGAN is concerned the convolutional network has been created using a U-Net architecture <ref type="bibr" target="#b24">[25]</ref>. This kind of network is formed by a series of convolution layers followed by a series of transpose convolution layers to bring the image back to its original size with skip connections. Both the models have been trained using Google Cloud Vertex AI training jobs that leverage a hyperparameter tuning tool based on Google Vizier <ref type="bibr" target="#b28">[29]</ref>. In both cases, learning rate, batch size and number of epochs have been automatically selected by the optimization algorithm. To measure the performance of these algorithms, a metric called peak signal-to-noise ratio (PSNR) is used. This measures the quality of a compressed image compared to the original one and is defined as the ratio between the maximum power of a signal and the power of noise that can invalidate the fidelity of its compressed representation. Since many signals have a very wide dynamic range, PSNR is usually expressed in terms of the logarithmic decibel scale. The PSNR is defined as</p><formula xml:id="formula_2">P SN R = 20log M AX{I} √ M SE<label>(3)</label></formula><p>Where the Mean Square Error (MSE) between two images I and K is defined as:</p><formula xml:id="formula_3">M SE = 1 M N M −1 j=0 N −1 i=0 ||I(i, j) − K(i, j)|| 2<label>(4)</label></formula><p>The PSNR obtained for the cGAN model is 8,93db while the one obtained for cycleGAN is 23,245db. Thus, the model based on cycleGAN outperforms the one based on cGAN.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Image classification</head><p>To train the image classification module we used the dataset A and B together. Since the dataset A is composed of just one kind of document and the dataset B is composed of three kinds of documents, the outcome of the model is composed of four classes. The images have been divided into training and test with an 80-20 splits. The models compared to perform template identification are SNNs and CNNs. As far as the SNN is concerned, the two embedding layers placed in parallel consist of 2 convolutional layers with ReLu activation function followed by fully-connected layer. In order to train the algorithm, training samples have been paired randomly to have 500 paired samples of the same class and 500 of different classes. To test the algorithm, all the training images have been used as comparison samples against the test images and, to select the final class, majority voting has been performed. As far as CNN is concerned, the network is composed of 3 different levels of convolution with ReLu activation function, each followed by a max-pooling layer and dropout to avoid overfitting and with a final fully-connected layer. The model has been tested using the same test images of the SNN model. Both the models have been trained using Google Cloud Vertex AI training jobs with automatic hyperparameter optimization. In both cases, learning rate, batch size and number of epochs have been automatically selected by the optimization algorithm. As a comparison metric we decided to rely on overall accuracy (OA). On the four classes, the CNN yielded 93,71% of OA while the SNN model 94,33% of OA. Thus, the SNN model yields slightly better performance, however, its great advantage relies on the fact that, in an industrial environment, this kind of model needs less training data with respect to the CNN, letting it be easier to train and use such a model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Table identification</head><p>To evaluate the performance of the table detection algorithm based on the HT, we decided to use the number of lines (vertical and horizontal) correctly extracted, undetected, partially identified and exceeding. Furthermore, to understand the actual goodness and usefulness of the pre-processing phase based on cycleGAN, this calculation was done both with and without the application of the cycleGAN model. The results are reported in Table <ref type="table" target="#tab_1">3</ref>. From the results we can see that the application of GANs does not actually impact the performance of lines detection in the case of dataset A, since these data are software-generated and, thus, perfectly cleaned. Instead, with respect to dataset B, their application enhance the results, showing that the algorithm is able to improve the quality of some scanned documents.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Information extraction: overall results</head><p>Finally, in order to evaluate the overall performance of the methodology, since the final objective is to retrieve specific information from documents, we decided to compare each single field extracted with the target of the extraction. To evaluate the performance of our system we decided to use a metric called Gestalt Pattern Matching (GPM), which assigns a similarity value between two strings S 1 and S2, based on the size of the substrings and on the number of matching K m characters between the strings where the matching characters are defined as the longest common substring (LCS) plus recursively the number of matching characters in the non-matching regions on both sides of the LCS</p><formula xml:id="formula_4">GM P = 2K m |S 1 | + |S 2 |<label>(5)</label></formula><p>As can be seen from Table <ref type="table" target="#tab_2">4</ref>, in the case of dataset A, the extractor has an exact matching both with the use of the GAN and without. This is likely because the pdf is perfectly clean but also because the search task of these documents is simpler and seems to be less influenced by a precise identification of the tabular scheme (on which, however, there is good performance). As for the dataset B, on the other hand, since the search patterns are more complex and the quality of documents lower, the percentage of matching obtained (better in the case of application of GAN, confirming the effectiveness of the pre-processing layer) is 0.81. Overall, our methodology has a 90,5% of performance for the task of information extraction in terms of GPM. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and future works</head><p>This study presents a methodology based on machine learning for the realization of a general and customizable RPA system based on the type of documents and information to be extracted. After an extensive analysis of the state-of-the-art, we developed a modular methodology that could adapt to different documents in terms of template and content. For each module we tested different options. The identified methodology consists of 5 modules: an image denosing module based on cyceGAN; a document template identification module, based on SNN; an information extraction module based on table identification via HT and on text extraction via Google Cloud Vision API; a custom information mapping module to organize the content into a matrix structure and a query module that extracts the necessary information through search patterns. The methodology has been tested using the GPM score. Overall, the methodology perform well with a score of 0.905 (corresponding to a matching of 90%) and also proved the effectiveness of the image denoising algorithm. Finally, the implemented methodology has been deployed in different industrial environments, with different document formats just by fine tuning the template identification model and by defining the search pattern. Some enanchements can be applied for better generalization, such as using a deep learning approaches to detect tables in the document, thus reducing the error in line identification and, consequently, in information retrieval, or designing some optimization algorithms to keep the complexity of the SNN model low (due to the necessary scan of the images to perform prediction) that could slow down the extraction of information in an industrial context.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Main algorithms and approaches for each OCR phase</figDesc><table><row><cell>OCR Phase</cell><cell>Algorithms and Approaches</cell></row><row><cell>Pre-Processing</cell><cell>Binarization, skew correction, filtering, thresholding, compres-</cell></row><row><cell></cell><cell>sion, thinning [5], GAN [12]</cell></row><row><cell>Segmentation</cell><cell>Top-down methods, bottom-up methods, hybrid methods [6, 7]</cell></row><row><cell>Normalization</cell><cell>Standard approaches [8]</cell></row><row><cell cols="2">Feature Extraction Neural Network based approach [9, 10]</cell></row><row><cell>Post-Processing</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3 .</head><label>3</label><figDesc>Overall results of table extraction algorithm</figDesc><table><row><cell>Metric</cell><cell cols="6">A-noGAN A-GAN B-noGAN B-GAN All-noGAN All-GAN</cell></row><row><cell>% correct lines</cell><cell>88.5%</cell><cell>88.5%</cell><cell>87%</cell><cell>90%</cell><cell>88%</cell><cell>89%</cell></row><row><cell cols="2">% exceeding lines 3%</cell><cell>3.5%</cell><cell>0.5%</cell><cell>0%</cell><cell>2%</cell><cell>2%</cell></row><row><cell cols="2">% undetected lines 5.5%</cell><cell>5%</cell><cell>12%</cell><cell>10%</cell><cell>8%</cell><cell>7.5%</cell></row><row><cell>% partial lines</cell><cell>3%</cell><cell>3%</cell><cell>0.5%</cell><cell>0%</cell><cell>2%</cell><cell>1.5%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 4 .</head><label>4</label><figDesc>Overall results</figDesc><table><row><cell>Metric</cell><cell cols="6">A-noGAN A-GAN B-noGAN B-GAN All-noGAN All-GAN</cell></row><row><cell>% GMP score</cell><cell>1</cell><cell>1</cell><cell>0,765</cell><cell>0,81</cell><cell>0,88</cell><cell>0,905</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Semantic information extraction from images of complex documents</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Peanho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Stagni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">S C</forename><surname>Da Silva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Intelligence</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="543" to="557" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">An Overview of the Tesseract OCR Engine</title>
		<author>
			<persName><forename type="first">R</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth International Conference on Document Analysis and Recognition -Volume 02 (IC-DAR &apos;07</title>
				<meeting>the Ninth International Conference on Document Analysis and Recognition -Volume 02 (IC-DAR &apos;07<address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="629" to="633" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<ptr target="https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf\reference\archives/PDFReference.pdf" />
		<title level="m">Adobe system Incorporated</title>
				<imprint>
			<date type="published" when="2021-08-11">11 Aug 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Layout and Content Extraction for PDF Documents</title>
		<author>
			<persName><forename type="first">H</forename><surname>Chao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fan</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3</idno>
		<idno>-540- 28640-0 20</idno>
		<ptr target="https://doi.org/10.1007/978-3" />
	</analytic>
	<monogr>
		<title level="m">Document Analysis Systems VI. DAS 2004. LNCS</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Marinai</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Dengel</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2004">2004</date>
			<biblScope unit="volume">3163</biblScope>
			<biblScope unit="page" from="213" to="224" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A Detailed Analysis of Optical Character Recognition Technology</title>
		<author>
			<persName><forename type="first">K</forename><surname>Hamad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kaya</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Applied Mathematics, Electronics and Computers</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="244" to="244" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Page Segmentation in OCR System-A Review</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kaur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Khurana</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Science and Information Technologies</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="420" to="422" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Text Pre-processing and Text Segmentation for OCR</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Shinde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">G</forename><surname>Chougule</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Science Engineering and Technology</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="810" to="812" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Feature extraction methods for character recognition -a survey</title>
		<author>
			<persName><forename type="first">Ø</forename><forename type="middle">D</forename><surname>Trier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Taxt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern recognition</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="641" to="662" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">OCR-based chassis-number recognition using artificial neural networks</title>
		<author>
			<persName><forename type="first">P</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Karamchandani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nadkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Gulechha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Koli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2009 IEEE International Conference on Vehicular Electronics and Safety (ICVES)</title>
				<meeting>the 2009 IEEE International Conference on Vehicular Electronics and Safety (ICVES)<address><addrLine>India</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="31" to="34" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Neural networks for document image preprocessing: state of the art</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rehman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Saba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence Review</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="253" to="273" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Character recognition by computer and applications, Handbook of pattern recognition and image processing</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">Y</forename><surname>Suen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1986">1986</date>
			<biblScope unit="page" from="569" to="586" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Generative Adversarial Networks</title>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pouget-Abadie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mirza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Warde-Farley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ozair</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">11</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Image De-Raining Using a Conditional Generative Adversarial Network</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sindagi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">M</forename><surname>Patel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Circuits and Systems for Video Technology</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="3943" to="3956" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Göbel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hassan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Oro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Orsi</surname></persName>
		</author>
		<title level="m">2013 12th International Conference on Document Analysis and Recognition</title>
				<meeting><address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1449" to="1453" />
		</imprint>
	</monogr>
	<note>ICDAR 2013 Table Competition</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Learning to detect tables in document images using line and text information</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">V</forename><surname>Thong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Khuong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">B K</forename><surname>Trinh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Hyung-Jeong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Tuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Soo-Hyung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd International Conference on Machine Learning and Soft Computing (ICMLSC &apos;18)</title>
				<meeting>the 2nd International Conference on Machine Learning and Soft Computing (ICMLSC &apos;18)<address><addrLine>New York, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="151" to="155" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">High performance document layout analysis</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M</forename><surname>Breuel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Symposium on Document Image Understanding Technology</title>
				<meeting>the Symposium on Document Image Understanding Technology</meeting>
		<imprint>
			<date type="published" when="2003">2003. 2003</date>
			<biblScope unit="page" from="209" to="208" />
		</imprint>
	</monogr>
	<note>Understanding Technology</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A case-based reasoning approach for invoice structure extraction</title>
		<author>
			<persName><forename type="first">H</forename><surname>Hamza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bela Id</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bela Id</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth International Conference on Document Analysis and Recognition</title>
				<meeting>the Ninth International Conference on Document Analysis and Recognition</meeting>
		<imprint>
			<publisher>IEEE Computer society</publisher>
			<date type="published" when="2007">2007. 2007</date>
			<biblScope unit="page" from="327" to="331" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Seizing the treasure: Transferring knowledge in invoice analysis</title>
		<author>
			<persName><forename type="first">F</forename><surname>Schulz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ebbecke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gillmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Adrian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dengel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th International Conference on Document Analysis and Recognition</title>
				<meeting>the 10th International Conference on Document Analysis and Recognition</meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="848" to="852" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Siamese neural networks for one-shot image recognition</title>
		<author>
			<persName><forename type="first">G</forename><surname>Koch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zemel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICML Deep Learning Workshop</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">2</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Isola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Efros</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)</title>
				<meeting>the 2017 IEEE International Conference on Computer Vision (ICCV)<address><addrLine>Venice, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2242" to="2251" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Conditional generative adversarial nets</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mirza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Osindero</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1411.1784</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<ptr target="https://www.kaggle.com/c/denoising-dirty-documents/data" />
		<title level="m">Kaggle, Denoising Dirty Documents, Dataset Competition</title>
				<imprint>
			<date type="published" when="2021-08-11">11 Aug 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Recent advances in convolutional neural networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kuen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shahroudy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Shuai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">77</biblScope>
			<biblScope unit="page" from="354" to="377" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Learning to Clean: A GAN Perspective In</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Abhishek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Vig</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-21074-8</idno>
		<ptr target="https://doi.org/10.1007/978-3-030-21074-8" />
	</analytic>
	<monogr>
		<title level="m">Computer Vision -ACC Workshops, ACCV 2018</title>
		<title level="s">LNCS</title>
		<editor>
			<persName><forename type="first">G</forename><surname>Carneiro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>You</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">11367</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">U-Net: Convolutional Networks for Biomedical Image Segmentation</title>
		<author>
			<persName><forename type="first">O</forename><surname>Ronneberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fisher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Brox</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-24574-428</idno>
		<ptr target="https://doi.org/10.1007/978-3-319-24574-428" />
	</analytic>
	<monogr>
		<title level="m">Medical Image Computing and Computer-Assisted Intervention -MICCAI 2015</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Navab</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Hornegger</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Wells</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Frangi</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">9351</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">A Computational Approach to Edge Detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Canny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="679" to="698" />
		</imprint>
	</monogr>
	<note>PAMI-</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Comparisons of Probabilistic and Non-probabilistic Hough Transforms</title>
		<author>
			<persName><forename type="first">K</forename><surname>Heikki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Petri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Erkki</surname></persName>
		</author>
		<idno type="DOI">10.1007/BFb0028367</idno>
		<ptr target="https://doi.org/10.1007/BFb0028367" />
	</analytic>
	<monogr>
		<title level="m">Computer Vision -ECCV &apos;94. ECCV 1994. LNCS</title>
				<editor>
			<persName><forename type="first">J</forename><forename type="middle">O</forename><surname>Eklundh</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<biblScope unit="volume">801</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Generalizing the Hough transform to detect arbitrary shapes</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ballard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="111" to="122" />
			<date type="published" when="1981">1981</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Google Vizier: A Service for Black-Box Optimization</title>
		<author>
			<persName><forename type="first">D</forename><surname>Golovin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Solnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Moitra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kochanski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Karro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sculley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD &apos;17)</title>
				<meeting>the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD &apos;17)<address><addrLine>New York,USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1487" to="1495" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">A Machine Learning Approach to Information Extraction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Téllez-Valero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes-Y-Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Villaseñor-Pineda</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-</idno>
		<idno>540-30586-6 58</idno>
		<ptr target="https://doi.org/10.1007/978-3-" />
	</analytic>
	<monogr>
		<title level="m">Computational Linguistics and Intelligent Text Processing</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Gelbukh</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<biblScope unit="volume">3406</biblScope>
		</imprint>
	</monogr>
	<note>CICLing 2005. LNCS</note>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Deep Reader: Information Extraction from Document Images via Relation Extraction and Natural Language</title>
		<author>
			<persName><forename type="first">D</forename><surname>Vishwanath</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030</idno>
		<idno>-21074-8 15</idno>
		<ptr target="https://doi.org/10.1007/978-3-030" />
	</analytic>
	<monogr>
		<title level="m">Computer Vision -ACCV 2018 Workshops. ACCV 2018</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Carneiro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>You</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">11367</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Palm</surname></persName>
		</author>
		<title level="m">End-to-end information extraction from business documents</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">One-shot Text Field labeling using Attention and Belief Propagation for Structure Information Extraction</title>
		<author>
			<persName><forename type="first">M</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th ACM International Conference on Multimedia</title>
				<meeting>the 28th ACM International Conference on Multimedia<address><addrLine>New York, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images</title>
		<author>
			<persName><forename type="first">S</forename><surname>Schreiber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Wolf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dengel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ahmed</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceeding of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</title>
				<meeting>eeding of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)<address><addrLine>Japan</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1162" to="1167" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Dataset, Ground-Truth and Performance Metrics for Table Detection Evaluation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the10th IAPR International Workshop on Document Analysis Systems (DAS)</title>
				<meeting>the10th IAPR International Workshop on Document Analysis Systems (DAS)</meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="445" to="449" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">A new Approach for Detection and ExtractionTables in Scanned Document Image using Improved Hough Transform</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hussein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Abdullah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Engineering and Technology Journal</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="738" to="753" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<ptr target="https://cloud.google.com/vision/docs" />
		<title level="m">Google Cloud Vision API Documentation</title>
				<imprint>
			<date type="published" when="2021-08-11">11 Aug 2021</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
