<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">KGSAR: A Knowledge Graph-Based Tool for Managing Spanish Colonial Notary Records</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Shivika</forename><surname>Prasanna</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Missouri-Columbia</orgName>
								<address>
									<settlement>Columbia</settlement>
									<region>Missouri</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nouf</forename><surname>Alrasheed</surname></persName>
							<email>nalrasheed@mail.umkc.edu</email>
							<affiliation key="aff1">
								<orgName type="institution">University of Missouri-Kansas City</orgName>
								<address>
									<settlement>Kansas City</settlement>
									<region>Missouri</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Parshad</forename><surname>Suthar</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Missouri-Columbia</orgName>
								<address>
									<settlement>Columbia</settlement>
									<region>Missouri</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pooja</forename><surname>Purushatma</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Missouri-Columbia</orgName>
								<address>
									<settlement>Columbia</settlement>
									<region>Missouri</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Praveen</forename><surname>Rao</surname></persName>
							<email>praveen.rao@missouri.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Missouri-Columbia</orgName>
								<address>
									<settlement>Columbia</settlement>
									<region>Missouri</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Viviana</forename><surname>Grieco</surname></persName>
							<email>griecov@umkc.edu</email>
							<affiliation key="aff1">
								<orgName type="institution">University of Missouri-Kansas City</orgName>
								<address>
									<settlement>Kansas City</settlement>
									<region>Missouri</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">KGSAR: A Knowledge Graph-Based Tool for Managing Spanish Colonial Notary Records</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">661E02670D8E09D1699F4F027139F2D2</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T03:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Knowledge graphs</term>
					<term>information retrieval</term>
					<term>optical character recognition</term>
					<term>historical manuscripts</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Notary records contain abundant information relevant to historical inquiry but are in physical form and hence, searching for information in these documents could be painstaking. In this demo paper, we present a document retrieval system that allows users to search for a keyword in digitized copies of physical records. The system uses cleaned and denoised images to search a keyword using optical character recognition (OCR) models re-trained on labeled data provided by experts. The word predictions and bounding boxes are stored as a knowledge graph (KG). A keyword query is then mapped to a graph query on the KG. The results are ranked based on text matching. An intuitive user interface (UI) allows a user to search, correct, delete or draw more annotations that are used for retraining of the OCR models.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Historical manuscripts such as 17 𝑡ℎ century Spanish-American notarial scripts, carry a plethora of information that is highly useful to historians in understanding the social, economical, cultural, and political developments during different time periods. Manually searching through the scripts is time consuming and limits the scope of a historian's research findings. With advances in deep learning, OCR models have become more accurate and efficient. However, the lack of high quality training data on specific handwritten collections restricts applicability of pretrained OCR models. Furthermore, efficient and accurate document retrieval is required as the collections can contain millions of handwritten words.</p><p>In this paper, we present a new document retrieval system called KGSAR (KG for Spanish American Notary Records) for a set of hand-written Spanish notary documents from the National Archives of Argentina. KGSAR synergistically combines retrained OCR models and the concept of KG to address the challenges in accessing, reading, and searching within the documents. A KG can provide numerous benefits for information/document retrieval <ref type="bibr" target="#b7">[8]</ref>. It can enable semantic search and better understanding of users' queries and documents as well as provide explanations for matched entities and their relationships <ref type="bibr" target="#b7">[8]</ref>. In KGSAR, the Resource Description Framework (RDF) and SPARQL are used for efficient representation, indexing, and query processing of data extracted from the documents (e.g., predicted words) via OCR. The KG contains additional facts about the notaries, and is stored and queried using a fast graph database. The UI allows a user to provide additional training data for retraining the OCR models. The design of KGSAR is generic and can be easily adapted to other historical scripts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Deep learning techniques achieve high accuracy when large, labeled datasets are available <ref type="bibr" target="#b2">[3]</ref>. They have also enabled high quality OCR on handwritten documents. To use existing OCR models for specialized collections, we require high quality labeled data from experts.</p><p>Alrasheed et.al. <ref type="bibr" target="#b1">[2]</ref> showed that after retraining on the Spanish-American notary records, Keras-OCR and YOLO-OCR achieved a better performance as compared to Kraken, Tesseract, and Calamari-OCR. When tested on our collection, the latter systems (which are based on pretrained models for the English language) were only able to detect lines over words and could not recognize any of the characters present in those lines. <ref type="bibr">[12,</ref><ref type="bibr" target="#b11">13,</ref><ref type="bibr" target="#b12">14]</ref>. For an image containing 670 manually annotated words, Keras-OCR <ref type="bibr" target="#b10">[11]</ref> and YOLO-OCR <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref> were able to recognize 306 and 146 words respectively, while Kraken, Tesseract and Calamari-OCR were not able to recognize any words in the detected lines.</p><p>Shaw et.al. <ref type="bibr" target="#b4">[5]</ref> proposed a system for converting handwritten medical prescriptions digitally using electronic writing pad and utilized OCR techniques for character recognition in the digital prescriptions, instead of whole words. Sugarawara et.al. <ref type="bibr" target="#b6">[7]</ref> proposed a method for retrieving Japanese keywords using a text query where they first generated an image of the query text using Generative semi-supervised model, and then retrieved regions in documents similar to the generated image by feature matching. Preliminary works such as of Kim et.al. <ref type="bibr" target="#b5">[6]</ref> presented an end-to-end system that combined word recognition using segmentation with a matching technique designed to handle the large dimensional feature vectors that represented shape description of characters in a word.</p><p>Unlike most prior work that focus on text recognition, KGSAR aims to synergistically combine OCR and knowledge management techniques to facilitate efficient and accurate retrieval of 17 𝑡ℎ century Spanish-American notarial scripts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Architecture of KGSAR</head><p>Seventeenth-century Spanish American notarial scripts include multiple handwritings due to a high turn-over rate in the notary office. Interim notaries did not receive extensive training, thus, handwriting in the documents consists of highly irregular scripts. The current implementation of KGSAR stores 20,000 out of 200,000 images that comprise the entire digitized collection.</p><p>KGSAR's architecture is illustrated in Figure <ref type="figure" target="#fig_0">1</ref>. Component (A) transforms the document scans into grayscale, applies a median filter to soften backgrounds and removes background noise, and applies image binarization to convert the images to black and white as scanned document images contained noise that affected feature extraction and classification <ref type="bibr" target="#b0">[1]</ref>. Component (B) contains 83 cleaned images (166 manuscript pages) that were labeled by Spanish-proficient labelers. This yielded a dataset containing 26,482 words for retraining the OCR models. This dataset is from the hand of Baldibia y Brisuela, who by 1650, acted as an interim notary in Buenos Aires, Argentina.</p><p>Pretrained Keras-OCR and YOLO-OCR models failed to identify handwritten text as they have been trained on printed English characters. Component (C) represents OCR model training where Keras-OCR recognizer was trained on 21,185 labeled words of 77 images and pretrained detector was used as it was able to accurately draw bounding boxes around the words. YOLO-OCR was trained in a novel way where YOLO was trained as a word localizer to predict only the bounding box coordinates, and convolutional recurrent neural network (CRNN) was trained as a recognizer to identify the text in the bounding boxes.</p><p>The retrained models were used to predict on about 20,000 unlabeled images. Component (D) denotes a KG representation built using the predictions. Entities such as the predicted words, bounding box coordinates, image containing the predictions, and the OCR model type that was used were stored as nodes in the KG. These nodes were connected using their respective relations, and serialized into N-triples format. The KG was stored in Blazegraph <ref type="bibr" target="#b3">[4]</ref>, a popular graph database, as denoted by Component (E). Bulk data loader was used to load all the N-Triples files as an atomic transaction.</p><p>Component (F) denotes an intuitive Web UI for a user to pose a keyword query. The word and its n-grams (for word length &gt; 3) are used to construct a SPARQL query, which is executed by Blazegraph. We utilized Blazegraph's FullTextSearch feature to perform exact and partial word matching. Each search result was scored using the cosine distance to the query, allowing words with exact matches and higher match probabilities to be ranked higher. The matching scans were ranked to show the most relevant results. Component (G) denotes the annotation feature where a user can correct the results, delete or annotate more words after a query, to retrain the OCR models with better labeled data.</p><p>The UI was developed using HTML5 and AngularJS, and the backend code was developed using Python 3.8. We packaged the entire tool, Blazegraph journal and JAR file into a Docker image to facilitate quick testing and experimentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Demonstration Scenarios</head><p>During the demo, a user can interact with KGSAR by posing queries and correcting the bounding boxes as well as labeling new words. We highlight the primary features of KGSAR. Figure <ref type="figure" target="#fig_1">2</ref> shows a screenshot of KGSAR after searching for the word poder 1 . The user will see the bounding boxes of the words matched in the images and can navigate through the images.</p><p>Figure <ref type="figure" target="#fig_2">3</ref> shows the annotation feature for the same word. Here, the user can see edit and delete options for the word, as well as the predicted value for the bounding box. The code for KGSAR is available on GitHub at https://github.com/MU-Data-Science/KGSAR. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: System Architecture</figDesc><graphic coords="3,132.04,84.19,331.21,146.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Search UI: results for keyword 'poder'</figDesc><graphic coords="4,171.64,147.54,252.00,247.65" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Annotate UI: user can edit, delete or view the prediction value</figDesc><graphic coords="5,153.64,84.19,287.99,140.49" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments: This work was supported by a National Endowment for the Humanities (NEH) Digital Humanities Advancement Grant (HAA-271747-20) and a Research and Creative Works Strategic Investment Tier 3 Award from the University of Missouri System. We would like to thank Ryan Rowland and Adam Sisk for labeling a subset of the notary records.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Character Recognition Of Seventeenth-Century Spanish American Notary Records Using Deep Learning. Digital Humanities Quarterly 1 poder refers to a power of attorney, a document that, to be valid</title>
		<author>
			<persName><forename type="first">N</forename><surname>Alrasheed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Grieco</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">15</biblScope>
		</imprint>
	</monogr>
	<note>required notarial endorsement</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records</title>
		<author>
			<persName><forename type="first">N</forename><surname>Alrasheed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Prasanna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rowland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Grieco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wasserman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd Workshop on Structuring and Understanding of Multimedia heritAge Contents</title>
				<meeting>the 3rd Workshop on Structuring and Understanding of Multimedia heritAge Contents</meeting>
		<imprint>
			<date type="published" when="2021-10">October. 2021</date>
			<biblScope unit="page" from="23" to="30" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">ImageNet: A large-scale hierarchical image database</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE conference on computer vision and pattern recognition</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="248" to="255" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title/>
		<author>
			<persName><surname>Blazegraph</surname></persName>
		</author>
		<ptr target="https://blazegraph.com" />
		<imprint>
			<date type="published" when="2022-06">June 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Medical Handwritten Prescription Recognition and Information Retrieval using Neural Network</title>
		<author>
			<persName><forename type="first">U</forename><surname>Shaw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mamgai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Malhotra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">6th International Conference on Signal Processing, Computing and Control (ISPCC)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="46" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">An architecture for handwritten text recognition systems</title>
		<author>
			<persName><forename type="first">G</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Govindaraju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">N</forename><surname>Srihari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal on Document Analysis and Recognition</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="37" to="44" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Text Retrieval for Japanese Historical Documents by Image Generation</title>
		<author>
			<persName><forename type="first">C</forename><surname>Sugawara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Miyazaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sugaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Omachi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4th International Workshop on Historical Document Imaging and Processing</title>
				<meeting>the 4th International Workshop on Historical Document Imaging and Processing</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="19" to="24" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Knowledge graphs: An Information Retrieval Perspective</title>
		<author>
			<persName><forename type="first">R</forename><surname>Reinanda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Meij</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De Rijke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Foundations and Trends in Information Retrieval</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="289" to="444" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">YOLO9000: better, faster, stronger</title>
		<author>
			<persName><forename type="first">J</forename><surname>Redmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="7263" to="7271" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Redmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1804.02767</idno>
		<title level="m">Yolov3: An incremental improvement</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title/>
		<author>
			<persName><surname>Keras</surname></persName>
		</author>
		<ptr target="https://keras.io" />
		<imprint>
			<date type="published" when="2021-03">March 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title/>
		<author>
			<persName><surname>Kraken</surname></persName>
		</author>
		<ptr target="http://kraken.re" />
		<imprint>
			<date type="published" when="2021-07">July 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Ocr</forename><surname>Calamari</surname></persName>
		</author>
		<ptr target="https://calamari-ocr.readthedocs.io/en/latest/" />
		<imprint>
			<date type="published" when="2021-07">July 2021</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
