<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KGSAR: A Knowledge Graph-Based Tool for Managing Spanish Colonial Notary Records</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shivika Prasanna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nouf Alrasheed</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parshad Suthar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pooja Purushatma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Praveen Rao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Grieco</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Missouri-Columbia</institution>
          ,
          <addr-line>Columbia, Missouri</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Missouri-Kansas City</institution>
          ,
          <addr-line>Kansas City, Missouri</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Notary records contain abundant information relevant to historical inquiry but are in physical form and hence, searching for information in these documents could be painstaking. In this demo paper, we present a document retrieval system that allows users to search for a keyword in digitized copies of physical records. The system uses cleaned and denoised images to search a keyword using optical character recognition (OCR) models re-trained on labeled data provided by experts. The word predictions and bounding boxes are stored as a knowledge graph (KG). A keyword query is then mapped to a graph query on the KG. The results are ranked based on text matching. An intuitive user interface (UI) allows a user to search, correct, delete or draw more annotations that are used for retraining of the OCR models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge graphs</kwd>
        <kwd>information retrieval</kwd>
        <kwd>optical character recognition</kwd>
        <kwd>historical manuscripts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>for matched entities and their relationships [8]. In KGSAR, the Resource Description Framework
(RDF) and SPARQL are used for eficient representation, indexing, and query processing of data
extracted from the documents (e.g., predicted words) via OCR. The KG contains additional facts
about the notaries, and is stored and queried using a fast graph database. The UI allows a user
to provide additional training data for retraining the OCR models. The design of KGSAR is
generic and can be easily adapted to other historical scripts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Deep learning techniques achieve high accuracy when large, labeled datasets are available [3].
They have also enabled high quality OCR on handwritten documents. To use existing OCR
models for specialized collections, we require high quality labeled data from experts.</p>
      <p>Alrasheed et.al. [2] showed that after retraining on the Spanish-American notary records,
Keras-OCR and YOLO-OCR achieved a better performance as compared to Kraken, Tesseract,
and Calamari-OCR. When tested on our collection, the latter systems (which are based on
pretrained models for the English language) were only able to detect lines over words and could
not recognize any of the characters present in those lines. [12, 13, 14]. For an image containing
670 manually annotated words, Keras-OCR [11] and YOLO-OCR [9, 10] were able to recognize
306 and 146 words respectively, while Kraken, Tesseract and Calamari-OCR were not able to
recognize any words in the detected lines.</p>
      <p>Shaw et.al. [5] proposed a system for converting handwritten medical prescriptions digitally
using electronic writing pad and utilized OCR techniques for character recognition in the digital
prescriptions, instead of whole words. Sugarawara et.al. [7] proposed a method for retrieving
Japanese keywords using a text query where they first generated an image of the query text
using Generative semi-supervised model, and then retrieved regions in documents similar to
the generated image by feature matching. Preliminary works such as of Kim et.al. [6] presented
an end-to-end system that combined word recognition using segmentation with a matching
technique designed to handle the large dimensional feature vectors that represented shape
description of characters in a word.</p>
      <p>Unlike most prior work that focus on text recognition, KGSAR aims to synergistically combine
OCR and knowledge management techniques to facilitate eficient and accurate retrieval of
17ℎ century Spanish-American notarial scripts.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Architecture of KGSAR</title>
      <p>Seventeenth-century Spanish American notarial scripts include multiple handwritings due to a
high turn-over rate in the notary ofice. Interim notaries did not receive extensive training, thus,
handwriting in the documents consists of highly irregular scripts. The current implementation
of KGSAR stores 20,000 out of 200,000 images that comprise the entire digitized collection.</p>
      <p>KGSAR’s architecture is illustrated in Figure 1. Component (A) transforms the document scans
into grayscale, applies a median filter to soften backgrounds and removes background noise,
and applies image binarization to convert the images to black and white as scanned document
images contained noise that afected feature extraction and classification [ 1]. Component (B)
contains 83 cleaned images (166 manuscript pages) that were labeled by Spanish-proficient
labelers. This yielded a dataset containing 26,482 words for retraining the OCR models. This
dataset is from the hand of Baldibia y Brisuela, who by 1650, acted as an interim notary in
Buenos Aires, Argentina.</p>
      <p>Pretrained Keras-OCR and YOLO-OCR models failed to identify handwritten text as they
have been trained on printed English characters. Component (C) represents OCR model training
where Keras-OCR recognizer was trained on 21,185 labeled words of 77 images and pretrained
detector was used as it was able to accurately draw bounding boxes around the words.
YOLOOCR was trained in a novel way where YOLO was trained as a word localizer to predict only
the bounding box coordinates, and convolutional recurrent neural network (CRNN) was trained
as a recognizer to identify the text in the bounding boxes.</p>
      <p>The retrained models were used to predict on about 20,000 unlabeled images. Component (D)
denotes a KG representation built using the predictions. Entities such as the predicted words,
bounding box coordinates, image containing the predictions, and the OCR model type that
was used were stored as nodes in the KG. These nodes were connected using their respective
relations, and serialized into N-triples format. The KG was stored in Blazegraph [4], a popular
graph database, as denoted by Component (E). Bulk data loader was used to load all the N-Triples
ifles as an atomic transaction.</p>
      <p>Component (F) denotes an intuitive Web UI for a user to pose a keyword query. The word
and its n-grams (for word length &gt; 3) are used to construct a SPARQL query, which is executed
by Blazegraph. We utilized Blazegraph’s FullTextSearch feature to perform exact and partial
word matching. Each search result was scored using the cosine distance to the query, allowing
words with exact matches and higher match probabilities to be ranked higher. The matching
scans were ranked to show the most relevant results. Component (G) denotes the annotation
feature where a user can correct the results, delete or annotate more words after a query, to
retrain the OCR models with better labeled data.</p>
      <p>The UI was developed using HTML5 and AngularJS, and the backend code was developed
using Python 3.8. We packaged the entire tool, Blazegraph journal and JAR file into a Docker
image to facilitate quick testing and experimentation.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Demonstration Scenarios</title>
      <p>During the demo, a user can interact with KGSAR by posing queries and correcting the bounding
boxes as well as labeling new words. We highlight the primary features of KGSAR.
Acknowledgments: This work was supported by a National Endowment for the Humanities
(NEH) Digital Humanities Advancement Grant (HAA-271747-20) and a Research and Creative
Works Strategic Investment Tier 3 Award from the University of Missouri System. We would
like to thank Ryan Rowland and Adam Sisk for labeling a subset of the notary records.
[1] Alrasheed, N., Rao, P. and Grieco, V., Character Recognition Of Seventeenth-Century</p>
      <p>Spanish American Notary Records Using Deep Learning. Digital Humanities Quarterly
1poder refers to a power of attorney, a document that, to be valid, required notarial endorsement.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <volume>15</volume>
          (
          <issue>4</issue>
          ) (
          <year>2021</year>
          ). [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Alrasheed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Prasanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rowland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Grieco</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wasserman</surname>
          </string-name>
          , October.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>tary Records</surname>
          </string-name>
          .
          <source>In Proceedings of the 3rd Workshop on Structuring and Understanding of</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Multimedia heritAge Contents</source>
          ,
          <fpage>23</fpage>
          -
          <lpage>30</lpage>
          (
          <year>2021</year>
          ). [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>ImageNet: A large-scale hierarchical</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>image database</article-title>
          , In 2009 IEEE conference
          <article-title>on computer vision and pattern recognition</article-title>
          . IEEE,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          248-
          <fpage>255</fpage>
          (
          <year>2009</year>
          ). [4]
          <string-name>
            <surname>Blazegraph</surname>
          </string-name>
          , https://blazegraph.com.
          <source>Last accessed June</source>
          <year>2022</year>
          . [5]
          <string-name>
            <given-names>U.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mamgai</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Malhotra</surname>
          </string-name>
          , Medical Handwritten Prescription Recognition and
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>Information Retrieval using Neural Network</article-title>
          , In 2021 6th International Conference on Signal
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Processing</surname>
          </string-name>
          ,
          <source>Computing and Control (ISPCC)</source>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>50</lpage>
          (
          <year>2021</year>
          ). [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Govindaraju</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.N.</given-names>
            <surname>Srihari</surname>
          </string-name>
          ,
          <article-title>An architecture for handwritten text recognition</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>systems</surname>
          </string-name>
          ,
          <source>International Journal on Document Analysis and Recognition</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ),
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          (
          <year>1999</year>
          ). [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sugawara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miyazaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sugaya</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Omachi</surname>
          </string-name>
          , Text Retrieval for Japanese Historical
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>Documents by Image Generation</article-title>
          ,
          <source>In Proceedings of the 4th International Workshop on</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>Historical Document Imaging and Processing</source>
          ,
          <volume>19</volume>
          -
          <fpage>24</fpage>
          (
          <year>2017</year>
          ). [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Reinanda</surname>
          </string-name>
          , E. Meij, and M. de Rijke,
          <article-title>Knowledge graphs: An Information Retrieval Per-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>spective</surname>
          </string-name>
          ,
          <source>Foundations and Trends in Information Retrieval</source>
          ,
          <volume>14</volume>
          (
          <issue>4</issue>
          ),
          <fpage>289</fpage>
          -
          <lpage>444</lpage>
          (
          <year>2020</year>
          ). [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>YOLO9000: better, faster, stronger</article-title>
          ,
          <source>In Proceedings of the IEEE</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>conference on computer vision and pattern recognition</source>
          ,
          <fpage>7263</fpage>
          -
          <lpage>7271</lpage>
          ,
          <year>2017</year>
          . [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>Yolov3: An incremental improvement</article-title>
          . arXiv preprint
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          arXiv:
          <year>1804</year>
          .02767,
          <year>2018</year>
          . [11]
          <string-name>
            <surname>Keras</surname>
          </string-name>
          , https://keras.io.
          <source>Last accessed March</source>
          <year>2021</year>
          . [12]
          <string-name>
            <surname>Tesseract</surname>
          </string-name>
          , https://en.wikipedia.org/wiki/Tesseract_(software).
          <source>Last accessed July</source>
          <year>2021</year>
          . [13]
          <string-name>
            <surname>Kraken</surname>
          </string-name>
          , http://kraken.re.
          <source>Last accessed July</source>
          <year>2021</year>
          . [14]
          <string-name>
            <surname>Calamari</surname>
            <given-names>OCR</given-names>
          </string-name>
          , https://calamari-ocr.readthedocs.io/en/latest/.
          <source>Last accessed July</source>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>