<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>AIxIA</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Word Spotting in Handwritten Historical Documents by N-gram Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe De Gregorio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Marcelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information and Electrical Engineering and Applied Mathematics - DIEM, University of Salerno</institution>
          ,
          <addr-line>Via Giovanni Paolo II 132, Fisciano (SA), 84084</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>28</volume>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>We address the problem of handwritten word retrieval in documents belonging to small collections of historical interest. Word retrieval consists of finding the images of a given query word, and one of the techniques commonly used to solve this problem is Keyword Spotting (KWS), which promises to retrieve images of words without requiring explicit handwriting recognition. KWS systems, however, are limited by the problem of so called out-of-vocabulary (OOV) words, i.e. words not included in the training set, that cannot be retrieved. To overcome this limitation, we propose a KWS system that focuses the search on character sequences, referred to as N-grams, instead of whole words, thus aiming to make OOV words searchable. The system is based on a Siamese Network that searches for all the N-gram images of the training set that corresponds to the N-gram of the query word and outputs a ranked list of possible images of the searched word extracted from the collection. The results show that the system is able to retrieve words indiscriminately from the set of In-Vocabulary and Out-Of-Vocabulary words, showing similar performance in both cases, suggesting that focusing the search on N-grams may provide a valid solution to the OOV word search problem.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Keyword Spotting</kwd>
        <kwd>Word Retrieval</kwd>
        <kwd>Historical Document</kwd>
        <kwd>Handwritten</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The digitization of paper documents has shown several advantages, and for this reason, it
constantly captures the attention of researchers. Digitising early written and printed documents
can make it possible to preserve a digital version even if the original document is destroyed
or damaged. Moreover, typically historical documents are stored in libraries and archives and
this may limit to access them, so digitisation makes it easier for researchers to access archival
collections by publishing images of the collections online and enabling new ways of access
and interaction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The digital format also allows the application of data mining, information
retrieval and document analysis techniques to handwritten paper documents using modern
tools based on computer vision, document analysis and machine learning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In this paper, we focus on handwritten word retrieval, which consists of finding images of a
given query word from a collection of documents. We put our attention on small collections,
comprising 50 pages or less of handwritten documents of historical interest, as they present
typical and unique features. Collections with these characteristics can make word retrieval
demanding, as handwriting recognition techniques can produce unsatisfactory results. To
get around the problem, the KeyWord Spotting (KWS) technique promises to retrieve words
without the need for explicit recognition [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. KWS based on a Query by Example (QbE) paradigm
requires the construction of a reference dictionary consisting of transcription/image pairs of
words that can be searched. This efectively limits the search to the words in the dictionary and
creates the problem of Out of Vocabulary (OOV) words for which no search can be performed.
One solution followed by OCR-oriented systems is to concentrate the search on individual
characters and to assemble the searched words only at a later stage. It is indeed possible
to create an almost complete dictionary of examples of characters, even starting from a few
reference pages. However, applying this idea to cursive writing is dificult. The main issue is
that segmentation at the character level of the handwritten script is a hard-to-solve problem
because of the spatial and stylistic variability typical of cursive handwriting. On the other hand,
segmentation into sequences of a few characters, which we will henceforth call N-grams, is
certainly a simpler process than character segmentation, and the dictionary of N-grams that can
be derived from a few pages of reference would cover a larger percentage of the text present in
the whole collection compared to the word-based dictionary, thus reducing the limit of OOV
word recovery.
      </p>
      <p>
        Focusing our attention on short sequences rather than whole words or individual characters
can rediscover a motivation related to the fine motor skills acquired by an individual during
the learning phase of writing. Studies on motor behaviour have shown that writing is the
result of precise motor actions that can be automated [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. In general, during the learning
phase, an individual tends to develop motor programs associated with motor actions with a
high frequency of execution. This leads to the development of motor primitives associated
with motor actions that are frequently performed and to the definition of motor programs that
encode the execution of the movement itself and that are stored in the brain and activated each
time that movement is performed. In the field of cursive handwriting, it is plausible to assume
that the motor programs are developed not in terms of single characters or whole words, the
ifrst being too short and the second too long and complex, but in sequences of a few characters.
This would mean that every time a subject writes an N-gram to which a motor program is
assigned, he produces an ink trace that is always compatible with and similar to all the others.
The repeated similarity in the execution of the same movements for the N-grams could make
the N-grams recognisable, thus making them ideal candidates for the recognition of cursive
handwriting. With this work, we propose a KWS model capable of recognising words in a small
collection of handwritten documents, using N-grams extracted from a subset of the collection
as recognition primitives. The KWS follows the Query by String (QbS) paradigm, as it receives
as input the string of characters representing the query word, but the core of the system is an
N-grams spotting system based on the Query by Example paradigm.
      </p>
      <p>Below, a brief overview of the state of the art is presented in the section 2, the method is then
presented in the section 3 and the experimental results are presented in the section 4- Finally
the section 5 presents the conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>
        Recognition-free retrieval, also known in the literature as word spotting or keyword spotting
(KWS), was developed as an alternative to recognition-based information retrieval. Its purpose
is to find all instances of a query in a set by evaluating the similarity between elements and
returning as output the results that appear most similar with respect to a common representation
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. KWS was first applied to the field of handwritten documents of historical interest by
Manmatha et al. in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Since then, various solutions have been proposed and diferent paradigms
have been defined. The first important distinction for KWS systems is defined by the query
mode of the system. The system can be queried by providing an image of a word to search
for similar words in an entire document. This approach is called Query by Example (QbE)
[
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Alternatively, it is possible to query the system with a text query and expect images that
contain the word, in which case called Query by String (QbS) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] approaches. Another important
classification distinguishes between segmentation-based and segmentation-free approaches.
The former start from the word segmentation of the data collection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and the latter [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
apply directly to the entire page or lines of the document. These approaches are less used,
although they avoid the problems of poor segmentation. An important distinction concerns
lexical-based and lexical-free approaches. The diference lies in the presence of a reference
lexicon that collects all the words that the system can recognise and thus recover. Lexical-based
approaches prove to be more efective than lexicon-free methods [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. The main limitation
of lexicon-based systems is the problem of OOV (Out Of Vocabulary) words, i.e. words for
which the system knows no representation and which therefore cannot be recovered.
      </p>
      <p>
        Some solutions have been presented in the literature to alleviate the problem of OOV. A
naive solution to the OOV problem is to expand the reference dictionary. However, expanding
the dictionary involves a loss of execution time. Rabiner et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] try to solve the problem
by introducing diferent systems with small and complementary dictionaries. However, the
application of solutions aimed at increasing the size of the lexicon is only feasible if it is possible
to expand it. If the initial data of the dictionary is limited, such approaches are not feasible.
Puigcerver et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] propose a solution that does not involve expanding the dictionary by
introducing a similarity metric between words that can also be applied to OOV. Brakensiek
et al. in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] present a recognition system that uses Markov models together with language
models based on N-grams. In all these solutions, the performance of OOVs improves slightly
but remains highly dependent on the size of the dictionary and the ability to train a language
model. For these reasons, we believe that the problem of OOV words remains an open problem.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Overview</title>
        <p>The general idea behind the system is to search for a query word within the image of a document
page by decomposing the word into the N-grams for which the system can perform a search.
Figure 1 shows the general system workflow. Once the set of query N-grams has been determined,
the N-gram spotting phase allows the identification of the positions of the diferent N-grams
within the document, which are finally analysed to determine the position of the entire query
word. In the next sections we will describe in detail the diferent phases of the process workflow,
starting with the deconstruction of the query word and the definition of the set of query
Ngrams, then the analysis of the N-gram spotting phase, and finally the process of combining the
identified N-grams to recover the original query word.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Word Deconstruction</title>
        <p>The entire system receives as input an image of a page of a document and a string of the word
to search. Since the examination scheme is to search for the N-grams and not the whole word, it
is crucial to define the N-grams query set. This set is composed of all the N-grams of the query
word, defining the maximum depth  , where  represents the maximum number of characters
to be considered. When the next step of searching for the N-grams is performed with a system
based on a dictionary, it is necessary to purge the set of N-grams of the query of all possible
N-grams that do not fit into the reference dictionary.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. N-gram Spotting</title>
        <p>
          The problem reduces to searching the image of the page for the areas containing the N-grams
of the query set, performing the N-gram spotting operation. For this purpose, we have defined
an N-gram spotting system based on the QbE paradigm. The system, therefore, requires the
definition of a reference dictionary consisting of images of N-grams to be searched. The search
thus consists of identifying the areas of the image that appear most similar to the images of the
N-grams of the reference dictionary. For this to be possible, a measure of similarity between the
images must be defined. The measure of image similarity can be learned using neural networks
structured according to the paradigm of Siamese architectures. The network allows obtaining a
measure of similarity between at least two inputs, which in our case may consist of two N-grams
images[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The architecture of one branch of the network is kept quite simple and consists of
a convolutional backbone for extracting features from the image, followed by a fully connected
layer for refining the encoding of the final embedding. Figure 2 shows the network architecture
in the top right; images of two N-grams are fed into the network and a measure of the similarity
between them is provided, which is smaller the more similar the input images are to each other.
The two arms of the network share the weights, as per the Siamese network paradigm. The
convolutional network used to extract the features from the images is a PHOCNet network that
has shown good performance in identifying handwritten words [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>The system then assumes that the document page is segmented into lines of text. The search
process is performed on the text lines by sliding a window that crops out part of the image
to be compared with one of the N-grams of the query set, and a similarity score between the
two images is provided by the network. In the end, the scoring trend of the similarity on the
entire line is computed, as shown in Figure 2. Analyzing the trend, the minimum peaks should
correspond to an instance of the N-gram we are looking for. The N-gram spotting phase then is
repeated for all the N-gram classes contained in the N-grams query set.</p>
        <p>The system provides the option to choose the number  of samples for each class of N-grams
to use for the search. If the search cardinality  is greater than 1, we could find multiple peaks
in the same region from searches with diferent items of the same class of N-grams. To obtain a
single similarity trend for each class of N-gram, all the two overlapping solutions 1 and 2,
i.e. solutions with peaks in the same region of the row obtained with two diferent instances
of N-grams of the same class, are merged by reshaping the score according to the following
expression:</p>
        <p>= min (1, 2) − 
where  is the reward attributed to the new solution and is equal to:</p>
        <p>= (1 − (|1 − 2|))(3/4)(1,2)
In this way, similarity scores are rewarded with a low diference between them and have values
close to zero.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Word Reconstruction</title>
        <p>The ultimate goal is to determine the position of a whole query word, starting from the analysis
of the results of the N-gram spotting phase. To this end, we look for "high-density areas", i.e.
areas of the text line where there are overlaps of searched N-grams. If the N-gram spotting phase
was successful for all classes of N-grams, a density area could correspond to an area where the
searched word occurs. However, it is important to note that the detection of density areas alone
is not suficient to confirm that these zones correspond to the position of the searched words.
This is because the density area does not take into account the position of the N-grams and
could therefore contain diferent anagrams of the query word. Moreover, the N-gram spotting
phase is not error-free and could yield density zones that do not contain all the N-grams of the
word. Therefore, a density area evaluation phase is required in which a confidence measure is
assigned to each area. A confidence measure has been defined with a value in the range (0, 100),
where 100 represents maximum confidence estimated considering three criteria: 1) number of
retrieved N-grams; 2) mean similarity score of the retrieved N-grams; 3) position of the retrieved
N-grams.</p>
        <p>As for the first criterion, the more N-grams are detected in the density area, the greater the
confidence measure. For a maximum confidence measure, the density area must have exactly
the number of expected N-grams. More generally, the confidence level varies linearly with the
number of detected N-grams. For example, if the number of detected N-grams equals half of the
expected number, the confidence measure is halved.</p>
        <p>In applying the second criterion, note that each N-gram was recognised with a similarity score.
The lower the similarity score, the more reliable the prediction. The confidence measure can
then be reshaped by subtracting from it the average value of the similarity scores of all N-grams
belonging to the density area. In the optimal case, each detected N-gram has a similarity value
of zero, resulting in a maximum probability for each N-gram. In this case, the confidence would
not change because every N-gram in the area is safe.</p>
        <p>To evaluate the position of the N-grams in the density area, we can calculate the pyramidal
decomposition of the query word and consider the diferent sets of N-grams at the diferent
levels of the representation. To calculate the representation, the word for each level must
be divided into diferent parts corresponding to the depth of the level. In other words, level
two of the representation consists of the query word divided into two parts, the third level
consists of the word divided into three parts, and so on. We can assign the set of N-grams
that make up every single sequence of the representation, building the pyramidal N-gram
sets representation of the query word. Similarly, the pyramidal representation of the ordered
N-grams of the density area can be calculated. The density area is divided into an increasing
number of contiguous zones from time to time and the sets of N-grams belonging to the diferent
zones are constructed. If the density area is consistent with the query word, consistency must be
maintained between both the pyramidal representations at all levels. To assess this consistency,
the number of N-grams of the pyramidal representation of the density area that does not match
the pyramidal representation of the word query is counted. An N-gram of the density area
is inconsistent with the word query representation if it is present in a set of N-grams at a
particular level of the density zone representation but is not present in the relative set of the
word query representation at the same level. The confidence value can then be reshaped based
on the ratio between the inconsistent and consistent N-grams. If all N-grams from level 2 of the
representation are inconsistent, the confidence value is reduced by 100%. If, on the other hand,
all N-grams are consistent, the restructuring does not afect the confidence value.</p>
        <p>At this point, each detected density area is assigned a confidence measure, the higher the
more likely it is to contain an instance of the searched word. In this way, once the system
receives a query word, it can return a list of all the areas that may contain the word, each with
its confidence measure.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          The KWS system was tested on a selection of 20 pages from the Bentham collection [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The
pages of the dataset are binarized using the Sauvola method [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and then segmented into text
lines. The set is divided into a subset of 5 pages, from which the reference alphabet of N-grams
is created, and a test subset consisting of the remaining 15 pages. From the first 5 pages, together
with their transcription, all N-grams with  equal to 2 and 3 are extracted. The decision to limit
 to 3 is because considering character sequences consisting of more than 3 characters would,
in our opinion, run counter to the premises of this work. We want to use N-grams as recognition
primitives, keeping the sequences large enough to facilitate segmentation, but small enough to
remain efective recognition primitives. Following the process of extracting the N-grams from
the training set, we obtain a set of N-grams consisting of 1044 distinct classes. However, the
dataset is highly imbalanced as the cardinality of each class ranges from a minimum of 1 to a
maximum of 117, with 615 classes consisting of less than 3 items. To reduce the imbalance, the
minimum cardinality of the classes was raised to 3 by simple image transformations and noise
addition. In this way, the training set consists of 1044 classes totalling 5440 samples.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training of the Siamese Network</title>
        <p>
          The core of the N-gram spotting system is a Siamese Neural Network that uses a PHOCnet
convolutional framework to extract features from images. For the experiments, the PHOCNet
backbone was pre-trained with the IAM handwritten dataset [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The whole branch of the
Siamese network, i.e. the PHOCNet backbone and the downstream fully connected layers, were
ifne-tuned according to a triplet loss [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] using the images of the N-grams. For this purpose,
a training set and a validation set were created starting from the set of all N-grams extracted
from the first 5 pages of the collection. For the training set, at most 10 elements were selected
for each class and a training triplet (, ,  ) was defined for each of these elements. For each
anchor element , the element belonging to the same class is selected as the positive element 
which provides a PHOCNet embedding that is more distant than the embedding of the anchor
element. To define the negative element  , the 10 classes closest to the anchor class are selected
based on the Levenshtein distance calculated on the N-gram labels. The element  is then
randomly selected from this set with a probability of 80%, otherwise, it is randomly selected
from the entire data set. In this way, ’hard’ triples are obtained that try to maximise the distance
between - and minimise the distance between - . Any images that are not in the training
set at the end of the process are added to the validation set.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Results</title>
        <p>By definition, the system returns in response to a query the confidence-ordered list of  images
that supposedly contain instances of the searched word. As the number of instances of the
words in the collection is unknown, setting the value of  may lead to underestimating the
performance in terms of both Recall and Precision: given the value of , it will lead to under
estimating the Recall whenever there are more than  instances of the query word in the
collections, while it will lead to underestimating the Precision in case of words with less than 
instances. Figure 3 reports the results in terms of recall for diferent values of  (r@k) for two
groups of words, the first of which consists only of in-vocabulary (IV) words and the second of
OOV words that can be constructed from the reference dictionary of N-grams.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>The results show that the system can recover words from the set of IV words as well as from
the set of OOV words with a similar recall rate, independently of the value of . Indeed, the
N-gram spotting phase takes place in the N-gram search space, and as long as all N-grams are
available to search for the query word, it makes no diference to the system whether it is an IV
word or an OOV word.</p>
      <p>An important feature of the proposed system is that it is segmentation-free, i.e. it avoids both
character and word-level segmentation, which are anything but easy in cursive handwriting.
The price to pay is that the system also retrieves parts of words that are similar to the searched
word. In other words: if the query word is part of a longer word the system can retrieve these
parts, for instance in the case of the query word "perform", the system can recover part of the
word "performed", as shown in Figure 4, with a fairly high confidence value since actually, the
transcription of the selected crop is the same, or very similar, to the query word. Similarly, there
are also cases when not all the necessary N-grams of the query word can be spotted, but most
of them are, as in the case of the query word "appellate" and the instance of the word "Appeal"
shown in Figure 4.</p>
      <p>The experimental results, eventually, show that the Siamese Network does not perform
satisfactorily, as it is shown by the case on top of the figure, where the image of the N-gram "mo"
is matched with instances of the N-gram "na" because of their similarity, as well as the case on
the bottom, where the images of the N-gram "lla" is matched with instances of the N-gram "int"
despite they do not similar. This may depend mostly on the small size of the training set, which
is unavoidable due to the size of the collections we are dealing with, as already mentioned. A
possible way to overcome this limitation is to apply data augmentation techniques to a larger
extent than we have done in this study.</p>
      <p>In this paper, we have presented a KWS system that makes it possible to retrieve OOV words,
thus circumventing the limitation of the dictionary-based approaches, by searching for N-gram
images within the text line, thus avoiding both character and word-level segmentation. The
experimental results have shown that the system can spot OOV words with similar performance
to the case of IV words. They also show that the overall performance is very encouraging,
considering the small size of the training set, with many classes and a few samples per class.
This certainly hampers the performance of the Siamese Network, and thus we are currently
investigating data augmentation techniques as a possible solution to achieve better performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sulaiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Omar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Nasrudin</surname>
          </string-name>
          ,
          <article-title>Degraded historical document binarization: A review on issues, challenges, techniques, and future directions</article-title>
          ,
          <source>Journal of Imaging</source>
          <volume>5</volume>
          (
          <year>2019</year>
          )
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Philips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tabrizi</surname>
          </string-name>
          ,
          <article-title>Historical document processing: historical document processing: a survey of techniques, tools, and trends</article-title>
          , arXiv preprint arXiv:
          <year>2002</year>
          .
          <volume>06300</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Giotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sfikas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nikou</surname>
          </string-name>
          ,
          <article-title>A survey of document image word spotting techniques</article-title>
          ,
          <source>Pattern recognition 68</source>
          (
          <year>2017</year>
          )
          <fpage>310</fpage>
          -
          <lpage>332</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marcelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parziale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Senatore</surname>
          </string-name>
          ,
          <article-title>Some observations on handwriting from a motor learning perspective</article-title>
          .,
          <source>in: AFHA</source>
          , volume
          <volume>1022</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2013</year>
          , pp.
          <fpage>6</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Wing</surname>
          </string-name>
          ,
          <article-title>Motor control: Mechanisms of motor equivalence in handwriting</article-title>
          ,
          <source>Current biology 10</source>
          (
          <year>2000</year>
          )
          <fpage>R245</fpage>
          -
          <lpage>R248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          , C. Han,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Riseman</surname>
          </string-name>
          ,
          <article-title>Word spotting: A new approach to indexing handwriting</article-title>
          ,
          <source>in: Proceedings CVPR IEEE Computer Society Conference on Computer Vision</source>
          and Pattern Recognition, IEEE,
          <year>1996</year>
          , pp.
          <fpage>631</fpage>
          -
          <lpage>637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Rath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          ,
          <article-title>Word spotting for historical documents</article-title>
          ,
          <source>International Journal of Document Analysis and Recognition (IJDAR) 9</source>
          (
          <year>2007</year>
          )
          <fpage>139</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Konidaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Kesidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gatos</surname>
          </string-name>
          ,
          <article-title>A segmentation-free word spotting method for historical printed documents, Pattern analysis and applications (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Almazán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gordo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fornés</surname>
          </string-name>
          , E. Valveny,
          <article-title>Word spotting and recognition with embedded attributes</article-title>
          ,
          <source>IEEE TPAMI</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Puigcerver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Toselli</surname>
          </string-name>
          , E. Vidal,
          <article-title>Querying out-of-vocabulary words in lexicon-based keyword spotting</article-title>
          ,
          <source>Neural Computing and Applications</source>
          <volume>28</volume>
          (
          <year>2017</year>
          )
          <fpage>2373</fpage>
          -
          <lpage>2382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Toselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Frinken</surname>
          </string-name>
          ,
          <article-title>Hmm word graph based keyword spotting in handwritten document images</article-title>
          ,
          <source>Information Sciences 370</source>
          (
          <year>2016</year>
          )
          <fpage>497</fpage>
          -
          <lpage>518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Rabiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.-H.</given-names>
            <surname>Juang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Wilpon</surname>
          </string-name>
          ,
          <article-title>Hmm clustering for connected word recognition</article-title>
          , in: International Conference on Acoustics, Speech, and Signal Processing„ IEEE,
          <year>1989</year>
          , pp.
          <fpage>405</fpage>
          -
          <lpage>408</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Brakensiek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rottland</surname>
          </string-name>
          , G. Rigoll,
          <article-title>Handwritten address recognition with open vocabulary using character n-grams</article-title>
          , in: Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, IEEE,
          <year>2002</year>
          , pp.
          <fpage>357</fpage>
          -
          <lpage>362</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. I.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lladós</surname>
          </string-name>
          , U. Pal, Signet:
          <article-title>Convolutional siamese network for writer independent ofline signature verification</article-title>
          ,
          <source>arXiv preprint arXiv:1707.02131</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sudholt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <article-title>Phocnet: A deep convolutional neural network for word spotting in handwritten documents</article-title>
          ,
          <source>in: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>277</fpage>
          -
          <lpage>282</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Toselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Romero</surname>
          </string-name>
          , E. Vidal,
          <article-title>Icdar 2015 competition htrts: Handwritten text recognition on the transcriptorium dataset</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sauvola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietikäinen</surname>
          </string-name>
          ,
          <article-title>Adaptive document image binarization</article-title>
          ,
          <source>Pattern recognition 33</source>
          (
          <year>2000</year>
          )
          <fpage>225</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>U.-V.</given-names>
            <surname>Marti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bunke</surname>
          </string-name>
          ,
          <article-title>The iam-database: an english sentence database for ofline handwriting recognition</article-title>
          ,
          <source>International Journal on Document Analysis and Recognition</source>
          <volume>5</volume>
          (
          <year>2002</year>
          )
          <fpage>39</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schrof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Philbin</surname>
          </string-name>
          ,
          <article-title>Facenet: A unified embedding for face recognition and clustering</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>815</fpage>
          -
          <lpage>823</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>