<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the ImageCLEF 2016 Handwritten Scanned Document Retrieval Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mauricio Villegas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joan Puigcerver</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro H. Toselli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joan-Andreu Sanchez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrique Vidal</string-name>
          <email>evidalg@prhlt.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>PRHLT, Universitat Politecnica de Valencia Cam de Vera</institution>
          <addr-line>s/n, 46022 Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The ImageCLEF 2016 Handwritten Scanned Document Retrieval Task was the rst edition of a challenge aimed at developing retrieval systems for handwritten documents. Several novelties were introduced in comparison to other recent related evaluations, speci cally: multiple word queries, nding local blocks of text, results in transition between consecutive pages, handling words broken between lines, words unseen in training and queries with zero relevant results. To evaluate the systems, a dataset of manuscripts written by Jeremy Bentham was used, and has been left publicly available after the evaluation. The participation was not as good as expected, receiving results from four groups. Despite the low participation, the results were very interesting. One group obtained very good performance, handling relatively well the cases of queries with words not observed in the training data and locating words broken between two lines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In recent years there has been an increasing interest in digitizing the vast amounts
of pre-digital age books and documents that exist throughout the world. Many
of the emerging digitizing initiatives are aimed at dealing with huge collections
of handwritten documents, for which automatic recognition is not yet as mature
as for printed text Optical Character Recognition (OCR). Thus, there is a need
to develop reliable and scalable indexing techniques for manuscripts, targeting
its particular challenges. Users for this technology could be libraries with fragile
historical books, which for preservation are being scanned to make them
available to the public without the risk of further deterioration. Apart from making
the scanned pages available, there is also interest in providing search facilities so
that the people consulting these collections have information access tools that
they are already accustomed to have in other applications. Another use for this
technology is historical birth, marriage and death records, due to the common
interest of people in nding out about their ancestors and family roots, which
results in a need for tools that ease searching these collections of records.</p>
      <p>The archaic solution to handwritten document indexing is to manually
transcribe and then use standard text retrieval technologies. However, this becomes
too expensive for large collections and more so for historical documents that
generally require expert paleographers to transcribe them. Alternatively,
handwritten text recognition (HTR) techniques can be used for automatic indexing,
which requires to transcribe only a small part of the document for learning the
models, or reuse models obtained from similar manuscripts, thus requiring the
least human e ort. The use of HTR for indexing manuscripts does not have to
be limited to only recognizing the text before the use of standard text retrieval
techniques. The uncertainty of the recognized text can be taken into account
during the retrieval, so the HTR can potentially provide better performance
than what standard text retrieval techniques would allow.</p>
      <p>This paper presents an overview of the Handwritten Scanned Document
Retrieval challenge, one of the three evaluation tasks organized by ImageCLEF in
2016 under the CLEF initiative labs1. The main aspects that distinguishes this
challenge in comparison to other evaluations or competitions related to
searching in handwritten documents are: multiple word queries, nding local blocks
of text, results in transition between consecutive pages, handling words broken
between lines, words unseen in training and queries with zero relevant results.
The reminder of this paper is organized as follows. Next section gives a short
overview of works related to this challenge. Then, Section 3 describes the task in
detail, including the objective of the challenge, the reasons why it was organized,
participation rules and provided data and resources. Followed by this, Section 4
presents and discusses the results of the systems submitted by the participants.
Section 5 gives recommendations for using the dataset from this evaluation in
future works. Finally, Section 6 concludes the paper with nal remarks and
future outlooks. For a global view of the ImageCLEF activities in 2016, the reader
is invited to look at the general overview paper [20].
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Traditionally, the process of automatically searching for text in handwritten
documents has been known as Keyword Spotting (KWS), which actually can be seen
as a particular case of image retrieval. The goal of KWS is to nd all instances
of a query word in a given document. Among the noteworthy KWS paradigms
aiming to ful ll this goal, two main kinds are distinguished: Query-by-Example
(QbE) [8,7,1,23] and Query-by-String (QbS) [4,5,15,19]. While in QbE the query
is speci ed by providing a cropped image of the word or words to be searched
for, in QbS, queries are directly speci ed as character strings, i.e., typing in a
keyboard. Likewise other distinctions considered are: training-based or
trainingfree [4,23]; i.e., whether the KWS system needs or not to be trained on
appropriate (annotated) images, and segmentation-based or segmentation-free [23,7];
i.e., whether KWS is applied to full document (page) images or just to images of
individual words (previously segmented from the original full images). Finally, it
is also worth mentioning that, similar as most state-of-the-art handwritten text
recognition systems, there are KWS approaches which are line-oriented [4,5,19].</p>
      <sec id="sec-2-1">
        <title>1 http://www.clef-initiative.eu</title>
        <p>This means that their basic input search/indexing units are whole text line
images, without any kind of segmentation into words or characters, which are
analyzed to determine the degree of con dence that a given keyword appears in
the image.</p>
        <p>Both of the paradigms, QbE and QbS, have their particular usage
scenarios, strengths and weaknesses. In the case of QbE, the techniques generally are
training-free, so they alleviate the need to transcribe a part of the document.
However, they do have the limitation that the query is an image, which
depending on the case might not be easy to obtain. In contrast, QbS techniques
can potentially allow to search for any word, although they generally require
training.</p>
        <p>In recent years, several KWS contests on handwritten documents have been
organized, mainly conjunction with conferences like the International Conference
on Frontiers in Handwriting Recognition (ICFHR) and the International
Conference on Document Analysis and Recognition (ICDAR). These contests focused
rst on evaluating QbE approaches [11]2;3, although lately, QbS approaches have
been also considered in the ICDAR'15 [12]4, and in the ICFHR'165.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Overview of the Task</title>
      <sec id="sec-3-1">
        <title>Motivations and Objectives</title>
        <p>The general motivations for this handwritten retrieval task have already been
outlined in the introduction of the paper, so there is no need to repeat them.
Though, additional to these there are other more speci c motivations related the
current state of research in this area and the other evaluation initiatives that
have been organized in recent years.</p>
        <p>Keyword spotting is mostly concerned with detecting all instances of a given
word in a set of images, and with good reason since any system for searching
text in images containing handwriting must be based on a technique of this sort.
However, since this is a problem related to searching, it must also be analyzed
from the perspective of the eld of information retrieval. There are other di
culties beyond the detections of words that are important for the development
of handwritten retrieval systems, so one objective of this evaluation was to
stimulate the research in other aspects of the problem that are also important. One
aspect that was considered was the type of query the user issues. In order to
favor ease of use in general applications, a good approach is to allow queries
written as natural language, in contrast to for example allowing only a single
word for searching, or multiple words as a boolean expression. With the idea of
targeting this goal in the future, for this evaluation the queries were composed
of multiple words and the order of the words had to be taken into account.
2 http://users.iit.demokritos.gr/~bgat/H-WSCO2013</p>
        <sec id="sec-3-1-1">
          <title>3 http://vc.ee.duth.gr/H-KWS2014</title>
          <p>4 http://transcriptorium.eu/~icdar15kws</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>5 https://www.prhlt.upv.es/contests/icfhr2016-kws</title>
          <p>Another aspect considered was what were the objects to retrieve. Previous
research on keyword spotting has mostly targeted nding a word in a line of text
or in a full page. A single line is generally too small for multiple word queries. On
the other hand, providing a full page as result might not be convenient for the
user to have a quick look (i.e., read part of the text retrieved) and judge if the
result does ful ll the current information need. For some applications a middle
ground between a line and a page might be more appropriate. Furthermore,
handwritten documents are not composed of isolated pages, since the content
traverses multiple pages. The result of a query could lie in a transition between
two consecutive pages, so retrieval systems should consider this possibility. A
related topic is that commonly when someone writes, when reaching the end of
a line, if there is little space to t completely the next word, it is split or broken
between the current line and the following one. In many cases these broken words
are marked with a symbol, such as the hyphen (short horizontal line) commonly
used now, but in past times other symbols have been employed. Also there are
documents in which there is no special mark for the broken words, either because
it was forgotten or because it was not a usual practice. These broken words are
also candidates to be found, so developing techniques to handle them is also a
requirement.</p>
          <p>As mentioned before, these kinds of evaluations related to handwriting
recognition are normally organized in conjunction with conferences such as ICDAR
and ICFHR, which are not common venues for the information retrieval
community. This was one of the reasons for organizing it at CLEF, to present these
problems to experts on information retrieval and foster collaborations that surely
will bene t this area of research.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Challenge Description</title>
        <p>The challenge aimed at evaluating the performance of retrieval systems for
scanned handwritten documents6. Speci cally, it targeted the scenario of free
text search in a document, in which the user wants to nd locations for a given
multiple word query provided as a string (QbS). The result of the search are
local regions (such as a paragraph), which could even include the end of a page
and start of the next. However, since layout analysis and detection of paragraphs
is in itself a di cult problem, a couple of simpli cations were introduced. First,
the was no requirement to detect the text lines in the page images or determine
their reading order. This was provided as part of the data. Second, the segments
to retrieve were de ned as a concatenation of 6 consecutive lines (from top to
bottom and left to right if there are columns), ignoring the type of line it may be
(e.g. title, inserted word, etc.). More precisely, the segments were de ned by a
sliding window that moves one line at a time (thus neighboring segments overlap
by 5 lines) traversing all the pages in the document, so there are segments that
include lines at the end of a page and at the beginning of the next, and the total</p>
        <sec id="sec-3-2-1">
          <title>6 Challenge website at http://imageclef.org/2016/handwritten</title>
          <p>number of segments is ve less than the total number of lines in the document.
An example of a segment is presented in Figure 1.</p>
          <p>The queries selected for evaluation consisted of between one and ve words
that had to be searched for in the collection of pages, and a segment was
considered relevant if all the query words appeared in the given order. Due to the
overlap of the segments, for a given query, several consecutive segments are
relevant. In a real application these consecutive segments would be considered a
single result, however, to simplify the evaluation, the relevancy of all segments
had to be provided by the systems. The participants were expected to submit
for each query, only for the segments considered relevant, a relevancy score and
the bounding boxes of all appearances of the query words within the segment,
irrespectively if it was or not an instance of the word that made the segment
relevant. To clarify this last point, taking the example from Figure 1, the word
necessary appears twice, however, the segment is relevant only because of the
second appearance, since the order of the words in the query has to be the same.
Nevertheless, the bounding box of rst appearance of necessary also had to be
provided.</p>
          <p>The queries were selected such that some key challenges were included, so
the developed systems were expected to consider them. These challenges are:
1. Queries with zero relevant results. In these cases the systems are expected
to assign low relevancy scores. Due to the uncertainty of the recognized text,
this would allow to apply a threshold to lter low con dence recognitions
and prevent showing false positives to the users.
2. Queries with results relevant due to words broken between two lines. See
Figure 1 for an example. Both parts of the broken word have to be within the
six lines of the segment. Not all broken words in the dataset include a
hyphenation symbol. Also the two parts of a broken word are not necessarily
in consecutive lines since there can be inserted text between these two lines.
3. Queries including words not seen in the data used for learning the recognition
models, also known as out-of-vocabulary (OOV) words.
4. Queries with a repeated word, so this word had to appear as many times in
the segment as it did in the query, additional to the restriction of the word
order.</p>
          <p>Since the evaluation measured the impact of words not seen in the training
set, the use of external data for learning a language model was prohibited. The
use of external data for learning the optical models was allowed, but with the
condition that results were also submitted using the same techniques and only
the provided data for learning the optical models. Each registered group was
allowed to submit a maximum of 10 system runs. If a group chose to submit
results using external training data, an additional 10 systems runs were allowed,
to cover the ones with and without external training data.</p>
          <p>A development set with accompanying ground truth was provided so that
participants could tune their systems. However, it was prohibited to use this
development data as training for the nal submissions, since the participants
were required to submit retrieval results for the concatenation of the development
and test sets, as if it were a single large document.</p>
          <p>The task was designed to allow easy participation from di erent research
communities by providing prepared data for each (see subsection 3.3), with the
aim of having synergies between these communities, and providing di erent ideas
and solutions to the problems being addressed. Not only groups that work on
handwritten text recognition could participate. For researchers specialized on
text processing and retrieval, recognition results in plain text using a baseline
system were provided. Researchers that worked in query-by-example could also
participate, since for the training set, bounding boxes for the words were given,
thus allowing to obtain example images for a given query string. However, this
kind of participation had twist, that the training words were segmented
automatically, therefore could be incorrectly segmented. Thus a technique to select
the among the available example words would be required.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Dataset</title>
        <p>The dataset used in this task was a subset of pages from unpublished manuscripts
written by the philosopher and reformer, Jeremy Bentham, that have been
digitised and transcribed under the Transcribe Bentham project [3]. Refer to Table 1
for some of the dataset statistics. The data was divided into three sets: training
with 363 pages, development with 433 pages and test with 200 pages. The
participants had to submit results for the concatenation of development and test as
if it were a single 633 page document. For the three sets, the original scanned
color page images were made available so that all parts of the challenge could be
addressed, including extraction of lines, pre-processing of the images, training of
recognition models, decoding, indexing and retrieval. To ease participation, also
Pages
Segments
Lines
Running wordsa
Total queries
Unique words in queries
Queries with OOV words
Queried broken words
Relevant segments
Rel. segm. for OOV queriesc
Rel. segm. with broken wordsd
a Without tokenizing.
b Unknown since 101 pages had not been transcribed at the time.
c Relevant segments only for the queries with OOV words.
d Relevant segments with a broken word as its only appearance.
pre-extracted line images were provided. These lines had in their meta-data the
crop window o set required to compute bounding boxes in the original page
image coordinates. The identi ers of the lines were set to be incrementing integers
such that the identi er of the rst line in a 6-line segment coincided with the
number of the corresponding segment. For the training and development sets,
the transcripts were also provided, in the rst case to allow training recognition
models and in the second so that participants could also de ne new queries to
evaluate in the development set.</p>
        <p>For each line of the development and test sets, the 100-best recognitions using
the baseline system (see subsection 3.4) were given, including log-likelihoods and
word bounding boxes. These n-best recognitions were intended for researchers
that worked on related elds (such as text retrieval) but do not normally work
with images or on handwriting recognition, so that they could easily participate,
concentrating on other parts of the challenge.</p>
        <p>For each page image there was also an accompanying XML in Page 2013
format7. The training set XMLs included polygons surrounding every line that had
been manually checked, and also word polygons but in this case automatically
obtained. The automatic word polygons were intended for groups working in
query-by-example keyword spotting, although the corresponding word
bounding boxes were provided separately for convenience. In contrast to training, the
development set XMLs had manually checked polygons for both lines and words
and the test set only had manually checked baselines instead of polygons
surrounding the lines. Thus, for the test set, the participants additionally needed to</p>
        <sec id="sec-3-3-1">
          <title>7 http://www.primaresearch.org/tools/PAGELibraries</title>
          <p>develop a technique to crop the lines for a given baseline, or use the pre-extracted
line images.</p>
          <p>For the development set, a list of 510 queries and the respective retrieval
ground truth (obtained using the word polygons) was supplied. The list of queries
for the test set had the same ones from development plus 490 additional, thus
1,000 in total. These 1,000 queries included 1,643 unique words.</p>
          <p>The 200 test set pages were kind of special, since the procedure to select
them and obtain its ground truth was signi cantly di erent with respect to the
development and training sets. The development set was actually the rst batch
of images processed in the tranScritprium project [17], and were chosen to be a
consecutive set of very homogeneous pages with not so many evident di culties.
The training set was the second batch considered in tranScritprium, which was
also consecutive and homogeneous but somewhat more challenging than the rst
one. For the test set, the pages were selected among ones that were transcribed
by volunteers [2], so there was no guarantee that the pages were consecutive. The
actual number of transcribed pages from the test set was not the full 200 pages,
but only 99. Many of the others were included since they were pages the between
the transcribed ones. To measure the performance for the test set, the segments
from untranscribed pages were removed from the submissions. However, since
the retrieval results are available, the ground truth could be produced for these
missing pages so that the performance can be obtained for the full set.</p>
          <p>All of the provided data for the task and scripts for computing the evaluation
measures and the baseline system (see subsections 3.4 and 3.5) are now publicly
available and citable [21].
3.4</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Baseline System</title>
        <p>A very simple baseline system was provided so that it served as a starting point
or just for initial comparisons and understanding the submission format. Since
the evaluation was designed to allow participation from researchers working on
text retrieval, the baseline system was chosen so that it served as a basis for
this kind of participation. Thus, the baseline system can be divided into two
disjoint modules: 1) the rst module receives the line images and produces a list
of the 100 most probable recognitions according to a pre-trained HTR model; 2)
then the second module takes as input this list of recognitions and estimates the
probability that the query words appear in the same order for any given 6-line
segment. Only for the second module of the system, software was provided to
the participants so that they could reproduce the results for the development
set using the 100-best recognitions.</p>
        <p>The rst module of the system was based on the classical architecture
composed of three main processes: document image pre-processing, line image
feature extraction and Hidden Markov Model (HMM) and language model
training/decoding [18]. The pre-processing included: extraction of line images,
correction of line slope and slant, noise removal and image enhancement [22].
Feature vectors were extracted using the technique form [9]. Training of HMMs
and decoding was done using HTK8 and bi-gram language model trained using
SRLIM9. The HMM character models had 6 states, Gaussian Mixture Models
(GMM) of 64 components and were trained for 5 Expectation Maximization
(EM) iterations. The HVite decoder was used to generate the 100-best
recognition lists, using input degree of 15, Grammar Scale Factor (GSF) of 15 and
Word Insertion Penalty (WIP) of -5. These parameters were selected based on
previous works on the Bentham data.</p>
        <p>Once the 100-best recognition lists are built for each line image, a similar
approach to the one described in [13] is followed to compute the probability
that a given segment, comprised by image lines xi; xi+1; : : : ; xi+L 1, is relevant
for a particular query, q. Following the previous work, this probability can be
computed by the following expression:</p>
        <p>X P (R = 1; w j q; xi; xi+1; : : : ; xi+L 1) =</p>
        <p>P (R = 1 j q; xi; xi+1; : : : ; xi+L 1) =
w2</p>
        <p>X
w02 :R(w0;q)=1</p>
        <p>
          P (w0 j xi; xi+1; : : : ; xi+L 1)
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
R is a binary random variable that denotes whether or not the segment is
relevant for the query q. The rst probability distribution is marginalized across
all possible transcripts, w, of the segment. This is equivalent to the probability
of all segment transcripts, w0, that are relevant for the query q, denoted by
the binary function R(w0; q). Character-Lattices were used in [13] to compute
the previous probabilities at a line-level. Here, the 100-best line-level transcripts
were used instead to approximate P (w j xi; xi+1; : : : ; xi+L 1).
        </p>
        <p>Observe that using a 100-best list of line-level transcription hypotheses yields
to a much bigger set of segment-level hypotheses. More precisely, with segments
comprised by L lines, if N -best line-level hypotheses are used, this gives a total
number of N L segment-level hypotheses. In our setting, with L = 6 and N = 100,
that would result in 1012 hypotheses for each segment. Instead, 100 segment-level
hypotheses were formed by joining rst the 1-best hypotheses from all lines, then
the 2-best hypotheses, and so on. This strategy is clearly suboptimal, but enables
working with a much smaller set of segment-level hypotheses and is enough for
a reasonable baseline system. The log-likelihood of the segment-level transcripts
are computed as the sum of the log-likelihoods of the line-level transcripts.</p>
        <p>
          In order to determine which transcript hypotheses are relevant for a particular
query, a regular expression containing all the words in the query, optionally
separated by other words, is created and then the likelihoods of all segment-level
hypotheses matching this regular expression are added and the result is divided
by the total likelihood of the segment to obtain the probability described in
Eq. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ). Regarding the word-level bounding boxes, they are extracted from the
bounding boxes of the transcript hypothesis containing the query that has the
highest log-likelihood.
        </p>
        <sec id="sec-3-4-1">
          <title>8 http://htk.eng.cam.ac.uk</title>
        </sec>
        <sec id="sec-3-4-2">
          <title>9 http://www.speech.sri.com/projects/srilm/</title>
          <p>The baseline system has two main shortcomings that were left for the
participants to tackle and improve: the presence of OOV words in the queries and
broken words split into two lines. Python source code10 was made available for
the participants to have all the details of the baseline system. Also a script that
computes the performance measures described next was made available11 along
with the development set baseline retrieval results and ground truth, so that
it could be tested and the measures compared with the ones published in the
challenge web page.
3.5</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>Performance Measures</title>
        <p>Two well known information retrieval metrics, Average Precision (AP) and
Normalized Discounted Cumulative Gain (NDCG), were taken as basis to
benchmark the systems, but applied in several ways so that the di erent challenges
considered in the evaluation were measured. These metrics were computed in
two ways: 1) globally and 2) for each query individually and then taking the
mean. To distinguish these two ways of measuring performance, we prepend to
the acronyms the letter g for global and m for the mean. Thus, we have gAP
as the AP measured globally (in the literature commonly referred to as just
AP [14]) and the mAP as the mean of the APs obtained for each of the queries.
Likewise, we have gNDCG as the NDCG applied globally and mNDCG as the
mean for all queries (commonly referred to as just NDCG in the literature).</p>
        <p>The gAP and gNDCG are highly a ected by the queries with zero relevant
results, due to the fact that all queries are considered at once, thus comparing
the relevancy scores across queries. The systems that give low relevancy scores
for queries without relevants are rewarded because true relevants from other
queries will be ranked better than for the zero relevant queries.</p>
        <p>For two other novel challenges we decided to directly measure its
performance: broken words and OOV words. In the case of OOV words, we just
measured the performance only for the queries that included at least one word that
was not observed in the training data. To measure the performance for the
broken words, we restricted further the requirement for a segment to be relevant,
such that it contains at least one broken word, and that the broken word is the
only appearance of this word in the segment.</p>
        <p>Since the objects to retrieve are parts of images containing (possibly di cult
to read) handwriting, it is important for the systems to provide locations of
the words to serve as a visual aid for the users. This is why it was required
that participants provide bounding boxes of the spotted words. Even though the
word locations are important, the measure of retrieval performance should be
independent of the detection of words, so the four metrics were computed both
at segment and word bounding box levels.</p>
        <p>All performance metrics aim to measure how good are the results given by a
ranked list of N matches, given that there are R relevant elements expected to
10 https://zenodo.org/record/52994/files/nbest-baseline.py
11 https://zenodo.org/record/52994/files/assessment.py
be retrieved, according to the ground-truth. The metrics can be de ned in terms
of two fundamental concepts: true positive (correct matches) and false positives
(also known as type I errors), which are used to classify a particular retrieved
match. Let TP(k) and FP(k) be indicator functions that evaluate to 1 if the k-th
match is a true positive or a false positive, respectively, otherwise they evaluate
to 0. Then, the precision p(k) among the rst k elements of the list, is de ned
as:</p>
        <p>Pjk=1 TP(j)
p(k) = Pjk=1 TP(j)+FP(j)
where Z is a normalization factor (sometimes denoted as INDCG), computed
from the ideal retrieval result, so that the NDCG is a value between 0.0 and
1.0. Analogous to the AP, the gNDCG is computed using Eq. (7) for all queries
combined, and the mNDCG is de ned as
mNDCG =
jQj q2Q
1 X NDCG(q)</p>
        <p>The usual de nition of AP only considers the cases when N &gt; 0 (there
is at least one match in the retrieved list) and R &gt; 0 (there is at least one
relevant element in the ground-truth). For this evaluation, the de nition of AP
was extended to address these ill-de ned cases as follows:</p>
        <p>AP =</p>
        <p>k=1
8 N
&gt; 1 X p(k) TP(k)
&gt;
&gt;&gt;&lt; R
&gt;1:0
&gt;
&gt;:&gt;0:0</p>
        <p>
          N &gt; 0 ^ R &gt; 0
N = 0 ^ R = 0
otherwise
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(6)
(7)
(8)
where NDCG(q) corresponds to Eq. (7) computed only for query q, and Q is the
set of all queries being evaluated.
        </p>
        <p>
          To measure the previous performance measures at a word level, the TP
and FP were generalized to a continuous case by using the intersection over
The gAP is computed using Eq. (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) for all queries combined, and the mAP is
de ned as:
mAP =
jQj q2Q
1 X AP(q)
where AP(q) corresponds to Eq. (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) computed only for query q, and Q is the set
of all queries being evaluated.
        </p>
        <p>Similar to the AP, the NDCG was extended to address the ill-de ned cases
as follows:</p>
        <p>NDCG =
8 N
&gt; 1 X
&gt;
&gt;&gt;&lt; Z</p>
        <p>k=1
&gt;1:0
&gt;
&gt;:&gt;0:0
2TP(k)</p>
        <p>1
log2(k + 1)</p>
        <p>N &gt; 0 ^ R &gt; 0
N = 0 ^ R = 0
otherwise
A</p>
        <p>A \ B
union (IoU) metric as an overlapping degree between the reference and the
retrieved bounding boxes. Let A be a detected bounding box in the retrieved list
of matches, and B a reference bounding box corresponding to the same word,
then the continuous de nitions of TP and FP are:</p>
        <p>TP(A; B) = jA \ Bj = IoU
jA [ Bj</p>
        <p>A
FP(A; B) = j j
jA \ Bj = 1
A
j j
jA \ Bj</p>
        <p>A
j j
(9)
(10)
These continuous versions can be more easily understood by the visual example
depicted in Fig. 2.</p>
        <p>Given these extensions, the bounding box of the k-th result in the retrieved
list is matched against the previously-unmatched reference bounding box (of the
same word) with the greatest IoU area. That is, a reference bounding box which
was not matched against any of the previous k 1 results. This prevents that the
same reference bounding box is matched multiple times against di erent detected
bounding boxes. Then, the continuous versions of TP and FP are applied in the
de nitions of the four metrics described above.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation Outcome</title>
      <sec id="sec-4-1">
        <title>Participation</title>
        <p>There was considerable interest in the task. In total 48 groups registered, and
24 of these signed the End User Agreement (a requirement to access the data
during the evaluation). Based on the data server logs, the test queries (only
useful if there was some intention of submitting results) were downloaded from 9
countries: Germany, India, Israel, Japan, Mexico, Morocco, Tunisia, Turkey and
USA. In the end, only four groups submitted results and three of them submitted
a working notes paper describing their system. Among the four groups, 37 system
runs were submitted, 7 of which had mistakes thus were invalidated.</p>
        <p>The nal participation was not as good as hoped for, though considering that
it was a novel challenge and it was organized in conjunction with an unusual
venue for its topic, the participation was not bad.</p>
        <p>The following is a list of short descriptions, one for each of the
participating groups (ordered alphabetically). The reference of the paper describing their
systems, for the three groups that submitted it, is included below so that the
reader can refer to them for further details of the used techniques.
{ CITlab: [16] The team from the Computational Intelligence Technology
Laboratory of the Institute of Mathematics, University of Rostock, in Germany.
It was represented by Tobias Strau , Tobias Gruning, Gundram Leifert and
Roger Labahn. They were the only team that participated in the full challenge,
training their own recognition models and dealing with broken words and
outof-vocabulary words. The technique is based on multi-dimensional recurrent
neural networks (MDRNN) trained using connectionist temporal classi
cation (CTC), followed by a regular expression based decoder aided by a word
uni-gram language model. To handle broken words, when lines are recognized
with a hyphenation symbol at the end or the start, the respective parts are
concatenated to check if it matches as a word. The CITlab group submitted
18 runs in total, among which 8 corresponded to systems that where trained
with external training data. After the submission, they realized that in the
external training data, there were pages from the development set, which was
prohibited in the evaluation. So they decided to exclude these results in their
working notes paper, and they are excluded in this paper because of this also.
{ MayoBMI: [10] The team from the Section of Biomedical Informatics of the
Mayo Clinic, in Rochester, United States of America. It was represented by
Sijia Liu, Yanshan Wang, Saeed Mehrabi, Dingcheng Li and Hongfang Liu.
They focused their research on the detection of hyphenation symbols followed
by a spell correction to detect broken words, but nally did not submit results
using this development. The text retrieval of their system was based on the
n-best recognition results provided by the organizers, though only considering
the 1-best. To index the words they used the Porter stemmer and the retrieval
was based on Term Frequency - Inverse Document Frequency (TF-IDF). The
MayoBMI group submitted only one system run.
{ IIIT: The team from the Center for Visual Information Technology,
International Institute of Information Technology, in Hyderabad, India. It was
represented by Kartik Dutta. This was the only participant that followed
the query-by-example route. Interestingly, their submission handled
out-ofvocabulary words, but unfortunately they did not submit a paper describing
the approach and decided not to disclose any details about it. The IIIT group
submitted two system runs.
{ UAEMex: [6] The team from the Universidad Autonoma del Estado de
Mexico, (UAEM), Mexico. It was represented by Miguel Angel Garc a Calderon
and Rene Arnulfo Garc a Hernandez. Their method was based on the n-best
recognition results provided by the organizers, though only considering the
1-best. The technique was based on the Longest Common Subsequence
algorithm, which made it possible to retrieve words not seen in training. The
UAEMex group submitted 8 system runs.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>The complete results of the evaluation are presented in six tables. The rst two
tables, 2 and 3, correspond to the complete retrieval results. The Following two,
4 and 5, are for the additional requirement of the segments containing broken
words. And the nal two tables, 6 and 7, are using only the queries that include
at least one OOV word.</p>
        <p>All of these tables include the results for both the development and the 99
page test sets, and for the four performance measures described in subsection 3.5.
There is no special order for the results within the tables. The groups appear
in alphabetical order and the runs for each group correspond to the order in
which they were submitted. The best result from each group is highlighted in
bold font. This selection is based on considering equally important the eight
computed values (development, test and 4 performance measures). The cells in
the tables where there is a dash as result, are the cases in which the system run
did not include even a single retrieval result. This indicates that the system run
had some problem or did not consider that speci c challenge, for example the
case of broken and OOV words for the baseline.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Discussion</title>
        <p>Each group followed quite a di erent path. The IIIT team participated as
queryby-example, thus their results are not directly comparable with any of the other
participants. Hopefully in the future other research groups will use this dataset
for the query-by-example scenario, and be able to compare with what is presented
in this paper. Two teams, MayoBMI and UAEMex, based their work on the
provided recognition results, although considered only the 1-best hypothesis, so
they are handicapped in comparison to the baseline system. Furthermore, the
test set was considerably more di cult than the development and the baseline
system performed very poorly, so their results were also a ected by this. Two
groups, CITlab and MayoBMI, dealt with the broken words, though both based
it on the detection of hyphenation symbols, so the cases in which there is no
hyphenation are ignored. CITlab and UAEMex proposed solutions that handled
OOV words.</p>
        <p>The performance obtained by the CITlab is very impressive. They used
recurrent neural networks, which is the current state-of-the-art in handwriting
recognition, so the good performance was somewhat expected. Many groups in
this area are working on similar techniques, so the low participation was
unfortunate since it prevented a much more fruitful comparison of systems. The results
for the OOV queries are not a ected much in comparison to the complete set of
queries. This suggests that the problem of the OOV can be handled rather well
with current techniques. In the case of the broken words, the drop in performance
is more noticeable, though it can be said that it is a more challenging problem</p>
        <p>in which little work has been done. Nevertheless, the CITlab performance for
the broken words is quite good, which makes it the most interesting outcome of
this evaluation.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Post-evaluation Dataset Usage</title>
      <p>As observed in subsection 4.2, the selected test set was signi cantly di erent to
the training and development sets, something that was a surprise for both the
participants and us the organizers. Among the di erences observed are: scanning
and image quality, providing only the baselines, new writers or style and di ering
paper formats. At the moment it is not known exactly what are the di erences
that are a ecting most the system performance, so a more in-depth analysis
should be conducted to understand it.</p>
      <p>In future usage of this test set, most of the improvements in performance
are expected to be due to better handling of these data di erences, not the
speci c challenges proposed in this evaluation. So we are recommending not
using it for comparative future works. Since the development set is relatively
large and its retrieval ground truth is now publicly available, for subsequent
works it is recommended to use only the development set for measuring the
performance of the systems. The evaluation protocol and performance measures
should be the same as the one proposed in this paper. So that future results
are comparable with the ones presented in this paper, a list of 60 pages selected
from the development set is being provided, so that these are used as validation
for adjusting parameters and hyper-parameters. This list is exactly the same as
the one used by the CITlab group in this evaluation.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>This paper presented an overview of the ImageCLEF 2016 Handwritten Scanned
Document Retrieval Task, the rst edition of a challenge aimed at developing
retrieval systems for handwritten documents. Several novelties were introduced
in comparison to other recent related evaluations, speci cally: multiple word
queries, nding local blocks of text, results in transitions between consecutive
pages, handling words broken between two lines, words unseen in training and
queries with zero relevant results. The evaluation was organized so that groups
from several research communities could participate: handwritten text
recognition, query-by-string and query-by-example keyword spotting and information
retrieval.</p>
      <p>The interest in the task was considerable, 48 groups registered, and 24 signed
an agreement to get access to the data. In the end, the participation was low,
submissions were received for only four groups. The results presented in this
paper are for 21 valid system runs received, left after removing 7 submissions
that were invalid and 9 that were withdrawn since part of the development set
was used for training, something that was prohibited by the rules of participation.</p>
      <p>The best performance was obtained by the CITlab. Their system used
multidimensional recurrent neural networks (MDRNN) trained using connectionist
temporal classi cation (CTC), followed by regular expression based decoder
aided by a word uni-gram language model. Very good performance was obtained
for words not observed in the training data and for words broken between lines.
Thus, even though the low participation prevented a much more fruitful
comparison of systems, the results ended up being very interesting.</p>
      <p>This evaluation should serve to give push to this area of research, in particular
the novel challenges proposed. The two groups that addressed the broken words
problem, based it on locating the hyphenation symbols that generally mark them.
However, not always broken words are marked with a special symbol, so future
work could be targeted at handling this more general case. Possibly there will
be a need to produce new datasets to better assess the challenges proposed here.
Not only to have more data with broken word examples, but also to measure
performance with queries written in a more natural language, including words
that may or may not appear in a relevant document segment, or for example
appearing as synonyms.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>
        This work has been partially supported by the European Union (EU) Horizon
2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref:
674943), EU project HIMANIS (JPICH programme, Spanish grant Ref:
PCIN2015-068), MINECO/FEDER, UE under project TIN2015-70924-C2-1-R and
the Spanish MEC under FPU grant FPU13/06281.
6. Garca Calderon, M.A., Garca Hernandez, R.A.: UAEMex at ImageCLEF 2016:
Handwritten Retrieval. In: CLEF2016 Working Notes. CEUR Workshop
Proceedings, vol. 1609. CEUR-WS.org, Evora, Portugal (September 5-8 2016)
7. Gatos, B., Pratikakis, I.: Segmentation-free word spotting in historical printed
documents. In: Document Analysis and Recognition, 2009. ICDAR '09. 10th
International Conference on. pp. 271{275 (July 2009)
8. Giotis, A., Gerogiannis, D., Nikou, C.: Word Spotting in Handwritten Text Using
Contour-Based Models. In: Frontiers in Handwriting Recog. (ICFHR), 2014 14th
Int. Conf. on. pp. 399{404 (Sept 2014), doi:10.1109/ICFHR.2014.73
9. Kozielski, M., Forster, J., Ney, H.: Moment-based image normalization
for handwritten text recognition. In: Frontiers in Handwriting
Recognition (ICFHR), 2012 International Conference on. pp. 256{261 (Sept 2012),
doi:10.1109/ICFHR.2012.236
10. Liu, S., Wang, Y., Mehrabi, S., Li, D., Liu, H.: MayoBMI at ImageCLEF 2016
Handwritten Document Retrieval Challenge. In: CLEF2016 Working Notes. CEUR
Workshop Proceedings, vol. 1609. CEUR-WS.org, Evora, Portugal (September 5-8
2016)
11. Pratikakis, I., Zagoris, K., Gatos, B., Louloudis, G., Stamatopoulos, N.: ICFHR
2014 Competition on Handwritten Keyword Spotting (H-KWS 2014). In: Frontiers
in Handwriting Recognition (ICFHR), 2014 14th International Conference on. pp.
814{819 (Sept 2014), doi:10.1109/ICFHR.2014.142
12. Puigcerver, J., Toselli, A.H., Vidal, E.: Icdar2015 competition on keyword
spotting for handwritten documents. In: Document Analysis and Recognition
(ICDAR), 2015 13th International Conference on. pp. 1176{1180 (Aug 2015),
doi:10.1109/ICDAR.2015.7333946
13. Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic interpretation and
improvements to the hmm- ller for handwritten keyword spotting. In: Document Analysis
and Recognition (ICDAR), 2015 13th International Conference on. pp. 731{735
(Aug 2015), doi:10.1109/ICDAR.2015.7333858
14. Robertson, S.: A new interpretation of average precision. In: Proc. of the
International ACM SIGIR conference on Research and development in
information retrieval (SIGIR '08). pp. 689{690. ACM, New York, NY, USA (2008),
doi:10.1145/1390334.1390453
15. Rodr guez-Serrano, J.A., Perronnin, F.: Handwritten word-spotting using hidden
Markov models and universal vocabularies. Pattern Recognition 42, 2106{2116
(September 2009), doi:10.1016/j.patcog.2009.02.005
16. Strau , T., Gruning, T., Leifert, G., Labahn, R.: CITlab ARGUS for Keyword
Search in Historical Handwritten Documents - Description of CITlab's System
for the ImageCLEF 2016 Handwritten Scanned Document Retrieval Task. In:
CLEF2016 Working Notes. CEUR Workshop Proceedings, vol. 1609.
CEURWS.org, Evora, Portugal (September 5-8 2016)
17. Sanchez, J.A., Toselli, A.H., Romero, V., Vidal, E.: Icdar 2015 competition htrts:
Handwritten text recognition on the transcriptorium dataset. In: Document
Analysis and Recognition (ICDAR), 2015 13th International Conference on. pp. 1166{
1170 (Aug 2015), doi:10.1109/ICDAR.2015.7333944
18. Toselli, A.H., Juan, A., Keysers, D., Gonzalez, J., Salvador, I., Ney, H., Vidal,
E., Casacuberta, F.: Integrated Handwriting Recognition and Interpretation using
Finite-State Models. Int. Journal of Pattern Recognition and Arti cial Intelligence
18(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), 519{539 (June 2004), doi:10.1142/S0218001404003344
19. Toselli, A.H., Puigcerver, J., Vidal, E.: Context-aware lattice based ller approach
for key word spotting in handwritten documents. In: Document Analysis and
Recognition (ICDAR), 2015 13th International Conference on. pp. 736{740 (Aug
2015), doi:10.1109/ICDAR.2015.7333859
20. Villegas, M., Muller, H., Garc a Seco de Herrera, A., Schaer, R., Bromuri, S.,
Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandrea, E., Gaizauskas,
R., Mikolajczyk, K., Puigcerver, J., Toselli, A.H., Sanchez, J.A., Vidal, E.: General
Overview of ImageCLEF at the CLEF 2016 Labs. In: Experimental IR Meets
Multilinguality, Multimodality, and Interaction. Lecture Notes in Computer Science,
Springer International Publishing (2016)
21. Villegas, M., Puigcerver, J., Toselli, A.H.: ImageCLEF 2016 Bentham Handwritten
      </p>
      <p>
        Retrieval Dataset (2016), doi:10.5281/zenodo.52994
22. Villegas, M., Romero, V., Sanchez, J.A.: On the Modi cation of Binarization
Algorithms to Retain Grayscale Information for Handwritten Text Recognition.
In: 7th Iberian Conference on Pattern Recognition and Image Analysis, LNCS,
vol. 9117, pp. 208{215. Springer, Santiago de Compostela (Spain) (Jun 2015),
doi:10.1007/978-3-319-19390-8 24
23. Zagoris, K., Ergina, K., Papamarkos, N.: Image retrieval systems based on compact
shape descriptor and relevance feedback information. Journal of Visual
Communication and Image Representation 22(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ), 378 { 390 (2011)
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aldavert</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusinol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toledo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Llados</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Integrating Visual and Textual Cues for Query-by-String Word Spotting</article-title>
          .
          <source>In: Document Analysis and Recognition (ICDAR)</source>
          ,
          <year>2013</year>
          12th International Conference on. pp.
          <volume>511</volume>
          {
          <issue>515</issue>
          (Aug
          <year>2013</year>
          ), doi:10.1109/ICDAR.
          <year>2013</year>
          .108
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Causer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Terras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>`many hands make light work. many hands together make merry work': Transcribe bentham and crowdsourcing manuscript collections</article-title>
          .
          <source>Crowdsourcing Our Cultural</source>
          Heritage pp.
          <volume>57</volume>
          {
          <issue>88</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Causer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Building a volunteer community: results and ndings from Transcribe Bentham</article-title>
          .
          <source>Digital Humanities Quarterly</source>
          <volume>6</volume>
          (
          <year>2012</year>
          ), http://www. digitalhumanities.org/dhq/vol/6/2/000125/000125.html
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Keller, A.,
          <string-name>
            <surname>Frinken</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunke</surname>
          </string-name>
          , H.:
          <article-title>Lexicon-free handwritten word spotting using character HMMs</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>33</volume>
          (
          <issue>7</issue>
          ),
          <volume>934</volume>
          {
          <fpage>942</fpage>
          (
          <year>2012</year>
          ),
          <article-title>Special Issue on Awards from ICPR 2010</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.patrec.
          <year>2011</year>
          .
          <volume>09</volume>
          .009
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Frinken</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunke</surname>
          </string-name>
          , H.:
          <article-title>A Novel Word Spotting Algorithm Using Bidirectional Long Short-Term Memory Neural Networks</article-title>
          . In: Schwenker,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>El Gayar</surname>
          </string-name>
          , N. (eds.)
          <source>Arti cial Neural Networks in Pattern Recognition, Lecture Notes in Computer Science</source>
          , vol.
          <volume>5998</volume>
          , pp.
          <volume>185</volume>
          {
          <fpage>196</fpage>
          . Springer Berlin / Heidelberg (
          <year>2010</year>
          ),
          <source>doi:10.1007/978-3-642-12159-3 17</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>