<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MayoBMI at ImageCLEF 2016 Handwritten Document Retrieval Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sijia Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanshan Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saeed Mehrabi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dingcheng Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongfang Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Section of Biomedical Informatics, Mayo Clinic</institution>
          ,
          <addr-line>Rochester MN 55905</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this working note, we introduce our participation at the ImageCLEF 2016 Handwritten Document Retrieval Task. We mainly focused on hyphenation detection using line images and information retrieval using n-best results. The hyphenation detection step utilizes extracted image features from beginning and end of a line and a binary classi er to determine if a line contains hyphenation. Then the spell correction step is used to eliminate spelling errors from the concatenation of a broken word from the end of a line and the beginning of the next line. The nal text retrieval step employs a su x stripping algorithm to normalize the word tense and form and TF-IDF scheme to rank the retrieved relevant segment results of our submission.</p>
      </abstract>
      <kwd-group>
        <kwd>handwriting recognition</kwd>
        <kwd>hyphenation detection</kwd>
        <kwd>text retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        For the ImageCLEF 2016 Handwritten Scanned Document Retrieval Task [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ],
our aim is to develop a document retrieval system to retrieve the relevant
segments and word bounding boxes for given string of test queries. An intuitive
solution for this task generally includes three components: handwritten text
recognition, keyword spotting and text retrieval. To obtain relatively accurate
transcripts from document images, image pre-processing methods such as
image binarization and text line extraction are generally used [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ]. Based on
whether trying to generate transcripts from the handwritten text images as
an intermediate step, there are mainly two categories of solutions: recognition
based approaches and keyword spotting based approaches. For the recognition
based approaches, there are two kinds of models commonly used in the
state-ofart handwritten recognition systems for historical documents: Recurrent Neural
Network (RNN) with Connectionist Temporal Classi cation (CTC) [
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ] and
Hidden Markov Model (HMM) [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7,8,9</xref>
        ]. Both of these models can achieve high
recognition accuracy, which is measured in word and character error rates. For
keyword based approaches, depending on whether the query is a word image
or a string in the dataset, systems can either query the keyword by comparing
(a)
(b)
(c)
      </p>
      <p>Ground Truth: 6. The evidence of the engagement, consigned to a portable
1-best: 6 . The evidence of the engagement consigned to a /*missing*/
Ground Truth: instrument, instead of a fixed Book. Taken from
Exche</p>
      <p>1-best: instrument , instead of a by Each Silver from The
Ground Truth: quer Bills,- Differs from Stock Annuities- agrees with</p>
      <p>
        1-best: our Bills differs from Stock Annuities degrees
with=the features of query images with the indexed features of existing images in the
dataset or transcribe the document before retrieving the query string. Language
models like n-gram and HMM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] can also be used to improve the overall
keyword spotting performance.
      </p>
      <p>Although the handwritten document recognition has attracted much
attention in the areas of image processing and machine learning, the impact of
hyphenated words is not well studied. Hyphenated words, also known as broken
words, are the words supposed to appear at the end of a text line while broken
and continued at the beginning of the next line because of the manuscript writer's
intention to save space. Hyphens will be used to mark such broken words at the
end of line by \-" or \=" and at beginning of a line by \:". Hyphenations may
cause recognition errors if they are not processed properly. Example line images,
recognition errors in 1-best transcripts and the corresponding ground truth from
the task dataset are shown in Fig. 1. We are interested in this working note to
describe a solution to detect words with hyphenation in segmented line images,
which can further be used to improve the retrieval result of given strings of
queries.</p>
      <p>The rest of this working note is organized as follows. First, we introduce
our proposed methods in Section 2, including steps of image preprocessing of
line segmented image, hyphenation detection, spell correction using training
transcript and text retrieval. Second, the evaluation of the e ectiveness of
hyphenation detection methods and the o cial task evaluation of our submitted
run are discussed in Section 3. Finally, we conclude our work and proposed some
future works and improvements.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Methods</title>
      <p>In our proposed solution to ImageCLEF 2016 Handwritten Document Retrieval
Task, we mainly focused on how to detect the hyphenated words, how to do
spell correction for detected hyphenated words and how to retrieve the relevant
segments based on 1-best results. In this section, we will elaborate the details
of each step. We rst describe the preprocessing step of how the segmented
grayscale line images is normalized into xed-height binary images (Section 2.1).
Then using normalized line images, a hyphenation detection methods is proposed
(Section 2.2) to detect lines with hyphenation, followed by spell correction
(Section 2.3) and text retrieval step (Section 2.4).
2.1</p>
      <p>Preprocessing
In this step, our goal for the preprocessing step is to obtain noise and slant free
binary line images. In some related works, skew correction may also be necessary
before the recognition or word spotting step. However, in the task dataset, lines
are well segmented and with only negligible slope, which ease us from skew
correction. The slant of written lines are removed by applying a two dimensional
a ne transformation to the original line images. The transformation matrix is
arbitrarily chosen based on the observation from random selected line images in
the training set. Afterwards, a global threshold is applied to the slant corrected
grayscale line images to generate the binary line images. We also resize the line
images to a xed height of 30, and the width is scaled proportionally.
2.2</p>
      <p>Hyphenation Detection
In order to detect lines with beginning and ending hyphenations, image windows
at both the beginning and end of each line image are obtained. Several
image features are then extracted from the image windows, and various binary
classi ers are used to detect hyphenations. As a binary classi cation problem,
according to the writing style of the document writer, lines containing both the
end hyphens and beginning hyphens in the next line are considered lines with
hyphenation. Lines with such hyphenations are labeled as positive, while the
others are labeled as negative. This strict labeling rule is helpful to eliminate
false positives in the prediction results.</p>
      <p>Several binary classi cation methods are used on the image features in the
training set. To evaluate the performance of these methods, precision, recall and
F-score are tested as metrics in the development set. In this task, the ground
truth transcripts of both training and development set are provided, thus the
ground truth labels can be obtained and used to compare with the hyphenation
detection results.</p>
      <p>
        To represent the binary line image windows as feature vectors, a set of local
features are extracted from the beginning windows and ending windows for
each line. These features have been used in previous works on keyword spotting
approaches [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. There are 8 features of the image window describing whether
they contain hyphens. They are: the horizontal and vertical locations of the rst
non-background pixel in the window, the horizontal and vertical locations of the
last non-background pixel in the window, the average intensity, the second order
moment and the coordinates of the window centroid. Further, the window is
cropped with the tight rectangular bounding box by removing all the lines with
no non-background pixel on the boundary. Then the average and second order
moment of intensity and the window centroid are recalculated and combined with
previous features. Besides, the number of non-background pixels are summed up
by each line and column, which generates pixel histograms horizontally and
vertically. In this work, we use a window with width of 20 and height of 30.
Therefore, a feature vector with 8 + 4 + 20 + 30 = 62 dimensions are extracted to
represent the window. For each line, both the ending window of the current line
and the beginning window of the next line are used to determine whether the
line contains hyphenation or not. Thus, the dimension of feature vector of each
line is doubled, resulting in a feature vector of 2 62 = 124 dimensions for each
line. The feature vector is then used as the input of various machine learning
methods.
      </p>
      <p>
        The classi ers investigated are implemented in Scikit-Learn [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
classi ers which are tested on development set are: 5 Nearest Neighbors, Decision
Tree, Random Forest, AdaBoost, Naive Bayes with Gaussian kernel. Among
these classi ers, our experiment in development set suggested that AdaBoost
can provide the best prediction.
2.3
      </p>
      <p>Spell Correction
Once the hyphenation of a certain line is detected, the last word of the current
line and the rst word of the next line are concatenated. The text from the two
word windows are extracted from the 1-best result, and all the special characters
are removed before concatenation.</p>
      <p>From the 1-best documents in both the training and development dataset,
the accuracy of recognition without hyphen is relatively high. However, for
hyphenated words, word spotting algorithms tend to predict the given word
image as the most similar complete word, instead of considering it as only the
front part or the back part of a word. For example, the word \testimony", if
broken as \testi-mony" into the end and beginning of two lines, the latter part
is more likely to be predicted as "many" which is a complete word, instead
of "mony" which is a su x in ground truth. Fig. 1 also shows some of such
examples.</p>
      <p>To correct the recognition errors caused by hyphenation, a spell correction
step is applied after the concatenation of two broken words. The ground truth
QUERY</p>
      <p>TRANSCRIPTS</p>
      <p>PROCE
SSING</p>
      <p>DOCUMENTS INDEXING</p>
      <p>SEGMENTS
RETRIEVAL
MODEL</p>
      <p>RETRIEVAL REQUEST INDEX RANKING
SYSTEM</p>
      <p>
        RESULT
transcript from the training set is used to generate a dictionary of the dataset. All
the words in training and development set are converted to lower case before put
into the dictionary. A spell corrector is then used. The spell corrector considers
the corrections of edit distance no larger than 2 from the original word. The
corrections with maximum likelihood to the current prediction in the dictionary
is chosen. If no matched correction under these criteria is found, the original
word is remained unchanged. The implementation of the spell corrector can be
found in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Y;
(Format 1)
where L is the line number, T represents the word, W and H are the width and
height of the bounding box, and X and Y are the left-top coordinate of the box,
respectively.</p>
      <p>We used the software package Elasticsearch1 to index the segments. A
total of 16939 segments were indexed into three elds: \ID", \contents" and
\annotations". In the \ID" eld, the ID of the rst line in each segment was
used to distinguish the segments. In the \contents" eld, we only indexed the
words, i.e., W in format 1, in the segments. We applied the following hyphen-rule
to take the hyphenated words into consideration:</p>
      <p>
        If line i ends with any of =, -, and :, and line i + 1 starts with either - or :,
then the last term in line i and the rst term in line i + 1 are
considered a hyphenated word.
(Rule 1)
The broken words are concatenated together after removing the hyphens. The
spell corrector described in previous section is then used. Before indexing, we
also applied the Porter stemmer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to the \contents" eld. By doing so, the
words like \abuses", \abusing" and \abused" were also retrieved given the
query \abuse". In the \annotations" eld if rule 1 was found, the corresponding
annotation terms represented in Format 1 were connected by \=:". Therefore,
for each segment the number of words in the \contents" eld is equivalent to the
number of annotation terms in the \annotations" eld.
      </p>
      <p>
        In the retrieval component, we utilize the Term Frequency - Inverse
Document Frequency (TF-IDF) scheme [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to rank the segments in the
\contents" eld and retrieved the top 30 segments. According to the positions of
the matched word in the \contents" eld, the corresponding annotation terms are
retrieved from the \annotations" eld. Using the annotation terms, the bounding
boxes as well as the queries are created as the nal submission.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Evaluation</title>
      <p>In this section, we elaborate the details of experiments and the performance of
hyphenation detection and ImageCLEF 2016 Handwritten Scanned Document
Retrieval Task evaluation.</p>
      <p>
        The task dataset is a subset of scanned and manually transcribed manuscripts
written by Jeremy Bentham under the Transcribe Bentham project [
        <xref ref-type="bibr" rid="ref15 ref2">2,15</xref>
        ]. The
task dataset consists of three subsets: training, development and test set. The
training and development set contains 9645 and 10589 manually segmented
line images, respectively. The test set contains the 10589 line images in the
development set and 6355 line images exclusive in test set, resulting in a total
of 16944 lines.
      </p>
      <p>For the hyphenation detection step, the ground truth are extracted from
transcripts of training and development set. As expected, the dataset is
1 https://www.elastic.co/products/elasticsearch
signi cantly unbalanced. In the training set, there are only 810 positive samples
in 9645 lines, the proportion of positive samples is 8.4%. In development set,
there are only 853 positive samples in 10589 total samples. The percentage of
positive samples is 8:0%. The trained models from training set are used for the
classi cation of the development set. The precision, recall and F-score metrics
of the development set is shown in Table 1. From the experiment results we can
observe that AdaBoost is the best performed classi er for hyphenation detection
in the proposed feature set. The detection algorithm does not perform well in
development set, thus we do not include it into the nal submission. The spell
corrector is still used in text retrieval step to handle the hyphens already in the
1-best results.</p>
      <p>We submitted one run as the task submission, and the o cial evaluation
results are shown in Table 2. We noticed there is a signi cant drop from the
results of the development set to those of the test set. The similar performance
decreases can be also found in the results of the baseline system, which uses
exact string matching and should be robust among di erent datasets if these
datasets are of similar characteristics and quality. The reason for this signi cant
performance di erence is because the test set is considerably more di cult than
the development set, where the bounding box is much less accurate and the
images quality is lower.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We discussed our participation of ImageCLEF 2016 Handwritten Scanned
Document Retrieval Task. Our submission run is based on a three step
process, hyphenation detection, spell correction and information retrieval. The
hyphenation detection step utilizes extracted image features from beginning and
end of lines and a binary classi ers to determine if a line contains hyphenation
or not. Then the spell correction step is used to eliminate spelling errors the
concatenation of a broken word from a line ending and the line beginning of the
next line. The spell correction step uses only the vocabulary from the transcript
of training set. A nal information retrieval step employs a su x stripping
algorithm to normalize the word tense and form and TF-IDF scheme to rank
the retrieved 30 segments as the output of our system.</p>
      <p>There are several future works that can be investigated to improve the
performance of our system. First, more line image features can be considered
as the input of supervised binary classi cation methods. Second, larger lexicon
can be utilized in both the hyphenation detection step and spell correction step.
Due to the task restriction, only the vocabulary in the training set can be used,
and the use of external data for learning a language model is prohibited. A larger
lexicon of the whole dataset or external data rather than only the training set
will improve the e ectiveness of the spell corrector.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>The authors gratefully acknowledge the support from the National Library of
Medicine (NLM) grant R01LM11934.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Muller, H., Garc a Seco de Herrera,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Schaer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Bromuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Gilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Piras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Ramisa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Gaizauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mikolajczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Puigcerver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Toselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.H.</given-names>
            ,
            <surname>Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            ,
            <surname>Vidal</surname>
          </string-name>
          , E.:
          <article-title>General Overview of ImageCLEF at the CLEF 2016 Labs</article-title>
          . Lecture Notes in Computer Science. Springer International Publishing (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puigcerver</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toselli</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          , E.:
          <article-title>Overview of the ImageCLEF 2016 Handwritten Scanned Document Retrieval Task</article-title>
          .
          <source>In: CLEF2016 Working Notes. CEUR Workshop Proceedings</source>
          , Evora, Portugal, CEUR-WS.
          <source>org (Sep</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Saabni</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Sana</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Text line extraction for historical document images</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>35</volume>
          (
          <year>2014</year>
          )
          <volume>23</volume>
          { 33 Frontiers in Handwriting Processing.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Setlur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Govindaraju</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Text extraction from gray scale historical document images using adaptive local connectivity map</article-title>
          .
          <source>In: Proc. Eighth Int. Conf. Document Analysis and Recognition. ICDAR 05</source>
          , Washington, DC, USA, IEEE Computer Society (
          <year>August 2005</year>
          )
          <volume>794</volume>
          {798 Vol. 2
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>O ine Handwriting Recognition with Multidimensional Recurrent Neural Networks</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          <volume>21</volume>
          , NIPS'
          <volume>21</volume>
          (
          <year>2008</year>
          )
          <volume>545</volume>
          {
          <fpage>552</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Strau</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Gruning, T.,
          <string-name>
            <surname>Leifert</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labahn</surname>
          </string-name>
          , R.:
          <article-title>Citlab ARGUS for historical handwritten documents</article-title>
          .
          <source>CoRR abs/1412</source>
          .3949 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Keller, A.,
          <string-name>
            <surname>Frinken</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunke</surname>
          </string-name>
          , H.:
          <article-title>Lexicon-free handwritten word spotting using character HMMs</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>33</volume>
          (
          <issue>7</issue>
          ) (
          <year>2012</year>
          )
          <volume>934</volume>
          {
          <fpage>942</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Almazan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gordo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fornes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valveny</surname>
          </string-name>
          , E.:
          <article-title>E cient Exemplar Word Spotting</article-title>
          .
          <source>Procedings of the British Machine Vision Conference</source>
          <year>2012</year>
          (
          <year>2012</year>
          )
          <volume>67</volume>
          .1|-
          <fpage>67</fpage>
          .
          <fpage>11</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rodr</surname>
            guez-Serrano,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Handwritten word-spotting using hidden Markov models and universal vocabularies</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>42</volume>
          (
          <issue>9</issue>
          ) (
          <year>2009</year>
          )
          <volume>2106</volume>
          {
          <fpage>2116</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Marti</surname>
            ,
            <given-names>U.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunke</surname>
          </string-name>
          , H.:
          <article-title>Using a statistical language model to improve the performance of an hmm-based cursive handwriting recognition system</article-title>
          .
          <source>International Journal of Pattern Recognition and Arti cial Intelligence</source>
          <volume>15</volume>
          (
          <issue>01</issue>
          ) (
          <year>2001</year>
          )
          <volume>65</volume>
          {
          <fpage>90</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Norvig</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>How to write a spelling corrector</article-title>
          . http://norvig.com/spell-correct. html
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          :
          <article-title>An algorithm for su x stripping</article-title>
          .
          <source>Program</source>
          <volume>14</volume>
          (
          <issue>3</issue>
          ) (
          <year>1980</year>
          )
          <volume>130</volume>
          {
          <fpage>137</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGill</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>Introduction to modern information retrieval</article-title>
          .
          <source>McGrawHill</source>
          ,
          <string-name>
            <surname>Inc.</surname>
          </string-name>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Tim</given-names>
            <surname>Causer</surname>
          </string-name>
          , V.W.:
          <article-title>Building a volunteer community: Results and ndings from transcribe bentham</article-title>
          .
          <source>Digital Humanities Quarterly</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>