<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of WSCG 26 (2018) 31-40. doi:10.24132/JWSCG.2018.26.1.4.
[7] D. G. Lowe</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.24132/JWSCG.2018.26.1.4</article-id>
      <article-id pub-id-type="urn">.fi/urn:nbn:fi:</article-id>
      <title-group>
        <article-title>Exploring Handwritten Document Collections: An EPSC-Based Approach for Feature Extraction and Similarity Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anders Hast</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Örjan Simonsson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Uppsala County Archive on Popular Movement</institution>
          ,
          <addr-line>S:t Olofsgatan 15, Uppsala, 753 21</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Uppsala University</institution>
          ,
          <addr-line>Lägerhyddsvägen 1, Uppsala, 751 05</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1988</year>
      </pub-date>
      <volume>2</volume>
      <fpage>147</fpage>
      <lpage>151</lpage>
      <abstract>
        <p>This work in progress paper presents a novel approach for the classification and analysis of handwritten documents using a combination of Embedded Prototype Subspace Classification (EPSC) and advanced clustering techniques. We focus on facilitating the examination of document collections by enabling eficient comparisons between documents written by diferent hands. Our methodology involves the extraction of features from keypoints detected in the handwritten text, which are then processed using t-SNE and modified K-Means clustering to identify clusters of similar features. The novelty lies in a similarity score that is computed to quantify the likeness between document pairs, enabling the identification of stylistic similarities even in the absence of ground truth. An interactive visual application is developed to assist users in exploring the collection, providing insights into the nature of each document, including the diferentiation between typewritten and handwritten texts. Our preliminary experiments demonstrate promising results, indicating that documents of the same hand tend to cluster together while distinguishing between varying writing styles. However, we acknowledge that there is room for improvement, particularly in optimising the keypoint detection, feature extraction, and background removal processes, as well as in determining optimal thresholds. Future work will address these limitations, enhancing the robustness of our method and expanding its applicability to a wider range of documents.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Document Collections</kwd>
        <kwd>Writing style</kwd>
        <kwd>Visualisation</kwd>
        <kwd>Exploration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>We present a work-in-progress aimed at facilitating a quick and eficient overview of documents
with varying handwriting styles, which is crucial for many applications in historical document
analysis and archival research. Handwriting analysis has traditionally been a challenging
task. However, with the increasing volume of digitized handwritten documents, particularly in
historical collections, there is a growing need for automated methods to assist researchers in
managing and exploring these collections more eficiently.</p>
      <p>
        Deep learning methods have demonstrated efectiveness in identifying writing hands but
require substantial amounts of training data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In contrast, computer vision-based methods
leverage keypoints and local features, which are either clustered to generate a supervector
representing each document [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or processed through a deep learning network [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this paper,
we adopt the keypoint-based approach, incorporating several notable modifications as detailed
below.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>
        Each document, as shown in Figure 1a, is processed as follows: First, the background is removed
to enhance the visibility of the text [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], minimising the impact of background noise and ensuring
that keypoints are computed only on the text strokes. However, binarisation is not applied
here, as it would introduce unwanted jaggedness and distort the text strokes. Instead, the
background is removed while preserving the text in grayscale, allowing for accurate keypoint
detection without interference from binarisation artifacts. Keypoints are computed using the
Harris detector [5], which identifies descriptive points in the script, as shown in Figure 1b.
      </p>
      <p>
        Next, simple yet descriptive feature vectors are computed using the Radial Line Fourier
Descriptor for Historical Handwritten Text Representation [6]. While more advanced descriptors,
such as the SIFT descriptor [7], could have been employed alongside sophisticated keypoints,
as shown in [
        <xref ref-type="bibr" rid="ref2 ref3">3, 2</xref>
        ], these methods have both strengths and limitations. For instance, SIFT ofers
scale invariance, but its inherent rotation invariance is disadvantageous in this context, where
stroke direction is crucial. Additionally, SIFT detects blobs using the diference of Gaussian
approach, in contrast to the Harris detector, which focuses on corners. Future work will aim to
refine scale invariance to better accommodate documents with varying script sizes.
      </p>
      <p>For each image, the features obtained at each keypoint are processed using t-SNE [8], followed
by a modified K-Means clustering algorithm to identify groups of similar features [ 9], as
illustrated in Figure 2. Unlike traditional K-Means, where initial cluster centroids are chosen
randomly, the starting values in this method are strategically placed at the cluster centers on a
predefined scale. This modification helps improve the clustering by providing a more informed
starting point, leading to more accurate grouping of similar features.</p>
      <p>While this approach shares similarities with previously published methods, such as keypoint
clustering, its novelty lies in using clusters as subspaces that capture variations in stroke
execution with similar appearances, as well as in the novel computation of supervectors. The
following section outlines how these clusters are employed to develop a classifier.</p>
      <p>Recently, Embedded Prototype Subspace Classification (EPSC) [10, 9, 11, 12, 9] has been
successfully applied to classify diverse datasets. This classifier builds on concepts developed by
Kohonen and others [13, 14, 15, 16, 17] and can be viewed as a two-layer neural network
[11, 17, 18], where the weights are mathematically derived using Principal Component Analysis
(PCA) [18]. Prototypes are identified using t-SNE and the modified K-Means clustering, as
previously described, where each feature within a cluster serves as a prototype. This approach
makes the process easy to interpret through visualisations.</p>
      <p>To compare two documents, A and B, we follow this procedure: The first eigenvector from
the PCA of each cluster in document A is projected into the subspaces obtained from the
(a) Original document.</p>
      <p>(b) The background has been removed and keypoints appear on the text strokes.
clusters of document B. Notably, using only the first eigenvector in the subspace appears to
yield reasonable results. Therefore, the comparison primarily relies on the dot product of the
ifrst eigenvectors, as the projection depth is efectively 1. Future work will explore the impact
of varying projection depths to further refine this approach.</p>
      <p>To track the similarity between clusters in documents A and B, we compute two lists. The
ifrst list contains the identities of the clusters in A, while the second list, denoted as , holds
the corresponding activation values. Given that two documents typically have difering numbers
of clusters, and that multiple clusters in B can correspond to a single cluster in A, we define the
similarity between the documents as the ratio of activated clusters in A that exceed a certain
activation threshold  to the total number of activated clusters in A.</p>
      <p>The similarity ratio  is calculated as follows:</p>
      <p>Here,  represents the number of elements in the list . In our experiments, we set  = 0.92.
However, further experimentation is necessary to determine the optimal threshold, as it may
vary depending on the specific methods used for feature extraction and clustering.</p>
      <p>One advantage of the approach presented above is that each document is uniquely
characterised by its own description based on EPSC. However, the supervector is derived from a row
in the correlation matrix, where each document page is compared against all other pages. The
key idea is that pages written by the same hand are expected to show higher similarity, reflected
in a greater  value, whereas pages written by diferent hands should exhibit lower similarity,
resulting in a smaller  value.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The Labour’s memory Project</title>
      <p>While the primary objective of the approach outlined here is not writer identification but
rather to facilitate the browsing of document collections, it can still be efectively utilised for
that purpose. Both tasks are conducted using data from the Labour’s Memory infrastructure
project, launched in 2021 to digitise and present annual reports and financial records from
blue-collar labour organisations spanning the period from 1880 to 2020. The project includes
materials from Swedish unions at various levels (local, district, and national) and international
labor organisations, spanning repositories such as Folkrörelsearkivet för Uppsala län (FAC) and
Arbetarrörelsens arkiv och bibliotek (ARAB). The corpus, primarily in Swedish with some English,
German, French, and Spanish, is estimated to consist of 1–1.5 million pages, with 300,000 pages
digitised by 2024.</p>
      <p>The local organizations’ annual reports, housed at FAC, consist of approximately 35,000
pages, often handwritten or typewritten and rarely professionally published. These texts,
along with their manually transcribed counterparts, are essential for developing handwriting
text recognition (HTR) models. In contrast, national organisations produced professionally
printed annual reports for broader audiences. The handwritten reports, often created by
secretaries, chairs, auditors, or cashiers, exhibit diverse styles, reflecting varying levels of skill
and consistency.</p>
      <sec id="sec-3-1">
        <title>3.1. Writer Identification</title>
        <p>An experiment was conducted in which three pages from ten diferent documents, each written
by distinct authors, were selected. The EPSC for each document was computed, and the
similarity score, denoted as , was calculated as described herein. To make the experiment
more challenging, the last page of each document was also included, where often only half of
the pages contained written text.</p>
        <p>In figure 3, one page from each of the ten documents are shown, together with a dendrogram
that was computed from a correlation matrix, which was computed by all similarity scores
, as previously explained. Since there is no ground truth available, the documents had to be
selected manually, and therefore we chose to use only a limited number. Although the grouping
of documents achieved perfect results, this is not always guaranteed. Nevertheless, the writing
styles of some documents are quite similar, which provides an indication of the performance of
the proposed algorithm.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Browsing the collection</title>
        <p>An interactive application was developed to enable users to browse a subsection of the entire
document collection, specifically consisting of the already transcribed documents. Each
document is represented as a blue dot in the t-SNE visualization in figure 4. The document selected
by the user is displayed in the lower left corner, while a close-up view is shown in the lower
right corner, enabling a more detailed examination of the writing style. The identifying number
and name of the chosen document is shown in the top.</p>
        <p>Interestingly, it was discovered that there were numerous typewritten documents in the
collection. The user was able to identify them within just a few minutes, as they are located
to the left of the red curve, which was added to highlight their position. Browsing through all
1,700 documents manually, on the other hand, would have been rather cumbersome.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future work</title>
      <p>The two small experiments demonstrate promising results, suggesting that the proposed
approach is efective. However, there is significant room for improvement. As previously
mentioned, both the keypoint detector and the feature extractor can be optimised, and the threshold
 should be further investigated. The clustering shown in Figure 4 indicates that improvements
ought to be possible, as typewritten documents should ideally occupy a distinct region separate
from handwritten documents. However, it is important to note that some documents contain
areas with both typewritten and handwritten text. Furthermore, the background removal process
could be optimised to efectively handle documents with varying levels of background noise.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was partially funded through the Labour’s Memory project, which received support
from Riksbankens Jubileumsfond under Grant Agreement No. IN20-0040.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stutzmann</surname>
          </string-name>
          , Artificial paleography:
          <article-title>Computational approaches to identifying script types in medieval manuscripts</article-title>
          ,
          <source>Speculum</source>
          <volume>92</volume>
          (
          <year>2017</year>
          )
          <fpage>S86</fpage>
          -
          <lpage>S109</lpage>
          . URL: https://doi.org/10.1086/694112. doi:
          <volume>10</volume>
          .1086/694112. arXiv:https://doi.org/10.1086/694112.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sablatnig</surname>
          </string-name>
          ,
          <article-title>Writer identification and writer retrieval using the fisher vector on visual vocabularies</article-title>
          ,
          <source>in: 2013 12th International Conference on Document Analysis and Recognition</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>545</fpage>
          -
          <lpage>549</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDAR.
          <year>2013</year>
          .
          <volume>114</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gropp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <article-title>Unsupervised Feature Learning for Writer Identification and Writer Retrieval</article-title>
          , in: IEEE (Ed.),
          <source>2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>991</fpage>
          -
          <lpage>997</lpage>
          . URL: https://www5. informatik.uni-erlangen.de/Forschung/Publikationen/2017/Christlein17-UFL.pdf. doi:
          <volume>10</volume>
          . 1109/ICDAR.
          <year>2017</year>
          .
          <volume>165</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Vats</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Automatic document image binarization using bayesian optimization</article-title>
          ,
          <source>in: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing</source>
          , ACM,
          <year>2017</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>