Exploring Handwritten Document Collections: An EPSC-Based Approach for Feature Extraction and Similarity Analysis Anders Hast1,*,† , Örjan Simonsson2,† 1 Uppsala University, Lägerhyddsvägen 1, Uppsala, 751 05, Sweden 2 Uppsala County Archive on Popular Movement, S:t Olofsgatan 15, Uppsala, 753 21, Sweden Abstract This work in progress paper presents a novel approach for the classification and analysis of handwritten documents using a combination of Embedded Prototype Subspace Classification (EPSC) and advanced clustering techniques. We focus on facilitating the examination of document collections by enabling efficient comparisons between documents written by different hands. Our methodology involves the extraction of features from keypoints detected in the handwritten text, which are then processed using t-SNE and modified K-Means clustering to identify clusters of similar features. The novelty lies in a similarity score that is computed to quantify the likeness between document pairs, enabling the identification of stylistic similarities even in the absence of ground truth. An interactive visual application is developed to assist users in exploring the collection, providing insights into the nature of each document, including the differentiation between typewritten and handwritten texts. Our preliminary experiments demonstrate promising results, indicating that documents of the same hand tend to cluster together while distinguishing between varying writing styles. However, we acknowledge that there is room for improvement, particularly in optimising the keypoint detection, feature extraction, and background removal processes, as well as in determining optimal thresholds. Future work will address these limitations, enhancing the robustness of our method and expanding its applicability to a wider range of documents. Keywords Document Collections, Writing style, Visualisation, Exploration 1. Introduction We present a work-in-progress aimed at facilitating a quick and efficient overview of documents with varying handwriting styles, which is crucial for many applications in historical document analysis and archival research. Handwriting analysis has traditionally been a challenging task. However, with the increasing volume of digitized handwritten documents, particularly in historical collections, there is a growing need for automated methods to assist researchers in managing and exploring these collections more efficiently. IRCDL 2025: 21st Conference on Information and Research Science Connecting to Digital and Library Science, February 20-21 2025, Udine, Italy * Corresponding author. † These authors contributed equally. $ anders.hast@it.uu.se (A. Hast); orjan.simonson@fauppsala.se ( Simonsson) € https://andershast.com/ (A. Hast); https://https://www.fauppsala.se/ ( Simonsson)  0000-0003-1054-2754 (A. Hast); 0000-0002-6347-4102 ( Simonsson) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Deep learning methods have demonstrated effectiveness in identifying writing hands but require substantial amounts of training data [1]. In contrast, computer vision-based methods leverage keypoints and local features, which are either clustered to generate a supervector representing each document [2] or processed through a deep learning network [3]. In this paper, we adopt the keypoint-based approach, incorporating several notable modifications as detailed below. 2. Method Each document, as shown in Figure 1a, is processed as follows: First, the background is removed to enhance the visibility of the text [4], minimising the impact of background noise and ensuring that keypoints are computed only on the text strokes. However, binarisation is not applied here, as it would introduce unwanted jaggedness and distort the text strokes. Instead, the background is removed while preserving the text in grayscale, allowing for accurate keypoint detection without interference from binarisation artifacts. Keypoints are computed using the Harris detector [5], which identifies descriptive points in the script, as shown in Figure 1b. Next, simple yet descriptive feature vectors are computed using the Radial Line Fourier Descriptor for Historical Handwritten Text Representation [6]. While more advanced descriptors, such as the SIFT descriptor [7], could have been employed alongside sophisticated keypoints, as shown in [3, 2], these methods have both strengths and limitations. For instance, SIFT offers scale invariance, but its inherent rotation invariance is disadvantageous in this context, where stroke direction is crucial. Additionally, SIFT detects blobs using the difference of Gaussian approach, in contrast to the Harris detector, which focuses on corners. Future work will aim to refine scale invariance to better accommodate documents with varying script sizes. For each image, the features obtained at each keypoint are processed using t-SNE [8], followed by a modified K-Means clustering algorithm to identify groups of similar features [9], as illustrated in Figure 2. Unlike traditional K-Means, where initial cluster centroids are chosen randomly, the starting values in this method are strategically placed at the cluster centers on a predefined scale. This modification helps improve the clustering by providing a more informed starting point, leading to more accurate grouping of similar features. While this approach shares similarities with previously published methods, such as keypoint clustering, its novelty lies in using clusters as subspaces that capture variations in stroke execution with similar appearances, as well as in the novel computation of supervectors. The following section outlines how these clusters are employed to develop a classifier. Recently, Embedded Prototype Subspace Classification (EPSC) [10, 9, 11, 12, 9] has been suc- cessfully applied to classify diverse datasets. This classifier builds on concepts developed by Kohonen and others [13, 14, 15, 16, 17] and can be viewed as a two-layer neural network [11, 17, 18], where the weights are mathematically derived using Principal Component Analysis (PCA) [18]. Prototypes are identified using t-SNE and the modified K-Means clustering, as previously described, where each feature within a cluster serves as a prototype. This approach makes the process easy to interpret through visualisations. To compare two documents, A and B, we follow this procedure: The first eigenvector from the PCA of each cluster in document A is projected into the subspaces obtained from the (a) Original document. (b) The background has been removed and keypoints appear on the text strokes. Figure 1: The documents (a) are processed so that the background is removed and keypoints are computed (b). clusters of document B. Notably, using only the first eigenvector in the subspace appears to yield reasonable results. Therefore, the comparison primarily relies on the dot product of the first eigenvectors, as the projection depth is effectively 1. Future work will explore the impact of varying projection depths to further refine this approach. To track the similarity between clusters in documents A and B, we compute two lists. The first list contains the identities of the clusters in A, while the second list, denoted as 𝐴𝑐𝑡, holds the corresponding activation values. Given that two documents typically have differing numbers of clusters, and that multiple clusters in B can correspond to a single cluster in A, we define the similarity between the documents as the ratio of activated clusters in A that exceed a certain activation threshold 𝜃 to the total number of activated clusters in A. The similarity ratio 𝑅 is calculated as follows: Figure 2: The features extracted at each keypoint from a document image are processed using t-SNE, followed by modified K-Means clustering to identify clusters of similar features. ∑︀𝑛 if 𝐴𝑐𝑡[𝑖] > 𝜃 𝑅 = ∑︀𝑛𝑖=1 (1) 𝑖=1 if 𝐴𝑐𝑡[𝑖] ̸= 0 Here, 𝑛 represents the number of elements in the list 𝐴𝑐𝑡. In our experiments, we set 𝜃 = 0.92. However, further experimentation is necessary to determine the optimal threshold, as it may vary depending on the specific methods used for feature extraction and clustering. One advantage of the approach presented above is that each document is uniquely charac- terised by its own description based on EPSC. However, the supervector is derived from a row in the correlation matrix, where each document page is compared against all other pages. The key idea is that pages written by the same hand are expected to show higher similarity, reflected in a greater 𝑅 value, whereas pages written by different hands should exhibit lower similarity, resulting in a smaller 𝑅 value. 3. The Labour’s memory Project While the primary objective of the approach outlined here is not writer identification but rather to facilitate the browsing of document collections, it can still be effectively utilised for that purpose. Both tasks are conducted using data from the Labour’s Memory infrastructure project, launched in 2021 to digitise and present annual reports and financial records from blue-collar labour organisations spanning the period from 1880 to 2020. The project includes materials from Swedish unions at various levels (local, district, and national) and international labor organisations, spanning repositories such as Folkrörelsearkivet för Uppsala län (FAC) and Arbetarrörelsens arkiv och bibliotek (ARAB). The corpus, primarily in Swedish with some English, German, French, and Spanish, is estimated to consist of 1–1.5 million pages, with 300,000 pages digitised by 2024. The local organizations’ annual reports, housed at FAC, consist of approximately 35,000 pages, often handwritten or typewritten and rarely professionally published. These texts, (a) 2 (b) 5 (c) 8 (d) 11 (e) 14 (f) 17 (g) 20 (h) 23 (i) 26 (j) 29 (k) Dendrogram, showing how the 10 documents with three pages each are perfectly classified. Figure 3: Figures a-j show examples from each document group as shown in figure (k). The numbers correspond to the number in figure 3k. along with their manually transcribed counterparts, are essential for developing handwriting text recognition (HTR) models. In contrast, national organisations produced professionally printed annual reports for broader audiences. The handwritten reports, often created by secretaries, chairs, auditors, or cashiers, exhibit diverse styles, reflecting varying levels of skill and consistency. 3.1. Writer Identification An experiment was conducted in which three pages from ten different documents, each written by distinct authors, were selected. The EPSC for each document was computed, and the similarity score, denoted as 𝑅, was calculated as described herein. To make the experiment more challenging, the last page of each document was also included, where often only half of the pages contained written text. In figure 3, one page from each of the ten documents are shown, together with a dendrogram that was computed from a correlation matrix, which was computed by all similarity scores 𝑅, as previously explained. Since there is no ground truth available, the documents had to be selected manually, and therefore we chose to use only a limited number. Although the grouping of documents achieved perfect results, this is not always guaranteed. Nevertheless, the writing styles of some documents are quite similar, which provides an indication of the performance of the proposed algorithm. 3.2. Browsing the collection An interactive application was developed to enable users to browse a subsection of the entire document collection, specifically consisting of the already transcribed documents. Each docu- ment is represented as a blue dot in the t-SNE visualization in figure 4. The document selected by the user is displayed in the lower left corner, while a close-up view is shown in the lower right corner, enabling a more detailed examination of the writing style. The identifying number and name of the chosen document is shown in the top. Interestingly, it was discovered that there were numerous typewritten documents in the collection. The user was able to identify them within just a few minutes, as they are located to the left of the red curve, which was added to highlight their position. Browsing through all 1,700 documents manually, on the other hand, would have been rather cumbersome. 4. Conclusion and Future work The two small experiments demonstrate promising results, suggesting that the proposed ap- proach is effective. However, there is significant room for improvement. As previously men- tioned, both the keypoint detector and the feature extractor can be optimised, and the threshold 𝜃 should be further investigated. The clustering shown in Figure 4 indicates that improvements ought to be possible, as typewritten documents should ideally occupy a distinct region separate from handwritten documents. However, it is important to note that some documents contain ar- eas with both typewritten and handwritten text. Furthermore, the background removal process could be optimised to effectively handle documents with varying levels of background noise. Figure 4: Thanks to the visual interactive application, it was possible to quickly identify which docu- ments were typewritten and which were not, all within a few minutes. The typewritten documents are located to the left of the red line, which was added later. Acknowledgments This work was partially funded through the Labour’s Memory project, which received support from Riksbankens Jubileumsfond under Grant Agreement No. IN20-0040. References [1] M. Kestemont, V. Christlein, D. Stutzmann, Artificial paleography: Computa- tional approaches to identifying script types in medieval manuscripts, Specu- lum 92 (2017) S86–S109. URL: https://doi.org/10.1086/694112. doi:10.1086/694112. arXiv:https://doi.org/10.1086/694112. [2] S. Fiel, R. Sablatnig, Writer identification and writer retrieval using the fisher vector on visual vocabularies, in: 2013 12th International Conference on Document Analysis and Recognition, 2013, pp. 545–549. doi:10.1109/ICDAR.2013.114. [3] V. Christlein, M. Gropp, S. Fiel, A. Maier, Unsupervised Feature Learning for Writer Identification and Writer Retrieval, in: IEEE (Ed.), 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 991–997. URL: https://www5. informatik.uni-erlangen.de/Forschung/Publikationen/2017/Christlein17-UFL.pdf. doi:10. 1109/ICDAR.2017.165. [4] E. Vats, A. Hast, P. Singh, Automatic document image binarization using bayesian op- timization, in: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, ACM, 2017, pp. 89–94. [5] C. Harris, M. Stephens, A combined corner and edge detector., in: Proceedings of The Fourth Alvey Vision Conference, Manchester, UK, 1988, pp. 147–151. [6] A. Hast, E. Vats, Radial line fourier descriptor for historical handwritten text representation, Journal of WSCG 26 (2018) 31–40. doi:10.24132/JWSCG.2018.26.1.4. [7] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2004) 91–110. [8] L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of machine learning research 9 (2008) 2579–2605. [9] A. Hast, E. Vats, Word recognition using embedded prototype subspace classifiers on a new imbalanced dataset., Journal of WSCG 29 (2021) 39–47. URL: http://wscg.zcu.cz/ WSCG2021/2021-J-WSCG-1-2.pdf. [10] A. Hast, Magnitude of semicircle tiles in fourier-space : A handcrafted feature descriptor for word recognition using embedded prototype subspace classifiers, Journal of WSCG 30 (2022) 82–90. doi:10.24132/JWSCG.2022.10. [11] A. Hast, M. Lind, E. Vats, Embedded prototype subspace classification : A subspace learning framework, in: The 18th International Conference on Computer Analysis of Images and Patterns (CAIP), Lecture Notes in Computer Science, 2019, pp. 581–592. [12] A. Hast, M. Lind, Ensembles and cascading of embedded prototype subspace classifiers, Journal of WSCG 28 (2020) 89–95. doi:10.24132/JWSCG.2020.28.11. [13] W. Watanabe, P. F. Lambert, C. A. Kulikowski, J. L. Buxto, R. Walker, Evaluation and selection of variables in pattern recognition, in: J. Tou (Ed.), Computer and Information Sciences, volume 2, New York: Academic Press, 1967, pp. 91–122. [14] T. Kohonen, P. Lehtiö, J. Rovamo, J. Hyvärinen, K. Bry, L. Vainio, A principle of neural associative memory, Neuroscience 2 (1977) 1065 – 1076. URL: http:// www.sciencedirect.com/science/article/pii/0306452277901294. doi:https://doi.org/ 10.1016/0306-4522(77)90129-4. [15] T. Kohonen, E. Oja, Fast adaptive formation of orthogonalizing filters and associative memory in recurrent networks of neuron-like elements, Biological Cybernetics 21 (1976) 85–95. URL: https://doi.org/10.1007/BF01259390. doi:10.1007/BF01259390. [16] T. Kohonen, E. Reuhkala, K. Mäkisara, L. Vainio, Associative recall of images, Biological Cybernetics 22 (1976) 159–168. URL: https://doi.org/10.1007/BF00365526. doi:10.1007/ BF00365526. [17] E. Oja, T. Kohonen, The subspace learning algorithm as a formalism for pattern recogni- tion and neural networks, in: IEEE 1988 International Conference on Neural Networks, volume 1, 1988, pp. 277–284. doi:10.1109/ICNN.1988.23858. [18] J. Laaksonen, Subspace classifiers in recognition of handwritten digits, G4 monografi- aväitöskirja, Helsinki University of Technology, 1997-05-07. URL: http://urn.fi/urn:nbn:fi: tkk-001249.