An Unsupervised Learning Approach to Text Line Detection in Complex Illuminated Medieval Manuscripts? Lizeth Gonzalez-Carabarin1 and Lisandra S. Costiner2 ?? 1 Department of Electrical Engineering Eindhoven University of Technology Eindhoven, The Netherlands l.gonzalez.carabarin@tue.com 2 Merton College, University of Oxford lia.costiner@merton.ox.ac.uk Abstract. This paper outlines a simple and effective clustering and filtering ap- proach for text line detection in challenging illuminated medieval manuscripts from Western Europe. As illuminated medieval manuscripts were copied and il- lustrated by hand and do not have regular formats, they pose particular difficul- ties for traditional methods of text line extraction, which have been designed for printed books. This paper introduces an unsupervised learning approach to text line detection in challenging manuscripts based on clustering, using a k-means algorithm with a combination of three salient features that are well-suited for im- age processing. These are: the gradient in the y-direction, mean values of rows, and grayscale values. The strength of the method lies in its reduced number of features, its computational lightness, low-memory use, transparency at every step of the process, and versatility. It can also be used on a single image, being per- page based, unlike supervised learning approaches which require large training datasets. This stands as an alternative to computationally-heavy algorithms such as neural networks, which have been increasingly used in recent years to solve such tasks. Keywords: Text Line Extraction, Document Layout Analysis, Medieval Manuscripts, Unsupervised Learning, Artificial Intelligence, Digital Humanities. 1 Introduction Digitization initiatives undertaken by libraries, museums and collections around the globe are rapidly increasing the number of manuscript images online. Given the large volume of such data, it is important to devise new ways to automatically process and extract relevant information from these images thus saving valuable human time. Digi- tized documents pose a number of challenges for the extraction of relevant information, the key ones being the location of areas of text and illustration and the detection of text lines. ? Supported by Merton College, Oxford. ?? The two authors contributed equally to this article. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 88 Medieval manuscripts are especially challenging for automatic text line detection. Each surviving book was hand produced so its page layout and illustrations vary widely. Furthermore, unlike in printed books, text and decoration in medieval manuscripts do not typically conform to uniform rectangular registers, which is the assumed layout for image segmentation [1], [2]. Instead, text lines in manuscripts can be irregular, and can be broken up by unframed images. Such documents, therefore, pose particular difficul- ties for traditional methods of text line detection designed for printed text, requiring instead the development of customized algorithms. Text line detection is an essential component of automatic document analysis, being the basis of text extraction and ultimately optical character recognition (OCR). It is typ- ically categorized under image segmentation algorithms [3]. A number of competitions have been particularly dedicated to the challenges of extracting text lines from challeng- ing documents [4], [5]. Older approaches employed in document segmentation and text line extraction were adapted to specific types of records [8] [9], [10] , so there is a need for a global or generic approach that will be able to adapt to different types of docu- ments. Recent developments have tended to focus on the use of neural networks [2], [5], [6], [7], [12], [13], [14], which have proved successful, as have hybrid approaches [1]. Although effective, neural networks (NNs) require manually-annotated data for train- ing, expending large amounts of human time. They are also computationally heavy, and are black boxes, meaning that their inner workings are not understood. New approaches with increased versatility, stability, generality, ability to perform multi-scale analysis, and to handle color remain a desiderata [3]. There is a need, therefore, for a generic tool that is flexible in dealing with a range of documents, is low on processing power, and white-box, allowing every step to be queried and understood. This paper proposes such a technique for the automatic identification and extraction of lines of text from Western medieval manuscripts with complex layouts that include text and image. It introduces an unsupervised learning approach to text line detection based on clustering, using a k-means algorithm [17] with a combination of three salient features that are well suited for image processing. Although k-means has been applied for document segmentation previously, the number of features used in these approaches was large, increasing the computational cost [1]. The current methodology differenti- ates itself by relying on only three features, namely the gradient in the y-direction, mean values of rows and gray-scale values. It is also able to detect useful clusters based on statistical information about cluster centroid positions. Moreover, as compared to pre- vious approaches to line extraction and segmentation which work uniquely on pictures of single pages, this approach can be applied to images of open books which show two facing pages, a more difficult arrangement for image processing. 2 Dataset Although for text line detection in medieval documents with challenging layouts, a number of annotated datasets have been created [11], [15], no such datasets exist for illuminated medieval manuscripts. For the current study, a new challenging dataset was specifically created to contain images that might cause difficulties in the automatic detection of text lines. This include books with a variety of layouts and decorations, 89 different types of texts (devotional and medical), written in various scripts, and pro- duced in a number of geographical regions in different time periods. The layout of these manuscripts include single or double columns of text. The text is often interrupted by images, such as historiated initials, borders, marginal decoration, and unframed draw- ings. The selection of images was chosen from digitised manuscripts downloaded from the freely available digital collection of the Bodleian Library in Oxford [16]. The 120 images used in this study, derive from the following manuscripts in Oxford’s Bodleian Library: MS Canon. Misc 476 (a fourteenth-century Italian Life of the Virgin and of Christ), MS Add. A. 185 (a fifteenth-century French Book of Hours), MS Ashmole 1462 (a twelfth-century English manuscript containing medical and herbal texts), MS Auct. 2.2 (a fourteenth-century English choir psalter), MS Buchanan e 7 (a fifteenth- century Italian Book of Hours). 3 Clustering Based on k-means The algorithm used in this approach is a k-means clustering algorithm [17]. The k- means algorithm partitions a set P of N observations, P = {x1 , x2 , · · ·, xN }, into k clusters (C1 , C2 , · · ·, Ck ), (where C1 ∪ C2 ∪ · · · ∪ Ck = P ), such as to minimise: k X 2 X 1 X min x− xj (1) C1 ,C2 ,···,Ck |Ci | i=1 x∈Ci xj ∈Ci In our case, x1 , x2 , · · ·, xN are vectors inRp , where p is the number of features, and the distance between points is the Euclidean distance, kxi − xj k2 , and the centroid mi of the cluster Ci is the mean: 1 X mi = xj (2) |Ci | xj ∈Ci In order to find the clusters and centroids in equations (1) and (2), the k-means algorithm starts by selecting k centroids mi , which are randomly placed among the data points. Then, the method iterates the following two steps: (a) clusters each data point to their nearest mean/centroid and (b) calculates new means/centroids as the average of each cluster (equation 2). Other studies have proposed similar approaches for layout segmentation based on k-means, however they use a large number of features (tens of them) [1]. This paper proposes the use of three features after a pre-processing stage, which has the advantage of increasing efficiency in terms of processing time and resources. 4 Pre-Processing Stage The input image is in RGB format, and it is represented as a multidimensional matrix of m x n x d, where m is the number of rows, n is the number of columns and d is the 90 Fig. 1: (a) Original image, (b) smoothed image, (c) gradient, and (d) mean values of rows. depth (number of channels). The values range from 0 to 255. The first step is to obtain the gray format based on: IG = 0.2989 ∗ d1 + 0.5870 ∗ d2 + 0.1140 ∗ d3 (3) Where d = [d1 , d2 , d3 ], representing R, G and B values respectively. Here, IG is a matrix of shape m x n. After converting the image into grayscale, a uniform filter is applied using a kernel size of 13 (selected through trials) in order to obtain a smoother format IGsmooth . 5 Feature Extraction Fig. 2: Five classes obtained using k-means for one manuscript image. Each class rep- resents one cluster. The points in the clusters are represented in black. 91 After pre-processing, three features are proposed for clustering. These are the gradi- ent in the vertical y-direction, the mean values of rows, and the gray-scale values of IGsmooth . The reason behind the use of gradient values of the smooth version is that it provides the upper and lower edges of the text lines (Fig. 1c). The second feature, the mean value of rows, also provides information regarding the text lines (Fig. 1d). Zero values in IGsmooth correspond to black pixels, while 1 values correspond to white pixels. Therefore, minima of mean row values (close to 0) are likely to correspond to black text lines, whereas maxima (close to 1) are likely to correspond to the white space between text lines or the spaces at the top and bottom of the page. Finally, the third fea- ture, the grey values of IGsmooth may provide information regarding empty spaces and image location (Fig. 1b). For an N -dimensional array, the gradient is based on central differences in the interior and first differences in the boundaries for the matrix. The central differences are given by: δh [f ](x) = f (x + h/2) − f (x − h/2) (4) And for the boundaries: δh [f ](x) = f (x + h) − f (x) (5) Based on previous equations, the gradient was calculated, and a matrix I∆ of size m x n was obtained. Additionally, the mean value for one row is calculated according to: Pn j=1 xi,j µsmooth,i = (6) n The Iµ m x n matrix of row means associates to each pixel within a row i, the mean value µsmooth,i . After the computation of the three matrices, IGsmooth , I∆ , and Iµ , which contain the values of features, standardisation was performed on the concatenated matrices: IX − µI Iz = (7) σI where µI and σI are the mean and the standard deviation of IX , which contains the concatenation of {IGsmooth ,I∆ ,Iµ }. To better understand the full pre-processing stage, Algorithm 1 shows the pseudocode. 6 Classification Based on k-means Once the three features are computed and standardised, a k-means algorithm is used to compute five clusters. An example of produced clusters from one manuscript image is shown in Fig. 3. Each class represents one cluster. The points in the clusters are rep- resented in black. As can be observed, Class 1 and Class 5 provide useful information regarding the location of text lines. These five clusters were computed across the entire corpus of 120 images. Fig. 3 plots these 120 x 5 clusters. The observations made for Fig. 3 are visible in all 120 im- ages. The centroid values that are in the extremes of 2 and -2 correspond to gradients at 92 Fig. 3: Centroids of each of the 120 images in our dataset. Each image has 5 centroids. The top and bottom clusters show that the algorithm has clearly detected the upper and lower boundaries of text lines. the upper and lower boundaries of text lines. In Fig. 3, the well-separated top and bottom clouds of centroids, with values of 2 and -2 respectively, show that the algorithm has clearly detected the upper and lower boundaries of text lines. Further post-processing can be performed to extract unique horizontal lines of text. Using the mean value of rows, text lines can be differentiated from the white horizontal space separating them (black color having a value of 0, while white having a value of 1). Other results of k-means can be used to improve the extraction of text lines. In this process, it is important to note that digitised manuscripts come in a range of formats. Some capture a single page, others capture facing pages, while certain others feature both the page and some space that falls outside of it. The regions outside of the page, which typically appear black on a photograph or scan, need to be removed (see Fig. 1a). These areas are detected by the algorithm as seen in Fig. 2, class 4. The algorithm also detects the white pixels of the page, corresponding to unwritten and unillustrated areas (represented in black in Fig. 2, class 2). These observations can be used to further eliminate upper and lower margins, and improve the accuracy of the text line extraction. The pseudocode with the procedure of clustering, post-processing and line detec- tion is shown in Algorithms 1 and 2. After obtaining the class which contains the lines (class lines), a post-processing stage is applied in order to eliminate borders (class borders). This is followed by an enhancing stage, in which a uniform filter and 93 gradient is obtained. This enhances horizontal lines of class lines. Finally, a Hough Lines algorithm is applied [20]. Algorithm 1 Pseudo-code for pre-processing 1: procedure P RE - PROCESSING 2: img gray = RGB TO GRAY(original image) 3: img bw = BW THRESHOLD(img gray > BW THRESHOLD) 4: img bw smooth 1 = UNIFORM FILTER(img bw) 5: mean rows gray v = MEAN(img bw smooth 1, rows) 6: img bw smooth 2 = img bw smooth 1 7: loop: 8: if i < smoothness level then return true 9: img bw smooth 2 = UNIFORM FILTER(img bw smooth 2) 10: img bw smooth 2 =GRAD Y(img bw smooth 2) 11: i←i+1 12: img grad = img bw smooth 2 13: X data={ img bw smooth 1 img grad mean rows gray v } Algorithm 2 Algorithm for text line extraction 1: procedure T EXT EXTRACTION 2: clusters = K MEANS(X data) 3: class lines [cluster == cluster lines] = 0 4: loop: 5: if cluster borders then return true 6: class borders[cluster == cluster borders] = 0 7: class borders = UNIFORM FILTER(class borders) 8: class lines[class borders == 1] = 0 9: class lines smooth = UNIFORM FILTER(class lines) 10: loop: 11: if i < smoothness level 2 then return true 12: class lines smooth = UNIFORM FILTER(class lines smooth) 13: class lines smooth = GRAD Y(class lines smooth) 14: i ← i + 1. 15: lines = HOUGH LINES(class lines smooth) 7 Results Results are shown in Fig. 4 for different manuscripts with various layouts and types of illumination. The green marks represent the upper and lower boundaries of text lines found by the algorithm. The algorithm appears to accurately identify lines of text in a number of challenging cases. These include images of single and double columns of 94 (a) Sample 1 (b) Sample 2 (c) Sample 3 Fig. 4: Text and image extraction for different samples of manuscripts. This shows that the algorithm can handle difficult cases such as: single and double columns of text, text lines which are broken by illustrations inserted within the text column, images including facing pages of the manuscript, and images including extraneous material, such as the space outside of a book. text, text lines which are broken by unframed illustrations which are inserted within the text column, images of facing pages, and images that include extraneous material, such as the space outside of a book, which appears black in photographs. The accuracy of the algorithm was evaluated by comparing its performance against the same dataset which had been manually labelled in red. The following measures were used to evaluate the performance of the current model: TP (True Positive) are correctly detected lines; FP (False Positive) are regions in the page erroneously marked as lines; FN (False Negative) corresponds to those lines which were not marked. This was done across all of our image dataset and the final accuracy was computed by averaging the results of each image. Evaluation results were obtained based on pixel accuracy. For the proposed database, a final accuracy of 87.2% was obtained. The accuracy was calcu- lated according to the percentage of pixels categorized as ’text lines’ by our algorithm that matched the manually labeled lines. Although this accuracy is below that of state- of-the-art algorithms [19], the method is useful because of its simplicity, and because it is less computationally heavy, being run without the assistance of GPU and external servers. 8 Discussion We have presented an efficient algorithm which relies on unsupervised learning to suc- cessfully detect lines of text. The dataset was curated to include challenging cases, and indeed, some proved to be too difficult for the algorithm. The fifteenth-century French Book of Hours, Oxford, Bodleian Library, MS. Add. A.185, was one such case (Fig. 5). 95 For this particular manuscript, which includes a large border with foliate patterns and an illustration, the algorithm classified some areas of the decoration as text lines. These consisted mainly of vines which were sketched in pen. Although increasing the number of clusters did not improve the performance of our model, adding more features may. Another avenue for improvement that could be pursued is that of hybrid techniques (supervised and unsupervised learning). These would aim for model shrinkage (e.g. leveraging pruning and quantization for Deep Learning approaches) to further increase metrics while providing efficient inference models. Fig. 5: Example of a challenging manuscript (Oxford, Bodleian Library, MS. Add. A. 185, fol. 106v). Red circles bring attention to areas of erroneous line detection. 9 Conclusion This paper outlines a simple and effective approach for line detection in challenging illuminated medieval manuscripts. Traditional approaches of line detection assume that text regions are enclosed in rectangular registers, which is not true for many medieval books, especially those that contain illuminations. This approach is suited to such com- plexity. Although k-means and filtering have been previously used for similar page seg- mentation tasks, the uniqueness of this approach is its reliance on only three features. The strength of the method further lies in its transparency at every step of the process, low-memory use, potential to produce refined results, and versatility. This stands as an alternative to algorithms such as neural networks, which have been increasingly used 96 in recent years to solve such tasks – algorithms which are black-boxes, do not allow for querying of their decision-making process and are computationally intensive. Also, in contrast to supervised learning approaches, such as neural networks, that require a lot of human time for manual annotation of training data, this algorithm is unsupervised and demands no such investment. Furthermore, this algorithm works on single case, whereas others require large datasets on which to train. This approach, therefore, pro- vides not only a solution for line detection in challenging images of documents with mixed textual and visual content, but more importantly leads towards algorithms with improved robustness, stability and versatility. Such results are important for scholars in the humanities, as a pre-processing step for textual extraction and ultimately optical character recognition (OCR). References 1. Yang, Y., Pintus, R., Gobbetti, E., and Rushmeier H.: Automatic Single Page-Based Algo- rithms for Medieval Manuscript Analysis. Journal on Computing and Cultural Heritage 10(2). DOI:https://doi.org/10.1145/2996469 (2017). 2. Ares Oliveira, S., Seguin, B. and Kaplan, F.: dhSegment: A Generic Deep-Learning Approach for Document Segmentation. 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) (2018). 3. Eskenazi, S., Gomez-Krämer, P. and Ogier, J.: A Comprehensive Survey of Mostly Textual Document Segmentation Algorithms since 2008. Pattern Recognition, 64, pp. 1-14 (2017). 4. Stamatopoulos, N., Gatos, B., Louloudis, G., Pal, U. and Alaei, A.: ICDAR 2013 Handwriting Segmentation Contest. Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR). IEEE, pp. 1402–1406 (2013). 5. Grüning, T., Labahn, R., Diem, M., Kleber, F. and Fiel, S.: Read-Bad: a New Dataset and Eval- uation Scheme for Baseline Detection in Archival Documents. 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 351–356 (2018). 6. Fink, M., Layer, T., Mackenbrock, G. and Sprinzl M.: Baseline Detection in Historical Doc- uments Using Convolutional U-Nets, 2018 13th IAPR International Workshop on Docu- ment Analysis Systems (DAS), Vienna, Austria, 2018, pp. 37-42, doi: 10.1109/DAS.2018.34 (2018). 7. Alberti, M., Vögtlin, L., Pondenkandath, V., Seuret, M., Ingold, R. and Liwicki, M.: Labeling, Cutting, Grouping: an Efficient Text Line Segmentation Method for Medieval Manuscripts. arXiv:1906.11894 (2019). 8. Shafait, F., Keysers, D. and Breuel, T.: Performance Evaluation and Benchmarking of Six- Page Segmentation Algorithms. Pattern Analysis and Machine Intelligence, IEEE Transac- tions, 30(6), pp. 941–954 (2008). 9. Likforman-Sulem, L., Zahour, A. and Taconet, B: Text line Segmentation of Historical Docu- ments: a Survey. IJDAR 9, pp. 123–138. https://doi.org/10.1007/s10032-006-0023-z (2007) 10. Pintus, R., Yang, Y., Gobbei, E., et al: A TaLISMAN: Automatic Text and Line Segmentation of Historical MANuscripts’. EUROGRAPHICS Workshop on Graphics and Cultural Heritage, (2014). 11. Diem, M., Kleber, F., Fiel, S., Grüning, T., and Gatos, B.: ScriptNet: ICDAR 2017 Com- petition on Baseline Detection in Archival Documents (cBAD) [Data set]. Zenodo. DOI: http://doi.org/10.5281/zenodo.257972 (2017). 12. Renton, G., Chatelain, C., Adam, S., Kermorvant, C. and Paquet, T.: Handwritten Text Line Segmentation Using Fully Convolutional Network. 2017 14th IAPR International Confer- 97 ence on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 2017, pp. 5-9, doi: 10.1109/ICDAR.2017.321 (2017). 13. Chen, K., Seuret, M., Liwicki M., Hennebert J. and Ingold R.: Page Segmentation of His- torical Document Images with Convolutional Autoencoders. Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1011–1015 (2015). 14. Barakat, B., Droby, A., Kassis, M. and El-Sana, J.: Text Line Segmentation for Challenging Handwritten Document Images using Fully Convolutional Network. 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 374-379 (2018). 15. Simistira, F., Seuret, M., Eichenberger, N., Garz, A., Liwicki, M., Ingold, R.: Diva-hisdb: a precisely annotated large dataset of challenging medieval manuscripts. Frontiers in Handwrit- ing Recognition (ICFHR), 2016 15th International Conference, pp. 471–476 (2016). 16. Digital Bodleian. https://digital.bodleian.ox.ac.uk/, last accessed 2021/3/20. 17. Jin, X., Han, J.: K-Means Clustering on Encyclopedia of Machine Learning. Springer US, pp. 563-564 (2010). 18. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Nets. Advances in Neural Information Processing Sys- tems, 27 (2014). 19. Barakat, B. K., Droby, A., Alasam, R., Madi, B., Rabaev I., Shammes, R. and El-Sana, J.: Unsupervised Deep Learning for Text Line Segmentation. arXiv:2003.08632 (2020). 20. Likforman-Sulem, L., Hanimyan, A. and Faure: A Hough Based Algorithm for Extracting Text Lines in Handwritten Documents. International Conference on Document Analysis and Recognition (ICDAR), pp. 774–777 (1995).