An Unsupervised Learning Approach to Text Line
         Detection in Complex Illuminated Medieval
                        Manuscripts?

                  Lizeth Gonzalez-Carabarin1 and Lisandra S. Costiner2 ??
                               1
                                  Department of Electrical Engineering
                                 Eindhoven University of Technology
                                     Eindhoven, The Netherlands
                              l.gonzalez.carabarin@tue.com
                               2
                                 Merton College, University of Oxford
                              lia.costiner@merton.ox.ac.uk


          Abstract. This paper outlines a simple and effective clustering and filtering ap-
          proach for text line detection in challenging illuminated medieval manuscripts
          from Western Europe. As illuminated medieval manuscripts were copied and il-
          lustrated by hand and do not have regular formats, they pose particular difficul-
          ties for traditional methods of text line extraction, which have been designed for
          printed books. This paper introduces an unsupervised learning approach to text
          line detection in challenging manuscripts based on clustering, using a k-means
          algorithm with a combination of three salient features that are well-suited for im-
          age processing. These are: the gradient in the y-direction, mean values of rows,
          and grayscale values. The strength of the method lies in its reduced number of
          features, its computational lightness, low-memory use, transparency at every step
          of the process, and versatility. It can also be used on a single image, being per-
          page based, unlike supervised learning approaches which require large training
          datasets. This stands as an alternative to computationally-heavy algorithms such
          as neural networks, which have been increasingly used in recent years to solve
          such tasks.

          Keywords: Text Line Extraction, Document Layout Analysis, Medieval Manuscripts,
          Unsupervised Learning, Artificial Intelligence, Digital Humanities.


1      Introduction
Digitization initiatives undertaken by libraries, museums and collections around the
globe are rapidly increasing the number of manuscript images online. Given the large
volume of such data, it is important to devise new ways to automatically process and
extract relevant information from these images thus saving valuable human time. Digi-
tized documents pose a number of challenges for the extraction of relevant information,
the key ones being the location of areas of text and illustration and the detection of text
lines.
?
     Supported by Merton College, Oxford.
??
     The two authors contributed equally to this article.


               Copyright © 2021 for this paper by its authors. Use permitted under
              Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                            88


    Medieval manuscripts are especially challenging for automatic text line detection.
Each surviving book was hand produced so its page layout and illustrations vary widely.
Furthermore, unlike in printed books, text and decoration in medieval manuscripts do
not typically conform to uniform rectangular registers, which is the assumed layout for
image segmentation [1], [2]. Instead, text lines in manuscripts can be irregular, and can
be broken up by unframed images. Such documents, therefore, pose particular difficul-
ties for traditional methods of text line detection designed for printed text, requiring
instead the development of customized algorithms.
    Text line detection is an essential component of automatic document analysis, being
the basis of text extraction and ultimately optical character recognition (OCR). It is typ-
ically categorized under image segmentation algorithms [3]. A number of competitions
have been particularly dedicated to the challenges of extracting text lines from challeng-
ing documents [4], [5]. Older approaches employed in document segmentation and text
line extraction were adapted to specific types of records [8] [9], [10] , so there is a need
for a global or generic approach that will be able to adapt to different types of docu-
ments. Recent developments have tended to focus on the use of neural networks [2], [5],
[6], [7], [12], [13], [14], which have proved successful, as have hybrid approaches [1].
Although effective, neural networks (NNs) require manually-annotated data for train-
ing, expending large amounts of human time. They are also computationally heavy, and
are black boxes, meaning that their inner workings are not understood. New approaches
with increased versatility, stability, generality, ability to perform multi-scale analysis,
and to handle color remain a desiderata [3]. There is a need, therefore, for a generic tool
that is flexible in dealing with a range of documents, is low on processing power, and
white-box, allowing every step to be queried and understood.
    This paper proposes such a technique for the automatic identification and extraction
of lines of text from Western medieval manuscripts with complex layouts that include
text and image. It introduces an unsupervised learning approach to text line detection
based on clustering, using a k-means algorithm [17] with a combination of three salient
features that are well suited for image processing. Although k-means has been applied
for document segmentation previously, the number of features used in these approaches
was large, increasing the computational cost [1]. The current methodology differenti-
ates itself by relying on only three features, namely the gradient in the y-direction, mean
values of rows and gray-scale values. It is also able to detect useful clusters based on
statistical information about cluster centroid positions. Moreover, as compared to pre-
vious approaches to line extraction and segmentation which work uniquely on pictures
of single pages, this approach can be applied to images of open books which show two
facing pages, a more difficult arrangement for image processing.


2   Dataset

Although for text line detection in medieval documents with challenging layouts, a
number of annotated datasets have been created [11], [15], no such datasets exist for
illuminated medieval manuscripts. For the current study, a new challenging dataset was
specifically created to contain images that might cause difficulties in the automatic
detection of text lines. This include books with a variety of layouts and decorations,
                                                 89


different types of texts (devotional and medical), written in various scripts, and pro-
duced in a number of geographical regions in different time periods. The layout of these
manuscripts include single or double columns of text. The text is often interrupted by
images, such as historiated initials, borders, marginal decoration, and unframed draw-
ings. The selection of images was chosen from digitised manuscripts downloaded from
the freely available digital collection of the Bodleian Library in Oxford [16]. The 120
images used in this study, derive from the following manuscripts in Oxford’s Bodleian
Library: MS Canon. Misc 476 (a fourteenth-century Italian Life of the Virgin and of
Christ), MS Add. A. 185 (a fifteenth-century French Book of Hours), MS Ashmole
1462 (a twelfth-century English manuscript containing medical and herbal texts), MS
Auct. 2.2 (a fourteenth-century English choir psalter), MS Buchanan e 7 (a fifteenth-
century Italian Book of Hours).


3   Clustering Based on k-means

The algorithm used in this approach is a k-means clustering algorithm [17]. The k-
means algorithm partitions a set P of N observations, P = {x1 , x2 , · · ·, xN }, into k
clusters (C1 , C2 , · · ·, Ck ), (where C1 ∪ C2 ∪ · · · ∪ Ck = P ), such as to minimise:
                                         k X                            2
                                         X                 1 X
                            min                     x−          xj                        (1)
                        C1 ,C2 ,···,Ck                    |Ci |
                                         i=1 x∈Ci              xj ∈Ci


In our case, x1 , x2 , · · ·, xN are vectors inRp , where p is the number of features, and the
distance between points is the Euclidean distance, kxi − xj k2 , and the centroid mi of
the cluster Ci is the mean:

                                                 1 X
                                         mi =         xj                                  (2)
                                                |Ci |
                                                      xj ∈Ci

In order to find the clusters and centroids in equations (1) and (2), the k-means algorithm
starts by selecting k centroids mi , which are randomly placed among the data points.
Then, the method iterates the following two steps: (a) clusters each data point to their
nearest mean/centroid and (b) calculates new means/centroids as the average of each
cluster (equation 2).
    Other studies have proposed similar approaches for layout segmentation based on
k-means, however they use a large number of features (tens of them) [1]. This paper
proposes the use of three features after a pre-processing stage, which has the advantage
of increasing efficiency in terms of processing time and resources.


4   Pre-Processing Stage

The input image is in RGB format, and it is represented as a multidimensional matrix
of m x n x d, where m is the number of rows, n is the number of columns and d is the
                                          90


Fig. 1: (a) Original image, (b) smoothed image, (c) gradient, and (d) mean values of
rows.


depth (number of channels). The values range from 0 to 255. The first step is to obtain
the gray format based on:

                    IG = 0.2989 ∗ d1 + 0.5870 ∗ d2 + 0.1140 ∗ d3                      (3)

Where d = [d1 , d2 , d3 ], representing R, G and B values respectively. Here, IG is a
matrix of shape m x n. After converting the image into grayscale, a uniform filter is
applied using a kernel size of 13 (selected through trials) in order to obtain a smoother
format IGsmooth .


5   Feature Extraction


Fig. 2: Five classes obtained using k-means for one manuscript image. Each class rep-
resents one cluster. The points in the clusters are represented in black.
                                            91


After pre-processing, three features are proposed for clustering. These are the gradi-
ent in the vertical y-direction, the mean values of rows, and the gray-scale values of
IGsmooth . The reason behind the use of gradient values of the smooth version is that
it provides the upper and lower edges of the text lines (Fig. 1c). The second feature,
the mean value of rows, also provides information regarding the text lines (Fig. 1d).
Zero values in IGsmooth correspond to black pixels, while 1 values correspond to white
pixels. Therefore, minima of mean row values (close to 0) are likely to correspond to
black text lines, whereas maxima (close to 1) are likely to correspond to the white space
between text lines or the spaces at the top and bottom of the page. Finally, the third fea-
ture, the grey values of IGsmooth may provide information regarding empty spaces and
image location (Fig. 1b). For an N -dimensional array, the gradient is based on central
differences in the interior and first differences in the boundaries for the matrix. The
central differences are given by:

                          δh [f ](x) = f (x + h/2) − f (x − h/2)                       (4)
And for the boundaries:

                              δh [f ](x) = f (x + h) − f (x)                           (5)
Based on previous equations, the gradient was calculated, and a matrix I∆ of size m x
n was obtained. Additionally, the mean value for one row is calculated according to:
                                              Pn
                                                 j=1 xi,j
                                µsmooth,i =                                         (6)
                                                   n
The Iµ m x n matrix of row means associates to each pixel within a row i, the mean
value µsmooth,i . After the computation of the three matrices, IGsmooth , I∆ , and Iµ ,
which contain the values of features, standardisation was performed on the concatenated
matrices:
                                            IX − µI
                                     Iz =                                              (7)
                                              σI
where µI and σI are the mean and the standard deviation of IX , which contains the
concatenation of {IGsmooth ,I∆ ,Iµ }. To better understand the full pre-processing stage,
Algorithm 1 shows the pseudocode.


6   Classification Based on k-means

Once the three features are computed and standardised, a k-means algorithm is used to
compute five clusters. An example of produced clusters from one manuscript image is
shown in Fig. 3. Each class represents one cluster. The points in the clusters are rep-
resented in black. As can be observed, Class 1 and Class 5 provide useful information
regarding the location of text lines.
     These five clusters were computed across the entire corpus of 120 images. Fig. 3
plots these 120 x 5 clusters. The observations made for Fig. 3 are visible in all 120 im-
ages. The centroid values that are in the extremes of 2 and -2 correspond to gradients at
                                           92


Fig. 3: Centroids of each of the 120 images in our dataset. Each image has 5 centroids.
The top and bottom clusters show that the algorithm has clearly detected the upper and
lower boundaries of text lines.


the upper and lower boundaries of text lines. In Fig. 3, the well-separated top and bottom
clouds of centroids, with values of 2 and -2 respectively, show that the algorithm has
clearly detected the upper and lower boundaries of text lines. Further post-processing
can be performed to extract unique horizontal lines of text. Using the mean value of
rows, text lines can be differentiated from the white horizontal space separating them
(black color having a value of 0, while white having a value of 1).
    Other results of k-means can be used to improve the extraction of text lines. In this
process, it is important to note that digitised manuscripts come in a range of formats.
Some capture a single page, others capture facing pages, while certain others feature
both the page and some space that falls outside of it. The regions outside of the page,
which typically appear black on a photograph or scan, need to be removed (see Fig.
1a). These areas are detected by the algorithm as seen in Fig. 2, class 4. The algorithm
also detects the white pixels of the page, corresponding to unwritten and unillustrated
areas (represented in black in Fig. 2, class 2). These observations can be used to further
eliminate upper and lower margins, and improve the accuracy of the text line extraction.
    The pseudocode with the procedure of clustering, post-processing and line detec-
tion is shown in Algorithms 1 and 2. After obtaining the class which contains the
lines (class lines), a post-processing stage is applied in order to eliminate borders
(class borders). This is followed by an enhancing stage, in which a uniform filter and
                                         93


gradient is obtained. This enhances horizontal lines of class lines. Finally, a Hough
Lines algorithm is applied [20].


Algorithm 1 Pseudo-code for pre-processing
 1: procedure P RE - PROCESSING
 2:    img gray = RGB TO GRAY(original image)
 3:    img bw = BW THRESHOLD(img gray > BW THRESHOLD)
 4:    img bw smooth 1 = UNIFORM FILTER(img bw)
 5:    mean rows gray v = MEAN(img bw smooth 1, rows)
 6:    img bw smooth 2 = img bw smooth 1
 7:    loop:
 8:    if i < smoothness level then return true
 9:         img bw smooth 2 = UNIFORM FILTER(img bw smooth 2)
10:         img bw smooth 2 =GRAD Y(img bw smooth 2)
11:         i←i+1
12:    img grad = img bw smooth 2
13:    X data={ img bw smooth 1 img grad mean rows gray v }


Algorithm 2 Algorithm for text line extraction
 1: procedure T EXT EXTRACTION
 2:    clusters = K MEANS(X data)
 3:    class lines [cluster == cluster lines] = 0
 4:    loop:
 5:    if cluster borders then return true
 6:         class borders[cluster == cluster borders] = 0
 7:         class borders = UNIFORM FILTER(class borders)
 8:         class lines[class borders == 1] = 0
 9:    class lines smooth = UNIFORM FILTER(class lines)
10:    loop:
11:    if i < smoothness level 2 then return true
12:         class lines smooth = UNIFORM FILTER(class lines smooth)
13:         class lines smooth = GRAD Y(class lines smooth)
14:         i ← i + 1.
15:    lines = HOUGH LINES(class lines smooth)


7   Results
Results are shown in Fig. 4 for different manuscripts with various layouts and types of
illumination. The green marks represent the upper and lower boundaries of text lines
found by the algorithm. The algorithm appears to accurately identify lines of text in a
number of challenging cases. These include images of single and double columns of
                                          94


       (a) Sample 1                   (b) Sample 2                   (c) Sample 3

Fig. 4: Text and image extraction for different samples of manuscripts. This shows that
the algorithm can handle difficult cases such as: single and double columns of text, text
lines which are broken by illustrations inserted within the text column, images including
facing pages of the manuscript, and images including extraneous material, such as the
space outside of a book.


text, text lines which are broken by unframed illustrations which are inserted within the
text column, images of facing pages, and images that include extraneous material, such
as the space outside of a book, which appears black in photographs.
     The accuracy of the algorithm was evaluated by comparing its performance against
the same dataset which had been manually labelled in red. The following measures were
used to evaluate the performance of the current model: TP (True Positive) are correctly
detected lines; FP (False Positive) are regions in the page erroneously marked as lines;
FN (False Negative) corresponds to those lines which were not marked. This was done
across all of our image dataset and the final accuracy was computed by averaging the
results of each image. Evaluation results were obtained based on pixel accuracy. For the
proposed database, a final accuracy of 87.2% was obtained. The accuracy was calcu-
lated according to the percentage of pixels categorized as ’text lines’ by our algorithm
that matched the manually labeled lines. Although this accuracy is below that of state-
of-the-art algorithms [19], the method is useful because of its simplicity, and because
it is less computationally heavy, being run without the assistance of GPU and external
servers.


8   Discussion
We have presented an efficient algorithm which relies on unsupervised learning to suc-
cessfully detect lines of text. The dataset was curated to include challenging cases, and
indeed, some proved to be too difficult for the algorithm. The fifteenth-century French
Book of Hours, Oxford, Bodleian Library, MS. Add. A.185, was one such case (Fig. 5).
                                           95


For this particular manuscript, which includes a large border with foliate patterns and
an illustration, the algorithm classified some areas of the decoration as text lines. These
consisted mainly of vines which were sketched in pen. Although increasing the number
of clusters did not improve the performance of our model, adding more features may.
Another avenue for improvement that could be pursued is that of hybrid techniques
(supervised and unsupervised learning). These would aim for model shrinkage (e.g.
leveraging pruning and quantization for Deep Learning approaches) to further increase
metrics while providing efficient inference models.


Fig. 5: Example of a challenging manuscript (Oxford, Bodleian Library, MS. Add. A.
185, fol. 106v). Red circles bring attention to areas of erroneous line detection.


9   Conclusion
This paper outlines a simple and effective approach for line detection in challenging
illuminated medieval manuscripts. Traditional approaches of line detection assume that
text regions are enclosed in rectangular registers, which is not true for many medieval
books, especially those that contain illuminations. This approach is suited to such com-
plexity. Although k-means and filtering have been previously used for similar page seg-
mentation tasks, the uniqueness of this approach is its reliance on only three features.
The strength of the method further lies in its transparency at every step of the process,
low-memory use, potential to produce refined results, and versatility. This stands as an
alternative to algorithms such as neural networks, which have been increasingly used
                                               96


in recent years to solve such tasks – algorithms which are black-boxes, do not allow for
querying of their decision-making process and are computationally intensive. Also, in
contrast to supervised learning approaches, such as neural networks, that require a lot
of human time for manual annotation of training data, this algorithm is unsupervised
and demands no such investment. Furthermore, this algorithm works on single case,
whereas others require large datasets on which to train. This approach, therefore, pro-
vides not only a solution for line detection in challenging images of documents with
mixed textual and visual content, but more importantly leads towards algorithms with
improved robustness, stability and versatility. Such results are important for scholars
in the humanities, as a pre-processing step for textual extraction and ultimately optical
character recognition (OCR).


References
1. Yang, Y., Pintus, R., Gobbetti, E., and Rushmeier H.: Automatic Single Page-Based Algo-
   rithms for Medieval Manuscript Analysis. Journal on Computing and Cultural Heritage 10(2).
   DOI:https://doi.org/10.1145/2996469 (2017).
2. Ares Oliveira, S., Seguin, B. and Kaplan, F.: dhSegment: A Generic Deep-Learning Approach
   for Document Segmentation. 2018 16th International Conference on Frontiers in Handwriting
   Recognition (ICFHR) (2018).
3. Eskenazi, S., Gomez-Krämer, P. and Ogier, J.: A Comprehensive Survey of Mostly Textual
   Document Segmentation Algorithms since 2008. Pattern Recognition, 64, pp. 1-14 (2017).
4. Stamatopoulos, N., Gatos, B., Louloudis, G., Pal, U. and Alaei, A.: ICDAR 2013 Handwriting
   Segmentation Contest. Proceedings of the 2013 12th International Conference on Document
   Analysis and Recognition (ICDAR). IEEE, pp. 1402–1406 (2013).
5. Grüning, T., Labahn, R., Diem, M., Kleber, F. and Fiel, S.: Read-Bad: a New Dataset and Eval-
   uation Scheme for Baseline Detection in Archival Documents. 2018 13th IAPR International
   Workshop on Document Analysis Systems (DAS), pp. 351–356 (2018).
6. Fink, M., Layer, T., Mackenbrock, G. and Sprinzl M.: Baseline Detection in Historical Doc-
   uments Using Convolutional U-Nets, 2018 13th IAPR International Workshop on Docu-
   ment Analysis Systems (DAS), Vienna, Austria, 2018, pp. 37-42, doi: 10.1109/DAS.2018.34
   (2018).
7. Alberti, M., Vögtlin, L., Pondenkandath, V., Seuret, M., Ingold, R. and Liwicki, M.: Labeling,
   Cutting, Grouping: an Efficient Text Line Segmentation Method for Medieval Manuscripts.
   arXiv:1906.11894 (2019).
8. Shafait, F., Keysers, D. and Breuel, T.: Performance Evaluation and Benchmarking of Six-
   Page Segmentation Algorithms. Pattern Analysis and Machine Intelligence, IEEE Transac-
   tions, 30(6), pp. 941–954 (2008).
9. Likforman-Sulem, L., Zahour, A. and Taconet, B: Text line Segmentation of Historical Docu-
   ments: a Survey. IJDAR 9, pp. 123–138. https://doi.org/10.1007/s10032-006-0023-z (2007)
10. Pintus, R., Yang, Y., Gobbei, E., et al: A TaLISMAN: Automatic Text and Line Segmentation
   of Historical MANuscripts’. EUROGRAPHICS Workshop on Graphics and Cultural Heritage,
   (2014).
11. Diem, M., Kleber, F., Fiel, S., Grüning, T., and Gatos, B.: ScriptNet: ICDAR 2017 Com-
   petition on Baseline Detection in Archival Documents (cBAD) [Data set]. Zenodo. DOI:
   http://doi.org/10.5281/zenodo.257972 (2017).
12. Renton, G., Chatelain, C., Adam, S., Kermorvant, C. and Paquet, T.: Handwritten Text Line
   Segmentation Using Fully Convolutional Network. 2017 14th IAPR International Confer-
                                              97


   ence on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 2017, pp. 5-9, doi:
   10.1109/ICDAR.2017.321 (2017).
13. Chen, K., Seuret, M., Liwicki M., Hennebert J. and Ingold R.: Page Segmentation of His-
   torical Document Images with Convolutional Autoencoders. Proceedings of the 2015 13th
   International Conference on Document Analysis and Recognition (ICDAR), pp. 1011–1015
   (2015).
14. Barakat, B., Droby, A., Kassis, M. and El-Sana, J.: Text Line Segmentation for Challenging
   Handwritten Document Images using Fully Convolutional Network. 2018 16th International
   Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 374-379 (2018).
15. Simistira, F., Seuret, M., Eichenberger, N., Garz, A., Liwicki, M., Ingold, R.: Diva-hisdb: a
   precisely annotated large dataset of challenging medieval manuscripts. Frontiers in Handwrit-
   ing Recognition (ICFHR), 2016 15th International Conference, pp. 471–476 (2016).
16. Digital Bodleian. https://digital.bodleian.ox.ac.uk/, last accessed 2021/3/20.
17. Jin, X., Han, J.: K-Means Clustering on Encyclopedia of Machine Learning. Springer US,
   pp. 563-564 (2010).
18. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
   A., Bengio, Y.: Generative Adversarial Nets. Advances in Neural Information Processing Sys-
   tems, 27 (2014).
19. Barakat, B. K., Droby, A., Alasam, R., Madi, B., Rabaev I., Shammes, R. and El-Sana, J.:
   Unsupervised Deep Learning for Text Line Segmentation. arXiv:2003.08632 (2020).
20. Likforman-Sulem, L., Hanimyan, A. and Faure: A Hough Based Algorithm for Extracting
   Text Lines in Handwritten Documents. International Conference on Document Analysis and
   Recognition (ICDAR), pp. 774–777 (1995).