The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia Segmentation of Handwritten and Printed Arabic Documents Ghazouani Fethi, IFN1, ENIT, Tunis, Tunisia Email: gfethi@yahoo.fr and Maddouri Mondher FST, Tunis, Tunisia Email: mondher.maddouri@fst.rnu.tn Maddouri Snoussi Samia, ENIT, Tunis, Tunisia Email: samia_maddouri@yahoo.f El Abed Haikel, Email: elabed@tu-bs.de 1 Volker Margner, Email: Maergner@ifn.ing.tu-bs.de 1 1 Institute for Communications Technology, Braunschweig, Germany Abstract—on this paper, we proposed a new text line In the following, we present some technique applied to segmentation of handwritten and typewriting Arabic the segmentation of documents into text lines. Then we document images that uses the Outer Isothetic Cover (OIC) present our approach to the segmentation of Arabic algorithm of a digital object. In the first step, we use this handwritings and printed documents into blocks, text method to segment the composed document into text blocs. lines and words or parts of words. In the second step, for each text bloc we will extract the text lines. Finally, line text will be segmented into words or into The results of our segmentation method are shown pieces of Arabic word (PAWs). subsequently, by tests on historical and modern documents. The first results obtained in the current stage of the proposed method over a dozen texts are encouraging. We Finally we end our article with a conclusion and have also tested this method on documents written in Latin perspectives that show a possible extension of the scripts. proposed approach. Keywords — handwritten and modern document; text line segmentation; document image; pieces of Arabic words II. RELATED WORKS I. INTRODUCTION Several works have been proposed for the segmentation The first step in the automatic document recognition is of documents. For example, Bennasri and al. have the segmentation of the text image into text line. The proposed a method to extract lines of text Arabic script, objectives of this step is to assign each component of the using the projection [2]: First the document is divided into text to the appropriate line; to make it possible to prepare multiple columns to correct the problem of sinuosity. the data for further processing such as normalization, Then, the starting points of all lines are detected using the word segmentation and feature extraction. minimum partial projection of the profile. Then, a contour The segmentation of handwritten text is complicated by tracking part of each line is carried out: first in the the variation of the distance interline and the undulation direction of writing, then in the opposite direction. of baselines generate different orientations of the text. The characters in two lines of text may touch or overlap. Nicolaou et al. [3] proposed a technique to segment the This considerably complicates the segmentation line. In lines of Latin manuscripts using tracers (axes) minima. Arabic script, these situations are frequently due to the These minima are estimated using the vertical projection presence of ascending and descending characters. The histogram. This method was tested on a sub-database of ICDAR 2007 consists of 20 documents containing 476 massive presence of diacritical symbols often generates lines and 80 documents containing 1771 lines. This false lines. technique achieved an extraction rate that is equal 98.6%. Most work on the segmentation of a page in line is based on a decomposition of the image into connected components. After the separation of lines, we focus on Another approach to the extraction lines Arabic the separation of words for each line, then segmenting manuscripts of ancient texts was proposed by Zahour and each word in pieces (parts) of words. al. in [4]. Initially, the document is divided into columns In the framework of this article, we focus essentially of equal size. Then, each column of the document is on image segmentation of Arabic documents into blocks segmented into three types of text blocks: small blocks of text and lines. Then we apply our method to the which generally represent the diacritical symbols, segmentation of Latin documents. medium blocks that correspond to body text and large 1 The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia blocks reflect the overlap of words between adjacent into blocks of text, if the document is composed of text lines. blocks, each block is then segmented into a set of lines then each line will be segmented into words and/or pieces There are also methods of segmentation using the of words. Hough transformed. This technique is widely used for A. Construction of the Outer Isothetic Cover (OIC) extraction of text lines [5]. For example in [6], the Hough In order to segment to segment document, we construct transform is used with a method of grouping connected the outer isothetic covers of the corresponding document components. For this, the connected components are image after its binarization. To do this, we impose an extracted and then the contours and edges of these isothetic set of grid size g on the binarized image. components are detected. Let Q1, Q2, Q3, and Q4 be the four quadrants incident at a grid point p(i, j), as shown in Fig. 1. The grid point p is Louloudis et al. proposed in [7], a technique for the decided to be a vertex depending on how many of the extraction of lines and words of ancient Greek quadrants have object containment. Interestingly, there manuscripts. The Hough transform is applied to arise 24 = 16 different arrangements considering object connected components using the centroids of rectangles containments of these four quadrants, which can be encompassing their points as voters. These rectangles are reduced to five cases. Let Cq (q = 0, 1, · · · , 4) denote the estimated by calculating the average size of characters in case of all the arrangements for which q out of 4 squares the document. The proposed system was tested on the are occupied by the object. If p belongs to Case C1, then it basis of documents ICDAR 2007 which is divided into 80 is a 900 vertex of the isothetic polygon; and if it is a 2700 vertex, it belongs to C3. For Case C2, if the diagonal documents containing 1773 lines and 40 ancient quadrants are occupied, then p is considered as a 2700 manuscripts containing 1095 lines. The extraction rate of vertex; otherwise, p is a nonvertex grid point lying on lines is 97%. some edge of the polygon. For case C0, p is just an ordinary grid point lying Another method used for segmentation is the snake or outside the polygon, whereas, for case C4, p is a grid point the contour. With this technique Bukhari et al. proposed a lying inside the polygon [1]. method for extracting parameterized snake lines of So, in order to draw to draw polygon correspond of text handwritten documents [8]. The proposed system was blocks, of text lines and of multiple words. We have tested on the basis of documents ICDAR 2007 which is modified the algorithm TIPS [1]. With a proper grid size divided into 80 documents containing 1770 lines. The g, each polygon is constructed. extraction rate of line is 96.3%. First, the document is binarized. Then the grid points are traversed in the raw-major order until a 90° vertex Du et al. used the Mumford-Shah model [10] for the (‘start vertex’) is found. Subsequent grid points are extraction of the lines of Latin manuscripts [9]. The classified, marked as ‘visited’, and the direction is proposed system was tested on 100 Chinese documents, determined from each such grid point until the start vertex 96documents and 100 Indian Korean documents. The is reached. This completes the outer isothetic cover extraction rate of lines is 98% for Chinese documents, corresponding to an object (text blocks, text lines or 98% for Indian documents and 96% for Korean word). The procedure is iterated over the remaining set of documents. unvisited grid points until the next 90° vertex is found, which subsequently derives the polygon corresponding to another object. Finally, all the grid points are visited and A new method proposed by Vasant Manohar et al. in the algorithm reports the vertex sequences of all the [11], this method involves grouping text lines segmented isothetic polygons corresponding to the text blocks, to text by a set of methods for segmentation of handwritten texts lines or words in the input document. line in an undirected graph. The graph nodes correspond to connected components and the edge connecting pairs Setting the grid size g: In order that each isothetic of connected components. polygon corresponds to each object and hence results in a III. PROPOSED WORK sequence of vertices, specifying an appropriate grid size is necessary. So, for each case of segmentation, the grid size The proposed method realizes the segmentation of g is chosen after a set of tests on a set of document handwritten and/or printed text lines, into words and into images. pieces of words. It is based on the algorithm for the construction of the isothetic covers of a digital object [1]. We thought to find a new segmentation method of B. Segmentation of the document in text blocks document images. We started by construct the Outer Isethetic Cover (OIC) of documents. So, we made a A document can be composed by one or more change to this algorithm in order to segment a document paragraphs (or blocks of text). These text blocks can be 2 Figure 1. Five combinatorial cases (16 subcases) [12]. The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia arranged in parallel horizontally or vertically (Figure 2). between them. Contrary to the Latin script, a line can be In this case, we thought to divide the document into segmented into word or into characters. parts of texts designed to simplify further processing i.e. segmentation of the text lines. The extraction of these text blocks is made by varying IV. EXPREMENTS AND RESULTS the size of the grid g, the more the size g is large, the more In order to evaluate the results of our approach, we have (a) (b) Figure 2. Segmentation of handwritten Arabic document into text block (a) g = 16 (b) g = 13. the results are better. The figure 2 shows the results for tested this algorithm on a variety of documents of two document segmentation into blocks of text by changing types of scripts: handwriting and printed. Handwritten the grid size g for handwritten Arabic texts. Arabic documents composed by text blocks were segmented into blocks; the result is shown in Figure 1. As for the handwritten script, we have applied our method on documents printed Arabic and Latin; C. Segmentation of (block) text in lines The first database is a collection of 200 forms written A text or text block can be segmented into text lines. By by 200 different native writers. The writers were asked to changing the grid size g, we have applied our approach to write a paragraph of an Arabic text including up to 10 segment a text into texts line. At the difference to segment sentences. There were no restrictions for the writing. This a document to text blocks, the same algorithm is applied database is an extension of the standard benchmarking with the operator of mathematical morphology (the IfN/ENIT database. The second collection of Arabic closure) in order to obtain the entire polygon line. The handwritten documents includes scans from historical results are shown in Fig. 4. documents collected during a research project in the IfN. The printed Latin text is from the Google Books (version 7, August 2007) D. Segmentation of line into words and/or into pieces of word : Then we tried to do the segmentation text blocks into text lines. The result of the segmentation of handwritten Then we have used the algorithm to extract words Arabic text is shown below. In the same way, we are from line. A line can be segmented into words and/or showing in the following figures the result of such parts of word Arabic manuscript (printed respectively). segmentation for printed Arabic and Latin script: This is because the Arabic writing is recursive. The word The last step of our method is to extract the connected can be composed by parts of words (Pieces of Arabic components from the line. The result of handwritten Words (PAWs)) and sometimes there is enough space Arabic text line segmentation into words or parts of words has been shown in Fig. 3. In the same step we are applied our method on a printed text line Arabic and Latin. 3 The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia (a) (b) Figure 3. Segmentation of handwritten text line Arabic into word or into pieces of words: (a) g = 2 (b) g = 4 (a) (b) Figure 4. Text lines segmentation (a) g = 1 (b) g = 2 4 The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia [4] Zahour, A., Likforman-Sulem, L., Boussellaa, W. et Taconet, B. V. CONCLUSION (2007). Text line segmentation of historical arabic documents. In 9th Int.Conf. on Document Analysis and Recognition. In this paper we presented some techniques of [5] Duda, R. O. et Hart, P. E. (1972). Use of the hough transformation segmentation methods. Then we proposed a new to detect lines and curves in pictures. Commun. ACM. segmentation method for document images handwritten [6] Malleron, V., Eglin, V., Emptoz, H., Dord-Crouslé, S. et Régnier, and printed script. The idea of this method is inspired P. (2009). Text lines and snippets extraction for 19th century from the algorithm of construction of isothetic covers of a handwriting documents layout analysis. International Conference digital object [1]. So we have shown that the results of on Document Analysis and Recognition. such segmentation depend of the variation of the grid size g. Then, to segment composed documents into text blocks, [7] Malleron, V., Eglin, V., Emptoz, H., Dord-Crouslé, S. et Régnier, we used a large value of g. And to extract the line text P. (2009). Text lines and snippets extraction for 19th century handwriting documents layout analysis. International Conference from blocks and the words or pieces of words from text on Document Analysis and Recognition. line, we have reduced the grid size g. The results of our method are preferment for proper [8] Bukhari, S. S., Shafait, F. et Breuel, T. M. (2009). images document, especially for the type of printed texts Scriptindependent handwritten textlines segmentation using active document. This because in this type of script, the contours. In ICDAR09. characters of two lines can neither touches nor overlaps. Instead of the handwritten, these situations exist [9] Du, X., Pan, W. et Bui, T. D. (2009). Text line segmentation in frequently, which will sometimes give incorrect results. handwritten documents using mumford-shah model. Pattern Recognition. REFERENCES [10] Mumford, D. et Shah, J. (1989). Optimal approximation by piecewise smooth functional and associated variational problems. [1] A. Biswas, P. Bhowmick, B.B. Bhattacharya, Construction of Commun. Pure Appl. Math. isothetic covers of a digital object: A combinatorial approach 2010. [11] Vasant Manohar, Shiv N. Vitaladevuni, Huaigu Cao, Rohit Prasad, and Prem NatarajanGraph Clustering-based Ensemble Method for [2] Bennasri, A., Zahour, A. et Taconet, B. (1999). Extraction des Handwritten Text Line Segmentation. ICDAR 2011. lignes d’un texte manuscrit arabe. Vision Interface’99. [12] Aisharjya,Sakar et al. Word Segmentation and Baseline Detection [3] Nicolaou, A. et Gatos, B. (2009). Handwritten text line in Handwritten Documents Using Isothetic Covers. International segmentation by shredding text into its lines. International Conference on Frontiers in Handwriting Recognition 2010 Conference on Document Analysis and Recognition. 5