The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia


 Segmentation of Handwritten and Printed Arabic
                  Documents
                          Ghazouani Fethi, IFN1, ENIT, Tunis, Tunisia Email: gfethi@yahoo.fr
                                                                  and
                    Maddouri Mondher FST, Tunis, Tunisia Email: mondher.maddouri@fst.rnu.tn
                 Maddouri Snoussi Samia, ENIT, Tunis, Tunisia Email: samia_maddouri@yahoo.f
                                         El Abed Haikel, Email: elabed@tu-bs.de 1
                                    Volker Margner, Email: Maergner@ifn.ing.tu-bs.de 1
                            1
                                Institute for Communications Technology, Braunschweig, Germany


Abstract—on this paper, we proposed a new text line                        In the following, we present some technique applied to
segmentation of handwritten and typewriting Arabic                      the segmentation of documents into text lines. Then we
document images that uses the Outer Isothetic Cover (OIC)               present our approach to the segmentation of Arabic
algorithm of a digital object. In the first step, we use this           handwritings and printed documents into blocks, text
method to segment the composed document into text blocs.
                                                                        lines and words or parts of words.
In the second step, for each text bloc we will extract the text
lines. Finally, line text will be segmented into words or into          The results of our segmentation method are shown
pieces of Arabic word (PAWs).                                           subsequently, by tests on historical and modern
                                                                        documents.
The first results obtained in the current stage of the
proposed method over a dozen texts are encouraging. We                    Finally we end our article with a conclusion and
have also tested this method on documents written in Latin              perspectives that show a possible extension of the
scripts.                                                                proposed approach.

Keywords — handwritten and modern document; text line
segmentation; document image; pieces of Arabic words
                                                                                           II. RELATED WORKS
                     I.    INTRODUCTION
                                                                           Several works have been proposed for the segmentation
The first step in the automatic document recognition is                 of documents. For example, Bennasri and al. have
the segmentation of the text image into text line. The                  proposed a method to extract lines of text Arabic script,
objectives of this step is to assign each component of the              using the projection [2]: First the document is divided into
text to the appropriate line; to make it possible to prepare            multiple columns to correct the problem of sinuosity.
the data for further processing such as normalization,                  Then, the starting points of all lines are detected using the
word segmentation and feature extraction.                               minimum partial projection of the profile. Then, a contour
The segmentation of handwritten text is complicated by                  tracking part of each line is carried out: first in the
the variation of the distance interline and the undulation              direction of writing, then in the opposite direction.
of baselines generate different orientations of the text.
The characters in two lines of text may touch or overlap.                  Nicolaou et al. [3] proposed a technique to segment the
This considerably complicates the segmentation line. In                 lines of Latin manuscripts using tracers (axes) minima.
Arabic script, these situations are frequently due to the               These minima are estimated using the vertical projection
presence of ascending and descending characters. The                    histogram. This method was tested on a sub-database of
                                                                        ICDAR 2007 consists of 20 documents containing 476
massive presence of diacritical symbols often generates
                                                                        lines and 80 documents containing 1771 lines. This
false lines.                                                            technique achieved an extraction rate that is equal 98.6%.
   Most work on the segmentation of a page in line is
based on a decomposition of the image into connected
components. After the separation of lines, we focus on                  Another approach to the extraction lines Arabic
the separation of words for each line, then segmenting                  manuscripts of ancient texts was proposed by Zahour and
each word in pieces (parts) of words.                                   al. in [4]. Initially, the document is divided into columns
   In the framework of this article, we focus essentially               of equal size. Then, each column of the document is
on image segmentation of Arabic documents into blocks                   segmented into three types of text blocks: small blocks
of text and lines. Then we apply our method to the                      which generally represent the diacritical symbols,
segmentation of Latin documents.                                        medium blocks that correspond to body text and large


                                                                   1
                                The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia


blocks reflect the overlap of words between adjacent                  into blocks of text, if the document is composed of text
lines.                                                                blocks, each block is then segmented into a set of lines
                                                                      then each line will be segmented into words and/or pieces
    There are also methods of segmentation using the                  of words.
Hough transformed. This technique is widely used for                  A. Construction of the Outer Isothetic Cover (OIC)
extraction of text lines [5]. For example in [6], the Hough              In order to segment to segment document, we construct
transform is used with a method of grouping connected                 the outer isothetic covers of the corresponding document
components. For this, the connected components are                    image after its binarization. To do this, we impose an
extracted and then the contours and edges of these                    isothetic set of grid size g on the binarized image.
components are detected.                                                 Let Q1, Q2, Q3, and Q4 be the four quadrants incident at
                                                                      a grid point p(i, j), as shown in Fig. 1. The grid point p is
    Louloudis et al. proposed in [7], a technique for the             decided to be a vertex depending on how many of the
extraction of lines and words of ancient Greek                        quadrants have object containment. Interestingly, there
manuscripts. The Hough transform is applied to                        arise 24 = 16 different arrangements considering object
connected components using the centroids of rectangles                containments of these four quadrants, which can be
encompassing their points as voters. These rectangles are             reduced to five cases. Let Cq (q = 0, 1, · · · , 4) denote the
estimated by calculating the average size of characters in            case of all the arrangements for which q out of 4 squares
the document. The proposed system was tested on the                   are occupied by the object. If p belongs to Case C1, then it
basis of documents ICDAR 2007 which is divided into 80                is a 900 vertex of the isothetic polygon; and if it is a 2700
                                                                      vertex, it belongs to C3. For Case C2, if the diagonal
documents containing 1773 lines and 40 ancient                        quadrants are occupied, then p is considered as a 2700
manuscripts containing 1095 lines. The extraction rate of             vertex; otherwise, p is a nonvertex grid point lying on
lines is 97%.                                                         some edge of the polygon.
                                                                         For case C0, p is just an ordinary grid point lying
    Another method used for segmentation is the snake or              outside the polygon, whereas, for case C4, p is a grid point
the contour. With this technique Bukhari et al. proposed a            lying inside the polygon [1].
method for extracting parameterized snake lines of                       So, in order to draw to draw polygon correspond of text
handwritten documents [8]. The proposed system was                    blocks, of text lines and of multiple words. We have
tested on the basis of documents ICDAR 2007 which is                  modified the algorithm TIPS [1]. With a proper grid size
divided into 80 documents containing 1770 lines. The                  g, each polygon is constructed.
extraction rate of line is 96.3%.                                        First, the document is binarized. Then the grid points
                                                                      are traversed in the raw-major order until a 90° vertex
    Du et al. used the Mumford-Shah model [10] for the                (‘start vertex’) is found. Subsequent grid points are
extraction of the lines of Latin manuscripts [9]. The                 classified, marked as ‘visited’, and the direction is
proposed system was tested on 100 Chinese documents,                  determined from each such grid point until the start vertex
96documents and 100 Indian Korean documents. The                      is reached. This completes the outer isothetic cover
extraction rate of lines is 98% for Chinese documents,                corresponding to an object (text blocks, text lines or
98% for Indian documents and 96% for Korean                           word). The procedure is iterated over the remaining set of
documents.                                                            unvisited grid points until the next 90° vertex is found,
                                                                      which subsequently derives the polygon corresponding to
                                                                      another object. Finally, all the grid points are visited and
A new method proposed by Vasant Manohar et al. in                     the algorithm reports the vertex sequences of all the
[11], this method involves grouping text lines segmented              isothetic polygons corresponding to the text blocks, to text
by a set of methods for segmentation of handwritten texts             lines or words in the input document.
line in an undirected graph. The graph nodes correspond
to connected components and the edge connecting pairs
                                                                        Setting the grid size g: In order that each isothetic
of connected components.
                                                                      polygon corresponds to each object and hence results in a
                   III. PROPOSED WORK                                 sequence of vertices, specifying an appropriate grid size is
                                                                      necessary. So, for each case of segmentation, the grid size
   The proposed method realizes the segmentation of                   g is chosen after a set of tests on a set of document
handwritten and/or printed text lines, into words and into            images.
pieces of words. It is based on the algorithm for the
construction of the isothetic covers of a digital object [1].
   We thought to find a new segmentation method of                    B. Segmentation of the document in text blocks
document images. We started by construct the Outer
Isethetic Cover (OIC) of documents. So, we made a                       A document can be composed by one or more
change to this algorithm in order to segment a document               paragraphs (or blocks of text). These text blocks can be


                                                                  2
                                          Figure 1. Five combinatorial cases (16 subcases) [12].
                                 The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia


arranged in parallel horizontally or vertically (Figure 2).            between them. Contrary to the Latin script, a line can be
   In this case, we thought to divide the document into                segmented into word or into characters.
parts of texts designed to simplify further processing i.e.
segmentation of the text lines.
   The extraction of these text blocks is made by varying                            IV. EXPREMENTS AND RESULTS
the size of the grid g, the more the size g is large, the more            In order to evaluate the results of our approach, we have


                     (a)                                                                              (b)

                                  Figure 2. Segmentation of handwritten Arabic document into text block (a)

                                                             g = 16 (b) g = 13.


the results are better. The figure 2 shows the results for             tested this algorithm on a variety of documents of two
document segmentation into blocks of text by changing                  types of scripts: handwriting and printed. Handwritten
the grid size g for handwritten Arabic texts.                          Arabic documents composed by text blocks were
                                                                       segmented into blocks; the result is shown in Figure 1. As
                                                                       for the handwritten script, we have applied our method on
                                                                       documents printed Arabic and Latin;
C. Segmentation of (block) text in lines
                                                                          The first database is a collection of 200 forms written
   A text or text block can be segmented into text lines. By           by 200 different native writers. The writers were asked to
changing the grid size g, we have applied our approach to              write a paragraph of an Arabic text including up to 10
segment a text into texts line. At the difference to segment           sentences. There were no restrictions for the writing. This
a document to text blocks, the same algorithm is applied               database is an extension of the standard benchmarking
with the operator of mathematical morphology (the                      IfN/ENIT database. The second collection of Arabic
closure) in order to obtain the entire polygon line. The               handwritten documents includes scans from historical
results are shown in Fig. 4.                                           documents collected during a research project in the IfN.
                                                                       The printed Latin text is from the Google Books (version
                                                                       7, August 2007)
D. Segmentation of line into words and/or into pieces of
   word :                                                                 Then we tried to do the segmentation text blocks into
                                                                       text lines. The result of the segmentation of handwritten
   Then we have used the algorithm to extract words                    Arabic text is shown below. In the same way, we are
from line. A line can be segmented into words and/or                   showing in the following figures the result of such
parts of word Arabic manuscript (printed respectively).                segmentation for printed Arabic and Latin script:
This is because the Arabic writing is recursive. The word                 The last step of our method is to extract the connected
can be composed by parts of words (Pieces of Arabic                    components from the line. The result of handwritten
Words (PAWs)) and sometimes there is enough space                      Arabic text line segmentation into words or parts of words
                                                                       has been shown in Fig. 3. In the same step we are applied
                                                                       our method on a printed text line Arabic and Latin.


                                                                   3
        The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia


                                        (a)


                                        (b)

Figure 3. Segmentation of handwritten text line Arabic into word or into pieces of
                           words: (a) g = 2 (b) g = 4


           (a)                                                             (b)

                        Figure 4. Text lines segmentation (a) g = 1 (b) g = 2


                                           4
                                     The 2nd Workshop SIDOP - March 23-24 - Hammamet, Tunisia


                                                                           [4]   Zahour, A., Likforman-Sulem, L., Boussellaa, W. et Taconet, B.
                        V.     CONCLUSION                                        (2007). Text line segmentation of historical arabic documents. In
                                                                                 9th Int.Conf. on Document Analysis and Recognition.

   In this paper we presented some techniques of                           [5]   Duda, R. O. et Hart, P. E. (1972). Use of the hough transformation
segmentation methods. Then we proposed a new                                     to detect lines and curves in pictures. Commun. ACM.
segmentation method for document images handwritten                        [6]   Malleron, V., Eglin, V., Emptoz, H., Dord-Crouslé, S. et Régnier,
and printed script. The idea of this method is inspired                          P. (2009). Text lines and snippets extraction for 19th century
from the algorithm of construction of isothetic covers of a                      handwriting documents layout analysis. International Conference
digital object [1]. So we have shown that the results of                         on Document Analysis and Recognition.
such segmentation depend of the variation of the grid size
g. Then, to segment composed documents into text blocks,                   [7]   Malleron, V., Eglin, V., Emptoz, H., Dord-Crouslé, S. et Régnier,
we used a large value of g. And to extract the line text                         P. (2009). Text lines and snippets extraction for 19th century
                                                                                 handwriting documents layout analysis. International Conference
from blocks and the words or pieces of words from text                           on Document Analysis and Recognition.
line, we have reduced the grid size g.
   The results of our method are preferment for proper                     [8]   Bukhari, S. S., Shafait, F. et Breuel, T. M. (2009).
images document, especially for the type of printed texts                        Scriptindependent handwritten textlines segmentation using active
document. This because in this type of script, the                               contours. In ICDAR09.
characters of two lines can neither touches nor overlaps.
Instead of the handwritten, these situations exist                         [9]   Du, X., Pan, W. et Bui, T. D. (2009). Text line segmentation in
frequently, which will sometimes give incorrect results.                         handwritten documents using mumford-shah model. Pattern
                                                                                 Recognition.


                          REFERENCES                                       [10] Mumford, D. et Shah, J. (1989). Optimal approximation by
                                                                                piecewise smooth functional and associated variational problems.
[1]   A. Biswas, P. Bhowmick, B.B. Bhattacharya, Construction of                Commun. Pure Appl. Math.
      isothetic covers of a digital object: A combinatorial approach
      2010.
                                                                           [11] Vasant Manohar, Shiv N. Vitaladevuni, Huaigu Cao, Rohit Prasad,
                                                                                and Prem NatarajanGraph Clustering-based Ensemble Method for
[2]   Bennasri, A., Zahour, A. et Taconet, B. (1999). Extraction des            Handwritten Text Line Segmentation. ICDAR 2011.
      lignes d’un texte manuscrit arabe. Vision Interface’99.
                                                                           [12] Aisharjya,Sakar et al. Word Segmentation and Baseline Detection
[3]   Nicolaou, A. et Gatos, B. (2009). Handwritten text line                   in Handwritten Documents Using Isothetic Covers. International
      segmentation by shredding text into its lines. International              Conference on Frontiers in Handwriting Recognition 2010
      Conference on Document Analysis and Recognition.


                                                                       5