Extraction and Separation of Words From bilingual printed document Rabeb Ben Abdelbaki and Sofiene Haboubi SIDOP Research Group Signal, Image and Information Technologies National Engineering School of Tunis BP 37 Belvedere Tunis, TN-1002, Tunisia benabdelbakira@yahoo.com sofiene.haboubi@istmt.rnu.tn Abstract—in this paper, we present our work about the extraction Segmentation of text into words is occurred at the step of and separation of words from bilingual printed document. This discrimination between several scripts for the local approach approach is based on the structuring element of the which is based on words as components to be studied; it morphological dilation. We report results for Arabic, Latin and requires prior knowledge of individual words constituting the bilingual Arabic-Latin scripts and we show its limitations and document. It is also a necessary step to discriminate between present the possible improvements. printed and handwriting document and for any system of automatic processing of multilingual documents. This task is Keywords-component; Script; Arabic; Latin; Discrimination; usually a delicate and complex task in the multilingual context Separation; mathematical morphology; Dilation; Structural given the large difference between the characteristics of element different scripts in the form of letters, spacing, etc... Our work aims to develop a new method for the separation and extraction I. INTRODUCTION of words in a bilingual printed document. We will focus in the The character recognition system, called OCR “Optical following on stating the characteristics of Arabic and Latin Character Recognition” allow to find the characters forming a scripts. And then mention some related works. After that, we text, o recognize them individually and then validate them by will present our method of separation and extraction of words lexical recognition of words that contain them. In other words, from bilingual printed documents. And finally we will interpret an OCR is the process of scanning a paper document which the results of this work. leads to a digital text. Due to the oneness of the language script within the same II. MORPHOLOGICAL CHARACTERISTICS OF ARABIC AND OCR, an important problem appears when the document is no LATIN SCRIPTS longer monolingual. In fact, if the document is multilingual, Script is the graphical presentation of a language through then the OCR loses its ability to read the document because of signs drawn on a support. Since its appearance around the the dependence of characteristics on the structural properties of third millennium BC, it has continued to grow with the the character, style and type of writing that generally differs languages it represents. In this paper, we will focus on the from a script to another. Therefore, it’s imperative to identify Latin and Arabic script. the languages present in the document in order to redirect it to the appropriate character recognizer. A. Arabic script In reality, we can’t speak of discrimination between scripts without involving document’s segmentation. In fact, the The Arabic script is a consonantal script, composed of 28 segmentation of documents into words is an important step in letters, excluding the "hamza", which behaves either as a full the process of document recognition; this phase becomes letter or as a diacritic and the symbol "~" which is crucial in the case of multilingual document. It is the written only on the support of the character "‫"ﺍ‬. foundation of all the following steps; it increases also the The Arab character can have up to four different forms efficiency of a recognition system. depending on its position in the word or in the pseudo-word as SIDOP’12 : 2nd Workshop on Signal and Document Processing it changes its design depending on its position: initial, medial, final or isolated. The Arabic script is a semi-cursive writing. Letters are generally linked to each other and Arabic words can be composed of one or more pseudo-words written from right to Figure 3: Examples of words containing different PAWs left, both in printed or handwriting form. Several Arabic letters have the same body and differ only at the number and location of diacritical marks. These diacritics can be above or below the baseline, in different places depending on the character, but never up and down B. Latin script simultaneously. (Figure 1) The Latin script uses two bicameral spellings for each In Arabic, there are 15 letters, presented in Table 1, among character, one called lowercase, the other called uppercase or 28 of the alphabet, which have diacritical points. capital. In general, each grapheme possesses these two types of spelling with few exceptions changing from script to another. The Latin alphabet has 26 basic letters. In Uppercase form, letters change shapes and sizes. Unlike the Arabic alphabet, Latin alphabet consists of two types of graphemes, vowels and consonants. Latin characters are composed of 5 vowels presented in Table 2 and 21 consonants listed in Table 3. Table 2: Latin's vowels Figure 1: Characters and common body Table 3: Latin's consonants Table 1: Letters with diacritical points The Latin alphabet is one of the richest alphabets of national variations because of its geographic and temporal spread. Each Latin script is based on the fundamental letters of the Latin alphabet but it may have some specific letters The Arabic script has no capital letters and Arabic considered as variants from the basics ones and those characters include a loop that can have different forms. considered as new letters. Table 4 shows the different forms of a Latin grapheme. Table 4: The different forms of a Latin grapheme Figure 2: Different forms of loops [Touj et al., 04] In addition, Arabic script varies vertically and horizontally, because of the presence of horizontal and vertical ligatures between characters of the same word. The Arabic word doesn't have a fixed length, it may include one or more pseudo-words called PAW (Piece of Arabic Word) each including a different number of characters. The presence of pseudo-words in Arabic script increases the complexity of its segmentation. In fact, these PAWS induce to error segmentation's algorithms because they introduce important and variable intra-word spaces length compared to the intra-words space in Latin. Identify applicable sponsor/s here. If no sponsors, delete this text box. (sponsors) SIDOP’12 : 2nd Workshop on Signal and Document Processing As for the Arabic script, several variants of Latin characters valleys present in the vertical projection delimiting the have diacritical signs, such as points above the body of the different words. character, accents (acute accent, grave accent, circumflex), tilde, etc. But only two basics characters have diacritical points. • The work of [Chanda and al, 07] presents a segmentation method of bilingual documents containing Thai and English words. Their method is to encode the The Latin script is written from left to right, it's a non- different lines of text depending on the position of black cursive; its letters are isolated from one to another, separated pixels in each line. After segmentation of the document by intra-words spaces in its printed form. The Latin alphabet in line, their method goes through the histogram of each has also many loops that can have different forms. vertical line and produces a 0 if it encounters two black pixels or less, if not the scan is valued at 1. The chain Table 5: Latin’s loops produced is then analyzed, if there is a set of 0 with minimum length equal to 2 * k1, mid-term is considered as the borderline for word segmentation. The value of k1 is an estimate of the white gap between two III. RELATED WORKS consecutive characters of a document’s line. The document’s words segmentation is an important phase • [Rezaee and al., 09] proposed a word segmentation in the document’s recognition process. In fact, this phase is method in bilingual documents containing English and very crucial in the case of multilingual documents; it becomes Farsi scripts. Their method is based on image’s obligatory to segment the document and identify its words directional projections and the analysis of some individually. attributes like the gaps between words and thresholding from the peak distribution to then segment the text lines This step is the foundation of all the following steps. The into words. segmented words become the entries for the other steps of the recognition process. Despite the diversity of segmentation • [Da Silva and al., 11] proposed a method of word approaches and its richness on segmentation methods, the segmentation from Latin documents containing both the domain of document segmentation, especially in words, stills handwritten and printed form. This method is based on an open field and a powerful line of research that interests the segmentation of text on connected components and enough scientists. In fact, many researchers have focused on their extraction by cropping a bounding area. After this axis. Our literature review led to a list of research in this extracting of the components, this method proceeds by area which we mention the most interesting. fusion of near neighbors in the same line and having a distance between their bounding boxes less than a • [Ma and Doermann, 03] used the Docstrum algorithm threshold “th” calculated by the following formula XX: of O'Gorman for the segmentation of bilingual documents, applying it on Arabic-English, Chinese- English, English-Hindi, and Korean-English dictionary. This algorithm is a bottom-up approach based on the calculation of the k nearest neighbors for each connected component of the document. After the removal of noise, the connected components are separated into two groups according to a factor selected Where k is the number of frames and Li the widths of from the proportion of character sizes. One group all the frames of the image. consists of the characters most dominant and the other • [Haboubi and al., 11] proposed a segmentation method consists of the characters of titles and headers (or head) for bilingual documents containing Arabic and Latin of sections. Then for each connected component, they script, based on the use of mathematical morphology to seek the k nearest neighbors, each pair of these delimit the different words in the text. This method uses neighbors has an angle and an associated distance. By the morphological dilation with a line structuring grouping the components through the features element. They use sequential dilation by increasing each mentioned above, the geometric areas of physical time the size of the structuring element in order to structures of the document can be determined. The determine a threshold that separates the spaces between proposed method is independent of the change in words and intra-word spaces. This threshold orientation of the document and of the inter-words corresponds to the dilation order where the number of spacing. However, the value of k is dependent on the connected components has a zero standard deviation. structure of the document. • [Dhandra and al., 07] have segmented bilingual IV. PROPOSED APPROACH documents containing one of India's regional languages (Hindi, Kannada and Tamil) and English numbers, their The approach developed for the word segmentation of method was based on the segmentation of text in printed bilingual documents, includes several steps (Figure 4). different lines, then each line will be projected vertically From a document’s image, we begin with a preprocessing to and segmented into words based on the analysis of the prepare the scanned document to the segmentation, and then we move to the detection and extraction of text lines. After that SIDOP’12 : 2nd Workshop on Signal and Document Processing we analyze each line separately and we extract the different this position is the lower limit of a line. We kept, each time, the words presents in the document. positions of the white areas that will be used for cropping the image into different lines. C. Words detection and extraction The document segmentation allows to segment documents at different levels, either characters, or pseudo-words, or words. This level of segmentation is the most difficult among the others, given that the segmentation has to differentiate between different types of spaces between characters, between pseudo-word and between words, which is not always obvious to a word extraction system. The objective of the proposed approach is to segment the document image in order to separate and extract the words of a bilingual printed document. Segmentation methods segment documents into connected components; either character in the case of a Latin printed document, or pseudo-words in the case Figure 4: The process of word segmentation from a bilingual of an Arabic printed document, because the non cursive and printed document semi cursive Latin and Arabic printed writing. Our approach uses mathematical morphology for the A. Pre-processing elimination of intra-word spacing and the building of The document image is the result of an acquisition step connected components formed by different words in the using a scanner. Our approach doesn’t give much interest in the bilingual printed document. We used the morphological pre-processing because the proposed system should work with dilation to enlarge the image by filling the holes corresponding images preprocessed in advance by a dedicated pre-processing in our case to the intra-word spaces. documents system such as the elimination of noise introduced sometime during the scan documents, the skew correction, To achieve this goal, we must determine the best structural deleting diacritics, etc. element able to stick the different characters of a word, without sticking words together. At this level, two major problems However, our approach preserves the amount of appear. The first on is the size of the structural element and the information present in the text because we work with document second on is its shape. The determination of these two images containing diacritical signs, given their important role characteristic features of the structuring element is the in the understanding of the text, although their presence may foundation of our work. increase the complexity of the segmentation task because of problems encountered in the detection and extraction lines. Choosing only one size of the structuring element for each Indeed, in some writing styles, diacritics may exceed the upper document cannot give a performing segmentation because the or the lower limit of the line, which can error the step of intra-word spaces differ from one font to another, and depend detecting lines. on the size of the font. Similarly, the spaces between words depend on the text alignment, especially in the case of justified In our case, the pre-processing step is limited to the text. Moreover, a document can contain different fonts and binarization of the document image and the values’ inversion sizes. of black and white pixels in order to prepare the document to the step of detecting lines. The shape of the structuring element solves the problem of extracting diacritics as separated words because our method B. Lines detection doesn’t eliminate these signs during pre-processing. A solution of this problem is to stick diacritics to their words. So, we This phase is rather difficult in the case of bilingual opted to shapes that have a height, and we have searched the documents because of the large variability between Arabic and corresponding height that solves the problem with minimal Latin printed scripts, and it becomes more complex with the line’s changes. presence of diacritical marks. The morphological study of Latin and Arabic scripts shows the presence of a significant The approach developed proceeds line by line to find each number of diacritical signs. time, the size and shape of the structuring element of the dilation that can separate and extract correctly and with We have chosen to use the projection method to delimit the minimal changes the different words in a document line. horizontal lines. This method corresponds to the needs of Indeed, we proposed three methods to calculate the size of the document’s segmentation because we handle text documents structuring element and selected four specific shapes for the with a simple structure. Our proposed approach goes through structuring element. the image horizontally and calculates the value of black pixels in each row of the matrix representing the image. Next, we have analyzed the histogram of projections, if the number of black pixels has changed its value from 0 to a positive one then SIDOP’12 : 2nd Workshop on Signal and Document Processing  Methods for calculating the size of the structuring element After the detection of spaces in each line of the document, we determine the size of the structuring element according to the three proposed methods. o Method based on median calculation This method proceeds by elimination of redundant spaces values presents in the considered line then sort the new list of spaces in ascending order to permit the interpretation of these values. This list reflects the nature of spaces contained in the processed image. It begins with the relatively small areas, which actually represent the spaces between characters in a word, and reaches the largest gap present in the line. This method is based on the fact that the threshold space, able to stick the characters of a word without sticking the words together, has an intermediate value between the lower and upper bound of the new list of spaces. The median value of this list is considered as the size of the structuring element of the dilation. o Method based on the average calculation Figure 5: Proposed methods for the document segmentation This second method is similar to the previous one in the Our approach consists in testing all the combinations of determination of distinct values of spaces. It is based on the methods for calculating the size of the structuring element and fact that the threshold value of the structural element of the its forms. In fact, we fix each time the calculation method and dilation is proportional to the number of spaces present in the we vary the shape. We begin by applying a combination of image given and their lengths. Indeed, this method sets the size Latin and Arabic printed documents. If we get good results, we of the structuring element to the average lengths of different continue testing on mixed documents. Otherwise, we consider spaces in the line introduced. it unnecessary to apply the combination to bilingual o Method based on the calculation of the documents. difference between the values of spaces • Size’s calculation of the structural element This method is based on the detection of larger jump lengths between spaces. It works on the entire list of spaces in To calculate the size of the structuring element, we started line to be processed. From this list, it calculates the different by determining the list of spaces in each line, then, we lengths of jumps in spaces. Then, it covers the new list by developed three different methods to solve this problem. determining the greatest difference between spaces. The  Identification of document spaces threshold size of the structuring element is necessarily located between the areas that have generated the biggest jump. Our approach proceeds by analyzing the extracted lines. Difference method associates the size of the structuring This analysis is based on the calculation of the vertical element the average between the two spaces relative to the projection histogram to determine the values of the different largest jump determined. spaces in the document. We cover vertically each line and we calculate the number of black pixels presents in each column of This calculation method of the structuring element is to the line’s image. generate a list of jumps between different lengths spaces, looking for the biggest jump, find two spaces related to this The next step is to analyze the vertical histogram obtained jump and calculates their average. The size of the structuring and to determine the positions of the spaces inter and intra element corresponds to this average. words and their values, if the number of black pixels becomes zero after a sequence of non-zero black pixels then this change corresponds to the presence of a space in the line. We store its • Determination of the structural element’s shape position and calculate the length of this area. In fact the value After calculating the size of the structuring element, comes of the space or its length corresponds to the distance between the phase of the choice of suitable form which allows the position of appearance of this space and the position of the segmenting the bilingual printed document correctly. first non-zero value of the number of black pixels encountered The structuring element can have several forms such as in running through the vertical projection histogram. We obtain square, diamond, polygon, Euclidean disc, line, point pairs, at the end a list composed of space values present in the rectangle, etc. considered line and a list with their positions. SIDOP’12 : 2nd Workshop on Signal and Document Processing We were interested in this approach to four specific forms of the structuring element, the first is the diamond shape, and the second is the square, the third and fourth are variants of the Table 8: Results of the rectangle shape with height 2 times the size rectangle shape with some differences in the input parameters. Shape of the We used these shapes because of the presence of diacritical structural Method of Script Good extraction signs in the documents to be segmented. These forms have in element calculation rates common a height proportional to the size of the structuring Arabic 61,10% Median element, an important feature in our approach in order to stick Latin 67,25% the different diacritical marks to their words. Rectangle with Arabic 88,07% height 2 times Average Latin 93,29%  The diamond shape the size Arabic 94,13% Difference (jump) Latin 95,84% For each line, the diameter of the diamond is equal to the size of the structuring element determined by one of the three calculation methods proposed later.  The rectangle shape with height 3 times the size The following table shows the results found by combining For each line, the width of the rectangle is equal to the size the diamond shape with the three methods for calculating the of the structuring element determined by one of the three structural element. calculation methods proposed later and height equal to 3 times this value. Table 6: Results of the diamond shape The following table shows the results found by combining Shape of the structural Method of Script Good extraction the shape rectangle of height three times the size calculated element calculation rates with the three methods of calculating the structural element. Arabic 26,23% Median Latin 59,87% Table 9: Results of the rectangle shape with height 3 times the Diamond Arabic 49,72% size Average Latin 82,82% Shape of the Arabic 63,85% structural Method of Script Good extraction Difference (jump) Latin 91 ,81% calculation rates element Arabic 72,11%  The square shape Median Latin 72,35% Rectangle with Arabic 92,11% For each line, the length of the square is equal to the size of height 3 times Average Latin 95 ,03% structuring element determined by one of the three calculation the size Arabic 94,86% methods proposed later. Difference (jump) Latin 97,05% The following table shows the results found by combining the square shape with the three methods of calculating the structural element. V. INTERPRETATIONS Table 7: Results of the square shape The following table represents the best rates achieved for each form of the structuring element. Shape of the structural Method of Script Good extraction Table 10: Best rates achieved for each form of the structuring calculation rates element element Arabic 28,62% Median Latin 53,69% Shape of the structural element Method of Script Good Square Arabic 62,57% Average calculation extraction Latin 85,23% rates Arabic 71,93% Difference Arabic 63,85% Difference (jump) Latin 93,56% Diamond (jump) Latin 91 ,81% Difference Arabic 71,93% Square (jump) Latin 93,56%  Rectangle shape with height 2 times the size Rectangle with height Difference Arabic 94,13% 2 times the size (jump) Latin 95,84% For each line, the width of the rectangle is equal to the size of the structuring element determined by one of the three Rectangle with height Difference Arabic 94,86% 3 times the size (jump) Latin 97,05% calculation methods proposed later and height is equal to two times this value. The following table shows the results found by combining We note that the best good extraction rates are obtained for the shape rectangle of height 2 times the size calculated with 94.86% and 97.05% Arabic to Latin. These rates are achieved the three methods of calculating the structural element. by the combination of the method of calculating the structuring element’s size based on the difference between the spaces values and the rectangle shape with a height equal to 3 times the size of the structuring element. SIDOP’12 : 2nd Workshop on Signal and Document Processing The application of this combination on the sample printed Although the result obtained by this method used for the bilingual documents gave a good extraction rate equal to separation of words is compelling, it has some limitations: 94.85%. This result is explained by the adequacy of method of • The sample size is 945 words for the printed Latin calculating the size of the structuring element to changes in the lengths of spaces between the lines and document and height, documents, 545 words for Arabic printed documents and 564 words for printed bilingual documents; proportional to the size, the different distances of diacritics their words. • This approach only deals with the printed documents; The figure shows a sample run of a line from a printed • This approach is limited to bilingual Arabic and Latin bilingual document with diacritics. documents. In perspectives, we expect to enlarge the sample size to better test the performance of the proposed method. We also plan to extend our method to the processing of textual handwritten bilingual documents, to mixed bilingual documents (both handwritten and printed forms in the same document) as well as treatment of bilingual documents of any kind and even that of multilingual documents. REFERENCES [Touj and al., 04] Sofiene Touj, Najoua Essoukri Ben Amara, Hamid Amiri, « Reconnaissance de l’Ecriture Arabe Imprimée par Transformée de Hough Généralisée », dans Conférence Internationale Francophone sur l'Ecrit et le Document (CIFED 04) 2004. [Ma and Doermann, 03] : Huanfeng Ma, David Doermann, « Gabor Filter Figure 6: Example of word segmentation from a bilingual Based Multi-class Classifier for Scanned Document Images », printed document Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03) 0-7695-1960-1/03 $17.00 © The word segmentation of the printed bilingual document 2003 IEEE. gave 14 words, which correctly corresponds to the words found [Dhandra and al., 07] : B.V. Dhandral, Mallikarjun Hangarge, Ravindra in the line of the document introduced. Hegadil and V.S. Malemathl, « Word Level Script Identification in Bilingual Documents through Discriminating Features », International Conference on Signal Processing, Communications and Networking, VI. CONCLUSION AND PERSPECTIVES 2007. ICSCN '07. The separation and extraction of words in a printed [Chanda and al, 07] : S. Chanda, Oriol Ramos Terrades and U. Pal, « SVM Based Scheme for Thai and English Script Identification », Ninth bilingual document constituted the main contribution of our International Conference on Document Analysis and Recognition recognition’s area, its different stages, and the various available (ICDAR 2007) 0-7695-2822-8/07 $25.00 © 2007 IEEE. methods of documents segmentation into words. [Rezaee and al., 09] : Hamideh Rezaee, Masoud Geravanchizadeh, Farbod Razzazi, « Automatic Language Identification of Bilingual English and After studying the Arabic and Latin scripts we have Farsi Scripts », IEEE International Conference on Application of proceeded to the implementation of our approach. We have Information and Communication Technologies (AICT), 2009. developed different methods for calculating the size of the [Da Silva and al., 11] : Lincoln Faria da Silva, Aura Conci, Angel Sanchez, structuring element of morphological dilation, combined with « Word-Level Segmentation in Printed and Handwritten Documents », different forms and tested on samples of printed Arabic and publié dans IEEE 18th International conference on Systems, Signals and Latin documents. After that, we have compared the results, and Image Processing (IWSSIP), 2011- Sarajevo. the best performing combination was chosen to testing printed [Haboubi and al., 11] : Sofiene Haboubi, Samia Snoussi Maddouri, Hamid Amiri, « Discrimination between Arabic and Latin from bilingual bilingual documents, subject of our study. documents », publié dans IEEE International Conference on Communications, Computing and Control Applications (CCCA), 2011.