Extraction and Separation of Words
                       From bilingual printed document
                                          Rabeb Ben Abdelbaki and Sofiene Haboubi
                                                       SIDOP Research Group
                                            Signal, Image and Information Technologies
                           National Engineering School of Tunis BP 37 Belvedere Tunis, TN-1002, Tunisia
                                                     benabdelbakira@yahoo.com
                                                    sofiene.haboubi@istmt.rnu.tn


Abstract—in this paper, we present our work about the extraction        Segmentation of text into words is occurred at the step of
and separation of words from bilingual printed document. This       discrimination between several scripts for the local approach
approach is based on the structuring element of the                 which is based on words as components to be studied; it
morphological dilation. We report results for Arabic, Latin and     requires prior knowledge of individual words constituting the
bilingual Arabic-Latin scripts and we show its limitations and      document. It is also a necessary step to discriminate between
present the possible improvements.                                  printed and handwriting document and for any system of
                                                                    automatic processing of multilingual documents. This task is
Keywords-component; Script; Arabic; Latin; Discrimination;
                                                                    usually a delicate and complex task in the multilingual context
Separation; mathematical morphology; Dilation; Structural
                                                                    given the large difference between the characteristics of
element
                                                                    different scripts in the form of letters, spacing, etc... Our work
                                                                    aims to develop a new method for the separation and extraction
                      I.     INTRODUCTION                           of words in a bilingual printed document. We will focus in the
    The character recognition system, called OCR “Optical           following on stating the characteristics of Arabic and Latin
Character Recognition” allow to find the characters forming a       scripts. And then mention some related works. After that, we
text, o recognize them individually and then validate them by       will present our method of separation and extraction of words
lexical recognition of words that contain them. In other words,     from bilingual printed documents. And finally we will interpret
an OCR is the process of scanning a paper document which            the results of this work.
leads to a digital text.
    Due to the oneness of the language script within the same          II.   MORPHOLOGICAL CHARACTERISTICS OF ARABIC AND
OCR, an important problem appears when the document is no                                    LATIN SCRIPTS
longer monolingual. In fact, if the document is multilingual,          Script is the graphical presentation of a language through
then the OCR loses its ability to read the document because of      signs drawn on a support. Since its appearance around the
the dependence of characteristics on the structural properties of   third millennium BC, it has continued to grow with the
the character, style and type of writing that generally differs     languages it represents. In this paper, we will focus on the
from a script to another. Therefore, it’s imperative to identify    Latin and Arabic script.
the languages present in the document in order to redirect it to
the appropriate character recognizer.
                                                                    A. Arabic script
    In reality, we can’t speak of discrimination between scripts
without involving document’s segmentation. In fact, the                 The Arabic script is a consonantal script, composed of 28
segmentation of documents into words is an important step in        letters, excluding the "hamza", which behaves either as a full
the process of document recognition; this phase becomes             letter or as a diacritic and the symbol "~" which is
crucial in the case of multilingual document. It is the             written only on the support of the character "‫"ﺍ‬.
foundation of all the following steps; it increases also the           The Arab character can have up to four different forms
efficiency of a recognition system.                                 depending on its position in the word or in the pseudo-word as
                                           SIDOP’12 : 2nd Workshop on Signal and Document Processing
it changes its design depending on its position: initial, medial,
final or isolated.
    The Arabic script is a semi-cursive writing. Letters are
generally linked to each other and Arabic words can be
composed of one or more pseudo-words written from right to                         Figure 3: Examples of words containing different PAWs
left, both in printed or handwriting form.
    Several Arabic letters have the same body and differ only at
the number and location of diacritical marks. These diacritics
can be above or below the baseline, in different places
depending on the character, but never up and down                               B. Latin script
simultaneously. (Figure 1)                                                         The Latin script uses two bicameral spellings for each
   In Arabic, there are 15 letters, presented in Table 1, among                 character, one called lowercase, the other called uppercase or
28 of the alphabet, which have diacritical points.                              capital. In general, each grapheme possesses these two types of
                                                                                spelling with few exceptions changing from script to another.
                                                                                    The Latin alphabet has 26 basic letters. In Uppercase form,
                                                                                letters change shapes and sizes.
                                                                                   Unlike the Arabic alphabet, Latin alphabet consists of two
                                                                                types of graphemes, vowels and consonants. Latin characters
                                                                                are composed of 5 vowels presented in Table 2 and 21
                                                                                consonants listed in Table 3.

                                                                                                    Table 2: Latin's vowels


               Figure 1: Characters and common body

                                                                                                  Table 3: Latin's consonants
                Table 1: Letters with diacritical points


                                                                                    The Latin alphabet is one of the richest alphabets of
                                                                                national variations because of its geographic and temporal
                                                                                spread. Each Latin script is based on the fundamental letters of
                                                                                the Latin alphabet but it may have some specific letters
   The Arabic script has no capital letters and Arabic                          considered as variants from the basics ones and those
characters include a loop that can have different forms.                        considered as new letters. Table 4 shows the different forms of
                                                                                a Latin grapheme.

                                                                                   Table 4: The different forms of a Latin grapheme

        Figure 2: Different forms of loops [Touj et al., 04]
   In addition, Arabic script varies vertically and horizontally,
because of the presence of horizontal and vertical ligatures
between characters of the same word.
   The Arabic word doesn't have a fixed length, it may include
one or more pseudo-words called PAW (Piece of Arabic Word)
each including a different number of characters.
    The presence of pseudo-words in Arabic script increases
the complexity of its segmentation. In fact, these PAWS induce
to error segmentation's algorithms because they introduce
important and variable intra-word spaces length compared to
the intra-words space in Latin.


    Identify applicable sponsor/s here. If no sponsors, delete this text box.
(sponsors)
                                   SIDOP’12 : 2nd Workshop on Signal and Document Processing
    As for the Arabic script, several variants of Latin characters           valleys present in the vertical projection delimiting the
have diacritical signs, such as points above the body of the                 different words.
character, accents (acute accent, grave accent, circumflex),
tilde, etc. But only two basics characters have diacritical points.     •    The work of [Chanda and al, 07] presents a
                                                                             segmentation method of bilingual documents containing
                                                                             Thai and English words. Their method is to encode the
    The Latin script is written from left to right, it's a non-
                                                                             different lines of text depending on the position of black
cursive; its letters are isolated from one to another, separated
                                                                             pixels in each line. After segmentation of the document
by intra-words spaces in its printed form. The Latin alphabet
                                                                             in line, their method goes through the histogram of each
has also many loops that can have different forms.
                                                                             vertical line and produces a 0 if it encounters two black
                                                                             pixels or less, if not the scan is valued at 1. The chain
                     Table 5: Latin’s loops                                  produced is then analyzed, if there is a set of 0 with
                                                                             minimum length equal to 2 * k1, mid-term is considered
                                                                             as the borderline for word segmentation. The value of
                                                                             k1 is an estimate of the white gap between two
                     III.   RELATED WORKS                                    consecutive characters of a document’s line.
    The document’s words segmentation is an important phase             •    [Rezaee and al., 09] proposed a word segmentation
in the document’s recognition process. In fact, this phase is                method in bilingual documents containing English and
very crucial in the case of multilingual documents; it becomes               Farsi scripts. Their method is based on image’s
obligatory to segment the document and identify its words                    directional projections and the analysis of some
individually.                                                                attributes like the gaps between words and thresholding
                                                                             from the peak distribution to then segment the text lines
    This step is the foundation of all the following steps. The              into words.
segmented words become the entries for the other steps of the
recognition process. Despite the diversity of segmentation              •    [Da Silva and al., 11] proposed a method of word
approaches and its richness on segmentation methods, the                     segmentation from Latin documents containing both the
domain of document segmentation, especially in words, stills                 handwritten and printed form. This method is based on
an open field and a powerful line of research that interests                 the segmentation of text on connected components and
enough scientists. In fact, many researchers have focused on                 their extraction by cropping a bounding area. After
this axis. Our literature review led to a list of research in this           extracting of the components, this method proceeds by
area which we mention the most interesting.                                  fusion of near neighbors in the same line and having a
                                                                             distance between their bounding boxes less than a
  •    [Ma and Doermann, 03] used the Docstrum algorithm                     threshold “th” calculated by the following formula XX:
       of O'Gorman for the segmentation of bilingual
       documents, applying it on Arabic-English, Chinese-
       English, English-Hindi, and Korean-English dictionary.
       This algorithm is a bottom-up approach based on the
       calculation of the k nearest neighbors for each
       connected component of the document. After the
       removal of noise, the connected components are
       separated into two groups according to a factor selected              Where k is the number of frames and Li the widths of
       from the proportion of character sizes. One group                     all the frames of the image.
       consists of the characters most dominant and the other
                                                                        •    [Haboubi and al., 11] proposed a segmentation method
       consists of the characters of titles and headers (or head)
                                                                             for bilingual documents containing Arabic and Latin
       of sections. Then for each connected component, they
                                                                             script, based on the use of mathematical morphology to
       seek the k nearest neighbors, each pair of these
                                                                             delimit the different words in the text. This method uses
       neighbors has an angle and an associated distance. By
                                                                             the morphological dilation with a line structuring
       grouping the components through the features
                                                                             element. They use sequential dilation by increasing each
       mentioned above, the geometric areas of physical
                                                                             time the size of the structuring element in order to
       structures of the document can be determined. The
                                                                             determine a threshold that separates the spaces between
       proposed method is independent of the change in
                                                                             words and intra-word spaces. This threshold
       orientation of the document and of the inter-words
                                                                             corresponds to the dilation order where the number of
       spacing. However, the value of k is dependent on the
                                                                             connected components has a zero standard deviation.
       structure of the document.
  •    [Dhandra and al., 07] have segmented bilingual                                   IV.   PROPOSED APPROACH
       documents containing one of India's regional languages
       (Hindi, Kannada and Tamil) and English numbers, their              The approach developed for the word segmentation of
       method was based on the segmentation of text in                printed bilingual documents, includes several steps (Figure 4).
       different lines, then each line will be projected vertically   From a document’s image, we begin with a preprocessing to
       and segmented into words based on the analysis of the          prepare the scanned document to the segmentation, and then
                                                                      we move to the detection and extraction of text lines. After that
                                 SIDOP’12 : 2nd Workshop on Signal and Document Processing
we analyze each line separately and we extract the different       this position is the lower limit of a line. We kept, each time, the
words presents in the document.                                    positions of the white areas that will be used for cropping the
                                                                   image into different lines.

                                                                   C. Words detection and extraction
                                                                       The document segmentation allows to segment documents
                                                                   at different levels, either characters, or pseudo-words, or
                                                                   words. This level of segmentation is the most difficult among
                                                                   the others, given that the segmentation has to differentiate
                                                                   between different types of spaces between characters, between
                                                                   pseudo-word and between words, which is not always obvious
                                                                   to a word extraction system.
                                                                       The objective of the proposed approach is to segment the
                                                                   document image in order to separate and extract the words of a
                                                                   bilingual printed document. Segmentation methods segment
                                                                   documents into connected components; either character in the
                                                                   case of a Latin printed document, or pseudo-words in the case
 Figure 4: The process of word segmentation from a bilingual       of an Arabic printed document, because the non cursive and
                      printed document                             semi cursive Latin and Arabic printed writing.
                                                                       Our approach uses mathematical morphology for the
A. Pre-processing
                                                                   elimination of intra-word spacing and the building of
    The document image is the result of an acquisition step        connected components formed by different words in the
using a scanner. Our approach doesn’t give much interest in the    bilingual printed document. We used the morphological
pre-processing because the proposed system should work with        dilation to enlarge the image by filling the holes corresponding
images preprocessed in advance by a dedicated pre-processing       in our case to the intra-word spaces.
documents system such as the elimination of noise introduced
sometime during the scan documents, the skew correction,               To achieve this goal, we must determine the best structural
deleting diacritics, etc.                                          element able to stick the different characters of a word, without
                                                                   sticking words together. At this level, two major problems
    However, our approach preserves the amount of                  appear. The first on is the size of the structural element and the
information present in the text because we work with document      second on is its shape. The determination of these two
images containing diacritical signs, given their important role    characteristic features of the structuring element is the
in the understanding of the text, although their presence may      foundation of our work.
increase the complexity of the segmentation task because of
problems encountered in the detection and extraction lines.            Choosing only one size of the structuring element for each
Indeed, in some writing styles, diacritics may exceed the upper    document cannot give a performing segmentation because the
or the lower limit of the line, which can error the step of        intra-word spaces differ from one font to another, and depend
detecting lines.                                                   on the size of the font. Similarly, the spaces between words
                                                                   depend on the text alignment, especially in the case of justified
    In our case, the pre-processing step is limited to the         text. Moreover, a document can contain different fonts and
binarization of the document image and the values’ inversion       sizes.
of black and white pixels in order to prepare the document to
the step of detecting lines.                                           The shape of the structuring element solves the problem of
                                                                   extracting diacritics as separated words because our method
B. Lines detection                                                 doesn’t eliminate these signs during pre-processing. A solution
                                                                   of this problem is to stick diacritics to their words. So, we
   This phase is rather difficult in the case of bilingual         opted to shapes that have a height, and we have searched the
documents because of the large variability between Arabic and      corresponding height that solves the problem with minimal
Latin printed scripts, and it becomes more complex with the        line’s changes.
presence of diacritical marks. The morphological study of
Latin and Arabic scripts shows the presence of a significant           The approach developed proceeds line by line to find each
number of diacritical signs.                                       time, the size and shape of the structuring element of the
                                                                   dilation that can separate and extract correctly and with
    We have chosen to use the projection method to delimit the     minimal changes the different words in a document line.
horizontal lines. This method corresponds to the needs of          Indeed, we proposed three methods to calculate the size of the
document’s segmentation because we handle text documents           structuring element and selected four specific shapes for the
with a simple structure. Our proposed approach goes through        structuring element.
the image horizontally and calculates the value of black pixels
in each row of the matrix representing the image. Next, we
have analyzed the histogram of projections, if the number of
black pixels has changed its value from 0 to a positive one then
                                  SIDOP’12 : 2nd Workshop on Signal and Document Processing
                                                                               Methods for calculating the size of the structuring
                                                                                element
                                                                        After the detection of spaces in each line of the document,
                                                                    we determine the size of the structuring element according to
                                                                    the three proposed methods.
                                                                                 o    Method based on median calculation
                                                                        This method proceeds by elimination of redundant spaces
                                                                    values presents in the considered line then sort the new list of
                                                                    spaces in ascending order to permit the interpretation of these
                                                                    values.
                                                                         This list reflects the nature of spaces contained in the
                                                                    processed image. It begins with the relatively small areas,
                                                                    which actually represent the spaces between characters in a
                                                                    word, and reaches the largest gap present in the line. This
                                                                    method is based on the fact that the threshold space, able to
                                                                    stick the characters of a word without sticking the words
                                                                    together, has an intermediate value between the lower and
                                                                    upper bound of the new list of spaces. The median value of this
                                                                    list is considered as the size of the structuring element of the
                                                                    dilation.
                                                                                 o    Method based on the average calculation
  Figure 5: Proposed methods for the document segmentation              This second method is similar to the previous one in the
     Our approach consists in testing all the combinations of       determination of distinct values of spaces. It is based on the
methods for calculating the size of the structuring element and     fact that the threshold value of the structural element of the
its forms. In fact, we fix each time the calculation method and     dilation is proportional to the number of spaces present in the
we vary the shape. We begin by applying a combination of            image given and their lengths. Indeed, this method sets the size
Latin and Arabic printed documents. If we get good results, we      of the structuring element to the average lengths of different
continue testing on mixed documents. Otherwise, we consider         spaces in the line introduced.
it unnecessary to apply the combination to bilingual                             o    Method based on the calculation of the
documents.                                                                            difference between the values of spaces
    • Size’s calculation of the structural element                      This method is based on the detection of larger jump
                                                                    lengths between spaces. It works on the entire list of spaces in
   To calculate the size of the structuring element, we started
                                                                    line to be processed. From this list, it calculates the different
by determining the list of spaces in each line, then, we
                                                                    lengths of jumps in spaces. Then, it covers the new list by
developed three different methods to solve this problem.
                                                                    determining the greatest difference between spaces. The
             Identification of document spaces                     threshold size of the structuring element is necessarily located
                                                                    between the areas that have generated the biggest jump.
    Our approach proceeds by analyzing the extracted lines.         Difference method associates the size of the structuring
This analysis is based on the calculation of the vertical           element the average between the two spaces relative to the
projection histogram to determine the values of the different       largest jump determined.
spaces in the document. We cover vertically each line and we
calculate the number of black pixels presents in each column of        This calculation method of the structuring element is to
the line’s image.                                                   generate a list of jumps between different lengths spaces,
                                                                    looking for the biggest jump, find two spaces related to this
     The next step is to analyze the vertical histogram obtained    jump and calculates their average. The size of the structuring
and to determine the positions of the spaces inter and intra        element corresponds to this average.
words and their values, if the number of black pixels becomes
zero after a sequence of non-zero black pixels then this change
corresponds to the presence of a space in the line. We store its        • Determination of the structural element’s shape
position and calculate the length of this area. In fact the value      After calculating the size of the structuring element, comes
of the space or its length corresponds to the distance between      the phase of the choice of suitable form which allows
the position of appearance of this space and the position of the    segmenting the bilingual printed document correctly.
first non-zero value of the number of black pixels encountered          The structuring element can have several forms such as
in running through the vertical projection histogram. We obtain     square, diamond, polygon, Euclidean disc, line, point pairs,
at the end a list composed of space values present in the           rectangle, etc.
considered line and a list with their positions.
                                        SIDOP’12 : 2nd Workshop on Signal and Document Processing
    We were interested in this approach to four specific forms
of the structuring element, the first is the diamond shape, and
the second is the square, the third and fourth are variants of the     Table 8: Results of the rectangle shape with height 2 times the
                                                                                                     size
rectangle shape with some differences in the input parameters.
                                                                        Shape of the
    We used these shapes because of the presence of diacritical          structural            Method of        Script      Good extraction
signs in the documents to be segmented. These forms have in               element              calculation                      rates
common a height proportional to the size of the structuring                                                     Arabic          61,10%
                                                                                                Median
element, an important feature in our approach in order to stick                                                 Latin           67,25%
the different diacritical marks to their words.                         Rectangle with                          Arabic          88,07%
                                                                        height 2 times          Average         Latin           93,29%
                The diamond shape                                         the size                             Arabic          94,13%
                                                                                           Difference (jump)    Latin           95,84%
    For each line, the diameter of the diamond is equal to the
size of the structuring element determined by one of the three
calculation methods proposed later.                                                     The rectangle shape with height 3 times the size
    The following table shows the results found by combining               For each line, the width of the rectangle is equal to the size
the diamond shape with the three methods for calculating the           of the structuring element determined by one of the three
structural element.                                                    calculation methods proposed later and height equal to 3 times
                                                                       this value.
                Table 6: Results of the diamond shape
                                                                          The following table shows the results found by combining
 Shape of the
  structural           Method of           Script    Good extraction   the shape rectangle of height three times the size calculated
   element             calculation                       rates         with the three methods of calculating the structural element.
                                           Arabic        26,23%
                        Median             Latin         59,87%
                                                                       Table 9: Results of the rectangle shape with height 3 times the
   Diamond                                 Arabic        49,72%                                      size
                        Average            Latin         82,82%         Shape of the
                                           Arabic        63,85%          structural            Method of        Script      Good extraction
                    Difference (jump)      Latin         91 ,81%                               calculation                      rates
                                                                          element
                                                                                                                Arabic         72,11%
                The square shape                                                               Median          Latin          72,35%
                                                                        Rectangle with                          Arabic         92,11%
    For each line, the length of the square is equal to the size of     height 3 times          Average         Latin          95 ,03%
structuring element determined by one of the three calculation             the size                             Arabic         94,86%
methods proposed later.                                                                    Difference (jump)    Latin          97,05%

    The following table shows the results found by combining
the square shape with the three methods of calculating the
structural element.                                                                            V.     INTERPRETATIONS
                 Table 7: Results of the square shape                     The following table represents the best rates achieved for
                                                                       each form of the structuring element.
 Shape of the
  structural           Method of           Script    Good extraction    Table 10: Best rates achieved for each form of the structuring
                       calculation                       rates
   element                                                                                         element
                                           Arabic        28,62%
                        Median             Latin         53,69%            Shape of the
                                                                        structural element        Method of      Script          Good
    Square                                 Arabic        62,57%
                        Average                                                                   calculation                  extraction
                                           Latin         85,23%                                                                  rates
                                           Arabic        71,93%                                   Difference     Arabic         63,85%
                    Difference (jump)      Latin         93,56%              Diamond               (jump)        Latin          91 ,81%
                                                                                                  Difference     Arabic         71,93%
                                                                              Square               (jump)        Latin          93,56%
                Rectangle shape with height 2 times the size          Rectangle with height      Difference     Arabic         94,13%
                                                                         2 times the size          (jump)        Latin          95,84%
    For each line, the width of the rectangle is equal to the size
of the structuring element determined by one of the three              Rectangle with height      Difference     Arabic         94,86%
                                                                         3 times the size          (jump)        Latin          97,05%
calculation methods proposed later and height is equal to two
times this value.
    The following table shows the results found by combining               We note that the best good extraction rates are obtained for
the shape rectangle of height 2 times the size calculated with         94.86% and 97.05% Arabic to Latin. These rates are achieved
the three methods of calculating the structural element.               by the combination of the method of calculating the structuring
                                                                       element’s size based on the difference between the spaces
                                                                       values and the rectangle shape with a height equal to 3 times
                                                                       the size of the structuring element.
                                   SIDOP’12 : 2nd Workshop on Signal and Document Processing
    The application of this combination on the sample printed            Although the result obtained by this method used for the
bilingual documents gave a good extraction rate equal to              separation of words is compelling, it has some limitations:
94.85%. This result is explained by the adequacy of method of
                                                                         • The sample size is 945 words for the printed Latin
calculating the size of the structuring element to changes in the
lengths of spaces between the lines and document and height,          documents, 545 words for Arabic printed documents and 564
                                                                      words for printed bilingual documents;
proportional to the size, the different distances of diacritics
their words.                                                              • This approach only deals with the printed documents;
    The figure shows a sample run of a line from a printed               • This approach is limited to bilingual Arabic and Latin
bilingual document with diacritics.                                   documents.
                                                                          In perspectives, we expect to enlarge the sample size to
                                                                      better test the performance of the proposed method. We also
                                                                      plan to extend our method to the processing of textual
                                                                      handwritten bilingual documents, to mixed bilingual
                                                                      documents (both handwritten and printed forms in the same
                                                                      document) as well as treatment of bilingual documents of any
                                                                      kind and even that of multilingual documents.

                                                                                                    REFERENCES

                                                                      [Touj and al., 04] Sofiene Touj, Najoua Essoukri Ben Amara, Hamid Amiri,
                                                                           «       Reconnaissance de l’Ecriture Arabe Imprimée par Transformée de
                                                                           Hough Généralisée », dans Conférence Internationale Francophone sur
                                                                           l'Ecrit et le Document (CIFED 04) 2004.
                                                                      [Ma and Doermann, 03] : Huanfeng Ma, David Doermann, « Gabor Filter
   Figure 6: Example of word segmentation from a bilingual                 Based Multi-class Classifier for Scanned Document Images »,
                      printed document                                     Proceedings of the Seventh International Conference on Document
                                                                           Analysis and Recognition (ICDAR’03) 0-7695-1960-1/03 $17.00 ©
    The word segmentation of the printed bilingual document                2003 IEEE.
gave 14 words, which correctly corresponds to the words found         [Dhandra and al., 07] : B.V. Dhandral, Mallikarjun Hangarge, Ravindra
in the line of the document introduced.                                    Hegadil and V.S. Malemathl, « Word Level Script Identification in
                                                                           Bilingual Documents through Discriminating Features », International
                                                                           Conference on Signal Processing, Communications and Networking,
             VI.   CONCLUSION AND PERSPECTIVES                             2007. ICSCN '07.
    The separation and extraction of words in a printed               [Chanda and al, 07] : S. Chanda, Oriol Ramos Terrades and U. Pal, « SVM
                                                                           Based Scheme for Thai and English Script Identification », Ninth
bilingual document constituted the main contribution of our                International Conference on Document Analysis and Recognition
recognition’s area, its different stages, and the various available        (ICDAR 2007) 0-7695-2822-8/07 $25.00 © 2007 IEEE.
methods of documents segmentation into words.                         [Rezaee and al., 09] : Hamideh Rezaee, Masoud Geravanchizadeh, Farbod
                                                                           Razzazi, « Automatic Language Identification of Bilingual English and
    After studying the Arabic and Latin scripts we have                    Farsi Scripts », IEEE International Conference on Application of
proceeded to the implementation of our approach. We have                   Information and Communication Technologies (AICT), 2009.
developed different methods for calculating the size of the            [Da Silva and al., 11] : Lincoln Faria da Silva, Aura Conci, Angel Sanchez,
structuring element of morphological dilation, combined with               « Word-Level Segmentation in Printed and Handwritten Documents »,
different forms and tested on samples of printed Arabic and                publié dans IEEE 18th International conference on Systems, Signals and
Latin documents. After that, we have compared the results, and             Image Processing (IWSSIP), 2011- Sarajevo.
the best performing combination was chosen to testing printed         [Haboubi and al., 11] : Sofiene Haboubi, Samia Snoussi Maddouri, Hamid
                                                                           Amiri, « Discrimination between Arabic and Latin from bilingual
bilingual documents, subject of our study.                                 documents », publié dans IEEE International Conference on
                                                                           Communications, Computing and Control Applications (CCCA), 2011.