<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A survey on text-line segmentation process in historical Arab manuscripts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soumia Djaghbellou</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Attia Abdelouahab</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bouziane Abderraouf</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of computer science, university of Bordj Bou Arreridj</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MSE Laboratory, Department of computer science, university of Bordj Bou Arreridj</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The segmentation process entails dividing or decomposing the entire document image into segments or lines. This technique serves as a fundamental step in developing any writing or optical character recognition system. However, numerous existing segmentation schemes encounter challenges when dealing with specific script styles, like ancient or historical Arabic writing found in ancient manuscripts. which possesses unique characteristics. These characteristics include inclined text lines, overlapping letters, diacritic marks, decorative elements, variable letter forms, and ligatures (combinations of two or more letters merged to form a single connected shape).Thus, in this paper we present a thorough survey of the field. The survey is composed of two segments. The first segment provides a concise overview of the historical Arabic documents. The second, which serves as the primary segment, focuses on the crucial step of handwritten document recognition, specifically segmentation. A detailed and systematic overview of the various approaches to segmentation, including diferent levels, employed for extracting handwritten Arabic text-lines, is outlined. Subsequently, a literature study is conducted to review and analyze proposed works in this area.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text-lines</kwd>
        <kwd>segmentation</kwd>
        <kwd>pattern recognition</kwd>
        <kwd>Arabic handwritten</kwd>
        <kwd>historical Arabic documents</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        problem of text-line segmentation process and provides a
comprehensive survey of existing research works. Firstly,
Historical documents often symbolize the identity of di- it presents the historical Arabic Manuscripts in general,
verse civilizations worldwide. Analyzing and compre- including main features, structure/type of documents.
hending their contents holds paramount importance, es- Secondly, in the main section of the paper, the focus shifts
pecially for researchers. Manually extracting data from to the segmentation process as a critical phase in
recognithese historical documents proves to be a laborious and tion, specifically emphasizing the commonly utilized and
expensive endeavor. In recent years, there has been a adopted techniques for Arabic scripts. The remainder
surge in research dedicated to the automated process- of this paper is structured as follows: section 2 ofers a
ing of historical documents. Despite notable advance- comprehensive overview of ancient Arabic manuscripts,
ments, automating the processing and analysis of histor- encompassing their content structure and various
appliical Arabic documents remains a challenging task. Text cations. Moving on to Section 3, we delve into the image
line segmentation stands as an initial stage in the text segmentation process, exploring its diferent levels and
recognition system process. This critical preprocessing focusing on widely adopted approaches specifically
destep in document analysis poses particular challenges, signed for handwritten Arabic texts. In Section 4, we
especially with handwritten texts. While segmenting concentrate on a comparative study, presenting notable
text lines from machine-printed documents is commonly existing works related to the segmentation of
handwritconsidered resolved, freestyle handwritten text lines re- ten Arabic documents. This analysis will consider the
main notably challenging. This complexity arises due method of experimental analysis and the data-set used.
to curved lines, inconsistent spacing, and overlapping Section 5 is dedicated to showcasing a compilation of
faspatial boundaries. Additionally, irregular layouts, var- mous Arabic databases that have been utilized in various
ied character sizes reflecting diferent writing styles, in- studies. Finally, Section 6 addresses open issues,
motivatersecting lines, and the absence of a clear baseline all tions, and potential directions for future research. And
contribute to the intricacy and dificulty in handwritten lastly, Section 7 concludes the paper, summarizing the
document analysis[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This paper focuses on complex findings and insights presented throughout the article.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Structures and applications of the historical Arabic documents</title>
      <sec id="sec-2-1">
        <title>Throughout history, manuscripts were the oficial way</title>
        <p>
          of writing down knowledge and science. There is a huge
amount of historical Arabic manuscripts in the archives
and national libraries around the world, which have been Figure 1: Examples of the four categories a,b,d [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], d [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] of
scanning their collections to make them publicly avail- handwritten documents.
able and to preserve this valuable cultural heritage. The
Arabic manuscripts, like manuscripts of the other
languages, have some common characteristics. format. Therefore, to harness these documents, diverse
        </p>
        <p>
          Manuscripts have their specificities and various dis- methods and applications are employed. In this paper,
tinct elements that can useful to identify the manuscripts we succinctly outline three applications, depicted in the
for the creation of an electronic format of description. figure 2 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In dealing with aged document images,
Some of these elements, we distinguish are: segmentation poses a significant challenge. However,
delineating distinct blocks within their physical structure
simplifies this task. By focusing on block forms and their
spatial relationships, we ascertain:
• Mention the responsible
• Names of owners. It is also an important clue
for researchers wishing to follow the historical
development of the manuscript
• The title: it is the main identifying element which
        </p>
        <p>in most cases is presented on the title page
• Physical description/codicology.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Handwritten documents, irrespective of the language, are</title>
        <p>
          typically categorized based on their physical appearance
into four classes, as outlined in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]:
        </p>
        <p>Currently, numerous major libraries globally are
digitizing handwritten historical documents. These scanned
images are uploaded onto their websites, complemented
by metadata facilitating specific document searches
within extensive databases. Accessing document
content isn’t feasible without its digital presence in a textual
• Baseline (connecting the lower part of character</p>
        <p>bodies)
• Median line (tracing the upper part of character</p>
        <p>bodies)
• Upper line (linking the top of ascenders)
• Lower line (uniting the bottom of descenders).</p>
        <p>
          Efectively interpreting this manuscript type demands a
robust segmentation process backed by eficient methods
and techniques. This paper aims to extensively discuss
the pivotal phase of segmenting textual documents,
concentrating on the most efective methodologies that have
exhibited performance, particularly concerning
handwritten Arabic script.
• Mono-oriented documents: lines in this class
are oriented in one direction. Figure 1.a shows a
handwritten Arabic document with a horizontal
orientation
• Multi-oriented documents: lines in these
documents are arranged in blocks of diferent
orientations. Figure 1.b gives an example of this class of
documents 3. Segmentation phase and its
• Multi-script documents: These comprise texts methods
authored by multiple individuals, resulting in
various scripts. This occurrence was frequent in the Segmentation of documents into text lines, also known
past when individuals succeeded one another to as text line extraction, stands as a fundamental step in
ifnalize a document or collaborated on the same document content recognition. It typically serves as a
written piece. Figure 1c depicts a multi-script preprocessing phase, as illustrated in Figure 4.
handwritten document (Arabic and Latin) However, segmenting ancient and historical
handwrit• Heterogeneous documents: This category encom- ten Arabic documents into text lines poses considerable
passes content that includes both textual infor- challenges, especially when dealing with documents of
mation and images or illustrations. Handwritten poor quality. This complexity hampers content extraction
documents of this nature might feature diverse due to the diverse nature of Arabic writing. Characters
orientations, such as dimension lines or illustra- and words present varying shapes, contributing to an
tive drawings, as illustrated in Figure 1d. extensive vocabulary. Moreover, these documents
frequently feature additional disruptive elements like stains,
ornamentation, seals, and holes [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>The objective of segmentation is to simplify and
alter the representation of an image into something more
meaningful and easier to analyze and recognize, such
sequently, the characters within those words. Numerous
studies in this domain employ image decomposition into
connected components [9].</p>
        <p>Line segmentation into words: This phase utilizes
vertical projections’ histograms of lines to detect spaces
between words for separation. However, this method might
not be as efective when dealing with Arabic script.</p>
        <p>Word segmentation into characters: This process
involves breaking down words into their constituent
individual symbols. It is a pivotal decision-making step in
optical character recognition systems, determining the
accuracy of isolated patterns within an image [10].
3.1. The adopted methods for Arabic</p>
        <p>text-lines segmentation</p>
      </sec>
      <sec id="sec-2-3">
        <title>In the context of Arabic manuscripts, localizing text lines</title>
        <p>for extraction or segmentation is challenging due to the
as lines, sub-words, words, or characters. This is illus- morphological peculiarities associated with the Arabic
trated in Figure 5. In general, there exist four levels of script. The script is naturally cursive, unconstrained, and
segmentation, as illustrated in Figure 5: horizontally oriented, which adds to the dificulties,
espe</p>
        <p>
          Page segmentation: This initial step involves identify- cially when dealing with historical documents.
Evaluaing information areas on each page based on their visual tion of Arabic handwritten text line extraction algorithms
attributes. It often includes logically labeling these ar- has either been lacking or has shown higher error rates
eas according to the content they represent, such as text, compared to algorithms used for other languages. This is
graphics, or images. A comprehensive analysis of the because, as previously mentioned, Arabic script exhibits
techniques employed in document analysis has been pre- greater cursive characteristics compared to other scripts
sented in prior studies [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] [7] [8]. [11].
        </p>
        <p>Text segmentation into lines: This stage focuses on To streamline and simplify the handling of such
docseparating text lines to extract individual words and sub- uments, extensive research has been directed toward
categorized based on their operational mode or the
strategies they employ. These strategies cover projection-based
methods, smearing methods, hough-based approaches,
and clustering or grouping methods.</p>
        <p>Projection-based Approach: Within this approach, we
diferentiate between two profiles: the vertical projection
and the horizontal projection. These profiles are obtained
by summing the pixel values along the vertical and
horizontal axes, respectively, for each y and x value [15]. In
our text line segmentation study, we specifically focus
on the horizontal projection. This projection is applied
Figure 6: Original colored document and Binarized document. to diferent lines in order to obtain an initial position for
the separated lines and their corresponding baselines. In
Figure 7, we present the reference lines and interfering
image binarization. Image binarization, illustrated in Fig- lines, while Figure 8 provides an illustrative example of
ure 6, involves dividing the image into two categories: an Arabic text and its horizontal projection.
the background and the object (text lines). Segmenting Smearing methods: Involve horizontally spreading
a color image can be exceptionally resource-intensive, black pixels while measuring the gap between white
often necessitating consideration of multiple factors (e.g., spaces. If the distance meets a predetermined
threshhistogram, color, etc.) to determine a point’s class or type. old, those spaces get filled with black pixels. This
proVarious techniques have been devised for image binariza- cess leads to the formation of interconnected black pixel
tion, including global threshold, local threshold, Markov shapes around the text lines [18], as depicted in Figure
random field model [ 12], [13], water flow model [ 14], and 9. However, this method posed a challenge of line
overGatos et al.’s method [15], [16]. In general, methods and lap. Consequently, during smearing, two separate lines
approaches for segmenting handwritten text lines can be might merge, causing the segmentation of two lines into
manuscripts. Conversely, alternative methods (refer to
Figure 10) are guided by criteria aimed at preventing the
intersection of black pixels, as proposed and implemented
in [21] for segmenting both modern and historical
Chinese documents, often characterized by overlapping lines.</p>
        <p>Nevertheless, this method proves eficient only when
fewer black pixels are present at the contact points,
potentially failing when these points contain a significant
amount of such pixels.</p>
        <p>For detecting connected components between lines,
several rules and criteria need consideration. These
Figure 9: An example demonstrating the Application of the include identifying the label of the component
(touchSmearing Method on Arabic text [18] ing/overlapping), managing ambiguous component sizes,
assessing the density of black pixels within each
alignment region, considering alignment proximity, and
evalone[19]. uating contextual information (such as the positions of</p>
        <p>Hough transform: is a frequency-based technique em- alignments surrounding the component) [22]. Separating
ployed to detect and pinpoint straight lines within text these components involves analyzing the vertical
projecdocument images. Employing the Hough transform on tion profile of the component to determine the location
the centroid allows us to determine the orientations of of the horizontal frontier segment that will be used to
these lines [20]. Handwritten document images often separate the touching elements. If the projection profile
contain annotations, erasures, and lines oriented in vari- exhibits two peaks, the separation occurs midway
beous directions alongside the primary lines. Consequently, tween them; otherwise, the component is divided into
this method depends on contextual cues, like direction two equal parts [22].
continuity and proximity criteria, to eliminate erroneous
alignments among the components. 4. Literature on Handwritten</p>
        <p>Grouping Approach: This method involves
aggregating and combining various units (such as blocks, pixels, Arabic Documents
connected components) in a bottom-up fashion to cre- Segmentation works
ate alignments using distinct perceptual criteria such as
proximity, continuity, and similarity. It also employs geo- In this section, we showcase prominent existing research
metric feature details, including the size, position, shape, in the realm of segmenting handwritten Arabic
docuand orientation of the connected components, thereby ments, employing experimental analysis methods and
grouping them into rows [20]. datasets. Until recently, there has been a scarcity of
stud</p>
        <p>Most studies on segmenting handwritten Arabic text ies focusing on segmenting and recognizing handwritten
into lines rely on schemes that identify overlapping, Arabic text. Owing to the script’s distinct
characteristouched, or connected components, such as the group- tics, conventional methods often fall short in terms of
ing approach. However, projection-based methods re- eficacy. The review encompasses publications from the
main efective primarily in cases where handwritten doc- last 12 years, evaluating a total of 10 articles from
jouruments exhibit minimal overlap, and there is no white- nals and conferences. These articles primarily centered
space between lines, a common feature in many Arabic
on text segmentation within their core methodologies. Khayyat, Muna, et al. (2012) [26] introduced a
techA summary of these ten articles is provided in Table 1. nique for extracting handwritten text lines. Their method
For a better understanding of the Table 1 contents, let’s relies on morphological dilation with a dynamically
adapexamine each column separately. The first column in tive mask. They employed a smearing technique to
genthe table displays the author and the publication year erate large connected components, or blobs, which were
of the article. Next, a concise description of the seg- subsequently analyzed for applying appropriate
smearmentation technique outlined in each respective study ing to the document. This approach underwent
testis provided. Afterward, details regarding the database ing using the Arabic dataset CENPARMI, encompassing
utilized to test the proposed model are presented. Lastly, multi-skewed and touching lines. The experimental
outthe final column showcases the evaluation results of each comes demonstrated the eficacy of the proposed
algoexperiment. It’s important to highlight that the assess- rithm, achieving a precision rate of 96.3% .
ment of the performance of diferent proposed methods Al-Dmour and Fares Fraij (2014) [27] introduced a
textrelied on calculating various metrics as detailed below: line segmentation model that relies on the established
Accuracy = (TP+TN)/ (TP+TN+FP+FN) horizontal projection profile (HPP) method. Initially, the
Precision = TP/ (TP+FP) approach involves generating a histogram of black
pixRecall=TP/ (TP+FN) els along the preprocessed image’s horizontal scan lines.
F1-Score=2*((Precision*Recall)/ (Precision+ Recall)) Subsequently, the self-similarity is improved through
auWhere: tocorrelation. The implemented system demonstrated
TP, TN (True Positives and True Negatives): indicate highly encouraging outcomes, achieving an extraction
the correct predictions for the positive and negative class. accuracy rate of 84.8% .</p>
        <p>FP, FN (False Positives and False Negatives): FP indi- Suresha, M., and Amani Ali Ahmed Ali (2018) [28]
cates the incorrect predictions of the positive class and developed a segmentation process utilizing the Hough
the incorrect predictions of the negative class, often re- transform method. This technique was followed by a
ferred to as FN. skeletonization operation in the post-processing phase,</p>
        <p>Boussellaa, Wafa, et al. (2010) [23] introduced a seg- aimed at rectifying potential false alarms. The ultimate
mentation method centered on block covering analysis objective was to proficiently segment vertically
conemploying an unsupervised method. Initially, they calcu- nected characters. The efectiveness of the proposed
lated the optimal document decomposition into vertical system was demonstrated through experimentation on
strips to achieve fuzzy baseline detection using the fuzzy two datasets: IFN/ENIT and AHDB Arabic Handwriting
C-means algorithm. Afterward, blocks were assigned Database. Their results showcased accuracies of 97.4%
to the respective lines in Arabic historical handwritten and 98.9%, respectively.
documents containing various scripts, including char- Neche, Chemseddine, et al. (2019) [29] introduced a
acters that overlapped or were multi-touching. The al- text-line segmentation approach employing a deep
learngorithm presented in their study demonstrated strong ing architecture. Specifically, they employed an RU-Net
performance, achieving an accuracy rate of 95%. enabling pixel-wise classification to diferentiate text-line</p>
        <p>Kumar, Jayant, et al. (2010) [24] introduced a method pixels from the background. The experimental
assessfor extracting handwritten text lines from monochro- ment was conducted on the KHATT standard Arabic
matic Arabic document images. Their approach relied on benchmark, and the results obtained validate the
suca unique graph framework involving two key steps. Ini- cessful segmentation process, achieving a rate of 96.7%
tially, the scheme estimates local orientation at each pri- .
mary component, constructing a sparse similarity graph. Gader, Takwa, et al. (2020) [15] developed a system
Subsequently, it employs a shortest path algorithm to as- for extracting text lines from images containing
unconsess similarities between non-adjacent components. The strained handwritten Arabic texts sourced from the
pubmodel underwent testing on a dataset comprising 125 lic Arabic dataset, BADAM. The approach relies on a deep
images, resulting in a final accuracy of 96% . neural network named AR2U-Net, incorporating a
Recur</p>
        <p>Kumar, Jayant, et al. (2011) [25] developed a handwrit- rent Residual convolutional neural network in
conjuncten text-lines extraction model utilizing a graph-based tion with the U-Net model and an Attention mechanism.
technique to detect touching and proximity errors, with The model demonstrated its performance, achieving a
a refinement step using Expectation-Maximization (EM) precision rate of 93.2% .
to iteratively split the error segments to obtain correct Mechi, Olfa, et al. (2021) [30] introduced a hybrid
text-lines of the experiment dataset of 125 Arabic doc- method that merges a U-Net deep network with
tradiument images. The study showed the productivity of tional document image analysis techniques, including
the proposed experiments by giving a very high score of connected component analysis and modified RLSA, to
98.76% . localize text lines in various contemporary datasets of
Handwritten Arabic documents, both public and private.</p>
        <p>The outcomes demonstrated the eficacy of the proposed numbers, and complete texts. For instance, the AI-ISRA
approach, achieving a high precision rate of approxi- database [14] incorporates Arabic sentences, words,
digmately 90% . its, and signatures from 500 individuals. In contrast, the</p>
        <p>
          Meziani, Fariza, et al. (2021) [31] implemented their AHDB database [34] encompasses 10,000 words authored
proposed technique on document images sourced from by 100 writers, while the IFN/ENIT dataset [18] includes
the standard KHATT database, which exhibited various Arabic words and Tunisian town names. The CENPARMI
inclinations, overlapping, and intersecting lines. Prior to database [35] comprises 3,000 digits (legal and courtesy
employing the segmentation method, they executed a se- amounts, and numerals). The IFHCDB database [36]
foquence of preprocessing operations. These steps encom- cuses on isolated ofline handwritten Farsi/Arabic
numpassed the conversion of Gray-scale images into binary bers and characters, showcasing grayscale images of
ones, employing the Hough transform for skew detection, 52,380 characters and 17,740 numerals. AHD-Base [37]
and rectifying inclinations to ameliorate image quality. comprises 60,000 training digits and 10,000 testing digits
To execute the segmentation, they amalgamated three scribed by 700 individuals of diverse ages and educational
methodologies reliant on horizontal projection profile backgrounds. Meanwhile, the AHD/AMSH database [38]
(HPP), connected components (CC), and skeleton anal- features 12,300 Arabic handwritten words produced by
ysis. Their outcomes showcased promise, achieving no- 82 writers. Additionally, the Alamri Database [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a
coltable metrics such as an f-measure of 85% . lection of 46,800 digits, 13,439 numerical strings, 21,426
        </p>
        <p>Gader, Takwa Ben Aïcha, and Afef Kacem Echi (2022) letters, 11,375 words, and 1,640 special symbols,
writ[32] proposed an efective technique for accurately seg- ten by 328 contributors. The APTI (Arabic Printed Text
menting overlapping and touching handwritten Arabic Images) [39] dataset is artificially created using a
lexitext lines. Their approach relies on a modified U-Net con consisting of 113,284 words, employing 10 Arabic
called AR2U-net, which is a deep learning-based method fonts with various sizes and styles. This database
encomtrained on the LTP (Local Touching Patches) database. passes 45,313,600 individual word images, amounting
This model performs pixel-wise classification to segment to over 250 million characters. Conversely, the
SUSTtouching characters. Additionally, they introduced a post- ALT database [40] comprises numerals, letters, and
Aratreatment step to segment consecutive touching text lines, bic names. The KHATT database [41] comprises 1000
resulting in an impressive accuracy of 94.6% . forms and 2000 paragraphs authored by 1000 writers.</p>
        <p>Abdo, Hakim A., et al. (2022) [33] their analysis of The HACDB dataset [16] provides a collection of Arabic
Arabic text documents introduced a comprehensive four- character images designed to encompass various shapes,
step methodology: preprocessing, text line segmenta- including overlapping characters. It encompasses 6600
tion, word segmentation, and character segmentation. character shapes created by 50 writers. The AIA9K [42]
The technique leverages horizontal projection methods is a database of the Arabic alphabet, featuring 8737 letters
to detect and extract text lines. In the word segmenta- distributed across 28 classes, while the AHCD [8] consists
tion phase, space thresholds are computed to diferenti- of 16800 isolated Arabic characters. The KU-database [7]
ate within-word and between-words spaces, efectively is composed of words extracted from renowned Arabic
isolating individual words. Following this, a thinning proverbs, encompassing a total of 3024 word images,
method is employed to identify ligatures and charac- 14616 PAWs (Part of Arabic Words), and 30744
characters. The proposed methodology underwent rigorous ters. The recently introduced DBAHD [43] is a proposed
testing on a dataset of 115 text images, inclusive of sam- database focusing on Arabic handwritten diacritics,
covples from the King Fahd University of Petroleum and ering various forms of diacritical marks. It comprises 500
Minerals (KFUPM) handwritten Arabic text (KHATT) diacritics distributed across 5 folders, including 100
examdatabase, along with additional images generated by the ples of single-point, double-point, triple-point, Hamza,
researchers. The experimental outcomes displayed ex- and madda.
ceptional performance, yielding success rates of 98.6 % A recent and noteworthy addition is the HAMCDB
for line segmentation, 96% for word segmentation, and [44], presented as the inaugural database of handwritten
87.1% for character segmentation. Maghrebi characters, featuring 1560 images. Table 2
provides an overview and summary of these aforementioned
datasets.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Datasets</title>
      <sec id="sec-3-1">
        <title>Most of the experimental studies and research in the</title>
        <p>realm of automatic segmentation and ofline Arabic
handwriting recognition rely on diverse Arabic databases.
These databases encompass collections of images
featuring various content types such as characters, words,</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Open Issues, motivation and</title>
    </sec>
    <sec id="sec-5">
      <title>Future Research Directions</title>
      <sec id="sec-5-1">
        <title>The ancient manuscript heritage represents a big part of the individual and collective memory of the country; it</title>
        <p>Author and Year
(Boussellaa, Wafa, et
al., 2010)[23]
(Kumar, Jayant, et al.,
2010)[24]
(Kumar, Jayant, et
al.,2011)[25]
(Khayyat, Muna, et
al., 2012)[26]
(Al-Dmour, Ayman,
and Fares Fraij.,
2014)[27]
(Suresha, M., and</p>
        <sec id="sec-5-1-1">
          <title>Amani Ali Ahmed</title>
          <p>Ali., 2018)[28]
(Neche,
Chemseddine, et al., 2019)[29]
(Gader, Takwa, et al.,
2020)[15]
(Mechi, Olfa, et al.,
2021)[30]
(Meziani, Fariza, et
al., 2021)[31]
(Gader, Takwa Ben</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>Aïcha, and Afef</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>Kacem Echi, 2022) [32] (Abdo, Hakim A., et al. 2022) [33]</title>
          <p>The segmentation technique
The experiment dataset
Evaluation Metrics</p>
        </sec>
        <sec id="sec-5-1-4">
          <title>Block covering analysis using unsupervised technique</title>
        </sec>
        <sec id="sec-5-1-5">
          <title>A graph-based approach (by building a</title>
          <p>sparse similarity graph and the use of
a shortest path algorithm to compute
similarities between non-neighboring
components.</p>
        </sec>
        <sec id="sec-5-1-6">
          <title>A graph-based technique to detect</title>
          <p>touching and proximity errors + a
refinement operation using Expectation</p>
        </sec>
        <sec id="sec-5-1-7">
          <title>Maximization (EM)</title>
        </sec>
        <sec id="sec-5-1-8">
          <title>A smearing technique based on morphological dilation with a dynamic adaptive mask the horizontal projection profile (HPP)</title>
        </sec>
        <sec id="sec-5-1-9">
          <title>Hough transform approach preceded</title>
          <p>by a novel method based on
skeletonization in the post-processing stage</p>
        </sec>
        <sec id="sec-5-1-10">
          <title>A deep learning Architecture (RU-net).</title>
        </sec>
        <sec id="sec-5-1-11">
          <title>A deep neural network called AR2U</title>
        </sec>
        <sec id="sec-5-1-12">
          <title>Net based on the U-Net model</title>
        </sec>
        <sec id="sec-5-1-13">
          <title>Hybrid method (a deep network (U-Net architecture) with classical image analysis techniques).</title>
        </sec>
        <sec id="sec-5-1-14">
          <title>Combination of: Horizontal projection</title>
          <p>profile (HPP), on connected
components (CC) and on skeleton.</p>
        </sec>
        <sec id="sec-5-1-15">
          <title>Deep learning-based method based on a modified U-Net named AR2U-net (Attention-based Recurrent Residual Unet model).</title>
          <p>The horizontal projection technique
is utilized for detecting text lines,
calculating a threshold to determine
the spacing between isolated words,
and subsequently applying a thinning
method to detect characters.
100 old handwritten
document images from the
National Library of Tunisia
125 Arabic document images
Accuracy=95%
Accuracy=96%</p>
        </sec>
        <sec id="sec-5-1-16">
          <title>Privet datasets of 125 Arabic document images with 1974 text-lines</title>
        </sec>
        <sec id="sec-5-1-17">
          <title>CENPARMI Arabic handwritten documents (Section IV-A)</title>
        </sec>
        <sec id="sec-5-1-18">
          <title>Benchmarking datasets of the</title>
        </sec>
        <sec id="sec-5-1-19">
          <title>AHDB</title>
          <p>F1=98.76%
Precision =96.3%
extraction rate=84.8%</p>
        </sec>
        <sec id="sec-5-1-20">
          <title>IFN/ENIT and Arabic Handwriting Database: AHDB Accuracies=97.4% and 98.9%</title>
          <p>Accuracy=96.7%
Precision=93.2%</p>
        </sec>
        <sec id="sec-5-1-21">
          <title>High Precision</title>
          <p>Accuracy= 94,6%
Success rate= 98.6%
(lines segmentation)
96% ( words
segmentation), and 87.1% for
characters
segmentation.</p>
        </sec>
        <sec id="sec-5-1-22">
          <title>KHATT</title>
        </sec>
        <sec id="sec-5-1-23">
          <title>BADAM</title>
        </sec>
        <sec id="sec-5-1-24">
          <title>ANT database 100 text images from KHATT F1=85%</title>
        </sec>
        <sec id="sec-5-1-25">
          <title>LTP (Local Touching Patches) database</title>
        </sec>
        <sec id="sec-5-1-26">
          <title>A set of 115 text images from (KFUPM) (KHATT) database + Some images produced by the authors</title>
          <p>has played a main role in the preservation of cultural iden- tions encompass the analysis, access, and comprehension
tity and the construction of the contemporary Arabian of the manuscript content. Firstly, an artistic card is
prostates. Despite its importance, it faces dificult situations, posed for each manuscript, containing details ranging
some of which result from its transfer from its places from the manuscript’s content to its formal aspects. Such
of origin, which has caused the destruction and loss of information aids in the creation of diverse structured
certain rare manuscripts. To safeguard this heritage of databases across various domains of these manuscripts
ancient texts and manuscripts, custodians and preser- and facilitates the development of corresponding
segmenvationists tasked with its care turn to digital tools and tation and recognition systems.
techniques. They organize digitization projects to con- As a form of motivation and for future research in
vert these materials into digital formats, enabling further preserving this significant heritage, we are currently
inresearch through automated procedures. These opera- volved in establishing an initial database. This database
will feature scanned and photographed images of ancient
Algerian manuscripts collected from diverse centers and
regions across the country. It is intended to serve as
a foundational resource for various automatic
processing experiments, particularly in the domain of text line
segmentation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <p>This paper ofers an extensive examination of established
approaches for ofline Arabic text line segmentation and
extraction in handwriting. It begins with a concise
depiction of the origins and key attributes of ancient
handwritten Arabic manuscripts. Subsequently, it explores various
techniques employed in segmenting handwritten Arabic
documents and delves into the encountered challenges
with this script style. Additionally, it provides a
comprehensive and comparative analysis of existing works
using experimental datasets, serving as a valuable
resource for computer vision, machine learning researchers,
practitioners, and engineers. Furthermore, it presents an
overview of datasets and concludes by addressing
unresolved issues, challenges, and future research directions,
beneficial for emerging researchers and engineers.
[7] et al Hafiz, A. M. Ku ±database of handwritten ara- Recognition (IJDAR), pages 123–138, 2007.</p>
      <p>bic words. 2016. [23] Boussellaa W Zahour A Elabed H Benabdelhafid A
[8] El-Sawy A Loey M El-Bakry H. Arabic handwritten Alimi A M. Unsupervised block covering analysis
characters recognition using convolutional neural for text-line segmentation of arabic ancient
handnetwork. In WSEAS Transactions on Computer Re- written document images. In 20th International
search, pages 11–19, 2017. Conference on Pattern Recognition, pages 1929–1932,
[9] A Belaid. Analyse de documents: de l’image à la 2010.</p>
      <p>représentation par les normes de codage. In Cours [24] Jayant Kumar, Wael Abd-Almageed, Le Kang, and
de l’INRIA. Document numérique, pages 21–38, 1997. David Doermann. Handwritten arabic text line
[10] Bennasri A Zahour A Taconet B. Extraction des segmentation using afinity propagation. In
Prolignes d’un texte manuscrit arabe. In Vision inter- ceedings of the 9th IAPR international workshop on
face, pages 42–48, 1999. document analysis systems, pages 135–142, 2010.
[11] Kaileh H. L’accès à distance aux manuscrits arabes [25] Kumar J Kang L Doermann D Abd-Almageed W.
numérisés en mode image. In Doctoral dissertation, Segmentation of handwritten textlines in presence
Lyon 2, 2004. of touching components. In International
Confer[12] Van den Boogert N. Some notes on maghribi script. ence on Document Analysis and Recognition, pages
1989. 109–113, 2011.
[13] Wolf C Doermann D. Binarization of low quality [26] Khayyat M Lam L Suen C Y Yin F Liu C. Arabic
text using a markov random field model. In Inter- handwritten text line extraction by applying an
national Conference on Pattern Recognition, IEEE, adaptive mask to morphological dilation. In 10th
pages 160–163, 2002. IAPR International Workshop on Document Analysis
[14] Kharma N Ahmed M Ward R. A new comprehen- Systems, pages 100–104, 2012.
sive database of handwritten arabic words, num- [27] Al-Dmour A. Fraij F. Segmenting arabic
handwritbers, and signatures used for ocr testing. In IEEE ten documents into text lines and words. In
InCanadian Conference on Electrical and Computer En- ternational journal of Advancements in Computing
gineering (Cat. No.99TH8411), pages 766–768, 1999. technology, 2015.
[15] Gader Takwa B A et Echi Afef K. unconstrained [28] Suresha M Ali A. Segmentation of handwritten text
handwritten arabic text-lines segmentation based lines with touching of line. In International Journal
on ar2u-net. In 17th International Conference on of Computer Engineering and Applications, pages
Frontiers in Handwriting Recognition (ICFHR), pages 1–12, 2018.</p>
      <p>349–354, 2020. [29] Neche C Belaid A Kacem-Echi A. Arabic
handwrit[16] Lawgali A Angelova M Bouridane A. Hacdb: Hand- ten documents segmentation into text-lines and
written arabic characters database for automatic words using deep learning. In International
Confercharacter recognition. In European workshop on vi- ence on Document Analysis and Recognition
Worksual information processing (EUVIP), pages 255–259, shops (ICDARW), pages 19–24, 2019.
2013. [30] Mechi O Mehri M Ingold R Amara N E.
Combin[17] et al. Papavassiliou, Vassilis. Handwritten docu- ing deep and ad-hoc solutions to localize text lines
ment image segmentation into text lines and words. in ancient arabic document images. In 25th
InterIn Pattern recognition, pages 369–377, 2010. national Conference on Pattern Recognition (ICPR),
[18] Pechwitz M Maddouri S S Märgner V Ellouze pages 7759–7766, 2021.</p>
      <p>N Amiri H. Ifn/enit-database of handwritten arabic [31] Meziani F Bouchakour L Ghribi K Yahiaoui M
Lawords. In Proc. of CIFED, pages 127–136, 2002. trache H Abbas M. Arabic handwritten text to line
[19] et al. Al-Barhamtoshy, Hassanin M. typewritten segmentation. In International Conference on
Inforand handwritten using optical character recogni- mation Systems and Advanced Technologies (ICISAT),
tion (ocr) system. In Arabic calligraphy. pages 1–5, 2021.
[20] Ali A Suresha M. Survey on segmentation and [32] Takwa Ben Aïcha Gader and Afef Kacem Echi. Deep
recognition of handwritten arabic script. In SN learning-based segmentation of connected
compoComputer Science, pages 1–31, 2020. nents in arabic handwritten documents. In
Interna[21] Tseng Y H Lee H. Recognition-based handwritten tional Conference on Intelligent Systems and Pattern
chinese character segmentation using a probabilis- Recognition, pages 93–106. Springer, 2022.
tic viterbi algorithm. In Pattern Recognition Letters, [33] et al. Abdo, Hakim A. An approach to analysis of
pages 791–806, 1999. arabic text documents into text lines, words, and
[22] Likforman-Sulem L Zahour A Taconet B. Text line characters. In Indones. J. Electr. Eng. Comput, pages
segmentation of historical documents: a survey. 754–763, 2022.</p>
      <p>In International Journal of Document Analysis and [34] Al-Ma’adeed S Elliman D Higgins C A. a data
base for arabic handwritten text recognition
research. In Proceedings eighth international
workshop on frontiers in handwriting recognition, pages
485–489, 2002.
[35] Al-Ohali Y Cheriet M Suen C. databases for
recognition of handwritten arabic cheques. In Pattern</p>
      <p>Recognition, pages 111–121, 2003.
[36] Mozafari S Faez K Faradji F Ziaratban M Golzan</p>
      <p>S. a comprehensive isolated farsi/arabic character
database for handwritten ocr research. In Tenth
International Workshop on Frontiers in Handwriting</p>
      <p>Recognition, 2006.
[37] El-Sherif E A Abdelazeem S. a two-stage system
for arabic handwritten digit recognition tested on
a new large database. In Artificial intelligence and
pattern recognition, pages 237–242, 2007.
[38] Al-Nassiri A ABDULLA S. a new arabic (ahd/amsh)
handwritten database. In ACIT, Lattakia, Syria,
2007.
[39] Slimane F Ingold R Kanoun S Alimi A M Hennebert</p>
      <p>J. a new arabic printed text image database and
evaluation protocols. In 10th International
Conference on Document Analysis and Recognition, pages
946–950, 2009.
[40] Musa M E. Arabic handwritten datasets for pattern
recognition and machine learning. In 5th
International Conference on Application of Information
and Communication Technologies (AICT), pages 1–3,
2011.
[41] Mahmoud S Ahmad I Al-Khatib W G Alshayeb M</p>
      <p>Parvez M T Märgner V et. al. Khatt: An open
arabic ofline handwritten text database. In Pattern</p>
      <p>Recognition, pages 1096–1112, 2014.
[42] Torki M Hussein M E Elsallamy A Fayyaz M Yaser S.</p>
      <p>Window-based descriptors for arabic handwritten
alphabet recognition: a comparative study on a
novel dataset. In arXiv preprint, pages 1411–3519,
2014.
[43] Lamghari N Raghay S. Recognition of arabic
handwritten diacritics using the new database dbahd.</p>
      <p>In Journal of Physics, IOP Publishing., page 012023,
2021.
[44] Djaghbellou S. Attia A. Bouziane A. &amp; Akhtar</p>
      <p>Z. Local features enhancement using deep
autoencoder scheme for the recognition of the proposed
handwritten arabic-maghrebi characters database.</p>
      <p>In Multimedia Tools and Applications, pages 1–19,
2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Orsatti</surname>
            <given-names>P.</given-names>
          </string-name>
          <article-title>Le manuscrit islamique: caractéristiques matérielles et typologie</article-title>
          . pages
          <fpage>269</fpage>
          -
          <lpage>331</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Al-Dmour A Fraij F. Segmenting</surname>
          </string-name>
          <article-title>arabic handwritten documents into text lines and words</article-title>
          . In
          <source>International journal of Advancements in Computing technology, page 109</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Islamic medical manuscripts at the national library of medicine</article-title>
          . https://www.nlm.nih.gov/hmd/arabic/ arabichome.html. Accessed:
          <fpage>2023</fpage>
          -03-10.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] Bibliothèque nationale de tunisie. http://www. bibliotheque.nat.tn. Accessed:
          <fpage>2023</fpage>
          -03-10.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Lebore</surname>
            <given-names>T.</given-names>
          </string-name>
          <article-title>Segmentation d'image application aux documents anciens</article-title>
          . In Mémoire de Master de recherche, Université de Nante, France.,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] et al. Huda,
          <string-name>
            <surname>Alamri.</surname>
          </string-name>
          <article-title>A novel comprehensive database for arabic of-line handwriting recognition</article-title>
          .
          <source>In Proceedings of 11th international conference on frontiers in handwriting recognition, ICFHR</source>
          , pages
          <fpage>664</fpage>
          -
          <lpage>669</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>