1. Introduction

A survey on text-line segmentation process in historical Arab manuscripts

Soumia Djaghbellou

Attia Abdelouahab

Bouziane Abderraouf

1 0 Department of computer science, university of Bordj Bou Arreridj , Algeria 1 MSE Laboratory, Department of computer science, university of Bordj Bou Arreridj , Algeria

The segmentation process entails dividing or decomposing the entire document image into segments or lines. This technique serves as a fundamental step in developing any writing or optical character recognition system. However, numerous existing segmentation schemes encounter challenges when dealing with specific script styles, like ancient or historical Arabic writing found in ancient manuscripts. which possesses unique characteristics. These characteristics include inclined text lines, overlapping letters, diacritic marks, decorative elements, variable letter forms, and ligatures (combinations of two or more letters merged to form a single connected shape).Thus, in this paper we present a thorough survey of the field. The survey is composed of two segments. The first segment provides a concise overview of the historical Arabic documents. The second, which serves as the primary segment, focuses on the crucial step of handwritten document recognition, specifically segmentation. A detailed and systematic overview of the various approaches to segmentation, including diferent levels, employed for extracting handwritten Arabic text-lines, is outlined. Subsequently, a literature study is conducted to review and analyze proposed works in this area.

eol>Text-lines segmentation pattern recognition Arabic handwritten historical Arabic documents

1. Introduction

problem of text-line segmentation process and provides a comprehensive survey of existing research works. Firstly, Historical documents often symbolize the identity of di- it presents the historical Arabic Manuscripts in general, verse civilizations worldwide. Analyzing and compre- including main features, structure/type of documents. hending their contents holds paramount importance, es- Secondly, in the main section of the paper, the focus shifts pecially for researchers. Manually extracting data from to the segmentation process as a critical phase in recognithese historical documents proves to be a laborious and tion, specifically emphasizing the commonly utilized and expensive endeavor. In recent years, there has been a adopted techniques for Arabic scripts. The remainder surge in research dedicated to the automated process- of this paper is structured as follows: section 2 ofers a ing of historical documents. Despite notable advance- comprehensive overview of ancient Arabic manuscripts, ments, automating the processing and analysis of histor- encompassing their content structure and various appliical Arabic documents remains a challenging task. Text cations. Moving on to Section 3, we delve into the image line segmentation stands as an initial stage in the text segmentation process, exploring its diferent levels and recognition system process. This critical preprocessing focusing on widely adopted approaches specifically destep in document analysis poses particular challenges, signed for handwritten Arabic texts. In Section 4, we especially with handwritten texts. While segmenting concentrate on a comparative study, presenting notable text lines from machine-printed documents is commonly existing works related to the segmentation of handwritconsidered resolved, freestyle handwritten text lines re- ten Arabic documents. This analysis will consider the main notably challenging. This complexity arises due method of experimental analysis and the data-set used. to curved lines, inconsistent spacing, and overlapping Section 5 is dedicated to showcasing a compilation of faspatial boundaries. Additionally, irregular layouts, var- mous Arabic databases that have been utilized in various ied character sizes reflecting diferent writing styles, in- studies. Finally, Section 6 addresses open issues, motivatersecting lines, and the absence of a clear baseline all tions, and potential directions for future research. And contribute to the intricacy and dificulty in handwritten lastly, Section 7 concludes the paper, summarizing the document analysis[ 1 ]. This paper focuses on complex findings and insights presented throughout the article.

2. Structures and applications of the historical Arabic documents Throughout history, manuscripts were the oficial way

of writing down knowledge and science. There is a huge amount of historical Arabic manuscripts in the archives and national libraries around the world, which have been Figure 1: Examples of the four categories a,b,d [ 3 ], d [ 4 ] of scanning their collections to make them publicly avail- handwritten documents. able and to preserve this valuable cultural heritage. The Arabic manuscripts, like manuscripts of the other languages, have some common characteristics. format. Therefore, to harness these documents, diverse

Manuscripts have their specificities and various dis- methods and applications are employed. In this paper, tinct elements that can useful to identify the manuscripts we succinctly outline three applications, depicted in the for the creation of an electronic format of description. figure 2 [ 5 ]. In dealing with aged document images, Some of these elements, we distinguish are: segmentation poses a significant challenge. However, delineating distinct blocks within their physical structure simplifies this task. By focusing on block forms and their spatial relationships, we ascertain: • Mention the responsible • Names of owners. It is also an important clue for researchers wishing to follow the historical development of the manuscript • The title: it is the main identifying element which

in most cases is presented on the title page • Physical description/codicology.

Handwritten documents, irrespective of the language, are

typically categorized based on their physical appearance into four classes, as outlined in [ 2 ]:

Currently, numerous major libraries globally are digitizing handwritten historical documents. These scanned images are uploaded onto their websites, complemented by metadata facilitating specific document searches within extensive databases. Accessing document content isn’t feasible without its digital presence in a textual • Baseline (connecting the lower part of character

bodies) • Median line (tracing the upper part of character

bodies) • Upper line (linking the top of ascenders) • Lower line (uniting the bottom of descenders).

Efectively interpreting this manuscript type demands a robust segmentation process backed by eficient methods and techniques. This paper aims to extensively discuss the pivotal phase of segmenting textual documents, concentrating on the most efective methodologies that have exhibited performance, particularly concerning handwritten Arabic script. • Mono-oriented documents: lines in this class are oriented in one direction. Figure 1.a shows a handwritten Arabic document with a horizontal orientation • Multi-oriented documents: lines in these documents are arranged in blocks of diferent orientations. Figure 1.b gives an example of this class of documents 3. Segmentation phase and its • Multi-script documents: These comprise texts methods authored by multiple individuals, resulting in various scripts. This occurrence was frequent in the Segmentation of documents into text lines, also known past when individuals succeeded one another to as text line extraction, stands as a fundamental step in ifnalize a document or collaborated on the same document content recognition. It typically serves as a written piece. Figure 1c depicts a multi-script preprocessing phase, as illustrated in Figure 4. handwritten document (Arabic and Latin) However, segmenting ancient and historical handwrit• Heterogeneous documents: This category encom- ten Arabic documents into text lines poses considerable passes content that includes both textual infor- challenges, especially when dealing with documents of mation and images or illustrations. Handwritten poor quality. This complexity hampers content extraction documents of this nature might feature diverse due to the diverse nature of Arabic writing. Characters orientations, such as dimension lines or illustra- and words present varying shapes, contributing to an tive drawings, as illustrated in Figure 1d. extensive vocabulary. Moreover, these documents frequently feature additional disruptive elements like stains, ornamentation, seals, and holes [ 5 ].

The objective of segmentation is to simplify and alter the representation of an image into something more meaningful and easier to analyze and recognize, such sequently, the characters within those words. Numerous studies in this domain employ image decomposition into connected components [9].

Line segmentation into words: This phase utilizes vertical projections’ histograms of lines to detect spaces between words for separation. However, this method might not be as efective when dealing with Arabic script.

Word segmentation into characters: This process involves breaking down words into their constituent individual symbols. It is a pivotal decision-making step in optical character recognition systems, determining the accuracy of isolated patterns within an image [10]. 3.1. The adopted methods for Arabic

text-lines segmentation

In the context of Arabic manuscripts, localizing text lines

for extraction or segmentation is challenging due to the as lines, sub-words, words, or characters. This is illus- morphological peculiarities associated with the Arabic trated in Figure 5. In general, there exist four levels of script. The script is naturally cursive, unconstrained, and segmentation, as illustrated in Figure 5: horizontally oriented, which adds to the dificulties, espe

Page segmentation: This initial step involves identify- cially when dealing with historical documents. Evaluaing information areas on each page based on their visual tion of Arabic handwritten text line extraction algorithms attributes. It often includes logically labeling these ar- has either been lacking or has shown higher error rates eas according to the content they represent, such as text, compared to algorithms used for other languages. This is graphics, or images. A comprehensive analysis of the because, as previously mentioned, Arabic script exhibits techniques employed in document analysis has been pre- greater cursive characteristics compared to other scripts sented in prior studies [ 6 ] [7] [8]. [11].

Text segmentation into lines: This stage focuses on To streamline and simplify the handling of such docseparating text lines to extract individual words and sub- uments, extensive research has been directed toward categorized based on their operational mode or the strategies they employ. These strategies cover projection-based methods, smearing methods, hough-based approaches, and clustering or grouping methods.

Projection-based Approach: Within this approach, we diferentiate between two profiles: the vertical projection and the horizontal projection. These profiles are obtained by summing the pixel values along the vertical and horizontal axes, respectively, for each y and x value [15]. In our text line segmentation study, we specifically focus on the horizontal projection. This projection is applied Figure 6: Original colored document and Binarized document. to diferent lines in order to obtain an initial position for the separated lines and their corresponding baselines. In Figure 7, we present the reference lines and interfering image binarization. Image binarization, illustrated in Fig- lines, while Figure 8 provides an illustrative example of ure 6, involves dividing the image into two categories: an Arabic text and its horizontal projection. the background and the object (text lines). Segmenting Smearing methods: Involve horizontally spreading a color image can be exceptionally resource-intensive, black pixels while measuring the gap between white often necessitating consideration of multiple factors (e.g., spaces. If the distance meets a predetermined threshhistogram, color, etc.) to determine a point’s class or type. old, those spaces get filled with black pixels. This proVarious techniques have been devised for image binariza- cess leads to the formation of interconnected black pixel tion, including global threshold, local threshold, Markov shapes around the text lines [18], as depicted in Figure random field model [ 12], [13], water flow model [ 14], and 9. However, this method posed a challenge of line overGatos et al.’s method [15], [16]. In general, methods and lap. Consequently, during smearing, two separate lines approaches for segmenting handwritten text lines can be might merge, causing the segmentation of two lines into manuscripts. Conversely, alternative methods (refer to Figure 10) are guided by criteria aimed at preventing the intersection of black pixels, as proposed and implemented in [21] for segmenting both modern and historical Chinese documents, often characterized by overlapping lines.

Nevertheless, this method proves eficient only when fewer black pixels are present at the contact points, potentially failing when these points contain a significant amount of such pixels.

For detecting connected components between lines, several rules and criteria need consideration. These Figure 9: An example demonstrating the Application of the include identifying the label of the component (touchSmearing Method on Arabic text [18] ing/overlapping), managing ambiguous component sizes, assessing the density of black pixels within each alignment region, considering alignment proximity, and evalone[19]. uating contextual information (such as the positions of

Hough transform: is a frequency-based technique em- alignments surrounding the component) [22]. Separating ployed to detect and pinpoint straight lines within text these components involves analyzing the vertical projecdocument images. Employing the Hough transform on tion profile of the component to determine the location the centroid allows us to determine the orientations of of the horizontal frontier segment that will be used to these lines [20]. Handwritten document images often separate the touching elements. If the projection profile contain annotations, erasures, and lines oriented in vari- exhibits two peaks, the separation occurs midway beous directions alongside the primary lines. Consequently, tween them; otherwise, the component is divided into this method depends on contextual cues, like direction two equal parts [22]. continuity and proximity criteria, to eliminate erroneous alignments among the components. 4. Literature on Handwritten

Grouping Approach: This method involves aggregating and combining various units (such as blocks, pixels, Arabic Documents connected components) in a bottom-up fashion to cre- Segmentation works ate alignments using distinct perceptual criteria such as proximity, continuity, and similarity. It also employs geo- In this section, we showcase prominent existing research metric feature details, including the size, position, shape, in the realm of segmenting handwritten Arabic docuand orientation of the connected components, thereby ments, employing experimental analysis methods and grouping them into rows [20]. datasets. Until recently, there has been a scarcity of stud

Most studies on segmenting handwritten Arabic text ies focusing on segmenting and recognizing handwritten into lines rely on schemes that identify overlapping, Arabic text. Owing to the script’s distinct characteristouched, or connected components, such as the group- tics, conventional methods often fall short in terms of ing approach. However, projection-based methods re- eficacy. The review encompasses publications from the main efective primarily in cases where handwritten doc- last 12 years, evaluating a total of 10 articles from jouruments exhibit minimal overlap, and there is no white- nals and conferences. These articles primarily centered space between lines, a common feature in many Arabic on text segmentation within their core methodologies. Khayyat, Muna, et al. (2012) [26] introduced a techA summary of these ten articles is provided in Table 1. nique for extracting handwritten text lines. Their method For a better understanding of the Table 1 contents, let’s relies on morphological dilation with a dynamically adapexamine each column separately. The first column in tive mask. They employed a smearing technique to genthe table displays the author and the publication year erate large connected components, or blobs, which were of the article. Next, a concise description of the seg- subsequently analyzed for applying appropriate smearmentation technique outlined in each respective study ing to the document. This approach underwent testis provided. Afterward, details regarding the database ing using the Arabic dataset CENPARMI, encompassing utilized to test the proposed model are presented. Lastly, multi-skewed and touching lines. The experimental outthe final column showcases the evaluation results of each comes demonstrated the eficacy of the proposed algoexperiment. It’s important to highlight that the assess- rithm, achieving a precision rate of 96.3% . ment of the performance of diferent proposed methods Al-Dmour and Fares Fraij (2014) [27] introduced a textrelied on calculating various metrics as detailed below: line segmentation model that relies on the established Accuracy = (TP+TN)/ (TP+TN+FP+FN) horizontal projection profile (HPP) method. Initially, the Precision = TP/ (TP+FP) approach involves generating a histogram of black pixRecall=TP/ (TP+FN) els along the preprocessed image’s horizontal scan lines. F1-Score=2*((Precision*Recall)/ (Precision+ Recall)) Subsequently, the self-similarity is improved through auWhere: tocorrelation. The implemented system demonstrated TP, TN (True Positives and True Negatives): indicate highly encouraging outcomes, achieving an extraction the correct predictions for the positive and negative class. accuracy rate of 84.8% .

FP, FN (False Positives and False Negatives): FP indi- Suresha, M., and Amani Ali Ahmed Ali (2018) [28] cates the incorrect predictions of the positive class and developed a segmentation process utilizing the Hough the incorrect predictions of the negative class, often re- transform method. This technique was followed by a ferred to as FN. skeletonization operation in the post-processing phase,

Boussellaa, Wafa, et al. (2010) [23] introduced a seg- aimed at rectifying potential false alarms. The ultimate mentation method centered on block covering analysis objective was to proficiently segment vertically conemploying an unsupervised method. Initially, they calcu- nected characters. The efectiveness of the proposed lated the optimal document decomposition into vertical system was demonstrated through experimentation on strips to achieve fuzzy baseline detection using the fuzzy two datasets: IFN/ENIT and AHDB Arabic Handwriting C-means algorithm. Afterward, blocks were assigned Database. Their results showcased accuracies of 97.4% to the respective lines in Arabic historical handwritten and 98.9%, respectively. documents containing various scripts, including char- Neche, Chemseddine, et al. (2019) [29] introduced a acters that overlapped or were multi-touching. The al- text-line segmentation approach employing a deep learngorithm presented in their study demonstrated strong ing architecture. Specifically, they employed an RU-Net performance, achieving an accuracy rate of 95%. enabling pixel-wise classification to diferentiate text-line

Kumar, Jayant, et al. (2010) [24] introduced a method pixels from the background. The experimental assessfor extracting handwritten text lines from monochro- ment was conducted on the KHATT standard Arabic matic Arabic document images. Their approach relied on benchmark, and the results obtained validate the suca unique graph framework involving two key steps. Ini- cessful segmentation process, achieving a rate of 96.7% tially, the scheme estimates local orientation at each pri- . mary component, constructing a sparse similarity graph. Gader, Takwa, et al. (2020) [15] developed a system Subsequently, it employs a shortest path algorithm to as- for extracting text lines from images containing unconsess similarities between non-adjacent components. The strained handwritten Arabic texts sourced from the pubmodel underwent testing on a dataset comprising 125 lic Arabic dataset, BADAM. The approach relies on a deep images, resulting in a final accuracy of 96% . neural network named AR2U-Net, incorporating a Recur

Kumar, Jayant, et al. (2011) [25] developed a handwrit- rent Residual convolutional neural network in conjuncten text-lines extraction model utilizing a graph-based tion with the U-Net model and an Attention mechanism. technique to detect touching and proximity errors, with The model demonstrated its performance, achieving a a refinement step using Expectation-Maximization (EM) precision rate of 93.2% . to iteratively split the error segments to obtain correct Mechi, Olfa, et al. (2021) [30] introduced a hybrid text-lines of the experiment dataset of 125 Arabic doc- method that merges a U-Net deep network with tradiument images. The study showed the productivity of tional document image analysis techniques, including the proposed experiments by giving a very high score of connected component analysis and modified RLSA, to 98.76% . localize text lines in various contemporary datasets of Handwritten Arabic documents, both public and private.

The outcomes demonstrated the eficacy of the proposed numbers, and complete texts. For instance, the AI-ISRA approach, achieving a high precision rate of approxi- database [14] incorporates Arabic sentences, words, digmately 90% . its, and signatures from 500 individuals. In contrast, the

Meziani, Fariza, et al. (2021) [31] implemented their AHDB database [34] encompasses 10,000 words authored proposed technique on document images sourced from by 100 writers, while the IFN/ENIT dataset [18] includes the standard KHATT database, which exhibited various Arabic words and Tunisian town names. The CENPARMI inclinations, overlapping, and intersecting lines. Prior to database [35] comprises 3,000 digits (legal and courtesy employing the segmentation method, they executed a se- amounts, and numerals). The IFHCDB database [36] foquence of preprocessing operations. These steps encom- cuses on isolated ofline handwritten Farsi/Arabic numpassed the conversion of Gray-scale images into binary bers and characters, showcasing grayscale images of ones, employing the Hough transform for skew detection, 52,380 characters and 17,740 numerals. AHD-Base [37] and rectifying inclinations to ameliorate image quality. comprises 60,000 training digits and 10,000 testing digits To execute the segmentation, they amalgamated three scribed by 700 individuals of diverse ages and educational methodologies reliant on horizontal projection profile backgrounds. Meanwhile, the AHD/AMSH database [38] (HPP), connected components (CC), and skeleton anal- features 12,300 Arabic handwritten words produced by ysis. Their outcomes showcased promise, achieving no- 82 writers. Additionally, the Alamri Database [ 6 ] is a coltable metrics such as an f-measure of 85% . lection of 46,800 digits, 13,439 numerical strings, 21,426

Gader, Takwa Ben Aïcha, and Afef Kacem Echi (2022) letters, 11,375 words, and 1,640 special symbols, writ[32] proposed an efective technique for accurately seg- ten by 328 contributors. The APTI (Arabic Printed Text menting overlapping and touching handwritten Arabic Images) [39] dataset is artificially created using a lexitext lines. Their approach relies on a modified U-Net con consisting of 113,284 words, employing 10 Arabic called AR2U-net, which is a deep learning-based method fonts with various sizes and styles. This database encomtrained on the LTP (Local Touching Patches) database. passes 45,313,600 individual word images, amounting This model performs pixel-wise classification to segment to over 250 million characters. Conversely, the SUSTtouching characters. Additionally, they introduced a post- ALT database [40] comprises numerals, letters, and Aratreatment step to segment consecutive touching text lines, bic names. The KHATT database [41] comprises 1000 resulting in an impressive accuracy of 94.6% . forms and 2000 paragraphs authored by 1000 writers.

Abdo, Hakim A., et al. (2022) [33] their analysis of The HACDB dataset [16] provides a collection of Arabic Arabic text documents introduced a comprehensive four- character images designed to encompass various shapes, step methodology: preprocessing, text line segmenta- including overlapping characters. It encompasses 6600 tion, word segmentation, and character segmentation. character shapes created by 50 writers. The AIA9K [42] The technique leverages horizontal projection methods is a database of the Arabic alphabet, featuring 8737 letters to detect and extract text lines. In the word segmenta- distributed across 28 classes, while the AHCD [8] consists tion phase, space thresholds are computed to diferenti- of 16800 isolated Arabic characters. The KU-database [7] ate within-word and between-words spaces, efectively is composed of words extracted from renowned Arabic isolating individual words. Following this, a thinning proverbs, encompassing a total of 3024 word images, method is employed to identify ligatures and charac- 14616 PAWs (Part of Arabic Words), and 30744 characters. The proposed methodology underwent rigorous ters. The recently introduced DBAHD [43] is a proposed testing on a dataset of 115 text images, inclusive of sam- database focusing on Arabic handwritten diacritics, covples from the King Fahd University of Petroleum and ering various forms of diacritical marks. It comprises 500 Minerals (KFUPM) handwritten Arabic text (KHATT) diacritics distributed across 5 folders, including 100 examdatabase, along with additional images generated by the ples of single-point, double-point, triple-point, Hamza, researchers. The experimental outcomes displayed ex- and madda. ceptional performance, yielding success rates of 98.6 % A recent and noteworthy addition is the HAMCDB for line segmentation, 96% for word segmentation, and [44], presented as the inaugural database of handwritten 87.1% for character segmentation. Maghrebi characters, featuring 1560 images. Table 2 provides an overview and summary of these aforementioned datasets.

5. Datasets Most of the experimental studies and research in the

realm of automatic segmentation and ofline Arabic handwriting recognition rely on diverse Arabic databases. These databases encompass collections of images featuring various content types such as characters, words,

6. Open Issues, motivation and Future Research Directions The ancient manuscript heritage represents a big part of the individual and collective memory of the country; it

Author and Year (Boussellaa, Wafa, et al., 2010)[23] (Kumar, Jayant, et al., 2010)[24] (Kumar, Jayant, et al.,2011)[25] (Khayyat, Muna, et al., 2012)[26] (Al-Dmour, Ayman, and Fares Fraij., 2014)[27] (Suresha, M., and

Amani Ali Ahmed

Ali., 2018)[28] (Neche, Chemseddine, et al., 2019)[29] (Gader, Takwa, et al., 2020)[15] (Mechi, Olfa, et al., 2021)[30] (Meziani, Fariza, et al., 2021)[31] (Gader, Takwa Ben

Aïcha, and Afef Kacem Echi, 2022) [32] (Abdo, Hakim A., et al. 2022) [33]

The segmentation technique The experiment dataset Evaluation Metrics

Block covering analysis using unsupervised technique A graph-based approach (by building a

sparse similarity graph and the use of a shortest path algorithm to compute similarities between non-neighboring components.

A graph-based technique to detect

touching and proximity errors + a refinement operation using Expectation

Maximization (EM) A smearing technique based on morphological dilation with a dynamic adaptive mask the horizontal projection profile (HPP) Hough transform approach preceded

by a novel method based on skeletonization in the post-processing stage

A deep learning Architecture (RU-net). A deep neural network called AR2U Net based on the U-Net model Hybrid method (a deep network (U-Net architecture) with classical image analysis techniques). Combination of: Horizontal projection

profile (HPP), on connected components (CC) and on skeleton.

Deep learning-based method based on a modified U-Net named AR2U-net (Attention-based Recurrent Residual Unet model).

The horizontal projection technique is utilized for detecting text lines, calculating a threshold to determine the spacing between isolated words, and subsequently applying a thinning method to detect characters. 100 old handwritten document images from the National Library of Tunisia 125 Arabic document images Accuracy=95% Accuracy=96%

Privet datasets of 125 Arabic document images with 1974 text-lines CENPARMI Arabic handwritten documents (Section IV-A) Benchmarking datasets of the AHDB

F1=98.76% Precision =96.3% extraction rate=84.8%

IFN/ENIT and Arabic Handwriting Database: AHDB Accuracies=97.4% and 98.9%

Accuracy=96.7% Precision=93.2%

High Precision

Accuracy= 94,6% Success rate= 98.6% (lines segmentation) 96% ( words segmentation), and 87.1% for characters segmentation.

KHATT BADAM ANT database 100 text images from KHATT F1=85% LTP (Local Touching Patches) database A set of 115 text images from (KFUPM) (KHATT) database + Some images produced by the authors

has played a main role in the preservation of cultural iden- tions encompass the analysis, access, and comprehension tity and the construction of the contemporary Arabian of the manuscript content. Firstly, an artistic card is prostates. Despite its importance, it faces dificult situations, posed for each manuscript, containing details ranging some of which result from its transfer from its places from the manuscript’s content to its formal aspects. Such of origin, which has caused the destruction and loss of information aids in the creation of diverse structured certain rare manuscripts. To safeguard this heritage of databases across various domains of these manuscripts ancient texts and manuscripts, custodians and preser- and facilitates the development of corresponding segmenvationists tasked with its care turn to digital tools and tation and recognition systems. techniques. They organize digitization projects to con- As a form of motivation and for future research in vert these materials into digital formats, enabling further preserving this significant heritage, we are currently inresearch through automated procedures. These opera- volved in establishing an initial database. This database will feature scanned and photographed images of ancient Algerian manuscripts collected from diverse centers and regions across the country. It is intended to serve as a foundational resource for various automatic processing experiments, particularly in the domain of text line segmentation.

7. Conclusion

This paper ofers an extensive examination of established approaches for ofline Arabic text line segmentation and extraction in handwriting. It begins with a concise depiction of the origins and key attributes of ancient handwritten Arabic manuscripts. Subsequently, it explores various techniques employed in segmenting handwritten Arabic documents and delves into the encountered challenges with this script style. Additionally, it provides a comprehensive and comparative analysis of existing works using experimental datasets, serving as a valuable resource for computer vision, machine learning researchers, practitioners, and engineers. Furthermore, it presents an overview of datasets and concludes by addressing unresolved issues, challenges, and future research directions, beneficial for emerging researchers and engineers. [7] et al Hafiz, A. M. Ku ±database of handwritten ara- Recognition (IJDAR), pages 123–138, 2007.

bic words. 2016. [23] Boussellaa W Zahour A Elabed H Benabdelhafid A [8] El-Sawy A Loey M El-Bakry H. Arabic handwritten Alimi A M. Unsupervised block covering analysis characters recognition using convolutional neural for text-line segmentation of arabic ancient handnetwork. In WSEAS Transactions on Computer Re- written document images. In 20th International search, pages 11–19, 2017. Conference on Pattern Recognition, pages 1929–1932, [9] A Belaid. Analyse de documents: de l’image à la 2010.

représentation par les normes de codage. In Cours [24] Jayant Kumar, Wael Abd-Almageed, Le Kang, and de l’INRIA. Document numérique, pages 21–38, 1997. David Doermann. Handwritten arabic text line [10] Bennasri A Zahour A Taconet B. Extraction des segmentation using afinity propagation. In Prolignes d’un texte manuscrit arabe. In Vision inter- ceedings of the 9th IAPR international workshop on face, pages 42–48, 1999. document analysis systems, pages 135–142, 2010. [11] Kaileh H. L’accès à distance aux manuscrits arabes [25] Kumar J Kang L Doermann D Abd-Almageed W. numérisés en mode image. In Doctoral dissertation, Segmentation of handwritten textlines in presence Lyon 2, 2004. of touching components. In International Confer[12] Van den Boogert N. Some notes on maghribi script. ence on Document Analysis and Recognition, pages 1989. 109–113, 2011. [13] Wolf C Doermann D. Binarization of low quality [26] Khayyat M Lam L Suen C Y Yin F Liu C. Arabic text using a markov random field model. In Inter- handwritten text line extraction by applying an national Conference on Pattern Recognition, IEEE, adaptive mask to morphological dilation. In 10th pages 160–163, 2002. IAPR International Workshop on Document Analysis [14] Kharma N Ahmed M Ward R. A new comprehen- Systems, pages 100–104, 2012. sive database of handwritten arabic words, num- [27] Al-Dmour A. Fraij F. Segmenting arabic handwritbers, and signatures used for ocr testing. In IEEE ten documents into text lines and words. In InCanadian Conference on Electrical and Computer En- ternational journal of Advancements in Computing gineering (Cat. No.99TH8411), pages 766–768, 1999. technology, 2015. [15] Gader Takwa B A et Echi Afef K. unconstrained [28] Suresha M Ali A. Segmentation of handwritten text handwritten arabic text-lines segmentation based lines with touching of line. In International Journal on ar2u-net. In 17th International Conference on of Computer Engineering and Applications, pages Frontiers in Handwriting Recognition (ICFHR), pages 1–12, 2018.

349–354, 2020. [29] Neche C Belaid A Kacem-Echi A. Arabic handwrit[16] Lawgali A Angelova M Bouridane A. Hacdb: Hand- ten documents segmentation into text-lines and written arabic characters database for automatic words using deep learning. In International Confercharacter recognition. In European workshop on vi- ence on Document Analysis and Recognition Worksual information processing (EUVIP), pages 255–259, shops (ICDARW), pages 19–24, 2019. 2013. [30] Mechi O Mehri M Ingold R Amara N E. Combin[17] et al. Papavassiliou, Vassilis. Handwritten docu- ing deep and ad-hoc solutions to localize text lines ment image segmentation into text lines and words. in ancient arabic document images. In 25th InterIn Pattern recognition, pages 369–377, 2010. national Conference on Pattern Recognition (ICPR), [18] Pechwitz M Maddouri S S Märgner V Ellouze pages 7759–7766, 2021.

N Amiri H. Ifn/enit-database of handwritten arabic [31] Meziani F Bouchakour L Ghribi K Yahiaoui M Lawords. In Proc. of CIFED, pages 127–136, 2002. trache H Abbas M. Arabic handwritten text to line [19] et al. Al-Barhamtoshy, Hassanin M. typewritten segmentation. In International Conference on Inforand handwritten using optical character recogni- mation Systems and Advanced Technologies (ICISAT), tion (ocr) system. In Arabic calligraphy. pages 1–5, 2021. [20] Ali A Suresha M. Survey on segmentation and [32] Takwa Ben Aïcha Gader and Afef Kacem Echi. Deep recognition of handwritten arabic script. In SN learning-based segmentation of connected compoComputer Science, pages 1–31, 2020. nents in arabic handwritten documents. In Interna[21] Tseng Y H Lee H. Recognition-based handwritten tional Conference on Intelligent Systems and Pattern chinese character segmentation using a probabilis- Recognition, pages 93–106. Springer, 2022. tic viterbi algorithm. In Pattern Recognition Letters, [33] et al. Abdo, Hakim A. An approach to analysis of pages 791–806, 1999. arabic text documents into text lines, words, and [22] Likforman-Sulem L Zahour A Taconet B. Text line characters. In Indones. J. Electr. Eng. Comput, pages segmentation of historical documents: a survey. 754–763, 2022.

In International Journal of Document Analysis and [34] Al-Ma’adeed S Elliman D Higgins C A. a data base for arabic handwritten text recognition research. In Proceedings eighth international workshop on frontiers in handwriting recognition, pages 485–489, 2002. [35] Al-Ohali Y Cheriet M Suen C. databases for recognition of handwritten arabic cheques. In Pattern

Recognition, pages 111–121, 2003. [36] Mozafari S Faez K Faradji F Ziaratban M Golzan

S. a comprehensive isolated farsi/arabic character database for handwritten ocr research. In Tenth International Workshop on Frontiers in Handwriting

Recognition, 2006. [37] El-Sherif E A Abdelazeem S. a two-stage system for arabic handwritten digit recognition tested on a new large database. In Artificial intelligence and pattern recognition, pages 237–242, 2007. [38] Al-Nassiri A ABDULLA S. a new arabic (ahd/amsh) handwritten database. In ACIT, Lattakia, Syria, 2007. [39] Slimane F Ingold R Kanoun S Alimi A M Hennebert

J. a new arabic printed text image database and evaluation protocols. In 10th International Conference on Document Analysis and Recognition, pages 946–950, 2009. [40] Musa M E. Arabic handwritten datasets for pattern recognition and machine learning. In 5th International Conference on Application of Information and Communication Technologies (AICT), pages 1–3, 2011. [41] Mahmoud S Ahmad I Al-Khatib W G Alshayeb M

Parvez M T Märgner V et. al. Khatt: An open arabic ofline handwritten text database. In Pattern

Recognition, pages 1096–1112, 2014. [42] Torki M Hussein M E Elsallamy A Fayyaz M Yaser S.

Window-based descriptors for arabic handwritten alphabet recognition: a comparative study on a novel dataset. In arXiv preprint, pages 1411–3519, 2014. [43] Lamghari N Raghay S. Recognition of arabic handwritten diacritics using the new database dbahd.

In Journal of Physics, IOP Publishing., page 012023, 2021. [44] Djaghbellou S. Attia A. Bouziane A. & Akhtar

Z. Local features enhancement using deep autoencoder scheme for the recognition of the proposed handwritten arabic-maghrebi characters database.

In Multimedia Tools and Applications, pages 1–19, 2022.

[1] Orsatti

Le manuscrit islamique: caractéristiques matérielles et typologie . pages 269 - 331 , 1993 .

[2] Al-Dmour A Fraij F. Segmenting arabic handwritten documents into text lines and words . In International journal of Advancements in Computing technology, page 109 , 2014 .

[3] Islamic medical manuscripts at the national library of medicine . https://www.nlm.nih.gov/hmd/arabic/ arabichome.html. Accessed: 2023 -03-10.

[4] Bibliothèque nationale de tunisie. http://www. bibliotheque.nat.tn. Accessed: 2023 -03-10.

[5] Lebore

Segmentation d'image application aux documents anciens . In Mémoire de Master de recherche, Université de Nante, France., 2007 .

[6] et al. Huda, Alamri. A novel comprehensive database for arabic of-line handwriting recognition . In Proceedings of 11th international conference on frontiers in handwriting recognition, ICFHR , pages 664 - 669 , 2008 .