1. Introduction

Deep Learning-Based Framework For Text Detection and Recognition In Natural Images

Djouher Akrour

Mohamed Akram Khelili

Imene Aloui

Azeddine Aissaoui

0 0 Centre de Recherche Scientifiques et Techniques sur les Régions Arides , Campus Universitaire , Université Mohamed Khider , Biskra , Algeria 1 LESIA laboratory / Department of Computer Science, Biskra University , PB 145 RP, 07000 Biskra , Algeria 2 LINFI laboratory / Department of Computer Science, Biskra University , City communal 197 Biskra , Algeria

79 88

Detecting and recognizing text in natural images is a critical task for extracting meaningful information, yet it remains highly challenging due to the variability and complexity of unstructured text in real-world scenarios. Traditional image processing techniques often rely on handcrafted features, which struggle to adapt to the diverse and unpredictable nature of text in the wild. To address these limitations, this paper leverages advancements in deep learning to develop a robust framework capable of adaptive feature learning, text extraction, and digitization. The proposed method utilizes YOLOv5 for precise localization of text-rich regions, followed by an LSTM-based module to segment text into individual characters. These characters are subsequently processed by a Capsule Network-based recognition module, ensuring accurate text recognition. A semantic post-processing step is incorporated to further enhance the system's overall performance. Experimental evaluations conducted on popular benchmark datasets demonstrate that the proposed framework significantly outperforms existing state-of-the-art methods, achieving superior accuracy and eficiency in both text detection and recognition tasks.

eol>Capsule Network Yolov5 LSTM Text detection Text recognition Semantic recognition

1. Introduction

performance, making it suitable for practical deployment such as Faster R-CNN [29], rely on regional proposals in robotics and mobile applications. In the second stage, a and have inspired advanced models like Connectionist Latin text recognition module is introduced, which com- Text Proposal Network (CTPN) [30], R2CNN [31], and bines character segmentation via an LSTM network and RRPN [ 32, 33 ]. For example, TextFuseNet [ 34, 35 ] uses text recognition using a Capsule Network (CapsNet) to multi-level feature representations and multi-path fusion capture complex spatial relationships between charac- to enhance text detection, achieving high accuracy but ters and words. The system is further enhanced by a with significant computational overhead. On the other semantic post-processing step that applies grammatical hand, one-stage approaches eliminate the region procorrections and evaluates word similarity using metrics posal phase and directly estimate candidate text regions such as Levenshtein distance and cosine similarity. from feature maps. Networks such as YOLO [ 36, 37, 38 ],

The primary contributions of this work are as follows: SSD [ 39 ], and their derivatives have demonstrated exFirst, we present a robust end-to-end system for scene ceptional eficiency. For instance, Gupta et al. [ 40 ] intetext detection and recognition tailored for Latin scripts. grated YOLO with a random-forest classifier to reduce Second, we propose an eficient one-stage text detector false positives, while He et al. [ 41 ] incorporated an atbased on a Fully Convolutional Network (FCN), which tention mechanism in SSD to suppress background noise. handles multi-scale text detection without introducing Similarly, TextBoxes [ 11 ] and its extension, TextBoxes++ excessive computational overhead. Third, we introduce [ 42 ], addressed varying text aspect ratios and arbitrary an innovative recognition module that integrates LSTM orientations, respectively, while SegLink [ 15 ] used SSD and CapsNet, achieving comparable performance to state- to segment text into smaller components linked into comof-the-art systems in text recognition tasks. plete instances. EAST [ 14 ] directly employed a fully

The remainder of this paper is organized as follows: convolutional network (FCN) for eficient text region Section 2 provides an overview of related work, high- detection without unnecessary intermediate steps, follighting significant advances in scene text detection, EEG lowed by thresholding and non-maximum suppression classification, and robotic applications. Section 3 presents for refinement. the details of the proposed framework. Section 4 outlines Text recognition methods are generally classified into the experimental setups and performance evaluations, sequence-based, word-based, and character-based apwhile Section 5 concludes with a summary and future proaches. Sequence-based approaches represent text as directions. a sequence of characters. For example, CRNN [ 43 ] combines convolutional and recurrent neural networks to extract feature sequences and model contextual infor2. Related work mation. Similarly, Shi et al. [ 16 ] integrated a spatial transformer network with a sequence recognition network to robustly recognize irregular text. Word-based approaches, such as Jaderberg et al.’s method [ 17 ], focus on recognizing entire words by training convolutional neural networks on synthetic word datasets. While these methods have achieved state-of-the-art performance, they are often constrained by a predefined vocabulary.

Character-based approaches, on the other hand, detect and recognize individual characters before assembling them into words. For instance, Minetto et al. [ 44 ] utilized histograms of oriented gradients for character description and recognition, while Yao et al. [45] proposed Strokelets, a robust multi-scale representation capturing character structures at diferent levels. This approach ofers greater flexibility and is not limited by text length, making it suitable for complex scenarios.

These advancements in both text detection and recognition have significantly contributed to the development of more robust and eficient systems, laying a strong foundation for further research in this domain.

The detection and recognition of scene text have garnered

substantial attention in the computer vision domain due to their significance in numerous real-world applications. Over the years, various methods have been proposed to tackle the challenges associated with scene text detection and recognition, which have been thoroughly reviewed in several comprehensive surveys and analyses [ 20, 21 ]. These methods can be broadly classified into two main categories: text detection and text recognition.

Scene text detection approaches can be divided into traditional machine learning-based methods and modern deep learning-based methods. Traditional approaches rely heavily on handcrafted features and techniques such as sliding windows and connected components to detect text in natural scene images [22, 23, 24, 25, 26]. Although these methods have shown promising results, they often sufer from a high rate of false positives when applied to complex and diverse real-world scenarios. In contrast, deep learning-based methods have emerged as the dominant approach, ofering improved accuracy and robustness [ 11, 27, 14, 28 ]. Deep learning-based text detection methods can be further categorized into twostage and one-stage strategies. Two-stage approaches,

3. Proposed model The proposed model, as illustrated in Figure 1, consists

of two imperative component including text detector and text recognizer. Firstly, candidate text region is localized from input image using one-stage text detector based on YOLOv5. Following that, text image is segmented into set of individual character patches using BILSTM-based segmentation technique. Then, these patches pass oneby-one to the capsule network which help to accurately recognize each character. The Set of recognized characters form complete word which pass by Post-Processing module to apply semantic correction in order to enhance the accuracy and efectiveness of recognizer component. More details about each component are described below.

3.1. One-stage text detector Yolov5 was chosen as our scene text detector for several

key reasons. First, it integrates the Cross Stage Partial Network (CSPNet) [46] with Darknet, forming CSPDarknet as its backbone. This design enhances inference speed and accuracy while reducing computational complexity by merging feature maps from diferent network stages. Second, Yolov5 employs the Path Aggregation Network (PANet) [47] to improve information flow. PANet uses an enhanced Feature Pyramid Network (FPN) structure with a shorter bottom-up path to better propagate lowlevel features, aiding the model’s performance on unseen data and improving text scaling. Additionally, adaptive feature pooling ensures valuable information is passed through each feature level, enhancing localization accuracy for text detection. Finally, Yolov5’s detection heads generate three diferent feature map sizes, enabling multiscale predictions and enabling the detector to handle text of varying sizes under challenging real-world conditions.

3.2. Text Recognition System In this section, we introduce the second stage of our framework which consists of three modules:

3.2.1. Segmentation module After detecting the text using Yolov5, two layers of LSTM have been used with 256 units to learn long-ranges temporal dependencies. The LSTM architecture consists of three gates called input, forget, and output gates, connected with memory cells which make the LSTM stores the previous context for long time. The input gate consists of encoding information by applying hyperbolic tangent function (ℎ) on the active cell () and the previous cell output (ℎ− 1) in order to generate vector of values between − 1 and +1. Meanwhile, the forget gate used () and (ℎ− 1) to be multiplied with weight’s matrices and added to the bias, then passed to the activation function which resulted binary values. Where the 0 means that the cell information will be cleaned, however, the 1 means that the cell information will be stored for the future use. The output gate applies and ℎ function to active cell () and the previous cell output (ℎ− 1), then, multiply them with the vector values generated in the input gate to produce an output that will be passed to the next cell.

In our work, we used bidirectional LSTM, as shown in ifgure 2, to context information from each vector of the = max (︀ 0, + − ‖ ‖)︀ 2 + (1 − ) max (︀ 0, ‖‖ − − )︀ 2 (1) where = 1 if an image of class is present and + = 0.9 and − = 0.1, we use = 0.5.

The 8 × 16 transformation matrix maps the 8dimensional capsule input space to a 16-dimensional capsule output space for each class in relation to the capsule output of the previous layer . The predicted vector ˆ| is expressed by a matrix operation between the weight matrix and .

The final output for class is computed using novel

vector-to-vector nonlinearity squashing function as: detected words by applying the forward and the backward LSTM. The first one is used to analyze a vector of forward hidden state→s− = →{−1→,−2, ..., } , which →− is only dependent on the left neighbors at each time . while, the backward LSTM is used for analyzing a vector of backward hidden state− s ← =− {← 1−, ←2, ..−., ←}, which is only dependent on the right neighbors at each time . In the last step, the result of forward and backward should be concatenate to represent character’s segment at each vector =→[−−; ←]. The output of the segmentation is sequence of character’s image which will be faded to CapsNet after convert it to binary image. 3.2.2. Recognition module

Here, we tend to apply the same CapsNet structure employed previously in [6] and modifying it according to our purpose. Figure 3 depicts the overall CapsNet structure used for scene text recognition.

The CapsNet structure is composed of an encoder and a decoder, former of which comprises of: • Convolutional Layers: The layer has 256 kernels each with a bias term, stride of 1, size of In this module, English lexicon, Levenshtein distance 9× 9× 1 followed by the rectified linear activation [48] and cosine similarity [49] metrics are adopted to ( ). This layer used as lower-level feature grammatically check the resulted word from CapsNet. extractors and outputs 20 × 20 × 256 tensor. The main purpose of use such metrics is to determine the • PrimaryCaps layers: The 8 capsule layer applies required number of changes (inserting, deleting or replac9 × 9 × 256 convolutional kernels, with stride ing a character in word) and enhancing recognizer com2, to the 20 × 20 × 256 input tensor. This layer putational eficiency by reducing the number of words produce combination of the above feature outputs that will be treated by cosine metric. Figure 4 depicts the and generates 6 × 6 × 8 × 8 tensor. overall architecture of post-processing module. • CharCaps Layers: These 70 capsule layers are The word generated by CapsNet pass firstly to the used for the generation of the loss function and lexicon for selecting the set of words that have the same transformational weight matrix. Stem of the input word. Then, this set of words will be handled one-by-one by the two metrics mentioned before.

Whereas, Decoder consists of three Fully Connected Finally, the word with the highest cosine similarity is layers (FC). chosen as the correct word.

The loss function is calculated for correct and incorrect Levenshtein [48] is based on calculating the distance CharCaps, primarily defined as 1 if the correct label corre- matrix between the components of two words. The first sponds with the character of this particular CharCap and step is to create matrix of shape ( + 2, + 2) where 0 otherwise. A zero-loss event is initiated either when a and are the size of the two words. The first two lines probability of right or wrong prediction is greater than represent the first word and indices respectively, and the + or less than − , respectively. For each CharCaps first two columns represent the second word and indices capsule, , the incurred loss is as follows: respectively. Then, the matrix should be completed with where: with coupling coeficients measuring the probability of primary capsule probabilistically triggering capsule . representing the weighted sum shrinked by the squashing function. 3.2.3. Post-processing module 0. For instance, the matrix of the word “beter” and “better” will look like:

After that, we have to compare between the characters of the two words, character by character in each row and each column. The value of comparison in the point (, ) will be the minimum of three values [( − 1, ) + 1], [( − 1, − 1)], and [(, − 1) + 1]. The output of this matrix will be: ⎛ ⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎝

⎛ ⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎝

As we see in the resulting matrix, the positions (5, 4)

and (6, 5) have the value 1 which are incorrect because the letter “e” in the position (5, 0) is equal to the letter “e” in the position (0, 4). In addition to that, the Levenshtein distance between the two words is 1 which means that there is missing character in the second word. Using Levenshtein distance allows recognizer to select three most identical words from the set of words who will be next treated by the cosine metric.

Cosine Similarity is based on calculating the cosine angle of words’ vectors [49]. After constructing the vector of the two words (1, 2), the cosine similarity is calculated as follows: cos (1, 2) =

1 × 2 ‖1‖ × ‖ 2‖ 79–88 (5) = √︀∑︀=1 12 × ∑︀ =1 1 × 2 √︀∑︀

=1 22 cos(, ) = 0.89

The values of the cosine similarity will be arranged between 0 and 1 where values closer to 1 indicate that the words more similar. 4. Experiments and results 4.1. Datasets To evaluate the performance and versatility of our pro

posed text detection and recognition framework, we conduct experiments using four challenging benchmark datasets: ICDAR2013 [50], ICDAR2015 [51], MSRATD500 [52], and ICDAR2017-MLT [53]. The ICDAR2013 dataset is widely recognized as the standard benchmark for horizontal text detection. It includes 229 training images and 233 testing images, with word-level annotations provided for each image. Similarly, the ICDAR2015 dataset comprises 1000 training images and 500 testing images, featuring various accidental scene text instances annotated with quadrangular bounding boxes. The MSRA-TD500 dataset contains 300 training images and 200 test images, incorporating both English and Chinese text. The text areas in this dataset are arbitrarily oriented, and annotations are provided at the sentence level, making it particularly challenging for text detection models. The ICDAR2017-MLT dataset is a more complex and diverse collection, consisting of 7200 training images, 1800 validation images, and 9000 testing images. This dataset includes multi-oriented, multi-script, and multilingual scene text instances with line-level and word-level annotations, significantly increasing the dificulty of the detection task. For the evaluation of text recognition, we use a modified version of the EnglishFnt dataset from the Chars74K collection [54], which has also been used in previous works [ 6 ]. This dataset is employed for training the Long Short-Term Memory (LSTM) network for word segmentation. To assess the efectiveness of our text detection and recognition system, we adopt the standard evaluation metrics, including precision (P), recall (R), and F-measure (F), to quantify detection and recognition performance.

4.2. Evaluation

4.2.1. Text detection

To assess the efectiveness of our framework in detecting

horizontal and long text, we compare its performance with state-of-the-art text detection methods on the ICDAR2013 and MSRA-TD500 datasets. On the ICDAR2013 benchmark, our detector outperforms other methods by 4.2.2. Text Recognition

The segmentation results demonstrate an impressive

94% accuracy when training our LSTM model on the Chars74K dataset. This improvement highlights the ability of LSTM to learn long-range temporal dependencies by utilizing both the forward and backward aspects of the LSTM architecture. The model efectively captures the features of both previous and future characters within the image, enhancing segmentation of the box image into sub-images, which are then passed to the CapsNet model.

Experimental results show that our CapsNet model, trained on Chars74K images, achieves a recognition rate of 92%. This indicates that our character recognition model significantly outperforms state-of-the-art methods, as presented in Table 1, and achieves comparable performance to the optimal methods. at least 1%, except for TextFuseNet [ 34 ]. On the MSRA- Table 1 TD500 dataset, our detector achieves a precision of 89.5%, Recognition rate comparison of state-of-the-art methods on improving upon the SRPN+VGGDet [55] method, which the Chars74K dataset has a precision of 87.3%. This improvement demonstrates State-of-the-art methods Recognition rate [%] the superiority of our framework in detecting long scene AlexNet [ 59 ] 77.77 text using a single fully connected network. We also GoogleNet [ 59 ] 88.89 validate the performance of our detector on multilingual Multiscale HoG Features [ 60 ] 80 text detection using the ICDAR2017-MLT dataset. Except ConvNet [ 60 ] 71.69 for DB-ResNet-50 [56], our detector delivers the highest DCNN [ 61 ] 90.32 precision, confirming that our Yolov5-based framework Proposed CapsNet architecture 92 efectively handles the diverse text shapes across diferent languages. For multi-oriented text detection on the Our results also demonstrate CapsNet’s ability to hanICDAR2015 dataset, our method achieves an F-measure dle a wide variety of character shapes and its robustness of 55.7% and precision of 76%. Compared to one-stage when dealing with datasets containing a larger number methods such as SegLink [ 57 ], EAST [ 14 ], TextBoxes++ of classes (70 classes). Table 2 presents the accuracy, [ 42 ], and RRD [ 58 ], our precision is 7.3%, 11.2%, and 9.6% recall, and F1-score for a selection of characters from lower, respectively, but 2.9% higher than SegLink. This the Chars74K dataset. This significant improvement in indicates that while our detector does not surpass oth- performance is attributed to the complexity of the Primaers in precision for multi-oriented text, it still performs ryCaps layers, which, by utilizing vectors during traincompetitively. Additionally, the use of multi-branch de- ing, increase the model’s capacity to represent character tection improves detection accuracy. By generating fea- information and efectively capture various character ture maps of three diferent sizes ( 18 × 18, 36 × 36, attributes. 72 × 72) and fusing them, our detector efectively utilizes both shallow and deep features. This enables it to Table 2 capture rich details and semantic information, enhancing Accuracy (Acc), Recall (Rec), and F1-score (F1) of Character its ability to handle text of varying sizes. Overall, our Recognition (CapsNet) experimental results demonstrate that the proposed text detector achieves comparable performance to state-of- Metric 0 2 9 I P Y x y ? the-art methods. It efectively detects horizontal, long, Acc [%] 78 99 99 89 91 92 80 96 97 multilingual, and multi-oriented text in natural images, Rec [%] 83 99 98 78 91 96 83 91 100 as illustrated in Figure 5. Despite the varying styles of F1 [%] 81 99 98 83 91 94 81 93 99 images, the results highlight the detector’s ability to accurately identify text with diverse shapes, orientations, sizes, and languages.

5. Conclusion

In this paper, we have presented a novel end-to-end system for extracting text from natural scene image. We introduced robust detector which can suitably localize and extracts the region where text is existing and this has an appreciable increase in accuracy while recognizing the texts. The proposed detector is resistant to backgrounds complexities and is insensitive to noise, scale change, variation of font and languages. Moreover, a modular Latin text recognition method is proposed to accurately recognize text in diferent situation. We additionally employed, in this work, CapsNet with dynamic routing for recognition of detected text. After devising the text detected to sub-images of individual characters using specific segmentation method based on BI-LSTM network; CapsNet is leveraged to diverse characters into tens of categories. Furthermore, we proposed semantic method as post processing step to improve the performance and the accuracy of the system in the full word recognition.

Experimental results on diferent popular text spotting benchmarks, including both regular and irregular datasets, prove that our proposed model can significantly outperform state-of-the-art methods in terms of detection and recognition with its eficiency and high accuracy. In future work, this system will be tested in Chinese or other languages. Future work will look also at improving our model to deal with the problems of false positives and partially detected text lines especially those belonging to arbitrarily-oriented and curved textual regions.

6. Declaration on Generative AI During the preparation of this work, the authors used

ChatGPT, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

[1]

Guettala ,

Sayah ,

Kahloul ,

Tibermacine , Real time human detection by unmanned aerial vehicles , in: 2022 International Symposium on iNnovative Informatics of Biskra (ISNIB) , IEEE, 2022 , pp. 1 - 6 .

[2]

Tibermacine ,

Djedi , Gene regulatory network to control and simulate virtual creature's locomotion ( 2015 ).

[3]

Boutarfaia ,

Russo ,

Tibermacine ,

I. E.

Tibermacine , Deep learning for eeg-based motor imagery classification: Towards enhanced human-machine interaction and assistive robotics , in: CEUR Workshop Proceedings , volume 3695 , 2023 , p. 68 - 74 .

[4]

Brandizzi ,

Fanti ,

Gallotta ,

Russo ,

Iocchi ,

Nardi ,

Napoli , Unsupervised pose estimation by means of an innovative vision transformer , in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , volume 13589 LNAI, 2023 , p. 3 - 20 . doi: 10 .1007/ 978-3- 031 -23480- 4 _ 1 .

[5]

Tibermacine , W. GUETTALA ,

I. E.

Tibermacine , Eficient one-stage deep learning for text detection in scene images , Electrotehnica, Electronica, Automatica (EEA) 72 ( 2024 ) 65 - 71 .

[6]

Tibermacine ,

S. M.

Amine , An end-to-end trainable capsule network for image-based character recognition and its application to video subtitle recognition ., ICTACT Journal on Image & Video Processing 11 ( 2021 ).

[7]

Bonanno ,

Capizzi ,

Gagliano ,

Napoli , Optimal management of various renewable energy sources by a new forecasting method , in: SPEEDAM 2012 - 21st International Symposium on Power Electronics, Electrical Drives, Automation and Motion , 2012 , p. 934 - 940 . doi: 10 .1109/SPEEDAM. 2012 . 6264603 .

[8]

Bonanno , G. Capizzi,

G. L.

Sciuto ,

Napoli ,

Pappalardo ,

Tramontana , A cascade neural network architecture investigating surface plasmon polaritons propagation for thin metals in openmp , in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , vol - tection and recognition , Archives of computational ume 8467 LNAI , 2014 , p. 22 - 33 . doi: 10 .1007/ methods in engineering 27 ( 2020 ) 433 - 454 . 978 -3- 319 -07173- 2 _ 3 . [21]

Brisinello ,

Grbić ,

Vranješ ,

Vranješ , Re-

[9]

Y.-F.

Pan ,

Hou , C.-L. Liu, Text localization in view on text detection methods on scene images, natural scene images based on conditional random in: international symposium ELMAR , IEEE, 2019 , ifeld, in: 10th international conference on docu- pp. 51 - 56 . ment analysis and recognition , IEEE, 2009 , pp. 6 - 10 . [22]

Nail ,

M. A.

Atoussi ,

Saadi ,

I. E.

Tibermacine ,

[10]

Tibermacine ,

Akrour ,

Khamar ,

I. E.

Tiber- C. Napoli , Real-time synchronisation of multiple macine, A. Rabehi, Comparative analysis of svm fractional-order chaotic systems: an application and cnn classifiers for eeg signal classification in study in secure communication, Fractal and Fracresponse to diferent auditory stimuli , in: 2024 tional 8 ( 2024 ) 104. International Conference on Telecommunications [23]

K. I.

Kim ,

Jung ,

J. H.

Kim , Texture-based approach and Intelligent Systems (ICTIS) , IEEE, 2024 , pp. 1 - 8 . for text detection in images using support vector

[11]

Liao ,

Shi ,

Bai ,

Wang , W. Liu, Textboxes: machines and continuously adaptive mean shift A fast text detector with a single deep neural net- algorithm, IEEE Transactions on Pattern Analysis work , in: Proceedings of the AAAI conference on and Machine Intelligence 25 ( 2003 ) 1631 - 1639 . artificial intelligence, volume 31 , 2017 . [24]

Neumann ,

Matas , Scene text localization and

[12]

Tian ,

Huang ,

He ,

Qiao , Detect- recognition with oriented stroke detection, in: Proing text in natural image with connectionist text ceedings of the ieee international conference on proposal network , in: European conference on computer vision , 2013 , pp. 97 - 104 . computer vision, Springer, 2016 , pp. 56 - 72 . [25]

Nail ,

Djaidir ,

I. E.

Tibermacine , C. Napoli,

[13]

Long ,

Ruan ,

Zhang ,

He ,

Wu ,

Yao ,

Haidour ,

Abdelaziz , Gas turbine vibration Textsnake: A flexible representation for detecting monitoring based on real data and neuro-fuzzy systext of arbitrary shapes , in: Proceedings of the tem, Diagnostyka 25 ( 2024 ). European conference on computer vision (ECCV), [26] X.-C. Yin , X.

Yin , K.

Huang , H.-W. Hao, Robust text 2018 , pp. 20 - 36 . detection in natural scene images , IEEE transac-

[14]

Zhou ,

Yao ,

Wen ,

Wang ,

Zhou , W. He, tions on pattern analysis and machine intelligence J. Liang, East: an eficient and accurate scene text 36 ( 2013 ) 970 - 983 . detector, in: Proceedings of the IEEE conference [27]

Russo ,

I. E.

Tibermacine , A . Tibermacine, on Computer Vision and Pattern Recognition , 2017 ,

Chebana ,

Nahili ,

Starczewscki , C. Napoli, pp. 5551 - 5560 . Analyzing eeg patterns in young adults exposed to

[15]

Shi ,

Bai ,

Belongie , Detecting oriented text diferent acrophobia levels: a vr study, Frontiers in natural images by linking segments , in: Proceed- in Human Neuroscience 18 ( 2024 ). doi: 10 .3389/ ings of the IEEE conference on computer vision and fnhum. 2024 . 1348154 . pattern recognition, 2017 , pp. 2550 - 2558 . [28]

Naidji ,

Tibermacine ,

Guettala , I. E . Tiber-

[16]

Shi ,

Wang ,

Lyu ,

Yao ,

Bai , Robust scene macine , et al., Semi-mind controlled robots based text recognition with automatic rectification, in: on reinforcement learning for indoor application ., Proceedings of the IEEE conference on computer in: ICYRIME , 2023 , pp. 51 - 59 . vision and pattern recognition, 2016 , pp. 4168 - 4176 . [29]

Ren ,

He ,

Girshick ,

Sun , Faster r-cnn:

[17]

Jaderberg ,

Simonyan ,

Vedaldi , A. Zisser- Towards real-time object detection with region proman, Synthetic data and artificial neural networks posal networks, IEEE transactions on pattern analfor natural scene text recognition , arXiv preprint ysis and machine intelligence 39 ( 2016 ) 1137 - 1149 . arXiv: 1406 .2227 ( 2014 ). [30]

Bouchelaghem ,

I. E.

Tibermacine ,

Balsi , M. Mo-

[18]

Tibermacine ,

I. E.

Tibermacine , M. Zouai, roni, C. Napoli, Cross-domain machine learning A. Rabehi, Eeg classification using contrastive learn- approaches using hyperspectral imaging for plasing and riemannian tangent space representations, tics litter detection , in: 2024 IEEE Mediterranean in: 2024 International Conference on Telecommuni- and Middle-East Geoscience and Remote Sensing cations and Intelligent Systems (ICTIS) , IEEE, 2024 , Symposium (M2GARSS), IEEE, 2024 , pp. 36 - 40 . pp. 1 - 7 . [31]

Jiang ,

Zhu ,

Wang ,

Yang ,

Li ,

Wang ,

[19]

Tibermacine ,

Djedi , Neat neural networks to P. Fu, Z. Luo, R 2 cnn: Rotational region cnn for control and simulate virtual creature's locomotion, arbitrarily-oriented scene text detection , in: 24th in: 2014 International Conference on Multimedia International conference on pattern recognition Computing and Systems (ICMCS) , IEEE, 2014 , pp. (ICPR) , IEEE, 2018 , pp. 3610 - 3615 . 9 - 14 . [32]

Ma , W. Shao,

Ye ,

Wang ,

Zheng ,

[20]

Lin ,

Yang ,

Zhang , Review of scene text de- X. Xue, Arbitrary-oriented scene text detection via rotation proposals , IEEE transactions on multime- [45]

Yao ,

Bai ,

Shi , W. Liu, Strokelets: A learned dia 20 ( 2018 ) 3111 - 3122 . multi -scale representation for scene text recogni-

[33]

Ladjal ,

I. E.

Tibermacine ,

Bechouat , M. Se- tion, in: Proceedings of the IEEE conference on draoui, C. Napoli,

Rabehi ,

Lalmi , Hybrid mod- computer vision and pattern recognition, 2014 , pp. els for direct normal irradiance forecasting: A case 4042-4049. study of ghardaia zone (algeria ), Natural Hazards [46] C.-Y. Wang , H. -Y. M. Liao , Y. -H. Wu , P.-Y. Chen, J.-W. 120 ( 2024 ) 14703 - 14725 . Hsieh,

I.-H.

Yeh , Cspnet: A new backbone that can

[34]

Ye ,

Chen ,

Liu ,

Du , Textfusenet: Scene enhance learning capability of cnn, in: Proceedings text detection with richer fused features ., in: IJCAI, of the IEEE/CVF conference on computer vision and volume 20 , 2020 , pp. 516 - 522 . pattern recognition workshops, 2020 , pp. 390 - 391 .

[35]

eddine Boukredine , E. Mehallel,

Boualleg , [47]

Liu ,

Qi ,

Qin ,

Shi ,

Jia , Path aggregation

Baitiche ,

Rabehi ,

Guermoui ,

Douara , I. E. network for instance segmentation , in: Proceedings Tibermacine, Enhanced performance of microstrip of the IEEE conference on computer vision and antenna arrays through concave modifications and pattern recognition , 2018 , pp. 8759 - 8768 . cut-corner techniques , ITEGAM-JETIA 11 ( 2025 ) [48]

Lcvenshtcin , Binary coors capable or 'correct65-71. ing deletions, insertions, and reversals , in: Soviet

[36]

Farhadi , J. Redmon, Yolov3: An incremental Physics-Doklady , volume 10 , 1966 , pp. 707 - 710 . improvement, in: Computer vision and pattern [49]

Li , L. Han, Distance weighted cosine similarrecognition , volume 1804 , Springer Berlin/Heidel- ity measure for text classification , in: Intelligent berg, Germany , 2018 , pp. 1 - 6 .

Data

Engineering and Automated Learning-IDEAL,

[37]

Napoli ,

Ponzi ,

Puglisi ,

Russo , I. Tiber- Springer, 2013 , pp. 611 - 618 . macine, et al., Exploiting robots as healthcare re- [50]

Karatzas ,

Shafait ,

Uchida ,

Iwamura , L. G. sources for epidemics management and support i Bigorda ,

S. R.

Mestre ,

Mas ,

D. F.

Mota , J. A. caregivers, in: CEUR Workshop Proceedings , vol- Almazan,

L. P.

De Las Heras , Icdar 2013 robust ume 3686 , CEUR-WS , 2024 , pp. 1 - 10 . reading competition, in: 2013 12th international

[38]

Russo ,

Ahmed ,

I. E.

Tibermacine ,

Napoli , En- conference on document analysis and recognition, hancing eeg signal reconstruction in cross- domain

IEEE

, 2013 , pp. 1484 - 1493 . adaptation using cyclegan , in: 2024 International [51]

Karatzas ,

Gomez-Bigorda ,

Nicolaou , Conference on Telecommunications and Intelligent S. Ghosh , A.

Bagdanov , M.

Iwamura , J.

Matas , Systems (ICTIS) , IEEE, 2024 , pp. 1 - 8 . L. Neumann , V. R.

Chandrasekhar , S.

Lu , et al., Icdar

[39]

Liu ,

Anguelov ,

Erhan ,

Szegedy ,

Reed , 2015 competition on robust reading, in: 2015 13th C. -

Y. Fu , A. C.

Berg , Ssd: Single shot multibox de- international conference on document analysis and tector , in: Computer Vision-ECCV 2016 : 14th Eu- recognition (ICDAR) , IEEE, 2015 , pp. 1156 - 1160 . ropean Conference, Amsterdam, The Netherlands, [52]

Yao ,

Bai , W. Liu,

Ma ,

Tu , Detecting texts Springer, 2016 , pp. 21 - 37 . of arbitrary orientations in natural images , in: 2012

[40]

Gupta ,

Vedaldi ,

Zisserman , Synthetic data IEEE conference on computer vision and pattern for text localisation in natural images , in: Proceed- recognition, IEEE, 2012 , pp. 1083 - 1090 . ings of the IEEE conference on computer vision and [53]

Nayef ,

Yin , I. Bizid,

Choi , Y. Feng, pattern recognition, 2016 , pp. 2315 - 2324 . D. Karatzas , Z.

Luo , U.

Pal , C.

Rigaud , J. Chazalon,

[41]

He ,

Huang ,

He ,

Zhu ,

Qiao ,

Li , Sin- et al., Icdar2017 robust reading challenge on multigle shot text detector with regional attention, in: lingual scene text detection and script identificationProceedings of the IEEE international conference rrc-mlt , in: 14th IAPR international conference on on computer vision , 2017 , pp. 3047 - 3055 . document analysis and recognition (ICDAR) , vol-

[42]

Liao ,

Shi ,

Bai , Textboxes++ : A single-shot ume 1 , IEEE, 2017 , pp. 1454 - 1459 . oriented scene text detector , IEEE transactions on [54] T. E. de Campos , B. R.

Babu , M.

Varma , Character image processing 27 ( 2018 ) 3676 - 3690 . recognition in natural images , in: International con-

[43]

Shi ,

Bai ,

Yao , An end-to-end trainable neu- ference on computer vision theory and applications, ral network for image-based sequence recognition volume 1 , SCITEPRESS, 2009 , pp. 273 - 280 . and its application to scene text recognition , IEEE [55]

He ,

X.-Y.

Zhang ,

Yin ,

Luo , J.-M. Ogier , C. - transactions on pattern analysis and machine intel- L. Liu, Realtime multi-scale scene text detection ligence 39 ( 2016 ) 2298 - 2304 . with scale-based region proposal network , Pattern

[44]

Minetto ,

Thome ,

Cord ,

N. J.

Leite ,

Stolfi , T- Recognition 98 ( 2020 ) 107026. hog: An efective gradient-based descriptor for sin- [56]

Liao ,

Wan ,

Yao ,

Chen ,

Bai , Real-time gle line text regions , Pattern recognition 46 ( 2013 ) scene text detection with diferentiable binarization , 1078 - 1090 . in : Proceedings of the AAAI conference on artificial intelligence , volume 34 , 2020 , pp. 11474 - 11481 .

[57]

Chen ,

A. L.

Yuille , Detecting and reading text in natural scenes , in: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition , volume 2 , IEEE, 2004 , pp. II-II.

[58]

Liao ,

Zhu ,

Shi , G.-s. Xia,

Bai , Rotationsensitive regression for oriented scene text detection , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2018 , pp. 5909 - 5918 .

[59]

Soomro ,

M. A.

Farooq ,

R. H.

Raza , Performance evaluation of advanced deep learning architectures for ofline handwritten character recognition , in: 2017 International Conference on Frontiers of Information Technology (FIT) , IEEE, 2017 , pp. 362 - 367 .

[60]

A. J.

Newell ,

L. D.

Grifin , Multiscale histogram of oriented gradient descriptors for robust character recognition, in: International conference on document analysis and recognition , IEEE, 2011 , pp. 1085 - 1089 .

[61]

Arivazhagan ,

Arun ,

Rathina , Recognition of handwritten characters using deep convolution neural network ., Journal of the National Science Foundation of Sri Lanka 49 ( 2021 ).