=Paper=
{{Paper
|id=Vol-3349/paper2
|storemode=property
|title=Text Detection Forgot About Document OCR
|pdfUrl=https://ceur-ws.org/Vol-3349/paper2.pdf
|volume=Vol-3349
|authors=Krzysztof Olejniczak,Milan Sulc
|dblpUrl=https://dblp.org/rec/conf/cvww/OlejniczakS23
}}
==Text Detection Forgot About Document OCR==
Text Detection Forgot About Document OCR Krzysztof Olejniczak1,2 , Milan Šulc2 1 University of Oxford, United Kingdom 2 Rossum.ai, Czech Republic Abstract Detection and recognition of text from scans and other images, commonly denoted as Optical Character Recognition (OCR), is a widely used form of automated document processing with a number of methods available. Yet OCR systems still do not achieve 100% accuracy, requiring human corrections in applications where correct readout is essential. Advances in machine learning enabled even more challenging scenarios of text detection and recognition "in-the-wild" – such as detecting text on objects from photographs of complex scenes. While the state-of-the-art methods for in-the-wild text recognition are typically evaluated on complex scenes, their performance in the domain of documents is typically not published, and a comprehensive comparison with methods for document OCR is missing. This paper compares several methods designed for in-the-wild text recognition and for document text recognition, and provides their evaluation on the domain of structured documents. The results suggest that state-of-the-art methods originally proposed for in-the-wild text detection also achieve competitive results on document text detection, outperforming available OCR methods. We argue that the application of document OCR should not be omitted in evaluation of text detection and recognition methods. Keywords Text Detection, Text Recognition, OCR, Optical Character Recognition, Text In The Wild 1. Introduction engines [18, 19, 2]. Additionally, we adopt publicly avail- able Text Recognition models [20, 21] and combine them Optical Character Recognition (OCR) is a classic problem with Text Detectors to perform two-stage end-to-end text in machine learning and computer vision with standard recognition for a complete evaluation of text extraction. methods [1, 2] and surveys [3, 4, 5, 6] available. Recent ad- vances in machine learning and its applications, such as autonomous driving, scene understanding or large-scale 2. Related Work image retrieval, shifted the attention of Text Recogni- tion research towards the more challenging in-the-wild 2.1. Document OCR text scenarios, with arbitrarily shaped and oriented in- OCR engines designed for the "standard" application do- stances of text appearing in complex scenes. Spotting main of documents range from open-source projects such text in-the-wild poses challenges such as extreme aspect as TesseractOCR [2] and PP-OCR [1] to commercial ser- ratios, curved or otherwise irregular text, complex back- vices, including AWS Textract [18] or Google Document grounds and clutter in the scenes. Recent methods [7, 8] AI [19]. Despite Document OCR being a classic problem achieve impressive results on challenging text in-the-wild with many practical applications, studied for decades datasets like TotalText [9] or CTW-1500 [10], with F1 [22, 23], it still cannot be considered ’solved’ – even the reaching 90% and 87% respectively. Although automated best engines struggle to achieve perfect accuracy. The document processing remains one of the major applica- methodology behind the commercial cloud services is tions of OCR, to the best of our knowledge, the results of typically not disclosed. The most popular1 open-source in-the-wild text detection models were never comprehen- OCR engine at the time of publication, Tesseract [2] (v4 sively evaluated on the domain of documents and com- and v5), uses a Long Short-Term Memory (LSTM) neural pared with methods developed for document OCR. This network as the default recognition engine. paper reviews several recent Text Detection methods de- veloped for the in-the-wild scenario [11, 12, 13, 7, 14, 8], evaluates their performance (out of the box and fine- 2.2. In-the-wild Text Detection tuned) on benchmark document datasets [15, 16, 17], and 2.2.1. Regression-based Methods compares their scores against popular Document OCR Regression-based Methods follow the object classification 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian approach, reduced to a single-class problem. TextBoxes Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 [25] and TextBoxes++ [26] locate text instances with † The work was done when Krzysztof Olejniczak was an intern at various lengths by using sets of anchors with different Rossum. aspect ratios. Various regression-based methods utilize © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Based on the GitHub repository [24] statistics. 1 Krzysztof Olejniczak et al. CEUR Workshop Proceedings 1–7 Figure 1: Comparison of detection results performed on documents from FUNSD dataset (cropped). an iterative refinement strategy, iteratively enhancing using post-processing on so obtained pixel maps. The the quality of detected boundaries. LOMO [27] uses an binary, deterministic nature of such pixel classification Iterative Refinement Module, which in every step re- problem may cause learning confusion on the borders gresses coordinates of each corner of the predicted bound- of text instances. Numerous methods address this issue ary, with an attention mechanism. PCR [28] proposes a by predicting text kernels (central regions of instances) top-down approach, starting with predictions of centres and appropriately gathering pixels around them. PSENet and sizes of text instances, and iteratively improving the [32] predicts kernels of different sizes and forms bound- bounding boxes using its Contour Localisation Mecha- ing boxes by iteratively expanding their regions. PAN nism. TextBPN++ [8] introduces an Iterative Boundary [14] generates pixel classification and kernel maps, link- Deformation Module, utilizing Transformer Blocks with ing each classified text pixel to the nearest kernel. Cen- multi-head attention [29] encoder and a multi-layer per- tripetalText [33] produces centripetal shift vectors that ceptron decoder, to iteratively adjust vertices of detected map pixels to correct text centres. KPN [34] creates pixel instances. Instead of considering vertices of the bound- embedding vectors, for each instance locates the central ing boxes, DCLNet [12] predicts quadrilateral boundaries pixel and retrieves the whole shapes by measuring the by locating four lines restricting the corresponding area, similarities in embedding vectors with scalar product. representing them in polar coordinates system. To ad- Vast majority of segmentation-based methods generate dress the problem of arbitrary-shaped text detection and probability maps, representing how likely pixels are to be accurately model the boundaries of irregular text regions, contained in some text region, and using certain binariza- more sophisticated bounding boxes representation ideas tion mechanism (e.g. by applying thresholding) convert have been developed. ABCNet [30] adapts cubic Bezier them into binary pixel maps. However, the thresholds curves to parametrize curved text instances, gaining the are often determined empirically, and uncareful choice possibility of fitting non-polygon shapes. FCENet [31] of them may lead to drastic decrease in performance. To proposes Fourier Contour Embedding method, predict- solve this problem, DBNet [13] proposes a Differentiable ing the Fourier signature vectors corresponding to the Binarization Equation, making the step between proba- representation of the boundary in Fourier domain, and bility and classification maps end-to-end trainable and uses them to generate the shape of the instance with therefore letting the network learn how to accurately Inverse Fourier Transformation. binarise predictions. DBNet++ [7] further improves on the baseline by extending the backbone network with an 2.2.2. Segmentation-based Methods Adaptive Scale Fusion attention module, enhancing the upscaling process and obtaining deeper features. Text- Segmentation-based Methods aim to classify each pixel FuseNet [35] generates features on three different levels: as either text or non-text, and generate bounding boxes global-, word- and character-level, and fuses them to gain 2 Krzysztof Olejniczak et al. CEUR Workshop Proceedings 1–7 relevant context and deeper insight into the image struc- 4. Experiments ture. Instead of detecting words, CRAFT [11] locates text on character-level, predicting the areas covered by single 4.1. Training Strategies letters, and links characters of each instance with respect DBNet [13], DBNet++ [7] and PAN [14] were fine-tuned to the generated affinity map. for 100 epochs (600 epochs in case of FUNSD) with batch size of 8 and initial learning rate set to 0.0001 and decreas- 3. Methods ing by a factor of 10 at the 60th and 80th epoch (200th and 400th for FUNSD). Baselines, pre-trained on Syn- 3.1. Text Detection thText [38] (DBNet, DBNet++) or ImageNet [39] (PAN), were downloaded from the MMOCR 0.6.2 Model Zoo [40]. To cover a wide range of text detectors, we selected DCLNet [12] was fine-tuned from a pre-trained model methods from Section 2.2 with different approaches: for [41] on each dataset for 150 epochs with batch size of 4, regression-based methods, we included TextBPN++ as a initial learning rate of 0.001, decaying to 0.0001. For each vertex-focused algorithm and DCLNet as an edge-focused dataset, TextBPN++ [8] was fine-tuned from a pre-trained approach. From segmentation-based methods, we se- model [42] for 50 epochs with batch size of 4, learning lected DBNet and DBNet++ as pure segmentation and rate of 0.0001 and data augmentations consisting of flip- PAN as an approach linking text pixels to corresponding ping, cropping and rotations. Given no publicly-available kernels. Finally, CRAFT was chosen as a character-level training scripts for CRAFT, during the experiments, we method. used the MLT model from the github repository [43] without fine-tuning. All experiments were performed 3.2. Text Recognition using Adam optimizer with momentum 0.9, on a single GPU with 11 GB of VRAM (GeForce GTX-1080Ti). The ultimate goal of text detection, especially in the case of document processing, is to recognize the text within the detected instances. Therefore, to evaluate the suitabil- 4.2. Detection Results ity of popular in-the-wild detectors for document OCR, Results of the text detection methods selected in Section we perform end-to-end measurements with the following 3.1 on the datasets from Table 1 are presented in Table 2. text recognition engines: SAR [20], MASTER [36] and On FUNSD dataset, DBNet++ achieves both the highest CRNN [21]. The open-source engines were combined detection recall (97.40%) and F1-score (97.42%). The high- with the detection methods in a two-stage manner: the est precision rate, 97.84% was scored by CRAFT. PAN input image was initially processed by a detector, which performed the weakest out of all considered in-the-wild returned bounding boxes. Afterwards, the corresponding algorithms, scoring just 81.44% F1-score. Despite having cropped instances were passed to recognition models. As achieved better results on FUNSD, segmentation-based a point of reference, we compare both the detection and approaches were outperformed by regression-based end-to-end recognition results of the selected methods methods on CORD and XFUND. TextBPN++ proved to with predictions of three common engines for end-to-end be the best performing algorithm on CORD in terms of document OCR: Tesseract [2], Google Document AI [19] recall and F1-score, scoring 99.74% and 99.19%, respec- and AWS Textract [18]. tively. DCLNet, for which the best precision rate on CORD (98.67%) was recorded, achieved superior results 3.3. Metric on XFUND, outperforming the remaining methods with respect to all three measures: precision - 98.22%, recall To measure both detection and end-to-end performance, - 98.17% and F1-score - 98.20%. Out of the considered we used the CLEval [37] metric. Contrary to metrics such popular engines for end-to-end document OCR, AWS as Intersection over Union (IoU) perceiving text on word- Textract presented the best performance on the domain level, CLEval measures precision and recall on character of scans of structured documents – FUNSD and XFUND – level. As a consequence, it slightly reduces the punish- scoring 96.69% and 92.65% F1-score, respectively. Google ment for splitting or merging problematic instances (e.g Document AI generalized remarkably better to distorted dates), providing reliable and intuitive comparison of the photos of receipts from the CORD dataset, achieving quality of detection and recognition. Additionally, the 93.30% F1-score, surpassing the scores of AWS Textract Recognition Score evaluated by CLEval, approximately and Tesseract. The results show that in-the-wild detectors corresponding to the precision of character recognition, fine-tuned on document datasets can outperform popular informs about the quality of the recognition engine specif- OCR engines on the domain of structured documents in ically on the detected bounding boxes. terms of the CLEval detection metric. However, the re- sults for the predictions of pre-trained detectors may not 3 Krzysztof Olejniczak et al. CEUR Workshop Proceedings 1–7 Table 1 Document datasets used in the experiments for text detection and recognition. Dataset Training images Test images Document Types Language FUNSD [15] 149 50 Distorted forms, surveys, reports English CORD [16] 900 100 Photos of Indonesian receipts English XFUND [17] 1043 350 Clean scanned forms Multilingual Table 2 Comparison of the detection performance of the chosen methods on benchmark datasets, with respect to the CLEval metric. "P", "R" and "F1" represent the precision, recall and F1-score, respectively. FUNSD CORD XFUND Method P R F1 P R F1 P R F1 PAN [14] 96.25 70.57 81.44 98.92 97.33 98.12 96.96 77.90 86.39 DBNet [13] 96.02 96.11 96.07 97.94 99.17 98.55 97.04 95.58 96.30 DBNet++ [7] 97.45 97.40 97.42 97.58 99.60 98.58 97.87 97.93 97.90 TextBPN++ [8] 96.63 95.59 96.11 98.65 99.74 99.19 97.88 94.29 96.05 DCLNet [12] 94.16 95.35 94.75 98.67 97.91 98.29 98.22 98.17 98.20 CRAFT [11] 97.84 95.72 96.77 94.25 88.46 91.26 89.75 93.02 91.36 Tesseract [2] 80.13 73.80 76.83 76.46 47.38 58.51 85.84 87.47 86.65 Document AI [19] 95.56 89.77 92.57 92.90 93.71 93.30 89.49 90.68 90.08 AWS Textract [18] 97.50 95.89 96.69 80.60 84.79 82.64 97.64 88.14 92.65 be fully representative due to differences in splitting rules. Recognition Score for AWS Textract reached almost 96%, E.g. Document AI creates separate instances for special surpassing CRNN’s scores by c.a. 2%. This suggests that symbols, e.g. brackets, leading to undesired splitting the recognition engine used in AWS Textract, perform- of words like "name(s)" into several fragments, lower- ing much more accurately on FUNSD than the CRNN ing precision and recall. On all experimented datasets, model, may have been a crucial reason for the good all fine-tuned in-the-wild text detection models reached results. When evaluated on CORD, models with Dif- high prediction scores, proving themselves capable of ferentiable Binarization scored the highest marks in all handling text in structured documents. Qualitative anal- end-to-end measures: recall (DBNet++), precision and ysis of detectors’ predictions revealed that the major F1-score (DBNet); significantly surpassing the remaining sources of error were incorrect splitting of long text frag- methods. Interestingly, despite obtaining the best recall ments (e.g e-mail addresses), merging instances in dense rate, DBNet++ did not beat the simpler DBNet in terms text regions and missing short stand-alone text, such as of end-to-end F1-score. The predictions of regression- single-digit numbers. based approaches, better than segmentation-based ones when pure detection scores were measured, appeared to 4.3. Recognition Results combine slightly worse with CRNN. TextBPN++, how- ever, remained competitive, achieving similar results End-to-end text recognition results combining fine-tuned to DBNet and DBNet++. Recognition Scores of CRNN, in-the-wild detectors with SAR [20] and MASTER [36] regardless the choice of in-the-wild detector, exceeded models from MMOCR 0.6.2 Model Zoo [46], and CRNN 93% on FUNSD and 98.5% on CORD, once again demon- [21] from docTR [45] are listed in Table 3. The XFUND strating the suitability of applying these algorithms to dataset was skipped for this experiment since it contains document text recognition. SAR model, not specifically Chinese and Japanese characters, for which the recog- trained on documents, presented poorer performance: nition models were not trained. On FUNSD, the end-to- the highest measured F1-scores on FUNSD and CORD end measurement outcomes followed the patterns from were 86.36% and 85.25%, respectively, both obtained by detection: equipped with CRNN as the recognition en- the combination with TextBPN++. Fine-tuned SAR mod- gine, DBNet++ proved to be the best tuned model in els achieved slightly higher F1-scores reaching 89.49% terms of CLEval end-to-end Recall (93.52%) and F1-score on FUNSD (equipped with DBNet++ as the detector) and (92.23%), losing only to CRAFT in terms of precision. 93.77% on CORD (combined with TextBPN++ detections). Much higher F1-score (+2%) was measured for AWS Tex- Despite gaining a noticeable advantage over the base- tract, whose end-to-end results outperformed all of the line, fine-tuned SAR models did not surpass the perfor- considered algorithms. It is important to note that the mance of the pre-trained CRNN. Similarly to SAR, the 4 Krzysztof Olejniczak et al. CEUR Workshop Proceedings 1–7 Table 3 Comparison of the recognition performance of the chosen text detection methods combined with MMOCR’s [44] SAR and MASTER default models, fine-tuned SAR, and docTR’s [45] CRNN default model, on FUNSD and CORD, with respect to the CLEval metric. "P", "R", "F1" and "S" represent the end-to-end precision, recall, F1-score and Recognition Score, respectively. FUNSD CORD Recognition Detection P R F1 S P R F1 S PAN [14] 76.14 74.17 75.14 79.79 82.04 84.27 83.14 84.76 DBNet [13] 79.10 82.51 80.77 83.33 82.76 85.79 84.25 85.49 SAR [20] CRAFT [11] 83.75 85.16 84.45 85.92 79.62 76.93 78.25 86.37 (baseline) TextBPN++ [8] 84.90 87.87 86.36 88.86 83.56 87.00 85.25 86.58 DBNet++ [7] 80.04 83.53 81.75 82.85 82.95 86.66 84.76 85.89 DCLNet [12] 77.67 82.27 79.91 81.80 82.75 85.53 84.11 86.16 PAN [14] 86.37 76.61 81.20 90.23 87.73 88.95 88.34 90.59 DBNet [13] 87.48 88.07 87.77 91.90 91.12 94.00 92.54 94.02 SAR [20] CRAFT [11] 88.14 86.48 87.30 90.39 84.98 79.19 81.99 91.53 (fine-tuned) TextBPN++ [8] 88.12 88.32 88.22 92.16 91.46 96.21 93.77 94.77 DBNet++ [7] 89.15 89.83 89.49 92.13 90.40 93.83 92.09 93.54 DCLNet [12] 86.10 87.30 86.70 90.46 87.69 90.02 88.84 91.58 PAN [14] 77.50 74.58 76.01 81.10 90.25 92.12 91.17 93.16 DBNet [13] 80.30 83.11 81.68 84.55 91.94 94.31 93.11 94.62 CRAFT [11] 82.06 82.90 82.48 84.22 85.81 81.86 83.79 92.93 MASTER [36] TextBPN++ [8] 82.10 83.93 83.00 85.96 91.77 94.79 93.26 94.78 DBNet++ [7] 81.33 83.99 82.64 84.13 91.39 94.63 92.98 94.48 DCLNet [12] 79.55 82.85 81.17 83.31 90.01 92.28 91.13 93.71 PAN [14] 90.31 87.14 88.70 94.00 95.70 96.52 96.10 98.65 DBNet [13] 89.07 91.56 90.30 93.24 96.00 97.51 96.75 98.67 CRAFT [11] 91.20 91.67 91.43 93.40 93.12 87.25 90.09 98.73 CRNN [21] TextBPN++ [8] 89.94 91.80 90.86 93.86 95.35 97.71 96.52 98.48 DBNet++ [7] 90.97 93.52 92.23 93.71 95.43 97.85 96.62 98.51 DCLNet [12] 89.84 92.95 91.37 93.16 95.04 96.34 95.69 98.52 Tesseract [2] 73.84 73.84 69.09 88.48 73.96 44.33 55.43 93.55 Google Document AI [19] 90.83 92.03 91.42 94.80 88.06 90.97 89.49 98.61 AWS Textract [18] 93.61 95.46 94.53 95.78 84.53 82.13 83.32 96.63 pre-trained MASTER model [46] worked the best in com- cess. In particular, fine-tuning models such as DBNet++ bination with TextBPN++, achieving F1 score of 83.00% or TextBPN++ yielded over 96% detection F1-score on on FUNSD and 93.26% on CORD. FUNSD, over 98% detection F1-score on CORD and over 96% detection F1-score on XFUND, with respect to the CLEval metric, outperforming Google Document AI and 5. Conclusions AWS Textract. Moreover, combining these detectors with a publicly-available CRNN recognition model in a two- Text detection research has witnessed great progress in stage manner consistently achieves over 90% CLEval recent years, thanks to advancements in deep machine end-to-end F1-score, even without explicit fine-tuning learning. The recently introduced methods widened the of CRNN. We hope the results will bring more attention range of possible applications of text detectors, making to evaluating future Text Detection methods not only in them viable for in-the-wild text spotting. This shifted the text-in-the-wild scenario, but also on the domain of the attention towards more complex scenarios, including documents. arbitrarily-shaped text or instances with non-orthogonal orientations. With automated document processing remaining one of the most relevant commercial OCR Acknowledgement applications, we stress the importance of determining whether the state-of-the-art methods for scene text spot- We acknowledge the help of Bohumír Zámečník, an ex- ting can also improve document OCR. Our experiments pert on OCR systems, who helped with the supervision prove that detectors designed for in-the-wild text spot- of Krzysztof’s internship project. ting can indeed be applied to documents with great suc- 5 Krzysztof Olejniczak et al. CEUR Workshop Proceedings 1–7 References scene text detection with differentiable binarization, in: Proc. AAAI, 2020. [1] Y. Du, C. Li, R. Guo, X. Yin, W. Liu, J. Zhou, Y. Bai, [14] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, Z. Yu, Y. Yang, Q. Dang, H. Wang, PP-OCR: A G. Yu, C. Shen, Efficient and accurate arbitrary- practical ultra lightweight OCR system, CoRR shaped text detection with pixel aggregation net- abs/2009.09941 (2020). URL: https://arxiv.org/abs/ work, CoRR abs/1908.05900 (2019). URL: http: 2009.09941. arXiv:2009.09941. //arxiv.org/abs/1908.05900. arXiv:1908.05900. [2] A. Kay, Tesseract: An open-source optical character [15] J.-P. T. Guillaume Jaume, Hazim Kemal Ekenel, recognition engine, Linux J. 2007 (2007) 2. Funsd: A dataset for form understanding in noisy [3] K. Hamad, K. Mehmet, A detailed analysis of optical scanned documents, in: Accepted to ICDAR-OST, character recognition technology, International 2019. Journal of Applied Mathematics Electronics and [16] S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, H. Lee, Computers (2016) 244–249. Cord: A consolidated receipt dataset for post-ocr [4] T. Hegghammer, Ocr with tesseract, amazon tex- parsing (2019). tract, and google document ai: A benchmarking [17] Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Flo- experiment, 2021. URL: osf.io/preprints/socarxiv/ rencio, C. Zhang, F. Wei, XFUND: A bench- 6zfvs. doi:10.31235/osf.io/6zfvs. mark dataset for multilingual visually rich form [5] N. Islam, Z. Islam, N. Noor, A survey on opti- understanding, in: Findings of the Asso- cal character recognition system, arXiv preprint ciation for Computational Linguistics: ACL arXiv:1710.05703 (2017). 2022, Association for Computational Linguistics, [6] J. Memon, M. Sami, R. A. Khan, M. Uddin, Hand- Dublin, Ireland, 2022, pp. 3214–3224. URL: https: written optical character recognition (ocr): A com- //aclanthology.org/2022.findings-acl.253. doi:10. prehensive systematic literature review (slr), IEEE 18653/v1/2022.findings-acl.253. Access 8 (2020) 142642–142668. [18] Amazon, Amazon textract, https://aws.amazon. [7] M. Liao, Z. Zou, Z. Wan, C. Yao, X. Bai, Real-time com/textract, 2022. Accessed: 2022-09-25. scene text detection with differentiable binarization [19] Google, Google cloud document ai, https://cloud. and adaptive scale fusion, IEEE Transactions on google.com/document-ai, 2022. Accessed: 2022-09- Pattern Analysis and Machine Intelligence (2022). 25. [8] S. Zhang, X. Zhu, C. Yang, H. Wang, X. Yin, Adap- [20] H. Li, P. Wang, C. Shen, G. Zhang, Show, attend and tive boundary proposal network for arbitrary shape read: A simple and strong baseline for irregular text text detection, in: 2021 IEEE/CVF International recognition, CoRR abs/1811.00751 (2018). URL: http: Conference on Computer Vision, ICCV 2021, Mon- //arxiv.org/abs/1811.00751. arXiv:1811.00751. treal, QC, Canada, October 10-17, 2021, IEEE, 2021, [21] B. Shi, X. Bai, C. Yao, An end-to-end trainable pp. 1285–1294. neural network for image-based sequence recogni- [9] C. K. Ch’ng, C. S. Chan, C. Liu, Total-text: To- tion and its application to scene text recognition, wards orientation robustness in scene text detec- CoRR abs/1507.05717 (2015). URL: http://arxiv.org/ tion, International Journal on Document Analysis abs/1507.05717. arXiv:1507.05717. and Recognition (IJDAR) 23 (2020) 31–52. doi:10. [22] S. Mori, H. Nishida, H. Yamada, Optical character 1007/s10032-019-00334-z. recognition, John Wiley & Sons, Inc., 1999. [10] Y. Liu, L. Jin, S. Zhang, C. Luo, S. Zhang, Curved [23] H. F. Schantz, The history of ocr, optical character scene text detection via transverse and longitudinal recognition, Manchester Center, VT: Recognition sequence connection, Pattern Recognition 90 Technologies Users Association (1982). (2019) 337–345. URL: https://www.sciencedirect. [24] S. W. et al., Tesseract open source ocr engine com/science/article/pii/S0031320319300664. (main repository), https://github.com/tesseract-ocr/ doi:https://doi.org/10.1016/j.patcog. tesseract, 2022. Accessed: 2022-10-14. 2019.02.002. [25] M. Liao, B. Shi, X. Bai, X. Wang, W. Liu, Textboxes: [11] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Character A fast text detector with a single deep neural net- region awareness for text detection, in: Proceedings work, in: AAAI, 2017. of the IEEE Conference on Computer Vision and [26] B. S. Minghui Liao, X. Bai, TextBoxes++: A single- Pattern Recognition, 2019, pp. 9365–9374. shot oriented scene text detector, IEEE Transactions [12] Y. Bi, Z. Hu, Disentangled contour learning for on Image Processing 27 (2018) 3676–3690. URL: quadrilateral text detection, in: Proceedings of the https://doi.org/10.1109/TIP.2018.2825107. doi:10. IEEE/CVF Winter Conference on Applications of 1109/TIP.2018.2825107. Computer Vision, 2021, pp. 909–918. [27] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, [13] M. Liao, Z. Wan, C. Yao, K. Chen, X. Bai, Real-time E. Ding, X. Ding, Look more than once: An accu- 6 Krzysztof Olejniczak et al. CEUR Workshop Proceedings 1–7 rate detector for text of arbitrary shapes, CoRR Fei, Imagenet: A large-scale hierarchical image abs/1904.06535 (2019). URL: http://arxiv.org/abs/ database, in: 2009 IEEE conference on computer 1904.06535. arXiv:1904.06535. vision and pattern recognition, Ieee, 2009, pp. 248– [28] P. Dai, S. Zhang, H. Zhang, X. Cao, Progressive 255. contour regression for arbitrary-shape scene text [40] Z. Kuang, H. Sun, Z. Li, X. Yue, T. H. Lin, J. Chen, detection, in: 2021 IEEE/CVF Conference on Com- H. Wei, Y. Zhu, T. Gao, W. Zhang, K. Chen, puter Vision and Pattern Recognition (CVPR), 2021, W. Zhang, D. Lin, Text detection models - mmocr pp. 7389–7398. doi:10.1109/CVPR46437.2021. 0.6.2 documentation, https://mmocr.readthedocs. 00731. io/en/latest/textdet_models.html, 2022. Accessed: [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor- 2022-10-14. eit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- [41] Y. Bi, Z. Hu, Pytorch implementation of dclnet "dis- sukhin, Attention is all you need, CoRR entangled contour learning for quadrilateral text de- abs/1706.03762 (2017). URL: http://arxiv.org/abs/ tection", https://github.com/SakuraRiven/DCLNet, 1706.03762. arXiv:1706.03762. 2021. Accessed: 2022-10-13. [30] Y. Liu, H. Chen, C. Shen, T. He, L. Jin, [42] S. Zhang, X. Zhu, C. Yang, H. Wang, X. Yin, L. Wang, Abcnet: Real-time scene text spot- Arbitrary shape text detection via bound- ting with adaptive bezier-curve network, CoRR ary transformer, https://github.com/GXYM/ abs/2002.10200 (2020). URL: https://arxiv.org/abs/ TextBPN-Plus-Plus, 2022. Accessed: 2022-09-29. 2002.10200. arXiv:2002.10200. [43] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Official [31] Y. Zhu, J. Chen, L. Liang, Z. Kuang, L. Jin, W. Zhang, implementation of character region awareness for Fourier contour embedding for arbitrary-shaped text detection (craft), https://github.com/clovaai/ text detection, in: CVPR, 2021. CRAFT-pytorch, 2019. Accessed: 2022-10-13. [32] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, S. Shao, [44] Z. Kuang, H. Sun, Z. Li, X. Yue, T. H. Lin, J. Chen, Shape robust text detection with progressive scale H. Wei, Y. Zhu, T. Gao, W. Zhang, K. Chen, expansion network, in: Proceedings of the IEEE W. Zhang, D. Lin, Mmocr: A comprehensive toolbox Conference on Computer Vision and Pattern Recog- for text detection, recognition and understanding, nition, 2019, pp. 9336–9345. arXiv preprint arXiv:2108.06543 (2021). [33] T. Sheng, J. Chen, Z. Lian, Centripetaltext: An [45] Mindee, doctr: Document text recognition, https: efficient text instance representation for scene text //github.com/mindee/doctr, 2021. detection, in: Thirty-Fifth Conference on Neural [46] Z. Kuang, H. Sun, Z. Li, X. Yue, T. H. Lin, J. Chen, Information Processing Systems, 2021. H. Wei, Y. Zhu, T. Gao, W. Zhang, K. Chen, [34] S.-X. Zhang, X. Zhu, J.-B. Hou, C. Yang, X.-C. Yin, W. Zhang, D. Lin, Text recognition models - mmocr Kernel proposal network for arbitrary shape text de- 0.6.2 documentation, https://mmocr.readthedocs.io/ tection, 2022. URL: https://arxiv.org/abs/2203.06410. en/latest/textrecog_models.html, 2021. Accessed: doi:10.48550/ARXIV.2203.06410. 2022-10-14. [35] J. Ye, Z. Chen, J. Liu, B. Du, Textfusenet: Scene text detection with richer fused features, in: Pro- ceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, In- ternational Joint Conferences on Artificial Intelli- gence Organization, 2020, pp. 516–522. [36] N. Lu, W. Yu, X. Qi, Y. Chen, P. Gong, R. Xiao, MAS- TER: multi-aspect non-local network for scene text recognition, CoRR abs/1910.02562 (2019). URL: http: //arxiv.org/abs/1910.02562. arXiv:1910.02562. [37] Y. Baek, D. Nam, S. Park, J. Lee, S. Shin, J. Baek, C. Y. Lee, H. Lee, Cleval: Character-level evalua- tion for text detection and recognition tasks, CoRR abs/2006.06244 (2020). URL: https://arxiv.org/abs/ 2006.06244. arXiv:2006.06244. [38] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, in: IEEE Conference on Computer Vision and Pattern Recog- nition, 2016. [39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei- 7