=Paper= {{Paper |id=Vol-3349/paper2 |storemode=property |title=Text Detection Forgot About Document OCR |pdfUrl=https://ceur-ws.org/Vol-3349/paper2.pdf |volume=Vol-3349 |authors=Krzysztof Olejniczak,Milan Sulc |dblpUrl=https://dblp.org/rec/conf/cvww/OlejniczakS23 }} ==Text Detection Forgot About Document OCR== https://ceur-ws.org/Vol-3349/paper2.pdf
Text Detection Forgot About Document OCR
Krzysztof Olejniczak1,2 , Milan Šulc2
1
    University of Oxford, United Kingdom
2
    Rossum.ai, Czech Republic


                                             Abstract
                                             Detection and recognition of text from scans and other images, commonly denoted as Optical Character Recognition (OCR), is
                                             a widely used form of automated document processing with a number of methods available. Yet OCR systems still do not
                                             achieve 100% accuracy, requiring human corrections in applications where correct readout is essential. Advances in machine
                                             learning enabled even more challenging scenarios of text detection and recognition "in-the-wild" – such as detecting text on
                                             objects from photographs of complex scenes. While the state-of-the-art methods for in-the-wild text recognition are typically
                                             evaluated on complex scenes, their performance in the domain of documents is typically not published, and a comprehensive
                                             comparison with methods for document OCR is missing. This paper compares several methods designed for in-the-wild
                                             text recognition and for document text recognition, and provides their evaluation on the domain of structured documents.
                                             The results suggest that state-of-the-art methods originally proposed for in-the-wild text detection also achieve competitive
                                             results on document text detection, outperforming available OCR methods. We argue that the application of document OCR
                                             should not be omitted in evaluation of text detection and recognition methods.

                                             Keywords
                                             Text Detection, Text Recognition, OCR, Optical Character Recognition, Text In The Wild



1. Introduction                                                                                                                       engines [18, 19, 2]. Additionally, we adopt publicly avail-
                                                                                                                                      able Text Recognition models [20, 21] and combine them
Optical Character Recognition (OCR) is a classic problem                                                                              with Text Detectors to perform two-stage end-to-end text
in machine learning and computer vision with standard                                                                                 recognition for a complete evaluation of text extraction.
methods [1, 2] and surveys [3, 4, 5, 6] available. Recent ad-
vances in machine learning and its applications, such as
autonomous driving, scene understanding or large-scale                                                                                2. Related Work
image retrieval, shifted the attention of Text Recogni-
tion research towards the more challenging in-the-wild                                                                                2.1. Document OCR
text scenarios, with arbitrarily shaped and oriented in-
                                                                                                                                      OCR engines designed for the "standard" application do-
stances of text appearing in complex scenes. Spotting
                                                                                                                                      main of documents range from open-source projects such
text in-the-wild poses challenges such as extreme aspect
                                                                                                                                      as TesseractOCR [2] and PP-OCR [1] to commercial ser-
ratios, curved or otherwise irregular text, complex back-
                                                                                                                                      vices, including AWS Textract [18] or Google Document
grounds and clutter in the scenes. Recent methods [7, 8]
                                                                                                                                      AI [19]. Despite Document OCR being a classic problem
achieve impressive results on challenging text in-the-wild
                                                                                                                                      with many practical applications, studied for decades
datasets like TotalText [9] or CTW-1500 [10], with F1
                                                                                                                                      [22, 23], it still cannot be considered ’solved’ – even the
reaching 90% and 87% respectively. Although automated
                                                                                                                                      best engines struggle to achieve perfect accuracy. The
document processing remains one of the major applica-
                                                                                                                                      methodology behind the commercial cloud services is
tions of OCR, to the best of our knowledge, the results of
                                                                                                                                      typically not disclosed. The most popular1 open-source
in-the-wild text detection models were never comprehen-
                                                                                                                                      OCR engine at the time of publication, Tesseract [2] (v4
sively evaluated on the domain of documents and com-
                                                                                                                                      and v5), uses a Long Short-Term Memory (LSTM) neural
pared with methods developed for document OCR. This
                                                                                                                                      network as the default recognition engine.
paper reviews several recent Text Detection methods de-
veloped for the in-the-wild scenario [11, 12, 13, 7, 14, 8],
evaluates their performance (out of the box and fine-                                                                                 2.2. In-the-wild Text Detection
tuned) on benchmark document datasets [15, 16, 17], and                                                                               2.2.1. Regression-based Methods
compares their scores against popular Document OCR
                                                                                                                                      Regression-based Methods follow the object classification
26th Computer Vision Winter Workshop, Robert Sablatnig and Florian                                                                    approach, reduced to a single-class problem. TextBoxes
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023                                                                        [25] and TextBoxes++ [26] locate text instances with
†
  The work was done when Krzysztof Olejniczak was an intern at                                                                        various lengths by using sets of anchors with different
  Rossum.                                                                                                                             aspect ratios. Various regression-based methods utilize
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        1
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                                          Based on the GitHub repository [24] statistics.




                                                                                                                                  1
Krzysztof Olejniczak et al. CEUR Workshop Proceedings                                                                   1–7




Figure 1: Comparison of detection results performed on documents from FUNSD dataset (cropped).



an iterative refinement strategy, iteratively enhancing           using post-processing on so obtained pixel maps. The
the quality of detected boundaries. LOMO [27] uses an             binary, deterministic nature of such pixel classification
Iterative Refinement Module, which in every step re-              problem may cause learning confusion on the borders
gresses coordinates of each corner of the predicted bound-        of text instances. Numerous methods address this issue
ary, with an attention mechanism. PCR [28] proposes a             by predicting text kernels (central regions of instances)
top-down approach, starting with predictions of centres           and appropriately gathering pixels around them. PSENet
and sizes of text instances, and iteratively improving the        [32] predicts kernels of different sizes and forms bound-
bounding boxes using its Contour Localisation Mecha-              ing boxes by iteratively expanding their regions. PAN
nism. TextBPN++ [8] introduces an Iterative Boundary              [14] generates pixel classification and kernel maps, link-
Deformation Module, utilizing Transformer Blocks with             ing each classified text pixel to the nearest kernel. Cen-
multi-head attention [29] encoder and a multi-layer per-          tripetalText [33] produces centripetal shift vectors that
ceptron decoder, to iteratively adjust vertices of detected       map pixels to correct text centres. KPN [34] creates pixel
instances. Instead of considering vertices of the bound-          embedding vectors, for each instance locates the central
ing boxes, DCLNet [12] predicts quadrilateral boundaries          pixel and retrieves the whole shapes by measuring the
by locating four lines restricting the corresponding area,        similarities in embedding vectors with scalar product.
representing them in polar coordinates system. To ad-             Vast majority of segmentation-based methods generate
dress the problem of arbitrary-shaped text detection and          probability maps, representing how likely pixels are to be
accurately model the boundaries of irregular text regions,        contained in some text region, and using certain binariza-
more sophisticated bounding boxes representation ideas            tion mechanism (e.g. by applying thresholding) convert
have been developed. ABCNet [30] adapts cubic Bezier              them into binary pixel maps. However, the thresholds
curves to parametrize curved text instances, gaining the          are often determined empirically, and uncareful choice
possibility of fitting non-polygon shapes. FCENet [31]            of them may lead to drastic decrease in performance. To
proposes Fourier Contour Embedding method, predict-               solve this problem, DBNet [13] proposes a Differentiable
ing the Fourier signature vectors corresponding to the            Binarization Equation, making the step between proba-
representation of the boundary in Fourier domain, and             bility and classification maps end-to-end trainable and
uses them to generate the shape of the instance with              therefore letting the network learn how to accurately
Inverse Fourier Transformation.                                   binarise predictions. DBNet++ [7] further improves on
                                                                  the baseline by extending the backbone network with an
2.2.2. Segmentation-based Methods                                 Adaptive Scale Fusion attention module, enhancing the
                                                                  upscaling process and obtaining deeper features. Text-
Segmentation-based Methods aim to classify each pixel             FuseNet [35] generates features on three different levels:
as either text or non-text, and generate bounding boxes           global-, word- and character-level, and fuses them to gain



                                                              2
Krzysztof Olejniczak et al. CEUR Workshop Proceedings                                                                   1–7



relevant context and deeper insight into the image struc- 4. Experiments
ture. Instead of detecting words, CRAFT [11] locates text
on character-level, predicting the areas covered by single 4.1. Training Strategies
letters, and links characters of each instance with respect
                                                             DBNet [13], DBNet++ [7] and PAN [14] were fine-tuned
to the generated affinity map.
                                                             for 100 epochs (600 epochs in case of FUNSD) with batch
                                                             size of 8 and initial learning rate set to 0.0001 and decreas-
3. Methods                                                   ing by a factor of 10 at the 60th and 80th epoch (200th
                                                             and 400th for FUNSD). Baselines, pre-trained on Syn-
3.1. Text Detection                                          thText [38] (DBNet, DBNet++) or ImageNet [39] (PAN),
                                                             were downloaded from the MMOCR 0.6.2 Model Zoo [40].
To cover a wide range of text detectors, we selected DCLNet [12] was fine-tuned from a pre-trained model
methods from Section 2.2 with different approaches: for [41] on each dataset for 150 epochs with batch size of 4,
regression-based methods, we included TextBPN++ as a initial learning rate of 0.001, decaying to 0.0001. For each
vertex-focused algorithm and DCLNet as an edge-focused dataset, TextBPN++ [8] was fine-tuned from a pre-trained
approach. From segmentation-based methods, we se- model [42] for 50 epochs with batch size of 4, learning
lected DBNet and DBNet++ as pure segmentation and rate of 0.0001 and data augmentations consisting of flip-
PAN as an approach linking text pixels to corresponding ping, cropping and rotations. Given no publicly-available
kernels. Finally, CRAFT was chosen as a character-level training scripts for CRAFT, during the experiments, we
method.                                                      used the MLT model from the github repository [43]
                                                             without fine-tuning. All experiments were performed
3.2. Text Recognition                                        using Adam optimizer with momentum 0.9, on a single
                                                             GPU with 11 GB of VRAM (GeForce GTX-1080Ti).
The ultimate goal of text detection, especially in the case
of document processing, is to recognize the text within
the detected instances. Therefore, to evaluate the suitabil- 4.2. Detection Results
ity of popular in-the-wild detectors for document OCR, Results of the text detection methods selected in Section
we perform end-to-end measurements with the following 3.1 on the datasets from Table 1 are presented in Table 2.
text recognition engines: SAR [20], MASTER [36] and On FUNSD dataset, DBNet++ achieves both the highest
CRNN [21]. The open-source engines were combined detection recall (97.40%) and F1-score (97.42%). The high-
with the detection methods in a two-stage manner: the est precision rate, 97.84% was scored by CRAFT. PAN
input image was initially processed by a detector, which performed the weakest out of all considered in-the-wild
returned bounding boxes. Afterwards, the corresponding algorithms, scoring just 81.44% F1-score. Despite having
cropped instances were passed to recognition models. As achieved better results on FUNSD, segmentation-based
a point of reference, we compare both the detection and approaches were outperformed by regression-based
end-to-end recognition results of the selected methods methods on CORD and XFUND. TextBPN++ proved to
with predictions of three common engines for end-to-end be the best performing algorithm on CORD in terms of
document OCR: Tesseract [2], Google Document AI [19] recall and F1-score, scoring 99.74% and 99.19%, respec-
and AWS Textract [18].                                       tively. DCLNet, for which the best precision rate on
                                                                  CORD (98.67%) was recorded, achieved superior results
3.3. Metric                                                       on XFUND, outperforming the remaining methods with
                                                                  respect to all three measures: precision - 98.22%, recall
To measure both detection and end-to-end performance,             - 98.17% and F1-score - 98.20%. Out of the considered
we used the CLEval [37] metric. Contrary to metrics such          popular engines for end-to-end document OCR, AWS
as Intersection over Union (IoU) perceiving text on word-         Textract presented the best performance on the domain
level, CLEval measures precision and recall on character          of scans of structured documents – FUNSD and XFUND –
level. As a consequence, it slightly reduces the punish-          scoring 96.69% and 92.65% F1-score, respectively. Google
ment for splitting or merging problematic instances (e.g          Document AI generalized remarkably better to distorted
dates), providing reliable and intuitive comparison of the        photos of receipts from the CORD dataset, achieving
quality of detection and recognition. Additionally, the           93.30% F1-score, surpassing the scores of AWS Textract
Recognition Score evaluated by CLEval, approximately              and Tesseract. The results show that in-the-wild detectors
corresponding to the precision of character recognition,          fine-tuned on document datasets can outperform popular
informs about the quality of the recognition engine specif-       OCR engines on the domain of structured documents in
ically on the detected bounding boxes.                            terms of the CLEval detection metric. However, the re-
                                                                  sults for the predictions of pre-trained detectors may not




                                                              3
Krzysztof Olejniczak et al. CEUR Workshop Proceedings                                                                     1–7



Table 1
Document datasets used in the experiments for text detection and recognition.
               Dataset      Training images Test images                    Document Types              Language
             FUNSD [15]            149               50              Distorted forms, surveys, reports  English
             CORD [16]             900               100              Photos of Indonesian receipts     English
             XFUND [17]            1043              350                   Clean scanned forms         Multilingual


Table 2
Comparison of the detection performance of the chosen methods on benchmark datasets, with respect to the CLEval metric.
"P", "R" and "F1" represent the precision, recall and F1-score, respectively.

                                          FUNSD                            CORD                       XFUND
               Method
                                   P        R        F1              P        R       F1        P        R       F1
               PAN [14]          96.25    70.57     81.44    98.92         97.33     98.12    96.96    77.90    86.39
              DBNet [13]         96.02    96.11     96.07    97.94         99.17     98.55    97.04    95.58    96.30
             DBNet++ [7]         97.45    97.40     97.42    97.58         99.60     98.58    97.87    97.93    97.90
            TextBPN++ [8]        96.63    95.59     96.11    98.65         99.74     99.19    97.88    94.29    96.05
             DCLNet [12]         94.16    95.35     94.75    98.67         97.91     98.29    98.22    98.17    98.20
             CRAFT [11]          97.84    95.72     96.77    94.25         88.46     91.26    89.75    93.02    91.36
             Tesseract [2]       80.13    73.80     76.83    76.46         47.38     58.51    85.84    87.47    86.65
           Document AI [19]      95.56    89.77     92.57    92.90         93.71     93.30    89.49    90.68    90.08
           AWS Textract [18]     97.50    95.89     96.69    80.60         84.79     82.64    97.64    88.14    92.65


be fully representative due to differences in splitting rules.       Recognition Score for AWS Textract reached almost 96%,
E.g. Document AI creates separate instances for special              surpassing CRNN’s scores by c.a. 2%. This suggests that
symbols, e.g. brackets, leading to undesired splitting               the recognition engine used in AWS Textract, perform-
of words like "name(s)" into several fragments, lower-               ing much more accurately on FUNSD than the CRNN
ing precision and recall. On all experimented datasets,              model, may have been a crucial reason for the good
all fine-tuned in-the-wild text detection models reached             results. When evaluated on CORD, models with Dif-
high prediction scores, proving themselves capable of                ferentiable Binarization scored the highest marks in all
handling text in structured documents. Qualitative anal-             end-to-end measures: recall (DBNet++), precision and
ysis of detectors’ predictions revealed that the major               F1-score (DBNet); significantly surpassing the remaining
sources of error were incorrect splitting of long text frag-         methods. Interestingly, despite obtaining the best recall
ments (e.g e-mail addresses), merging instances in dense             rate, DBNet++ did not beat the simpler DBNet in terms
text regions and missing short stand-alone text, such as             of end-to-end F1-score. The predictions of regression-
single-digit numbers.                                                based approaches, better than segmentation-based ones
                                                                     when pure detection scores were measured, appeared to
4.3. Recognition Results                                             combine slightly worse with CRNN. TextBPN++, how-
                                                                     ever, remained competitive, achieving similar results
End-to-end text recognition results combining fine-tuned             to DBNet and DBNet++. Recognition Scores of CRNN,
in-the-wild detectors with SAR [20] and MASTER [36]                  regardless the choice of in-the-wild detector, exceeded
models from MMOCR 0.6.2 Model Zoo [46], and CRNN                     93% on FUNSD and 98.5% on CORD, once again demon-
[21] from docTR [45] are listed in Table 3. The XFUND                strating the suitability of applying these algorithms to
dataset was skipped for this experiment since it contains            document text recognition. SAR model, not specifically
Chinese and Japanese characters, for which the recog-                trained on documents, presented poorer performance:
nition models were not trained. On FUNSD, the end-to-                the highest measured F1-scores on FUNSD and CORD
end measurement outcomes followed the patterns from                  were 86.36% and 85.25%, respectively, both obtained by
detection: equipped with CRNN as the recognition en-                 the combination with TextBPN++. Fine-tuned SAR mod-
gine, DBNet++ proved to be the best tuned model in                   els achieved slightly higher F1-scores reaching 89.49%
terms of CLEval end-to-end Recall (93.52%) and F1-score              on FUNSD (equipped with DBNet++ as the detector) and
(92.23%), losing only to CRAFT in terms of precision.                93.77% on CORD (combined with TextBPN++ detections).
Much higher F1-score (+2%) was measured for AWS Tex-                 Despite gaining a noticeable advantage over the base-
tract, whose end-to-end results outperformed all of the              line, fine-tuned SAR models did not surpass the perfor-
considered algorithms. It is important to note that the              mance of the pre-trained CRNN. Similarly to SAR, the



                                                                 4
Krzysztof Olejniczak et al. CEUR Workshop Proceedings                                                                      1–7



Table 3
Comparison of the recognition performance of the chosen text detection methods combined with MMOCR’s [44] SAR and
MASTER default models, fine-tuned SAR, and docTR’s [45] CRNN default model, on FUNSD and CORD, with respect to the
CLEval metric. "P", "R", "F1" and "S" represent the end-to-end precision, recall, F1-score and Recognition Score, respectively.

                                                          FUNSD                                    CORD
        Recognition         Detection
                                               P         R         F1       S        P         R        F1        S
                             PAN [14]        76.14    74.17       75.14   79.79    82.04    84.27     83.14    84.76
                           DBNet [13]        79.10    82.51       80.77   83.33    82.76    85.79     84.25    85.49
           SAR [20]        CRAFT [11]        83.75    85.16       84.45   85.92    79.62    76.93     78.25    86.37
          (baseline)     TextBPN++ [8]       84.90    87.87       86.36   88.86    83.56    87.00     85.25    86.58
                          DBNet++ [7]        80.04    83.53       81.75   82.85    82.95    86.66     84.76    85.89
                          DCLNet [12]        77.67    82.27       79.91   81.80    82.75    85.53     84.11    86.16
                             PAN [14]        86.37    76.61       81.20   90.23    87.73    88.95     88.34    90.59
                           DBNet [13]        87.48    88.07       87.77   91.90    91.12    94.00     92.54    94.02
           SAR [20]        CRAFT [11]        88.14    86.48       87.30   90.39    84.98    79.19     81.99    91.53
        (fine-tuned)     TextBPN++ [8]       88.12    88.32       88.22   92.16    91.46    96.21     93.77    94.77
                          DBNet++ [7]        89.15    89.83       89.49   92.13    90.40    93.83     92.09    93.54
                          DCLNet [12]        86.10    87.30       86.70   90.46    87.69    90.02     88.84    91.58
                             PAN [14]        77.50    74.58       76.01   81.10    90.25    92.12     91.17    93.16
                           DBNet [13]        80.30    83.11       81.68   84.55    91.94    94.31     93.11    94.62
                           CRAFT [11]        82.06    82.90       82.48   84.22    85.81    81.86     83.79    92.93
        MASTER [36]
                         TextBPN++ [8]       82.10    83.93       83.00   85.96    91.77    94.79     93.26    94.78
                          DBNet++ [7]        81.33    83.99       82.64   84.13    91.39    94.63     92.98    94.48
                          DCLNet [12]        79.55    82.85       81.17   83.31    90.01    92.28     91.13    93.71
                             PAN [14]        90.31    87.14       88.70   94.00    95.70    96.52     96.10    98.65
                           DBNet [13]        89.07    91.56       90.30   93.24    96.00    97.51     96.75    98.67
                           CRAFT [11]        91.20    91.67       91.43   93.40    93.12    87.25     90.09    98.73
        CRNN [21]
                         TextBPN++ [8]       89.94    91.80       90.86   93.86    95.35    97.71     96.52    98.48
                          DBNet++ [7]        90.97    93.52       92.23   93.71    95.43    97.85     96.62    98.51
                          DCLNet [12]        89.84    92.95       91.37   93.16    95.04    96.34     95.69    98.52
                  Tesseract [2]              73.84    73.84       69.09   88.48    73.96    44.33     55.43    93.55
            Google Document AI [19]          90.83    92.03       91.42   94.80    88.06    90.97     89.49    98.61
               AWS Textract [18]             93.61    95.46       94.53   95.78    84.53    82.13     83.32    96.63


pre-trained MASTER model [46] worked the best in com- cess. In particular, fine-tuning models such as DBNet++
bination with TextBPN++, achieving F1 score of 83.00% or TextBPN++ yielded over 96% detection F1-score on
on FUNSD and 93.26% on CORD.                              FUNSD, over 98% detection F1-score on CORD and over
                                                          96% detection F1-score on XFUND, with respect to the
                                                          CLEval metric, outperforming Google Document AI and
5. Conclusions                                            AWS Textract. Moreover, combining these detectors with
                                                          a publicly-available CRNN recognition model in a two-
Text detection research has witnessed great progress in
                                                          stage manner consistently achieves over 90% CLEval
recent years, thanks to advancements in deep machine
                                                          end-to-end F1-score, even without explicit fine-tuning
learning. The recently introduced methods widened the
                                                          of CRNN. We hope the results will bring more attention
range of possible applications of text detectors, making
                                                          to evaluating future Text Detection methods not only in
them viable for in-the-wild text spotting. This shifted
                                                          the text-in-the-wild scenario, but also on the domain of
the attention towards more complex scenarios, including
                                                          documents.
arbitrarily-shaped text or instances with non-orthogonal
orientations. With automated document processing
remaining one of the most relevant commercial OCR Acknowledgement
applications, we stress the importance of determining
whether the state-of-the-art methods for scene text spot- We acknowledge the help of Bohumír Zámečník, an ex-
ting can also improve document OCR. Our experiments pert on OCR systems, who helped with the supervision
prove that detectors designed for in-the-wild text spot- of Krzysztof’s internship project.
ting can indeed be applied to documents with great suc-




                                                              5
Krzysztof Olejniczak et al. CEUR Workshop Proceedings                                                                   1–7



References                                                         scene text detection with differentiable binarization,
                                                                   in: Proc. AAAI, 2020.
 [1] Y. Du, C. Li, R. Guo, X. Yin, W. Liu, J. Zhou, Y. Bai,   [14] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu,
     Z. Yu, Y. Yang, Q. Dang, H. Wang, PP-OCR: A                   G. Yu, C. Shen, Efficient and accurate arbitrary-
     practical ultra lightweight OCR system, CoRR                  shaped text detection with pixel aggregation net-
     abs/2009.09941 (2020). URL: https://arxiv.org/abs/            work, CoRR abs/1908.05900 (2019). URL: http:
     2009.09941. arXiv:2009.09941.                                 //arxiv.org/abs/1908.05900. arXiv:1908.05900.
 [2] A. Kay, Tesseract: An open-source optical character      [15] J.-P. T. Guillaume Jaume, Hazim Kemal Ekenel,
     recognition engine, Linux J. 2007 (2007) 2.                   Funsd: A dataset for form understanding in noisy
 [3] K. Hamad, K. Mehmet, A detailed analysis of optical           scanned documents, in: Accepted to ICDAR-OST,
     character recognition technology, International               2019.
     Journal of Applied Mathematics Electronics and           [16] S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, H. Lee,
     Computers (2016) 244–249.                                     Cord: A consolidated receipt dataset for post-ocr
 [4] T. Hegghammer, Ocr with tesseract, amazon tex-                parsing (2019).
     tract, and google document ai: A benchmarking            [17] Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Flo-
     experiment, 2021. URL: osf.io/preprints/socarxiv/             rencio, C. Zhang, F. Wei, XFUND: A bench-
     6zfvs. doi:10.31235/osf.io/6zfvs.                             mark dataset for multilingual visually rich form
 [5] N. Islam, Z. Islam, N. Noor, A survey on opti-                understanding,         in: Findings of the Asso-
     cal character recognition system, arXiv preprint              ciation for Computational Linguistics: ACL
     arXiv:1710.05703 (2017).                                      2022, Association for Computational Linguistics,
 [6] J. Memon, M. Sami, R. A. Khan, M. Uddin, Hand-                Dublin, Ireland, 2022, pp. 3214–3224. URL: https:
     written optical character recognition (ocr): A com-           //aclanthology.org/2022.findings-acl.253. doi:10.
     prehensive systematic literature review (slr), IEEE           18653/v1/2022.findings-acl.253.
     Access 8 (2020) 142642–142668.                           [18] Amazon, Amazon textract, https://aws.amazon.
 [7] M. Liao, Z. Zou, Z. Wan, C. Yao, X. Bai, Real-time            com/textract, 2022. Accessed: 2022-09-25.
     scene text detection with differentiable binarization    [19] Google, Google cloud document ai, https://cloud.
     and adaptive scale fusion, IEEE Transactions on               google.com/document-ai, 2022. Accessed: 2022-09-
     Pattern Analysis and Machine Intelligence (2022).             25.
 [8] S. Zhang, X. Zhu, C. Yang, H. Wang, X. Yin, Adap-        [20] H. Li, P. Wang, C. Shen, G. Zhang, Show, attend and
     tive boundary proposal network for arbitrary shape            read: A simple and strong baseline for irregular text
     text detection, in: 2021 IEEE/CVF International               recognition, CoRR abs/1811.00751 (2018). URL: http:
     Conference on Computer Vision, ICCV 2021, Mon-                //arxiv.org/abs/1811.00751. arXiv:1811.00751.
     treal, QC, Canada, October 10-17, 2021, IEEE, 2021,      [21] B. Shi, X. Bai, C. Yao, An end-to-end trainable
     pp. 1285–1294.                                                neural network for image-based sequence recogni-
 [9] C. K. Ch’ng, C. S. Chan, C. Liu, Total-text: To-              tion and its application to scene text recognition,
     wards orientation robustness in scene text detec-             CoRR abs/1507.05717 (2015). URL: http://arxiv.org/
     tion, International Journal on Document Analysis              abs/1507.05717. arXiv:1507.05717.
     and Recognition (IJDAR) 23 (2020) 31–52. doi:10.         [22] S. Mori, H. Nishida, H. Yamada, Optical character
     1007/s10032-019-00334-z.                                      recognition, John Wiley & Sons, Inc., 1999.
[10] Y. Liu, L. Jin, S. Zhang, C. Luo, S. Zhang, Curved       [23] H. F. Schantz, The history of ocr, optical character
     scene text detection via transverse and longitudinal          recognition, Manchester Center, VT: Recognition
     sequence connection, Pattern Recognition 90                   Technologies Users Association (1982).
     (2019) 337–345. URL: https://www.sciencedirect.          [24] S. W. et al., Tesseract open source ocr engine
     com/science/article/pii/S0031320319300664.                    (main repository), https://github.com/tesseract-ocr/
     doi:https://doi.org/10.1016/j.patcog.                         tesseract, 2022. Accessed: 2022-10-14.
     2019.02.002.                                             [25] M. Liao, B. Shi, X. Bai, X. Wang, W. Liu, Textboxes:
[11] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Character            A fast text detector with a single deep neural net-
     region awareness for text detection, in: Proceedings          work, in: AAAI, 2017.
     of the IEEE Conference on Computer Vision and            [26] B. S. Minghui Liao, X. Bai, TextBoxes++: A single-
     Pattern Recognition, 2019, pp. 9365–9374.                     shot oriented scene text detector, IEEE Transactions
[12] Y. Bi, Z. Hu, Disentangled contour learning for               on Image Processing 27 (2018) 3676–3690. URL:
     quadrilateral text detection, in: Proceedings of the          https://doi.org/10.1109/TIP.2018.2825107. doi:10.
     IEEE/CVF Winter Conference on Applications of                 1109/TIP.2018.2825107.
     Computer Vision, 2021, pp. 909–918.                      [27] C. Zhang, B. Liang, Z. Huang, M. En, J. Han,
[13] M. Liao, Z. Wan, C. Yao, K. Chen, X. Bai, Real-time           E. Ding, X. Ding, Look more than once: An accu-



                                                          6
Krzysztof Olejniczak et al. CEUR Workshop Proceedings                                                               1–7



     rate detector for text of arbitrary shapes, CoRR              Fei, Imagenet: A large-scale hierarchical image
     abs/1904.06535 (2019). URL: http://arxiv.org/abs/             database, in: 2009 IEEE conference on computer
     1904.06535. arXiv:1904.06535.                                 vision and pattern recognition, Ieee, 2009, pp. 248–
[28] P. Dai, S. Zhang, H. Zhang, X. Cao, Progressive               255.
     contour regression for arbitrary-shape scene text        [40] Z. Kuang, H. Sun, Z. Li, X. Yue, T. H. Lin, J. Chen,
     detection, in: 2021 IEEE/CVF Conference on Com-               H. Wei, Y. Zhu, T. Gao, W. Zhang, K. Chen,
     puter Vision and Pattern Recognition (CVPR), 2021,            W. Zhang, D. Lin, Text detection models - mmocr
     pp. 7389–7398. doi:10.1109/CVPR46437.2021.                    0.6.2 documentation, https://mmocr.readthedocs.
     00731.                                                        io/en/latest/textdet_models.html, 2022. Accessed:
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor-                 2022-10-14.
     eit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-          [41] Y. Bi, Z. Hu, Pytorch implementation of dclnet "dis-
     sukhin,      Attention is all you need,          CoRR         entangled contour learning for quadrilateral text de-
     abs/1706.03762 (2017). URL: http://arxiv.org/abs/             tection", https://github.com/SakuraRiven/DCLNet,
     1706.03762. arXiv:1706.03762.                                 2021. Accessed: 2022-10-13.
[30] Y. Liu, H. Chen, C. Shen, T. He, L. Jin,                 [42] S. Zhang, X. Zhu, C. Yang, H. Wang, X. Yin,
     L. Wang, Abcnet: Real-time scene text spot-                   Arbitrary shape text detection via bound-
     ting with adaptive bezier-curve network, CoRR                 ary transformer,        https://github.com/GXYM/
     abs/2002.10200 (2020). URL: https://arxiv.org/abs/            TextBPN-Plus-Plus, 2022. Accessed: 2022-09-29.
     2002.10200. arXiv:2002.10200.                            [43] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, Official
[31] Y. Zhu, J. Chen, L. Liang, Z. Kuang, L. Jin, W. Zhang,        implementation of character region awareness for
     Fourier contour embedding for arbitrary-shaped                text detection (craft), https://github.com/clovaai/
     text detection, in: CVPR, 2021.                               CRAFT-pytorch, 2019. Accessed: 2022-10-13.
[32] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, S. Shao,   [44] Z. Kuang, H. Sun, Z. Li, X. Yue, T. H. Lin, J. Chen,
     Shape robust text detection with progressive scale            H. Wei, Y. Zhu, T. Gao, W. Zhang, K. Chen,
     expansion network, in: Proceedings of the IEEE                W. Zhang, D. Lin, Mmocr: A comprehensive toolbox
     Conference on Computer Vision and Pattern Recog-              for text detection, recognition and understanding,
     nition, 2019, pp. 9336–9345.                                  arXiv preprint arXiv:2108.06543 (2021).
[33] T. Sheng, J. Chen, Z. Lian, Centripetaltext: An          [45] Mindee, doctr: Document text recognition, https:
     efficient text instance representation for scene text         //github.com/mindee/doctr, 2021.
     detection, in: Thirty-Fifth Conference on Neural         [46] Z. Kuang, H. Sun, Z. Li, X. Yue, T. H. Lin, J. Chen,
     Information Processing Systems, 2021.                         H. Wei, Y. Zhu, T. Gao, W. Zhang, K. Chen,
[34] S.-X. Zhang, X. Zhu, J.-B. Hou, C. Yang, X.-C. Yin,           W. Zhang, D. Lin, Text recognition models - mmocr
     Kernel proposal network for arbitrary shape text de-          0.6.2 documentation, https://mmocr.readthedocs.io/
     tection, 2022. URL: https://arxiv.org/abs/2203.06410.         en/latest/textrecog_models.html, 2021. Accessed:
     doi:10.48550/ARXIV.2203.06410.                                2022-10-14.
[35] J. Ye, Z. Chen, J. Liu, B. Du, Textfusenet: Scene
     text detection with richer fused features, in: Pro-
     ceedings of the Twenty-Ninth International Joint
     Conference on Artificial Intelligence, IJCAI-20, In-
     ternational Joint Conferences on Artificial Intelli-
     gence Organization, 2020, pp. 516–522.
[36] N. Lu, W. Yu, X. Qi, Y. Chen, P. Gong, R. Xiao, MAS-
     TER: multi-aspect non-local network for scene text
     recognition, CoRR abs/1910.02562 (2019). URL: http:
     //arxiv.org/abs/1910.02562. arXiv:1910.02562.
[37] Y. Baek, D. Nam, S. Park, J. Lee, S. Shin, J. Baek,
     C. Y. Lee, H. Lee, Cleval: Character-level evalua-
     tion for text detection and recognition tasks, CoRR
     abs/2006.06244 (2020). URL: https://arxiv.org/abs/
     2006.06244. arXiv:2006.06244.
[38] A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data
     for text localisation in natural images, in: IEEE
     Conference on Computer Vision and Pattern Recog-
     nition, 2016.
[39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-



                                                          7