<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Detection Forgot About Document OCR</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krzysztof Olejniczak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan Šulc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>26th Computer Vision Winter Workshop</institution>
          ,
          <addr-line>Robert Sablatnig and Florian Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rossum.ai</institution>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The work was done when Krzysztof Olejniczak was an intern at Rossum</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Oxford</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Detection and recognition of text from scans and other images, commonly denoted as Optical Character Recognition (OCR), is a widely used form of automated document processing with a number of methods available. Yet OCR systems still do not achieve 100% accuracy, requiring human corrections in applications where correct readout is essential. Advances in machine learning enabled even more challenging scenarios of text detection and recognition "in-the-wild" - such as detecting text on objects from photographs of complex scenes. While the state-of-the-art methods for in-the-wild text recognition are typically evaluated on complex scenes, their performance in the domain of documents is typically not published, and a comprehensive comparison with methods for document OCR is missing. This paper compares several methods designed for in-the-wild text recognition and for document text recognition, and provides their evaluation on the domain of structured documents. The results suggest that state-of-the-art methods originally proposed for in-the-wild text detection also achieve competitive results on document text detection, outperforming available OCR methods. We argue that the application of document OCR should not be omitted in evaluation of text detection and recognition methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text Detection</kwd>
        <kwd>Text Recognition</kwd>
        <kwd>OCR</kwd>
        <kwd>Optical Character Recognition</kwd>
        <kwd>Text In The Wild</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Optical Character Recognition (OCR) is a classic problem
in machine learning and computer vision with standard
methods [1, 2] and surveys [
        <xref ref-type="bibr" rid="ref10">3, 4, 5, 6</xref>
        ] available. Recent
advances in machine learning and its applications, such as
autonomous driving, scene understanding or large-scale
image retrieval, shifted the attention of Text
Recognition research towards the more challenging in-the-wild
text scenarios, with arbitrarily shaped and oriented
instances of text appearing in complex scenes. Spotting
text in-the-wild poses challenges such as extreme aspect
ratios, curved or otherwise irregular text, complex
backgrounds and clutter in the scenes. Recent methods [
        <xref ref-type="bibr" rid="ref14">7, 8</xref>
        ]
achieve impressive results on challenging text in-the-wild
datasets like TotalText [9] or CTW-1500 [
        <xref ref-type="bibr" rid="ref56">10</xref>
        ], with F1
reaching 90% and 87% respectively. Although automated
document processing remains one of the major
applications of OCR, to the best of our knowledge, the results of
in-the-wild text detection models were never
comprehensively evaluated on the domain of documents and
compared with methods developed for document OCR. This
paper reviews several recent Text Detection methods
developed for the in-the-wild scenario [
        <xref ref-type="bibr" rid="ref14">11, 12, 13, 7, 14, 8</xref>
        ],
evaluates their performance (out of the box and
finetuned) on benchmark document datasets [15, 16, 17], and
compares their scores against popular Document OCR
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Document OCR</title>
        <sec id="sec-2-1-1">
          <title>OCR engines designed for the "standard" application do</title>
          <p>main of documents range from open-source projects such
as TesseractOCR [2] and PP-OCR [1] to commercial
services, including AWS Textract [18] or Google Document
AI [19]. Despite Document OCR being a classic problem
with many practical applications, studied for decades
[22, 23], it still cannot be considered ’solved’ – even the
best engines struggle to achieve perfect accuracy. The
methodology behind the commercial cloud services is
typically not disclosed. The most popular1 open-source
OCR engine at the time of publication, Tesseract [2] (v4
and v5), uses a Long Short-Term Memory (LSTM) neural
network as the default recognition engine.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. In-the-wild Text Detection</title>
        <p>2.2.1. Regression-based Methods
Regression-based Methods follow the object classification
approach, reduced to a single-class problem. TextBoxes
[25] and TextBoxes++ [26] locate text instances with
various lengths by using sets of anchors with diferent
aspect ratios. Various regression-based methods utilize</p>
        <sec id="sec-2-2-1">
          <title>1Based on the GitHub repository [24] statistics.</title>
          <p>
            an iterative refinement strategy, iteratively enhancing using post-processing on so obtained pixel maps. The
the quality of detected boundaries. LOMO [27] uses an binary, deterministic nature of such pixel classification
Iterative Refinement Module, which in every step re- problem may cause learning confusion on the borders
gresses coordinates of each corner of the predicted bound- of text instances. Numerous methods address this issue
ary, with an attention mechanism. PCR [28] proposes a by predicting text kernels (central regions of instances)
top-down approach, starting with predictions of centres and appropriately gathering pixels around them. PSENet
and sizes of text instances, and iteratively improving the [32] predicts kernels of diferent sizes and forms
boundbounding boxes using its Contour Localisation Mecha- ing boxes by iteratively expanding their regions. PAN
nism. TextBPN++ [
            <xref ref-type="bibr" rid="ref14">8</xref>
            ] introduces an Iterative Boundary [14] generates pixel classification and kernel maps,
linkDeformation Module, utilizing Transformer Blocks with ing each classified text pixel to the nearest kernel.
Cenmulti-head attention [29] encoder and a multi-layer per- tripetalText [33] produces centripetal shift vectors that
ceptron decoder, to iteratively adjust vertices of detected map pixels to correct text centres. KPN [34] creates pixel
instances. Instead of considering vertices of the bound- embedding vectors, for each instance locates the central
ing boxes, DCLNet [12] predicts quadrilateral boundaries pixel and retrieves the whole shapes by measuring the
by locating four lines restricting the corresponding area, similarities in embedding vectors with scalar product.
representing them in polar coordinates system. To ad- Vast majority of segmentation-based methods generate
dress the problem of arbitrary-shaped text detection and probability maps, representing how likely pixels are to be
accurately model the boundaries of irregular text regions, contained in some text region, and using certain
binarizamore sophisticated bounding boxes representation ideas tion mechanism (e.g. by applying thresholding) convert
have been developed. ABCNet [30] adapts cubic Bezier them into binary pixel maps. However, the thresholds
curves to parametrize curved text instances, gaining the are often determined empirically, and uncareful choice
possibility of fitting non-polygon shapes. FCENet [ 31] of them may lead to drastic decrease in performance. To
proposes Fourier Contour Embedding method, predict- solve this problem, DBNet [13] proposes a Diferentiable
ing the Fourier signature vectors corresponding to the Binarization Equation, making the step between
probarepresentation of the boundary in Fourier domain, and bility and classification maps end-to-end trainable and
uses them to generate the shape of the instance with therefore letting the network learn how to accurately
Inverse Fourier Transformation. binarise predictions. DBNet++ [7] further improves on
the baseline by extending the backbone network with an
2.2.2. Segmentation-based Methods Adaptive Scale Fusion attention module, enhancing the
upscaling process and obtaining deeper features.
TextSegmentation-based Methods aim to classify each pixel FuseNet [35] generates features on three diferent levels:
as either text or non-text, and generate bounding boxes global-, word- and character-level, and fuses them to gain
relevant context and deeper insight into the image
structure. Instead of detecting words, CRAFT [11] locates text
on character-level, predicting the areas covered by single
letters, and links characters of each instance with respect
to the generated afinity map.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Text Detection</title>
        <sec id="sec-3-1-1">
          <title>To cover a wide range of text detectors, we selected</title>
          <p>methods from Section 2.2 with diferent approaches: for
regression-based methods, we included TextBPN++ as a
vertex-focused algorithm and DCLNet as an edge-focused
approach. From segmentation-based methods, we
selected DBNet and DBNet++ as pure segmentation and
PAN as an approach linking text pixels to corresponding
kernels. Finally, CRAFT was chosen as a character-level
method.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Text Recognition</title>
        <p>1–7</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Training Strategies</title>
        <p>
          DBNet [13], DBNet++ [7] and PAN [14] were fine-tuned
for 100 epochs (600 epochs in case of FUNSD) with batch
size of 8 and initial learning rate set to 0.0001 and
decreasing by a factor of 10 at the 60th and 80th epoch (200th
and 400th for FUNSD). Baselines, pre-trained on
SynthText [38] (DBNet, DBNet++) or ImageNet [39] (PAN),
were downloaded from the MMOCR 0.6.2 Model Zoo [40].
DCLNet [12] was fine-tuned from a pre-trained model
[41] on each dataset for 150 epochs with batch size of 4,
initial learning rate of 0.001, decaying to 0.0001. For each
dataset, TextBPN++ [
          <xref ref-type="bibr" rid="ref14">8</xref>
          ] was fine-tuned from a pre-trained
model [42] for 50 epochs with batch size of 4, learning
rate of 0.0001 and data augmentations consisting of
flipping, cropping and rotations. Given no publicly-available
training scripts for CRAFT, during the experiments, we
used the MLT model from the github repository [43]
without fine-tuning. All experiments were performed
using Adam optimizer with momentum 0.9, on a single
GPU with 11 GB of VRAM (GeForce GTX-1080Ti).
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Detection Results</title>
        <sec id="sec-4-2-1">
          <title>The ultimate goal of text detection, especially in the case</title>
          <p>of document processing, is to recognize the text within
the detected instances. Therefore, to evaluate the
suitability of popular in-the-wild detectors for document OCR,
we perform end-to-end measurements with the following
text recognition engines: SAR [20], MASTER [36] and
CRNN [21]. The open-source engines were combined
with the detection methods in a two-stage manner: the
input image was initially processed by a detector, which
returned bounding boxes. Afterwards, the corresponding
cropped instances were passed to recognition models. As
a point of reference, we compare both the detection and
end-to-end recognition results of the selected methods
with predictions of three common engines for end-to-end
document OCR: Tesseract [2], Google Document AI [19]
and AWS Textract [18].</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Results of the text detection methods selected in Section</title>
          <p>3.1 on the datasets from Table 1 are presented in Table 2.</p>
          <p>On FUNSD dataset, DBNet++ achieves both the highest
detection recall (97.40%) and F1-score (97.42%). The
highest precision rate, 97.84% was scored by CRAFT. PAN
performed the weakest out of all considered in-the-wild
algorithms, scoring just 81.44% F1-score. Despite having
achieved better results on FUNSD, segmentation-based
approaches were outperformed by regression-based
methods on CORD and XFUND. TextBPN++ proved to
be the best performing algorithm on CORD in terms of
recall and F1-score, scoring 99.74% and 99.19%,
respectively. DCLNet, for which the best precision rate on
CORD (98.67%) was recorded, achieved superior results
3.3. Metric on XFUND, outperforming the remaining methods with
respect to all three measures: precision - 98.22%, recall
To measure both detection and end-to-end performance, - 98.17% and F1-score - 98.20%. Out of the considered
we used the CLEval [37] metric. Contrary to metrics such popular engines for end-to-end document OCR, AWS
as Intersection over Union (IoU) perceiving text on word- Textract presented the best performance on the domain
level, CLEval measures precision and recall on character of scans of structured documents – FUNSD and XFUND –
level. As a consequence, it slightly reduces the punish- scoring 96.69% and 92.65% F1-score, respectively. Google
ment for splitting or merging problematic instances (e.g Document AI generalized remarkably better to distorted
dates), providing reliable and intuitive comparison of the photos of receipts from the CORD dataset, achieving
quality of detection and recognition. Additionally, the 93.30% F1-score, surpassing the scores of AWS Textract
Recognition Score evaluated by CLEval, approximately and Tesseract. The results show that in-the-wild detectors
corresponding to the precision of character recognition, fine-tuned on document datasets can outperform popular
informs about the quality of the recognition engine
specifically on the detected bounding boxes.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>OCR engines on the domain of structured documents in terms of the CLEval detection metric. However, the results for the predictions of pre-trained detectors may not</title>
          <p>be fully representative due to diferences in splitting rules. Recognition Score for AWS Textract reached almost 96%,
E.g. Document AI creates separate instances for special surpassing CRNN’s scores by c.a. 2%. This suggests that
symbols, e.g. brackets, leading to undesired splitting the recognition engine used in AWS Textract,
performof words like "name(s)" into several fragments, lower- ing much more accurately on FUNSD than the CRNN
ing precision and recall. On all experimented datasets, model, may have been a crucial reason for the good
all fine-tuned in-the-wild text detection models reached results. When evaluated on CORD, models with
Difhigh prediction scores, proving themselves capable of ferentiable Binarization scored the highest marks in all
handling text in structured documents. Qualitative anal- end-to-end measures: recall (DBNet++), precision and
ysis of detectors’ predictions revealed that the major F1-score (DBNet); significantly surpassing the remaining
sources of error were incorrect splitting of long text frag- methods. Interestingly, despite obtaining the best recall
ments (e.g e-mail addresses), merging instances in dense rate, DBNet++ did not beat the simpler DBNet in terms
text regions and missing short stand-alone text, such as of end-to-end F1-score. The predictions of
regressionsingle-digit numbers. based approaches, better than segmentation-based ones
when pure detection scores were measured, appeared to
4.3. Recognition Results combine slightly worse with CRNN. TextBPN++,
however, remained competitive, achieving similar results
End-to-end text recognition results combining fine-tuned to DBNet and DBNet++. Recognition Scores of CRNN,
in-the-wild detectors with SAR [20] and MASTER [36] regardless the choice of in-the-wild detector, exceeded
models from MMOCR 0.6.2 Model Zoo [46], and CRNN 93% on FUNSD and 98.5% on CORD, once again
demon[21] from docTR [45] are listed in Table 3. The XFUND strating the suitability of applying these algorithms to
dataset was skipped for this experiment since it contains document text recognition. SAR model, not specifically
Chinese and Japanese characters, for which the recog- trained on documents, presented poorer performance:
nition models were not trained. On FUNSD, the end-to- the highest measured F1-scores on FUNSD and CORD
end measurement outcomes followed the patterns from were 86.36% and 85.25%, respectively, both obtained by
detection: equipped with CRNN as the recognition en- the combination with TextBPN++. Fine-tuned SAR
modgine, DBNet++ proved to be the best tuned model in els achieved slightly higher F1-scores reaching 89.49%
terms of CLEval end-to-end Recall (93.52%) and F1-score on FUNSD (equipped with DBNet++ as the detector) and
(92.23%), losing only to CRAFT in terms of precision. 93.77% on CORD (combined with TextBPN++ detections).
Much higher F1-score (+2%) was measured for AWS Tex- Despite gaining a noticeable advantage over the
basetract, whose end-to-end results outperformed all of the line, fine-tuned SAR models did not surpass the
perforconsidered algorithms. It is important to note that the mance of the pre-trained CRNN. Similarly to SAR, the
pre-trained MASTER model [46] worked the best in com- cess. In particular, fine-tuning models such as DBNet++
bination with TextBPN++, achieving F1 score of 83.00% or TextBPN++ yielded over 96% detection F1-score on
on FUNSD and 93.26% on CORD. FUNSD, over 98% detection F1-score on CORD and over
96% detection F1-score on XFUND, with respect to the
CLEval metric, outperforming Google Document AI and
5. Conclusions AWS Textract. Moreover, combining these detectors with
a publicly-available CRNN recognition model in a
twostage manner consistently achieves over 90% CLEval
end-to-end F1-score, even without explicit fine-tuning
of CRNN. We hope the results will bring more attention
to evaluating future Text Detection methods not only in
the text-in-the-wild scenario, but also on the domain of
documents.</p>
          <p>Text detection research has witnessed great progress in
recent years, thanks to advancements in deep machine
learning. The recently introduced methods widened the
range of possible applications of text detectors, making
them viable for in-the-wild text spotting. This shifted
the attention towards more complex scenarios, including
arbitrarily-shaped text or instances with non-orthogonal
orientations. With automated document processing
remaining one of the most relevant commercial OCR Acknowledgement
applications, we stress the importance of determining
whether the state-of-the-art methods for scene text spot- We acknowledge the help of Bohumír Zámečník, an
exting can also improve document OCR. Our experiments pert on OCR systems, who helped with the supervision
prove that detectors designed for in-the-wild text spot- of Krzysztof’s internship project.
ting can indeed be applied to documents with great
suc</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>in: Proc. AAAI</source>
          ,
          <year>2020</year>
          . [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          , [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          , T. Lu,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          abs/
          <year>2009</year>
          .09941 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/ work, CoRR abs/
          <year>1908</year>
          .05900 (
          <year>2019</year>
          ). URL: http:
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <year>2009</year>
          .09941. arXiv:
          <year>2009</year>
          .09941. //arxiv.org/abs/
          <year>1908</year>
          .05900. arXiv:
          <year>1908</year>
          .
          <volume>05900</volume>
          . [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tesseract:</surname>
          </string-name>
          <article-title>An open-source optical character</article-title>
          [15]
          <string-name>
            <surname>J.-P. T. Guillaume</surname>
            <given-names>Jaume</given-names>
          </string-name>
          , Hazim Kemal Ekenel,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>recognition engine</article-title>
          ,
          <source>Linux J</source>
          .
          <year>2007</year>
          (
          <year>2007</year>
          )
          <article-title>2</article-title>
          .
          <string-name>
            <surname>Funsd</surname>
          </string-name>
          :
          <article-title>A dataset for form understanding in noisy</article-title>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hamad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mehmet</surname>
          </string-name>
          ,
          <article-title>A detailed analysis of optical scanned documents, in: Accepted to ICDAR-OST,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>character recognition technology</article-title>
          ,
          <year>International 2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>Journal of Applied Mathematics Electronics</source>
          <volume>and</volume>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Surh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Computers</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <fpage>244</fpage>
          -
          <lpage>249</lpage>
          . Cord:
          <article-title>A consolidated receipt dataset for post-</article-title>
          ocr [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hegghammer</surname>
          </string-name>
          ,
          <article-title>Ocr with tesseract, amazon tex- parsing (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>tract</surname>
            , and google document ai: A benchmarking [17]
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          , D. Flo-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>experiment</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: osf.io/preprints/socarxiv/ rencio, C. Zhang,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , XFUND: A bench-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          6zfvs. doi:
          <volume>10</volume>
          .31235/osf.io/6zfvs.
          <article-title>mark dataset for multilingual visually rich form [5</article-title>
          ]
          <string-name>
            <given-names>N.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Noor</surname>
          </string-name>
          ,
          <article-title>A survey on opti- understanding</article-title>
          , in: Findings of the Asso-
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>arXiv:1710.05703</source>
          (
          <year>2017</year>
          ).
          <year>2022</year>
          , Association for Computational Linguistics, [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Memon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uddin</surname>
          </string-name>
          , Hand- Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>3214</fpage>
          -
          <lpage>3224</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>written optical character recognition (ocr): A com-</article-title>
          //aclanthology.org/
          <year>2022</year>
          .findings-acl.
          <volume>253</volume>
          . doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>prehensive systematic literature review (slr)</article-title>
          ,
          <source>IEEE</source>
          <volume>18653</volume>
          /v1/
          <year>2022</year>
          .findings-acl.
          <volume>253</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <issue>Access 8</issue>
          (
          <year>2020</year>
          )
          <fpage>142642</fpage>
          -
          <lpage>142668</lpage>
          . [18]
          <string-name>
            <surname>Amazon</surname>
            , Amazon textract, https://aws.amazon. [7]
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Bai</surname>
          </string-name>
          ,
          <article-title>Real-time com/textract, 2022</article-title>
          . Accessed:
          <fpage>2022</fpage>
          -09-25.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>scene text detection with diferentiable binarization [19] Google, Google cloud document ai</article-title>
          , https://cloud.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>and adaptive scale fusion</article-title>
          ,
          <source>IEEE Transactions on google.com/document-ai</source>
          ,
          <year>2022</year>
          . Accessed:
          <fpage>2022</fpage>
          -09-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>Pattern Analysis and Machine Intelligence</source>
          (
          <year>2022</year>
          ).
          <fpage>25</fpage>
          . [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          , Adap- [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          , G. Zhang, Show, attend and
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>text detection</article-title>
          , in: 2021 IEEE/CVF International recognition, CoRR abs/
          <year>1811</year>
          .00751 (
          <year>2018</year>
          ). URL: http:
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>Conference on Computer Vision</source>
          , ICCV 2021, Mon- //arxiv.org/abs/
          <year>1811</year>
          .00751. arXiv:
          <year>1811</year>
          .00751.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>treal</surname>
          </string-name>
          , QC, Canada,
          <source>October 10-17</source>
          ,
          <year>2021</year>
          , IEEE,
          <year>2021</year>
          , [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>An end-to-end trainable</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          pp.
          <fpage>1285</fpage>
          -
          <lpage>1294</lpage>
          .
          <article-title>neural network for image-based sequence recogni</article-title>
          [9]
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Ch'ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Chan</surname>
          </string-name>
          , C. Liu,
          <article-title>Total-text: To- tion and its application to scene text recognition,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>wards orientation robustness in scene text detec</article-title>
          -
          <source>CoRR abs/1507</source>
          .05717 (
          <year>2015</year>
          ). URL: http://arxiv.org/
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>tion</surname>
          </string-name>
          ,
          <source>International Journal on Document Analysis abs/1507</source>
          .05717. arXiv:
          <volume>1507</volume>
          .
          <fpage>05717</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>and Recognition (IJDAR) 23</source>
          (
          <year>2020</year>
          )
          <fpage>31</fpage>
          -
          <lpage>52</lpage>
          . doi:10. [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nishida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yamada</surname>
          </string-name>
          , Optical character
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <volume>1007</volume>
          /s10032-019-00334-z. recognition, John Wiley &amp; Sons, Inc.,
          <year>1999</year>
          . [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Curved [23]
          <string-name>
            <given-names>H. F.</given-names>
            <surname>Schantz</surname>
          </string-name>
          ,
          <article-title>The history of ocr, optical character</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>sequence connection</article-title>
          ,
          <source>Pattern Recognition 90 Technologies Users Association</source>
          (
          <year>1982</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          (
          <year>2019</year>
          )
          <fpage>337</fpage>
          -
          <lpage>345</lpage>
          . URL: https://www.sciencedirect. [24]
          <string-name>
            <surname>S. W.</surname>
          </string-name>
          et al.,
          <article-title>Tesseract open source ocr engine</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>com/science/article/pii/S0031320319300664. (main repository), https://github.com/tesseract-ocr/</mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          doi:https://doi.org/10.1016/j.patcog. tesseract,
          <year>2022</year>
          . Accessed:
          <fpage>2022</fpage>
          -10-14.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <year>2019</year>
          .
          <volume>02</volume>
          .002. [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Liu, Textboxes: [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lee</surname>
          </string-name>
          , D. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Yun,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Character A fast text detector with a single deep neural net-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <article-title>region awareness for text detection</article-title>
          , in: Proceedings work, in: AAAI,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>of the IEEE Conference on Computer Vision</source>
          and [26]
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Minghui Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          , TextBoxes++: A single-
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Pattern</given-names>
            <surname>Recognition</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>9365</fpage>
          -
          <lpage>9374</lpage>
          .
          <article-title>shot oriented scene text detector</article-title>
          , IEEE Transactions [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <source>Disentangled contour learning for on Image Processing</source>
          <volume>27</volume>
          (
          <year>2018</year>
          )
          <fpage>3676</fpage>
          -
          <lpage>3690</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>quadrilateral text detection</article-title>
          , in: Proceedings of the https://doi.org/10.1109/TIP.
          <year>2018</year>
          .
          <volume>2825107</volume>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>IEEE/CVF Winter Conference on Applications of 1109/TIP</source>
          .
          <year>2018</year>
          .
          <volume>2825107</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <given-names>Computer</given-names>
            <surname>Vision</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>909</fpage>
          -
          <lpage>918</lpage>
          . [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>En</surname>
          </string-name>
          , J. Han, [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Real-time</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Ding</surname>
          </string-name>
          ,
          <article-title>Look more than once: An accu-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          abs/
          <year>1904</year>
          .06535 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/ database, in: 2009 IEEE conference on computer
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <year>1904</year>
          .06535. arXiv:
          <year>1904</year>
          .
          <article-title>06535. vision and pattern recognition</article-title>
          , Ieee,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>[</lpage>
          28]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cao</surname>
          </string-name>
          , Progressive
          <volume>255</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <article-title>contour regression for arbitrary-shape scene text</article-title>
          [40]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          detection, in: 2021 IEEE/CVF Conference on Com
          <string-name>
            <surname>- H. Wei</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , K. Chen,
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <source>puter Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          , Text detection models - mmocr
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          pp.
          <fpage>7389</fpage>
          -
          <lpage>7398</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR46437.
          <year>2021</year>
          .
          <volume>0</volume>
          .
          <issue>6</issue>
          .2 documentation, https://mmocr.readthedocs.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          00731. io/en/latest/textdet_models.html,
          <year>2022</year>
          . Accessed: [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkor- 2022
          <string-name>
            <surname>-</surname>
          </string-name>
          10-14.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>eit</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kaiser</surname>
            , I. Polo- [41]
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Bi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Pytorch implementation of dclnet "dis-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <source>abs/1706</source>
          .03762 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/ tection", https://github.com/SakuraRiven/DCLNet,
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          1706.03762. arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
          <year>2021</year>
          . Accessed:
          <fpage>2022</fpage>
          -10-13. [30]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>He</surname>
          </string-name>
          , L. Jin, [42]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          abs/
          <year>2002</year>
          .10200 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/ TextBPN-Plus-Plus,
          <year>2022</year>
          . Accessed:
          <fpage>2022</fpage>
          -09-29.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <year>2002</year>
          .10200. arXiv:
          <year>2002</year>
          .
          <volume>10200</volume>
          . [43]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lee</surname>
          </string-name>
          , D. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Yun,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Oficial [31]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jin</surname>
          </string-name>
          , W. Zhang, implementation
          <article-title>of character region awareness for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <article-title>text detection</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2021</year>
          . CRAFT-pytorch,
          <year>2019</year>
          . Accessed:
          <fpage>2022</fpage>
          -10-13. [32]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yu</surname>
          </string-name>
          , S. Shao, [44]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>nition</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>9336</fpage>
          -
          <lpage>9345</lpage>
          . arXiv preprint arXiv:
          <volume>2108</volume>
          .06543 (
          <year>2021</year>
          ). [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lian</surname>
          </string-name>
          , Centripetaltext: An [45]
          <article-title>Mindee, doctr: Document text recognition, https:</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <article-title>eficient text instance representation for scene text //github</article-title>
          .com/mindee/doctr,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          detection, in: Thirty-Fifth Conference on Neural [46]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <source>Information Processing Systems</source>
          ,
          <year>2021</year>
          . H.
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , K. Chen, [34]
            <given-names>S.-X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Hou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>X.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Text recognition models - mmocr</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <article-title>Kernel proposal network for arbitrary shape text de- 0.6.2 documentation</article-title>
          , https://mmocr.readthedocs.io/
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <string-name>
            <surname>tection</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2203.06410. en/latest/textrecog_models.html,
          <year>2021</year>
          . Accessed:
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <source>doi:10.48550/ARXIV.2203.06410</source>
          . 2022-
          <volume>10</volume>
          -14. [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Du</surname>
          </string-name>
          , Textfusenet: Scene
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          <source>Conference on Artificial Intelligence, IJCAI-20</source>
          , In-
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          <source>gence Organization</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>516</fpage>
          -
          <lpage>522</lpage>
          . [36]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , MAS-
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          recognition, CoRR abs/
          <year>1910</year>
          .02562 (
          <year>2019</year>
          ). URL: http:
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          //arxiv.org/abs/
          <year>1910</year>
          .02562. arXiv:
          <year>1910</year>
          .
          <volume>02562</volume>
          . [37]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shin</surname>
          </string-name>
          , J. Baek,
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          abs/
          <year>2006</year>
          .06244 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          <year>2006</year>
          .06244. arXiv:
          <year>2006</year>
          .
          <volume>06244</volume>
          . [38]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , Synthetic data
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          <string-name>
            <surname>nition</surname>
          </string-name>
          ,
          <year>2016</year>
          . [39]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Fei-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>