<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extended Overview of DocILE 2023: Document Information Localization and Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Štěpán Šimsa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Uřičář</string-name>
          <email>michal.uricar@rossum.ai</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan Šulc</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yash Patel</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ahmed Hamdi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matěj Kocián</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matyáš Skalický</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiří Matas</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antoine Doucet</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mickaël Coustaty</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimosthenis Karatzas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision Center, Universitat Autónoma de Barcelona</institution>
          ,
          <addr-line>08193 Cerdanyola del Vallès, Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rossum</institution>
          ,
          <addr-line>Křižíkova 148/34, 186 00 Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Second Foundation</institution>
          ,
          <addr-line>Na Florenci 15, 110 00 Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of La Rochelle</institution>
          ,
          <addr-line>23 Avenue Albert Einstein, 17031 La Rochelle</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Visual Recognition Group, CTU in Prague</institution>
          ,
          <addr-line>Karlovo náměstí 13, 121 35 Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper provides an overview of the DocILE 2023 Competition, its tasks, participant submissions, the competition results and possible future research directions. This first edition of the competition focused on two Information Extraction tasks, Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR). Both of these tasks require detection of pre-defined categories of information in business documents. The second task additionally requires correctly grouping the information into tuples, capturing the structure laid out in the document. The competition used the recently published DocILE dataset and benchmark that stays open to new submissions. The diversity of the participant solutions indicates the potential of the dataset as the submissions included pure Computer Vision, pure Natural Language Processing, as well as multi-modal solutions and utilized all of the parts of the dataset, including the annotated, synthetic and unlabeled subsets. This is an extended version of the condensed overview paper [1].</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Extraction</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Optical Character Recognition</kwd>
        <kwd>Document Understanding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Documents, such as invoices, purchase orders, contracts, and financial statements, are a major
form of communication between businesses. Extraction of the key information from such
documents is an essential task, as they contain a wealth of valuable information critical for
day-to-day decision-making, compliance, and operational eficiency.</p>
      <p>
        Machine learning techniques, particularly those based on deep learning, natural language
processing, and computer vision, have shown great promise in a number of document
understanding tasks [
        <xref ref-type="bibr" rid="ref10 ref11 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">2, 3, 4, 5, 6, 7, 8, 9, 10, 11</xref>
        ], such as understanding of forms [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ],
receipts [
        <xref ref-type="bibr" rid="ref15 ref7">7, 15</xref>
        ], tables [
        <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
        ], or invoices [
        <xref ref-type="bibr" rid="ref19 ref20 ref21">19, 20, 21</xref>
        ]. Another approach to document
understanding is question answering [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ].
      </p>
      <p>
        The DocILE competition and lab at CLEF 2023 called for contributions to the DocILE
benchmark [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], which focuses on the practically oriented tasks of Key Information Localization and
Extraction (KILE) and Line Item Recognition (LIR), as defined in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>
        This paper provides an overview of the first run of the DocILE competition, summarizing
the participants solutions and their final results, as well as a breakdown of the results with
respect to certain information, e.g., with respect to zero-shot/few-shot/many-shot layouts in
the training or with respect to text extractions, which are not otherwise checked in the main
evaluation metric. This is an extended version of the condensed overview paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The paper is structured as follows: Section 2 describes the DocILE dataset, its acquisition
and distribution to individual subsets; Section 3 summarizes the DocILE competition tasks and
their respective evaluation process; all competing methods submitted to the competition are
briefly described in Section 4; results from the competition, their breakdown and discussion are
provided in Section 5; finally, Section 6 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>
        The competition was based on the DocILE [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] dataset of business documents, which consists
of three distinct subsets: annotated, unlabeled, and synthetic. The annotated set comprises 6, 680
real business documents sourced from publicly available platforms, which have been carefully
annotated. The unlabeled set consists of a massive collection of 932, 467 real business documents
also obtained from publicly available sources, intended for unsupervised pre-training purposes.
The dataset draws its documents from two public data sources: UCSF Industry Documents
Library [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and Public Inspection Files (PIF) [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. UCSF Industry Documents Library is a
digitalized archive of documents created by industries that impact public health, while PIF
consists of public files of American broadcast stations, specifically focusing on political campaign
ads. The documents were retrieved in a PDF format, and various selection criteria were applied to
ensure the quality and relevance of the dataset. The synthetic set comprises 100, 000 documents
generated using a proprietary document generator. These synthetic documents are designed
to mimic the layout and structure of 100 fully annotated real business documents from the
annotated set.
      </p>
      <p>
        Participants were allowed to use the 5, 180 training samples, 500 validation samples and the
full synthetic and unlabeled dataset. The remaining 1, 000 documents form the test set. Usage
of external document datasets or models pre-trained on such datasets was forbidden in the
competition, while datasets and pre-trained models from other domains — such as images from
ImageNet [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] or texts from BooksCorpus [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] — were allowed.
      </p>
      <p>
        For each document, the dataset contains the original PDF file and OCR pre-computed using
the DocTR [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] library achieving excellent recognition scores in [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. Annotations are provided
for documents in the annotated and synthetic sets and include field annotations for the two
competition tasks, KILE and LIR, as well as additional metadata: original source of the document,
layout cluster ID1, table grid annotation, document type, currency, page count and page image
sizes. Annotations for the test set are not publicly available.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Tasks and Evaluation</title>
      <p>
        The competition had two tracks, one for each of the two tasks, KILE and LIR, respectively.
The goal of both of these tasks is to detect semantic fields in the document, i.e., for each
category (field type) localize all the text boxes that have this semantic meaning and extract the
corresponding text. For LIR, fields have to be additionally grouped into Line Items, i.e., tuples
representing a single item. For a more formal definition, refer to [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], where the tasks were first
defined. An example document with annotations for KILE and LIR is illustrated in Figure 1.
      </p>
      <p>The DocILE benchmark is hosted on the Robust Reading Challenge portal2. As the test set
annotations remain private, the only way to compare the solutions on the test set is to make a
submission to the benchmark. During the competition, participants did not see the results or
even their own score, so they had to select the final solution without gathering any info about
the test set.</p>
      <p>To focus the competition on the most important part of the two tasks, which is the semantic
understanding of the values in the documents, only the localization part was evaluated. This
means the tasks can be framed as object detection tasks, with LIR additionally requiring the
grouping of the detected objects into Line Items. Therefore, standard object detection metrics
are employed, with Average Precision (AP) as the main metric for KILE and F1 as the main
metric for LIR. A predicted and a ground truth field are matching if they have the same field
type and if they cover the same text in the document, as explained in detail in Figure 2. For LIR
the fields also need to belong to corresponding Line Items, where this correspondence is found
with a matching that maximizes the total number of matched fields, as shown in Figure 3.</p>
      <p>Extracting the text of the localized fields is an obvious extension of the two tasks whose
precision is also important. Therefore, both tracks in the benchmark have a separate leaderboard,
where the extracted text is compared with the annotated text for each matched field pair and an
exact match is required to count the pair as a true positive pair.</p>
      <p>The benchmark also contains additional leaderboards for zero-shot, few-shot and many-shot
evaluation. This is the same evaluation as in the main leaderboard but evaluated only on a
subset of the test documents. Specifically, it is evaluated on documents from layout clusters that
have zero (zero-shot), one to three (few-shot) or four and more (many-shot) samples available
for training (i.e., in the training or validation set). These test subsets contain roughly 250, 250
and 500 documents, respectively. This enables a more detailed analysis of the methods and
helps to understand which methods generalize better to new document layouts and which can
better overfit to clusters with many examples available for training.
1Clusters are formed by documents that have similar visual layout and placement of semantic information in this
layout.
2https://rrc.cvc.uab.es/?ch=26
LLiinneeIItteemm::31 LLiinneeIItteemm::42
LLiinneeIItteemm::75 LLiinneeIItteemm::68
amount_due
customer_id
line_item_amount_gross
line_item_unit_price_gross
amount_total_gross
date_due
line_item_code
payment_reference
customer_billing_address
date_issue
line_item_description
payment_terms
customer_billing_name
document_id
line_item_quantity
vendor_name</p>
    </sec>
    <sec id="sec-4">
      <title>4. Submissions</title>
      <p>The competitions received contributions from 5 teams for the KILE task and 4 teams in the
LIR task. See Figure 4 to compare this with the number of dataset downloads and competition
registrations. We briefly present all the submitted methods in an alphabetical order.</p>
      <sec id="sec-4-1">
        <title>4.1. GraphDoc — USTC-iFLYTEK, China</title>
        <p>
          The team from the University of Science and Technology of China and iFLYTEK AI Research,
China submitted a method [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], which jointly solves both KILE and LIR tasks. Their approach
is based on an ensemble of a modified GraphDoc [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] tailored for the purpose of the DocILE
competition, pre-trained on the DocILE unlabeled set and consequently fine-tuned on the
training set. Both competition tasks are handled like Named Entity Recognition (NER), followed
by a special Merger module, which operates on the attention layer from the GraphDoc model
and the merging strategy is therefore learned, unlike in the baseline method. The authors
noticed the inherent nature of the KILE and LIR task and exploited it naturally — the word
(a) Each pre-computed OCR word is split uniformly into pseudo-character boxes
based on the number of characters. Pseudo-Character Centers are the centers of
these boxes.
(b) Correct extraction examples.
(c) Incorrect extraction examples.
tokens are merged to instances by the first level Merger module and then the second Merger
module operates on these instances for the line item classes and merges them into final line
items. The proposed method still uses some level of a rule-based post-processing, which is
based on the observation of data: 1) some field annotations contain only part of the detected
text boxes from DocTR and need to be manually split (such as currency_code_amount_due
ifelds that usually contains only the symbol ’$’); 2) some symbols are frequently detected as
part of the OCR word box, but excluded from the annotations (such as the symbol ’#’); 3) Text
boxes that are far apart rarely belong to the same instance, or to the same line item.
        </p>
        <p>
          Besides the contribution on the model side, the authors also devoted some efort to improve
the OCR detections provided, by removing the detections with low confidence and by running
DocTR [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] on scaled-up images (1.25× , 1.5× , and 1.75× ) and aggregating the found text
boxes to improve the recall of the OCR detections. The OCR detections are also re-ordered,
similarly as in the baseline methods, in the top-down left-right reading order.
        </p>
        <p>Since the proposed method uses multi-modal input (text, layout, vision), we can put it into a
category of combination of NLP and CV.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. LiLT — University of Information Technology, Vietnam</title>
        <p>
          The team from University of Information Technology, Vietnam submitted a method based on
the baselines with a layout-aware backbone LiLT [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. The authors decided to re-split the
provided dataset to 80% for training and 20% for validation (original ratio was 90% and 10%,
respectively), arguing that the original split was leading to a poor generalization. Another
contribution was filtering out low-confident OCR detections. There is no mention of the usage
of either the synthetic or the unlabeled sets of the DocILE dataset in the manuscript.
        </p>
        <p>Unfortunately, despite competing in both KILE and LIR tasks, the authors submitted a
manuscript describing only the solution for the LIR task. Since the backbone LiLT uses a
combination of text and layout input, we categorize it as a pure NLP solution.</p>
        <p>
          The review process of the authors’ manuscript discovered a violation of the benchmark rules
due to the usage of the prohibited pre-trained checkpoint for the LiLT backbone. The authors
used the checkpoint from training on the IIT-CDIP [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] dataset, which is a document dataset.
Therefore we had to remove this method from the oficial leaderboard of the competition and
the benchmark.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Union-RoBERTa — University of Information Technology, Vietnam</title>
        <p>
          The team from University of Information Technology, Vietnam submitted a method [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] which
is heavily based on the provided baselines. Their method, coined as Union-RoBERTa, is an
ensemble of two provided baselines [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] with a plain RoBERTa trained from scratch on the
synthetic and training data using Fast Gradient Method. They use the afirmative strategy for
the ensemble (hence the Union in the name) and follow it by an additional merging of fields
based on distance with a threshold tuned on the validation set. This ensemble is then used
to generate pseudo-labels for 10, 000 samples from the unlabeled set which are then used for
additional pre-training of the three models followed by an additional training on the training
set. Although there is not much novelty in the proposed method, it is a nice example how
well-established practices can yield significant improvements.
        </p>
        <p>The proposed method participated in the KILE task only. Since the method is based on
RoBERTa models, we put it into a pure NLP category.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. ViBERTGrid — Ricoh Software Research Center, China</title>
        <p>
          The team from Ricoh Software Research Center, China submitted a method based on token
classification with ViBERTGrid [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], followed by a distance-based merging procedure. The
team participated in both KILE and LIR tasks. However, the results were below baselines for
both tasks and the authors decided not to submit a manuscript with further details. We can only
guess, based on the provided description with ViBERTGrid, that the method was a combination
of NLP and CV.
        </p>
        <p>We noticed that the method probably sufers from not using the adequate score (all detections
were using the same score 1.0) which could explain why AP is significantly lower compared to
the other methods, while F1 measure on the KILE task is in the middle of the ranking, as seen
in Figure 5a and discussed more in Section 5.5.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. YOLOv8 — University of West Bohemia, Czech Republic</title>
        <p>
          The team from University of West Bohemia, Czech Republic submitted a method [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] based on
the combination of YOLOv8 [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] and CharGrid [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ] with modifications, such as splitting the
word boxes to pseudo-characters, not using the one number encoding of a character directly
but a three numbers encoding instead, and concatenating the image with the CharGrid
representation. The authors did not leverage synthetic nor unlabeled parts of the dataset, but they
used augmentations during training. Due to the faster training procedure, they decided to use
just random translation for augmentation, even though the best results in ablation study were
observed when mosaicking was applied. The method works quite well on the KILE task (where
it even achieves the highest F1) but falls behind on the LIR task. The latter is attributed to the
increased number of false positive detections.
        </p>
        <p>This contribution is purely based on computer vision.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>
        The results for the KILE and LIR tasks, including the baselines from the DocILE dataset paper [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ],
are displayed in Figures 5a and 5b, respectively. We can see that while on the KILE task
participants approaches clearly outperform the provided baseline by a large margin on the main
evaluation metric (AP), on the LIR task, there is not such a big improvement, except for the
GraphDoc based approach. The baseline methods are marked with ⊟ symbol.
      </p>
      <p>Interestingly, for the KILE task, the secondary metric (F1) does not seem to be correlated with
the primary metric (AP) and several of the methods, including the baselines, are comparatively
much better on F1 than on AP. In fact, the YOLOv8 based approach outperforms the otherwise
winning GraphDoc in F1 metric. This might be related to the fact that AP takes into account
the score assigned to individual predictions, while F1 does not, and that some teams focused on
assigning good scores to predictions more than others, as discussed in Section 5.5.</p>
      <p>In the LIR task, there is some correlation between the primary metric, which in this task is
F1, and the secondary metric (AP), with a slight violation for the GraphDoc based method.</p>
      <p>Considering the achieved metric values, we can say that the DocILE benchmark poses very
challenging tasks, because the best results on both KILE and LIR tasks are below 80% of the
respective quality metric.</p>
      <sec id="sec-5-1">
        <title>5.1. Text Extraction Evaluation</title>
        <p>Figure 6 summarizes the results when text extractions are checked in the evaluation. Note, that
this was intentionally not done in the main evaluation, which focuses more on the localization
part, so that participants do not have to focus on optimizing the OCR solution for text read
out. However, in a real-world system, this would likely be the main metric for evaluation and
therefore we present results of all of the competing methods when this strict text comparison is
employed. By definition, all methods are performing worse on both KILE and LIR task, compared
to the main localization-only evaluation. Also both AP and F1 metrics show less variance for all
competing methods. Unfortunately, the YOLOv8 based method did not provide the text outputs
(which was not required for the competition), so we cannot evaluate this method properly.</p>
        <p>The KILE task, summarized in Figure 6a, shows that the GraphDoc still outperforms all the
other competitors. However, the margin is not as big as in the final evaluation.</p>
        <p>The LIR task is summarized in Figure 6b. Surprisingly, the GraphDoc based method, which
was winning in the main evaluation, and which kept its position for the KILE task, is now
lagging behind quite significantly. We believe this might be attributed to the lack of efort
invested to the text read-out after merging.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation on zero/few/many-shot layouts</title>
        <p>In this section, we present a break-down of the evaluation with respect to the document layouts
seen/unseen during training, hence providing hints about how the particular method generalizes.
We have three distinct categories for this evaluation: 1) zero-shot, formed by document layouts
that were not in the training nor validation sets; 2) few-shot, which is formed by document
layouts that have 1–3 samples in the training and validation subset of the DocILE dataset; 3)
many-shot, with 4 or more samples in the training and validation subset.</p>
        <p>⊟ RLoiLBTERTa+synth ⊟ ⊟ RLoaByEoRuTtLaMv3+synth
(a) KILE overall results
⊟ LRiLoTBERTa+synth
⊟ LayoutLMv3+synth
⊟ RoBERTa</p>
        <sec id="sec-5-2-1">
          <title>Union-RoBERTa[36]</title>
          <p>⊟ RoBERTa+synth</p>
          <p>LiLT ⊟ ⊟ RLoaByEoRuTtLaMv3+synth
(a) KILE text extraction results
⊟ LRiLoTBERTa+synth
⊟ LayoutLMv3+synth
⊟ RoBERTa</p>
          <p>In Figure 7, we show the results of the first category — zero-shot. For the KILE task (Figure
7a), we can see that GraphDoc is still a clear winner with a relatively high margin. However,
interestingly, YOLOv8 performs much worse, compared to the overall results. This might be
attributed to the fact that this method did not leverage the unlabeled part of the DocILE dataset
and therefore is more prone to overfitting. The RoBERTa baseline performs better than RoBERTa
with supervised pre-training on synthetic data, which might be caused by the fact that synthetic
documents are based on selected layouts from the training set and these layouts are not present
in the zero-shot test subset, although we do not see the same efect in the case of LayoutLMv3
or the LIR task. Union-RoBERTa gets to the second place; considering it is basically an ensemble
of the baselines, this might be an indicator that ensembling can also improve generalization
properties. It is also worth mentioning that ViBERTGrid is very good in generalization when
the F1 measure is concerned.</p>
          <p>The LIR task (Figure 7b) shows similar results — GraphDoc remains on the first place, LiLT lost
its second position to RoBERTa with supervised pre-training on synthetic data and LayoutLMv3
baseline pre-trained on synthetic data swapped its position with RoBERTa baseline. Note,
that for both tasks, the results are significantly worse for the zero-shot setup compared to the
overall results, showing a room for improvement with respect to generalization of all competing
methods.</p>
          <p>The results of the few-shot evaluation are in Figure 8. The KILE task (Figure 8a) shows
that only a few similar layouts during training can help significantly. We see, that YOLOv8
gets back to the second place, RoBERTa+synth baseline improves significantly. It is also worth
mentioning that all methods improve both the AP and F1 metrics by roughly 10%, compared to
the zero-shot setup, with some exceptions with even a better improvement, and ViBERTGrid,
which has a lower improvement.</p>
          <p>In the LIR task (Figure 8b), we can see that all methods get closer to each other, similarly as it
was in the overall evaluation. However, what is really surprising is that the results for zero-shot
variant were actually slightly better than the results for few-shot. Also, the LiLT benefits from
seeing at least a few similar layouts during training much more than GraphDoc and overtakes
its first position. Also RoBERTa baseline is slightly better than RoBERTa+synth.</p>
          <p>In Figure 9, we show the results for the many-shot scenario. For the KILE task (Figure 9a), it
can be seen that the order of competing methods converges to the same one as for the overall
results, with the only exception of LayoutLMv3 and LayoutLMv3+synth baselines, which are
swapped. We can also see, that the results are roughly 10% better than for the overall case,
which is not surprising, since the overall case contains also unseen layout examples. For
the LIR task (Figure 9b), we see a similar trend, but the improvement is not that significant.
Interestingly, the LayoutLMv3+synth baseline gets to the second place outperforming both LiLT
and RoBERTa+synth baselines. However, we should point out that the results of these methods
are very close.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Breakdown based on document source</title>
        <p>The number of documents of each layout cluster in the training, validation and unlabeled subsets
are depicted in Figure 13 and Figure 14 for the UCSF and PIF document source type subsets,
respectively.</p>
        <p>⊟ RLoiLBTERT⊟aL+aysyonutthLMv3+syn⊟thLayoutLMv3
(a) KILE zero-shot results</p>
        <sec id="sec-5-3-1">
          <title>GraphDUoncio[n3-3R]oBERTa[36]</title>
          <p>ViBERTGrid
40</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>GraphDoc[33]YOLOUvn8io[n3-8R]oBER⊟TaR[o3B6]ERTa+synth</title>
          <p>LiLT ⊟ ⊟ RLoaByEoRuTtLaMv3+syn⊟thLayoutLMv3
(a) KILE few-shot results</p>
          <p>ViBERTGrid</p>
          <p>⊟ RLoiLBTERTa+synth
(a) KILE many-shot results
⊟ RoBER⊟TaLa⊟yoLuatyLoMuvt3LMv3+synth</p>
          <p>ViBERTGrid</p>
          <p>The distribution of the number of document pages in the training, validation and unlabeled
sets are depicted in Figure 11 and Figure 12 for the UCSF and PIF document sources subsets,
respectively.</p>
          <p>
            Figure 10 depicts the breakdown of results based on the document source type and also
in combination with zero/few/many-shot layout analysis for both KILE and LIR tasks of the
winning solution [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ]. From the graphs, we can see that documents from the PIF source are
posing a bigger problem to the method which is interesting since UCSF has bigger variance in
the number of diferent layouts. This noticable diference might be attributed to the fact that
PIF documents are more frequently multi-paged. There might be some non-trivial changes in
document layout thanks to the transition from one page to another, especially when tables are
concerned.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Using synthetic and unlabeled data</title>
        <p>According to the submitted participant papers, only GraphDoc and partly also Union-RoBERTa
(they used only 10, 000 samples) leveraged the unlabeled part of the DocILE dataset. We believe
that the reason for not using the unlabeled data was mainly relatively tight time constraints. It
is visible that GraphDoc-based method wins in almost all comparisons with the exception of
the few-shot (Figure 8b) and text extraction (Figure 6b) LIR tasks. However, it is hard to judge if
this could be attributed to the usage of the unlabeled data.</p>
        <p>Only the authors of Union-RoBERTa report the usage of the synthetic part of the DocILE
dataset. Again, the reason for not using the provided synthetic data might be time constraints.
From the baselines point of view, we see that using the synthetic data helps in most situations,
with a few exceptions like the zero-shot KILE task (Figure 7a) and the few-shot LIR task
(Figure 8b), where RoBERTa performs better than RoBERTa+synth. However, simultaneously,
the LayoutLMv3+synth outperforms LayoutLMv3. But we should point out that in these cases
the diferences are not very big.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Importance of score for Average Precision</title>
        <p>While for the F1 metric the score assigned to individual predictions is ignored, it plays an
important role for the AP metric. In AP, predicted fields are first sorted by the score, then
the precision-recall pairs are computed iteratively and finally the metric itself is the average
precision achieved for diferent recall thresholds. Therefore, if we can ensure that there are
more true positives among the predictions with higher score than among the predictions with
lower score, the precision will increase for lower recall thresholds and remain similar for higher
recall thresholds, when compared to the case when scores are random.</p>
        <p>To prove this point, we can look at two examples. The ViBERTGrid method used the same
score for all predictions and it achieves very poor results on AP compared to its results on F1,
as can be seen in Figure 5. On the other hand, in the participant paper of the GraphDoc method,
they argue that the prediction score is important for the AP metric and they show that by
using a carefully selected score they achieve a 13.6% higher result on AP on the validation set
compared to using the same score for all predictions. We can see in Figure 5 that for the KILE
task GraphDoc has the smallest diference between the AP and F1 metrics of all the methods.</p>
        <p>Test</p>
        <p>SFsubset
UC</p>
        <p>PSIFFssuubbsseett—zero-shot
UC</p>
        <p>t
zero-sho
PIFsubset—</p>
        <p>t
-sho
few</p>
        <p>SFsubset—
UC</p>
        <p>t
-sho
few
PIFsubset—</p>
        <p>t
any-sho
m</p>
        <p>SFsubset—
UC</p>
        <p>
          t
any-sho
m
PIFsubset—
(a) GraphDoc [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] breakdown of results based on the document source and
zero/few/manyshot layout for the KILE task.
        </p>
        <p>Test</p>
        <p>SFsubset
UC</p>
        <p>PSIFFssuubbsseett—zero-shot
UC</p>
        <p>t
zero-sho
PIFsubset—</p>
        <p>t
-sho
few</p>
        <p>SFsubset—
UC</p>
        <p>t
-sho
few
PIFsubset—</p>
        <p>t
any-sho
m</p>
        <p>SFsubset—
UC</p>
        <p>
          t
any-sho
m
PIFsubset—
(b) GraphDoc [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] breakdown of results based on the document source and
zero/few/manyshot layout for the LIR task.
1.0
t
e
se0.8
h
t
n
is0.6
t
n
e
um0.4
c
o
d%0.2
0.0
Training set
Validation set
        </p>
        <p>Unlabeled set
k
9
.
8
2
k
2
.
1
2
k
4
.
4
1
k
9
.
0
1
k
4
.
7
3
0
200
400</p>
        <p>Layout cluster
600
800
Training set
Validation set
Unlabeled set
This is not the case for LIR, maybe because here AP was not the main evaluation metric and so
less focus might have been given to assigning a correct score to the predictions in this case.</p>
        <p>Since there is a noticeable diference between the behaviour of the AP and F1 metrics,
benchmark submissions are allowed to mark some fields with the flag use_only_for_ap to include
it only for the AP computation, while excluding it from the F1 computation. Unfortunately, no
submission have utilized this feature so we cannot evaluate its efect.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Potential gain from hand-crafted post-processing rules</title>
        <p>
          In the GraphDoc [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] submission the authors use multiple post-processing rules to mitigate
some common errors. Two of these rules deal with the problem that granularity of the input
OCR words is not always good enough for some specific field types:
($) When a word box has a predicted field type currency_code_amount_due and contains
the symbol ’$’, return just the value ’$’ and split the bounding box.
(#) For field types with the sufix _id, when the predicted text starts with the symbol ’#’,
remove this symbol and split the bounding box.
        </p>
        <p>
          In this section, we explore what is the potential impact of such heuristic rules and verify
whether we see a gap between GraphDoc [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] and the other methods on the impacted field
types. In the following analysis, we will consider a method that uses the OCR provided with the
dataset and that creates final fields by taking a union of bounding boxes of several of the input
OCR word (their snapped version). Although this is not true for YOLOv8 [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] that predicts
bounding boxes directly, the other methods more or less follow these criteria.
        </p>
        <p>First, let us generalize the two rules above. We say a field is a split field if there is no word
token that matches the field location, i.e., that covers the same set of Pseudo-Character-Centers
(PCCs) as defined in Figure 2, and if there exists a word token that covers a superset of the PCCs
covered by the field. The number of split fields for each field type in the training and validation
sets are listed in Tables 1 and 2. For KILE, a total of 7.5 % of fields are split fields, while for LIR it
is 3.3 % of all fields. Since handling these cases usually afects both precision and recall, it has
the potential to improve the final metrics by several percentage points.</p>
        <p>Now let us focus specifically on the rules ( $) and (#) listed above. We say a split field
follows the rule ($) if its text is equal to just the symbol ’$’ and it is covered by a word that
contains this symbol in its text. In the training and validation set, the afected field types are
only currency_code_amount_due and line_item_currency as shown in Table 3. This
represents 4.9 % of all KILE fields and less than 0.1 % of all LIR fields.</p>
        <p>
          We say a split field follows the rule ( #) if there is a word covering this field that has exactly the
same text with an additional symbol ’#’ prepended at the beginning. The number of split fields
satisfying the rule (#) for each afected field type is listed in Table 4. In total, this represents
0.5 % of all KILE fields and less than 0.1 % of all LIR fields and as noticed by GraphDoc [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ],
most of these fields have a type with the sufix _id.
        </p>
        <p>
          Let us now verify whether we see the impact of the GraphDoc [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] post-processing rules on
the test set predictions. In Figure 15 we compare all of the methods on the whole test subset,
on the currency_code_amount_due field type (only for KILE) and on field types with the
sufix _id. As expected, we see that rule ($) gives GraphDoc a big edge over most of the other
methods. Exceptions are YOLOv8 [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], which does not have the same limitations connected to
the OCR input, and ViBERTGrid, which also demonstrates decent performance on this field
type, but the reasons behind its success are unknown to us, as we have not received a paper
describing this method in more detail. For (#) we do not see GraphDoc outperforming the other
methods (when compared to the results on all field types) on either of the two tasks, which
matches the observations from the analysis on the training and validation set above.
        </p>
        <p>
          From these results it is apparent that the small trick of extracting just the symbol ’$’ out
of the word boxes predicted to have the class currency_code_amount_due, pushed the
GraphDoc [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] results on KILE several percentage points up compared to most of the other
methods.
        </p>
        <p>0</p>
        <p>LiLT a+synth
RoBERT ⊟</p>
        <p>a
RoBERT</p>
        <p>v3+synth
LayoutLM ⊟</p>
        <p>v3
LayoutLM</p>
        <p>rid
ViBERTG
(a) KILE evaluated with AP on diferent subsets of field types. AP: all field types, AP (
type currency_code_amount_due, AP (#): all KILE field types with sufix
_id.</p>
        <p>
          $): field
oc[
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]
raphD
G
⊟
LiLT
        </p>
        <p>a+synth
RoBERT</p>
        <p>v3+synth</p>
        <p>LayoutLM
⊟</p>
        <p>a
RoBERT</p>
        <p>v3
LayoutLM</p>
        <p>
          v8[
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]
        </p>
        <p>LO
YO</p>
        <p>rid
ViBERTG
(b) LIR evaluated with F1 on diferent subsets of field types. F1: all field types, F1 (
type line_item_order_id.
#): field
Figure 15: Evaluation of the methods on field types afected by the GraphDoc splitting heuristics for
Task 1: KILE (15a) and Task 2: LIR (15b).</p>
        <p>AP
AP ($)
AP (#)</p>
        <p>F1
F1 (#)</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We presented the first edition of the DocILE 2023 competition, which consisted of two tracks:
KILE and LIR. Both tasks consist of detection of pre-defined categories of information in business
documents. The latter task additionally requires groupping the information into tuples. In the
end, we obtained 5 submissions for KILE and 4 submissions for LIR. The diversity of the chosen
approaches shows the potential of the DocILE dataset and benchmark, which spans the domains
of computer vision, layout analysis, and natural language processing. Unsurprisingly, some
of the submissions used a multi-modal approach. The values of the respective error metrics
indicate that the benchmark is non-trivial and the problems are far from being solved.</p>
      <p>The benchmark remains open to new submissions, leaving it as a springboard for future
research and for the document understanding community. To point out just a few possible
research questions for this benchmark: 1) How to best use the unlabeled and synthetic datasets
(as most of the solutions did not focus on these parts of the dataset)? 2) Is it possible to better
utilize the fact that many documents share the same layout and push the performance on the
few-shot subset closer to the performance on the many-shot subset? 3) Which parts of the
tasks are better solved by pure NLP solutions (such as the baselines), which are better solved
by pure CV solutions (such as YOLOv8) and do the multi-modal solutions (such as GraphDoc)
already utilize both of the modalities to their full potential or is one of the modalities still
under-utilized?
(Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, Brussels, Belgium, October 31 - November 4, 2018, Association for Computational
Linguistics, 2018, pp. 4459–4469. URL: https://aclanthology.org/D18-1476/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Š.</given-names>
            <surname>Šimsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uřičář</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hamdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kocián</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Skalický</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doucet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coustaty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          , Overview of DocILE 2023:
          <article-title>Document Information Localization and Extraction</article-title>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Aliannejadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF</source>
          <year>2023</year>
          ),
          <article-title>LNCS Experimental IR Meets Multilinguality, Multimodality, and</article-title>
          <string-name>
            <surname>Interaction.</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <article-title>Vibertgrid: a jointly trained multi-modal 2d document representation for key information extraction from documents</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Katti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reisswig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brarda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Höhne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Faddoul</surname>
          </string-name>
          , Chargrid:
          <article-title>Towards understanding 2d documents</article-title>
          , in: EMNLP,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Layoutlm:
          <article-title>Pre-training of text and layout for document image understanding</article-title>
          ,
          <source>in: KDD</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Florencio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Che</surname>
          </string-name>
          , et al.,
          <article-title>Layoutlmv2: Multi-modal pre-training for visually-rich document understanding</article-title>
          ,
          <source>ACL</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Layoutlmv3: Pre-training for document ai with unified text and image masking</article-title>
          , in: ACM-MM,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nam</surname>
          </string-name>
          , S. Park,
          <article-title>Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents</article-title>
          ,
          <source>in: AAAI</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tanaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nishida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoshida</surname>
          </string-name>
          ,
          <string-name>
            <surname>Visualmrc:</surname>
          </string-name>
          <article-title>Machine reading comprehension on document images</article-title>
          ,
          <source>in: AAAI</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Powalski</surname>
          </string-name>
          , Ł. Borchmann,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurkiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dwojak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietruszka</surname>
          </string-name>
          , G. Pałka,
          <article-title>Going full-tilt boogie on document understanding with text-image-layout transformer</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Bansal,
          <string-name>
            <surname>Unifying</surname>
            <given-names>Vision</given-names>
          </string-name>
          , Text, and
          <article-title>Layout for Universal Document Processing</article-title>
          , arXiv (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Ma,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>Multimodal pre-training based on graph attention network for document understanding</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Morse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tensmeyer</surname>
          </string-name>
          ,
          <article-title>Deep visual template-free form parsing</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hammami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Héroux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. P.</surname>
          </string-name>
          <article-title>d'Andecy, One-shot field spotting on colored forms using subgraph isomorphism</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          , L. Jiang,
          <article-title>irmp: From printed forms to relational data model</article-title>
          ,
          <source>in: HPCC</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          ,
          <article-title>ICDAR2019 competition on scanned receipt OCR and information extraction</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Herzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Nowak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piccinno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Eisenschlos</surname>
          </string-name>
          , Tapas:
          <article-title>Weakly supervised table parsing via pre-training, arXiv (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schreiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agne</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dengel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          , Deepdesrt:
          <article-title>Deep learning for detection and structure recognition of tables in document images</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jimeno-Yepes</surname>
          </string-name>
          ,
          <article-title>Publaynet: Largest dataset ever for document layout analysis</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lohani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Belaïd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belaïd</surname>
          </string-name>
          ,
          <article-title>An invoice reading system using a graph convolutional network</article-title>
          , in: ACCV workshops,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Potti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Wendt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <article-title>Representation learning for information extraction from form-like documents</article-title>
          ,
          <source>in: ACL</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Riba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goldmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fornés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ramos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lladós</surname>
          </string-name>
          ,
          <article-title>Table detection in invoice documents by graph neural networks</article-title>
          ,
          <source>in: ICDAR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          , C. Jawahar,
          <article-title>DocVQA: A dataset for vqa on document images</article-title>
          ,
          <source>in: WACV</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bagal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Valveny</surname>
          </string-name>
          , C. Jawahar, InfographicVQA, in: WACV,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Š. Šimsa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Šulc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Uřičář</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hamdi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kocián</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Skalický</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Matas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Doucet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Coustaty</surname>
          </string-name>
          , D. Karatzas,
          <article-title>DocILE Benchmark for Document Information Localization and Extraction</article-title>
          ,
          <source>in: 17th International Conference on Document Analysis and Recognition</source>
          ,
          <string-name>
            <surname>ICDAR</surname>
          </string-name>
          <year>2021</year>
          , San José, California, USA,
          <year>August</year>
          21-
          <issue>26</issue>
          ,
          <year>2023</year>
          , Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Skalický</surname>
          </string-name>
          , Š. Šimsa,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uřičář</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <article-title>Business document information extraction: Towards practical benchmarks</article-title>
          ,
          <source>in: CLEF</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Web</surname>
          </string-name>
          , Industry Documents Library, https://www.industrydocuments.ucsf.edu/, ????. Accessed:
          <fpage>2022</fpage>
          -10-20.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Web</surname>
          </string-name>
          , Public Inspection Files, https://publicfiles.fcc.gov/, ????. Accessed:
          <fpage>2022</fpage>
          -10-20.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          , et al.,
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          ,
          <source>IJCV</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies and reading books</article-title>
          , in: ICCV,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Mindee</surname>
          </string-name>
          , docTR: Document Text Recognition, https://github.com/mindee/doctr,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>K.</given-names>
            <surname>Olejniczak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šulc</surname>
          </string-name>
          ,
          <string-name>
            <surname>Text Detection Forgot About Document</surname>
            <given-names>OCR</given-names>
          </string-name>
          , in: CVWW,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Š. Šimsa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Šulc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Skalický</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Hamdi, DocILE 2023 Teaser:
          <article-title>Document Information Localization and Extraction</article-title>
          , in: J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Caputo (Eds.),
          <source>Advances in Information Retrieval - 45th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2023</year>
          , Dublin, Ireland, April 2-
          <issue>6</issue>
          ,
          <year>2023</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>III</given-names>
          </string-name>
          , volume
          <volume>13982</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          , pp.
          <fpage>600</fpage>
          -
          <lpage>608</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -28241-6\_
          <fpage>69</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          , J. Ma,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>USTC-iFLYTEK at DocILE: a Multi-modal approach using Domain-specific GraphDoc</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Thessaloniki, Greece, September 18th - to - 21st, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <article-title>Lilt: A simple yet efective language-independent layout transformer for structured document understanding</article-title>
          ,
          <source>ACL</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Agam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Frieder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grossman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heard</surname>
          </string-name>
          ,
          <article-title>Building a test collection for complex document information processing</article-title>
          ,
          <source>in: SIGIR</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-N. M.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. G.</given-names>
            <surname>Bui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Duong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>UnionRoBERTa: RoBERTas Ensemble Technique for Competition on Document Information Localization and Extraction</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Thessaloniki, Greece, September 18th - to - 21st, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <article-title>ViBERTgrid: A Jointly Trained Multi-modal 2D Document Representation for Key Information Extraction from Documents</article-title>
          , in: J.
          <string-name>
            <surname>Lladós</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lopresti</surname>
          </string-name>
          , S. Uchida (Eds.),
          <source>16th International Conference on Document Analysis and Recognition</source>
          ,
          <string-name>
            <surname>ICDAR</surname>
          </string-name>
          <year>2021</year>
          , Lausanne, Switzerland, September 5-
          <issue>10</issue>
          ,
          <year>2021</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>12821</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2021</year>
          , pp.
          <fpage>548</fpage>
          -
          <lpage>563</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -86549-8_
          <fpage>35</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -86549-8\_
          <fpage>35</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>J.</given-names>
            <surname>Straka</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gruber</surname>
          </string-name>
          ,
          <article-title>Object Detection Pipeline Using YOLOv8 for Document Information Extraction</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.), Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Thessaloniki, Greece, September 18th - to - 21st, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jocher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaurasia</surname>
          </string-name>
          , J. Qiu, YOLO by Ultralytics,
          <year>2023</year>
          . URL: https://github.com/ ultralytics/ultralytics.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Katti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reisswig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brarda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Höhne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Faddoul</surname>
          </string-name>
          ,
          <source>CharGrid: Towards Understanding 2D Documents</source>
          , in: E.
          <string-name>
            <surname>Rilof</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          , J. Tsujii
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>