<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Fixmatch: Simplifying semi-supervised learning with con-
sistency and confidence. arXiv preprint arXiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Data Augmentations for Document Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>AP-figure AP-heading AP-listitem AP-table AP-text</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Clova AI, NAVER Corp</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Korea University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <volume>07685</volume>
      <fpage>1192</fpage>
      <lpage>1200</lpage>
      <abstract>
        <p>Data augmentation has the potential to significantly improve the generalization capability of deep neural networks. Especially in image recognition, recent augmentation techniques such as Mixup, CutOut, CutMix, and RandAugment have shown great performance improvement. These augmentation techniques have also shown effectiveness in semi-supervised learning or self-supervised learning. Despite of these effects and usefulness, these techniques cannot be applied directly to document image analysis, which require text semantic feature preservation. To tackle this problem, we propose novel augmentation methods, DocCutout and DocCutMix, that are more suitable for document images, by applying the transform to each word unit and thus preserving text semantic feature during augmentation. We conduct intensive experiments to find the most effective data augmentation techniques among various approaches for document object detection and show our proposed augmentation methods outperform stateof-the-arts with +1.77 AP in PubMed dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>DocCutout</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In modern machine learning such as deep neural networks,
data augmentation is a de-facto vital solution to augment the
limited training data and improve the generalization
capability of the models, accounting for the fact that most
state-ofthe-art models require data at massive scale.</p>
      <p>
        In general, data augmentation greatly contributes to
improving the accuracy of the machine learning models,
according to the Vicinal Risk Minimization (VRM)
principle (Zhang et al. 2017). In computer vision fields, there
have been numerous successful approaches, which are
formulated by their own strategies. For instance, Mixup (Zhang
et al. 2017) used a linear interpolation between two
different training instances, and CutMix (Yun et al. 2019)
used an image patch cut and paste.
        <xref ref-type="bibr" rid="ref6">(Cubuk et al. 2019;
2020)</xref>
        also presented automated augmentation techniques
that is able to search the best combination among several
transformations.
        <xref ref-type="bibr" rid="ref12">(DeVries and Taylor 2017)</xref>
        explained that
data augmentation with randomly masking part of image, as
Cutout, works as a regularizer.
      </p>
      <p>*Work done during an internship at Clova AI, NAVER Corp.
Copyright © 2021, for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 International (CC
BY 4.0).
(a) Original
(b) DocCutout
(c) Figure source image
(d) DocCutMix</p>
      <sec id="sec-1-1">
        <title>Baseline DocCutout DocCutMix AP-figure</title>
        <p>
          AP
Data augmentation has not been used to just improve
generalization capability of the model, but can be used to solve
the other problems. For instance, (Zhang et al. 2017) showed
Mixup can increase the robustness to adversarial examples.
          <xref ref-type="bibr" rid="ref15">(Hendrycks et al. 2019)</xref>
          improved the robustness to
perturbation and uncertainty by using both the automated augmented
images and the original image by Mixup. For self-supervised
learning and semi-supervised learning, data augmentation
takes an important role in teaching models about
representations from unlabeled data. For instance, consistency
regularization
          <xref ref-type="bibr" rid="ref28 ref4">(Miyato et al. 2018; Berthelot et al. 2019;
Sohn et al. 2020)</xref>
          , which trains differently augmented data
from the same source data to be classified identically, is the
most representative semi-supervised learning methods
recently.
        </p>
        <p>Although the data augmentation methods have been
popularly proposed and utilized, there were a few attempts to
augment a document image. Naturally, in augmenting a
document image, the created image must not lose the semantic
inforamation of the text area. However, the data
augmentation methods described above do not take these points into
account.</p>
        <p>In this paper, we propose, for the first time, data
augmentation methods for the document image analysis. To account
for the fact that a document image consists image and text
regions having different formats, respectively, we propose a
augmentation technique to independently apply transform to
each word unit. In particular, we present two methods,
DocCutout and DocCutMix, that reinforce the regularization of
the model, greatly enhancing the performance.</p>
        <p>Our contributions can be summarized as follow:
• We argue that recently studied data augmentations, such
as Mixup, Cutout, and CutMix, have a problem in losing
word level semantic information in document images.
• We propose two data augmentation methods, namely</p>
        <p>DocCutout and DocCutMix, to handle word-level images.
• Through various experiments, we show that the proposed
DocCutout and DocCutMix are effective and generalize
well.</p>
        <p>2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Data Augmentation</title>
        <p>In Computer Vision, data augmentation is a classic and
standard way to improve neural networks. Still, various
augmentation methods are being published.</p>
        <p>
          Cutout
          <xref ref-type="bibr" rid="ref12">(DeVries and Taylor 2017; Zhong et al. 2020)</xref>
          is an
augmentation technique that masks a part of an image.
          <xref ref-type="bibr" rid="ref12">(DeVries and Taylor 2017)</xref>
          explained that Cutout plays a role
of regularizer of the model like dropout. Mixup proposed
by (Zhang et al. 2017) is a linear interpolation between two
data. According to them, Mixup has the effect of Vicinal
Risk Minimization. Through this, it is said that not only the
accuracy of the model but also the robustness can be
obtained.
        </p>
        <p>Mixup is not suitable for localization tasks because
features of different classes are mixed throughout the
created image. To overcome this, (Yun et al. 2019) proposed
CutMix. CutMix replaces some patches of an image with
patches of other images, and the target class linearly
interpolates with the area ratio of the two images.</p>
        <p>
          Data augmentation is also being studied in the field of
NLP
          <xref ref-type="bibr" rid="ref19 ref2 ref9">(Kobayashi 2018; Wei and Zou 2019; Bari,
Mohiuddin, and Joty 2020)</xref>
          .
          <xref ref-type="bibr" rid="ref9">(Wei and Zou 2019)</xref>
          improved accuracy
in NLP tasks including text classification by using methods
such as Synonym Replacement, Random Insertion, Random
Swap, and Random Deletion together. BERT
          <xref ref-type="bibr" rid="ref10">(Devlin et al.
2019)</xref>
          , a pre-trained language model that has made
remarkable developments in various tasks in the NLP field, also
performs a kind of data manipluation. BERT performs
selfsupervised learning by masking some tokens in the input and
learning to predict the corresponding parts by the model.
        </p>
        <p>
          There are a few studies on data augmentation in the
field of text image data. In natural scene text image cases,
          <xref ref-type="bibr" rid="ref13 ref17 ref23">(Gupta, Vedaldi, and Zisserman 2016; Liao et al. 2020;
Jaderberg et al. 2014; Wu et al. 2019a)</xref>
          create text images by
synthesizing arbitrary text on a natural scene image. These
studies will be of great help in improving the performance of
text detection, but it is difficult to extend to tasks that require
recognition of text semantics in images, such as document
layout analysis. If data augmentation is performed for
document layout analysis, both visual features of images and
the semantic features of text should be kept realistic. Our
proposed data augmentation method satisfies this condition
through word-based masking or a mix between two data.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Document Layout Analysis</title>
        <p>Document layout analysis is a task of identifying the regions
of interest in a document image to extract necessary
information from the document. There are two approaches to this
task, utilizing visual or textual information in the document.</p>
        <p>
          One is to employ the object detection model in the
computer vision field.
          <xref ref-type="bibr" rid="ref14">(Hao et al. 2016)</xref>
          and
          <xref ref-type="bibr" rid="ref30">(Schreiber et al.
2017)</xref>
          proposed table detection model in document image
based on CNN and Faster R-CNN, respectively.
          <xref ref-type="bibr" rid="ref9">(Soto and
Yoo 2019)</xref>
          also utilized Faster R-CNN for object detection,
but classified 9 classes including table in document image.
        </p>
        <p>
          The other is to perform entity extraction in the natural
language field.
          <xref ref-type="bibr" rid="ref18">(Katti et al. 2018)</xref>
          encoded document image
as a 2D grid of characters and applied fully convolutional
encoder-decoder network for information extraction.
          <xref ref-type="bibr" rid="ref16 ref9">(Denk
and Reisswig 2019; Hwang et al. 2019)</xref>
          proposed a model
based on BERT in order to utilize the rich and
contextualized word representation of BERT.
        </p>
        <p>
          The above approaches used only visual or textual
information in the document. However, in real documents, such
visual and textual information are strongly related in order
to represent contents of the documents effectively.
Considering the characteristic of a document, it is desirable to
perform document layout analysis using both visual and textual
information. There are still few studies that consider both
information, but because of their desirability, they are
being actively studied.
          <xref ref-type="bibr" rid="ref27">(Liu et al. 2019)</xref>
          and (Yu et al. 2020)
utilized graph convolution to obtain visual text embeddings
and combined them with token embedding to feed combined
representation into BiLSTM-CRF model.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>In this section, we first introduce previous data augmentation
methods tailored to image understanding tasks, i.e., Cutout
and CutMix, and their limitations when directly applied to
document image analysis, and then present our data
augmentation techniques, called DocCutout and DocCutMix.</p>
      <sec id="sec-3-1">
        <title>Motivation</title>
        <p>
          Cutout Cutout introduced by
          <xref ref-type="bibr" rid="ref12">(DeVries and Taylor 2017)</xref>
          is one of powerful regularization techniques to make deep
neural networks generalize better by randomly dropping an
input image region, which extends a dropout (Srivastava et
al. 2014) working on the input feature itself. It encourages
the networks to focus on less discriminative regions on the
input, thereby improving such generalization capability.
f0;S1pgeWcificHallays, alnetinupsutdiemnoatgee Xand2a bRinWaryHmaCskainnddicMatin2g
where to drop out, respectively. In Cutout, new augmented
image X~ is sampled such that
where indicates element-wise multiplication operator. To
perform as regularizer, the binary mask M is randomly
sampled with a form of the bounding box coordinates B =
(rx; ry; rw; rh) such that
rx
ry
        </p>
        <p>Unif (0; W ) ; rw = W
Unif (0; H) ; rh = H
p
p
;
;
where denotes the drop ratio. This sampling rule follows
the uniform distribution sampling of Unif and thus makes
the cropped area ratio rwrh=W H = . The binary mask M
is decided by filling with 0 within the bounding box B, 1
otherwise.</p>
        <p>CutMix Even though Mixup (Zhang et al. 2017), based on
linear interpolation of two different training instances, can
greatly improve the model’s performance for general
classification tasks, it has limited localization ability (Yun et al.
2019), which is the bottleneck to be applied to tasks, e.g.,
object detection. To overcome this limitation, a new
augmentation method, called CutMix, was proposed, where patches
within an instance are cut and pasted from another instance
and used to train the model with mixed ground-truth labels
according to proportion of mixed areas. It has been shown
that CutMix takes advantage of both Cutout and Mixup, and
outperforms them especially in weakly-supervised object
localization task.</p>
        <p>Specifically, CutMix generates a new augmented image
X~ following the rule as
where XA is one instance and XB is another instance. The
new label is determined by taking into account the ratio of
mixing as
y~ = (1
)yA + yB;
(1)
(2)
(3)
(4)
where yA and yB are ground-truth labels for XA and XB,
respectively. The cropping variables are similarly
determined as Cuout as follows:
rx
ry</p>
        <p>Unif (0; W ) ; rw = W
Unif (0; H) ; rh = H
p
p
:
;
(5)
Limitations Although effective to improve the
generalization capability of models, aforementioned Cutout and
CutMix cannot be directly deployed for tasks requiring text units
localization, such as document layout analysis, document
table detection, and document text detection. In fact, for object
detection in an image, a model is generally able to localize
an object, even though cropping or occlusion occurs, by
focusing on the textures and shapes of remaining parts.
However, when localizing and recognizing the text images, the
shape of the text is far much more important than the texture,
and thus, partially occluded words in the text, by Cutout or
CutMix, may not be recovered and recognized completely
differently from the original letter. Figure 2 exemplifies this
phenomenon. To overcome this limitation, technique
separately handling image and text in document image is
demanded, which is the topic of this paper.</p>
        <p>(a) Original natural scene image</p>
        <p>
          (b) Cutout natural scene image
(c) The OCR results for image with Cutout applied.
First of all, we present augmentation method, DocCutout, to
maintain the text shape in words during augmentation, thus
overcoming the limitations of original Cutout. It should be
noted that due to the nature of document images,
bounding box annotation of word units is relatively easy to
obtain. In general, most of the document image datasets were
created from latex or xml format metadata
          <xref ref-type="bibr" rid="ref20 ref22 ref23 ref7 ref9">(Zhong, Tang,
and Yepes 2019; Li et al. 2020b)</xref>
          , and thus we naturally
access such bounding box annotation. In addition, if not, it is
relatively easy to extract the characters and their positions
through optical character recognition (OCR) methods
          <xref ref-type="bibr" rid="ref1">(Baek
et al. 2019)</xref>
          .
        </p>
        <p>DocCutout basically follows the rule of Cutout as in
Equation 1, but 0 is replaced with the fill value matrix F
which represents the value to be filled in mask region. We
experimented with F for 0, meaning black, and 1, meaning
white.
It is different to Cutout in that we independently cut the
region from image and text boxes. Let B be the bounding box
area where masking is attempted, and words(B) is the set of
word boxes in b. The masking box coordinates are sampled
according to:
if label(B) == “figure”:</p>
        <p>Unif (bx; bx + bw) ; rw = bw
p
Unif (bx; bx + bh) ; rh = bh
;
;
p
rx
ry
else:
i 2 sample(len(words(B)); r); ti 2 words(B)
rxi = tix; rwi = tiw;
ryi = tiy; rhi = tih;
where sample(len(words(B)); r) means to sample by the
probability ratio of r among the indices of words(B).</p>
        <p>
          Such word-by-word masking is not only similar to Cutout,
but also similar to the masking method used in BERT
          <xref ref-type="bibr" rid="ref10">(Devlin et al. 2019)</xref>
          , which have been proven to effectively train
the NLP model in self-supervised fashion. But, it is
different in that while word semantic feature vectors are masked
in BERT, we mask the visual feature vectors of text images,
which allows to learn styles of text such as font and color.
Note that some studies attempted to deploy this for
document layout analysis through natural language processing
of post-OCR data
          <xref ref-type="bibr" rid="ref22">(Xu et al. 2020)</xref>
          or detection modules in
the field of Computer Vision
          <xref ref-type="bibr" rid="ref20 ref22 ref5">(Li et al. 2020a)</xref>
          , but they did
not utilize visual features or word units at the same time.
To achieve further development in document layout
analysis, we need to build a unified model that utilizes both visual
features and semantic features of text. Unified models such
as
          <xref ref-type="bibr" rid="ref27">(Liu et al. 2019)</xref>
          and (Yu et al. 2020) are being studied
recently. In this unified model, DocCutOut has a lot of room
for application.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>DocCutMix</title>
        <p>We also present DocCutMix that replaces part of an image
with part of another image, inspired by CutMix. The main
difference is that it preserves the meaning of word units,
by replacing some words or patches in figures in one
image from those of other images. Moreover, to preserve the
(6)
(7)
plausibility of the augmented image, the labeled class of the
sampled target patch and the original patch should be the
same to account for the fact that the styles of the texts vary
greatly depending on the class. For instance, the letters of
the heading class are usually in bold or colorful, while most
general texts are not.</p>
        <p>The CutMix can be formulated as follows:
x~ =(1
kSk
X Mi)
i=1</p>
        <p>kSk
x + X Mi
i=1
s p(label(si)); (8)
where S is the set of original patches sampled with a
certain probability from the original image, and si is the i-th
element of S. This certain probability will be described in
detail in Section 4 as a hyper-parameter. The s p function,
short for sample patch function, returns a selected one patch
from all patches in the mini-batch which have the same class
as the original patch. Further details of the DocCutMix
algorithm are described in Algorithm 1.</p>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section, we report an exhaustive evaluation to
assess the effectiveness of proposed augmentation methods,
namely DocCutout and DocCutMix, by conducting two
main experiments, respectively: 1) ablation study on our
augmentation methods and 2) comparison with previous
methods.</p>
      <sec id="sec-4-1">
        <title>Experimental Protocol</title>
        <p>
          Dataset To evaluate the proposed methods, we consider
a standard benchmark, PubMed dataset
          <xref ref-type="bibr" rid="ref20 ref22 ref5">(Li et al. 2020a)</xref>
          .
PubMed is a subset of PubLayNet
          <xref ref-type="bibr" rid="ref9">(Zhong, Tang, and Yepes
2019)</xref>
          , which is one of the large-scale datasets for document
object detection, especially sampled from medical journal
articles. PubMed consists of 12,871 document images and
257,830 bounding boxes with 5 classes such as text, title,
list, figure, and table. We train the model on first 9,653
images and evaluate on the remaining 3,218 images. To extract
the word bounding boxes, we utilized in-house OCR engine1
and estimated the label for each word based on the overlap
with the area for each class of PubMed.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Baseline models for document object detection Follow</title>
        <p>
          ing the most recent literature
          <xref ref-type="bibr" rid="ref20 ref22 ref5">(Li et al. 2020a)</xref>
          , we chosen
Feature Pyramid Networks (FPN)
          <xref ref-type="bibr" rid="ref25">(Lin et al. 2017a)</xref>
          as a
baseline model for document object detection, one of the
most effective methods. In particular, FPN exploits the
pyramidal feature hierarchy of CNNs and builds a feature
pyramid of high-level semantics for all the layers, extracting a
mixture of high-level and low-level visual features. It is thus
suitable for document image analysis in that the document
image often contain both large-scale objects, even taking up
most of the image, and small-scale objects, such as a very
small listitem with a single word. For FPN, we followed the
most common practice and used ResNet-50 as the backbone.
We trained the networks with an Momentum SGD optimizer
and an initial learning rate of 0.01, which is divided by 10
after 60,000 iterations out of the total 80,000 iterations.
        </p>
        <sec id="sec-4-2-1">
          <title>1https://clova.ai/ocr</title>
          <p>Algorithm 1 Pseudo-code of DocCutMix
for each training iteration do
data batch = get minibatch(dataset)
for each (img, instances) in data batch do . img is C W
for each (bbox, class, isword) in instances do
if isword or class == ‘figure’ then
patchimage = img[:,bbox[1]:bbox[3],bbox[2]:bbox[0]]
patchlist[class].append(patchimage)
for each img, instances in data batch do
for each (bbox, class, isword) in instances do
if Random(0,1) &lt; mixportion then
if isword or class == ‘figure’ then
r i = Unif(0, len(patchlist[class]))
ph, pw = (bbox[3]-bbox[1]), (bbox[2]-bbox[0])
resized patch = resize(patchlist[class][r i],(ph,pw))
img[:,bbox[1]:bbox[3],bbox[2]:bbox[0]] = resized patch
instances = [(bbox, class) for (bbox, class, isword) in instances if not isword]
H size tensor, instances is a list of (bbox, class, isword)
. isword = if bounding box instance indicate word
. There are (# of class) patchlist
. resize patch with interpolation</p>
          <p>. DocCutMix
. optional, clear word annotations</p>
          <p>
            We further consider two recent methods, the DC5 model
proposed in
            <xref ref-type="bibr" rid="ref8">(Dai et al. 2017)</xref>
            and the RetinaNet model
proposed in
            <xref ref-type="bibr" rid="ref25">(Lin et al. 2017b)</xref>
            . We set the same experimental
setting as FPN, namely the same learning rate, optimizer,
and ResNet-50 backbone. All models are implemented on
top of Detectron2
            <xref ref-type="bibr" rid="ref1 ref6">(Wu et al. 2019b)</xref>
            .
          </p>
          <p>
            Parameters for DocCutout and DocCutMix In all
experiments, we set the probability of applying augmentation
to 0.5. DocCutOut has hyper-parameters called fill value,
p and patch ratio. Since fill value determines what value
to fill the Cutout regions, we considered two cases, namely
white (255, 255, 255) and black (0,0,0). p means the
percentage of the cutout part in the figure bounding box, and
is determined through p Unif (0:3; 0:5) for each
transform. patch ratio means the ratio of elements to be Cutout
among figures or words in the document, defined for
DocCutMix. All data augmentations are implemented on top of
Albumentations
            <xref ref-type="bibr" rid="ref5">(Buslaev et al. 2020)</xref>
            .
          </p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Comparison against baseline augmentations</title>
        <p>
          We conducted experiments on various augmentation
methods to determine which method is effective for the document
object detection task as follows.
• Colorjitter: bright=0.2, contrast=0.2, saturation=0.2,
hue=0.2, (standard in Albumentations
          <xref ref-type="bibr" rid="ref5">(Buslaev et al.
2020)</xref>
          )
• Gaussnoise: (var limit=(10.0, 50.0), mean=0, (standard
setting in Albumentations
          <xref ref-type="bibr" rid="ref5">(Buslaev et al. 2020)</xref>
          )
• Affine: shift=0.0625, scale=0.01, rotate=2
Colorjitter and Gaussnoise are pixel-level augmentations, so
they can be applied directly to document images. In the case
of Affine transformation, to preserve the semantic of the text,
very small parameters are just used.
        </p>
        <p>
          Results are given in Table 1. The evaluation metric
followed the standard of COCO object detection
          <xref ref-type="bibr" rid="ref24">(Lin et al.
2014)</xref>
          . We observe that DocCutout achieves the best result,
86.84 AP. DocCutout outperforms non augmented baseline
by +1.77 AP. DocCutMix also showed results that surpassed
the methods of other comparison groups. DocCutMix shows
the best results in AP-table and AP-text, so it looks good
to be applied to table understanding tasks such as
document table detection. Moreover, Figure 3 shows that
DocCutout and DocCutMix help model converge stable during
training. Interestingly, Gaussnoise showed the most unstable
convergence graph in training, but it showed superior
performance in listitem class which is the most difficult class for
all methods. Affine augmentation has also shown
competitive results, but there is a threat that affine transformation
can transform the semantic of text.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Hyper-parameter search for our methods</title>
        <p>As previously explained, both DocCutout and DocCutMix
have a hyper-parameter called patch ratio. The experiment
was conducted by changing the patch ratio of the two
methods in the order of 0.2, 0.33, and 0.5. As a result, 0.33 for
DocCutout and 0.5 for DocCutMix showed the best
perforAP-figure</p>
        <sec id="sec-4-4-1">
          <title>AP-heading</title>
        </sec>
        <sec id="sec-4-4-2">
          <title>AP-listitem</title>
        </sec>
        <sec id="sec-4-4-3">
          <title>AP-table</title>
        </sec>
        <sec id="sec-4-4-4">
          <title>AP-text</title>
          <p>mance.</p>
          <p>DocCutout has another hyper-parameter, fill value. Since
most of the document images in the dataset have a white
background, the white fill value creates more realistic data
and showed better AP with +0.21. Table 2 describes the
result of hyper-parameter experiments.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Comparison by changing the baseline model</title>
        <sec id="sec-4-5-1">
          <title>Model FPN</title>
        </sec>
        <sec id="sec-4-5-2">
          <title>FRCNN DC</title>
        </sec>
        <sec id="sec-4-5-3">
          <title>RetinaNet</title>
          <p>Augmentation</p>
          <p>Baseline
DocCutout
Baseline
DocCutout
Baseline
DocCutout</p>
          <p>AP
We tested whether DocCutout, which showed the
highest AP among the tested augmentation methods, can be
applied to various models in general. Table 3 shows the result.
The higher the baseline model, the greater the performance
improvement when DocCutout was used. The FPN model
which have the highest baseline performance showed a
performance improvement of +1.77, while the RetinaNet which
have the lowest baseline performance showed a performance
improvement +0.09.</p>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>Combination of data augmentations</title>
        <sec id="sec-4-6-1">
          <title>Combination of Augmentations</title>
        </sec>
        <sec id="sec-4-6-2">
          <title>DocCutout</title>
          <p>DocCutout + Affine
DocCutout + DocCutMix
AP</p>
          <p>
            Table 4 shows an experiment that combined DocCutMix
and Affine augmentation to DocCutout. Affine, which was
inferior to DocCutMix in single augmentation, shows better
performance in combination with DocCutout. Finding the
most appropriate augmentation combination from the data
augmentation combination is quite complicated problem.
Although it is beyond the scope of our research, it seems
possible to find the optimal combination of document image
augmentations based on data augmentation methods that we
have proposed and experimented with. It is expected that
recent studies, such as AutoAugment
            <xref ref-type="bibr" rid="ref6">(Cubuk et al. 2019)</xref>
            and
RandAugment
            <xref ref-type="bibr" rid="ref7">(Cubuk et al. 2020)</xref>
            , can be applied to solve
the problem.
          </p>
          <p>5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Data augmentation plays a variety of roles and contributes
greatly to the improved performance of model. However,
there have been a lack of the study for data augmentation
for document image understanding, which requires
understanding both natural language and visual features. In the
paper, we have shown that recent data augmentation
techniques such as Cutout and CutMix have a limitation and thus
cannot be directly applied to document images, although
they show a great effectiveness in natural images. To tackle
this problem, we proposed two data augmentation methods,
DocCutOut and DocCutMix. Our proposed methods show
not only performance improvement in PubMed dataset, but
also generality in various models.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank Clova AI OCR team, especially
Bado Lee, Daehyun Nam and Yoonsik Kim for their
helpful feedback and discussion.
Soto, C., and Yoo, S. 2019. Visual detection with context for
document layout analysis. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), 3464–3470.
Hong Kong, China: Association for Computational
Linguistics.</p>
      <p>Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Salakhutdinov, R. 2014. Dropout: a simple way to prevent
neural networks from overfitting. The journal of machine
learning research 15(1):1929–1958.</p>
      <p>Wei, J., and Zou, K. 2019. Eda: Easy data augmentation
techniques for boosting performance on text classification
tasks. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), 6383–6389.</p>
      <p>Yu, W.; Lu, N.; Qi, X.; Gong, P.; and Xiao, R. 2020. PICK:
Processing key information extraction from documents
using improved graph learning-convolutional networks. In
Proceedings of the 25th International Conference on Pattern
Recognition (ICPR).</p>
      <p>Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2020.
Random erasing data augmentation. In Proceedings of the
AAAI Conference on Artificial Intelligence (AAAI).
Zhong, X.; Tang, J.; and Yepes, A. J. 2019. Publaynet:
largest dataset ever for document layout analysis. In 2019
International Conference on Document Analysis and
Recognition (ICDAR), 1015–1022. IEEE.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Baek</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; Han,
          <string-name>
            <given-names>D</given-names>
            .;
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Character region awareness for text detection</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>9365</fpage>
          -
          <lpage>9374</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bari</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ; Mohiuddin, M. T.; and
          <string-name>
            <surname>Joty</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Multimix: A robust data augmentation strategy for cross-lingual nlp</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          arXiv preprint arXiv:
          <year>2004</year>
          .13240.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Berthelot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Carlini</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Papernot,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Oliver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ; and
            <surname>Raffel</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. A.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Mixmatch: A holistic approach to semi-supervised learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>5049</volume>
          -
          <fpage>5059</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Buslaev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Iglovikov</surname>
            ,
            <given-names>V. I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Khvedchenya</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Parinov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Druzhinin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kalinin</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Albumentations: Fast and flexible image augmentations</article-title>
          .
          <source>Information</source>
          <volume>11</volume>
          (
          <issue>2</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Cubuk</surname>
            ,
            <given-names>E. D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mane</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vasudevan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Autoaugment: Learning augmentation strategies from data</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>113</fpage>
          -
          <lpage>123</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Cubuk</surname>
            ,
            <given-names>E. D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shlens</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Randaugment: Practical automated data augmentation with a reduced search space</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</source>
          ,
          <fpage>702</fpage>
          -
          <lpage>703</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Qi,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Xiong,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; Zhang, G.; Hu, H.; and Wei,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Deformable convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <fpage>764</fpage>
          -
          <lpage>773</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Denk</surname>
            ,
            <given-names>T. I.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Reisswig</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>BERTgrid: Contextualized embedding for 2d document representation and understanding</article-title>
          .
          <source>In Workshop on Document Intelligence at NeurIPS</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>DeVries</surname>
          </string-name>
          , T., and
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , G. W.
          <year>2017</year>
          .
          <article-title>Improved regularization of convolutional neural networks with cutout</article-title>
          .
          <source>arXiv preprint arXiv:1708</source>
          .
          <fpage>04552</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Synthetic data for text localisation in natural images</article-title>
          .
          <source>In IEEE Conference on Computer Vision</source>
          and Pattern Recognition.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Hao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>A table detection method for pdf documents based on convolutional neural networks</article-title>
          .
          <source>In 2016 12th IAPR Workshop on Document Analysis Systems (DAS)</source>
          ,
          <fpage>287</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Hendrycks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cubuk</surname>
            ,
            <given-names>E. D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gilmer</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Augmix: A simple data processing method to improve robustness and uncertainty</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Hwang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Kim,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Seo,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Yim</surname>
          </string-name>
          , J.; Park, S.; Park,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ; and
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Post-OCR parsing: building simple and robust parser via bio tagging</article-title>
          .
          <source>In Workshop on Document Intelligence at NeurIPS</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Jaderberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Synthetic data and artificial neural networks for natural scene text recognition</article-title>
          .
          <source>In Workshop on Deep Learning</source>
          , NIPS.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Katti</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          ; Reisswig,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Guder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Brarda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Bickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; Ho¨hne, J.; and
            <surname>Faddoul</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. B.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Chargrid: Towards understanding 2d documents</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <fpage>4459</fpage>
          -
          <lpage>4469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Kobayashi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Contextual augmentation: Data augmentation by words with paradigmatic relations</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>2</volume>
          (
          <issue>Short Papers)</issue>
          ,
          <fpage>452</fpage>
          -
          <lpage>457</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wigington</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tensmeyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; Barmpalios,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Morariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. I.</given-names>
            ;
            <surname>Manjunatha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ;
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; and Fu,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2020a</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>Cross-domain document object detection: Benchmark suite and method</article-title>
          .
          <source>In Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <fpage>12915</fpage>
          -
          <lpage>12924</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2020b</year>
          .
          <article-title>Docbank: A benchmark dataset for document layout analysis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; He,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ; and
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Synthtext3d: synthesizing scene text images from 3d virtual worlds</article-title>
          .
          <source>Science China Information Sciences</source>
          <volume>63</volume>
          (
          <issue>2</issue>
          ):
          <fpage>120105</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , T.-Y.;
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Perona,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ; Dolla´r, P.; and
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. L.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Microsoft coco: Common objects in context</article-title>
          .
          <source>In Proceedings of the European Conference on Computer Vision (ECCV)</source>
          ,
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , T.-Y.; Dolla´r, P.;
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; He,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Hariharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ; and
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2017a</year>
          .
          <article-title>Feature pyramid networks for object detection</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>2117</fpage>
          -
          <lpage>2125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          2017b.
          <article-title>Focal loss for dense object detection</article-title>
          .
          <source>In Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Graph convolution for multimodal information extraction from visually rich documents</article-title>
          .
          <source>In Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)</article-title>
          , Volume
          <volume>2</volume>
          (
          <issue>Industry Papers)</issue>
          ,
          <fpage>32</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Miyato</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Maeda</surname>
            , S.-i.; Koyama,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ishii</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>Virtual adversarial training: a regularization method for supervised and semi-supervised learning</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>41</volume>
          (8):
          <fpage>1979</fpage>
          -
          <lpage>1993</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Schreiber</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Agne</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dengel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Deepdesrt: Deep learning for detection and structure recognition of tables in document images</article-title>
          .
          <source>In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</source>
          , volume
          <volume>01</volume>
          ,
          <fpage>1162</fpage>
          -
          <lpage>1167</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          2017.
          <article-title>mixup: Beyond empirical risk minimization</article-title>
          .
          <source>arXiv preprint arXiv:1710</source>
          .
          <fpage>09412</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>