<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Technical Drawing Figures in US Patents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Md Reshad Ul Hoque</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xin Wei</string-name>
          <email>xwei001@odu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muntabir Hasan Choudhury</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kehinde Ajayi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Gryder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diane Oyen</string-name>
          <email>doyen@lanl.gov</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science, Old Dominion University</institution>
          ,
          <addr-line>Norfolk, Virginia</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Electrical and Computer Engineering, Old Dominion University</institution>
          ,
          <addr-line>Norfolk, Virginia</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Los Alamos National Laboratory</institution>
          ,
          <addr-line>Los Alamos</addr-line>
          ,
          <country>New Mexico</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Segmentation, US Patents</institution>
          ,
          <addr-line>Sketched images, CNN, Point-shooting, HR-Net, U-Net, Transformer</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>called MedT, fine-tuned on a small set of training samples</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>models, including MedT and DETR. The method we pro-</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Image segmentation is the core computer vision problem for identifying objects within a scene. Segmentation is a challenging task because the prediction for each pixel label requires contextual information. Most recent research deals with the segmentation of natural images rather than drawings. However, there is very little research on sketched image segmentation. In this study, we introduce heuristic (point-shooting) and deep learning-based methods (U-Net, HR-Net, MedT, DETR) to segment technical drawings in US patent documents. Our proposed methods on the US Patent dataset achieved over 90% accuracy where transformer performs well with 97% segmentation accuracy, which is promising and computationally eficient. Our source codes and datasets are available at https://github.com/GoFigure-LANL/figure-segmentation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Combining information contained in text and images is
an important aspect of understanding scientific
documents. However, patents and scientific documents often
contain compound figures containing subfigures, each
ciate individual subfigures with the appropriate caption
and reference text, we must first segment the full figure
into its individual subfigures. Although much research
has been done on figure understanding and extraction
for scientific documents, existing methods rely on either
(1) manually-designed rules and human-crafted features
machine learning approaches most of which were trained
on natural images [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We demonstrate that we cannot
simply apply approaches developed for other datasets to
to novel datasets.
      </p>
      <p>
        Image segmentation has been extensively studied with
rule-based methods such as watershed, and machine
learning methods applied to natural images [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. In
patent drawings, there are usually white space between
individual drawings. A simple sweeping line method,
which detect boundaries of subfigures by counting the
maximum number of black-pixels along a horizontal (or
      </p>
      <sec id="sec-1-1">
        <title>The model is also computationally eficient compared</title>
        <p>with other methods. We release the benchmark dataset
that can be used for future work on the task of
segmenting technical drawings.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>experiment, we chose an empirical value  = 2 . If a black
pixel in the original figure was detected inside the open
dot, the dot retrained. Otherwise, the dot was removed.
We constrain the circle centers so they do not fall outside
the figure boundary. We then fill all retained circles and
draw contours1. Using the contour information, we draw
rectangular bounding boxes to segment a figure.
3.2. Deep Learning Methods</p>
      <sec id="sec-2-1">
        <title>The data for this project is obtained from the United</title>
        <p>States Patent and Trademark Ofice (USPTO). The ground
truth dataset is developed on a corpus of 500 randomly
selected figures from the design category of patent. The
dataset consists of 20 figure files with single drawings
and 480 figure files, which containing at least two
subifgures. We preprocess each figure to remove text labels.</p>
        <p>The number of subfigures in each figure file is inferred by
the number of text labels detected identifying subfigures
so segmentation is only necessary for figures containing
multiple subfigures.</p>
        <p>We use VGG Image Annotator (VIA) to annotate our
dataset. VIA is a manual annotation open source
software for annotating images, videos, and audio. We draw
rectangles bounding boxes around subfigures. Each
figure consists of 2–12 subfigures. We also performed an
independent human verification to ensure the bounding
boxes were drawn correctly. VIA allows exporting
annotation results including filename, file size, region count
(e.g., number of the bounding boxes for each figure in an
image), region id, and coordinates of bounding boxes.</p>
        <p>The point-shooting method is easy to implement and
successfully segments most figures containing multiple
technical drawings. However, the method does not
generalize well for certain figures in our dataset. One example
is shown in Figure 3. One can see that in Row 2, the
point shooting method creates wrong bounding boxes
(Column 1), while U-Net, a deep learning model, produces
the correct bounding box (Column 2).</p>
        <p>
          Therefore, we consider applying deep learning-based
methods including U-Net [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], HR-Net [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and
transformer models (MedT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], DETR [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]). One challenge
is that the ground truth only contains bounding boxes,
while these models produce pixel-level masks. Therefore,
the ground truth cannot be directly used for training
these deep learning models. To overcome this challenge,
we first use the point-shooting method to generate masks
for training figures and use them as input to train U-net,
HR-net, and fine-tune MedT and DETR. It is worth
mentioning that the point-shooting method achieves an
ac3. Segmentation Methods curacy of 92.5%. Although the result output by the
pointshooting method is not 100% accurate, we hope that the
3.1. Point Shooting Method neural networks can still encode and capture the right
features and achieve better generalization for reasonably
We propose a heuristic method for segmenting figures good performance. Figure 2 shows the deep learning
segcontaining technical drawings. We call it point-shooting mentation pipeline. All of our deep learning-based
modbecause it mimics the shooting of darts onto a dartboard. els except DETR [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] are semantic segmentation
modThe goal is to draw bounding boxes around individual els where the models produce foreground-background
subfigures on a figure containing multiple technical draw- masks on the input image. DETR is a transformer based
ings. end-to-end object detection model that directly predicts
        </p>
        <p>Figure 1 illustrates the procedures of this method. Af- the bounding boxes on the input image.
ter removing the figure labels, we randomly pick a pixel
in the figure , and draw an open dot of a radius  . For our
python-c/</p>
        <p>
          Most existing SoTA deep learning based models were ture map from a contracting path, and two convolutions,
pre-trained on natural or medical images, which usually each followed by a ReLU.
contain rich color and/or gradient information compared
with technical drawings, which are mostly sketched im- 3.2.2. HR-Net
ages containing black/grey-scale pixels. Therefore, these
pre-trained models usually result in a poor performance In the contracting path of U-Net, feature maps are
downwhen tested on technical drawings in our dataset. To sampled to the lower resolution using polling and later
overcome this limitation, we fine-tuned pre-trained mod- up-sampled in the decoder part. In this process,
highels or trained them from scratch. To reduce unnecessary resolution information is lost. Although skip
conneccomputational cost, we rescale the resolution of the in- tions are used to copy the high-resolution information
put figure to 128 × 128 and use them to train the deep to the expansive path. They can not fully recover
highlearning models. The model produces the segmentation resolution information. To overcome this drawback, we
mask with a dimension of 128 × 128 × 3 which we use to apply the HR-Net model which retains both high and
draw contours and then bounding boxes around contours. low resolution information throughout the training
proAfter obtaining bounding boxes from the low-resolution cess. The preserved information may be useful to
reconimage, we linearly scale up the predicted bounding boxes struct the segmentation mask. We simplified the original
to fit the original figure. HR-Net, which contains three resolution channels, each
capturing high, mid, and low resolution information,
re3.2.1. U-Net spectively. The three channels contain five, three, and
two convolutional blocks, respectively. Each
convoluThe architecture of U-Net consists of a contracting path tional block contains two convolutional layers followed
and an expanding path. The contracting path is a typi- by a batch normalization layer, and a ReLU activation
cal convolutional network containing a series of convo- layer. The resolution gap between two channels is 2.
lutional layers, each followed by a rectified linear unit
(ReLU) and a max pooling layer with stride 2 for down- 3.2.3. Transformer
sampling. At each downsampling step, the number of
feature channels is doubled. In the expanding path, each step
consists of an upsampling of the feature map followed by
an “up-convolution”, a concatenation with cropped
feaAlthough the CNN-based models have shown
impressive performance on the segmentation tasks [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], they
can not capture the long-range dependencies between
pixels due to inherent inductive biases [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Transformers have significantly improved many fundamental
natural language processing tasks. The novel idea behind
the success is “Self Attention” [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This mechanism au- that are correctly segmented. We set aside 200 figures
tomatically weights more on more important features
for evaluation. We use Intersection over Union (IOU),
and can capture the long-range dependencies. The
comwhich compares overlaps between the predicted
boundputer vision domain has borrowed this idea to improve
ing boxes with the ground truth bounding boxes. The
vision-related tasks. We consider two transformer-based
segmentation is determined correct if IOU is greater than
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>These mechanisms control the influence of the relative</title>
        <p>
          module is used to train a transformer on a small dataset. tain cases, the point-shooting method produced the
correct segmentation map but deep learning-based methods
positional encoding on non-local context. This architec- failed, such as the segmentation results in Row 1 and
models.
3.2.4. MedT
The core component of MedT is a gated position-sensitive
axial attention mechanism designed for small size datasets
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] . Gated control axial attention which introduces
an additional control mechanism in the self-attention
an empirical threshold of 0.7. To verify consistency, we
also perform qualitative evaluation by visually inspecting
predicted and ground truth segmentations. The manual
inspection is consistent with automatic inspection with
an agreement rate of 98%.
        </p>
        <p>In general, deep learning-based methods perform
better than point-shooting methods, such as the
segmentation results in Row 2 of Figure 3. However, in
certure contains two branches, including a global branch
that captures the dependencies between pixels and the
entire image and a local branch that captures finer
dependencies among neighbouring pixels.</p>
        <p>
          The training figures are passed through a convolution
block before passing through the global branch. The
same figure is broken down into patches and sent through
a similar convolution block before passing through the
local branch sequentially. A re-sampler aggregates the
outputs from the local branch based on the position of
the patch and generates output feature maps. Outputs
from both branches are add together followed by a 1 × 1
convolutional layer to pool these output feature maps
into a segmentation mask.
3.2.5. DETR
DETR is an end-to-end object detection transformer model
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The architecture is simple and does not require
specialized layer or a custom function (such as the
nonmaximum suppression function) for predicting the
bounding boxes. The original DETR model predicts 80 classes
of bounding boxes. We fine-tuned this model that
directly predicts the bounding boxes of subfigures given a
compound figure.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results and Discussion</title>
      <p>The segmentation task can be seen as a classification
problem, in which individual subfigures are foreground
objects, and the blank area between subfigures is the
background. Although we use a training corpus with noisy
labels, the deep learning models successfully capture
latent representations and correctly segmented individual
drawings. The evaluation results are shown in Table 1.</p>
      <sec id="sec-3-1">
        <title>Visual comparisons of segmentation results of diferent</title>
        <p>models of challeging cases are shown in Figure 3.</p>
      </sec>
      <sec id="sec-3-2">
        <title>The performance of each model is measured using the accuracy, which is calculated as the fraction of subfigures</title>
      </sec>
      <sec id="sec-3-3">
        <title>Row 3 in Figure 3. There are a few challenging cases, in</title>
        <p>which all methods failed (Figure 3). This occurred when a
subfigure contains relatively isolated fragments without
prominent connections, which were treated as individual
objects.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>In conclusion, we compared heuristic and deep learning
methods on the task of segmenting technical drawings
in US patents. Both heuristic and deep learning-based
models achieve over 90% accuracy. Interestingly, though
we trained using data containing noisy labels, generated
using the point shooting method, the deep learning
models still captured the right features and outperformed the
point shooting method. The CNN-based model (e.g.,
HR</p>
      <sec id="sec-4-1">
        <title>Net) under-performs the transformer model by a small margin. We attribute this to the gated attention mechanism in the transformer model, which captured the longrange relations between pixels.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Taschwer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Marques</surname>
          </string-name>
          ,
          <article-title>Automatic separation of compound figures in scientific articles</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>77</volume>
          (
          <year>2018</year>
          )
          <fpage>519</fpage>
          -
          <lpage>548</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsutsui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Crandall</surname>
          </string-name>
          ,
          <article-title>A data driven approach for compound figure separation using convolutional neural networks</article-title>
          ,
          <source>in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</source>
          , volume
          <volume>1</volume>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>533</fpage>
          -
          <lpage>540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <article-title>Deep watershed transform for instance segmentation</article-title>
          ,
          <source>in: IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <article-title>End-to-end instance segmentation with recurrent attention</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>6656</fpage>
          -
          <lpage>6664</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Minaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Boykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Porikli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kehtarnavaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Terzopoulos</surname>
          </string-name>
          ,
          <article-title>Image segmentation using deep learning: A survey</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Subramanya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Endluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          , Chartreader:
          <article-title>Automatic parsing of bar-plots</article-title>
          ,
          <source>in: 22nd International Conference on Information Reuse and Integration for Data Science, IRI</source>
          <year>2021</year>
          ,
          <article-title>Virtual</article-title>
          , IEEE,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <article-title>Pdfigures 2.0: Mining figures from research papers</article-title>
          ,
          <source>in: 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.-s.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>West</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          , Viziometrics:
          <article-title>Analyzing visual information in the scientific literature</article-title>
          ,
          <source>IEEE Transactions on Big Data</source>
          <volume>4</volume>
          (
          <year>2017</year>
          )
          <fpage>117</fpage>
          -
          <lpage>129</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation</article-title>
          , in: International Conference on
          <article-title>Medical image computing and computer-assisted intervention</article-title>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          , T. Cheng, B.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Mu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Deep highresolution representation learning for visual recognition</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Carion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zagoruyko</surname>
          </string-name>
          ,
          <article-title>End-to-end object detection with transformers</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>J. M. J. Valanarasu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Oza</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Hacihaliloglu</surname>
            ,
            <given-names>V. M.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Medical transformer: Gated axial-attention for medical image segmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2102.10662</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>