<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Improving web user interface element detection using Faster R-CNN</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiří Vyskočil</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukáš Picek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2</volume>
      <fpage>1</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>Several challenges may arise when designing new user interfaces (UIs), e.g., because of communication between designers and developers, to which the detection of UI elements can help. The ImageCLEF DrawnUI 2021 challenge builds on the detection of such elements in two contest tasks: a Screenshot task that contains the website screenshot images with lots of noisy data, and a Wireframe task for detecting UI elements from hand-drawn proposals. This paper describes a simple algorithm based on the edge detection to filter noisy data from the website screenshots, and machine learning method which scored the first place in both tasks while having 0.628 and 0.900 mAP at 0.5 IoU in the Screenshot and Wireframe tasks. This method is based on the Faster R-CNN with a Feature Pyramid Network (FPN) that uses selected aspect ratios of anchor boxes according to the occurrences from the datasets. The code is available at https://github.com/vyskocj/ImageCLEFdrawnUI2021 The ImageCLEF DrawnUI challenge [1] was organized as part of the ImageCLEF 2021 workshop [2] at the CLEF conference. The main goal for the two proposed tasks - Screenshots &amp; Wireframes - was to create a system capable of automatic detection and recognition of individual user interface (UI) elements on given images. The Screenshot task focused on the website screenshot images, and the Wireframe task targeted on hand-drawn UI drawings. The motivation for both tasks is to simplify and speed up the Web development process by giving the designers a tool that can visualize the website immediately based on their hand-drawn sketches. The machine learning techniques have already been applied to the hand-drawn UI elements detection in the last years. Gupta et al. [3] used Mask R-CNN [4] and Multi-Pass Inference technique to boost the viability of the model by passing the input image (without the already detected objects) to the model several times. Narayanan et al. [5] explored Cascade R-CNN [6] and YOLOv4 [7] architectures, and Zita et al. [8] used regular Faster R-CNN [9] architecture and advanced regularization techniques for training the model. In this work, we utilize the Faster R-CNN extended by the Feature Pyramid Network (FPN) [10] that builds high-level semantic feature maps at all selected scales and makes the predictions more accurate. The models were implemented and fine-tuned using the Detectron2 API [ 11] from publicly available checkpoints</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Object Detection</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Edge Detection</kwd>
        <kwd>Faster R-CNN</kwd>
        <kwd>FPN</kwd>
        <kwd>CNN</kwd>
        <kwd>User Interface</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        pre-trained on the COCO dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Additionally, we improved the performance by using
various augmentations, i.e., Relative Random Resize, Cutout [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], brightness and contrast
adjustment, and by selecting bounding box proposals. In case of the Screenshot task, we utilized
the use of data filtering algorithm based on the edge detection [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. The improvements of
our method, which won in both contest tasks, are shown by comparing with the others in the
benchmark of the DrawnUI challenge.
      </p>
      <p>
        Besides, we experimented with novel methods [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ] based on the Detection Transformer.
These approaches remove the need for hand-designed components, e.g., non-maxima
suppression (NMS), but requires much more training time to convergence than previous detectors.
Given this training issue of the Detection Transformer, we decided to keep the NMS in our
model. Using the Transformers on the provided data in the contest tasks led to significantly
worse detection performance even with 7.5× more training steps.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Challenge datasets</title>
      <p>Wireframe task. Provided dataset is a combination of 4,291 hand-drawn high-resolution
image templates. The data is divided into 3,218 images for the development and 1,073
images for the testing. For each image in the development set, we have manual annotations with
the bounding boxes and their corresponding labels from pre-defined 21 classes. The development
set includes all images from last year’s challenge and additional images to re-balance the class
distribution. As there is no oficial training/validation split provided, we did a random 85%/15%
split. The detailed statistics covering the class distribution, dataset split, and absolute/relative
box number are presented in Table 1.</p>
      <p>Screenshot task. In the Screenshot task, the provided dataset includes 9,630 full-page
screenshots of websites in several languages. The data comes with labeled bounding boxes of the UI
elements. A total of 6 classes is defined, the distribution of ground truth boxes can be found
in Table 2. The development set contains 6,840 training images, and 930 manually annotated
validation images. The training set includes noisy data: blank images and bounding boxes with
shifted positions. The testing set contained a total of 1,860 samples.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        In this section, we cover the noise data filtering algorithm and training the Faster R-CNN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
detection network based on the ResNet-50 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] backbone. We also use the FPN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] extractor to
combine semantically strong features thanks to a top-down pathway and lateral connections
from the same spatial size. We use SGD optimizer with momentum of 0.9 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], learning rate
warm up of the first epoch reached the value of 0.0025 and a smooth L 1 [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] is applied for
regression loss. The detector is implemented and fine-tuned in the Detectron2 API [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] from
publicly available pre-trained weights on the COCO dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For more details about the
hyperparameter settings, see Table 3, and advanced augmentations are listed in Table 4. All
experiments are evaluated at mean average precision (mAP) and mean average recall (mAR)
with Intersection over Union (IoU) in range of 0.5 to 0.95 with the increment of 0.05, and mean
average precision with IoU greater than 0.5 (denoted as mAP0.5).
      </p>
      <p>Category
button
label
paragraph
image
link
linebreak
container
header
textinput
checkbox
radiobutton
toggle
slider
datepicker
textarea
rating
dropdown
video
list
stepperinput
table</p>
      <p>Category
link
text
image
heading
input
button</p>
      <sec id="sec-3-1">
        <title>3.1. Baseline experiment</title>
        <p>For the baseline experiment, the Random Relative Resize augmentation is applied to resize an
image to 70-90% of its size and crop it to a maximum of 1,400 px to limit the memory usage.
The resize augmentations are deeply examined in Section 3.3.1. The hyperparameters settings
are described in Table 3 and advanced augmentations in Table 4. In the Screenshot task, the
baseline model reached 0.592 mAP0.5, 0.404 mAP, and 0.603 mAR, while in the Wireframe
task, a model with the same settings have 0.969 mAP0.5, 0.703 mAP, and 0.763 mAR.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Filtering noisy data in the Screenshot task</title>
        <p>
          Even though the noisy data can be efective for the training [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], we decided to analyze filtering
of blank images and wrongly annotated bounding boxes from the Screenshot task dataset.
The aim is to remove images or ground truth boxes that contain constant color intensity. For
this reason, the data filtering (shown in Algorithm 1) is based on an edge detector [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ] to be
independent of the intensity of the pixels in the input image.
        </p>
        <p>Algorithm 1 Filtering homogeneous image elements from a dataset.</p>
        <p>Define threshold values  and 
for each image do</p>
        <p>Apply edge detector to the image and compute a mean value   from the output
if   ≤  then</p>
        <p>Discard this image from the set and continue with the next one
else
for each bounding box of the image do</p>
        <p>Apply the  threshold in the same way as  in the image
end for
if all bounding boxes are discarded from the annotations of the image then</p>
        <p>Discard this image from the set and continue with the next one
end if
end if
end for</p>
        <p>
          To verify the eficiency of data filtering, we manually selected appropriate thresholds for
images (see Table 5) and a set of fixed thresholds from 0.2 to 1.8 for bounding boxes. Then
we trained the network with filtered annotations in the Screenshot dataset using the same
settings as in Section 3.1. For the results of this experiment, see Table 6. One can observe that
ifltering the homogeneous images increases the mAP0.5 by 0.012, mAP by 0.008, and mAR by
0.009 compared to the case of the original set. Filtering homogeneous images and bounding
boxes also increases the detection performance but it detects less precisely in all tested cases
of bounding box thresholds than filtering only the images. This behaviour can be caused by
eliminating the training data, which is in fact an object, not noise. Therefore, we determined
a new baseline for the Screenshot task by filtering only the images from the training set
(this model was submitted as a baseline for the Screenshot task of DrawnUI challenge [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]).
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Augmentations</title>
        <p>
          Image resizing, Cutout [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] augmentation, and color spaces are tested to improve detection
performance. The improvement is evaluated as a comparison with baseline models defined in
the previous sections, i.e., Section 3.1 is relevant for the Wireframe task and for the Screenshot
task, a new baseline was established in the Section 3.2.
        </p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Image resize</title>
          <p>The basic approach to dealing with various sizes of the input images is to resize it to the
desired constant value so that the original aspect ratio is kept. However, various sizes can help
the learning algorithm to detect objects at diferent stages of the network. For example, imagine
that we only have small boxes available for the category button in the training set. In the test
stage, this network will not expect a large button at the input and will most likely fail to detect
it. In order to use diferent input image sizes, the backbone network must not contain any fully
connected layer.</p>
          <p>
            Two types of resizing images are compared in Table 7. The first one, Resize Shortest Edge
(default for Detectron2 [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] software) has a defined set of shortest edge lengths of the image from
640 to 800 px with the increment of 32, which are selected randomly during the training. If the
longer edge is larger than 1,333 px, the shorter edge is underscaled so that the longer edge does
not exceed this maximum size. We proposed the second type of resizing as a Random Relative
Resize. It defines an interval for which the image is randomly resized, and a maximum length of
the edges for cropping image during the training due to memory requirements. The particular
aim of this augmentation is to keep the small boxes so that they do not disappear when the image
size is reduced, and the network is able to detect them. In the test stage, the image is resized
only by the middle value of the specified interval and no image cropping is applied. This
augmentation proved to be most suitable for an image enlarging by a random value in the
interval [0.6, 1.0] for both tasks, where mAP0.5 and mAP metrics are roughly ranging from
0.034 to 0.045 higher than when using the Resize Shortest Edge.
          </p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Cutout augmentation</title>
          <p>
            To increase the performance of the network, in addition to brightness and resize augmentations,
we also used Cutout [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] from Albumentations library [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ] to randomly cuts boxes (denoted also
as holes) from the image. This augmentation expects the number of maximum holes and their
maximum spatial size as the input. In our experiments, we define the maximum size of holes in
the percentage of the image. The results (see Table 8) show that it can increase mAP0.5 by 0.008,
and mAP by 0.004 for the Screenshot task when using 4 holes with max size of 5% of the image,
while in the Wireframe task, the detection performance was slightly reduced in all settings of
the Cutout augmentation. Even so, we applied this augmentation in our further research (see
Section 3.4 and Section 4) to keep the experiments for both contest tasks comparable, and in
the Screenshot task, the augmentation shows meaningful improvements.
          </p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Color space</title>
          <p>
            mAP
mAR
mAP
mAR
In the next step, converting images to the greyscale, such as in the previous works of this
challenge [
            <xref ref-type="bibr" rid="ref3 ref8">8, 3</xref>
            ], is applied. It results in no improvement against the RGB images (see Table 9) in
the Screenshot task. On the other hand, for the Wireframe task, converting the data to grayscale
yields up to approximately 0.005 greater mAP0.5, mAP, and mAR. Therefore, both RGB and
grayscale images are used for the remaining experiments.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Anchor box proposals</title>
        <p>
          We followed up on previous experiments that examines augmentations (see Section 3.3) and we
trained new models (parameters for new augmentations are summarized in Table 10). After that
we analyzed which aspect ratios of ground truth boxes are included in the datasets. Occurrence
of such aspect ratios are visualized in Figure 1 for both the Screenshot and the Wireframe tasks.
One can observe that the horizontal boxes are far more frequent than the vertical ones. As
a result, the appropriate aspect ratios were selected to generate the box proposals. We added one
horizontal aspect ratio of 0.2 to the default ones (i.e., to the set of 0.5, 1, and 2). Then we selected
aspect ratios of 0.1, 0.5, 1, and 1.5 according to the distribution from the Figure 1. Eventually,
we also reduced the size of the anchors 2× for each output layer from the Feature Pyramid
Network [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] of ResNet [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] backbone, i.e., reduced size for semantic feature maps from 2 to
6, see Table 11 for a summary of these settings.
        </p>
        <p>An experiment examining the use of diferent aspect ratios for anchor box proposals (see
Table 12) shows that selecting aspect ratios by frequency in the dataset increases detection
0.10
s0.08
e
c
n
rre0.06
u
c
c
o
ive0.04
t
a
l
e
R0.02
0.025
s0.020
e
c
n
rre0.015
u
c
c
o
ie0.010
v
t
a
l
e</p>
        <p>R0.005
performance in most cases. Only for the Wireframe task with RGB images, the default aspect
ratios achieved slightly greater mAP and mAR than the ones se lected from statistics. The value
of mAP0.5 is greater for aspect ratios selected according to statistics with smaller anchor sizes,
and this setting proved to be better performing for greyscale images roughly ranging from
0.002 to 0.003 for all measured metrics. Therefore, we selected this setting for comparison with
the other backbone models in the Wireframe task. For the Screenshot task, the same aspect
ratios were selected but with default sizes of anchor boxes, because it performed better with
mAP0.5 and mAP up to 0.006 than the default anchor settings.
mAP
mAR
mAP
mAR</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Backbones comparison</title>
      <p>As a last step, several backbone architectures were compared for UI element detection on
greyscale and RGB images. We used base parameters and augmentations from Table 3 and
Table 4, and additional augmentations described in Table 10. In the Wireframe task,
statistical + smaller sizes variant of the anchor generator was used, and the statistical variant was used
for the Screenshot task (for these settings of anchor box proposals see Table 11).</p>
      <p>In the comparison of the backbone architectures (see Table 13), the reader can recognise
that only for the Wireframe task, the most complex compared architecture achieved better
performance for measured metrics in both cases of selected color space. Although we expected
better performance with the more complex ResNeXt-101 backbone, superior results were
achieved with the ResNet-50 in the Screenshot task. The model with a complex backbone
converges slower than ResNet-50, hence more epochs should be ran for better results.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Submissions</title>
      <p>
        In the DrawnUI challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we have created up to 9 submissions using the configuration
listed bellow. The configuration is the same for both the Screenshot and the Wireframe tasks,
any additional configurations relevant for any of the tasks are also specified. Results on the test
mAP
mAR
mAP
mAR
set can be found in Table 14:
#1: ResNet-50 (baseline, RGB) - model trained according to Table 3 and Table 4 w/ the
Random Relative Resize augmentation using image resize interval [0.7, 0.9]. In the Screenshot task,
only the images were filtered using thresholds described in Table 5.
#2: ResNet-50 (augmentations, RGB) - baseline trained w/ augmentations from Table 10.
#3: ResNet-50 (anchor settings, RGB) - same as submission #2 w/ anchor settings from
Table 11: statistical for the Screenshot task, and statistical + smaller sizes for the Wireframe task.
#4: ResNet-50 (anchor settings, greyscale) - same as submission #3 but w/ greyscale images.
#5: ResNet-50 (train+val, RGB) - same as submission #3 but trained on the whole
development set (w/o any validation data).
#6: ResNeXt-101 (RGB) - trained w/ the same settings as submission #3.
#7: ResNet-50 (train+val, RGB, 2× epochs) - submission #5 trained for 2× more epochs.
#8: ResNet-50 (train+val, greyscale) - same as submission #5 but trained w/ greyscale images.
#9: ResNeXt-101 (RGB, train+val, +5 epochs) - submission #6 fine-tuned w/ 5 more epochs
on whole development set (w/o any validation data).
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Our method, including data filtering, Cutout augmentation and statistical aspect ratios for anchor
box proposals, ended in the first place in both contest tasks of DrawnUI challenge: Screenshot
task - ResNet-50 backbone trained on whole development set with 0.628 mAP at 0.5 IoU on
the test set, and Wireframe task - ResNeXt-101 backbone trained with split development set
for training and validation, this model achieved 0.900 mAP at 0.5 IoU on the test set. Besides,
we explored the State-of-the-Art object detectors based on the transformers, such as a DETR.
The DETR did not achieve satisfactory results even after 300 epochs compared with the Faster
R-CNN trained up to 40 epochs. Due to time constraints, we will consider the use of transformer
in the upcoming research projects.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The work has been supported by the grant of the University of West Bohemia, project No.
SGS-2019-027. Computational resources were supplied by the project "e-Infrastruktura CZ"
(e-INFRA LM2018140) provided within the program Projects of Large Research, Development
and Innovations Infrastructures.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Berari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tauteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fichou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Ştefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <source>Overview of ImageCLEFdrawnUI</source>
          <year>2021</year>
          :
          <article-title>The Detection and Recognition of Hand Drawn and Digital Website UIs Task</article-title>
          , in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          , Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sarrouti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kozlovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Liauchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dicente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jacutprakart</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Berari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Tauteanu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fichou</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brie</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dogariu</surname>
            ,
            <given-names>L. D.</given-names>
          </string-name>
          <string-name>
            <surname>Ştefan</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Constantin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chamberlain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>T. A.</given-names>
          </string-name>
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Moustahfid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Deshayes-Chossart</surname>
          </string-name>
          ,
          <article-title>Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 12th International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ),
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohapatra</surname>
          </string-name>
          ,
          <article-title>Html atomic ui elements extraction from hand-drawn website images using mask-rcnn and novel multi-pass inference technique</article-title>
          ,
          <source>in: CLEF2020 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          , Thessaloniki, Greece,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          , G. Gkioxari,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mask</surname>
          </string-name>
          r-cnn,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2961</fpage>
          -
          <lpage>2969</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N. A.</given-names>
            <surname>Balaji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jaganathan</surname>
          </string-name>
          ,
          <article-title>Deep learning for ui element detection</article-title>
          :
          <source>Drawnui</source>
          <year>2020</year>
          , in: CLEF2020 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          , Thessaloniki, Greece,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasconcelos</surname>
          </string-name>
          ,
          <article-title>Cascade r-cnn: High quality object detection and instance segmentation</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>43</volume>
          (
          <year>2021</year>
          )
          <fpage>1483</fpage>
          -
          <lpage>1498</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2019</year>
          .
          <volume>2956516</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bochkovskiy</surname>
          </string-name>
          , C.-Y. Wang, H.
          <string-name>
            <surname>-Y. M. Liao</surname>
          </string-name>
          ,
          <article-title>Yolov4: Optimal speed and accuracy of object detection</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>10934</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Říha,</surname>
          </string-name>
          <article-title>Sketch2code: Automatic hand-drawn ui elements detection with faster r-cnn</article-title>
          ,
          <source>in: CLEF2020 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          , Thessaloniki, Greece,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
          </string-name>
          r-cnn:
          <article-title>Towards real-time object detection with region proposal networks</article-title>
          , in: C.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>N. D.</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sugiyama</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2015</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Hariharan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
          </string-name>
          ,
          <article-title>Feature pyramid networks for object detection</article-title>
          ,
          <source>in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>936</fpage>
          -
          <lpage>944</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2017</year>
          .
          <volume>106</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          , W.-Y. Lo,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          , Detectron2, https://github.com/ facebookresearch/detectron2,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <article-title>Microsoft coco: Common objects in context</article-title>
          ,
          <source>in: European conference on computer vision</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>T. DeVries</surname>
          </string-name>
          , G. W. Taylor,
          <article-title>Improved regularization of convolutional neural networks with cutout</article-title>
          ,
          <source>arXiv preprint arXiv:1708.04552</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Laplacian operator-based edge detectors</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>29</volume>
          (
          <year>2007</year>
          )
          <fpage>886</fpage>
          -
          <lpage>890</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2007</year>
          .
          <volume>1027</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tabbone</surname>
          </string-name>
          , et al.,
          <article-title>Edge detection techniques-an overview, Pattern Recognition and Image Analysis C/C of Raspoznavaniye Obrazov I Analiz Izobrazhenii 8 (</article-title>
          <year>1998</year>
          )
          <fpage>537</fpage>
          -
          <lpage>559</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Carion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirillov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zagoruyko</surname>
          </string-name>
          ,
          <article-title>End-to-end object detection with transformers</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>Deformable detr: Deformable transformers for end-to-end object detection</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>04159</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <article-title>On the momentum term in gradient descent learning algorithms</article-title>
          ,
          <source>Neural networks 12</source>
          (
          <year>1999</year>
          )
          <fpage>145</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fast</surname>
          </string-name>
          r-cnn,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2015</year>
          .
          <volume>169</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Duerig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Philbin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>The unreasonable efectiveness of noisy data for fine-grained recognition</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>301</fpage>
          -
          <lpage>320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Buslaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Iglovikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Khvedchenya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Druzhinin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Kalinin</surname>
          </string-name>
          ,
          <article-title>Albumentations: Fast and flexible image augmentations</article-title>
          ,
          <source>Information</source>
          <volume>11</volume>
          (
          <year>2020</year>
          ). URL: https://www.mdpi.com/2078-2489/11/2/125. doi:
          <volume>10</volume>
          .3390/info11020125.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>