=Paper= {{Paper |id=Vol-2936/paper-117 |storemode=property |title=Improving web user interface element detection using Faster R-CNN |pdfUrl=https://ceur-ws.org/Vol-2936/paper-117.pdf |volume=Vol-2936 |authors=Jiří Vyskočil,Lukáš Picek |dblpUrl=https://dblp.org/rec/conf/clef/VyskocilP21 }} ==Improving web user interface element detection using Faster R-CNN== https://ceur-ws.org/Vol-2936/paper-117.pdf
Improving web user interface element detection
using Faster R-CNN
Jiří Vyskočil1 , Lukáš Picek1
1
    Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia


                                         Abstract
                                         Several challenges may arise when designing new user interfaces (UIs), e.g., because of communication
                                         between designers and developers, to which the detection of UI elements can help. The ImageCLEF
                                         DrawnUI 2021 challenge builds on the detection of such elements in two contest tasks: a Screenshot
                                         task that contains the website screenshot images with lots of noisy data, and a Wireframe task for
                                         detecting UI elements from hand-drawn proposals. This paper describes a simple algorithm based on
                                         the edge detection to filter noisy data from the website screenshots, and machine learning method which
                                         scored the first place in both tasks while having 0.628 and 0.900 mAP at 0.5 IoU in the Screenshot and
                                         Wireframe tasks. This method is based on the Faster R-CNN with a Feature Pyramid Network (FPN) that
                                         uses selected aspect ratios of anchor boxes according to the occurrences from the datasets. The code is
                                         available at https://github.com/vyskocj/ImageCLEFdrawnUI2021

                                         Keywords
                                         Object Detection, Machine Learning, Edge Detection, Faster R-CNN, FPN, CNN, User Interface




1. Introduction
The ImageCLEF DrawnUI challenge [1] was organized as part of the ImageCLEF 2021 work-
shop [2] at the CLEF conference. The main goal for the two proposed tasks - Screenshots &
Wireframes - was to create a system capable of automatic detection and recognition of individual
user interface (UI) elements on given images. The Screenshot task focused on the website screen-
shot images, and the Wireframe task targeted on hand-drawn UI drawings. The motivation for
both tasks is to simplify and speed up the Web development process by giving the designers
a tool that can visualize the website immediately based on their hand-drawn sketches.
   The machine learning techniques have already been applied to the hand-drawn UI elements
detection in the last years. Gupta et al. [3] used Mask R-CNN [4] and Multi-Pass Inference
technique to boost the viability of the model by passing the input image (without the already
detected objects) to the model several times. Narayanan et al. [5] explored Cascade R-CNN [6]
and YOLOv4 [7] architectures, and Zita et al. [8] used regular Faster R-CNN [9] architecture and
advanced regularization techniques for training the model. In this work, we utilize the Faster
R-CNN extended by the Feature Pyramid Network (FPN) [10] that builds high-level semantic
feature maps at all selected scales and makes the predictions more accurate. The models were
implemented and fine-tuned using the Detectron2 API [11] from publicly available checkpoints
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" vyskocj@kky.zcu.cz (J. Vyskočil); picekl@ntis.zcu.cz (L. Picek)
 0000-0002-6443-2051 (J. Vyskočil); 0000-0002-6041-9722 (L. Picek)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
pre-trained on the COCO dataset [12]. Additionally, we improved the performance by using
various augmentations, i.e., Relative Random Resize, Cutout [13], brightness and contrast
adjustment, and by selecting bounding box proposals. In case of the Screenshot task, we utilized
the use of data filtering algorithm based on the edge detection [14, 15]. The improvements of
our method, which won in both contest tasks, are shown by comparing with the others in the
benchmark of the DrawnUI challenge.
   Besides, we experimented with novel methods [16, 17] based on the Detection Transformer.
These approaches remove the need for hand-designed components, e.g., non-maxima suppres-
sion (NMS), but requires much more training time to convergence than previous detectors.
Given this training issue of the Detection Transformer, we decided to keep the NMS in our
model. Using the Transformers on the provided data in the contest tasks led to significantly
worse detection performance even with 7.5× more training steps.


2. Challenge datasets
Wireframe task. Provided dataset is a combination of 4,291 hand-drawn high-resolution
image templates. The data is divided into 3,218 images for the development and 1,073 im-
ages for the testing. For each image in the development set, we have manual annotations with
the bounding boxes and their corresponding labels from pre-defined 21 classes. The development
set includes all images from last year’s challenge and additional images to re-balance the class
distribution. As there is no official training/validation split provided, we did a random 85%/15%
split. The detailed statistics covering the class distribution, dataset split, and absolute/relative
box number are presented in Table 1.
Screenshot task. In the Screenshot task, the provided dataset includes 9,630 full-page screen-
shots of websites in several languages. The data comes with labeled bounding boxes of the UI
elements. A total of 6 classes is defined, the distribution of ground truth boxes can be found
in Table 2. The development set contains 6,840 training images, and 930 manually annotated
validation images. The training set includes noisy data: blank images and bounding boxes with
shifted positions. The testing set contained a total of 1,860 samples.


3. Methodology
In this section, we cover the noise data filtering algorithm and training the Faster R-CNN [9]
detection network based on the ResNet-50 [18] backbone. We also use the FPN [10] extractor to
combine semantically strong features thanks to a top-down pathway and lateral connections
from the same spatial size. We use SGD optimizer with momentum of 0.9 [19], learning rate
warm up of the first epoch reached the value of 0.0025 and a smooth L1 [20] is applied for
regression loss. The detector is implemented and fine-tuned in the Detectron2 API [11] from
publicly available pre-trained weights on the COCO dataset [12]. For more details about the
hyperparameter settings, see Table 3, and advanced augmentations are listed in Table 4. All
experiments are evaluated at mean average precision (mAP) and mean average recall (mAR)
with Intersection over Union (IoU) in range of 0.5 to 0.95 with the increment of 0.05, and mean
average precision with IoU greater than 0.5 (denoted as mAP0.5).
Table 1
Distribution of training and validation set categories of the Wireframe task.
                                                  Wireframe task
                                             Training set                  Validation set
                      Category
                                      # of boxes    fraction [%]     # of boxes    fraction [%]
                      button            21,787          19.33           3,657          17.80
                      label             17,348          15.39           3,462          16.85
                      paragraph         13,884          12.32           2,474          12.04
                      image             10,328           9.16           1,752           8.53
                      link               6,359           5.64           1,133           5.52
                      linebreak          6,208           5.51           1,118           5.44
                      container          5,425           4.81            953            4.64
                      header             4,356           3.86            739            3.60
                      textinput          4,192           3.72            825            4.02
                      checkbox           3,426           3.04            677            3.30
                      radiobutton        3,302           2.93            719            3.50
                      toggle             2,785           2.47            574            2.79
                      slider             2,668           2.37            524            2.55
                      datepicker         2,606           2.31            473            2.30
                      textarea           2,449           2.17            452            2.20
                      rating             2,372           2.10            437            2.13
                      dropdown           1,453           1.29            250            1.22
                      video               810            0.72            155            0.75
                      list                788            0.70            125            0.61
                      stepperinput        137            0.12              22           0.11
                      table                 55           0.05              19           0.09


Table 2
Distribution of training and validation set categories of the Screenshot task.
                                                 Screensthot task
                                            Training set                  Validation set
                        Category
                                     # of boxes    fraction [%]     # of boxes    fraction [%]
                        link          106,457         36.72          10,910          30.46
                        text           82,642         28.51           9,400          26.24
                        image          52,672         18.17           7,462          20.83
                        heading        39,330         13.57           5,620          15.69
                        input           4,448          1.53             605           1.69
                        button          4,353          1.50           1,823           5.09



3.1. Baseline experiment
For the baseline experiment, the Random Relative Resize augmentation is applied to resize an
image to 70-90% of its size and crop it to a maximum of 1,400 px to limit the memory usage.
The resize augmentations are deeply examined in Section 3.3.1. The hyperparameters settings
are described in Table 3 and advanced augmentations in Table 4. In the Screenshot task, the
baseline model reached 0.592 mAP0.5, 0.404 mAP, and 0.603 mAR, while in the Wireframe
task, a model with the same settings have 0.969 mAP0.5, 0.703 mAP, and 0.763 mAR.
Table 3
Base parameters for training the models.
 Parameter                                 Value   Parameter                              Value
 Checkpoint                              COCO      Batch size                                  1
 Optimizer               SGD w/ moment. of 0.9     Accumulated grad.                           4
 Loss                               smooth L1                                    20 (Screenshot)
                                                   Epochs
 Base and min lr              0.0025 - 0.000625                                  40 (Wireframe)
 Decay factor                                0.5                           [10, 15] (Screenshot)
                                                   Decay in ... epoch
 Warm up                                1 epoch                            [20, 30] (Wireframe)


Table 4
Base augmentations for training the models.
 Augmentation                                               Intensity          Probability [%]
 Random Brightness                                          0.5 - 1.5                50
 Random Contrast                                            0.5 - 1.5                50
 Random Saturation (RGB only)                               0.5 - 1.5                50


3.2. Filtering noisy data in the Screenshot task
Even though the noisy data can be effective for the training [21], we decided to analyze filtering
of blank images and wrongly annotated bounding boxes from the Screenshot task dataset.
The aim is to remove images or ground truth boxes that contain constant color intensity. For
this reason, the data filtering (shown in Algorithm 1) is based on an edge detector [14, 15] to be
independent of the intensity of the pixels in the input image.

Algorithm 1 Filtering homogeneous image elements from a dataset.
   Define threshold values 𝑇𝑖𝑚𝑔 and 𝑇𝑏𝑜𝑥
   for each image do
       Apply edge detector to the image and compute a mean value 𝜇𝑖𝑚𝑔 from the output
       if 𝜇𝑖𝑚𝑔 ≤ 𝑇𝑖𝑚𝑔 then
           Discard this image from the set and continue with the next one
       else
           for each bounding box of the image do
               Apply the 𝑇𝑏𝑜𝑥 threshold in the same way as 𝑇𝑖𝑚𝑔 in the image
           end for
           if all bounding boxes are discarded from the annotations of the image then
               Discard this image from the set and continue with the next one
           end if
       end if
   end for

  To verify the efficiency of data filtering, we manually selected appropriate thresholds for
images (see Table 5) and a set of fixed thresholds from 0.2 to 1.8 for bounding boxes. Then
we trained the network with filtered annotations in the Screenshot dataset using the same
settings as in Section 3.1. For the results of this experiment, see Table 6. One can observe that
filtering the homogeneous images increases the mAP0.5 by 0.012, mAP by 0.008, and mAR by
0.009 compared to the case of the original set. Filtering homogeneous images and bounding
boxes also increases the detection performance but it detects less precisely in all tested cases
of bounding box thresholds than filtering only the images. This behaviour can be caused by
eliminating the training data, which is in fact an object, not noise. Therefore, we determined
a new baseline for the Screenshot task by filtering only the images from the training set
(this model was submitted as a baseline for the Screenshot task of DrawnUI challenge [1]).

Table 5
Manually designed thresholds for discarding images from a training set of the Screenshot task.
Img size          ≤ 500 × 500       ≤ 600 × 600          ≤ 900 × 900   ≤ 1200×1200      ≥ 1200×1200
Threshold              2.5                3.5                2.8            2.5               0.8



Table 6
Results on the validation set while discarding noisy images (for threshold values see Table 5) and bound-
ing boxes (BBoxes) from the training set of the Screenshot task.
                          Threshold
                                                mAP0.5        mAP         mAR
                       Image    BBox
                         -           -          0.592         0.404      0.603
                       Tab. 5        -          0.604         0.412      0.612
                       Tab. 5       0.2         0.599         0.409      0.610
                       Tab. 5       0.6         0.599         0.409      0.610
                       Tab. 5       1.0         0.601         0.408      0.610
                       Tab. 5       1.4         0.601         0.411      0.612
                       Tab. 5       1.8         0.593         0.397      0.606



3.3. Augmentations
Image resizing, Cutout [13] augmentation, and color spaces are tested to improve detection
performance. The improvement is evaluated as a comparison with baseline models defined in
the previous sections, i.e., Section 3.1 is relevant for the Wireframe task and for the Screenshot
task, a new baseline was established in the Section 3.2.

3.3.1. Image resize
The basic approach to dealing with various sizes of the input images is to resize it to the
desired constant value so that the original aspect ratio is kept. However, various sizes can help
the learning algorithm to detect objects at different stages of the network. For example, imagine
that we only have small boxes available for the category button in the training set. In the test
stage, this network will not expect a large button at the input and will most likely fail to detect
it. In order to use different input image sizes, the backbone network must not contain any fully
connected layer.
    Two types of resizing images are compared in Table 7. The first one, Resize Shortest Edge
(default for Detectron2 [11] software) has a defined set of shortest edge lengths of the image from
640 to 800 px with the increment of 32, which are selected randomly during the training. If the
longer edge is larger than 1,333 px, the shorter edge is underscaled so that the longer edge does
not exceed this maximum size. We proposed the second type of resizing as a Random Relative
Resize. It defines an interval for which the image is randomly resized, and a maximum length of
the edges for cropping image during the training due to memory requirements. The particular
aim of this augmentation is to keep the small boxes so that they do not disappear when the image
size is reduced, and the network is able to detect them. In the test stage, the image is resized
only by the middle value of the specified interval and no image cropping is applied. This
augmentation proved to be most suitable for an image enlarging by a random value in the
interval [0.6, 1.0] for both tasks, where mAP0.5 and mAP metrics are roughly ranging from
0.034 to 0.045 higher than when using the Resize Shortest Edge.

Table 7
Comparison of two types of resizing augmentation on the validation set for the Screenshot and Wire-
frame tasks. Resize Shortest Edge (RSE) selects sizes from 640 to 800 px with a step of 32. Random
Relative Resize (RRR) defines an interval of [min, max] relative sizes as factor for resizing the image.
                                   Screenshot task                         Wireframe task
 Resize type              mAP0.5         mAP          mAR         mAP0.5         mAP          mAR
 RSE                       0.563        0.372         0.549        0.929         0.672        0.732
 RRR [0.7, 0.9]            0.604        0.412         0.612        0.969         0.703        0.763
 RRR [0.6, 1.0]            0.608        0.417         0.619        0.972         0.706        0.765
 RRR [0.4, 1.0]            0.608        0.414         0.608        0.954         0.689        0.751



3.3.2. Cutout augmentation
To increase the performance of the network, in addition to brightness and resize augmentations,
we also used Cutout [13] from Albumentations library [22] to randomly cuts boxes (denoted also
as holes) from the image. This augmentation expects the number of maximum holes and their
maximum spatial size as the input. In our experiments, we define the maximum size of holes in
the percentage of the image. The results (see Table 8) show that it can increase mAP0.5 by 0.008,
and mAP by 0.004 for the Screenshot task when using 4 holes with max size of 5% of the image,
while in the Wireframe task, the detection performance was slightly reduced in all settings of
the Cutout augmentation. Even so, we applied this augmentation in our further research (see
Section 3.4 and Section 4) to keep the experiments for both contest tasks comparable, and in
the Screenshot task, the augmentation shows meaningful improvements.
Table 8
Comparison of using Cutout augmentation on the validation set for the Screenshot and Wireframe
tasks. The max size is given as a percentage of the image size.
                                      Screenshot task                        Wireframe task
 max holes     max size      mAP0.5        mAP          mAR         mAP0.5        mAP          mAR
      -            -          0.604        0.412        0.612        0.969        0.703        0.763
      4          5.0%         0.612        0.416        0.610        0.967        0.699        0.759
      8          5.0%         0.603        0.411        0.606        0.968        0.701        0.760
     16          5.0%         0.596        0.405        0.608        0.967        0.698        0.759
     16          2.5%         0.603        0.410        0.607        0.964        0.697        0.758


3.3.3. Color space
In the next step, converting images to the greyscale, such as in the previous works of this
challenge [8, 3], is applied. It results in no improvement against the RGB images (see Table 9) in
the Screenshot task. On the other hand, for the Wireframe task, converting the data to grayscale
yields up to approximately 0.005 greater mAP0.5, mAP, and mAR. Therefore, both RGB and
grayscale images are used for the remaining experiments.

Table 9
Comparison of using RGB and greyscale images on the validation set for the Screenshot and Wireframe
tasks.
                                      Screenshot task                        Wireframe task
 Color space                 mAP0.5        mAP          mAR         mAP0.5        mAP          mAR
 RGB                          0.604        0.412        0.612        0.969        0.703        0.763
 Greyscale                    0.593        0.403        0.604        0.974        0.707        0.767



3.4. Anchor box proposals
We followed up on previous experiments that examines augmentations (see Section 3.3) and we
trained new models (parameters for new augmentations are summarized in Table 10). After that
we analyzed which aspect ratios of ground truth boxes are included in the datasets. Occurrence
of such aspect ratios are visualized in Figure 1 for both the Screenshot and the Wireframe tasks.
One can observe that the horizontal boxes are far more frequent than the vertical ones. As
a result, the appropriate aspect ratios were selected to generate the box proposals. We added one
horizontal aspect ratio of 0.2 to the default ones (i.e., to the set of 0.5, 1, and 2). Then we selected
aspect ratios of 0.1, 0.5, 1, and 1.5 according to the distribution from the Figure 1. Eventually,
we also reduced the size of the anchors 2× for each output layer from the Feature Pyramid
Network [10] of ResNet [18] backbone, i.e., reduced size for semantic feature maps from 𝑃2 to
𝑃6 , see Table 11 for a summary of these settings.
   An experiment examining the use of different aspect ratios for anchor box proposals (see
Table 12) shows that selecting aspect ratios by frequency in the dataset increases detection
Table 10
Additional augmentations to the training the models (see Table 3 and Table 4 for base parameters and
common augmentations used for training).
 Augmentation                                                                                           Parameters
 Random Relative Resize                                             resize interval = [0.6, 1.0]                  crop = 1400 px
 Cutout                                                             max holes = 4                                 max size = 5% of image


performance in most cases. Only for the Wireframe task with RGB images, the default aspect
ratios achieved slightly greater mAP and mAR than the ones se lected from statistics. The value
of mAP0.5 is greater for aspect ratios selected according to statistics with smaller anchor sizes,
and this setting proved to be better performing for greyscale images roughly ranging from
0.002 to 0.003 for all measured metrics. Therefore, we selected this setting for comparison with
the other backbone models in the Wireframe task. For the Screenshot task, the same aspect
ratios were selected but with default sizes of anchor boxes, because it performed better with
mAP0.5 and mAP up to 0.006 than the default anchor settings.


                           0.10
                                                                                  Screenshot task

                           0.08
        Relative occurrences




                           0.06

                           0.04

                           0.02

                           0.00
                               0.0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8 0.9 1.0 1.1 1.2          1.3   1.4   1.5   1.6   1.7   1.8   1.9   2.0
                                                                               Bounding box aspect ratios

                    0.025
                                                                                   Wireframe task

                    0.020
 Relative occurrences




                    0.015

                    0.010

                    0.005

                    0.000
                         0.0         0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8 0.9 1.0 1.1 1.2          1.3   1.4   1.5   1.6   1.7   1.8   1.9   2.0
                                                                               Bounding box aspect ratios

Figure 1: Occurrence of the aspect ratios of the bounding boxes in the Screenshot and Wireframe task
datasets.
Table 11
Settings of the selected aspect ratios and sizes for anchor box proposals.
  #    Anchor generator settings                 Aspect ratios                  Anchor sizes
  1    default                                      [0.5, 1.0, 2.0]         [32, 64, 128, 256, 512]
  2    default + horizontal                    [0.2, 0.5, 1.0, 2.0]         [32, 64, 128, 256, 512]
  3    statistical                             [0.1, 0.5, 1.0, 1.5]         [32, 64, 128, 256, 512]
  4    statistical + smaller sizes             [0.1, 0.5, 1.0, 1.5]      [16, 32, 64, 128, 256]


Table 12
Comparison of using different settings of aspect ratios and sizes for anchor box proposals on the valida-
tion set. Comparison is performed on the RGB and greyscale images for the Screenshot and Wireframe
tasks. Anchor settings are specified in Table 11.
                                             Screenshot task                    Wireframe task
 Anchor setting Color space          mAP0.5       mAP          mAR     mAP0.5       mAP          mAR
       #1             RGB            0.617       0.421         0.615    0.973       0.705        0.765
       #2             RGB            0.621       0.424         0.625    0.974       0.704        0.762
       #3             RGB            0.623       0.426         0.627    0.970       0.704        0.763
       #4             RGB            0.611       0.414         0.627    0.975       0.700        0.762
       #1          Greyscale         0.612       0.417         0.613    0.972       0.702        0.762
       #2          Greyscale         0.610       0.416         0.617    0.974       0.703        0.762
       #3          Greyscale         0.612       0.419         0.620    0.970       0.703        0.762
       #4          Greyscale         0.606       0.410         0.622    0.975       0.704        0.764


4. Backbones comparison
As a last step, several backbone architectures were compared for UI element detection on
greyscale and RGB images. We used base parameters and augmentations from Table 3 and
Table 4, and additional augmentations described in Table 10. In the Wireframe task, statisti-
cal + smaller sizes variant of the anchor generator was used, and the statistical variant was used
for the Screenshot task (for these settings of anchor box proposals see Table 11).
  In the comparison of the backbone architectures (see Table 13), the reader can recognise
that only for the Wireframe task, the most complex compared architecture achieved better
performance for measured metrics in both cases of selected color space. Although we expected
better performance with the more complex ResNeXt-101 backbone, superior results were
achieved with the ResNet-50 in the Screenshot task. The model with a complex backbone
converges slower than ResNet-50, hence more epochs should be ran for better results.


5. Submissions
In the DrawnUI challenge [1], we have created up to 9 submissions using the configuration
listed bellow. The configuration is the same for both the Screenshot and the Wireframe tasks,
any additional configurations relevant for any of the tasks are also specified. Results on the test
Table 13
Comparison of different backbone architectures on the validation set. Comparison is performed on
the RGB and greyscale images for the Screenshot and Wireframe tasks.
                                               Screenshot task                     Wireframe task
 Backbone            Color space      mAP0.5        mAP          mAR       mAP0.5      mAP          mAR
 ResNet-50              RGB            0.623       0.426         0.627     0.975       0.700        0.762
 ResNet-101             RGB            0.603       0.410         0.614     0.970       0.699        0.758
 ResNeXt-101            RGB            0.601       0.408         0.608     0.977       0.705        0.765
 ResNet-50           Greyscale         0.612       0.419         0.620     0.975       0.704        0.764
 ResNet-101          Greyscale         0.604       0.413         0.612     0.970       0.699        0.760
 ResNeXt-101         Greyscale         0.598       0.408         0.605     0.976       0.706        0.766


set can be found in Table 14:
#1: ResNet-50 (baseline, RGB) - model trained according to Table 3 and Table 4 w/ the Ran-
dom Relative Resize augmentation using image resize interval [0.7, 0.9]. In the Screenshot task,
only the images were filtered using thresholds described in Table 5.
#2: ResNet-50 (augmentations, RGB) - baseline trained w/ augmentations from Table 10.
#3: ResNet-50 (anchor settings, RGB) - same as submission #2 w/ anchor settings from Ta-
ble 11: statistical for the Screenshot task, and statistical + smaller sizes for the Wireframe task.
#4: ResNet-50 (anchor settings, greyscale) - same as submission #3 but w/ greyscale images.
#5: ResNet-50 (train+val, RGB) - same as submission #3 but trained on the whole develop-
ment set (w/o any validation data).
#6: ResNeXt-101 (RGB) - trained w/ the same settings as submission #3.
#7: ResNet-50 (train+val, RGB, 2× epochs) - submission #5 trained for 2× more epochs.
#8: ResNet-50 (train+val, greyscale) - same as submission #5 but trained w/ greyscale images.
#9: ResNeXt-101 (RGB, train+val, +5 epochs) - submission #6 fine-tuned w/ 5 more epochs
on whole development set (w/o any validation data).

Table 14
Test results obtained from the submissions.
                         Screenshot task                                    Wireframe task
  #        Run ID             mAP0.5            mAR0.5            Run ID        mAP0.5          mAR0.5
  1         134207            0.594              0.815            134095           0.794            0.832
  2         134214            0.602              0.822            134175           0.830            0.863
  3         134215            0.609              0.834            134180           0.882            0.918
  4         134217            0.601              0.827            134181           0.889            0.923
  5         134224            0.628              0.830            134225           0.888            0.925
  6         134603            0.590              0.807            134548           0.900            0.934
  7         134716            0.621              0.821            134723           0.894            0.928
  8            -                -                  -              134728           0.895            0.927
  9            -                -                  -              134829           0.900            0.933
6. Conclusion
Our method, including data filtering, Cutout augmentation and statistical aspect ratios for anchor
box proposals, ended in the first place in both contest tasks of DrawnUI challenge: Screenshot
task - ResNet-50 backbone trained on whole development set with 0.628 mAP at 0.5 IoU on
the test set, and Wireframe task - ResNeXt-101 backbone trained with split development set
for training and validation, this model achieved 0.900 mAP at 0.5 IoU on the test set. Besides,
we explored the State-of-the-Art object detectors based on the transformers, such as a DETR.
The DETR did not achieve satisfactory results even after 300 epochs compared with the Faster
R-CNN trained up to 40 epochs. Due to time constraints, we will consider the use of transformer
in the upcoming research projects.


Acknowledgments
The work has been supported by the grant of the University of West Bohemia, project No.
SGS-2019-027. Computational resources were supplied by the project "e-Infrastruktura CZ"
(e-INFRA LM2018140) provided within the program Projects of Large Research, Development
and Innovations Infrastructures.


References
 [1] R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D. Ştefan, M. G. Constantin,
     B. Ionescu, Overview of ImageCLEFdrawnUI 2021: The Detection and Recognition of
     Hand Drawn and Digital Website UIs Task, in: CLEF2021 Working Notes, CEUR Workshop
     Proceedings, CEUR-WS.org , Bucharest, Romania, 2021.
 [2] B. Ionescu, H. Müller, R. Péteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A.
     Hasan, S. Kozlovski, V. Liauchuk, Y. Dicente, V. Kovalev, O. Pelka, A. G. S. de Herrera,
     J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D.
     Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid,
     A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval
     in medical, nature, internet and social media applications, in: Experimental IR Meets
     Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International
     Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer
     Science, Springer, Bucharest, Romania, 2021.
 [3] P. Gupta, S. Mohapatra, Html atomic ui elements extraction from hand-drawn website
     images using mask-rcnn and novel multi-pass inference technique, in: CLEF2020 Working
     Notes, CEUR Workshop Proceedings, CEUR-WS.org , Thessaloniki,
     Greece, 2020.
 [4] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE
     international conference on computer vision, 2017, pp. 2961–2969.
 [5] N. Narayanan, N. N. A. Balaji, K. Jaganathan, Deep learning for ui element detection:
     Drawnui 2020, in: CLEF2020 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org
     , Thessaloniki, Greece, 2020.
 [6] Z. Cai, N. Vasconcelos, Cascade r-cnn: High quality object detection and instance seg-
     mentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021)
     1483–1498. doi:10.1109/TPAMI.2019.2956516.
 [7] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, Yolov4: Optimal speed and accuracy of object
     detection, arXiv preprint arXiv:2004.10934 (2020).
 [8] A. Zita, L. Picek, A. Říha, Sketch2code: Automatic hand-drawn ui elements detection with
     faster r-cnn, in: CLEF2020 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org
     , Thessaloniki, Greece, 2020.
 [9] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with
     region proposal networks, in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett
     (Eds.), Advances in Neural Information Processing Systems 28, Curran Associates, Inc.,
     2015, pp. 91–99.
[10] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks
     for object detection, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition
     (CVPR), 2017, pp. 936–944. doi:10.1109/CVPR.2017.106.
[11] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick, Detectron2, https://github.com/
     facebookresearch/detectron2, 2019.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick,
     Microsoft coco: Common objects in context, in: European conference on computer vision,
     Springer, 2014, pp. 740–755.
[13] T. DeVries, G. W. Taylor, Improved regularization of convolutional neural networks with
     cutout, arXiv preprint arXiv:1708.04552 (2017).
[14] X. Wang, Laplacian operator-based edge detectors, IEEE transactions on pattern analysis
     and machine intelligence 29 (2007) 886–890. doi:10.1109/TPAMI.2007.1027.
[15] D. Ziou, S. Tabbone, et al., Edge detection techniques-an overview, Pattern Recognition and
     Image Analysis C/C of Raspoznavaniye Obrazov I Analiz Izobrazhenii 8 (1998) 537–559.
[16] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object
     detection with transformers, in: European Conference on Computer Vision, Springer,
     2020, pp. 213–229.
[17] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable transformers for
     end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020).
[18] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: The
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
     doi:10.1109/CVPR.2016.90.
[19] N. Qian, On the momentum term in gradient descent learning algorithms, Neural networks
     12 (1999) 145–151.
[20] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer
     vision, 2015, pp. 1440–1448. doi:10.1109/ICCV.2015.169.
[21] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, L. Fei-Fei, The
     unreasonable effectiveness of noisy data for fine-grained recognition, in: European
     Conference on Computer Vision, Springer, 2016, pp. 301–320.
[22] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, A. A. Kalinin,
     Albumentations: Fast and flexible image augmentations, Information 11 (2020). URL:
     https://www.mdpi.com/2078-2489/11/2/125. doi:10.3390/info11020125.