UI element detection from wireframe drawings of websites Prasang Gupta1 , Vishakha Bansal1 1 PwC US Advisory, BG House, Lake Boulevard Road, Hiranandani Gardens, Powai, Mumbai, India Abstract User Interfaces (UIs) wireframe is a crucial part of designing front-end of websites and mobile appli- cations. Detection of UI elements such as paragraphs, buttons, images etc. from the wireframes using advanced Artificial Intelligence (AI) algorithms pave the way to automate the process of conversion of wireframes to Hypertext Mark-up Language (HTML) code. In this paper, we have explored different variants of 5th generation of You Only Look Once (YOLOv5) algorithm and post-processing techniques involving tuning of confidence cut-off variable for detection of UI elements. Our final approach com- prises of data pre-processing using contrast normalization and conversion to black and white (BW), detection and localization of UI elements using YOLOv5x variant followed by confidence cutoff for se- lecting final bounding boxes. This approach resulted in Mean Average Precision (mAP) of 0.836 on the test data. Keywords Website UI elements, UI element extraction, Image Processing, OpenCV, Object Detection, YOLOv5, Confidence Cutoff Variation 1. Introduction In recent times, building an online presence through websites and mobile applications has become a necessity for businesses to create global outreach and provide better customer service. Designing such applications is a time-consuming and iterative process. Wireframing serves as a starting point for this process. There are various tools that could be used for creating such wireframes and converting them automatically to code. However, the tools could be expensive and require learning for specific usage. The wireframe task from ImageCLEFdrawnUI 2021 Task[1] which is part of ImageCLEF 2021[2] is the second edition in this area and aims at reducing the dependency on the tools and automating the code conversion process by using machine learning for detection and localization of UI elements in the wireframes. The dataset provided as part of this task has been enhanced from its previous edition in terms of volume and class distribution. In this study, we focus on creating model-driven approach which is able to identify and localize the bounding boxes of all UI elements present in a wireframe. In the next section, we CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " prasang.gupta@pwc.com (P. Gupta); vishakha.bansal@pwc.com (V. Bansal) ~ https://prasang-gupta.github.io/ (P. Gupta)  0000-0002-3623-8227 (P. Gupta); 0000-0002-5582-9145 (V. Bansal) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) briefly describe the dataset used for training and validation of the models. In Section 3, we cover the methodology used âĂŞ data pre-processing, modelling and post-processing. In Section 4, we present the results from the final approach. The paper finishes with the conclusion and future work. 2. Dataset The dataset for the ImageCLEFdrawnUI 2021 competition [1] included snapshots of hand drawn wireframe images of website layouts. These images included a total of 21 classes of atomic UI elements including images, paragraphs, headers, links etc. The provided dataset included 4291 of such images. These images were divided into a development and a test set. Table 1 Class distribution in the train and validation sets. Label Train Freq Val Freq Train Dist Val Dist paragraph 2727 141 0.89 0.88 label 1004 42 0.33 0.26 header 2059 121 0.67 0.76 button 2991 153 0.98 0.96 image 2462 121 0.81 0.76 linebreak 2370 128 0.78 0.8 container 2923 153 0.96 0.96 link 1031 56 0.34 0.35 textinput 1453 69 0.48 0.43 dropdown 688 34 0.22 0.21 checkbox 663 25 0.22 0.16 radiobutton 478 14 0.16 0.09 rating 434 11 0.14 0.07 toggle 452 13 0.15 0.08 textarea 418 9 0.14 0.06 datepicker 468 12 0.15 0.08 stepperinput 91 3 0.03 0.02 slider 491 16 0.16 0.1 video 448 22 0.15 0.14 table 56 1 0.02 0.01 list 180 6 0.06 0.04 The development set contained 3218 labelled images while the test set contained 1073 unla- belled images. The development set was further divided into a train and a validation set. Out of the 3218 images in the development set, 3058 images were included in the train set and the rest 160 images formed the validation set (about 5% of the development set). This division was done keeping in mind that the distributions of the classes remain as close as possible. The exact distribution is shown in Table 1. The images were all RGB images having a myriad of different sizes. The size distribution of Figure 1: Distribution of heights and widths of the images in the development set with 20 bins. the images can be seen in the histograms for height and width distribution in Figure 1. All the images were later resized to a constant size of 512 x 512 for training purposes. Figure 2: Distribution of the classes within the development dataset. The plot on the left shows the total number of occurrences of a class in the dataset. The plot on the right shows the spread of the classes defined as the total number of unique images which have atleast 1 occurence of that class. Both of these plots show that some classes are abundantly present while some are a little less represented. The 21 classes present had different amount of representation in the dataset. Some classes which are commonly found in websites were abundantly present and dominated most of the images in the dataset as well. These classes are paragraph, button, link, image etc. However, some classes which are not as abundantly present in websites such as table, video, stepperinput, list etc. were present in comparatively lesser number in the dataset as well. The distribution of the classes among the dataset can be seen in Figure 2. 3. Methodology 3.1. Data Pre-Processing There are different pre-processing techniques that we have employed to improve the viability and performance of our model. As we are dealing with image data, most of our pre-processing steps use OpenCV. We have opted to use OpenCV [3] on C++ because of the added speed it provides when dealing with large images. Some of the techniques we have used are described in the subsequent sections. 3.1.1. Contrast improvement The images present in the dataset are made by taking snapshots of wireframe drawings of websites drawn by users. These drawings were either made on paper using pen or on a whiteboard using a marker. As the final images are snapshots, they very much depend on the quality of the camera used. Since most of the cameras introduce noise or a brightness overlay on the image, it was expected and was verified that different images had different tints, brightness and contrast. Figure 3: Histogram equalisation technique changing the narrow histogram of the pixels in the input image to a much more wide distribution. To counter this issue, there are histogram equalisation techniques which change the range of the pixels present in the image from a very confined space to a much larger distribution. This generally results in a much clearer image with better separation. This can be visualised in Figure 3. The histogram equalisation technique is powerful, but it also has a shortcoming. In addition to enhancing the contrast of the image, it also enhances the noise. Since our dataset has a lot of noise due to the nature of how they are collected, we chose to employ the Contrast Limited Adaptive Histogram Equalisation(CLAHE) technique from the OpenCV library which overcomes this shortcoming. Figure 4: Histogram equalisation after contrast limiting (CLAHE) Contrast Limited Adaptive Histogram Equalisation (CLAHE) technique is a modification of the more general histogram equalisation techniques. It employs contrast limiting before applying histogram equalisation. This is achieved by clipping the histogram bins which are Figure 5: A depiction of the effect of CLAHE on images. The image on the left is the original image and the one on the right is after applying CLAHE. It can be observed that the processed image is much more clear and legible than the original. above a specified contrast limit (we have used the default value 40 in this study) and then uniformly distributing the clipped pixels to the other bins. This can be visualised in Figure 4. Using this technique provided us with much more cleaner images for our dataset. The effect of this technique can be visualised on a sample image from the dataset in Figure 5. 3.1.2. Conversion to Black and White After getting the contrast normalised images, the next step was to convert them into black and white. This would remove any noise present in the image and focus the model to identify only the things that matter i.e. the wireframe drawings. There were a lot of iterations performed before by Gupta et. al. [4] to get the most effective way to convert wireframe grayscale images to black and white. We have directly used the techniques discussed in that paper. 3.2. Modelling This problem is an object detection problem at its heart. Hence, a lot of object detection models like Mask-RCNN [5], YOLO [6] and EfficientDet [7] come to mind. This problem can also be modelled as a segmentation problem. Hence, this adds other famous and proven models like U-Net [8], LinkNet [9], FPN [10] and PSPNet [11] to the list of possible modelling techniques. For the purpose of this problem, we started with exploring U-Net and Mask-RCNN. The U-Net model is proven to work well on object detection problems, but considering that the size of the dataset is not huge as compared to deep learning (DL) standards, training a full U-Net would firstly, take a lot of time and secondly, would run into overfitting issues due to the sheer number of parameters involved. Mask-RCNN was another great option to go ahead with. It is much easier to train and has been known to perform at par, if not better than U-Net on different use cases [12] [13]. However, as it has already been explored and found to run into problems in detecting smaller UI elements, it was discarded [4]. Figure 6: YOLOv5 architecture [14] Table 2 Comparison of different variants of YOLOv5 using COCO dataset[15] Model mAP val 0.5:0.95 mAP Test 0.5:0.95 mAP val 0.5 Speed on V100 (ms) Parameters (millions) YOLOv5s 43.3 43.3 61.9 4.3 12.7 YOLOv5m 50.5 50.5 68.7 8.4 35.9 YOLOv5l 53.4 53.4 71.1 12.3 77.2 YOLOv5x 54.4 54.4 72.0 22.4 41.8 We chose to go with the latest version of YOLOv5 [15] to ensure speedy inference for real-life use cases as well as flexibility in choosing the right number of parameters based on its different flavours. The general architecture of YOLOv5 is shown in Figure 6. Also, YOLOv5 is available in 4 different size variants, the details of which are present in Table 2 most of which has been explored in this study. Before diving into the details of all the different submissions made, let us go over the common elements of all the runs. The images were all resized to size 512 x 512 using OpenCV’s linear interpolation method after performing the aforementioned pre-processing steps. Also, the dataset split was chosen to be about 95% for train and 5% for validation. A batch size of 32 was chosen throughout to train except for the extra large YOLO variants where a batch size of 16 was used. The models were trained on Google Colab using a GPU environment with a single NVIDIA Tesla K80 GPU. 3.2.1. Run 1 : YOLOv5s Baseline Figure 7: Train results for the YOLOv5s model with no pre-trained weights. To establish a baseline for our runs, we used a basic YOLOv5s architecture with no pre-trained weights and trained the model for 100 epochs from scratch. The training metrics for the model can be seen in Figure 7. We got a precision score of 0.876 and a recall score of 0.835 on our validation dataset. These numbers were decent for a small model with no starting weights. To explore further on how pre-trained weights would affect this, we incorporated that in the next run. 3.2.2. Run 2 : YOLOv5s with pre-trained weights This run included pre-trained weights on the same YOLOv5s architecture, These pre-trained weights were originally generated by training the model on COCO dataset [16]. COCO dataset is very general in nature with 91 classes and contains 123,287 images. Having been trained on such a large gamut of images, these weights have a lot of information already encoded in them. Hence, very little training suffices for a decent performing model. We loaded the model with the pre-trained weights and trained it for 100 epochs keeping all the layers unfrozen. The training metrics for the model can be seen in Figure 8. We got a significant bump in our validation dataset metrics with a precision score of 0.951 and a recall score of 0.82 on the same. Figure 8: Train results for the YOLOv5s model with pre-trained weights. 3.2.3. Run 3 : YOLOv5l Having tried out the small variant of YOLOv5, we moved on to the large variant. However, this time we also employed a learning rate scheduler. We used pytorch’s [17] implementation of ReduceLROnPlateau. We also implemented early stopping in this run. Both of these settings were carried forward to the rest of the runs as well. Figure 9: Train results for the YOLOv5l model with pre-trained weights. We trained the large model with the pre-trained weights for 200 epochs. The training metrics for the model can be seen in Figure 9. We got a precision score of 0.964 and a recall score of 0.944 on our validation dataset. This is an improvement over the small model which was expected because of the larger parameters present in the large variant. 3.2.4. Run 4 : YOLOv5x Figure 10: Train results for the YOLOv5x model with pre-trained weights. To extract the highest amount of performance from the YOLO models, we next tried the YOLOv5x variant which is the largest variant present. Because of the huge size of the model parameters, we reduced our batch size to 16. We trained this model with pre-trained weights for 200 epochs and got very similar performance on our validation dataset. The training metrics for the model can be seen in Figure 10. We got a precision score of 0.961 and a recall score of 0.943 on the same. However, the difference was notable in the test performance of the extra large and the large model. The large model got an mAP value of 0.81 while the extra large model got an mAP score of 0.82 on the test set. Even though there are diminishing returns with respect to the speed of the model, as speed was not an issue, we decided to go ahead with the extra large model for further investigation. 3.2.5. Run 5 : YOLOv5x with frozen layers Up until now, we were loading the pre-trained weights into the model, and re-training all the layers. In this run, we decided to freeze the early layers and train only the head of the model. This would ideally ensure much faster training time as the number of parameters to be updated currently are huge. Figure 11: Train results for the YOLOv5x model with pre-trained weights and layers 0-9 frozen (Trained only on the head layer) We trained only the head (with 0-9 layers frozen of the model) for 100 epochs. The training was much faster, however their was a dip in performance. The training metrics for the model can be seen in Figure 11. We got a precision score of 0.927 and a recall score of 0.863 on the validation dataset. Since this was a sizeable dip, we planned to go ahead with the model we got in Run 4. 3.3. Post-Processing After getting the final trained model in Run 4, their were several post-processing methods that were employed. The first method that was employed was the multi-pass inference [4] and the second method was model confidence variation. We will discuss both of these methods in detail in the following sections. 3.3.1. Multi-Pass Inference This method is predominantly used for increasing the recall score. This technique works by sending the image through the model multiple times, each time removing the objects that were detected earlier and then smartly appending all the outputs together based on different confidence scores and pass numbers. We employed this technique to our study, but this did not work well as we found that our model was performing very well in the recall department detecting almost all the UI elements. Hence, there was little to no scope for improvement using this technique. Hence, we scrapped this and went on to our next post processing method. 3.3.2. Confidence cutoff variation Another variable that was found to be important for the performance of our model was the confidence cutoff variable. This cutoff is a hyperparameter for the model. and is responsible for selecting all the bounding boxes that appear in the final results of the model while the others are discarded. Table 3 Different cutoff values tried and corresponding number of labels and test mAP score. Confidence Cutoff Total Dataset labels % increase in labels Test mAP scores 25 101251 0.0 0.820 20 109848 8.5 0.824 15 119881 9.1 NA 10 130765 9.0 0.829 5 142963 9.3 0.832 1 165013 15.4 0.836 There were several different cutoff values we tried with different number of total labels detected in the model. We observed a marginal increase in the performance of the model. The cutoff variable was changed from 25% to 1%. The summary of the results is shown in Table 3. 4. Results and Discussion A total of 10 submissions were made. The predictions on the test set images were collated in a csv file. For each image on the test set, the bounding boxes corresponding to each instance of a detected class and the confidence scores were submitted. The mAP and recall scores obtained on the test dataset are shown in Table 4. Table 4 Table summarising the runs submitted for the challenge. Run ID Model Description mAP Recall Run 1 132552 YOLOv5s baseline 0.649 0.675 Run 2 132567 YOLOv5s with pre-trained weights 0.649 0.675 Run 3 132575 YOLOv5l with pre-trained weights 0.810 0.826 Run 4 132583 YOLOv5x with pre-trained weights , LR, Early Stopping 0.820 0.840 Run 5 132592 YOLOv5x with pre-trained weights and only heads trained 0.701 0.731 Run 6 134090 Run 4 with 0.2 confidence cutoff 0.824 0.844 Run 7 134099 Run 4 with 0.15 confidence cutoff 0.824 0.844 Run 8 134113 Run 4 with 0.1 confidence cutoff 0.829 0.852 Run 9 134133 Run 4 with 0.05 confidence cutoff 0.832 0.858 Run 10 134133 Run 4 with 0.01 confidence cutoff 0.836 0.865 It has be seen that YOLOv5x performed best of all the YOLOv5 variants and confidence cutoff variable used for post processing is an important factor as it contributed to an increase in the performance of the model. 5. Conclusion In this paper, we have built YOLOv5 based model to detect and localize UI elements in wireframes. We also performed contrast normalization for improvement in the quality of the input images to the model and introduced tuning of confidence cut-off variable for improving the output performance of the model. An mAP score north of 0.8 was attained using This approach on the test data which consisted of wireframes containing a range of UI elements helping us gain 2nd position on the leaderboard of wireframe task of ImageCLEFdrawnUI 2021 [1]. This approach could be integrated further into the pipeline of automating the conversion to front-end code and ensure speedy inference for real-life use cases. There is also a scope of experimenting with the ensemble of two modelling approaches: one for wireframes with more compactly placed UI elements and another for wireframes with less compactly placed UI elements. This would ensure that confidence cutoff variable is correctly tuned and would result in getting the reasonable number of selected bounding boxes for these two different cases. Acknowledgments Thanks to the developers of Google Colab for providing a free GPU environment for model training. References [1] R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D. Ştefan, M. G. Constantin, B. Ionescu, Overview of ImageCLEFdrawnUI 2021: The detection and recognition of hand drawn and digital website uis task, in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org , Bucharest, Romania, 2021. [2] B. Ionescu, H. Müller, R. Péteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, S. Kozlovski, V. Liauchuk, Y. Dicente, V. Kovalev, O. Pelka, A. G. S. de Herrera, J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D. Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid, A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer Science, Springer, Bucharest, Romania, 2021. [3] G. Bradski, The OpenCV Library, Dr. Dobb’s Journal of Software Tools (2000). [4] P. Gupta, S. Mohapatra, Html atomic ui elements extraction from hand-drawn website images using mask-rcnn and novel multi-pass inference technique (2020). [5] K. He, G. Gkioxari, P. DollÃąr, R. Girshick, Mask r-cnn, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988. doi:10.1109/ICCV.2017. 322. [6] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788. doi:10.1109/CVPR.2016.91. [7] M. Tan, R. Pang, Q. V. Le, Efficientdet: Scalable and efficient object detection, 2020. arXiv:1911.09070. [8] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: N. Navab, J. Hornegger, W. M. Wells, A. F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Springer International Publishing, Cham, 2015, pp. 234–241. [9] A. Chaurasia, E. Culurciello, Linknet: Exploiting encoder representations for effi- cient semantic segmentation, 2017 IEEE Visual Communications and Image Process- ing (VCIP) (2017). URL: http://dx.doi.org/10.1109/VCIP.2017.8305148. doi:10.1109/vcip. 2017.8305148. [10] T.-Y. Lin, P. DollÃąr, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 936–944. doi:10.1109/CVPR.2017.106. [11] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6230–6239. doi:10.1109/CVPR.2017.660. [12] M. S. Durkee, R. Abraham, J. Ai, J. D. Fuhrman, M. R. Clark, M. L. Giger, Comparing Mask R-CNN and U-Net architectures for robust automatic segmentation of immune cells in immunofluorescence images of Lupus Nephritis biopsies, in: I. Georgakoudi, A. Tarnok (Eds.), Imaging, Manipulation, and Analysis of Biomolecules, Cells, and Tissues XIX, volume 11647, International Society for Optics and Photonics, SPIE, 2021, pp. 109 – 115. URL: https://doi.org/10.1117/12.2577785. doi:10.1117/12.2577785. [13] T. T. P. Quoc, T. T. Linh, T. N. T. Minh, Comparing u-net convolutional network with mask r-cnn in agricultural area segmentation on satellite images, in: 2020 7th NAFOSTED Conference on Information and Computer Science (NICS), 2020, pp. 124–129. doi:10. 1109/NICS51282.2020.9335856. [14] R. Xu, H. Lin, K. Lu, L. Cao, Y. Liu, A forest fire detection system based on ensemble learning, Forests 12 (2021) 217. doi:10.3390/f12020217. [15] G. Jocher, A. Stoken, J. Borovec, NanoCode012, A. Chaurasia, TaoXie, L. Changyu, A. V, Laughing, tkianai, yxNONG, A. Hogan, lorenzomammana, AlexWang1900, J. Hajek, L. Di- aconu, Marc, Y. Kwon, oleg, wanghaoyang0106, Y. Defretin, A. Lohia, ml5ah, B. Mi- lanko, B. Fineran, D. Khromov, D. Yiwei, Doug, Durgesh, F. Ingham, ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations, 2021. URL: https://doi.org/10.5281/zenodo.4679653. doi:10.5281/zenodo.4679653. [16] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, P. DollÃąr, Microsoft coco: Common objects in context, 2015. arXiv:1405.0312. [17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Sys- tems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.