UI element detection from wireframe drawings of websites

UI element detection from wireframe drawings of websites PrasangGupta prasang.gupta@pwc.com PwC US Advisory

BG House, Lake Boulevard Road, Hiranandani Gardens Powai Mumbai India

VishakhaBansal vishakha.bansal@pwc.com PwC US Advisory

BG House, Lake Boulevard Road, Hiranandani Gardens Powai Mumbai India

UI element detection from wireframe drawings of websites 1613-0073 9ADA2AEADDEE74012F4E63FA56074F9D GROBID - A machine learning software for extracting information from scholarly documents Website UI elements UI element extraction Image Processing OpenCV Object Detection YOLOv5 Confidence Cutoff Variation

User Interfaces (UIs) wireframe is a crucial part of designing front-end of websites and mobile applications. Detection of UI elements such as paragraphs, buttons, images etc. from the wireframes using advanced Artificial Intelligence (AI) algorithms pave the way to automate the process of conversion of wireframes to Hypertext Mark-up Language (HTML) code. In this paper, we have explored different variants of 5th generation of You Only Look Once (YOLOv5) algorithm and post-processing techniques involving tuning of confidence cut-off variable for detection of UI elements. Our final approach comprises of data pre-processing using contrast normalization and conversion to black and white (BW), detection and localization of UI elements using YOLOv5x variant followed by confidence cutoff for selecting final bounding boxes. This approach resulted in Mean Average Precision (mAP) of 0.836 on the test data.

Introduction

In recent times, building an online presence through websites and mobile applications has become a necessity for businesses to create global outreach and provide better customer service. Designing such applications is a time-consuming and iterative process. Wireframing serves as a starting point for this process. There are various tools that could be used for creating such wireframes and converting them automatically to code. However, the tools could be expensive and require learning for specific usage.

The wireframe task from ImageCLEFdrawnUI 2021 Task [1] which is part of ImageCLEF 2021 [2] is the second edition in this area and aims at reducing the dependency on the tools and automating the code conversion process by using machine learning for detection and localization of UI elements in the wireframes. The dataset provided as part of this task has been enhanced from its previous edition in terms of volume and class distribution.

In this study, we focus on creating model-driven approach which is able to identify and localize the bounding boxes of all UI elements present in a wireframe. In the next section, we briefly describe the dataset used for training and validation of the models. In Section 3, we cover the methodology used âĂŞ data pre-processing, modelling and post-processing. In Section 4, we present the results from the final approach. The paper finishes with the conclusion and future work.

Dataset

The dataset for the ImageCLEFdrawnUI 2021 competition [1] included snapshots of hand drawn wireframe images of website layouts. These images included a total of 21 classes of atomic UI elements including images, paragraphs, headers, links etc. The provided dataset included 4291 of such images. These images were divided into a development and a test set. The development set contained 3218 labelled images while the test set contained 1073 unlabelled images. The development set was further divided into a train and a validation set. Out of the 3218 images in the development set, 3058 images were included in the train set and the rest 160 images formed the validation set (about 5% of the development set). This division was done keeping in mind that the distributions of the classes remain as close as possible. The exact distribution is shown in Table 1.

The images were all RGB images having a myriad of different sizes. The size distribution of the images can be seen in the histograms for height and width distribution in Figure 1. All the images were later resized to a constant size of 512 x 512 for training purposes. The 21 classes present had different amount of representation in the dataset. Some classes which are commonly found in websites were abundantly present and dominated most of the images in the dataset as well. These classes are paragraph, button, link, image etc. However, some classes which are not as abundantly present in websites such as table, video, stepperinput, list etc. were present in comparatively lesser number in the dataset as well. The distribution of the classes among the dataset can be seen in Figure 2.

Methodology

Data Pre-Processing

There are different pre-processing techniques that we have employed to improve the viability and performance of our model. As we are dealing with image data, most of our pre-processing steps use OpenCV. We have opted to use OpenCV [3] on C++ because of the added speed it provides when dealing with large images. Some of the techniques we have used are described in the subsequent sections.

Contrast improvement

The images present in the dataset are made by taking snapshots of wireframe drawings of websites drawn by users. These drawings were either made on paper using pen or on a whiteboard using a marker. As the final images are snapshots, they very much depend on the quality of the camera used. Since most of the cameras introduce noise or a brightness overlay on the image, it was expected and was verified that different images had different tints, brightness and contrast. To counter this issue, there are histogram equalisation techniques which change the range of the pixels present in the image from a very confined space to a much larger distribution. This generally results in a much clearer image with better separation. This can be visualised in Figure 3.

The histogram equalisation technique is powerful, but it also has a shortcoming. In addition to enhancing the contrast of the image, it also enhances the noise. Since our dataset has a lot of noise due to the nature of how they are collected, we chose to employ the Contrast Limited Adaptive Histogram Equalisation(CLAHE) technique from the OpenCV library which overcomes this shortcoming. above a specified contrast limit (we have used the default value 40 in this study) and then uniformly distributing the clipped pixels to the other bins. This can be visualised in Figure 4. Using this technique provided us with much more cleaner images for our dataset. The effect of this technique can be visualised on a sample image from the dataset in Figure 5.

Conversion to Black and White

After getting the contrast normalised images, the next step was to convert them into black and white. This would remove any noise present in the image and focus the model to identify only the things that matter i.e. the wireframe drawings.

There were a lot of iterations performed before by Gupta et. al. [4] to get the most effective way to convert wireframe grayscale images to black and white. We have directly used the techniques discussed in that paper.

Modelling

This problem is an object detection problem at its heart. Hence, a lot of object detection models like Mask-RCNN [5], YOLO [6] and EfficientDet [7] come to mind. This problem can also be modelled as a segmentation problem. Hence, this adds other famous and proven models like U-Net [8], LinkNet [9], FPN [10] and PSPNet [11] to the list of possible modelling techniques.

For the purpose of this problem, we started with exploring U-Net and Mask-RCNN. The U-Net model is proven to work well on object detection problems, but considering that the size of the dataset is not huge as compared to deep learning (DL) standards, training a full U-Net would firstly, take a lot of time and secondly, would run into overfitting issues due to the sheer number of parameters involved.

Mask-RCNN was another great option to go ahead with. It is much easier to train and has been known to perform at par, if not better than U-Net on different use cases [12] [13]. However, as it has already been explored and found to run into problems in detecting smaller UI elements, it was discarded [4]. We chose to go with the latest version of YOLOv5 [15] to ensure speedy inference for real-life use cases as well as flexibility in choosing the right number of parameters based on its different flavours. The general architecture of YOLOv5 is shown in Figure 6. Also, YOLOv5 is available in 4 different size variants, the details of which are present in Table 2 most of which has been explored in this study.

Before diving into the details of all the different submissions made, let us go over the common elements of all the runs. The images were all resized to size 512 x 512 using OpenCV's linear interpolation method after performing the aforementioned pre-processing steps. Also, the dataset split was chosen to be about 95% for train and 5% for validation. A batch size of 32 was chosen throughout to train except for the extra large YOLO variants where a batch size of 16 was used. The models were trained on Google Colab using a GPU environment with a single NVIDIA Tesla K80 GPU. To establish a baseline for our runs, we used a basic YOLOv5s architecture with no pre-trained weights and trained the model for 100 epochs from scratch. The training metrics for the model can be seen in Figure 7. We got a precision score of 0.876 and a recall score of 0.835 on our validation dataset. These numbers were decent for a small model with no starting weights. To explore further on how pre-trained weights would affect this, we incorporated that in the next run.

Run 1 : YOLOv5s Baseline

Run 2 : YOLOv5s with pre-trained weights

This run included pre-trained weights on the same YOLOv5s architecture, These pre-trained weights were originally generated by training the model on COCO dataset [16]. COCO dataset is very general in nature with 91 classes and contains 123,287 images. Having been trained on such a large gamut of images, these weights have a lot of information already encoded in them. Hence, very little training suffices for a decent performing model.

We loaded the model with the pre-trained weights and trained it for 100 epochs keeping all the layers unfrozen. The training metrics for the model can be seen in Figure 8. We got a significant bump in our validation dataset metrics with a precision score of 0.951 and a recall score of 0.82 on the same.

Run 3 : YOLOv5l

Having tried out the small variant of YOLOv5, we moved on to the large variant. However, this time we also employed a learning rate scheduler. We used pytorch's [17] implementation of ReduceLROnPlateau. We also implemented early stopping in this run. Both of these settings were carried forward to the rest of the runs as well. We trained the large model with the pre-trained weights for 200 epochs. The training metrics for the model can be seen in Figure 9. We got a precision score of 0.964 and a recall score of 0.944 on our validation dataset. This is an improvement over the small model which was expected because of the larger parameters present in the large variant. To extract the highest amount of performance from the YOLO models, we next tried the YOLOv5x variant which is the largest variant present. Because of the huge size of the model parameters, we reduced our batch size to 16. We trained this model with pre-trained weights for 200 epochs and got very similar performance on our validation dataset. The training metrics for the model can be seen in Figure 10. We got a precision score of 0.961 and a recall score of 0.943 on the same.

Run 4 : YOLOv5x

However, the difference was notable in the test performance of the extra large and the large model. The large model got an mAP value of 0.81 while the extra large model got an mAP score of 0.82 on the test set. Even though there are diminishing returns with respect to the speed of the model, as speed was not an issue, we decided to go ahead with the extra large model for further investigation.

Run 5 : YOLOv5x with frozen layers

Up until now, we were loading the pre-trained weights into the model, and re-training all the layers. In this run, we decided to freeze the early layers and train only the head of the model. This would ideally ensure much faster training time as the number of parameters to be updated currently are huge. We trained only the head (with 0-9 layers frozen of the model) for 100 epochs. The training was much faster, however their was a dip in performance. The training metrics for the model can be seen in Figure 11. We got a precision score of 0.927 and a recall score of 0.863 on the validation dataset. Since this was a sizeable dip, we planned to go ahead with the model we got in Run 4.

Post-Processing

After getting the final trained model in Run 4, their were several post-processing methods that were employed. The first method that was employed was the multi-pass inference [4] and the second method was model confidence variation. We will discuss both of these methods in detail in the following sections.

Multi-Pass Inference

This method is predominantly used for increasing the recall score. This technique works by sending the image through the model multiple times, each time removing the objects that were detected earlier and then smartly appending all the outputs together based on different confidence scores and pass numbers.

We employed this technique to our study, but this did not work well as we found that our model was performing very well in the recall department detecting almost all the UI elements. Hence, there was little to no scope for improvement using this technique. Hence, we scrapped this and went on to our next post processing method.

Confidence cutoff variation

Another variable that was found to be important for the performance of our model was the confidence cutoff variable. This cutoff is a hyperparameter for the model. and is responsible for selecting all the bounding boxes that appear in the final results of the model while the others are discarded. There were several different cutoff values we tried with different number of total labels detected in the model. We observed a marginal increase in the performance of the model. The cutoff variable was changed from 25% to 1%. The summary of the results is shown in Table 3.

Results and Discussion

A total of 10 submissions were made. The predictions on the test set images were collated in a csv file. For each image on the test set, the bounding boxes corresponding to each instance of a detected class and the confidence scores were submitted. The mAP and recall scores obtained on the test dataset are shown in Table 4. It has be seen that YOLOv5x performed best of all the YOLOv5 variants and confidence cutoff variable used for post processing is an important factor as it contributed to an increase in the performance of the model.

Conclusion

In this paper, we have built YOLOv5 based model to detect and localize UI elements in wireframes. We also performed contrast normalization for improvement in the quality of the input images to the model and introduced tuning of confidence cut-off variable for improving the output performance of the model. An mAP score north of 0.8 was attained using This approach on the test data which consisted of wireframes containing a range of UI elements helping us gain 2nd position on the leaderboard of wireframe task of ImageCLEFdrawnUI 2021 [1]. This approach could be integrated further into the pipeline of automating the conversion to front-end code and ensure speedy inference for real-life use cases. There is also a scope of experimenting with the ensemble of two modelling approaches: one for wireframes with more compactly placed UI elements and another for wireframes with less compactly placed UI elements. This would ensure that confidence cutoff variable is correctly tuned and would result in getting the reasonable number of selected bounding boxes for these two different cases.

Figure 1 :1Figure 1: Distribution of heights and widths of the images in the development set with 20 bins.

Figure 2 :2Figure 2: Distribution of the classes within the development dataset. The plot on the left shows the total number of occurrences of a class in the dataset. The plot on the right shows the spread of the classes defined as the total number of unique images which have atleast 1 occurence of that class. Both of these plots show that some classes are abundantly present while some are a little less represented.

Figure 3 :3Figure 3: Histogram equalisation technique changing the narrow histogram of the pixels in the input image to a much more wide distribution.

Figure 4 :4Figure 4: Histogram equalisation after contrast limiting (CLAHE)

Figure 5 :5Figure 5: A depiction of the effect of CLAHE on images. The image on the left is the original image and the one on the right is after applying CLAHE. It can be observed that the processed image is much more clear and legible than the original.

Figure 6 :6Figure 6: YOLOv5 architecture [14]

Figure 7 :7Figure 7: Train results for the YOLOv5s model with no pre-trained weights.

Figure 8 :8Figure 8: Train results for the YOLOv5s model with pre-trained weights.

Figure 9 :9Figure 9: Train results for the YOLOv5l model with pre-trained weights.

Figure 10 :10Figure 10: Train results for the YOLOv5x model with pre-trained weights.

Figure 11 :11Figure 11: Train results for the YOLOv5x model with pre-trained weights and layers 0-9 frozen (Trained only on the head layer)

Table 11Class distribution in the train and validation sets.LabelTrain Freq Val Freq Train Dist Val Distparagraph27271410.890.88label1004420.330.26header20591210.670.76button29911530.980.96image24621210.810.76linebreak23701280.780.8container29231530.960.96link1031560.340.35textinput1453690.480.43dropdown688340.220.21checkbox663250.220.16radiobutton478140.160.09rating434110.140.07toggle452130.150.08textarea41890.140.06datepicker468120.150.08stepperinput9130.030.02slider491160.160.1video448220.150.14table5610.020.01list18060.060.04

Table 22Comparison of different variants of YOLOv5 using COCO dataset[15] YOLOv5s43.343.361.94.312.7YOLOv5m50.550.568.78.435.9YOLOv5l53.453.471.112.377.2YOLOv5x54.454.472.022.441.8

Model mAP val 0.5:0.95 mAP Test 0.5:0.95 mAP val 0.5 Speed on V100 (ms) Parameters (millions)

Table 33Different cutoff values tried and corresponding number of labels and test mAP score.Confidence Cutoff Total Dataset labels % increase in labels Test mAP scores251012510.00.820201098488.50.824151198819.1NA101307659.00.82951429639.30.832116501315.40.836

Table 44Table summarising the runs submitted for the challenge.Run IDModel DescriptionmAP RecallRun 1132552YOLOv5s baseline0.649 0.675Run 2132567YOLOv5s with pre-trained weights0.649 0.675Run 3132575YOLOv5l with pre-trained weights0.810 0.826Run 4132583YOLOv5x with pre-trained weights , LR, Early Stopping0.820 0.840Run 5132592 YOLOv5x with pre-trained weights and only heads trained 0.701 0.731Run 6134090Run 4 with 0.2 confidence cutoff0.824 0.844Run 7134099Run 4 with 0.15 confidence cutoff0.824 0.844Run 8134113Run 4 with 0.1 confidence cutoff0.829 0.852Run 9134133Run 4 with 0.05 confidence cutoff0.832 0.858Run 10 134133Run 4 with 0.01 confidence cutoff0.836 0.865

Acknowledgments

Thanks to the developers of Google Colab for providing a free GPU environment for model training.

Overview of ImageCLEFdrawnUI 2021: The detection and recognition of hand drawn and digital website uis task RBerari ATauteanu DFichou PBrie MDogariu LDŞtefan MGConstantin BIonescu CLEF2021 Working Notes, CEUR Workshop Proceedings

Bucharest, Romania

2021 Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications BIonescu HMüller RPéteri ABen Abacha MSarrouti DDemner-Fushman SAHasan SKozlovski VLiauchuk YDicente VKovalev OPelka AG SDe Herrera JJacutprakart CMFriedrich RBerari ATauteanu DFichou PBrie MDogariu LDŞtefan MGConstantin JChamberlain ACampello AClark TAOliver HMoustahfid APopescu JDeshayes-Chossart Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021) LNCS Lecture Notes in Computer Science

Bucharest, Romania

Springer 2021 The OpenCV Library GBradski Dr. Dobb's Journal of Software Tools 2000 and novel multi-pass inference technique PGupta SMohapatra Html atomic ui elements extraction from hand-drawn website images using mask-rcnn 2020 Mask r-cnn KHe GGkioxari PDollãąr RGirshick 10.1109/ICCV.2017.322 2017 IEEE International Conference on Computer Vision (ICCV) 2017 You only look once: Unified, real-time object detection JRedmon SDivvala RGirshick AFarhadi 10.1109/CVPR.2016.91 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016. 2016 Efficientdet: Scalable and efficient object detection MTan RPang QVLe arXiv:1911.09070 2020 U-net: Convolutional networks for biomedical image segmentation ORonneberger PFischer TBrox Medical Image Computing and Computer-Assisted Intervention -MICCAI 2015 NNavab JHornegger WMWells AFFrangi

Cham

Springer International Publishing 2015 Linknet: Exploiting encoder representations for efficient semantic segmentation AChaurasia ECulurciello 10.1109/vcip.2017.8305148 IEEE Visual Communications and Image Processing VCIP 2017. 2017 Feature pyramid networks for object detection T.-YLin PDollãąr RGirshick KHe BHariharan SBelongie 10.1109/CVPR.2017.106 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. 2017 Pyramid scene parsing network HZhao JShi XQi XWang JJia 10.1109/CVPR.2017.660 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. 2017 Comparing Mask R-CNN and U-Net architectures for robust automatic segmentation of immune cells in immunofluorescence images of Lupus Nephritis biopsies MSDurkee RAbraham JAi JDFuhrman MRClark MLGiger 10.1117/12.2577785 Imaging, Manipulation, and Analysis of Biomolecules, Cells, and Tissues XIX IGeorgakoudi ATarnok SPIE 2021 11647 International Society for Optics and Photonics Comparing u-net convolutional network with mask r-cnn in agricultural area segmentation on satellite images TT PQuoc TTLinh TN TMinh 10.1109/NICS51282.2020.9335856 2020 7th NAFOSTED Conference on Information and Computer Science (NICS) 2020 A forest fire detection system based on ensemble learning RXu HLin KLu LCao YLiu 10.3390/f12020217 Forests 12 217 2021 GJocher AStoken JBorovec ANanocode012 Chaurasia LTaoxie AVChangyu ALaughing Hogan JAlexwang1900 LHajek Diaconu YMarc YKwon ADefretin BLohia BMilanko DFineran DKhromov Yiwei Doug FDurgesh Ingham 10.5281/zenodo.4679653 ultralytics/yolov5: v5.0 -YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations 2021 T.-YLin MMaire SBelongie LBourdev RGirshick JHays PPerona DRamanan CLZitnick PDollãąr arXiv:1405.0312 Microsoft coco: Common objects in context 2015 Pytorch: An imperative style, high-performance deep learning library APaszke SGross FMassa ALerer JBradbury GChanan TKilleen ZLin NGimelshein LAntiga ADesmaison AKopf EYang ZDevito MRaison ATejani SChilamkurthy BSteiner LFang JBai SChintala Advances in Neural Information Processing Systems 32 HWallach HLarochelle ABeygelzimer FAlché-Buc EFox RGarnett Curran Associates, Inc 2019