Deep Learning for UI Element Detection: DrawnUI 2020 Naveen Narayanan? , Nitin Nikamanth Appiah Balaji? , and Kavya Jaganathan Sri Sivasubramaniya Nadar College Of Engineering {naveen17097, nitinnikamanth17099, kavya17074}@cse.ssn.edu.in Abstract. As the field of software development is rapidly growing, it becomes vital to show steady progress in the application development sector to maintain the pace. Constantly tweaking the front-end of an application becomes cumbersome for developers. Detection tools that help design UI from hand-drawn sketches can aid developers and non- developers in building applications with great ease. The recent advance- ments in deep learning techniques and the availability of computational power facilitates training efficient models for object detection. Two such models, the YOLOv4 and Cascade RCNN are implemented for detecting UI elements from hand-drawn sketches and are detailed in this paper. These models are observed to have an average mAP@IoU 0.5 improve- ment of 31.85% over the baseline Faster RCNN. Keywords: DrawnUI · YOLOv4 · Cascade RCNN · Object Detection. 1 Introduction With recent needs in the software development industry for tools to increase efficiency of application development, aiding tools like recommendation systems, generative models help in reducing the time taken to build quality systems. With the difficulty in iterative software engineering setups, it becomes cumbersome for UI designers and front-end developers to make changes and fix issues. So the requirement of an interactive tool for designers to work on, for generating previews or even generate a completely finished front-end application becomes convenient. As the field of computer vision and deep learning have attained maturity, various network architectures have been built for numerous downstream appli- cations like object detection, instance segmentation, semantic segmentation and panoptic segmentation. The rise of CNN’s and its huge success in image classifi- cation tasks have encouraged further research on CNN based models for object detection. ? These authors contributed equally Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. In this paper we discuss about various object detection models and their performance comparison for UI element detection.The rest of the paper is divided into 5 sections. Section 2 gives an idea of the evolution of object detection models and the motivation behind the chosen networks. Section 3 gives an overview of the dataset used. Section 4 explains the chosen architectures and Section 5 compares the performance of the proposed methods and Section 6 draws the conclusions. 2 Background Object Detection using deep neural networks(DNN) has been an active area of research for over a decade. R-CNN, proposed by Ross Girshick et al. [5] had a two stage architecture, which combined a proposal detector and a region-wise classifier that became predominant in the recent past due to its success. Subse- quently SPP net, proposed by Kaiming He et al. [6] improved over RCNN with correct estimation and detection efficiency in testing as analysed by [13]. Regard- less of region proposal generation used in all the above networks, the training of all network layers can be processed in a single-stage with a multi-task loss. This was implemented in Fast RCNN [13][4] which saved the additional expense on storage space,and improved both accuracy and efficiency with more reasonable training schemes. It was later found that the Faster RCNN [11] can overcome the region proposal computational cost by implementing a RPN module becoming a significant improvement over Fast RCNN. Cascade RCNN [2], which is a multi stage extension of Faster RCNN, achieves high quality object detection by effec- tively rejecting close false positives. This is achieved by combining a cascaded bounding box regression and cascaded detection which simultaneously increases both, the quality of hypotheses and the detector. YOLO, developed by Joseph Redmon et al. [8], had a different approach when compared to R-CNN, where-in a single network predicted both, the bounding boxes and the class probabilities for these boxes. The YOLOv1 divided input im- ages into Size x Size grid cells and in each grid, certain number of bounding boxes were taken. Bounding boxes are then selected if their class probabilities are more than a particular threshold value. But YOLOv1 had difficulties in detecting small objects that appear in groups and in detecting objects having unusual aspect ratios. Combating this issue, YOLOv2 as described by [9], used batch normalisa- tion, a high resolution classifier, anchor boxes and dimension clusters to improve the performance. It also bounds the location using logistic activation overcom- ing the instability of YOLOv1 in early iterations. YOLOv2 also uses Darknet-19 thus obtaining a good balance between accuracy and model complexity. The next proposed YOLOv3 had few incremental improvements on YOLOv2. It had a better feature extractor, DarkNet-53 with shortcut connections as well as a better object detector with feature map upsampling and concatenation as men- tioned in[10]. The latest YOLOv4 has inbuilt Data Augmentation for a more robust training as part of its ’Bag of Freebies’ and it improves accuracy of ob- ject detection using its ’Bag of Specials’ as explained in [1] on the backbone and Detector. In our work, we thus implement Cascade RCNN and Yolov4 in detecting UI elements because of their significant advantages and state-of-the-art perfor- mances. 3 Dataset Description The dataset provided by [3] [7], consists of hand drawn images of internet web- sites, mobile application interfaces from 1000 different templates. The dataset consisted of 21 classes which included a variety of small elements like check boxes and buttons all the way to large elements such as images and containers. In order to increase the sample images and to rectify the class imbalance, data augmentation techniques such as random scaling and flipping were implemented. Table 1: Development and testing distributions. Data Type Number of samples Development 2,363 Testing 587 Total 2,950 4 Methodology The two different architectures that was employed are YOLOv4 and Cascade RCNN with Resnest Backbone. 4.1 Cascade RCNN Cascade RCNN [2] is a multi-stage extension of the two stage architecture of Faster-RCNN. Proposal sub-network is the first stage of the Faster RCNN ar- chitecture, in which the entire image is processed by a backbone network after which preliminary detection hypotheses, also called object proposals is produced by applying a proposal head(”H0”). The second stage consists of processing the hypotheses by using a region-of-interest detection sub-network (“H1”), denoted as a detection head. Every hypotheses is then assigned a classification score(”C”) and a bounding box(”B”). This architecture also consists of a cascaded bounding box regression and cascaded detection which simultaneously increases both, the quality of hypotheses and the detector. The Backbone used is ResNeSt over the traditional ResNet.This is because, according to [12], architectures originally designed for image classification like (a) Faster RCNN (b) Cascade RCNN Fig. 1: The architectures of different frameworks. “I” is input image, “conv” backbone convolutions, “pool” region-wise feature extraction, “H” network head, “B” bounding box, and “C” classification. “B0” is proposals in all architectures. ResNet might not be suitable for downstream applications like object detec- tion because of limited receptive-field size and lack of cross-channel interaction. ResNeSt is made of multiple number of Split-Attention blocks stacked in ResNet- style. Each Split-Attention block divides Feature-maps into several groups and finer-grained subgroups or splits, where the feature representation of each group is determined via a weighted combination of the representations of its splits. Feature Pyramid Networks(FPN) is used for feature extraction(referred as pool in Fig 1) while Region Proposal Network(RPN) is used as the proposal head. 4.2 YOLO Unlike algorithms which re-purposes classifiers for object detection, YOLO takes a different approach. Instead of repeated forward passes, as in the case of sliding window or region proposal models, the image is divided into S*S grids, and a single convolutional network pass outputs the results for all the S*S grids. The output of the network is of the shape S*S*(B * 5 + C), where B is the number of anchor boxes and C is the number of object classes [8]. It generates B number of possible detections for each grid. So instead of separately regressing for region proposals, YOLO considers an end-to-end regression task for detection. The architecture as mentioned by [1], consists of CSPDarknet53 backbone, SPP additional module, PANet path-aggregation neck, and YOLOv3 head. YOLOv4 also uses techniques such as ’Bag of Freebies’ and ’Bag of Special’ to introduce data augmentation, mish activation, cross-stage partial connection and multi-input weighted residual connections in the network . With YOLOv4, augmentation techniques such as CutMix, Mosaic, class label smoothing and self-adversarial training were trialed to improve the performance. By CutMix, parts of an image is replaced by another image, with annotations from both image parts. Similarly in mosaic augmentation, 4 different images are combined in various ratios, which thereby improves the recognition of objects of different scales. To reduce model overfitting on the classification output, the con- fidence scores are intentionally reduced by label smoothing technique. Finally by Self-Adversarial training (SAT), the images are passed normally in the forward step. Instead of back propagating, the loss output is used to distort the image in a way it harms the model. These augmented images are then used to train the model in a generic fashion. With these additional improvements YOLOv4 performs more accurately as compared to its earlier versions. 5 Results In the ImageClef DrawnUI 2020 Challenge, our team CudaMemError1 achieved a top submission rank of 2, with Overall Precision of 0.9504. Two additional metrics namely, mAP@IoU 0.5 and recall@IoU 0.5 were further used to evaluate the models. We can observe that, Cascade RCNN (67972) scored better than Table 2: Scores on test images Run ID Model Overall precision mAP@IoU 0.5 recall@IoU 0.5 67413 baseline Faster RCNN 0.94789 57.2 40.3 67833 Cascade RCNN 0.95035 68.16 53.3 67710 Cascade RCNN 0.94909 64.92 50.5 67722 Cascade RCNN 0.93463 72.33 58.5 67829 YOLOv4 0.93300 73.82 55.6 67707 YOLOv4 0.93125 79.24 59.4 67831 YOLOv4 0.92987 79.11 60 67972 Cascade RCNN 0.95044 71.53 55.6 67706 YOLOv4 0.93437 79.36 59.8 Run 67972: 10000 iterations; Run 67706: 7000 iterations YOLOv4 (67706) by 1.7% in terms of Overall Precision but, YOLOv4 outper- formed Cascade RCNN by 10.9% and 7.5% in mAP@IoU 0.5 and recall@IoU 0.5 respectively. Comparing with the baseline Faster RCNN model, a 38.7% improvement in mAP@IoU 0.5 and 48.3% improvement in recall@IoU 0.5 was achieved with YOLOv4 model and a 25% improvement in mAP@IoU 0.5 and 37.9% in recall@IoU 0.5 is achieved with the Cascade RCNN model. 6 Conclusion Two different architectures namely YOLOv4 and Cascade RCNN were imple- mented for detecting UI elements from hand drawn sketches. YOLOv4 performed significantly better than the baseline and Cascade RCNN in terms of mAP@IoU 0.5 and recall@IoU 0.5 with a score of 79.36 and 59.8 respectively, while Cascade RCNN had an Overall Precision score of 0.9504. Front end development can be streamlined and made easy with these detec- tion systems. Automatic code generation can be further added to this pipeline in order to make the whole process of front end development simple and conve- nient. These detection systems will be an asset for developers to be able to keep up with ever-growing software engineering world, and also for non-developers to get into the digital space without learning how to code. References 1. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020) 2. Cai, Z., Vasconcelos, N.: Cascade r-cnn: High quality object detection and instance segmentation (2019) 3. Fichou, D., Berari, R., Brie, P., Dogariu, M., Ştefan, L.D., Constantin, M.G., Ionescu, B.: Overview of ImageCLEFdrawnUI 2020: The Detection and Recogni- tion of Hand Drawn Website UIs Task. In: CLEF2020 Working Notes. CEUR Work- shop Proceedings, CEUR-WS.org , Thessaloniki, Greece (September 22-25 2020) 4. Girshick, R.: Fast r-cnn (2015) 5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu- rate object detection and semantic segmentation (2013) 6. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition (2014) 7. Ionescu, B., Müller, H., Péteri, R., Abacha, A.B., Datla, V., Hasan, S.A., Demner- Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L., Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T., Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu, M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September 22- 25 2020) 8. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection (2015) 9. Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger (2016) 10. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement (2018) 11. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec- tion with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015) 12. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Mueller, J., Manmatha, R., Li, M., Smola, A.: Resnest: Split-attention networks (2020) 13. Zhao, Z.Q., Zheng, P., tao Xu, S., Wu, X.: Object detection with deep learning: A review (2018)