Assessing Completeness in Training Data for Image-Based Analysis of Web User Interfaces Sebastian Heil* Maxim Bakaev* Technische Universitt Chemnitz Novosibirsk State Technical University Chemnitz, Germany Novosibirsk, Russia sebastian.heil@informatik.tu-chemnitz.de bakaev@corp.nstu.ru 0000-0003-2761-9009 0000-0002-1889-0692 Martin Gaedke Technische Universitt Chemnitz Chemnitz, Germany martin.gaedke@informatik.tu-chemnitz.de 0000-0002-6729-2912 * Both authors contributed equally to the work. Abstract Analysis of user interfaces (UIs) based on their visual representation (screenshots) is gaining increasing popularity, institutionalizing the HCI vision field. Witnessing the same visual appearance of a UI like a human user provides the advantage of taking into account layouts, whitespace, graphical content, etc. independent of the concrete plat- form and framework used. However, visual analysis requires significant amounts of training data, particularly for the classifiers that identify UI elements and their types. In our paper we demonstrate how data completeness could be assessed in training datasets produced by crowd- workers, without the need to duplicate the extensive work. In the experimental session, 11 annotators labeled more than 42000 UI ele- ments in nearly 500 web UI screenshots using the LabelImg tool with the pre-defined set of classes corresponding to visually identifiable web page element types. We identify metrics that can be automatically extracted for UI screenshots and construct regression models predict- ing the expected number of labeled elements in the screenshot. The results can be used in outlier analysis of crowdworkers in any existing microtasking platform. Keywords: Machine Learning, Crowdworking, Image Recognition, Human-Computer Vision Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: S. Hölldobler, A. Malikov (eds.): Proceedings of the YSIP-3 Workshop, Stavropol and Arkhyz, Russian Federation, 17-09-2019–20-09-2019, published at http://ceur-ws.org 1 1 Introduction Though futurologists have long been fearing that the development of AI is going to take jobs away from humans, it actually caused boom in the demand for microworking services. These involve the use of general human intelligence to complete tasks for which no algorithm is known or satisfactory efficient. The outcome is employed in either solving some practical problem, providing an online service, or improving an AI model based on machine learning. Popular examples include labeling images, audios and videos, moderating online content, sentiment analysis, translating short texts in other languages, etc. These tasks by and large involve unskilled and tedious work on data gathering and processing, so microworking services requestors can rarely get enough motivated volunteers and tend to rely on low-paid microworkers [1]. For instance, there are already disturbing reports about involuntary microservitude prisoners in Finland who have been assigned with data tagging jobs [2]. Aside from concerns what the AI taught by socially dubious teachers will be like, the consequence is the growing need for checking the quality of the outcome produced by such uninterested workers. Quality is also the current focus for the crowdwork done via Internet (see review in [3]), and the number of specialized platforms has been growing lately: MTurk (2005), microworkers.com (2009), Yandex.Toloka (2014), Googles AutoML (2018), etc. Controlling completeness and accuracy of data used for training is particularly important, since these quality dimensions are linked to recall and precision in the resulting AI models. Todays trend is not just implementation of the output data quality assessment tools in the platforms: e.g. Yandex.Toloka allows specification of control rules for performance time, accuracy vs. the ground truth, majority consensus, etc. There is also a growing number of related meta-tools: CDAS [4], Crowd Truth [5], iCrowd [6], DOCS for MTurk [7], and more. Generally, they are concerned with online quality control and optimal assignment of tasks and mostly rely on performance in completed work, which is evaluated based on ground truth and majority consensus approaches. Nearly universally, these require that the same or very similar work is done by several workers, so that an accuracy measure could be calculated. This work duplication is undesirable for some domains, where the tasks are labor-intensive and have no strictly correct outcome. Particularly, in our work we focus on user interface (UI) labeling the specification of UI elements positions and types in UIs visual representation. We propose assessing the output data completeness based on the expected number of objects that we predict for UI based on certain metrics that can be automatically calculated for the UI image. Most existing research even in general image labeling focuses on image complexity, in which the number of objects is only one of the dimensions. Particularly, image compression metrics, such as the popular JPEG or PNG, are known to be well correlated with image complexity, but their application for UIs assessment is specific and relatively novel (see in [8]). The potential advantage of the approach is increasing the efficiency of producing training data via removing the necessity to perform spare work to ensure the data quality. The remainder of the paper is organized as follows. In Section 2, we detail the UI image-based (visual) analysis approach and describe the related software tool that we previously built. Further, we run experimental UI labeling session with 11 workers that processed about 500 screenshots of university website homepages. In Section 3, we analyze the collected data, present the characteristics of the dataset, and construct regression models for predicting the expected number of UI elements from JPEG, PNG, and entropy metrics combined with edge detection-based recognition. In the final section, we discuss the results, provide conclusions and outline directions for further research. So far, the greatest limitation of our work is lack of its testing to see if crowdworkers undermining data completeness can be identified in real conditions. 2 Methods 2.1 Web UI Visual Analysis Image-based analysis of UIs is gaining in popularity, as it allows witnessing the same interface as the user, which is particularly important for web UIs. The drawback of this approach is that considerable amount of training data is needed (particularly for the classifiers that identify UI elements and their types), which is mostly produced through human UI labeling. It is already widely noted that when data is annotated through crowdworking platforms, controlling its quality is of foremost importance [3]. Particularly for UI labeling tasks, data completeness can suffer if unfaithful crowd- workers optimize their task performance for better revenue / effort ratio. We believe that such outlier workers can be identified without adhering to duplicate labeling, on which ground truth and majority consensus are essentially founded. Objective characteristics of the material (that is, UIs being labeled) can provide meaningful clues on the degree of a worker performance’s regularity. 2 Previously, we have developed a prototype visual analyzer tool capable of extracting several metrics from UIs visual representation [9]. It exhibited rather acceptable recognition of UI elements, which is mostly based on edge detection and identification of vertical and horizontal lines, rectangular forms, etc. (see in Fig.1). However, the performance of its trained classifiers responsible for detection of web UI elements types was found to be inadequate. As we are developing the enhanced visual analyzer, we are concerned with efficient collection of training data via web UIs labeling and assessment of its quality. Figure 1: Recognition of UI elements in the visual analyzer tool. 2.2 UI Labeling and the Data Quality Data completeness is an important attribute of overall data quality, which indicates comprehensiveness of avail- able data with respect to a specific informational requirement. General crowdworking is arguably more often concerned with data accuracy, since tasks are rarely compound enough to be completed only partially. In UI labeling, completeness can be undermined if too few UI elements are identified, which would further lead to decreased recall of the automated visual analysis tool. To assess data completeness in this domain, the ground truth and majority consensus approaches could be well used. That is, if a crowd worker repeatedly under-identifies UI elements, his or her output could be considered invalid, and no further tasks would be assigned. However, web UIs are currently very diverse and the full number of UI elements can vary dramatically. Processing a single UI takes considerable time, so each worker would only label a couple dozens of them, ruling out statistically meaningful comparison of averaged values per workers labeling different UIs. It means that in order for the ground truth and majority consensus approaches to be 3 effective, several workers would have to process the same UI. I.e. the extensive labeling effort would have to be duplicated, without contributing much to the training data. Instead, the expected number of elements in a web UI could be predicted without the involvement of human workers, based on metrics extracted from its image (screenshot, as demonstrated e.g. in [9]). Subsequently, a kind of outlier analysis [3] could be employed to identify workers whose performance consistently does not correspond to the expected values. To explore whether the prediction-based approach will hold true for the trusted dataset and to identify the significant metrics, we collected labeling data in an experimental session. 2.3 The Experiment Description 2.3.1 Participants The workers in our study were student members of the Novosibirsk State Technical University (Russia) crowd- intelligence lab, who volunteered to work on the project. In total, there were 11 of them (6 male, 5 female), with age ranging from 20 to 24 (mean = 20.5, SD = 0.74), all Bachelor students of Applied Informatics major. All the workers had normal or corrected to normal vision and reasonable experience with web UIs and IT. 2.3.2 Material The material was screenshots of higher educational organizations websites homepages (UIs). Initially, 10639 screenshots were collected automatically by the dedicated Python script crawling through URLs we took from various catalogues (DBPedia, etc.). The screenshots were made for full web pages, as they were rendered, not just of the part above the fold or of a fixed size. Then we hand-picked 497 screenshots from the population, using the following criteria: 1. University or college corporate website with reasonably robust functionality; 2. Not overly famous university; 3. Website content in English and reasonably diverse (i.e. no photos-only websites); 4. Reasonable diversity in website designs (colors, page layouts, etc.). 2.3.3 Design The experiment used between-subjects design each UI screenshot was processed only by one worker, so there was no duplication of work. The independent and derived independent variables were: 1. The size of the UI screenshot file in PNG-24 format, in MB: PNG filesize; 2. File size for the same screenshot in JPEG-100 format, in MB: JPEG size; 3. The number of elements metric automatically produced by the visual analyzer for a UI screenshot: VA Elements; 4. Entropy value obtained for the .png file through MATLABs entropy(I) function: M Entropy. The dependent variable in our study was the number of UI elements labeled in UI screenshot by worker: N Elements. 2.3.4 Procedure For labeling the UIs, the workers used LabelImg tool. It allows drawing bounding rectangle around an image element, specifying a label for it (choosing from the set of pre-defined classes or adding a custom class), and saving the results as XML files in PASCAL VOC format. The workers were provided with instruction on using the tool and were given the set of pre-defined classes specific for web UIs (see in Table 1). The 497 UI screenshots were distributed between the student workers nearly equally and based on their alphabetical order (no random assignment). Each worker used his or her judgment in deciding which UI elements to label, but they were asked to achieve maximum completeness in each UI. In total, it took the workers 6 days to complete their assignment. 4 3 Results 3.1 Descriptive Statistics The workers in total labeled 495 UI screenshots (2 erroneous ones were removed). This resulted in 42716 labeled UI elements, of which 39803 (93.2%) belonged to the pre-defined classes (shown inee Table 1). Example of a screenshot being labeled with the LabelImg tool is provided in Fig. 2. Table 1: The pre-defined classes and the frequencies in the student workers results. Class name Class description Frequency Graphical content elements: image foreground images that the web page displays 4046 images that are used as background, i.e. other UI Elements are background image 673 placed on top of them and they have no semantic meaning an area that is visually separated from its surroundings by borders, panel shadows, and/or background color and contains at least one other 573 UI element Textual content elements: any list (numbered or unnumbered) that uses bullet points, num- list berings, borders, background color etc. to display a set of similar 305 items any visually recognizable table (using alignment, lines or back- table 24 ground color to represent rows and columns) a portion of text consisting of one or more lines of text that are paragraph not visually separated by white space and/or indentation from 1964 other text textblock two or more subsequent paragraphs of text 728 any other portion of text that is neither a label nor a paragraph text 7949 or textblock any graphical symbol, can appear on buttons, tabs, links, in texts symbol 1803 etc. or separately Interface elements: must be labeled one-by-one, without the accompanying text checkbox 10 (which must be marked as label) must be labeled one-by-one, without the accompanying text radiobutton 199 (which must be marked as label) a listbox that would expend when clicked, displaying several op- selectbox 863 tions which can be selected or multi-selected textinput single line (including password field, data/calendar, etc.) 375 textarea multi line 69 if the button displays text on it, please additionally label the text button 2571 of the button as type ”label” (see below) a small portion of text, typically one word or only few words, that label 3288 are used together with another UI control like a radiobutton intra-page tabs created using HTML/CSS/JS, not browser tabs, tabs 427 please place the rectangle around the tab handle both intra-page e.g. inside textareas and the main scrollbar of the scrollbar 59 entire page if displayed should span the entire pagination controls area, typically the next pagination 92 and previous buttons and page links link can be inside text (hyperlink), in navigation, etc. 13785 Between the UIs, N Elements per UI ranged from 12 to 230, mean = 86.3, SD = 38.3, RSD = 44.3%. The Kolmogorov-Smirnov normality test suggested that normality hypothesis had to be rejected for N Elements 5 Figure 2: Example of a screenshot being labeled by a worker in LabelImg tool (PascalVOC). (D495 = 0.047, p = 0.01). Between the workers, the average N Elements per UI ranged from 44.5 to 121.6, mean = 86.4, SD = 22.3. Detailed statistics is provided in Table 2 (workers’ names are shortened to initials). When counting the classes, obviously errorneous ones (e.g. butto) were removed from the consideration. Notably, the relative standard deviations (RSD), bar one outlier worker (SMl with RSD = 86.26%), ranged in a rather narrow interval of 24.24-49.05%. The Shapiro-Wilks test suggested that normality hypothesis could not be rejected (W11 = 0.972, p = 0.903). 3.2 Analyzing and Predicting the Number of UI Elements in Screenshots Running the 495 UI screenshots processed by the workers through our visual analyzer software, we were able to obtain the number of UI elements metric (VA Elements) for 440 of them (another 55 or 11.1% encountered technical problems). The resulting VA Elements ranged from 4 to 278, mean = 65.4, SD = 32.7, RSD = 50.1%. Hence, on average human workers recognized 1.32 times more UI elements than the automation tool. The Kolmogorov-Smirnov test suggested that normality hypothesis had to be rejected for VA Elements (D440 = 0.105, p < 0.001). We found that Pearson correlation between N Elements and VA Elements per UI was highly significant (r440 = 0.381, p < 0.001). The correlations for JPEG filesize (r495 = 0.278, p < 0.001), PNG filesize (r495 = 0.174, 6 Table 2: Descriptive statistics for the labeled UI elements per workers. Worker name (gender) UIs labeled UI elements Classes used Mean (SD) RSD AA (male) 56 4896 35 87.43 (38.79) 44.37% GD (male) 44 3520 18 80.00 (19.39) 24.24% KK (female) 44 3927 16 89.25 (25.90) 29.02% MA (female) 44 5349 18 121.57 (34.16) 28.10% NE (female) 44 4994 17 113.50 (31.37) 27.64% PV (male) 44 4659 19 105.89 (37.67) 35.58% PE (female) 43 2649 19 61.60 (30.22) 49.05% SV (male) 44 3929 29 89.30 (34.30) 38.40% SMr (male) 45 1781 17 75.95 (27.42) 36.11% SMl (male) 43 3266 16 39.58 (34.14) 86.26% VY (female) 44 3746 18 85.14 (28.11) 33.01% All set 495 42716 43 86.29 (38.26) 44.33% p < 0.001), and MEntropy (r492 = -0.125, p = 0.006) were also significant, but somehow weaker. Further, we constructed regression model for N Elements with the 4 factors, which was found to be highly significant (F4,432 = 26.0, p < 0.001), although had rather mediocre R2 = 0.194. Its Akaike Information Criterion (AIC) value was equal to 3100. N Elements = 70.5 + 24.9 × JP EG f ilesize − 12.0 × P N G f ilesize + 0.253 × V A Elements − 5.4 × M Entropy (1) Since in some cases the visual analyzer failed to produce the metrics, we tested if the model could be con- structed without the VA Elements factor. The regression was found to be highly significant too (F3,488 = 36.1, p < 0.001), although it had somehow lower R2 = 0.182 and poorer AIC = 3498. N Elements = 87.4 + 38.5 × JP EG f ilesize − 24.1 × P N G f ilesize − 6.2 × M Entropy (2) We used the model (2) to obtain the predicted numbers of elements for the 55 screenshots that the visual analyzer failed to process. Pearson correlation between the predicted values and the actual number of labeled elements (N Elements) was found to be highly significant, r55 = 0.427, p = 0.001. 4 Conclusion Image-based analysis of user interfaces has recognized advantages as it allows taking into account layouts, whites- pace, graphical content, etc. independent of the concrete platform and framework used. The visual analysis (HCI vision) software tools generally require lots of training data, particularly for detecting the type of elements in today’s manifold web UIs. Relying on internet-based crowdworkers who perform UI labeling is a popular ap- proach for collecting such training data, but controlling the output quality currently involves considerable work overhead. In our work, we proposed to assess data completeness, which in UI labeling equates the number of identified UI elements, via predicting this expected number with metrics automatically calculated for the input image. For that, we constructed two regression models: (1) relies on the number of UI elements assessed by our dedicated visual analysis tool based on edge detection, while (2) only uses JPEG, PNG and entropy metrics as the factors. The quality of (1) was somehow better, as its R2 = 0.194 was 6.59% higher, for the 1.13 times smaller sample. However, with (2) one is capable of predicting the expected number of UI elements in UI image without the need to rely on external tools. We see another contribution of the work in the set of pre-defined classes that we devised for web UI labeling and which covered 93.2% of all labeled UI elements in our study. The classes are presented and described in Table 1, and can be used by researchers working on similar problems. Undoubtedly, the main limitation of our study is lack of the models’ testing in real crowdworking to identify workers who undermine completeness in UI labeling tasks. Our future research prospects include collecting the 7 training data for the enhanced visual analyzer through a crowdwork platform and using the models together with outlier analysis to identify neglecting performers. Another limitation is the relatively low R2 coefficients in the models, even though (2) allowed to predict the values that had reasonably strong correlation of r = 0.427 with the actual number of labeled UI elements. We plan to work on refining the set of factors, probably drawing from the metrics of visual complexity, which is currently extensively studied in HCI. Our further research prospects also include assessing other dimensions of training data quality, particularly its accuracy, also based on the characteristics of input and output datasets, without the need for extra work effort. For that end, we plan to study the distribution of UI elements’ classes and produce the characteristics of the trusted dataset, so that each worker’s output could be related to them. Acknowledgements The reported study was funded by Russian Ministry of Education and Science, according to the research project No. 2.2327.2017/4.6. References 1. Semuels, A.: The Internet Is Enabling a New Kind of Poorly Paid Hell. The Atlantic. Next Economy, 23 Jan 2018. Accessed 20 May 2019 at https://www.theatlantic.com/business/archive/2018/01/amazon- mechanical-turk/551192/ 2. Chen, A.: Inmates in Finland are training AI as part of prison labor. The Verge, Mar 28, 2019. Accessed 20 May 2019 at https://www.theverge.com/2019/3/28/18285572/prison-labor-finland-artificial-intelligence- data-tagging-vainu. 3. Daniel, F. et al.: Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR), 51(1), article 7 (2018). 4. Liu, X. et al.: CDAS: a crowdsourcing data analytics system. In Proc. of the VLDB Endowment, 5(10), pp. 1040-1051 (2012). 5. Inel, O. et al.: Crowdtruth: Machine-human computation framework for harnessing disagreement in gath- ering annotated data. In Proc. International Semantic Web Conference, pp. 486-504 (2014). 6. Fan, J. et al.: iCrowd: An adaptive crowdsourcing framework. In ACM SIGMOD International Conference on Management of Data, pp. 1015-1030 (2015). 7. Zheng, Y. et al.: QASCA: quality-aware task assignment system for crowdsourcing applications. In ACM SIGMOD International Conference on Management of Data, pp. 1031-1046 (2015). 8. Boychuk, E., Bakaev, M.: Entropy and Compression Based Analysis of Web User Interfaces. Lecture Notes in Computer Science (International Conference on Web Engineering), 11496, pp. 253-261 (2019). 9. Bakaev M. et al.: Auto-extraction and integration of metrics for web user interfaces. Journal of Web Engineering, 17(6&7), 561-590 (2018). 8