Introduction

Assessing Completeness in Training Data for Image-Based Analysis of Web User Interfaces

Martin Gaedke

martin.gaedke@informatik.tu-chemnitz.de 0 2 3 0 Maxim Bakaev 1 Novosibirsk State Technical University Novosibirsk , Russia 0000-0002-1889-0692 2 Sebastian Heil 3 Technische Universitt Chemnitz Chemnitz , Germany 0000-0002-6729-2912

Analysis of user interfaces (UIs) based on their visual representation (screenshots) is gaining increasing popularity, institutionalizing the HCI vision eld. Witnessing the same visual appearance of a UI like a human user provides the advantage of taking into account layouts, whitespace, graphical content, etc. independent of the concrete platform and framework used. However, visual analysis requires signi cant amounts of training data, particularly for the classi ers that identify UI elements and their types. In our paper we demonstrate how data completeness could be assessed in training datasets produced by crowdworkers, without the need to duplicate the extensive work. In the experimental session, 11 annotators labeled more than 42000 UI elements in nearly 500 web UI screenshots using the LabelImg tool with the pre-de ned set of classes corresponding to visually identi able web page element types. We identify metrics that can be automatically extracted for UI screenshots and construct regression models predicting the expected number of labeled elements in the screenshot. The results can be used in outlier analysis of crowdworkers in any existing microtasking platform.

Machine Learning Crowdworking Image Recognition Human-Computer Vision

Introduction

Though futurologists have long been fearing that the development of AI is going to take jobs away from humans, it actually caused boom in the demand for microworking services. These involve the use of general human intelligence to complete tasks for which no algorithm is known or satisfactory e cient. The outcome is employed in either solving some practical problem, providing an online service, or improving an AI model based on machine learning. Popular examples include labeling images, audios and videos, moderating online content, sentiment analysis, translating short texts in other languages, etc. These tasks by and large involve unskilled and tedious work on data gathering and processing, so microworking services requestors can rarely get enough motivated volunteers and tend to rely on low-paid microworkers [ 1 ]. For instance, there are already disturbing reports about involuntary microservitude prisoners in Finland who have been assigned with data tagging jobs [ 2 ]. Aside from concerns what the AI taught by socially dubious teachers will be like, the consequence is the growing need for checking the quality of the outcome produced by such uninterested workers.

Quality is also the current focus for the crowdwork done via Internet (see review in [ 3 ]), and the number of specialized platforms has been growing lately: MTurk (2005), microworkers.com (2009), Yandex.Toloka (2014), Googles AutoML (2018), etc. Controlling completeness and accuracy of data used for training is particularly important, since these quality dimensions are linked to recall and precision in the resulting AI models. Todays trend is not just implementation of the output data quality assessment tools in the platforms: e.g. Yandex.Toloka allows speci cation of control rules for performance time, accuracy vs. the ground truth, majority consensus, etc. There is also a growing number of related meta-tools: CDAS [ 4 ], Crowd Truth [ 5 ], iCrowd [ 6 ], DOCS for MTurk [ 7 ], and more. Generally, they are concerned with online quality control and optimal assignment of tasks and mostly rely on performance in completed work, which is evaluated based on ground truth and majority consensus approaches. Nearly universally, these require that the same or very similar work is done by several workers, so that an accuracy measure could be calculated.

This work duplication is undesirable for some domains, where the tasks are labor-intensive and have no strictly correct outcome. Particularly, in our work we focus on user interface (UI) labeling the speci cation of UI elements positions and types in UIs visual representation. We propose assessing the output data completeness based on the expected number of objects that we predict for UI based on certain metrics that can be automatically calculated for the UI image. Most existing research even in general image labeling focuses on image complexity, in which the number of objects is only one of the dimensions. Particularly, image compression metrics, such as the popular JPEG or PNG, are known to be well correlated with image complexity, but their application for UIs assessment is speci c and relatively novel (see in [ 8 ]). The potential advantage of the approach is increasing the e ciency of producing training data via removing the necessity to perform spare work to ensure the data quality.

The remainder of the paper is organized as follows. In Section 2, we detail the UI image-based (visual) analysis approach and describe the related software tool that we previously built. Further, we run experimental UI labeling session with 11 workers that processed about 500 screenshots of university website homepages. In Section 3, we analyze the collected data, present the characteristics of the dataset, and construct regression models for predicting the expected number of UI elements from JPEG, PNG, and entropy metrics combined with edge detection-based recognition. In the nal section, we discuss the results, provide conclusions and outline directions for further research. So far, the greatest limitation of our work is lack of its testing to see if crowdworkers undermining data completeness can be identi ed in real conditions. 2 2.1

Methods Web UI Visual Analysis

Image-based analysis of UIs is gaining in popularity, as it allows witnessing the same interface as the user, which is particularly important for web UIs. The drawback of this approach is that considerable amount of training data is needed (particularly for the classi ers that identify UI elements and their types), which is mostly produced through human UI labeling.

It is already widely noted that when data is annotated through crowdworking platforms, controlling its quality is of foremost importance [ 3 ]. Particularly for UI labeling tasks, data completeness can su er if unfaithful crowdworkers optimize their task performance for better revenue / e ort ratio. We believe that such outlier workers can be identi ed without adhering to duplicate labeling, on which ground truth and majority consensus are essentially founded. Objective characteristics of the material (that is, UIs being labeled) can provide meaningful clues on the degree of a worker performance's regularity.

Previously, we have developed a prototype visual analyzer tool capable of extracting several metrics from UIs visual representation [ 9 ]. It exhibited rather acceptable recognition of UI elements, which is mostly based on edge detection and identi cation of vertical and horizontal lines, rectangular forms, etc. (see in Fig.1). However, the performance of its trained classi ers responsible for detection of web UI elements types was found to be inadequate. As we are developing the enhanced visual analyzer, we are concerned with e cient collection of training data via web UIs labeling and assessment of its quality. Data completeness is an important attribute of overall data quality, which indicates comprehensiveness of available data with respect to a speci c informational requirement. General crowdworking is arguably more often concerned with data accuracy, since tasks are rarely compound enough to be completed only partially. In UI labeling, completeness can be undermined if too few UI elements are identi ed, which would further lead to decreased recall of the automated visual analysis tool.

To assess data completeness in this domain, the ground truth and majority consensus approaches could be well used. That is, if a crowd worker repeatedly under-identi es UI elements, his or her output could be considered invalid, and no further tasks would be assigned. However, web UIs are currently very diverse and the full number of UI elements can vary dramatically. Processing a single UI takes considerable time, so each worker would only label a couple dozens of them, ruling out statistically meaningful comparison of averaged values per workers labeling di erent UIs. It means that in order for the ground truth and majority consensus approaches to be e ective, several workers would have to process the same UI. I.e. the extensive labeling e ort would have to be duplicated, without contributing much to the training data.

Instead, the expected number of elements in a web UI could be predicted without the involvement of human workers, based on metrics extracted from its image (screenshot, as demonstrated e.g. in [ 9 ]). Subsequently, a kind of outlier analysis [ 3 ] could be employed to identify workers whose performance consistently does not correspond to the expected values. To explore whether the prediction-based approach will hold true for the trusted dataset and to identify the signi cant metrics, we collected labeling data in an experimental session. 2.3 2.3.1

The Experiment Description Participants

The workers in our study were student members of the Novosibirsk State Technical University (Russia) crowdintelligence lab, who volunteered to work on the project. In total, there were 11 of them (6 male, 5 female), with age ranging from 20 to 24 (mean = 20.5, SD = 0.74), all Bachelor students of Applied Informatics major. All the workers had normal or corrected to normal vision and reasonable experience with web UIs and IT. 2.3.2

Material

The material was screenshots of higher educational organizations websites homepages (UIs). Initially, 10639 screenshots were collected automatically by the dedicated Python script crawling through URLs we took from various catalogues (DBPedia, etc.). The screenshots were made for full web pages, as they were rendered, not just of the part above the fold or of a xed size. Then we hand-picked 497 screenshots from the population, using the following criteria: 1. University or college corporate website with reasonably robust functionality;

2. Not overly famous university;

3. Website content in English and reasonably diverse (i.e. no photos-only websites);

4. Reasonable diversity in website designs (colors, page layouts, etc.).

2.3.3

Design

The experiment used between-subjects design each UI screenshot was processed only by one worker, so there was no duplication of work. The independent and derived independent variables were:

1. The size of the UI screenshot le in PNG-24 format, in MB: PNG lesize; 2. File size for the same screenshot in JPEG-100 format, in MB: JPEG size;

3. The number of elements metric automatically produced by the visual analyzer for a UI screenshot:

VA Elements; 4. Entropy value obtained for the .png le through MATLABs entropy(I) function: M Entropy.

The dependent variable in our study was the number of UI elements labeled in UI screenshot by worker: N Elements. 2.3.4

Procedure

For labeling the UIs, the workers used LabelImg tool. It allows drawing bounding rectangle around an image element, specifying a label for it (choosing from the set of pre-de ned classes or adding a custom class), and saving the results as XML les in PASCAL VOC format. The workers were provided with instruction on using the tool and were given the set of pre-de ned classes speci c for web UIs (see in Table 1).

The 497 UI screenshots were distributed between the student workers nearly equally and based on their alphabetical order (no random assignment). Each worker used his or her judgment in deciding which UI elements to label, but they were asked to achieve maximum completeness in each UI. In total, it took the workers 6 days to complete their assignment. The workers in total labeled 495 UI screenshots (2 erroneous ones were removed). This resulted in 42716 labeled UI elements, of which 39803 (93.2%) belonged to the pre-de ned classes (shown inee Table 1). Example of a screenshot being labeled with the LabelImg tool is provided in Fig. 2.

Class name Class description Graphical content elements:

image foreground images that the web page displays background image ipmlaacgeeds otnhattoparoef uthseedmaasnbdatchkegyrohuanvde, ni.oe.seomthaenrticUImEealenminegnts are

an area that is visually separated from its surroundings by borders, panel shadows, and/or background color and contains at least one other

UI element Textual content elements:

any list (numbered or unnumbered) that uses bullet points, numlist berings, borders, background color etc. to display a set of similar

items table garnoyunvdisucaollloyr rteocroegpnriezsaebnlte rtoawbslean(udsicnogluamlingsn)ment, lines or back

a portion of text consisting of one or more lines of text that are paragraph not visually separated by white space and/or indentation from

other text textblock two or more subsequent paragraphs of text text aornyteoxtthbelrocpkortion of text that is neither a label nor a paragraph symbol eatncy. gorraspehpiacaraltseylmybol, can appear on buttons, tabs, links, in texts Interface elements: checkbox (mwuhsitchbmeulsatbbeleedmaornkee-dbya-solnaeb,elw)ithout the accompanying text radiobutton (mwuhsitchbmeulsatbbeleedmaornkee-dbya-solnaeb,elw)ithout the accompanying text selectbox taiolinsstbwohxicthhactanwobueldseelexcpteenddorwmheunltci-lsicekleecdt,eddisplaying several optextinput single line (including password eld, data/calendar, etc.) textarea multi line button iofftthheebbuuttttoonndaissptlyapyes t"elaxbteoln" i(ts,epelbeaesloewa)dditionally label the text label aarsemuasleldpotortgieotnheorf wteixtth, atynpoitchaelrlyUoInceownotrrodl olirkoenalyrafedwiowbuotrtdosn, that tabs ipnlteraas-eppaglaecetatbhse crreecattaendgluesianrgouHnTd MthLe/tCaSbSh/aJnSd,lneot browser tabs, scrollbar ebnottihreinptargae-piafgdeisep.gl a.yiendside textareas and the main scrollbar of the pagination sahnodupldresvpiaonusthbeutetnotnirseapnadgipnaagteiolninckosntrols area, typically the next link can be inside text (hyperlink), in navigation, etc. (D495 = 0.047, p = 0.01).

Between the workers, the average N Elements per UI ranged from 44.5 to 121.6, mean = 86.4, SD = 22.3. Detailed statistics is provided in Table 2 (workers' names are shortened to initials). When counting the classes, obviously errorneous ones (e.g. butto) were removed from the consideration. Notably, the relative standard deviations (RSD), bar one outlier worker (SMl with RSD = 86.26%), ranged in a rather narrow interval of 24.24-49.05%. The Shapiro-Wilks test suggested that normality hypothesis could not be rejected (W11 = 0.972, p = 0.903). 3.2

Analyzing and Predicting the Number of UI Elements in Screenshots

Running the 495 UI screenshots processed by the workers through our visual analyzer software, we were able to obtain the number of UI elements metric (VA Elements) for 440 of them (another 55 or 11.1% encountered technical problems). The resulting VA Elements ranged from 4 to 278, mean = 65.4, SD = 32.7, RSD = 50.1%. Hence, on average human workers recognized 1.32 times more UI elements than the automation tool. The Kolmogorov-Smirnov test suggested that normality hypothesis had to be rejected for VA Elements (D440 = 0.105, p < 0.001).

We found that Pearson correlation between N Elements and VA Elements per UI was highly signi cant (r440 = 0.381, p < 0.001). The correlations for JPEG lesize (r495 = 0.278, p < 0.001), PNG lesize (r495 = 0.174, p < 0.001), and MEntropy (r492 = -0.125, p = 0.006) were also signi cant, but somehow weaker.

Further, we constructed regression model for N Elements with the 4 factors, which was found to be highly signi cant (F4;432 = 26.0, p < 0.001), although had rather mediocre R2 = 0.194. Its Akaike Information Criterion (AIC) value was equal to 3100.

N Elements = 70:5 + 24:9 J P EG f ilesize 12:0 P N G f ilesize + 0:253 V A Elements 5:4 M Entropy (1)

Since in some cases the visual analyzer failed to produce the metrics, we tested if the model could be constructed without the VA Elements factor. The regression was found to be highly signi cant too (F3;488 = 36.1, p < 0.001), although it had somehow lower R2 = 0.182 and poorer AIC = 3498.

N Elements = 87:4 + 38:5

J P EG f ilesize 24:1

P N G f ilesize 6:2

M Entropy (2) We used the model (2) to obtain the predicted numbers of elements for the 55 screenshots that the visual analyzer failed to process. Pearson correlation between the predicted values and the actual number of labeled elements (N Elements) was found to be highly signi cant, r55 = 0.427, p = 0.001. 4

Conclusion

Image-based analysis of user interfaces has recognized advantages as it allows taking into account layouts, whitespace, graphical content, etc. independent of the concrete platform and framework used. The visual analysis (HCI vision) software tools generally require lots of training data, particularly for detecting the type of elements in today's manifold web UIs. Relying on internet-based crowdworkers who perform UI labeling is a popular approach for collecting such training data, but controlling the output quality currently involves considerable work overhead. In our work, we proposed to assess data completeness, which in UI labeling equates the number of identi ed UI elements, via predicting this expected number with metrics automatically calculated for the input image.

For that, we constructed two regression models: (1) relies on the number of UI elements assessed by our dedicated visual analysis tool based on edge detection, while (2) only uses JPEG, PNG and entropy metrics as the factors. The quality of (1) was somehow better, as its R2 = 0.194 was 6.59% higher, for the 1.13 times smaller sample. However, with (2) one is capable of predicting the expected number of UI elements in UI image without the need to rely on external tools.

We see another contribution of the work in the set of pre-de ned classes that we devised for web UI labeling and which covered 93.2% of all labeled UI elements in our study. The classes are presented and described in Table 1, and can be used by researchers working on similar problems.

Undoubtedly, the main limitation of our study is lack of the models' testing in real crowdworking to identify workers who undermine completeness in UI labeling tasks. Our future research prospects include collecting the training data for the enhanced visual analyzer through a crowdwork platform and using the models together with outlier analysis to identify neglecting performers.

Another limitation is the relatively low R2 coe cients in the models, even though (2) allowed to predict the values that had reasonably strong correlation of r = 0.427 with the actual number of labeled UI elements. We plan to work on re ning the set of factors, probably drawing from the metrics of visual complexity, which is currently extensively studied in HCI.

Our further research prospects also include assessing other dimensions of training data quality, particularly its accuracy, also based on the characteristics of input and output datasets, without the need for extra work e ort. For that end, we plan to study the distribution of UI elements' classes and produce the characteristics of the trusted dataset, so that each worker's output could be related to them.

Acknowledgements

The reported study was funded by Russian Ministry of Education and Science, according to the research project No. 2.2327.2017/4.6.

1. Semuels , A. : The Internet Is Enabling a New Kind of Poorly Paid Hell . The Atlantic. Next Economy, 23 Jan 2018 . Accessed 20 May 2019 at https://www.theatlantic.com/business/archive/2018/01/amazonmechanical-turk/551192/

2. Chen , A. : Inmates in Finland are training AI as part of prison labor . The Verge, Mar 28 , 2019 . Accessed 20 May 2019 at https://www.theverge.com/ 2019 /3/28/18285572/prison-labor - nland -arti cial-intelligencedata-tagging-vainu.

3. Daniel , F. et al.: Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions . ACM Computing Surveys (CSUR) , 51 ( 1 ), article 7 ( 2018 ).

4. Liu , X. et al.: CDAS: a crowdsourcing data analytics system . In Proc. of the VLDB Endowment , 5 ( 10 ), pp. 1040 - 1051 ( 2012 ).

5. Inel , O. et al.: Crowdtruth: Machine-human computation framework for harnessing disagreement in gathering annotated data . In Proc. International Semantic Web Conference , pp. 486 - 504 ( 2014 ).

6. Fan , J. et al.: iCrowd: An adaptive crowdsourcing framework . In ACM SIGMOD International Conference on Management of Data , pp. 1015 - 1030 ( 2015 ).

7. Zheng , Y. et al.: QASCA: quality-aware task assignment system for crowdsourcing applications . In ACM SIGMOD International Conference on Management of Data , pp. 1031 - 1046 ( 2015 ).

8. Boychuk , E. , Bakaev , M. : Entropy and Compression Based Analysis of Web User Interfaces . Lecture Notes in Computer Science (International Conference on Web Engineering) , 11496 , pp. 253 - 261 ( 2019 ).

9. Bakaev

et al.: Auto-extraction and integration of metrics for web user interfaces . Journal of Web Engineering , 17 ( 6 &7), 561 - 590 ( 2018 ).