=Paper=
{{Paper
|id=Vol-1131/mindthegap14_9
|storemode=property
|title=User Performance Indicators In Task-Based Data Collection Systems
|pdfUrl=https://ceur-ws.org/Vol-1131/mindthegap14_9.pdf
|volume=Vol-1131
|dblpUrl=https://dblp.org/rec/conf/iconference/ChamberlainO14
}}
==User Performance Indicators In Task-Based Data Collection Systems==
User Performance Indicators In Task-Based Data Collection Systems Jon Chamberlain, Cliff O’Reilly School of Computer Science and Electronic Engineering University of Essex Wivenhoe Park, CO4 3SQ England {jchamb,coreila}@essex.ac.uk Abstract When attempting to analyse and improve a Figure 1: Stages of processing in human cognition. system interface it is often the performance not necessarily be assumed that speed correlates to of system users that measures the success of quality. A fast response may indicate a highly trained different iterations of design. This paper in- user responding to a simple task and conversely a slow vestigates the importance of sensory and cog- response might indicate a difficult task that requires nitive stages in human data processing, using more thought. data collected from Phrase Detectives, a text- based game for collecting language data, and It is therefore important to understand what is hap- discusses its application for interface design. pening to the user during the response and whether there is anything that can be done to the system to improve performance. 1 Introduction This paper investigates the importance of sensory When attempting to analyse and improve a system in- and cognitive stages in human data processing, using terface it is often the performance of system users that data collected from Phrase Detectives, a text-based measures the success of different iterations of design. game for collecting language data, and attemps to iso- The metric of performance depends on the context of late the effect of each stage. Furthermore we discuss the task and what is considered the most important the implications and its application for interface de- outputs by the system owners, for example one system sign. may desire high quality output from users, whereas an- other might want fast output from users [RC10]. 2 Related Work When quality is the performance measure it is es- sential to have a trusted gold standard with which The analysis of timed decision making has been a key to judge the user’s responses. A common problem experimental model in Cognitive Psychology. Stud- for natural language processing applications, such as ies in Reaction (or Response) Time (RT) show that co-reference resolution, is that there is not sufficient the human interaction with a system can be divided resources available and creating them is both time- into discrete stages: incoming stimulus; mental re- consuming and costly [PCK+ 13]. sponse; and behavioural response [Ste69]. Although Using user response time as a performance indica- traditional psychological theories follow this model of tor presents a different set of problems and it may progression from perception to action, recent studies are moving more towards models of increasing com- Copyright c 2014 for the individual papers by the paper’s plexity [HMU08]. authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. For our investigation we distinguish between 3 In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.): stages of processing required from the user to elicit Proceedings of the MindTheGap’14 Workshop, Berlin, Ger- an output response from input stimuli (see also Figure many, 4-March-2014, published at http://ceur-ws.org 1): Figure 3: A task presented in Validation Mode. has yet to be incorporated into more sophisticated models predicting data quality from user behaviour [RYZ+ 10, WRfW+ 09, KHH12, MRZ05]. Figure 2: A task presented in Annotation Mode. 1. input processing (sensory processing) where the 3 Data Collection user reads the text and comprehends it; Phrase Detectives is a game-with-a-purpose designed 2. decision making (cognitive processing) where the to collect data on anaphoric co-reference2 in English user makes a choice about how to complete the documents [CPKs08, PCK+ 13]. task; The game uses 2 modes for players to complete a linguistic task. Initially text is presented in Annota- 3. taking action (motor response) to enter the re- tion Mode (called Name the Culprit in the game - see sponse into the system interface (typically using Figure 2) where the player makes an annotation de- a keyboard or mouse). cision about a highlighted markable (section of text). This model demonstrates how a user responds to a If different players enter different interpretations for a task and can be seen in many examples of user interac- markable then each interpretation is presented to more tion in task-based data collection systems. In crowd- players in Validation Mode (called Detectives Confer- sourcing systems, such as Amazon’s Mechanical Turk1 ence in the game - see Figure 3). The players in Vali- or data collection games, a user is given an input (typ- dation Mode have to agree or disagree with the inter- ically a section of text or an image) and asked to com- pretation. plete a task using that input, such as to identify a The game was released as 2 interfaces: in 2008 as linguistic feature in the text or to categorise objects an independent website system (PD)3 and in 2011 as in an image [KCS08]. The model can also be seen in an embedded game within the social network Face- security applications such as reCAPTCHA, where the book (PDFB).4 Both versions of the Phrase Detectives response of the user proves they are human and not game were built primarily in PHP, HTML, CSS and an automated machine [vAMM+ 08]. As a final exam- JavaScript, employ the same overall game architecture ple, the model can be seen in users’ responses to a and run simultaneously on the same corpus of docu- search results page, with the list of results being the ments. input and the click to the target document being the One of the differences between Phrase Detectives response [MTO12]. and other data collection games is that it uses pre- The relationship between accuracy in completing a processing to offer the players a restricted choice of task and the time taken is known as the Speed Ac- options. In Annotation Mode the text has embedded curacy Trade-off. Evidence from studies in ecologi- code that shows all selectable markables; In Validation cal decision-making show clear indications that dif- Mode the player is offered a binary choice of agree- ficult tasks can be guessed where the costs of error 2 Anaphoric co-reference is a type of linguistic reference where are low. This results in lower accuracy but faster one expression depends on another referential element. An ex- completion time [CSR09, KBM06]. Whilst studies us- ample would be the relation between the entity ‘Jon’ and the ing RT as a measure of performance are common, it pronoun ‘his’ in the text ‘Jon rode his bike to school’. 3 http://www.phrasedetectives.com 1 http://www.mturk.com 4 https://apps.facebook.com/phrasedetectives Figure 4: Proportional frequency of RT in the 2 modes of the 2 interfaces of Phrase Detectives. Table 1: Total responses for the 2 modes in the 2 in- Table 2: Minimum, median and mean RT from a ran- terfaces of Phrase Detectives. dom sample of 50,000 responses of each response type PD PDFB from PD and PDFB. Total Annotations 1,096,575 520,434 PD PDFB Total Validations (Agree) 123,224 115,280 Annotation RT (min) 1.0s 2.0s Total Validations (Disagree) 278,896 199,197 Annotation RT (med) 3.0s 6.0s Annotation RT (mean) 7.2s 10.2s ing or disagreeing with an interpretation. This makes Validation (Agr) RT (min) 1.0s 1.0s the interface more game-like and allows the data to Validation (Agr) RT (med) 5.0s 6.0s be analysed in a more straightforward way as all re- Validation (Agr) RT (max) 10.0s 10.5s sponses are clicks rather than keyboard typing. In this Validation (Dis) RT (min) 1.0s 2.0s sense it makes the findings more comparable to search Validation (Dis) RT (med) 3.0s 6.0s result tasks than reCAPTCHA typing tasks. Validation (Dis) RT (mean) 8.4s 9.9s 4 Analysis fewer annotations to validations than the PD interface indicating that the players in the latter disagreed with In order to investigate the human data processing in each other more (see Table 1). A random sample of the Phrase Detectives game the RT was analysed in 50,000 responses per response type (annotation, agree- different ways. All data analysed in this paper is from ing validation, and disagreeing validation) shows that the first 2 years of data collection from each interface users respond differently between the 2 interfaces (see and does not include data from markables that are Table 2). The data was also plotted as a proportional flagged as ignored.5 Responses of 0 seconds were not frequency of RT, with a focus on the first 15 seconds included because they were more likely to indicate a (see Figure 4). problem with the system rather than a sub 0.5 second response. Responses over 512 seconds (8:32 minutes)6 There is a significant difference in the RT between were also not included and outliers do not represent interfaces (p<0.05, unpaired t-test). This may indi- more than 0.5% of the total responses. cate a higher level of cheating and spam in PD how- An overview of the total responses from each in- ever PDFB may be slower because it had to load the terface shows the PDFB interface had proportionately Facebook wrapper in addition to the interface. This is supported by the minimum RT for PDFB being 2.0s in 5 System administrators manually correct pre-processing er- Annotation and Validation (Disagree) Modes, where rors by tagging redundant markables to be ignored. 6 The upper time limit is set at 512 seconds because the data it could be assumed that this is the system’s maxi- is part of a larger investigation that used RT grouped by a power mum speed. The 2 interfaces differ in the proportion function and it is assumed no task would take longer than this. of responses 2 seconds or less (almost a third of all responses in PD but a negligible amount in PDFB). Table 3: Mean RT for aggregated correct and incor- One of the motivations for this research is to under- rect responses in the 2 modes from 122 gold standard stand the threshold where responses can be excluded markable observations (80 in the case of Validation based on predicted RT rather than comparison to a Disagree). * indicates p<0.05. gold standard. Correct Incorrect The RT for validations was slower than for anno- Annotation* 10.1s 12.8s tations in the PD interface. This is counter-intuitive Validation (Agree)* 13.5s 17.7s as Annotation Mode has more options for the user to Validation (Disagree) 14.5s 15.0s choose from and requires a more complex motor re- sponse. One of the assumptions in the orginal game design was that a Validation Mode would be faster Table 4: Minimum, median and maximum RT for than an Annotation Mode and it would make data clicking actions in Annotation Mode from 6,176 mark- collection more efficient. ables (p<0.01). The data was further analysed to investigate the 3 Min Med Max stages of user processing. Different data models were 1 click (DN, NR) 1.0s 5.0s 123.3s used to isolate the effect of the stage in question and 2 clicks (DO1) 1.0s 9.8s 293.0s negate the influence of the 2 other stages. 3 clicks (DO2, PR1) 2.0s 12.0s 509.0s of a user who does not have a good understanding of 4.1 Input processing the task or that the task is more difficult than usual. A random sample of 100,000 validation (agree and dis- Mean RT is slower than the general dataset (Table agree) responses were taken from the PDFB corpus. 2). One explaination is that the gold standard was cre- The RT and character distance at the start of the ated from some of the first documents to be completed markable were tested for a linear correlation, the hy- and the user base at that time would mostly have been pothesis being that more input data (i.e., a long text) interested early adopters, beta testers and colleagues will require more time for the player to read and com- of the developers rather than the more general crowd prehend. Validation Mode was used because it always that developed over time, including spammers making displays the same number of choices to the player no fast responses. matter what the length of the text (i.e., 2) so the ac- tion and decision making stages should be constant 4.3 Taking action and any difference observed in RT would be due to A random sample of 100,000 markables and associ- input processing. ated annotations was taken from completed documents There was a significant correlation between RT and from both interfaces where the markable starting char- the amount of text displayed on the screen (p<0.05, acter was greater than 1,000 characters. Annotations Pearson’s Correlation) which supports the hypothesis were grouped on the minimum number of clicks that that processing a larger input takes longer time. would be required to make the response (any mark- ables that had no responses in any group were ex- 4.2 Decision making cluded). Thus the effect of input processing speed was minimised in selected markables and decision making The decision making stage was investigated using an time is assumed to be evenly distributed. analysis of 5 documents in the PD corpus that had a double gold standard (i.e., had been marked by 2 • 1 click response, including Discourse-New (DN) language experts), excluding markables that were am- and Non-Referring (NR); biguous (i.e., the 2 experts did not agree on the best answer) or where there was complete consensus. The • 2 click response, including Discourse-Old (DO) comparison of paired responses of individual markables where 1 antecedent was chosen; minimises the effect of processing time and the action • 3 click response, including DO where 2 an- time is assumed to be evenly distributed. tecedents were chosen and Property (PR) where The analysis shows that an incorrect response takes 1 antecedent was chosen. longer, significantly so in the case of making an annota- tion or agreeing with an annotation (p<0.05, paired t- There is a significant difference between each group test) - see Table 3. Given that this dataset is from PD (p<0.01, paired t-test), implying that the motor re- where there are a high number of fast spam responses sponse per click is between 2 to 4 seconds, although it is feasible that the true incorrect RT is higher. Tak- for some tasks it is clearly faster as can be seen in the ing longer to make an incorrect response is indicative minimum RT. This makes the filtering of responses below a threshold RT important as in some cases the user performance. user not would have enough time to process the in- Modelling the system and measuring user per- put, make a decision and take action. This will be formance allows designers to benchmark proposed dependent of how difficult the task is to repond to. changes to see if they have the desired effect, either Here the actions require the user to click on a link or an improvement in user performance or a negligible button but this methodology can be extended to cover detriment when, for example, monetising an interface different styles of input, for example freetext entry. by adding more advertising. Sensory and motor ac- Freetext is a more complicated response because the tions in the system can be improved by changes to the same decision can be expressed in different ways and interface, for example in the case of search results, en- automatic text processing and normalisation would be suring the results list page contains enough data so the required. However, when a complex answer might be user is likely to find their target but not so much that advantageous, it is usefulto have an unrestricted way of it slows the user down with input processing. Even collecting data allowing novel answers to be recorded. simple changes such as increasing the contrast or size To this end the Phrase Detectives game allowed free- of the text might allow faster processing of the input text comments to be added to markables. text and hence improve user performance. Decision making can be improved through user training, either 5 Discussion explicitly with instructions and training examples or implicitly by following interface design conventions so By understanding the way users interact with a system the user is pre-trained in how the system will work. each task response time can be predicted. In the case Predicting a user response is an imprecise science of the Phrase Detectives game we can use a prediction and other human factors should be considered as po- of what the user should do for a given size of input to tentially overriding factors in any analysis. A user’s process, task difficulty and data entry mode. The same expectations of how an interface should operate com- could be applied to any task driven system such as bined with factors beyond measurement may negate search, where the system returns a set of results from careful design efforts. a query of known complexity with a set of actionable areas that allow a response to be predicted even when the user is unknown. 6 Conclusion When the system is able to predict a response time Our investigation has shown that all three stages of for a given input, task and interface combination user user interaction within task-based data collection sys- performance can be measured, with users that perform tems (processing the input; making a decision; and as predicted being used as a pseudo-gold standard so taking action) have a significant effect on the response the system can learn from new data. Outlier data can time of users and this has an impact on how inter- be filtered; a response that is too fast may indicate the face design elements should be applied. Using response user is clicking randomly or that it is an automated time to evaluate users from log data may only be accu- or spam response; a response that is too slow may rate enough to filter outliers rather than predict perfor- indicate the user is distracted, fatigued or does not mance, however this is the subject of future research. understand the task and therefore the quality of their judgement is likely to be poor. 6.0.1 Acknowledgements The significant results uncovered by the analysis of the Phrase Detectives data should be treated with The authors would like to thank the reviewers and Dr some caution. Independent analysis of each process- Udo Kruschwitz for their comments and suggestions. ing stage is not entirely possible for log data because The creation of the original game was funded by EP- users are capable of performing each stage simulta- SRC project AnaWiki, EP/F00575X/1. neously, i.e., by making decisions and following the text with the mouse cursor whilst reading the text. References A more precise model could be achieved with eye- tracking and GOMS (Goals, Operators, Methods, and [ABD06] Eugene Agichtein, Eric Brill, and Susan Selection) rule modelling [CNM83] using a test group Dumais. Improving web search rank- to establish baselines for comparison to the log data ing by incorporating user behavior in- or by using implicit user feedback from more detailed formation. In Proceedings of the 29th logs [ABD06]. Without using more precise measures of Annual International ACM SIGIR Con- response time this method is most usefully employed ference on Research and Development in as a way to detect and filter spam and very poor re- Information Retrieval, SIGIR ’06, pages sponses, rather than as a way to evaluate and predict 19–26, New York, NY, USA, 2006. ACM. [CNM83] Stuart K. Card, Allen Newell, and SIGIR Conference on Research and De- Thomas P. Moran. The Psychology of velopment in Information Retrieval, SI- Human-Computer Interaction. L. Erl- GIR ’12, pages 621–630, New York, NY, baum Associates Inc., Hillsdale, NJ, USA, 2012. ACM. USA, 1983. [PCK+ 13] Massimo Poesio, Jon Chamberlain, Udo [CPKs08] Jon Chamberlain, Massimo Poesio, and Kruschwitz, Livio Robaldo, and Luca Udo Kruschwitz. Phrase Detectives: Ducceschi. Phrase detectives: Utilizing A web-based collaborative annotation collective intelligence for internet-scale game. In Proceedings of the Interna- language resource creation. ACM Trans- tional Conference on Semantic Systems actions on Interactive Intelligent Sys- (I-Semantics’08), 2008. tems, 2013. [CSR09] Lars Chittka, Peter Skorupski, and [RC10] Filip Radlinski and Nick Craswell. Com- Nigel E Raine. Speed–accuracy trade- paring the sensitivity of information re- offs in animal decision making. Trends trieval metrics. In Proceedings of the 33rd in Ecology & Evolution, 24(7):400–407, international ACM SIGIR conference 2009. on Research and development in infor- mation retrieval, pages 667–674. ACM, [HMU08] Hauke R. Heekeren, Sean Marrett, and 2010. Leslie G. Ungerleider. The neural sys- tems that mediate human perceptual de- [RYZ+ 10] Vikas C. Raykar, Shipeng Yu, Linda H. cision making. Nature reviews. Neuro- Zhao, Gerardo Hermosillo Valadez, science, 9(6):467–479, June 2008. Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal [KBM06] Leslie M Kay, Jennifer Beshel, and Claire of Machine Learning Research, 11:1297– Martin. When good enough is best. Neu- 1322, August 2010. ron, 51(3):277–278, 2006. [Ste69] Saul Sternberg. The discovery of pro- [KCS08] Aniket Kittur, Ed H. Chi, and Bongwon cessing stages: Extensions of Donders’ Suh. Crowdsourcing user studies with method. Acta Psychologica, 30:276–315, mechanical turk. In Proceedings of the 1969. SIGCHI Conference on Human Factors in Computing Systems, CHI ’08, pages [vAMM+ 08] Luis von Ahn, Benjamin Maurer, Colin 453–456, New York, NY, USA, 2008. McMillen, David Abraham, and Manuel ACM. Blum. reCAPTCHA: Human-based character recognition via web security [KHH12] Ece Kamar, Severin Hacker, and Eric measures. Science, 321(5895):1465–1468, Horvitz. Combining human and ma- 2008. chine intelligence in large-scale crowd- sourcing. In Proceedings of the 11th In- [WRfW+ 09] Jacob Whitehill, Paul Ruvolo, Ting fan ternational Conference on Autonomous Wu, Jacob Bergsma, and Javier Movel- Agents and Multiagent Systems - Volume lan. Whose vote should count more: Op- 1, AAMAS ’12, pages 467–474, Richland, timal integration of labels from label- SC, 2012. International Foundation for ers of unknown expertise. In Y. Ben- Autonomous Agents and Multiagent Sys- gio, D. Schuurmans, J. Lafferty, C. K. I. tems. Williams, and A. Culotta, editors, Ad- vances in Neural Information Processing [MRZ05] Nolan Miller, Paul Resnick, and Richard Systems 22, page 2035–2043, December Zeckhauser. Eliciting informative feed- 2009. back: The Peer-Prediction method. Management Science, 51(9):1359–1373, September 2005. [MTO12] Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. Learning to predict response times for online query scheduling. In Pro- ceedings of the 35th International ACM