=Paper= {{Paper |id=Vol-1131/mindthegap14_9 |storemode=property |title=User Performance Indicators In Task-Based Data Collection Systems |pdfUrl=https://ceur-ws.org/Vol-1131/mindthegap14_9.pdf |volume=Vol-1131 |dblpUrl=https://dblp.org/rec/conf/iconference/ChamberlainO14 }} ==User Performance Indicators In Task-Based Data Collection Systems== https://ceur-ws.org/Vol-1131/mindthegap14_9.pdf

User Performance Indicators In Task-Based Data
Collection Systems

Jon Chamberlain, Cliff O’Reilly
School of Computer Science and Electronic Engineering
University of Essex
Wivenhoe Park, CO4 3SQ England
{jchamb,coreila}@essex.ac.uk

Abstract
When attempting to analyse and improve a Figure 1: Stages of processing in human cognition.
system interface it is often the performance
not necessarily be assumed that speed correlates to
of system users that measures the success of
quality. A fast response may indicate a highly trained
different iterations of design. This paper in-
user responding to a simple task and conversely a slow
vestigates the importance of sensory and cog-
response might indicate a difficult task that requires
nitive stages in human data processing, using
more thought.
data collected from Phrase Detectives, a text-
based game for collecting language data, and It is therefore important to understand what is hap-
discusses its application for interface design. pening to the user during the response and whether
there is anything that can be done to the system to
improve performance.
1 Introduction
This paper investigates the importance of sensory
When attempting to analyse and improve a system in- and cognitive stages in human data processing, using
terface it is often the performance of system users that data collected from Phrase Detectives, a text-based
measures the success of different iterations of design. game for collecting language data, and attemps to iso-
The metric of performance depends on the context of late the effect of each stage. Furthermore we discuss
the task and what is considered the most important the implications and its application for interface de-
outputs by the system owners, for example one system sign.
may desire high quality output from users, whereas an-
other might want fast output from users [RC10]. 2 Related Work
When quality is the performance measure it is es-
sential to have a trusted gold standard with which The analysis of timed decision making has been a key
to judge the user’s responses. A common problem experimental model in Cognitive Psychology. Stud-
for natural language processing applications, such as ies in Reaction (or Response) Time (RT) show that
co-reference resolution, is that there is not sufficient the human interaction with a system can be divided
resources available and creating them is both time- into discrete stages: incoming stimulus; mental re-
consuming and costly [PCK+ 13]. sponse; and behavioural response [Ste69]. Although
Using user response time as a performance indica- traditional psychological theories follow this model of
tor presents a different set of problems and it may progression from perception to action, recent studies
are moving more towards models of increasing com-
Copyright c 2014 for the individual papers by the paper’s plexity [HMU08].
authors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
For our investigation we distinguish between 3
In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.):
stages of processing required from the user to elicit
Proceedings of the MindTheGap’14 Workshop, Berlin, Ger- an output response from input stimuli (see also Figure
many, 4-March-2014, published at http://ceur-ws.org 1):
Figure 3: A task presented in Validation Mode.
has yet to be incorporated into more sophisticated
models predicting data quality from user behaviour
[RYZ+ 10, WRfW+ 09, KHH12, MRZ05].
Figure 2: A task presented in Annotation Mode.
1. input processing (sensory processing) where the 3 Data Collection
user reads the text and comprehends it;
Phrase Detectives is a game-with-a-purpose designed
2. decision making (cognitive processing) where the to collect data on anaphoric co-reference2 in English
user makes a choice about how to complete the documents [CPKs08, PCK+ 13].
task; The game uses 2 modes for players to complete a
linguistic task. Initially text is presented in Annota-
3. taking action (motor response) to enter the re- tion Mode (called Name the Culprit in the game - see
sponse into the system interface (typically using Figure 2) where the player makes an annotation de-
a keyboard or mouse). cision about a highlighted markable (section of text).
This model demonstrates how a user responds to a If different players enter different interpretations for a
task and can be seen in many examples of user interac- markable then each interpretation is presented to more
tion in task-based data collection systems. In crowd- players in Validation Mode (called Detectives Confer-
sourcing systems, such as Amazon’s Mechanical Turk1 ence in the game - see Figure 3). The players in Vali-
or data collection games, a user is given an input (typ- dation Mode have to agree or disagree with the inter-
ically a section of text or an image) and asked to com- pretation.
plete a task using that input, such as to identify a The game was released as 2 interfaces: in 2008 as
linguistic feature in the text or to categorise objects an independent website system (PD)3 and in 2011 as
in an image [KCS08]. The model can also be seen in an embedded game within the social network Face-
security applications such as reCAPTCHA, where the book (PDFB).4 Both versions of the Phrase Detectives
response of the user proves they are human and not game were built primarily in PHP, HTML, CSS and
an automated machine [vAMM+ 08]. As a final exam- JavaScript, employ the same overall game architecture
ple, the model can be seen in users’ responses to a and run simultaneously on the same corpus of docu-
search results page, with the list of results being the ments.
input and the click to the target document being the One of the differences between Phrase Detectives
response [MTO12]. and other data collection games is that it uses pre-
The relationship between accuracy in completing a processing to offer the players a restricted choice of
task and the time taken is known as the Speed Ac- options. In Annotation Mode the text has embedded
curacy Trade-off. Evidence from studies in ecologi- code that shows all selectable markables; In Validation
cal decision-making show clear indications that dif- Mode the player is offered a binary choice of agree-
ficult tasks can be guessed where the costs of error 2 Anaphoric co-reference is a type of linguistic reference where
are low. This results in lower accuracy but faster one expression depends on another referential element. An ex-
completion time [CSR09, KBM06]. Whilst studies us- ample would be the relation between the entity ‘Jon’ and the
ing RT as a measure of performance are common, it pronoun ‘his’ in the text ‘Jon rode his bike to school’.
3 http://www.phrasedetectives.com
1 http://www.mturk.com 4 https://apps.facebook.com/phrasedetectives
Figure 4: Proportional frequency of RT in the 2 modes of the 2 interfaces of Phrase Detectives.
Table 1: Total responses for the 2 modes in the 2 in- Table 2: Minimum, median and mean RT from a ran-
terfaces of Phrase Detectives. dom sample of 50,000 responses of each response type
PD PDFB from PD and PDFB.
Total Annotations 1,096,575 520,434 PD PDFB
Total Validations (Agree) 123,224 115,280 Annotation RT (min) 1.0s 2.0s
Total Validations (Disagree) 278,896 199,197 Annotation RT (med) 3.0s 6.0s
Annotation RT (mean) 7.2s 10.2s
ing or disagreeing with an interpretation. This makes Validation (Agr) RT (min) 1.0s 1.0s
the interface more game-like and allows the data to Validation (Agr) RT (med) 5.0s 6.0s
be analysed in a more straightforward way as all re- Validation (Agr) RT (max) 10.0s 10.5s
sponses are clicks rather than keyboard typing. In this Validation (Dis) RT (min) 1.0s 2.0s
sense it makes the findings more comparable to search Validation (Dis) RT (med) 3.0s 6.0s
result tasks than reCAPTCHA typing tasks. Validation (Dis) RT (mean) 8.4s 9.9s

4 Analysis fewer annotations to validations than the PD interface
indicating that the players in the latter disagreed with
In order to investigate the human data processing in
each other more (see Table 1). A random sample of
the Phrase Detectives game the RT was analysed in
50,000 responses per response type (annotation, agree-
different ways. All data analysed in this paper is from
ing validation, and disagreeing validation) shows that
the first 2 years of data collection from each interface
users respond differently between the 2 interfaces (see
and does not include data from markables that are
Table 2). The data was also plotted as a proportional
flagged as ignored.5 Responses of 0 seconds were not
frequency of RT, with a focus on the first 15 seconds
included because they were more likely to indicate a
(see Figure 4).
problem with the system rather than a sub 0.5 second
response. Responses over 512 seconds (8:32 minutes)6 There is a significant difference in the RT between
were also not included and outliers do not represent interfaces (p<0.05, unpaired t-test). This may indi-
more than 0.5% of the total responses. cate a higher level of cheating and spam in PD how-
An overview of the total responses from each in- ever PDFB may be slower because it had to load the
terface shows the PDFB interface had proportionately Facebook wrapper in addition to the interface. This is
supported by the minimum RT for PDFB being 2.0s in
5 System administrators manually correct pre-processing er-
Annotation and Validation (Disagree) Modes, where
rors by tagging redundant markables to be ignored.
6 The upper time limit is set at 512 seconds because the data it could be assumed that this is the system’s maxi-
is part of a larger investigation that used RT grouped by a power mum speed. The 2 interfaces differ in the proportion
function and it is assumed no task would take longer than this. of responses 2 seconds or less (almost a third of all
responses in PD but a negligible amount in PDFB).
Table 3: Mean RT for aggregated correct and incor-
One of the motivations for this research is to under-
rect responses in the 2 modes from 122 gold standard
stand the threshold where responses can be excluded
markable observations (80 in the case of Validation
based on predicted RT rather than comparison to a
Disagree). * indicates p<0.05.
gold standard.
Correct Incorrect
The RT for validations was slower than for anno-
Annotation* 10.1s 12.8s
tations in the PD interface. This is counter-intuitive
Validation (Agree)* 13.5s 17.7s
as Annotation Mode has more options for the user to
Validation (Disagree) 14.5s 15.0s
choose from and requires a more complex motor re-
sponse. One of the assumptions in the orginal game
design was that a Validation Mode would be faster Table 4: Minimum, median and maximum RT for
than an Annotation Mode and it would make data clicking actions in Annotation Mode from 6,176 mark-
collection more efficient. ables (p<0.01).
The data was further analysed to investigate the 3 Min Med Max
stages of user processing. Different data models were 1 click (DN, NR) 1.0s 5.0s 123.3s
used to isolate the effect of the stage in question and 2 clicks (DO1) 1.0s 9.8s 293.0s
negate the influence of the 2 other stages. 3 clicks (DO2, PR1) 2.0s 12.0s 509.0s
of a user who does not have a good understanding of
4.1 Input processing
the task or that the task is more difficult than usual.
A random sample of 100,000 validation (agree and dis- Mean RT is slower than the general dataset (Table
agree) responses were taken from the PDFB corpus. 2). One explaination is that the gold standard was cre-
The RT and character distance at the start of the ated from some of the first documents to be completed
markable were tested for a linear correlation, the hy- and the user base at that time would mostly have been
pothesis being that more input data (i.e., a long text) interested early adopters, beta testers and colleagues
will require more time for the player to read and com- of the developers rather than the more general crowd
prehend. Validation Mode was used because it always that developed over time, including spammers making
displays the same number of choices to the player no fast responses.
matter what the length of the text (i.e., 2) so the ac-
tion and decision making stages should be constant 4.3 Taking action
and any difference observed in RT would be due to A random sample of 100,000 markables and associ-
input processing. ated annotations was taken from completed documents
There was a significant correlation between RT and from both interfaces where the markable starting char-
the amount of text displayed on the screen (p<0.05, acter was greater than 1,000 characters. Annotations
Pearson’s Correlation) which supports the hypothesis were grouped on the minimum number of clicks that
that processing a larger input takes longer time. would be required to make the response (any mark-
ables that had no responses in any group were ex-
4.2 Decision making cluded). Thus the effect of input processing speed was
minimised in selected markables and decision making
The decision making stage was investigated using an time is assumed to be evenly distributed.
analysis of 5 documents in the PD corpus that had
a double gold standard (i.e., had been marked by 2 • 1 click response, including Discourse-New (DN)
language experts), excluding markables that were am- and Non-Referring (NR);
biguous (i.e., the 2 experts did not agree on the best
answer) or where there was complete consensus. The • 2 click response, including Discourse-Old (DO)
comparison of paired responses of individual markables where 1 antecedent was chosen;
minimises the effect of processing time and the action • 3 click response, including DO where 2 an-
time is assumed to be evenly distributed. tecedents were chosen and Property (PR) where
The analysis shows that an incorrect response takes 1 antecedent was chosen.
longer, significantly so in the case of making an annota-
tion or agreeing with an annotation (p<0.05, paired t- There is a significant difference between each group
test) - see Table 3. Given that this dataset is from PD (p<0.01, paired t-test), implying that the motor re-
where there are a high number of fast spam responses sponse per click is between 2 to 4 seconds, although
it is feasible that the true incorrect RT is higher. Tak- for some tasks it is clearly faster as can be seen in the
ing longer to make an incorrect response is indicative minimum RT. This makes the filtering of responses
below a threshold RT important as in some cases the user performance.
user not would have enough time to process the in- Modelling the system and measuring user per-
put, make a decision and take action. This will be formance allows designers to benchmark proposed
dependent of how difficult the task is to repond to. changes to see if they have the desired effect, either
Here the actions require the user to click on a link or an improvement in user performance or a negligible
button but this methodology can be extended to cover detriment when, for example, monetising an interface
different styles of input, for example freetext entry. by adding more advertising. Sensory and motor ac-
Freetext is a more complicated response because the tions in the system can be improved by changes to the
same decision can be expressed in different ways and interface, for example in the case of search results, en-
automatic text processing and normalisation would be suring the results list page contains enough data so the
required. However, when a complex answer might be user is likely to find their target but not so much that
advantageous, it is usefulto have an unrestricted way of it slows the user down with input processing. Even
collecting data allowing novel answers to be recorded. simple changes such as increasing the contrast or size
To this end the Phrase Detectives game allowed free- of the text might allow faster processing of the input
text comments to be added to markables. text and hence improve user performance. Decision
making can be improved through user training, either
5 Discussion explicitly with instructions and training examples or
implicitly by following interface design conventions so
By understanding the way users interact with a system the user is pre-trained in how the system will work.
each task response time can be predicted. In the case Predicting a user response is an imprecise science
of the Phrase Detectives game we can use a prediction and other human factors should be considered as po-
of what the user should do for a given size of input to tentially overriding factors in any analysis. A user’s
process, task difficulty and data entry mode. The same expectations of how an interface should operate com-
could be applied to any task driven system such as bined with factors beyond measurement may negate
search, where the system returns a set of results from careful design efforts.
a query of known complexity with a set of actionable
areas that allow a response to be predicted even when
the user is unknown. 6 Conclusion
When the system is able to predict a response time Our investigation has shown that all three stages of
for a given input, task and interface combination user user interaction within task-based data collection sys-
performance can be measured, with users that perform tems (processing the input; making a decision; and
as predicted being used as a pseudo-gold standard so taking action) have a significant effect on the response
the system can learn from new data. Outlier data can time of users and this has an impact on how inter-
be filtered; a response that is too fast may indicate the face design elements should be applied. Using response
user is clicking randomly or that it is an automated time to evaluate users from log data may only be accu-
or spam response; a response that is too slow may rate enough to filter outliers rather than predict perfor-
indicate the user is distracted, fatigued or does not mance, however this is the subject of future research.
understand the task and therefore the quality of their
judgement is likely to be poor.
6.0.1 Acknowledgements
The significant results uncovered by the analysis
of the Phrase Detectives data should be treated with The authors would like to thank the reviewers and Dr
some caution. Independent analysis of each process- Udo Kruschwitz for their comments and suggestions.
ing stage is not entirely possible for log data because The creation of the original game was funded by EP-
users are capable of performing each stage simulta- SRC project AnaWiki, EP/F00575X/1.
neously, i.e., by making decisions and following the
text with the mouse cursor whilst reading the text. References
A more precise model could be achieved with eye-
tracking and GOMS (Goals, Operators, Methods, and [ABD06] Eugene Agichtein, Eric Brill, and Susan
Selection) rule modelling [CNM83] using a test group Dumais. Improving web search rank-
to establish baselines for comparison to the log data ing by incorporating user behavior in-
or by using implicit user feedback from more detailed formation. In Proceedings of the 29th
logs [ABD06]. Without using more precise measures of Annual International ACM SIGIR Con-
response time this method is most usefully employed ference on Research and Development in
as a way to detect and filter spam and very poor re- Information Retrieval, SIGIR ’06, pages
sponses, rather than as a way to evaluate and predict 19–26, New York, NY, USA, 2006. ACM.
[CNM83] Stuart K. Card, Allen Newell, and SIGIR Conference on Research and De-
Thomas P. Moran. The Psychology of velopment in Information Retrieval, SI-
Human-Computer Interaction. L. Erl- GIR ’12, pages 621–630, New York, NY,
baum Associates Inc., Hillsdale, NJ, USA, 2012. ACM.
USA, 1983.
[PCK+ 13] Massimo Poesio, Jon Chamberlain, Udo
[CPKs08] Jon Chamberlain, Massimo Poesio, and Kruschwitz, Livio Robaldo, and Luca
Udo Kruschwitz. Phrase Detectives: Ducceschi. Phrase detectives: Utilizing
A web-based collaborative annotation collective intelligence for internet-scale
game. In Proceedings of the Interna- language resource creation. ACM Trans-
tional Conference on Semantic Systems actions on Interactive Intelligent Sys-
(I-Semantics’08), 2008. tems, 2013.
[CSR09] Lars Chittka, Peter Skorupski, and [RC10] Filip Radlinski and Nick Craswell. Com-
Nigel E Raine. Speed–accuracy trade- paring the sensitivity of information re-
offs in animal decision making. Trends trieval metrics. In Proceedings of the 33rd
in Ecology & Evolution, 24(7):400–407, international ACM SIGIR conference
2009. on Research and development in infor-
mation retrieval, pages 667–674. ACM,
[HMU08] Hauke R. Heekeren, Sean Marrett, and 2010.
Leslie G. Ungerleider. The neural sys-
tems that mediate human perceptual de- [RYZ+ 10] Vikas C. Raykar, Shipeng Yu, Linda H.
cision making. Nature reviews. Neuro- Zhao, Gerardo Hermosillo Valadez,
science, 9(6):467–479, June 2008. Charles Florin, Luca Bogoni, and Linda
Moy. Learning from crowds. Journal
[KBM06] Leslie M Kay, Jennifer Beshel, and Claire
of Machine Learning Research, 11:1297–
Martin. When good enough is best. Neu-
1322, August 2010.
ron, 51(3):277–278, 2006.
[Ste69] Saul Sternberg. The discovery of pro-
[KCS08] Aniket Kittur, Ed H. Chi, and Bongwon
cessing stages: Extensions of Donders’
Suh. Crowdsourcing user studies with
method. Acta Psychologica, 30:276–315,
mechanical turk. In Proceedings of the
1969.
SIGCHI Conference on Human Factors
in Computing Systems, CHI ’08, pages [vAMM+ 08] Luis von Ahn, Benjamin Maurer, Colin
453–456, New York, NY, USA, 2008. McMillen, David Abraham, and Manuel
ACM. Blum. reCAPTCHA: Human-based
character recognition via web security
[KHH12] Ece Kamar, Severin Hacker, and Eric
measures. Science, 321(5895):1465–1468,
Horvitz. Combining human and ma-
2008.
chine intelligence in large-scale crowd-
sourcing. In Proceedings of the 11th In- [WRfW+ 09] Jacob Whitehill, Paul Ruvolo, Ting fan
ternational Conference on Autonomous Wu, Jacob Bergsma, and Javier Movel-
Agents and Multiagent Systems - Volume lan. Whose vote should count more: Op-
1, AAMAS ’12, pages 467–474, Richland, timal integration of labels from label-
SC, 2012. International Foundation for ers of unknown expertise. In Y. Ben-
Autonomous Agents and Multiagent Sys- gio, D. Schuurmans, J. Lafferty, C. K. I.
tems. Williams, and A. Culotta, editors, Ad-
vances in Neural Information Processing
[MRZ05] Nolan Miller, Paul Resnick, and Richard
Systems 22, page 2035–2043, December
Zeckhauser. Eliciting informative feed-
2009.
back: The Peer-Prediction method.
Management Science, 51(9):1359–1373,
September 2005.
[MTO12] Craig Macdonald, Nicola Tonellotto, and
Iadh Ounis. Learning to predict response
times for online query scheduling. In Pro-
ceedings of the 35th International ACM