<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Vikas C. Raykar, Shipeng Yu, Linda H.
Zhao, Gerardo Hermosillo Valadez,
Charles Florin, Luca Bogoni, and Linda
Moy. Learning from crowds. Journal
of Machine Learning Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>User Performance Indicators In Task-Based Data Collection Systems</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Jon Chamberlain, Cli O'Reilly School of Computer Science and Electronic Engineering University of Essex</institution>
          <addr-line>Wivenhoe Park, CO4 3SQ England</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>11</volume>
      <issue>1297</issue>
      <abstract>
        <p>When attempting to analyse and improve a system interface it is often the performance of system users that measures the success of di erent iterations of design. This paper investigates the importance of sensory and cognitive stages in human data processing, using data collected from Phrase Detectives, a textbased game for collecting language data, and discusses its application for interface design.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>When attempting to analyse and improve a system
interface it is often the performance of system users that
measures the success of di erent iterations of design.
The metric of performance depends on the context of
the task and what is considered the most important
outputs by the system owners, for example one system
may desire high quality output from users, whereas
another might want fast output from users [RC10].</p>
      <p>When quality is the performance measure it is
essential to have a trusted gold standard with which
to judge the user's responses. A common problem
for natural language processing applications, such as
co-reference resolution, is that there is not su cient
resources available and creating them is both
timeconsuming and costly [PCK+13].</p>
      <p>Using user response time as a performance
indicator presents a di erent set of problems and it may</p>
    </sec>
    <sec id="sec-2">
      <title>Work</title>
      <p>The analysis of timed decision making has been a key
experimental model in Cognitive Psychology.
Studies in Reaction (or Response) Time (RT) show that
the human interaction with a system can be divided
into discrete stages: incoming stimulus; mental
response; and behavioural response [Ste69]. Although
traditional psychological theories follow this model of
progression from perception to action, recent studies
are moving more towards models of increasing
complexity [HMU08].</p>
      <p>For our investigation we distinguish between 3
stages of processing required from the user to elicit
an output response from input stimuli (see also Figure
1):
1. input processing (sensory processing) where the
user reads the text and comprehends it;
2. decision making (cognitive processing) where the
user makes a choice about how to complete the
task;
3. taking action (motor response) to enter the
response into the system interface (typically using
a keyboard or mouse).</p>
      <p>This model demonstrates how a user responds to a
task and can be seen in many examples of user
interaction in task-based data collection systems. In
crowdsourcing systems, such as Amazon's Mechanical Turk1
or data collection games, a user is given an input
(typically a section of text or an image) and asked to
complete a task using that input, such as to identify a
linguistic feature in the text or to categorise objects
in an image [KCS08]. The model can also be seen in
security applications such as reCAPTCHA, where the
response of the user proves they are human and not
an automated machine [vAMM+08]. As a nal
example, the model can be seen in users' responses to a
search results page, with the list of results being the
input and the click to the target document being the
response [MTO12].</p>
      <p>The relationship between accuracy in completing a
task and the time taken is known as the Speed
Accuracy Trade-o . Evidence from studies in
ecological decision-making show clear indications that
difcult tasks can be guessed where the costs of error
are low. This results in lower accuracy but faster
completion time [CSR09, KBM06]. Whilst studies
using RT as a measure of performance are common, it
1http://www.mturk.com
has yet to be incorporated into more sophisticated
models predicting data quality from user behaviour
[RYZ+10, WRfW+09, KHH12, MRZ05].
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data Collection</title>
      <p>Phrase Detectives is a game-with-a-purpose designed
to collect data on anaphoric co-reference2 in English
documents [CPKs08, PCK+13].</p>
      <p>The game uses 2 modes for players to complete a
linguistic task. Initially text is presented in
Annotation Mode (called Name the Culprit in the game - see
Figure 2) where the player makes an annotation
decision about a highlighted markable (section of text).
If di erent players enter di erent interpretations for a
markable then each interpretation is presented to more
players in Validation Mode (called Detectives
Conference in the game - see Figure 3). The players in
Validation Mode have to agree or disagree with the
interpretation.</p>
      <p>The game was released as 2 interfaces: in 2008 as
an independent website system (PD)3 and in 2011 as
an embedded game within the social network
Facebook (PDFB).4 Both versions of the Phrase Detectives
game were built primarily in PHP, HTML, CSS and
JavaScript, employ the same overall game architecture
and run simultaneously on the same corpus of
documents.</p>
      <p>One of the di erences between Phrase Detectives
and other data collection games is that it uses
preprocessing to o er the players a restricted choice of
options. In Annotation Mode the text has embedded
code that shows all selectable markables; In Validation
Mode the player is o ered a binary choice of
agree2Anaphoric co-reference is a type of linguistic reference where
one expression depends on another referential element. An
example would be the relation between the entity `Jon' and the
pronoun `his' in the text `Jon rode his bike to school'.
3http://www.phrasedetectives.com
4https://apps.facebook.com/phrasedetectives</p>
      <sec id="sec-3-1">
        <title>Total Annotations Total Validations (Agree) Total Validations (Disagree)</title>
        <p>PD
1,096,575
123,224
278,896</p>
        <sec id="sec-3-1-1">
          <title>PDFB</title>
          <p>520,434
115,280
199,197
ing or disagreeing with an interpretation. This makes
the interface more game-like and allows the data to
be analysed in a more straightforward way as all
responses are clicks rather than keyboard typing. In this
sense it makes the ndings more comparable to search
result tasks than reCAPTCHA typing tasks.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Analysis</title>
      <p>In order to investigate the human data processing in
the Phrase Detectives game the RT was analysed in
di erent ways. All data analysed in this paper is from
the rst 2 years of data collection from each interface
and does not include data from markables that are
agged as ignored.5 Responses of 0 seconds were not
included because they were more likely to indicate a
problem with the system rather than a sub 0.5 second
response. Responses over 512 seconds (8:32 minutes)6
were also not included and outliers do not represent
more than 0.5% of the total responses.</p>
      <p>An overview of the total responses from each
interface shows the PDFB interface had proportionately
5System administrators manually correct pre-processing
errors by tagging redundant markables to be ignored.</p>
      <p>6The upper time limit is set at 512 seconds because the data
is part of a larger investigation that used RT grouped by a power
function and it is assumed no task would take longer than this.</p>
      <sec id="sec-4-1">
        <title>Annotation RT (min)</title>
        <p>Annotation RT (med)
Annotation RT (mean)
Validation (Agr) RT (min)
Validation (Agr) RT (med)
Validation (Agr) RT (max)
Validation (Dis) RT (min)
Validation (Dis) RT (med)</p>
        <p>Validation (Dis) RT (mean)
fewer annotations to validations than the PD interface
indicating that the players in the latter disagreed with
each other more (see Table 1). A random sample of
50,000 responses per response type (annotation,
agreeing validation, and disagreeing validation) shows that
users respond di erently between the 2 interfaces (see
Table 2). The data was also plotted as a proportional
frequency of RT, with a focus on the rst 15 seconds
(see Figure 4).</p>
        <p>There is a signi cant di erence in the RT between
interfaces (p&lt;0.05, unpaired t-test). This may
indicate a higher level of cheating and spam in PD
however PDFB may be slower because it had to load the
Facebook wrapper in addition to the interface. This is
supported by the minimum RT for PDFB being 2.0s in
Annotation and Validation (Disagree) Modes, where
it could be assumed that this is the system's
maximum speed. The 2 interfaces di er in the proportion
of responses 2 seconds or less (almost a third of all
PD
1.0s
3.0s
7.2s
1.0s
5.0s
10.0s
1.0s
3.0s
8.4s</p>
        <p>PDFB
2.0s
6.0s
10.2s
1.0s
6.0s
10.5s
2.0s
6.0s
9.9s
responses in PD but a negligible amount in PDFB).
One of the motivations for this research is to
understand the threshold where responses can be excluded
based on predicted RT rather than comparison to a
gold standard.</p>
        <p>The RT for validations was slower than for
annotations in the PD interface. This is counter-intuitive
as Annotation Mode has more options for the user to
choose from and requires a more complex motor
response. One of the assumptions in the orginal game
design was that a Validation Mode would be faster
than an Annotation Mode and it would make data
collection more e cient.</p>
        <p>The data was further analysed to investigate the 3
stages of user processing. Di erent data models were
used to isolate the e ect of the stage in question and
negate the in uence of the 2 other stages.
4.1</p>
        <sec id="sec-4-1-1">
          <title>Input processing</title>
          <p>A random sample of 100,000 validation (agree and
disagree) responses were taken from the PDFB corpus.
The RT and character distance at the start of the
markable were tested for a linear correlation, the
hypothesis being that more input data (i.e., a long text)
will require more time for the player to read and
comprehend. Validation Mode was used because it always
displays the same number of choices to the player no
matter what the length of the text (i.e., 2) so the
action and decision making stages should be constant
and any di erence observed in RT would be due to
input processing.</p>
          <p>There was a signi cant correlation between RT and
the amount of text displayed on the screen (p&lt;0.05,
Pearson's Correlation) which supports the hypothesis
that processing a larger input takes longer time.
4.2</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Decision making</title>
          <p>The decision making stage was investigated using an
analysis of 5 documents in the PD corpus that had
a double gold standard (i.e., had been marked by 2
language experts), excluding markables that were
ambiguous (i.e., the 2 experts did not agree on the best
answer) or where there was complete consensus. The
comparison of paired responses of individual markables
minimises the e ect of processing time and the action
time is assumed to be evenly distributed.</p>
          <p>The analysis shows that an incorrect response takes
longer, signi cantly so in the case of making an
annotation or agreeing with an annotation (p&lt;0.05, paired
ttest) - see Table 3. Given that this dataset is from PD
where there are a high number of fast spam responses
it is feasible that the true incorrect RT is higher.
Taking longer to make an incorrect response is indicative
of a user who does not have a good understanding of
the task or that the task is more di cult than usual.</p>
          <p>Mean RT is slower than the general dataset (Table
2). One explaination is that the gold standard was
created from some of the rst documents to be completed
and the user base at that time would mostly have been
interested early adopters, beta testers and colleagues
of the developers rather than the more general crowd
that developed over time, including spammers making
fast responses.
4.3</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Taking action</title>
          <p>A random sample of 100,000 markables and
associated annotations was taken from completed documents
from both interfaces where the markable starting
character was greater than 1,000 characters. Annotations
were grouped on the minimum number of clicks that
would be required to make the response (any
markables that had no responses in any group were
excluded). Thus the e ect of input processing speed was
minimised in selected markables and decision making
time is assumed to be evenly distributed.</p>
          <p>1 click response, including Discourse-New (DN)
and Non-Referring (NR);
2 click response, including Discourse-Old (DO)
where 1 antecedent was chosen;
3 click response, including DO where 2
antecedents were chosen and Property (PR) where
1 antecedent was chosen.</p>
          <p>There is a signi cant di erence between each group
(p&lt;0.01, paired t-test), implying that the motor
response per click is between 2 to 4 seconds, although
for some tasks it is clearly faster as can be seen in the
minimum RT. This makes the ltering of responses
below a threshold RT important as in some cases the
user not would have enough time to process the
input, make a decision and take action. This will be
dependent of how di cult the task is to repond to.</p>
          <p>Here the actions require the user to click on a link or
button but this methodology can be extended to cover
di erent styles of input, for example freetext entry.
Freetext is a more complicated response because the
same decision can be expressed in di erent ways and
automatic text processing and normalisation would be
required. However, when a complex answer might be
advantageous, it is usefulto have an unrestricted way of
collecting data allowing novel answers to be recorded.
To this end the Phrase Detectives game allowed
freetext comments to be added to markables.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>By understanding the way users interact with a system
each task response time can be predicted. In the case
of the Phrase Detectives game we can use a prediction
of what the user should do for a given size of input to
process, task di culty and data entry mode. The same
could be applied to any task driven system such as
search, where the system returns a set of results from
a query of known complexity with a set of actionable
areas that allow a response to be predicted even when
the user is unknown.</p>
      <p>When the system is able to predict a response time
for a given input, task and interface combination user
performance can be measured, with users that perform
as predicted being used as a pseudo-gold standard so
the system can learn from new data. Outlier data can
be ltered; a response that is too fast may indicate the
user is clicking randomly or that it is an automated
or spam response; a response that is too slow may
indicate the user is distracted, fatigued or does not
understand the task and therefore the quality of their
judgement is likely to be poor.</p>
      <p>The signi cant results uncovered by the analysis
of the Phrase Detectives data should be treated with
some caution. Independent analysis of each
processing stage is not entirely possible for log data because
users are capable of performing each stage
simultaneously, i.e., by making decisions and following the
text with the mouse cursor whilst reading the text.
A more precise model could be achieved with
eyetracking and GOMS (Goals, Operators, Methods, and
Selection) rule modelling [CNM83] using a test group
to establish baselines for comparison to the log data
or by using implicit user feedback from more detailed
logs [ABD06]. Without using more precise measures of
response time this method is most usefully employed
as a way to detect and lter spam and very poor
responses, rather than as a way to evaluate and predict
user performance.</p>
      <p>Modelling the system and measuring user
performance allows designers to benchmark proposed
changes to see if they have the desired e ect, either
an improvement in user performance or a negligible
detriment when, for example, monetising an interface
by adding more advertising. Sensory and motor
actions in the system can be improved by changes to the
interface, for example in the case of search results,
ensuring the results list page contains enough data so the
user is likely to nd their target but not so much that
it slows the user down with input processing. Even
simple changes such as increasing the contrast or size
of the text might allow faster processing of the input
text and hence improve user performance. Decision
making can be improved through user training, either
explicitly with instructions and training examples or
implicitly by following interface design conventions so
the user is pre-trained in how the system will work.</p>
      <p>Predicting a user response is an imprecise science
and other human factors should be considered as
potentially overriding factors in any analysis. A user's
expectations of how an interface should operate
combined with factors beyond measurement may negate
careful design e orts.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Our investigation has shown that all three stages of
user interaction within task-based data collection
systems (processing the input; making a decision; and
taking action) have a signi cant e ect on the response
time of users and this has an impact on how
interface design elements should be applied. Using response
time to evaluate users from log data may only be
accurate enough to lter outliers rather than predict
performance, however this is the subject of future research.
6.0.1</p>
      <sec id="sec-6-1">
        <title>Acknowledgements</title>
        <p>The authors would like to thank the reviewers and Dr
Udo Kruschwitz for their comments and suggestions.
The creation of the original game was funded by
EPSRC project AnaWiki, EP/F00575X/1.
[ABD06]</p>
        <sec id="sec-6-1-1">
          <title>Eugene Agichtein, Eric Brill, and Susan</title>
          <p>Dumais. Improving web search
ranking by incorporating user behavior
information. In Proceedings of the 29th
Annual International ACM SIGIR
Conference on Research and Development in
Information Retrieval, SIGIR '06, pages
19{26, New York, NY, USA, 2006. ACM.
[CPKs08]
[HMU08]</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>Stuart K. Card, Allen Newell, and</title>
          <p>Thomas P. Moran. The Psychology of
Human-Computer Interaction. L.
Erlbaum Associates Inc., Hillsdale, NJ,
USA, 1983.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>Jon Chamberlain, Massimo Poesio, and</title>
          <p>Udo Kruschwitz. Phrase Detectives:
A web-based collaborative annotation
game. In Proceedings of the
International Conference on Semantic Systems
(I-Semantics'08), 2008.</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>Lars Chittka, Peter Skorupski, and</title>
          <p>Nigel E Raine. Speed{accuracy
tradeo s in animal decision making. Trends
in Ecology &amp; Evolution, 24(7):400{407,
2009.</p>
        </sec>
        <sec id="sec-6-1-5">
          <title>Hauke R. Heekeren, Sean Marrett, and</title>
          <p>Leslie G. Ungerleider. The neural
systems that mediate human perceptual
decision making. Nature reviews.
Neuroscience, 9(6):467{479, June 2008.</p>
        </sec>
        <sec id="sec-6-1-6">
          <title>Leslie M Kay, Jennifer Beshel, and Claire Martin. When good enough is best. Neuron, 51(3):277{278, 2006.</title>
        </sec>
        <sec id="sec-6-1-7">
          <title>Aniket Kittur, Ed H. Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk. In Proceedings of the</title>
          <p>SIGCHI Conference on Human Factors
in Computing Systems, CHI '08, pages
453{456, New York, NY, USA, 2008.
ACM.</p>
        </sec>
        <sec id="sec-6-1-8">
          <title>Ece Kamar, Severin Hacker, and Eric</title>
          <p>Horvitz. Combining human and
machine intelligence in large-scale
crowdsourcing. In Proceedings of the 11th
International Conference on Autonomous
Agents and Multiagent Systems - Volume
1, AAMAS '12, pages 467{474, Richland,
SC, 2012. International Foundation for
Autonomous Agents and Multiagent
Systems.</p>
        </sec>
        <sec id="sec-6-1-9">
          <title>Craig Macdonald, Nicola Tonellotto, and</title>
          <p>Iadh Ounis. Learning to predict response
times for online query scheduling. In
Proceedings of the 35th International ACM
[PCK+13]
[RC10]
[RYZ+10]
[Ste69]</p>
        </sec>
        <sec id="sec-6-1-10">
          <title>Massimo Poesio, Jon Chamberlain, Udo</title>
          <p>Kruschwitz, Livio Robaldo, and Luca
Ducceschi. Phrase detectives: Utilizing
collective intelligence for internet-scale
language resource creation. ACM
Transactions on Interactive Intelligent
Systems, 2013.</p>
          <p>Filip Radlinski and Nick Craswell.
Comparing the sensitivity of information
retrieval metrics. In Proceedings of the 33rd
international ACM SIGIR conference
on Research and development in
information retrieval, pages 667{674. ACM,
2010.</p>
        </sec>
        <sec id="sec-6-1-11">
          <title>Saul Sternberg. The discovery of processing stages: Extensions of Donders' method. Acta Psychologica, 30:276{315, 1969.</title>
          <p>[vAMM+08] Luis von Ahn, Benjamin Maurer, Colin
McMillen, David Abraham, and Manuel
Blum. reCAPTCHA: Human-based
character recognition via web security
measures. Science, 321(5895):1465{1468,
2008.
[WRfW+09] Jacob Whitehill, Paul Ruvolo, Ting fan
Wu, Jacob Bergsma, and Javier
Movellan. Whose vote should count more:
Optimal integration of labels from
labelers of unknown expertise. In Y.
Bengio, D. Schuurmans, J. La erty, C. K. I.</p>
          <p>Williams, and A. Culotta, editors,
Advances in Neural Information Processing
Systems 22, page 2035{2043, December
2009.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [CSR09]
          <string-name>
            <given-names>Nolan</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Paul</given-names>
            <surname>Resnick</surname>
          </string-name>
          , and Richard Zeckhauser.
          <article-title>Eliciting informative feedback: The Peer-Prediction method</article-title>
          .
          <source>Management Science</source>
          ,
          <volume>51</volume>
          (
          <issue>9</issue>
          ):
          <volume>1359</volume>
          {
          <fpage>1373</fpage>
          ,
          <year>September 2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>