=Paper= {{Paper |id=Vol-1131/mindthegap14_9 |storemode=property |title=User Performance Indicators In Task-Based Data Collection Systems |pdfUrl=https://ceur-ws.org/Vol-1131/mindthegap14_9.pdf |volume=Vol-1131 |dblpUrl=https://dblp.org/rec/conf/iconference/ChamberlainO14 }} ==User Performance Indicators In Task-Based Data Collection Systems== https://ceur-ws.org/Vol-1131/mindthegap14_9.pdf
        User Performance Indicators In Task-Based Data
                     Collection Systems

                                      Jon Chamberlain, Cliff O’Reilly
                          School of Computer Science and Electronic Engineering
                                           University of Essex
                                    Wivenhoe Park, CO4 3SQ England
                                     {jchamb,coreila}@essex.ac.uk




                       Abstract
    When attempting to analyse and improve a                     Figure 1: Stages of processing in human cognition.
    system interface it is often the performance
                                                                not necessarily be assumed that speed correlates to
    of system users that measures the success of
                                                                quality. A fast response may indicate a highly trained
    different iterations of design. This paper in-
                                                                user responding to a simple task and conversely a slow
    vestigates the importance of sensory and cog-
                                                                response might indicate a difficult task that requires
    nitive stages in human data processing, using
                                                                more thought.
    data collected from Phrase Detectives, a text-
    based game for collecting language data, and                   It is therefore important to understand what is hap-
    discusses its application for interface design.             pening to the user during the response and whether
                                                                there is anything that can be done to the system to
                                                                improve performance.
1    Introduction
                                                                   This paper investigates the importance of sensory
When attempting to analyse and improve a system in-             and cognitive stages in human data processing, using
terface it is often the performance of system users that        data collected from Phrase Detectives, a text-based
measures the success of different iterations of design.         game for collecting language data, and attemps to iso-
The metric of performance depends on the context of             late the effect of each stage. Furthermore we discuss
the task and what is considered the most important              the implications and its application for interface de-
outputs by the system owners, for example one system            sign.
may desire high quality output from users, whereas an-
other might want fast output from users [RC10].                 2   Related Work
   When quality is the performance measure it is es-
sential to have a trusted gold standard with which              The analysis of timed decision making has been a key
to judge the user’s responses. A common problem                 experimental model in Cognitive Psychology. Stud-
for natural language processing applications, such as           ies in Reaction (or Response) Time (RT) show that
co-reference resolution, is that there is not sufficient        the human interaction with a system can be divided
resources available and creating them is both time-             into discrete stages: incoming stimulus; mental re-
consuming and costly [PCK+ 13].                                 sponse; and behavioural response [Ste69]. Although
   Using user response time as a performance indica-            traditional psychological theories follow this model of
tor presents a different set of problems and it may             progression from perception to action, recent studies
                                                                are moving more towards models of increasing com-
    Copyright c 2014 for the individual papers by the paper’s   plexity [HMU08].
authors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
                                                                    For our investigation we distinguish between 3
   In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.):
                                                                stages of processing required from the user to elicit
Proceedings of the MindTheGap’14 Workshop, Berlin, Ger-         an output response from input stimuli (see also Figure
many, 4-March-2014, published at http://ceur-ws.org             1):
                                                               Figure 3: A task presented in Validation Mode.
                                                           has yet to be incorporated into more sophisticated
                                                           models predicting data quality from user behaviour
                                                           [RYZ+ 10, WRfW+ 09, KHH12, MRZ05].
  Figure 2: A task presented in Annotation Mode.
 1. input processing (sensory processing) where the        3    Data Collection
    user reads the text and comprehends it;
                                                           Phrase Detectives is a game-with-a-purpose designed
 2. decision making (cognitive processing) where the       to collect data on anaphoric co-reference2 in English
    user makes a choice about how to complete the          documents [CPKs08, PCK+ 13].
    task;                                                     The game uses 2 modes for players to complete a
                                                           linguistic task. Initially text is presented in Annota-
 3. taking action (motor response) to enter the re-        tion Mode (called Name the Culprit in the game - see
    sponse into the system interface (typically using      Figure 2) where the player makes an annotation de-
    a keyboard or mouse).                                  cision about a highlighted markable (section of text).
   This model demonstrates how a user responds to a        If different players enter different interpretations for a
task and can be seen in many examples of user interac-     markable then each interpretation is presented to more
tion in task-based data collection systems. In crowd-      players in Validation Mode (called Detectives Confer-
sourcing systems, such as Amazon’s Mechanical Turk1        ence in the game - see Figure 3). The players in Vali-
or data collection games, a user is given an input (typ-   dation Mode have to agree or disagree with the inter-
ically a section of text or an image) and asked to com-    pretation.
plete a task using that input, such as to identify a          The game was released as 2 interfaces: in 2008 as
linguistic feature in the text or to categorise objects    an independent website system (PD)3 and in 2011 as
in an image [KCS08]. The model can also be seen in         an embedded game within the social network Face-
security applications such as reCAPTCHA, where the         book (PDFB).4 Both versions of the Phrase Detectives
response of the user proves they are human and not         game were built primarily in PHP, HTML, CSS and
an automated machine [vAMM+ 08]. As a final exam-          JavaScript, employ the same overall game architecture
ple, the model can be seen in users’ responses to a        and run simultaneously on the same corpus of docu-
search results page, with the list of results being the    ments.
input and the click to the target document being the          One of the differences between Phrase Detectives
response [MTO12].                                          and other data collection games is that it uses pre-
   The relationship between accuracy in completing a       processing to offer the players a restricted choice of
task and the time taken is known as the Speed Ac-          options. In Annotation Mode the text has embedded
curacy Trade-off. Evidence from studies in ecologi-        code that shows all selectable markables; In Validation
cal decision-making show clear indications that dif-       Mode the player is offered a binary choice of agree-
ficult tasks can be guessed where the costs of error          2 Anaphoric co-reference is a type of linguistic reference where
are low. This results in lower accuracy but faster         one expression depends on another referential element. An ex-
completion time [CSR09, KBM06]. Whilst studies us-         ample would be the relation between the entity ‘Jon’ and the
ing RT as a measure of performance are common, it          pronoun ‘his’ in the text ‘Jon rode his bike to school’.
                                                              3 http://www.phrasedetectives.com
  1 http://www.mturk.com                                      4 https://apps.facebook.com/phrasedetectives
         Figure 4: Proportional frequency of RT in the 2 modes of the 2 interfaces of Phrase Detectives.
Table 1: Total responses for the 2 modes in the 2 in-               Table 2: Minimum, median and mean RT from a ran-
terfaces of Phrase Detectives.                                      dom sample of 50,000 responses of each response type
                                     PD PDFB                        from PD and PDFB.
  Total Annotations             1,096,575 520,434                                                     PD PDFB
  Total Validations (Agree)       123,224 115,280                      Annotation RT (min)            1.0s      2.0s
  Total Validations (Disagree)    278,896 199,197                      Annotation RT (med)            3.0s      6.0s
                                                                       Annotation RT (mean)           7.2s     10.2s
ing or disagreeing with an interpretation. This makes                  Validation (Agr) RT (min)      1.0s      1.0s
the interface more game-like and allows the data to                    Validation (Agr) RT (med)      5.0s      6.0s
be analysed in a more straightforward way as all re-                   Validation (Agr) RT (max) 10.0s         10.5s
sponses are clicks rather than keyboard typing. In this                Validation (Dis) RT (min)      1.0s      2.0s
sense it makes the findings more comparable to search                  Validation (Dis) RT (med)      3.0s      6.0s
result tasks than reCAPTCHA typing tasks.                              Validation (Dis) RT (mean)     8.4s      9.9s

4    Analysis                                                       fewer annotations to validations than the PD interface
                                                                    indicating that the players in the latter disagreed with
In order to investigate the human data processing in
                                                                    each other more (see Table 1). A random sample of
the Phrase Detectives game the RT was analysed in
                                                                    50,000 responses per response type (annotation, agree-
different ways. All data analysed in this paper is from
                                                                    ing validation, and disagreeing validation) shows that
the first 2 years of data collection from each interface
                                                                    users respond differently between the 2 interfaces (see
and does not include data from markables that are
                                                                    Table 2). The data was also plotted as a proportional
flagged as ignored.5 Responses of 0 seconds were not
                                                                    frequency of RT, with a focus on the first 15 seconds
included because they were more likely to indicate a
                                                                    (see Figure 4).
problem with the system rather than a sub 0.5 second
response. Responses over 512 seconds (8:32 minutes)6                   There is a significant difference in the RT between
were also not included and outliers do not represent                interfaces (p<0.05, unpaired t-test). This may indi-
more than 0.5% of the total responses.                              cate a higher level of cheating and spam in PD how-
   An overview of the total responses from each in-                 ever PDFB may be slower because it had to load the
terface shows the PDFB interface had proportionately                Facebook wrapper in addition to the interface. This is
                                                                    supported by the minimum RT for PDFB being 2.0s in
    5 System administrators manually correct pre-processing er-
                                                                    Annotation and Validation (Disagree) Modes, where
rors by tagging redundant markables to be ignored.
    6 The upper time limit is set at 512 seconds because the data   it could be assumed that this is the system’s maxi-
is part of a larger investigation that used RT grouped by a power   mum speed. The 2 interfaces differ in the proportion
function and it is assumed no task would take longer than this.     of responses 2 seconds or less (almost a third of all
responses in PD but a negligible amount in PDFB).
                                                            Table 3: Mean RT for aggregated correct and incor-
One of the motivations for this research is to under-
                                                            rect responses in the 2 modes from 122 gold standard
stand the threshold where responses can be excluded
                                                            markable observations (80 in the case of Validation
based on predicted RT rather than comparison to a
                                                            Disagree). * indicates p<0.05.
gold standard.
                                                                                       Correct Incorrect
   The RT for validations was slower than for anno-
                                                                Annotation*                10.1s        12.8s
tations in the PD interface. This is counter-intuitive
                                                                Validation (Agree)*        13.5s        17.7s
as Annotation Mode has more options for the user to
                                                                Validation (Disagree)      14.5s        15.0s
choose from and requires a more complex motor re-
sponse. One of the assumptions in the orginal game
design was that a Validation Mode would be faster           Table 4: Minimum, median and maximum RT for
than an Annotation Mode and it would make data              clicking actions in Annotation Mode from 6,176 mark-
collection more efficient.                                  ables (p<0.01).
   The data was further analysed to investigate the 3                                   Min Med        Max
stages of user processing. Different data models were            1 click (DN, NR)        1.0s  5.0s 123.3s
used to isolate the effect of the stage in question and          2 clicks (DO1)          1.0s  9.8s 293.0s
negate the influence of the 2 other stages.                      3 clicks (DO2, PR1)     2.0s 12.0s 509.0s
                                                            of a user who does not have a good understanding of
4.1   Input processing
                                                            the task or that the task is more difficult than usual.
A random sample of 100,000 validation (agree and dis-          Mean RT is slower than the general dataset (Table
agree) responses were taken from the PDFB corpus.           2). One explaination is that the gold standard was cre-
The RT and character distance at the start of the           ated from some of the first documents to be completed
markable were tested for a linear correlation, the hy-      and the user base at that time would mostly have been
pothesis being that more input data (i.e., a long text)     interested early adopters, beta testers and colleagues
will require more time for the player to read and com-      of the developers rather than the more general crowd
prehend. Validation Mode was used because it always         that developed over time, including spammers making
displays the same number of choices to the player no        fast responses.
matter what the length of the text (i.e., 2) so the ac-
tion and decision making stages should be constant          4.3   Taking action
and any difference observed in RT would be due to           A random sample of 100,000 markables and associ-
input processing.                                           ated annotations was taken from completed documents
   There was a significant correlation between RT and       from both interfaces where the markable starting char-
the amount of text displayed on the screen (p<0.05,         acter was greater than 1,000 characters. Annotations
Pearson’s Correlation) which supports the hypothesis        were grouped on the minimum number of clicks that
that processing a larger input takes longer time.           would be required to make the response (any mark-
                                                            ables that had no responses in any group were ex-
4.2   Decision making                                       cluded). Thus the effect of input processing speed was
                                                            minimised in selected markables and decision making
The decision making stage was investigated using an         time is assumed to be evenly distributed.
analysis of 5 documents in the PD corpus that had
a double gold standard (i.e., had been marked by 2            • 1 click response, including Discourse-New (DN)
language experts), excluding markables that were am-            and Non-Referring (NR);
biguous (i.e., the 2 experts did not agree on the best
answer) or where there was complete consensus. The            • 2 click response, including Discourse-Old (DO)
comparison of paired responses of individual markables          where 1 antecedent was chosen;
minimises the effect of processing time and the action        • 3 click response, including DO where 2 an-
time is assumed to be evenly distributed.                       tecedents were chosen and Property (PR) where
    The analysis shows that an incorrect response takes         1 antecedent was chosen.
longer, significantly so in the case of making an annota-
tion or agreeing with an annotation (p<0.05, paired t-         There is a significant difference between each group
test) - see Table 3. Given that this dataset is from PD     (p<0.01, paired t-test), implying that the motor re-
where there are a high number of fast spam responses        sponse per click is between 2 to 4 seconds, although
it is feasible that the true incorrect RT is higher. Tak-   for some tasks it is clearly faster as can be seen in the
ing longer to make an incorrect response is indicative      minimum RT. This makes the filtering of responses
below a threshold RT important as in some cases the          user performance.
user not would have enough time to process the in-              Modelling the system and measuring user per-
put, make a decision and take action. This will be           formance allows designers to benchmark proposed
dependent of how difficult the task is to repond to.         changes to see if they have the desired effect, either
   Here the actions require the user to click on a link or   an improvement in user performance or a negligible
button but this methodology can be extended to cover         detriment when, for example, monetising an interface
different styles of input, for example freetext entry.       by adding more advertising. Sensory and motor ac-
Freetext is a more complicated response because the          tions in the system can be improved by changes to the
same decision can be expressed in different ways and         interface, for example in the case of search results, en-
automatic text processing and normalisation would be         suring the results list page contains enough data so the
required. However, when a complex answer might be            user is likely to find their target but not so much that
advantageous, it is usefulto have an unrestricted way of     it slows the user down with input processing. Even
collecting data allowing novel answers to be recorded.       simple changes such as increasing the contrast or size
To this end the Phrase Detectives game allowed free-         of the text might allow faster processing of the input
text comments to be added to markables.                      text and hence improve user performance. Decision
                                                             making can be improved through user training, either
5    Discussion                                              explicitly with instructions and training examples or
                                                             implicitly by following interface design conventions so
By understanding the way users interact with a system        the user is pre-trained in how the system will work.
each task response time can be predicted. In the case           Predicting a user response is an imprecise science
of the Phrase Detectives game we can use a prediction        and other human factors should be considered as po-
of what the user should do for a given size of input to      tentially overriding factors in any analysis. A user’s
process, task difficulty and data entry mode. The same       expectations of how an interface should operate com-
could be applied to any task driven system such as           bined with factors beyond measurement may negate
search, where the system returns a set of results from       careful design efforts.
a query of known complexity with a set of actionable
areas that allow a response to be predicted even when
the user is unknown.                                         6    Conclusion
   When the system is able to predict a response time        Our investigation has shown that all three stages of
for a given input, task and interface combination user       user interaction within task-based data collection sys-
performance can be measured, with users that perform         tems (processing the input; making a decision; and
as predicted being used as a pseudo-gold standard so         taking action) have a significant effect on the response
the system can learn from new data. Outlier data can         time of users and this has an impact on how inter-
be filtered; a response that is too fast may indicate the    face design elements should be applied. Using response
user is clicking randomly or that it is an automated         time to evaluate users from log data may only be accu-
or spam response; a response that is too slow may            rate enough to filter outliers rather than predict perfor-
indicate the user is distracted, fatigued or does not        mance, however this is the subject of future research.
understand the task and therefore the quality of their
judgement is likely to be poor.
                                                             6.0.1   Acknowledgements
   The significant results uncovered by the analysis
of the Phrase Detectives data should be treated with         The authors would like to thank the reviewers and Dr
some caution. Independent analysis of each process-          Udo Kruschwitz for their comments and suggestions.
ing stage is not entirely possible for log data because      The creation of the original game was funded by EP-
users are capable of performing each stage simulta-          SRC project AnaWiki, EP/F00575X/1.
neously, i.e., by making decisions and following the
text with the mouse cursor whilst reading the text.          References
A more precise model could be achieved with eye-
tracking and GOMS (Goals, Operators, Methods, and            [ABD06]       Eugene Agichtein, Eric Brill, and Susan
Selection) rule modelling [CNM83] using a test group                       Dumais. Improving web search rank-
to establish baselines for comparison to the log data                      ing by incorporating user behavior in-
or by using implicit user feedback from more detailed                      formation. In Proceedings of the 29th
logs [ABD06]. Without using more precise measures of                       Annual International ACM SIGIR Con-
response time this method is most usefully employed                        ference on Research and Development in
as a way to detect and filter spam and very poor re-                       Information Retrieval, SIGIR ’06, pages
sponses, rather than as a way to evaluate and predict                      19–26, New York, NY, USA, 2006. ACM.
[CNM83]    Stuart K. Card, Allen Newell, and                        SIGIR Conference on Research and De-
           Thomas P. Moran. The Psychology of                       velopment in Information Retrieval, SI-
           Human-Computer Interaction. L. Erl-                      GIR ’12, pages 621–630, New York, NY,
           baum Associates Inc., Hillsdale, NJ,                     USA, 2012. ACM.
           USA, 1983.
                                                        [PCK+ 13]   Massimo Poesio, Jon Chamberlain, Udo
[CPKs08]   Jon Chamberlain, Massimo Poesio, and                     Kruschwitz, Livio Robaldo, and Luca
           Udo Kruschwitz.      Phrase Detectives:                  Ducceschi. Phrase detectives: Utilizing
           A web-based collaborative annotation                     collective intelligence for internet-scale
           game. In Proceedings of the Interna-                     language resource creation. ACM Trans-
           tional Conference on Semantic Systems                    actions on Interactive Intelligent Sys-
           (I-Semantics’08), 2008.                                  tems, 2013.
[CSR09]    Lars Chittka, Peter Skorupski, and           [RC10]      Filip Radlinski and Nick Craswell. Com-
           Nigel E Raine. Speed–accuracy trade-                     paring the sensitivity of information re-
           offs in animal decision making. Trends                   trieval metrics. In Proceedings of the 33rd
           in Ecology & Evolution, 24(7):400–407,                   international ACM SIGIR conference
           2009.                                                    on Research and development in infor-
                                                                    mation retrieval, pages 667–674. ACM,
[HMU08]    Hauke R. Heekeren, Sean Marrett, and                     2010.
           Leslie G. Ungerleider. The neural sys-
           tems that mediate human perceptual de-       [RYZ+ 10]   Vikas C. Raykar, Shipeng Yu, Linda H.
           cision making. Nature reviews. Neuro-                    Zhao, Gerardo Hermosillo Valadez,
           science, 9(6):467–479, June 2008.                        Charles Florin, Luca Bogoni, and Linda
                                                                    Moy. Learning from crowds. Journal
[KBM06]    Leslie M Kay, Jennifer Beshel, and Claire
                                                                    of Machine Learning Research, 11:1297–
           Martin. When good enough is best. Neu-
                                                                    1322, August 2010.
           ron, 51(3):277–278, 2006.
                                                        [Ste69]     Saul Sternberg. The discovery of pro-
[KCS08]    Aniket Kittur, Ed H. Chi, and Bongwon
                                                                    cessing stages: Extensions of Donders’
           Suh. Crowdsourcing user studies with
                                                                    method. Acta Psychologica, 30:276–315,
           mechanical turk. In Proceedings of the
                                                                    1969.
           SIGCHI Conference on Human Factors
           in Computing Systems, CHI ’08, pages         [vAMM+ 08] Luis von Ahn, Benjamin Maurer, Colin
           453–456, New York, NY, USA, 2008.                       McMillen, David Abraham, and Manuel
           ACM.                                                    Blum.     reCAPTCHA: Human-based
                                                                   character recognition via web security
[KHH12]    Ece Kamar, Severin Hacker, and Eric
                                                                   measures. Science, 321(5895):1465–1468,
           Horvitz. Combining human and ma-
                                                                   2008.
           chine intelligence in large-scale crowd-
           sourcing. In Proceedings of the 11th In-     [WRfW+ 09] Jacob Whitehill, Paul Ruvolo, Ting fan
           ternational Conference on Autonomous                    Wu, Jacob Bergsma, and Javier Movel-
           Agents and Multiagent Systems - Volume                  lan. Whose vote should count more: Op-
           1, AAMAS ’12, pages 467–474, Richland,                  timal integration of labels from label-
           SC, 2012. International Foundation for                  ers of unknown expertise. In Y. Ben-
           Autonomous Agents and Multiagent Sys-                   gio, D. Schuurmans, J. Lafferty, C. K. I.
           tems.                                                   Williams, and A. Culotta, editors, Ad-
                                                                   vances in Neural Information Processing
[MRZ05]    Nolan Miller, Paul Resnick, and Richard
                                                                   Systems 22, page 2035–2043, December
           Zeckhauser. Eliciting informative feed-
                                                                   2009.
           back:   The Peer-Prediction method.
           Management Science, 51(9):1359–1373,
           September 2005.
[MTO12]    Craig Macdonald, Nicola Tonellotto, and
           Iadh Ounis. Learning to predict response
           times for online query scheduling. In Pro-
           ceedings of the 35th International ACM