=Paper= {{Paper |id=None |storemode=property |title=Towards Detecting Wikipedia Task Contexts |pdfUrl=https://ceur-ws.org/Vol-909/poster4.pdf |volume=Vol-909 |dblpUrl=https://dblp.org/rec/conf/eurohcir/KnauslEL12 }} ==Towards Detecting Wikipedia Task Contexts== https://ceur-ws.org/Vol-909/poster4.pdf
                   Towards Detecting Wikipedia Task Contexts

                  Hanna Knaeusl                               David Elsweiler                     Bernd Ludwig
            Chair for Information Science              Chair for Information Science       Chair for Information Science
              University Regensburg                      University Regensburg               University Regensburg
                       Germany                                    Germany                             Germany
             hanna.knaeusl@ur.de                         david.elsweiler@ur.de                bernd.ludwig@ur.de


ABSTRACT                                                                  information is required to complete some job and casual-
Wikipedia is a resource used by many people for many dif-                 leisure tasks, where the aim is more pleasure-focused, e.g.
ferent purposes. We posit that it might be beneficial to alter            to pass time, to relax, to be entertained etc.
the content or the way content is presented depending on the                 Wikipedia contributors are encouraged to create pages in
task context. Here we describe a small pilot lab study to in-             a way that meets the needs of as many users as possible
vestigate features of interaction that might help to infer the            by including information on a topic with sufficient quantity,
contextual situation surrounding wikipedia search tasks. We               quality and completeness and structuring the content in a
describe our effort to collect data and analyse relationships             way that makes sense generally. Nevertheless, one could
between the features and the assigned task context.                       imagine that different content or different presentations of
                                                                          the same content might be more suitable in specific con-
                                                                          texts. For example, lookup tasks may be best supported
Categories and Subject Descriptors                                        when facts in an article are presented as a list that can be
H.5.1 [Information Interfaces and Presentation]: Mul-                     scanned easily. In such scenarios, content such as images
timedia Information Systems                                               may be less helpful and perhaps even distracting. Contrast-
                                                                          ingly, in casual-leisure situations, users may want to focus
                                                                          on multimedia content or have information presented in a
General Terms                                                             way that encourages browsing and information discovery.
Preference Elicitation, Info Seeking Behaviour                               We believe examples like this suggest there may be benefit
                                                                          in moving away from static pages, which try to cater for all
                                                                          usage situations, to dynamic pages that are generated appro-
Keywords                                                                  priately based on the context of use. As a first step towards
Eyetracking, Wikipedia                                                    exploring this hypothesis, in this paper, we investigate how
                                                                          the context of use – the task type being performed – might be
                                                                          detected automatically from user-interactions with the sys-
1.    INTRODUCTION                                                        tem. We want to establish if the way the user interacts with
   Information portals such as Wikipedia represent rich sour-             the system, e.g. his mouse and keyboard interactions, eye
ces of information covering an incredibly broad range of top-             movements, and click behaviour can provide implicit feed-
ics. Many Wikipedia entries are also long and can cover as-               back regarding the usage scenario and user goals.
pects ranging from overviews and introductions to more de-                   With this aim in mind, we present a small pilot study that
tailed descriptions of advanced aspects that are perhaps only             allows us to evaluate a methodology for detecting the fea-
suitable for topic experts. Single pages can also contain not             tures of interaction that might help us infer the contextual
only text, but images, info-graphics, lists and navigational              situation surrounding a user’s search task. We collect inter-
information. Previous research suggests that these resources              action data in the context of a controlled laboratory study
will have several different contexts of use. For example, Mar-            and analyse relationships between the features of interac-
chionini [11] identifies three main types of search tasks, all            tion and the assigned task context. The data show that
of which are applicable to Wikipedia: Lookup tasks include                for the small number of users in our study, the behaviour
finding answers to specific questions, known-item searches                exhibited when completing tasks of different types is very
or navigating to specific pages. These tasks are contrasted               different; users interact with different types of content in
with exploratory search tasks, which include learn tasks,                 different ways. Further, we provide evidence that it is pos-
where the aim is to acquire larger amounts of knowledge and               sible, at least for some users, to predict these behaviours
achieve an enhanced understanding of a given topic, and in-               based purely on mouse and keyboard interactions.
vestigate tasks, where the user makes use of found informa-
tion and continues to contribute to or generate knowledge in
some way. Elsweiler et al. [4] provide an additional task di-             2.   RELATED WORK
mension, distinguishing between work-oriented tasks where                   In the IR community a large amount of work has been
                                                                          performed to establish if interaction data can be used as a
                                                                          surrogate for explicit relevance judgements. This is known
Presented at EuroHCIR2012. Copyright c 2012 for the individual papers
by the papers’ authors. Copying permitted only for private and academic   as implicit relevance feedback. Early research in this area
purposes. This volume is published and copyrighted by its editors.        demonstrated a correlation between the time spent reading a
action   label description                                                             lookup
Read     RE    User is reading text                                action    TX ON     PI
                                                                                       IN IB IG WI LI HD
Scan     SC    User scans content e.g. headlines, lists            EX          0 0      0
                                                                                        0 0 42     0 0     0
               or whole page                                       NV          0 0      0
                                                                                        0 0 0 46 0         0
Examine EX     User examines element                               RE          0 0      0
                                                                                       23 0 23     0 27    0
Navigate NV    User navigates                                      SC         53 0      0
                                                                                       24 18 0     0 59   12
                                                                                        learn
element               label    element            label            action TX ON PI IN IB IG WI LI HD
Headline              HD       Text passage       TX               EX        0  0 89    0 0 93     0 0     0
List                  LI       Introduction       IN               NV        0  2   0   0 0 0 52 0         0
Picture               PI       Info Box           IB               RE     1872  0   0 72 0 0       0 93    0
Charts, tables etc.   IG       Links in Wikipedia WI               SC      172  0   6   2 0 0      0 62 285
Other navigation      ON                                                            casual-leisure
                                                                   action TX ON PI IN IB IG WI LI HD
                                                                   EX        0  0 137   0 2 85     0 0    0
Figure 1: Annotation labels for the user actions dur-              NV        0 11   0   0 0 0 105 0       0
ing Wikipedia search and for the gazed elements                    RE     1876  0   6 274 1 0      0 90 32
                                                                   SC      177  0   2   8 6 0      0 60 134

document and explicit relevance judgements [12]. Although          Table 1: Absolute frequencies of content elements
this has been disputed in naturalistic situations [10], White      for actions for the investigated task types
and Kelly show that when task type is taken into account
clear signals can be found[16]. Other studies have shown
that the amount of scrolling on a Web page [3], click-through      ment on screen, with task descriptions, questionnaires and
for documents in a browser [9], bookmarking behaviour [7]          a web-browser window appearing when appropriate. The
and eye movements during the search [2]can all be used as          experimenters observed the tasks remotely in an adjoining
implicit feedback to improve retrieval performance.                room, where the participant’s screen was mirrored.
   Interaction data can also be used as a means to predict
user emotions. For example, Fox et al., show that query log        3.2      Data Collected
features can be used to predict searcher satisfaction [6] and        We collected a large amount of data from each participant
Feild et al.[5] used interaction data and physical sensors to      before, during and after the study.
predict levels of user frustration with high accuracy.
   A third group of studies show correlations between differ-      Questionnaires: A pre-study questionnaire collected de-
ent styles of interactions e.g. for some users visual attention    mographics, search experience, and experience with wikipedia
on the screen can be predicted via mouse coordinates [15]          of the participants. Pre-and post-task questionnaires elicited
We believe that the interaction style, the emotional state of      perceptions of the task and domain knowledge, of success
the user and the motivating task context will be intrinsically     and the experience including emotional aspects, and finally
related and that the work done previously suggests it may          a post-study questionnaire provided general impressions of
be possible to predict the task based on interaction data.         the experiment.
We explore this in a small pilot study below.                      Eyetracking Data: We recorded participant gaze patterns
                                                                   using an SMI RED eye-tracker. The associated BeGaze soft-
3.    DATA COLLECTION                                              ware recorded videos files of screen interactions with an ad-
                                                                   ditional layer indicating the area of the screen where the user
  In this section we provide details of the data collected and     is focusing his gaze. We manually annotated these complete
explain the motivation behind recording the data.                  overlaid video sequences with two labels. The first describes
                                                                   what the user is doing (”action”). This is a simple coding
3.1    Study Design                                                scheme but aligns with reading psychology research [14, 13].
   Data was collected via a laboratory based user study with       It was the annotator who decided which action to code at
4 users. The participants were information science students        what moment by following the focus displayed in the layer
(1 male, 3 female) aged between 20 and 30. All of the par-         on top of the recorded screen. The second label describes
ticipants were experienced wikipedia users and were com-           the content (”element”) being focused on and is derived from
fortable using the wikipedia search facilities. Although this      the elements available in Wikipedia pages. The label was as-
user population is not large or diverse enough to provide          signed when the focussed on an area on the screen so long
generalisable results, it is sufficient for our aims, which were   that the annotator could assume the element in the area was
to evaluate and improve the methodology and get a sense            perceived.The full set of labels for actions and elements is
for the feasibility of our ideas.                                  presented in Fig. 1. The intuition behind the labels was that
   Each participant performed 6 Wikipedia search tasks (2          the style of reading for different task types and the content
of each of the 3 types of interest - lookup, learn and casual-     elements used will be very different. By labelling videos in
leisure). The tasks were presented in the form of a simulated      this way we could test this intuition empirically.
scenario and were ordered randomly to minimise learning
effects. Example tasks for each type are shown in Figure 2.        Browser Logs: We instrumented the firefox web-browser
   After initially greeting the participant, the experimental      to log all user interactions during the search process.
procedure was explained in person. Then, to prevent biases,           Timestamp information was used to align interaction data
the participant was led automatically through the experi-          from different sensors.
Lookup: Last night you watched a documentary about the sinking of the Titanic. Suddenly you wonder how many passengers were on
    board when the catastrophe happened. Search in Wikipedia for this information.
Learn: Friends from abroad are visiting Germany and you plan to travel together to visit the small but beautiful city of Regensburg.
     As preparation for the trip you want to know more about the city and its history. Use Wikipedia to do this.
Casual-leisure: You have a few minutes before your class starts but you are already sitting in the lecture hall. Kill this time using
    wikipedia using the next six minutes to look at whatever topic(s) take your fancy.


                      Figure 2: Examples of the kinds of tasks assigned to study participants.


4.     EVALUATION OF THE DATA                                        quickly, looking for the snippets of information that will
  We analyse the data in two stages. First, in Section 4.1, we       satisfy their specific information need. They tend to scan a
examine the distribution of video labels for different types of      number of different kinds of content elements during tasks.
task to determine if users behave differently or focus their at-     This can be seen from Table 1 with counts being spread
tention on different kinds of topics when completing different       over text passages, introduction, info boxes, lists and head-
task types. Second, in Section 4.2, we show how these labels         ers. Images are noticeably missing from lookup tasks. It
can, in turn, be predicted using interaction data from the           seems as if the participants have decided that for the tasks
eyetracker and browser. The first stage provides evidence            assigned, images will not useful and are able to avoid them.
that the user’s preferences for content elements depends on             Learn and casual-leisure differ from lookup tasks in that
the search task, endorsing our suggestion to customise web           they both tend to be longer in time and have more interac-
pages at run time. The second stage provides some evidence           tions. They also both involve reading actions, which were
for our hypothesis that the interactions a user performs in          rare for lookup. By this we mean that the user focuses atten-
a browser may be used to predict which actions he trying             tion on whole passages of text and attends the text from left
to complete and which content elements he is preferring at           to right and line by line. Another similarity between learn
that moment.                                                         and casual-leisure tasks is the way that text passages are
                                                                     consumed, with the counts for these tasks being very simi-
                LO vs RE        LO vs CA        RE vs CA             lar. There are differences between learn and casual-leisure
     action     χ2   p-value    χ2 p-value      χ2 p-value           tasks, particularly in terms of the elements used other than
      EX         9    0.011      9  0.029       18  0.006            text passages. During learn tasks the focus tended to be on
      NV         9    0.011      9  0.011       18  0.001            headers, while for casual leisure, the focus was on elements
      RE        13    0.043      6  0.301       27  0.079            such as introductions and info boxes, which allow the user
      SC      36.563  0.064     45  0.039       45  0.039            to gain an overview of what a page is about and allow them
                                                                     to judge whether it is interesting or not. We assume that
                                                                     headers are useful for learn tasks because here there is a
Table 2: χ2 -tests for Different Distributions of Con-               concrete information need i.e. users do not just need to find
tent Elements per Task Type (LO: lookup, RE:                         something that is interesting or not, but need specific infor-
learn, CA: casual-leisure)                                           mational content. In this sense headers will help the user
                                                                     determine whether a paragraph is worth reading or not.

4.1      Reading Style and Content for Task-types                    4.2    Predicting Style and Content Preferences
   Technical difficulties meant we were only able to work with          To determine if the manually assigned labels can be pre-
data for 6 casual-leisure, 4 lookup tasks and 4 learn tasks.         dicted from interaction data alone, we calculated statistics
We first divided the data into 500ms frames, allowing us to          for counts of the synchronous occurrences of video labels
normalise the counts by task length, and counted relative            and input events for the 500ms frames introduced above.
frequencies of frames for which label combinations occur for         As we were searching for the simplest features possible (so
each task type (see Table 1). Visually inspecting the distri-        they could eventually be computed easily during a browser
bution of content for actions, suggests the reading style and        session at runtime) we used the frequencies of the most com-
the elements of content interacted with were very different in       mon mouse events and the average saccade distance (i.e. eye
different task contexts. This is confirmed by pair-wise com-         movement) per frame as features. More precisely, for each
parisons using chi-squared tests for the distributions content       frame we descretised these features into two levels: low and
elements for each possible pair of task types (see Table 2).         high based on the mean value over all frames.
   Examining the results in Table 2, we observe that all but            Table 3 (left) gives an example for the information we
one combination of action type shows highly significant dif-         computed from the raw log data. In order to understand
ferences in the distribution of content elements examined.           whether the knowledge of the mousemove frequency is rel-
The exception is the distribution of elements for lookup and         evant for predicting user actions and content elements, we
casual-leisure tasks, which initially seems counterintuitive,        performed a series of χ2 -squared tests for all six search tasks
as one would expect these two tasks to be very different.            for one of the test participants chosen at random (in total
Below we summarise the main similarities and differences             about 30 minutes of interaction). The results are reported in
between the task-types and attempt to explain what these             Table 3(right). With the exception of the rare click events,
mean in the context of our work.                                     all features are highly significant. We interpret this as a pos-
   When completing lookup tasks, the participants do not             itive indication that for individual users – depending on their
typically read content, the exception being page introduc-           personal interaction style (see [1, 8]) – it is feasible that the
tions. Instead they scan large portions of the page very             reading behaviour label could be predicted during a brows-
           mousemove                      mousemove         Task       scroll         click     mousemove      avg.sacc.dist
 action    high low          element      high low                 action el.       act. el.    act. el.       act. el.
 NV        5    6            IN           30   12           1      ***        ***   ***   ***   ***  ***       ***   **
 RE        18   5            IB           8    10           2      ***        ***                    *         ***   ***
 SC        41   18           WI           5    6            3      *          **          *     ***  ***       * **
                             LI           21   1            4      *          ***   *                **              ***
                                                            5      ***        ***               ***  ***       *
                                                            6      ***        ***   **          ***  ***       **    ***

Table 3: Frequency counts of user actions and mousemove events and of content elements and mousemove events
occurring simultaneously (left). The table on the right shows the significance results for χ2 -squared tests.


ing session. The results of the χ2 -squared tests indicate that     [3] M. Claypool, P. Le, M. Wased, and D. Brown. Implicit
knowing at run-time whether the observed input events oc-               interest indicators. In Proceedings of the IUI, page
cur below or above average at any point of time increases the           33–40, 2001.
accuracy of predicting the video labels as annotated for that       [4] D. Elsweiler, M. L. Wilson, and B. Kirkegaard Lunn.
moment as the distribution P (action|event = low) differs               New Directions in Information Behaviour, chapter
significantly from the distribution P (action|event = high)             Understanding Casual-leisure Information Behaviour.
for any annotated action and for any annotated element                  Emerald Publishing, 2011.
type. This oberservation opens the way for runtime predic-          [5] H. Feild, J. Allan, and R. Jones. Predicting searcher
tion of the user action and preferred elements. From that               frustration. In Proc of SIGIR 2010,, 2010.
information, the system can predict the current task type           [6] S. Fox, K. Karnawat, M. Mydland, S. Dumais, and
and use this information for generating content dynamically.            T. White. Evaluating implicit measures to improve
                                                                        web search. ACM Trans. Inform. Syst., 23(2):147–168,
5.   CONCLUSIONS                                                        2005.
   The preliminary data analysis we have presented provides         [7] Q. Guo and E. Agichtein. Ready to buy or just
clues that, firstly, reading behaviour and preferences for con-         browsing?: detecting web searcher goals from
tent elements depend on the surrounding task context and,               interaction data. In Proceedings of SIGIR, pages
secondly, both behaviour and preferences may be predicted               130–137, 2010.
for individual users based on their interaction style.              [8] J. Huang, R. White, and G. Buscher. User see, user
   There are several limitations to this work. That we only             point: gaze and cursor alignment in web search. In
have data from four participants from a relatively homoge-              Proceedings of CHI, CHI ’12, pages 1341–1350, New
nous group means we cannot generalise. However, we claim                York, NY, USA, 2012. ACM.
that the presented methodology is well suited to address            [9] T. Joachims, L. Granka, B. Pan, H. Hembrooke,
our long term research questions outlined in the introduc-              F. Radlinki, and G. Gay. Evaluating the accuracy of
tion and the pilot has provided us with insight into how to             implicit feedback from clicks and query reformulations
improve a full study. In addition to resolving several tech-            in web search. ACM Trans. Inform. Syst., 25(2), 2007.
nical challenges, we have learned that the great care will         [10] D. Kelly and N. J. Belkin. Reading time, scrolling and
need to be taken when simulating tasks. For example, were               interaction: exploring implicit sources of user
few images looked at in lookup tasks, simply because of the             preferences for relevance feedback. In Proceedings of
tasks we chose? We also plan to look at more complicated                SIGIR, page 408–409, 2001.
prediction features and account for the fact that individual       [11] G. Marchionini. Exploratory search: from finding to
differences in participants (cognitive, reading style [14]) will        understanding. Commun. ACM, 49(4):41–46, 2006.
exists and that users interact in different ways (people who       [12] M. Morita and Y. Shinoda. Information filtering based
follow eye movements with their mouse, people who don’t)                on user behavior analysis and best match text retrieva.
[15]. At EuroHCIR, we look forward to engaging with the                 In Proceedings of SIGIR, pages 272––281, 1994.
broader HCI and IR communities to discuss the ideas in this
                                                                   [13] J. Nielsen. Designing Web Usability. New Riders,
paper; we are particularly eager to receive feedback on the
                                                                        Berkeley, Calif., 2006.
next steps along this research path, including brainstorming
solutions to some of the empirical design challenges of run-       [14] K. Rayner. Eye movements in reading and information
ning such experiments and identifying and dealing with the              processing: 20 years of research. Psych. Bull,
many factors which should be incorporated in the full study.            124(3):372––422, 1998.
                                                                   [15] K. Rodden and X. Fu. Exploring how mouse
                                                                        movements relate to eye movements on web search
6.   REFERENCES                                                         results pages. In SIGIR Workshop on Web
 [1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno.                   Information Seeking and Interaction, pages 29–32,
     Learning user interaction models for predicting web                2007.
     search result preferences. In Proceedings of SIGIR,
                                                                   [16] R. W. White and D. Kelly. A study on the effects of
     SIGIR ’06, pages 3–10, 2006.
                                                                        personalization and task information on implicit
 [2] G. Buscher, A. Dengel, and L. Van Elst. Eye                        feedback performance. In Proceedings of CIKM 2006,
     movements as implicit relevance feedback. In CHI’08:               page 297–306, 2006.
     Extended Abstracts on Human Factors in Computing
     Systems, page 2991–2996, 2008.