=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards Detecting Wikipedia Task Contexts
|pdfUrl=https://ceur-ws.org/Vol-909/poster4.pdf
|volume=Vol-909
|dblpUrl=https://dblp.org/rec/conf/eurohcir/KnauslEL12
}}
==Towards Detecting Wikipedia Task Contexts==
Towards Detecting Wikipedia Task Contexts
Hanna Knaeusl David Elsweiler Bernd Ludwig
Chair for Information Science Chair for Information Science Chair for Information Science
University Regensburg University Regensburg University Regensburg
Germany Germany Germany
hanna.knaeusl@ur.de david.elsweiler@ur.de bernd.ludwig@ur.de
ABSTRACT information is required to complete some job and casual-
Wikipedia is a resource used by many people for many dif- leisure tasks, where the aim is more pleasure-focused, e.g.
ferent purposes. We posit that it might be beneficial to alter to pass time, to relax, to be entertained etc.
the content or the way content is presented depending on the Wikipedia contributors are encouraged to create pages in
task context. Here we describe a small pilot lab study to in- a way that meets the needs of as many users as possible
vestigate features of interaction that might help to infer the by including information on a topic with sufficient quantity,
contextual situation surrounding wikipedia search tasks. We quality and completeness and structuring the content in a
describe our effort to collect data and analyse relationships way that makes sense generally. Nevertheless, one could
between the features and the assigned task context. imagine that different content or different presentations of
the same content might be more suitable in specific con-
texts. For example, lookup tasks may be best supported
Categories and Subject Descriptors when facts in an article are presented as a list that can be
H.5.1 [Information Interfaces and Presentation]: Mul- scanned easily. In such scenarios, content such as images
timedia Information Systems may be less helpful and perhaps even distracting. Contrast-
ingly, in casual-leisure situations, users may want to focus
on multimedia content or have information presented in a
General Terms way that encourages browsing and information discovery.
Preference Elicitation, Info Seeking Behaviour We believe examples like this suggest there may be benefit
in moving away from static pages, which try to cater for all
usage situations, to dynamic pages that are generated appro-
Keywords priately based on the context of use. As a first step towards
Eyetracking, Wikipedia exploring this hypothesis, in this paper, we investigate how
the context of use – the task type being performed – might be
detected automatically from user-interactions with the sys-
1. INTRODUCTION tem. We want to establish if the way the user interacts with
Information portals such as Wikipedia represent rich sour- the system, e.g. his mouse and keyboard interactions, eye
ces of information covering an incredibly broad range of top- movements, and click behaviour can provide implicit feed-
ics. Many Wikipedia entries are also long and can cover as- back regarding the usage scenario and user goals.
pects ranging from overviews and introductions to more de- With this aim in mind, we present a small pilot study that
tailed descriptions of advanced aspects that are perhaps only allows us to evaluate a methodology for detecting the fea-
suitable for topic experts. Single pages can also contain not tures of interaction that might help us infer the contextual
only text, but images, info-graphics, lists and navigational situation surrounding a user’s search task. We collect inter-
information. Previous research suggests that these resources action data in the context of a controlled laboratory study
will have several different contexts of use. For example, Mar- and analyse relationships between the features of interac-
chionini [11] identifies three main types of search tasks, all tion and the assigned task context. The data show that
of which are applicable to Wikipedia: Lookup tasks include for the small number of users in our study, the behaviour
finding answers to specific questions, known-item searches exhibited when completing tasks of different types is very
or navigating to specific pages. These tasks are contrasted different; users interact with different types of content in
with exploratory search tasks, which include learn tasks, different ways. Further, we provide evidence that it is pos-
where the aim is to acquire larger amounts of knowledge and sible, at least for some users, to predict these behaviours
achieve an enhanced understanding of a given topic, and in- based purely on mouse and keyboard interactions.
vestigate tasks, where the user makes use of found informa-
tion and continues to contribute to or generate knowledge in
some way. Elsweiler et al. [4] provide an additional task di- 2. RELATED WORK
mension, distinguishing between work-oriented tasks where In the IR community a large amount of work has been
performed to establish if interaction data can be used as a
surrogate for explicit relevance judgements. This is known
Presented at EuroHCIR2012. Copyright c 2012 for the individual papers
by the papers’ authors. Copying permitted only for private and academic as implicit relevance feedback. Early research in this area
purposes. This volume is published and copyrighted by its editors. demonstrated a correlation between the time spent reading a
action label description lookup
Read RE User is reading text action TX ON PI
IN IB IG WI LI HD
Scan SC User scans content e.g. headlines, lists EX 0 0 0
0 0 42 0 0 0
or whole page NV 0 0 0
0 0 0 46 0 0
Examine EX User examines element RE 0 0 0
23 0 23 0 27 0
Navigate NV User navigates SC 53 0 0
24 18 0 0 59 12
learn
element label element label action TX ON PI IN IB IG WI LI HD
Headline HD Text passage TX EX 0 0 89 0 0 93 0 0 0
List LI Introduction IN NV 0 2 0 0 0 0 52 0 0
Picture PI Info Box IB RE 1872 0 0 72 0 0 0 93 0
Charts, tables etc. IG Links in Wikipedia WI SC 172 0 6 2 0 0 0 62 285
Other navigation ON casual-leisure
action TX ON PI IN IB IG WI LI HD
EX 0 0 137 0 2 85 0 0 0
Figure 1: Annotation labels for the user actions dur- NV 0 11 0 0 0 0 105 0 0
ing Wikipedia search and for the gazed elements RE 1876 0 6 274 1 0 0 90 32
SC 177 0 2 8 6 0 0 60 134
document and explicit relevance judgements [12]. Although Table 1: Absolute frequencies of content elements
this has been disputed in naturalistic situations [10], White for actions for the investigated task types
and Kelly show that when task type is taken into account
clear signals can be found[16]. Other studies have shown
that the amount of scrolling on a Web page [3], click-through ment on screen, with task descriptions, questionnaires and
for documents in a browser [9], bookmarking behaviour [7] a web-browser window appearing when appropriate. The
and eye movements during the search [2]can all be used as experimenters observed the tasks remotely in an adjoining
implicit feedback to improve retrieval performance. room, where the participant’s screen was mirrored.
Interaction data can also be used as a means to predict
user emotions. For example, Fox et al., show that query log 3.2 Data Collected
features can be used to predict searcher satisfaction [6] and We collected a large amount of data from each participant
Feild et al.[5] used interaction data and physical sensors to before, during and after the study.
predict levels of user frustration with high accuracy.
A third group of studies show correlations between differ- Questionnaires: A pre-study questionnaire collected de-
ent styles of interactions e.g. for some users visual attention mographics, search experience, and experience with wikipedia
on the screen can be predicted via mouse coordinates [15] of the participants. Pre-and post-task questionnaires elicited
We believe that the interaction style, the emotional state of perceptions of the task and domain knowledge, of success
the user and the motivating task context will be intrinsically and the experience including emotional aspects, and finally
related and that the work done previously suggests it may a post-study questionnaire provided general impressions of
be possible to predict the task based on interaction data. the experiment.
We explore this in a small pilot study below. Eyetracking Data: We recorded participant gaze patterns
using an SMI RED eye-tracker. The associated BeGaze soft-
3. DATA COLLECTION ware recorded videos files of screen interactions with an ad-
ditional layer indicating the area of the screen where the user
In this section we provide details of the data collected and is focusing his gaze. We manually annotated these complete
explain the motivation behind recording the data. overlaid video sequences with two labels. The first describes
what the user is doing (”action”). This is a simple coding
3.1 Study Design scheme but aligns with reading psychology research [14, 13].
Data was collected via a laboratory based user study with It was the annotator who decided which action to code at
4 users. The participants were information science students what moment by following the focus displayed in the layer
(1 male, 3 female) aged between 20 and 30. All of the par- on top of the recorded screen. The second label describes
ticipants were experienced wikipedia users and were com- the content (”element”) being focused on and is derived from
fortable using the wikipedia search facilities. Although this the elements available in Wikipedia pages. The label was as-
user population is not large or diverse enough to provide signed when the focussed on an area on the screen so long
generalisable results, it is sufficient for our aims, which were that the annotator could assume the element in the area was
to evaluate and improve the methodology and get a sense perceived.The full set of labels for actions and elements is
for the feasibility of our ideas. presented in Fig. 1. The intuition behind the labels was that
Each participant performed 6 Wikipedia search tasks (2 the style of reading for different task types and the content
of each of the 3 types of interest - lookup, learn and casual- elements used will be very different. By labelling videos in
leisure). The tasks were presented in the form of a simulated this way we could test this intuition empirically.
scenario and were ordered randomly to minimise learning
effects. Example tasks for each type are shown in Figure 2. Browser Logs: We instrumented the firefox web-browser
After initially greeting the participant, the experimental to log all user interactions during the search process.
procedure was explained in person. Then, to prevent biases, Timestamp information was used to align interaction data
the participant was led automatically through the experi- from different sensors.
Lookup: Last night you watched a documentary about the sinking of the Titanic. Suddenly you wonder how many passengers were on
board when the catastrophe happened. Search in Wikipedia for this information.
Learn: Friends from abroad are visiting Germany and you plan to travel together to visit the small but beautiful city of Regensburg.
As preparation for the trip you want to know more about the city and its history. Use Wikipedia to do this.
Casual-leisure: You have a few minutes before your class starts but you are already sitting in the lecture hall. Kill this time using
wikipedia using the next six minutes to look at whatever topic(s) take your fancy.
Figure 2: Examples of the kinds of tasks assigned to study participants.
4. EVALUATION OF THE DATA quickly, looking for the snippets of information that will
We analyse the data in two stages. First, in Section 4.1, we satisfy their specific information need. They tend to scan a
examine the distribution of video labels for different types of number of different kinds of content elements during tasks.
task to determine if users behave differently or focus their at- This can be seen from Table 1 with counts being spread
tention on different kinds of topics when completing different over text passages, introduction, info boxes, lists and head-
task types. Second, in Section 4.2, we show how these labels ers. Images are noticeably missing from lookup tasks. It
can, in turn, be predicted using interaction data from the seems as if the participants have decided that for the tasks
eyetracker and browser. The first stage provides evidence assigned, images will not useful and are able to avoid them.
that the user’s preferences for content elements depends on Learn and casual-leisure differ from lookup tasks in that
the search task, endorsing our suggestion to customise web they both tend to be longer in time and have more interac-
pages at run time. The second stage provides some evidence tions. They also both involve reading actions, which were
for our hypothesis that the interactions a user performs in rare for lookup. By this we mean that the user focuses atten-
a browser may be used to predict which actions he trying tion on whole passages of text and attends the text from left
to complete and which content elements he is preferring at to right and line by line. Another similarity between learn
that moment. and casual-leisure tasks is the way that text passages are
consumed, with the counts for these tasks being very simi-
LO vs RE LO vs CA RE vs CA lar. There are differences between learn and casual-leisure
action χ2 p-value χ2 p-value χ2 p-value tasks, particularly in terms of the elements used other than
EX 9 0.011 9 0.029 18 0.006 text passages. During learn tasks the focus tended to be on
NV 9 0.011 9 0.011 18 0.001 headers, while for casual leisure, the focus was on elements
RE 13 0.043 6 0.301 27 0.079 such as introductions and info boxes, which allow the user
SC 36.563 0.064 45 0.039 45 0.039 to gain an overview of what a page is about and allow them
to judge whether it is interesting or not. We assume that
headers are useful for learn tasks because here there is a
Table 2: χ2 -tests for Different Distributions of Con- concrete information need i.e. users do not just need to find
tent Elements per Task Type (LO: lookup, RE: something that is interesting or not, but need specific infor-
learn, CA: casual-leisure) mational content. In this sense headers will help the user
determine whether a paragraph is worth reading or not.
4.1 Reading Style and Content for Task-types 4.2 Predicting Style and Content Preferences
Technical difficulties meant we were only able to work with To determine if the manually assigned labels can be pre-
data for 6 casual-leisure, 4 lookup tasks and 4 learn tasks. dicted from interaction data alone, we calculated statistics
We first divided the data into 500ms frames, allowing us to for counts of the synchronous occurrences of video labels
normalise the counts by task length, and counted relative and input events for the 500ms frames introduced above.
frequencies of frames for which label combinations occur for As we were searching for the simplest features possible (so
each task type (see Table 1). Visually inspecting the distri- they could eventually be computed easily during a browser
bution of content for actions, suggests the reading style and session at runtime) we used the frequencies of the most com-
the elements of content interacted with were very different in mon mouse events and the average saccade distance (i.e. eye
different task contexts. This is confirmed by pair-wise com- movement) per frame as features. More precisely, for each
parisons using chi-squared tests for the distributions content frame we descretised these features into two levels: low and
elements for each possible pair of task types (see Table 2). high based on the mean value over all frames.
Examining the results in Table 2, we observe that all but Table 3 (left) gives an example for the information we
one combination of action type shows highly significant dif- computed from the raw log data. In order to understand
ferences in the distribution of content elements examined. whether the knowledge of the mousemove frequency is rel-
The exception is the distribution of elements for lookup and evant for predicting user actions and content elements, we
casual-leisure tasks, which initially seems counterintuitive, performed a series of χ2 -squared tests for all six search tasks
as one would expect these two tasks to be very different. for one of the test participants chosen at random (in total
Below we summarise the main similarities and differences about 30 minutes of interaction). The results are reported in
between the task-types and attempt to explain what these Table 3(right). With the exception of the rare click events,
mean in the context of our work. all features are highly significant. We interpret this as a pos-
When completing lookup tasks, the participants do not itive indication that for individual users – depending on their
typically read content, the exception being page introduc- personal interaction style (see [1, 8]) – it is feasible that the
tions. Instead they scan large portions of the page very reading behaviour label could be predicted during a brows-
mousemove mousemove Task scroll click mousemove avg.sacc.dist
action high low element high low action el. act. el. act. el. act. el.
NV 5 6 IN 30 12 1 *** *** *** *** *** *** *** **
RE 18 5 IB 8 10 2 *** *** * *** ***
SC 41 18 WI 5 6 3 * ** * *** *** * **
LI 21 1 4 * *** * ** ***
5 *** *** *** *** *
6 *** *** ** *** *** ** ***
Table 3: Frequency counts of user actions and mousemove events and of content elements and mousemove events
occurring simultaneously (left). The table on the right shows the significance results for χ2 -squared tests.
ing session. The results of the χ2 -squared tests indicate that [3] M. Claypool, P. Le, M. Wased, and D. Brown. Implicit
knowing at run-time whether the observed input events oc- interest indicators. In Proceedings of the IUI, page
cur below or above average at any point of time increases the 33–40, 2001.
accuracy of predicting the video labels as annotated for that [4] D. Elsweiler, M. L. Wilson, and B. Kirkegaard Lunn.
moment as the distribution P (action|event = low) differs New Directions in Information Behaviour, chapter
significantly from the distribution P (action|event = high) Understanding Casual-leisure Information Behaviour.
for any annotated action and for any annotated element Emerald Publishing, 2011.
type. This oberservation opens the way for runtime predic- [5] H. Feild, J. Allan, and R. Jones. Predicting searcher
tion of the user action and preferred elements. From that frustration. In Proc of SIGIR 2010,, 2010.
information, the system can predict the current task type [6] S. Fox, K. Karnawat, M. Mydland, S. Dumais, and
and use this information for generating content dynamically. T. White. Evaluating implicit measures to improve
web search. ACM Trans. Inform. Syst., 23(2):147–168,
5. CONCLUSIONS 2005.
The preliminary data analysis we have presented provides [7] Q. Guo and E. Agichtein. Ready to buy or just
clues that, firstly, reading behaviour and preferences for con- browsing?: detecting web searcher goals from
tent elements depend on the surrounding task context and, interaction data. In Proceedings of SIGIR, pages
secondly, both behaviour and preferences may be predicted 130–137, 2010.
for individual users based on their interaction style. [8] J. Huang, R. White, and G. Buscher. User see, user
There are several limitations to this work. That we only point: gaze and cursor alignment in web search. In
have data from four participants from a relatively homoge- Proceedings of CHI, CHI ’12, pages 1341–1350, New
nous group means we cannot generalise. However, we claim York, NY, USA, 2012. ACM.
that the presented methodology is well suited to address [9] T. Joachims, L. Granka, B. Pan, H. Hembrooke,
our long term research questions outlined in the introduc- F. Radlinki, and G. Gay. Evaluating the accuracy of
tion and the pilot has provided us with insight into how to implicit feedback from clicks and query reformulations
improve a full study. In addition to resolving several tech- in web search. ACM Trans. Inform. Syst., 25(2), 2007.
nical challenges, we have learned that the great care will [10] D. Kelly and N. J. Belkin. Reading time, scrolling and
need to be taken when simulating tasks. For example, were interaction: exploring implicit sources of user
few images looked at in lookup tasks, simply because of the preferences for relevance feedback. In Proceedings of
tasks we chose? We also plan to look at more complicated SIGIR, page 408–409, 2001.
prediction features and account for the fact that individual [11] G. Marchionini. Exploratory search: from finding to
differences in participants (cognitive, reading style [14]) will understanding. Commun. ACM, 49(4):41–46, 2006.
exists and that users interact in different ways (people who [12] M. Morita and Y. Shinoda. Information filtering based
follow eye movements with their mouse, people who don’t) on user behavior analysis and best match text retrieva.
[15]. At EuroHCIR, we look forward to engaging with the In Proceedings of SIGIR, pages 272––281, 1994.
broader HCI and IR communities to discuss the ideas in this
[13] J. Nielsen. Designing Web Usability. New Riders,
paper; we are particularly eager to receive feedback on the
Berkeley, Calif., 2006.
next steps along this research path, including brainstorming
solutions to some of the empirical design challenges of run- [14] K. Rayner. Eye movements in reading and information
ning such experiments and identifying and dealing with the processing: 20 years of research. Psych. Bull,
many factors which should be incorporated in the full study. 124(3):372––422, 1998.
[15] K. Rodden and X. Fu. Exploring how mouse
movements relate to eye movements on web search
6. REFERENCES results pages. In SIGIR Workshop on Web
[1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Information Seeking and Interaction, pages 29–32,
Learning user interaction models for predicting web 2007.
search result preferences. In Proceedings of SIGIR,
[16] R. W. White and D. Kelly. A study on the effects of
SIGIR ’06, pages 3–10, 2006.
personalization and task information on implicit
[2] G. Buscher, A. Dengel, and L. Van Elst. Eye feedback performance. In Proceedings of CIKM 2006,
movements as implicit relevance feedback. In CHI’08: page 297–306, 2006.
Extended Abstracts on Human Factors in Computing
Systems, page 2991–2996, 2008.