=Paper=
{{Paper
|id=Vol-2337/paper4
|storemode=property
|title=INEX iTrack Revisited: Exploring the Potential for Re-use
|pdfUrl=https://ceur-ws.org/Vol-2337/paper4.pdf
|volume=Vol-2337
|authors=Nils Pharo
|dblpUrl=https://dblp.org/rec/conf/chiir/Pharo19
}}
==INEX iTrack Revisited: Exploring the Potential for Re-use==
INEX iTrack Revisited: Exploring the Potential for Re-use
Nils Pharo
Oslo Metropolitan University
Oslo, Norway
nilsp@oslomet.no
ABSTRACT Table 1: Consistent INEX iTrack terminology over time
This paper presents the experiences from the INEX iTrack exper-
iments conducted over a period of seven years. The purpose is Year/period Common terms
to present the infrastructure of the experiments with the aim to 2005-2010 Document corpus, relevance assess-
identify its potential for re-use in new experiments. The paper dis- ments, experimental procedure
cusses the terminology, research design, methodology, resources 2006-2010 Search system, logging
and reporting from the Inex iTrack in light of this. 2008-2010 Tasks, participating groups
CCS CONCEPTS
• Information systems → Users and interactive retrieval; •
Human-centered computing → User studies; Empirical stud- the years. The experimental set-up included presentation of the
ies in HCI. tasks, the search system, the document corpus, and the procedure
for data collection. In varying degree, results were presented in the
KEYWORDS proceedings report, some years the experiments had not ended at
Interactive information retrieval, methodology, open science the time of proceedings report deadlines.
We do not report any of the findings, these can be found in the
1 INTRODUCTION proceedings reports and a summary of the seven years of iTrack
The Initiative for Evaluation of XML retrieval (INEX) started in 2002 experiments [6].
as a set of experiments following the Cranfield model. The purpose
of INEX was initially to test the potential of XML elements as items 3 THE INEX ITRACK INFRASTRUCTURE
for retrieval, as an alternative to full text documents, document parts
and document passages. The INEX interactive track (iTrack) was 3.1 Terminology
run as a subtrack from 2004 to 2010 [3, 5, 8, 10–12], with the goal to During the iTrack years, the terminology used went through some
study how end-users query, interact with, and evaluate documents changes. In particular, the first year (2004) stands out with an id-
and document parts. The iTrack was organized in a distributed iosyncratic terminology. Table 1 shows the distribution of central
way. Participating groups from universities and other research terms used over the period, compared according to their intended
institutions across the world collected data following a standardised use, i.e. the concept (infrastructure element) they represent. This
procedure for data collection in an experimental setting. In this means, e.g., that from 2005 to 2010 the term "document corpus" was
way, it was possible to collect rather large data sets of user-system used consistently to refer to the collection of documents used in
interaction. the experiments, whereas the term "Tasks" was used consistently
In this paper we shall investigate the methodological approach from 2008 to 2010.
used in INEX iTrack. The intention is to explore its potential for Table 2 provides an overview of central concepts, definitions, and
re-use and the experience that can be of value for establishing a the terminology where term use have changed over time. This does
common methodology for interactive information retrieval (IR) not represent an exhaustive overview, only concepts used over sev-
experiments. The paper is structured in the following way; the eral years of experiments are included.
first part contains the method, we present iTrack infrastructure, Although term use has changed over time, it is easy to identify
i.e. the terminology, research design, methodology, resources and the common infrastructure elements from the proceedings report.
reporting used. Thereafter follows a discussion of challenges, before Most confusing is the different uses of the term "Task", which was
the final part with summary and conclusions. used to refer to different experimental tasks in 2005 and 2006. In
2006, e.g., three different tasks were described as "Task A - Common
2 METHOD Baseline System with IEEE Collection", "Task B - Participation with
In order to identify the infrastructure of the INEX iTrack we inves- Own Element Retrieval System" and "Task C - Searching the Lonely
tigate the reports published in the proceedings from 2004 to 2010. Planet Collection", respectively.
The structure of the iTrack reports was kept fairly consistent across
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on 3.2 Research design
Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK The research design used in the iTrack experiments has been stable.
2019. Copyright for the individual papers remains with the authors. Copying permitted
for private and academic purposes. This volume is published and copyrighted by its A generic representation of the experimental procedure can be
editors.. described in the following way:
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK Nils Pharo
Table 2: INEX iTrack terminologial changes over time situations [1] during the whole period. During the years, changes in
methodology include changes in: document corpus, search systems,
Concept Definition Distribution task types, relevance scales and analysis. Also the overall research
Task The task(s) assigned to Topics (2004), questions have changed. Some examples of iTrack research ques-
participants, what they tasks/topics (2005), tions are:
are asked to find infor- search tasks (2006), • What element types / level of granularity do searchers chose
mation about and its tasks (2008-2010) to see? In what sequence?
context • How do users make use of document structure
Search sys- The system(s) designed System (2004, 2005), – in making relevance judgements?
tem to be used in the exper- search system (2006- – in choosing level of granulaity to view?
iments 2010) • What level of element granularity constitutes the basis of a
Document The documents search- Document corpus (2005- relevance decision? With what degree of certainty?
corpus able in the search sys- 2010) • How do factors such as topic knowledge influence
tem – choice of element granularity?
Experimental The procedure used for Experimental protocol – number of elements viewed / amount read?
procedure performing the experi- (2004), experimental – relevance judgements?
ment procedure (2005-2010)
3.3.1 Document corpus. In 2004 and 2005 the corpus was a collec-
tion of journal articles published by IEEE (also used in other INEX
tracks), in addition, a collection of Lonely Planet travel guides was
(1) General questionnaire. The participant fills out a question-
used in 2005. In 2006 and 2008 the Wikipedia collection, consist-
naire on background knowledge, demographic data etc. Ques-
ing of more than 650 000 XML-formatted encyclopaedic articles,
tionnaires were on paper (2004-2006) or online (2008-2010)
was used in the iTrack as well as other INEX tracks. In 2009 and
(2) Training task. The participant is given a training task to
2010 a collection of Amazon and Librarything book reviews, was
introduce them to the system’s design and functionalities.
specifically collected for the iTrack. This collection has later been
(3) Task 1
adopted by CLEF’s Social Book Search Lab.
(a) Task specific questionnaire. The participant fills out a ques-
tionnaire on task specific knowledge 3.3.2 Search system. Several search systems were developed by
(b) Search session. The participant interacts with the system the iTrack organizers. In 2004 and 2005 the HyREX retrieval en-
in order to perform the task. gine 1 was used as backend in the baseline system. In 2006 two
(c) Post task questionnaire. The participant fills out question- different backends were used to test the difference between passage
naires related to the experience with the system, difficulty and element retrieval, CSIRO’s Panoptic/Funnelback platform as
in solving the task etc. passage retrieval backend and TopX 2 from Max Planck Institute
(4) Additional tasks performed as described in step 3. for Informatics for the element retrieval backend. In 2008 and 2009
(5) Post experiment questionnaire. The participant fills out a a retrieval system built within the Daffodil framework developed
questionnaire to provide feedback about the search system. at the University of Duisburg-Essen 3 ) was used. In 2010 Daffodil
In addition to a common experimental procedure, the participating was replaced with a system based on the ezDL framework 4 . The
groups had the opportunity to perform their own experiments. In system interface design was quite consistent throughout the whole
2005 and 2006 it was explicitly organized so that research groups period. It was built within the Daffodil framework. In 2009-2010
could use their own systems and compare their results to the system the design consisted of three main components (see Figure 1): a
developed for the experiments as a baseline. query panel, a result list, and a window showing the details of the
Very little analysis was performed as part of the iTrack work. Stud- item retrieved from the result list. Previous years the document
ies performed on iTrack data and reported in journal articles and was shown in a separate interface.
conference proceedings papers have used transaction log analysis, 3.3.3 Task types. Table 3 contains an overview of iTrack task cat-
statistical analysis of questionnaire data, screen capturing and eye- egories. The iTrack experiments’ task categories typically have
tracking. The studies have, e.g., investigated users preference with changed from year to year with categories differing in complexity.
respect to element granularity [2, 4, 7] and the effect of task types In particular the 2006 tasks should be noted, where tasks were two-
on preferred elements [9]. dimensional combining type and structure. This is an example of a
2006 fact-finding hierarchical task:
3.3 Methodology "A friend has just sent an email from an Internet café in the south-
The initial purpose of the iTrack was twofold: "to investigate the ern USA where she is on a hiking trip. She tells you that she has
behaviour of users when interacting with components of XML doc- 1 The system can be downloaded from http://www.is.informatik.uni-
uments, and secondly to investigate and develop approaches for duisburg.de/projects/hyrex/.
XML retrieval which are effective in user-based environments". In 2 Only the TopX backend is available for download: http://topx.sourceforge.net/.
3 more details are available on http://www.is.informatik.uni-
the first two years, the iTrack was closely connected with the INEX
duisburg.de/projects/daffodil/index.html
ad hoc-track, using the ad hoc-track’s document corpus and top- 4 More information on ezDL can be found on http://www.is.informatik.uni-
ics/tasks. The tasks have been formulated as simulated work task duisburg.de/projects/ezdl/.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
INEX iTrack Revisited: Exploring the Potential for Re-use (CHIIR 2019), 14 March 2019, Glasgow, UK
2004 Background "Find background infor-
mation about..."
Comparison "Find differences
between..."
2005 General/Challenging The "general" chal-
lenges were designed
as simpler than the
"more complex" chal-
lenging tasks
2006 Types: Decision mak- The tasks were com-
ing; Fact finding; Infor- bined on two dimen-
mation gathering Struc- sions: type and struc-
ture: Hierarchical; Par- ture.
allel
2008 Fact finding/Research The tasks were de-
signed to represent
Figure 1: Inex iTrack 2009-2010 interface information needs
typical for Wikipedia
users, finding facts,
such as the "biggest
just stepped into an anthill of small red ants and has a large number
airport" or perform
of painful bites on her leg. She wants to know what species of ants
research to write a
they are likely to be, how dangerous they are and what she can do
paper.
about the bites. What will you tell her?"
2009 Broad/Narrow/Semi Broad tasks represented
The task types used in the 2010 iTrack was designed to simulate
self-selected needs that lead to the-
searchers at different stages of the search process, as defined by
matic exploration. Nar-
Kuhlthau. Below is an example of a 2010 explorative task:
row tasks represented
"You are at an early stage of working on an assignment, and have
relatively narrow topi-
decided to start exploring the literature of your topic. Your initial
cal information needs.
idea has led to one of the following three research needs:
2010 Explorative/Data gath- The tasks were de-
(1) Find trustworthy books discussing the conspiracy theories ering/Semi self-selected signed to represent
which developed after the 9/11 terrorist attacks in New York. different stages in
(2) Find controversial books discussing the climate change and information seeking
whether it is man-made or not. processes.
(3) Find highly acclaimed novels that treat issues related to racial Table 3: iTrack task categories
discrimination."
Semi self-selected tasks were used in 2009 and 2010. The partici-
pants were asked to "[t]ry to find books about a specific topic or of
a certain type, but do not look for a specific title you already know."
In 2005 the author noted concerns that the 2004 scale “was far too
complex for the test persons to comprehend”, thus choosing the
3.3.4 Relevance scales. A variety of relevance scales have been
simple scale in 2005. In 2006 and 2008 a two-dimensional scale with
used in the iTrack. The complexity of the scales have varied a lot.
five possible scores was used, with the following definitions: Rel-
In 2005, 2009 and 2010 a simple trinary relevance scale was used,
evant, but too broad, contains relevant information, but also a
the searchers were asked to assess elements as "relevant", "partially
substantial amount of other information. Relevant, contains highly
relevant" or "not relevant". In 2004 a ten point relevance scale was
relevant information, and is just the right size to be understand-
used:
able. Relevant, but too narrow, contains relevant information,
A Very useful and Very specific but needs more context to be understood. Partially relevant, has
B Very useful and Fairly specific enough context to be understandable, but contains only partially
C Very useful and Marginally specific relevant information. Not relevant, does not contain any relevant
D Fairly useful and Very specific information that is useful for solving the task.
E Fairly useful and Fairly specific
F Fairly useful and Marginally specific 3.3.5 Analysis methods. iTrack data analysis has been performed
G Marginally useful and Very specific using a combination of transaction logs and questionnaire data.
H Marginally useful and Fairly specific Studies have been performed investigating the types of transactions
I Marginally useful and Marginally specific taking place, typical transaction patterns, and factors influencing
J Contains no relevant information transaction patterns.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK Nils Pharo
3.4 Resources documented in the INEX proceedings.
The INEX iTrack evolved from 2004 to 2010. In the first years, it com- Although experiments evolved throughout the period, with signifi-
plemented the research goals of the ad hoc-track, re-using topics, cant impact on elements such as task types and relevance scales,
with some modifications, from the ad hoc-track, with the intention the documentation is fairly consistent. The data are, however, at
to identify how end-user react to element-based IR systems. The present not publicly available and the systems that were used are
software used for the search system, which was developed at the only partially available. This raises the following questions and
University of Duisburg-Essen, gradually developed and interface challenges for securing re-use of Inex iTrack experiments, which
design was kept consistent. Questionnaires were also kept fairly will also be of value for resuse of interactive IR experiments in
consistent, addressing the same background factors from year to general:
year. • the need for a data repository for preservation of research
designs, including transaction logs and questionnaires along
3.5 Reporting with code books and necessary documentation for re-use
The iTracks proceeding reports document the study design. The • a common repository for document corpuses and search
software is documented at the web sites. The questionnaires are not systems
well documented. The biggest issue is the availability of transaction • a discussion on the need for standardized questions in ques-
logs and questionnaire data. These are not openly available at the tionnaires in order to compare across experiments
time of writing. The intention of the iTrack was that the data should
be available only to the research groups for a limited period and 6 ACKNOWLEDGEMENTS
then become available for others upon request. Unfortunately, the
I would like to thank Norbert Fuhr and Thomas Beckers for valuable
iTrack web sites are no longer available, which leaves us with the
information about the current status of the iTrack systems and data.
track reports as the main official documentation.
4 DISCUSSION REFERENCES
[1] Pia Borlund. 2003. The IIR evaluation model: a framework for evaluation of
The experiences from the INEX iTrack have been manifold. With interactive information retrieval systems. 8, 3 (2003). http://informationr.net/ir/
the collaborative effort of several research groups collecting data in 8-3/paper152.html
a standardized manner, the iTrack resulted in large interactive IR [2] Barbara Hammer-Aebi, Kirstine Wilfred Christensen, Haakon Lund, and Birger
Larsen. 2006. Users, structured documents and overlap: interactive searching
datasets. The maximum number of participating research groups of elements and the influence of context on search behaviour. In Proceedings of
were 11 (in 2004 and 2005), with 119 searchers taking part in the the 1st international conference on Information interaction in context (IIiX). ACM,
New York, NY, USA, 46–55. https://doi.org/10.1145/1164820.1164833
2005 experiment. The data can be compared across countries and, to [3] Birger Larsen, Saadia Malik, and Anastasios Tombros. 2006. The interactive track
a certain degree, across different user groups (although the majority at INEX 2005. In Advances in XML Information Retrieval and Evaluation, Norbert
of participants have, however, been students in computer science Fuhr, Mounia Lalmas, Saadia Malik, and Gabriella Kazai (Eds.). Springer, Berlin,
398–410. http://dx.doi.org/10.1007/978-3-540-34963-1_30
and library and information science). In addition, rich background [4] Birger Larsen, Anastasios Tombros, and Saadia Malik. 2006. Is XML retrieval
data on many searchers have been collected. meaningful to users?: searcher preferences for full documents vs. elements. In
The major challenges of the experiments are the design of tasks. Proceedings of the 29th annual international ACM SIGIR conference on Research
and development in information retrieval (SIGIR ’06). ACM, New York, NY, USA,
These should be relevant for the participants and tailored following 663–664. https://doi.org/10.1145/1148170.1148306
Borlund’s simulated work task situation method [1]. This can be [5] Saadia Malik, Anastasios Tombros, and Birger Larsen. 2007. The Interactive Track
at INEX 2006. In Comparative Evaluation of XML Information Retrieval Systems,
done either by agreeing upon a very specific user group to collect Norbert Fuhr, Mounia Lalmas, and Andrew Trotman (Eds.). Vol. 4518. Springer,
participants from or by making very generic tasks. To design re- Berlin, 387–399. http://www.springerlink.com/content/d4rv145135659g38/
alistic experiments we should also take into account that today’s [6] Ragnar Nordlie and Nils Pharo. 2012. Seven Years of INEX Interactive Retrieval
Experiments – Lessons and Challenges. In Information Access Evaluation. Multi-
information searchers search all the time, in a fragmented way and linguality, Multimodality, and Visual Analytics (Lecture Notes in Computer Science),
on various platforms. Tiziana Catarci, Pamela Forner, Djoerd Hiemstra, Anselmo Peñas, and Giuseppe
Other challenges include the identification of factors that influence Santucci (Eds.). Springer Berlin Heidelberg, 13–23.
[7] Nils Pharo. 2008. The effect of granularity and order in XML element retrieval.
interaction. We need to be able to identify the degree in which we Information Processing and Management 44, 5 (Sept. 2008), 1732–1740. https:
can make valid analysis based on the data. //doi.org/10.1016/j.ipm.2008.05.004
[8] Nils Pharo, Thomas Beckers, Ragnar Nordlie, and Norbert Fuhr. 2011. Overview of
Specific challenges related to re-use and data sharing in interactive the INEX 2010 Interactive Track. In Comparative Evaluation of Focused Retrieval,
IR include establishing standardized ways of documenting exper- Shlomo Geva, Jaap Kamps, Ralf Schenkel, and Andrew Trotman (Eds.). Vol. 6932.
iments, which is what the BIIRRR workshop addresses. It is also Springer, Berlin, 227–235.
[9] Nils Pharo and Astrid Krahn. 2011. The effect of task type on preferred element
necessary to establish a forum for discussions and coordination of types in an XML-based retrieval system. Journal of the American Society for
IIR experiment efforts Information Science and Technology 62, 9 (Sept. 2011), 1717–1726. https://doi.
org/10.1002/asi.21587
[10] Nils Pharo, Ragnar Nordlie, and Khairun Nisa Fachry. 2009. Overview of the INEX
5 SUMMARY AND FUTURE WORK 2008 Interactive Track. In Advances in Focused Retrieval, David Hutchison, Takeo
Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell,
The INEX interactive track organized collaborative interactive in- Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan,
formation retrieval experiments from 2004 to 2010. In all, the iTrack Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Shlomo
initiated six rounds of experiments with changes in tasks, collec- Geva, Jaap Kamps, and Andrew Trotman (Eds.). Vol. 5631. Springer, Berlin, 300–
313.
tions and search systems. The experiments resulted in data in the [11] Nils Pharo, Ragnar Nordlie, Norbert Fuhr, Thomas Beckers, and Khairun Nisa
form of transaction logs and questionnaires. All experiments were Fachry. 2010. Overview of the INEX 2009 Interactive Track. In Focused Retrieval
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
INEX iTrack Revisited: Exploring the Potential for Re-use (CHIIR 2019), 14 March 2019, Glasgow, UK
and Evaluation, Shlomo Geva, Jaap Kamps, and Andrew Trotman (Eds.). Vol. 6203. [12] Anastasios Tombros, Birger Larsen, and Saadia Malik. 2005. The interactive track
Springer, Berlin, 303–311. at INEX 2004. In Advances in XML Information Retrieval, Norbert Fuhr, Mounia
Lalmas, Saadia Malik, and Zoltán Szlávik (Eds.). Springer, Berlin, 410–423.