INEX iTrack Revisited: Exploring the Potential for Re-use
                                                                                 Nils Pharo
                                                                     Oslo Metropolitan University
                                                                            Oslo, Norway
                                                                          nilsp@oslomet.no

ABSTRACT                                                                                   Table 1: Consistent INEX iTrack terminology over time
This paper presents the experiences from the INEX iTrack exper-
iments conducted over a period of seven years. The purpose is                                  Year/period      Common terms
to present the infrastructure of the experiments with the aim to                               2005-2010        Document corpus, relevance assess-
identify its potential for re-use in new experiments. The paper dis-                                            ments, experimental procedure
cusses the terminology, research design, methodology, resources                                2006-2010        Search system, logging
and reporting from the Inex iTrack in light of this.                                           2008-2010        Tasks, participating groups

CCS CONCEPTS
• Information systems → Users and interactive retrieval; •
Human-centered computing → User studies; Empirical stud-                                 the years. The experimental set-up included presentation of the
ies in HCI.                                                                              tasks, the search system, the document corpus, and the procedure
                                                                                         for data collection. In varying degree, results were presented in the
KEYWORDS                                                                                 proceedings report, some years the experiments had not ended at
Interactive information retrieval, methodology, open science                             the time of proceedings report deadlines.
                                                                                         We do not report any of the findings, these can be found in the
1    INTRODUCTION                                                                        proceedings reports and a summary of the seven years of iTrack
The Initiative for Evaluation of XML retrieval (INEX) started in 2002                    experiments [6].
as a set of experiments following the Cranfield model. The purpose
of INEX was initially to test the potential of XML elements as items                     3 THE INEX ITRACK INFRASTRUCTURE
for retrieval, as an alternative to full text documents, document parts
and document passages. The INEX interactive track (iTrack) was                           3.1 Terminology
run as a subtrack from 2004 to 2010 [3, 5, 8, 10–12], with the goal to                   During the iTrack years, the terminology used went through some
study how end-users query, interact with, and evaluate documents                         changes. In particular, the first year (2004) stands out with an id-
and document parts. The iTrack was organized in a distributed                            iosyncratic terminology. Table 1 shows the distribution of central
way. Participating groups from universities and other research                           terms used over the period, compared according to their intended
institutions across the world collected data following a standardised                    use, i.e. the concept (infrastructure element) they represent. This
procedure for data collection in an experimental setting. In this                        means, e.g., that from 2005 to 2010 the term "document corpus" was
way, it was possible to collect rather large data sets of user-system                    used consistently to refer to the collection of documents used in
interaction.                                                                             the experiments, whereas the term "Tasks" was used consistently
In this paper we shall investigate the methodological approach                           from 2008 to 2010.
used in INEX iTrack. The intention is to explore its potential for                        Table 2 provides an overview of central concepts, definitions, and
re-use and the experience that can be of value for establishing a                        the terminology where term use have changed over time. This does
common methodology for interactive information retrieval (IR)                            not represent an exhaustive overview, only concepts used over sev-
experiments. The paper is structured in the following way; the                           eral years of experiments are included.
first part contains the method, we present iTrack infrastructure,                         Although term use has changed over time, it is easy to identify
i.e. the terminology, research design, methodology, resources and                        the common infrastructure elements from the proceedings report.
reporting used. Thereafter follows a discussion of challenges, before                    Most confusing is the different uses of the term "Task", which was
the final part with summary and conclusions.                                             used to refer to different experimental tasks in 2005 and 2006. In
                                                                                         2006, e.g., three different tasks were described as "Task A - Common
2    METHOD                                                                              Baseline System with IEEE Collection", "Task B - Participation with
In order to identify the infrastructure of the INEX iTrack we inves-                     Own Element Retrieval System" and "Task C - Searching the Lonely
tigate the reports published in the proceedings from 2004 to 2010.                       Planet Collection", respectively.
The structure of the iTrack reports was kept fairly consistent across
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on   3.2     Research design
Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK     The research design used in the iTrack experiments has been stable.
2019. Copyright for the individual papers remains with the authors. Copying permitted
for private and academic purposes. This volume is published and copyrighted by its       A generic representation of the experimental procedure can be
editors..                                                                                described in the following way:
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK                                                                               Nils Pharo
   Table 2: INEX iTrack terminologial changes over time                    situations [1] during the whole period. During the years, changes in
                                                                           methodology include changes in: document corpus, search systems,
 Concept      Definition                     Distribution                  task types, relevance scales and analysis. Also the overall research
 Task         The task(s) assigned to        Topics           (2004),      questions have changed. Some examples of iTrack research ques-
              participants, what they        tasks/topics     (2005),      tions are:
              are asked to find infor-       search tasks (2006),               • What element types / level of granularity do searchers chose
              mation about and its           tasks (2008-2010)                    to see? In what sequence?
              context                                                           • How do users make use of document structure
 Search sys- The system(s) designed          System (2004, 2005),                 – in making relevance judgements?
 tem          to be used in the exper-       search system (2006-                 – in choosing level of granulaity to view?
              iments                         2010)                              • What level of element granularity constitutes the basis of a
 Document The documents search-              Document corpus (2005-               relevance decision? With what degree of certainty?
 corpus       able in the search sys-        2010)                              • How do factors such as topic knowledge influence
              tem                                                                 – choice of element granularity?
 Experimental The procedure used for         Experimental protocol                – number of elements viewed / amount read?
 procedure    performing the experi-         (2004), experimental                 – relevance judgements?
              ment                           procedure (2005-2010)
                                                                           3.3.1 Document corpus. In 2004 and 2005 the corpus was a collec-
                                                                           tion of journal articles published by IEEE (also used in other INEX
                                                                           tracks), in addition, a collection of Lonely Planet travel guides was
   (1) General questionnaire. The participant fills out a question-
                                                                           used in 2005. In 2006 and 2008 the Wikipedia collection, consist-
        naire on background knowledge, demographic data etc. Ques-
                                                                           ing of more than 650 000 XML-formatted encyclopaedic articles,
        tionnaires were on paper (2004-2006) or online (2008-2010)
                                                                           was used in the iTrack as well as other INEX tracks. In 2009 and
   (2) Training task. The participant is given a training task to
                                                                           2010 a collection of Amazon and Librarything book reviews, was
        introduce them to the system’s design and functionalities.
                                                                           specifically collected for the iTrack. This collection has later been
   (3) Task 1
                                                                           adopted by CLEF’s Social Book Search Lab.
      (a) Task specific questionnaire. The participant fills out a ques-
           tionnaire on task specific knowledge                            3.3.2 Search system. Several search systems were developed by
     (b) Search session. The participant interacts with the system         the iTrack organizers. In 2004 and 2005 the HyREX retrieval en-
           in order to perform the task.                                   gine 1 was used as backend in the baseline system. In 2006 two
      (c) Post task questionnaire. The participant fills out question-     different backends were used to test the difference between passage
           naires related to the experience with the system, difficulty    and element retrieval, CSIRO’s Panoptic/Funnelback platform as
           in solving the task etc.                                        passage retrieval backend and TopX 2 from Max Planck Institute
   (4) Additional tasks performed as described in step 3.                  for Informatics for the element retrieval backend. In 2008 and 2009
   (5) Post experiment questionnaire. The participant fills out a          a retrieval system built within the Daffodil framework developed
        questionnaire to provide feedback about the search system.         at the University of Duisburg-Essen 3 ) was used. In 2010 Daffodil
In addition to a common experimental procedure, the participating          was replaced with a system based on the ezDL framework 4 . The
groups had the opportunity to perform their own experiments. In            system interface design was quite consistent throughout the whole
2005 and 2006 it was explicitly organized so that research groups          period. It was built within the Daffodil framework. In 2009-2010
could use their own systems and compare their results to the system        the design consisted of three main components (see Figure 1): a
developed for the experiments as a baseline.                               query panel, a result list, and a window showing the details of the
Very little analysis was performed as part of the iTrack work. Stud-       item retrieved from the result list. Previous years the document
ies performed on iTrack data and reported in journal articles and          was shown in a separate interface.
conference proceedings papers have used transaction log analysis,          3.3.3 Task types. Table 3 contains an overview of iTrack task cat-
statistical analysis of questionnaire data, screen capturing and eye-      egories. The iTrack experiments’ task categories typically have
tracking. The studies have, e.g., investigated users preference with       changed from year to year with categories differing in complexity.
respect to element granularity [2, 4, 7] and the effect of task types      In particular the 2006 tasks should be noted, where tasks were two-
on preferred elements [9].                                                 dimensional combining type and structure. This is an example of a
                                                                           2006 fact-finding hierarchical task:
3.3    Methodology                                                         "A friend has just sent an email from an Internet café in the south-
The initial purpose of the iTrack was twofold: "to investigate the         ern USA where she is on a hiking trip. She tells you that she has
behaviour of users when interacting with components of XML doc-            1 The    system can be downloaded from http://www.is.informatik.uni-
uments, and secondly to investigate and develop approaches for             duisburg.de/projects/hyrex/.
XML retrieval which are effective in user-based environments". In          2 Only the TopX backend is available for download: http://topx.sourceforge.net/.
                                                                           3 more      details    are      available     on      http://www.is.informatik.uni-
the first two years, the iTrack was closely connected with the INEX
                                                                           duisburg.de/projects/daffodil/index.html
ad hoc-track, using the ad hoc-track’s document corpus and top-            4 More information on ezDL can be found on http://www.is.informatik.uni-
ics/tasks. The tasks have been formulated as simulated work task           duisburg.de/projects/ezdl/.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
INEX iTrack Revisited: Exploring the Potential for Re-use                                (CHIIR 2019), 14 March 2019, Glasgow, UK
                                                                             2004      Background                "Find background infor-
                                                                                                                 mation about..."
                                                                                       Comparison                "Find       differences
                                                                                                                 between..."
                                                                             2005      General/Challenging       The "general" chal-
                                                                                                                 lenges were designed
                                                                                                                 as simpler than the
                                                                                                                 "more complex" chal-
                                                                                                                 lenging tasks
                                                                             2006      Types: Decision mak-      The tasks were com-
                                                                                       ing; Fact finding; Infor- bined on two dimen-
                                                                                       mation gathering Struc-   sions: type and struc-
                                                                                       ture: Hierarchical; Par-  ture.
                                                                                       allel
                                                                             2008      Fact finding/Research     The tasks were de-
                                                                                                                 signed to represent
           Figure 1: Inex iTrack 2009-2010 interface                                                             information       needs
                                                                                                                 typical for Wikipedia
                                                                                                                 users, finding facts,
                                                                                                                 such as the "biggest
just stepped into an anthill of small red ants and has a large number
                                                                                                                 airport" or perform
of painful bites on her leg. She wants to know what species of ants
                                                                                                                 research to write a
they are likely to be, how dangerous they are and what she can do
                                                                                                                 paper.
about the bites. What will you tell her?"
                                                                             2009      Broad/Narrow/Semi         Broad tasks represented
The task types used in the 2010 iTrack was designed to simulate
                                                                                       self-selected             needs that lead to the-
searchers at different stages of the search process, as defined by
                                                                                                                 matic exploration. Nar-
Kuhlthau. Below is an example of a 2010 explorative task:
                                                                                                                 row tasks represented
"You are at an early stage of working on an assignment, and have
                                                                                                                 relatively narrow topi-
decided to start exploring the literature of your topic. Your initial
                                                                                                                 cal information needs.
idea has led to one of the following three research needs:
                                                                             2010      Explorative/Data gath- The tasks were de-
   (1) Find trustworthy books discussing the conspiracy theories                       ering/Semi self-selected signed to represent
       which developed after the 9/11 terrorist attacks in New York.                                             different stages in
   (2) Find controversial books discussing the climate change and                                                information seeking
       whether it is man-made or not.                                                                            processes.
   (3) Find highly acclaimed novels that treat issues related to racial                   Table 3: iTrack task categories
       discrimination."
Semi self-selected tasks were used in 2009 and 2010. The partici-
pants were asked to "[t]ry to find books about a specific topic or of
a certain type, but do not look for a specific title you already know."
                                                                          In 2005 the author noted concerns that the 2004 scale “was far too
                                                                          complex for the test persons to comprehend”, thus choosing the
3.3.4 Relevance scales. A variety of relevance scales have been
                                                                          simple scale in 2005. In 2006 and 2008 a two-dimensional scale with
used in the iTrack. The complexity of the scales have varied a lot.
                                                                          five possible scores was used, with the following definitions: Rel-
In 2005, 2009 and 2010 a simple trinary relevance scale was used,
                                                                          evant, but too broad, contains relevant information, but also a
the searchers were asked to assess elements as "relevant", "partially
                                                                          substantial amount of other information. Relevant, contains highly
relevant" or "not relevant". In 2004 a ten point relevance scale was
                                                                          relevant information, and is just the right size to be understand-
used:
                                                                          able. Relevant, but too narrow, contains relevant information,
    A Very useful and Very specific                                       but needs more context to be understood. Partially relevant, has
    B Very useful and Fairly specific                                     enough context to be understandable, but contains only partially
    C Very useful and Marginally specific                                 relevant information. Not relevant, does not contain any relevant
    D Fairly useful and Very specific                                     information that is useful for solving the task.
    E Fairly useful and Fairly specific
    F Fairly useful and Marginally specific                               3.3.5 Analysis methods. iTrack data analysis has been performed
    G Marginally useful and Very specific                                 using a combination of transaction logs and questionnaire data.
    H Marginally useful and Fairly specific                               Studies have been performed investigating the types of transactions
     I Marginally useful and Marginally specific                          taking place, typical transaction patterns, and factors influencing
     J Contains no relevant information                                   transaction patterns.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK                                                                               Nils Pharo

3.4    Resources                                                         documented in the INEX proceedings.
The INEX iTrack evolved from 2004 to 2010. In the first years, it com-   Although experiments evolved throughout the period, with signifi-
plemented the research goals of the ad hoc-track, re-using topics,       cant impact on elements such as task types and relevance scales,
with some modifications, from the ad hoc-track, with the intention       the documentation is fairly consistent. The data are, however, at
to identify how end-user react to element-based IR systems. The          present not publicly available and the systems that were used are
software used for the search system, which was developed at the          only partially available. This raises the following questions and
University of Duisburg-Essen, gradually developed and interface          challenges for securing re-use of Inex iTrack experiments, which
design was kept consistent. Questionnaires were also kept fairly         will also be of value for resuse of interactive IR experiments in
consistent, addressing the same background factors from year to          general:
year.                                                                         • the need for a data repository for preservation of research
                                                                                designs, including transaction logs and questionnaires along
3.5    Reporting                                                                with code books and necessary documentation for re-use
The iTracks proceeding reports document the study design. The                 • a common repository for document corpuses and search
software is documented at the web sites. The questionnaires are not             systems
well documented. The biggest issue is the availability of transaction         • a discussion on the need for standardized questions in ques-
logs and questionnaire data. These are not openly available at the              tionnaires in order to compare across experiments
time of writing. The intention of the iTrack was that the data should
be available only to the research groups for a limited period and        6     ACKNOWLEDGEMENTS
then become available for others upon request. Unfortunately, the
                                                                         I would like to thank Norbert Fuhr and Thomas Beckers for valuable
iTrack web sites are no longer available, which leaves us with the
                                                                         information about the current status of the iTrack systems and data.
track reports as the main official documentation.

4     DISCUSSION                                                         REFERENCES
                                                                          [1] Pia Borlund. 2003. The IIR evaluation model: a framework for evaluation of
The experiences from the INEX iTrack have been manifold. With                 interactive information retrieval systems. 8, 3 (2003). http://informationr.net/ir/
the collaborative effort of several research groups collecting data in        8-3/paper152.html
a standardized manner, the iTrack resulted in large interactive IR        [2] Barbara Hammer-Aebi, Kirstine Wilfred Christensen, Haakon Lund, and Birger
                                                                              Larsen. 2006. Users, structured documents and overlap: interactive searching
datasets. The maximum number of participating research groups                 of elements and the influence of context on search behaviour. In Proceedings of
were 11 (in 2004 and 2005), with 119 searchers taking part in the             the 1st international conference on Information interaction in context (IIiX). ACM,
                                                                              New York, NY, USA, 46–55. https://doi.org/10.1145/1164820.1164833
2005 experiment. The data can be compared across countries and, to        [3] Birger Larsen, Saadia Malik, and Anastasios Tombros. 2006. The interactive track
a certain degree, across different user groups (although the majority         at INEX 2005. In Advances in XML Information Retrieval and Evaluation, Norbert
of participants have, however, been students in computer science              Fuhr, Mounia Lalmas, Saadia Malik, and Gabriella Kazai (Eds.). Springer, Berlin,
                                                                              398–410. http://dx.doi.org/10.1007/978-3-540-34963-1_30
and library and information science). In addition, rich background        [4] Birger Larsen, Anastasios Tombros, and Saadia Malik. 2006. Is XML retrieval
data on many searchers have been collected.                                   meaningful to users?: searcher preferences for full documents vs. elements. In
The major challenges of the experiments are the design of tasks.              Proceedings of the 29th annual international ACM SIGIR conference on Research
                                                                              and development in information retrieval (SIGIR ’06). ACM, New York, NY, USA,
These should be relevant for the participants and tailored following          663–664. https://doi.org/10.1145/1148170.1148306
Borlund’s simulated work task situation method [1]. This can be           [5] Saadia Malik, Anastasios Tombros, and Birger Larsen. 2007. The Interactive Track
                                                                              at INEX 2006. In Comparative Evaluation of XML Information Retrieval Systems,
done either by agreeing upon a very specific user group to collect            Norbert Fuhr, Mounia Lalmas, and Andrew Trotman (Eds.). Vol. 4518. Springer,
participants from or by making very generic tasks. To design re-              Berlin, 387–399. http://www.springerlink.com/content/d4rv145135659g38/
alistic experiments we should also take into account that today’s         [6] Ragnar Nordlie and Nils Pharo. 2012. Seven Years of INEX Interactive Retrieval
                                                                              Experiments – Lessons and Challenges. In Information Access Evaluation. Multi-
information searchers search all the time, in a fragmented way and            linguality, Multimodality, and Visual Analytics (Lecture Notes in Computer Science),
on various platforms.                                                         Tiziana Catarci, Pamela Forner, Djoerd Hiemstra, Anselmo Peñas, and Giuseppe
Other challenges include the identification of factors that influence         Santucci (Eds.). Springer Berlin Heidelberg, 13–23.
                                                                          [7] Nils Pharo. 2008. The effect of granularity and order in XML element retrieval.
interaction. We need to be able to identify the degree in which we            Information Processing and Management 44, 5 (Sept. 2008), 1732–1740. https:
can make valid analysis based on the data.                                    //doi.org/10.1016/j.ipm.2008.05.004
                                                                          [8] Nils Pharo, Thomas Beckers, Ragnar Nordlie, and Norbert Fuhr. 2011. Overview of
Specific challenges related to re-use and data sharing in interactive         the INEX 2010 Interactive Track. In Comparative Evaluation of Focused Retrieval,
IR include establishing standardized ways of documenting exper-               Shlomo Geva, Jaap Kamps, Ralf Schenkel, and Andrew Trotman (Eds.). Vol. 6932.
iments, which is what the BIIRRR workshop addresses. It is also               Springer, Berlin, 227–235.
                                                                          [9] Nils Pharo and Astrid Krahn. 2011. The effect of task type on preferred element
necessary to establish a forum for discussions and coordination of            types in an XML-based retrieval system. Journal of the American Society for
IIR experiment efforts                                                        Information Science and Technology 62, 9 (Sept. 2011), 1717–1726. https://doi.
                                                                              org/10.1002/asi.21587
                                                                         [10] Nils Pharo, Ragnar Nordlie, and Khairun Nisa Fachry. 2009. Overview of the INEX
5     SUMMARY AND FUTURE WORK                                                 2008 Interactive Track. In Advances in Focused Retrieval, David Hutchison, Takeo
                                                                              Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell,
The INEX interactive track organized collaborative interactive in-            Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan,
formation retrieval experiments from 2004 to 2010. In all, the iTrack         Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Shlomo
initiated six rounds of experiments with changes in tasks, collec-            Geva, Jaap Kamps, and Andrew Trotman (Eds.). Vol. 5631. Springer, Berlin, 300–
                                                                              313.
tions and search systems. The experiments resulted in data in the        [11] Nils Pharo, Ragnar Nordlie, Norbert Fuhr, Thomas Beckers, and Khairun Nisa
form of transaction logs and questionnaires. All experiments were             Fachry. 2010. Overview of the INEX 2009 Interactive Track. In Focused Retrieval
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
INEX iTrack Revisited: Exploring the Potential for Re-use                                (CHIIR 2019), 14 March 2019, Glasgow, UK

   and Evaluation, Shlomo Geva, Jaap Kamps, and Andrew Trotman (Eds.). Vol. 6203.   [12] Anastasios Tombros, Birger Larsen, and Saadia Malik. 2005. The interactive track
   Springer, Berlin, 303–311.                                                            at INEX 2004. In Advances in XML Information Retrieval, Norbert Fuhr, Mounia
                                                                                         Lalmas, Saadia Malik, and Zoltán Szlávik (Eds.). Springer, Berlin, 410–423.