INEX iTrack Revisited: Exploring the Potential for Re-use Nils Pharo Oslo Metropolitan University Oslo, Norway nilsp@oslomet.no ABSTRACT Table 1: Consistent INEX iTrack terminology over time This paper presents the experiences from the INEX iTrack exper- iments conducted over a period of seven years. The purpose is Year/period Common terms to present the infrastructure of the experiments with the aim to 2005-2010 Document corpus, relevance assess- identify its potential for re-use in new experiments. The paper dis- ments, experimental procedure cusses the terminology, research design, methodology, resources 2006-2010 Search system, logging and reporting from the Inex iTrack in light of this. 2008-2010 Tasks, participating groups CCS CONCEPTS • Information systems → Users and interactive retrieval; • Human-centered computing → User studies; Empirical stud- the years. The experimental set-up included presentation of the ies in HCI. tasks, the search system, the document corpus, and the procedure for data collection. In varying degree, results were presented in the KEYWORDS proceedings report, some years the experiments had not ended at Interactive information retrieval, methodology, open science the time of proceedings report deadlines. We do not report any of the findings, these can be found in the 1 INTRODUCTION proceedings reports and a summary of the seven years of iTrack The Initiative for Evaluation of XML retrieval (INEX) started in 2002 experiments [6]. as a set of experiments following the Cranfield model. The purpose of INEX was initially to test the potential of XML elements as items 3 THE INEX ITRACK INFRASTRUCTURE for retrieval, as an alternative to full text documents, document parts and document passages. The INEX interactive track (iTrack) was 3.1 Terminology run as a subtrack from 2004 to 2010 [3, 5, 8, 10–12], with the goal to During the iTrack years, the terminology used went through some study how end-users query, interact with, and evaluate documents changes. In particular, the first year (2004) stands out with an id- and document parts. The iTrack was organized in a distributed iosyncratic terminology. Table 1 shows the distribution of central way. Participating groups from universities and other research terms used over the period, compared according to their intended institutions across the world collected data following a standardised use, i.e. the concept (infrastructure element) they represent. This procedure for data collection in an experimental setting. In this means, e.g., that from 2005 to 2010 the term "document corpus" was way, it was possible to collect rather large data sets of user-system used consistently to refer to the collection of documents used in interaction. the experiments, whereas the term "Tasks" was used consistently In this paper we shall investigate the methodological approach from 2008 to 2010. used in INEX iTrack. The intention is to explore its potential for Table 2 provides an overview of central concepts, definitions, and re-use and the experience that can be of value for establishing a the terminology where term use have changed over time. This does common methodology for interactive information retrieval (IR) not represent an exhaustive overview, only concepts used over sev- experiments. The paper is structured in the following way; the eral years of experiments are included. first part contains the method, we present iTrack infrastructure, Although term use has changed over time, it is easy to identify i.e. the terminology, research design, methodology, resources and the common infrastructure elements from the proceedings report. reporting used. Thereafter follows a discussion of challenges, before Most confusing is the different uses of the term "Task", which was the final part with summary and conclusions. used to refer to different experimental tasks in 2005 and 2006. In 2006, e.g., three different tasks were described as "Task A - Common 2 METHOD Baseline System with IEEE Collection", "Task B - Participation with In order to identify the infrastructure of the INEX iTrack we inves- Own Element Retrieval System" and "Task C - Searching the Lonely tigate the reports published in the proceedings from 2004 to 2010. Planet Collection", respectively. The structure of the iTrack reports was kept fairly consistent across Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on 3.2 Research design Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK The research design used in the iTrack experiments has been stable. 2019. Copyright for the individual papers remains with the authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its A generic representation of the experimental procedure can be editors.. described in the following way: Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Nils Pharo Table 2: INEX iTrack terminologial changes over time situations [1] during the whole period. During the years, changes in methodology include changes in: document corpus, search systems, Concept Definition Distribution task types, relevance scales and analysis. Also the overall research Task The task(s) assigned to Topics (2004), questions have changed. Some examples of iTrack research ques- participants, what they tasks/topics (2005), tions are: are asked to find infor- search tasks (2006), • What element types / level of granularity do searchers chose mation about and its tasks (2008-2010) to see? In what sequence? context • How do users make use of document structure Search sys- The system(s) designed System (2004, 2005), – in making relevance judgements? tem to be used in the exper- search system (2006- – in choosing level of granulaity to view? iments 2010) • What level of element granularity constitutes the basis of a Document The documents search- Document corpus (2005- relevance decision? With what degree of certainty? corpus able in the search sys- 2010) • How do factors such as topic knowledge influence tem – choice of element granularity? Experimental The procedure used for Experimental protocol – number of elements viewed / amount read? procedure performing the experi- (2004), experimental – relevance judgements? ment procedure (2005-2010) 3.3.1 Document corpus. In 2004 and 2005 the corpus was a collec- tion of journal articles published by IEEE (also used in other INEX tracks), in addition, a collection of Lonely Planet travel guides was (1) General questionnaire. The participant fills out a question- used in 2005. In 2006 and 2008 the Wikipedia collection, consist- naire on background knowledge, demographic data etc. Ques- ing of more than 650 000 XML-formatted encyclopaedic articles, tionnaires were on paper (2004-2006) or online (2008-2010) was used in the iTrack as well as other INEX tracks. In 2009 and (2) Training task. The participant is given a training task to 2010 a collection of Amazon and Librarything book reviews, was introduce them to the system’s design and functionalities. specifically collected for the iTrack. This collection has later been (3) Task 1 adopted by CLEF’s Social Book Search Lab. (a) Task specific questionnaire. The participant fills out a ques- tionnaire on task specific knowledge 3.3.2 Search system. Several search systems were developed by (b) Search session. The participant interacts with the system the iTrack organizers. In 2004 and 2005 the HyREX retrieval en- in order to perform the task. gine 1 was used as backend in the baseline system. In 2006 two (c) Post task questionnaire. The participant fills out question- different backends were used to test the difference between passage naires related to the experience with the system, difficulty and element retrieval, CSIRO’s Panoptic/Funnelback platform as in solving the task etc. passage retrieval backend and TopX 2 from Max Planck Institute (4) Additional tasks performed as described in step 3. for Informatics for the element retrieval backend. In 2008 and 2009 (5) Post experiment questionnaire. The participant fills out a a retrieval system built within the Daffodil framework developed questionnaire to provide feedback about the search system. at the University of Duisburg-Essen 3 ) was used. In 2010 Daffodil In addition to a common experimental procedure, the participating was replaced with a system based on the ezDL framework 4 . The groups had the opportunity to perform their own experiments. In system interface design was quite consistent throughout the whole 2005 and 2006 it was explicitly organized so that research groups period. It was built within the Daffodil framework. In 2009-2010 could use their own systems and compare their results to the system the design consisted of three main components (see Figure 1): a developed for the experiments as a baseline. query panel, a result list, and a window showing the details of the Very little analysis was performed as part of the iTrack work. Stud- item retrieved from the result list. Previous years the document ies performed on iTrack data and reported in journal articles and was shown in a separate interface. conference proceedings papers have used transaction log analysis, 3.3.3 Task types. Table 3 contains an overview of iTrack task cat- statistical analysis of questionnaire data, screen capturing and eye- egories. The iTrack experiments’ task categories typically have tracking. The studies have, e.g., investigated users preference with changed from year to year with categories differing in complexity. respect to element granularity [2, 4, 7] and the effect of task types In particular the 2006 tasks should be noted, where tasks were two- on preferred elements [9]. dimensional combining type and structure. This is an example of a 2006 fact-finding hierarchical task: 3.3 Methodology "A friend has just sent an email from an Internet café in the south- The initial purpose of the iTrack was twofold: "to investigate the ern USA where she is on a hiking trip. She tells you that she has behaviour of users when interacting with components of XML doc- 1 The system can be downloaded from http://www.is.informatik.uni- uments, and secondly to investigate and develop approaches for duisburg.de/projects/hyrex/. XML retrieval which are effective in user-based environments". In 2 Only the TopX backend is available for download: http://topx.sourceforge.net/. 3 more details are available on http://www.is.informatik.uni- the first two years, the iTrack was closely connected with the INEX duisburg.de/projects/daffodil/index.html ad hoc-track, using the ad hoc-track’s document corpus and top- 4 More information on ezDL can be found on http://www.is.informatik.uni- ics/tasks. The tasks have been formulated as simulated work task duisburg.de/projects/ezdl/. Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval INEX iTrack Revisited: Exploring the Potential for Re-use (CHIIR 2019), 14 March 2019, Glasgow, UK 2004 Background "Find background infor- mation about..." Comparison "Find differences between..." 2005 General/Challenging The "general" chal- lenges were designed as simpler than the "more complex" chal- lenging tasks 2006 Types: Decision mak- The tasks were com- ing; Fact finding; Infor- bined on two dimen- mation gathering Struc- sions: type and struc- ture: Hierarchical; Par- ture. allel 2008 Fact finding/Research The tasks were de- signed to represent Figure 1: Inex iTrack 2009-2010 interface information needs typical for Wikipedia users, finding facts, such as the "biggest just stepped into an anthill of small red ants and has a large number airport" or perform of painful bites on her leg. She wants to know what species of ants research to write a they are likely to be, how dangerous they are and what she can do paper. about the bites. What will you tell her?" 2009 Broad/Narrow/Semi Broad tasks represented The task types used in the 2010 iTrack was designed to simulate self-selected needs that lead to the- searchers at different stages of the search process, as defined by matic exploration. Nar- Kuhlthau. Below is an example of a 2010 explorative task: row tasks represented "You are at an early stage of working on an assignment, and have relatively narrow topi- decided to start exploring the literature of your topic. Your initial cal information needs. idea has led to one of the following three research needs: 2010 Explorative/Data gath- The tasks were de- (1) Find trustworthy books discussing the conspiracy theories ering/Semi self-selected signed to represent which developed after the 9/11 terrorist attacks in New York. different stages in (2) Find controversial books discussing the climate change and information seeking whether it is man-made or not. processes. (3) Find highly acclaimed novels that treat issues related to racial Table 3: iTrack task categories discrimination." Semi self-selected tasks were used in 2009 and 2010. The partici- pants were asked to "[t]ry to find books about a specific topic or of a certain type, but do not look for a specific title you already know." In 2005 the author noted concerns that the 2004 scale “was far too complex for the test persons to comprehend”, thus choosing the 3.3.4 Relevance scales. A variety of relevance scales have been simple scale in 2005. In 2006 and 2008 a two-dimensional scale with used in the iTrack. The complexity of the scales have varied a lot. five possible scores was used, with the following definitions: Rel- In 2005, 2009 and 2010 a simple trinary relevance scale was used, evant, but too broad, contains relevant information, but also a the searchers were asked to assess elements as "relevant", "partially substantial amount of other information. Relevant, contains highly relevant" or "not relevant". In 2004 a ten point relevance scale was relevant information, and is just the right size to be understand- used: able. Relevant, but too narrow, contains relevant information, A Very useful and Very specific but needs more context to be understood. Partially relevant, has B Very useful and Fairly specific enough context to be understandable, but contains only partially C Very useful and Marginally specific relevant information. Not relevant, does not contain any relevant D Fairly useful and Very specific information that is useful for solving the task. E Fairly useful and Fairly specific F Fairly useful and Marginally specific 3.3.5 Analysis methods. iTrack data analysis has been performed G Marginally useful and Very specific using a combination of transaction logs and questionnaire data. H Marginally useful and Fairly specific Studies have been performed investigating the types of transactions I Marginally useful and Marginally specific taking place, typical transaction patterns, and factors influencing J Contains no relevant information transaction patterns. Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Nils Pharo 3.4 Resources documented in the INEX proceedings. The INEX iTrack evolved from 2004 to 2010. In the first years, it com- Although experiments evolved throughout the period, with signifi- plemented the research goals of the ad hoc-track, re-using topics, cant impact on elements such as task types and relevance scales, with some modifications, from the ad hoc-track, with the intention the documentation is fairly consistent. The data are, however, at to identify how end-user react to element-based IR systems. The present not publicly available and the systems that were used are software used for the search system, which was developed at the only partially available. This raises the following questions and University of Duisburg-Essen, gradually developed and interface challenges for securing re-use of Inex iTrack experiments, which design was kept consistent. Questionnaires were also kept fairly will also be of value for resuse of interactive IR experiments in consistent, addressing the same background factors from year to general: year. • the need for a data repository for preservation of research designs, including transaction logs and questionnaires along 3.5 Reporting with code books and necessary documentation for re-use The iTracks proceeding reports document the study design. The • a common repository for document corpuses and search software is documented at the web sites. The questionnaires are not systems well documented. The biggest issue is the availability of transaction • a discussion on the need for standardized questions in ques- logs and questionnaire data. These are not openly available at the tionnaires in order to compare across experiments time of writing. The intention of the iTrack was that the data should be available only to the research groups for a limited period and 6 ACKNOWLEDGEMENTS then become available for others upon request. Unfortunately, the I would like to thank Norbert Fuhr and Thomas Beckers for valuable iTrack web sites are no longer available, which leaves us with the information about the current status of the iTrack systems and data. track reports as the main official documentation. 4 DISCUSSION REFERENCES [1] Pia Borlund. 2003. The IIR evaluation model: a framework for evaluation of The experiences from the INEX iTrack have been manifold. With interactive information retrieval systems. 8, 3 (2003). http://informationr.net/ir/ the collaborative effort of several research groups collecting data in 8-3/paper152.html a standardized manner, the iTrack resulted in large interactive IR [2] Barbara Hammer-Aebi, Kirstine Wilfred Christensen, Haakon Lund, and Birger Larsen. 2006. Users, structured documents and overlap: interactive searching datasets. The maximum number of participating research groups of elements and the influence of context on search behaviour. In Proceedings of were 11 (in 2004 and 2005), with 119 searchers taking part in the the 1st international conference on Information interaction in context (IIiX). ACM, New York, NY, USA, 46–55. https://doi.org/10.1145/1164820.1164833 2005 experiment. The data can be compared across countries and, to [3] Birger Larsen, Saadia Malik, and Anastasios Tombros. 2006. The interactive track a certain degree, across different user groups (although the majority at INEX 2005. In Advances in XML Information Retrieval and Evaluation, Norbert of participants have, however, been students in computer science Fuhr, Mounia Lalmas, Saadia Malik, and Gabriella Kazai (Eds.). Springer, Berlin, 398–410. http://dx.doi.org/10.1007/978-3-540-34963-1_30 and library and information science). In addition, rich background [4] Birger Larsen, Anastasios Tombros, and Saadia Malik. 2006. Is XML retrieval data on many searchers have been collected. meaningful to users?: searcher preferences for full documents vs. elements. In The major challenges of the experiments are the design of tasks. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’06). ACM, New York, NY, USA, These should be relevant for the participants and tailored following 663–664. https://doi.org/10.1145/1148170.1148306 Borlund’s simulated work task situation method [1]. This can be [5] Saadia Malik, Anastasios Tombros, and Birger Larsen. 2007. The Interactive Track at INEX 2006. In Comparative Evaluation of XML Information Retrieval Systems, done either by agreeing upon a very specific user group to collect Norbert Fuhr, Mounia Lalmas, and Andrew Trotman (Eds.). Vol. 4518. Springer, participants from or by making very generic tasks. To design re- Berlin, 387–399. http://www.springerlink.com/content/d4rv145135659g38/ alistic experiments we should also take into account that today’s [6] Ragnar Nordlie and Nils Pharo. 2012. Seven Years of INEX Interactive Retrieval Experiments – Lessons and Challenges. In Information Access Evaluation. Multi- information searchers search all the time, in a fragmented way and linguality, Multimodality, and Visual Analytics (Lecture Notes in Computer Science), on various platforms. Tiziana Catarci, Pamela Forner, Djoerd Hiemstra, Anselmo Peñas, and Giuseppe Other challenges include the identification of factors that influence Santucci (Eds.). Springer Berlin Heidelberg, 13–23. [7] Nils Pharo. 2008. The effect of granularity and order in XML element retrieval. interaction. We need to be able to identify the degree in which we Information Processing and Management 44, 5 (Sept. 2008), 1732–1740. https: can make valid analysis based on the data. //doi.org/10.1016/j.ipm.2008.05.004 [8] Nils Pharo, Thomas Beckers, Ragnar Nordlie, and Norbert Fuhr. 2011. Overview of Specific challenges related to re-use and data sharing in interactive the INEX 2010 Interactive Track. In Comparative Evaluation of Focused Retrieval, IR include establishing standardized ways of documenting exper- Shlomo Geva, Jaap Kamps, Ralf Schenkel, and Andrew Trotman (Eds.). Vol. 6932. iments, which is what the BIIRRR workshop addresses. It is also Springer, Berlin, 227–235. [9] Nils Pharo and Astrid Krahn. 2011. The effect of task type on preferred element necessary to establish a forum for discussions and coordination of types in an XML-based retrieval system. Journal of the American Society for IIR experiment efforts Information Science and Technology 62, 9 (Sept. 2011), 1717–1726. https://doi. org/10.1002/asi.21587 [10] Nils Pharo, Ragnar Nordlie, and Khairun Nisa Fachry. 2009. Overview of the INEX 5 SUMMARY AND FUTURE WORK 2008 Interactive Track. In Advances in Focused Retrieval, David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, The INEX interactive track organized collaborative interactive in- Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, formation retrieval experiments from 2004 to 2010. In all, the iTrack Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Shlomo initiated six rounds of experiments with changes in tasks, collec- Geva, Jaap Kamps, and Andrew Trotman (Eds.). Vol. 5631. Springer, Berlin, 300– 313. tions and search systems. The experiments resulted in data in the [11] Nils Pharo, Ragnar Nordlie, Norbert Fuhr, Thomas Beckers, and Khairun Nisa form of transaction logs and questionnaires. All experiments were Fachry. 2010. Overview of the INEX 2009 Interactive Track. In Focused Retrieval Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval INEX iTrack Revisited: Exploring the Potential for Re-use (CHIIR 2019), 14 March 2019, Glasgow, UK and Evaluation, Shlomo Geva, Jaap Kamps, and Andrew Trotman (Eds.). Vol. 6203. [12] Anastasios Tombros, Birger Larsen, and Saadia Malik. 2005. The interactive track Springer, Berlin, 303–311. at INEX 2004. In Advances in XML Information Retrieval, Norbert Fuhr, Mounia Lalmas, Saadia Malik, and Zoltán Szlávik (Eds.). Springer, Berlin, 410–423.