=Paper=
{{Paper
|id=Vol-2337/paper5
|storemode=property
|title=Experiences with the 2013-2016 CLEF Interactive Information Retrieval Tracks
|pdfUrl=https://ceur-ws.org/Vol-2337/paper5.pdf
|volume=Vol-2337
|authors=Vivien Petras,Marijn Koolen,Maria Gäde,Toine Bogers
|dblpUrl=https://dblp.org/rec/conf/chiir/PetrasKGB19
}}
==Experiences with the 2013-2016 CLEF Interactive Information Retrieval Tracks==
Experiences with the 2013-2016 CLEF Interactive Information Retrieval Tracks Vivien Petras Marijn Koolen Berlin School of Library and Information Science Humanities Cluster Humboldt-Universität zu Berlin Royal Netherlands Academy of Arts and Sciences Berlin Amsterdam vivien.petras@ibi.hu-berlin.de marijn.koolen@di.huc.knaw.nl Maria Gäde Toine Bogers Berlin School of Library and Information Science Science and Information Studies Humboldt-Universität zu Berlin Department of Communication & Psychology Berlin Aalborg University Copenhagen maria.gaede@ibi.hu-berlin.de Copenhagen toine@hum.aau.dk ABSTRACT 2 CULTURAL HERITAGE IN CLEF @ CLEF This paper describes our experiences with the interactive IR tracks 2011-2012 organized at CLEF from 2013-2016 and aggregates the lessons 2.1 Setup learned with each consecutive instance of the lab. We end with a summary of practical insights and lessons for future collabora- The EU-funded PROMISE1 project (Participative Research labOra- tive interactive IR evaluation exercises and for potential re-use tory for Multimedia and Multilingual Information Systems Evalua- scenarios. tion) ran from 2010-2013 with the goal of providing a virtual and open laboratory for research and experimentation with complex multimodal and multilingual information systems [7]. In order to KEYWORDS evaluate its concepts and prototypes, three use cases were defined to interactive information retrieval, evaluation, CHiC, SBS, CLEF, book guide real-world requirements analysis and contextual testing: ‘Un- search, information seeking locking Cultural Heritage’ (information access to cultural heritage material), ‘Searching for Innovation’ (patent search) and ‘Visual Clinical Decision Support’ (radiology image retrieval). For the ‘Unlocking Cultural Heritage‘ (CH) use case, a workshop 1 INTRODUCTION at the 2011 CLEF conference was organized in order to review After the INEX (Initiative for Evaluation of XML Retrieval) Inter- existing information access use cases in the CH domain and then active Track ended in 2010 [23], there was a gap in interactive develop retrieval scenarios that could be used for evaluating CH information retrieval (IIR) experimentation at the large-scale evalu- information access systems [11]. In addition to qualitative usability ation initiatives. The interactive track at the Cultural Heritage at tests of user interfaces, transaction log analyses and Cranfield-style CLEF (Conference and Labs of the Evaluation Forum) lab (iCHiC) text retrieval evaluation, other forms of user studies were also revived this in 2013 and merged with the INEX Social Book Search considered as viable evaluation approaches. The study and analysis track to form the Social Book Search (SBS) lab at CLEF, running an of different interaction patterns with CH materials was the main interactive track in 2014-2016. interest of the workshop’s participants2 . This paper provides a chronological overview of the development At the 2012 CLEF conference, a pilot evaluation exercise was and history of these two IIR initiatives and their outcomes. We organized for the CH domain, progressing work from the workshop focus on the lessons learned for future collaborative IIR evaluation format to an evaluation lab [26]. It was based on a real-life collection exercises and for potential re-use scenarios. We start by chronicling of CH material: the complete index of the Europeana digital library3 , the timeline of the different interactive labs that were organized in which encompassed ca. 23 million metadata records in 30 different Sections 2-6. We then highlight the most important lessons learned languages at that time. The information needs were based on 50 for the configuration of IIR evaluation experiments. We conclude queries (harvested from Europeana logfiles), translated into English, by discussing consequent activities and insights for the re-use of French and German. The tasks in this pilot exercise comprised both IIR resources. a conventional system-oriented scenario (i.e., ad-hoc retrieval) as well as more specialized retrieval scenarios for the CH domain– the semantic enrichment and variability tasks4 . The evaluation Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on 1 http://www.promise-noe.eu Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK 2 http://www.promise-noe.eu/chic-2011/ 2019. Copyright for the individual papers remains with the authors. Copying permitted 3 www.europeana.eu for private and academic purposes. This volume is published and copyrighted by its editors. 4 http://www.promise-noe.eu/chic-2012/home Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Vivien Petras, Marijn Koolen, Maria Gäde, and Toine Bogers followed the Cranfield paradigm by pooling the retrieval results out pre- and post-task questionnaires, assessed their experience on and assessing their relevance using human assessors. the User Engagement Scale [24] and evaluated the usefulness of found objects (relevance assessment) and the interface (usability). 2.2 Lessons learned Although the 2011 CHiC workshop had already emphasized that 3.2 Lessons learned a focus on user interaction patterns was an important evaluation aspect for the CH domain, this first CHiC lab in 2012 had no in- The iCHiC track ended up collecting data on 208 experiment par- teractive tasks. Instead, it utilized a document collection based on ticipants and their interactions from four participating research Europeana and used queries harvested from Europeana logs to con- groups. As a pilot experiment for collaborative data gathering, this struct information needs. The vision was to extend the ad-hoc style first interactive task was successful overall. retrieval evaluation with interactive and other evaluation scenar- The most important lesson learned from iCHiC and the reason ios (particularly result presentations and alternative methods for why it was merged with the INEX Social Book Search lab (see relevance assessments) in the next phases. Section 4) was that the provided metadata records were not “rich” The Europeana document collection, albeit a real-world collec- enough in content to provide an interesting case study for casual tion, turned out to be very challenging. While an effort was made to browsing and search. The sparseness of the document collection normalize the provided metadata by wrapping it in a special XML had already been a problem for the ad-hoc retrieval tests, and real format and removing certain metadata fields, the content in the users did not like them any better. The actual purpose of iCHiC—to metadata had very different descriptive qualities, depending on the study users’ interactions with the content—was hampered by the original content provider. Both the data sparseness and multilin- lack of interesting content. guality of the content posed serious challenges for the participants. The experimental set-up and questionnaire instruments repre- Image data, such as thumbnails of graphical material in Europeana, sented a significant effort for the participants to complete. However, could not be provided due to copyright reasons. the collected data was deemed necessary for further analysis. Some of the provided topics were not suitable for relevance An original plan for the set-up of this task was to provide the assessment, because information needs could not always be unam- metadata collection, simulated work tasks, and the experimental biguously inferred from the provided queries. The topics mostly setup (questionnaires, logging protocol) to the participating re- contained short queries of 1-3 words and only half of them had search groups and have them provide their own infrastructure for short descriptions added, which did not help much when the topic data gathering. After discussions, the organizers concluded that was vague. For the CH use case, IIR studies focusing on interac- having different groups each building infrastructures would add tion patterns were needed, so an additional interactive task was too much variability and also pose a large barrier to entry especially proposed for the next round. for groups that did not have software or GUI design specialists. The data gathering at the University of Sheffield’s servers had 3 INTERACTIVE CHIC TRACK @ CLEF 2013 the additional advantage of having a central place where all the data was stored. This also posed a problem in later years, however, 3.1 Setup when researchers affiliated with the University of Sheffield moved The Interactive Track5 at the CHiC 2013 lab at CLEF (iCHiC) aimed to a different institutions and neither the preservation and mainte- at building a bridge for IIR and behavior researchers to work in a nance of the infrastructure and data nor its legal ownership were TREC-style evaluation environment. The idea was to develop a data established. collection of IIR evaluation data, which could be re-used and built Four teams participated in the track, but not all of them were able upon. This task intentionally used a subset of the document collec- to recruit the 30 required participants. The uneven contribution tion used in the other CHiC ad-hoc retrieval experimental tasks to led to some discussion about the fairness of all groups then being allow for later triangulation of results. Based on approximately 1 able to use the same data in later analyses. Initial discussions on million metadata records from the English Europeana collection who would get to analyze the data with which research questions and representing a broad range of CH objects, a simple search in- in which priority (important for later publications) were never terface was envisioned that would allow for browse and search successfully resolved as the organizers moved on to new tasks. interactions with the metadata records for the IIR experiments [25]. Some of the organizers published follow-up analyses of the data One non-goal oriented task (based on Borlund’s simulated work [16], while other participating research groups did not. tasks [4, 5]), which simulated “casual” use of the system (“spend 20 The participating groups all adhered to research ethics require- minutes on the system and explore”) was provided to all experiment ments set forth by the University of Sheffield, which hosted the participants. platform and the collected data. Different ethical requirements (e.g., The same experimental infrastructure, which hosted the web- based on national law) were not considered. The experimental par- based interfaces, documents and logged the interactions [19] was ticipants were asked to consent to their responses being shared not provided to all participating research groups. All groups had to just with the organizers, but with the wider research community, recruit at least 30 participants: at least 10 of them had to be observed which allows for re-use of the data. However, processes for enabling in the lab, while at least 20 could use the system remotely. Apart the data sharing at a later time were not considered. from the logged interactions on the systems, participants also filled The proposal for the interactive task had planned for a two-year period, where the data gathering (user interaction logging) and 5 http://www.promise-noe.eu/chic-2013/tasks/interactive-task preliminary data analysis would happen in the first year. In year Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval CLEF Interactive Information Retrieval Tracks (CHIIR 2019), 14 March 2019, Glasgow, UK two, an aggregated data set of all logged interactions was to be iCHiC [16]. In choosing a simulated work task, we considered tasks released to the research community in order to inform an improved that could be connected to specific stages in the search process, system design for data gathering, which would start again in year similar to Pharo and Nordlie [27]. two. While the organizers provided an initial analysis of the data [32], a planned follow-up analysis of the data did not take place. 4.1 Setup The 2014 iSBS Track did not run as a full evaluation campaign, 4 FIRST INEX iSBS TRACK @ CLEF 2014 because most of the year was used to prepare and set up the multi- Social Book Search (SBS)6 started as a system-centered evaluation stage search system, tasks and protocol [17]. However, each of these campaign at INEX in 2011 [21], focusing on retrieval and ranking components improved on the iCHIC set-up: a more interesting col- of book metadata and associated user-generated metadata, such as lection, more focus on the user interfaces and more varied tasks. user reviews, ratings and tags from Amazon and LibraryThing [1]. The track organizers recruited a small number of participants (41) The main research question behind the track was how to exploit but decided to open up the experiment to other groups only in the the different types of curated and user-generated metadata for second year. The multi-stage system was compared against a base- realistic and complex book search requests as expressed on the line system that had mostly the same features but all in a single view. book discussion forums of LibraryThing. After its third year, the The experiment included a training task, a goal-oriented task and a organizers discussed changes to the SBS lab, specifically the nature non-goal oriented task. Pre- and post-experiment questionnaires of book search tasks and how they are evaluated. At the same time, asked for demographic and cultural information, and the overall the iCHiC organizers were looking for a different collection than experience and engagement with the interface. Post-task question- the Europeana cultural heritage objects, because they struggled to naires asked about the usefulness of different interface features. come up with a meaningful task that engaged users, as the cultural Most of the questions were constructed specifically for this domain heritage metadata descriptions got little interest from participating and system, but the engagement questions were reused from the users. Initial discussions between the SBS and iCHiC organizers iCHiC Track. The underlying experimental system of the iCHiC suggested books and associated social media data might be a more experiments was also reused, but had to be modified somewhat to natural domain for participating users. By tying an interactive track fit the iSBS Track. to a system-centred track around the same collection and tasks, lessons learned in one track could feed into the other. Thus the 4.2 Lessons learned interactive SBS (iSBS) track was launched. Although the long preparation phase left little time for gathering Another important initiative was to study the different stages of data, it resulted in a consensus among the large group of organizers the search process and how they could be supported by different about the set of generic research questions that the experimental interfaces [? ]. We considered models of the information search pro- setup and search systems should be able to address. cess [10, 22, 33] in combination with models of how readers select The setup did not lead to enough complex interactions to identify books to read [15, 28–31]. The book selection models distinguish stage transitions in the search process and to test the value of between book internal features (e.g., subject, treatment, characters, multi-stage interfaces. We considered multiple causes: (1) the tasks ending) and external features (e.g., author, title, cover, genre) [29], were relatively simple and did not require complex interactions; but all are based on interaction in physical libraries and book shops, (2) the instructions and training task were not sufficient to get so they had to be adapted to online environments, where the users users familiar with such an interface; and (3) the interface was not have no access to the full-text, but to additional data in the form of self-explanatory enough for users to interact with meaningfully. user-generated content. Thus, selection is based only on external The questionnaire data suggested the tasks could be completed features. with little effort. We subsequently discussed whether we should This led to a three-stage model of browsing, searching and se- use more complex yet still realistic book search tasks. lection, each with separate interfaces that carry over user choices There was a conflict between the goal of studying social book when switching between interfaces, based on Goodall [15]. These search with realistic tasks and the goal of studying the value of stages correspond to the three stages in Vakkari’s model of pre- interfaces for different stages in the search process. The models focus, focus and post-focus [33]. There was a lengthy discussion on of Kuhlthau [22] and Vakkari [33] are based on researchers and what functionalities to include in each stage and how to label the students searching information to write a report or essay and are different interfaces, to ensure that they made sense to users while perhaps less relevant to casual leisure search for books. Or perhaps retaining a close connection to the three search stages and selection the users lack a felt need with the simulated tasks, but would display stages from the literature. It took many iterations of UI choices to more complex interactions if they really were searching for one or adapt the system to the data that was available and deemed most more books to buy. useful to the searcher based on book search studies [15, 28, 30]. Such extensive tailoring of the search UI to the data collection naturally 5 SECOND iSBS TRACK @ CLEF 2015 makes reuse of UI components problematic. 5.1 Changes from previous edition We were interested in the difference between goal-oriented and non-goal oriented tasks, also to compare the non-goal oriented task The second year of the iSBS track was open to other research groups in the book domain to the same non-goal task in CH as used in and had a longer data gathering period with many more participants (192 in total) [14]. Most of the setup was kept the same to allow 6 http://marijnkoolen.com/Social-Book-Search/ comparison with the results of the previous year. However, the Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Vivien Petras, Marijn Koolen, Maria Gäde, and Toine Bogers goal-oriented task was redesigned to have five different sub-tasks, users have ‘skin in the game’ and feel a personal connection to to make users interact more and for longer periods of time. leisure-focus work tasks. A problem encountered since running the 2016 iSBS Track is that 5.2 Lessons learned organizers move between institutions, which causes problems for maintaining experimental systems, websites and repositories when We found that the fact that metadata in the book collection was they loose institutional access to servers where the infrastructure exclusively available in English was a hurdle for several non-native is hosted on. This in turn endangers the continuous availability of English speaking users. As some participating groups contributed research data and experiments. A natural solution to this recurring many more users than other groups, with more non-native English problem could be an independent or inter-institutional platform speakers, the balance was very different than the year before, which and repository for these systems and materials. makes comparison of cohorts difficult. Users also spent a lot of time on the goal-oriented task with sub tasks, causing some of them to abandon the experiment after 7 OUTCOMES: WHAT DID WE LEARN? the first of the two tasks. In their feedback, others indicated that 7.1 Document Collections the overall experiment took too long. This could mean that the One important lesson learned from the iCHiC and iSBS tracks is the gathered data is biased towards more persistent participants. importance of a suitable document collection that is realistic in both size and content variety. The document collection used for iCHiC 6 THIRD iSBS TRACK @ CLEF 2016 was based on metadata from Europeana. Even though it represented 6.1 Changes from previous edition a broad range of different topics, the individual items in the dataset were often sparse in their information content. In the iSBS tracks, In the third edition of the iSBS track we made more significant the document collection based on Amazon and LibraryThing data changes to the experimental setup. Some modifications were made offered richer information that is more suitable for an interesting to the experiment structure to avoid participants abandoning the task for users, but over the course of the different iSBS editions the experiment. The main change was that users only had one manda- collection grew increasingly out-of-date. We found this negatively tory task, but could continue with other tasks as long as they were affected search behavior as well as user engagement, especially willing to continue. We added eight tasks based on book search during the open search task. Users were looking for recent book requests from the LibraryThing discussion forums to provide as titles and got frustrated that they could only find books that were realistic tasks as possible [13]. Another big change was that we at least six years old. focused only on the multi-stage interface to have fewer variables in While re-use of IIR resources is important for replicability and the gathered data. FInally, a third change was that each participat- reproducibility, oftentimes older document collections are simply ing institution had their own instance of the experiment to ensure not interesting anymore for participants—something system-based participant allocation was balanced for each institution, not only evaluation suffers from to a lesser degree. How to obtain realis- for the overall experiment. This was mainly because some institu- tic, engaging, and up-to-date document collections, while at the tions had specific cohorts, which they could not analyse across the same time maintaining comparability across evaluation iterations, variables when balancing was only done overall. remains an open question. Using a live document collection from a production system would 6.2 Lessons learned not allow for the same number of interactions to be studied and A comparison of the 2015 and 2016 cohorts showed very few differ- poses difficulty for logging. It is not a simple alternative. Arguably, ences in terms of time spent on goal-oriented and non-goal tasks what matters is not the stability of the set of documents that are (the 2015 cohort showed no ordering effect between doing goal- searchable, but the extent to which that set is up-to-date. Book oriented first and doing non-goal-oriented first), giving a strong search interactions gathered in 2014 can be compared with those indication that the experiment structure and tasks are producing gathered in 2019 if in both cases users could search books published reliable results. This also suggests that the two cohorts could be in the last five years, despite there being no overlap between the two combined to reduce the impact of individual differences. One of collections, as long as the type and amount of information about the hardest struggles in IIR evaluation campaigns is getting a large books remains the same. To improve re-usability, it may be more and diverse enough set of users. Running such campaigns for long valuable to investigate and describe relevant aspects of document periods requires continuity. The same experimental systems need collections, so that IIR studies with different document collections to remain available with at most small changes. can be compared based on their overlapping relevance aspects, e.g., The additional tasks based on requests from the LibraryThing dis- recency, structure, type, and amount of information. cussion forums resulted in different search behaviour from the sim- Unfortunately, realistic document collections tend to exhibit a ulated goal-oriented and non-goal oriented tasks, but also showed larger degree of variety and complexity. This may make them more large differences between the LibraryThing tasks themselves, with engaging and interesting to participants, but it also increases the more subjective, fiction-oriented tasks leading to less interaction complexity of the analysis of their behavior. One could argue that than concrete, non-fiction-oriented tasks. This suggests that IIR to achieve a more detailed and thorough analysis, perhaps simpler findings may be very sensitive to the specifics of the simulated work document collections would be more suitable, thereby setting up a tasks used. It may also signal that in order to study information trade-off between complexity at the experimental and the analysis search for reading for one’s own enjoyment, it is important that stages. Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval CLEF Interactive Information Retrieval Tracks (CHIIR 2019), 14 March 2019, Glasgow, UK 7.2 Information Needs relationships and dependencies is likely to lead to invalid results. In order to have meaningful impact, IIR studies should be represen- Longer preparation time or access to user databases with poten- tative of the real-life variety in domains, system designs, and user tial participants could help overcoming such biases in participant types and needs. One way in which iCHiC and iSBS attempted to recruitment. do this was by using a varied and realistic set of simulated work One of our findings in iSBS was that the cultural background tasks [6] and cover stories that include extra context about the makes a significant difference. This is something that is rarely re- background task to support the search behavior of participants. ported in studies, but that appears to be an important aspect to How best to generate such realistic information needs is an open include. This also challenges the assumption that by providing the question. One potentially fruitful approach in the 2016 iSBS track same infrastructure and tasks but using different user group dis- involved taking real-world examples of complex information needs tributions over the years or across national boundaries, measured from the LibraryThing forums and using them as optional addi- user interactions can be aggregated across these groups. There tional work tasks. These tasks were judged as being rich in variety were some analyses that clustered users based on certain aspects, and detail by our participants, so this could be an interesting avenue but the question remains which users can be viewed in aggrega- for future work. However, as the difference between fiction and tion. Since academic IIR studies often rely on students, perhaps non-fiction tasks showed, personal interest does play an important studies can explicitly describe criteria of representativeness of the role in user engagement, so using real-world requests as simulated target user group and add questions to the questionnaire that cap- work tasks is not a catch-all solution. ture aspects of users that allows mapping them to these aspects of Despite the proven usefulness of simulated work tasks, they are representativeness. still not the same as a user’s own information needs. We there- fore also included work tasks in iCHiC and iSBS that focused on 7.4 Search User Interface the participants’ own information needs. Non-restrictive tasks, in The search user interface is perhaps the most important aspect to which users can search whatever and however they want for as get right for the IIR system used in the experiments as our experi- long or short as they want, offer more realistic aspects of informa- ence with iCHiC and iSBS tracks has taught us. The ubiquity and tion behavior, but they make comparison more difficult. Differences popularity of modern-day search engines means that any search between users can be due to them having wildly different ‘tasks’ in user interface has certain minimum expectations to meet in terms of mind. Although we experimented with different types of tasks, we layout and/or functionality. Not meeting these expectations means feel that we have only scratched the surface here. True information risking distracting users and has a deleterious effect on their search needs can be multilingual and multicultural, making assessment behavior. It would be beneficial if the IIR system offered the flexi- even more challenging. bility of choosing different search interfaces to study the effects of In addition, by focusing only on single information needs, we be- the GUI on information seeking behavior. This was used to great lieve that we are ignoring valuable aspects of the entire information effect in the iSBS tracks to examine how different interfaces can seeking process, both individual and collaborative [20]. Information support the different search stages. search is only one aspect of information behavior and is commonly This flexibility came at a price, however, as the software compo- combined with exploration, browsing, or interaction with a recom- nents needed for the infrastructure became increasingly complex. mender system. Moreover, information behavior often takes place Both iCHiC and iSBS used a customized infrastructure developed across and between different devices (desktop vs. smartphone), in- by one of the organizers, which made this possible [18]. Maintain- formation systems (e.g. Amazon, LibraryThing, Google but also ing customized software for future experiments is a hard problem. social media channels like Facebook and Twitter [9]) and modali- Making infrastructure publicly available with appropriate docu- ties (digital vs. paper). On the other hand, a large number of varied mentation is one way to alleviate this. information needs and task contexts leads to a wide distribution Another difficulty is that the design of interfaces can be in- of experimental data points, which—if not enough users can be formed by different theoretical models of information interaction. persuaded to participate—may result in insufficiently significant In setting up the iSBS track and designing the multistage interface, analyses. we discussed the appropriateness of numerous information seek- ing/search models as well as book selection models and strategies, 7.3 Study Participants how they are related to each other and how they correspond to or are supported by aspects of the interface. A further complication Ideally, an IIR evaluation campaign recruits participants that are a is that our choices were also steered by the research questions we realistic representation of the general target population to avoid wanted to address. These issues add another set of variables to take the introduction of biases [8, p. 241]. However, in most IIR tracks— into account when considering comparison and reuse, and should including our own—researchers have often relied on recruiting be described in studies. students from participating universities or research groups as par- ticipants. Due to the short-term preparations and research cycles, this is often the only way to include enough participants in an IIR 7.5 Experimental Setup experiment. However, students are only one of several user groups IIR research usually includes several complex components that can that need to be taken into account when dealing with complex affect the quality and success of each experiment. While the im- search tasks. It needs to be assured that users are selected based on portance of some elements such as task development have been the specific system, feature or task to be tested as ignoring these extensively discussed, other aspects remain less considered. Only a Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Vivien Petras, Marijn Koolen, Maria Gäde, and Toine Bogers few studies report on or discuss measures used to analyze or inter- the data or infrastructure, and what happens when organizers move pret results from IIR experiments. So far, IIR measures are highly to other institutions or leave the project, or when new organizers contextual varying from experiment to experiment. Measures used join. span from data on interactions, such as session duration or clicks, While always intended, the organizers of iCHiC and iSBS could to qualitative data derived from questionnaires or interviews. Often find hardly any re-use of the gathered data for IIR studies or trian- several data points are complemented or correlated. gulation studies with the related ad-hoc retrieval experiments in A collaborative IIR study requires that participating research CHiC or SBS. One reason may have been the insufficient availability groups pool their gathered data and aggregating this data generates of the research data along with a proper rights clearance. substantial overhead. If institutions gather their own data, aggrega- There are generic platforms for storing and sharing scientific tion may involve harmonizing inconsistencies. In the iCHiC and data, such as the Open Science Framework7 and several Dataverse8 iSBS tracks, a single system was used to gather all experimental instances. These options solve some of the institutional issues, but data, but this system had to be developed and adapted with each they lack the flexibility to run experimental systems or to add iteration. A comprehensive documentation and accurate descrip- domain-specific search and access features to datasets that make tions of the data gathering tools is crucial for the evaluation and a repository like RepAST useful to the IIR community. Publicly re-use of these aspects in future studies. available repositories for software and software infrastructures also Different research groups and individuals often want to study exist (e.g., GitHub), but present similar problems to the research slightly different aspects of the problem domain or setup, requiring data repositories. different questions in the questionnaire, different tasks or users, or Next to problems of storage and access of IIR research data, different search system components. With every change, new users there are issues of copyright, privacy and ethics. The questionnaire need to be recruited, and comparisons to previously collected data informs users, which institutions are involved, but how should or- becomes harder. The long preparatory discussions among the iSBS ganizers deal with new researchers and institutions joining? One organizers regarding research questions, theoretical frameworks option is for organizers to agree on ethical guidelines for data gath- and research designs suggests that it is possible to some extent to ering, informed consent and data representation. For further data incorporate a broad set of research questions in the overall research re-use, it is crucial that users also give their informed consent for design to allow a range of studies with the same setup. But often re- additional analyses of their data. To create a trustworthy environ- search questions change or new questions are prompted during and ment, IIR researcher must provide concrete statements on who following the experiments, calling for an iterative development of and for what future purposes the data will be used. This should be the research design. We are not aware of any guidelines on how to available additionally to the research data as part of an archived best update designs to allow some backwards comparability. While and documented research design (see Section 7.5). there is large variability in research questions and research designs, the group would have benefited from re-using other researchers’ 7.7 Coordinating Collaborative Research research design components, as was done with the User Engage- IIR research is a highly interdisciplinary field bridging areas of in- ment Scale [24] in both iCHiC and iSBS. Apart from documenting formation seeking, interactive and system-centered (ranking, eval- the broad aspects of the experimental set-up in the track overview uation) IR and user interface design. Accordingly, researchers from papers, a thorough documentation and subsequent publication of different disciplines need to collaborate on complex questions and questionnaire items, scales and other measures would not only help experimental setups. Entering the field of IIR research is still a chal- other researchers in not having to re-invent standard items (e.g., lenge due to inconsistent or incompatible practices. Even for those demographic questions), but also support the standardization of IIR that work on IIR problems, no collaboration on systems, tasks, data, research. participants or research questions can be observed. This might be the case due to time and resource constraints caused by traditional 7.6 Data Storage, Infrastructure Maintenance & one-year research cycles as well as unawareness of other projects. In assessing the interest in an interactive track in the SBS Lab Intellectual Property Rights during a joint iCHiC and SBS discussion session at CLEF 2013, From 2011 until 2016, the various interactive tracks generated a everyone who stated their interest was involved in the initial dis- wealth of data, but also went through numerous organizational cussions in setting up the track, to get an overview of what aspects changes, both in terms of the individuals involved and the insti- they wanted to investigate, thereby shaping the track around a tutions that provided infrastructure. iSBS started as part of INEX broad set of interests. This community input is valuable both in with some data stored on servers dedicated to INEX activities, other attracting groups to actively participate and in creating a setup data stored on servers maintained by one of the organizers’ institu- with potential for long term community support and interest. A tions and the search indexes on another set of servers of another challenge of the desired community input and larger organizer num- organizing institution. bers is the required additional overhead for the decision processes. Recurring questions are (1) what happens if organizers leave Once again, good documentation and communication is vital as are and own crucial pieces of the data or infrastructure, and (2) what well-understood guidelines or practices about the consequences of happens when organizers move between institutions, thereby losing researchers joining or leaving the initiative. Collaborative research access to data or infrastructure? For research data management purposes, it is important that organizers of IIR studies make explicit 7 https://osf.io/ who is responsible for which part of the data and systems, who owns 8 https://dataverse.org/ Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval CLEF Interactive Information Retrieval Tracks (CHIIR 2019), 14 March 2019, Glasgow, UK also entails a joint understanding of how research results will be it is crucial for the community to standardize in order to move presented (e.g. rules of authorship and priority). This is especially forward as a research discipline. important in large collaborations. Collaborative research, by its very nature, tries to study aspects REFERENCES which require a large-scale infrastructure, a large number of users [1] Thomas Beckers, Norbert Fuhr, Nils Pharo, Ragnar Nordlie, and Khairun Nisa or other aspects that need a strong community input. This will Fachry. 2010. Overview and Results of the INEX 2009 Interactive Track. In ECDL necessarily prolong the design and implementation phases of any (Lecture Notes in Computer Science), Mounia Lalmas, Joemon M. Jose, Andreas Rauber, Fabrizio Sebastiani, and Ingo Frommholz (Eds.), Vol. 6273. Springer, study, which is a detriment in a fast-paced scholarly context as 409–412. IIR research, especially within the large evaluation campaigns or [2] Nicholas Belkin, Toine Bogers, Jaap Kamps, Diane Kelly, Marijn Koolen, and research conferences, which run on annual cycles. This type of Emine Yilmaz. 2017. Second Workshop on Supporting Complex Search Tasks. In Proc CHIIR 2017. ACM, New York, NY, 433–435. work would be best supported by a multi-year project or by moving [3] Toine Bogers, Maria Gäde, Mark Hall, Luanne Freund, Marijn Koolen, Vivien to a slower research output model. Petras, and Mette Skov. 2018. Report on the Workshop on Barriers to Interactive IR Resources Re-use (BIIRRR 2018). SIGIR Forum 52, 1 (Aug. 2018), 119–128. [4] Pia Borlund. 2003. The IIR Evaluation Model: A Framework for Evaluation of Interactive Information Retrieval Systems. Information Research 8, 3 (2003). 8 OUTCOMES: WHERE TO GO FROM HERE? [5] Pia Borlund. 2016. Interactive Information Retrieval: An Evaluation Perspective. In CHIIR ’16: Proceedings of the 2016 ACM on Conference on Human Information Based on previous experiences from the CLEF/INEX Interactive Interaction and Retrieval. ACM, New York, NY, USA, 151–151. Social Book Search tracks, the two Supporting Complex Search [6] Pia Borlund and Peter Ingwersen. 1997. The Development of a Method for the Tasks (SCST) community workshops (2015 and 2017) [2, 12] were Evaluation of Interactive Information Retrieval Systems. Journal of Documenta- tion 53, 3 (1997), 225–250. organized to discuss IIR challenges and future directions in the area [7] Martin Braschler, Khalid Choukri, Nicola Ferro, Allan Hanbury, Jussi Karlgren, of complex search scenarios since cooperation between the different Henning Müller, Vivien Petras, Emanuele Pianta, Maarten de Rijke, and Giuseppe Santucci. 2010. A PROMISE for Experimental Evaluation. In Multilingual and tracks was rarely seen. The invited researchers from various fields Multimodal Information Access Evaluation, Maristella Agosti, Nicola Ferro, Carol concluded that collaborative IIR campaigns have great potential, Peters, Maarten de Rijke, and Alan Smeaton (Eds.). Springer Berlin Heidelberg, but lack standardization and sustainability. Since previous efforts Berlin, Heidelberg, 140–144. [8] Donald O. Case and Lisa M. Given. 2016. Looking for Information: A Survey of such as the Systematic Review of Assigned Search Tasks (RepAST) Research on Information Seeking, Needs, and Behavior (4th ed.). Emerald Group [34] have only been partly noticed or used, it remains an open Publishing, Bingley, UK. question how to secure the persistence of IIR research designs and [9] Otis Chandler. 2012. How Consumers Discover Books Online. In Tools of Change for Publishing. O’Reilly. results. [10] David Ellis. 1989. A behavioural model for information retrieval system design. The 2018 workshop on Barriers to IIR Resources Re-use (BIIRRR) Journal of information science 15, 4-5 (1989), 237–247. [11] Maria Gäde, Nicola Ferro, and Monica Lestari Paramita. 2011. CHiC 2011 switched the focus to the analysis and preparation of requirements - Cultural Heritage in CLEF: From Use Cases to Evaluation in Practice for for effective re-use of IIR resources or experiments [3]. The devel- Multilingual Information Access to Cultural Heritage. In CLEF Notebook Pa- opment of quality standards for the curation and re-use of research pers/Labs/Workshop. [12] Maria Gäde, Mark M. Hall, Hugo Huurdeman, Jaap Kamps, Marijn Koolen, Mette designs has been identified as one of the main tasks in this initia- Skove, Elaine Toms, and David Walsh. 2015. Report on the First Workshop on tive, along with the appropriate documentation and publication of Supporting Complex Search Tasks. SIGIR Forum 49, 1 (June 2015), 50–56. research data and the requisite software. Research designs were [13] Maria Gäde, Mark Michael Hall, Hugo C. Huurdeman, Jaap Kamps, Marijn Koolen, Mette Skov, Toine Bogers, and David Walsh. 2016. Overview of the SBS 2016 named as a priority, because they appear to have the highest poten- Interactive Track. In Working Notes of the CLEF 2016 Conference (CEUR Work- tial for standardization and re-use in other IIR studies. This requires shop Proceedings), Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald (Eds.), Vol. 1609. CEUR-WS.org, 1024–1038. a proper analysis of previously used research design elements as [14] Maria Gäde, Mark Michael Hall, Hugo C. Huurdeman, Jaap Kamps, Marijn Koolen, well as motivation for or against potential re-use of these elements. Mette Skov, Elaine Toms, and David Walsh. 2015. Overview of the SBS 2015 One idea is to develop a platform that would allow researchers Interactive Track. In Working Notes of the CLEF 2015 Conference (CEUR Workshop Proceedings), Linda Cappellato, Nicola Ferro, Gareth J. F. Jones, and Eric SanJuan from interdisciplinary fields to search for IIR research designs once (Eds.), Vol. 1391. CEUR-WS.org. they have been identified as re-usable and are stored and docu- [15] Deborah Goodall. 1989. Browsing in public libraries. Library and Information mented. Building such a repository requires an analysis and imple- Statistics Unit LISU. [16] Mark Hall, Robert Villa, Sophie Rutter, Daniel Bell, Paul Clough, and Elaine mentation of user requirements both for accessing and contributing Toms. 2013. Sheffield Submission to the CHiC Ineractive Task: Exploring Digital research designs, the development and agreement on a standardized Cultural Heritage. CLEF Working Notes. [17] Mark Michael Hall, Hugo C. Huurdeman, Marijn Koolen, Mette Skov, and David data infrastructure as well as a maintenance plan coordinated by a Walsh. 2014. Overview of the INEX 2014 Interactive Social Book Search Track. stable team of researchers. In Working Notes of the CLEF 2014 Conference (CEUR Workshop Proceedings), Apart from a proper documentation and archiving strategy, this Linda Cappellato, Nicola Ferro, Martin Halvey, and Wessel Kraaij (Eds.), Vol. 1180. CEUR-WS.org, 480–493. retrospective also pointed towards pre-study aspects, which are [18] Mark M Hall, Spyros Katsaris, and Elaine Toms. 2013. A Pluggable Interactive IR instrumental for re-using experimental research data and designs. Evaluation Work-bench. In European Workshop on Human-Computer Interaction This includes the establishment of guidelines for cross-national and and Information Retrieval. 35–38. http://ceur-ws.org/Vol-1033/paper4.pdf [19] Mark Michael Hall and Elaine Toms. 2013. Building a Common Framework for cross-institutional data collection, informed consent and data distri- IIR Evaluation. In Information Access Evaluation. Multilinguality, Multimodality, bution. As was declared several times in this paper, the reusability and Visualization, Pamela Forner, Henning Müller, Roberto Paredes, Paolo Rosso, and Benno Stein (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 17–28. of research designs and other IIR study components strongly de- [20] Preben Hansen, Chirag Shah, and Claus-Peter Klas. 2015. Collaborative Informa- pends on the community’s willingness to develop and maintain tion Seeking. Springer. proper documentation, curation and publication guidelines. While [21] Marijn Koolen, Gabriella Kazai, Jaap Kamps, Antoine Doucet, and Monica Lan- doni. 2012. Overview of the INEX 2011 Books and Social Search Track. In Focused this may not be as rewarding as creating new research data by Retrieval of Content and Structure: 10th International Workshop of the Initiative for implementing more IIR studies (and we need more of these as well), the Evaluation of XML Retrieval (INEX 2011) (LNCS), Shlomo Geva, Jaap Kamps, Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Vivien Petras, Marijn Koolen, Maria Gäde, and Toine Bogers and Ralf Schenkel (Eds.), Vol. 7424. Springer. 4th Information Interaction in Context Symposium. ACM, 4–11. [22] Carol C. Kuhlthau. 1991. Inside the search process: Information seeking from [28] Kara Reuter. 2007. Assessing aesthetic relevance: Children’s book selection in the user’s perspective. Journal of the American Society for Information Science 42, a digital library. Journal of the American Society for Information Science and 5 (1991), 361–371. Technology 58, 12 (2007), 1745–1763. [23] Ragnar Nordlie and Nils Pharo. 2012. Seven Years of INEX Interactive Retrieval [29] Catherine Sheldrick Ross. 1999. Finding without seeking: the information en- Experiments – Lessons and Challenges. In Information Access Evaluation. Multi- counter in the context of reading for pleasure. Information Processing & Man- linguality, Multimodality, and Visual Analytics, Tiziana Catarci, Pamela Forner, agement 35, 6 (1999), 783 – 799. DOI:http://dx.doi.org/10.1016/S0306-4573(99) Djoerd Hiemstra, Anselmo Peñas, and Giuseppe Santucci (Eds.). Springer Berlin 00026-6 Heidelberg, Berlin, Heidelberg, 13–23. [30] Catherine Sheldrick Ross. 2000. Making choices: What readers say about choosing [24] Heather L. O’Brien and Elaine G. Toms. 2010. The development and evaluation books to read for pleasure. The Acquisitions Librarian 13, 25 (2000), 5–21. of a survey to measure user engagement. Journal of the American Society for [31] Katariina Saarinen and Pertti Vakkari. 2013. A sign of a good book: readersâĂŹ Information Science and Technology 61, 1 (2010), 50–69. DOI:http://dx.doi.org/10. methods of accessing fiction in the public library. Journal of Documentation 69, 5 1002/asi.21229 (2013), 736–754. [25] Vivien Petras, Toine Bogers, Elaine Toms, Mark Hall, Jacques Savoy, Piotr Malak, [32] Elaine Toms and Mark Hall. 2013. The CHIC interactive task (CHICi) at Clef2013. Adam Pawłowski, Nicola Ferro, and Ivano Masiero. 2013. Cultural Heritage in CLEF Working Notes. CLEF (CHiC) 2013. In Information Access Evaluation. Multilinguality, Multimodal- [33] Pertti Vakkari. 2001. A theory of the task-based information retrieval process: a ity, and Visualization, Pamela Forner, Henning Müller, Roberto Paredes, Paolo summary and generalisation of a longitudinal study. Journal of documentation Rosso, and Benno Stein (Eds.). Springer Berlin Heidelberg, 192–211. 57, 1 (2001), 44–60. [26] Vivien Petras, Nicola Ferro, Maria Gäde, Antoine Isaac, Michael Kleineberg, Ivano [34] Barbara M. Wildemuth and Luanne Freund. 2012. Assigning Search Tasks De- Masiero, Mattia Nicchio, and Juliane Stiller. 2012. Cultural Heritage in CLEF signed to Elicit Exploratory Search Behaviors. In Proceedings of the Symposium (CHiC) Overview 2012. In CLEF 2012 Labs and Workshops. on Human-Computer Interaction and Information Retrieval (HCIR ’12). ACM, [27] Nils Pharo and Ragnar Nordlie. 2012. Examining the effect of task stage and topic New York, NY, USA, Article 4, 10 pages. DOI:http://dx.doi.org/10.1145/2391224. knowledge on searcher interaction with a digital bookstore. In Proceedings of the 2391228