=Paper= {{Paper |id=Vol-2337/paper5 |storemode=property |title=Experiences with the 2013-2016 CLEF Interactive Information Retrieval Tracks |pdfUrl=https://ceur-ws.org/Vol-2337/paper5.pdf |volume=Vol-2337 |authors=Vivien Petras,Marijn Koolen,Maria Gäde,Toine Bogers |dblpUrl=https://dblp.org/rec/conf/chiir/PetrasKGB19 }} ==Experiences with the 2013-2016 CLEF Interactive Information Retrieval Tracks== https://ceur-ws.org/Vol-2337/paper5.pdf
    Experiences with the 2013-2016 CLEF Interactive Information
                          Retrieval Tracks
                                  Vivien Petras                                                                   Marijn Koolen
          Berlin School of Library and Information Science                                                 Humanities Cluster
                   Humboldt-Universität zu Berlin                                            Royal Netherlands Academy of Arts and Sciences
                                Berlin                                                                         Amsterdam
                   vivien.petras@ibi.hu-berlin.de                                                    marijn.koolen@di.huc.knaw.nl

                                    Maria Gäde                                                                     Toine Bogers
          Berlin School of Library and Information Science                                           Science and Information Studies
                   Humboldt-Universität zu Berlin                                              Department of Communication & Psychology
                                Berlin                                                               Aalborg University Copenhagen
                    maria.gaede@ibi.hu-berlin.de                                                              Copenhagen
                                                                                                           toine@hum.aau.dk
ABSTRACT                                                                                 2  CULTURAL HERITAGE IN CLEF @ CLEF
This paper describes our experiences with the interactive IR tracks                         2011-2012
organized at CLEF from 2013-2016 and aggregates the lessons                              2.1 Setup
learned with each consecutive instance of the lab. We end with
a summary of practical insights and lessons for future collabora-                        The EU-funded PROMISE1 project (Participative Research labOra-
tive interactive IR evaluation exercises and for potential re-use                        tory for Multimedia and Multilingual Information Systems Evalua-
scenarios.                                                                               tion) ran from 2010-2013 with the goal of providing a virtual and
                                                                                         open laboratory for research and experimentation with complex
                                                                                         multimodal and multilingual information systems [7]. In order to
KEYWORDS                                                                                 evaluate its concepts and prototypes, three use cases were defined to
interactive information retrieval, evaluation, CHiC, SBS, CLEF, book                     guide real-world requirements analysis and contextual testing: ‘Un-
search, information seeking                                                              locking Cultural Heritage’ (information access to cultural heritage
                                                                                         material), ‘Searching for Innovation’ (patent search) and ‘Visual
                                                                                         Clinical Decision Support’ (radiology image retrieval).
                                                                                            For the ‘Unlocking Cultural Heritage‘ (CH) use case, a workshop
1    INTRODUCTION                                                                        at the 2011 CLEF conference was organized in order to review
After the INEX (Initiative for Evaluation of XML Retrieval) Inter-                       existing information access use cases in the CH domain and then
active Track ended in 2010 [23], there was a gap in interactive                          develop retrieval scenarios that could be used for evaluating CH
information retrieval (IIR) experimentation at the large-scale evalu-                    information access systems [11]. In addition to qualitative usability
ation initiatives. The interactive track at the Cultural Heritage at                     tests of user interfaces, transaction log analyses and Cranfield-style
CLEF (Conference and Labs of the Evaluation Forum) lab (iCHiC)                           text retrieval evaluation, other forms of user studies were also
revived this in 2013 and merged with the INEX Social Book Search                         considered as viable evaluation approaches. The study and analysis
track to form the Social Book Search (SBS) lab at CLEF, running an                       of different interaction patterns with CH materials was the main
interactive track in 2014-2016.                                                          interest of the workshop’s participants2 .
   This paper provides a chronological overview of the development                          At the 2012 CLEF conference, a pilot evaluation exercise was
and history of these two IIR initiatives and their outcomes. We                          organized for the CH domain, progressing work from the workshop
focus on the lessons learned for future collaborative IIR evaluation                     format to an evaluation lab [26]. It was based on a real-life collection
exercises and for potential re-use scenarios. We start by chronicling                    of CH material: the complete index of the Europeana digital library3 ,
the timeline of the different interactive labs that were organized in                    which encompassed ca. 23 million metadata records in 30 different
Sections 2-6. We then highlight the most important lessons learned                       languages at that time. The information needs were based on 50
for the configuration of IIR evaluation experiments. We conclude                         queries (harvested from Europeana logfiles), translated into English,
by discussing consequent activities and insights for the re-use of                       French and German. The tasks in this pilot exercise comprised both
IIR resources.                                                                           a conventional system-oriented scenario (i.e., ad-hoc retrieval) as
                                                                                         well as more specialized retrieval scenarios for the CH domain–
                                                                                         the semantic enrichment and variability tasks4 . The evaluation
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on
                                                                                         1 http://www.promise-noe.eu
Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK
                                                                                         2 http://www.promise-noe.eu/chic-2011/
2019. Copyright for the individual papers remains with the authors. Copying permitted
                                                                                         3 www.europeana.eu
for private and academic purposes. This volume is published and copyrighted by its
editors.                                                                                 4 http://www.promise-noe.eu/chic-2012/home
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK                                 Vivien Petras, Marijn Koolen, Maria Gäde, and Toine Bogers

followed the Cranfield paradigm by pooling the retrieval results           out pre- and post-task questionnaires, assessed their experience on
and assessing their relevance using human assessors.                       the User Engagement Scale [24] and evaluated the usefulness of
                                                                           found objects (relevance assessment) and the interface (usability).
2.2     Lessons learned
Although the 2011 CHiC workshop had already emphasized that
                                                                           3.2    Lessons learned
a focus on user interaction patterns was an important evaluation
aspect for the CH domain, this first CHiC lab in 2012 had no in-           The iCHiC track ended up collecting data on 208 experiment par-
teractive tasks. Instead, it utilized a document collection based on       ticipants and their interactions from four participating research
Europeana and used queries harvested from Europeana logs to con-           groups. As a pilot experiment for collaborative data gathering, this
struct information needs. The vision was to extend the ad-hoc style        first interactive task was successful overall.
retrieval evaluation with interactive and other evaluation scenar-            The most important lesson learned from iCHiC and the reason
ios (particularly result presentations and alternative methods for         why it was merged with the INEX Social Book Search lab (see
relevance assessments) in the next phases.                                 Section 4) was that the provided metadata records were not “rich”
   The Europeana document collection, albeit a real-world collec-          enough in content to provide an interesting case study for casual
tion, turned out to be very challenging. While an effort was made to       browsing and search. The sparseness of the document collection
normalize the provided metadata by wrapping it in a special XML            had already been a problem for the ad-hoc retrieval tests, and real
format and removing certain metadata fields, the content in the            users did not like them any better. The actual purpose of iCHiC—to
metadata had very different descriptive qualities, depending on the        study users’ interactions with the content—was hampered by the
original content provider. Both the data sparseness and multilin-          lack of interesting content.
guality of the content posed serious challenges for the participants.         The experimental set-up and questionnaire instruments repre-
Image data, such as thumbnails of graphical material in Europeana,         sented a significant effort for the participants to complete. However,
could not be provided due to copyright reasons.                            the collected data was deemed necessary for further analysis.
   Some of the provided topics were not suitable for relevance                An original plan for the set-up of this task was to provide the
assessment, because information needs could not always be unam-            metadata collection, simulated work tasks, and the experimental
biguously inferred from the provided queries. The topics mostly            setup (questionnaires, logging protocol) to the participating re-
contained short queries of 1-3 words and only half of them had             search groups and have them provide their own infrastructure for
short descriptions added, which did not help much when the topic           data gathering. After discussions, the organizers concluded that
was vague. For the CH use case, IIR studies focusing on interac-           having different groups each building infrastructures would add
tion patterns were needed, so an additional interactive task was           too much variability and also pose a large barrier to entry especially
proposed for the next round.                                               for groups that did not have software or GUI design specialists.
                                                                              The data gathering at the University of Sheffield’s servers had
3 INTERACTIVE CHIC TRACK @ CLEF 2013                                       the additional advantage of having a central place where all the
                                                                           data was stored. This also posed a problem in later years, however,
3.1 Setup                                                                  when researchers affiliated with the University of Sheffield moved
The Interactive Track5 at the CHiC 2013 lab at CLEF (iCHiC) aimed          to a different institutions and neither the preservation and mainte-
at building a bridge for IIR and behavior researchers to work in a         nance of the infrastructure and data nor its legal ownership were
TREC-style evaluation environment. The idea was to develop a data          established.
collection of IIR evaluation data, which could be re-used and built           Four teams participated in the track, but not all of them were able
upon. This task intentionally used a subset of the document collec-        to recruit the 30 required participants. The uneven contribution
tion used in the other CHiC ad-hoc retrieval experimental tasks to         led to some discussion about the fairness of all groups then being
allow for later triangulation of results. Based on approximately 1         able to use the same data in later analyses. Initial discussions on
million metadata records from the English Europeana collection             who would get to analyze the data with which research questions
and representing a broad range of CH objects, a simple search in-          in which priority (important for later publications) were never
terface was envisioned that would allow for browse and search              successfully resolved as the organizers moved on to new tasks.
interactions with the metadata records for the IIR experiments [25].       Some of the organizers published follow-up analyses of the data
One non-goal oriented task (based on Borlund’s simulated work              [16], while other participating research groups did not.
tasks [4, 5]), which simulated “casual” use of the system (“spend 20          The participating groups all adhered to research ethics require-
minutes on the system and explore”) was provided to all experiment         ments set forth by the University of Sheffield, which hosted the
participants.                                                              platform and the collected data. Different ethical requirements (e.g.,
   The same experimental infrastructure, which hosted the web-             based on national law) were not considered. The experimental par-
based interfaces, documents and logged the interactions [19] was           ticipants were asked to consent to their responses being shared not
provided to all participating research groups. All groups had to           just with the organizers, but with the wider research community,
recruit at least 30 participants: at least 10 of them had to be observed   which allows for re-use of the data. However, processes for enabling
in the lab, while at least 20 could use the system remotely. Apart         the data sharing at a later time were not considered.
from the logged interactions on the systems, participants also filled         The proposal for the interactive task had planned for a two-year
                                                                           period, where the data gathering (user interaction logging) and
5 http://www.promise-noe.eu/chic-2013/tasks/interactive-task               preliminary data analysis would happen in the first year. In year
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
CLEF Interactive Information Retrieval Tracks                                            (CHIIR 2019), 14 March 2019, Glasgow, UK


two, an aggregated data set of all logged interactions was to be          iCHiC [16]. In choosing a simulated work task, we considered tasks
released to the research community in order to inform an improved         that could be connected to specific stages in the search process,
system design for data gathering, which would start again in year         similar to Pharo and Nordlie [27].
two. While the organizers provided an initial analysis of the data
[32], a planned follow-up analysis of the data did not take place.        4.1    Setup
                                                                          The 2014 iSBS Track did not run as a full evaluation campaign,
4    FIRST INEX iSBS TRACK @ CLEF 2014                                    because most of the year was used to prepare and set up the multi-
Social Book Search (SBS)6 started as a system-centered evaluation         stage search system, tasks and protocol [17]. However, each of these
campaign at INEX in 2011 [21], focusing on retrieval and ranking          components improved on the iCHIC set-up: a more interesting col-
of book metadata and associated user-generated metadata, such as          lection, more focus on the user interfaces and more varied tasks.
user reviews, ratings and tags from Amazon and LibraryThing [1].          The track organizers recruited a small number of participants (41)
The main research question behind the track was how to exploit            but decided to open up the experiment to other groups only in the
the different types of curated and user-generated metadata for            second year. The multi-stage system was compared against a base-
realistic and complex book search requests as expressed on the            line system that had mostly the same features but all in a single view.
book discussion forums of LibraryThing. After its third year, the         The experiment included a training task, a goal-oriented task and a
organizers discussed changes to the SBS lab, specifically the nature      non-goal oriented task. Pre- and post-experiment questionnaires
of book search tasks and how they are evaluated. At the same time,        asked for demographic and cultural information, and the overall
the iCHiC organizers were looking for a different collection than         experience and engagement with the interface. Post-task question-
the Europeana cultural heritage objects, because they struggled to        naires asked about the usefulness of different interface features.
come up with a meaningful task that engaged users, as the cultural        Most of the questions were constructed specifically for this domain
heritage metadata descriptions got little interest from participating     and system, but the engagement questions were reused from the
users. Initial discussions between the SBS and iCHiC organizers           iCHiC Track. The underlying experimental system of the iCHiC
suggested books and associated social media data might be a more          experiments was also reused, but had to be modified somewhat to
natural domain for participating users. By tying an interactive track     fit the iSBS Track.
to a system-centred track around the same collection and tasks,
lessons learned in one track could feed into the other. Thus the          4.2    Lessons learned
interactive SBS (iSBS) track was launched.                                Although the long preparation phase left little time for gathering
   Another important initiative was to study the different stages of      data, it resulted in a consensus among the large group of organizers
the search process and how they could be supported by different           about the set of generic research questions that the experimental
interfaces [? ]. We considered models of the information search pro-      setup and search systems should be able to address.
cess [10, 22, 33] in combination with models of how readers select           The setup did not lead to enough complex interactions to identify
books to read [15, 28–31]. The book selection models distinguish          stage transitions in the search process and to test the value of
between book internal features (e.g., subject, treatment, characters,     multi-stage interfaces. We considered multiple causes: (1) the tasks
ending) and external features (e.g., author, title, cover, genre) [29],   were relatively simple and did not require complex interactions;
but all are based on interaction in physical libraries and book shops,    (2) the instructions and training task were not sufficient to get
so they had to be adapted to online environments, where the users         users familiar with such an interface; and (3) the interface was not
have no access to the full-text, but to additional data in the form of    self-explanatory enough for users to interact with meaningfully.
user-generated content. Thus, selection is based only on external         The questionnaire data suggested the tasks could be completed
features.                                                                 with little effort. We subsequently discussed whether we should
   This led to a three-stage model of browsing, searching and se-         use more complex yet still realistic book search tasks.
lection, each with separate interfaces that carry over user choices          There was a conflict between the goal of studying social book
when switching between interfaces, based on Goodall [15]. These           search with realistic tasks and the goal of studying the value of
stages correspond to the three stages in Vakkari’s model of pre-          interfaces for different stages in the search process. The models
focus, focus and post-focus [33]. There was a lengthy discussion on       of Kuhlthau [22] and Vakkari [33] are based on researchers and
what functionalities to include in each stage and how to label the        students searching information to write a report or essay and are
different interfaces, to ensure that they made sense to users while       perhaps less relevant to casual leisure search for books. Or perhaps
retaining a close connection to the three search stages and selection     the users lack a felt need with the simulated tasks, but would display
stages from the literature. It took many iterations of UI choices to      more complex interactions if they really were searching for one or
adapt the system to the data that was available and deemed most           more books to buy.
useful to the searcher based on book search studies [15, 28, 30]. Such
extensive tailoring of the search UI to the data collection naturally     5 SECOND iSBS TRACK @ CLEF 2015
makes reuse of UI components problematic.
                                                                          5.1 Changes from previous edition
   We were interested in the difference between goal-oriented and
non-goal oriented tasks, also to compare the non-goal oriented task       The second year of the iSBS track was open to other research groups
in the book domain to the same non-goal task in CH as used in             and had a longer data gathering period with many more participants
                                                                          (192 in total) [14]. Most of the setup was kept the same to allow
6 http://marijnkoolen.com/Social-Book-Search/                             comparison with the results of the previous year. However, the
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK                                 Vivien Petras, Marijn Koolen, Maria Gäde, and Toine Bogers


goal-oriented task was redesigned to have five different sub-tasks,     users have ‘skin in the game’ and feel a personal connection to
to make users interact more and for longer periods of time.             leisure-focus work tasks.
                                                                           A problem encountered since running the 2016 iSBS Track is that
5.2    Lessons learned                                                  organizers move between institutions, which causes problems for
                                                                        maintaining experimental systems, websites and repositories when
We found that the fact that metadata in the book collection was
                                                                        they loose institutional access to servers where the infrastructure
exclusively available in English was a hurdle for several non-native
                                                                        is hosted on. This in turn endangers the continuous availability of
English speaking users. As some participating groups contributed
                                                                        research data and experiments. A natural solution to this recurring
many more users than other groups, with more non-native English
                                                                        problem could be an independent or inter-institutional platform
speakers, the balance was very different than the year before, which
                                                                        and repository for these systems and materials.
makes comparison of cohorts difficult.
   Users also spent a lot of time on the goal-oriented task with
sub tasks, causing some of them to abandon the experiment after         7 OUTCOMES: WHAT DID WE LEARN?
the first of the two tasks. In their feedback, others indicated that    7.1 Document Collections
the overall experiment took too long. This could mean that the
                                                                        One important lesson learned from the iCHiC and iSBS tracks is the
gathered data is biased towards more persistent participants.
                                                                        importance of a suitable document collection that is realistic in both
                                                                        size and content variety. The document collection used for iCHiC
6 THIRD iSBS TRACK @ CLEF 2016                                          was based on metadata from Europeana. Even though it represented
6.1 Changes from previous edition                                       a broad range of different topics, the individual items in the dataset
                                                                        were often sparse in their information content. In the iSBS tracks,
In the third edition of the iSBS track we made more significant
                                                                        the document collection based on Amazon and LibraryThing data
changes to the experimental setup. Some modifications were made
                                                                        offered richer information that is more suitable for an interesting
to the experiment structure to avoid participants abandoning the
                                                                        task for users, but over the course of the different iSBS editions the
experiment. The main change was that users only had one manda-
                                                                        collection grew increasingly out-of-date. We found this negatively
tory task, but could continue with other tasks as long as they were
                                                                        affected search behavior as well as user engagement, especially
willing to continue. We added eight tasks based on book search
                                                                        during the open search task. Users were looking for recent book
requests from the LibraryThing discussion forums to provide as
                                                                        titles and got frustrated that they could only find books that were
realistic tasks as possible [13]. Another big change was that we
                                                                        at least six years old.
focused only on the multi-stage interface to have fewer variables in
                                                                            While re-use of IIR resources is important for replicability and
the gathered data. FInally, a third change was that each participat-
                                                                        reproducibility, oftentimes older document collections are simply
ing institution had their own instance of the experiment to ensure
                                                                        not interesting anymore for participants—something system-based
participant allocation was balanced for each institution, not only
                                                                        evaluation suffers from to a lesser degree. How to obtain realis-
for the overall experiment. This was mainly because some institu-
                                                                        tic, engaging, and up-to-date document collections, while at the
tions had specific cohorts, which they could not analyse across the
                                                                        same time maintaining comparability across evaluation iterations,
variables when balancing was only done overall.
                                                                        remains an open question.
                                                                            Using a live document collection from a production system would
6.2    Lessons learned                                                  not allow for the same number of interactions to be studied and
A comparison of the 2015 and 2016 cohorts showed very few differ-       poses difficulty for logging. It is not a simple alternative. Arguably,
ences in terms of time spent on goal-oriented and non-goal tasks        what matters is not the stability of the set of documents that are
(the 2015 cohort showed no ordering effect between doing goal-          searchable, but the extent to which that set is up-to-date. Book
oriented first and doing non-goal-oriented first), giving a strong      search interactions gathered in 2014 can be compared with those
indication that the experiment structure and tasks are producing        gathered in 2019 if in both cases users could search books published
reliable results. This also suggests that the two cohorts could be      in the last five years, despite there being no overlap between the two
combined to reduce the impact of individual differences. One of         collections, as long as the type and amount of information about
the hardest struggles in IIR evaluation campaigns is getting a large    books remains the same. To improve re-usability, it may be more
and diverse enough set of users. Running such campaigns for long        valuable to investigate and describe relevant aspects of document
periods requires continuity. The same experimental systems need         collections, so that IIR studies with different document collections
to remain available with at most small changes.                         can be compared based on their overlapping relevance aspects, e.g.,
   The additional tasks based on requests from the LibraryThing dis-    recency, structure, type, and amount of information.
cussion forums resulted in different search behaviour from the sim-         Unfortunately, realistic document collections tend to exhibit a
ulated goal-oriented and non-goal oriented tasks, but also showed       larger degree of variety and complexity. This may make them more
large differences between the LibraryThing tasks themselves, with       engaging and interesting to participants, but it also increases the
more subjective, fiction-oriented tasks leading to less interaction     complexity of the analysis of their behavior. One could argue that
than concrete, non-fiction-oriented tasks. This suggests that IIR       to achieve a more detailed and thorough analysis, perhaps simpler
findings may be very sensitive to the specifics of the simulated work   document collections would be more suitable, thereby setting up a
tasks used. It may also signal that in order to study information       trade-off between complexity at the experimental and the analysis
search for reading for one’s own enjoyment, it is important that        stages.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
CLEF Interactive Information Retrieval Tracks                                            (CHIIR 2019), 14 March 2019, Glasgow, UK


7.2    Information Needs                                                 relationships and dependencies is likely to lead to invalid results.
In order to have meaningful impact, IIR studies should be represen-      Longer preparation time or access to user databases with poten-
tative of the real-life variety in domains, system designs, and user     tial participants could help overcoming such biases in participant
types and needs. One way in which iCHiC and iSBS attempted to            recruitment.
do this was by using a varied and realistic set of simulated work           One of our findings in iSBS was that the cultural background
tasks [6] and cover stories that include extra context about the         makes a significant difference. This is something that is rarely re-
background task to support the search behavior of participants.          ported in studies, but that appears to be an important aspect to
How best to generate such realistic information needs is an open         include. This also challenges the assumption that by providing the
question. One potentially fruitful approach in the 2016 iSBS track       same infrastructure and tasks but using different user group dis-
involved taking real-world examples of complex information needs         tributions over the years or across national boundaries, measured
from the LibraryThing forums and using them as optional addi-            user interactions can be aggregated across these groups. There
tional work tasks. These tasks were judged as being rich in variety      were some analyses that clustered users based on certain aspects,
and detail by our participants, so this could be an interesting avenue   but the question remains which users can be viewed in aggrega-
for future work. However, as the difference between fiction and          tion. Since academic IIR studies often rely on students, perhaps
non-fiction tasks showed, personal interest does play an important       studies can explicitly describe criteria of representativeness of the
role in user engagement, so using real-world requests as simulated       target user group and add questions to the questionnaire that cap-
work tasks is not a catch-all solution.                                  ture aspects of users that allows mapping them to these aspects of
   Despite the proven usefulness of simulated work tasks, they are       representativeness.
still not the same as a user’s own information needs. We there-
fore also included work tasks in iCHiC and iSBS that focused on          7.4    Search User Interface
the participants’ own information needs. Non-restrictive tasks, in       The search user interface is perhaps the most important aspect to
which users can search whatever and however they want for as             get right for the IIR system used in the experiments as our experi-
long or short as they want, offer more realistic aspects of informa-     ence with iCHiC and iSBS tracks has taught us. The ubiquity and
tion behavior, but they make comparison more difficult. Differences      popularity of modern-day search engines means that any search
between users can be due to them having wildly different ‘tasks’ in      user interface has certain minimum expectations to meet in terms of
mind. Although we experimented with different types of tasks, we         layout and/or functionality. Not meeting these expectations means
feel that we have only scratched the surface here. True information      risking distracting users and has a deleterious effect on their search
needs can be multilingual and multicultural, making assessment           behavior. It would be beneficial if the IIR system offered the flexi-
even more challenging.                                                   bility of choosing different search interfaces to study the effects of
   In addition, by focusing only on single information needs, we be-     the GUI on information seeking behavior. This was used to great
lieve that we are ignoring valuable aspects of the entire information    effect in the iSBS tracks to examine how different interfaces can
seeking process, both individual and collaborative [20]. Information     support the different search stages.
search is only one aspect of information behavior and is commonly            This flexibility came at a price, however, as the software compo-
combined with exploration, browsing, or interaction with a recom-        nents needed for the infrastructure became increasingly complex.
mender system. Moreover, information behavior often takes place          Both iCHiC and iSBS used a customized infrastructure developed
across and between different devices (desktop vs. smartphone), in-       by one of the organizers, which made this possible [18]. Maintain-
formation systems (e.g. Amazon, LibraryThing, Google but also            ing customized software for future experiments is a hard problem.
social media channels like Facebook and Twitter [9]) and modali-         Making infrastructure publicly available with appropriate docu-
ties (digital vs. paper). On the other hand, a large number of varied    mentation is one way to alleviate this.
information needs and task contexts leads to a wide distribution             Another difficulty is that the design of interfaces can be in-
of experimental data points, which—if not enough users can be            formed by different theoretical models of information interaction.
persuaded to participate—may result in insufficiently significant        In setting up the iSBS track and designing the multistage interface,
analyses.                                                                we discussed the appropriateness of numerous information seek-
                                                                         ing/search models as well as book selection models and strategies,
7.3    Study Participants                                                how they are related to each other and how they correspond to or
                                                                         are supported by aspects of the interface. A further complication
Ideally, an IIR evaluation campaign recruits participants that are a
                                                                         is that our choices were also steered by the research questions we
realistic representation of the general target population to avoid
                                                                         wanted to address. These issues add another set of variables to take
the introduction of biases [8, p. 241]. However, in most IIR tracks—
                                                                         into account when considering comparison and reuse, and should
including our own—researchers have often relied on recruiting
                                                                         be described in studies.
students from participating universities or research groups as par-
ticipants. Due to the short-term preparations and research cycles,
this is often the only way to include enough participants in an IIR      7.5    Experimental Setup
experiment. However, students are only one of several user groups        IIR research usually includes several complex components that can
that need to be taken into account when dealing with complex             affect the quality and success of each experiment. While the im-
search tasks. It needs to be assured that users are selected based on    portance of some elements such as task development have been
the specific system, feature or task to be tested as ignoring these      extensively discussed, other aspects remain less considered. Only a
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK                                 Vivien Petras, Marijn Koolen, Maria Gäde, and Toine Bogers


few studies report on or discuss measures used to analyze or inter-      the data or infrastructure, and what happens when organizers move
pret results from IIR experiments. So far, IIR measures are highly       to other institutions or leave the project, or when new organizers
contextual varying from experiment to experiment. Measures used          join.
span from data on interactions, such as session duration or clicks,         While always intended, the organizers of iCHiC and iSBS could
to qualitative data derived from questionnaires or interviews. Often     find hardly any re-use of the gathered data for IIR studies or trian-
several data points are complemented or correlated.                      gulation studies with the related ad-hoc retrieval experiments in
   A collaborative IIR study requires that participating research        CHiC or SBS. One reason may have been the insufficient availability
groups pool their gathered data and aggregating this data generates      of the research data along with a proper rights clearance.
substantial overhead. If institutions gather their own data, aggrega-       There are generic platforms for storing and sharing scientific
tion may involve harmonizing inconsistencies. In the iCHiC and           data, such as the Open Science Framework7 and several Dataverse8
iSBS tracks, a single system was used to gather all experimental         instances. These options solve some of the institutional issues, but
data, but this system had to be developed and adapted with each          they lack the flexibility to run experimental systems or to add
iteration. A comprehensive documentation and accurate descrip-           domain-specific search and access features to datasets that make
tions of the data gathering tools is crucial for the evaluation and      a repository like RepAST useful to the IIR community. Publicly
re-use of these aspects in future studies.                               available repositories for software and software infrastructures also
   Different research groups and individuals often want to study         exist (e.g., GitHub), but present similar problems to the research
slightly different aspects of the problem domain or setup, requiring     data repositories.
different questions in the questionnaire, different tasks or users, or      Next to problems of storage and access of IIR research data,
different search system components. With every change, new users         there are issues of copyright, privacy and ethics. The questionnaire
need to be recruited, and comparisons to previously collected data       informs users, which institutions are involved, but how should or-
becomes harder. The long preparatory discussions among the iSBS          ganizers deal with new researchers and institutions joining? One
organizers regarding research questions, theoretical frameworks          option is for organizers to agree on ethical guidelines for data gath-
and research designs suggests that it is possible to some extent to      ering, informed consent and data representation. For further data
incorporate a broad set of research questions in the overall research    re-use, it is crucial that users also give their informed consent for
design to allow a range of studies with the same setup. But often re-    additional analyses of their data. To create a trustworthy environ-
search questions change or new questions are prompted during and         ment, IIR researcher must provide concrete statements on who
following the experiments, calling for an iterative development of       and for what future purposes the data will be used. This should be
the research design. We are not aware of any guidelines on how to        available additionally to the research data as part of an archived
best update designs to allow some backwards comparability. While         and documented research design (see Section 7.5).
there is large variability in research questions and research designs,
the group would have benefited from re-using other researchers’          7.7       Coordinating Collaborative Research
research design components, as was done with the User Engage-            IIR research is a highly interdisciplinary field bridging areas of in-
ment Scale [24] in both iCHiC and iSBS. Apart from documenting           formation seeking, interactive and system-centered (ranking, eval-
the broad aspects of the experimental set-up in the track overview       uation) IR and user interface design. Accordingly, researchers from
papers, a thorough documentation and subsequent publication of           different disciplines need to collaborate on complex questions and
questionnaire items, scales and other measures would not only help       experimental setups. Entering the field of IIR research is still a chal-
other researchers in not having to re-invent standard items (e.g.,       lenge due to inconsistent or incompatible practices. Even for those
demographic questions), but also support the standardization of IIR      that work on IIR problems, no collaboration on systems, tasks, data,
research.                                                                participants or research questions can be observed. This might be
                                                                         the case due to time and resource constraints caused by traditional
7.6    Data Storage, Infrastructure Maintenance &                        one-year research cycles as well as unawareness of other projects.
                                                                            In assessing the interest in an interactive track in the SBS Lab
       Intellectual Property Rights
                                                                         during a joint iCHiC and SBS discussion session at CLEF 2013,
From 2011 until 2016, the various interactive tracks generated a         everyone who stated their interest was involved in the initial dis-
wealth of data, but also went through numerous organizational            cussions in setting up the track, to get an overview of what aspects
changes, both in terms of the individuals involved and the insti-        they wanted to investigate, thereby shaping the track around a
tutions that provided infrastructure. iSBS started as part of INEX       broad set of interests. This community input is valuable both in
with some data stored on servers dedicated to INEX activities, other     attracting groups to actively participate and in creating a setup
data stored on servers maintained by one of the organizers’ institu-     with potential for long term community support and interest. A
tions and the search indexes on another set of servers of another        challenge of the desired community input and larger organizer num-
organizing institution.                                                  bers is the required additional overhead for the decision processes.
   Recurring questions are (1) what happens if organizers leave          Once again, good documentation and communication is vital as are
and own crucial pieces of the data or infrastructure, and (2) what       well-understood guidelines or practices about the consequences of
happens when organizers move between institutions, thereby losing        researchers joining or leaving the initiative. Collaborative research
access to data or infrastructure? For research data management
purposes, it is important that organizers of IIR studies make explicit   7 https://osf.io/

who is responsible for which part of the data and systems, who owns      8 https://dataverse.org/
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
CLEF Interactive Information Retrieval Tracks                                            (CHIIR 2019), 14 March 2019, Glasgow, UK


also entails a joint understanding of how research results will be        it is crucial for the community to standardize in order to move
presented (e.g. rules of authorship and priority). This is especially     forward as a research discipline.
important in large collaborations.
   Collaborative research, by its very nature, tries to study aspects     REFERENCES
which require a large-scale infrastructure, a large number of users        [1] Thomas Beckers, Norbert Fuhr, Nils Pharo, Ragnar Nordlie, and Khairun Nisa
or other aspects that need a strong community input. This will                 Fachry. 2010. Overview and Results of the INEX 2009 Interactive Track. In ECDL
necessarily prolong the design and implementation phases of any                (Lecture Notes in Computer Science), Mounia Lalmas, Joemon M. Jose, Andreas
                                                                               Rauber, Fabrizio Sebastiani, and Ingo Frommholz (Eds.), Vol. 6273. Springer,
study, which is a detriment in a fast-paced scholarly context as               409–412.
IIR research, especially within the large evaluation campaigns or          [2] Nicholas Belkin, Toine Bogers, Jaap Kamps, Diane Kelly, Marijn Koolen, and
research conferences, which run on annual cycles. This type of                 Emine Yilmaz. 2017. Second Workshop on Supporting Complex Search Tasks. In
                                                                               Proc CHIIR 2017. ACM, New York, NY, 433–435.
work would be best supported by a multi-year project or by moving          [3] Toine Bogers, Maria Gäde, Mark Hall, Luanne Freund, Marijn Koolen, Vivien
to a slower research output model.                                             Petras, and Mette Skov. 2018. Report on the Workshop on Barriers to Interactive
                                                                               IR Resources Re-use (BIIRRR 2018). SIGIR Forum 52, 1 (Aug. 2018), 119–128.
                                                                           [4] Pia Borlund. 2003. The IIR Evaluation Model: A Framework for Evaluation of
                                                                               Interactive Information Retrieval Systems. Information Research 8, 3 (2003).
8   OUTCOMES: WHERE TO GO FROM HERE?                                       [5] Pia Borlund. 2016. Interactive Information Retrieval: An Evaluation Perspective.
                                                                               In CHIIR ’16: Proceedings of the 2016 ACM on Conference on Human Information
Based on previous experiences from the CLEF/INEX Interactive                   Interaction and Retrieval. ACM, New York, NY, USA, 151–151.
Social Book Search tracks, the two Supporting Complex Search               [6] Pia Borlund and Peter Ingwersen. 1997. The Development of a Method for the
Tasks (SCST) community workshops (2015 and 2017) [2, 12] were                  Evaluation of Interactive Information Retrieval Systems. Journal of Documenta-
                                                                               tion 53, 3 (1997), 225–250.
organized to discuss IIR challenges and future directions in the area      [7] Martin Braschler, Khalid Choukri, Nicola Ferro, Allan Hanbury, Jussi Karlgren,
of complex search scenarios since cooperation between the different            Henning Müller, Vivien Petras, Emanuele Pianta, Maarten de Rijke, and Giuseppe
                                                                               Santucci. 2010. A PROMISE for Experimental Evaluation. In Multilingual and
tracks was rarely seen. The invited researchers from various fields            Multimodal Information Access Evaluation, Maristella Agosti, Nicola Ferro, Carol
concluded that collaborative IIR campaigns have great potential,               Peters, Maarten de Rijke, and Alan Smeaton (Eds.). Springer Berlin Heidelberg,
but lack standardization and sustainability. Since previous efforts            Berlin, Heidelberg, 140–144.
                                                                           [8] Donald O. Case and Lisa M. Given. 2016. Looking for Information: A Survey of
such as the Systematic Review of Assigned Search Tasks (RepAST)                Research on Information Seeking, Needs, and Behavior (4th ed.). Emerald Group
[34] have only been partly noticed or used, it remains an open                 Publishing, Bingley, UK.
question how to secure the persistence of IIR research designs and         [9] Otis Chandler. 2012. How Consumers Discover Books Online. In Tools of Change
                                                                               for Publishing. O’Reilly.
results.                                                                  [10] David Ellis. 1989. A behavioural model for information retrieval system design.
   The 2018 workshop on Barriers to IIR Resources Re-use (BIIRRR)              Journal of information science 15, 4-5 (1989), 237–247.
                                                                          [11] Maria Gäde, Nicola Ferro, and Monica Lestari Paramita. 2011. CHiC 2011
switched the focus to the analysis and preparation of requirements             - Cultural Heritage in CLEF: From Use Cases to Evaluation in Practice for
for effective re-use of IIR resources or experiments [3]. The devel-           Multilingual Information Access to Cultural Heritage. In CLEF Notebook Pa-
opment of quality standards for the curation and re-use of research            pers/Labs/Workshop.
                                                                          [12] Maria Gäde, Mark M. Hall, Hugo Huurdeman, Jaap Kamps, Marijn Koolen, Mette
designs has been identified as one of the main tasks in this initia-           Skove, Elaine Toms, and David Walsh. 2015. Report on the First Workshop on
tive, along with the appropriate documentation and publication of              Supporting Complex Search Tasks. SIGIR Forum 49, 1 (June 2015), 50–56.
research data and the requisite software. Research designs were           [13] Maria Gäde, Mark Michael Hall, Hugo C. Huurdeman, Jaap Kamps, Marijn Koolen,
                                                                               Mette Skov, Toine Bogers, and David Walsh. 2016. Overview of the SBS 2016
named as a priority, because they appear to have the highest poten-            Interactive Track. In Working Notes of the CLEF 2016 Conference (CEUR Work-
tial for standardization and re-use in other IIR studies. This requires        shop Proceedings), Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig
                                                                               Macdonald (Eds.), Vol. 1609. CEUR-WS.org, 1024–1038.
a proper analysis of previously used research design elements as          [14] Maria Gäde, Mark Michael Hall, Hugo C. Huurdeman, Jaap Kamps, Marijn Koolen,
well as motivation for or against potential re-use of these elements.          Mette Skov, Elaine Toms, and David Walsh. 2015. Overview of the SBS 2015
   One idea is to develop a platform that would allow researchers              Interactive Track. In Working Notes of the CLEF 2015 Conference (CEUR Workshop
                                                                               Proceedings), Linda Cappellato, Nicola Ferro, Gareth J. F. Jones, and Eric SanJuan
from interdisciplinary fields to search for IIR research designs once          (Eds.), Vol. 1391. CEUR-WS.org.
they have been identified as re-usable and are stored and docu-           [15] Deborah Goodall. 1989. Browsing in public libraries. Library and Information
mented. Building such a repository requires an analysis and imple-             Statistics Unit LISU.
                                                                          [16] Mark Hall, Robert Villa, Sophie Rutter, Daniel Bell, Paul Clough, and Elaine
mentation of user requirements both for accessing and contributing             Toms. 2013. Sheffield Submission to the CHiC Ineractive Task: Exploring Digital
research designs, the development and agreement on a standardized              Cultural Heritage. CLEF Working Notes.
                                                                          [17] Mark Michael Hall, Hugo C. Huurdeman, Marijn Koolen, Mette Skov, and David
data infrastructure as well as a maintenance plan coordinated by a             Walsh. 2014. Overview of the INEX 2014 Interactive Social Book Search Track.
stable team of researchers.                                                    In Working Notes of the CLEF 2014 Conference (CEUR Workshop Proceedings),
   Apart from a proper documentation and archiving strategy, this              Linda Cappellato, Nicola Ferro, Martin Halvey, and Wessel Kraaij (Eds.), Vol. 1180.
                                                                               CEUR-WS.org, 480–493.
retrospective also pointed towards pre-study aspects, which are           [18] Mark M Hall, Spyros Katsaris, and Elaine Toms. 2013. A Pluggable Interactive IR
instrumental for re-using experimental research data and designs.              Evaluation Work-bench. In European Workshop on Human-Computer Interaction
This includes the establishment of guidelines for cross-national and           and Information Retrieval. 35–38. http://ceur-ws.org/Vol-1033/paper4.pdf
                                                                          [19] Mark Michael Hall and Elaine Toms. 2013. Building a Common Framework for
cross-institutional data collection, informed consent and data distri-         IIR Evaluation. In Information Access Evaluation. Multilinguality, Multimodality,
bution. As was declared several times in this paper, the reusability           and Visualization, Pamela Forner, Henning Müller, Roberto Paredes, Paolo Rosso,
                                                                               and Benno Stein (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 17–28.
of research designs and other IIR study components strongly de-           [20] Preben Hansen, Chirag Shah, and Claus-Peter Klas. 2015. Collaborative Informa-
pends on the community’s willingness to develop and maintain                   tion Seeking. Springer.
proper documentation, curation and publication guidelines. While          [21] Marijn Koolen, Gabriella Kazai, Jaap Kamps, Antoine Doucet, and Monica Lan-
                                                                               doni. 2012. Overview of the INEX 2011 Books and Social Search Track. In Focused
this may not be as rewarding as creating new research data by                  Retrieval of Content and Structure: 10th International Workshop of the Initiative for
implementing more IIR studies (and we need more of these as well),             the Evaluation of XML Retrieval (INEX 2011) (LNCS), Shlomo Geva, Jaap Kamps,
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK                                 Vivien Petras, Marijn Koolen, Maria Gäde, and Toine Bogers


     and Ralf Schenkel (Eds.), Vol. 7424. Springer.                                             4th Information Interaction in Context Symposium. ACM, 4–11.
[22] Carol C. Kuhlthau. 1991. Inside the search process: Information seeking from          [28] Kara Reuter. 2007. Assessing aesthetic relevance: Children’s book selection in
     the user’s perspective. Journal of the American Society for Information Science 42,        a digital library. Journal of the American Society for Information Science and
     5 (1991), 361–371.                                                                         Technology 58, 12 (2007), 1745–1763.
[23] Ragnar Nordlie and Nils Pharo. 2012. Seven Years of INEX Interactive Retrieval        [29] Catherine Sheldrick Ross. 1999. Finding without seeking: the information en-
     Experiments – Lessons and Challenges. In Information Access Evaluation. Multi-             counter in the context of reading for pleasure. Information Processing & Man-
     linguality, Multimodality, and Visual Analytics, Tiziana Catarci, Pamela Forner,           agement 35, 6 (1999), 783 – 799. DOI:http://dx.doi.org/10.1016/S0306-4573(99)
     Djoerd Hiemstra, Anselmo Peñas, and Giuseppe Santucci (Eds.). Springer Berlin              00026-6
     Heidelberg, Berlin, Heidelberg, 13–23.                                                [30] Catherine Sheldrick Ross. 2000. Making choices: What readers say about choosing
[24] Heather L. O’Brien and Elaine G. Toms. 2010. The development and evaluation                books to read for pleasure. The Acquisitions Librarian 13, 25 (2000), 5–21.
     of a survey to measure user engagement. Journal of the American Society for           [31] Katariina Saarinen and Pertti Vakkari. 2013. A sign of a good book: readersâĂŹ
     Information Science and Technology 61, 1 (2010), 50–69. DOI:http://dx.doi.org/10.          methods of accessing fiction in the public library. Journal of Documentation 69, 5
     1002/asi.21229                                                                             (2013), 736–754.
[25] Vivien Petras, Toine Bogers, Elaine Toms, Mark Hall, Jacques Savoy, Piotr Malak,      [32] Elaine Toms and Mark Hall. 2013. The CHIC interactive task (CHICi) at Clef2013.
     Adam Pawłowski, Nicola Ferro, and Ivano Masiero. 2013. Cultural Heritage in                CLEF Working Notes.
     CLEF (CHiC) 2013. In Information Access Evaluation. Multilinguality, Multimodal-      [33] Pertti Vakkari. 2001. A theory of the task-based information retrieval process: a
     ity, and Visualization, Pamela Forner, Henning Müller, Roberto Paredes, Paolo              summary and generalisation of a longitudinal study. Journal of documentation
     Rosso, and Benno Stein (Eds.). Springer Berlin Heidelberg, 192–211.                        57, 1 (2001), 44–60.
[26] Vivien Petras, Nicola Ferro, Maria Gäde, Antoine Isaac, Michael Kleineberg, Ivano     [34] Barbara M. Wildemuth and Luanne Freund. 2012. Assigning Search Tasks De-
     Masiero, Mattia Nicchio, and Juliane Stiller. 2012. Cultural Heritage in CLEF              signed to Elicit Exploratory Search Behaviors. In Proceedings of the Symposium
     (CHiC) Overview 2012. In CLEF 2012 Labs and Workshops.                                     on Human-Computer Interaction and Information Retrieval (HCIR ’12). ACM,
[27] Nils Pharo and Ragnar Nordlie. 2012. Examining the effect of task stage and topic          New York, NY, USA, Article 4, 10 pages. DOI:http://dx.doi.org/10.1145/2391224.
     knowledge on searcher interaction with a digital bookstore. In Proceedings of the          2391228