To Re-use is to Re-write: Experiences with Re-using IIR
                              Experiment Software
                                                                               Mark M Hall
                                                           mark.hall@informatik.uni-halle.de
                                         Institut für Informatik, Martin-Luther-Universität Halle-Wittenberg
                                                                Halle (Saale), Germany

ABSTRACT                                                                                 2.1    Software
Interactive Information Retrieval experiments have two main re-                          The two software components that form the focus of this analysis
quirements. They need to follow a workflow that takes the par-                           are the Experiment Support System (ESS) and the Python Interactive
ticipant through the individual steps of the experiment and they                         information Retrieval Evaluation (PyIRE).
need to show the user an interface to interact with. Both of these
                                                                                            2.1.1 Experiment Support System. The ESS [5] was developed
aspects look like they should lend themselves to re-use. This paper
                                                                                         to handle the challenge of introducing and promoting a standard-
analyses the experience of developing and re-using software for
                                                                                         ised, yet flexible methodology for a range of IIR evaluation study
both of these aspects across a time period of approximately five
                                                                                         structures, including generic, standardised measures that can be
years. The main conclusion is that re-use of workflow management
                                                                                         deployed across studies and then allow for at least partial compa-
software should be possible, but for software for interface creation
                                                                                         rability of the results. Over time, the accumulated studies should
the question of whether re-use is possible is still open.
                                                                                         also provide a comprehensive data-set that includes both context
                                                                                         and process data that may be used by the IR community to test
CCS CONCEPTS                                                                             and develop algorithms seated in human cognition and behaviour,
• General and reference → Evaluation; • Software and its                                 and additionally to provide a sufficiently robust, detailed, reliable
engineering → Software design tradeoffs; Reusability; Soft-                              data-set that may be used to test existing measures and develop
ware evolution;                                                                          new ones. The core aims were to
                                                                                            (1) Provide a systematic way of setting up an experiment or
KEYWORDS                                                                                         user study that may be intuitively used by students and
Interactive Information Retrieval; Re-Use; Software; Evaluation                                  researchers;
                                                                                            (2) Provide a standard set of evaluation measures to improve
                                                                                                 comparability;
1    INTRODUCTION                                                                           (3) Ensure that standard and consistent data formats are used
Interactive Information Retrieval (IIR) experiments use a wide range                             to simplify the comparison and aggregation of studies;
of terminology, research designs, methodologies, resources, and                             (4) Extract a standard procedure for the conduct of IIR studies
reporting structures. As has been stated before, one of the issues                               from past research, so that studies can share a common
this has led to is that re-use in IIR is, on the face of it, harder and                          protocol even if the system, the tasks, and the participant
thus less common, a situation that the BIIRRR workshop series                                    samples are different;
seeks to address [1]. While IIR studies can be deployed via a range                         (5) Reduce resource (financial, time, users, ...) commitment in
of devices, delivery via the web is a common scenario and thus                                   the conduct of such studies.
creating tools to ensure this process supports as much re-use as                         To achieve this the overarching architecture in Figure 1 was devel-
possible is a potential starting point.                                                  oped, which consists of the following components:
   This paper discusses the experience of creating and re-using two                           • The Research Manager is the primary point of interaction
IIR software systems for building IIR experiments across multiple                                for the researcher setting up an experiment. It is used to
IIR studies.                                                                                     specify the workflow of the experiment, the tasks and inter-
                                                                                                 faces to use, and all other measures to acquire. To simplify
2    BACKGROUND                                                                                  and standardise both the experiment process and results, the
This analysis of the issues around re-using IIR web software compo-                              Research Manager is primed with a generic research proto-
nents is based on the experience of re-using two software systems                                col, that specifies the basic experiment workflow and into
across three shared tasks (Session TREC, iCHiC, and iSBS) and a                                 which the researcher only has to add the experiment-specific
number of individual (IIR) studies.                                                              aspects;
                                                                                              • the Experiment System takes the experiment defined by
                                                                                                 the Research Manager and generates the UI screens that the
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on           participants interact with. It also ensures that the tasks and
Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK
                                                                                                 interfaces are correctly distributed and rotated between the
2019. Copyright for the individual papers remains with the authors. Copying permitted
for private and academic purposes. This volume is published and copyrighted by its               participants, in accordance with the settings specified in the
editors..                                                                                        Research Manager. Finally it loads the Task-specific UI
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK                                                                             Mark M Hall


                                                                          Figure 2: The evaluation workbench consists of the four core
                                                                          modules (Web Frontend, Message Bus, Session, and Logging)
Figure 1: Design of the evaluation framework proposed in                  into which the IIR components used in the experiment are
[5], with the three core and the two study-specific compo-                plugged.
nents. In a study not situated in the IIR field, different study-
specific components would be used. In the framework, the
researcher interacts only with the Research Manager and
Data Extractor, while the participant only ever sees the Ex-
periment System and Task-specific UI


      and records the participants’ responses and ensures that they
      conform to the requirements specified by the researcher. To
      ensure the flexibility of the system, any web-based system
      can be used as the Task-specific UI;                                Figure 3: The workbench’s main workflow starts with the
    • the Data Extractor takes the participant data gathered by           generation of the initial UI and then waits for the partici-
      the Experiment system and provides them in a format that            pant to generate a UI event. The event is processed, the af-
      can be used by analysis packages such as SPSS or R. The data        fected component’s state and UI are updated and the work-
      includes not only the participants’ responses, but also data        bench goes back to waiting for the next UI event. A power-
      on tasks / interfaces used by the participants and the order        ful aspect of the workflow is that components, when they
      in which they appeared.                                             receive a message, can generate their own messages.

   To simplify the setup and further standardise IIR studies, the
following two IIR-specific components have been developed. In a           the system was designed using a message-passing architecture that
study outside the IIR context, these would be replaced with compo-        consists of the following four components:
nents developed for that context.                                             • Web Frontend handles the interface between the partici-
    • the Generic IIR Research Protocol aims to define a stan-                  pant’s browser and the evaluation workbench and is imple-
      dardised and re-usable workflow and set of evaluation mea-                mented using a combination of client-side and server-side
      sures for IIR evaluation studies;                                         functionality.
    • the Task Workbench provides an extensible and pluggable                 • Message Bus handles the inter-component communication
      set of UI components for IIR interfaces, with the aim of                  and forms the core of the system. It is responsible for passing
      simplifying the set-up of IIR evaluation experiments.                     messages from the Web Frontend to the IIR components
                                                                                configured to be listening for those messages and also for
   The software was written in Python as a web-based application
                                                                                passing messages directly between the components.
under an OpenSource license. It allows the researcher to define
                                                                              • Session handles loading and saving the components’ current
complex experiment workflows, including reponse-driven or data-
                                                                                state for a specific participant, hiding the complexities of
driven conditional branching, loops, crowdsourcing-style sampling
                                                                                web-application state from the individual components.
of questions from a data-set, and full latin-square setups. In the case
                                                                              • Logging provides a standardised logging interface that al-
of the data-driven and latin-square functionalities the system also
                                                                                lows the components to easily attach logging information
automatically balances participants across the various conditions.
                                                                                to the UI event generated by the participant.
Researchers can also import and export individual questions, pages,
and complete experiment workflows in order to ease re-use.                   When the researcher sets up the workbench for their experi-
                                                                          ment, they can freely configure which components to use, how to
    2.1.2 PyIRE. The PyIRE system [4] implements what in Figure 1         lay them out, and which components to connect to which other
is referred to as the “Task-specific UI / Task Workbench”. It provides    components. Based on this configuration the Web Frontend gen-
a Python-based, standardised API, which allows the researcher to          erates the initial user-interface that is shown to the participants.
define IIR user-interface (UI) components, their layout on screen,        Then, when the participant interacts with a UI element (fig. 3), the
and the data-flows both between the interface and the components          resulting UI event is handled by the Web Frontend, which gener-
and between components directly. To achieve this the PyIRE system         ates a message based on the UI event. This message is passed to
uses the architecture shown in 2. To achieve maximum flexibility,         the Message Bus, which uses the configuration provided by the
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
To Re-use is to Re-write: Experiences with Re-using IIR Experiment Software              (CHIIR 2019), 14 March 2019, Glasgow, UK


Figure 4: Example Interface built with the PyIRE workbench. The interface here consists of five separate components (task,
saved pages, search box, pagination, and search results list, which are joined together via the interface configuration.

[SearchResults]                                                             2.2.1 Session TREC. The Session TREC shared task [2] ran from
handler = application.components.SearchResults                           2011 - 2014 as part of the Text REtrieval Conference’s series of
name = search_results                                                    tracks. The aim was to provide participating teams with multi-
layout = grid-9 vgrid-expand                                             query search sessions in order to develop and evaluate improved
connect = search_box:query                                               ranking algorithms that took previous queries and results into
                                                                         account. In order to provide participating teams with the necessary
                                                                         multi-query session data, for the 2012 iteration, the decision was
Figure 5: Example configuration for a Standard Results List              made to acquire this session data through a custom IIR experiment.
component, showing how the component’s layout (9 grid-                      The initial run (2012) used custom software, from which re-
cells wide and vertically expanding) and connections to                  usable aspects were identified. The ESS and PyIRE software pack-
other components (to the “search_box” component via the                  ages were developed in the following year to support both the
query message) are specified.                                            Session TREC data acquisition and the iCHiC shared task described
                                                                         below.

researcher to determine which components to deliver the message             2.2.2 iCHiC. The CHiC interactive (iCHiC) task was added to
to. The components that are listening for that message update their      the longer-running Cultural Heritage in CLEF lab in 2013 [7]. The
own Session state based on the message and then mark themselves          interactive task focused on acquiring and analysing interactive
as changed. After message processing has been completed for all          information retrieval data-set describing undirected exploration
components, the Web Frontend then updates the UI for each of             and browsing in a collection of approximately 1.1 million English-
the changed components.                                                  language Cultural Heritage items. The task included both an online
   An example of the configuration used to set-up the experiment         and an in-lab part. The task UI provided three methods for the par-
is shown in Figure 5 (from the experiment in figure 4), specifying       ticipants to explore the collection. On the left there was a category
the configuration of the “search_results” component. It specifies        browser, that showed a hierarchical structure into which a sub-set
that the component should be displayed 9 grid-cells wide (the ap-        of the items in the collection (approximately 250,000) had been
plication layout uses a 12-by-12 cell grid layout) and should expand     mapped automatically [3]. The second option was to use the search
vertically to use as much space as is available. The component is        box to type in and run a query. The third method was to click on an
configured to be connected to the “search_box” component via the         item’s meta-data, which would run a search for other items with
“query” message. It is this ability to freely plug components together   the same meta-data. In all three cases, the items for the selected
that, we believe, makes the framework sufficiently flexible to sup-      category, user-provided query, or meta-data query would be shown
port the wide range of IIR experiments, while remaining simple to        in the central grid.
set-up and use.                                                             As stated above the exploration/browsing interface was built
   The message-passing architecture should allow arbitrary com-          using the PyIRE software.
ponents to work together. This should allow the researcher to take
components from other experiments, for example a novel search               2.2.3 iSBS. Th interactive Social Book Search task in the CLEF
result visualisation component, and combine it with other compo-         Social Book Search lab ran for three years from 2014 - 2016 [6] and
nents from their own research, such as a specific search backend.        combined ideas from the iCHiC task with research questions from
                                                                         the longer-running Social Book Search (SBS) lab. Users looking
2.2    Experiments                                                       for books online are confronted with both professional meta-data
The two software components were developed and then further              and user-generated content. The goal of the Interactive Social Book
re-used in a series of shared evaluation tasks and stand-alone IIR       Search Track was to investigate how users used these two sources
studies.                                                                 of information, when looking for books in a leisure context.
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
(CHIIR 2019), 14 March 2019, Glasgow, UK                                                                             Mark M Hall


   In the first year, the PyIRE workbench was used to construct             The documentation issues affected the re-use of the ESS in two
two UIs, one a traditional faceted search interface and one a novel      ways. Missing user-focused documentation on how to use the more
three-stage interface, based on the Vakkari search stages [8]. In the    advanced functionalities (primarily in the area of latin-squares,
second year, the three-stage interface was modified, while in the        data-driven questions, conditional branching) meant that re-use by
third year only an unchanged three-stage interface was tested, but       other academics has been limited to those who have easy access
using a wider range of tasks.                                            to the ESS developer in order to get support in how to set up such
                                                                         experiments. In theory these could have been addressed relatively
   2.2.4 WorldCat. The WorldCat experiment [unpublished] looked          easily, all that was needed were small tutorials to illustrate in which
at known-item search tasks within a large bibliographic data-set.        order to execute the individual steps. For example, for a data-driven
The PyIRE workbench was used to construct a replica of the World-        crowdsourcing experiment, it is necessary to first create the data-
Cat interface, but used the SBS book data-set to provide a controlled    set that has all the different items which are sampled, then create
data-set. The ESS was used to manage the experiment workflow.            the page to display them on, and finally use text markup to embed
   2.2.5 Spatial Language & Jokes Transcription. The Spatial Lan-        the data in the page that is displayed to the participant. However,
guage and Jokes Transcription experiments only used the ESS to           none of this is particularly apparent from the interface itself.
handle the experiment workflow aspects. Neither of these experi-            The other documentation issue is related to the documentation
ments was a traditional IIR study, but both re-used major parts of       of the code itself, which is very patchy. The result of this is that the
the workflow developed in [5]. However, the experiment-specific          ESS has reached a point where it is highly functional, but essentially
UIs were custom built for both of the experiments.                       cannot be maintained or developed any further, as any change risks
                                                                         breaking existing functionality in unexpected ways.
3     EXPERIENCE                                                            Both of these issues are primarily caused by the ESS being a side-
                                                                         project, where what time was available was focused on improving
The primary take-away message from the experience of develp-             the functionality and not documentation. While not a particularly
ing, re-using, and maintaining the two software packages over the        novel conclusion, it does re-iterate the point that without adequate
course of five years is that the more generic the software, the easier   documentation, re-use is essentially highly unlikely.
it is to re-use that component. That is not particularly surprising,        The second major issue that was encountered was with em-
as it is in line with the re-use of other software components in IIR.    bedding the task-specific UI in the ESS. The first is caused by a
For example, few IIR experiments build a new search backend from         limitation of the use of frames for embedding the task-specific UI.
scratch for their IIR studies, as the generic search engines that are    In order to produce an embedding that mostly hid the fact that
available, are easily adaptable to the specific data requirements.       the task-specific UI was embedded, the researcher had to manually
                                                                         use a large amount of CSS and some JavaScript to correctly adapt
3.1    ESS                                                               the size of the frame in which the UI is embedded. This created
The experience of re-using and evolving the ESS has mostly been a        an instant barrier to re-use, as it required some very specialised
positive experience, with the majority of issues encountered primar-     technical skills.
ily common software development issues, rather than IIR specific            The other issue with the task UI embedding arose from the need
issues.                                                                  to link the responses in the ESS with those in the task UI. This
   As the ESS was re-used throughout the years, the main change          is necessary as the ESS and the task UI are completely separate
was the addition of increasingly complex and powerful features. The      systems, thus no automatic linking is possible. To create a link, the
initial version was designed to allow the combination of standard        unique ID of the ESS participant can be embedded in the URL that
survey-style questions with data-driven, task-specific crowdsourc-       loads the task UI. When the task UI loads, the software generating
ing questions (where the question is wholy or in part driven by a        the UI can access this ID and store it together with the other data
data-set stored in the system). As the complexity of the experiment      collected by the task UI. Then, when the data is extracted, the ID
workflows increased, the ESS’ functionality was increased, adding        can be used to merge together the two survey responses collected
latin-square and conditional branching support for the iSBS task.        by the ESS and the data logged by the task UI.
This in some cases required re-writing parts of the ESS implemen-           For the 2013 Session TREC experiment, a configuration error
tation, but it was always a matter of software evolution, rather         caused the same static identififer to be sent for all participants.
than having to make major conceptual or structural changes to the        While the error occured due to a mistake made by the researcher
system.                                                                  when setting up the experiment, the brittleness of the linkage be-
   The generic nature of the ESS has also enabled re-use in studies      tween the two systems and the difficulty with seeing whether the
that lie outside the IIR context. This was basically easily possible     linking ID data was being transferred correctly, allowed the mis-
because of the initial decision to host the experiment-specific func-    take to go unnoticed. As a result, in that year the Session TREC
tionality outside the ESS and include it via the use of HTML frames.     data-set consisted only of the session query logs, but without any
Thus the map-based and transcription experiment interfaces devel-        information on the participants themselves, significantly limiting
oped for the Spatial Language and Jokes Transcription experiments        the value of the data-set as a whole.
could easily be integrated into the ESS experiment workflow.                Partly this issue is due to the ESS trying to be both a system
   However, there have also been some issues with re-use with            that requires minimal technical skills, but also a system that is very
the ESS. These fall into two categories: issues with embedding the       powerful and flexible, allowing the researcher to adapt the system
task-specific UI into the ESS and issues with documentation.             to a large degree. Based on the experience, I would suggest that
Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval
To Re-use is to Re-write: Experiences with Re-using IIR Experiment Software              (CHIIR 2019), 14 March 2019, Glasgow, UK


future systems of this kind either focus on the pure ease-of-use or       are targeted at different levels of technical expertise, but which use
the technical flexibility, but not both, and that from the beginning      a standardised format for describing questions and answer options,
documentation is a core step.                                             page structures, and overall experiment workflows. This would
                                                                          then allow importing and exporting these and moving experiments
3.2    PyIRE                                                              between the different systems.
                                                                             The third conclusion is that how to achieve the re-usability of
While re-use of the ESS was fundamenally possible, re-use of the
                                                                          software that helps with building IIR interfaces, is still an open
PyIRE workbench was not as successful. While the PyIRE was used
                                                                          question. Considering that there have been different approaches,
across all three shared tasks and in the WorldCat experiment, each
                                                                          including the one documented here, none of which have caught
re-use of the workbench essentially involved a major re-write of
                                                                          on in the slightest, it is also unclear whether there is actually any
the software.
                                                                          value in attempting this.
    The re-writes revealed the difficulty of developing truly re-usable
                                                                             From the experience I would suggest that future work should
UI components and also the difficulty in designing an architecture
                                                                          really focus on defining re-usable formats to define questions, an-
that allows for minimally connected components that at the same
                                                                          swer options, page structures, and experiment workflows. These
time provide the user with a cohesive use experience. As the PyIRE
                                                                          could then be moved between systems, enabling re-usability while
system was re-written to support more complex interface struc-
                                                                          allowing for flexibility in what systems people want to use. Re-
tures, more and more data had to be explicitly passed between
                                                                          usability of task UIs is an area where I am currently unconvinced
components, functionality that had to be added to each component,
                                                                          that re-usability is worth pursuing.
countering the core idea of plug-and-play reusability.
    Another issue was that while the architecture decoupled the com-
                                                                          REFERENCES
ponents, particularly around rendering the architecture there were
                                                                          [1] T. Bogers, M. Gäde, L. Freund, M. Hall, M. Koolen, V. Petras, and M. Skov. Workshop
some highly coupled interactions between the components and the               on barriers to interactive ir resources re-use. In Proceedings of the 2018 Conference
underlying PyIRE functionality. These coupling points meant that              on Human Information Interaction & Retrieval, CHIIR ’18, pages 382–385, New
                                                                              York, NY, USA, 2018. ACM.
in some cases, components had to have complex internal structures,        [2] B. Carterette, P. Clough, M. Hall, E. Kanoulas, and M. Sanderson. Evaluating
simply because they had to handle the case where the functionality            retrieval over sessions: The trec session track 2011-2014. In Proceedings of the 39th
was needed to update the component’s display and in other cases               International ACM SIGIR Conference on Research and Development in Information
                                                                              Retrieval, SIGIR ’16, pages 685–688, New York, NY, USA, 2016. ACM.
just to provide service functionality to other components.                [3] M. M. Hall, S. Fernando, P. Clough, A. Soroa, E. Agirre, and M. Stevenson. Evalu-
    The main effect of the re-writes and the coupling issues was that         ating hierarchical organisation structures for exploring digital libraries. 17(4):351–
it is very hard to actually replicate the past experiments, as each one       379, 2014.
                                                                          [4] M. M. Hall, S. Katsaris, and E. Toms. A Pluggable Interactive IR Evaluation Work-
requires a very specific version of the PyIRE system to run. While            bench. In European Workshop on Human-Computer Interaction and Information
the code is available, this means that for each experiment, a new             Retrieval, pages 35–38, 2013.
                                                                          [5] M. M. Hall and E. Toms. Building a Common Framework for IIR Evaluation. In
instance of the PyIRE server would have to be run, undermining the            CLEF 2013 - Information Access Evaluation. Multilinguality, Multimodality, and
point of having a workbench that allows implementing multiple                 Visualization, pages 17–28, 2013.
experiments in an easy-to-manage environment.                             [6] M. Koolen, T. Bogers, M. Gäde, M. M. Hall, I. Hendrickx, J. Kamps, M. Skov,
                                                                              S. Verberne, and D. Walsh. Overview of the CLEF 2016 Social Book Search Lab.
    The big question, which I cannot answer, is whether the prob-             2016.
lem with the PyIRE are due to specific mistakes made in how the           [7] E. Toms and M. M. Hall.                The CHiC Interactive Task (CHiCi) at
decoupled architecture was implemented, or generic issues with                CLEF2013. http://www.clef-initiative.eu/documents/71612/1713e643-27c3-4d76-
                                                                              9a6f-926cdb1db0f4, 2013.
the architecture itself.                                                  [8] P. Vakkari. A theory of the task-based information retrieval process: a summary
    In particular, the question is related to the ESS issue with the          and generalisation of a longitudinal study. 57(1):44–60, 2001.
target user groups. The way the PyIRE can be used was designed
to support both researchers wishing to develop their own compo-
nents, but also researchers who lacked the technical skills to build
their own and simply wanted to re-use existing components with
other tasks or data. Attempting to support both scenarios created
significant additional complexity, essentially making the system
hard to use for both groups.

4     CONCLUSION
The main conclusion from this analysis is that the core issue for
long-term re-use and maintainability of software for IIR experi-
ments is the availability of adequate documentation and in this
respect IIR experiment software is no different to any other soft-
ware.
   The second conclusion is that the development of a system that
supports researchers in the development, deployment, and re-use
of IIR experiment workflows is possible, as evidenced by the ESS
system. Ideally this might take the form of multiple systems that