To Re-use is to Re-write: Experiences with Re-using IIR Experiment Software Mark M Hall mark.hall@informatik.uni-halle.de Institut für Informatik, Martin-Luther-Universität Halle-Wittenberg Halle (Saale), Germany ABSTRACT 2.1 Software Interactive Information Retrieval experiments have two main re- The two software components that form the focus of this analysis quirements. They need to follow a workflow that takes the par- are the Experiment Support System (ESS) and the Python Interactive ticipant through the individual steps of the experiment and they information Retrieval Evaluation (PyIRE). need to show the user an interface to interact with. Both of these 2.1.1 Experiment Support System. The ESS [5] was developed aspects look like they should lend themselves to re-use. This paper to handle the challenge of introducing and promoting a standard- analyses the experience of developing and re-using software for ised, yet flexible methodology for a range of IIR evaluation study both of these aspects across a time period of approximately five structures, including generic, standardised measures that can be years. The main conclusion is that re-use of workflow management deployed across studies and then allow for at least partial compa- software should be possible, but for software for interface creation rability of the results. Over time, the accumulated studies should the question of whether re-use is possible is still open. also provide a comprehensive data-set that includes both context and process data that may be used by the IR community to test CCS CONCEPTS and develop algorithms seated in human cognition and behaviour, • General and reference → Evaluation; • Software and its and additionally to provide a sufficiently robust, detailed, reliable engineering → Software design tradeoffs; Reusability; Soft- data-set that may be used to test existing measures and develop ware evolution; new ones. The core aims were to (1) Provide a systematic way of setting up an experiment or KEYWORDS user study that may be intuitively used by students and Interactive Information Retrieval; Re-Use; Software; Evaluation researchers; (2) Provide a standard set of evaluation measures to improve comparability; 1 INTRODUCTION (3) Ensure that standard and consistent data formats are used Interactive Information Retrieval (IIR) experiments use a wide range to simplify the comparison and aggregation of studies; of terminology, research designs, methodologies, resources, and (4) Extract a standard procedure for the conduct of IIR studies reporting structures. As has been stated before, one of the issues from past research, so that studies can share a common this has led to is that re-use in IIR is, on the face of it, harder and protocol even if the system, the tasks, and the participant thus less common, a situation that the BIIRRR workshop series samples are different; seeks to address [1]. While IIR studies can be deployed via a range (5) Reduce resource (financial, time, users, ...) commitment in of devices, delivery via the web is a common scenario and thus the conduct of such studies. creating tools to ensure this process supports as much re-use as To achieve this the overarching architecture in Figure 1 was devel- possible is a potential starting point. oped, which consists of the following components: This paper discusses the experience of creating and re-using two • The Research Manager is the primary point of interaction IIR software systems for building IIR experiments across multiple for the researcher setting up an experiment. It is used to IIR studies. specify the workflow of the experiment, the tasks and inter- faces to use, and all other measures to acquire. To simplify 2 BACKGROUND and standardise both the experiment process and results, the This analysis of the issues around re-using IIR web software compo- Research Manager is primed with a generic research proto- nents is based on the experience of re-using two software systems col, that specifies the basic experiment workflow and into across three shared tasks (Session TREC, iCHiC, and iSBS) and a which the researcher only has to add the experiment-specific number of individual (IIR) studies. aspects; • the Experiment System takes the experiment defined by the Research Manager and generates the UI screens that the Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on participants interact with. It also ensures that the tasks and Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK interfaces are correctly distributed and rotated between the 2019. Copyright for the individual papers remains with the authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its participants, in accordance with the settings specified in the editors.. Research Manager. Finally it loads the Task-specific UI Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Mark M Hall Figure 2: The evaluation workbench consists of the four core modules (Web Frontend, Message Bus, Session, and Logging) Figure 1: Design of the evaluation framework proposed in into which the IIR components used in the experiment are [5], with the three core and the two study-specific compo- plugged. nents. In a study not situated in the IIR field, different study- specific components would be used. In the framework, the researcher interacts only with the Research Manager and Data Extractor, while the participant only ever sees the Ex- periment System and Task-specific UI and records the participants’ responses and ensures that they conform to the requirements specified by the researcher. To ensure the flexibility of the system, any web-based system can be used as the Task-specific UI; Figure 3: The workbench’s main workflow starts with the • the Data Extractor takes the participant data gathered by generation of the initial UI and then waits for the partici- the Experiment system and provides them in a format that pant to generate a UI event. The event is processed, the af- can be used by analysis packages such as SPSS or R. The data fected component’s state and UI are updated and the work- includes not only the participants’ responses, but also data bench goes back to waiting for the next UI event. A power- on tasks / interfaces used by the participants and the order ful aspect of the workflow is that components, when they in which they appeared. receive a message, can generate their own messages. To simplify the setup and further standardise IIR studies, the following two IIR-specific components have been developed. In a the system was designed using a message-passing architecture that study outside the IIR context, these would be replaced with compo- consists of the following four components: nents developed for that context. • Web Frontend handles the interface between the partici- • the Generic IIR Research Protocol aims to define a stan- pant’s browser and the evaluation workbench and is imple- dardised and re-usable workflow and set of evaluation mea- mented using a combination of client-side and server-side sures for IIR evaluation studies; functionality. • the Task Workbench provides an extensible and pluggable • Message Bus handles the inter-component communication set of UI components for IIR interfaces, with the aim of and forms the core of the system. It is responsible for passing simplifying the set-up of IIR evaluation experiments. messages from the Web Frontend to the IIR components configured to be listening for those messages and also for The software was written in Python as a web-based application passing messages directly between the components. under an OpenSource license. It allows the researcher to define • Session handles loading and saving the components’ current complex experiment workflows, including reponse-driven or data- state for a specific participant, hiding the complexities of driven conditional branching, loops, crowdsourcing-style sampling web-application state from the individual components. of questions from a data-set, and full latin-square setups. In the case • Logging provides a standardised logging interface that al- of the data-driven and latin-square functionalities the system also lows the components to easily attach logging information automatically balances participants across the various conditions. to the UI event generated by the participant. Researchers can also import and export individual questions, pages, and complete experiment workflows in order to ease re-use. When the researcher sets up the workbench for their experi- ment, they can freely configure which components to use, how to 2.1.2 PyIRE. The PyIRE system [4] implements what in Figure 1 lay them out, and which components to connect to which other is referred to as the “Task-specific UI / Task Workbench”. It provides components. Based on this configuration the Web Frontend gen- a Python-based, standardised API, which allows the researcher to erates the initial user-interface that is shown to the participants. define IIR user-interface (UI) components, their layout on screen, Then, when the participant interacts with a UI element (fig. 3), the and the data-flows both between the interface and the components resulting UI event is handled by the Web Frontend, which gener- and between components directly. To achieve this the PyIRE system ates a message based on the UI event. This message is passed to uses the architecture shown in 2. To achieve maximum flexibility, the Message Bus, which uses the configuration provided by the Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval To Re-use is to Re-write: Experiences with Re-using IIR Experiment Software (CHIIR 2019), 14 March 2019, Glasgow, UK Figure 4: Example Interface built with the PyIRE workbench. The interface here consists of five separate components (task, saved pages, search box, pagination, and search results list, which are joined together via the interface configuration. [SearchResults] 2.2.1 Session TREC. The Session TREC shared task [2] ran from handler = application.components.SearchResults 2011 - 2014 as part of the Text REtrieval Conference’s series of name = search_results tracks. The aim was to provide participating teams with multi- layout = grid-9 vgrid-expand query search sessions in order to develop and evaluate improved connect = search_box:query ranking algorithms that took previous queries and results into account. In order to provide participating teams with the necessary multi-query session data, for the 2012 iteration, the decision was Figure 5: Example configuration for a Standard Results List made to acquire this session data through a custom IIR experiment. component, showing how the component’s layout (9 grid- The initial run (2012) used custom software, from which re- cells wide and vertically expanding) and connections to usable aspects were identified. The ESS and PyIRE software pack- other components (to the “search_box” component via the ages were developed in the following year to support both the query message) are specified. Session TREC data acquisition and the iCHiC shared task described below. researcher to determine which components to deliver the message 2.2.2 iCHiC. The CHiC interactive (iCHiC) task was added to to. The components that are listening for that message update their the longer-running Cultural Heritage in CLEF lab in 2013 [7]. The own Session state based on the message and then mark themselves interactive task focused on acquiring and analysing interactive as changed. After message processing has been completed for all information retrieval data-set describing undirected exploration components, the Web Frontend then updates the UI for each of and browsing in a collection of approximately 1.1 million English- the changed components. language Cultural Heritage items. The task included both an online An example of the configuration used to set-up the experiment and an in-lab part. The task UI provided three methods for the par- is shown in Figure 5 (from the experiment in figure 4), specifying ticipants to explore the collection. On the left there was a category the configuration of the “search_results” component. It specifies browser, that showed a hierarchical structure into which a sub-set that the component should be displayed 9 grid-cells wide (the ap- of the items in the collection (approximately 250,000) had been plication layout uses a 12-by-12 cell grid layout) and should expand mapped automatically [3]. The second option was to use the search vertically to use as much space as is available. The component is box to type in and run a query. The third method was to click on an configured to be connected to the “search_box” component via the item’s meta-data, which would run a search for other items with “query” message. It is this ability to freely plug components together the same meta-data. In all three cases, the items for the selected that, we believe, makes the framework sufficiently flexible to sup- category, user-provided query, or meta-data query would be shown port the wide range of IIR experiments, while remaining simple to in the central grid. set-up and use. As stated above the exploration/browsing interface was built The message-passing architecture should allow arbitrary com- using the PyIRE software. ponents to work together. This should allow the researcher to take components from other experiments, for example a novel search 2.2.3 iSBS. Th interactive Social Book Search task in the CLEF result visualisation component, and combine it with other compo- Social Book Search lab ran for three years from 2014 - 2016 [6] and nents from their own research, such as a specific search backend. combined ideas from the iCHiC task with research questions from the longer-running Social Book Search (SBS) lab. Users looking 2.2 Experiments for books online are confronted with both professional meta-data The two software components were developed and then further and user-generated content. The goal of the Interactive Social Book re-used in a series of shared evaluation tasks and stand-alone IIR Search Track was to investigate how users used these two sources studies. of information, when looking for books in a leisure context. Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2019), 14 March 2019, Glasgow, UK Mark M Hall In the first year, the PyIRE workbench was used to construct The documentation issues affected the re-use of the ESS in two two UIs, one a traditional faceted search interface and one a novel ways. Missing user-focused documentation on how to use the more three-stage interface, based on the Vakkari search stages [8]. In the advanced functionalities (primarily in the area of latin-squares, second year, the three-stage interface was modified, while in the data-driven questions, conditional branching) meant that re-use by third year only an unchanged three-stage interface was tested, but other academics has been limited to those who have easy access using a wider range of tasks. to the ESS developer in order to get support in how to set up such experiments. In theory these could have been addressed relatively 2.2.4 WorldCat. The WorldCat experiment [unpublished] looked easily, all that was needed were small tutorials to illustrate in which at known-item search tasks within a large bibliographic data-set. order to execute the individual steps. For example, for a data-driven The PyIRE workbench was used to construct a replica of the World- crowdsourcing experiment, it is necessary to first create the data- Cat interface, but used the SBS book data-set to provide a controlled set that has all the different items which are sampled, then create data-set. The ESS was used to manage the experiment workflow. the page to display them on, and finally use text markup to embed 2.2.5 Spatial Language & Jokes Transcription. The Spatial Lan- the data in the page that is displayed to the participant. However, guage and Jokes Transcription experiments only used the ESS to none of this is particularly apparent from the interface itself. handle the experiment workflow aspects. Neither of these experi- The other documentation issue is related to the documentation ments was a traditional IIR study, but both re-used major parts of of the code itself, which is very patchy. The result of this is that the the workflow developed in [5]. However, the experiment-specific ESS has reached a point where it is highly functional, but essentially UIs were custom built for both of the experiments. cannot be maintained or developed any further, as any change risks breaking existing functionality in unexpected ways. 3 EXPERIENCE Both of these issues are primarily caused by the ESS being a side- project, where what time was available was focused on improving The primary take-away message from the experience of develp- the functionality and not documentation. While not a particularly ing, re-using, and maintaining the two software packages over the novel conclusion, it does re-iterate the point that without adequate course of five years is that the more generic the software, the easier documentation, re-use is essentially highly unlikely. it is to re-use that component. That is not particularly surprising, The second major issue that was encountered was with em- as it is in line with the re-use of other software components in IIR. bedding the task-specific UI in the ESS. The first is caused by a For example, few IIR experiments build a new search backend from limitation of the use of frames for embedding the task-specific UI. scratch for their IIR studies, as the generic search engines that are In order to produce an embedding that mostly hid the fact that available, are easily adaptable to the specific data requirements. the task-specific UI was embedded, the researcher had to manually use a large amount of CSS and some JavaScript to correctly adapt 3.1 ESS the size of the frame in which the UI is embedded. This created The experience of re-using and evolving the ESS has mostly been a an instant barrier to re-use, as it required some very specialised positive experience, with the majority of issues encountered primar- technical skills. ily common software development issues, rather than IIR specific The other issue with the task UI embedding arose from the need issues. to link the responses in the ESS with those in the task UI. This As the ESS was re-used throughout the years, the main change is necessary as the ESS and the task UI are completely separate was the addition of increasingly complex and powerful features. The systems, thus no automatic linking is possible. To create a link, the initial version was designed to allow the combination of standard unique ID of the ESS participant can be embedded in the URL that survey-style questions with data-driven, task-specific crowdsourc- loads the task UI. When the task UI loads, the software generating ing questions (where the question is wholy or in part driven by a the UI can access this ID and store it together with the other data data-set stored in the system). As the complexity of the experiment collected by the task UI. Then, when the data is extracted, the ID workflows increased, the ESS’ functionality was increased, adding can be used to merge together the two survey responses collected latin-square and conditional branching support for the iSBS task. by the ESS and the data logged by the task UI. This in some cases required re-writing parts of the ESS implemen- For the 2013 Session TREC experiment, a configuration error tation, but it was always a matter of software evolution, rather caused the same static identififer to be sent for all participants. than having to make major conceptual or structural changes to the While the error occured due to a mistake made by the researcher system. when setting up the experiment, the brittleness of the linkage be- The generic nature of the ESS has also enabled re-use in studies tween the two systems and the difficulty with seeing whether the that lie outside the IIR context. This was basically easily possible linking ID data was being transferred correctly, allowed the mis- because of the initial decision to host the experiment-specific func- take to go unnoticed. As a result, in that year the Session TREC tionality outside the ESS and include it via the use of HTML frames. data-set consisted only of the session query logs, but without any Thus the map-based and transcription experiment interfaces devel- information on the participants themselves, significantly limiting oped for the Spatial Language and Jokes Transcription experiments the value of the data-set as a whole. could easily be integrated into the ESS experiment workflow. Partly this issue is due to the ESS trying to be both a system However, there have also been some issues with re-use with that requires minimal technical skills, but also a system that is very the ESS. These fall into two categories: issues with embedding the powerful and flexible, allowing the researcher to adapt the system task-specific UI into the ESS and issues with documentation. to a large degree. Based on the experience, I would suggest that Workshop on Barriers to Interactive IR Resources Re-use at the ACM SIGIR Conference on Human Information Interaction and Retrieval To Re-use is to Re-write: Experiences with Re-using IIR Experiment Software (CHIIR 2019), 14 March 2019, Glasgow, UK future systems of this kind either focus on the pure ease-of-use or are targeted at different levels of technical expertise, but which use the technical flexibility, but not both, and that from the beginning a standardised format for describing questions and answer options, documentation is a core step. page structures, and overall experiment workflows. This would then allow importing and exporting these and moving experiments 3.2 PyIRE between the different systems. The third conclusion is that how to achieve the re-usability of While re-use of the ESS was fundamenally possible, re-use of the software that helps with building IIR interfaces, is still an open PyIRE workbench was not as successful. While the PyIRE was used question. Considering that there have been different approaches, across all three shared tasks and in the WorldCat experiment, each including the one documented here, none of which have caught re-use of the workbench essentially involved a major re-write of on in the slightest, it is also unclear whether there is actually any the software. value in attempting this. The re-writes revealed the difficulty of developing truly re-usable From the experience I would suggest that future work should UI components and also the difficulty in designing an architecture really focus on defining re-usable formats to define questions, an- that allows for minimally connected components that at the same swer options, page structures, and experiment workflows. These time provide the user with a cohesive use experience. As the PyIRE could then be moved between systems, enabling re-usability while system was re-written to support more complex interface struc- allowing for flexibility in what systems people want to use. Re- tures, more and more data had to be explicitly passed between usability of task UIs is an area where I am currently unconvinced components, functionality that had to be added to each component, that re-usability is worth pursuing. countering the core idea of plug-and-play reusability. Another issue was that while the architecture decoupled the com- REFERENCES ponents, particularly around rendering the architecture there were [1] T. Bogers, M. Gäde, L. Freund, M. Hall, M. Koolen, V. Petras, and M. Skov. Workshop some highly coupled interactions between the components and the on barriers to interactive ir resources re-use. In Proceedings of the 2018 Conference underlying PyIRE functionality. These coupling points meant that on Human Information Interaction & Retrieval, CHIIR ’18, pages 382–385, New York, NY, USA, 2018. ACM. in some cases, components had to have complex internal structures, [2] B. Carterette, P. Clough, M. Hall, E. Kanoulas, and M. Sanderson. Evaluating simply because they had to handle the case where the functionality retrieval over sessions: The trec session track 2011-2014. In Proceedings of the 39th was needed to update the component’s display and in other cases International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, pages 685–688, New York, NY, USA, 2016. ACM. just to provide service functionality to other components. [3] M. M. Hall, S. Fernando, P. Clough, A. Soroa, E. Agirre, and M. Stevenson. Evalu- The main effect of the re-writes and the coupling issues was that ating hierarchical organisation structures for exploring digital libraries. 17(4):351– it is very hard to actually replicate the past experiments, as each one 379, 2014. [4] M. M. Hall, S. Katsaris, and E. Toms. A Pluggable Interactive IR Evaluation Work- requires a very specific version of the PyIRE system to run. While bench. In European Workshop on Human-Computer Interaction and Information the code is available, this means that for each experiment, a new Retrieval, pages 35–38, 2013. [5] M. M. Hall and E. Toms. Building a Common Framework for IIR Evaluation. In instance of the PyIRE server would have to be run, undermining the CLEF 2013 - Information Access Evaluation. Multilinguality, Multimodality, and point of having a workbench that allows implementing multiple Visualization, pages 17–28, 2013. experiments in an easy-to-manage environment. [6] M. Koolen, T. Bogers, M. Gäde, M. M. Hall, I. Hendrickx, J. Kamps, M. Skov, S. Verberne, and D. Walsh. Overview of the CLEF 2016 Social Book Search Lab. The big question, which I cannot answer, is whether the prob- 2016. lem with the PyIRE are due to specific mistakes made in how the [7] E. Toms and M. M. Hall. The CHiC Interactive Task (CHiCi) at decoupled architecture was implemented, or generic issues with CLEF2013. http://www.clef-initiative.eu/documents/71612/1713e643-27c3-4d76- 9a6f-926cdb1db0f4, 2013. the architecture itself. [8] P. Vakkari. A theory of the task-based information retrieval process: a summary In particular, the question is related to the ESS issue with the and generalisation of a longitudinal study. 57(1):44–60, 2001. target user groups. The way the PyIRE can be used was designed to support both researchers wishing to develop their own compo- nents, but also researchers who lacked the technical skills to build their own and simply wanted to re-use existing components with other tasks or data. Attempting to support both scenarios created significant additional complexity, essentially making the system hard to use for both groups. 4 CONCLUSION The main conclusion from this analysis is that the core issue for long-term re-use and maintainability of software for IIR experi- ments is the availability of adequate documentation and in this respect IIR experiment software is no different to any other soft- ware. The second conclusion is that the development of a system that supports researchers in the development, deployment, and re-use of IIR experiment workflows is possible, as evidenced by the ESS system. Ideally this might take the form of multiple systems that