=Paper= {{Paper |id=Vol-2409/position01 |storemode=property |title=STELLA: Towards a Framework for the Reproducibility of Online Search Experiments |pdfUrl=https://ceur-ws.org/Vol-2409/position01.pdf |volume=Vol-2409 |authors=Timo Breuer,Philipp Schaer,Narges Tavalkolpoursaleh,Johann Schaible,Benjamin Wolff,Bernd Müller |dblpUrl=https://dblp.org/rec/conf/sigir/BreuerSTSWM19 }} ==STELLA: Towards a Framework for the Reproducibility of Online Search Experiments== https://ceur-ws.org/Vol-2409/position01.pdf
          STELLA: Towards a Framework for the Reproducibility of
                       Online Search Experiments
                     Timo Breuer                                        Narges Tavakolpoursaleh                                     Benjamin Wolff
                    Philipp Schaer                                          Johann Schaible                                          Bernd Müller
        firstname.lastname@th-koeln.de                                   firstname.lastname@gesis.org                        {wolff,muellerb}@zbmed.de
          Technische Hochschule Köln                                                 GESIS                               ZB MED - Information Centre for Life
               Cologne, Germany                                                Cologne, Germany                                       Sciences
                                                                                                                                 Cologne, Germany

ABSTRACT                                                                                           and all rankings generated by the systems can be stored and used
Reproducibility is a central aspect of offline as well as online eval-                             for subsequent calculations, there are no clear guidelines as to how
uations, to validate the results of different teams and in different                               the logged interaction data can contribute to a valid and repro-
experimental setups. However, often it is difficult or not even possi-                             ducible evaluation result. The major problem is that the recorded
ble to reproduce an online evaluation, as solely a few data providers                              interactions are authentic only for the particular situation in which
give access to their system, and if they do, it is limited in time and                             they were recorded.
typically only during an official challenge. To alleviate the situation,                              Conceptionally, Ferro et al. [2] introduce the PRIMAD model,
we propose STELLA: a living lab infrastructure with consistent                                     which specifies reproducibility in several components: Platform,
access to a data provider’s system, which can be used to train and                                 Research goal, Implementation, Method, Actor, and Data. PRIMAD
evaluate search- and recommender algorithms. In this position pa-                                  is a conceptional framework for the assessment of reproducibility
per, we align STELLA’s architecture to the PRIMAD model and                                        along the suggested components. Although PRIMAD explicitly
its six different components specifying reproducibility in online                                  discusses the application for both offline and online experiments, we
evaluations and illustrate two use cases with two academic search                                  see a gap when we apply it to living labs that involve the interaction
systems.                                                                                           of real users and real-time online platforms. The suggestion of
                                                                                                   thinking of users as "data generators" undervalues their role within
                                                                                                   the evaluations.
1     INTRODUCTION                                                                                    To overcome these issues, we introduce STELLA (InfraSTruc-
Reproducibility1 is still an open issue in TREC-style IR offline evalu-                            turEs for Living LAbs) a living lab infrastructure. STELLA allows
ations. Hanbury et al. [3] named this setting the Data-to-Algorithms                               capturing document data, algorithms, and user interactions in an
paradigm where participants submit the output of their software                                    online evaluation setup that is based on Docker containers. We
when running on a pre-published test collection. Recently, the                                     align the different components of the infrastructure to the PRIMAD
IR community extended this idea by thinking of Evaluation-as-a-                                    model and discuss their match to the model. We see a particular
Service (EaaS) that adopts the Algorithms-to-Data paradigm, i.e., in                               need to pay attention to the actor and data components. These
the form of living labs [1]. In living labs, relevance assessments are                             components represent the human-factors like users’ interactions
produced by actual users and instead resemble their satisfaction                                   that affect the outcomes of online experiments and not only the
with the search system in contrast to the explicit relevance assess-                               experimenters’ perspective on the experiment.
ments in TREC-style test collections. To obtain the satisfaction rate                                 In the following paper, we present the design of the STELLA
relevance is measured by observing user behavior, e.g., navigation,                                living lab Docker infrastructure (cf. Section 2). We align STELLA’s
click-through rates, and other metrics.                                                            components to the dimensions of the PRIMAD model and illustrate
   In theory EaaS might enhance reproducibility by keeping the                                     a use case with two academic search systems (cf. Section 3). Finally,
data, algorithms, and results in a central infrastructure that is acces-                           we discuss the benefits and limitations of our work and conclude
sible through a standard API and allows for sharing open-source                                    the paper (cf. Section 4).
software components. However, the live environment for evaluat-
ing experimental systems typically has the consequence that the
results are not reproducible since the users’ subjective impression                                2   ONLINE EVALUATION WITH STELLA
of relevance is very inconstant. This makes reproducibility of online                              The STELLA infrastructure allows researchers to evaluate search
experiments more complicated than their offline counterpart. To                                    and recommender algorithms in an online environment, i.e., within
what extent online experiments in living labs can be made repro-                                   a real-world system with real users. When using STELLA, the re-
ducible remains a central question. Although the user interactions                                 searchers’ primary goal is to introduce ranking models for search
                                                                                                   results and recommendations that outperform the existing base-
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR               line. In the following, we describe STELLA’s workflow as well as
2019, 25 July 2019, Paris, France.                                                                 its technical infrastructure. We adhere to the wording of TREC
1 One can differentiate between repeatability (same team, same experimental setup),
                                                                                                   OpenSearch [5] with regards to several components of the living
replicability (different team, same setup), and reproducibility (different team, different
setup). For the sake of simplicity, we use the term reproducibility to refer to all of these       lab infrastructure. Providers of search engines and correspond-
three types.                                                                                       ing web interfaces are referred to as sites. Research groups that




                                                                                               8
OSIRRC 2019 co-located with SIGIR 2019, July 25, 2019, Paris, France                                                                      Breuer, et al.




Figure 1: Living lab infrastructure based on Docker. Participants contribute their experimental retrieval and recommender
systems by uploading Dockerfiles and source code. The STELLA server composes a multi-container application out of single
experimental systems. This application is deployed locally at the sites. Queries are forwarded to this application, which deliv-
ers results from the experimental systems in return. User feedback is sent to the STELLA server and stored within the Docker
container.


contribute experimental retrieval and recommender systems are                   templates. The underlying infrastructure will assure the synchro-
referred to as participants.                                                    nization and comparability of experimental retrieval and recom-
    STELLA’s main component is the central living lab API. It con-              mender systems across different sites. See figure 1 for a schematic
nects data and content providers (sites) with researchers in the                visualization of the framework.
fields of information retrieval and recommender systems (partici-                  Sites deploy a multi-container environment that contains the
pants). When linked to the API, the sites provide data that can be              experimental systems. Queries by site users are forwarded to these
used by participants for implementing search and recommendation                 systems. A scheduling mechanism assures even distribution of
algorithms, e.g., metadata about items, users, session logs, and click          queries among the participants’ systems. The multi-container en-
paths. Participants use the API in order to obtain this data from               vironment logs user feedback and forwards this usage data to the
the sites and enhance their experimental system, e.g., by delivering            experimental systems. Likewise logged user feedback is sent to the
personalized search results. Subsequently, this experimental system             central STELLA server where it is filed, and overall statistics can
is integrated into the living lab infrastructure and made accessible            be calculated.
to site users. Within the site, the submitted system is implemented                The main functionalities of the STELLA server are the adminis-
automatically. The system is used within an A/B-testing or inter-               tration and synchronization of infrastructural components. After
leaving scenario, such that the system’s results are presented to the           the submission of extended template files, the STELLA server will
real users of the site. The users’ actions, e.g., clicking or selecting a       initiate the build process of Docker images. The build process is
ranked item which determines the click-through rate, is recorded                triggered by new contributions or changes within the experimen-
and send back to the central living labs API. There, this data is ag-           tal systems. Participants and sites will be able to self-administrate
gregated over time in order to produce reliable results to determine            configurations and have insights into the evaluation outcomes by
the system’s usefulness for that specific site.                                 visiting a dashboard service.
    To support reproducibility, we employ the Docker technology in                 In previous living lab campaigns [5], sites had to implement a
order to keep as many components as possible the way they were                  REST-API that redirected queries to the central living lab server and
in the first experiment. This includes reusing the utilized software,           an interleaving mechanism to generate the final result list. In our
tools being used to develop the method, and in a user-oriented                  case, sites can entirely rely on the Docker applications. The site’s
study, usage data generated by the users. Within the framework,                 original search system can – but do not have to – be integrated as
experimental retrieval and recommender systems are distributed                  an additional container. Optionally the REST-API can be reached
with the help of Docker images. Sites deploy local multi-container              by the conventional way of previous living lab campaigns [1]. The
applications running these images. Participants contribute their                Docker application can also be deployed on the STELLA server from
experimental systems by extending prepared Dockerfiles and code                 where it can be reached over the internet. This may be beneficial




                                                                            9
STELLA: Towards a Framework for the Reproducibility of Online Search Experiments                OSIRRC 2019 co-located with SIGIR 2019, July 25, 2019, Paris, France


 PRIMAD variable                  Instance                                                deployed IR-model, e.g., via accumulated click-through rates.
                                                                                          Both sites are configured in a way that they can interact with
 Platform                         Docker-based framework
                                                                                          the central STELLA server, i.e., provide data on the sites’ con-
 Research goal                    Retrieval effectiveness
                                                                                          tent and users as well as their queries and interactions.
 Implementation                   Chosen by participants
                                                                                        Research Goal: The research goal is the retrieval of scientific
 Method                           Chosen by participants
                                                                                          datasets and literature which satisfy a user’s information
 Actor                            Participants, site users
                                                                                          need. This includes retrieval using string-based queries as
 Data                             Domain/site specific
                                                                                          well as recommending further information using item-based
Table 1: Alignment of STELLA components to the dimen-                                     queries. In LIVIVO, the research goal is finding appropriate
sions of the PRIMAD model                                                                 domain-specific scientific literature in medicine, health, nu-
                                                                                          trition, and environmental and agricultural sciences. Besides
                                                                                          scientific publications in the social sciences, the GESIS-wide
                                                                                          Search offers to search for research data, scales, and other in-
for those sites, which want to participate but do not have sufficient
                                                                                          formation. The research goal also includes finding appropri-
hardware capacities for the Docker environment.
                                                                                          ate cross-item recommendations as recommending research
   Participants develop their systems in local environments and sub-
                                                                                          data based on a currently viewed publication.
mit their systems upon completion. They contribute their systems
                                                                                        Method and Implementation: The participant chooses both
by extending Dockerfile templates and providing the necessary
                                                                                          the method and implementation. With the help of the Docker-
source code. The submission of systems can be realized with al-
                                                                                          based framework, participants are free to choose which meth-
ready existing infrastructures like online version control services
                                                                                          ods for retrieval to use as well as the methods’ implemen-
in combination with the Docker Hub.
                                                                                          tation as long as the interface guidelines between the sites
3    RELATION TO PRIMAD                                                                   and the Docker images are respected.
                                                                                        Actor: The actors in online evaluations are (i) the potential
The PRIMAD model offers orientation to what extent reproducibil-                          site users and (ii) the participants that develop an experi-
ity in IR experiments can be achieved. Table 1 provides an overview                       mental system to be evaluated on a site. In both LIVIVO and
of the PRIMAD model and corresponding components in the STELLA                            GESIS-wide Search, the site users range from students to
infrastructure. The platform is provided by our framework, which                          scientists as well as librarians. Specifically, the users of both
mainly relies on Docker and its containerization technology. In-                          sites differ by their domain of interest (medicine and health
creasing retrieval effectiveness is the primary research goal. Sites,                     vs. social sciences), their experience, and the granularity of
e.g., a digital library, and participants, i.e., the external researchers,                their information need, which can be trivial, but also very
benefit from the cooperation with each other using STELLA. Sites,                         complex. In any case, the site users’ interactions are captured
for instance, may be interested in finding adequate retrieval and                         by STELLA, which allows observing differences in their be-
recommender algorithms, whereas participants get access to data                           havior. Participants submit their experimental systems using
from real-world user interactions. Both the implementation and                            the Docker-based framework and receive the evaluation re-
the method is chosen by the participants. In ad-hoc retrieval ex-                         sults from STELLA after some evaluation period. This way,
periments, solely the researcher would be considered an actor. In a                       they can adapt their systems and re-submit them whenever
living lab scenario, this group is extended by site users who affect                      possible.
the outcome of experiments. Finally, data will consist of logged                        Data: The data comprises all information the sites are willing
user interaction as well as domain-specific text collections.                             to provide to participants for developing their systems. It usu-
    In the following, we present use cases of how the STELLA in-                          ally includes some structured data, such as database records
frastructure is concretely aligned to the PRIMAD components. Two                          containing metadata on research data and publications, as
early adopters from the domain of academic search implement                               well as unstructured data like full texts or abstracts. LIVIVO
our framework such that they are part of the STELLA infrastruc-                           uses the ZB MED Knowledge Environment (ZB MED KE)
ture, LIVIVO2 [7] and the GESIS-wide Search3 [4]. LIVIVO is an                            as a data layer that semantically enriches (by annotating
interdisciplinary search engine and contains metadata on scien-                           the metadata with concepts from life sciences ontologies)
tific literature for medicine, health, nutrition, and environmental                       the textual content of metadata from about 50 different lit-
and agricultural sciences. The GESIS-wide Search is a scholarly                           erature resources with a total of about 55 Million citations.
search system where one can find information about social sci-                            Additionally, LIVIVO allows users to register and maintain a
ence research data, instruments and scales, as well as open access                        personalized profile, including a watch list. The GESIS-wide
publications.                                                                             Search integrates metadata from different portals into a cen-
     Platform: In STELLA the platform is implemented via a Docker-                        tral search index that uses a specific metadata schema based
         based framework as shown in Figure 1. It connects the sites                      on Dublin Core. Additionally, each data record is enriched
         with the participants and assures (i) the sites’ flow of data                    with explicit links between different information items like
         for computing IR-models, (ii) deploying the participants’ IR-                    links between a publication and a dataset. This link specifies
         models on the sites, and (iii) obtaining the usefulness of the                   that the publication uses that particular dataset. As there is
2 https://www.livivo.de, Accessed June 2019                                               no option for users to register, GESIS-wide Search provides
3 https://www.gesis.org/en/home/, Accessed June 2019                                      users’ session logs and click paths but no user profiles. The




                                                                                   10
OSIRRC 2019 co-located with SIGIR 2019, July 25, 2019, Paris, France                                                                                Breuer, et al.


        entire document corpora of LIVIVO and GESIS-wide Search              NewsREEL Replay participants had access to a dataset comprising a
        is made available for participants.                                  collection of log messages analogous to NewsREEL Live. Analogous
   In this living lab scenario P, R, I, and M would be conserved             to the online evaluation, participants had to find the configuration
within the Docker container and the STELLA infrastructure. A →               with the highest click-trough-rate. By using the recorded data, par-
A′ and D → D ′ would remain as components that change in an-                 ticipants experimented on a reasonable trade-off amid prediction
other system and another point in time. By recording the interaction         accuracy and response time.
and click data, we can try to preserve parts of A and D.
                                                                             5    CONCLUSION
4    DISCUSSION                                                              We present a living lab platform for the evaluation of online exper-
                                                                             iments and a Docker-based infrastructure which bridges the gap
Compared to previous living lab campaigns, the Docker-based in-
                                                                             between experimental systems and real user interactions. Concern-
frastructure results in several advances towards enhancing repro-
                                                                             ing the PRIMAD model, it is possible to assess reproducibility and
ducibility in online evaluations. The most important advances are:
                                                                             corresponding components in our infrastructure proposal. Partici-
    Transparency: Specifying retrieval and recommender sys-                  pants are free to choose which method and implementation to use
      tems, as well as their specific requirements in a standardized         and can rely on adequate deployment and environments. User inter-
      way, may contribute to the transparency of these systems.              action data will be logged and is accessible for optimizing systems
    No limitation to head queries: Pre-computed rankings were                and future applications.
      limited to the top-k queries. Even though this is an elegant
      solution, it may influence evaluation outcomes. Using locally          REFERENCES
      deployed Docker applications, there is no need for this re-            [1] Krisztian Balog, Anne Schuth, Peter Dekker, Philipp Schaer, Narges Tavakolpour-
      striction anymore. Rankings and recommendations can be                     saleh, and Po-Yu Chuang. 2016. Overview of the TREC 2016 Open Search Track.
                                                                                 In Proceedings of the Twenty-Fifth Text REtrieval Conference (TREC 2016). NIST.
      determined based on the complete corpus.                               [2] Nicola Ferro, Norbert Fuhr, Kalervo Järvelin, Noriko Kando, Matthias Lippold, and
    Avoidance of network latencies: Network latencies after the                  Justin Zobel. 2016. Increasing Reproducibility in IR: Findings from the Dagstuhl
      retrieval of rankings or recommendations might affect user                 Seminar on "Reproducibility of Data-Oriented Experiments in e-Science". SIGIR
                                                                                 Forum 50, 1 (June 2016), 68–82. https://doi.org/10.1145/2964797.2964808
      behavior. Also, implementing workarounds like timeouts                 [3] Allan Hanbury, Henning Müller, Krisztian Balog, Torben Brodt, Gordon V. Cor-
      resulted in additional implementation effort for sites. By de-             mack, Ivan Eggel, Tim Gollub, Frank Hopfgartner, Jayashree Kalpathy-Cramer,
                                                                                 Noriko Kando, Anastasia Krithara, Jimmy Lin, Simon Mercer, and Martin Potthast.
      ploying local Docker images, these latencies are eliminated.               2015. Evaluation-as-a-Service: Overview and Outlook. ArXiv e-prints (Dec. 2015).
    Lower entrance barrier for participation: Participants can                   http://arxiv.org/abs/1512.07454
      contribute already existing systems to the STELLA infras-              [4] Daniel Hienert, Dagmar Kern, Katarina Boland, Benjamin Zapilko, and Peter
                                                                                 Mutschke. 2019. A Digital Library for Research Data and Related Information in
      tructure by simply dockerizing them. By specifying the re-                 the Social Sciences. In ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)
      quired components and parameters, the deployment proce-                    (forthcoming).
      dure is less error-prone. Researchers can use software and             [5] Rolf Jagerman, Krisztian Balog, and Maarten de Rijke. 2018. OpenSearch: Lessons
                                                                                 Learned from an Online Evaluation Campaign. J. Data and Information Quality
      programming languages of their choice. Sites solely need to                10 (2018), 13:1–13:15.
      implement the REST-API and set up the local instance of the            [6] Benjamin Kille, Andreas Lommatzsch, Frank Hopfgartner, Martha Larson, and
                                                                                 Torben Brodt. 2017. CLEF 2017 NewsREEL Overview: Offline and Online Eval-
      Docker application. Letting Docker deploy the application,                 uation of Stream-based News Recommender Systems. In Working Notes of CLEF
      human errors are avoided, and efforts are reduced.                         2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September
                                                                                 11-14, 2017. (CEUR Workshop Proceedings), Linda Cappellato, Nicola Ferro, Lor-
   The benefits mentioned above come at a cost. Especially the                   raine Goeuriot, and Thomas Mandl (Eds.), Vol. 1866. CEUR-WS.org. http://ceur-
following limitations have to be considered:                                     ws.org/Vol-1866/invited_paper_17.pdf
                                                                             [7] Bernd Müller, Christoph Poley, Jana Pössel, Alexandra Hagelstein, and Thomas
    Central server: The proposed infrastructure relies on a cen-                 Gübitz. 2017. Livivo–the vertical search engine for life sciences. Datenbank-
      tral server. This vulnerability might be a target for malicious            Spektrum 17, 1 (2017), 29–34.
      intents and is generally a single point of failure.
    Hardware limitations: The experimental systems will be de-
      ployed at sites. Available hardware capacities may vary across
      different sites. Furthermore, the hardware requirements of
      experimental systems should be following the available re-
      sources of sites. For instance, participants contributing ma-
      chine/deep learning systems should not outsource training
      routines to external servers. In the first place, we will focus
      on lightweight experiments in order to keep the entrance
      barrier for participation low.
    User interaction data: While all interaction data within the
      STELLA infrastructure can be logged, a reuse-setup is not
      outlined yet and must remain as future work.
   Recording usage and interaction data for later reuse are not
novel. An example of interaction data recorded to allow later verifi-
cation and simulation was the NewsREEL lab at CLEF 2017 [6]. In




                                                                        11