<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>STELLA: Towards a Framework for the Reproducibility of Online Search Experiments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timo Breuer</string-name>
          <email>ifrstname.lastname@th-koeln.de Technische Hochschule Köln Cologne, Germany</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Narges Tavakolpoursaleh</string-name>
          <email>ifrstname.lastname@gesis.org GESIS Cologne, Germany</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Wolf</string-name>
          <email>wolf@zbmed.de</email>
          <email>{wolf,muellerb}@zbmed.de ZB MED - Information Centre for Life Sciences Cologne, Germany</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bernd Müller</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Johann Schaible</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Philipp Schaer</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>25</volume>
      <issue>2019</issue>
      <fpage>8</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Reproducibility is a central aspect of ofline as well as online evaluations, to validate the results of diferent teams and in diferent experimental setups. However, often it is dificult or not even possible to reproduce an online evaluation, as solely a few data providers give access to their system, and if they do, it is limited in time and typically only during an oficial challenge. To alleviate the situation, we propose STELLA: a living lab infrastructure with consistent access to a data provider's system, which can be used to train and evaluate search- and recommender algorithms. In this position paper, we align STELLA's architecture to the PRIMAD model and its six diferent components specifying reproducibility in online evaluations and illustrate two use cases with two academic search systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Reproducibility1 is still an open issue in TREC-style IR ofline
evaluations. Hanbury et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] named this setting the Data-to-Algorithms
paradigm where participants submit the output of their software
when running on a pre-published test collection. Recently, the
IR community extended this idea by thinking of
Evaluation-as-aService (EaaS) that adopts the Algorithms-to-Data paradigm, i.e., in
the form of living labs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In living labs, relevance assessments are
produced by actual users and instead resemble their satisfaction
with the search system in contrast to the explicit relevance
assessments in TREC-style test collections. To obtain the satisfaction rate
relevance is measured by observing user behavior, e.g., navigation,
click-through rates, and other metrics.
      </p>
      <p>In theory EaaS might enhance reproducibility by keeping the
data, algorithms, and results in a central infrastructure that is
accessible through a standard API and allows for sharing open-source
software components. However, the live environment for
evaluating experimental systems typically has the consequence that the
results are not reproducible since the users’ subjective impression
of relevance is very inconstant. This makes reproducibility of online
experiments more complicated than their ofline counterpart. To
what extent online experiments in living labs can be made
reproducible remains a central question. Although the user interactions
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR
2019, 25 July 2019, Paris, France.
1One can diferentiate between repeatability (same team, same experimental setup),
replicability (diferent team, same setup), and reproducibility (diferent team, diferent
setup). For the sake of simplicity, we use the term reproducibility to refer to all of these
three types.
and all rankings generated by the systems can be stored and used
for subsequent calculations, there are no clear guidelines as to how
the logged interaction data can contribute to a valid and
reproducible evaluation result. The major problem is that the recorded
interactions are authentic only for the particular situation in which
they were recorded.</p>
      <p>
        Conceptionally, Ferro et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduce the PRIMAD model,
which specifies reproducibility in several components: Platform,
Research goal, Implementation, Method, Actor, and Data. PRIMAD
is a conceptional framework for the assessment of reproducibility
along the suggested components. Although PRIMAD explicitly
discusses the application for both ofline and online experiments, we
see a gap when we apply it to living labs that involve the interaction
of real users and real-time online platforms. The suggestion of
thinking of users as "data generators" undervalues their role within
the evaluations.
      </p>
      <p>To overcome these issues, we introduce STELLA
(InfraSTructurEs for Living LAbs) a living lab infrastructure. STELLA allows
capturing document data, algorithms, and user interactions in an
online evaluation setup that is based on Docker containers. We
align the diferent components of the infrastructure to the PRIMAD
model and discuss their match to the model. We see a particular
need to pay attention to the actor and data components. These
components represent the human-factors like users’ interactions
that afect the outcomes of online experiments and not only the
experimenters’ perspective on the experiment.</p>
      <p>In the following paper, we present the design of the STELLA
living lab Docker infrastructure (cf. Section 2). We align STELLA’s
components to the dimensions of the PRIMAD model and illustrate
a use case with two academic search systems (cf. Section 3). Finally,
we discuss the benefits and limitations of our work and conclude
the paper (cf. Section 4).
2</p>
    </sec>
    <sec id="sec-2">
      <title>ONLINE EVALUATION WITH STELLA</title>
      <p>
        The STELLA infrastructure allows researchers to evaluate search
and recommender algorithms in an online environment, i.e., within
a real-world system with real users. When using STELLA, the
researchers’ primary goal is to introduce ranking models for search
results and recommendations that outperform the existing
baseline. In the following, we describe STELLA’s workflow as well as
its technical infrastructure. We adhere to the wording of TREC
OpenSearch [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] with regards to several components of the living
lab infrastructure. Providers of search engines and
corresponding web interfaces are referred to as sites. Research groups that
contribute experimental retrieval and recommender systems are
referred to as participants.
      </p>
      <p>STELLA’s main component is the central living lab API. It
connects data and content providers (sites) with researchers in the
ifelds of information retrieval and recommender systems
(participants). When linked to the API, the sites provide data that can be
used by participants for implementing search and recommendation
algorithms, e.g., metadata about items, users, session logs, and click
paths. Participants use the API in order to obtain this data from
the sites and enhance their experimental system, e.g., by delivering
personalized search results. Subsequently, this experimental system
is integrated into the living lab infrastructure and made accessible
to site users. Within the site, the submitted system is implemented
automatically. The system is used within an A/B-testing or
interleaving scenario, such that the system’s results are presented to the
real users of the site. The users’ actions, e.g., clicking or selecting a
ranked item which determines the click-through rate, is recorded
and send back to the central living labs API. There, this data is
aggregated over time in order to produce reliable results to determine
the system’s usefulness for that specific site.</p>
      <p>To support reproducibility, we employ the Docker technology in
order to keep as many components as possible the way they were
in the first experiment. This includes reusing the utilized software,
tools being used to develop the method, and in a user-oriented
study, usage data generated by the users. Within the framework,
experimental retrieval and recommender systems are distributed
with the help of Docker images. Sites deploy local multi-container
applications running these images. Participants contribute their
experimental systems by extending prepared Dockerfiles and code
templates. The underlying infrastructure will assure the
synchronization and comparability of experimental retrieval and
recommender systems across diferent sites. See figure 1 for a schematic
visualization of the framework.</p>
      <p>Sites deploy a multi-container environment that contains the
experimental systems. Queries by site users are forwarded to these
systems. A scheduling mechanism assures even distribution of
queries among the participants’ systems. The multi-container
environment logs user feedback and forwards this usage data to the
experimental systems. Likewise logged user feedback is sent to the
central STELLA server where it is filed, and overall statistics can
be calculated.</p>
      <p>The main functionalities of the STELLA server are the
administration and synchronization of infrastructural components. After
the submission of extended template files, the STELLA server will
initiate the build process of Docker images. The build process is
triggered by new contributions or changes within the
experimental systems. Participants and sites will be able to self-administrate
configurations and have insights into the evaluation outcomes by
visiting a dashboard service.</p>
      <p>
        In previous living lab campaigns [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], sites had to implement a
REST-API that redirected queries to the central living lab server and
an interleaving mechanism to generate the final result list. In our
case, sites can entirely rely on the Docker applications. The site’s
original search system can – but do not have to – be integrated as
an additional container. Optionally the REST-API can be reached
by the conventional way of previous living lab campaigns [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The
Docker application can also be deployed on the STELLA server from
where it can be reached over the internet. This may be beneficial
STELLA: Towards a Framework for the Reproducibility of Online Search Experiments
PRIMAD variable
for those sites, which want to participate but do not have suficient
hardware capacities for the Docker environment.
      </p>
      <p>Participants develop their systems in local environments and
submit their systems upon completion. They contribute their systems
by extending Dockerfile templates and providing the necessary
source code. The submission of systems can be realized with
already existing infrastructures like online version control services
in combination with the Docker Hub.
3</p>
    </sec>
    <sec id="sec-3">
      <title>RELATION TO PRIMAD</title>
      <p>The PRIMAD model ofers orientation to what extent
reproducibility in IR experiments can be achieved. Table 1 provides an overview
of the PRIMAD model and corresponding components in the STELLA
infrastructure. The platform is provided by our framework, which
mainly relies on Docker and its containerization technology.
Increasing retrieval efectiveness is the primary research goal. Sites,
e.g., a digital library, and participants, i.e., the external researchers,
benefit from the cooperation with each other using STELLA. Sites,
for instance, may be interested in finding adequate retrieval and
recommender algorithms, whereas participants get access to data
from real-world user interactions. Both the implementation and
the method is chosen by the participants. In ad-hoc retrieval
experiments, solely the researcher would be considered an actor. In a
living lab scenario, this group is extended by site users who afect
the outcome of experiments. Finally, data will consist of logged
user interaction as well as domain-specific text collections.</p>
      <p>
        In the following, we present use cases of how the STELLA
infrastructure is concretely aligned to the PRIMAD components. Two
early adopters from the domain of academic search implement
our framework such that they are part of the STELLA
infrastructure, LIVIVO2 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the GESIS-wide Search3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. LIVIVO is an
interdisciplinary search engine and contains metadata on
scientific literature for medicine, health, nutrition, and environmental
and agricultural sciences. The GESIS-wide Search is a scholarly
search system where one can find information about social
science research data, instruments and scales, as well as open access
publications.
      </p>
      <p>Platform: In STELLA the platform is implemented via a
Dockerbased framework as shown in Figure 1. It connects the sites
with the participants and assures (i) the sites’ flow of data
for computing IR-models, (ii) deploying the participants’
IRmodels on the sites, and (iii) obtaining the usefulness of the
2https://www.livivo.de, Accessed June 2019
3https://www.gesis.org/en/home/, Accessed June 2019
deployed IR-model, e.g., via accumulated click-through rates.
Both sites are configured in a way that they can interact with
the central STELLA server, i.e., provide data on the sites’
content and users as well as their queries and interactions.
Research Goal: The research goal is the retrieval of scientific
datasets and literature which satisfy a user’s information
need. This includes retrieval using string-based queries as
well as recommending further information using item-based
queries. In LIVIVO, the research goal is finding appropriate
domain-specific scientific literature in medicine, health,
nutrition, and environmental and agricultural sciences. Besides
scientific publications in the social sciences, the GESIS-wide
Search ofers to search for research data, scales, and other
information. The research goal also includes finding
appropriate cross-item recommendations as recommending research
data based on a currently viewed publication.</p>
      <p>Method and Implementation: The participant chooses both
the method and implementation. With the help of the
Dockerbased framework, participants are free to choose which
methods for retrieval to use as well as the methods’
implementation as long as the interface guidelines between the sites
and the Docker images are respected.</p>
      <p>Actor: The actors in online evaluations are (i) the potential
site users and (ii) the participants that develop an
experimental system to be evaluated on a site. In both LIVIVO and
GESIS-wide Search, the site users range from students to
scientists as well as librarians. Specifically, the users of both
sites difer by their domain of interest (medicine and health
vs. social sciences), their experience, and the granularity of
their information need, which can be trivial, but also very
complex. In any case, the site users’ interactions are captured
by STELLA, which allows observing diferences in their
behavior. Participants submit their experimental systems using
the Docker-based framework and receive the evaluation
results from STELLA after some evaluation period. This way,
they can adapt their systems and re-submit them whenever
possible.</p>
      <p>Data: The data comprises all information the sites are willing
to provide to participants for developing their systems. It
usually includes some structured data, such as database records
containing metadata on research data and publications, as
well as unstructured data like full texts or abstracts. LIVIVO
uses the ZB MED Knowledge Environment (ZB MED KE)
as a data layer that semantically enriches (by annotating
the metadata with concepts from life sciences ontologies)
the textual content of metadata from about 50 diferent
literature resources with a total of about 55 Million citations.
Additionally, LIVIVO allows users to register and maintain a
personalized profile, including a watch list. The GESIS-wide
Search integrates metadata from diferent portals into a
central search index that uses a specific metadata schema based
on Dublin Core. Additionally, each data record is enriched
with explicit links between diferent information items like
links between a publication and a dataset. This link specifies
that the publication uses that particular dataset. As there is
no option for users to register, GESIS-wide Search provides
users’ session logs and click paths but no user profiles. The
entire document corpora of LIVIVO and GESIS-wide Search
is made available for participants.</p>
      <p>In this living lab scenario P, R, I, and M would be conserved
within the Docker container and the STELLA infrastructure. A →
A′ and D → D ′ would remain as components that change in
another system and another point in time. By recording the interaction
and click data, we can try to preserve parts of A and D.</p>
    </sec>
    <sec id="sec-4">
      <title>4 DISCUSSION</title>
      <p>Compared to previous living lab campaigns, the Docker-based
infrastructure results in several advances towards enhancing
reproducibility in online evaluations. The most important advances are:
Transparency: Specifying retrieval and recommender
systems, as well as their specific requirements in a standardized
way, may contribute to the transparency of these systems.</p>
      <sec id="sec-4-1">
        <title>No limitation to head queries: Pre-computed rankings were</title>
        <p>limited to the top-k queries. Even though this is an elegant
solution, it may influence evaluation outcomes. Using locally
deployed Docker applications, there is no need for this
restriction anymore. Rankings and recommendations can be
determined based on the complete corpus.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Avoidance of network latencies: Network latencies after the</title>
        <p>retrieval of rankings or recommendations might afect user
behavior. Also, implementing workarounds like timeouts
resulted in additional implementation efort for sites. By
deploying local Docker images, these latencies are eliminated.
Lower entrance barrier for participation: Participants can
contribute already existing systems to the STELLA
infrastructure by simply dockerizing them. By specifying the
required components and parameters, the deployment
procedure is less error-prone. Researchers can use software and
programming languages of their choice. Sites solely need to
implement the REST-API and set up the local instance of the
Docker application. Letting Docker deploy the application,
human errors are avoided, and eforts are reduced.</p>
        <p>The benefits mentioned above come at a cost. Especially the
following limitations have to be considered:</p>
        <p>Central server: The proposed infrastructure relies on a
central server. This vulnerability might be a target for malicious
intents and is generally a single point of failure.</p>
        <p>Hardware limitations: The experimental systems will be
deployed at sites. Available hardware capacities may vary across
diferent sites. Furthermore, the hardware requirements of
experimental systems should be following the available
resources of sites. For instance, participants contributing
machine/deep learning systems should not outsource training
routines to external servers. In the first place, we will focus
on lightweight experiments in order to keep the entrance
barrier for participation low.</p>
        <p>User interaction data: While all interaction data within the
STELLA infrastructure can be logged, a reuse-setup is not
outlined yet and must remain as future work.</p>
        <p>
          Recording usage and interaction data for later reuse are not
novel. An example of interaction data recorded to allow later
verification and simulation was the NewsREEL lab at CLEF 2017 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In
        </p>
        <p>NewsREEL Replay participants had access to a dataset comprising a
collection of log messages analogous to NewsREEL Live. Analogous
to the online evaluation, participants had to find the configuration
with the highest click-trough-rate. By using the recorded data,
participants experimented on a reasonable trade-of amid prediction
accuracy and response time.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 CONCLUSION</title>
      <p>We present a living lab platform for the evaluation of online
experiments and a Docker-based infrastructure which bridges the gap
between experimental systems and real user interactions.
Concerning the PRIMAD model, it is possible to assess reproducibility and
corresponding components in our infrastructure proposal.
Participants are free to choose which method and implementation to use
and can rely on adequate deployment and environments. User
interaction data will be logged and is accessible for optimizing systems
and future applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Krisztian</given-names>
            <surname>Balog</surname>
          </string-name>
          , Anne Schuth,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Dekker</surname>
          </string-name>
          , Philipp Schaer, Narges Tavakolpoursaleh, and
          <string-name>
            <surname>Po-Yu Chuang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the TREC 2016 Open Search Track</article-title>
          .
          <source>In Proceedings of the Twenty-Fifth Text REtrieval Conference (TREC</source>
          <year>2016</year>
          ). NIST.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ferro</surname>
          </string-name>
          , Norbert Fuhr, Kalervo Järvelin, Noriko Kando, Matthias Lippold, and
          <string-name>
            <given-names>Justin</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science"</article-title>
          .
          <source>SIGIR Forum 50</source>
          ,
          <issue>1</issue>
          (
          <year>June 2016</year>
          ),
          <fpage>68</fpage>
          -
          <lpage>82</lpage>
          . https://doi.org/10.1145/2964797.2964808
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Allan</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , Henning Müller, Krisztian Balog, Torben Brodt,
          <string-name>
            <surname>Gordon</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
          </string-name>
          , Ivan Eggel, Tim Gollub, Frank Hopfgartner, Jayashree Kalpathy-Cramer, Noriko Kando, Anastasia Krithara, Jimmy Lin, Simon
          <string-name>
            <surname>Mercer</surname>
            , and
            <given-names>Martin</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Evaluation-as-a-Service: Overview and Outlook</article-title>
          . ArXiv e-prints
          <source>(Dec</source>
          .
          <year>2015</year>
          ). http://arxiv.org/abs/1512.07454
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Hienert</surname>
          </string-name>
          , Dagmar Kern, Katarina Boland, Benjamin Zapilko, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Mutschke</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A Digital Library for Research Data and Related Information in the Social Sciences</article-title>
          .
          <source>In ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) (forthcoming).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Rolf</given-names>
            <surname>Jagerman</surname>
          </string-name>
          , Krisztian Balog, and Maarten de Rijke.
          <year>2018</year>
          .
          <article-title>OpenSearch: Lessons Learned from an Online Evaluation Campaign</article-title>
          .
          <source>J. Data and Information Quality</source>
          <volume>10</volume>
          (
          <year>2018</year>
          ),
          <volume>13</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          :
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Kille</surname>
          </string-name>
          , Andreas Lommatzsch, Frank Hopfgartner, Martha Larson, and
          <string-name>
            <given-names>Torben</given-names>
            <surname>Brodt</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>CLEF 2017 NewsREEL Overview: Ofline and Online Evaluation of Stream-based News Recommender Systems</article-title>
          .
          <source>In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum</source>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . (CEUR Workshop Proceedings), Linda Cappellato, Nicola Ferro, Lorraine Goeuriot, and Thomas Mandl (Eds.), Vol.
          <year>1866</year>
          .
          <article-title>CEUR-WS.org</article-title>
          . http://ceurws.org/Vol-1866/invited_paper_17.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Bernd</given-names>
            <surname>Müller</surname>
          </string-name>
          , Christoph Poley, Jana Pössel, Alexandra Hagelstein, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Gübitz</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Livivo-the vertical search engine for life sciences</article-title>
          .
          <source>DatenbankSpektrum 17</source>
          ,
          <issue>1</issue>
          (
          <year>2017</year>
          ),
          <fpage>29</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>