<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of LiLAS 2021 - Living Labs for Academic Search (Extended Overview)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Philipp Schaer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timo Breuer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leyla Jael Castro</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Wolf</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johann Schaible</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Narges Tavakolpoursaleh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GESIS - Leibniz Institute for the Social Sciences</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TH Köln - University of Applied Sciences</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ZB MED - Information Centre for Life Sciences</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Living Labs for Academic Search (LiLAS) lab aims to strengthen the concept of user-centric living labs for academic search. The methodological gap between real-world and lab-based evaluation should be bridged by allowing lab participants to evaluate their retrieval approaches in two real-world academic search systems from life sciences and social sciences. This overview paper outlines the two academic search systems LIVIVO and GESIS Search, and their corresponding tasks within LiLAS, which are ad-hoc retrieval and dataset recommendation. The lab is based on a new evaluation infrastructure named STELLA that allows participants to submit results corresponding to their experimental systems in the form of pre-computed runs and Docker containers that can be integrated into production systems and generate experimental results in real-time. Both submission types are interleaved with the results provided by the productive systems allowing for a seamless presentation and evaluation. The evaluation of results and a meta-analysis of the diferent tasks and submission types complement this overview.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Living labs</kwd>
        <kwd>evaluation</kwd>
        <kwd>academic search</kwd>
        <kwd>dataset recommendation</kwd>
        <kwd>ad-hoc retrieval</kwd>
        <kwd>STELLA framwework</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        academic research. In-vivo here describes the possibility to perform IR experiments integrated
into real-world systems and to conduct experiments where the actual interaction with these
systems takes place. It should be emphasized here that these are not classic user experiments in
which the focus is on the individual interactions of users (e.g., to investigate questions of UI
design), but rather aggregated usage data is collected in large quantities in order to generate
reliable quantitative research results. The potential of living labs and real-world evaluation
techniques has been shown in previous CLEF labs such as NewsREEL [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and LL4IR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or
TREC OpenSearch [3]. In a similar vein, LiLAS is designed around the living lab evaluation
concept and introduces diferent use cases in the broader field of academic search. Academic
search solutions, which have to deal with the phenomena around the exponential growing rate
[4] of scientific information and knowledge, tend to fall behind the real-world requirements
and demands. The vast amount of scientific information does not only include traditional
journal publication, but also a constantly growing amount of pre-prints, research datasets, code,
survey data, and many other research objects. This heterogeneity and mass of documents and
datasets introduces new challenges to the disciplines of information retrieval, recommender
systems, digital libraries, and related fields. Academic search is a conceptional umbrella to
subsume all these diferent disciplines and is well-known through (mostly domain-specific)
search systems and portals such as PubMed, arXiv.org, or dblp. While those three are examples
of open-science-friendly systems as they allow re-use of metadata, usage data and/or access to
fulltext data, other systems such as Google Scholar or ResearchGate. The later ofer no access
at all to their internal algorithms and data and are therefore representatives of a closed-science
(and commercial) mindset.
      </p>
      <p>Progress in the field of academic search and its corresponding domains is usually evaluated
by means of shared tasks that are based on the principles of Cranfield/TREC-style studies [ 5].
Typical shared tasks at the Conference and Labs of the Evaluation Forum (CLEF) and the Text
Retrieval Conference (TREC) are based on the ofline computation of results/runs missing a
valuable link to real-world environments [6]. Most recently the TREC-COVID [7] evaluation
campaign run by NIST attracted a high number of participants and showed the high impact of
scientific retrieval tasks in the community. Within TREC-COVID a wide range of systems and
retrieval approaches participated and generally showed the massive retrieval performance that
recent BERT and other transformer-based machine learning approaches are capable of. However,
classic vector-space retrieval was also highly successful using the well-known SMART system1
and showed the limitations of the test collection-based evaluation approach of TREC-COVID and
1https://ir.nist.gov/covidSubmit/archive.html
the general need for innovation in the field of academic search and IR. Meta-evaluation studies
of system performances in TREC and CLEF showed a need for innovation in IR evaluation [8, 9].
The field of academic search is no exception to this. The central concern of academic search is
ifnding both relevant and high-quality documents. The question of what constitutes relevance
in academic search is multilayered [10] and an ongoing research area.</p>
      <p>In 2020 we held a first iteration of LiLAS as a so-called workshop lab. This year we provide
participants exclusive access to real-world systems, their document base (in our case a very
heterogeneous set of research articles and research data including, for instance, surveys), and the
actual interactions including the query string and the corresponding click data (see overview on
the setup in Figure 1). To foster diferent experimental settings we compile a set of head queries
and candidate documents to allow pre-computed submissions. Using the STELLA-infrastructure,
we allow participants to easily integrate their approaches into the real-world systems using
Docker containers and provide the possibility to compare diferent approaches at the same time.</p>
      <p>This extended lab overview is a longer version of the condensed LNCS lab overview [11]. It is
structured as follows: In Sections 2 and 3 we introduce the two main use cases of LiLAS which
are bond to the sites granting us access to their retrieval systems: LIVIVO and GESIS Search. In
these two sections the systems, the provided datasets, and task are described. In Section 4 we
outline the evaluation setup and STELLA, our living lab evaluation framework, and the two
submission types, namely pre-computed runs and Docker container submissions. Section 4 also
includes the description of the evaluation metrics used with in the lab and a short overview
on the organizational structure of the lab. In Section 5 we introduce the participating groups
and approaches. We outline the results of the evaluation rounds in Section 6 and conclude in
Section 7. In addition to the condensed LNCS overview we included some more textual details
and additional tables and figures in Appendix A.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Ad-hoc Search in LIVIO</title>
      <sec id="sec-2-1">
        <title>2.1. LIVIVO Literature Search Portal</title>
        <sec id="sec-2-1-1">
          <title>LIVIVO2 [12] is a literature search portal developed and supported by ZB MED – Information</title>
          <p>Centre for Life Sciences.ZB MED is a non-profit organization providing specialized literature in
Life Sciences at a national (German) and international level and hosting one of the largest stock
of life science literature in Europe. Since 2015, ZB MED supports users including librarians,
students, general practitioners and researchers with LIVIVO, a comprehensive and interdisciplinary
search portal for Life Sciences.</p>
          <p>LIVIVO integrates various literature resources from medicine, health, environment,
agriculture and nutrition, covering a variety of scholarly publication types (e.g., conferences, preprints,
peer-review journals). LIVIVO corpus includes about 80 million documents from more than 50
data sources in multiple languages (e.g., English, German, French). To better support its users,
LIVIVO ofers an end-user interface in English and German, an automatically and semantically
enhanced search capability, and a subject-based categorization covering the diferent areas it
supports (e.g., environment, agriculture, nutrition, medicine). Precision of search queries is
# Sample head query
{ "qid": 1001, "qstr": "integrierte AND versorgung", "freq": 12 }
# Sample documents
{ "DBRECORDID": "AGRISFR2016215853",
"TITLE": ["Dissection ..."],
"AUTHOR": ["Teyssèdre, Simon"],
"SOURCE": ["Dissection ..."],
"LANGUAGE": ["fra"],
"DATABASE": ["AGRIS"] }
# Sample candidate list
{ "qid": 1001,
"qstr": "integrierte AND versorgung",
"candidates": ["C951899619", "C676171", "848078", "C765841" ... ] }
improved by using descriptors with semantic support; in particular, LIVIVO uses three
multilingual vocabularies to this end (Medical Subject Headings MeSH,UMTHES,and AGROVOC.In
addition to its search capabilities, LIVIVO also integrates functionality supporting inter-library
loans at a national level in Germany. Since 2020, LIVIVO also ofers a specialized collection on</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>COVID-193</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. LIVIVO Dataset</title>
        <p>For the LiLAS challenge, we prepared training and test datasets comprising head queries
together with 100-document candidate list. In Figure 2 we include an excerpt of the diferent
elements included in the data. Data was formatted in JSON and presented as JSONL files to
facilitate processing. Participating head queries were restricted to keywords-based search and
keywords-based search plus AND, OR and NOT operators.</p>
        <p>Head queries were assigned an identifier, namely qid, a query string, qstr and as an additional
information the query frequency, freq. For each head query, a candidate list was also provided.
Candidate lists include the query identifier as well as corresponding string, together with a list
of 100 document identifiers (i.e. the native identifier used in the LIVIVO database).</p>
        <p>In addition to head queries and candidate lists, we also provided a set of documents in LIVIVO
corresponding to three of the major bibliographic scholarly databases so participants could
create their own indexes. The document set contains metadata for approx. 35 million documents
and is provided as a JSONL file. To reduce complexity and keep the data manageable, we
decided to provide only the 6 most important data fields (DBRECORDID, TITLE, AUTHOR,
SOURCE, LANGUAGE, DATABASE). Additional metadata and fulltext is mostly available from
the original database curators. The aformentioned databases correspond to Medline, the National
Library of Medicine’s (NLM) bibliographic database for life sciences and biomedical information
including about 20 million of abstracts; the NLM catalog, providing access to bibliographic
data for over 1.4 million journals, books and similar data; and the Agricultural Science and
Technology Information (AGRIS) database, a Food and Agriculture Organization of the United
Nations initiative compiling information on agricultural research with 8.9 million structured
bibliographical records on agricultural science and technology.
2.3. Task
Finding the most relevant publications in relation to a head query remains a challenge in
scholarly Information Retrieval systems. While most repositories or registries deal mostly with
publications in English, LIVIVO, the production system used at LiLAS, supports multilingualism,
adding an extra layer of complexity and presenting a challenge to participants.</p>
        <p>The goal of this ad-hoc search task is supporting researchers to find the most relevant
literature regarding a head query. Participants were asked to define and implement their
ranking approach using as basis a multi-lingual candidate documents list. A good ranking
should present users with the most relevant documents on top of the result set. An interesting
aspect of this task is the multilingualism as multiple languages can be used to pose a query (e.g.
English, German, French); however, regardless of the language used on the query, the retrieval
can include documents in other languages as part of the result set.
3. Research Data Recommendations in GESIS-Search</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.1. GESIS Search Portal</title>
        <sec id="sec-2-3-1">
          <title>GESIS Search4 is a search portal for social science research data and open access publications</title>
          <p>developed and supported by GESIS - Leibniz Institute for the Social Sciences. GESIS is a member
of the Leibniz Association with the purpose to promote social science research. It provides
essential and internationally relevant research-based services for the social sciences, and as the
largest European infrastructure institute for the social sciences, GESIS ofers advice, expertise
and services to scientists at all stages of their research projects.</p>
          <p>GESIS Search aims at helping its users find appropriate scholarly information on the broad
topic of social sciences [13]. To this end, it provides diferent types of information from the
social sciences in multiple languages, comprising literature (114.7k publications), research data
(84k), questions and variables (13.6k), as well as instruments and tools (440). A well-configured
relevance ranking together with a well-defined structure and faceting mechanism allow to
address the users’ information needs, however, the most interesting aspect is the inclusion
of scientific literature with research data. Typically, those types of information are accessible
through diferent portals only, posing the problem of a lack of links between these two types
of information. GESIS Search provides such an integrated access to research data as well as
to publications. The information items are connected to each other based on links that are
either manually created or automatically extracted by services that find data references in full
texts. Such linking allows researchers to explore the connections between information items
interactively.
# Sample publication document
{ "id": "csa201419416",
"title": "The Changing Value...",
"abstract": "This article reviews...",
"topic":[
"Children",
"Child Mortality",
"Values"] }
# Sample research dataset document
{ "id": "DA3433",
"title": "Kindheit, Jugend und Erwachsenwerden...",
"title_en": "Childhood, Adolencence, and Becoming an Adult...",
"abstract": "Die Hauptthemen der Studie...",
"abstract_en": "The primary topics of the study...",
"topic": ["Familie und Ehe", "Kinder"],
"topic_en": ["Family life and marriage", "Children"] }
# Sample candidate list
{ "s_id": "gesis-ssoar-62031",
"candidate_docs": {
"ZA6752": 0.1856689453125,
"ZA6751": 0.183837890625,
"ZA6749": 0.181396484375,
"ZA6782": 0.1795654296875} }</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>3.2. GESIS Search Dataset</title>
        <p>For LiLAS, we focus on all publications and research data comprised by GESIS Search. The
publications are mostly in English and German, and are annotated with further textual metadata
including title, abstract, topic, persons, and others. Metadata on research data comprises
(among others) a title, topics, datatype, abstract, collection method, primary investigators, and
contributors in English and/or German.</p>
        <p>The data provided to participants comprises the mentioned metadata on social science
literature and research data on social science topics comprised in the GESIS Search. In Figure 3 we
include an excerpt of the diferent elements included in the data. For the dataset recommendation
task with pre-computed results (see details in Section 3.3), in addition, the participants were
given the set of research data candidates that are recommended for each publication. This
candidate set is computed based on context similarity between publications and research data. It
is created by applying the TF-IDF score to vectorize the combination of title, abstract, and topics
for each document type and computing the cosine similarities between cross-data types. It
contains a list of research data for each publication with the highest similarities to the publication
among other research data in the corpus.
3.3. Task
Research data is of high importance in scientific research, especially when making progress
in experimental investigations. However, finding useful research data can be dificult and
cumbersome, even if using dataset search engines, such as Google Dataset Search5. Another
approach is scanning scientific publication for utilized or mentioned research data; however, this
allows to find explicitly stated research data and not other research data relevant to the subject.
To alleviate the situation, we aim at evolving the recommendation of appropriate research data
beyond explicitly mentioned or cited research data. To this end, we propose to recommend
research data based on publications of the user’s interest between a scientific publication and
possible research data candidates.</p>
        <p>The main task is: given a seed-document, participants are asked to calculate the best fitting
research data recommendations with regards to the seed-document. This resembles the use
case of providing highly useful recommendations of research data relevant to the publication
that the user is currently viewing. For example, the user is interested in the impact of religion
on political elections. She finds a publication regarding that topic, which has a set of research
data candidates covering the same topic.</p>
        <p>The participants were allowed to submit pre-computed and live runs (see section 4.2 for more
details). For submitting the pre-computed run, the participants also received a first candidate
list comprising 1k publication each having a list of recommended research data. The task here
was to re-rank this candidate list. On the contrary, for submitting the live runs, such a candidate
list was not needed, as the recommended candidates needed to be calculated first. To do so,
participants are provided metadata on publications as well as on the research data comprised in
GESIS Search (see Section 3.1 for more details on the provided data).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Evaluation Setup</title>
      <sec id="sec-3-1">
        <title>4.1. STELLA Infrastructure</title>
        <p>The technical infrastructure and platform was provided by our evaluation service called STELLA
(as illustrated in Figure 4). It complements existing shared task platforms by allowing
experimental ranking and recommendation systems to be fully integrated into an evaluation environment,
with no interference in the interaction between the users and the system as the whole process
is transparent for users. Besides transparency and reproduciblity, one of the STELLA main
principles is the integration of experimental systems as micro-services. More specifically, lab
participants package their single systems as Docker containers that are bundled in a
multicontainer application (MCA). Providers of academic research infrastructures deploy the MCA in
their back-end and use the REST-API either to get ranking and recommendations or to post the
corresponding user feedback that is mainly used for our evaluations. Intermediate evaluation
results are available through a public dashboard service that is hosted on a central server, also
part of the STELLA infrastructure. After authentication, participants can register experimental
systems at this central instance and access feedback data that can be used to optimize their
systems. In the following, each component of the infrastructure is briefly described to give
5https://datasetsearch.research.google.com/
the reader a better idea on how STELLA serves as a proxy for user-oriented experiments with
ranking and recommendation systems.
4.1.1. Micro-services
As pointed out before, we request our lab participants to package their systems with Docker. For
the sake of compatibility, we provide templates for these micro-services to implement minimal
REST-based web services. Participants can adapt their systems to these templates as they see
it fits as long as the pre-defined REST endpoints deliver technically correct responses. The
templates can be retrieved from GitHub6 that is fundamental to our infrastructure. Not only the
templates, but also the participant systems should be hosted in a public Git repository in order
to be integrated into the MCA. As soon as the developments are done, the participants register
their Git(Hub) URL at the central dashboard service of the infrastructure.
4.1.2. Multi-container Application (MCA)
Once the experimental systems pass technical unit tests and sanity checks for selected queries
and target items, they are ready to be deployed and evaluated via user interactions. To reduce
the deployments costs for the site providers, the single experimental systems are bundled into
an MCA which serves as the entry point to the infrastructure. The MCA handles the query
distribution among the experimental systems and also sends user feedback data to the central
server at regular intervals. After the REST-API corresponding to the MCA is connected to the
6https://github.com/stella-project/stella-micro-template
search interface, the user trafic can be redirected to the MCA which will actually deliver the
experimental results. We then interleave results of single experimental systems with those
from the baseline system by using a Team-Draft-Interleaving (TDI) approach. This results
in two benefits: 1) we prevent users from subpar retrieval results that also might afect the
site’s reputation, and 2), as shown before, interleaved results can be used to infer statistically
significant results with less user data as compared to conventional A/B tests. The site providers
rely on their own logging tools. STELLA expects a minimal set of information required when
sending feedback; however, sites are free to add any additional JSON-formatted feedback
information and interactions to the data payload, for instance logged clicks on site-specific SERP
elements. The underlying source code of the MCA is hosted in a public GitHub repository7.
4.1.3. Central Server
The central server instance of the infrastructure fulfills four functionalities: 1) participants, sites
and administrators visit the server to register user accounts and systems; 2) a dashboard service
provides visual analytics and first insights about the performance of experimental systems;
3) likewise, feedback data in the form of user interactions is stored in a database that can be
downloaded for system optimizations and further evaluations; and 4) the server implements an
automated update job of the MCA in order to integrate newly submitted systems if suitable.</p>
        <p>Each MCA that is instantiated with legitimate credentials posts the logged user feedback to the
central infrastructure server. Even though the infrastructure would allow continuous integration
of newly submitted systems, we stuck to the oficial dates of round 1 and 2 when updating the
MCAs at the sites. Due to moderate trafic, we run the central server on a lightweight single
core virtual machine with 2GB RAM and 50GB storage capacity8. More technical details about
the implementations can be found in the public GitHub repository9.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Submission Types</title>
        <p>Participants can choose between two diferent submission types for both tasks (i.e. ad-hoc
search and dataset recommendation). Similar to previous living labs, Type A are pre-computed
runs that contain rankings and recommendations of the most frequent queries and the most
frequently viewed document, respectively for reach task. Alternatively, it is possible to integrate
the entire experimental system as a micro-service as part of a Type B submission. Both
submission types have their own distinct merits as described below.
4.2.1. Type A - Pre-computed Runs
Even though the primary goal of the STELLA framework is the integration of entire systems as
micro-services, we ofer the possibility to participate in the experiments by submitting system
outputs, i.e. in the form of pre-computed rankings and recommendations. We do so for two
reasons. First, the Type A submissions resemble those of previous living labs and serve as the
baseline in order to evaluate the feasibility of our new infrastructure design. Second, we hope
7https://github.com/stella-project/stella-app
8https://lilas.stella-project.org/
9https://github.com/stella-project/stella-server
to lower technical barriers for some participants that want to submit the system outputs only.
To make it easier for participants, we follow the familiar TREC run file syntax.</p>
        <p>Depending on the chosen task, for each of the selected top-k queries or target items (identified
by &lt;qid&gt;) a ranking or recommendation has to be computed in advance and then uploaded
to the dashboard service. The upload process is tightly integrated into the GitHub ecosystem.
Once the run file is uploaded, a new repository is automatically created from the previously
described micro-template to which the uploaded run is committed. This is made possible thanks
to GitHub API and access tokens. The run file itself is loaded as a pandas DataFrame into
the running micro-service when the indexing endpoint is called. Upon request, the queries
and target items are translated into the corresponding &lt;qid&gt; to filter the DataFrame. Due
to manageable sizes of top-k queries and target items, the entire (compressed) run file can be
uploaded to the repository and can be kept in memory after it is indexed as a DataFrame. As a
technical safety check, we also integrate a dedicated verification tool 10 in combination with
GitHub Actions to verify that the uploaded files follow the correct syntax.
4.2.2. Type B - Docker Containers
Running fully-fledged ranking and recommendation systems as micro-services overcomes the
restrictions of responses that are limited to top-k queries and target items. Therefore, we
ofer the possibility to integrate the entire systems as a Docker container into the STELLA
infrastructure as part of Type B submissions. As pointed out earlier, participants fork the
template of the micro-services and adapt it to their experimental system. While Docker and the
implementation of pre-defined REST endpoints are hard requirements, participants have total
freedom w.r.t. the implementation and tools they use within their container, i.e., they do not
even have to build up on the Python web application that is provided in the template. Solely,
the index endpoint and, depending on the chosen task, either the ranking or recommendation
endpoint have to deliver technically correct results. For this purpose, we include unit tests in the
template repository that can be run in order to verify that the Docker containers can be properly
integrated. If these unit tests pass, the participants register the URL of the corresponding Git
repository at the dashboard service. Later on, the system URL is added to the build file of the
MCA when an update process is invoked. If the MCA is updated at the sites, newly submitted
experimental systems are build from the Dockerfiles in the specified repositories.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Baseline Systems</title>
        <p>LIVIVO baseline system for ranking is built on Apache Solr and Apache Lucene. The index
contains about 80 million documents from more than 50 data sources in multiple languages and
about 120 searchable fields ranging from basic data such as Title, Abstract, Authors to more
specific such as MeSH-Terms, availability or OCR-Data. For ranking, LIVIVO uses the Lucene
default ranker which is a variant of TF-IDF; on top of it, a custom boosting is added. Newer
documents as well as search queries occurring in title or author fields are boosted. An exact
match of search phrases in title-field results in a very high boosting. Moreover LIVIVO uses
a Lucene-based plugin which executes NLP-tasks like stemming, lemmatization, multilingual
10https://github.com/stella-project/syntax_checker_CLI
search; it also makes use of semantic technologies, mainly based on the Medical Subject Headings
(MeSH) vocabulary.</p>
        <p>The baseline system for recommendation of research data based on publications in Gesis
Search utilizes Pyserini, a Python interface to the IR toolkit built on Lucene designed to support
reproducible IR research. The baseline system for recommendation applies the SimpleSearcher
of Pyserini that provides the entry point for sparse retrieval BM25 ranking using bag-of-words
representations. The Lucene-based index contains abstracts and titles of all research data. The
publication identifier (target item of the recommendation) is translated into the publication title,
which, in turn, is used to query the index with a BM25 algorithm. Accordingly, the research
data recommendations are based on the title and abstracts of the research data and queries
made from the publication titles.</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.4. Evaluation Metrics</title>
        <p>Our logging infrastructure allows us to track search sessions and the corresponding interactions
made by users. Each session comprises a specific site user, multiple queries (or target items) as
wells as the corresponding results and feedback data in the form of user interactions, primarily
logged as clicks with timestamps.</p>
        <p>
          Similar to previous living lab initiatives, we design our user-oriented experiments with
interleaved result lists. Given a list with interleaved results and the corresponding clicks of
users, we determine Wins, Losses, Ties, and the derived Outcomes for relative comparisons of the
experimental and baseline systems [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Following previous living lab experiments, we implement
the interleaving method by the Team-Draft-Interleaving algorithm [14]. More specifically, we
refactored exactly the same implementation11 for the highest degree of comparability.
        </p>
        <p>Furthermore we follow Gingstad et al.’s proposal of a weighted score based on click events
[15] and define the Reward as
where  denotes the set of all elements on a search engine result page (SERP) for which clicks
are considered,  denotes the corresponding weight of the SERP element  that was clicked,
and  denotes the total number of clicks on the SERP element . The Normalized Reward is
defined as
nReward =</p>
        <p>Rewardexp</p>
        <p>Rewardexp + Rewardbase
that is the sum of all weighted clicks on experimental results (Rewardexp) normalized by
the total Reward given by Rewardexp + Rewardbase. Note that, only those clicks from the
experimental systems where rankings were interleaved with results of the two compared
systems are considered. Figure 5 shows the SERP elements that were logged at LIVIVO and the
corresponding weights for our evaluations. We do not implement the Mean Normalized Reward
proposed by Gingstad et al. due to a diferent evaluation setup. Our lab is organized in rounds
11https://bitbucket.org/living-labs/ll-api/src/master/ll/core/interleave.py
(1)
(2)
SERP Element 
Bookmark
Order
Fulltext
In Stock
More Links
Title
Details
during which the systems as well as the underlying document collections are not modified and
we already determine the Normalized Reward over all aggregated clicks of a specific round.</p>
      </sec>
      <sec id="sec-3-5">
        <title>4.5. Lab Rounds and Overall Lab Schedule</title>
        <p>The lab was originally split in two separated rounds of 4 weeks each. Due to technical issues
for LIVIVO round 1 was four days shorter and round 2 started one week later as planned. To
compensate this, we decided to let round 2 last until 24 May 2021, so in total round 2 lasted
nearly six instead of four weeks. An overview of the general LiLAS 2021 schedule is given in
Table 2. Each participating groups received a set of feedback data after each round; the feedback
was also made publicly available on the lab website12. Before each round a training phase was
ofered to allow the participants to build or adapt their systems to the new datasets or click
feedback data.</p>
        <p>12https://th-koeln.sciebo.de/s/OBm0NLEwz1RYl9N</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Participation</title>
      <sec id="sec-4-1">
        <title>5.1. Team lemuren</title>
        <p>Team lemuren participated in both rounds with pre-computed results and dockerized systems
for the ad-hoc search task at LIVIVO [16]. For both rounds, they submitted two diferent
approaches.</p>
        <p>The pre-computed ranking results of lemuren elk are based on built-in functions of
Elasticsearch. This system uses a combination between the divergence from randomness
model and the Jelinek-Mercer smoothing method for re-ranking candidate documents. The
preprocessing pipeline implements stop-word removal, stemming and considers synonyms for
medical and COVID19-related terms. The system was tuned only to the results in English.</p>
        <p>save fami is another pre-computed system. It also uses Elasticsearch combined with
natural language processing (NLP) modules implemented with the Python package spaCy.
Similar to the second submission for the pre-computed round, this dockerized system is build
on top of Elasticsearch and spaCy. The indexing pipeline follows a multilingual approach
supporting English and German languages. For both languages the system implements full
solutions available in spaCy, either by the models en_core_sci_lg (English biomedical texts)
or de_core_news_lg (general German texts). The system uses the Google Translator API 13
for language detection and automatic translating of incoming queries (from German to English
and vice versa). For indexing and document-retrieval Elasticsearch was used with a custom
boosting for MeSH and Chemical-tokens. lemuren elastic only (LEO) is the second
dockerized system by this team which, diferent from LEPREP, relies only on Elasticsearchs
built-in tools for indexing documents and processing queries. For indexing documents a custom
ingestion pipeline is used to detect the documents language (English or German) and creating
the corresponding language fields. Handling of basic acronyms was modeled by using the
built-in word-delimiter function. Similar to LEPREP-System, LEO uses Google Translator API
for automatic query translation. The system is complemented by a fuzzy match and fuzzy
query-expansion to obtain better results for mistyped queries. Like lemuren elk in round
one, LEO also uses DFR and LMJelinekMercer to calculate a score and a similarity distance.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Team tekma</title>
        <p>Team tekma contributed experiments to both rounds. In the first round, they submitted the
pre-computed results of the system tekma_s for the ad-hoc search task at LIVIVO [17]. In the
second round, they submitted pre-computed recommendations (covering the entire volume of
publications) for the corresponding task at GESIS. Both systems are described below.</p>
        <p>tekma_s used Apache Solr to index the document and used pseudo-relevance feedback to
extend the queries for the ad-hoc search task. The system only considers documents in English.
The system got few impressions and clicks in comparison to the baseline system. tekma_n
participated in the second round producing pre-computed recommendations. They used Apache
Solr BM25 ranking function and applied query expansion and data enrichment by adding the
metadata translations and re-ranking the retrieved result using user feedback and KNN. To
13https://pypi.org/project/google-trans-new/
generate the primary recommendations for a publication, they used publication fields as a query
to search the indexed dataset.</p>
      </sec>
      <sec id="sec-4-3">
        <title>5.3. Team GESIS Research</title>
        <p>In addition to the baseline system, team GESIS Research contributed a fully dockerized system in
both rounds [18]. gesis_rec_pyterrier implements a naive content-based recommendation
without any advanced knowledge about user preferences and usage metrics. It uses the metadata
available in both entity types, i.e., title, abstract, and topics. They employed the classical
tfidfbased weighting model from the PyTerrier frameworkto obtain first-hand experience with the
online evaluation. The indexing and query have been made of the combination of words in
title, abstract, and research data topics and publications. They decided to submit the same
experimental system for both rounds to gain more user feedback for their unique system. Even
though only tfidf-based recommendations are implemented at the current state, it ofers a good
starting point for further experimentation with PyTerrier and the declarative manner of defining
retrieval pipelines.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Results</title>
      <p>Our experimental evaluations are twofold. First, we evaluate overall statistics of both rounds
and sites. Second, we evaluate the performance of all participating systems based on the click
data logged during the active periods. As mentioned before, the first round ran during four
weeks from March 1st, 2021 to March 28th, 2021 and the second round for five weeks from April
17th, 2021 until May 24th, 2021 at LIVIVO and for six weeks from April 12th, 2021 until May
24th, 2021 at GESIS. To foster transparency and reproducibility of the evaluations, we release
the corresponding evaluation scripts in an open-source GitHub repository14.
6.1. Overall evaluations of both rounds and sites
Table 3 provides an overview of the trafic logged in both rounds. In sum, substantially more
sessions, impressions, and clicks were logged in the second round not only due a longer
period but also because more systems contributed as Type B submissions. In the first round,
systems deployed at LIVIVO were mostly contributed as Type A submissions, meaning their
14https://github.com/stella-project/stella-evaluations
2000
1500
1000
500</p>
      <p>0
2021-03-01 2021-03-06 2021-03-11 2021-03-16 2021-03-21 2021-03-26
responses were restricted to pre-selected head queries. LIVIVO started the second round with
full systems which delivered results for arbitrary queries and thus more session data was logged.
GESIS started both rounds with the majority of systems contributed as type B submissions. In
comparison to LIVIVO, more sessions and impressions were logged in the first round, but less
recommendations were clicked. Similarly, there are less clicks in the second round in comparison
to LIVIVO, which is also reflected by the Click-Through Rate (CTR) that is determined by the ratio
between Clicks and Impressions. As mentioned before, GESIS introduced the recommendations
of research datasets as a new service, and, presumably, users were not aware of this new feature.</p>
      <p>Figure 15 shows the distributions of sessions and impressions over the entire time span from
the beginning of the first round until the end of the second round. Note that, after the end of
the first round, we did not log any interactions until the beginning of the second round. In
comparison, the sessions and impressions are more uniformly distributed at GESIS. This can be
explained by the deployment of type B systems from the early beginning of the first round and
systems could provide recommendations for the entire volume of the publications.</p>
      <p>During the first two weeks of the first round, the amount of logged data at LIVIVO is
comparatively low due to systems with pre-computed results for pre-selected head queries.
After that, the first type B systems was deployed and increasingly more user trafic could be
redirected to our infrastructure. Figure 6 illustrates these efects. The cumulative sums of logged
sessions, impressions, and clicks rapidly increased after the first Type B system got online in
mid-March.</p>
      <p>The logged impressions follow a power-law distribution for both rankings and
recommendations as shown in Figure 7. Most of the impressions can be attributed to a few top-k queries
(rankings) or documents (recommendations). Table 4 and 5 show the top ten queries and
documents for that rankings and recommendations were made. Query strings were normalized by
40
sn30
o
i
s
s
e
rp20
m
I
10
0
35
30
s25
n
iso20
s
e
rp15
m
I10
5
0
Query
GESIS</p>
      <p>Document
lower-casing and removing special characters. As it can be seen from Table 4 the COVID-19
pandemic has a clear influence on the query distributions: the most frequent and the fifth most
frequent query are “covid19” and “covid”, respectively. Three of the ten most frequent queries are
definitely German queries (“demenz”, “pflege”, “schlaganfall”). Others are either domain-specific
or can also be interpreted as English queries. In Table 6 we report statistics about the queries
logged during both rounds at LIVIVO. In both rounds, interaction data was logged for 11,822
unique queries with an average length of 2.9840 terms and each session had 1.9340 queries on
average. Nine out of the ten most frequent target items of the recommendations at GESIS are
publications with German titles as shown in Table 5.</p>
      <p>Likewise the total number of clicks over queries and documents is extremely thin-tailed (cf.
Figure 11 and 12). A large amount of the clicks at LIVIVO were made for the query “polyvinyl
and nasal and packing” and LIVIVO’s internal server logs indicate a crawling process here. All
other queries received 23 or less clicks. As mentioned before, less clicks were made at GESIS.
Three clicks were made at maximum on recommendations for the most frequently clicked
documents.</p>
      <p>Similar power-law distributions can be observed for the total number of clicks over documents
(rankings) and datasets (recommendations) in Figure 13 and 14, respectively. A few documents
and datasets receive most of the clicks. Details about the corresponding items can be found in
Table 13 and 14.</p>
      <p>Another important aspect to be considered as part of the system evaluations is the position
bias inherent in the logged data. Click decisions are biased towards the top ranks of the</p>
      <p>Die Nichtwähler : Politische Normalität oder wachsende Distanz
zu den Parteien?
Doing Gender: Soziale Praktiken der Geschlechterunterscheidung
ZUMA-Informationssystem. Elektronisches Handbuch
sozialwissenschaftlicher Erhebungsinstrumente
Situiertes Wissen : die Wissenschaftsfrage
im Feminismus und das Privileg einer partialen Perspektive
Party identification, ideological preference,
and the left-right dimension among western mass publics
Die soziale Konstruktion von Geschlecht :
Erkenntnisperspektiven und gesellschaftstheoretische Fragen
Konsensfiktionen in Kleingruppen: dargestellt am Beispiel von jungen Ehen
SWLS Satisfaction with Life Scale
Entwicklung einer Skala zur Messung von Arbeitszufriedenheit (SAZ)
Gesundheitliche Ungleichheit / Health Inequalities
34
30
29
28
26
22
22
21
21
20
result lists as shown in Figure 8. For both use cases, the rankings and recommendations were
displayed to users as vertical lists. Note that, GESIS restricted the recommendations to the
ifrst six recommended datasets and no pagination over the following recommended items was
possible. LIVIVO shows ten results per page to its users, and as it can be seen from the logged
data, users rarely click results beyond the fifth page.</p>
      <p>In addition to “simple” clicks on ranked items, we logged specific SERP elements that were
clicked at LIVIVO. Table 5 already provided an overview on which elements were logged and
Figure 9 shows the CTR of these elements also follows a power-law distribution. The number
Rank</p>
      <p>GESIS
1
10
20
30
40
50
1
2
3</p>
      <p>4
Rank
5
6
of clicks is the highest for the Details button and it is followed by the Title and Fulltext click
options. In comparison, the other four logged elements receive substantially less clicks.</p>
      <sec id="sec-5-1">
        <title>6.2. System evaluations</title>
        <p>An overview of all systems participating in our experiments is provided in Table 7. In the
ifrst round, three type A systems ( lemuren_elk, tekmas, save_fami) were submitted and
deployed at LIVIVO. They were also deployed in the second round, but did not receive any
updates between the two rounds. Since there were no type B submissions in the first round
for LIVIVO, we deployed the type B system livivo_rank_pyserini after two weeks in
mid-March. It provided results for the entire volume of publications and rankings were based
on the BM25 method. It was implemented with Pyserini [19] and the corresponding default
SERP element
k
r
a
m
k
o
o
b
s
k
n
il
_
e
r
o
m
r
e
d
r
o
settings15. In contrast to the other systems, it was online for the last two weeks of the first
round only. In the second round, it was online in the first days until the other type B systems
were ready to be deployed since we wanted to distribute the user trafic among the
participants’ systems only. In the second round, two type B systems lemuren_elastic_only and
lemuren_elastic_preprocessing were contributed. Both systems build up on
Elasticsearch, whereas they difer by the pre-processing as outlined before. At GESIS,
gesis_rec_pyterrier, submitted as type B system, was online in both rounds. In the first round, the only type
A submission was gesis_rec_precom that was substituted in the second round by tekma_n.
Both baseline systems at LIVIVO (livivo_base) and GESIS (gesis_rec_pyserini) were
integrated as type B systems, remained unmodified, and could deliver results for every request.</p>
        <p>Table 8 compares the experimental systems’ outcomes and the corresponding logged
interactions and session data during the first round. Regarding the Outcome measure, none of the
experimental systems was able to outperform the baseline systems. Note that the reported
Outcomes of the baseline systems result from comparisons against all experimental systems.
The systems with pre-computed rankings (type A submissions) received a total number of 32
clicks over a period of four weeks at LIVIVO. Since interaction data was sparse in the first round,
we only received enough data for livivo_rank_pyserini to conduct significance tests. The
reported p-value results from a Wilcoxon signed-rank test and shows a significant diference
between the experimental and baseline system.</p>
        <p>Table 9 shows the results of the second round. tekma_n was contributed as type A submission,
but results were pre-computed for the entire volume of publications at GESIS. It replaced
gesis_rec_precom and achieved a higher CTR compared to the other recommender systems.</p>
        <p>15https://github.com/stella-project/livivo_rank_pyserini</p>
        <p>Likewise, it achieves an Outcome of 0.62, which might be an indicator that it outperforms the
baseline recommendations given by gesis_rec_pyserini. Unfortunately, we are not able
to conduct any meaningful significance tests due to the sparsity of click data. At LIVIVO, the
systems with pre-computed rankings (type A submissions) received a comparable amount of
clicks similar to the first round. In sum, all three systems received a total number of 35 clicks
over a period of five weeks. Even though, click data is sparse and interpretations have to be
made carefully, the relative ranking order of these three systems is preserved in the second
round (e.g. in terms of the Outcome, total number of clicks, or CTR).</p>
        <p>In the second round, no experimental system could outperform the baseline system at LIVIVO.
Both experimental type B systems lemuren_elastic_only and
lemuren_elastic_preprocessing achieve significantly lower Outcome scores as the baseline. However, the second
system has substantially lower Outcome and CTR scores. Both systems share a fair amount
of the same methodological approach and only difer by the processing of the input text. In
this case, the system performance does not seem to benefit from this specific pre-processing
step, when interpreting clicks as positive relevance signals. The third type B system at LIVIVO
livivo_rank_pyserini did not participate the entire second round, since we took it ofline
as soon as the other type B systems were available. Despite having participated in comparatively
less experiments than in the first round (1260 sessions vs. 243 sessions), the system achieves in
both rounds comparable results in terms of Outcome and CTR scores. This circumstance raises
the question for how long systems have to be online to deliver reliable performance estimates.
Figure 10 provides an overview of how the Outcome score evolves over aggregated sessions
for diferent systems and rounds. As the figures show, after a certain number of sessions, the
outcome tends to stabilize. In our future work, we want to investigate how much sessions (or
online time) is required to deliver meaningful estimates of system performance in terms of the
Outcome and other measures derived from interleaving experiments.</p>
        <p>Previous studies showed that a system is more likely to win if its documents are ranked at
higher positions [20]. As part of our experimental evaluations, we can confirm this circumstance.
We also determined the Spearman correlation between an interleaving outcome (1: win, -1: loss,
0: tie) and the highest ranked position of a document contributed by an experimental system. At
both sites, we see a weak but significant correlation (LIVIVO:  = − 0.0883,  = 1.3535 − 09;
GESIS:  = − 0.3480,  = 4.7422 − 07).</p>
        <p>One shortcoming of the previous measures derived from interleaving experiments is the
simplified interpretation of click interactions. As outlined in Section 4, by weighting clicks
diferently, it is possible to account for the meaning of the corresponding SERP elements. Table
10 shows the total number of clicks on SERP elements for each systems and the Normalized
Reward (nReward) resulting from the weighting scheme given in Figure 5. We compare the
total number of clicks of those (interleaving) experiments in which the experimental and
baseline systems delivered results. As it can be seen, comparing systems by clicks on diferent
SERP elements, provides a more diverse analysis. For instance, some of the systems achieve
higher numbers of clicks (and CTRs) for some SERP elements in direct comparison to the
baseline systems. livivo_rank_pyserini, lemuren_elastic_only got more clicks on
the Bookmark element than the baseline system, while all systems achieve lower numbers of
total clicks.</p>
        <p>None of the systems could outperform the baseline system in terms of the nReward measure,
but in comparison to the Outcome scores, there is a more balanced ratio between the nReward
scores that also accounts for the meaning of specific clicks. Likewise, it accounts for clicks even
if the experimental system did not “win” in the interleaving experiment. In Table 10 we compare
the total number of clicks over multiple sessions. While the Win, Loss, Tie, and Outcome only
measure if there have been more clicks in a single experiment, the nReward also considers those
clicks that were made in experiments in which the experimental system did not necessarily win.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusions</title>
      <p>The Living Labs for Academic Search (LiLAS) lab re-introduced the living lab paradigm with a
focus on tasks in the domain of academic search. The lab ofered the possibility to participate
in two diferent tasks, which were either dedicated to ad-hoc search in the Life Sciences
or research data recommendations in the Social Sciences. Participants were provided with
datasets and access to the underlying search portals for experimentation. For both tasks,
participants could contribute their experimental systems either by pre-computed outputs for
selected queries (or target items) or as fully-fledged dockerized systems. In total, we evaluated
nine experimental systems out of which seven were contributed by three participating groups. In
sum, two groups contributed experiments that cover pre-computed rankings and fully dockerized
systems at LIVIVO and pre-computed recommendations at GESIS. The GESIS research team
contributed another completely dockerized recommendation system. Our experimental setup
is based on interleaving experiments that combine experimental results with those from the
corresponding baseline systems at LIVIVO and GESIS. In accordance with the living lab paradigm,
our evaluations are based on user interactions, i.e. in the form of click feedback.</p>
      <p>A key component of the underlying infrastructure is the integration of experimental ranking
and recommendation systems as micro-services that are implemented with the help of Docker.
The LiLAS lab was the first test-bed to use this evaluation service and it exemplified some of
the benefits resulting from the new infrastructure design. First of all, completely dockerized
systems can overcome the restrictions of results limited to filtered lists of top-k queries or target
items. Significantly more data and click interactions can be logged if the experimental systems
can deliver results on-the-fly for arbitrary requests of rankings and recommendations. As a
consequence, this allows much more data aggregation in a shorter period of time and provides
a solid basis for statistical significance tests.</p>
      <p>Furthermore, the deployment efort for site providers and organizers is considerably reduced.
Once the systems are properly described with the corresponding Dockerfile, they can be rebuild
on purpose, exactly as the participants and developers intended them to be. Likewise, the
entire infrastructure service can be migrated with minimal costs due to Docker. However, we
0.7
0000....4356 tcoeuOm ii,,frsssssLTooeeenbW112050000000
0.2 lunm
0.1 tToa 500
0.0 0
0 200 400 Nu6m0b0er of S8e0ss0ions 1000 1200 1400 0 1000 2000 Numb3e0r00of Sess4io0n0s0 5000 6000
loss tie win outcome loss tie win outcome
livivo_rank_pyserini - Cumulative Wins, Losses, and Ties + Outcome livivo_rank_pyserini - Cumulative Wins, Losses, and Ties + Outcome
300 1.0 70 1.0
iil,,ftrsssssLTToooaeeenunbmW22115505000000 0000000.......9876543 tcoeuOm ili,,ftrsssssLTToooaeeenunbmW653214000000 0000....9854
0 0 0.3
0 200 400 Numbe6r0o0f Session8s00 1000 1200 0.2 0 50 Nu1m00ber of Sessio1n5s0 200</p>
      <p>loss tie win outcome loss tie win outcome
lemuren_elastic_only - Cumulative Wins, Losses, and Ties + Outcome lemuren_elastic_preprocessing - Cumulative Wins, Losses, and Ties + Outcome
ili,,ftrsssssLTToooaeeenunbmW1462800000000000 00000.....32145 tcoeuOm ili,,ftrsssssLTToooaeeenunbmW11648202000000000000 00000.....23145 tcoeuOm
500 1000Number of Sessions</p>
      <p>1500 2000 2500 3000
loss tie win outcome
hypothesize that one reason for the low participation might be the technical overhead for those
who were not already familiar with Docker. On the other hand, the development eforts pay of.
If the systems are properly adapted to the required interface and the source code is available
in a public repository, the (IR) research community can rely on these artifacts that make the
experiments transparent and reproducible.</p>
      <p>livivo_rank_pyserini
livivo_base
lemuren_elastic_only
livivo_base
lemuren_elastic_preprocessing
livivo_base
lemuren_elk
livivo_base
tekmas
livivo_base
save_fami
livivo_base
All experimental systems
livivo_base</p>
      <p>Thus, we address the reproducibility of these living lab experiments mostly from a
technological point of view, in the sense that we can repeat the experiments in the future with reduced
eforts, since the participating systems are openly available and should be reconstructible with
the help of the corresponding Dockerfiles. Future work should investigate how feasible it is to
rely on the Dockerfiles for the long-term preservation. Since experimental systems are rebuilt
each time with the help of the Dockerfile, updates of the underlying dependencies might be a
threat to the reproducibility. An intuitive solution would be the integration of pre-built Docker
images that may allow a longer reproducibility. Apart from the underlying technological aspects,
the reproducibility of the actual experimental results has to be investigated. Our experimental
setup would allow to answer questions with regard to the reproducibility of the experimental
results over time and also across diferent domains (e.g. Life vs. Social Sciences).</p>
      <p>Most of the evaluation measures are made for interleaving experiments that also depend on
the results of the baseline system and not solely on those of an experimental system. We have
not investigated yet, if the experimental results follow a transitive relation: if the experimental
system A outperforms the baseline system B, denoted as  ≻ , and the baseline system
B outperforms another experimental system C ( ≻ ), can we conclude that system A
would also outperform system C ( ≻ )? As the evaluations showed, click results are
heavily biased towards the first ranks and likewise they are context-dependent, i.e. they
depend on the entire result list and single click decisions have to be interpreted in relation
to neighboring and previously seen results and further evaluations in these directions would
require counterfactual reasoning. Nonetheless, in the second round it was illustrated how
our infrastructure service can be used for incremental developments and component-wise
analysis of experimental systems. The two experimental systems lemuren_elastic_only
and lemuren_elastic_preprocessing follow a similar approach and only difer by the
pre-processing component that has been shown not to be of any benefit.</p>
      <p>In addition to established outcome measures of interleaving experiments (Win, Loss, Tie,
Outcome), we also account for the meaning of clicks on diferent SERP elements. In this context,
we implement the Reward measure that is the weighted sum of clicks on diferent elements
corresponding to a specific result. Even though most of the experimental systems could not
outperform the baseline systems in terms of the overall scores, we see some clear diferences
between the system performance, which allow us to assess a system’s merits more thoroughly,
when the evaluations are based on diferent SERP elements.</p>
      <p>Overall, we consider our lab as a successful advancement to previous living lab experiments.
We were able to exemplify the benefits of fully dockerized systems delivering results for arbitrary
results on-the-fly. Furthermore, we could confirm several previous findings, for instance the
power laws underlying the click distributions. Additionally, we were able to conduct more
diverse comparison by diferentiating between clicks on diferent SERP elements and accounting
for their meaning. Unfortunately, we could not attract many participants, leaving some aspects
not tested, e.g. how many systems/experiments can be run simultaneously considering the
limitations of the infrastructure design, hardware requirements, server load and user trafic.
Likewise, no experimental ranking system could outperform the baseline system. In the future,
it might be helpful to provide participants with open and more transparent baseline systems
they can build upon. Some of the pre-computed experimental ranking and recommendations
seem to deliver promising results; however, the evaluations need to be interpreted with care
due to the sparsity of the available click data. As a way out, we favor continuous evaluations
freed from the time limits of rounds, in order to re-frame the introduced living lab service
as an ongoing evaluation challenge. The corresponding source code can be retrieved from a
public GitHub project16 and we plan to release the aggregated session data as a curated research
dataset.</p>
      <p>Acknowledgments
This work was supported by DFG (project no. 407518790).</p>
      <p>E. SanJuan, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
Multimodality, and Interaction - 6th International Conference of the CLEF Association, CLEF
2015, Toulouse, France, September 8-11, 2015, Proceedings, volume 9283 of Lecture Notes in
Computer Science, Springer, 2015, pp. 484–496. doi:10.1007/978-3-319-24027-5\_47.
[3] K. Balog, A. Schuth, P. Dekker, P. Schaer, N. Tavakolpoursaleh, P.-Y. Chuang, Overview
of the trec 2016 open search track, in: Proceedings of the Twenty-Fifth Text REtrieval
Conference (TREC 2016). NIST, 2016.
[4] D. J. de Solla Price, Little Science, Big Science, Columbia University Press, New York, 1963.
[5] J. Schaible, T. Breuer, N. Tavakolpoursaleh, B. Müller, B. Wolf, P. Schaer, Evaluation
infrastructures for academic shared tasks, Datenbank-Spektrum 20 (2020) 29–36. doi:10.
1007/s13222-020-00335-x.
[6] F. Hopfgartner, K. Balog, A. Lommatzsch, L. Kelly, B. Kille, A. Schuth, M. Larson,
Continuous Evaluation of Large-Scale Information Access Systems: A Case for
Living Labs, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a
Changing World, volume 41, Springer International Publishing, Cham, 2019, pp. 511–543.
doi:10.1007/978-3-030-22948-1\_21, series Title: The Information Retrieval Series.
[7] E. M. Voorhees, T. Alam, S. Bedrick, D. Demner-Fushman, W. R. Hersh, K. Lo, K. Roberts,
I. Soborof, L. L. Wang, TREC-COVID: constructing a pandemic information retrieval test
collection, CoRR abs/2005.04474 (2020). arXiv:2005.04474.
[8] W. Yang, K. Lu, P. Yang, J. Lin, Critically Examining the "Neural Hype": Weak
Baselines and the Additivity of Efectiveness Gains from Neural Ranking Models, in:
Proceedings of the 42nd International ACM SIGIR Conference on Research and
Development in Information Retrieval - SIGIR’19, ACM Press, Paris, France, 2019, pp. 1129–1132.
doi:10.1145/3331184.3331340.
[9] T. G. Armstrong, A. Mofat, W. Webber, J. Zobel, Improvements that don’t add up: ad-hoc
retrieval results since 1998, in: Proceeding of the 18th ACM conference on information
and knowledge management, CIKM ’09, ACM, Hong Kong, China, 2009, pp. 601–610.
doi:10.1145/1645953.1646031.
[10] Z. Carevic, P. Schaer, On the connection between citation-based and topical relevance
ranking: Results of a pretest using isearch, in: Proceedings of the First Workshop on
Bibliometric-enhanced Information Retrieval co-located with 36th European Conference
on Information Retrieval (ECIR 2014), Amsterdam, The Netherlands, April 13, 2014, volume
1143 of CEUR Workshop Proceedings, CEUR-WS.org, 2014, pp. 37–44.
[11] P. Schaer, T. Breuer, L. J. Castro, B. Wolf, J. Schaible, N. Tavakolpoursaleh, Overview
of lilas 2021 - living labs for academic search, in: K. S. Candan, B. Ionescu, L. Goeuriot,
B. Larsen, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental
IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Twelfth
International Conference of the CLEF Association (CLEF 2021), volume 12880 of Lecture
Notes in Computer Science, 2021.
[12] B. Müller, C. Poley, J. Pössel, A. Hagelstein, T. Gübitz, LIVIVO – the Vertical Search Engine
for Life Sciences, Datenbank-Spektrum 17 (2017) 29–34. URL: https://doi.org/10.1007/
s13222-016-0245-2. doi:10.1007/s13222-016-0245-2.
[13] D. Hienert, D. Kern, K. Boland, B. Zapilko, P. Mutschke, A digital library for research
data and related information in the social sciences, in: 19th ACM/IEEE Joint Conference
on Digital Libraries, JCDL 2019, Champaign, IL, USA, June 2-6, 2019, 2019, pp. 148–157.
doi:10.1109/JCDL.2019.00030.
[14] F. Radlinski, M. Kurup, T. Joachims, How does clickthrough data reflect retrieval quality?,
in: J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K. Choi,
A. Chowdhury (Eds.), Proceedings of the 17th ACM Conference on Information and
Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 2008,
ACM, 2008, pp. 43–52. doi:10.1145/1458082.1458092.
[15] K. Gingstad, Ø. Jekteberg, K. Balog, Arxivdigest: A living lab for personalized scientific
literature recommendation, in: M. d’Aquin, S. Dietze, C. Hauf, E. Curry, P. Cudré-Mauroux
(Eds.), CIKM ’20: The 29th ACM International Conference on Information and Knowledge
Management, Virtual Event, Ireland, October 19-23, 2020, ACM, 2020, pp. 3393–3396.
doi:10.1145/3340531.3417417.
[16] A. H. M. Tran, A. Kruf, J. Thos, C. Krah, M. Reiners, F. Ax, S. Brech, S. Gharib, V. Pawlas,
Ad-hoc retrieval of scientific documents on the livivo search portal, in: G. Faggioli, N. Ferro,
A. Joly, M. Maistro, F. Piroi (Eds.), Working Notes of CLEF 2021 - Conference and Labs of
the Evaluation Forum, CEUR Workshop Proceedings, 2021.
[17] J. Keller, L. P. M. Munz, Tekma at clef-2021: Bm-25 based rankings for scientific publication
retrieval and data set recommendation, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi
(Eds.), Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR
Workshop Proceedings, 2021.
[18] N. Tavakolpoursaleh, S. Schaible, Pyterrier-based research data recommendations for
scientific articles in the social sciences, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi
(Eds.), Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, CEUR
Workshop Proceedings, 2021.
[19] J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, R. Nogueira, Pyserini: An easy-to-use python
toolkit to support replicable IR research with sparse and dense representations, CoRR
abs/2102.10073 (2021). arXiv:2102.10073.
[20] R. Jagerman, K. Balog, M. de Rijke, Opensearch: Lessons learned from an online evaluation
campaign, J. Data and Information Quality 10 (2018) 13:1–13:15.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <p>SWLS Satisfaction with Life Scale
Nichtwähler der Bundestagswahl 2017
Die Nichtwähler : Politische Normalität oder wachsende Distanz zu den Parteien?
Sozialer Zusammenhalt in Deutschland 2017
Die Abwertung der Anderen : eine europäische Zustandsbeschreibung zu
Intoleranz, Vorurteilen und Diskriminierung
The political participation of disabled people in Europe: rights, accessibility
and activism
Trade union decline and what next: is Germany a special case?
ALLBUScompact - Kumulation 1980-2014 Variable Report
Medienkritikfähigkeit messbar machen: Analyse medienbezogener Fähigkeiten
bei Eltern von 10- bis 15-Jährigen
Substanzkonsum in der Allgemeinbevölkerung in Deutschland :
Ergebnisse des Epidemiologischen Suchtsurveys 2015</p>
      <p>GESIS
Document</p>
      <p>Rank</p>
      <p>Query string
1
2
3
4
5
6</p>
      <p>LIVIVO
Document
14
9
8
8
6
6
5
5
10</p>
      <p>Nichtwähler in Deutschland 2005 &amp; 2009 Non-voters in Germany 2005 &amp; 2009
Vertrauen in Staat und Gesellschaft während der Corona-Krise (April 2020)
EUSI: Datenbank zum Europäischen System Sozialer Indikatoren, 1950-2013
Landtagswahl in Bayern 2018
Satisfaction with Life Scale (CAPS-LIFESAT module)
Allgemeine Bevölkerungsumfrage der Sozialwissenschaften ALLBUScompact
- Kumulation 1980-2014
Mannheimer Corona-Studie
Naturbewusstsein 2015
Soziales Nachhaltigkeitsbarometer der Energiewende
Transitions and Old Age Potential: Übergänge und Alternspotenziale (TOP)
- 1. und 2. Welle</p>
      <p>GESIS
Dataset
2
2
2
F
i
g
u
r
e
1
5
:
S
e
s
s
i
o
n
s
a
n
d
I
m
p
r
e
s
s
i
o
n
s
a
t
L
I
V
I
V
O
(
l
i
v
i
v
o
_
b
a
s
e
)
a
n
d
2021-03-01
2021-03-02
2021-03-03
2021-03-04
2021-03-05
2021-03-06
2021-03-07
2021-03-08
2021-03-09
2021-03-10
2021-03-11
2021-03-12
2021-03-13
2021-03-14
2021-03-15
2021-03-16
2021-03-17
2021-03-18
2021-03-19
2021-03-20
2021-03-21
2021-03-22
2021-03-23
2021-03-24
2021-03-25
2021-03-26
2021-03-27
2021-03-28
2021-03-29
2021-03-30
2021-03-31
2021-04-01
2021-04-07
2021-04-08
2021-04-09
2021-04-13
2021-04-14
2021-04-15
2021-04-16
2021-04-17
2021-04-18
2021-04-19
2021-04-20
2021-04-21
2021-04-22
2021-04-23
2021-04-24
2021-04-25
2021-04-26
2021-04-27
2021-04-28
2021-04-29
2021-04-30
2021-05-01
2021-05-02
2021-05-03
2021-05-04
2021-05-05
2021-05-06
2021-05-07
2021-05-08
2021-05-09
2021-05-10
2021-05-11
2021-05-12
2021-05-13
2021-05-14
2021-05-15
2021-05-16
2021-05-17
2021-05-18
2021-05-19
2021-05-20
2021-05-21
2021-05-22
2021-05-23
2021-05-24</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          , L. Ramming, Newsreel multimedia at mediaeval 2018:
          <article-title>News recommendation with image and text content</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2018 Workshop</source>
          , CEUR-WS,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <article-title>Overview of the living labs for information retrieval evaluation (LL4IR) CLEF lab 2015</article-title>
          , in: J.
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Savoy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Pinel-Sauvagnat</surname>
            ,
            <given-names>G. J. F.</given-names>
          </string-name>
          <string-name>
            <surname>Jones</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>