=Paper=
{{Paper
|id=Vol-2409/invited01
|storemode=property
|title=Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019)
|pdfUrl=https://ceur-ws.org/Vol-2409/invited01.pdf
|volume=Vol-2409
|authors=Ryan Clancy,Nicola Ferro,Claudia Hauff,Jimmy Lin,Tetsuya Sakai,Ze Zhong Wu
|dblpUrl=https://dblp.org/rec/conf/sigir/Clancy0HLSW19a
}}
==Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019)==
<pdf width="1500px">https://ceur-ws.org/Vol-2409/invited01.pdf</pdf>
<pre>
                  Overview of the 2019 Open-Source IR Replicability
                              Challenge (OSIRRC 2019)
            Ryan Clancy,1 Nicola Ferro,2 Claudia Hauff,3 Jimmy Lin,1 Tetsuya Sakai,4 Ze Zhong Wu1
                      1 University of Waterloo                  2 University of Padua               3 TU Delft         4 Waseda University


ABSTRACT                                                                                       standard test collections. The solution that we have developed
The Open-Source IR Replicability Challenge (OSIRRC 2019), orga-                                is known as “the jig”.
nized as a workshop at SIGIR 2019, aims to improve the replicability                       (2) Build a curated library of Docker images that work with the jig
of ad hoc retrieval experiments in information retrieval by gathering                          to capture a diversity of systems and retrieval models.
a community of researchers to jointly develop a common Docker                              (3) Explore the possibility of broadening our efforts to include
specification and build Docker images that encapsulate a diversity                             additional tasks, diverse evaluation methodologies, and other
of systems and retrieval models. We articulate the goals of this                               benchmarking initiatives.
workshop and describe the “jig” that encodes the Docker specifica-                         Trivially, by supporting replicability, our proposed solution enables
tion. In total, 13 teams from around the world submitted 17 images,                        repeatability as well (which, as a recent case study has shown [14],
most of which were designed to produce retrieval runs for the                              is not as easy as one might imagine). It is not our goal to directly ad-
TREC 2004 Robust Track test collection. This exercise demonstrates                         dress reproducibility, although we do see our efforts as an important
the feasibility of orchestrating large, community-based replication                        stepping stone.
experiments with Docker technology. We envision OSIRRC be-                                    We hope that the fruits of this workshop can fuel empirical
coming an ongoing community-wide effort to ensure experimental                             progress in ad hoc retrieval by providing competitive baselines
replicability and sustained progress on standard test collections.                         that are easily replicable. The “prototypical” research paper of
                                                                                           this mold proposes an innovation and demonstrates its value by
                                                                                           comparing against one or more baselines. The often-cited meta-
1    INTRODUCTION                                                                          analysis of Armstrong et al. [2] from a decade ago showed that
The importance of repeatability, replicability, and reproducibility is                     researchers compare against weak baselines, and a recent study
broadly recognized in the computational sciences, both in support-                         by Yang et al. [13] revealed that, a decade later, the situation has
ing desirable scientific methodology as well as sustaining empirical                       not improved much—researchers are still comparing against weak
progress. The Open-Source IR Replicability Challenge (OSIRRC                               baselines. Lin [9] discussed social aspects of why this persists, but
2019), organized as a workshop at SIGIR 2019, aims to improve the                          there are genuine technical barriers as well. The growing complex-
replicability of ad hoc retrieval experiments in information retrieval                     ity of modern retrieval techniques, especially neural models that
by building community consensus around a common technical                                  are sensitive to hyperparameters and other details of the training
specification, with reference implementations. This overview paper                         regime, poses challenges for researchers who wish to demonstrate
is an extended version of an abstract that appears in the SIGIR                            that their proposed innovation improves upon a particular method.
proceedings.                                                                               Solutions that address replicability facilitate in-depth comparisons
    In order to precisely articulate the goals of this workshop, it is                     between existing and proposed approaches, potentially leading to
first necessary to establish common terminology. We use the above                          more insightful analyses and accelerating advances.
terms in the same manner as recent ACM guidelines pertaining to                               Overall, we are pleased with progress towards the first two goals
artifact review and badging:1                                                              of the workshop. A total of 17 Docker images, involving 13 differ-
• Repeatability (same team, same experimental setup): a researcher                         ent teams from around the world, were submitted for evaluation,
  can reliably repeat her own computation.                                                 comprising the OSIRRC 2019 “image library”. These images col-
• Replicability (different team, same experimental setup): an inde-                        lectively generated 49 replicable runs for the TREC 2004 Robust
  pendent group can obtain the same result using the authors’ own                          Track test collection, 12 replicable runs for the TREC 2017 Common
  artifacts.                                                                               Core Track test collection, and 19 replicable runs for the TREC 2018
• Reproducibility (different team, different experimental setup): an                       Common Core Track test collection. With respect to the third goal,
  independent group can obtain the same result using artifacts                             this paper offers our future vision—but its broader adoption by the
  which they develop completely independently.                                             community at large remains to be seen.
This workshop tackles the replicability challenge for ad hoc docu-
ment retrieval, with three explicit goals:
                                                                                           2   BACKGROUND
                                                                                           There has been much discussion about reproducibility in the sci-
(1) Develop a common Docker specification to support images that
                                                                                           ences, with most scientists agreeing that the situation can be charac-
    capture systems performing ad hoc retrieval experiments on
                                                                                           terized as a crisis [3]. We lack the space to provide a comprehensive
                                                                                           review of relevant literature in the medical, natural, and behav-
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons       ioral sciences. Within the computational sciences, to which at least
License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR
2019, 25 July 2019, Paris, France.                                                         a large portion of information retrieval research belongs, there
1 https://www.acm.org/publications/policies/artifact-review-badging                        have been many studies and proposed solutions, for example, a


                                                                                       1
recent Dagstuhl seminar [7]. Here, we focus on summarizing the                                  own code, and in turn publish the resulting image to be further
immediate predecessor of this workshop.                                                         used by others.
    Our workshop was conceived as the next iteration of the Open-
Source IR Reproducibility Challenge (OSIRRC), organized as part of                              3.1      General Design
the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and                                As defined by the Merriam-Webster dictionary, a jig is “a device
Generalizability of Results (RIGOR) [1]. This event in turn traces                              used to maintain mechanically the correct positional relationship
its roots back to a series of workshops focused on open-source IR                               between a piece of work and the tool or between parts of work
systems, which is widely understood as an important component                                   during assembly”. The central activity of this workshop revolved
of reproducibility. The Open-Source IR Reproducibility Challenge2                               around the co-design and co-implementation of a jig and Docker
brought together developers of open-source search engines to pro-                               images that work with the jig for ad hoc retrieval. Of course, in our
vide replicable baselines of their systems in a common environment                              context, the relationship is computational instead of mechanical.
on Amazon EC2. The product is a repository that contains all code                                  Shortly after the acceptance of the workshop proposal at SIGIR
necessary to generate ad hoc retrieval baselines, such that with a                              2019, we issued a call for participants who were interested in con-
single script, anyone with a copy of the collection can replicate                               tributing Docker images to our effort; the jig was designed with the
the submitted runs. Developers from seven different systems con-                                input of these participants. In other words, the jig and the images
tributed to the evaluation, which was conducted on the GOV2                                     co-evolved with feedback from members of the community. The
collection. The details of their experience are captured in an ECIR                             code of the jig is open source and available on GitHub.3
2016 paper [10].                                                                                   Our central idea is that each image would expose a number of
    In OSIRRC 2019, we aim to address two shortcomings with the                                 “hooks” that correspond to a point in the prototypical lifecycle of
previous exercise as a concrete step in moving the field forward.                               an ad hoc retrieval experiment: for example, indexing a collection,
From the technical perspective, the RIGOR 2015 participants de-                                 running a batch of queries, etc. These hooks then tie into code that
veloped scripts in a shared VM environment, and while this was                                  captures whatever retrieval model a particular researcher wishes to
sufficient to support cross-system comparisons at the time, the                                 package in the image—for example, a search engine implemented in
scripts were not sufficiently constrained, and the entire setup suf-                            Java or C++. The jig is responsible for triggering the hooks in each
fered from portability and isolation issues. Thus, it would have                                image in a particular sequence according to a predefined lifecycle
been difficult for others to reuse the infrastructure to replicate the                          model, e.g., first index the collection, then run a batch of queries,
results—in other words, the replicability experiments themselves                                finally evaluate the results. We have further built tooling that ap-
were difficult to replicate. We believe that Docker, which is a popu-                           plies the jig to multiple images to facilitate large-scale experiments.
lar standard for containerization, offers a potential solution to these                         More details about the jig are provided in the next section, but first
technical challenges.                                                                           we overview a few design decisions.
    Another limitation of the previous exercise was its focus on                                   Quite deliberately, the current jig does not make any demands
“bag of words” baselines, and while some participants did submit                                about the transparency of a particular image. For example, the
systems that exploited richer models (e.g., term dependence models                              search hook can run an executable whose source code is not publicly
and pseudo-relevance feedback), there was insufficient diversity in                             available. Such an image, while demonstrating replicability, would
the retrieval models examined. Primarily due to these issues, the                               not allow other researchers to inspect the inner workings of a
exercise has received less follow-up and uptake than the organizers                             particular retrieval method. While such images are not forbidden
had originally hoped.                                                                           in our design, they are obviously less desirable than images based
                                                                                                on open code. In practice, however, we anticipate that most images
3     DOCKER AND “THE JIG”                                                                      will be based on open-source code.
From a technical perspective, our efforts are built around Docker, a                               One technical design choice that we have grappled with is how
widely-adopted Linux-centric technology for delivering software in                              to get data “into” and “out of” a container. To be more concrete,
lightweight packages called containers. The Docker Engine hosts                                 for ad hoc retrieval the container needs access to the document
one or more of these containers on physical machines and manages                                collection and the topics. The jig also needs to be able to obtain
their lifecycle. One key feature of Docker is that all containers run                           the run files generated by the image for evaluation. Generically,
on a single operating system kernel; isolation is handled by Linux                              there are three options for feeding data to an image: first, the data
kernel features such as cgroups and kernel namespaces. This makes                               can be part of the image itself; second, the data can be fetched
containers far more lightweight than virtual machines, and hence                                from a remote location by the image (e.g., via curl, wget, or some
easier to manipulate. Containers are created from images, which                                 other network transfer mechanism); third, the jig could mount an
are typically built by importing base images (for example, capturing                            external data directory that the container has access to. The first
a specific software distribution) and then overlaying custom code.                              two approaches are problematic for our use case: images need to
The images themselves can be manipulated, combined, and modified                                be shareable, or resources need to be placed at a publicly-accessible
as first-class citizens in a broad ecosystem. For example, a group                              location online. This is not permissible for document collections
can overlay several existing images from public sources, add in its                             where researchers are required to sign license agreements before
                                                                                                using. Furthermore, both approaches do not allow the possibility
                                                                                                of testing on blind held-out data. We ultimately opted for the third
2 Note that the exercise is more accurately characterized as replicability and not repro-
ducibility; the event predated ACM’s standardization of terminology.                            3 https://github.com/osirrc/jig


                                                                                            2
approach: the jig mounts a (read-only) data directory that makes                                      jig                                       Docker
the document collection available at a known location, as part of                                                                               image
                                                                                     User specifies
the contract between the jig and the image (and similarly for topics).                <image>:<tag>                Starts image
A separate directory that is writable serves as the mechanism for
                                                                                        prepare
the jig to gather output runs from the image for evaluation. This                          phase                   Triggers hook
                                                                                                                                          init hook
method makes it possible for images to be tested on blind held-out
                                                                                                                   Triggers hook
documents and topics, as long as the formats have been agreed to                                                                          index hook
in advance.                                                                                                      Creates snapshot
   Finally, any evaluation exercise needs to define the test collection.
We decided to focus on newswire test collections because their                                                      <snapshot>

smaller sizes support a shorter iteration and debug cycle (compared
to, for example, larger web collections). In particular, we asked                     <image>:<tag>
                                                                                                            Triggers hook with snapshot
participants to focus on the TREC 2004 Robust Track test collection,                     search                                           search hook
in part because of its long history: a recent large-scale literature                       phase                     run files

meta-analysis comprising over one hundred papers [13] provides a
rich context to support historical comparisons.                                                         trec_eval

   Participants were also asked to accommodate the following two
(more recent) test collections if time allowed:
• TREC 2017 Common Core Track, on the New York Times Anno-
  tated Corpus.                                                                Figure 1: Interactions between the jig and the Docker image
• TREC 2018 Common Core Track, on the Washington Post Corpus.                  in the canonical evaluation flow.
Finally, a “reach” goal was to support existing web test collections
(e.g., GOV2 and ClueWeb). Although a few submitted images do
support one or more of these collections, no formal evaluation was             them readable by the image for indexing (see discussion in the
conducted on them.                                                             previous section). The jig triggers two hooks in the image:
                                                                               • First, the init hook is executed. This hook is meant for actions
3.2     Implementation Details                                                   such as downloading artifacts, cloning repositories and compiling
In this section we provide a more detailed technical description                 source code, or downloading external resources (e.g., a knowledge
of the jig. Note, however, that the jig is continuously evolving as              graph). Alternatively, these steps can be encoded directly in the
we gather more image contributions and learn about our design                    image itself, thus making init a no-op. These two mechanisms
shortcomings. We invite interested readers to consult our code                   largely lead to the same end result, and so it is mostly a matter
repository for the latest details and design revisions. To be clear, we          of preference for the image developer.
describe v0.1.1 of the jig, which was deployed for the evaluation.             • Next, the index hook is executed. The jig passes in a JSON string
    The jig is implemented in Python and communicates with the                   containing information such as the collection name, path, format,
Docker Engine via the Docker SDK for Python.4 In the current                     etc. required for indexing. The image manages its own index,
specification, each hook corresponds to a script in the image that               which is not directly visible to the jig.
has a specific name, resides at a fixed location, and obeys a speci-           After indexing has completed, the jig takes a snapshot of the image
fied contract dictating its behavior. Each script can invoke its own           via a Docker commit. This is useful as indexing generally takes
interpreter: common implementations include bash and Python.                   longer than a retrieval run, and this design allows multiple runs (at
Thus, via these scripts, the image has freedom to invoke arbitrary             different times) to be performed using the same index.
code. In the common case, the hooks invoke features of an existing                During the search phase, the user issues a command specifying
open-source search engine packaged in the image.                               an image’s repository (i.e., name) and tag (i.e., version), the collec-
    From the perspective of a user who is attempting to replicate              tion to search, and a number of auxiliary parameters such as the
results using an image, two commands are available: one for prepa-             topics file, qrels file, and output directory. This hook is meant to
ration (the prepare phase) and another for actually performing                 perform the actual ad hoc retrieval runs, after which the jig evalu-
the ad hoc retrieval run (the search phase). The jig handles the               ates the output with trec_eval. Just as in the index hook, relevant
execution lifecycle, from downloading the image to evaluating run              parameters are encoded in JSON. The image places run files in the
files using trec_eval. This is shown in Figure 1 as a timeline in the          /output directory, which is mapped back to the host; this allows
canonical lifecycle, with the jig on the left and the Docker image             the jig to retrieve the run files for evaluation.
on the right. The two phases are described in detail below.                       In addition to the two main hooks for ad hoc retrieval experi-
    During the prepare phase, the user issues a command specifying             ments, the jig also supports additional hooks for added functionality
an image’s repository (i.e., name) and tag (i.e., version) along with          (not shown in Figure 1). The first of these is the interact hook
a list of collections to index. As part of the contract between the jig        that allows a user to interactively explore an image in the state that
and an image, the jig mounts the document collections and makes                has been captured via a snapshot, after the execution of the index
                                                                               hook. This allows, for example, the user to “enter” an interactive
4 https://docker-py.readthedocs.io/en/stable/                                  shell in the image (via standard Docker commands), and allows


                                                                           3
the user to explore the inner workings of an image. The hook also                Following the deadline for submitting images, the organizers ran
allows users to interact with services that a container may choose           all images “from scratch” with v0.1.1 of the jig and the latest release
to expose, such as an interactive search interface, or even Jupyter          of each participant’s image. Using our script (see Section 3.2), each
notebooks. With the interact hook, the container is kept alive               image was executed sequentially on a virtual machine instance in
in the foreground, unlike the other hooks which exit immediately             the Microsoft Azure cloud. Note that it would have been possible
once execution has finished.                                                 to speed up the experiments by running the images in parallel,
    Finally, images may also implement a train hook, enabling an             each on its own virtual machine instance, but this was not done.
image to train a retrieval model, tune hyper-parameters, etc. after          We used the instance type Standard_D64s_v3, which according
the index hook has been executed. The train hook allows the user             to Azure documentation is either based on the 2.4 GHz Intel Xeon
to specify training and test splits for a set of topics, along with          E5-2673 v3 (Haswell) processor or 2.3 GHz Intel Xeon E5-2673 v4
a model directory for storing the model. This output directory is            (Broadwell) processor. Since we have no direct control over the
mapped back to the host and can be passed to the search hook                 physical hardware, it is only meaningful to compare efficiency
for use during retrieval. Currently, training is limited to the CPU,         (i.e., performance metrics such as query latency) across different
although progress has been made to support GPU-based training.               images running on the same virtual machine instance. Nevertheless,
    In the current design, the jig runs one image at a time, but addi-       our evaluations focused solely on retrieval effectiveness. This is a
tional tooling around the jig includes a script that further automates       shortcoming, since a number of images packaged search engines
all interactions with an image so that experiments can be run end to         that emphasize query evaluation efficiency.
end with minimal human supervision. This script creates a virtual                The results of running the jig on the submitted images com-
machine in the cloud (currently, Microsoft Azure), installs Docker           prise the “official” OSIRRC 2019 image library, and is available
engine and associated dependencies, and then runs the image using            on GitHub.6 We have captured all log output, run files, as well as
the jig. All output is then captured for archival purposes.                  trec_eval output. These results are summarized below.
                                                                                 For the TREC 2004 Robust Track test collection, 13 images gen-
4    SUBMITTED IMAGES AND RESULTS                                            erated a total of 49 runs, the results of which are shown in Ta-
Although we envision OSIRRC to be an ongoing effort, the reality of          ble 1; the specific version of the image is noted. Effectiveness is
a physical SIGIR workshop meant that it was necessary to impose              measured using standard rank retrieval metrics: average precision
an arbitrary deadline at which to “freeze” image development. This           (AP), precision at rank cutoff 30 (P30), and NDCG at rank cutoff 20
occurred at the end of June, 2019. At that point in time, we received        (NDCG@20). The table does not include runs from the following
17 images by 13 different teams, listed alphabetically as follows:           images: Solrini and Elastirini (which are identical to Anserini runs),
• Anserini (University of Waterloo)                                          EntityRetrieval (where relevance judgments are not available since
• Anserini-bm25prf (Waseda University)                                       it was designed for a different task), and IRC-CENTRE2019 (which
• ATIRE (University of Otago)                                                was not designed to produce results for this test collection).
• Birch (University of Waterloo)                                                 As the primary goal of this workshop is to build community,
• Elastirini (University of Waterloo)                                        infrastructure, and consensus, we deliberately attempt to minimize
• EntityRetrieval (Ryerson University)                                       direct comparisons of run effectiveness in the presentation: runs
• Galago (University of Massachusetts)                                       are grouped by image, and the image themselves are sorted alpha-
• ielab (University of Queensland)                                           betically. Nevertheless, a few important caveats are necessary for
• Indri (TU Delft)                                                           proper interpretation of the results: Most runs perform no parame-
• IRC-CENTRE2019 (Technische Hochschule Köln)                                ter tuning, although at least one implicitly encodes cross-validation
• JASS (University of Otago)                                                 results (e.g., Birch). Also, runs might use different parts of the com-
• JASSv2 (University of Otago)                                               plete topic: the “title”, “description”, and “narrative” (as well as
• NVSM (University of Padua)                                                 various combinations). For details, we invite the reader to consult
• OldDog (Radboud University)                                                the overview paper by each participating team.
• PISA (New York University and RMIT University)                                 We see that the submitted images generate runs that use a diverse
• Solrini (University of Waterloo)                                           set of retrieval models, including query expansion and pseudo-
• Terrier (TU Delft and University of Glasgow)                               relevance feedback (Anserini, Anserini-bm25prf, Indri, Terrier),
                                                                             term proximity (Indri and Terrier), conjunctive query processing
All except for two images were designed to replicate runs for the
                                                                             (OldDog), and neural ranking models (Birch and NVSM). Several im-
TREC 2004 Robust Track test collection, which was the primary
                                                                             ages package open-source search engines that are primarily focused
target for the exercise. The EntityRetrieval image was designed
                                                                             on efficiency (ATIRE, JASS, JASSv2, PISA). Although we concede
to perform entity retrieval (as opposed to ad hoc retrieval). The
                                                                             that there is an under-representation of neural approaches, rela-
IRC-CENTRE2019 image packages a submission to the CENTRE
                                                                             tive to the amount of interest in the community at present, there
reproducibility effort,5 which targets a specific set of runs from
                                                                             are undoubtedly replication challenges with neural ranking mod-
the TREC 2017 Common Core Track. A number of images also
                                                                             els, particularly with their training regimes. Nevertheless, we are
support the Common Core Track test collections from TREC 2017
                                                                             pleased with the range of systems and retrieval models that are
and 2018. Finally, a few images also provide support for the GOV2
                                                                             represented in these images.
and ClueWeb test collections, although these were not evaluated.
5 http://www.centre-eval.org/                                                6 https://github.com/osirrc/osirrc2019-library


                                                                         4
   Results from the TREC 2017 Common Core Track test collection              well as replicability of those reproducibility efforts. The jig also
are shown in Table 2. On this test collection, we have 12 runs from          supports mechanisms for evaluations on document collections and
6 images. Results from the TREC 2018 Common Core Track test                  information needs beyond those that an image was originally de-
collection are shown in Table 3: there are 19 runs from 4 images.            signed for. This aligns with intuitive notions of what it means for a
                                                                             technique to be generalizable.
                                                                                Overall, we believe that our efforts have moved the field of infor-
5   FUTURE VISION AND ONGOING WORK                                           mation retrieval forward both in terms of supporting “good science”
Our efforts complement other concurrent activities in the commu-             as well as sustained, cumulative empirical progress. This work
nity. SIGIR has established a task force to implement ACM’s policy           shows that it is indeed possible to coordinate a large, community-
on artifact review and badging [5], and our efforts can be viewed as         wide replication exercise in ad hoc retrieval, and that Docker pro-
a technical feasibility study. This workshop also complements the            vides a workable foundation for a common interface and lifecycle
recent CENTRE evaluation tasks jointly run at CLEF, NTCIR, and               specification. We invite the broader community to join our efforts!
TREC [6, 11]. One of the goals of CENTRE is to define appropriate
measures to determine whether and to what extent replicability               6    ACKNOWLEDGEMENTS
and reproducibility have been achieved, while our efforts focus on           We would like to thank all the participant who contributed Docker
how these properties can be demonstrated technically. Thus, the jig          images to the workshop. This exercise would not have been possible
can provide the means to achieve CENTRE goals. Given fortuitous              without their efforts. Additional thanks to Microsoft for providing
alignment in schedules, participants of CENTRE@CLEF2019 [4]                  credits on the Azure cloud.
were encouraged to participate in our workshop, and this in fact
led to the contribution of the IRC-CENTRE2019 image.                         REFERENCES
   From the technical perspective, we see two major shortcomings              [1] Jaime Arguello, Matt Crane, Fernando Diaz, Jimmy Lin, and Andrew Trotman.
of the current jig implementation. First, the training hook is not as             2015. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability,
                                                                                  and Generalizability of Results (RIGOR). SIGIR Forum 49, 2 (2015), 107–116.
well-developed as we would have liked. Second, the jig lacks GPU              [2] Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009.
support. Both will be remedied in a future iteration.                             Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since 1998. In
   We have proposed and prototyped a technical solution to the                    Proceedings of the 18th International Conference on Information and Knowledge
                                                                                  Management (CIKM 2009). Hong Kong, China, 601–610.
replicability challenge specifically for the SIGIR community, but             [3] Monya Baker. 2016. Is There a Reproducibility Crisis? Nature 533 (2016), 452–454.
the changes we envision will not occur without a corresponding                [4] Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff. 2019.
                                                                                  Overview of CENTRE@CLEF 2019: Sequel in the Systematic Reproducibility
cultural shift. Sustained, cumulative empirical progress will only                Realm. In Experimental IR Meets Multilinguality, Multimodality, and Interaction.
be made if researchers use our tools in their evaluations, and this               Proceedings of the Tenth International Conference of the CLEF Association (CLEF
will only be possible if images for the comparison conditions are                 2019). Lugano, Switzerland.
                                                                              [5] Nicola Ferro and Diane Kelly. 2018. SIGIR Initiative to Implement ACM Artifact
available. This means that the community needs to adopt the norm                  Review and Badging. SIGIR Forum 52, 1 (2018), 4–10.
of associating research papers with source code for replicating               [6] Nicola Ferro, Maria Maistro, Tetsuya Sakai, and Ian Soboroff. 2018. Overview of
results in those papers. However, as Voorhees et al. [12] reported,               CENTRE@CLEF 2018: A First Tale in the Systematic Reproducibility Realm. In
                                                                                  Proceedings of the Ninth International Conference of the CLEF Association (CLEF
having a link to a repository in a paper is far from sufficient. The              2018). Avignon, France, 239–246.
jig provides the tools to package ad hoc retrieval experiments in a           [7] Juliana Freire, Norbert Fuhr, and Andreas Rauber (Eds.). 2016. Report from
                                                                                  Dagstuhl Seminar 16041: Reproducibility of Data-Oriented Experiments in e-Science.
standard way, but these tools are useless without broad adoption.                 Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Germany.
The incentive structures of academic publishing need to adapt to              [8] Frank Hopfgartner, Allan Hanbury, Henning Müller, Ivan Eggel, Krisztian Ba-
encourage such behavior, but unfortunately this is beyond the scope               log, Torben Brodt, Gordon V. Cormack, Jimmy Lin, Jayashree Kalpathy-Cramer,
                                                                                  Noriko Kando, Makoto P. Kato, Anastasia Krithara, Tim Gollub, Martin Potthast,
of our workshop.                                                                  Evelyne Viega, and Simon Mercer. 2018. Evaluation-as-a-Service for the Compu-
   Given appropriate extensions, we believe that the jig can be                   tational Sciences: Overview and Outlook. ACM Journal of Data and Information
augmented to accommodate a range of batch retrieval tasks. One                    Quality (JDIQ) 10, 4 (November 2018), 15:1–15:32.
                                                                              [9] Jimmy Lin. 2018. The Neural Hype and Comparisons Against Weak Baselines.
important future direction is to add support for tasks beyond batch               SIGIR Forum 52, 2 (2018), 40–51.
retrieval, for example, to support interactive retrieval (with real or       [10] Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya,
                                                                                  John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. To-
simulated user input) and evaluations on private and other sensitive              ward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In
data. Moreover, our effort represents a first systematic attempt                  Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016).
to embody the Evaluation-as-a-Service paradigm [8] via Docker                     Padua, Italy, 408–420.
                                                                             [11] Tetsuya Sakai, Nicola Ferro, Ian Soboroff, Zhaohao Zeng, Peng Xiao, and Maria
containers. We believe that there are many possible paths forward                 Maistro. 2019. Overview of the NTCIR-14 CENTRE Task. In Proceedings of the
building on the ideas presented here.                                             14th NTCIR Conference on Evaluation of Information Access Technologies. Tokyo,
   Finally, we view our efforts as a stepping stone toward repro-                 Japan.
                                                                             [12] Ellen M. Voorhees, Shahzad Rajput, and Ian Soboroff. 2016. Promoting Repeata-
ducibility, and beyond that, generalizability. While these two im-                bility Through Open Runs. In Proceedings of the 7th International Workshop on
portant desiderata are not explicit goals of our workshop, we note                Evaluating Information Access (EVIA 2016). Tokyo, Japan, 17–20.
                                                                             [13] Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining the
that the jig itself can provide the technical vehicle for delivering              “Neural Hype”: Weak Baselines and the Additivity of Effectiveness Gains from
reproducibility and generalizability. Some researchers would want                 Neural Ranking Models. In Proceedings of the 42nd Annual International ACM
to package their own results in a Docker image. However, there is                 SIGIR Conference on Research and Development in Information Retrieval (SIGIR
                                                                                  2019). Paris, France.
nothing that would prevent researchers from reproducing another              [14] Ruifan Yu, Yuhao Xie, and Jimmy Lin. 2018. H2 oloo at TREC 2018: Cross-
team’s results, that is then captured in a Docker image conforming                Collection Relevance Transfer for the Common Core Track. In Proceedings of the
to our specifications. This would demonstrate reproducibility as                  Twenty-Seventh Text REtrieval Conference (TREC 2018). Gaithersburg, Maryland.


                                                                         5
Image                Version   Run                                    AP       P30      NDCG@20
Anserini             v0.1.1    bm25                                   0.2531   0.3102   0.4240
Anserini             v0.1.1    bm25.rm3                               0.2903   0.3365   0.4407
Anserini             v0.1.1    bm25.ax                                0.2895   0.3333   0.4357
Anserini             v0.1.1    ql                                     0.2467   0.3079   0.4113
Anserini             v0.1.1    ql.rm3                                 0.2747   0.3232   0.4269
Anserini             v0.1.1    ql.ax                                  0.2774   0.3229   0.4223
Anserini-bm25prf     v0.2.2    b=0.20_bm25_bm25prf                    0.2916   0.3396   0.4419
Anserini-bm25prf     v0.2.2    b=0.40_bm25_bm25prf                    0.2928   0.3438   0.4418
ATIRE                v0.1.1    ANT_r4_100_percent.BM25+.s-stem.RF     0.2184   0.3199   0.4211
Birch                v0.1.0    mb_2cv.cv.a                            0.3241   0.3756   0.4722
Birch                v0.1.0    mb_2cv.cv.ab                           0.3240   0.3756   0.4720
Birch                v0.1.0    mb_2cv.cv.abc                          0.3244   0.3767   0.4738
Birch                v0.1.0    mb_5cv.cv.a                            0.3266   0.3783   0.4769
Birch                v0.1.0    mb_5cv.cv.ab                           0.3278   0.3795   0.4817
Birch                v0.1.0    mb_5cv.cv.abc                          0.3278   0.3790   0.4831
Birch                v0.1.0    qa_2cv.cv.a                            0.3014   0.3507   0.4469
Birch                v0.1.0    qa_2cv.cv.ab                           0.3003   0.3494   0.4475
Birch                v0.1.0    qa_2cv.cv.abc                          0.3003   0.3494   0.4475
Birch                v0.1.0    qa_5cv.cv.a                            0.3102   0.3574   0.4628
Birch                v0.1.0    qa_5cv.cv.ab                           0.3090   0.3577   0.4615
Birch                v0.1.0    qa_5cv.cv.abc                          0.3090   0.3577   0.4614
Galago               v0.0.2    output_robust04                        0.1948   0.2659   0.3732
ielab                v0.0.1    robust04-1000                          0.1826   0.2605   0.3477
Indri                v0.2.1    bm25.title                             0.2338   0.2995   0.4041
Indri                v0.2.1    bm25.title.prf                         0.2563   0.3041   0.3995
Indri                v0.2.1    bm25.title+desc                        0.2702   0.3274   0.4517
Indri                v0.2.1    bm25.title+desc.prf.sd                 0.2971   0.3562   0.4448
Indri                v0.2.1    dir1000.title                          0.2499   0.3100   0.4201
Indri                v0.2.1    dir1000.title.sd                       0.2547   0.3146   0.4232
Indri                v0.2.1    dir1000.title.prf                      0.2812   0.3248   0.4276
Indri                v0.2.1    dir1000.title.prf.sd                   0.2855   0.3295   0.4298
Indri                v0.2.1    dir1000.desc                           0.2023   0.2581   0.3635
Indri                v0.2.1    jm0.5.title                            0.2242   0.2839   0.3689
JASS                 v0.1.1    JASS_r4_10_percent                     0.1984   0.2991   0.4055
JASSv2               v0.1.1    JASSv2_10                              0.1984   0.2991   0.4055
NVSM                 v0.1.0    robust04_test_topics_run               0.1415   0.2197   0.2757
OldDog               v1.0.0    bm25.robust04.con                      0.1736   0.2526   0.3619
OldDog               v1.0.0    bm25.robust04.dis                      0.2434   0.2985   0.4002
PISA                 v0.1.3    robust04-1000                          0.2534   0.3120   0.4221
Terrier              v0.1.7    bm25                                   0.2363   0.2977   0.4049
Terrier              v0.1.7    bm25_qe                                0.2762   0.3281   0.4332
Terrier              v0.1.7    bm25_prox                              0.2404   0.3033   0.4082
Terrier              v0.1.7    bm25_prox_qe                           0.2781   0.3288   0.4307
Terrier              v0.1.7    dph                                    0.2479   0.3129   0.4198
Terrier              v0.1.7    dph_qe                                 0.2821   0.3369   0.4425
Terrier              v0.1.7    dph_prox                               0.2501   0.3166   0.4206
Terrier              v0.1.7    dph_prox_qe                            0.2869   0.3376   0.4435
Terrier              v0.1.7    pl2                                    0.2241   0.2918   0.3948
Terrier              v0.1.7    pl2_qe                                 0.2538   0.3126   0.4163

                   Table 1: Results on the TREC 2004 Robust Track test collection.


                                                   6
Image                 Version   Run                      AP       P30      NDCG@20
Anserini              v0.1.1    bm25                     0.2087   0.4293   0.3877
Anserini              v0.1.1    bm25.rm3                 0.2823   0.5093   0.4467
Anserini              v0.1.1    bm25.ax                  0.2787   0.4980   0.4450
Anserini              v0.1.1    ql                       0.2032   0.4467   0.3958
Anserini              v0.1.1    ql.rm3                   0.2606   0.4827   0.4226
Anserini              v0.1.1    ql.ax                    0.2613   0.4953   0.4429
ATIRE                 v0.1.1    ANT_c17_100_percent      0.1436   0.4087   0.3742
IRC-CENTRE2019        v0.1.3    wcrobust04               0.2971   0.5613   0.5143
IRC-CENTRE2019        v0.1.3    wcrobust0405             0.3539   0.6347   0.5821
JASS                  v0.1.1    JASS_c17_10_percent      0.1415   0.4080   0.3711
JASSv2                v0.1.1    JASSv2_c17_10            0.1415   0.4080   0.3711
PISA                  v0.1.3    core17-1000              0.2078   0.4260   0.3898

       Table 2: Results on the TREC 2017 Common Core Track test collection.


           Image      Version   Run             AP       P30      NDCG@20
           Anserini   v0.1.1    bm25            0.2495   0.3567   0.4100
           Anserini   v0.1.1    bm25.ax         0.2920   0.4027   0.4342
           Anserini   v0.1.1    bm25.rm3        0.3136   0.4200   0.4604
           Anserini   v0.1.1    ql              0.2526   0.3653   0.4204
           Anserini   v0.1.1    ql.ax           0.2966   0.4060   0.4303
           Anserini   v0.1.1    ql.rm3          0.3073   0.4000   0.4366
           OldDog     v1.0.0    bm25.con        0.1802   0.3167   0.3650
           OldDog     v1.0.0    bm25.dis        0.2381   0.3313   0.3706
           PISA       v0.1.3    core18-1000     0.2384   0.3500   0.3927
           Terrier    v0.1.7    bm25            0.2326   0.3367   0.3800
           Terrier    v0.1.7    bm25_qe         0.2975   0.4040   0.4290
           Terrier    v0.1.7    bm25_prox       0.2369   0.3447   0.3954
           Terrier    v0.1.7    bm25_prox_qe    0.2960   0.4067   0.4318
           Terrier    v0.1.7    dph             0.2427   0.3633   0.4022
           Terrier    v0.1.7    dph_qe          0.3055   0.4153   0.4369
           Terrier    v0.1.7    dph_prox        0.2428   0.3673   0.4140
           Terrier    v0.1.7    dph_prox_qe     0.3035   0.4167   0.4462
           Terrier    v0.1.7    pl2             0.2225   0.3227   0.3636
           Terrier    v0.1.7    pl2_qe          0.2787   0.3933   0.3975

       Table 3: Results on the TREC 2018 Common Core Track test collection.


                                           7

</pre>