=Paper=
{{Paper
|id=Vol-2409/invited01
|storemode=property
|title=Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019)
|pdfUrl=https://ceur-ws.org/Vol-2409/invited01.pdf
|volume=Vol-2409
|authors=Ryan Clancy,Nicola Ferro,Claudia Hauff,Jimmy Lin,Tetsuya Sakai,Ze Zhong Wu
|dblpUrl=https://dblp.org/rec/conf/sigir/Clancy0HLSW19a
}}
==Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019)==
Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019) Ryan Clancy,1 Nicola Ferro,2 Claudia Hauff,3 Jimmy Lin,1 Tetsuya Sakai,4 Ze Zhong Wu1 1 University of Waterloo 2 University of Padua 3 TU Delft 4 Waseda University ABSTRACT standard test collections. The solution that we have developed The Open-Source IR Replicability Challenge (OSIRRC 2019), orga- is known as “the jig”. nized as a workshop at SIGIR 2019, aims to improve the replicability (2) Build a curated library of Docker images that work with the jig of ad hoc retrieval experiments in information retrieval by gathering to capture a diversity of systems and retrieval models. a community of researchers to jointly develop a common Docker (3) Explore the possibility of broadening our efforts to include specification and build Docker images that encapsulate a diversity additional tasks, diverse evaluation methodologies, and other of systems and retrieval models. We articulate the goals of this benchmarking initiatives. workshop and describe the “jig” that encodes the Docker specifica- Trivially, by supporting replicability, our proposed solution enables tion. In total, 13 teams from around the world submitted 17 images, repeatability as well (which, as a recent case study has shown [14], most of which were designed to produce retrieval runs for the is not as easy as one might imagine). It is not our goal to directly ad- TREC 2004 Robust Track test collection. This exercise demonstrates dress reproducibility, although we do see our efforts as an important the feasibility of orchestrating large, community-based replication stepping stone. experiments with Docker technology. We envision OSIRRC be- We hope that the fruits of this workshop can fuel empirical coming an ongoing community-wide effort to ensure experimental progress in ad hoc retrieval by providing competitive baselines replicability and sustained progress on standard test collections. that are easily replicable. The “prototypical” research paper of this mold proposes an innovation and demonstrates its value by comparing against one or more baselines. The often-cited meta- 1 INTRODUCTION analysis of Armstrong et al. [2] from a decade ago showed that The importance of repeatability, replicability, and reproducibility is researchers compare against weak baselines, and a recent study broadly recognized in the computational sciences, both in support- by Yang et al. [13] revealed that, a decade later, the situation has ing desirable scientific methodology as well as sustaining empirical not improved much—researchers are still comparing against weak progress. The Open-Source IR Replicability Challenge (OSIRRC baselines. Lin [9] discussed social aspects of why this persists, but 2019), organized as a workshop at SIGIR 2019, aims to improve the there are genuine technical barriers as well. The growing complex- replicability of ad hoc retrieval experiments in information retrieval ity of modern retrieval techniques, especially neural models that by building community consensus around a common technical are sensitive to hyperparameters and other details of the training specification, with reference implementations. This overview paper regime, poses challenges for researchers who wish to demonstrate is an extended version of an abstract that appears in the SIGIR that their proposed innovation improves upon a particular method. proceedings. Solutions that address replicability facilitate in-depth comparisons In order to precisely articulate the goals of this workshop, it is between existing and proposed approaches, potentially leading to first necessary to establish common terminology. We use the above more insightful analyses and accelerating advances. terms in the same manner as recent ACM guidelines pertaining to Overall, we are pleased with progress towards the first two goals artifact review and badging:1 of the workshop. A total of 17 Docker images, involving 13 differ- • Repeatability (same team, same experimental setup): a researcher ent teams from around the world, were submitted for evaluation, can reliably repeat her own computation. comprising the OSIRRC 2019 “image library”. These images col- • Replicability (different team, same experimental setup): an inde- lectively generated 49 replicable runs for the TREC 2004 Robust pendent group can obtain the same result using the authors’ own Track test collection, 12 replicable runs for the TREC 2017 Common artifacts. Core Track test collection, and 19 replicable runs for the TREC 2018 • Reproducibility (different team, different experimental setup): an Common Core Track test collection. With respect to the third goal, independent group can obtain the same result using artifacts this paper offers our future vision—but its broader adoption by the which they develop completely independently. community at large remains to be seen. This workshop tackles the replicability challenge for ad hoc docu- ment retrieval, with three explicit goals: 2 BACKGROUND There has been much discussion about reproducibility in the sci- (1) Develop a common Docker specification to support images that ences, with most scientists agreeing that the situation can be charac- capture systems performing ad hoc retrieval experiments on terized as a crisis [3]. We lack the space to provide a comprehensive review of relevant literature in the medical, natural, and behav- Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons ioral sciences. Within the computational sciences, to which at least License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR 2019, 25 July 2019, Paris, France. a large portion of information retrieval research belongs, there 1 https://www.acm.org/publications/policies/artifact-review-badging have been many studies and proposed solutions, for example, a 1 recent Dagstuhl seminar [7]. Here, we focus on summarizing the own code, and in turn publish the resulting image to be further immediate predecessor of this workshop. used by others. Our workshop was conceived as the next iteration of the Open- Source IR Reproducibility Challenge (OSIRRC), organized as part of 3.1 General Design the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and As defined by the Merriam-Webster dictionary, a jig is “a device Generalizability of Results (RIGOR) [1]. This event in turn traces used to maintain mechanically the correct positional relationship its roots back to a series of workshops focused on open-source IR between a piece of work and the tool or between parts of work systems, which is widely understood as an important component during assembly”. The central activity of this workshop revolved of reproducibility. The Open-Source IR Reproducibility Challenge2 around the co-design and co-implementation of a jig and Docker brought together developers of open-source search engines to pro- images that work with the jig for ad hoc retrieval. Of course, in our vide replicable baselines of their systems in a common environment context, the relationship is computational instead of mechanical. on Amazon EC2. The product is a repository that contains all code Shortly after the acceptance of the workshop proposal at SIGIR necessary to generate ad hoc retrieval baselines, such that with a 2019, we issued a call for participants who were interested in con- single script, anyone with a copy of the collection can replicate tributing Docker images to our effort; the jig was designed with the the submitted runs. Developers from seven different systems con- input of these participants. In other words, the jig and the images tributed to the evaluation, which was conducted on the GOV2 co-evolved with feedback from members of the community. The collection. The details of their experience are captured in an ECIR code of the jig is open source and available on GitHub.3 2016 paper [10]. Our central idea is that each image would expose a number of In OSIRRC 2019, we aim to address two shortcomings with the “hooks” that correspond to a point in the prototypical lifecycle of previous exercise as a concrete step in moving the field forward. an ad hoc retrieval experiment: for example, indexing a collection, From the technical perspective, the RIGOR 2015 participants de- running a batch of queries, etc. These hooks then tie into code that veloped scripts in a shared VM environment, and while this was captures whatever retrieval model a particular researcher wishes to sufficient to support cross-system comparisons at the time, the package in the image—for example, a search engine implemented in scripts were not sufficiently constrained, and the entire setup suf- Java or C++. The jig is responsible for triggering the hooks in each fered from portability and isolation issues. Thus, it would have image in a particular sequence according to a predefined lifecycle been difficult for others to reuse the infrastructure to replicate the model, e.g., first index the collection, then run a batch of queries, results—in other words, the replicability experiments themselves finally evaluate the results. We have further built tooling that ap- were difficult to replicate. We believe that Docker, which is a popu- plies the jig to multiple images to facilitate large-scale experiments. lar standard for containerization, offers a potential solution to these More details about the jig are provided in the next section, but first technical challenges. we overview a few design decisions. Another limitation of the previous exercise was its focus on Quite deliberately, the current jig does not make any demands “bag of words” baselines, and while some participants did submit about the transparency of a particular image. For example, the systems that exploited richer models (e.g., term dependence models search hook can run an executable whose source code is not publicly and pseudo-relevance feedback), there was insufficient diversity in available. Such an image, while demonstrating replicability, would the retrieval models examined. Primarily due to these issues, the not allow other researchers to inspect the inner workings of a exercise has received less follow-up and uptake than the organizers particular retrieval method. While such images are not forbidden had originally hoped. in our design, they are obviously less desirable than images based on open code. In practice, however, we anticipate that most images 3 DOCKER AND “THE JIG” will be based on open-source code. From a technical perspective, our efforts are built around Docker, a One technical design choice that we have grappled with is how widely-adopted Linux-centric technology for delivering software in to get data “into” and “out of” a container. To be more concrete, lightweight packages called containers. The Docker Engine hosts for ad hoc retrieval the container needs access to the document one or more of these containers on physical machines and manages collection and the topics. The jig also needs to be able to obtain their lifecycle. One key feature of Docker is that all containers run the run files generated by the image for evaluation. Generically, on a single operating system kernel; isolation is handled by Linux there are three options for feeding data to an image: first, the data kernel features such as cgroups and kernel namespaces. This makes can be part of the image itself; second, the data can be fetched containers far more lightweight than virtual machines, and hence from a remote location by the image (e.g., via curl, wget, or some easier to manipulate. Containers are created from images, which other network transfer mechanism); third, the jig could mount an are typically built by importing base images (for example, capturing external data directory that the container has access to. The first a specific software distribution) and then overlaying custom code. two approaches are problematic for our use case: images need to The images themselves can be manipulated, combined, and modified be shareable, or resources need to be placed at a publicly-accessible as first-class citizens in a broad ecosystem. For example, a group location online. This is not permissible for document collections can overlay several existing images from public sources, add in its where researchers are required to sign license agreements before using. Furthermore, both approaches do not allow the possibility of testing on blind held-out data. We ultimately opted for the third 2 Note that the exercise is more accurately characterized as replicability and not repro- ducibility; the event predated ACM’s standardization of terminology. 3 https://github.com/osirrc/jig 2 approach: the jig mounts a (read-only) data directory that makes jig Docker the document collection available at a known location, as part of image User specifies the contract between the jig and the image (and similarly for topics).: Starts image A separate directory that is writable serves as the mechanism for prepare the jig to gather output runs from the image for evaluation. This phase Triggers hook init hook method makes it possible for images to be tested on blind held-out Triggers hook documents and topics, as long as the formats have been agreed to index hook in advance. Creates snapshot Finally, any evaluation exercise needs to define the test collection. We decided to focus on newswire test collections because their smaller sizes support a shorter iteration and debug cycle (compared to, for example, larger web collections). In particular, we asked : Triggers hook with snapshot participants to focus on the TREC 2004 Robust Track test collection, search search hook in part because of its long history: a recent large-scale literature phase run files meta-analysis comprising over one hundred papers [13] provides a rich context to support historical comparisons. trec_eval Participants were also asked to accommodate the following two (more recent) test collections if time allowed: • TREC 2017 Common Core Track, on the New York Times Anno- tated Corpus. Figure 1: Interactions between the jig and the Docker image • TREC 2018 Common Core Track, on the Washington Post Corpus. in the canonical evaluation flow. Finally, a “reach” goal was to support existing web test collections (e.g., GOV2 and ClueWeb). Although a few submitted images do support one or more of these collections, no formal evaluation was them readable by the image for indexing (see discussion in the conducted on them. previous section). The jig triggers two hooks in the image: • First, the init hook is executed. This hook is meant for actions 3.2 Implementation Details such as downloading artifacts, cloning repositories and compiling In this section we provide a more detailed technical description source code, or downloading external resources (e.g., a knowledge of the jig. Note, however, that the jig is continuously evolving as graph). Alternatively, these steps can be encoded directly in the we gather more image contributions and learn about our design image itself, thus making init a no-op. These two mechanisms shortcomings. We invite interested readers to consult our code largely lead to the same end result, and so it is mostly a matter repository for the latest details and design revisions. To be clear, we of preference for the image developer. describe v0.1.1 of the jig, which was deployed for the evaluation. • Next, the index hook is executed. The jig passes in a JSON string The jig is implemented in Python and communicates with the containing information such as the collection name, path, format, Docker Engine via the Docker SDK for Python.4 In the current etc. required for indexing. The image manages its own index, specification, each hook corresponds to a script in the image that which is not directly visible to the jig. has a specific name, resides at a fixed location, and obeys a speci- After indexing has completed, the jig takes a snapshot of the image fied contract dictating its behavior. Each script can invoke its own via a Docker commit. This is useful as indexing generally takes interpreter: common implementations include bash and Python. longer than a retrieval run, and this design allows multiple runs (at Thus, via these scripts, the image has freedom to invoke arbitrary different times) to be performed using the same index. code. In the common case, the hooks invoke features of an existing During the search phase, the user issues a command specifying open-source search engine packaged in the image. an image’s repository (i.e., name) and tag (i.e., version), the collec- From the perspective of a user who is attempting to replicate tion to search, and a number of auxiliary parameters such as the results using an image, two commands are available: one for prepa- topics file, qrels file, and output directory. This hook is meant to ration (the prepare phase) and another for actually performing perform the actual ad hoc retrieval runs, after which the jig evalu- the ad hoc retrieval run (the search phase). The jig handles the ates the output with trec_eval. Just as in the index hook, relevant execution lifecycle, from downloading the image to evaluating run parameters are encoded in JSON. The image places run files in the files using trec_eval. This is shown in Figure 1 as a timeline in the /output directory, which is mapped back to the host; this allows canonical lifecycle, with the jig on the left and the Docker image the jig to retrieve the run files for evaluation. on the right. The two phases are described in detail below. In addition to the two main hooks for ad hoc retrieval experi- During the prepare phase, the user issues a command specifying ments, the jig also supports additional hooks for added functionality an image’s repository (i.e., name) and tag (i.e., version) along with (not shown in Figure 1). The first of these is the interact hook a list of collections to index. As part of the contract between the jig that allows a user to interactively explore an image in the state that and an image, the jig mounts the document collections and makes has been captured via a snapshot, after the execution of the index hook. This allows, for example, the user to “enter” an interactive 4 https://docker-py.readthedocs.io/en/stable/ shell in the image (via standard Docker commands), and allows 3 the user to explore the inner workings of an image. The hook also Following the deadline for submitting images, the organizers ran allows users to interact with services that a container may choose all images “from scratch” with v0.1.1 of the jig and the latest release to expose, such as an interactive search interface, or even Jupyter of each participant’s image. Using our script (see Section 3.2), each notebooks. With the interact hook, the container is kept alive image was executed sequentially on a virtual machine instance in in the foreground, unlike the other hooks which exit immediately the Microsoft Azure cloud. Note that it would have been possible once execution has finished. to speed up the experiments by running the images in parallel, Finally, images may also implement a train hook, enabling an each on its own virtual machine instance, but this was not done. image to train a retrieval model, tune hyper-parameters, etc. after We used the instance type Standard_D64s_v3, which according the index hook has been executed. The train hook allows the user to Azure documentation is either based on the 2.4 GHz Intel Xeon to specify training and test splits for a set of topics, along with E5-2673 v3 (Haswell) processor or 2.3 GHz Intel Xeon E5-2673 v4 a model directory for storing the model. This output directory is (Broadwell) processor. Since we have no direct control over the mapped back to the host and can be passed to the search hook physical hardware, it is only meaningful to compare efficiency for use during retrieval. Currently, training is limited to the CPU, (i.e., performance metrics such as query latency) across different although progress has been made to support GPU-based training. images running on the same virtual machine instance. Nevertheless, In the current design, the jig runs one image at a time, but addi- our evaluations focused solely on retrieval effectiveness. This is a tional tooling around the jig includes a script that further automates shortcoming, since a number of images packaged search engines all interactions with an image so that experiments can be run end to that emphasize query evaluation efficiency. end with minimal human supervision. This script creates a virtual The results of running the jig on the submitted images com- machine in the cloud (currently, Microsoft Azure), installs Docker prise the “official” OSIRRC 2019 image library, and is available engine and associated dependencies, and then runs the image using on GitHub.6 We have captured all log output, run files, as well as the jig. All output is then captured for archival purposes. trec_eval output. These results are summarized below. For the TREC 2004 Robust Track test collection, 13 images gen- 4 SUBMITTED IMAGES AND RESULTS erated a total of 49 runs, the results of which are shown in Ta- Although we envision OSIRRC to be an ongoing effort, the reality of ble 1; the specific version of the image is noted. Effectiveness is a physical SIGIR workshop meant that it was necessary to impose measured using standard rank retrieval metrics: average precision an arbitrary deadline at which to “freeze” image development. This (AP), precision at rank cutoff 30 (P30), and NDCG at rank cutoff 20 occurred at the end of June, 2019. At that point in time, we received (NDCG@20). The table does not include runs from the following 17 images by 13 different teams, listed alphabetically as follows: images: Solrini and Elastirini (which are identical to Anserini runs), • Anserini (University of Waterloo) EntityRetrieval (where relevance judgments are not available since • Anserini-bm25prf (Waseda University) it was designed for a different task), and IRC-CENTRE2019 (which • ATIRE (University of Otago) was not designed to produce results for this test collection). • Birch (University of Waterloo) As the primary goal of this workshop is to build community, • Elastirini (University of Waterloo) infrastructure, and consensus, we deliberately attempt to minimize • EntityRetrieval (Ryerson University) direct comparisons of run effectiveness in the presentation: runs • Galago (University of Massachusetts) are grouped by image, and the image themselves are sorted alpha- • ielab (University of Queensland) betically. Nevertheless, a few important caveats are necessary for • Indri (TU Delft) proper interpretation of the results: Most runs perform no parame- • IRC-CENTRE2019 (Technische Hochschule Köln) ter tuning, although at least one implicitly encodes cross-validation • JASS (University of Otago) results (e.g., Birch). Also, runs might use different parts of the com- • JASSv2 (University of Otago) plete topic: the “title”, “description”, and “narrative” (as well as • NVSM (University of Padua) various combinations). For details, we invite the reader to consult • OldDog (Radboud University) the overview paper by each participating team. • PISA (New York University and RMIT University) We see that the submitted images generate runs that use a diverse • Solrini (University of Waterloo) set of retrieval models, including query expansion and pseudo- • Terrier (TU Delft and University of Glasgow) relevance feedback (Anserini, Anserini-bm25prf, Indri, Terrier), term proximity (Indri and Terrier), conjunctive query processing All except for two images were designed to replicate runs for the (OldDog), and neural ranking models (Birch and NVSM). Several im- TREC 2004 Robust Track test collection, which was the primary ages package open-source search engines that are primarily focused target for the exercise. The EntityRetrieval image was designed on efficiency (ATIRE, JASS, JASSv2, PISA). Although we concede to perform entity retrieval (as opposed to ad hoc retrieval). The that there is an under-representation of neural approaches, rela- IRC-CENTRE2019 image packages a submission to the CENTRE tive to the amount of interest in the community at present, there reproducibility effort,5 which targets a specific set of runs from are undoubtedly replication challenges with neural ranking mod- the TREC 2017 Common Core Track. A number of images also els, particularly with their training regimes. Nevertheless, we are support the Common Core Track test collections from TREC 2017 pleased with the range of systems and retrieval models that are and 2018. Finally, a few images also provide support for the GOV2 represented in these images. and ClueWeb test collections, although these were not evaluated. 5 http://www.centre-eval.org/ 6 https://github.com/osirrc/osirrc2019-library 4 Results from the TREC 2017 Common Core Track test collection well as replicability of those reproducibility efforts. The jig also are shown in Table 2. On this test collection, we have 12 runs from supports mechanisms for evaluations on document collections and 6 images. Results from the TREC 2018 Common Core Track test information needs beyond those that an image was originally de- collection are shown in Table 3: there are 19 runs from 4 images. signed for. This aligns with intuitive notions of what it means for a technique to be generalizable. Overall, we believe that our efforts have moved the field of infor- 5 FUTURE VISION AND ONGOING WORK mation retrieval forward both in terms of supporting “good science” Our efforts complement other concurrent activities in the commu- as well as sustained, cumulative empirical progress. This work nity. SIGIR has established a task force to implement ACM’s policy shows that it is indeed possible to coordinate a large, community- on artifact review and badging [5], and our efforts can be viewed as wide replication exercise in ad hoc retrieval, and that Docker pro- a technical feasibility study. This workshop also complements the vides a workable foundation for a common interface and lifecycle recent CENTRE evaluation tasks jointly run at CLEF, NTCIR, and specification. We invite the broader community to join our efforts! TREC [6, 11]. One of the goals of CENTRE is to define appropriate measures to determine whether and to what extent replicability 6 ACKNOWLEDGEMENTS and reproducibility have been achieved, while our efforts focus on We would like to thank all the participant who contributed Docker how these properties can be demonstrated technically. Thus, the jig images to the workshop. This exercise would not have been possible can provide the means to achieve CENTRE goals. Given fortuitous without their efforts. Additional thanks to Microsoft for providing alignment in schedules, participants of CENTRE@CLEF2019 [4] credits on the Azure cloud. were encouraged to participate in our workshop, and this in fact led to the contribution of the IRC-CENTRE2019 image. REFERENCES From the technical perspective, we see two major shortcomings [1] Jaime Arguello, Matt Crane, Fernando Diaz, Jimmy Lin, and Andrew Trotman. of the current jig implementation. First, the training hook is not as 2015. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR). SIGIR Forum 49, 2 (2015), 107–116. well-developed as we would have liked. Second, the jig lacks GPU [2] Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. support. Both will be remedied in a future iteration. Improvements That Don’t Add Up: Ad-Hoc Retrieval Results Since 1998. In We have proposed and prototyped a technical solution to the Proceedings of the 18th International Conference on Information and Knowledge Management (CIKM 2009). Hong Kong, China, 601–610. replicability challenge specifically for the SIGIR community, but [3] Monya Baker. 2016. Is There a Reproducibility Crisis? Nature 533 (2016), 452–454. the changes we envision will not occur without a corresponding [4] Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff. 2019. Overview of CENTRE@CLEF 2019: Sequel in the Systematic Reproducibility cultural shift. Sustained, cumulative empirical progress will only Realm. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. be made if researchers use our tools in their evaluations, and this Proceedings of the Tenth International Conference of the CLEF Association (CLEF will only be possible if images for the comparison conditions are 2019). Lugano, Switzerland. [5] Nicola Ferro and Diane Kelly. 2018. SIGIR Initiative to Implement ACM Artifact available. This means that the community needs to adopt the norm Review and Badging. SIGIR Forum 52, 1 (2018), 4–10. of associating research papers with source code for replicating [6] Nicola Ferro, Maria Maistro, Tetsuya Sakai, and Ian Soboroff. 2018. Overview of results in those papers. However, as Voorhees et al. [12] reported, CENTRE@CLEF 2018: A First Tale in the Systematic Reproducibility Realm. In Proceedings of the Ninth International Conference of the CLEF Association (CLEF having a link to a repository in a paper is far from sufficient. The 2018). Avignon, France, 239–246. jig provides the tools to package ad hoc retrieval experiments in a [7] Juliana Freire, Norbert Fuhr, and Andreas Rauber (Eds.). 2016. Report from Dagstuhl Seminar 16041: Reproducibility of Data-Oriented Experiments in e-Science. standard way, but these tools are useless without broad adoption. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Germany. The incentive structures of academic publishing need to adapt to [8] Frank Hopfgartner, Allan Hanbury, Henning Müller, Ivan Eggel, Krisztian Ba- encourage such behavior, but unfortunately this is beyond the scope log, Torben Brodt, Gordon V. Cormack, Jimmy Lin, Jayashree Kalpathy-Cramer, Noriko Kando, Makoto P. Kato, Anastasia Krithara, Tim Gollub, Martin Potthast, of our workshop. Evelyne Viega, and Simon Mercer. 2018. Evaluation-as-a-Service for the Compu- Given appropriate extensions, we believe that the jig can be tational Sciences: Overview and Outlook. ACM Journal of Data and Information augmented to accommodate a range of batch retrieval tasks. One Quality (JDIQ) 10, 4 (November 2018), 15:1–15:32. [9] Jimmy Lin. 2018. The Neural Hype and Comparisons Against Weak Baselines. important future direction is to add support for tasks beyond batch SIGIR Forum 52, 2 (2018), 40–51. retrieval, for example, to support interactive retrieval (with real or [10] Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. To- simulated user input) and evaluations on private and other sensitive ward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In data. Moreover, our effort represents a first systematic attempt Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016). to embody the Evaluation-as-a-Service paradigm [8] via Docker Padua, Italy, 408–420. [11] Tetsuya Sakai, Nicola Ferro, Ian Soboroff, Zhaohao Zeng, Peng Xiao, and Maria containers. We believe that there are many possible paths forward Maistro. 2019. Overview of the NTCIR-14 CENTRE Task. In Proceedings of the building on the ideas presented here. 14th NTCIR Conference on Evaluation of Information Access Technologies. Tokyo, Finally, we view our efforts as a stepping stone toward repro- Japan. [12] Ellen M. Voorhees, Shahzad Rajput, and Ian Soboroff. 2016. Promoting Repeata- ducibility, and beyond that, generalizability. While these two im- bility Through Open Runs. In Proceedings of the 7th International Workshop on portant desiderata are not explicit goals of our workshop, we note Evaluating Information Access (EVIA 2016). Tokyo, Japan, 17–20. [13] Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining the that the jig itself can provide the technical vehicle for delivering “Neural Hype”: Weak Baselines and the Additivity of Effectiveness Gains from reproducibility and generalizability. Some researchers would want Neural Ranking Models. In Proceedings of the 42nd Annual International ACM to package their own results in a Docker image. However, there is SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019). Paris, France. nothing that would prevent researchers from reproducing another [14] Ruifan Yu, Yuhao Xie, and Jimmy Lin. 2018. H2 oloo at TREC 2018: Cross- team’s results, that is then captured in a Docker image conforming Collection Relevance Transfer for the Common Core Track. In Proceedings of the to our specifications. This would demonstrate reproducibility as Twenty-Seventh Text REtrieval Conference (TREC 2018). Gaithersburg, Maryland. 5 Image Version Run AP P30 NDCG@20 Anserini v0.1.1 bm25 0.2531 0.3102 0.4240 Anserini v0.1.1 bm25.rm3 0.2903 0.3365 0.4407 Anserini v0.1.1 bm25.ax 0.2895 0.3333 0.4357 Anserini v0.1.1 ql 0.2467 0.3079 0.4113 Anserini v0.1.1 ql.rm3 0.2747 0.3232 0.4269 Anserini v0.1.1 ql.ax 0.2774 0.3229 0.4223 Anserini-bm25prf v0.2.2 b=0.20_bm25_bm25prf 0.2916 0.3396 0.4419 Anserini-bm25prf v0.2.2 b=0.40_bm25_bm25prf 0.2928 0.3438 0.4418 ATIRE v0.1.1 ANT_r4_100_percent.BM25+.s-stem.RF 0.2184 0.3199 0.4211 Birch v0.1.0 mb_2cv.cv.a 0.3241 0.3756 0.4722 Birch v0.1.0 mb_2cv.cv.ab 0.3240 0.3756 0.4720 Birch v0.1.0 mb_2cv.cv.abc 0.3244 0.3767 0.4738 Birch v0.1.0 mb_5cv.cv.a 0.3266 0.3783 0.4769 Birch v0.1.0 mb_5cv.cv.ab 0.3278 0.3795 0.4817 Birch v0.1.0 mb_5cv.cv.abc 0.3278 0.3790 0.4831 Birch v0.1.0 qa_2cv.cv.a 0.3014 0.3507 0.4469 Birch v0.1.0 qa_2cv.cv.ab 0.3003 0.3494 0.4475 Birch v0.1.0 qa_2cv.cv.abc 0.3003 0.3494 0.4475 Birch v0.1.0 qa_5cv.cv.a 0.3102 0.3574 0.4628 Birch v0.1.0 qa_5cv.cv.ab 0.3090 0.3577 0.4615 Birch v0.1.0 qa_5cv.cv.abc 0.3090 0.3577 0.4614 Galago v0.0.2 output_robust04 0.1948 0.2659 0.3732 ielab v0.0.1 robust04-1000 0.1826 0.2605 0.3477 Indri v0.2.1 bm25.title 0.2338 0.2995 0.4041 Indri v0.2.1 bm25.title.prf 0.2563 0.3041 0.3995 Indri v0.2.1 bm25.title+desc 0.2702 0.3274 0.4517 Indri v0.2.1 bm25.title+desc.prf.sd 0.2971 0.3562 0.4448 Indri v0.2.1 dir1000.title 0.2499 0.3100 0.4201 Indri v0.2.1 dir1000.title.sd 0.2547 0.3146 0.4232 Indri v0.2.1 dir1000.title.prf 0.2812 0.3248 0.4276 Indri v0.2.1 dir1000.title.prf.sd 0.2855 0.3295 0.4298 Indri v0.2.1 dir1000.desc 0.2023 0.2581 0.3635 Indri v0.2.1 jm0.5.title 0.2242 0.2839 0.3689 JASS v0.1.1 JASS_r4_10_percent 0.1984 0.2991 0.4055 JASSv2 v0.1.1 JASSv2_10 0.1984 0.2991 0.4055 NVSM v0.1.0 robust04_test_topics_run 0.1415 0.2197 0.2757 OldDog v1.0.0 bm25.robust04.con 0.1736 0.2526 0.3619 OldDog v1.0.0 bm25.robust04.dis 0.2434 0.2985 0.4002 PISA v0.1.3 robust04-1000 0.2534 0.3120 0.4221 Terrier v0.1.7 bm25 0.2363 0.2977 0.4049 Terrier v0.1.7 bm25_qe 0.2762 0.3281 0.4332 Terrier v0.1.7 bm25_prox 0.2404 0.3033 0.4082 Terrier v0.1.7 bm25_prox_qe 0.2781 0.3288 0.4307 Terrier v0.1.7 dph 0.2479 0.3129 0.4198 Terrier v0.1.7 dph_qe 0.2821 0.3369 0.4425 Terrier v0.1.7 dph_prox 0.2501 0.3166 0.4206 Terrier v0.1.7 dph_prox_qe 0.2869 0.3376 0.4435 Terrier v0.1.7 pl2 0.2241 0.2918 0.3948 Terrier v0.1.7 pl2_qe 0.2538 0.3126 0.4163 Table 1: Results on the TREC 2004 Robust Track test collection. 6 Image Version Run AP P30 NDCG@20 Anserini v0.1.1 bm25 0.2087 0.4293 0.3877 Anserini v0.1.1 bm25.rm3 0.2823 0.5093 0.4467 Anserini v0.1.1 bm25.ax 0.2787 0.4980 0.4450 Anserini v0.1.1 ql 0.2032 0.4467 0.3958 Anserini v0.1.1 ql.rm3 0.2606 0.4827 0.4226 Anserini v0.1.1 ql.ax 0.2613 0.4953 0.4429 ATIRE v0.1.1 ANT_c17_100_percent 0.1436 0.4087 0.3742 IRC-CENTRE2019 v0.1.3 wcrobust04 0.2971 0.5613 0.5143 IRC-CENTRE2019 v0.1.3 wcrobust0405 0.3539 0.6347 0.5821 JASS v0.1.1 JASS_c17_10_percent 0.1415 0.4080 0.3711 JASSv2 v0.1.1 JASSv2_c17_10 0.1415 0.4080 0.3711 PISA v0.1.3 core17-1000 0.2078 0.4260 0.3898 Table 2: Results on the TREC 2017 Common Core Track test collection. Image Version Run AP P30 NDCG@20 Anserini v0.1.1 bm25 0.2495 0.3567 0.4100 Anserini v0.1.1 bm25.ax 0.2920 0.4027 0.4342 Anserini v0.1.1 bm25.rm3 0.3136 0.4200 0.4604 Anserini v0.1.1 ql 0.2526 0.3653 0.4204 Anserini v0.1.1 ql.ax 0.2966 0.4060 0.4303 Anserini v0.1.1 ql.rm3 0.3073 0.4000 0.4366 OldDog v1.0.0 bm25.con 0.1802 0.3167 0.3650 OldDog v1.0.0 bm25.dis 0.2381 0.3313 0.3706 PISA v0.1.3 core18-1000 0.2384 0.3500 0.3927 Terrier v0.1.7 bm25 0.2326 0.3367 0.3800 Terrier v0.1.7 bm25_qe 0.2975 0.4040 0.4290 Terrier v0.1.7 bm25_prox 0.2369 0.3447 0.3954 Terrier v0.1.7 bm25_prox_qe 0.2960 0.4067 0.4318 Terrier v0.1.7 dph 0.2427 0.3633 0.4022 Terrier v0.1.7 dph_qe 0.3055 0.4153 0.4369 Terrier v0.1.7 dph_prox 0.2428 0.3673 0.4140 Terrier v0.1.7 dph_prox_qe 0.3035 0.4167 0.4462 Terrier v0.1.7 pl2 0.2225 0.3227 0.3636 Terrier v0.1.7 pl2_qe 0.2787 0.3933 0.3975 Table 3: Results on the TREC 2018 Common Core Track test collection. 7