Dockerising Terrier for The Open-Source IR Replicability Challenge (OSIRRC 2019) Arthur Câmara Craig Macdonald A.BarbosaCamara@tudelft.nl craig.macdonald@glasgow.ac.uk Delft University of Technology University of Glasgow Delft, the Netherlands Glasgow, UK ABSTRACT mounts any additional files required into the container image (for Reproducibility and replicability are key concepts in science, and instance, the corpus files for indexing, or the topic files for retrieval). it is therefore important for information retrieval (IR) platforms In the following, we describe the Dockerfile used to build the con- to aid in reproducing and replicating experiments. In this paper, tainer image, and the hooks that we implemented for Terrier within we describe the creation of a Docker container for Terrier within the image. the framework of the OSIRRC 2019 challenge, which allows typical runs to be reproduced on TREC Test Collections such as Robust04, 2.1 Dockerfile GOV2, Core2018. In doing so, it is hoped that the produced Docker The Dockerfile builds a container image with the necessary pre- image can be of aid to other (re)producing baseline experiments on requisites for Terrier. As Terrier is developed in Java, we base the these test collections. Initiatives like OSIRRC are key in advancing Terrier container image on the OpenJDK standard image for Java these key concepts in the IR area. By making not only the source 8. A number of other libraries are installed, including Python (for code available, but also the exact same environment and standardis- interacting with the OSIRRC jig); gcompat (standard C libraries for ing inputs and outputs, it is possible to easily compare approaches trec_eval); and the Jupyter pre-requisites. We chose not to install and thereby improve the quality of the research for Information Terrier using the Dockerfile, to maintain a lightweight container Retrieval. image. 1 OVERVIEW 2.2 Standard Hooks Terrier (Terabyte Retriever) is an information retrieval (IR) toolkit, 2.2.1 init. This hook is used to prepare the container. We use this initiated by the University of Glasgow, which has been developed hook to download and extract Terrier. We provide example code for since 2001 [5]. It implements a number of retrieval and indexing both downloading a pre-built “tarball” Terrier from the Terrier.org methods, ready to be used in both research and production. website, or checking out a version from the Github repository. This Given its open source nature, Terrier has been used in a number hook is configured to use Terrier latest stable version, 5.2. It can, of papers in the field of IR and others over the years, particularly however, be easily configured for fetching other Terrier versions, using standard IR test collections such as those from the Text RE- by changing the variable version in the init script. trieval Conference (TREC). For this reason, we agreed to join the The hook is also compatible with git, making it possible to fetch OSIRRC 2019 challenge, to create a Docker container image to bleeding-edge versions of Terrier directly from the terrier-core allow standard baseline results to be obtained using Terrier in a Github repository1 . This can be controlled by the variable github manner that can be easily cross-compared with other platforms in the same init script. Note that when using the Github version implementing the OSIRRC design. Moreover, it is hoped that the the init hook may take longer to run, since the code will be compiled produced Docker image can be of aid to others in (re)producing manually for your system. In our experiments, this could take up baseline experiments on these test collections. to 3 extra minutes. This paper describes the implementation of the Terrier Docker 2.2.2 index. This hook is used to index the corpus that has been image. In particular: Section 2 describes the various scripts imple- mounted by the jig. The jig provides the name of the corpus (see mented, as well as the obtained retrieval performances; Section 3 Table 1 for supported corpora), as well as the format (trectext, describes the lessons learned in this implementation; Concluding trecweb or json). The index jig uses the corpora name to configure remarks and outlook follow in Section 4. the indexing process. For example, for the robust04 corpus which uses TREC Disks 4 & 5, we remove the Congressional Record from 2 TECHNICAL DESIGN the indexing manifest (the collection.spec file that lists the files The nature of the OSIRRC challenge is that implementing systems Terrier should index) as well READMEs and other unnecessary files, should provide a Dockerfile that can be used to create a Docker and configure an additional decompression plugin. Similarly, for the container image that can be run on any Docker installation. The core18 corpus, we configure Terrier to download2 an additional container image is required to implement to a number of “hooks” - indexing plugin to support parsing of the TREC Washington Post simply put, an executable at a known location in the image filesys- corpus. tem. These hooks are then called by the OSIRRC jig, which also 1 https://github.com/terrier-org/terrier-core 2 Indeed, inspired by Apache Spark, since version 5.0, Terrier supports downloading Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR additional plugins from MavenCentral, based on the value of the terrier.mvn.coords 2019, 25 July 2019, Paris, France. configuration property. 26 OSIRRC 2019, July 25, 2019, Paris, France Arthur Câmara and Craig Macdonald Name Description – pl2_prox: PL2 with proximity robust04 TREC Disks 4 & 5, minus CR – pl2_prox_qe: PL2 with proximity and query expansion gov2 TREC GOV2 • DPH core18 TREC Washington Post – dph: Vanilla DPH cw09b TREC ClueWeb09 part B – dph_qe: DPH with query expansion cw12b TREC ClueWeb12 part B – dph_prox: DPH with proximity Table 1: Supported corpora. – dph_prox_qe: DPH with proximity and query expansion The expected results for each of these runs can be found in Ta- ble 2, together with the relative improvements of each configuration over the vanilla version. On analysing the results in the table, it Finally, we note that Terrier supports a variety of indexing con- is of interest to note that query expansion always improves the figurations - we document here our choices and the alternatives final result considerably, reaching up to 27.90% of improvement available: over the vanilla versions. However, the same cannot be said about • Positions: Terrier, by default, does not record positions (also the proximity option. While yielding improvements of up to 3.61% called blocks) in the direct or inverted index. Passing the (bm25_prox), it can, sometimes, decrease the performance of the optional argument of block.indexing to the jig will result vanilla version up to 1.70% (pl2_prox). These results are interest- in positions being saved. This allows proximity weighting ing, since they show that different methods can result in diverse models to be run. observations across the multiple corpora and query sets. • Indexer: Terrier’s “classical” indexer creates a direct index When combining both proximity and query expansion, the re- (also called a forward index) before inversion of the direct sults, overall, show improvements over both the vanilla and qe index to create the inverted index. Indeed, the direct index or proximity alone, with improvements of up to 27.26% over the allows pseudo-relevance feedback query expansion. How- vanilla versions. In the cases where proximity decreased the origi- ever, the classical indexer is known to be slower than the nal results, combining query expansion and proximity search does alternative “single-pass” indexer that Terrier also provides. not improve the results, as expected. However, the results from If the direct index is not required, the single-pass indexer scenarios like the DPH model on Core18 and GOV2 (topics 701- could be used by passing a -j flag to Terrier during indexing. 750) show that, even if both query expansion and proximity search • Fields: Including fields in the index allow the frequency are combined, the overall result may not improve over only the of query terms within different parts of the document (e.g. stronger of the two methods (usually, query expansion). TITLE tags) to be recorded separately. This allows the use of This again reinforces the knowledge that each dataset is differ- field-based weighting models such as BM25F or PL2F, or use ent, and that a practitioner should be aware not to simply stack of fields for features within a learning-to-rank setting. methods that provides marginal gains, but instead test multiple • Stemming & Stopwords: We retain Terrier’s default set- combinations, and understand how each method may behave when ting of Porter Stemming and standard stopword removal. used in combination with others. Terrier’s stopword list has 733 words. 2.2.3 search. This hooks makes use of Terrier’s batchretrieve 2.3 Learning To Rank Hooks command to execute a ‘run’, i.e. to extract 1000 results for each Terrier provides support for learning-to-rank in several manners - of a batch of information needs (topics/queries). The jig mounts the ability to integrate additional features during ranking, including the topics into the container image. In addition to learning-to-rank additional query dependent features without having to re-traverse (discussed further below), we provide off-the-shelf support for 12 the inverted or direct index [6], as well as providing integration of retrieval configurations. These are broken down by three orthogo- the Jforests 3 implementation of LambdaMART [9]. nal components: weighting model (BM25 [8], as well as PL2 [1] and Learning to Rank integration is demonstrated through two hooks, DPH [2] from the Divergence from Randomness framework); prox- train and search. imity (pBiL Divergence from Randomness model [7]); and query 2.3.1 train. This hook extracts features for the training and val- expansion (Bo1 Divergence from Randomness model [1]). idation topics, before calling Jforests to build the learned model. The combination of these three components yields the following To aid implementation, train calls the search hook internally to possible configurations, to be passed to the hook using the --opts obtain results for the training and validation sets, specifying the config= parameter: bm25_ltr_features retrieval configuration. The retrieval features • BM25 to use are configurable by specifying the features argument to – bm25: Vanilla BM25 the jig. – bm25_qe: BM25 with query expansion – bm25_prox: BM25 with proximity 2.3.2 search. Search also supports generation of the final learning- – bm25_prox_qe: BM25 with proximity and query expan- to-rank run, using the bm25_ltr_jforest retrieval configuration. sion This configuration assumes that train has already been called and • PL2 hence a Jforests learned model file already exists. – pl2: Vanilla PL2 – pl2_qe: PL2 with query expansion 3 https://github.com/yasserg/jforests/ 27 Dockerising Terrier for The Open-Source IR Replicability Challenge (OSIRRC 2019) OSIRRC 2019, July 25, 2019, Paris, France Table 2: Expected performance per method and corpus. The best result for each corpus and query set is emphasised. Method Robust04 Core18 GOV2 701-750 751-800 801-850 Vanilla 0.2363 0.2326 0.2461 0.3081 0.2629 +QE 0.2762 (+16.89%) 0.2975 (+27.90%) 0.2621 (+6.50%) 0.3506 (+13.79%) 0.3118 (+18.60%) BM25 +Proximity 0.2404 (+1.74%) 0.2369 (+1.85%) 0.2537 (+3.09%) 0.3126 (+1.46%) 0.2724 (+3.61%) +QE +Proximity 0.2781 (+17.69%) 0.2960 (+27.26%) 0.2715 (+10.32%) 0.3507 (+13.83) 0.3085 (+17.34%) Vanilla 0.2241 0.2225 0.2334 0.2884 0.2363 +QE 0.2538 (+13.25%) 0.2787 (+25.26%) 0.2478 (+6.17%) 0.3160 (+9.57%) 0.2739 (+15.91%) PL2 +Proximity 0.2283 (+1.87%) 0.2248 (+1.03%) 0.2347 (+0.056%) 0.2835 (-1.70%) 0.2361 (-0.08%) +QE +Proximity 0.2575 (+14.90%) 0.2821 (+26.79%) 0.2455 (+5.18%) 0.3095 (+7.32%) 0.2628 (+11.21%) Vanilla 0.2479 0.2427 0.2804 0.3311 0.2917 +QE 0.2821 (+13.80%) 0.3055 (+25.88%) 0.3120 (+11.27%) 0.3754 (+13.38%) 0.3439 (+17.90%) DPH +Proximity 0.2501 (+0.89%) 0.2428 (+0.04%) 0.2834 (+1.07%) 0.3255 (-1.69%) 0.2904 (-0.45%) +QE +Proximity 0.2869 (+15.73%) 0.3035 (+25.05%) 0.3064 (+9.27%) 0.3095 (-6.52%) 0.3288 (+12.72) 2.4 Interaction #this starts the REST endpoint on port 1981 [dockerhost]$ cd jig In the interact hook, we provide three HTTP-accessible methods [dockerhost]$ python run.py interact --repo terrier --tag latest that allow a researcher to interact with the Terrier instance. Two ------ of these provide access to the results of the search engine, while #this demonstrates access to that index from another machine the third allows the user to conduct further experiments within a [anotherhost]$ cd terrier Jupyter notebook environment, making use of Terrier-Spark [3]. [anotherhost]$ bin/terrier interactive -I http://dockerhost:1981/ Each HTTP server is made available on a separate port, as detailed terrier query> information retrieval end:5 below4 . Displaying 1-6 results 0 FBIS4-20699 10.268754805435458 2.4.1 Port 1980: Simple search interface. This provides a user-friendly 1 FBIS4-20702 9.768490153503198 simple web presentation of the search results, allowing the user to 2 FR941027-2-00046 9.491347902606723 enter queries, and receive ranked search results. 3 FBIS4-20701 9.456022500508775 4 FBIS3-24510 9.31403481019499 2.4.2 Port 1981: REST API. This provides a REST endpoint for 5 FBIS4-20700 8.792342494849281 Terrier to provide search results from. This can be used directly, or can be used by another instance of Terrier to query the index in Figure 1: Accessing an index hosted on the Terrier Docker a running container (i.e. Terrier can be both a server or a client). container via the Terrier REST API. Figure 1 shows an example of using Terrier from the command line of another machine to access an index hosted within a Docker container. 2.4.3 Port 1982: Terrier-Spark Jupyter Notebook. Finally, port 1982 3.1 Do you really have the original version of starts a Jupyter notebook with Apache Toree’s Scala kernel installed. the corpus? This allows use of Terrier-Spark - a Scala interface built on top of Apache Spark that allows Terrier retrieval experiments to be con- We discovered, like several other research labs involved in the ducted, including in a Jupypter notebook [3, 4]. An example note- OSIRRC challenge, that TREC Disks 4 & 5 had been originally book is provided that allows the user to run more experiments on compressed using the archaic Unix compress utility, resulting in .z the available indices. Functionalities include querying and evaluat- .1z and .2z filename suffices. Our own copies in Glasgow and Delft ing outcomes (as shown in Figure 2), as well as combining Terrier’s had at sometime been recompressed using more contemporary Gzip learning-to-rank feature support with Apache Spark’s machine compression (with a resulting .gz filename suffix). learning capabilities. We made some minor adjustments in Terrier version 5.2 that al- lowed decompression of .z files using an Apache Commons package 3 LESSONS LEARNED to be integrated into Terrier on-the-fly. While developing this work, a number of roadblocks appeared, prompting new insights and workarounds that ended up improving 3.2 How much memory is in this container? the overall reproducibility of the work. Some of these roadblocks, Like any Java process, Terrier is limited in the amount of memory formulated as questions, are described in this section. available in the Java Virtual Machine (JVM). We worked hard to 4 Note that the ports on the host machine may differ, due to the way that Docker ensure that the JVM is allowed to use as much memory once a con- assigns ports. It is foreseen that this will be resolved in future versions of the OSIRRC tainer is running. This allows Terrier potential speed improvements jig - see https://github.com/osirrc/jig/issues/112. for both indexing and retrieval. 28 OSIRRC 2019, July 25, 2019, Paris, France Arthur Câmara and Craig Macdonald Figure 2: An example of evaluating a run from Terrier-Spark 3.3 Can the classical indexer be more environment. We have aimed to provide a range of standard re- aggressive in using the available memory? trieval configurations that Terrier can provide for the relevant test collections. Meanwhile, participation in the challenge has allowed In OSIRRC, we elected to default to Terrier’s classical indexer, as some improvements to the Terrier platform, that will be released this allows more flexibility in the index due to the creation of a in version 5.2. direct index compared to the faster single-pass indexer. However, it On the other hand, while the Docker image is a step in the di- was recognised that the classical indexer had seen less attention in rection of allowing replication of IR experiments, we believe that it recent years, and hence could be further optimised. In particular, in should be combined with a notebook-like environments that facili- Terrier 5.2, we made changes to the classical indexer to recognise tate the scripting of advanced experiments. We have provided one the available memory, and be more aggressive in its use of that RAM. example Terrier-Spark notebook, which demonstrates the possible In particular, we have observed significant efficiency improvements functionality of conducting an IR experiment within a notebook. when building a block index for GOV2 (an 11% reduction in indexing However, we acknowledge the overheads of operating in a Spark en- time for a Docker host machine with many CPU cores, with larger vironment (both in efficiency and in code complexity). In the future, benefit observed for less powerful hosts). we seek better integration of Terrier into a Python environment, to allow easier scripting of complex retrieval experiments. 4 CONCLUSIONS & OUTLOOK This paper has described the implementation of the Terrier-Docker container image within the OSIRRC replicability challenge. This REFERENCES [1] Giambattista Amati. 2003. Probabilistic Models for Information Retrieval based has been a worthwhile effort that has allowed many IR platforms on Divergence from Randomness. Ph.D. Dissertation. Department of Computing and toolkits to be made available within a standardised Docker Science, University of Glasgow. 29 Dockerising Terrier for The Open-Source IR Replicability Challenge (OSIRRC 2019) OSIRRC 2019, July 25, 2019, Paris, France [2] Giambattista Amati. 2006. Frequentist and Bayesian Approach to Information [6] Craig Macdonald, Rodrygo L. T. Santos, Iadh Ounis, and Ben He. 2013. About Retrieval. In ECIR (Lecture Notes in Computer Science), Vol. 3936. Springer, 13–24. learning models with multiple query-dependent features. ACM Trans. Inf. Syst. 31, [3] Craig Macdonald. 2018. Combining Terrier with Apache Spark to create Agile 3 (2013), 11. Experimental Information Retrieval Pipelines. In SIGIR. ACM, 1309–1312. [7] Jie Peng, Craig Macdonald, Ben He, Vassilis Plachouras, and Iadh Ounis. 2007. [4] Craig Macdonald, Richard McCreadie, and Iadh Ounis. 2018. Agile Information Incorporating term dependency in the dfr framework. In SIGIR. ACM, 843–844. Retrieval Experimentation with Terrier Notebooks. In DESIRES (CEUR Workshop [8] Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Proceedings), Vol. 2167. CEUR-WS.org, 54–61. Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval 3, [5] Craig Macdonald, Richard McCreadie, Rodrygo L. T. Santos, and Iadh Ounis. 2012. 4 (2009), 333–389. From puppy to maturity: Experiences in developing Terrier. Proc. of OSIR at SIGIR [9] Qiang Wu, Chris J. C. Burges, Krysta M. Svore, and Jianfeng Gao. 2008. Ranking, (2012), 60–63. Boosting, and Model Adaptation. Technical Report MSR-TR-2008-109. Microsoft. 30