A Docker-Based Replicability Study of a Neural Information Retrieval Model Nicola Ferro, Stefano Marchesin, Alberto Purpura, Gianmaria Silvello Department of Information Engineering University of Padua, Italy {ferro,marches1,purpuraa,silvello}@dei.unipd.it ABSTRACT on a set of software tools, developed in Java or Python and based In this work, we propose a Docker image architecture for the replica- on numerous libraries for scientific computing, which have all to bility of Neural IR (NeuIR) models. We also share two self-contained be available on the host machine in order for the applications to Docker images to run the Neural Vector Space Model (NVSM) [22], run smoothly. an unsupervised NeuIR model. The first image we share (nvsm_cpu) Therefore, OSIRRC 2019 aims to ease the use of such platforms, can run on most machines and relies only on CPU to perform the and of retrieval approaches in general, by providing Docker images required computations. The second image we share (nvsm_gpu) that replicate IR models on ad hoc document collections. To max- relies instead on the Graphics Processing Unit (GPU) of the host ma- imize the impact of such an effort, OSIRRC 2019 sets three main chine, when available, to perform computationally intensive tasks, goals: such as the training of the NVSM model. Furthermore, we discuss (1) Develop a common Docker interface specification to sup- some insights on the engineering challenges we encountered to port images that capture systems performing ad hoc retrieval obtain deterministic and consistent results from NeuIR models, re- experiments on standard test collections. The proposed solu- lying on TensorFlow within Docker. We also provide an in-depth tion is known as the jig. evaluation of the differences between the runs obtained with the (2) Build a curated library of Docker images that work with the shared images. The differences are due to the usage within Docker jig to capture a diversity of systems and retrieval models. of TensorFlow and CUDA libraries – whose inherent randomness (3) Explore the possibility of broadening these efforts to include alter, under certain circumstances, the relative order of documents additional tasks, evaluation methodologies, and benchmark in rankings. initiatives. OSIRRC 2019 gives us the opportunity to investigate on the repli- CCS CONCEPTS cability (and reproducibility), as described in the ACM guidelines • Information systems → Information retrieval; Retrieval discussed in [11], of Neural IR (NeuIR) models – on which these models and ranking; Evaluation of retrieval results; Retrieval issues are especially relevant. Indeed, NeuIR models are highly sen- models and ranking; • Computing methodologies → Unsuper- sitive to parameters, hyper-parameters, and pre-processing choices vised learning. that hamper researchers who want to study, evaluate, and compare NeuIR models against state-of-the-art approaches. Also, these mod- KEYWORDS els are usually compatible only with specific versions of the libraries Docker, Neural Information Retrieval, Replicability, Reproducibility that they rely on (e.g., Tensorflow) because these frameworks are constantly updated. The use of Docker images is a possible solution 1 INTRODUCTION to avoid these deployment issues on different machines as it already Following some recent efforts on reproducibility, like the CENTRE includes all the libraries required by the contained application. evaluations at CLEF [7, 9], NTCIR [19] and TREC [20], or the SI- For this reason, (i) we propose a Docker architecture that can GIR task force to implement ACM’s policy on artifact review and be used as a framework to train, test, and evaluate NeuIR models, badging [8], the Open-Source IR Replicability Challenge at SIGIR and is compatible with the jig introduced by OSSIRC 2019; and, (ii) 2019 (OSIRRC 2019) aims at addressing the replicability issue in we show how this architecture can be employed to build a Docker ad hoc document retrieval [3]. OSIRRC 2019’s vision is to build image that replicates the Neural Vector Space Model (NVSM) [22], Docker-based1 infrastructures to replicate results on standard ad a state-of-the-art unsupervised neural model for ad hoc retrieval. hoc test collections. Docker is a tool that allows for the creation We rely on our shared TensorFlow2 implementation of NVSM [18]. and deployment of applications via images containing all the re- The model is trained, tested, and evaluated on the TREC Robust04 quired dependencies. Relying on a Docker-based infrastructure to collection [23]. The contributions of this work are the following: replicate the results of existing systems, helps researchers to avoid • we present a Docker architecture for NeuIR models that all the issues related to system requirements and dependencies. is compliant with the OSSIRC 2019 jig requirements. The Indeed, Information Retrieval (IR) platforms such as Anserini [25], architecture supports three functions: index, train and search, Terrier [17], or text matching libraries such as MatchZoo [5] rely which are the same actions that are typically performed by 1 https://www.docker.com/. NeuIR models; Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR 2019, 25 July 2019, Paris, France. 2 https://www.tensorflow.org/. 37 OSIRRC 2019, July 25 2019, Paris, France N. Ferro, S. Marchesin, A. Purpura and G. Silvello • we share two Docker images to replicate the NVSM results • Platform (P) describes the underlying hard- and software on the Robust04 collection: nvsm_cpu which relies on one like the operating system and the computer used; or more CPUs for its computations and it is compatible with • Data (D) consists of two parts, namely the input data and most machines, and nvsm_gpu which supports parallel com- the specific parameters chosen to carry out the method; puting using an NVIDIA Graphics Processing Unit (GPU); • Actor (A) refers to the experimenter. • we perform extensive experimental evaluations to explore replicability challenges of NeuIR models (i.e., NVSM) with Along with the main aspects of the PRIMAD model, there are two Docker. other relevant variables that need to be taken into account for Our NVSM Docker images are part of the OSIRRC 2019 library.3 reproducibility: transparency and consistency. Transparency is the The source code, system runs, and additional required data can be ability to verify that all the necessary components of an experiment found in [10]. The findings presented in this paper contributed to perform as they claim; consistency refers to the success or failure the definition of the “training” hook within the jig. of a reproducibility experiment in terms of consistent outcomes. The rest of the paper is organized as follows: Section 2 presents The PRIMAD paradigm has been adopted by the IR community, an overview of previous and related initiatives for repeatability, where it has been adapted to the context of IR evaluation – both replicability and reproducibility. Section 3 describes the NVSM system-oriented and user-oriented [6]. Nevertheless, reproducibil- model. Section 4 presents the Docker image, whereas Section 5 ity in IR is still a critical concept, which requires infrastructures to describes how to interact with the provided Docker image. Section 6 manage experimental data [1], off-the-shelf open source IR systems presents the experimental setup and Section 7 shows the obtained [21] and reproducible baselines [2], among others. results. Finally, Section 8 discusses the outcomes of our experiments In this context, our contribution at OSIRRC 2019 lies between and provides insights on the replicability challenges and issues of replicability and reproducibility. Indeed, by relying on the NVSM NeuIR models. implementation available at [18], we replicate the results of a repro- duced version of NVSM. Therefore, we do not completely adhere 2 RELATED WORK to the definition of replicability provided by the ACM – as we rely on an independent implementation of NVSM rather than the Repeatability, replicability, and reproducibility are fundamental one proposed in [22]. Regardless, we believe that our contribution aspects of computational sciences, both in supporting desirable of a Docker architecture for NeuIR models, along with the two scientific methodology as well as sustaining empirical progress. produced NVSM Docker images, can shed some light on the repli- Recent ACM guidelines4 precisely define the above concepts as cability and reproducibility issues of NeuIR models. Besides, the follows: off-the-shelf nature of Docker images can help future researchers • Repeatability: a researcher can reliably repeat his/her own to replicate/reproduce our results easily and consistently. computation (same team, same experimental setup). • Replicability: an independent group can obtain the same result using the author’s own artifacts (different team, same 3 NEURAL VECTOR SPACE MODEL experimental setup). The Neural Vector Space Model (NVSM) is a state-of-the-art unsu- • Reproducibility: an independent group can obtain the same pervised model for ad hoc retrieval. The model achieves competitive result using artifacts which they develop completely inde- results against traditional lexical models, like Query Language Mod- pendently (different team, different experimental setup). els (QLM) [26], and outperforms state-of-the-art unsupervised se- These guidelines have been discussed and analyzed in depth in mantic retrieval models, like the Word2Vec-based models presented the Dagstuhl Seminar on “Reproduciblity of Data-Oriented Experi- in [24]. Below, we describe the main characteristics of NVSM. ments in e-Science” held on 24-29 January 2016 [11], which focused Given a document collection D = {d j }j=1 M and the relative lexicon on the core issues and approaches to reproducibility in several V = {w i }i=1 N , NVSM considers the vector representations w® ∈ Rkw i fields of computer science. One of the outcomes of the seminar was the Platform, Research goal, Implementation, Method, Actor and and d®j ∈ Rkd , where kw and kd denote the dimensionality of Data (PRIMAD) model, which tackles reproducibility from different word and document vectors. Due to the different size of word and angles. document embeddings, the word feature space is mapped to the The PRIMAD model acts as a framework to distinguish the main document feature space through a series of transformations learned elements describing an experiment in computer science, as there are by the model. many different terms related to various kinds of reproducibility [4]. A sequence of n words (i.e. an n-gram) (w j,i )ni=1 extracted from The main aspects of PRIMAD are the following: d j is represented by the average of the embeddings of the words in it. NVSM learns word and document representations considering mini- • Research Goal (R) characterizes the purpose of a study; batches B of (n-gram, document) pairs. These representations are • Method (M) is the specific approach proposed or considered learned minimizing the distance between a document embedding by the researcher; and the representations of the n-grams contained in it. During the • Implementation (I) refers to the actual implementation of training phase, L2-regularization is employed to normalize the n- the method, usually in some programming language; grams representations: norm(® x) = | |xx®® | | . This process is used to 3 https://github.com/osirrc/osirrc2019-library/#NVSM. obtain sparser representations. The projection of an n-gram into 4 https://www.acm.org/publications/policies/artifact-review-badging/. the kd -dimensional document feature space can be defined as a 38 A Docker-Based Replicability Study of a Neural Information Retrieval Model OSIRRC 2019, July 25 2019, Paris, France composition function: 4 DOCKER IMAGE ARCHITECTURE T̃ ((w j,i )ni=1 ) = (f ◦ norm ◦ д)((w j,i )ni=1 ). (1) 4.1 NVSM Docker Image with CPU support The NeuIR model we share, i.e. NVSM, is written in Python and Then, the standardized projection of the n-gram representation is relies on Tensorflow v.1.13.1. For this reason, we share a Docker obtained by estimating the per-feature sample mean and variance image based on the official Python 3.5 runtime container, on top of over batch B as follows: which we install the Python packages required by the algorithm – such as Tensorflow, Python NLTK, and Whoosh – we also install a T ((w j,i )ni=1 ) = C compiler, i.e. gcc, in order to use the official trec_eval package5 to evaluate the retrieval model during training. © T̃ ((w j,i )ni=1 ) − Ê[T̃ ((w j,i )ni=1 )] ª (2) hard-tanh ­­ q + β ®® . Since this docker image still relies for some functions (i.e. random V̂[T̃ ((w j,i )ni=1 )] number generation) on the host machine, despite being very similar, « ¬ the results are not exactly the same across different computers – The composition function д, combined with the L2-normalization while they are consistent on the same machine. In order to enable norm, causes words to compete for contributing to the resulting the replication of our experimental results, we share the model we n-gram representation. In this way, words that are more represen- trained with the nvsm image – which can be loaded in the shared tative for the target document will contribute more to the n-gram Docker model if the user decides to skip the training step – in a representation. Moreover, the standardization operation forces n- public repository.6 gram representations to differentiate themselves only in the dimen- sions that matter for the matching task. Thus, word representations 4.2 NVSM Docker Image with GPU support incorporate a notion of term specificity during the learning process. Our implementation of NVSM is based on Tensorflow, which is a The similarity of two representations in the latent vector space machine learning library that allows to employ the GPU on the host is computed as: machine in order to perform operations more efficiently. For this rea- son, we created and share in our repository another docker image P(S|d j , (w j,i )ni=1 ) = σ (d®j · T ((w j,i )ni=1 )), (3) (i.e. nvsm_gpu) which can use the GPU on the host machine to speed up the computations of the algorithm. To do so, the host machine where σ (t) is the sigmoid function and S is a binary indicator that running this image needs an NVIDIA GPU and the nvidia-docker states whether the representation of document d j is similar to the version7 installed on it. There are many advantages of employing projection of its n-gram (w j,i )ni=1 . The probability of a document GPUs for scientific computations, but their usage makes a size- d j given its n-gram (w j,i )ni=1 is then approximated by uniformly able difference especially when training deep learning models. The sampling z contrastive examples [13]: training of such models requires in fact to perform a large num- ber of matrix operations that can be easily parallelized and do not z+1 require powerful hardware. For these reasons, the architecture of log P̃(d j |(w j,i )ni=1 )) = z log P(S|d j , (w j,i )ni=1 )) 2z GPUs with thousands of low-power processing units is particularly z Õ ! (4) indicated for this kind of operations. + log(1.0 − P(S|dk , (w j,i )ni=1 ))) , The nvsm_gpu image is based on the official TensorFlow-gpu k =1, Docker image for Python 3. As in the other image that we share, d k ∼U (D) we include a C compiler in order to use trec_eval for the retrieval model evaluation and the Python libraries required by NVSM. where U (D) represents the uniform distribution used to obtain In our experiments, we observed that nvsm_gpu does not produce contrastive examples from documents D. Finally, to optimize the fully consistent results on the same machine. In fact, TensorFlow model, the following loss function – averaged over the instances in uses the Eigen library, which in turn uses CUDA atomic functions batch B – is used: to implement reduction operations, such as tf.reduce_sum etc. m Those operations are non-deterministic and each operation can 1 Õ L(θ |B) = − log P̃(d j |(w j,i )ni=1 ) introduce small variations, as also stated in a GitHub issue on m j=1 TensorFlow source code.8 Despite this problem, we still believe |V | |D | (5) λ Õ Õ  that the advantages brought by the usage of a GPU in terms of + ||w®i ||22 + ||d®j ||22 + ||W ||F2 , reduction of computational time – combined with the fact that we 2m i=1 j=1 detected only very small variations in the Mean Average Precision at Rank 1000 (MAP), Normalized Discounted Cumulative Gain at |V | |D | where θ is the set of parameters {w® i }i=1 , {d®j }j=1 , W , β and λ is a Rank 100 (nDCG@100), Precision at Rank 10 (P@10), and Recall – weight regularization hyper-parameter. make this implementation of the algorithm a valid alternative to After training, a query q is projected into the document feature the CPU-based one. space by the composition of the functions f and д: (f ◦д) (q) = h(q). 5 https://github.com/usnistgov/trec_eval. Finally, the matching score between a document d j and a query 6 http://osirrc.dei.unipd.it/sample_trained_models_nvsm_cpu_gpu/ q is given by the cosine similarity of their representations in the 7 https://github.com/NVIDIA/nvidia-docker. document feature space. 8 https://github.com/tensorflow/tensorflow/issues/3103. 39 OSIRRC 2019, July 25 2019, Paris, France N. Ferro, S. Marchesin, A. Purpura and G. Silvello In order to assess the amount of variability due to this non- previous step and the path to a text file containing the topic IDs on determinism in the training process we perform, in Section 7, a which to perform retrieval. Then, the NVSM trained model – which few tests to evaluate the difference between the run computed with performed best on the validation set of queries at the previous step – nvsm_cpu and three different runs computed with nvsm_gpu using will be loaded by the Docker image and retrieval will be performed the GPU. Finally, to ensure the replicability of our experiments we on the topics specified in the topic IDs file passed to the jig. share one of the models that we trained using this image on our public repository.9 This model can be loaded by the provided Docker 6 EXPERIMENTAL SETUP image to perform search on the Robust04 collection, skipping the 6.1 Experimental Collection training step. To test our docker image we consider the Robust04 collection, which 5 INTERACTION WITH THE DOCKER IMAGE is composed of TIPSTER corpus [14] Disk 4&5 minus CR. The collection counts 528,155 documents, with a vocabulary of 760,467 The interaction with the shared Docker image is performed via different words. The topics considered for the evaluation are topics the jig.10 The jig is an interface to perform retrieval operations 301-450, 601-700 from Robust04 [23]. Only the field title of topics employing a model embedded in a Docker image. At the time of is used for retrieval. The set of topics is split into validation (V) writing, our image can perform retrieval on the Robust04 collection. and test (T) sets, as proposed in [22].12 Relevance judgments are The actions supported by the NVSM Docker images that we share restricted accordingly. are: index, train, and search. The execution times and memory occupation statistics were computed on an 2018 Alienware Area-51 with an Intel Core i9- 5.1 Index 7980XE CPU @ 2.60GHz with 36 cores, 64GB of RAM and two The purpose of this action is to build the collection indices required GeForce GTX 1080Ti GPUs. by the retrieval model. Before the hook is run, the jig will mount the selected document collection at a path passed to the script. Our 6.2 Evaluation Measures image will then uncompress and index the collection using the We use the same measures of [22] to evaluate retrieval effectiveness: Whoosh retrieval library.11 The index files relative to the collection MAP, nDCG@100, P@10. Additionally, we also employ Recall. are finally saved inside the Docker image in order to speed up future search operations, eliminating the time to mount the index files in the Docker image. 6.3 Training To train the NVSM model, we set the following parameters and 5.2 Train hyper-parameters: word representation size kw = 300, number of negative examples z = 10, learning rate α = 0.001, regularization The purpose of the train action is to train a retrieval model. We lambda λ = 0.01, batch size m = 51200, dimensionality of the developed this hook within the jig in order to perform training document representations kd = 256 and n-gram size n = 16. We with the NVSM model, which is the only NeuIR model officially train the model for 15 iterations over the document collection and supported by the OSSIRC library at the time of writing. This hook we select the model iteration that performs best in terms of MAP. A mounts the topics and relevance judgments associated to the se- 1 Í lected experimental collection and two files containing the topic single iteration consists of ⌈ m d ∈D (|d | − n + 1)⌉ batches, where d is a document in the experimental collection D, and n is the n-gram IDs to use for the test and validation of the model. The support size as described in Section 3. for the evaluation of a model on different subsets of topics can be useful to any supervised NeurIR model which might use the jig in the future or to learning-to-rank approaches within other Docker 6.4 Performance Differences Evaluation images. In the case of NVSM, which is an unsupervised model, we In order to evaluate the differences between the rankings produced employ the validation subset of topics during training to select the with the two NVSM Docker images we consider the following best model – saved after each training epoch. The trained models measures. are saved to a directory indicated by the user on the host machine. • Root Mean Square Error (RMSE): This measure indicates This is done in order to keep the Docker image as light as possible, how close are the performance scores of two systems [16] and to allow the user to easily inspect the results of the training (the lower the better) and considers the values associated process. to a measure M(·) (i.e. MAP, nDCG@100) chosen for their evaluation. RMSE is defined as follows: 5.3 Search v u t T The purpose of the search hook is to perform an ad-hoc retrieval run 1Õ RMSE = (M 0,i − M 1,i )2, (6) – multiple runs can be performed by calling jig multiple times with T i=1 different parameters. In order to perform retrieval with the provided where M 0,i is the chosen evaluation measure associated to Docker image, the user will need to indicate as a parameter the the first run to compare and M 1,i is the same measure asso- path to the directory containing the trained model computed at the ciated to the second run, both relative to the i th topic. 9 http://osirrc.dei.unipd.it/sample_trained_models_nvsm_cpu_gpu/. 10 https://github.com/osirrc/jig. 12 Splits can be found at: https://github.com/osirrc/jig/tree/master/sample_ 11 https://whoosh.readthedocs.io/en/latest/index.html. training_validation_query_ids. 40 A Docker-Based Replicability Study of a Neural Information Retrieval Model OSIRRC 2019, July 25 2019, Paris, France • Kendall’s τ correlation coefficient: since different runs space on disk than the GPU one. This is because the former does may produce the same RMSE score, we also measure how not need all of the drivers and libraries required by the GPU version close are the ranked results lists of two systems. This is mea- of Tensorflow. In fact, these libraries make the nvsm_gpu image sured using Kendall’s τ correlation coefficient [15] among three times larger than the other one. We also point out that the the list of retrieved documents for each topic, averaged across GPU memory usage reported in Table 1 is proportional to the mem- all topics. The Kendall’s τ correlation coefficient on a single ory available on the GPU in the host machine. In fact, Tensorflow topic is given by: – if there is no explicit limitation imposed by the user – allocates P −Q most of the available space in the GPU memory to speed up compu- τi (run 0, run 1 ) = p , (7) tations. For this reason, since the GPU used in these experiments (P + Q + U )(P + Q + V ) has 11GB of memory available, the space used is 10.76GB. If the where T is the total number of topics, P is the total number GPU had less space available, then Tensorflow would be able to of concordant pairs (document pairs that are ranked in the adjust to it and use as low as 8GB. This is in fact the minimum GPU same order in both vectors) Q the total number of discordant memory requirement according to our tests, to run the nvsm_gpu pairs (document pairs that are ranked in opposite order in Docker image. the two vectors), U and V are the number of ties, respectively, in the first and in the second ranking. To compare two runs, NVSM (CPU) NVSM (GPU) Equation (7) becomes: Disk occupation (image only) 1.1GB 3.55GB T Index disk size (Robust04) 4.96GB 1Õ τ (run 0, run 1 ) = τi (run 0, run 1 ) (8) Maximum RAM occupation 16GB 10GB T i=1 GPU memory usage – 10.76GB The range of this measure is [-1, 1], where 1 indicates a Execution time (1 epoch) 8h 2h30m perfect correlation between the order of the documents in the Table 1: Analysis of the space on disk used, memory occupa- considered runs and -1 indicates that the rankings associated tion and execution time of the shared docker images. to each topic in the runs are one the inverse of the other. • Jaccard index: since different runs might contain a different set of relevant documents for each topic, we consider the average of the Jaccard index of each of these sets, over all In Table 2, we report the retrieval results obtained with the topics. We compute this value as: two shared Docker images. From these results, we observe that there are small differences, always within ±0.01, between the runs T 1 Õ |rd_run1i ∩ rd_run2i | obtained with nvsm_gpu on the same machine and with the ones sim(run 1, run 2 ) = (9) T i=1 |rd_run1i ∪ rd_run2i | obtained with nvsm_cpu on different machines. The causes of these small differences are described in Section 4 and are related to how where rd_run1i and rd_run2i are the sets of relevant docu- the optimization process of the model is managed by Tensorflow ments retrieved for the topic i in run1 and run2, respectively. within Docker. RMSE and Kendall’s τ correlation coefficient have been adopted to evaluate the differences between the rankings for reproducibility MAP nDCG@100 P@10 Recall purposes in CENTRE@CLEF [9], whereas the Jaccard index is used CPU (run 0) 0.138 0.271 0.285 0.6082 here for the first time for this purpose. GPU (run 0) 0.137 0.265 0.277 0.6102 We use these measures to evaluate the differences in the rankings GPU (run 1) 0.138 0.270 0.277 0.6066 produced by the NVSM Docker images. Since the image with GPU GPU (run 2) 0.137 0.268 0.270 0.6109 support is not fully deterministic we compute three different runs Table 2: Retrieval results on the Robust04 (T) collection com- on the same machine and analyze the differences between them. puted with the two shared Docker images of NVSM. Also, since NVSM does not retrieve any relevant document for four topics (312, 316, 348 and 379) because none of their terms are present in the NVSM term dictionary, we remove these topics from The MAP, nDCG@100, P@10, and Recall values obtained with the the pool and consider – only for the comparison between different images are all very similar, and close to the measures reported runs using RMSE, Kendall’s τ correlation coefficient and Jaccard in the original NVSM paper [22]. Indeed, the absolute difference index – a total of 196 out of the 200 test topics indicated in the test between the reported MAP, nDCG@100, and P@10 values in [22] split file available in the OSSIRC repository.13 and our results is always less than 0.02. As a side note, the MAP values obtained by NVSM are low when compared to the other 7 EVALUATION approaches on Robust04 that can be found in the OSIRRC 2019 In Table 1, we report the statistics relative to the disk space and library – even 10% lower than some methods that do not apply memory occupation of our images. We also include the time re- re-ranking. quired by each image to complete one training epoch. The first In order to further evaluate the performance differences between thing that we observe is that the CPU Docker image takes less the runs, we begin computing the RMSE considering the MAP, 13 https://github.com/osirrc/jig/tree/master/sample_training_validation_query_ nDCG@100, and P@10 measures. The RMSE gives us an idea of ids. the performance difference between two runs – averaged across 41 OSIRRC 2019, July 25 2019, Paris, France N. Ferro, S. Marchesin, A. Purpura and G. Silvello the considered topics. We first compute the average values of MAP, In this case, we observe that the runs computed with the GPU have nDCG@100, and P@10 over the three nvsm_gpu runs on each topic. more in common between each other than the run computed with Then, we compare these averaged performance measures, for each the CPU. However, the Jaccard index values are all very high and topic, against the corresponding ones associated to the CPU-based this confirms our previous hypothesis about the rankings. In fact, NVSM run we obtained on our machine. These results are reported this implies that they contain similar sets of relevant documents, in Table 3. From the results of this evaluation we can observe that which are however in different relative positions – because we have a low Kendall’s τ correlation coefficient – but in the same portion NVSM GPU (average) of the rankings – because we obtain similar and relatively high RMSE (MAP) 0.034 nDCG@100, P@10 values over all runs. RMSE (nDCG@100) 0.054 RMSE (P@10) 0.140 1.00 Table 3: RMSE (the lower the better) between the NVSM CPU run_0_cpu 1.00 0.81 0.81 0.81 Docker image and the average of the 3 runs computed with the NVSM GPU Docker image considering the MAP at rank 0.96 1000 (MAP), nDCG at rank 100 (nDCG@100), and Precision at rank 10 (P@10) between the NVSM CPU run and the av- run_0_gpu 0.81 1.00 0.86 0.86 0.92 erage of the three runs computed with the NVSM Docker image supporting GPU computations. run_1_gpu 0.81 0.86 1.00 0.97 0.88 the average performance difference across the considered 196 topics is very low when considering the MAP and nDCG@100 measures, 0.81 0.86 0.97 1.00 0.84 run_2_gpu while it grows when we consider the top part of the rankings (P@10). In conclusion, the RMSE value is generally low, hence we run_0_cpu run_0_gpu run_1_gpu run_2_gpu can confidently say that the models behave in a very similar way in terms of MAP, nDCG@100, and P@10 on all the considered topics. In Table 4, we report the Kendall’s τ measures associated to each Figure 1: Heatmap of the average Jaccard index between the pair of runs that we computed. This measure shows us how much sets of retrieved relevant documents for each topic by the the considered rankings are similar to each other. In our case, the NVSM Docker images. runs appear to be quite different from each other, since the Kendall’s τ values are all close to 0. In other words, when considering the top To qualitatively assess from a user perspective the differences 100 results in each run, the same documents are rarely in the same between the different runs we select one topic (301: “International positions in the selected rankings. This result, combined with the Organized Crime”) and report in Table 5 the top five document ids fact that the runs achieve all similar MAP, nDCG@100, P@10, and for each of them. The results in this table confirm our previous Recall values, leads to the conclusion that the relevant documents are ranked high in the rankings, but are not in the same positions. CPU GPU (run 0) GPU (run 1) GPU (run 2) In other words, NVSM performs a permutation of the documents in FBIS3-55219 FBIS3-55219 FBIS3-55219 FBIS3-55219 the runs, maintaining however the relative order between relevant FBIS4-41991 FBIS4-7811 FBIS4-7811 FBIS4-7811 and non-relevant documents. FBIS4-45469 FBIS4-43965 FBIS4-41991 FBIS4-41991 FBIS3-54945 FBIS3-23986 FBIS3-23986 FBIS3-23986 GPU (run 0) GPU (run 1) GPU (run 2) CPU FBIS4-7811 FBIS4-41991 FBIS4-65446 FBIS4-65446 GPU (run 0) 1.0 0.025 0.025 0.018 Table 5: Top 5 documents in the runs computed with GPU (run 1) 0.025 1.0 0.089 0.014 nvsm_cpu and nvsm_gpu. Relevant documents are highlighted GPU (run 2) 0.025 0.089 1.0 0.009 in bold. CPU 0.018 0.014 0.009 1.0 Table 4: Kendall’s τ correlation coefficient values between the runs we computed with the NVSM GPU and CPU Docker intuition. In fact, we observe that the majority of the high-ranked images considering the top 100 ranked documents in each documents for topic 301 for each run are relevant, but these docu- run. ments are slightly different across different runs. Also, we observe that the most of the relevant documents retrieved by nvsm_gpu are the same, while only two of the relevant documents retrieved by To validate our hypothesis, we report in Figure 1, for each pair nvsm_cpu are also found in the other runs. For instance, we observe of runs, the Jaccard index between the sets of relevant documents that document FBIS4-45469 is ranked in the top-5 positions only averaged over all topics, as described in Section 6. These values in the CPU run. Similarly, document FBIS4-43965 appears only help us to assess whether the set of relevant documents retrieved in the GPU run 0. These apparently small differences can have a for each topic in our runs is different, and by how much on average. sizeable impact on the user experience and should be taken into 42 A Docker-Based Replicability Study of a Neural Information Retrieval Model OSIRRC 2019, July 25 2019, Paris, France consideration when choosing to employ a NeuIR model in real-case [2] J. Arguello, M. Crane, F. Diaz, J. Lin, and A. Trotman. 2016. Report on the SIGIR scenarios. 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR). In ACM SIGIR Forum, Vol. 49. ACM, 107–116. [3] R. Clancy, N. Ferro, C. Hauff, J. Lin, T. Sakai, and Z. Z. Wu. 2019. The SIGIR 2019 8 FINAL REMARKS Open-Source IR Replicability Challenge (OSIRRC 2019). (2019). [4] D. De Roure. 2014. The future of scholarly communications. Insights 27, 3 (2014). In this work, we performed a replicability study of the Neural Vec- [5] Y. Fan, L. Pang, J. Hou, J. Guo, Y. Lan, and X. Cheng. 2017. Matchzoo: A toolkit tor Space Model (NVSM) retrieval model using Docker. First, we for deep text matching. arXiv preprint arXiv:1707.07270 (2017). [6] N. Ferro, N. Fuhr, K. Järvelin, N. Kando, M. Lippold, and J. Zobel. 2016. Increasing presented the architecture and the main functions of a Docker Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of image designed for the replicability of Neural IR (NeuIR) models. Data-Oriented Experiments in e-Science". SIGIR Forum 50, 1 (June 2016), 68–82. https://doi.org/10.1145/2964797.2964808 The described architecture is compatible with the jig developed [7] N. Ferro, N. Fuhr, M. Maistro, T. Sakai, and I. Soboroff. 2019. Overview of CEN- in the OSSIRC 2019 workshop and supports the index (to index TRE@CLEF 2019: Sequel in the Systematic Reproducibility Realm. In Experimental an experimental collection), train (to train a retrieval model), and IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). search (to perform document retrieval) actions. Secondly, we de- [8] N. Ferro and D. Kelly. 2018. SIGIR Initiative to Implement ACM Artifact Review scribed the image components and the engineering challenges to and Badging. SIGIR Forum 52, 1 (Aug. 2018), 4–10. https://doi.org/10.1145/ 3274784.3274786 obtain deterministic results with Docker using popular machine [9] N. Ferro, M. Maistro, T. Sakai, and I. Soboroff. 2018. Overview of CENTRE@ learning libraries such as Tensorflow. We also share two Docker CLEF 2018: A first tale in the systematic reproducibility realm. In International images – which are part of the OSSIRC 2019 library – of the NVSM Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 239–246. model: the first, which relies only on the CPU of the host machine [10] N. Ferro, S. Marchesin, A. Purpura, and G. Silvello. 2019. Docker Images of Neural to perform its operations, the second, which is able to also exploit Vector Space Model. https://doi.org/10.5281/zenodo.3246361 the GPU of the host machine, when available, to perform more [11] J. Freire, N. Fuhr, and A. Rauber. 2016. Reproducibility of Data-Oriented Ex- periments in e-Science (Dagstuhl Seminar 16041). Dagstuhl Reports 6, 1 (2016), expensive computations such as the training of the NVSM model. 108–159. https://doi.org/10.4230/DagRep.6.1.108 Finally, we performed an in-depth evaluation of the differences [12] N. Fuhr. 2018. Some common mistakes in IR evaluation, and how they can be avoided. In ACM SIGIR Forum, Vol. 51. ACM, 32–41. between the runs obtained with the two images, presenting some [13] M. Gutmann and A. Hyvärinen. 2010. Noise-contrastive estimation: A new insights which also hold for other NeuIR models relying on CUDA estimation principle for unnormalized statistical models. In Proc. of the 13th and Tensorflow. International Conference on Artificial Intelligence and Statistics. 297–304. [14] D. Harman. 1992. The DARPA tipster project. In ACM SIGIR Forum, Vol. 26. ACM, In fact, we observed some differences – which are hard to spot 26–28. when looking only at the average performance – between the runs [15] M. G. Kendall. 1948. Rank correlation methods. (1948). computed by the nvsm_cpu Docker images on different machines [16] J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. 2016. Toward reproducible baselines: The open-source and between the runs computed by the nvsm_cpu and nvsm_gpu IR reproducibility challenge. In European Conference on Information Retrieval. Docker images on the same machine. The differences between Springer, 408–420. nvsm_cpu images on different machines are related to the non- [17] C. Macdonald, R. McCreadie, R. L. Santos, and I. Ounis. 2012. From puppy to maturity: Experiences in developing terrier. Proc. of OSIR at SIGIR (2012), 60–63. determinism of the results, as Docker relies on the host machine [18] S. Marchesin, A. Purpura, and G. Silvello. 2019. A Neural Vector Space Model for some basic operations which influence the model optimization Implementation Repository. https://github.com/giansilv/NeuralIR/ [19] T. Sakai, N. Ferro, I. Soboroff, Z. Zeng, P. Xiao, and M. Maistro. 2019. Overview process through the generation of different pseudo-random number of the NTCIR-14 CENTRE Task. In Proceedings of the 14th NTCIR Conference on sequences. On the other hand, the differences between nvsm_gpu Evaluation of Information Access Technologies. Tokyo, Japan. images on the same machine are due to the implementation of some [20] I. Soboroff, N. Ferro, and T. Sakai. 2018. Overview of the TREC 2018 CENTRE Track. In The Twenty-Seventh Text REtrieval Conference Proceedings (TREC 2018). functions in the CUDA and Tensorflow libraries. We observed that [21] A. Trotman, C. L. A. Clarke, I. Ounis, S. Culpepper, M. A. Cartright, and S. Geva. these operations influence in a sizeable way the ordering of the 2012. Open source information retrieval: a report on the SIGIR 2012 workshop. same documents across different runs, but not the overall distribu- In ACM SIGIR Forum, Vol. 46. ACM, 95–101. [22] C. Van Gysel, M. de Rijke, and E. Kanoulas. 2018. Neural Vector Spaces for tion of relevant and non-relevant documents in the ranking. Similar Unsupervised Information Retrieval. ACM Trans. Inf. Syst. 36, 4 (2018), 38:1– differences, that are even more accentuated, can be found between 38:25. [23] E. M. Voorhees. 2005. The TREC Robust Retrieval Track. ACM SIGIR Forum 39, 1 nvsm_cpu and nvsm_gpu images on the same machine. Therefore, (2005), 11–20. even though these differences may seem marginal in offline eval- [24] I. Vulić and M. F. Moens. 2015. Monolingual and cross-lingual information uation settings, where the focus is on average performance, they retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th international ACM SIGIR conference on research and development in information are extremely relevant for user-oriented online settings – as they retrieval. ACM, 363–372. can have a sizeable impact on the user experience and should thus [25] P. Yang, H. Fang, and J. Lin. 2018. Anserini: Reproducible Ranking Baselines Using be taken into consideration when deciding whether to use NeuIR Lucene. J. Data and Information Quality 10, 4, Article 16 (Oct. 2018), 20 pages. https://doi.org/10.1145/3239571 models in real-world scenarios. [26] C. Zhai and J. Lafferty. 2004. A study of smoothing methods for language models We also share the models we trained on our machine with both applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22, 2 (2004), 179–214. the nvsm_cpu and nvsm_gpu Docker images in our public repository, as it is fundamental to enable replicability [12]. These can be loaded by the docker image in order to perform document retrieval with the same models we used and obtain the same runs. REFERENCES [1] M. Agosti, N. Ferro, and C. Thanos. 2012. DESIRE 2011: workshop on data infrastructurEs for supporting information retrieval evaluation.. In SIGIR Forum, Vol. 46. Citeseer, 51–55. 43