A Docker-Based Replicability Study of a Neural Information
                         Retrieval Model
                            Nicola Ferro, Stefano Marchesin, Alberto Purpura, Gianmaria Silvello
                                                        Department of Information Engineering
                                                               University of Padua, Italy
                                                    {ferro,marches1,purpuraa,silvello}@dei.unipd.it

ABSTRACT                                                                                 on a set of software tools, developed in Java or Python and based
In this work, we propose a Docker image architecture for the replica-                    on numerous libraries for scientific computing, which have all to
bility of Neural IR (NeuIR) models. We also share two self-contained                     be available on the host machine in order for the applications to
Docker images to run the Neural Vector Space Model (NVSM) [22],                          run smoothly.
an unsupervised NeuIR model. The first image we share (nvsm_cpu)                            Therefore, OSIRRC 2019 aims to ease the use of such platforms,
can run on most machines and relies only on CPU to perform the                           and of retrieval approaches in general, by providing Docker images
required computations. The second image we share (nvsm_gpu)                              that replicate IR models on ad hoc document collections. To max-
relies instead on the Graphics Processing Unit (GPU) of the host ma-                     imize the impact of such an effort, OSIRRC 2019 sets three main
chine, when available, to perform computationally intensive tasks,                       goals:
such as the training of the NVSM model. Furthermore, we discuss                             (1) Develop a common Docker interface specification to sup-
some insights on the engineering challenges we encountered to                                   port images that capture systems performing ad hoc retrieval
obtain deterministic and consistent results from NeuIR models, re-                              experiments on standard test collections. The proposed solu-
lying on TensorFlow within Docker. We also provide an in-depth                                  tion is known as the jig.
evaluation of the differences between the runs obtained with the                            (2) Build a curated library of Docker images that work with the
shared images. The differences are due to the usage within Docker                               jig to capture a diversity of systems and retrieval models.
of TensorFlow and CUDA libraries – whose inherent randomness                                (3) Explore the possibility of broadening these efforts to include
alter, under certain circumstances, the relative order of documents                             additional tasks, evaluation methodologies, and benchmark
in rankings.                                                                                    initiatives.
                                                                                         OSIRRC 2019 gives us the opportunity to investigate on the repli-
CCS CONCEPTS
                                                                                         cability (and reproducibility), as described in the ACM guidelines
• Information systems → Information retrieval; Retrieval                                 discussed in [11], of Neural IR (NeuIR) models – on which these
models and ranking; Evaluation of retrieval results; Retrieval                           issues are especially relevant. Indeed, NeuIR models are highly sen-
models and ranking; • Computing methodologies → Unsuper-                                 sitive to parameters, hyper-parameters, and pre-processing choices
vised learning.                                                                          that hamper researchers who want to study, evaluate, and compare
                                                                                         NeuIR models against state-of-the-art approaches. Also, these mod-
KEYWORDS                                                                                 els are usually compatible only with specific versions of the libraries
Docker, Neural Information Retrieval, Replicability, Reproducibility                     that they rely on (e.g., Tensorflow) because these frameworks are
                                                                                         constantly updated. The use of Docker images is a possible solution
1    INTRODUCTION                                                                        to avoid these deployment issues on different machines as it already
Following some recent efforts on reproducibility, like the CENTRE                        includes all the libraries required by the contained application.
evaluations at CLEF [7, 9], NTCIR [19] and TREC [20], or the SI-                             For this reason, (i) we propose a Docker architecture that can
GIR task force to implement ACM’s policy on artifact review and                          be used as a framework to train, test, and evaluate NeuIR models,
badging [8], the Open-Source IR Replicability Challenge at SIGIR                         and is compatible with the jig introduced by OSSIRC 2019; and, (ii)
2019 (OSIRRC 2019) aims at addressing the replicability issue in                         we show how this architecture can be employed to build a Docker
ad hoc document retrieval [3]. OSIRRC 2019’s vision is to build                          image that replicates the Neural Vector Space Model (NVSM) [22],
Docker-based1 infrastructures to replicate results on standard ad                        a state-of-the-art unsupervised neural model for ad hoc retrieval.
hoc test collections. Docker is a tool that allows for the creation                      We rely on our shared TensorFlow2 implementation of NVSM [18].
and deployment of applications via images containing all the re-                         The model is trained, tested, and evaluated on the TREC Robust04
quired dependencies. Relying on a Docker-based infrastructure to                         collection [23]. The contributions of this work are the following:
replicate the results of existing systems, helps researchers to avoid                        • we present a Docker architecture for NeuIR models that
all the issues related to system requirements and dependencies.                                is compliant with the OSSIRC 2019 jig requirements. The
Indeed, Information Retrieval (IR) platforms such as Anserini [25],                            architecture supports three functions: index, train and search,
Terrier [17], or text matching libraries such as MatchZoo [5] rely                             which are the same actions that are typically performed by
    1 https://www.docker.com/.                                                                 NeuIR models;
     Copyright © 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located
with SIGIR 2019, 25 July 2019, Paris, France.                                               2 https://www.tensorflow.org/.


                                                                                    37
OSIRRC 2019, July 25 2019, Paris, France                                                                       N. Ferro, S. Marchesin, A. Purpura and G. Silvello


     • we share two Docker images to replicate the NVSM results                       • Platform (P) describes the underlying hard- and software
       on the Robust04 collection: nvsm_cpu which relies on one                         like the operating system and the computer used;
       or more CPUs for its computations and it is compatible with                    • Data (D) consists of two parts, namely the input data and
       most machines, and nvsm_gpu which supports parallel com-                         the specific parameters chosen to carry out the method;
       puting using an NVIDIA Graphics Processing Unit (GPU);                         • Actor (A) refers to the experimenter.
     • we perform extensive experimental evaluations to explore
       replicability challenges of NeuIR models (i.e., NVSM) with                Along with the main aspects of the PRIMAD model, there are two
       Docker.                                                                   other relevant variables that need to be taken into account for
Our NVSM Docker images are part of the OSIRRC 2019 library.3                     reproducibility: transparency and consistency. Transparency is the
The source code, system runs, and additional required data can be                ability to verify that all the necessary components of an experiment
found in [10]. The findings presented in this paper contributed to               perform as they claim; consistency refers to the success or failure
the definition of the “training” hook within the jig.                            of a reproducibility experiment in terms of consistent outcomes.
   The rest of the paper is organized as follows: Section 2 presents                The PRIMAD paradigm has been adopted by the IR community,
an overview of previous and related initiatives for repeatability,               where it has been adapted to the context of IR evaluation – both
replicability and reproducibility. Section 3 describes the NVSM                  system-oriented and user-oriented [6]. Nevertheless, reproducibil-
model. Section 4 presents the Docker image, whereas Section 5                    ity in IR is still a critical concept, which requires infrastructures to
describes how to interact with the provided Docker image. Section 6              manage experimental data [1], off-the-shelf open source IR systems
presents the experimental setup and Section 7 shows the obtained                 [21] and reproducible baselines [2], among others.
results. Finally, Section 8 discusses the outcomes of our experiments               In this context, our contribution at OSIRRC 2019 lies between
and provides insights on the replicability challenges and issues of              replicability and reproducibility. Indeed, by relying on the NVSM
NeuIR models.                                                                    implementation available at [18], we replicate the results of a repro-
                                                                                 duced version of NVSM. Therefore, we do not completely adhere
2    RELATED WORK                                                                to the definition of replicability provided by the ACM – as we
                                                                                 rely on an independent implementation of NVSM rather than the
Repeatability, replicability, and reproducibility are fundamental
                                                                                 one proposed in [22]. Regardless, we believe that our contribution
aspects of computational sciences, both in supporting desirable
                                                                                 of a Docker architecture for NeuIR models, along with the two
scientific methodology as well as sustaining empirical progress.
                                                                                 produced NVSM Docker images, can shed some light on the repli-
Recent ACM guidelines4 precisely define the above concepts as
                                                                                 cability and reproducibility issues of NeuIR models. Besides, the
follows:
                                                                                 off-the-shelf nature of Docker images can help future researchers
     • Repeatability: a researcher can reliably repeat his/her own               to replicate/reproduce our results easily and consistently.
       computation (same team, same experimental setup).
     • Replicability: an independent group can obtain the same
       result using the author’s own artifacts (different team, same             3    NEURAL VECTOR SPACE MODEL
       experimental setup).
                                                                                 The Neural Vector Space Model (NVSM) is a state-of-the-art unsu-
     • Reproducibility: an independent group can obtain the same
                                                                                 pervised model for ad hoc retrieval. The model achieves competitive
       result using artifacts which they develop completely inde-
                                                                                 results against traditional lexical models, like Query Language Mod-
       pendently (different team, different experimental setup).
                                                                                 els (QLM) [26], and outperforms state-of-the-art unsupervised se-
These guidelines have been discussed and analyzed in depth in                    mantic retrieval models, like the Word2Vec-based models presented
the Dagstuhl Seminar on “Reproduciblity of Data-Oriented Experi-                 in [24]. Below, we describe the main characteristics of NVSM.
ments in e-Science” held on 24-29 January 2016 [11], which focused                  Given a document collection D = {d j }j=1
                                                                                                                            M and the relative lexicon
on the core issues and approaches to reproducibility in several
                                                                                 V = {w i }i=1
                                                                                             N , NVSM considers the vector representations w® ∈ Rkw
                                                                                                                                                    i
fields of computer science. One of the outcomes of the seminar
was the Platform, Research goal, Implementation, Method, Actor and               and d®j ∈ Rkd , where kw and kd denote the dimensionality of
Data (PRIMAD) model, which tackles reproducibility from different                word and document vectors. Due to the different size of word and
angles.                                                                          document embeddings, the word feature space is mapped to the
   The PRIMAD model acts as a framework to distinguish the main                  document feature space through a series of transformations learned
elements describing an experiment in computer science, as there are              by the model.
many different terms related to various kinds of reproducibility [4].                A sequence of n words (i.e. an n-gram) (w j,i )ni=1 extracted from
The main aspects of PRIMAD are the following:                                    d j is represented by the average of the embeddings of the words in it.
                                                                                 NVSM learns word and document representations considering mini-
     • Research Goal (R) characterizes the purpose of a study;
                                                                                 batches B of (n-gram, document) pairs. These representations are
     • Method (M) is the specific approach proposed or considered
                                                                                 learned minimizing the distance between a document embedding
       by the researcher;
                                                                                 and the representations of the n-grams contained in it. During the
     • Implementation (I) refers to the actual implementation of
                                                                                 training phase, L2-regularization is employed to normalize the n-
       the method, usually in some programming language;
                                                                                 grams representations: norm(®    x) = | |xx®® | | . This process is used to
    3 https://github.com/osirrc/osirrc2019-library/#NVSM.                        obtain sparser representations. The projection of an n-gram into
    4 https://www.acm.org/publications/policies/artifact-review-badging/.        the kd -dimensional document feature space can be defined as a


                                                                            38
A Docker-Based Replicability Study of a Neural Information Retrieval Model                                                            OSIRRC 2019, July 25 2019, Paris, France


composition function:                                                                       4 DOCKER IMAGE ARCHITECTURE
                T̃ ((w j,i )ni=1 ) = (f ◦ norm ◦ д)((w j,i )ni=1 ).             (1)
                                                                                            4.1 NVSM Docker Image with CPU support
                                                                                            The NeuIR model we share, i.e. NVSM, is written in Python and
Then, the standardized projection of the n-gram representation is                           relies on Tensorflow v.1.13.1. For this reason, we share a Docker
obtained by estimating the per-feature sample mean and variance                             image based on the official Python 3.5 runtime container, on top of
over batch B as follows:                                                                    which we install the Python packages required by the algorithm –
                                                                                            such as Tensorflow, Python NLTK, and Whoosh – we also install a
 T ((w j,i )ni=1 ) =                                                                        C compiler, i.e. gcc, in order to use the official trec_eval package5
                                                                                            to evaluate the retrieval model during training.
                            © T̃ ((w j,i )ni=1 ) − Ê[T̃ ((w j,i )ni=1 )]      ª (2)
                 hard-tanh ­­          q                                  + β ®® .             Since this docker image still relies for some functions (i.e. random
                                           V̂[T̃ ((w j,i )ni=1 )]                           number generation) on the host machine, despite being very similar,
                            «                                                  ¬            the results are not exactly the same across different computers –
The composition function д, combined with the L2-normalization                              while they are consistent on the same machine. In order to enable
norm, causes words to compete for contributing to the resulting                             the replication of our experimental results, we share the model we
n-gram representation. In this way, words that are more represen-                           trained with the nvsm image – which can be loaded in the shared
tative for the target document will contribute more to the n-gram                           Docker model if the user decides to skip the training step – in a
representation. Moreover, the standardization operation forces n-                           public repository.6
gram representations to differentiate themselves only in the dimen-
sions that matter for the matching task. Thus, word representations                         4.2    NVSM Docker Image with GPU support
incorporate a notion of term specificity during the learning process.                       Our implementation of NVSM is based on Tensorflow, which is a
   The similarity of two representations in the latent vector space                         machine learning library that allows to employ the GPU on the host
is computed as:                                                                             machine in order to perform operations more efficiently. For this rea-
                                                                                            son, we created and share in our repository another docker image
                P(S|d j , (w j,i )ni=1 ) = σ (d®j · T ((w j,i )ni=1 )),         (3)         (i.e. nvsm_gpu) which can use the GPU on the host machine to speed
                                                                                            up the computations of the algorithm. To do so, the host machine
where σ (t) is the sigmoid function and S is a binary indicator that                        running this image needs an NVIDIA GPU and the nvidia-docker
states whether the representation of document d j is similar to the                         version7 installed on it. There are many advantages of employing
projection of its n-gram (w j,i )ni=1 . The probability of a document                       GPUs for scientific computations, but their usage makes a size-
d j given its n-gram (w j,i )ni=1 is then approximated by uniformly                         able difference especially when training deep learning models. The
sampling z contrastive examples [13]:                                                       training of such models requires in fact to perform a large num-
                                                                                            ber of matrix operations that can be easily parallelized and do not
                                         z+1                                                require powerful hardware. For these reasons, the architecture of
        log P̃(d j |(w j,i )ni=1 )) =        z log P(S|d j , (w j,i )ni=1 ))
                                          2z                                                GPUs with thousands of low-power processing units is particularly
                              z
                              Õ
                                                                          !
                                                                                (4)         indicated for this kind of operations.
                       +                log(1.0 − P(S|dk , (w j,i )ni=1 ))) ,                   The nvsm_gpu image is based on the official TensorFlow-gpu
                              k =1,                                                         Docker image for Python 3. As in the other image that we share,
                           d k ∼U (D)                                                       we include a C compiler in order to use trec_eval for the retrieval
                                                                                            model evaluation and the Python libraries required by NVSM.
where U (D) represents the uniform distribution used to obtain
                                                                                                In our experiments, we observed that nvsm_gpu does not produce
contrastive examples from documents D. Finally, to optimize the
                                                                                            fully consistent results on the same machine. In fact, TensorFlow
model, the following loss function – averaged over the instances in
                                                                                            uses the Eigen library, which in turn uses CUDA atomic functions
batch B – is used:
                                                                                            to implement reduction operations, such as tf.reduce_sum etc.
                               m                                                            Those operations are non-deterministic and each operation can
                           1 Õ
          L(θ |B) = −            log P̃(d j |(w j,i )ni=1 )                                 introduce small variations, as also stated in a GitHub issue on
                           m j=1
                                                                                            TensorFlow source code.8 Despite this problem, we still believe
                             |V |              |D |
                                                                                (5)
                          λ Õ                 Õ                                           that the advantages brought by the usage of a GPU in terms of
                       +          ||w®i ||22 +      ||d®j ||22 + ||W ||F2 ,                 reduction of computational time – combined with the fact that we
                         2m i=1                j=1                                          detected only very small variations in the Mean Average Precision
                                                                                            at Rank 1000 (MAP), Normalized Discounted Cumulative Gain at
                                        |V |        |D |
where θ is the set of parameters {w® i }i=1 , {d®j }j=1 , W , β and λ is a                  Rank 100 (nDCG@100), Precision at Rank 10 (P@10), and Recall –
weight regularization hyper-parameter.                                                      make this implementation of the algorithm a valid alternative to
   After training, a query q is projected into the document feature                         the CPU-based one.
space by the composition of the functions f and д: (f ◦д) (q) = h(q).
                                                                                               5 https://github.com/usnistgov/trec_eval.
Finally, the matching score between a document d j and a query                                 6 http://osirrc.dei.unipd.it/sample_trained_models_nvsm_cpu_gpu/
q is given by the cosine similarity of their representations in the                            7 https://github.com/NVIDIA/nvidia-docker.

document feature space.                                                                        8 https://github.com/tensorflow/tensorflow/issues/3103.


                                                                                       39
OSIRRC 2019, July 25 2019, Paris, France                                                                       N. Ferro, S. Marchesin, A. Purpura and G. Silvello


   In order to assess the amount of variability due to this non-              previous step and the path to a text file containing the topic IDs on
determinism in the training process we perform, in Section 7, a               which to perform retrieval. Then, the NVSM trained model – which
few tests to evaluate the difference between the run computed with            performed best on the validation set of queries at the previous step –
nvsm_cpu and three different runs computed with nvsm_gpu using                will be loaded by the Docker image and retrieval will be performed
the GPU. Finally, to ensure the replicability of our experiments we           on the topics specified in the topic IDs file passed to the jig.
share one of the models that we trained using this image on our
public repository.9 This model can be loaded by the provided Docker           6 EXPERIMENTAL SETUP
image to perform search on the Robust04 collection, skipping the
                                                                              6.1 Experimental Collection
training step.
                                                                              To test our docker image we consider the Robust04 collection, which
5     INTERACTION WITH THE DOCKER IMAGE                                       is composed of TIPSTER corpus [14] Disk 4&5 minus CR. The
                                                                              collection counts 528,155 documents, with a vocabulary of 760,467
The interaction with the shared Docker image is performed via
                                                                              different words. The topics considered for the evaluation are topics
the jig.10 The jig is an interface to perform retrieval operations
                                                                              301-450, 601-700 from Robust04 [23]. Only the field title of topics
employing a model embedded in a Docker image. At the time of
                                                                              is used for retrieval. The set of topics is split into validation (V)
writing, our image can perform retrieval on the Robust04 collection.
                                                                              and test (T) sets, as proposed in [22].12 Relevance judgments are
The actions supported by the NVSM Docker images that we share
                                                                              restricted accordingly.
are: index, train, and search.
                                                                                 The execution times and memory occupation statistics were
                                                                              computed on an 2018 Alienware Area-51 with an Intel Core i9-
5.1      Index
                                                                              7980XE CPU @ 2.60GHz with 36 cores, 64GB of RAM and two
The purpose of this action is to build the collection indices required        GeForce GTX 1080Ti GPUs.
by the retrieval model. Before the hook is run, the jig will mount
the selected document collection at a path passed to the script. Our          6.2     Evaluation Measures
image will then uncompress and index the collection using the
                                                                              We use the same measures of [22] to evaluate retrieval effectiveness:
Whoosh retrieval library.11 The index files relative to the collection
                                                                              MAP, nDCG@100, P@10. Additionally, we also employ Recall.
are finally saved inside the Docker image in order to speed up future
search operations, eliminating the time to mount the index files in
the Docker image.
                                                                              6.3     Training
                                                                              To train the NVSM model, we set the following parameters and
5.2      Train                                                                hyper-parameters: word representation size kw = 300, number of
                                                                              negative examples z = 10, learning rate α = 0.001, regularization
The purpose of the train action is to train a retrieval model. We
                                                                              lambda λ = 0.01, batch size m = 51200, dimensionality of the
developed this hook within the jig in order to perform training
                                                                              document representations kd = 256 and n-gram size n = 16. We
with the NVSM model, which is the only NeuIR model officially
                                                                              train the model for 15 iterations over the document collection and
supported by the OSSIRC library at the time of writing. This hook
                                                                              we select the model iteration that performs best in terms of MAP. A
mounts the topics and relevance judgments associated to the se-                                              1 Í
lected experimental collection and two files containing the topic
                                                                              single iteration consists of ⌈ m  d ∈D (|d | − n + 1)⌉ batches, where d
                                                                              is a document in the experimental collection D, and n is the n-gram
IDs to use for the test and validation of the model. The support
                                                                              size as described in Section 3.
for the evaluation of a model on different subsets of topics can be
useful to any supervised NeurIR model which might use the jig in
the future or to learning-to-rank approaches within other Docker
                                                                              6.4     Performance Differences Evaluation
images. In the case of NVSM, which is an unsupervised model, we               In order to evaluate the differences between the rankings produced
employ the validation subset of topics during training to select the          with the two NVSM Docker images we consider the following
best model – saved after each training epoch. The trained models              measures.
are saved to a directory indicated by the user on the host machine.                 • Root Mean Square Error (RMSE): This measure indicates
This is done in order to keep the Docker image as light as possible,                  how close are the performance scores of two systems [16]
and to allow the user to easily inspect the results of the training                   (the lower the better) and considers the values associated
process.                                                                              to a measure M(·) (i.e. MAP, nDCG@100) chosen for their
                                                                                      evaluation. RMSE is defined as follows:
5.3      Search                                                                                          v
                                                                                                         u
                                                                                                         t T
The purpose of the search hook is to perform an ad-hoc retrieval run                                        1Õ
                                                                                                RMSE =            (M 0,i − M 1,i )2,          (6)
– multiple runs can be performed by calling jig multiple times with                                         T i=1
different parameters. In order to perform retrieval with the provided
                                                                                      where M 0,i is the chosen evaluation measure associated to
Docker image, the user will need to indicate as a parameter the
                                                                                      the first run to compare and M 1,i is the same measure asso-
path to the directory containing the trained model computed at the
                                                                                      ciated to the second run, both relative to the i th topic.
    9 http://osirrc.dei.unipd.it/sample_trained_models_nvsm_cpu_gpu/.
    10 https://github.com/osirrc/jig.                                             12 Splits can be found at: https://github.com/osirrc/jig/tree/master/sample_
    11 https://whoosh.readthedocs.io/en/latest/index.html.                    training_validation_query_ids.


                                                                         40
A Docker-Based Replicability Study of a Neural Information Retrieval Model                                                      OSIRRC 2019, July 25 2019, Paris, France


        • Kendall’s τ correlation coefficient: since different runs                          space on disk than the GPU one. This is because the former does
          may produce the same RMSE score, we also measure how                               not need all of the drivers and libraries required by the GPU version
          close are the ranked results lists of two systems. This is mea-                    of Tensorflow. In fact, these libraries make the nvsm_gpu image
          sured using Kendall’s τ correlation coefficient [15] among                         three times larger than the other one. We also point out that the
          the list of retrieved documents for each topic, averaged across                    GPU memory usage reported in Table 1 is proportional to the mem-
          all topics. The Kendall’s τ correlation coefficient on a single                    ory available on the GPU in the host machine. In fact, Tensorflow
          topic is given by:                                                                 – if there is no explicit limitation imposed by the user – allocates
                                               P −Q                                          most of the available space in the GPU memory to speed up compu-
                τi (run 0, run 1 ) = p                          ,     (7)                    tations. For this reason, since the GPU used in these experiments
                                       (P + Q + U )(P + Q + V )
                                                                                             has 11GB of memory available, the space used is 10.76GB. If the
           where T is the total number of topics, P is the total number                      GPU had less space available, then Tensorflow would be able to
           of concordant pairs (document pairs that are ranked in the                        adjust to it and use as low as 8GB. This is in fact the minimum GPU
           same order in both vectors) Q the total number of discordant                      memory requirement according to our tests, to run the nvsm_gpu
           pairs (document pairs that are ranked in opposite order in                        Docker image.
           the two vectors), U and V are the number of ties, respectively,
           in the first and in the second ranking. To compare two runs,                                                      NVSM (CPU) NVSM (GPU)
           Equation (7) becomes:                                                               Disk occupation (image only)         1.1GB       3.55GB
                                               T                                                  Index disk size (Robust04)            4.96GB
                                            1Õ
                     τ (run 0, run 1 ) =          τi (run 0, run 1 )             (8)            Maximum RAM occupation               16GB         10GB
                                            T i=1
                                                                                                        GPU memory usage                 –     10.76GB
          The range of this measure is [-1, 1], where 1 indicates a                                Execution time (1 epoch)            8h       2h30m
          perfect correlation between the order of the documents in the                      Table 1: Analysis of the space on disk used, memory occupa-
          considered runs and -1 indicates that the rankings associated                      tion and execution time of the shared docker images.
          to each topic in the runs are one the inverse of the other.
        • Jaccard index: since different runs might contain a different
          set of relevant documents for each topic, we consider the
          average of the Jaccard index of each of these sets, over all                          In Table 2, we report the retrieval results obtained with the
          topics. We compute this value as:                                                  two shared Docker images. From these results, we observe that
                                                                                             there are small differences, always within ±0.01, between the runs
                                        T
                                    1 Õ |rd_run1i ∩ rd_run2i |                               obtained with nvsm_gpu on the same machine and with the ones
            sim(run 1, run 2 ) =                                                 (9)
                                    T i=1 |rd_run1i ∪ rd_run2i |                             obtained with nvsm_cpu on different machines. The causes of these
                                                                                             small differences are described in Section 4 and are related to how
        where rd_run1i and rd_run2i are the sets of relevant docu-                           the optimization process of the model is managed by Tensorflow
         ments retrieved for the topic i in run1 and run2, respectively.                     within Docker.
RMSE and Kendall’s τ correlation coefficient have been adopted to
evaluate the differences between the rankings for reproducibility                                                   MAP     nDCG@100       P@10      Recall
purposes in CENTRE@CLEF [9], whereas the Jaccard index is used                                        CPU (run 0)   0.138       0.271       0.285    0.6082
here for the first time for this purpose.                                                             GPU (run 0)   0.137       0.265       0.277    0.6102
   We use these measures to evaluate the differences in the rankings                                  GPU (run 1)   0.138       0.270       0.277    0.6066
produced by the NVSM Docker images. Since the image with GPU                                          GPU (run 2)   0.137       0.268       0.270    0.6109
support is not fully deterministic we compute three different runs                           Table 2: Retrieval results on the Robust04 (T) collection com-
on the same machine and analyze the differences between them.                                puted with the two shared Docker images of NVSM.
Also, since NVSM does not retrieve any relevant document for
four topics (312, 316, 348 and 379) because none of their terms are
present in the NVSM term dictionary, we remove these topics from
                                                                                             The MAP, nDCG@100, P@10, and Recall values obtained with the
the pool and consider – only for the comparison between different
                                                                                             images are all very similar, and close to the measures reported
runs using RMSE, Kendall’s τ correlation coefficient and Jaccard
                                                                                             in the original NVSM paper [22]. Indeed, the absolute difference
index – a total of 196 out of the 200 test topics indicated in the test
                                                                                             between the reported MAP, nDCG@100, and P@10 values in [22]
split file available in the OSSIRC repository.13
                                                                                             and our results is always less than 0.02. As a side note, the MAP
                                                                                             values obtained by NVSM are low when compared to the other
7        EVALUATION                                                                          approaches on Robust04 that can be found in the OSIRRC 2019
In Table 1, we report the statistics relative to the disk space and                          library – even 10% lower than some methods that do not apply
memory occupation of our images. We also include the time re-                                re-ranking.
quired by each image to complete one training epoch. The first                                  In order to further evaluate the performance differences between
thing that we observe is that the CPU Docker image takes less                                the runs, we begin computing the RMSE considering the MAP,
       13 https://github.com/osirrc/jig/tree/master/sample_training_validation_query_        nDCG@100, and P@10 measures. The RMSE gives us an idea of
ids.                                                                                         the performance difference between two runs – averaged across


                                                                                        41
OSIRRC 2019, July 25 2019, Paris, France                                                                          N. Ferro, S. Marchesin, A. Purpura and G. Silvello


the considered topics. We first compute the average values of MAP,                In this case, we observe that the runs computed with the GPU have
nDCG@100, and P@10 over the three nvsm_gpu runs on each topic.                    more in common between each other than the run computed with
Then, we compare these averaged performance measures, for each                    the CPU. However, the Jaccard index values are all very high and
topic, against the corresponding ones associated to the CPU-based                 this confirms our previous hypothesis about the rankings. In fact,
NVSM run we obtained on our machine. These results are reported                   this implies that they contain similar sets of relevant documents,
in Table 3. From the results of this evaluation we can observe that               which are however in different relative positions – because we have
                                                                                  a low Kendall’s τ correlation coefficient – but in the same portion
                              NVSM GPU (average)                                  of the rankings – because we obtain similar and relatively high
                RMSE (MAP)                    0.034                               nDCG@100, P@10 values over all runs.
          RMSE (nDCG@100)                     0.054
               RMSE (P@10)                    0.140                                                                                                       1.00


Table 3: RMSE (the lower the better) between the NVSM CPU
                                                                                     run_0_cpu     1.00        0.81         0.81          0.81
Docker image and the average of the 3 runs computed with
the NVSM GPU Docker image considering the MAP at rank                                                                                                     0.96


1000 (MAP), nDCG at rank 100 (nDCG@100), and Precision
at rank 10 (P@10) between the NVSM CPU run and the av-                               run_0_gpu     0.81        1.00         0.86          0.86
                                                                                                                                                          0.92
erage of the three runs computed with the NVSM Docker
image supporting GPU computations.
                                                                                     run_1_gpu     0.81        0.86         1.00          0.97            0.88


the average performance difference across the considered 196 topics
is very low when considering the MAP and nDCG@100 measures,                                        0.81        0.86         0.97          1.00
                                                                                                                                                          0.84

                                                                                     run_2_gpu
while it grows when we consider the top part of the rankings
(P@10). In conclusion, the RMSE value is generally low, hence we
                                                                                                 run_0_cpu   run_0_gpu    run_1_gpu    run_2_gpu
can confidently say that the models behave in a very similar way in
terms of MAP, nDCG@100, and P@10 on all the considered topics.
   In Table 4, we report the Kendall’s τ measures associated to each              Figure 1: Heatmap of the average Jaccard index between the
pair of runs that we computed. This measure shows us how much                     sets of retrieved relevant documents for each topic by the
the considered rankings are similar to each other. In our case, the               NVSM Docker images.
runs appear to be quite different from each other, since the Kendall’s
τ values are all close to 0. In other words, when considering the top                To qualitatively assess from a user perspective the differences
100 results in each run, the same documents are rarely in the same                between the different runs we select one topic (301: “International
positions in the selected rankings. This result, combined with the                Organized Crime”) and report in Table 5 the top five document ids
fact that the runs achieve all similar MAP, nDCG@100, P@10, and                   for each of them. The results in this table confirm our previous
Recall values, leads to the conclusion that the relevant documents
are ranked high in the rankings, but are not in the same positions.                            CPU         GPU (run 0)      GPU (run 1)           GPU (run 2)
In other words, NVSM performs a permutation of the documents in                        FBIS3-55219        FBIS3-55219      FBIS3-55219           FBIS3-55219
the runs, maintaining however the relative order between relevant                       FBIS4-41991        FBIS4-7811       FBIS4-7811            FBIS4-7811
and non-relevant documents.                                                            FBIS4-45469        FBIS4-43965       FBIS4-41991           FBIS4-41991
                                                                                        FBIS3-54945       FBIS3-23986      FBIS3-23986           FBIS3-23986
                     GPU (run 0)      GPU (run 1)    GPU (run 2)    CPU                 FBIS4-7811         FBIS4-41991      FBIS4-65446           FBIS4-65446
    GPU (run 0)               1.0           0.025          0.025    0.018         Table 5: Top 5 documents in the runs computed with
    GPU (run 1)            0.025               1.0         0.089    0.014         nvsm_cpu and nvsm_gpu. Relevant documents are highlighted
    GPU (run 2)            0.025            0.089             1.0   0.009         in bold.
          CPU              0.018            0.014          0.009       1.0
Table 4: Kendall’s τ correlation coefficient values between
the runs we computed with the NVSM GPU and CPU Docker
                                                                                  intuition. In fact, we observe that the majority of the high-ranked
images considering the top 100 ranked documents in each
                                                                                  documents for topic 301 for each run are relevant, but these docu-
run.
                                                                                  ments are slightly different across different runs. Also, we observe
                                                                                  that the most of the relevant documents retrieved by nvsm_gpu are
                                                                                  the same, while only two of the relevant documents retrieved by
   To validate our hypothesis, we report in Figure 1, for each pair               nvsm_cpu are also found in the other runs. For instance, we observe
of runs, the Jaccard index between the sets of relevant documents                 that document FBIS4-45469 is ranked in the top-5 positions only
averaged over all topics, as described in Section 6. These values                 in the CPU run. Similarly, document FBIS4-43965 appears only
help us to assess whether the set of relevant documents retrieved                 in the GPU run 0. These apparently small differences can have a
for each topic in our runs is different, and by how much on average.              sizeable impact on the user experience and should be taken into


                                                                             42
A Docker-Based Replicability Study of a Neural Information Retrieval Model                                                                  OSIRRC 2019, July 25 2019, Paris, France


consideration when choosing to employ a NeuIR model in real-case                               [2] J. Arguello, M. Crane, F. Diaz, J. Lin, and A. Trotman. 2016. Report on the SIGIR
scenarios.                                                                                         2015 workshop on reproducibility, inexplicability, and generalizability of results
                                                                                                   (RIGOR). In ACM SIGIR Forum, Vol. 49. ACM, 107–116.
                                                                                               [3] R. Clancy, N. Ferro, C. Hauff, J. Lin, T. Sakai, and Z. Z. Wu. 2019. The SIGIR 2019
8    FINAL REMARKS                                                                                 Open-Source IR Replicability Challenge (OSIRRC 2019). (2019).
                                                                                               [4] D. De Roure. 2014. The future of scholarly communications. Insights 27, 3 (2014).
In this work, we performed a replicability study of the Neural Vec-                            [5] Y. Fan, L. Pang, J. Hou, J. Guo, Y. Lan, and X. Cheng. 2017. Matchzoo: A toolkit
tor Space Model (NVSM) retrieval model using Docker. First, we                                     for deep text matching. arXiv preprint arXiv:1707.07270 (2017).
                                                                                               [6] N. Ferro, N. Fuhr, K. Järvelin, N. Kando, M. Lippold, and J. Zobel. 2016. Increasing
presented the architecture and the main functions of a Docker                                      Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of
image designed for the replicability of Neural IR (NeuIR) models.                                  Data-Oriented Experiments in e-Science". SIGIR Forum 50, 1 (June 2016), 68–82.
                                                                                                   https://doi.org/10.1145/2964797.2964808
The described architecture is compatible with the jig developed                                [7] N. Ferro, N. Fuhr, M. Maistro, T. Sakai, and I. Soboroff. 2019. Overview of CEN-
in the OSSIRC 2019 workshop and supports the index (to index                                       TRE@CLEF 2019: Sequel in the Systematic Reproducibility Realm. In Experimental
an experimental collection), train (to train a retrieval model), and                               IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Tenth
                                                                                                   International Conference of the CLEF Association (CLEF 2019).
search (to perform document retrieval) actions. Secondly, we de-                               [8] N. Ferro and D. Kelly. 2018. SIGIR Initiative to Implement ACM Artifact Review
scribed the image components and the engineering challenges to                                     and Badging. SIGIR Forum 52, 1 (Aug. 2018), 4–10. https://doi.org/10.1145/
                                                                                                   3274784.3274786
obtain deterministic results with Docker using popular machine                                 [9] N. Ferro, M. Maistro, T. Sakai, and I. Soboroff. 2018. Overview of CENTRE@
learning libraries such as Tensorflow. We also share two Docker                                    CLEF 2018: A first tale in the systematic reproducibility realm. In International
images – which are part of the OSSIRC 2019 library – of the NVSM                                   Conference of the Cross-Language Evaluation Forum for European Languages.
                                                                                                   Springer, 239–246.
model: the first, which relies only on the CPU of the host machine                            [10] N. Ferro, S. Marchesin, A. Purpura, and G. Silvello. 2019. Docker Images of Neural
to perform its operations, the second, which is able to also exploit                               Vector Space Model. https://doi.org/10.5281/zenodo.3246361
the GPU of the host machine, when available, to perform more                                  [11] J. Freire, N. Fuhr, and A. Rauber. 2016. Reproducibility of Data-Oriented Ex-
                                                                                                   periments in e-Science (Dagstuhl Seminar 16041). Dagstuhl Reports 6, 1 (2016),
expensive computations such as the training of the NVSM model.                                     108–159. https://doi.org/10.4230/DagRep.6.1.108
Finally, we performed an in-depth evaluation of the differences                               [12] N. Fuhr. 2018. Some common mistakes in IR evaluation, and how they can be
                                                                                                   avoided. In ACM SIGIR Forum, Vol. 51. ACM, 32–41.
between the runs obtained with the two images, presenting some
                                                                                              [13] M. Gutmann and A. Hyvärinen. 2010. Noise-contrastive estimation: A new
insights which also hold for other NeuIR models relying on CUDA                                    estimation principle for unnormalized statistical models. In Proc. of the 13th
and Tensorflow.                                                                                    International Conference on Artificial Intelligence and Statistics. 297–304.
                                                                                              [14] D. Harman. 1992. The DARPA tipster project. In ACM SIGIR Forum, Vol. 26. ACM,
   In fact, we observed some differences – which are hard to spot                                  26–28.
when looking only at the average performance – between the runs                               [15] M. G. Kendall. 1948. Rank correlation methods. (1948).
computed by the nvsm_cpu Docker images on different machines                                  [16] J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C.
                                                                                                   Macdonald, and S. Vigna. 2016. Toward reproducible baselines: The open-source
and between the runs computed by the nvsm_cpu and nvsm_gpu                                         IR reproducibility challenge. In European Conference on Information Retrieval.
Docker images on the same machine. The differences between                                         Springer, 408–420.
nvsm_cpu images on different machines are related to the non-                                 [17] C. Macdonald, R. McCreadie, R. L. Santos, and I. Ounis. 2012. From puppy to
                                                                                                   maturity: Experiences in developing terrier. Proc. of OSIR at SIGIR (2012), 60–63.
determinism of the results, as Docker relies on the host machine                              [18] S. Marchesin, A. Purpura, and G. Silvello. 2019. A Neural Vector Space Model
for some basic operations which influence the model optimization                                   Implementation Repository. https://github.com/giansilv/NeuralIR/
                                                                                              [19] T. Sakai, N. Ferro, I. Soboroff, Z. Zeng, P. Xiao, and M. Maistro. 2019. Overview
process through the generation of different pseudo-random number                                   of the NTCIR-14 CENTRE Task. In Proceedings of the 14th NTCIR Conference on
sequences. On the other hand, the differences between nvsm_gpu                                     Evaluation of Information Access Technologies. Tokyo, Japan.
images on the same machine are due to the implementation of some                              [20] I. Soboroff, N. Ferro, and T. Sakai. 2018. Overview of the TREC 2018 CENTRE
                                                                                                   Track. In The Twenty-Seventh Text REtrieval Conference Proceedings (TREC 2018).
functions in the CUDA and Tensorflow libraries. We observed that                              [21] A. Trotman, C. L. A. Clarke, I. Ounis, S. Culpepper, M. A. Cartright, and S. Geva.
these operations influence in a sizeable way the ordering of the                                   2012. Open source information retrieval: a report on the SIGIR 2012 workshop.
same documents across different runs, but not the overall distribu-                                In ACM SIGIR Forum, Vol. 46. ACM, 95–101.
                                                                                              [22] C. Van Gysel, M. de Rijke, and E. Kanoulas. 2018. Neural Vector Spaces for
tion of relevant and non-relevant documents in the ranking. Similar                                Unsupervised Information Retrieval. ACM Trans. Inf. Syst. 36, 4 (2018), 38:1–
differences, that are even more accentuated, can be found between                                  38:25.
                                                                                              [23] E. M. Voorhees. 2005. The TREC Robust Retrieval Track. ACM SIGIR Forum 39, 1
nvsm_cpu and nvsm_gpu images on the same machine. Therefore,                                       (2005), 11–20.
even though these differences may seem marginal in offline eval-                              [24] I. Vulić and M. F. Moens. 2015. Monolingual and cross-lingual information
uation settings, where the focus is on average performance, they                                   retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th
                                                                                                   international ACM SIGIR conference on research and development in information
are extremely relevant for user-oriented online settings – as they                                 retrieval. ACM, 363–372.
can have a sizeable impact on the user experience and should thus                             [25] P. Yang, H. Fang, and J. Lin. 2018. Anserini: Reproducible Ranking Baselines Using
be taken into consideration when deciding whether to use NeuIR                                     Lucene. J. Data and Information Quality 10, 4, Article 16 (Oct. 2018), 20 pages.
                                                                                                   https://doi.org/10.1145/3239571
models in real-world scenarios.                                                               [26] C. Zhai and J. Lafferty. 2004. A study of smoothing methods for language models
   We also share the models we trained on our machine with both                                    applied to information retrieval. ACM Transactions on Information Systems (TOIS)
                                                                                                   22, 2 (2004), 179–214.
the nvsm_cpu and nvsm_gpu Docker images in our public repository,
as it is fundamental to enable replicability [12]. These can be loaded
by the docker image in order to perform document retrieval with
the same models we used and obtain the same runs.

REFERENCES
 [1] M. Agosti, N. Ferro, and C. Thanos. 2012. DESIRE 2011: workshop on data
     infrastructurEs for supporting information retrieval evaluation.. In SIGIR Forum,
     Vol. 46. Citeseer, 51–55.


                                                                                         43