<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Docker-Based Replicability Study of a Neural Information Retrieval Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Marchesin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Purpura</string-name>
          <email>purpuraa@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmaria Silvello</string-name>
          <email>silvello@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Docker, Neural Information Retrieval</institution>
          ,
          <addr-line>Replicability, Reproducibility</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>25</volume>
      <issue>2019</issue>
      <fpage>37</fpage>
      <lpage>43</lpage>
      <abstract>
        <p>In this work, we propose a Docker image architecture for the replicability of Neural IR (NeuIR) models. We also share two self-contained Docker images to run the Neural Vector Space Model (NVSM) [22], an unsupervised NeuIR model. The first image we share ( nvsm_cpu) can run on most machines and relies only on CPU to perform the required computations. The second image we share (nvsm_gpu) relies instead on the Graphics Processing Unit (GPU) of the host machine, when available, to perform computationally intensive tasks, such as the training of the NVSM model. Furthermore, we discuss some insights on the engineering challenges we encountered to obtain deterministic and consistent results from NeuIR models, relying on TensorFlow within Docker. We also provide an in-depth evaluation of the diferences between the runs obtained with the shared images. The diferences are due to the usage within Docker of TensorFlow and CUDA libraries - whose inherent randomness alter, under certain circumstances, the relative order of documents in rankings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Information retrieval; Retrieval
models and ranking; Evaluation of retrieval results; Retrieval
models and ranking; • Computing methodologies →
Unsupervised learning.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Following some recent eforts on reproducibility, like the CENTRE
evaluations at CLEF [
        <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
        ], NTCIR [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and TREC [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], or the
SIGIR task force to implement ACM’s policy on artifact review and
badging [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the Open-Source IR Replicability Challenge at SIGIR
2019 (OSIRRC 2019) aims at addressing the replicability issue in
ad hoc document retrieval [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. OSIRRC 2019’s vision is to build
Docker-based1 infrastructures to replicate results on standard ad
hoc test collections. Docker is a tool that allows for the creation
and deployment of applications via images containing all the
required dependencies. Relying on a Docker-based infrastructure to
replicate the results of existing systems, helps researchers to avoid
all the issues related to system requirements and dependencies.
Indeed, Information Retrieval (IR) platforms such as Anserini [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ],
Terrier [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], or text matching libraries such as MatchZoo [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] rely
      </p>
      <sec id="sec-2-1">
        <title>1https://www.docker.com/.</title>
        <p>Copyright © 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located
with SIGIR 2019, 25 July 2019, Paris, France.
on a set of software tools, developed in Java or Python and based
on numerous libraries for scientific computing, which have all to
be available on the host machine in order for the applications to
run smoothly.</p>
        <p>Therefore, OSIRRC 2019 aims to ease the use of such platforms,
and of retrieval approaches in general, by providing Docker images
that replicate IR models on ad hoc document collections. To
maximize the impact of such an efort, OSIRRC 2019 sets three main
goals:
(1) Develop a common Docker interface specification to
support images that capture systems performing ad hoc retrieval
experiments on standard test collections. The proposed
solution is known as the jig.
(2) Build a curated library of Docker images that work with the
jig to capture a diversity of systems and retrieval models.
(3) Explore the possibility of broadening these eforts to include
additional tasks, evaluation methodologies, and benchmark
initiatives.</p>
        <p>
          OSIRRC 2019 gives us the opportunity to investigate on the
replicability (and reproducibility), as described in the ACM guidelines
discussed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], of Neural IR (NeuIR) models – on which these
issues are especially relevant. Indeed, NeuIR models are highly
sensitive to parameters, hyper-parameters, and pre-processing choices
that hamper researchers who want to study, evaluate, and compare
NeuIR models against state-of-the-art approaches. Also, these
models are usually compatible only with specific versions of the libraries
that they rely on (e.g., Tensorflow) because these frameworks are
constantly updated. The use of Docker images is a possible solution
to avoid these deployment issues on diferent machines as it already
includes all the libraries required by the contained application.
        </p>
        <p>
          For this reason, (i) we propose a Docker architecture that can
be used as a framework to train, test, and evaluate NeuIR models,
and is compatible with the jig introduced by OSSIRC 2019; and, (ii)
we show how this architecture can be employed to build a Docker
image that replicates the Neural Vector Space Model (NVSM) [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ],
a state-of-the-art unsupervised neural model for ad hoc retrieval.
We rely on our shared TensorFlow2 implementation of NVSM [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
The model is trained, tested, and evaluated on the TREC Robust04
collection [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. The contributions of this work are the following:
• we present a Docker architecture for NeuIR models that
is compliant with the OSSIRC 2019 jig requirements. The
architecture supports three functions: index, train and search,
which are the same actions that are typically performed by
NeuIR models;
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2https://www.tensorflow.org/.</title>
        <p>• we share two Docker images to replicate the NVSM results
on the Robust04 collection: nvsm_cpu which relies on one
or more CPUs for its computations and it is compatible with
most machines, and nvsm_gpu which supports parallel
computing using an NVIDIA Graphics Processing Unit (GPU);
• we perform extensive experimental evaluations to explore
replicability challenges of NeuIR models (i.e., NVSM) with
Docker.</p>
        <p>
          Our NVSM Docker images are part of the OSIRRC 2019 library.3
The source code, system runs, and additional required data can be
found in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The findings presented in this paper contributed to
the definition of the “training” hook within the jig.
        </p>
        <p>The rest of the paper is organized as follows: Section 2 presents
an overview of previous and related initiatives for repeatability,
replicability and reproducibility. Section 3 describes the NVSM
model. Section 4 presents the Docker image, whereas Section 5
describes how to interact with the provided Docker image. Section 6
presents the experimental setup and Section 7 shows the obtained
results. Finally, Section 8 discusses the outcomes of our experiments
and provides insights on the replicability challenges and issues of
NeuIR models.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        Repeatability, replicability, and reproducibility are fundamental
aspects of computational sciences, both in supporting desirable
scientific methodology as well as sustaining empirical progress.
Recent ACM guidelines4 precisely define the above concepts as
follows:
• Repeatability: a researcher can reliably repeat his/her own
computation (same team, same experimental setup).
• Replicability: an independent group can obtain the same
result using the author’s own artifacts (diferent team, same
experimental setup).
• Reproducibility: an independent group can obtain the same
result using artifacts which they develop completely
independently (diferent team, diferent experimental setup).
These guidelines have been discussed and analyzed in depth in
the Dagstuhl Seminar on “Reproduciblity of Data-Oriented
Experiments in e-Science” held on 24-29 January 2016 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which focused
on the core issues and approaches to reproducibility in several
ifelds of computer science. One of the outcomes of the seminar
was the Platform, Research goal, Implementation, Method, Actor and
Data (PRIMAD) model, which tackles reproducibility from diferent
angles.
      </p>
      <p>
        The PRIMAD model acts as a framework to distinguish the main
elements describing an experiment in computer science, as there are
many diferent terms related to various kinds of reproducibility [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The main aspects of PRIMAD are the following:
• Research Goal (R) characterizes the purpose of a study;
• Method (M) is the specific approach proposed or considered
by the researcher;
• Implementation (I) refers to the actual implementation of
the method, usually in some programming language;
      </p>
      <sec id="sec-3-1">
        <title>3https://github.com/osirrc/osirrc2019-library/#NVSM. 4https://www.acm.org/publications/policies/artifact-review-badging/.</title>
        <p>• Platform (P) describes the underlying hard- and software
like the operating system and the computer used;
• Data (D) consists of two parts, namely the input data and
the specific parameters chosen to carry out the method;
• Actor (A) refers to the experimenter.</p>
        <p>Along with the main aspects of the PRIMAD model, there are two
other relevant variables that need to be taken into account for
reproducibility: transparency and consistency. Transparency is the
ability to verify that all the necessary components of an experiment
perform as they claim; consistency refers to the success or failure
of a reproducibility experiment in terms of consistent outcomes.</p>
        <p>
          The PRIMAD paradigm has been adopted by the IR community,
where it has been adapted to the context of IR evaluation – both
system-oriented and user-oriented [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Nevertheless,
reproducibility in IR is still a critical concept, which requires infrastructures to
manage experimental data [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], of-the-shelf open source IR systems
[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and reproducible baselines [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], among others.
        </p>
        <p>
          In this context, our contribution at OSIRRC 2019 lies between
replicability and reproducibility. Indeed, by relying on the NVSM
implementation available at [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], we replicate the results of a
reproduced version of NVSM. Therefore, we do not completely adhere
to the definition of replicability provided by the ACM – as we
rely on an independent implementation of NVSM rather than the
one proposed in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Regardless, we believe that our contribution
of a Docker architecture for NeuIR models, along with the two
produced NVSM Docker images, can shed some light on the
replicability and reproducibility issues of NeuIR models. Besides, the
of-the-shelf nature of Docker images can help future researchers
to replicate/reproduce our results easily and consistently.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>NEURAL VECTOR SPACE MODEL</title>
      <p>
        The Neural Vector Space Model (NVSM) is a state-of-the-art
unsupervised model for ad hoc retrieval. The model achieves competitive
results against traditional lexical models, like Query Language
Models (QLM) [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], and outperforms state-of-the-art unsupervised
semantic retrieval models, like the Word2Vec-based models presented
in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Below, we describe the main characteristics of NVSM.
      </p>
      <p>Given a document collection D = {dj }jM=1 and the relative lexicon
V = {wi }iN=1, NVSM considers the vector representations w®i ∈ Rkw
and d®j ∈ Rkd , where kw and kd denote the dimensionality of
word and document vectors. Due to the diferent size of word and
document embeddings, the word feature space is mapped to the
document feature space through a series of transformations learned
by the model.</p>
      <p>A sequence of n words (i.e. an n-gram) (wj,i )in=1 extracted from
dj is represented by the average of the embeddings of the words in it.
NVSM learns word and document representations considering
minibatches B of (n-gram, document) pairs. These representations are
learned minimizing the distance between a document embedding
and the representations of the n-grams contained in it. During the
training phase, L2-regularization is employed to normalize the
ngrams representations: norm(x®) = x® . This process is used to
| |x® | |
obtain sparser representations. The projection of an n-gram into
the kd -dimensional document feature space can be defined as a
composition function:</p>
      <p>T˜ ((wj,i )in=1) = (f ◦ norm ◦ д)((wj,i )in=1).
(1)
Then, the standardized projection of the n-gram representation is
obtained by estimating the per-feature sample mean and variance
over batch B as follows:
T ((wj,i )in=1) =
hard-tanh ­©­T˜ ((wj,qi)inVˆ=[1T)˜ (−(wEˆj[,Ti˜)(in(=w1j),]i )in=1)] + β ®®ª . (2)</p>
      <p>« ¬
The composition function д, combined with the L2-normalization
norm, causes words to compete for contributing to the resulting
n-gram representation. In this way, words that are more
representative for the target document will contribute more to the n-gram
representation. Moreover, the standardization operation forces
ngram representations to diferentiate themselves only in the
dimensions that matter for the matching task. Thus, word representations
incorporate a notion of term specificity during the learning process.</p>
      <p>The similarity of two representations in the latent vector space
is computed as:</p>
      <p>
        P (S |dj , (wj,i )in=1) = σ (d®j · T ((wj,i )in=1)),
where σ (t ) is the sigmoid function and S is a binary indicator that
states whether the representation of document dj is similar to the
projection of its n-gram (wj,i )in=1. The probability of a document
dj given its n-gram (wj,i )in=1 is then approximated by uniformly
sampling z contrastive examples [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]:
z + 1
      </p>
      <p>2z
log P˜(dj |(wj,i )in=1)) =</p>
      <p>z log P (S |dj , (wj,i )in=1))
+
z
Õ
k=1,
dk ∼U (D)
!
log(1.0 − P (S |dk , (wj,i )in=1))) ,
where U (D) represents the uniform distribution used to obtain
contrastive examples from documents D. Finally, to optimize the
model, the following loss function – averaged over the instances in
batch B – is used:</p>
      <p>m
L(θ |B) = − m1 Õ log P˜(dj |(wj,i )in=1)</p>
      <p>j=1
+
λ
2m
|V |
Õ
i=1
|D |
Õ
j=1
||w®i ||22 +
||d®j ||22 + ||W ||F2 ,
where θ is the set of parameters {w®i }i|V=1| , {d®j }j|D=1| , W , β and λ is a
weight regularization hyper-parameter.</p>
      <p>After training, a query q is projected into the document feature
space by the composition of the functions f and д: (f ◦д) (q) = h(q).
Finally, the matching score between a document dj and a query
q is given by the cosine similarity of their representations in the
document feature space.
(3)
(4)
(5)</p>
    </sec>
    <sec id="sec-5">
      <title>4 DOCKER IMAGE ARCHITECTURE</title>
    </sec>
    <sec id="sec-6">
      <title>4.1 NVSM Docker Image with CPU support</title>
      <p>The NeuIR model we share, i.e. NVSM, is written in Python and
relies on Tensorflow v.1.13.1. For this reason, we share a Docker
image based on the oficial Python 3.5 runtime container, on top of
which we install the Python packages required by the algorithm –
such as Tensorflow, Python NLTK, and Whoosh – we also install a
C compiler, i.e. gcc, in order to use the oficial trec_eval package5
to evaluate the retrieval model during training.</p>
      <p>Since this docker image still relies for some functions (i.e. random
number generation) on the host machine, despite being very similar,
the results are not exactly the same across diferent computers –
while they are consistent on the same machine. In order to enable
the replication of our experimental results, we share the model we
trained with the nvsm image – which can be loaded in the shared
Docker model if the user decides to skip the training step – in a
public repository.6</p>
    </sec>
    <sec id="sec-7">
      <title>4.2 NVSM Docker Image with GPU support</title>
      <p>Our implementation of NVSM is based on Tensorflow, which is a
machine learning library that allows to employ the GPU on the host
machine in order to perform operations more eficiently. For this
reason, we created and share in our repository another docker image
(i.e. nvsm_gpu) which can use the GPU on the host machine to speed
up the computations of the algorithm. To do so, the host machine
running this image needs an NVIDIA GPU and the nvidia-docker
version7 installed on it. There are many advantages of employing
GPUs for scientific computations, but their usage makes a
sizeable diference especially when training deep learning models. The
training of such models requires in fact to perform a large
number of matrix operations that can be easily parallelized and do not
require powerful hardware. For these reasons, the architecture of
GPUs with thousands of low-power processing units is particularly
indicated for this kind of operations.</p>
      <p>The nvsm_gpu image is based on the oficial TensorFlow-gpu
Docker image for Python 3. As in the other image that we share,
we include a C compiler in order to use trec_eval for the retrieval
model evaluation and the Python libraries required by NVSM.</p>
      <p>In our experiments, we observed that nvsm_gpu does not produce
fully consistent results on the same machine. In fact, TensorFlow
uses the Eigen library, which in turn uses CUDA atomic functions
to implement reduction operations, such as tf.reduce_sum etc.
Those operations are non-deterministic and each operation can
introduce small variations, as also stated in a GitHub issue on
TensorFlow source code.8 Despite this problem, we still believe
that the advantages brought by the usage of a GPU in terms of
reduction of computational time – combined with the fact that we
detected only very small variations in the Mean Average Precision
at Rank 1000 (MAP), Normalized Discounted Cumulative Gain at
Rank 100 (nDCG@100), Precision at Rank 10 (P@10), and Recall –
make this implementation of the algorithm a valid alternative to
the CPU-based one.</p>
      <sec id="sec-7-1">
        <title>5https://github.com/usnistgov/trec_eval. 6http://osirrc.dei.unipd.it/sample_trained_models_nvsm_cpu_gpu/ 7https://github.com/NVIDIA/nvidia-docker. 8https://github.com/tensorflow/tensorflow/issues/3103.</title>
        <p>In order to assess the amount of variability due to this
nondeterminism in the training process we perform, in Section 7, a
few tests to evaluate the diference between the run computed with
nvsm_cpu and three diferent runs computed with nvsm_gpu using
the GPU. Finally, to ensure the replicability of our experiments we
share one of the models that we trained using this image on our
public repository.9 This model can be loaded by the provided Docker
image to perform search on the Robust04 collection, skipping the
training step.
5</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>INTERACTION WITH THE DOCKER IMAGE</title>
      <p>The interaction with the shared Docker image is performed via
the jig.10 The jig is an interface to perform retrieval operations
employing a model embedded in a Docker image. At the time of
writing, our image can perform retrieval on the Robust04 collection.
The actions supported by the NVSM Docker images that we share
are: index, train, and search.
5.1</p>
    </sec>
    <sec id="sec-9">
      <title>Index</title>
      <p>The purpose of this action is to build the collection indices required
by the retrieval model. Before the hook is run, the jig will mount
the selected document collection at a path passed to the script. Our
image will then uncompress and index the collection using the
Whoosh retrieval library.11 The index files relative to the collection
are finally saved inside the Docker image in order to speed up future
search operations, eliminating the time to mount the index files in
the Docker image.
5.2</p>
    </sec>
    <sec id="sec-10">
      <title>Train</title>
      <p>The purpose of the train action is to train a retrieval model. We
developed this hook within the jig in order to perform training
with the NVSM model, which is the only NeuIR model oficially
supported by the OSSIRC library at the time of writing. This hook
mounts the topics and relevance judgments associated to the
selected experimental collection and two files containing the topic
IDs to use for the test and validation of the model. The support
for the evaluation of a model on diferent subsets of topics can be
useful to any supervised NeurIR model which might use the jig in
the future or to learning-to-rank approaches within other Docker
images. In the case of NVSM, which is an unsupervised model, we
employ the validation subset of topics during training to select the
best model – saved after each training epoch. The trained models
are saved to a directory indicated by the user on the host machine.
This is done in order to keep the Docker image as light as possible,
and to allow the user to easily inspect the results of the training
process.
5.3</p>
    </sec>
    <sec id="sec-11">
      <title>Search</title>
      <p>
        The purpose of the search hook is to perform an ad-hoc retrieval run
– multiple runs can be performed by calling jig multiple times with
diferent parameters. In order to perform retrieval with the provided
Docker image, the user will need to indicate as a parameter the
path to the directory containing the trained model computed at the
9http://osirrc.dei.unipd.it/sample_trained_models_nvsm_cpu_gpu/.
10https://github.com/osirrc/jig.
11https://whoosh.readthedocs.io/en/latest/index.html.
previous step and the path to a text file containing the topic IDs on
which to perform retrieval. Then, the NVSM trained model – which
performed best on the validation set of queries at the previous step –
will be loaded by the Docker image and retrieval will be performed
on the topics specified in the topic IDs file passed to the jig.
To test our docker image we consider the Robust04 collection, which
is composed of TIPSTER corpus [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] Disk 4&amp;5 minus CR. The
collection counts 528,155 documents, with a vocabulary of 760,467
diferent words. The topics considered for the evaluation are topics
301-450, 601-700 from Robust04 [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Only the field title of topics
is used for retrieval. The set of topics is split into validation (V)
and test (T) sets, as proposed in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].12 Relevance judgments are
restricted accordingly.
      </p>
      <p>
        The execution times and memory occupation statistics were
computed on an 2018 Alienware Area-51 with an Intel Core
i97980XE CPU @ 2.60GHz with 36 cores, 64GB of RAM and two
GeForce GTX 1080Ti GPUs.
To train the NVSM model, we set the following parameters and
hyper-parameters: word representation size kw = 300, number of
negative examples z = 10, learning rate α = 0.001, regularization
lambda λ = 0.01, batch size m = 51200, dimensionality of the
document representations kd = 256 and n-gram size n = 16. We
train the model for 15 iterations over the document collection and
we select the model iteration that performs best in terms of MAP. A
single iteration consists of ⌈ m1 Íd ∈D (|d | − n + 1)⌉ batches, where d
is a document in the experimental collection D, and n is the n-gram
size as described in Section 3.
where M0,i is the chosen evaluation measure associated to
the first run to compare and M1,i is the same measure
associated to the second run, both relative to the ith topic.
12Splits can be found at: https://github.com/osirrc/jig/tree/master/sample_
training_validation_query_ids.
• Kendall’s τ correlation coeficient : since diferent runs
may produce the same RMSE score, we also measure how
close are the ranked results lists of two systems. This is
measured using Kendall’s τ correlation coeficient [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] among
the list of retrieved documents for each topic, averaged across
all topics. The Kendall’s τ correlation coeficient on a single
topic is given by:
      </p>
      <p>P − Q
τi (run0, run1) = p(P + Q + U )(P + Q + V )
,
where T is the total number of topics, P is the total number
of concordant pairs (document pairs that are ranked in the
same order in both vectors) Q the total number of discordant
pairs (document pairs that are ranked in opposite order in
the two vectors), U and V are the number of ties, respectively,
in the first and in the second ranking. To compare two runs,
Equation (7) becomes:</p>
      <p>1 ÕT
τ (run0, run1) =</p>
      <p>τi (run0, run1)</p>
      <p>
        T i=1
The range of this measure is [
        <xref ref-type="bibr" rid="ref1">-1, 1</xref>
        ], where 1 indicates a
perfect correlation between the order of the documents in the
considered runs and -1 indicates that the rankings associated
to each topic in the runs are one the inverse of the other.
• Jaccard index: since diferent runs might contain a diferent
set of relevant documents for each topic, we consider the
average of the Jaccard index of each of these sets, over all
topics. We compute this value as:
sim(run1, run2) =
1 ÕT |rd_run1i ∩ rd_run2i |
      </p>
      <p>
        T i=1 |rd_run1i ∪ rd_run2i |
where rd_run1i and rd_run2i are the sets of relevant
documents retrieved for the topic i in run1 and run2, respectively.
RMSE and Kendall’s τ correlation coeficient have been adopted to
evaluate the diferences between the rankings for reproducibility
purposes in CENTRE@CLEF [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], whereas the Jaccard index is used
here for the first time for this purpose.
      </p>
      <p>We use these measures to evaluate the diferences in the rankings
produced by the NVSM Docker images. Since the image with GPU
support is not fully deterministic we compute three diferent runs
on the same machine and analyze the diferences between them.
Also, since NVSM does not retrieve any relevant document for
four topics (312, 316, 348 and 379) because none of their terms are
present in the NVSM term dictionary, we remove these topics from
the pool and consider – only for the comparison between diferent
runs using RMSE, Kendall’s τ correlation coeficient and Jaccard
index – a total of 196 out of the 200 test topics indicated in the test
split file available in the OSSIRC repository. 13
7</p>
    </sec>
    <sec id="sec-12">
      <title>EVALUATION</title>
      <p>In Table 1, we report the statistics relative to the disk space and
memory occupation of our images. We also include the time
required by each image to complete one training epoch. The first
thing that we observe is that the CPU Docker image takes less
13https://github.com/osirrc/jig/tree/master/sample_training_validation_query_
ids.
(7)
(8)
(9)
space on disk than the GPU one. This is because the former does
not need all of the drivers and libraries required by the GPU version
of Tensorflow. In fact, these libraries make the nvsm_gpu image
three times larger than the other one. We also point out that the
GPU memory usage reported in Table 1 is proportional to the
memory available on the GPU in the host machine. In fact, Tensorflow
– if there is no explicit limitation imposed by the user – allocates
most of the available space in the GPU memory to speed up
computations. For this reason, since the GPU used in these experiments
has 11GB of memory available, the space used is 10.76GB. If the
GPU had less space available, then Tensorflow would be able to
adjust to it and use as low as 8GB. This is in fact the minimum GPU
memory requirement according to our tests, to run the nvsm_gpu
Docker image.</p>
      <sec id="sec-12-1">
        <title>NVSM (CPU) NVSM (GPU)</title>
        <p>Disk occupation (image only) 1.1GB 3.55GB</p>
        <p>Index disk size (Robust04) 4.96GB
Maximum RAM occupation 16GB 10GB</p>
        <p>GPU memory usage – 10.76GB</p>
        <p>Execution time (1 epoch) 8h 2h30m
Table 1: Analysis of the space on disk used, memory
occupation and execution time of the shared docker images.</p>
        <p>In Table 2, we report the retrieval results obtained with the
two shared Docker images. From these results, we observe that
there are small diferences, always within ±0.01, between the runs
obtained with nvsm_gpu on the same machine and with the ones
obtained with nvsm_cpu on diferent machines. The causes of these
small diferences are described in Section 4 and are related to how
the optimization process of the model is managed by Tensorflow
within Docker.</p>
        <p>
          MAP
CPU (run 0) 0.138
GPU (run 0) 0.137
GPU (run 1) 0.138
GPU (run 2) 0.137
nDCG@100
0.271
0.265
0.270
0.268
The MAP, nDCG@100, P@10, and Recall values obtained with the
images are all very similar, and close to the measures reported
in the original NVSM paper [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Indeed, the absolute diference
between the reported MAP, nDCG@100, and P@10 values in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]
and our results is always less than 0.02. As a side note, the MAP
values obtained by NVSM are low when compared to the other
approaches on Robust04 that can be found in the OSIRRC 2019
library – even 10% lower than some methods that do not apply
re-ranking.
        </p>
        <p>In order to further evaluate the performance diferences between
the runs, we begin computing the RMSE considering the MAP,
nDCG@100, and P@10 measures. The RMSE gives us an idea of
the performance diference between two runs – averaged across
the considered topics. We first compute the average values of MAP,
nDCG@100, and P@10 over the three nvsm_gpu runs on each topic.
Then, we compare these averaged performance measures, for each
topic, against the corresponding ones associated to the CPU-based
NVSM run we obtained on our machine. These results are reported
in Table 3. From the results of this evaluation we can observe that
the average performance diference across the considered 196 topics
is very low when considering the MAP and nDCG@100 measures,
while it grows when we consider the top part of the rankings
(P@10). In conclusion, the RMSE value is generally low, hence we
can confidently say that the models behave in a very similar way in
terms of MAP, nDCG@100, and P@10 on all the considered topics.</p>
        <p>In Table 4, we report the Kendall’s τ measures associated to each
pair of runs that we computed. This measure shows us how much
the considered rankings are similar to each other. In our case, the
runs appear to be quite diferent from each other, since the Kendall’s
τ values are all close to 0. In other words, when considering the top
100 results in each run, the same documents are rarely in the same
positions in the selected rankings. This result, combined with the
fact that the runs achieve all similar MAP, nDCG@100, P@10, and
Recall values, leads to the conclusion that the relevant documents
are ranked high in the rankings, but are not in the same positions.
In other words, NVSM performs a permutation of the documents in
the runs, maintaining however the relative order between relevant
and non-relevant documents.</p>
        <p>GPU (run 0)
GPU (run 1)
GPU (run 2)</p>
        <p>CPU</p>
        <p>GPU (run 0) GPU (run 1) GPU (run 2)</p>
        <p>1.0 0.025 0.025
0.025 1.0 0.089
0.025 0.089 1.0
0.018 0.014 0.009
CPU
0.018
0.014
0.009
1.0</p>
        <p>To validate our hypothesis, we report in Figure 1, for each pair
of runs, the Jaccard index between the sets of relevant documents
averaged over all topics, as described in Section 6. These values
help us to assess whether the set of relevant documents retrieved
for each topic in our runs is diferent, and by how much on average.
In this case, we observe that the runs computed with the GPU have
more in common between each other than the run computed with
the CPU. However, the Jaccard index values are all very high and
this confirms our previous hypothesis about the rankings. In fact,
this implies that they contain similar sets of relevant documents,
which are however in diferent relative positions – because we have
a low Kendall’s τ correlation coeficient – but in the same portion
of the rankings – because we obtain similar and relatively high
nDCG@100, P@10 values over all runs.</p>
        <p>run_0_cpu</p>
        <p>To qualitatively assess from a user perspective the diferences
between the diferent runs we select one topic (301: “International
Organized Crime”) and report in Table 5 the top five document ids
for each of them. The results in this table confirm our previous</p>
        <p>CPU GPU (run 0) GPU (run 1) GPU (run 2)
FBIS3-55219 FBIS3-55219 FBIS3-55219 FBIS3-55219
FBIS4-41991 FBIS4-7811 FBIS4-7811 FBIS4-7811
FBIS4-45469 FBIS4-43965 FBIS4-41991 FBIS4-41991
FBIS3-54945 FBIS3-23986 FBIS3-23986 FBIS3-23986
FBIS4-7811 FBIS4-41991 FBIS4-65446 FBIS4-65446
Table 5: Top 5 documents in the runs computed with
nvsm_cpu and nvsm_gpu. Relevant documents are highlighted
in bold.
intuition. In fact, we observe that the majority of the high-ranked
documents for topic 301 for each run are relevant, but these
documents are slightly diferent across diferent runs. Also, we observe
that the most of the relevant documents retrieved by nvsm_gpu are
the same, while only two of the relevant documents retrieved by
nvsm_cpu are also found in the other runs. For instance, we observe
that document FBIS4-45469 is ranked in the top-5 positions only
in the CPU run. Similarly, document FBIS4-43965 appears only
in the GPU run 0. These apparently small diferences can have a
sizeable impact on the user experience and should be taken into
consideration when choosing to employ a NeuIR model in real-case
scenarios.
8</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>FINAL REMARKS</title>
      <p>In this work, we performed a replicability study of the Neural
Vector Space Model (NVSM) retrieval model using Docker. First, we
presented the architecture and the main functions of a Docker
image designed for the replicability of Neural IR (NeuIR) models.
The described architecture is compatible with the jig developed
in the OSSIRC 2019 workshop and supports the index (to index
an experimental collection), train (to train a retrieval model), and
search (to perform document retrieval) actions. Secondly, we
described the image components and the engineering challenges to
obtain deterministic results with Docker using popular machine
learning libraries such as Tensorflow. We also share two Docker
images – which are part of the OSSIRC 2019 library – of the NVSM
model: the first, which relies only on the CPU of the host machine
to perform its operations, the second, which is able to also exploit
the GPU of the host machine, when available, to perform more
expensive computations such as the training of the NVSM model.
Finally, we performed an in-depth evaluation of the diferences
between the runs obtained with the two images, presenting some
insights which also hold for other NeuIR models relying on CUDA
and Tensorflow.</p>
      <p>In fact, we observed some diferences – which are hard to spot
when looking only at the average performance – between the runs
computed by the nvsm_cpu Docker images on diferent machines
and between the runs computed by the nvsm_cpu and nvsm_gpu
Docker images on the same machine. The diferences between
nvsm_cpu images on diferent machines are related to the
nondeterminism of the results, as Docker relies on the host machine
for some basic operations which influence the model optimization
process through the generation of diferent pseudo-random number
sequences. On the other hand, the diferences between
nvsm_gpu
images on the same machine are due to the implementation of some
functions in the CUDA and Tensorflow libraries. We observed that
these operations influence in a sizeable way the ordering of the
same documents across diferent runs, but not the overall
distribution of relevant and non-relevant documents in the ranking. Similar
diferences, that are even more accentuated, can be found between
nvsm_cpu and nvsm_gpu images on the same machine. Therefore,
even though these diferences may seem marginal in ofline
evaluation settings, where the focus is on average performance, they
are extremely relevant for user-oriented online settings – as they
can have a sizeable impact on the user experience and should thus
be taken into consideration when deciding whether to use NeuIR
models in real-world scenarios.</p>
      <p>
        We also share the models we trained on our machine with both
the nvsm_cpu and nvsm_gpu Docker images in our public repository,
as it is fundamental to enable replicability [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. These can be loaded
by the docker image in order to perform document retrieval with
the same models we used and obtain the same runs.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Thanos</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>DESIRE 2011: workshop on data infrastructurEs for supporting information retrieval evaluation.</article-title>
          .
          <source>In SIGIR Forum</source>
          , Vol.
          <volume>46</volume>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <volume>51</volume>
          -
          <fpage>55</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arguello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Report on the SIGIR 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR)</article-title>
          .
          <source>In ACM SIGIR Forum</source>
          , Vol.
          <volume>49</volume>
          . ACM,
          <volume>107</volume>
          -
          <fpage>116</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Clancy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z. Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The SIGIR 2019 Open-Source IR Replicability Challenge (OSIRRC</article-title>
          <year>2019</year>
          ).
          <article-title>(</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>D. De Roure</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The future of scholarly communications</article-title>
          .
          <source>Insights</source>
          <volume>27</volume>
          ,
          <issue>3</issue>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          , and X. Cheng.
          <year>2017</year>
          .
          <article-title>Matchzoo: A toolkit for deep text matching</article-title>
          .
          <source>arXiv preprint arXiv:1707.07270</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lippold</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science"</article-title>
          .
          <source>SIGIR Forum 50</source>
          ,
          <issue>1</issue>
          (
          <year>June 2016</year>
          ),
          <fpage>68</fpage>
          -
          <lpage>82</lpage>
          . https://doi.org/10.1145/2964797.2964808
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Soborof</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Overview of CENTRE@CLEF 2019: Sequel in the Systematic Reproducibility Realm. In Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kelly</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>SIGIR Initiative to Implement ACM Artifact Review and Badging</article-title>
          .
          <source>SIGIR Forum 52</source>
          ,
          <issue>1</issue>
          (Aug.
          <year>2018</year>
          ),
          <fpage>4</fpage>
          -
          <lpage>10</lpage>
          . https://doi.org/10.1145/ 3274784.3274786
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Soborof</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of CENTRE@ CLEF 2018: A first tale in the systematic reproducibility realm</article-title>
          .
          <source>In International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          . Springer,
          <fpage>239</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purpura</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Docker Images of Neural Vector Space Model</article-title>
          . https://doi.org/10.5281/zenodo.3246361
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rauber</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Reproducibility of Data-Oriented Experiments in e-Science (Dagstuhl Seminar 16041)</article-title>
          .
          <source>Dagstuhl Reports</source>
          <volume>6</volume>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>108</fpage>
          -
          <lpage>159</lpage>
          . https://doi.org/10.4230/DagRep.6.1.
          <fpage>108</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Some common mistakes in IR evaluation, and how they can be avoided</article-title>
          .
          <source>In ACM SIGIR Forum</source>
          , Vol.
          <volume>51</volume>
          . ACM,
          <volume>32</volume>
          -
          <fpage>41</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gutmann</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Hyvärinen</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Noise-contrastive estimation: A new estimation principle for unnormalized statistical models</article-title>
          .
          <source>In Proc. of the 13th International Conference on Artificial Intelligence and Statistics</source>
          .
          <volume>297</volume>
          -
          <fpage>304</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>The DARPA tipster project</article-title>
          .
          <source>In ACM SIGIR Forum</source>
          , Vol.
          <volume>26</volume>
          . ACM,
          <volume>26</volume>
          -
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Kendall</surname>
          </string-name>
          .
          <year>1948</year>
          .
          <article-title>Rank correlation methods</article-title>
          . (
          <year>1948</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chattopadhyaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Foley</surname>
          </string-name>
          , G. Ingersoll,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vigna</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Toward reproducible baselines: The open-source IR reproducibility challenge</article-title>
          .
          <source>In European Conference on Information Retrieval</source>
          . Springer,
          <fpage>408</fpage>
          -
          <lpage>420</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. McCreadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>From puppy to maturity: Experiences in developing terrier</article-title>
          .
          <source>Proc. of OSIR at SIGIR</source>
          (
          <year>2012</year>
          ),
          <fpage>60</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purpura</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A Neural Vector Space Model Implementation Repository</article-title>
          . https://github.com/giansilv/NeuralIR/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Soborof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Overview of the NTCIR-14 CENTRE Task</article-title>
          .
          <source>In Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies</source>
          . Tokyo, Japan.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>I.</given-names>
            <surname>Soborof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the TREC 2018 CENTRE Track</article-title>
          .
          <source>In The Twenty-Seventh Text REtrieval Conference Proceedings (TREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , I. Ounis,
          <string-name>
            <given-names>S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Cartright</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Geva</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Open source information retrieval: a report on the SIGIR 2012 workshop</article-title>
          .
          <source>In ACM SIGIR Forum</source>
          , Vol.
          <volume>46</volume>
          . ACM,
          <volume>95</volume>
          -
          <fpage>101</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>C. Van Gysel</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Rijke</surname>
            , and
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kanoulas</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Neural Vector Spaces for Unsupervised Information Retrieval</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>36</volume>
          ,
          <issue>4</issue>
          (
          <year>2018</year>
          ),
          <volume>38</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          :
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>The TREC Robust Retrieval Track</article-title>
          .
          <source>ACM SIGIR Forum 39</source>
          ,
          <issue>1</issue>
          (
          <year>2005</year>
          ),
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>I.</given-names>
            <surname>Vulić</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Moens</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings</article-title>
          .
          <source>In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM</source>
          ,
          <volume>363</volume>
          -
          <fpage>372</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Anserini: Reproducible Ranking Baselines Using Lucene</article-title>
          .
          <source>J. Data and Information Quality</source>
          <volume>10</volume>
          ,
          <issue>4</issue>
          , Article 16 (Oct.
          <year>2018</year>
          ),
          <volume>20</volume>
          pages. https://doi.org/10.1145/3239571
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Laferty</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>A study of smoothing methods for language models applied to information retrieval</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 22</source>
          ,
          <issue>2</issue>
          (
          <year>2004</year>
          ),
          <fpage>179</fpage>
          -
          <lpage>214</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>