Dockerizing Automatic Routing Runs for The Open-Source IR
              Replicability Challenge (OSIRRC 2019)
                                  Timo Breuer                                                                           Philipp Schaer
                      firstname.lastname@th-koeln.de                                                         firstname.lastname@th-koeln.de
                        Technische Hochschule Köln                                                             Technische Hochschule Köln
                             Cologne, Germany                                                                       Cologne, Germany

ABSTRACT                                                                                    keeping test collections consistent across different systems, pro-
In the following, we describe our contribution to the Docker in-                            viding infrastructures for replicable environments, or increasing
frastructure for ad hoc retrieval experiments initiated by the Open-                        transparency by the use of open-source software.
Source IR Replicability Challenge (OSIRRC) at SIGIR 2019. We con-                               The OSIRRC workshop2 located at SIGIR 2019 is devoted to
tribute automatic routing runs as Grossman and Cormack specified                            the replicability of ad hoc retrieval systems. A major subject of
them during their participation in the TREC Common Core track                               interest is the integration of IR systems into a Docker infrastructure.
2017. Reimplementations of these runs are motivated by the CEN-                             Participants contribute existing IR systems by adapting them to
TRE lab held at the CLEF conference in 2019. More specifically, we                          pre-defined interfaces. The organizers put the focus on standard
investigated the replicability and reproducibility of WCRobust04                            test collections in order to keep underlying data across different
and WCRobust0405. In the following, we give insights into the                               systems consistent.
adaption of our replicated CENTRE submissions and report on our                                 Docker facilitates the deployment of complex software systems.
experiences made.                                                                           Dependencies and configurations can be specified in a standard-
                                                                                            ized way. The resulting images will be run with the help of os-level
Image Source:                                                                               virtualization. A growing community and efficient resource manage-
https://github.com/osirrc/irc-centre2019-docker                                             ment make Docker preferable to other virtualization alternatives.
Docker Hub:                                                                                 Using Docker addresses some barriers to replicability. With the
https://hub.docker.com/r/osirrc2019/irc-centre2019                                          help of clearly specified environments, configuration errors and
                                                                                            obscurities can be reduced to a minimum. An early attempt at mak-
                                                                                            ing retrieval results replicable with Docker was made by Yang et
1    INTRODUCTION                                                                           al. [11]. The authors describe an online service, which evaluates
                                                                                            submitted retrieval systems run in Docker containers. Likewise,
In 2018 the ACM introduced Artifact and Review Badging1 con-
                                                                                            Crane [5] proposes Docker as packaging tool for machine learning
cerned with procedures assuring repeatability, replicability, and
                                                                                            systems with multiple parameters.
reproducibility. According to the ACM definitions, reproducibility
                                                                                                The CENTRE lab at CLEF is another initiative concerned with
assumes stated precision to be obtainable with a different team
                                                                                            the replicability and reproducibility of IR systems. Our participa-
and a different setup. Preliminarily, the stated precision should be
                                                                                            tion in CENTRE@CLEF19 [6] was devoted to the replicability and
replicable by a different team with the original setup of the artifacts.
                                                                                            reproducibility of automatic routing runs. In order to contribute
Thus replicability is an essential requirement towards reproducible
                                                                                            our code submissions to the IR community, we integrate them into
results. Both objectives are closely coupled, and requirements of
                                                                                            the proposed Docker infrastructure of OSIRRC.
reproducibility are related to those of replicability.
                                                                                                The remainder of the paper is structured as follows. In section
    Empirical studies manifest most findings in the field of informa-
                                                                                            2, we will summarize our submission to CENTRE@CLEF19. In
tion retrieval (IR). Generally, these findings have to be replicable
                                                                                            this context, the general concept of automatic routing runs will
and reproducible in order to bring further use for future research
                                                                                            be introduced. Section 3 gives insights into the adaption of our
and applications. Results may become invalid under slightly differ-
                                                                                            reimplementations to the Docker infrastructure. The last section
ent conditions, thus reasons for non-reproducibility are manifold.
                                                                                            will conclude with the resulting benefits and experiences made.
Choosing weak baselines, selective reporting, or hidden and missing
information are some reasons for non-reproducible findings. The
well-known meta-studies by Armstrong [1, 2] reveal the problem of
                                                                                            2     REIMPLEMENTATION OF AUTOMATIC
illusory evaluation gains when comparing to weak or inappropriate                                 ROUTING RUNS
baselines. More recent studies by Wang et al. [12] or Lin et al. [9]                        In the context of CENTRE@CLEF19, participants were obliged
show that this circumstance is still a huge problem, especially, with                       to replicate, reproduce, and generalize IR systems submitted at
regards to neural IR.                                                                       previous conferences. Our participation in the CENTRE lab was
    During the last years the IR community launched several Eval-                           motivated by the replication and reproduction of automatic routing
uation as a Service initiatives [8]. Attempts were made towards                             runs submitted by Grossman and Cormack to the TREC Common
                                                                                            Core Track in 2017 [4]. According to the guidelines of the CEN-
1 https://www.acm.org/publications/policies/artifact-review-badging
                                                                                            TRE lab, participants have to reimplement original procedures.
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR
2019, 25 July 2019, Paris, France.                                                          2 https://osirrc.github.io/osirrc2019/


                                                                                       31
OSIRRC 2019, July 2019, Paris, France                                                                                              Breuer and Schaer


           Run                          Test         Training                 available in existing packages. A detailed description of our im-
                                                                              plementation is available in our workshop report [4]. The general
     WCRobust04                New York Times       Robust04
                                                                              workflow can be split into two processing stages.
    WCRobust0405               New York Times      Robust04+05
Table 1: Run constellations: Replicated runs are made of                      2.2.1 Data preparation. The first stage will prepare corpora data.
rankings from New York Times (NYT) documents. The un-                         Besides different compression formats, we also consider diverg-
derlying data of Robust04 and Robust05 are the TREC Disks                     ing text formatting. We adapted the preparation steps specifically
4&5 (minus congressional records) and the AQUAINT cor-                        to the characteristics of the corpora. After extraction, single doc-
pus, respectively.                                                            uments will be written to files which contain parsed text data.
                                                                              Grossman and Cormack envisage a union corpus in order to de-
                                                                              rive tfidf-features. The corpus with training samples as well as the
                                                                              corpus to be ranked are supposed to be unified. This proceeding
                                                                              results in training features that are augmented by the vocabulary
Replicability is evaluated by applying reimplementations to data              of the corpus to be ranked. In their contribution to the ECIR repro-
collections originally used. Reproducibility is evaluated by applying         ducibility workshop in 2019 Yu et al. consider this augmentation
reimplementations to new data collections. In the following, we               to be insignificant [13]. In our experimental setups, we compare
will describe the general concept of automatic routing runs, we               resulting runs of augmented and non-augmented training features
will continue with some insights into our reimplementations and               and confirm the assumptions made by Yu et al. It is reasonable
conclude this section with the replicated results.                            to neglect tfidf-derivation from a unified corpus. Features can be
                                                                              taken solely from the training corpus without negatively affecting
2.1     Automatic Routing Runs                                                evaluation measures. Due to these findings, we deviate from the
Grossman’s and Cormack’s contributions to the TREC Common                     original procedure with regards to the tfidf-derivation. Our training
Core 2017 track follow either a continuous active learning or rout-           features are exclusively derived from Robust corpora.
ing approach [7]. We focus on the latter in accordance with the
                                                                              2.2.2 Training & Prediction. Single document files with parsed text
CENTRE guidelines. As the authors point out, automatic routing
                                                                              result from the data preparation step. The scikit-learn package [10]
runs are based on deriving ranking profiles for specific topics. With
                                                                              offers the possibility to derive tfidf-features by implementing the
the help of relevance judgments, these profiles are constructed.
                                                                              TfidfVectorizer. A term-document matrix is built by providing doc-
Opposed to other retrieval approaches, no explicit query is needed
                                                                              ument files from the Robust corpora to the TfidfVectorizer. Text
in order to derive a document ranking for a specific topic. Typically,
                                                                              documents from all corpora are converted to numerical represen-
queries are stemmed from topics and corresponding narratives.
                                                                              tations with regards to this matrix. Converting documents from
The proposed routing mechanisms, however, do not require such
                                                                              the New York Times corpus, for instance, can result in vectors that
an input. Grossman and Cormack chose to implement the profile
                                                                              do not cover the complete vocabulary of specific documents. This
derivation with the help of a logistic regression classifier. They
                                                                              is a requirement for retrieving document vectors of equal length.
convert text documents into numerical representations by the de-
                                                                              As pointed out above, this is not affecting evaluation outcomes.
termination of tfidf weights. Subsequently, they train the classifier
                                                                              Relevance judgments are converted to a binary scale and serve for
with these tfidf features in combination with binary relevance judg-
                                                                              selecting training documents from the Robust corpora. Based on
ments. The classifier is used to rank documents of another corpus
                                                                              a given topic, judged documents are converted to features vectors
which is different from the one used for training. The likelihood of
                                                                              and are prepared as training input for the logistic regression clas-
documents being relevant will serve as a score. Since there is no hu-
                                                                              sifier. We dump the training data in SVMlight format to keep our
man intervention in the described procedure, it is fully automatic. It
                                                                              workflow compatible with other machine learning frameworks. In
has to be considered, that this approach is limited to corpora which
                                                                              our case, we make use of the LogisticRegression model from the
share relevance judgments for the same topics. The entire corpus
                                                                              scikit-learn package. The model is trained topic-wise with features
is ranked, whereas for each topic, the 10,000 highest ranking doc-
                                                                              being either relevant or not. Afterwards, each document of the new
uments are used for evaluation. Grossman and Cormack rank the
                                                                              corpus will be scored. We order documents by descending score for
New York Times corpus with training data based on the Robust04
                                                                              each topic. The 10, 000 highest ranking documents form the final
collection. The resulting run is titled WCRobust04. By enriching
                                                                              ranking of a single topic.
training data with documents from the Robust05 collection, they
acquire improvements in terms of MAP and P@10. The resulting                  2.2.3 Outcomes. While replicated P@10 values come close to those
run is titled WCRobust0405. Table 1 shows an overview of run                  given by Grossman and Cormack, MAP values stay below the base-
constellations as they were used in the context of the CENTRE lab.            line (more details in section 3.4). Especially reproduced values (de-
                                                                              rived from another corpus) drop significantly and offer a starting
2.2     Reimplementation                                                      point for future investigations [4].
Based on the description by Grossman and Cormack, we chose to
reimplement the system from scratch. CENTRE organizers premise                3   DOCKER ADAPTION
the use of open-source software. The Python community offers                  In the following, we describe the portation of our replicated CEN-
a rich toolset of open and free software. We pursued a Python-                TRE submission into Docker images. More specifically we give
only implementation, since the required components were largely               insights, how the workflow is adapted to the given hooks. In this


                                                                         32
Dockerizing Automatic Routing Runs for The Open-Source IR Replicability Challenge (OSIRRC 2019)                                        OSIRRC 2019, July 2019, Paris, France


context, we refer to the procedure illustrated in the previous section.
Contributors of retrieval systems have to adapt their systems with
respect to pre-defined hooks. These hooks are implemented with
the help of scripts for initialization, indexing, searching, training,
and interaction. They should be located in the root directory of the
Docker containers. A Python-based framework ("jig") will call these
scripts and invoke the corresponding processing steps. Automatic
routing runs slightly differ from the conventional ad hoc approach.
Instead of deriving rankings based on a query, a classification model
is devised based on judged documents for a given topic. The general
workflow of our submission is illustrated in figure 1.
Supported Collections:
robust04, robust05, core17

Supported Hooks:
init, index, search


3.1     Dockerfile
Since our implementation is completely done with Python, the
image relies on an existing Python 3 image. Upon image building,
directories will be made and the three scripts for initialization,
indexing, and searching will be copied. The required corpora will
be mounted as volumes when starting the container.


3.2     Hooks
3.2.1 Initialization. On initialization, the source code will be down-
loaded from a public GitHub repository. Required Python packages
will be installed. Depending on the specified run, either WCRo-
bust04 or WCRobust0405 will be replicated, and the corresponding
scripts are prepared.

3.2.2 Indexing. After successful initialization, indexing is done by
determining tfidf-features. Data extraction and text processing re-
sult in single documents for each corpus. A term-document matrix
is constructed by using the TfidfVectorizer of the scikit-learn pack-
age. In combination with qrel files from Robust04 and Robust05,
documents will be picked and transformed into tfidf-features with
                                                                                      Figure 1: Depiction of how our reimplementation of auto-
respect to the term-document matrix. Likewise, the entire NYT
                                                                                      matic routing runs is adapted to the workflow given by
corpus is transformed into a numerical representation according
                                                                                      the "jig". The two cyan boxes include the processing steps
to this matrix. At the end of the indexing process, a Python shelf
                                                                                      which are conducted in the running container instances.
containing tfidf-features of all documents from the NYT corpus
                                                                                      The objects within the dashed rectangle are committed to
and SVMlight formatted tfidf-features remain as artifacts. They
                                                                                      the Docker image after indexing is done.
will be committed to the resulting image, all other artifacts like the
vectorizer of extracted document files have to be deleted in order
to keep the image size low.
                                                                                      3.3     Modifications
                                                                                      Our CENTRE submission3 was adaptable with little effort. The
3.2.3 Searching. The "jig" will start a new container running the                     following modifications were necessary in order to prepare the
committed image of the previous step. In our case, the "searching"                    code for the Docker infrastructure.
process consists of training a topic model and scoring the entire                        After initialization and indexing are done, the "jig" will commit
NYT corpus for each topic. We make use of the logistic regression                     the container’s changes, including the indexed collection. The new
classifier implemented in the scikit-learn package, although other                    image will be run in a second container which conducts the ranking.
machine learning models should be easy to integrate. The 10,000                       Due to this given workflow, we were obliged to split our CENTRE
highest scorings for each topic will be merged into one final run
file. The "jig" will handle evaluation by using trec_eval.                            3 https://bitbucket.org/centre_eval/c2019_irc/


                                                                                 33
OSIRRC 2019, July 2019, Paris, France                                                                                                                Breuer and Schaer


                            Run           MAP              P@10                 4    CONCLUSION
                      WCRobust04         0.3711            0.6460               We contributed our CENTRE@CLEF19 submissions to the Docker
      Baseline                                                                  infrastructure initiated by the OSIRRC workshop. Our original code
                      WCRobust0405       0.4278            0.7500
                      WCRobust04         0.2971            0.6820               submission reimplemented automatic routing runs as they were de-
  Replicability                                                                 scribed by Grossman and Cormack [7]. In the course of our CENTRE
                      WCRobust0405       0.3539            0.7360
                                                                                participation, we investigated the replicability and reproducibility
Table 2: Results of replicated runs in comparison to the base-
                                                                                of the given procedure. We focus on contributing replicated runs to
line which is given by Grossman and Cormack. All runs are
                                                                                the Docker infrastructure. CENTRE defines replicability by using
based on 50 topics. [7].
                                                                                the original test collection in combination with a different setup.
                                                                                Thus our reimplementations rank documents of the NYT corpus
                                                                                by using Robust corpora for the training of topic models.
submission into two main processing steps. We decided to keep the                  Adaptions to the Docker infrastructure were realizable with little
tfidf artifacts only in order to keep the committed image as small              effort. We adjusted the workflow with regard to the given hooks.
as possible. The text artifacts resulting from the preprocessing will           The resulting runs exactly match those which were replicated in
be deleted after the determination of the term-document matrix/T-               the context of our CENTRE participation. Due to the encapsulation
fidfVectorizer and tfidf-features. At first, we omitted the removal             into Docker images, less configuration effort is required and our
of unneeded documents, resulting in large Docker image sizes that               experimental environment can be exactly reconstructed. Required
could not be handled on moderate hardware.                                      components are taken from existing Docker images and Python
   The data extraction and text processing steps are parallelized,              packages.
speeding up the preprocessing. In this context special attention had               Starting points for future improvements were elaborated in the
to be paid to the compressed files by the same name with different              previous section. Investigations on reproducibility can be made pos-
endings (.0z, .1z, .2z), since extracting these files in parallel will          sible by integrating the Washington Post corpus into our workflow.
result in name conflicts.                                                       In this context, the support of other test collections might also be
                                                                                interesting. Parallelizing classifications can reduce execution time
3.4     Evaluation Outcomes                                                     of the ranking. An archived version of our submitted Docker image
                                                                                is available at Zenodo [3].
The evaluation outcomes of our replicated runs are given in table 2.
We were not able to fully replicate the baseline given by Grossman
and Cormack. Replicated evaluation outcomes are slightly worse                  REFERENCES
                                                                                 [1] Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009.
compared to the originals but are similar to results achieved by                     Has Adhoc Retrieval Improved Since 1994?. In Proceedings of the 32Nd Interna-
Yu et al. with their classification only approach [13]. Evaluation                   tional ACM SIGIR Conference on Research and Development in Information Retrieval
outcomes can be improved by enriching training data with an                          (SIGIR ’09). ACM, New York, NY, USA, 692–693. https://doi.org/10.1145/1571941.
                                                                                     1572081
additional corpus (in this case combining Robust04 and Robust05).                [2] Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel.
By using the OSSIRC Docker infrastructure, we can rebuild the exact                  2009. Improvements That Don’T Add Up: Ad-hoc Retrieval Results Since
                                                                                     1998. In Proceedings of the 18th ACM Conference on Information and Knowl-
environment used for our submissions to CENTRE@CLEF2019. The                         edge Management (CIKM ’09). ACM, New York, NY, USA, 601–610. https:
resulting evaluation outcomes match those which were achieved                        //doi.org/10.1145/1645953.1646031
during our participation at the CENTRE lab.                                      [3] Timo Breuer and Philipp Schaer. 2019. osirrc/irc-centre2019-docker: OSIRRC @
                                                                                     SIGIR 2019 Docker Image for IRC-CENTRE2019. https://doi.org/10.5281/zenodo.
                                                                                     3245439
3.5     Limitations                                                              [4] Timo Breuer and Philipp Schaer. 2019. Replicability and Reproducibility of
                                                                                     Automatic Routing Runs. In Working Notes of CLEF 2019 - Conference and Labs of
At the current state, rankings for replicated runs are possible. This                the Evaluation Forum (CEUR Workshop Proceedings). CEUR-WS.org. (accepted).
means, only the NYT and Robust corpora will be processed correctly               [5] Matt Crane. 2018. Questionable Answers in Question Answering Research:
by our Docker image. In the future, support of the Washington Post                   Reproducibility and Variability of Published Results. TACL 6 (2018), 241–252.
                                                                                     https://transacl.org/ojs/index.php/tacl/article/view/1299
corpus can be integrated for the further investigation of repro-                 [6] Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff. 2019.
ducibility. Going a step further, the complete data preparation step                 CENTRE@CLEF 2019. In Advances in Information Retrieval - 41st European Con-
                                                                                     ference on IR Research, ECIR 2019, Cologne, Germany, April 14-18, 2019, Proceedings,
could be extended to more general compatibility with other test                      Part II (Lecture Notes in Computer Science), Leif Azzopardi, Benno Stein, Nor-
collections or generic text data. At the moment, the routines are                    bert Fuhr, Philipp Mayr, Claudia Hauff, and Djoerd Hiemstra (Eds.), Vol. 11438.
highly adjusted to the given workflow by Grossman and Cormack                        Springer, 283–290. https://doi.org/10.1007/978-3-030-15719-7_38
                                                                                 [7] Maura R. Grossman and Gordon V. Cormack. 2017. MRG_UWaterloo and Water-
and the underlying data.                                                             looCormack Participation in the TREC 2017 Common Core Track. In Proceedings
   Even though the data preparation is parallelized, it takes a while                of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland,
to index the corpora. In order to reduce indexing time, the text                     USA, November 15-17, 2017, Ellen M. Voorhees and Angela Ellis (Eds.), Vol. Spe-
                                                                                     cial Publication 500-324. National Institute of Standards and Technology (NIST).
preprocessing can be omitted, leading to compromises between                         https://trec.nist.gov/pubs/trec26/papers/MRG_UWaterloo-CC.pdf
execution time and evaluation measures.                                          [8] Frank Hopfgartner, Allan Hanbury, Henning Müller, Ivan Eggel, Krisztian Ba-
                                                                                     log, Torben Brodt, Gordon V. Cormack, Jimmy Lin, Jayashree Kalpathy-Cramer,
   Predictions of the logistic regression classifier are used for scor-              Noriko Kando, Makoto P. Kato, Anastasia Krithara, Tim Gollub, Martin Potthast,
ing documents. Currently, the corresponding tfidf-features are                       Evelyne Viegas, and Simon Mercer. 2018. Evaluation-as-a-Service for the Com-
stored in a Python shelf and will be read out sequentially for classifi-             putational Sciences: Overview and Outlook. J. Data and Information Quality 10,
                                                                                     4, Article 15 (Oct. 2018), 32 pages. https://doi.org/10.1145/3239570
cation. This step should be parallelized for the reduction of ranking            [9] Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines.
time.                                                                                SIGIR Forum 52, 2 (Jan. 2019), 40–51. https://doi.org/10.1145/3308774.3308781


                                                                           34
Dockerizing Automatic Routing Runs for The Open-Source IR Replicability Challenge (OSIRRC 2019)                                              OSIRRC 2019, July 2019, Paris, France


[10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.                  from Neural Ranking Models. CoRR abs/1904.09171 (2019). arXiv:1904.09171
     Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-             http://arxiv.org/abs/1904.09171
     napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine              [13] Ruifan Yu, Yuhao Xie, and Jimmy Lin. 2019. Simple Techniques for Cross-
     Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.                 Collection Relevance Feedback. In Advances in Information Retrieval - 41st Euro-
[11] Peilin Yang and Hui Fang. 2016. A Reproducibility Study of Information Retrieval               pean Conference on IR Research, ECIR 2019, Cologne, Germany, April 14-18, 2019,
     Models. In Proceedings of the 2016 ACM International Conference on the Theory                  Proceedings, Part I (Lecture Notes in Computer Science), Leif Azzopardi, Benno
     of Information Retrieval (ICTIR ’16). ACM, New York, NY, USA, 77–86. https:                    Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and Djoerd Hiemstra (Eds.),
     //doi.org/10.1145/2970398.2970415                                                              Vol. 11437. Springer, 397–409. https://doi.org/10.1007/978-3-030-15712-8_26
[12] Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining
     the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains


                                                                                          35