<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dockerizing Automatic Routing Runs for The Open-Source IR Replicability Challenge (OSIRRC 2019)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timo Breuer</string-name>
          <email>ifrstname.lastname@th-koeln.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Image Source:</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Schaer</string-name>
          <email>ifrstname.lastname@th-koeln.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Hochschule Köln</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Hochschule Köln</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>https://github.com/osirrc/irc-centre2019-docker, Docker Hub:, https://hub.docker.com/r/osirrc2019/irc-centre2019</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>31</fpage>
      <lpage>35</lpage>
      <abstract>
        <p>In the following, we describe our contribution to the Docker infrastructure for ad hoc retrieval experiments initiated by the OpenSource IR Replicability Challenge (OSIRRC) at SIGIR 2019. We contribute automatic routing runs as Grossman and Cormack specified them during their participation in the TREC Common Core track 2017. Reimplementations of these runs are motivated by the CENTRE lab held at the CLEF conference in 2019. More specifically, we investigated the replicability and reproducibility of WCRobust04 and WCRobust0405. In the following, we give insights into the adaption of our replicated CENTRE submissions and report on our experiences made.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In 2018 the ACM introduced Artifact and Review Badging1
concerned with procedures assuring repeatability, replicability, and
reproducibility. According to the ACM definitions, reproducibility
assumes stated precision to be obtainable with a diferent team
and a diferent setup. Preliminarily, the stated precision should be
replicable by a diferent team with the original setup of the artifacts.
Thus replicability is an essential requirement towards reproducible
results. Both objectives are closely coupled, and requirements of
reproducibility are related to those of replicability.</p>
      <p>
        Empirical studies manifest most findings in the field of
information retrieval (IR). Generally, these findings have to be replicable
and reproducible in order to bring further use for future research
and applications. Results may become invalid under slightly
diferent conditions, thus reasons for non-reproducibility are manifold.
Choosing weak baselines, selective reporting, or hidden and missing
information are some reasons for non-reproducible findings. The
well-known meta-studies by Armstrong [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] reveal the problem of
illusory evaluation gains when comparing to weak or inappropriate
baselines. More recent studies by Wang et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or Lin et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
show that this circumstance is still a huge problem, especially, with
regards to neural IR.
      </p>
      <p>
        During the last years the IR community launched several
Evaluation as a Service initiatives [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Attempts were made towards
keeping test collections consistent across diferent systems,
providing infrastructures for replicable environments, or increasing
transparency by the use of open-source software.
      </p>
      <p>The OSIRRC workshop2 located at SIGIR 2019 is devoted to
the replicability of ad hoc retrieval systems. A major subject of
interest is the integration of IR systems into a Docker infrastructure.
Participants contribute existing IR systems by adapting them to
pre-defined interfaces. The organizers put the focus on standard
test collections in order to keep underlying data across diferent
systems consistent.</p>
      <p>
        Docker facilitates the deployment of complex software systems.
Dependencies and configurations can be specified in a
standardized way. The resulting images will be run with the help of os-level
virtualization. A growing community and eficient resource
management make Docker preferable to other virtualization alternatives.
Using Docker addresses some barriers to replicability. With the
help of clearly specified environments, configuration errors and
obscurities can be reduced to a minimum. An early attempt at
making retrieval results replicable with Docker was made by Yang et
al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The authors describe an online service, which evaluates
submitted retrieval systems run in Docker containers. Likewise,
Crane [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes Docker as packaging tool for machine learning
systems with multiple parameters.
      </p>
      <p>
        The CENTRE lab at CLEF is another initiative concerned with
the replicability and reproducibility of IR systems. Our
participation in CENTRE@CLEF19 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] was devoted to the replicability and
reproducibility of automatic routing runs. In order to contribute
our code submissions to the IR community, we integrate them into
the proposed Docker infrastructure of OSIRRC.
      </p>
      <p>The remainder of the paper is structured as follows. In section
2, we will summarize our submission to CENTRE@CLEF19. In
this context, the general concept of automatic routing runs will
be introduced. Section 3 gives insights into the adaption of our
reimplementations to the Docker infrastructure. The last section
will conclude with the resulting benefits and experiences made.
2</p>
    </sec>
    <sec id="sec-2">
      <title>REIMPLEMENTATION OF AUTOMATIC</title>
    </sec>
    <sec id="sec-3">
      <title>ROUTING RUNS</title>
      <p>
        In the context of CENTRE@CLEF19, participants were obliged
to replicate, reproduce, and generalize IR systems submitted at
previous conferences. Our participation in the CENTRE lab was
motivated by the replication and reproduction of automatic routing
runs submitted by Grossman and Cormack to the TREC Common
Core Track in 2017 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. According to the guidelines of the
CENTRE lab, participants have to reimplement original procedures.
2https://osirrc.github.io/osirrc2019/
      </p>
      <p>Run</p>
      <sec id="sec-3-1">
        <title>WCRobust04 WCRobust0405 Test</title>
      </sec>
      <sec id="sec-3-2">
        <title>New York Times New York Times</title>
      </sec>
      <sec id="sec-3-3">
        <title>Training</title>
      </sec>
      <sec id="sec-3-4">
        <title>Robust04 Robust04+05</title>
        <p>Replicability is evaluated by applying reimplementations to data
collections originally used. Reproducibility is evaluated by applying
reimplementations to new data collections. In the following, we
will describe the general concept of automatic routing runs, we
will continue with some insights into our reimplementations and
conclude this section with the replicated results.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Automatic Routing Runs</title>
      <p>
        Grossman’s and Cormack’s contributions to the TREC Common
Core 2017 track follow either a continuous active learning or
routing approach [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We focus on the latter in accordance with the
CENTRE guidelines. As the authors point out, automatic routing
runs are based on deriving ranking profiles for specific topics. With
the help of relevance judgments, these profiles are constructed.
Opposed to other retrieval approaches, no explicit query is needed
in order to derive a document ranking for a specific topic. Typically,
queries are stemmed from topics and corresponding narratives.
The proposed routing mechanisms, however, do not require such
an input. Grossman and Cormack chose to implement the profile
derivation with the help of a logistic regression classifier. They
convert text documents into numerical representations by the
determination of tfidf weights. Subsequently, they train the classifier
with these tfidf features in combination with binary relevance
judgments. The classifier is used to rank documents of another corpus
which is diferent from the one used for training. The likelihood of
documents being relevant will serve as a score. Since there is no
human intervention in the described procedure, it is fully automatic. It
has to be considered, that this approach is limited to corpora which
share relevance judgments for the same topics. The entire corpus
is ranked, whereas for each topic, the 10,000 highest ranking
documents are used for evaluation. Grossman and Cormack rank the
New York Times corpus with training data based on the Robust04
collection. The resulting run is titled WCRobust04. By enriching
training data with documents from the Robust05 collection, they
acquire improvements in terms of MAP and P@10. The resulting
run is titled WCRobust0405. Table 1 shows an overview of run
constellations as they were used in the context of the CENTRE lab.
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Reimplementation</title>
      <p>
        Based on the description by Grossman and Cormack, we chose to
reimplement the system from scratch. CENTRE organizers premise
the use of open-source software. The Python community ofers
a rich toolset of open and free software. We pursued a
Pythononly implementation, since the required components were largely
available in existing packages. A detailed description of our
implementation is available in our workshop report [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The general
workflow can be split into two processing stages.
2.2.1 Data preparation. The first stage will prepare corpora data.
Besides diferent compression formats, we also consider
diverging text formatting. We adapted the preparation steps specifically
to the characteristics of the corpora. After extraction, single
documents will be written to files which contain parsed text data.
Grossman and Cormack envisage a union corpus in order to
derive tfidf-features. The corpus with training samples as well as the
corpus to be ranked are supposed to be unified. This proceeding
results in training features that are augmented by the vocabulary
of the corpus to be ranked. In their contribution to the ECIR
reproducibility workshop in 2019 Yu et al. consider this augmentation
to be insignificant [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In our experimental setups, we compare
resulting runs of augmented and non-augmented training features
and confirm the assumptions made by Yu et al. It is reasonable
to neglect tfidf-derivation from a unified corpus. Features can be
taken solely from the training corpus without negatively afecting
evaluation measures. Due to these findings, we deviate from the
original procedure with regards to the tfidf-derivation. Our training
features are exclusively derived from Robust corpora.
2.2.2 Training &amp; Prediction. Single document files with parsed text
result from the data preparation step. The scikit-learn package [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
ofers the possibility to derive tfidf-features by implementing the
Tfidf Vectorizer. A term-document matrix is built by providing
document files from the Robust corpora to the Tfidf Vectorizer. Text
documents from all corpora are converted to numerical
representations with regards to this matrix. Converting documents from
the New York Times corpus, for instance, can result in vectors that
do not cover the complete vocabulary of specific documents. This
is a requirement for retrieving document vectors of equal length.
As pointed out above, this is not afecting evaluation outcomes.
Relevance judgments are converted to a binary scale and serve for
selecting training documents from the Robust corpora. Based on
a given topic, judged documents are converted to features vectors
and are prepared as training input for the logistic regression
classifier. We dump the training data in SVMlight format to keep our
workflow compatible with other machine learning frameworks. In
our case, we make use of the LogisticRegression model from the
scikit-learn package. The model is trained topic-wise with features
being either relevant or not. Afterwards, each document of the new
corpus will be scored. We order documents by descending score for
each topic. The 10, 000 highest ranking documents form the final
ranking of a single topic.
2.2.3 Outcomes. While replicated P@10 values come close to those
given by Grossman and Cormack, MAP values stay below the
baseline (more details in section 3.4). Especially reproduced values
(derived from another corpus) drop significantly and ofer a starting
point for future investigations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-6">
      <title>DOCKER ADAPTION</title>
      <p>In the following, we describe the portation of our replicated
CENTRE submission into Docker images. More specifically we give
insights, how the workflow is adapted to the given hooks. In this
context, we refer to the procedure illustrated in the previous section.
Contributors of retrieval systems have to adapt their systems with
respect to pre-defined hooks. These hooks are implemented with
the help of scripts for initialization, indexing, searching, training,
and interaction. They should be located in the root directory of the
Docker containers. A Python-based framework ("jig") will call these
scripts and invoke the corresponding processing steps. Automatic
routing runs slightly difer from the conventional ad hoc approach.
Instead of deriving rankings based on a query, a classification model
is devised based on judged documents for a given topic. The general
workflow of our submission is illustrated in figure 1.</p>
      <sec id="sec-6-1">
        <title>Supported Collections:</title>
        <p>robust04, robust05, core17</p>
      </sec>
      <sec id="sec-6-2">
        <title>Supported Hooks:</title>
        <p>init, index, search
3.1</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Dockerfile</title>
      <p>Since our implementation is completely done with Python, the
image relies on an existing Python 3 image. Upon image building,
directories will be made and the three scripts for initialization,
indexing, and searching will be copied. The required corpora will
be mounted as volumes when starting the container.
3.2</p>
      <p>Hooks
3.2.1 Initialization. On initialization, the source code will be
downloaded from a public GitHub repository. Required Python packages
will be installed. Depending on the specified run, either
WCRobust04 or WCRobust0405 will be replicated, and the corresponding
scripts are prepared.
3.2.2 Indexing. After successful initialization, indexing is done by
determining tfidf-features. Data extraction and text processing
result in single documents for each corpus. A term-document matrix
is constructed by using the Tfidf Vectorizer of the scikit-learn
package. In combination with qrel files from Robust04 and Robust05,
documents will be picked and transformed into tfidf-features with
respect to the term-document matrix. Likewise, the entire NYT
corpus is transformed into a numerical representation according
to this matrix. At the end of the indexing process, a Python shelf
containing tfidf-features of all documents from the NYT corpus
and SVMlight formatted tfidf-features remain as artifacts. They
will be committed to the resulting image, all other artifacts like the
vectorizer of extracted document files have to be deleted in order
to keep the image size low.
3.2.3 Searching. The "jig" will start a new container running the
committed image of the previous step. In our case, the "searching"
process consists of training a topic model and scoring the entire
NYT corpus for each topic. We make use of the logistic regression
classifier implemented in the scikit-learn package, although other
machine learning models should be easy to integrate. The 10,000
highest scorings for each topic will be merged into one final run
ifle. The "jig" will handle evaluation by using trec_eval.</p>
      <sec id="sec-7-1">
        <title>Baseline</title>
      </sec>
      <sec id="sec-7-2">
        <title>Replicability</title>
        <p>submission into two main processing steps. We decided to keep the
tfidf artifacts only in order to keep the committed image as small
as possible. The text artifacts resulting from the preprocessing will
be deleted after the determination of the term-document
matrix/Tifdf Vectorizer and tfidf-features. At first, we omitted the removal
of unneeded documents, resulting in large Docker image sizes that
could not be handled on moderate hardware.</p>
        <p>The data extraction and text processing steps are parallelized,
speeding up the preprocessing. In this context special attention had
to be paid to the compressed files by the same name with diferent
endings (.0z, .1z, .2z), since extracting these files in parallel will
result in name conflicts.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3.4 Evaluation Outcomes</title>
      <p>
        The evaluation outcomes of our replicated runs are given in table 2.
We were not able to fully replicate the baseline given by Grossman
and Cormack. Replicated evaluation outcomes are slightly worse
compared to the originals but are similar to results achieved by
Yu et al. with their classification only approach [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Evaluation
outcomes can be improved by enriching training data with an
additional corpus (in this case combining Robust04 and Robust05).
By using the OSSIRC Docker infrastructure, we can rebuild the exact
environment used for our submissions to CENTRE@CLEF2019. The
resulting evaluation outcomes match those which were achieved
during our participation at the CENTRE lab.
      </p>
    </sec>
    <sec id="sec-9">
      <title>3.5 Limitations</title>
      <p>At the current state, rankings for replicated runs are possible. This
means, only the NYT and Robust corpora will be processed correctly
by our Docker image. In the future, support of the Washington Post
corpus can be integrated for the further investigation of
reproducibility. Going a step further, the complete data preparation step
could be extended to more general compatibility with other test
collections or generic text data. At the moment, the routines are
highly adjusted to the given workflow by Grossman and Cormack
and the underlying data.</p>
      <p>Even though the data preparation is parallelized, it takes a while
to index the corpora. In order to reduce indexing time, the text
preprocessing can be omitted, leading to compromises between
execution time and evaluation measures.</p>
      <p>Predictions of the logistic regression classifier are used for
scoring documents. Currently, the corresponding tfidf-features are
stored in a Python shelf and will be read out sequentially for
classification. This step should be parallelized for the reduction of ranking
time.</p>
    </sec>
    <sec id="sec-10">
      <title>4 CONCLUSION</title>
      <p>
        We contributed our CENTRE@CLEF19 submissions to the Docker
infrastructure initiated by the OSIRRC workshop. Our original code
submission reimplemented automatic routing runs as they were
described by Grossman and Cormack [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the course of our CENTRE
participation, we investigated the replicability and reproducibility
of the given procedure. We focus on contributing replicated runs to
the Docker infrastructure. CENTRE defines replicability by using
the original test collection in combination with a diferent setup.
Thus our reimplementations rank documents of the NYT corpus
by using Robust corpora for the training of topic models.
      </p>
      <p>Adaptions to the Docker infrastructure were realizable with little
efort. We adjusted the workflow with regard to the given hooks.
The resulting runs exactly match those which were replicated in
the context of our CENTRE participation. Due to the encapsulation
into Docker images, less configuration efort is required and our
experimental environment can be exactly reconstructed. Required
components are taken from existing Docker images and Python
packages.</p>
      <p>
        Starting points for future improvements were elaborated in the
previous section. Investigations on reproducibility can be made
possible by integrating the Washington Post corpus into our workflow.
In this context, the support of other test collections might also be
interesting. Parallelizing classifications can reduce execution time
of the ranking. An archived version of our submitted Docker image
is available at Zenodo [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Timothy</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Armstrong</surname>
            , Alistair Mofat,
            <given-names>William</given-names>
          </string-name>
          <string-name>
            <surname>Webber</surname>
            , and
            <given-names>Justin</given-names>
          </string-name>
          <string-name>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Has Adhoc Retrieval Improved Since 1994?</article-title>
          .
          <source>In Proceedings of the 32Nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '09)</source>
          . ACM, New York, NY, USA,
          <fpage>692</fpage>
          -
          <lpage>693</lpage>
          . https://doi.org/10.1145/1571941. 1572081
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Timothy</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Armstrong</surname>
            , Alistair Mofat,
            <given-names>William</given-names>
          </string-name>
          <string-name>
            <surname>Webber</surname>
            , and
            <given-names>Justin</given-names>
          </string-name>
          <string-name>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <string-name>
            <given-names>Improvements</given-names>
            <surname>That Don'T Add</surname>
          </string-name>
          <article-title>Up: Ad-hoc Retrieval Results Since 1998</article-title>
          .
          <source>In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09)</source>
          . ACM, New York, NY, USA,
          <fpage>601</fpage>
          -
          <lpage>610</lpage>
          . https: //doi.org/10.1145/1645953.1646031
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Timo</given-names>
            <surname>Breuer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Schaer</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>osirrc/irc-centre2019-docker: OSIRRC @ SIGIR 2019 Docker Image for IRC-CENTRE2019</article-title>
          . https://doi.org/10.5281/zenodo. 3245439
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Timo</given-names>
            <surname>Breuer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Schaer</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Replicability and Reproducibility of Automatic Routing Runs</article-title>
          .
          <source>In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum (CEUR Workshop Proceedings)</source>
          .
          <source>CEUR-WS.org. (accepted).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Matt</given-names>
            <surname>Crane</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results</article-title>
          .
          <source>TACL 6</source>
          (
          <year>2018</year>
          ),
          <fpage>241</fpage>
          -
          <lpage>252</lpage>
          . https://transacl.org/ojs/index.php/tacl/article/view/1299
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ferro</surname>
          </string-name>
          , Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Soborof</surname>
          </string-name>
          .
          <year>2019</year>
          . CENTRE@
          <article-title>CLEF 2019</article-title>
          .
          <source>In Advances in Information Retrieval - 41st European Conference on IR Research</source>
          , ECIR
          <year>2019</year>
          , Cologne, Germany, April 14-
          <issue>18</issue>
          ,
          <year>2019</year>
          , Proceedings,
          <string-name>
            <surname>Part II</surname>
          </string-name>
          (Lecture Notes in Computer Science), Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr,
          <source>Claudia Hauf, and Djoerd Hiemstra (Eds.)</source>
          , Vol.
          <volume>11438</volume>
          . Springer,
          <fpage>283</fpage>
          -
          <lpage>290</lpage>
          . https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -15719-7_
          <fpage>38</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Maura</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Grossman</surname>
            and
            <given-names>Gordon V.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>MRG_UWaterloo and WaterlooCormack Participation in the TREC 2017 Common Core Track</article-title>
          .
          <source>In Proceedings of The Twenty-Sixth Text REtrieval Conference</source>
          , TREC 2017, Gaithersburg, Maryland, USA, November
          <volume>15</volume>
          -
          <issue>17</issue>
          ,
          <year>2017</year>
          ,
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          and Angela Ellis (Eds.), Vol. Special Publication 500-
          <fpage>324</fpage>
          .
          <article-title>National Institute of Standards and Technology (NIST)</article-title>
          . https://trec.nist.gov/pubs/trec26/papers/MRG_
          <article-title>UWaterloo-CC.pdf</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          , Allan Hanbury, Henning Müller, Ivan Eggel, Krisztian Balog, Torben Brodt,
          <string-name>
            <surname>Gordon</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jimmy Lin</surname>
          </string-name>
          , Jayashree
          <string-name>
            <surname>Kalpathy-Cramer</surname>
            , Noriko Kando, Makoto P. Kato, Anastasia Krithara, Tim Gollub, Martin Potthast, Evelyne Viegas, and
            <given-names>Simon</given-names>
          </string-name>
          <string-name>
            <surname>Mercer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Evaluation-as-a-Service for the Computational Sciences: Overview and Outlook</article-title>
          .
          <source>J. Data and Information Quality</source>
          <volume>10</volume>
          ,
          <issue>4</issue>
          , Article 15 (Oct.
          <year>2018</year>
          ),
          <volume>32</volume>
          pages. https://doi.org/10.1145/3239570
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The Neural Hype and Comparisons Against Weak Baselines</article-title>
          .
          <source>SIGIR Forum 52</source>
          ,
          <issue>2</issue>
          (Jan.
          <year>2019</year>
          ),
          <fpage>40</fpage>
          -
          <lpage>51</lpage>
          . https://doi.org/10.1145/3308774.3308781
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Peilin</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hui</given-names>
            <surname>Fang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A Reproducibility Study of Information Retrieval Models</article-title>
          .
          <source>In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (ICTIR '16)</source>
          . ACM, New York, NY, USA,
          <fpage>77</fpage>
          -
          <lpage>86</lpage>
          . https: //doi.org/10.1145/2970398.2970415
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Wei</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Kuang Lu,
          <string-name>
            <given-names>Peilin</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Efectiveness Gains from Neural Ranking Models</article-title>
          . CoRR abs/
          <year>1904</year>
          .09171 (
          <year>2019</year>
          ). arXiv:
          <year>1904</year>
          .09171 http://arxiv.org/abs/
          <year>1904</year>
          .09171
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ruifan</surname>
            <given-names>Yu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Yuhao</given-names>
            <surname>Xie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Simple Techniques for CrossCollection Relevance Feedback</article-title>
          .
          <source>In Advances in Information Retrieval - 41st European Conference on IR Research</source>
          , ECIR
          <year>2019</year>
          , Cologne, Germany, April 14-
          <issue>18</issue>
          ,
          <year>2019</year>
          , Proceedings,
          <string-name>
            <surname>Part I</surname>
          </string-name>
          (Lecture Notes in Computer Science), Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr,
          <source>Claudia Hauf, and Djoerd Hiemstra (Eds.)</source>
          , Vol.
          <volume>11437</volume>
          . Springer,
          <fpage>397</fpage>
          -
          <lpage>409</lpage>
          . https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -15712-8_
          <fpage>26</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>