<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Let's measure run time!</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Allan Hanbury TU Wien</institution>
        </aff>
      </contrib-group>
      <fpage>12</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>Establishing a docker-based replicability infrastructure ofers the community a great opportunity: measuring the run time of information retrieval systems. The time required to present query results to a user is paramount to the users satisfaction. Recent advances in neural IR re-ranking models put the issue of query latency at the forefront. They bring a complex trade-of between performance and efectiveness based on a myriad of factors: the choice of encoding model, network architecture, hardware acceleration and many others. The best performing models (currently using the BERT transformer model) run orders of magnitude more slowly than simpler architectures. We aim to broaden the focus of the neural IR community to include performance considerations - to sustain the practical applicability of our innovations. In this position paper we supply our argument with a case study exploring the performance of diferent neural re-ranking models. Finally, we propose to extend the OSIRRC docker-based replicability infrastructure with two performance focused benchmark scenarios.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The replicability and subsequent fair comparison of results in
Information Retrieval (IR) is a fundamentally important goal. Currently,
the main focus of the community is on the efectiveness results
of IR models. We argue that in the future the same infrastructure
supporting efectiveness replicability should be used to measure
performance. We use the term performance in this paper in the sense
of speed and run time – for the quality of retrieval results we use
efectiveness. In many cases, the time required to present query
results to a user is paramount to the users satisfaction, although in
some tasks users might be willing to wait longer for better results
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Thus a discussion in the community about existing trade-ofs
between performance and efectiveness is needed.
      </p>
      <p>This is not a new insight and we don’t claim to re-invent the
wheel with this position paper, rather we want to draw attention
to this issue as it becomes more prevalent in recent advances in
neural network methods for IR. Neural IR ranking models are
rerankers using the content text of a given query and a document
to assign a relevance score. Here, the choice of architecture and
encoding model ofer large efectiveness gains, while at the same
time potentially impacting the speed of training and inference by
orders of magnitude.</p>
      <p>
        The recently released MS MARCO v2 re-ranking dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is
the first public test collection large enough to easily reproduce
neural IR models.1 The public nature of the dataset also makes it
the prime contestant for replicability eforts for neural IR models.
      </p>
      <p>
        Nogueira et al. [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ] first showed the substantial efectiveness
gains for the MS MARCO passage re-ranking using BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a
large pre-trained transformer based model. However, they note the
stark trade-of with respect to performance. MacAvaney et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
show that by combining BERT ’s classification label with the output
of various neural models exhibits good results for
low-trainingresource collections. They also show that this comes at a substantial
performance cost – BERT taking two orders of magnitude longer
than a simple word embedding.
      </p>
      <p>
        On one hand, the retrieval results achieved with BERT ’s
contextualized encoding are truly impressive, on the other hand, the
community should not lose focus of the practicality of their
solutions – requiring fast performance for search. Complementing
our argument we present a case study about the performance of
diferent neural IR models and embedding models (Section 2). We
show that using a FastText [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] encoding provides a small trade-of
between efectiveness and performance, whereas BERT shows a big
trade-of in both directions. BERT is more than 100 times slower
than non-contextualized ranking models.
      </p>
      <p>
        The medical computer vision community has already recognized
the need for a focus on run time considerations. The medical image
analysis benchmark VISCERAL [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] included run time
measurements of participant solutions on the same hardware. Additionally,
computer vision tasks, such as object detection and tracking, often
require realtime results [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Here, iterations over neural network
architectures have been focusing on performance [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. The
object detection architectures commonly start with a pre-trained
feature extraction model. As Huang et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] show, this feature
extraction stage can easily be swapped to accommodate diferent
performance-efectiveness needs. We postulate that for neural IR
models the time has come to do the same. Neural IR models depend
on an encoding layer and recent works [
        <xref ref-type="bibr" rid="ref11 ref13 ref6">6, 11, 13</xref>
        ] show that the
neural IR community has at least 4 diferent encoding architectures
to choose from (basic word embedding, FastText, ELMo, BERT).
      </p>
      <p>The public comparison of results on leaderboards and
evaluation campaigns sparks interest and friendly competition among
researchers. However, they naturally incentivise a focus on the
effectiveness metrics used and other important aspects of IR systems
– for example the latency of a response – are left aside. The
introduction of docker-based submissions of complete retrieval systems
makes the comparison of run time metrics feasible: All system can
be compared under the same hardware conditions by a third party.
1Easy in this context means: MS MARCO has enough training samples to successfully
train the neural IR models without the need of a bag of tricks &amp; details applied to the
pre-processing &amp; training regime – which are often not published in the accompanying
papers</p>
      <p>Concretely, we propose to extend the docker-based replicability
infrastructure for two additional use cases (Section 3):
(1) Dynamic full system benchmark</p>
      <p>We measure the query latency and throughput over a longer
realistic period of a full search engine (possibly including
a neural re-ranking component). We envision a scripted
"interactive" mode, where the search engine returns results
for a single query at a time, giving the benchmark a lot of
ifdelity in reporting performance statistics.
(2) Static re-ranking benchmark</p>
      <p>We measure the (neural) re-ranking components in isolation,
providing them with the re-ranking candidate list. This
allows for direct comparability of models as all external factors
are fixed. This static scenario is very close to the way neural
IR re-ranking models are evaluated today, with added timing
metrics.</p>
      <p>A standardized performance evaluation helps the research
community and software engineers building on the research to better
understand the trade-ofs of diferent models and the performance
requirements that each of them have. It is our understanding that
the replicability eforts of our community are not only for good
science, they are also geared towards the usability of our
innovations in systems that people use. We argue that the performance is
a major contributor to this goal and therefore worthwhile to study
as part of a broader replicability and reproducibility push.
2</p>
    </sec>
    <sec id="sec-2">
      <title>NEURAL IR MODEL PERFORMANCE</title>
      <p>In the following case study we take a closer look at the training and
inference time as well as GPU memory requirements for diferent
neural IR models. Additionally, we compare the time required to
re-rank a query with the model’s efectiveness.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Neural IR Models</title>
      <p>
        We conduct our experiments on five neural IR models using a basic
Glove [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] word embedding and FastText [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Additionally, we
evaluate a BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] based ranking model. We use the MS MARCO [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
passage ranking collection to train and evaluate the models. All
models are trained end-to-end and the word representations are
ifne-tuned. Now, we give a brief overview of the models used with
a focus on performance sensitive components:
      </p>
      <p>
        KNRM [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] applies a diferentiable soft-histogram (Gaussian
kernel functions) on top of the similarity matching matrix of query
and document tokens – summing the interactions by their similarity.
The model then learns to weight the diferent soft-histogram bins.
      </p>
      <p>
        CONV-KNRM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] extends KNRM by adding a Convolutional
Neural Network (CNN) layer on top of the word embeddings,
enabling word-level n-gram representation learning. CONV-KNRM
cross-matches n-grams and scores n2 similarity matrices in total.
      </p>
      <p>
        MatchPyramid [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is a ranking model inspired by deep neural
image processing architectures. The model first computes the
similarity matching matrix, which is then applied to several stacked
CNN layers with dynamic max-pooling to ensure a fixed size output.
      </p>
      <p>
        PACRR [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] applies diferent sized CNN layers on the match
matrix followed by a max pooling of the strongest signals. In contrast
to MatchPyramid, the CNNs are only single layered, focusing on
diferent n-gram sizes and single word-to-word interactions are
modeled without a CNN.
      </p>
      <p>
        DUET [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a hybrid model applying CNNs to local
interactions and single vector representation matching of the query and
document. The two paths are combined at the end of the model to
form the relevance score. Note: We employed v2 of the model. We
changed the local interaction input to a cosine match matrix – in line
with the other models – in contrast to the exact matching in the
published DUET model. We were not able to reproduce the original exact
match results, however the cosine match matrix shows significantly
better results than in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        BERT[CLS] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] difers strongly from the previously described
models. It is a multi-purpose transformer based NLP model. We
follow the approach from Nogueira et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and first concatenate
the query and document sequences with the [SEP] indicator. Then,
we apply a single linear layer on top of the first [CLS] token to
produce the relevance score.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiment Setup</title>
      <p>
        In our experiment setup, we largely follow Hofstätter et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
We use PyTorch [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and AllenNLP [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for the neural models and
Anserini [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] to obtain the initial BM25 rankings. The BM25
baseline reaches 0.192 MRR@10 – as all neural models are significantly
better, we omit it in the rest of the paper. We use the Adam
optimizer and pairwise margin ranking loss with a learning rate of
1e-3 for all non-BERT models; for BERT we use a rate of 3e-6 and
the "bert-base-uncased" pre-trained model2. We train the models
with a batch size of 64; for evaluation we use a batch size of 256.
We keep the defaults for the model configurations from their
respective papers, except for MatchPyramid where we follow the
5-layer configuration from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For the basic word embedding we
use a vocabulary with a minimum collection occurrence of 5. The
nature of the passage collection means we operate on fairly short
text sequences: We clip the passages at 200 and the queries at 20
tokens – this only removes a modest number of outliers.
      </p>
      <p>
        In their work, Hofstätter et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] evaluate the efectiveness of
the models along the re-ranking depth (i.e. how many documents
are re-ranked by the neural model) – they show that a shallow
re-ranking depth already saturates most queries. This insight can
be employed to tune the performance of re-ranking systems further
in the future. In our case study, we keep it simple by reporting the
best validation MRR@10 (Mean Reciprocal Rank) results per model.
      </p>
      <p>We present average timings per batch assuming a batch contains
a single query with 256 re-ranking documents. We report timings
from already cached batches – excluding the pre-processing and
therefore reducing the considerable negative performance impact
of Python as much as possible. We use a benchmark server with
NVIDIA GTX 1080 TI (11GB memory) GPUs and Intel Xeon E5-2667
@ 3.20GHz CPUs. Each model is run on a single GPU.</p>
      <p>We caution that the measurements do not reflect production
ready implementations – as we directly measured PyTorch research
models and we strongly believe that overall the performance can
further be improved by employing more inference optimized
runtimes (such as the ONNX runtime3) and performance optimized
support code (for example non-Python code feeding data into the
neural network). We would like to kick-start innovation in this
direction with our paper.
2From: https:// github.com/ huggingface/ pytorch-pretrained-BERT
3https:// github.com/ microsoft/ onnxruntime
training largely depends on the encoding layer 4. Fine-tuning the
BERT model is much slower than all other models. It also is more
challenging to fit on a GPU with limited available memory, we
employed gradient accumulation to update the weights every 64
samples. We did not observe big performance diferences between
batch sizes.</p>
      <p>Now we focus on the practically more important aspect: the
reranking performance of the neural IR models. In Table 2 we report
the time that the neural IR models spend to score the provided
query-document pairs. The reported time only includes the model
computation. This corresponds to benchmark scenario #2 (Section
3.2).</p>
      <p>The main observation from the re-ranking performance data
in Table 2 is the striking diference between BERT and non-BERT
models. Both the word vector and FastText encodings have a low
memory footprint and the level of performance makes them suitable
for realtime re-ranking tasks. There are slight trade-ofs between
the models as depicted in Figure 1. The diferences correspond to
the training speed, discussed above. However, compared to BERT ’s
performance those diferences become marginal. BERT ofers
impressive efectiveness gains at a substantial performance cost. We
only evaluate a single BERT model, however the performance
characteristics should apply to all BERT -based models.</p>
      <p>
        We believe that the practical applicability of BERT -based
reranking models is currently limited to ofline scoring or domains
where users are willing to accept multiple second delays in their
search workflow. Future work will likely focus on the gap between
the contextualized and non-contextualized models – both in terms
of performance and efectiveness. Another path is to speed-up BERT
and other transformer based models, for example with pruning [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Therefore, we argue that it is necessary to provide the replicability
infrastructure with tools to take both performance and efectiveness
dimensions into account.
4We report peak memory usage provided by PyTorch, however we observed that
one requires additional GPU memory, FastText &amp; BERT are not trainable on 8 GB.
We believe this is due to memory fragmentation. The size of the required headroom
remains an open question for future work.
We start our observations with the training of the models as shown
in Table 1. The main performance metric is the throughput of triples
per second. A triple is a single training sample with a query and one
relevant and one non-relevant document. The models are trained
with a pairwise ranking loss, which requires two forward and a
single backward pass per triple. The batch count is the number of
batches with the best validation result. KNRM is the fastest to train,
it also saturates first. MatchPyramid and PACRR exhibit similar
performance. This is due to their similar architecture components
(CNNs applied on a 2D match matrix). In the class of CNNs applied
to higher dimensional word representations, DUET is slightly faster
than CONV-KNRM, although CONV-KNRM is slightly more efective.
In general, FastText vectors improve all models with a modest
performance decrease. The peak GPU memory required in the
      </p>
      <sec id="sec-4-1">
        <title>Query</title>
      </sec>
      <sec id="sec-4-2">
        <title>Inverted</title>
      </sec>
      <sec id="sec-4-3">
        <title>Index</title>
        <p>❶ Full system benchmark
❷ Re-ranking benchmark</p>
      </sec>
      <sec id="sec-4-4">
        <title>All matched document statistics</title>
      </sec>
      <sec id="sec-4-5">
        <title>First stage ranker BM25</title>
      </sec>
      <sec id="sec-4-6">
        <title>Query-document text pairs</title>
        <p>❷</p>
      </sec>
      <sec id="sec-4-7">
        <title>Ranked</title>
        <p>documents
(Top 1000)</p>
      </sec>
      <sec id="sec-4-8">
        <title>Full text storage</title>
      </sec>
      <sec id="sec-4-9">
        <title>Neural IR</title>
      </sec>
      <sec id="sec-4-10">
        <title>Model</title>
      </sec>
      <sec id="sec-4-11">
        <title>Second stage re-ranker</title>
      </sec>
      <sec id="sec-4-12">
        <title>Re-Ranked</title>
        <p>documents
(Top 10)
Following the observations from the case study above, we propose
to systematically measure and report performance metrics as part
of all replicability campaigns. Concretely, we propose to extend
the OSIRRC docker-based replicability infrastructure for two
additional use cases. The diferent measured components are depicted
in Figure 2.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Full System Benchmark</title>
      <p>Currently, most IR evaluation is conducted in batched processes –
working through a set of queries at a time, as we are mostly
interested in the efectiveness results. The OSIRRC specifications also
contain an optional timing feature for batched retrieval5. While we
see this as a good first step, we envision a more performance
focused benchmark: A scripted "interactive" mode, where the system
answers one query at a time. Here, the benchmark decides the load
and is able to measure fine grained latency and throughput.</p>
      <p>The scripted "interactive" mode needs as little overhead as
possible, like a lightweight HTTP endpoint receiving the query string
and returning TREC-formatted results. The execution of the
benchmark needs to be conducted on the same hardware, multiple times
to reduce noise.</p>
      <p>
        Although we present a neural IR model case study, we do not
limit this benchmark scenario to them – rather we see it as an
opportunity to cover the full range of retrieval methods. For example,
we are able to incorporate recall-boosting measures in the first
stage retrieval such as BERT -based document expansion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or
query expansion with IR-specific word embeddings [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Measuring the latency and throughput over a longer realistic
period of a full search engine (with neural IR re-ranking component)
touches many previously undeveloped components: Storing neural
model input values of indexed documents, generating batches on the
lfy, or handling concurrency. If a neural IR model is to be deployed
in production with a GPU acceleration, the issue of concurrent
processing becomes important: We observed that slower models
also have a higher GPU utilization, potentially creating a trafic jam
on the GPU, that in turn would increase the needed infrastructure
cost for the same amount of users.
5See: https://github.com/osirrc/jig
The neural IR field is receiving considerable attention and has a
growing community. In our opinion, the community is in need of a
more structured evaluation – both for performance and
efectiveness. We now propose a benchmark, which aims to deliver on both
dimensions.</p>
      <p>The re-ranking benchmark focuses on the innermost component
of neural IR models: the scoring of query-document tuples. We
provide the re-ranking candidate list and the neural IR model scores
the tuples. Many of the existing neural IR models follow this pattern
and can therefore easily be swapped and compared with each other
– also on public leaderboards, such as the MS MARCO leaderboard.
This static scenario provides a coherent way of evaluating neural
IR re-ranking models. It helps to mitigate diferences in the setup
of various research groups.
4</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION</title>
      <p>The OSIRRC docker-based IR replicability infrastructure presents
an opportunity to incorporate performance benchmarks. As an
example for the need of a broader view of the community, we show
in a case study the trade-of between performance and efectiveness
of neural IR models, especially for recent BERT based models. As a
result, we propose two diferent performance-focused benchmarks
to be incorporated in the infrastructure going forward. We look
forward to working with the community on these issues.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Payal</given-names>
            <surname>Bajaj</surname>
          </string-name>
          , Daniel Campos, Nick Craswell, Li Deng,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          , Xiaodong Liu, Rangan Majumder,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Mcnamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bhaskar</given-names>
            <surname>Mitra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Tri</given-names>
            <surname>Nguyen</surname>
          </string-name>
          .
          <year>2016</year>
          . MS MARCO :
          <article-title>A Human Generated MAchine Reading COmprehension Dataset</article-title>
          .
          <source>In Proc. of NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Enriching Word Vectors with Subword Information</article-title>
          .
          <source>Tr. of the ACL 5</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Zhuyun</given-names>
            <surname>Dai</surname>
          </string-name>
          , Chenyan Xiong,
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Callan</surname>
          </string-name>
          , and Zhiyuan Liu.
          <year>2018</year>
          .
          <article-title>Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search</article-title>
          .
          <source>In Proc. of WSDM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Matt</given-names>
            <surname>Gardner</surname>
          </string-name>
          , Joel Grus, Mark Neumann,
          <string-name>
            <given-names>Oyvind</given-names>
            <surname>Tafjord</surname>
          </string-name>
          , et al.
          <year>2017</year>
          .
          <article-title>AllenNLP: A Deep Semantic Natural Language Processing Platform</article-title>
          . arXiv:arXiv:
          <year>1803</year>
          .07640
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          , Navid Rekabsaz, Carsten Eickhof, and
          <string-name>
            <given-names>Allan</given-names>
            <surname>Hanbury</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>On the Efect of Low-Frequency Terms on Neural-IR Models</article-title>
          .
          <source>In Proc. of SIGIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          , Navid Rekabsaz, Mihai Lupu, Carsten Eickhof, and
          <string-name>
            <given-names>Allan</given-names>
            <surname>Hanbury</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Enriching Word Embeddings for Patent Retrieval with Global Context</article-title>
          .
          <source>In Proc. of ECIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Huang</surname>
          </string-name>
          , Vivek Rathod, Chen Sun,
          <string-name>
            <given-names>Menglong</given-names>
            <surname>Zhu</surname>
          </string-name>
          , et al.
          <year>2017</year>
          .
          <article-title>Speed/accuracy trade-ofs for modern convolutional object detectors</article-title>
          .
          <source>In Proc. of the IEEE-CVPR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Kai</given-names>
            <surname>Hui</surname>
          </string-name>
          , Andrew Yates, Klaus Berberich, and Gerard de Melo.
          <year>2017</year>
          .
          <article-title>PACRR: A Position-Aware Neural IR Model for Relevance Matching</article-title>
          .
          <source>In Proc. of EMNLP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Oscar</given-names>
            <surname>Jimenez-del Toro</surname>
          </string-name>
          , Henning Müller,
          <string-name>
            <given-names>Markus</given-names>
            <surname>Krenn</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Cloudbased evaluation of anatomical structure segmentation and landmark detection algorithms: VISCERAL anatomy benchmarks</article-title>
          .
          <source>IEEE trans. on Med</source>
          .
          <source>Imaging</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Sean</surname>
            <given-names>MacAvaney</given-names>
          </string-name>
          , Andrew Yates, Arman Cohan, and
          <string-name>
            <given-names>Nazli</given-names>
            <surname>Goharian</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>CEDR: Contextualized Embeddings for Document Ranking</article-title>
          .
          <source>In SIGIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Bhaskar</given-names>
            <surname>Mitra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nick</given-names>
            <surname>Craswell</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>An Updated Duet Model for Passage Re-ranking</article-title>
          . arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>07666</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Rodrigo</given-names>
            <surname>Nogueira</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Passage Re-ranking with BERT</article-title>
          . arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>04085</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Rodrigo</surname>
            <given-names>Nogueira</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimmy Lin</surname>
            , and
            <given-names>Kyunghyun</given-names>
          </string-name>
          <string-name>
            <surname>Cho</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Document Expansion by Query Prediction</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>08375</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Liang</surname>
            <given-names>Pang</given-names>
          </string-name>
          , Yanyan Lan, Jiafeng Guo, Jun Xu,
          <string-name>
            <given-names>Shengxian</given-names>
            <surname>Wan</surname>
          </string-name>
          , and Xueqi Cheng.
          <year>2016</year>
          .
          <article-title>Text Matching as Image Recognition</article-title>
          .
          <source>In Proc of. AAAI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Adam</surname>
            <given-names>Paszke</given-names>
          </string-name>
          , Sam Gross, Soumith Chintala,
          <string-name>
            <given-names>Gregory</given-names>
            <surname>Chanan</surname>
          </string-name>
          , et al.
          <year>2017</year>
          .
          <article-title>Automatic diferentiation in PyTorch</article-title>
          . In NIPS-W.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Jefrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proc of EMNLP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Joseph</surname>
            <given-names>Redmon</given-names>
          </string-name>
          , Santosh Divvala,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Ali</given-names>
          </string-name>
          <string-name>
            <surname>Farhadi</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>You only look once: Unified, real-time object detection</article-title>
          .
          <source>In Proc. of the IEEE-CVPR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Shaoqing</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Faster r-cnn: Towards real-time object detection with region proposal networks</article-title>
          .
          <source>In Proc. of NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Jaime</surname>
            <given-names>Teevan</given-names>
          </string-name>
          , Kevyn Collins-Thompson,
          <string-name>
            <surname>Ryen W White</surname>
          </string-name>
          , Susan T Dumais, and
          <string-name>
            <given-names>Yubin</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Slow search: Information retrieval without time constraints</article-title>
          .
          <source>In Proc. of the Symposium on HCI and IR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Elena</surname>
            <given-names>Voita</given-names>
          </string-name>
          , David Talbot,
          <string-name>
            <given-names>Fedor</given-names>
            <surname>Moiseev</surname>
          </string-name>
          , Rico Sennrich, and
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Titov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned</article-title>
          . arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>09418</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Chenyan</surname>
            <given-names>Xiong</given-names>
          </string-name>
          , Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and
          <string-name>
            <given-names>Russell</given-names>
            <surname>Power</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>End-to-End Neural Ad-hoc Ranking with Kernel Pooling</article-title>
          .
          <source>In Proc. of SIGIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Peilin</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hui</given-names>
            <surname>Fang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Anserini: Enabling the use of Lucene for information retrieval research</article-title>
          .
          <source>In Proc. of SIGIR.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>