Let’s measure run time!
                       Extending the IR replicability infrastructure to include performance aspects

                            Sebastian Hofstätter                                                                       Allan Hanbury
                                   TU Wien                                                                              TU Wien
                          s.hofstaetter@tuwien.ac.at                                                             hanbury@ifs.tuwien.ac.at

ABSTRACT                                                                                    neural IR models.1 The public nature of the dataset also makes it
Establishing a docker-based replicability infrastructure offers the                         the prime contestant for replicability efforts for neural IR models.
community a great opportunity: measuring the run time of informa-                              Nogueira et al. [13, 14] first showed the substantial effectiveness
tion retrieval systems. The time required to present query results                          gains for the MS MARCO passage re-ranking using BERT [4], a
to a user is paramount to the users satisfaction. Recent advances                           large pre-trained transformer based model. However, they note the
in neural IR re-ranking models put the issue of query latency at                            stark trade-off with respect to performance. MacAvaney et al. [11]
the forefront. They bring a complex trade-off between performance                           show that by combining BERT ’s classification label with the output
and effectiveness based on a myriad of factors: the choice of encod-                        of various neural models exhibits good results for low-training-
ing model, network architecture, hardware acceleration and many                             resource collections. They also show that this comes at a substantial
others. The best performing models (currently using the BERT                                performance cost – BERT taking two orders of magnitude longer
transformer model) run orders of magnitude more slowly than sim-                            than a simple word embedding.
pler architectures. We aim to broaden the focus of the neural IR                               On one hand, the retrieval results achieved with BERT ’s con-
community to include performance considerations – to sustain the                            textualized encoding are truly impressive, on the other hand, the
practical applicability of our innovations. In this position paper we                       community should not lose focus of the practicality of their so-
supply our argument with a case study exploring the performance                             lutions – requiring fast performance for search. Complementing
of different neural re-ranking models. Finally, we propose to ex-                           our argument we present a case study about the performance of
tend the OSIRRC docker-based replicability infrastructure with two                          different neural IR models and embedding models (Section 2). We
performance focused benchmark scenarios.                                                    show that using a FastText [2] encoding provides a small trade-off
                                                                                            between effectiveness and performance, whereas BERT shows a big
                                                                                            trade-off in both directions. BERT is more than 100 times slower
1    INTRODUCTION                                                                           than non-contextualized ranking models.
The replicability and subsequent fair comparison of results in Infor-                          The medical computer vision community has already recognized
mation Retrieval (IR) is a fundamentally important goal. Currently,                         the need for a focus on run time considerations. The medical image
the main focus of the community is on the effectiveness results                             analysis benchmark VISCERAL [10] included run time measure-
of IR models. We argue that in the future the same infrastructure                           ments of participant solutions on the same hardware. Additionally,
supporting effectiveness replicability should be used to measure                            computer vision tasks, such as object detection and tracking, often
performance. We use the term performance in this paper in the sense                         require realtime results [8]. Here, iterations over neural network
of speed and run time – for the quality of retrieval results we use                         architectures have been focusing on performance [18, 19]. The
effectiveness. In many cases, the time required to present query                            object detection architectures commonly start with a pre-trained
results to a user is paramount to the users satisfaction, although in                       feature extraction model. As Huang et al. [8] show, this feature
some tasks users might be willing to wait longer for better results                         extraction stage can easily be swapped to accommodate different
[20]. Thus a discussion in the community about existing trade-offs                          performance-effectiveness needs. We postulate that for neural IR
between performance and effectiveness is needed.                                            models the time has come to do the same. Neural IR models depend
   This is not a new insight and we don’t claim to re-invent the                            on an encoding layer and recent works [6, 11, 13] show that the
wheel with this position paper, rather we want to draw attention                            neural IR community has at least 4 different encoding architectures
to this issue as it becomes more prevalent in recent advances in                            to choose from (basic word embedding, FastText, ELMo, BERT).
neural network methods for IR. Neural IR ranking models are re-                                The public comparison of results on leaderboards and evalua-
rankers using the content text of a given query and a document                              tion campaigns sparks interest and friendly competition among
to assign a relevance score. Here, the choice of architecture and                           researchers. However, they naturally incentivise a focus on the ef-
encoding model offer large effectiveness gains, while at the same                           fectiveness metrics used and other important aspects of IR systems
time potentially impacting the speed of training and inference by                           – for example the latency of a response – are left aside. The intro-
orders of magnitude.                                                                        duction of docker-based submissions of complete retrieval systems
   The recently released MS MARCO v2 re-ranking dataset [1] is                              makes the comparison of run time metrics feasible: All system can
the first public test collection large enough to easily reproduce                           be compared under the same hardware conditions by a third party.

                                                                                            1 Easy in this context means: MS MARCO has enough training samples to successfully
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons        train the neural IR models without the need of a bag of tricks & details applied to the
License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR        pre-processing & training regime – which are often not published in the accompanying
2019, 25 July 2019, Paris, France.                                                          papers


                                                                                       12
   Concretely, we propose to extend the docker-based replicability                DUET [12] is a hybrid model applying CNNs to local interac-
infrastructure for two additional use cases (Section 3):                       tions and single vector representation matching of the query and
    (1) Dynamic full system benchmark                                          document. The two paths are combined at the end of the model to
        We measure the query latency and throughput over a longer              form the relevance score. Note: We employed v2 of the model. We
        realistic period of a full search engine (possibly including           changed the local interaction input to a cosine match matrix – in line
        a neural re-ranking component). We envision a scripted                 with the other models – in contrast to the exact matching in the pub-
        "interactive" mode, where the search engine returns results            lished DUET model. We were not able to reproduce the original exact
        for a single query at a time, giving the benchmark a lot of            match results, however the cosine match matrix shows significantly
        fidelity in reporting performance statistics.                          better results than in [12].
    (2) Static re-ranking benchmark                                               BERT[CLS] [4] differs strongly from the previously described
        We measure the (neural) re-ranking components in isolation,            models. It is a multi-purpose transformer based NLP model. We
        providing them with the re-ranking candidate list. This al-            follow the approach from Nogueira et al. [13] and first concatenate
        lows for direct comparability of models as all external factors        the query and document sequences with the [SEP] indicator. Then,
        are fixed. This static scenario is very close to the way neural        we apply a single linear layer on top of the first [CLS] token to
        IR re-ranking models are evaluated today, with added timing            produce the relevance score.
        metrics.                                                               2.2      Experiment Setup
   A standardized performance evaluation helps the research com-               In our experiment setup, we largely follow Hofstätter et al. [6].
munity and software engineers building on the research to better               We use PyTorch [16] and AllenNLP [5] for the neural models and
understand the trade-offs of different models and the performance              Anserini [23] to obtain the initial BM25 rankings. The BM25 base-
requirements that each of them have. It is our understanding that              line reaches 0.192 MRR@10 – as all neural models are significantly
the replicability efforts of our community are not only for good               better, we omit it in the rest of the paper. We use the Adam op-
science, they are also geared towards the usability of our innova-             timizer and pairwise margin ranking loss with a learning rate of
tions in systems that people use. We argue that the performance is             1e -3 for all non-BERT models; for BERT we use a rate of 3e -6 and
a major contributor to this goal and therefore worthwhile to study             the "bert-base-uncased" pre-trained model2 . We train the models
as part of a broader replicability and reproducibility push.                   with a batch size of 64; for evaluation we use a batch size of 256.
2     NEURAL IR MODEL PERFORMANCE                                              We keep the defaults for the model configurations from their re-
                                                                               spective papers, except for MatchPyramid where we follow the
In the following case study we take a closer look at the training and
                                                                               5-layer configuration from [6]. For the basic word embedding we
inference time as well as GPU memory requirements for different
                                                                               use a vocabulary with a minimum collection occurrence of 5. The
neural IR models. Additionally, we compare the time required to
                                                                               nature of the passage collection means we operate on fairly short
re-rank a query with the model’s effectiveness.
                                                                               text sequences: We clip the passages at 200 and the queries at 20
2.1     Neural IR Models                                                       tokens – this only removes a modest number of outliers.
We conduct our experiments on five neural IR models using a basic                  In their work, Hofstätter et al. [6] evaluate the effectiveness of
Glove [17] word embedding and FastText [2]. Additionally, we eval-             the models along the re-ranking depth (i.e. how many documents
uate a BERT [4] based ranking model. We use the MS MARCO [1]                   are re-ranked by the neural model) – they show that a shallow
passage ranking collection to train and evaluate the models. All               re-ranking depth already saturates most queries. This insight can
models are trained end-to-end and the word representations are                 be employed to tune the performance of re-ranking systems further
fine-tuned. Now, we give a brief overview of the models used with              in the future. In our case study, we keep it simple by reporting the
a focus on performance sensitive components:                                   best validation MRR@10 (Mean Reciprocal Rank) results per model.
   KNRM [22] applies a differentiable soft-histogram (Gaussian                     We present average timings per batch assuming a batch contains
kernel functions) on top of the similarity matching matrix of query            a single query with 256 re-ranking documents. We report timings
and document tokens – summing the interactions by their similarity.            from already cached batches – excluding the pre-processing and
The model then learns to weight the different soft-histogram bins.             therefore reducing the considerable negative performance impact
   CONV-KNRM [3] extends KNRM by adding a Convolutional                        of Python as much as possible. We use a benchmark server with
Neural Network (CNN) layer on top of the word embeddings, en-                  NVIDIA GTX 1080 TI (11GB memory) GPUs and Intel Xeon E5-2667
abling word-level n-gram representation learning. CONV-KNRM                    @ 3.20GHz CPUs. Each model is run on a single GPU.
cross-matches n-grams and scores n 2 similarity matrices in total.                 We caution that the measurements do not reflect production
   MatchPyramid [15] is a ranking model inspired by deep neural                ready implementations – as we directly measured PyTorch research
image processing architectures. The model first computes the sim-              models and we strongly believe that overall the performance can
ilarity matching matrix, which is then applied to several stacked              further be improved by employing more inference optimized run-
CNN layers with dynamic max-pooling to ensure a fixed size output.             times (such as the ONNX runtime3 ) and performance optimized
   PACRR [9] applies different sized CNN layers on the match ma-               support code (for example non-Python code feeding data into the
trix followed by a max pooling of the strongest signals. In contrast           neural network). We would like to kick-start innovation in this
to MatchPyramid, the CNNs are only single layered, focusing on                 direction with our paper.
different n-gram sizes and single word-to-word interactions are                2 From: https:// github.com/ huggingface/ pytorch-pretrained-BERT
modeled without a CNN.                                                         3 https:// github.com/ microsoft/ onnxruntime


                                                                          13
Table 1: Training performance (Triples/second includes: 2x                             0.36
forward, 1x loss & backward per triple), Training duration
(best validation result after batch count), peak GPU memory                            0.34
requirement as well as the effectiveness score (MRR@10)
                                                                                       0.32
                                 Triples    Batch       Peak
                   Model                                         MRR


                                                                              MRR@10
                               / second     count     Memory                           0.30

                   KNRM           5,200     44,000     2.16 GB   0.222                                                                             KNRM
    Word vectors


                                                                                       0.28                                                        C-KNRM
                   C-KNRM         1,300     98,000     2.73 GB   0.261
                                                                                                                                                   MatchP.
                   MatchP.        2,900    178,000     2.30 GB   0.245                                                                             PACRR
                                                                                       0.26
                   PACRR          2,900    130,000     2.21 GB   0.249                                                                             DUET
                   DUET           1,900    146,000     2.47 GB   0.259                                                                       Word Vectors
                                                                                       0.24
                                                                                                                                             FastText
                   KNRM           2,300     62,000     7.34 GB   0.231
                                                                                                                                             BERT
   FastText


                   C-KNRM         1,000    184,000     7.81 GB   0.273                 0.22
                   MatchP.        1,800    182,000     7.47 GB   0.254                        5   10    15     20   25   30 1800             1900      2000
                   PACRR          1,700    100,000     7.40 GB   0.257                                          ms per query
                   DUET           1,600    182,000     7.46 GB   0.271          Figure 1: A comparison of performance and effectiveness
                                                                                Note: the break in the x-axis indicates a large time gap
                   BERT[CLS]         33     77,500     7.68 GB   0.347
                                                                                training largely depends on the encoding layer 4 . Fine-tuning the
Table 2: Re-ranking speed (256 documents per query &                            BERT model is much slower than all other models. It also is more
batch), peak GPU memory requirement and MRR@10 effec-                           challenging to fit on a GPU with limited available memory, we
tiveness of our evaluated neural IR models.                                     employed gradient accumulation to update the weights every 64
                                                                                samples. We did not observe big performance differences between
                                                                                batch sizes.
                                   Docs       Time      GPU
                   Model                                         MRR               Now we focus on the practically more important aspect: the re-
                               / second    / query    Memory
                                                                                ranking performance of the neural IR models. In Table 2 we report
                   KNRM          48,000       5 ms     0.84 GB   0.222          the time that the neural IR models spend to score the provided
 Word vectors


                   C-KNRM        12,000      21 ms     0.93 GB   0.261          query-document pairs. The reported time only includes the model
                   MatchP.       28,000       9 ms     0.97 GB   0.245          computation. This corresponds to benchmark scenario #2 (Section
                   PACRR         27,000       9 ms     0.91 GB   0.249          3.2).
                   DUET          14,000      18 ms     1.04 GB   0.259             The main observation from the re-ranking performance data
                   KNRM          36,000       7 ms     2.59 GB   0.231          in Table 2 is the striking difference between BERT and non-BERT
                                                                                models. Both the word vector and FastText encodings have a low
 FastText


                   C-KNRM        11,000      23 ms     2.68 GB   0.273
                   MatchP.       23,000      11 ms     2.72 GB   0.254          memory footprint and the level of performance makes them suitable
                   PACRR         21,000      12 ms     2.67 GB   0.257          for realtime re-ranking tasks. There are slight trade-offs between
                   DUET          17,000      15 ms     2.68 GB   0.271          the models as depicted in Figure 1. The differences correspond to
                                                                                the training speed, discussed above. However, compared to BERT ’s
                   BERT[CLS]        130    1,970 ms    7.29 GB   0.347          performance those differences become marginal. BERT offers im-
                                                                                pressive effectiveness gains at a substantial performance cost. We
                                                                                only evaluate a single BERT model, however the performance char-
2.3                Results & Discussion                                         acteristics should apply to all BERT -based models.
We start our observations with the training of the models as shown                 We believe that the practical applicability of BERT -based re-
in Table 1. The main performance metric is the throughput of triples            ranking models is currently limited to offline scoring or domains
per second. A triple is a single training sample with a query and one           where users are willing to accept multiple second delays in their
relevant and one non-relevant document. The models are trained                  search workflow. Future work will likely focus on the gap between
with a pairwise ranking loss, which requires two forward and a                  the contextualized and non-contextualized models – both in terms
single backward pass per triple. The batch count is the number of               of performance and effectiveness. Another path is to speed-up BERT
batches with the best validation result. KNRM is the fastest to train,          and other transformer based models, for example with pruning [21].
it also saturates first. MatchPyramid and PACRR exhibit similar                 Therefore, we argue that it is necessary to provide the replicability
performance. This is due to their similar architecture components               infrastructure with tools to take both performance and effectiveness
(CNNs applied on a 2D match matrix). In the class of CNNs applied               dimensions into account.
to higher dimensional word representations, DUET is slightly faster
                                                                                4 We report peak memory usage provided by PyTorch, however we observed that
than CONV-KNRM, although CONV-KNRM is slightly more effective.
                                                                                one requires additional GPU memory, FastText & BERT are not trainable on 8 GB.
In general, FastText vectors improve all models with a modest                   We believe this is due to memory fragmentation. The size of the required headroom
performance decrease. The peak GPU memory required in the                       remains an open question for future work.


                                                                         14
                        Query
                                                                                ❶ Full system benchmark ❷ Re-ranking benchmark
           ❶
                                                         First stage ranker                              ❷
                                       All matched                              Query-document
                                       document statistics                      text pairs                                                   Re-Ranked
                                                                                                                  Neural IR
                     Inverted                                  BM25                                                                          documents
                                                                                                                   Model
                       Index                                                                                                                 (Top 10)
                                                       Ranked
                                                    documents           Full text                               Second stage
                                                    (Top 1000)          storage                                   re-ranker


    Figure 2: A simplified query workflow with re-ranking – showing the reach of our proposed performance benchmarks


3     BENCHMARK SCENARIOS                                                           3.2     Re-ranking Benchmark
Following the observations from the case study above, we propose                    The neural IR field is receiving considerable attention and has a
to systematically measure and report performance metrics as part                    growing community. In our opinion, the community is in need of a
of all replicability campaigns. Concretely, we propose to extend                    more structured evaluation – both for performance and effective-
the OSIRRC docker-based replicability infrastructure for two addi-                  ness. We now propose a benchmark, which aims to deliver on both
tional use cases. The different measured components are depicted                    dimensions.
in Figure 2.                                                                           The re-ranking benchmark focuses on the innermost component
                                                                                    of neural IR models: the scoring of query-document tuples. We
3.1      Full System Benchmark                                                      provide the re-ranking candidate list and the neural IR model scores
Currently, most IR evaluation is conducted in batched processes –                   the tuples. Many of the existing neural IR models follow this pattern
working through a set of queries at a time, as we are mostly inter-                 and can therefore easily be swapped and compared with each other
ested in the effectiveness results. The OSIRRC specifications also                  – also on public leaderboards, such as the MS MARCO leaderboard.
contain an optional timing feature for batched retrieval5 . While we                This static scenario provides a coherent way of evaluating neural
see this as a good first step, we envision a more performance fo-                   IR re-ranking models. It helps to mitigate differences in the setup
cused benchmark: A scripted "interactive" mode, where the system                    of various research groups.
answers one query at a time. Here, the benchmark decides the load
and is able to measure fine grained latency and throughput.                         4     CONCLUSION
    The scripted "interactive" mode needs as little overhead as possi-
                                                                                    The OSIRRC docker-based IR replicability infrastructure presents
ble, like a lightweight HTTP endpoint receiving the query string
                                                                                    an opportunity to incorporate performance benchmarks. As an
and returning TREC-formatted results. The execution of the bench-
                                                                                    example for the need of a broader view of the community, we show
mark needs to be conducted on the same hardware, multiple times
                                                                                    in a case study the trade-off between performance and effectiveness
to reduce noise.
                                                                                    of neural IR models, especially for recent BERT based models. As a
    Although we present a neural IR model case study, we do not
                                                                                    result, we propose two different performance-focused benchmarks
limit this benchmark scenario to them – rather we see it as an op-
                                                                                    to be incorporated in the infrastructure going forward. We look
portunity to cover the full range of retrieval methods. For example,
                                                                                    forward to working with the community on these issues.
we are able to incorporate recall-boosting measures in the first
stage retrieval such as BERT -based document expansion [14] or
query expansion with IR-specific word embeddings [7].                               REFERENCES
    Measuring the latency and throughput over a longer realistic                     [1] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu,
                                                                                         Rangan Majumder, Andrew Mcnamara, Bhaskar Mitra, and Tri Nguyen. 2016.
period of a full search engine (with neural IR re-ranking component)                     MS MARCO : A Human Generated MAchine Reading COmprehension Dataset.
touches many previously undeveloped components: Storing neural                           In Proc. of NIPS.
model input values of indexed documents, generating batches on the                   [2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.
                                                                                         Enriching Word Vectors with Subword Information. Tr. of the ACL 5 (2017).
fly, or handling concurrency. If a neural IR model is to be deployed                 [3] Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional
in production with a GPU acceleration, the issue of concurrent                           Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In Proc. of WSDM.
processing becomes important: We observed that slower models                         [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
                                                                                         Pre-training of Deep Bidirectional Transformers for Language Understanding.
also have a higher GPU utilization, potentially creating a traffic jam                   arXiv preprint arXiv:1810.04805 (2018).
on the GPU, that in turn would increase the needed infrastructure                    [5] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, et al. 2017. AllenNLP:
                                                                                         A Deep Semantic Natural Language Processing Platform. arXiv:arXiv:1803.07640
cost for the same amount of users.                                                   [6] Sebastian Hofstätter, Navid Rekabsaz, Carsten Eickhoff, and Allan Hanbury. 2019.
5 See: https://github.com/osirrc/jig                                                     On the Effect of Low-Frequency Terms on Neural-IR Models. In Proc. of SIGIR.


                                                                               15
 [7] Sebastian Hofstätter, Navid Rekabsaz, Mihai Lupu, Carsten Eickhoff, and Allan           [16] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, et al. 2017. Auto-
     Hanbury. 2019. Enriching Word Embeddings for Patent Retrieval with Global                    matic differentiation in PyTorch. In NIPS-W.
     Context. In Proc. of ECIR.                                                              [17] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
 [8] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, et al. 2017.                           Global vectors for word representation. In Proc of EMNLP.
     Speed/accuracy trade-offs for modern convolutional object detectors. In Proc. of        [18] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only
     the IEEE-CVPR.                                                                               look once: Unified, real-time object detection. In Proc. of the IEEE-CVPR.
 [9] Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: A              [19] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
     Position-Aware Neural IR Model for Relevance Matching. In Proc. of EMNLP.                    Towards real-time object detection with region proposal networks. In Proc. of
[10] Oscar Jimenez-del Toro, Henning Müller, Markus Krenn, et al. 2016. Cloud-                    NIPS.
     based evaluation of anatomical structure segmentation and landmark detection            [20] Jaime Teevan, Kevyn Collins-Thompson, Ryen W White, Susan T Dumais, and
     algorithms: VISCERAL anatomy benchmarks. IEEE trans. on Med. Imaging (2016).                 Yubin Kim. 2013. Slow search: Information retrieval without time constraints. In
[11] Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR:                   Proc. of the Symposium on HCI and IR.
     Contextualized Embeddings for Document Ranking. In SIGIR.                               [21] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019.
[12] Bhaskar Mitra and Nick Craswell. 2019. An Updated Duet Model for Passage                     Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting,
     Re-ranking. arXiv preprint arXiv:1903.07666 (2019).                                          the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418 (2019).
[13] Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT.                 [22] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017.
     arXiv preprint arXiv:1901.04085 (2019).                                                      End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In Proc. of SIGIR.
[14] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document                [23] Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene
     Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).                       for information retrieval research. In Proc. of SIGIR.
[15] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng.
     2016. Text Matching as Image Recognition. In Proc of. AAAI.


                                                                                        16