1. Introduction

Serverless BM25 Search and BERT Reranking

Mayank Anand

Jiarui Zhang

Shane Ding

Ji Xin

Jimmy Lin

0 0 University of Waterloo , Canada

The retrieve-rerank pipeline is a well-established architecture for search applications, typically with first-stage retrieval using keyword search followed by reranking with a transformer-based model. In deploying such an architecture in the cloud, developers must devote considerable efort to resource provisioning and management: typically, the goal is to optimize the infrastructure configuration (number and type of server instance) to achieve certain performance characteristics (latency, throughput, etc.) while reducing operating costs. In this paper, we introduce a serverless prototype of the retrieve-rerank pipeline for search using Amazon Web Services (AWS), comprised of BM25 for first-stage retrieval using Lucene followed by reranking with the monoBERT model using Hugging Face Transformers. The advantage of a serverless design is that a cloud provider shoulders the burden of operational management, for example, allocating server instances and scaling with query load. We experimentally show with the popular MS MARCO passage ranking test collection that compared to a traditional server-based deployment, our serverless implementation (1) retains the same level of efectiveness, (2) can reduce average latency by exploiting massive parallelism, and (3) incurs comparable costs if the service is expected to be idle for some fraction of the time. Our implementation is open-sourced at https://github.com/castorini/serverless-bert-reranking.

eol>multi-stage ranking architectures transformers monoBERT

1. Introduction

over-provisioning; for robust failover, a minimal installation typically runs two servers, additionally contributing It is a common practice today for search engines to to idle (wasted) resources. As the query load increases, to adopt a retrieve–rerank architecture, for example, with maintain the same level of performance, more server inkeyword search as first-stage retrieval followed by a stances need to be provisioned. As query load decreases, transformer-based model for reranking [1]. This rep- these extra instances must then be destroyed. To cope resents a simple instantiation of a multi-stage retrieval with variable load robustly, developers need to build logic architecture [2] that is widely used in production at to dynamically spin up and down instances, which may scale [3, 4, 5, 6]. In terms of deployments, individ- be complex and error prone. While these are solvable ual servers (today, typically virtualized instances in the engineering challenges, we wonder if there’s a better cloud) form the basic building blocks for search applica- way. It would be desirable if we could scale up and down tions. Persistent services running on a cluster cooperate seamlessly, without efort—and ideally, all the way down to provide the various functionalities that comprise the to zero. That is, if there are no incoming queries, can we complete application. To scale out, the standard practice not have to pay anything? is to adopt a replicated, document-partitioned architec- Serverless architectures to the rescue! In this paper, ture [7, 8, 9]. we present a serverless prototype of the retrieve–rerank

This design has two important implications: First, the pipeline for search using Amazon Web Services (AWS), services must exist as always-on, long-running processes, comprised of BM25 for first-stage retrieval using Lucene ready to handle requests at any moment. This presents followed by reranking with the monoBERT model using a floor on resource consumption, as costs are incurred Hugging Face Transformers. We describe our design and even when the service is idle. Second, scaling up and present experimental results with the MS MARCO pasdown in response to query load must be performed at the sage ranking test collection. In addition to the ability granularity of servers, usually through replication and to completely ofload operational management, we beload balancing. Thus, a server-based design means that lieve that there are two scenarios where our serverless when the query load is low, even a single server may be design is particularly compelling: (1) a search application that handles low query volumes, where server instances DESIRES 2021 – 2nd International Conference on Design of may be idle most of the time, and (2) a search application Experimental Search & Information REtrieval Systems, September where incoming requests may be bursty, for example, 1"5–m18a,y2a0n2k1.,aPnaadnuda@,Iutawlyaterloo.ca (M. Anand); a service endpoint that is periodically invoked by some jiarui.zhang@uwaterloo.ca (J. Zhang); s44ding@uwaterloo.ca other component. While our prototype exhibits a number (S. Ding); ji.xin@uwaterloo.ca (J. Xin); jimmylin@uwaterloo.ca of limitations at present, it perhaps ofers a blueprint for a (J. Lin) very diferent approach to how future search applications may be built.

© 2021 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)

Search client

FetchLambda

SearchLambda

Lucene indexes Web Browser

API Gateway

Raw documents Serverless Search

DynamoDB RerankLambda

Serverless Reranking

Elastic Container Registry

2. Background and Related Work

demonstrated the feasibility of serverless search, but has a number of shortcomings. In particular, their prototype The development of cloud technologies can be charac- required custom code, which presents barriers to broad terized as continuing disaggregation of computing com- adoption. Lin [16] addressed this shortcoming by demonponents. In the early days, the cloud meant dynamic, strating how the open-source Lucene search library can readily-available, easy-to-provision virtual machines. To- be packaged in a serverless design with minimal custom day, however, there exists a myriad of services, ofered code to achieve query latencies capable of supporting by all the major cloud providers, that deliver computing interactive retrieval. This serverless Lucene prototype capabilities in a much more fine-grained manner under forms the starting point of our work, where we further a pay-as-you-go model. add serverless reranking with transformer-based models

One particularly interesting development is the rise of to demonstrate a full serverless retrieve–rerank pipeline. so-called Function-as-a-Service (FaaS) oferings: The developer provides a block of code with well-known entry and exit points, and the cloud provider handles all other 3. Serverless Architecture aspects of execution—provisioning resources to execute those functions, scaling up and down to match a particu- We present and evaluate a working implementation of a lar load, etc., all under a per-invocation cost model. Com- serverless retrieve–rerank pipeline with the architecture bined with storage- and database-as-a-service oferings, shown in Figure 1. In this work, Amazon Web Services it is now possible to write end-to-end serverless appli- (AWS) was selected as the cloud platform: in particular, cations where the abstraction of a server is completely Lambda, the AWS Function-as-a-Service ofering, proabsent. To be clear, a serverless design does not mean vides the core building block in our design. Nevertheless, that we can somehow compute without servers. Rather, other popular cloud providers ofer comparable services it means that the developer no longer needs to explic- that can provide alternative implementations. itly manage server instances. Instead, the cloud provider Our design is comprised of two distinct components, shoulders the burden of operational management, thus serverless search and serverless reranking, described befreeing the developer to focus on implementing the ap- low. The serverless search component is built on the plication logic. serverless Lucene prototype presented by Lin [16], while

Researchers have explored serverless architectures for the serverless reranking component has not been dea variety of applications [10, 11, 12, 13, 14], but most rele- scribed anywhere else. vant to this work is serverless search: Crane and Lin [15] previously demonstrated a working prototype on Ama- 3.1. Serverless Search zon Web Services. In their design, postings lists are stored in the DynamoDB data store and query execution is han- An important desideratum of our work is to build serverdled by Lambda (Amazon’s FaaS ofering). Their work less search on the open-source Lucene search library, which has emerged as the de facto platform for de- tial execution on a “cold” instance, however, carries the veloping real-world search applications, typically via additional startup costs associated with populating the OpenSearch, Elasticsearch, Solr, or other components cache. This is not unlike any other in-memory system, in the broader ecosystem. Other than a few commer- and Lambda execution incurs no performance penalty in cial search engine companies that deploy custom infras- steady state. tructure (for which the serverless design would not be To complete the architecture shown in Figure 1, there of interest anyway), Lucene dominates the search land- are a few more components to describe: Raw documents scape, with deployments at organizations ranging from are stored in DynamoDB (organized as a simple key– Bloomberg to Twitter to Wikipedia. Furthermore, the value store). Another Lambda, the FetchLambda calls use of Lucene for academic research has been gaining the SearchLambda to generate a ranking, and then istraction [17, 18, 19]. Thus, to increase the potential for sues concurrent, batched calls to DynamoDB to retrieve broader impact and adoption, our focus is to leverage as the actual document text (which is needed for rerankmuch of the existing Lucene codebase as possible. ing). The FetchLambda can be triggered through a REST

The design of serverless architectures hinges around endpoint provided by the API Gateway. The final prodthe decoupling of state from stateless code. In the con- uct is a service that takes a query and returns a list of text of search, “state” is captured by the inverted index documents (complete with their contents) accessible to a and other related data structures, while query evaluation search client (e.g., in a web browser). (i.e., postings traversal) can be considered stateless. Thus, Although in principle the FetchLambda and the it is only natural to package Lucene’s query evaluation SearchLambda can be combined, we have kept the two code (IndexReader, IndexSearcher, etc.) into a Lambda separate to support future scale out. A partitioned arfunction—the SearchLambda in Figure 1. The index struc- chitecture can be implementation by multiple Searchtures (assumed to have been generated elsewhere) can Lambda instances, each responsible for its own index be stored in S3, Amazon’s persistent object store. partition, in which case FetchLambda can serve as a cen

How do we “connect” Lucene code (running in a tral broker.

Lambda) with index structures stored in S3? Fortunately, Lucene’s Directory interface provides a low-level abstrac- 3.2. Serverless Reranking tion for reading index structures (at the level of reading bytes from streams, seeking to diferent byte ofset posi- In our design, BM25 results from first-stage retrieval are tions, etc.). Thus, it sufices to provide a custom Directory fed to monoBERT, a standard cross-encoder, for rerankimplementation built with Amazon’s S3 API, and then ing. In monoBERT, inference is performed on all canuse this implementation for reading the indexes. Criti- didate documents: an input template comprised of the cally, all other parts of the Lucene query evaluation stack query and the document text is fed to a fine-tuned BERT remain unchanged—instead of consuming bytes from a model, which produces a relevance score. All candidate local drive (for example), the bytes are now streamed documents are then sorted by these scores. Previous across the datacenter network from S3. studies have shown that this approach is both simple and

Given this design, an important issue of course is the efective [ 1]. In this work, we adopt a monoBERT variant performance of (remote network) reads from S3. This is called Early Exiting monoBERT [24], which increases solved by caching; that is, a custom S3Directory imple- the eficiency of the BERT backbone by adding in “early mentation reads data into memory and thus the overall exits” that allow the inference process to terminate early design is no diferent from main-memory search engines, if the model is confident in its decisions. Our implemenwhich are quite commonplace today both in the academic tation is based on Hugging Face Transformers [25] and literature [20, 21, 22] as well as in production deploy- PyTorch [26]. ments [9, 23]. In order to understand how this caching Conceptually, serverless ranking is straightforward mechanism interacts with Lambda execution, it is neces- because the operation is stateless and embarrassingly sary to understand at a high level how Amazon handles parallel. We simply need to generate a relevance score FaaS execution. for each document, which can proceed independently.

Behind the scenes, Amazon is provisioning containers The obvious implementation is to wrap model inference to execute the Lambda; it controls how many containers in a Lambda, and this is exactly what we do with the are running to satisfy a particular load, automatically RerankLambda, as shown in Figure 1. There are, howscales up and down the number of containers, and per- ever, two engineering challenges, discussed below. forms load balancing. Therefore, code execution can First, neural inference typically requires GPUs to either occur on a “warm” instance (i.e., already running achieve latencies that are suficiently low to support incontainer) or a “cold” instance. For a “warm” instance, teractive applications, but AWS Lambda invocations are query evaluation proceeds without overhead as the index limited to CPUs. We mitigate this limitation with the structures have already been loaded into memory; ini- early-exit model optimizations described above as well as by exploiting the parallelism provided by the FaaS Table 1 design (more details below). Efectiveness comparisons on the development set of the MS

The second challenge concerns the size of the Lambda MARCO passage ranking test collection. deployment package. Presently, AWS places a limit of 250 MB, which is insuficient for both our model and the Configuration MRR@10 neural inference stack (Hugging Face Transformers and BM25 0.18 PyTorch). One straightforward solution is to download BM25 + Early Exiting monoBERT 0.34 the reranker model at execution time, directly from S3 to the temporary directory attached to the Lambda instance. However, this solution is ineficient because the to obtain document contents from DynamoDB in batches model must be downloaded every time a new execution of 100 documents. environment is created. Instead, we directly incorporated To speed up BERT reranking, we issued parallel our fine-tuned model and the entire execution stack into RerankLambda requests, each with the query and ten a container built on the AWS base image for Lambda, candidate documents. That is, each invocation processed where the size limit is 10 GB. We then uploaded the im- ten documents, and therefore to rerank 1000 hits, we age to ECR, Amazon’s fully-managed container registry had to issue 100 requests in parallel. With Lambda, this service, which provides fast and highly-available access. amount of parallelism is easy to obtain—after all, this is This way, AWS is able to optimize resource provision- exactly the point of FaaS. In principle, we could even issue ing, for example, caching the image closer to where the 1000 parallel requests, each scoring a single document, FaaS invocation occurs. Since the reranking model is but we did not try this configuration in our experiments. already part of the container image, we have eliminated For the reranking model, Early Exiting monoBERT, we all external dependencies. followed the guidance in Xin et al. [24] and selected

As shown in Figure 1, the reranker service endpoint = 1.0 and = 0.9 (third row of Table 1 in the (RerankLambda) is accessible from the API Gateway via paper). Based on the reported results: with only a 1% HTTP. It receives a JSON request structure comprising drop in MRR, this setting provided 2.9× acceleration, the query and (document id, content) pairs to be reranked, which means that on average, each inference exits the performs model inference, and returns a JSON structure 12-layer transformer model after around four layers. with (document id, score) pairs. To evaluate retrieval efectiveness, we ran inference

In our current prototype, we have completely decou- on the entire development set of the MS MARCO paspled serverless search from serverless reranking, but this sage ranking test collection (6980 queries), using both design imposes some unnecessary data movement: the the serverless prototype and a comparable server-based document contents are returned to the client (across the configuration; MRR@10 scores are shown in Table 1. We network) and then sent right back to the RerankLambda encountered minor issues resulting from the encoding of for reranking. It would be straightforward to more tightly special characters, which translated into very small difcouple the search and reranking components, but we ferences in efectiveness between the two designs (third have currently not done so, primarily because the sav- digit after the decimal point). These issues aside, we can ings would be modest at best. The additional costs of verify that our serverless deployment retains the same this extra data transfer are small compared to the costs level of efectiveness as a server-based design. associated with neural network inference. To evaluate retrieval latency and cost, we further performed search and reranking on 100 queries from the development set of the MS MARCO passage ranking test 4. Experiments and Results collection to obtain more detailed logging data. We measured component latency as well as end-to-end latency We evaluated our serverless search and reranking pro- from the client side. Table 2 provides a breakdown in totype using the popular MS MARCO passage ranking terms of mean, 50th, and 99th percentile latency. As we test collection [27], which comprises of 8.8M documents can observe, end-to-end latency is dominated by server(passages). Inverted indexes for the collection were built less reranking, due to the computationally intensive nausing the Anserini toolkit [28] and then uploaded to S3. ture of neural inference. Before these experiments, we Separately, the raw document texts were inserted into conducted multiple trials to warm up the SearchLambda DynamoDB using a custom ingestion script. and RerankLambda instances.

In our experiments, we retrieved 1000 hits using BM25 Based on the latency measurements, we estimated opfor each query and reranked all of those hits. The Search- erating costs (in US dollars). AWS Lambda charges based Lambda returns only the document ids of the retrieval on the number of function invocations as well as the results; since BERT reranking requires the document con- duration of the function execution. The pricing also retents as well, the FetchLambda issues concurrent queries lfects the amount of memory allocated to the function; Table 2 latency is in theory limited by CPU-based inference on a Component and end-to-end latency and cost based on 100 single document. queries from the development set of the MS MARCO passage Nevertheless, it is clear that on a per-query basis, our ranking test collection. Latency is reported per query, while serverless design is 7–8× more expensive than a tradicost is reported per 100 queries. tional server-based deployment. This is of course exStage Latency (s/Q) Cost pected, and there are two components to this gap. First, Mean P50 P99 (/100Q) the per unit time cost of serverless components must sum

up to more than the cost of a comparable server; otherBDMyn2a5moDB Fetch 00..6955 00..6956 01..9026 $0.022- wise, AWS would be losing money on serverless oferings. BERT reranking 11.21 10.64 17.90 $15.90 Second, decomposing a server-based application into a End to end 12.81 12.24 19.35 $16.00 serverless design introduces friction (e.g., unnecessary data movement and network communication). In our speBERT reranking (V100) 26.21 25.52 36.64 $2.20 cific case, there is the additional diference between GPU vs. CPU neural inference. However, how much each of these factors contributes to the cost diference is dificult resources beyond CPU are allocated proportionally based to determine. on memory. In our case, we allocated the maximum, Summarizing the “bottom line” based on our experi10240 MB. At present, the costs are $0.20 per 1M requests mental results: If we expect a server to be idle 85–90% of and $0.0000166667 for every GB-second duration; in our the time, a serverless deployment is more cost eficient. case, the per-request charge is negligible. Thus, we esti- Beyond costs alone, a serverless design exhibits all the mated compute costs as duration of compute (seconds) potential advantages we have already discussed: minimal × memory allocated (GB) × 0.0000166667. For ease of operational burden along with seamless scalability down interpretation, we report costs in terms of 100 queries, to zero (zero queries, zero cost) and up to arbitrarily large shown in Table 2. DynamoDB costs are computed ac- query loads. Whether these tradeofs are worthwhile, of cording to a complex set of rules that are hard to directly course, will depend on the exact operational scenario. estimate. However, for our experiments, these costs are negligible compared to the other components.

To compare the cost of our serverless prototype with 5. Future Work and Conclusions a standard server-based deployment, we also set up our reranking pipeline on a local server with a single NVIDIA At a high level, a serverless design provides operational V100 GPU. Here, we focus on BERT reranking latency advantages and cost eficiencies for low-load applications, only, as the contributions from the other components where in traditional server-based designs the developer are negligible. Based on these latency measurements, we must still pay for idle servers. In our experiments, this estimated query costs by looking up the per-hour (on- “breakeven point” is around 85–90% idle, but it is imdemand) prices of V100 servers from AWS and Azure, portant to note that our experimental results reflect a both of which provide similar pricing (we used $3.05 per snapshot at a specific point in time, based on a specific hour as the basis of our calculations). These results are implementation. Below, we discuss some of the factors also reported in Table 2. that may play a role in this calculus, and how they might

What do we make of these experimental results? change over time.

The latency for GPU-based reranking is admittedly First, there is a general downward trend in AWS costs longer than comparable figures reported in similar ex- over time as computing capabilities advance. However, periments [29]. We attribute this to the lack of batch the relative costs between storage, server instances, and inference in our implementation. With this caveat in FaaS invocation may not be stable. These diferences will mind, we see that serverless BERT reranking is able to impact costs over time, but unfortunately, the developer achieve lower latency with only CPUs. No doubt some has little control over pricing. of this gap is due to our sub-optimal implementation, but Second, costs will be afected by diferent architecture the more interesting point is that the serverless design al- and implementation choices, including future innovalows us to arbitrarily parallelize Lambda invocations. In tions. The reader may have noticed that in our experour setup, we issued 100 parallel SearchLambda requests, iments, end-to-end latency for both the serverless and each performing inference on ten documents. To reduce server-based deployments are still outside the range of latency further, we could increase parallelism even more, what is acceptable for an interactive application. There for example, dispatching 1000 parallel requests, each per- are various options to reduce latency: we can choose to forming inference on a single document. Based on the rerank fewer BM25 results, in which case we are tradLambda pricing model, this should have no appreciable ing of efectiveness for eficiency. As we have already impact on cost. Thus, the lower bound on end-to-end mentioned, Lambda could support greater parallelism (thus lower latency) without increasing costs. For the [4] S. Liu, F. Xiao, W. Ou, L. Si, Cascade ranking for server-based design, we can also rerank in parallel, but operational e-commerce search, in: Proceedings of this would increase costs (e.g., requiring a larger server the 23rd ACM SIGKDD International Conference on with more GPUs). These considerations seem to be in Knowledge Discovery and Data Mining (SIGKDD favor of the serverless design with its per-invocation cost 2017), Halifax, Nova Scotia, Canada, 2017, pp. 1557– model and seamless scalability. 1565.

Neural inference forms the biggest component of [5] J.-T. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang, both latency and cost, and there is much research on P. Pronin, J. Padmanabhan, G. Ottaviano, L. Yang, models that support faster and more eficient inference. Embedding-based retrieval in Facebook search, in: Examples include ALBERT [30], TinyBERT [31], and Proceedings of the 26th ACM SIGKDD InternaQBERT [32], all of which can serve as drop-in replace- tional Conference on Knowledge Discovery and ments for our current reranker. These improvements will Data Mining (SIGKDD 2020), 2020, pp. 2553–2561. benefit both the serverless and server-based design, and [6] L. Zou, S. Zhang, H. Cai, D. Ma, S. Cheng, D. Shi, to a large extent we can ride the wave of future innova- Z. Zhu, W. Su, S. Wang, Z. Cheng, D. Yin, Pretions in NLP “for free”. The interesting question, however, trained language model based ranking in Baidu is whether some of these innovations will diferentially search, arXiv:2105.11108 (2021). impact CPU-based vs. GPU-based inference, or perhaps [7] L. A. Barroso, J. Dean, U. Hölzle, Web search for a in the future FaaS oferings might support GPUs. We do planet: The Google cluster architecture, IEEE Micro not have answers at present, but future explorations of 23 (2003) 22–28. these issues would be interesting. [8] R. Baeza-Yates, C. Castillo, F. Junqueira, V. Pla

To conclude, in this work we built on an existing chouras, F. Silvestri, Challenges on distributed web serverless Lucene prototype to demonstrate a complete retrieval, in: Proceedings of the IEEE 23rd Interretrieve–rerank search architecture with a transformer- national Conference on Data Engineering (ICDE based model. Our experiments allow us to characterize 2007), Istanbul, Turkey, 2007, pp. 6–20. the tradeofs between serverless and sever-based designs. [9] J. Dean, Challenges in building large-scale informaNo doubt the costs of both approaches will change over tion retrieval systems, in: Keynote Presentation at time as the economics of cloud computing evolve and as the Second ACM International Conference on Web technical innovations lead to eficiency improvements. Search and Data Mining (WSDM 2009), Barcelona, However, the operational advantages of the serverless Spain, 2009. design will remain. Whether such an architecture will [10] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, B. Recht, gain widespread adoption remains to be seen, but at the Occupy the cloud: Distributed computing for the very least this design challenges how we think about the 99%, in: Proceedings of the 2017 Symposium on architecture of search applications. Cloud Computing (SoCC 2017), Santa Clara, California, 2017, pp. 445–451. [11] Y. Kim, J. Lin, Serverless data analytics with Flint, Acknowledgments in: Proceedings of the 2018 IEEE 11th International Conference on Cloud Computing (CLOUD 2018), This research was supported in part by the Canada First San Francisco, California, 2018, pp. 451–455. Research Excellence Fund and the Natural Sciences and [12] J. Hellerstein, J. Faleiro, J. Gonzalez, J. SchleierEngineering Research Council (NSERC) of Canada; com- Smith, V. Sreekanti, A. Tumanov, C. Wu, Serverputational resources were provided by Compute Ontario less computing: One step forward, two steps back, and Compute Canada. arXiv:1812.03651 (2018). [13] S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, References C. Kozyrakis, M. Zaharia, K. Winstein, From laptop to lambda: Outsourcing everyday jobs to thousands [1] R. Nogueira, K. Cho, Passage re-ranking with BERT, of transient functional containers, in: Proceedings arXiv:1901.04085 (2019). of the 2019 USENIX Annual Technical Conference, [2] J. Lin, R. Nogueira, A. Yates, Pretrained trans- Renton, Washington, 2019, pp. 475–488. formers for text ranking: BERT and beyond, [14] V. Sreekanti, C. Wu, X. C. Lin, J. Schleier-Smith, arXiv:2010.06467 (2020). J. M. Faleiro, J. E. Gonzalez, J. M. Hellerstein, A. Tu[3] J. Pedersen, Query understanding at Bing, in: Indus- manov, Cloudburst: Stateful functions-as-a-service, try Track Keynote at the 33rd Annual International arXiv:2001.04592 (2020).

ACM SIGIR Conference on Research and Develop- [15] M. Crane, J. Lin, An exploration of serverless archiment in Information Retrieval (SIGIR 2010), Geneva, tectures for information retrieval, in: Proceedings Switzerland, 2010. of the 3rd ACM International Conference on the Theory of Information Retrieval (ICTIR 2017), Am- L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. Desterdam, The Netherlands, 2017, pp. 241–244. Vito, M. Raison, A. Tejani, S. Chilamkurthy, [16] J. Lin, A prototype of serverless Lucene, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: arXiv:2002.01447 (2020). An imperative style, high-performance deep learn[17] L. Azzopardi, M. Crane, H. Fang, G. Ingersoll, J. Lin, ing library, in: Advances in Neural Information Y. Moshfeghi, H. Scells, P. Yang, G. Zuccon, The Processing Systems 32 (NeurIPS 2019), Vancouver, Lucene for Information Access and Retrieval Re- Canada, 2019, pp. 8024–8035. search (LIARR) Workshop at SIGIR 2017, in: Pro- [27] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, ceedings of the 40th Annual International ACM X. Liu, R. Majumder, A. McNamara, B. Mitra, SIGIR Conference on Research and Development in T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. TiInformation Retrieval (SIGIR 2017), Tokyo, Japan, wary, T. Wang, MS MARCO: A human gener2017, pp. 1429–1430. ated MAchine Reading COmprehension dataset, [18] L. Azzopardi, Y. Moshfeghi, M. Halvey, R. S. arXiv:1611.09268 (2016).

Alkhawaldeh, K. Balog, E. Di Buccio, D. Ceccarelli, [28] P. Yang, H. Fang, J. Lin, Anserini: Enabling the use J. M. Fernández-Luna, C. Hull, J. Mannix, S. Pal- of Lucene for information retrieval research, in: chowdhury, Lucene4IR: Developing information Proceedings of the 40th Annual International ACM retrieval evaluation resources using Lucene, SIGIR SIGIR Conference on Research and Development in Forum 50 (2017) 58–75. Information Retrieval (SIGIR 2017), Tokyo, Japan, [19] P. Yang, H. Fang, J. Lin, Anserini: Reproducible 2017, pp. 1253–1256.

ranking baselines using Lucene, Journal of Data [29] O. Khattab, M. Zaharia, ColBERT: Eficient and and Information Quality 10 (2018) Article 16. efective passage search via contextualized late in[20] T. Strohman, W. B. Croft, Eficient document re- teraction over BERT, in: Proceedings of the 43rd trieval in main memory, in: Proceedings of the Annual International ACM SIGIR Conference on Re30th Annual International ACM SIGIR Conference search and Development in Information Retrieval on Research and Development in Information Re- (SIGIR 2020), 2020, pp. 39–48. trieval (SIGIR 2007), Amsterdam, The Netherlands, [30] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, 2007, pp. 175–182. R. Soricut, ALBERT: A lite BERT for self-supervised [21] S. Büttcher, C. L. A. Clarke, Index compression is learning of language representations, in: Intergood, especially for random access, in: Proceedings national Conference on Learning Representations, of the Sixteenth International Conference on Infor- 2020. mation and Knowledge Management (CIKM 2007), [31] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, Lisbon, Portugal, 2007, pp. 761–770. F. Wang, Q. Liu, TinyBERT: Distilling BERT for [22] J. Lin, A. Trotman, The role of index compression natural language understanding, in: Findings of the in score-at-a-time query evaluation, Information Association for Computational Linguistics: EMNLP Retrieval 20 (2017) 199–220. 2020, 2020, pp. 4163–4174. [23] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, [32] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, J. Lin, Earlybird: real-time search at Twitter, in: Pro- M. W. Mahoney, K. Keutzer, Q-BERT: Hessian based ceedings of the 28th International Conference on ultra low precision quantization of BERT, in: ProData Engineering (ICDE 2012), Washington, D.C., ceedings of the AAAI Conference on Artificial In2012, pp. 1360–1369. telligence, volume 34, 2020, pp. 8815–8821. [24] J. Xin, R. Nogueira, Y. Yu, J. Lin, Early exiting BERT for eficient document ranking, in: Proceedings of SustaiNLP: Workshop on Simple and Eficient

Natural Language Processing, 2020, pp. 83–88. [25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: Stateof-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45. [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,