<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Serverless BM25 Search and BERT Reranking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mayank Anand</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiarui Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shane Ding</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ji Xin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jimmy Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Waterloo</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The retrieve-rerank pipeline is a well-established architecture for search applications, typically with first-stage retrieval using keyword search followed by reranking with a transformer-based model. In deploying such an architecture in the cloud, developers must devote considerable efort to resource provisioning and management: typically, the goal is to optimize the infrastructure configuration (number and type of server instance) to achieve certain performance characteristics (latency, throughput, etc.) while reducing operating costs. In this paper, we introduce a serverless prototype of the retrieve-rerank pipeline for search using Amazon Web Services (AWS), comprised of BM25 for first-stage retrieval using Lucene followed by reranking with the monoBERT model using Hugging Face Transformers. The advantage of a serverless design is that a cloud provider shoulders the burden of operational management, for example, allocating server instances and scaling with query load. We experimentally show with the popular MS MARCO passage ranking test collection that compared to a traditional server-based deployment, our serverless implementation (1) retains the same level of efectiveness, (2) can reduce average latency by exploiting massive parallelism, and (3) incurs comparable costs if the service is expected to be idle for some fraction of the time. Our implementation is open-sourced at https://github.com/castorini/serverless-bert-reranking.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multi-stage ranking architectures</kwd>
        <kwd>transformers</kwd>
        <kwd>monoBERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>over-provisioning; for robust failover, a minimal
installation typically runs two servers, additionally contributing
It is a common practice today for search engines to to idle (wasted) resources. As the query load increases, to
adopt a retrieve–rerank architecture, for example, with maintain the same level of performance, more server
inkeyword search as first-stage retrieval followed by a stances need to be provisioned. As query load decreases,
transformer-based model for reranking [1]. This rep- these extra instances must then be destroyed. To cope
resents a simple instantiation of a multi-stage retrieval with variable load robustly, developers need to build logic
architecture [2] that is widely used in production at to dynamically spin up and down instances, which may
scale [3, 4, 5, 6]. In terms of deployments, individ- be complex and error prone. While these are solvable
ual servers (today, typically virtualized instances in the engineering challenges, we wonder if there’s a better
cloud) form the basic building blocks for search applica- way. It would be desirable if we could scale up and down
tions. Persistent services running on a cluster cooperate seamlessly, without efort—and ideally, all the way down
to provide the various functionalities that comprise the to zero. That is, if there are no incoming queries, can we
complete application. To scale out, the standard practice not have to pay anything?
is to adopt a replicated, document-partitioned architec- Serverless architectures to the rescue! In this paper,
ture [7, 8, 9]. we present a serverless prototype of the retrieve–rerank</p>
      <p>This design has two important implications: First, the pipeline for search using Amazon Web Services (AWS),
services must exist as always-on, long-running processes, comprised of BM25 for first-stage retrieval using Lucene
ready to handle requests at any moment. This presents followed by reranking with the monoBERT model using
a floor on resource consumption, as costs are incurred Hugging Face Transformers. We describe our design and
even when the service is idle. Second, scaling up and present experimental results with the MS MARCO
pasdown in response to query load must be performed at the sage ranking test collection. In addition to the ability
granularity of servers, usually through replication and to completely ofload operational management, we
beload balancing. Thus, a server-based design means that lieve that there are two scenarios where our serverless
when the query load is low, even a single server may be design is particularly compelling: (1) a search application
that handles low query volumes, where server instances
DESIRES 2021 – 2nd International Conference on Design of may be idle most of the time, and (2) a search application
Experimental Search &amp; Information REtrieval Systems, September where incoming requests may be bursty, for example,
1"5–m18a,y2a0n2k1.,aPnaadnuda@,Iutawlyaterloo.ca (M. Anand); a service endpoint that is periodically invoked by some
jiarui.zhang@uwaterloo.ca (J. Zhang); s44ding@uwaterloo.ca other component. While our prototype exhibits a number
(S. Ding); ji.xin@uwaterloo.ca (J. Xin); jimmylin@uwaterloo.ca of limitations at present, it perhaps ofers a blueprint for a
(J. Lin) very diferent approach to how future search applications
may be built.</p>
      <p>© 2021 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)</p>
      <p>Search client</p>
      <p>FetchLambda</p>
      <p>SearchLambda</p>
      <p>Lucene indexes
Web Browser</p>
      <p>API Gateway</p>
      <p>S3</p>
      <p>Raw documents
Serverless Search</p>
      <p>DynamoDB
RerankLambda</p>
      <p>Serverless Reranking</p>
      <p>Elastic Container Registry</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>demonstrated the feasibility of serverless search, but has
a number of shortcomings. In particular, their prototype
The development of cloud technologies can be charac- required custom code, which presents barriers to broad
terized as continuing disaggregation of computing com- adoption. Lin [16] addressed this shortcoming by
demonponents. In the early days, the cloud meant dynamic, strating how the open-source Lucene search library can
readily-available, easy-to-provision virtual machines. To- be packaged in a serverless design with minimal custom
day, however, there exists a myriad of services, ofered code to achieve query latencies capable of supporting
by all the major cloud providers, that deliver computing interactive retrieval. This serverless Lucene prototype
capabilities in a much more fine-grained manner under forms the starting point of our work, where we further
a pay-as-you-go model. add serverless reranking with transformer-based models</p>
      <p>One particularly interesting development is the rise of to demonstrate a full serverless retrieve–rerank pipeline.
so-called Function-as-a-Service (FaaS) oferings: The
developer provides a block of code with well-known entry
and exit points, and the cloud provider handles all other 3. Serverless Architecture
aspects of execution—provisioning resources to execute
those functions, scaling up and down to match a particu- We present and evaluate a working implementation of a
lar load, etc., all under a per-invocation cost model. Com- serverless retrieve–rerank pipeline with the architecture
bined with storage- and database-as-a-service oferings, shown in Figure 1. In this work, Amazon Web Services
it is now possible to write end-to-end serverless appli- (AWS) was selected as the cloud platform: in particular,
cations where the abstraction of a server is completely Lambda, the AWS Function-as-a-Service ofering,
proabsent. To be clear, a serverless design does not mean vides the core building block in our design. Nevertheless,
that we can somehow compute without servers. Rather, other popular cloud providers ofer comparable services
it means that the developer no longer needs to explic- that can provide alternative implementations.
itly manage server instances. Instead, the cloud provider Our design is comprised of two distinct components,
shoulders the burden of operational management, thus serverless search and serverless reranking, described
befreeing the developer to focus on implementing the ap- low. The serverless search component is built on the
plication logic. serverless Lucene prototype presented by Lin [16], while</p>
      <p>Researchers have explored serverless architectures for the serverless reranking component has not been
dea variety of applications [10, 11, 12, 13, 14], but most rele- scribed anywhere else.
vant to this work is serverless search: Crane and Lin [15]
previously demonstrated a working prototype on Ama- 3.1. Serverless Search
zon Web Services. In their design, postings lists are stored
in the DynamoDB data store and query execution is han- An important desideratum of our work is to build
serverdled by Lambda (Amazon’s FaaS ofering). Their work less search on the open-source Lucene search library,
which has emerged as the de facto platform for de- tial execution on a “cold” instance, however, carries the
veloping real-world search applications, typically via additional startup costs associated with populating the
OpenSearch, Elasticsearch, Solr, or other components cache. This is not unlike any other in-memory system,
in the broader ecosystem. Other than a few commer- and Lambda execution incurs no performance penalty in
cial search engine companies that deploy custom infras- steady state.
tructure (for which the serverless design would not be To complete the architecture shown in Figure 1, there
of interest anyway), Lucene dominates the search land- are a few more components to describe: Raw documents
scape, with deployments at organizations ranging from are stored in DynamoDB (organized as a simple key–
Bloomberg to Twitter to Wikipedia. Furthermore, the value store). Another Lambda, the FetchLambda calls
use of Lucene for academic research has been gaining the SearchLambda to generate a ranking, and then
istraction [17, 18, 19]. Thus, to increase the potential for sues concurrent, batched calls to DynamoDB to retrieve
broader impact and adoption, our focus is to leverage as the actual document text (which is needed for
rerankmuch of the existing Lucene codebase as possible. ing). The FetchLambda can be triggered through a REST</p>
      <p>The design of serverless architectures hinges around endpoint provided by the API Gateway. The final
prodthe decoupling of state from stateless code. In the con- uct is a service that takes a query and returns a list of
text of search, “state” is captured by the inverted index documents (complete with their contents) accessible to a
and other related data structures, while query evaluation search client (e.g., in a web browser).
(i.e., postings traversal) can be considered stateless. Thus, Although in principle the FetchLambda and the
it is only natural to package Lucene’s query evaluation SearchLambda can be combined, we have kept the two
code (IndexReader, IndexSearcher, etc.) into a Lambda separate to support future scale out. A partitioned
arfunction—the SearchLambda in Figure 1. The index struc- chitecture can be implementation by multiple
Searchtures (assumed to have been generated elsewhere) can Lambda instances, each responsible for its own index
be stored in S3, Amazon’s persistent object store. partition, in which case FetchLambda can serve as a
cen</p>
      <p>How do we “connect” Lucene code (running in a tral broker.</p>
      <p>Lambda) with index structures stored in S3? Fortunately,
Lucene’s Directory interface provides a low-level abstrac- 3.2. Serverless Reranking
tion for reading index structures (at the level of reading
bytes from streams, seeking to diferent byte ofset posi- In our design, BM25 results from first-stage retrieval are
tions, etc.). Thus, it sufices to provide a custom Directory fed to monoBERT, a standard cross-encoder, for
rerankimplementation built with Amazon’s S3 API, and then ing. In monoBERT, inference is performed on all
canuse this implementation for reading the indexes. Criti- didate documents: an input template comprised of the
cally, all other parts of the Lucene query evaluation stack query and the document text is fed to a fine-tuned BERT
remain unchanged—instead of consuming bytes from a model, which produces a relevance score. All candidate
local drive (for example), the bytes are now streamed documents are then sorted by these scores. Previous
across the datacenter network from S3. studies have shown that this approach is both simple and</p>
      <p>Given this design, an important issue of course is the efective [ 1]. In this work, we adopt a monoBERT variant
performance of (remote network) reads from S3. This is called Early Exiting monoBERT [24], which increases
solved by caching; that is, a custom S3Directory imple- the eficiency of the BERT backbone by adding in “early
mentation reads data into memory and thus the overall exits” that allow the inference process to terminate early
design is no diferent from main-memory search engines, if the model is confident in its decisions. Our
implemenwhich are quite commonplace today both in the academic tation is based on Hugging Face Transformers [25] and
literature [20, 21, 22] as well as in production deploy- PyTorch [26].
ments [9, 23]. In order to understand how this caching Conceptually, serverless ranking is straightforward
mechanism interacts with Lambda execution, it is neces- because the operation is stateless and embarrassingly
sary to understand at a high level how Amazon handles parallel. We simply need to generate a relevance score
FaaS execution. for each document, which can proceed independently.</p>
      <p>Behind the scenes, Amazon is provisioning containers The obvious implementation is to wrap model inference
to execute the Lambda; it controls how many containers in a Lambda, and this is exactly what we do with the
are running to satisfy a particular load, automatically RerankLambda, as shown in Figure 1. There are,
howscales up and down the number of containers, and per- ever, two engineering challenges, discussed below.
forms load balancing. Therefore, code execution can First, neural inference typically requires GPUs to
either occur on a “warm” instance (i.e., already running achieve latencies that are suficiently low to support
incontainer) or a “cold” instance. For a “warm” instance, teractive applications, but AWS Lambda invocations are
query evaluation proceeds without overhead as the index limited to CPUs. We mitigate this limitation with the
structures have already been loaded into memory; ini- early-exit model optimizations described above as well
as by exploiting the parallelism provided by the FaaS Table 1
design (more details below). Efectiveness comparisons on the development set of the MS</p>
      <p>The second challenge concerns the size of the Lambda MARCO passage ranking test collection.
deployment package. Presently, AWS places a limit of
250 MB, which is insuficient for both our model and the Configuration MRR@10
neural inference stack (Hugging Face Transformers and BM25 0.18
PyTorch). One straightforward solution is to download BM25 + Early Exiting monoBERT 0.34
the reranker model at execution time, directly from S3
to the temporary directory attached to the Lambda
instance. However, this solution is ineficient because the to obtain document contents from DynamoDB in batches
model must be downloaded every time a new execution of 100 documents.
environment is created. Instead, we directly incorporated To speed up BERT reranking, we issued parallel
our fine-tuned model and the entire execution stack into RerankLambda requests, each with the query and ten
a container built on the AWS base image for Lambda, candidate documents. That is, each invocation processed
where the size limit is 10 GB. We then uploaded the im- ten documents, and therefore to rerank 1000 hits, we
age to ECR, Amazon’s fully-managed container registry had to issue 100 requests in parallel. With Lambda, this
service, which provides fast and highly-available access. amount of parallelism is easy to obtain—after all, this is
This way, AWS is able to optimize resource provision- exactly the point of FaaS. In principle, we could even issue
ing, for example, caching the image closer to where the 1000 parallel requests, each scoring a single document,
FaaS invocation occurs. Since the reranking model is but we did not try this configuration in our experiments.
already part of the container image, we have eliminated For the reranking model, Early Exiting monoBERT, we
all external dependencies. followed the guidance in Xin et al. [24] and selected</p>
      <p>As shown in Figure 1, the reranker service endpoint   = 1.0 and   = 0.9 (third row of Table 1 in the
(RerankLambda) is accessible from the API Gateway via paper). Based on the reported results: with only a 1%
HTTP. It receives a JSON request structure comprising drop in MRR, this setting provided 2.9× acceleration,
the query and (document id, content) pairs to be reranked, which means that on average, each inference exits the
performs model inference, and returns a JSON structure 12-layer transformer model after around four layers.
with (document id, score) pairs. To evaluate retrieval efectiveness, we ran inference</p>
      <p>In our current prototype, we have completely decou- on the entire development set of the MS MARCO
paspled serverless search from serverless reranking, but this sage ranking test collection (6980 queries), using both
design imposes some unnecessary data movement: the the serverless prototype and a comparable server-based
document contents are returned to the client (across the configuration; MRR@10 scores are shown in Table 1. We
network) and then sent right back to the RerankLambda encountered minor issues resulting from the encoding of
for reranking. It would be straightforward to more tightly special characters, which translated into very small
difcouple the search and reranking components, but we ferences in efectiveness between the two designs (third
have currently not done so, primarily because the sav- digit after the decimal point). These issues aside, we can
ings would be modest at best. The additional costs of verify that our serverless deployment retains the same
this extra data transfer are small compared to the costs level of efectiveness as a server-based design.
associated with neural network inference. To evaluate retrieval latency and cost, we further
performed search and reranking on 100 queries from the
development set of the MS MARCO passage ranking test
4. Experiments and Results collection to obtain more detailed logging data. We
measured component latency as well as end-to-end latency
We evaluated our serverless search and reranking pro- from the client side. Table 2 provides a breakdown in
totype using the popular MS MARCO passage ranking terms of mean, 50th, and 99th percentile latency. As we
test collection [27], which comprises of 8.8M documents can observe, end-to-end latency is dominated by
server(passages). Inverted indexes for the collection were built less reranking, due to the computationally intensive
nausing the Anserini toolkit [28] and then uploaded to S3. ture of neural inference. Before these experiments, we
Separately, the raw document texts were inserted into conducted multiple trials to warm up the SearchLambda
DynamoDB using a custom ingestion script. and RerankLambda instances.</p>
      <p>In our experiments, we retrieved 1000 hits using BM25 Based on the latency measurements, we estimated
opfor each query and reranked all of those hits. The Search- erating costs (in US dollars). AWS Lambda charges based
Lambda returns only the document ids of the retrieval on the number of function invocations as well as the
results; since BERT reranking requires the document con- duration of the function execution. The pricing also
retents as well, the FetchLambda issues concurrent queries lfects the amount of memory allocated to the function;
Table 2 latency is in theory limited by CPU-based inference on a
Component and end-to-end latency and cost based on 100 single document.
queries from the development set of the MS MARCO passage Nevertheless, it is clear that on a per-query basis, our
ranking test collection. Latency is reported per query, while serverless design is 7–8× more expensive than a
tradicost is reported per 100 queries. tional server-based deployment. This is of course
exStage Latency (s/Q) Cost pected, and there are two components to this gap. First,
Mean P50 P99 (/100Q) the per unit time cost of serverless components must sum</p>
      <p>up to more than the cost of a comparable server;
otherBDMyn2a5moDB Fetch 00..6955 00..6956 01..9026 $0.022- wise, AWS would be losing money on serverless oferings.
BERT reranking 11.21 10.64 17.90 $15.90 Second, decomposing a server-based application into a
End to end 12.81 12.24 19.35 $16.00 serverless design introduces friction (e.g., unnecessary
data movement and network communication). In our
speBERT reranking (V100) 26.21 25.52 36.64 $2.20 cific case, there is the additional diference between GPU
vs. CPU neural inference. However, how much each of
these factors contributes to the cost diference is dificult
resources beyond CPU are allocated proportionally based to determine.
on memory. In our case, we allocated the maximum, Summarizing the “bottom line” based on our
experi10240 MB. At present, the costs are $0.20 per 1M requests mental results: If we expect a server to be idle 85–90% of
and $0.0000166667 for every GB-second duration; in our the time, a serverless deployment is more cost eficient.
case, the per-request charge is negligible. Thus, we esti- Beyond costs alone, a serverless design exhibits all the
mated compute costs as duration of compute (seconds) potential advantages we have already discussed: minimal
× memory allocated (GB) × 0.0000166667. For ease of operational burden along with seamless scalability down
interpretation, we report costs in terms of 100 queries, to zero (zero queries, zero cost) and up to arbitrarily large
shown in Table 2. DynamoDB costs are computed ac- query loads. Whether these tradeofs are worthwhile, of
cording to a complex set of rules that are hard to directly course, will depend on the exact operational scenario.
estimate. However, for our experiments, these costs are
negligible compared to the other components.</p>
      <p>To compare the cost of our serverless prototype with 5. Future Work and Conclusions
a standard server-based deployment, we also set up our
reranking pipeline on a local server with a single NVIDIA At a high level, a serverless design provides operational
V100 GPU. Here, we focus on BERT reranking latency advantages and cost eficiencies for low-load applications,
only, as the contributions from the other components where in traditional server-based designs the developer
are negligible. Based on these latency measurements, we must still pay for idle servers. In our experiments, this
estimated query costs by looking up the per-hour (on- “breakeven point” is around 85–90% idle, but it is
imdemand) prices of V100 servers from AWS and Azure, portant to note that our experimental results reflect a
both of which provide similar pricing (we used $3.05 per snapshot at a specific point in time, based on a specific
hour as the basis of our calculations). These results are implementation. Below, we discuss some of the factors
also reported in Table 2. that may play a role in this calculus, and how they might</p>
      <p>What do we make of these experimental results? change over time.</p>
      <p>The latency for GPU-based reranking is admittedly First, there is a general downward trend in AWS costs
longer than comparable figures reported in similar ex- over time as computing capabilities advance. However,
periments [29]. We attribute this to the lack of batch the relative costs between storage, server instances, and
inference in our implementation. With this caveat in FaaS invocation may not be stable. These diferences will
mind, we see that serverless BERT reranking is able to impact costs over time, but unfortunately, the developer
achieve lower latency with only CPUs. No doubt some has little control over pricing.
of this gap is due to our sub-optimal implementation, but Second, costs will be afected by diferent architecture
the more interesting point is that the serverless design al- and implementation choices, including future
innovalows us to arbitrarily parallelize Lambda invocations. In tions. The reader may have noticed that in our
experour setup, we issued 100 parallel SearchLambda requests, iments, end-to-end latency for both the serverless and
each performing inference on ten documents. To reduce server-based deployments are still outside the range of
latency further, we could increase parallelism even more, what is acceptable for an interactive application. There
for example, dispatching 1000 parallel requests, each per- are various options to reduce latency: we can choose to
forming inference on a single document. Based on the rerank fewer BM25 results, in which case we are
tradLambda pricing model, this should have no appreciable ing of efectiveness for eficiency. As we have already
impact on cost. Thus, the lower bound on end-to-end mentioned, Lambda could support greater parallelism
(thus lower latency) without increasing costs. For the [4] S. Liu, F. Xiao, W. Ou, L. Si, Cascade ranking for
server-based design, we can also rerank in parallel, but operational e-commerce search, in: Proceedings of
this would increase costs (e.g., requiring a larger server the 23rd ACM SIGKDD International Conference on
with more GPUs). These considerations seem to be in Knowledge Discovery and Data Mining (SIGKDD
favor of the serverless design with its per-invocation cost 2017), Halifax, Nova Scotia, Canada, 2017, pp. 1557–
model and seamless scalability. 1565.</p>
      <p>Neural inference forms the biggest component of [5] J.-T. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang,
both latency and cost, and there is much research on P. Pronin, J. Padmanabhan, G. Ottaviano, L. Yang,
models that support faster and more eficient inference. Embedding-based retrieval in Facebook search, in:
Examples include ALBERT [30], TinyBERT [31], and Proceedings of the 26th ACM SIGKDD
InternaQBERT [32], all of which can serve as drop-in replace- tional Conference on Knowledge Discovery and
ments for our current reranker. These improvements will Data Mining (SIGKDD 2020), 2020, pp. 2553–2561.
benefit both the serverless and server-based design, and [6] L. Zou, S. Zhang, H. Cai, D. Ma, S. Cheng, D. Shi,
to a large extent we can ride the wave of future innova- Z. Zhu, W. Su, S. Wang, Z. Cheng, D. Yin,
Pretions in NLP “for free”. The interesting question, however, trained language model based ranking in Baidu
is whether some of these innovations will diferentially search, arXiv:2105.11108 (2021).
impact CPU-based vs. GPU-based inference, or perhaps [7] L. A. Barroso, J. Dean, U. Hölzle, Web search for a
in the future FaaS oferings might support GPUs. We do planet: The Google cluster architecture, IEEE Micro
not have answers at present, but future explorations of 23 (2003) 22–28.
these issues would be interesting. [8] R. Baeza-Yates, C. Castillo, F. Junqueira, V.
Pla</p>
      <p>To conclude, in this work we built on an existing chouras, F. Silvestri, Challenges on distributed web
serverless Lucene prototype to demonstrate a complete retrieval, in: Proceedings of the IEEE 23rd
Interretrieve–rerank search architecture with a transformer- national Conference on Data Engineering (ICDE
based model. Our experiments allow us to characterize 2007), Istanbul, Turkey, 2007, pp. 6–20.
the tradeofs between serverless and sever-based designs. [9] J. Dean, Challenges in building large-scale
informaNo doubt the costs of both approaches will change over tion retrieval systems, in: Keynote Presentation at
time as the economics of cloud computing evolve and as the Second ACM International Conference on Web
technical innovations lead to eficiency improvements. Search and Data Mining (WSDM 2009), Barcelona,
However, the operational advantages of the serverless Spain, 2009.
design will remain. Whether such an architecture will [10] E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, B. Recht,
gain widespread adoption remains to be seen, but at the Occupy the cloud: Distributed computing for the
very least this design challenges how we think about the 99%, in: Proceedings of the 2017 Symposium on
architecture of search applications. Cloud Computing (SoCC 2017), Santa Clara,
California, 2017, pp. 445–451.
[11] Y. Kim, J. Lin, Serverless data analytics with Flint,
Acknowledgments in: Proceedings of the 2018 IEEE 11th International
Conference on Cloud Computing (CLOUD 2018),
This research was supported in part by the Canada First San Francisco, California, 2018, pp. 451–455.
Research Excellence Fund and the Natural Sciences and [12] J. Hellerstein, J. Faleiro, J. Gonzalez, J.
SchleierEngineering Research Council (NSERC) of Canada; com- Smith, V. Sreekanti, A. Tumanov, C. Wu,
Serverputational resources were provided by Compute Ontario less computing: One step forward, two steps back,
and Compute Canada. arXiv:1812.03651 (2018).
[13] S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee,
References C. Kozyrakis, M. Zaharia, K. Winstein, From laptop
to lambda: Outsourcing everyday jobs to thousands
[1] R. Nogueira, K. Cho, Passage re-ranking with BERT, of transient functional containers, in: Proceedings
arXiv:1901.04085 (2019). of the 2019 USENIX Annual Technical Conference,
[2] J. Lin, R. Nogueira, A. Yates, Pretrained trans- Renton, Washington, 2019, pp. 475–488.
formers for text ranking: BERT and beyond, [14] V. Sreekanti, C. Wu, X. C. Lin, J. Schleier-Smith,
arXiv:2010.06467 (2020). J. M. Faleiro, J. E. Gonzalez, J. M. Hellerstein, A.
Tu[3] J. Pedersen, Query understanding at Bing, in: Indus- manov, Cloudburst: Stateful functions-as-a-service,
try Track Keynote at the 33rd Annual International arXiv:2001.04592 (2020).</p>
      <p>ACM SIGIR Conference on Research and Develop- [15] M. Crane, J. Lin, An exploration of serverless
archiment in Information Retrieval (SIGIR 2010), Geneva, tectures for information retrieval, in: Proceedings
Switzerland, 2010. of the 3rd ACM International Conference on the
Theory of Information Retrieval (ICTIR 2017), Am- L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z.
Desterdam, The Netherlands, 2017, pp. 241–244. Vito, M. Raison, A. Tejani, S. Chilamkurthy,
[16] J. Lin, A prototype of serverless Lucene, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch:
arXiv:2002.01447 (2020). An imperative style, high-performance deep
learn[17] L. Azzopardi, M. Crane, H. Fang, G. Ingersoll, J. Lin, ing library, in: Advances in Neural Information
Y. Moshfeghi, H. Scells, P. Yang, G. Zuccon, The Processing Systems 32 (NeurIPS 2019), Vancouver,
Lucene for Information Access and Retrieval Re- Canada, 2019, pp. 8024–8035.
search (LIARR) Workshop at SIGIR 2017, in: Pro- [27] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao,
ceedings of the 40th Annual International ACM X. Liu, R. Majumder, A. McNamara, B. Mitra,
SIGIR Conference on Research and Development in T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S.
TiInformation Retrieval (SIGIR 2017), Tokyo, Japan, wary, T. Wang, MS MARCO: A human
gener2017, pp. 1429–1430. ated MAchine Reading COmprehension dataset,
[18] L. Azzopardi, Y. Moshfeghi, M. Halvey, R. S. arXiv:1611.09268 (2016).</p>
      <p>Alkhawaldeh, K. Balog, E. Di Buccio, D. Ceccarelli, [28] P. Yang, H. Fang, J. Lin, Anserini: Enabling the use
J. M. Fernández-Luna, C. Hull, J. Mannix, S. Pal- of Lucene for information retrieval research, in:
chowdhury, Lucene4IR: Developing information Proceedings of the 40th Annual International ACM
retrieval evaluation resources using Lucene, SIGIR SIGIR Conference on Research and Development in
Forum 50 (2017) 58–75. Information Retrieval (SIGIR 2017), Tokyo, Japan,
[19] P. Yang, H. Fang, J. Lin, Anserini: Reproducible 2017, pp. 1253–1256.</p>
      <p>ranking baselines using Lucene, Journal of Data [29] O. Khattab, M. Zaharia, ColBERT: Eficient and
and Information Quality 10 (2018) Article 16. efective passage search via contextualized late
in[20] T. Strohman, W. B. Croft, Eficient document re- teraction over BERT, in: Proceedings of the 43rd
trieval in main memory, in: Proceedings of the Annual International ACM SIGIR Conference on
Re30th Annual International ACM SIGIR Conference search and Development in Information Retrieval
on Research and Development in Information Re- (SIGIR 2020), 2020, pp. 39–48.
trieval (SIGIR 2007), Amsterdam, The Netherlands, [30] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma,
2007, pp. 175–182. R. Soricut, ALBERT: A lite BERT for self-supervised
[21] S. Büttcher, C. L. A. Clarke, Index compression is learning of language representations, in:
Intergood, especially for random access, in: Proceedings national Conference on Learning Representations,
of the Sixteenth International Conference on Infor- 2020.
mation and Knowledge Management (CIKM 2007), [31] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li,
Lisbon, Portugal, 2007, pp. 761–770. F. Wang, Q. Liu, TinyBERT: Distilling BERT for
[22] J. Lin, A. Trotman, The role of index compression natural language understanding, in: Findings of the
in score-at-a-time query evaluation, Information Association for Computational Linguistics: EMNLP
Retrieval 20 (2017) 199–220. 2020, 2020, pp. 4163–4174.
[23] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, [32] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami,
J. Lin, Earlybird: real-time search at Twitter, in: Pro- M. W. Mahoney, K. Keutzer, Q-BERT: Hessian based
ceedings of the 28th International Conference on ultra low precision quantization of BERT, in:
ProData Engineering (ICDE 2012), Washington, D.C., ceedings of the AAAI Conference on Artificial
In2012, pp. 1360–1369. telligence, volume 34, 2020, pp. 8815–8821.
[24] J. Xin, R. Nogueira, Y. Yu, J. Lin, Early exiting BERT
for eficient document ranking, in: Proceedings
of SustaiNLP: Workshop on Simple and Eficient</p>
      <p>Natural Language Processing, 2020, pp. 83–88.
[25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger,
M. Drame, Q. Lhoest, A. Rush, Transformers:
Stateof-the-art natural language processing, in:
Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: System
Demonstrations, 2020, pp. 38–45.
[26] A. Paszke, S. Gross, F. Massa, A. Lerer, J.
Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>