<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Beyond algorithms: Ranking at scale at Booking.com</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Themis Mavridis∗</string-name>
          <email>themistoklis.mavridis@booking.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Mende∗</string-name>
          <email>andrew.mende@booking.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soraya Hausl∗</string-name>
          <email>soraya.hausl@booking.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Pagano∗</string-name>
          <email>roberto.pagano@booking.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Booking.com</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Booking.com is one of the world's largest online travel companies where millions of customers find their accommodation through an extensive and diverse list of properties including hotels, apartments, guest houses, and more. Our customers heavily rely on the ranking of search results in order to find a property that satisfies their preferences and criteria. While most of the Machine Learning literature focuses on the algorithmic aspect of ranking, not much has been published about the intricacies of developing a ranking that is Machine Learning powered and delivers business impact in a commercial environment. In this work, we describe the various challenges we faced while developing a large-scale Machine Learning powered ranking, including signal definition and feature representation complexity in the modelling aspect, information leakage and speed of innovation in experimentation for online evaluation, and response time minimization around serving at scale in production. We display an overview of the business gains of the step changes in the Machine Learning ranking of Booking.com. Our main conclusion is that there are multiple aspects that need to be carefully addressed for the creation of a Machine Learned ranker in a large-scale commercial setting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Machine learning; • General
and reference → Experimentation.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Booking.com is one of the world’s largest online travel companies
where millions of customers search for their accommodation every
day. Millions of accommodations of various types are listed on our
platform and available to customers all around the globe. In the most
popular destinations there are thousands of properties available for
every search, ranging from private apartments, which somebody
rents out while going on vacation, to traditional chain hotels. It is
not reasonably possible for our customers to exhaustively review
such a large list of options. Filtering, performed by the minority
of our customers, can help narrow down the options to hundreds
which is still a large amount. Therefore, our customers rely heavily
on our ranking for finding the accommodation that is a good fit for
them.</p>
      <p>This work describes many challenges, decisions, and trade ofs
that creating and utilizing a large-scale Machine Learned ranker
∗All authors contributed equally to this research.
2</p>
    </sec>
    <sec id="sec-3">
      <title>MODELLING</title>
      <p>In order to create a ranker that satisfies the requirements of our
customers on our scale, we need to learn over billions of user
sessions. Training on billions of rows, with a complex feature space
consisting of large amount of user, context and accommodation
features is challenging and requires handling terabytes of data.</p>
      <p>In two-sided marketplaces ranking has several objectives and
typically a number of constraints. Besides surfacing
accommodations best suitable for the customer, it is important to satisfy the
needs of accommodation providers by delivering consistent level
of exposure while also giving our partners tools to influence their
visibility on the platform according to their sales strategy. For the
purpose of this paper we will focus on optimisation for one
objective which is by far the most important: catering to customers’
preferences.
2.1</p>
    </sec>
    <sec id="sec-4">
      <title>Signals Definition</title>
      <p>In the course of a search session our users perform a variety of
actions. The actions that are most important from the ranking
perspective are: search results impressions, clicks, reservations,
stays and reviews of a property. The proper definition of the positive
and negative signals derived from these actions is of paramount
importance.</p>
      <p>There are four aspects that can help us decide which signals to
use in our models; 1) relation to user satisfaction/dissatisfaction, 2)
amount of data points, 3) delay to observe the action, 4) bias of the
action. Analysis of diferent choices of signals against these aspects
are presented in Table 1 for positive and Table 2 for negative ones.</p>
      <p>Positive signals In our experience utilizing reservations as
positive signals provides us with data points which have a strong</p>
      <sec id="sec-4-1">
        <title>Positive Signal</title>
        <sec id="sec-4-1-1">
          <title>Property click</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Clicking a property on search results does not necessarily translate to customer satisfaction</title>
          <p>Spending time browsing does
not necessarily translate to
customer satisfaction</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Lack of information (e.g. potential cancellation) for some reservations since they could be for the future</title>
          <p>Lack of information about the
experience of the customers
during their stay</p>
        </sec>
        <sec id="sec-4-1-4">
          <title>There is a bias on which customers choose to provide us with a review</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Having a property displayed in</title>
          <p>the viewport with no user
action does not necessarily
translate to a negative user
signal
Assumption that the users
examined all the properties
above the one they clicked</p>
        </sec>
        <sec id="sec-4-1-6">
          <title>Spending time browsing does not necessarily translate to customer dissatisfaction</title>
        </sec>
        <sec id="sec-4-1-7">
          <title>There is a bias on which</title>
          <p>customers write reviews
relation to user satisfaction, a good amount of observations and
small observation delay. Utilizing property clicks or property clicks
with at least a certain dwell time as secondary positive signals can
help the model learn that there could be multiple options that
satisfy the requirements of a user to a certain extent, despite the user
reserving only one specific accommodation.</p>
          <p>Negative signals If a user clicked on at least one listing, we can
assume that all search results above the clicked one were examined
by the user and deliberately skipped. This creates a large and quickly
observed set of data points indicating a weak fit for this user. If a
user did not click any listing, their search result impressions are still
useful as negative signals for the model to generalize and reduce
the bias towards users who booked or clicked a property.
2.2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Feature Representation</title>
      <p>A Machine Learned model that aims to serve as a ranker needs to
learn information about both the preferences of our customers, and
the diverse qualities of the properties providing accommodation.
The representation of the feature spaces plays an important role
for learning an optimal ranker. The major challenges around the
features are:</p>
      <p>Seasonality: The ranker needs to perform well across diferent
seasons. Explicit categorical features about seasonality from the
search context need to be considered (e.g. day of the week, month).
Moreover, since there are seasonal efects on the amount of people
travelling, the values of the property performance features e.g.
amount of reservations over the last 6 months varies over time.
We can represent these features as relative categorical variables,
e.g. computing daily N percentiles of the reservations over the
last 6 months and representing the feature by the corresponding
percentile.</p>
      <p>Non-linearity: It is important for a ranker to be able to learn
non-linear relations between the target and the features, e.g. a
reservation and the review score of a property. This can be addressed
from the perspective of algorithm choice e.g. using Gradient
Boosting. The same challenge can also be addressed from the standpoint
of feature engineering, where we can represent features as
categorical and learn non-linear relations even in linear algorithms.</p>
      <p>Complexity: The feature space of accommodations is complex.
Explicit human made features about a property do not necessarily
describe this space in the best possible way. For example, star rating
is a well-known characteristic of hotels, yet hotels are only a part
of our inventory and the definition of stars does not exist for
apartments. Projecting the properties in a latent feature space to enhance
their feature representation can be useful. For instance, this can be
achieved by training a Word2vec model over the sequence of user
actions (impressions, clicks, bookings).</p>
      <p>
        Locality: The ranker should be able to learn user preferences
per destination. This can be achieved by either training one ranker
per destination or cluster of destinations, or alternatively by
introducing destination as a feature. However, we need to keep in mind
that if the dimensionality of the destination feature is large, the
“Curse of dimensionality” can be an issue for some algorithms [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Moreover, detailed information about the location of a property
within a destination should be considered as well. Representing
the location in neighborhoods could be a useful method, however
the amount of neighborhoods around the world is very large, and
the feature is sparse since a given neighborhood appears only in
one specific destination. Transforming the feature to whether a
property belongs to the top N most popular neighborhoods in a
destination can alleviate these issues.
      </p>
      <p>In-session adjustment: The ranker needs to be able to adjust
while our customers search for their desired destination and
accommodation. We can capture explicit behavioral characteristics such
as clicks on certain components, and utilize them as features to infer
a user’s interest in specific properties. For example we can compute
similarity scores of a ranked property with the latest properties
clicked by the user in the current search and leverage those as a
feature.
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Ranking Biases</title>
      <p>There are biases in the data that need to be addressed in order for
a ranker to learn the user preferences objectively. Otherwise, the
ranker may be biased towards certain types of properties which
will result in a suboptimal experience for the customers. Three
prominent biases that we aim to reduce are:</p>
      <p>Impression bias: In our logs we capture our customers
examining various properties. Our customers do not perform an exhaustive
and equally distributed examination of all the available properties
in a given destination. Therefore, the propensity of a property to
be examined by a customer is influenced by the Machine Learned
rankers historically used.</p>
      <p>Position bias: Customers tend to click and reserve properties
that are placed higher in the search results since there is an inherent
bias that items closer to the top are “better”.</p>
      <p>User bias: Diferent users display diferent behavior in terms
of the amount of property impressions, clicks and bookings. This
leads to bias towards some users that create more training data
points, while the ranker should be performing equally well for all
users.</p>
      <p>
        There have been multiple publications on the topic of bias
reduction over the years that all revolve around various methods to
perform propensity estimation and utilizing it in the learning of
a ranker with reduced biases, such as treating the biases as
counterfactual and weighting the actions by the Inverse Propensity
Weighting [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], query level Inverse Propensity Weighting [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] or
representing the position trust bias as position dependent noise [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
2.4
      </p>
    </sec>
    <sec id="sec-7">
      <title>Model Creation</title>
      <p>The ranker needs to learn the user preferences over thousands of
destinations and millions of diverse accommodations.</p>
      <p>
        Large-scale training: Training a model on billions of rows
and terabytes of data poses some challenges. Online Learning [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
techniques can help us learn models on such sizes of training data
by enabling us to use out-of-core incremental learning [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and
benefit from a reduced memory footprint. Utilizing the Hashing
Trick [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] also enables us to have variable size feature vectors, and
limit the memory footprint. If the dataset is larger than the disk
or the training time is longer than the expected time to release
a re-trained ranker, distributed Machine Learning can be utilized
to speed up the training process. There are intricacies that come
with distributed Machine Learning e.g. node synchronization, load
balancing, fault tolerance or parameter averaging. Distributed
Machine Learning is an area which is actively developing, yet there
seems to be no consensus about which methods are the best to
use [
        <xref ref-type="bibr" rid="ref1 ref14 ref18 ref8">1, 8, 14, 18</xref>
        ].
      </p>
      <p>
        Generalization: It is really important to evaluate the ability of
the ranker to generalize. Since the goal is to train a ranker using
historical observations and produce the optimal ranking for our
customers in the future, out-of-time validation should be used. One
of the methods to achieve this is walk-forward validation, which
allows us to evaluate a ranker across diferent time periods [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Furthermore it is essential to notice that during training we are
utilizing the properties displayed to the users to learn the optimal
ranking, yet during testing we need to include all the properties
that were available at every point in time in order to simulate what
would happen in reality on inference time and calculate ofline
evaluation metrics such as precision, recall, mrr, ndcg that can
provide us with a more accurate health check of a ranker.
      </p>
      <p>Multi-scenario evaluation: Evaluating the ranker in diferent
scenarios is important. Because we are utilizing behavioral features
and property clicks to learn the optimal ranking, we should evaluate
a ranker not only through global ranking metrics but also pay
attention to customer subgroups. The ranker could be performing very
well for customers that have already interacted with our platform
and inadequately for customers that just landed.
3</p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTATION</title>
      <p>
        Online experimentation in the form of Randomized Controlled
Trials (RCTs) has long played a central role in the development
of products at Booking.com [
        <xref ref-type="bibr" rid="ref12 ref15">12, 15</xref>
        ]. It enables us to measure the
business value of a new product, feature or model. Ranking is a
selflearning system where a model that is used to sort the items also
influences the data that it is trained on. This poses a data leakage
challenge in our experimentation that needs to be carefully handled.
Moreover, considering that we know that ofline evaluation is only
a health check and does not necessarily correlate with business
value [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], evaluation of multiple rankers relies on eficient online
experimentation, and poses speed and sensitivity challenges which
if successfully addressed can accelerate product development.
3.1
      </p>
    </sec>
    <sec id="sec-9">
      <title>Leakage</title>
      <p>In a RCT with our users as the experimental units, we randomly split
the audience in Control and Treatment Group which are exposed
to two diferent rankers. There are two data leakage challenges
in such a setup: re-training leakage and feature leakage. They can
both compromise the observations of our trial if not addressed.</p>
      <p>Re-training leakage: If the rankers are re-trained during the
trial and the training data is derived from all the logged data of our
platform regardless of the group it is generated from, there will be
a “leakage” of customer preferences between the two rankers, and
this might lead the two rankers to recommend similar properties
with similar sorting. Therefore, Isolated Feedback Loops of the data
logging of all the groups in a RCT are necessary.</p>
      <p>Feature leakage: Even if the rankers are not re-trained during
trial, in case they rely on time-dynamic features such as property
performance in the last day e.g. amount of reservations, there will
be “leakage” of the efect of one ranker on the other. In this case,
there are two options: 1) “Freezing” of the data sources to contain
only pre-experimental data points. This solution requires minimal
infrastructure investment from a data logging standpoint. Yet it
assumes that the world is stationary which is not acceptable when
a ranker relies on features that reflect very recent performance
such as amount of reservations over the last day, and 2) Isolated
Feedback Loops per group of the RCT which provide us with a clear
separation of the data, yet create complexity on how we utilize the
data collected.
3.2</p>
    </sec>
    <sec id="sec-10">
      <title>Speed of Innovation - Experimentation via</title>
    </sec>
    <sec id="sec-11">
      <title>Interleaving</title>
      <p>Another challenge regarding online testing in Ranking is the speed
of experimentation. When experimenting with Ranking algorithms
we cannot just run multiple tests at the same time as the
independence assumption - that simultaneously running experiments do
not impact the outcome of each other - does not hold true.</p>
      <p>
        We have tried various setups in the past such as executing
multivariate experiments, splitting trafic into equal random subgroups
and running an experiment per subgroup or leveraging indirect
metrics in RCTs as an early indicator of customer satisfaction.
However, none of these are desirable as we would either have to accept
a considerable loss in power or deal with increased complexity.
Based on reports from other companies [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] on promising results,
we investigated how interleaving experimentation can be used in
Search Ranking at Booking.com.
      </p>
      <p>
        With interleaving (IL) we do not randomise on the customer,
instead we construct our search results by merging the results from
two rankers applying team draft IL [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and use each position in
the search ranking as a unit of randomisation. We then measure if
our users interact more with search results coming from ranking A
or ranking B [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Figure 1 visualises the relation between the results of the RCT
experiment and the corresponding pairwise IL trials. The RCT
compares control and treatment based on an important business
metric and the related confidence intervals. In the IL trials we
examine user preference based on comparison of relative wins (i.e.
which algorithm the user interaction is attributed to) of one ranker
to the other and we use bootstrapping to compute a corresponding
distribution [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
      </p>
      <p>The results of our IL trial have been very promising: IL tests
were very sensitive to diferences in user preferences - in general
we saw potential for a 10-100x speedup due to the high sensitivity
of results. Thus we can run IL on less trafic and test more variants
simultaneously leading to substantial decrease in runtime
requirements. Based on statistical characteristics only we could reduce our
experiment runtime even further to just a few hours. However, as
we want to observe a global representation of our trafic and not
make decisions based on preferences of users in only one timezone
we run online tests for at least a day to obtain suficient exposure.</p>
      <p>It is important to note the limitations of interleaving: the
detection of a conclusive ranker preference regarding a metric in IL does
not necessarily translate into a significant change of the metric
in the RCT experiment given our usual expected efect and also
cannot be used to estimate the efect size of an RCT test. Hence we
can only use IL to make a preselection between algorithms and the
maximal risk is that we do not choose the most promising model,
but rather a suboptimal one and test that one in the following RCT
test.</p>
      <p>Why do we see diverging results and why is interleaving
so much more sensitive? If we segment our customers on search
results by strength of intent, we will always have a large share of
low-intent users, who most probably will not book in this session
(think of early stage exploration), some smaller share of high-intent
users, who will book a room no matter what (think of a business
traveller who definitely needs an accommodation in the next few
days) and a very small part of customers, who are ready to book
but only if they are convinced that they found the right ofer. The
low-intent users are inevitable noise in all cases, they produce
searches and page views, but no bookings. In an RCT experiment,
the high-intent users will also create noise, because they are highly
likely to book even if we show accommodations in a suboptimal
order, efectively providing no evidence if the ranking is good or
bad. We are only getting relevant signals from the extremely sparse
group of users whose decision still has the freedom to go one or
the other direction. Ranking interleaving provides an opportunity
to also gather signals about the quality of a ranker from the
highintent users as they will still pick the property that satisfies their
requirements best. This can substantially increase the number of
meaningful votes providing feedback on which ranking algorithm
is better.
4</p>
    </sec>
    <sec id="sec-12">
      <title>SERVING</title>
      <p>
        A Machine Learned ranker needs to serve millions of users looking
for accommodations every day. Each customer may perform several
searches to choose the best accommodation over a set of tens of
thousands of properties in some cases. We support searches up
to a year from the search date, making the problem of fetching
the availability in real time for all properties, check-in, check-out,
policy, room combinations a challenge on its own. Response time
minimization [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and eficient feature retrieval are critical for the
application of real time Machine Learning based ranking.
      </p>
      <sec id="sec-12-1">
        <title>Response Time Minimization: A way to tackle this problem</title>
        <p>
          is to parallelize the computation of the availability and ranking
score of each property, thus building a top N list (i.e. the first page)
to be returned in a distributed fashion by using sharding [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], where
each shard has a subset of properties to be ranked.
        </p>
        <p>Let N be the number of properties to be returned to the search
results, each shard utilizes a reverse index that maps destinations
to properties, selects the properties that belong to the destination,
computes the availability and the ranking score on its subset of
properties and returns its (partial) top N to the coordinator. The
coordinator aggregates the partial top N lists from each shard into
the final top N which is presented in the search results. Since each
shard is responsible for a subset of properties, it becomes feasible to
pre-compute and materialize the availability information for all the
possible combinations of check-in/out date, room and policy. This
mechanism is very convenient for the deployment of a Machine
Learned model, although it comes with the limitation that each
shard does not have complete visibility to which properties are
available globally.</p>
        <p>Feature Deployment: Feature deployment for models which
are called in real time poses another challenge: all features a model
was trained with must be available at prediction time. Our model
relies on customer, search and property features. Customer features
can be thought of as two types: historical, such as past bookings,
and in-session, such as the latest property clicked by the user in the
current search. A solution to the productionization of the historical
features is to have a key-value storage and retrieve them as soon as
the customer comes to the website. In-session and search features
can be kept in memory with some expiration timeout.</p>
        <p>Property features, on the other hand, are either static features
such as location-related features or performance features which are
updated in regular intervals. Passing the features on the request is
ineficient since the features do not depend on the user context or
actions. They can be productionized via in-memory lookup tables.</p>
        <p>Another challenge to the feature productionization comes from
the latency and memory footprint of the feature weights look-up.
One solution is to have a key-value storage where the key is the
feature and the value is its associated weight. However, the feature
space can be large and the memory footprint of this key value data
structure can be problematic. Utilizing the hashing trick also in
production alleviates this problem.</p>
        <p>
          Hashing is fast (scale of nanoseconds) and it adds a minimal
performance efect on the look-up process, and using a hash of a
pre-specified length provides us with the benefit of limited memory
footprint of the feature space in production, while having negligible
impact on model performance if the hash space size is chosen
correctly [
          <xref ref-type="bibr" rid="ref21 ref3">3, 21</xref>
          ].
5
        </p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>QUANTIFYING BUSINESS IMPACT</title>
      <p>This section presents the relative improvement in business value
that the introduction of each step change in the Machine
Learning ranking yielded, which are summarized in Figure 2. Each bar
represents the improvement measured in RCTs against the control
group. The first bar represents the improvement that the initial
Machine Learned model brought. We take this improvement as a
reference value (i.e. it has value of 1) in order to indicate the scale
of the improvements of the following step changes we introduced
in ranking.</p>
      <p>Business relative value
c
itr
e
m
s
s
e
n
i
s
u
b
t
n
a
tr
o
p
m
i
n
a
f
o
t
n
e
m
veo
r
p
m
i
ve
ilt
a
e
R
6
4
2
0</p>
      <p>Baseline Machine
Learning Model</p>
      <p>Item-to-item
Ranking</p>
      <p>Contextual Item- Behavioural and Full Scale
to-item Ranking Contextual Item- Machine Learned</p>
      <p>to-item Ranking Ranker</p>
      <p>Ranker
The next step change was the utilization of item-to-item
similarity derived from historical co-occurrence data to adjust the
search results of every user within their session based on their
property pageviews and was working in conjunction with the Machine
Learned model to provide the users with an improved user
experience while they were browsing our platform. The introduction
of the user context in the item-to-item similarity ranking helped
us take a step forward and provide in-session personalisation for
diferent traveller profiles. Following that, utilizing the user
behavioral features (e.g. reading reviews of a property or time since last
click on a property) helped us understand the user preferences even
better while our customers browse our platform.</p>
      <p>As a result, over the years, the ranking of the search results was
an ensemble consisting of various Machine Learning components
that were incrementally added through RCTs. These components
had been continuously optimized to improve customer experience
until we experienced a plateau in the business gains. It was clear
that the next step change was the incorporation of all the learnings
from the introduction of these components to a single full-scale
Machine Learned ranker. The last bar represents the improvement
in business value brought by the introduction of the first full-scale
Machine Learned ranker which replaced all the various components
that were working in conjunction before, and compared to the
previous step changes, the gains were substantial.
6</p>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION</title>
      <p>In this work we shared the challenges and viable solutions that
we have faced while developing Machine Learned rankers in a
large-scale e-commerce setting. We covered three aspects beyond
the choice of algorithm: Modelling, Experimentation and Serving.
Several challenges arise in modelling, such as defining which
signals to utilize based on the various observed user actions, how to
represent features and treat biases, how to train Machine Learned
rankers based on large datasets and how to evaluate such rankers.
Validating a ranker through RCTs poses the problem of information
leakage among control and treatment, which requires special care.
We also display the power of interleaving in reducing noise and
increasing power in the ranker online evaluation, albeit its limitation
in providing quantitative information regarding the business
metrics of interest. Serving a Machine Learned model on a high-trafic
platform poses challenges in terms of latency and model
deployment, which often require carefully crafted engineering solutions.
Lastly, we showed the business value of our step changes in ranking
over time. Our main conclusion is that there are multiple important
aspects beyond the choice of algorithm that need to be carefully
addressed for the introduction of a Machine Learned ranker in a
commercial setup. We wish this work would inspire practitioners
around the application of Machine Learning for ranking problems
and serve as guidance on their path addressing the challenges we
faced.</p>
    </sec>
    <sec id="sec-15">
      <title>ACKNOWLEDGMENTS</title>
      <p>We thank all the colleagues who have worked hard on improving
the ranking at Booking.com over the years.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Alekh</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , Oliveier Chapelle, Miroslav Dudík,
          <string-name>
            <given-names>and John</given-names>
            <surname>Langford</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A Reliable Efective Terascale Linear Learning System</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          ,
          <issue>32</issue>
          (
          <year>2014</year>
          ),
          <fpage>1111</fpage>
          -
          <lpage>1133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Aman</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , Xuanhui Wang, Cheng Li,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bendersky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Najork</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Addressing Trust Bias for Unbiased Learning-to-Rank</article-title>
          .
          <source>In The World Wide Web Conference</source>
          .
          <volume>4</volume>
          -
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Lucas</given-names>
            <surname>Bernardi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Don't be tricked by the Hashing Trick</article-title>
          . https://booking.ai/ dont-be
          <article-title>-tricked-by-the-hashing-trick-192a6aae3087</article-title>
          .
          <source>Accessed August 28</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Lucas</given-names>
            <surname>Bernardi</surname>
          </string-name>
          , Themistoklis Mavridis, and
          <string-name>
            <given-names>Pablo</given-names>
            <surname>Estevez</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>150 Successful Machine Learning Models: 6 Lessons Learned at Booking. com</article-title>
          .
          <source>In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          .
          <fpage>1743</fpage>
          -
          <lpage>1751</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Netflix</given-names>
            <surname>Technology Blog</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Innovating Faster on Personalization Algorithms at Netflix Using Interleaving</article-title>
          . https://medium.com/netflix-techblog/
          <article-title>interleavingin-online-experiments-at-netflix-a04ee392ec55</article-title>
          .
          <source>Accessed May 25</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Léon</given-names>
            <surname>Bottou</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Online Algorithms and Stochastic Approximations</article-title>
          .
          <source>In Online Learning and Neural Networks</source>
          , David Saad (Ed.). Cambridge University Press, Cambridge, UK. revised, oct
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Chapelle</surname>
          </string-name>
          , Thorsten Joachims, Filip Radlinski, and
          <string-name>
            <given-names>Yisong</given-names>
            <surname>Yue</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Largescale validation and analysis of interleaved search evaluation</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 30</source>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Adam</given-names>
            <surname>Coates</surname>
          </string-name>
          , Brody Huval,
          <string-name>
            <given-names>Tao</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>David Wu</surname>
            ,
            <given-names>Bryan</given-names>
          </string-name>
          <string-name>
            <surname>Catanzaro</surname>
            , and
            <given-names>Ng</given-names>
          </string-name>
          <string-name>
            <surname>Andrew</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Deep learning with COTS HPC systems</article-title>
          .
          <source>In International conference on machine learning</source>
          .
          <volume>1337</volume>
          -
          <fpage>1345</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>James</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Corbett</surname>
            ,
            <given-names>Jefrey</given-names>
          </string-name>
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Epstein</surname>
            ,
            <given-names>Andrew</given-names>
          </string-name>
          <string-name>
            <surname>Fikes</surname>
            ,
            <given-names>Christopher</given-names>
          </string-name>
          <string-name>
            <surname>Frost</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          <string-name>
            <surname>Furman</surname>
            , Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser,
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Hochschild</surname>
            , Wilson Hsieh,
            <given-names>Sebastian</given-names>
          </string-name>
          <string-name>
            <surname>Kanthak</surname>
            , Eugene Kogan,
            <given-names>Hongyi</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Alexander</given-names>
          </string-name>
          <string-name>
            <surname>Lloyd</surname>
            , Sergey Melnik, David Mwaura, David Nagle,
            <given-names>Sean</given-names>
          </string-name>
          <string-name>
            <surname>Quinlan</surname>
            , Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and
            <given-names>Dale</given-names>
          </string-name>
          <string-name>
            <surname>Woodford</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Spanner: Google's Globally Distributed Database</article-title>
          .
          <source>ACM Trans. Comput. Syst</source>
          .
          <volume>31</volume>
          ,
          <issue>3</issue>
          ,
          <string-name>
            <surname>Article 8</surname>
          </string-name>
          (
          <issue>Aug</issue>
          .
          <year>2013</year>
          ),
          <volume>22</volume>
          pages.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>David L Donoho</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>High-dimensional data analysis: The curses and blessings of dimensionality</article-title>
          .
          <source>AMS math challenges lecture 1</source>
          (
          <year>2000</year>
          ),
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Roger</surname>
            <given-names>W</given-names>
          </string-name>
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>An introduction to the bootstrap</article-title>
          .
          <source>Teaching Statistics</source>
          <volume>23</volume>
          ,
          <issue>2</issue>
          (
          <year>2001</year>
          ),
          <fpage>49</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Raphael</given-names>
            <surname>Lopez</surname>
          </string-name>
          <string-name>
            <surname>Kaufman</surname>
          </string-name>
          , Jegar Pitchforth, and
          <string-name>
            <given-names>Lukas</given-names>
            <surname>Vermeer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Democratizing online controlled experiments at Booking.com</article-title>
          .
          <source>arXiv preprint arXiv:1710.08217</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Charles</surname>
            <given-names>D Kirkpatrick II</given-names>
          </string-name>
          and
          <string-name>
            <surname>Julie A Dahlquist</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Technical analysis: the complete resource for financial market technicians</article-title>
          . FT press.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Mu</given-names>
            <surname>Li</surname>
          </string-name>
          , David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski,
          <string-name>
            <given-names>James</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <surname>Eugene J Shekita</surname>
          </string-name>
          , and
          <string-name>
            <surname>Bor-Yiing Su</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Scaling distributed machine learning with the parameter server</article-title>
          .
          <source>In 11th {USENIX} Symposium on Operating Systems Design and Implementation</source>
          ({
          <source>OSDI} 14)</source>
          .
          <fpage>583</fpage>
          -
          <lpage>598</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Bahattin</given-names>
            <surname>Tolga</surname>
          </string-name>
          <string-name>
            <given-names>Öztan</given-names>
            , Zoé van Havre,
            <surname>Caio Gomes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Lukas</given-names>
            <surname>Vermeer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Mediation Analysis in Online Experiments at Booking.com: Disentangling Direct and Indirect Efects</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>12718</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Filip</surname>
            <given-names>Radlinski</given-names>
          </string-name>
          , Madhu Kurup, and
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>How does clickthrough data reflect retrieval quality?</article-title>
          .
          <source>In Proceedings of the 17th ACM conference on Information and knowledge management</source>
          .
          <volume>43</volume>
          -
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Paul R Rosenbaum and Donald B Rubin</surname>
          </string-name>
          .
          <year>1983</year>
          .
          <article-title>The central role of the propensity score in observational studies for causal efects</article-title>
          .
          <source>Biometrika</source>
          <volume>70</volume>
          ,
          <issue>1</issue>
          (
          <year>1983</year>
          ),
          <fpage>41</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Smola</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shravan</given-names>
            <surname>Narayanamurthy</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>An Architecture for Parallel Topic Models</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>3</volume>
          ,
          <issue>1</issue>
          -
          <fpage>2</fpage>
          (
          <year>2010</year>
          ),
          <fpage>703</fpage>
          -
          <lpage>710</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Jefrey</surname>
            <given-names>Scott</given-names>
          </string-name>
          <string-name>
            <surname>Vitter</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>External memory algorithms and data structures: Dealing with massive data</article-title>
          .
          <source>ACM Computing surveys (CsUR) 33</source>
          ,
          <issue>2</issue>
          (
          <year>2001</year>
          ),
          <fpage>209</fpage>
          -
          <lpage>271</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Xuanhui</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bendersky</surname>
          </string-name>
          , Donald Metzler, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Najork</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Learning to rank with selection bias in personal search</article-title>
          .
          <source>In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval</source>
          .
          <fpage>115</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Kilian</surname>
            <given-names>Weinberger</given-names>
          </string-name>
          , Anirban Dasgupta, John Langford, Alex Smola, and
          <string-name>
            <given-names>Josh</given-names>
            <surname>Attenberg</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Feature hashing for large scale multitask learning</article-title>
          .
          <source>In Proceedings of the 26th annual international conference on machine learning</source>
          .
          <volume>1113</volume>
          -
          <fpage>1120</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>