The Architecture of eBay Search
                Andrew Trotman                                            Jon Degenhardt                              Surya Kallumadi
                University of Otago                                           eBay inc.                            Kansas State University
              andrew@cs.otago.ac.nz                                    jdegenhardt@ebay.com                           surya@ksu.edu

ABSTRACT                                                                              moment of the search (despite the bidding war). If the item sells
The architecture of a large-scale search engine is dependent on                       to one of the bidders, then a fourth user should not be able to find
the application. In the case of a web search engine, the document                     that item even if it was sold only a few milliseconds earlier. And
collection can be considered to be a constantly growing, but slowly                   it is obvious that two users cannot be permitted to buy the same
changing, archive. The task is to find and crawl good pages and                       unique item.
to index them. The rate of change of these pages can be estimated                         Although this appears to be a simple case of database search, it
and the pages re-crawled and re-indexed periodically. Users issue                     is not – it is complex Information Retrieval search. eBay buyers
queries, and the results to those queries are likely to be the same                   can, and do, issue queries along the lines of “blue Converse All
from day to day so can be cached.                                                     Stars”, price restricted to under $10, and ordered on price includ-
   eCommerce, on the other hand, is quite different. The document                     ing shipping. Such a scenario requires a free-text search (to find
collection can change very quickly, the results of queries may differ                 the shoe), a facet restriction (under $10), the use of personalized
from user to user, and crawling may not be required.                                  information about the seller (the shipping from address), the buyer
   In this contribution, we outline the architecture of Cassini, the                  (the shipping to address), and feature data such as the weight of the
eBay search engine. eBay tackles a problem quite different from                       item (necessary to compute the shipping cost). In this example, the
that of a traditional web search and consequently chose to design                     rank order of items is dependent on the seller and the buyer and
and build a search engine from scratch and to customize it to the                     consequently a different user is likely to see a different ordering
nature of the problem.                                                                making result caching near impossible.
                                                                                          Clearly there is much in common between a traditional large
CCS CONCEPTS                                                                          scale web search engine and eBay. Both require distributed search
                                                                                      in order to search the collection quickly. Both require replicated
• Information systems → Environment-specific retrieval;
                                                                                      search in order to maintain high levels of reliability. Both require
KEYWORDS                                                                              multiple data centers to mitigate network latency issues and to
                                                                                      insure against network or power problems.
Product Search, Industrial Information Retrieval.                                         In this position paper we outline the current state of the eBay
ACM Reference format:                                                                 search engine (called Cassini), and outline the reasons behind some
Andrew Trotman, Jon Degenhardt, and Surya Kallumadi. 2017. The Architec-              of the engineering decisions. We start with the search engine itself,
ture of eBay Search. In Proceedings of ACM SIGIR Workshop on eCommerce,               before moving on to briefly discuss ranking (unfortunately the
Tokyo, Japan, August 2017 (SIGIR 2017 eCom), 7 pages.
                                                                                      details of which are proprietary), we then discuss the indexing
https://doi.org/
                                                                                      pipeline. Finally we introduce one of the many difficult and unique
                                                                                      eCommerce challenges eBay faces: updates to sellers of items.
1 INTRODUCTION
Like a traditional web archive, a marketplace such as that at eBay                    2   DISTRUBUTED IR
forms a large search space. At the time of writing there were about a                 There is a practical limit to the number of queries that can be ser-
billion searchable documents (items) available on eBay. Unlike a tra-                 viced on a single machine in a given time period. Results from the
ditional web archive, a marketplace such as eBay sees rapid change                    SIGIR RIGOR .gov2 evaluation showed the top performing system,
to that document collection, with approximately 20% of the collec-                    JASS [8], resolving queries in about 28ms – or only 36 queries per
tion changing every day. Also unlike a web archive, changes in the                    second on short queries. A system like eBay must scale in three
document collection must be propagated to the users immediately.                      dimensions, to increased queries per second, to increased complex-
   An example scenario illustrates this immediacy. An item may                        ity of those queries, and to an increased number of documents
be seeing a bidding war where two or more users are pushing up                        searched as the size of the document collection grows. The natural
the price, while a third user is searching for a “good deal” listing                  solution is a distributed architecture.
results to their search ordered on price from low to high. That third                    The usual distributed solution is to break the document collection
user expects the price they see on their screen to be accurate at the                 into several equal sized chunks called shards, and to search each
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic   shard on a different CPU core (or machine) and in parallel. The
purposes.                                                                             machines that resolve the query are referred to as query nodes and
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.):
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published     results are then merged at a broker.
at http://ceur-ws.org                                                                    The number of shards that are needed is a function of the query
                                                                                      load and the number of documents. Replication is used to address
                                                                                      an increase in query rate. This results in a grid of machines where
                                                                                      columns are shards and rows are replicas. eBay takes exactly this
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                         Andrew Trotman, Jon Degenhardt, and Surya Kallumadi


approach with the exact configuration changing over time (by man-         the query nodes, to merge them, and to pass the merged list back
ual re-configuration) in order better address the behavioral changes      up the protocol stack.
of the users, advances in hardware, and a growing document col-              eBay supports several different datasets that a user can search.
lection.                                                                  The most obvious is known as Active Item, those items currently for
    It is relatively easy to show that the same document will be found    sale through the site. But it is also possible to search through those
regardless of the sharding strategy. However, it is also relatively       items that have recently been sold, known as Completed Item. To
easy to show that for any ranking function using IDF it is not            allow the searching through either Active Item, Completed Item, or
possible to guarantee that the relative rank order of the documents       the two combined, a second aggregator (the Top Level Aggregator)
will remain the same if the shards are searched in isolation then         is used. This aggregator examines the query to determine which set
merged. Prior work suggests several solutions to this [3] including       of documents should be searched, sends the query to the appropriate
keeping a central vocabulary of terms and frequencies seen across         set, aggregates the results and sends that back up the protocol stack.
all the shards. In a search scenario undergoing rapid change (such as        Figure 1 Illustrates the topology of the eBay search engine.
eBay), keeping a central vocabulary is impractical. A better solution     Queries eventually arrive at the top-level aggregator which then
comes from uncooperative distributed search: probing [10]. With           distributes them to the correct lower level aggregators (based on
probing the search engine first sends a query to a query node             collection to search) which then distributes to query nodes to do
in order to ascertain term statistics, the query is annotated with        the search in a shared nothing (central nothing) architecture. Top-k
those statistics and then distributed back to the query nodes to be       results are then passed up the tree, each time being merged at an
resolved. Using this approach the global statistics are computed on       aggregator before being returned back to the user. Other elements
the fly and IDF is, consequently, correct at the moment of query          of Figure 1 are discussed in other sections.
processing – ideal for a rapidly changing collection. In practice,
probing is expensive as it substantially increases network traffic.
It also introduces an additional level of dependence between the
                                                                          3   RELIABLE IR
broker and the query nodes as the query cannot be resolved at any         Any production environment expected to remain up 24 hours a day
query node until the broker has fully annotated the query.                7 days a week is faced with possible software, network, or hardware
    eBay, like many distributed search engines, prefers the central       failure. This is often addressed through replication. In the case of
nothing approach. That is, the task of the broker is to distribute        eBay, a data center is divided into a logical grid where columns are
the query to the query nodes and to merge the results, while the          shards (query nodes) and rows are replica sets. Various software
task of the query node is to search its shard. In this way the query      components are constantly monitoring the state of the grid and
nodes are independent and there is no cross-system dependency.            reporting any failures.
Although the global term statistics necessary to correctly compute            These replicas are a normal part of the data center – they are
IDF are inaccurate, the statistics stored at each node are a good-        all resolving queries as those queries arrive. But some queries take
enough estimate of the global statistics because each shard is large      longer to resolve than others. A software load balancer is used
(typically tens of millions of documents).                                to ensure replica sets do not form bottlenecks. Each time a query
    There has been considerable research into selective search (also      arrives the software load balancer identifies the most idle replica
known as topical sharding) [6] and collection selection by using          set and forwards the query to that replica set.
source selection approaches such as ReDDE [11]. It is unclear from            Should a replica set fail, the software load balancer stops sending
the literature whether topical sharding is used in any commercial         queries to that set until it eventually returns. Should an individual
search engine, and it is not used at eBay. One consideration is failure   node in a column fail then the low-level aggregator knows how
tolerance. Random sharding appears to offer better resilience to          to fail over to a “spare” replica of that individual node. This same
query node drop-out than topical sharding, as it naturally avoids         approach was discussed by Barroso et al [1] with respect to the
dropping large numbers of relevant documents. Also importantly,           Google search engine.
query streams are topically unpredictable – especially after a world          As with any large-scale web search engine, eBay has a suffi-
event. For example, when a famous musician dies eBay can see a            ciently large number of computers in the grid that it is reasonable
peak in queries for merchandise and memorabilia of that individual.       to expect that at any moment in time at least one is in a failed state.
With random sharding that increased load is spread across all the         Various systems identify this, try and bring those nodes back up,
query nodes, but with topical sharding a single node might be             try to reboot machines, and after going through a checklist, notify
expected to resolve the increased query load on its own – leading         a hardware engineer of a fault.
to an increase in latency and a decrease in quality of service. Recall        Once a node returns to service it rejoins its replica set by no-
that caching is difficult in a system like eBay due to the high rate      tifying the broker of its reincarnation. The broker will then send
of change in the document collection (and individualized results          queries to that node. This is also illustrated in Figure 1 which shows
listings) – and so all queries pass to all query nodes to be resolved.    two lower level aggregators with exact replica sets of the query
The cache does not and cannot compensate for the increased query          nodes for the Active Item collection.
traffic as there is no cache.
    The task of the broker (referred to, at eBay, as an aggregator)
                                                                          4   STRUCTURED IR
is to simply receive a query from further up the protocol stack, to
distribute that to the query nodes, then to receive the results from      By default, the user queries on the eBay site search only the listing
                                                                          titles – users can, however, specify that a search should search not
The Architecture of eBay Search                                                                        SIGIR 2017 eCom, August 2017, Tokyo, Japan


                                                                  Software
                                                                Load Balancer


                                                                    Query
                                                                Transformation


                                                                 Top Level
                                                                 Aggregator

         Active Item                                                                                                             Completed Item
                        Lower Level                              Lower Level                                 Lower Level
                        Aggregator                               Aggregator                                  Aggregator


               Query                  Query             Query                    Query               Query                  Query
                            ∙∙∙                                      ∙∙∙                                         ∙∙∙
               Node                   Node              Node                     Node                Node                   Node


Figure 1: Two levels of aggregator (broker) talking to query nodes (search engine instances) makes it possible to search active
item or completed item or both, all in parallel. However, before any of this load ballancing and query transformation occurs.


only the title, but also the description of the items. This is an ex-
                                                                                          Collection         Position            Region
ample of structured (or fielded) information retrieval. Surprisingly,
there does not appear to be a single best index structure for struc-                      <item>                           1-4
tured search. Trotman [12] suggests building a single DOM-like                            <title>                          1-3
super-tree over all documents, numbering the nodes, and annotat-
ing postings lists with node numbers. Guo et. al. [5] stores a Dewey                      Converse              1
encoded path directly in the postings lists. Both these approaches                        All                   2
involve annotating each posting in a postings list.
    Clarke et. al. [2] introduce region algebra, and in doing so show                     Star                  3
that the structural information can be stored separate from the post-                     </title>
ings lists. If the postings lists encode word positions from the start
of the collection (rather than document id and term frequency ) and                       <description>                    4-4
if the document start and end positions are stored in a separate list                     Blue                  4
then it is possible, given any posting, to determine which document
it came from. If a document contains several different structures                         </description>
then the start and end of these structures can be stored in lists,                        </item>
one for each structure. Given a term restriction (e.g. “Converse in
                                                                                          <item>                           5-7
Description”), the search engine loads the postings for the term,
filters the list using the field’s positional information, and converts                   <title>                          5-6
the result into a document id. This approach is particularly appli-
                                                                                          Low                   5
cable if the document collection is XML and the task is focused
retrieval [13].                                                                           Tops                  6
    Figure 2 illustrates region algebra. It shows two documents
                                                                                          </title>
(items) in XML, the term positions, and the regions are shown
at the starts of each element. Given a term position and the regions                      <description>                    7-7
it is possible to determine whether or not the term lies in the region.
                                                                                          Red                   7
For example, “red” occurs at position 7, and the title regions are 1-3
and 5-6 so “red” does not occur in the title of any document.                             </description>
    eBay sometimes uses region algebra albeit encoded differently                         </item>
from Figure 2. As the task is document retrieval (not focused re-
trieval), the postings lists store document id, term frequency, and
word positions within that document. The region lists encode doc-               Figure 2: A collection of 2 structured documents, the term
ument id, field frequency, and the start and length of that field in            positions and the regions (start and end positions) used for
that document. There are several reasons for this encoding, the                 region algebra.
most obvious of which is that by storing <start, length> pairs the
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                          Andrew Trotman, Jon Degenhardt, and Surya Kallumadi


difficulty of resolving self-containing regions (e.g. a <b> element
                                                                                               Term                 Postings
inside a <b> element) is diminished. Another reason is efficiency.
When a query contains a field restricted phrase it can be resolved by                          all                    <1,1>
first resolving the phrase query then resolving the field restriction.
                                                                                               blue                   <1,1>
Region algebra, although provably correct, does not scale well for
semi-structured document collections. For a field that occurs many                             converse               <1,1>
times in a document (such as <p> in an HTML document) the lists
that contain the start and end positions of that field can become                              low                    <2,1>
very long – longer than the postings lists for even the most frequent                          red                    <2,1>
terms. Processing these long lists takes time and can overwhelm
the overall processing time of a query. This does not happen within                            star                   <1,1>
eBay because documents are structured (not semi-structured) and
                                                                                               tops                   <2,1>
so structures occur at most once per document.
    An obvious and far simpler approach to structured information                              title:all              <1,1>
retrieval is to build an inverted index over each separate field in the
                                                                                               title:converse         <1,1>
collection. This approach also does not scale, this time if there are
a large number of fields within a document. If the user is able to                             title:low              <2,1>
choose any combination of fields to search over then each inverted
list for each field must be examined – which is computationally                                title:star             <1,1>
expensive.                                                                                     title:tops             <2,1>
    The eBay indexer builds separate inverted indexes over a small
number of fields as well as an index over the entire document,                                 description:blue       <1,1>
this time without word positions. A user is then able to restrict a
                                                                                               description:red        <2.1>
query term to one of these fields or to the entire collection, but
never to more than one of these (and they cannot use phrases).
This approach has the advantage of decreasing the complexity of            Figure 3: The fielded index for document collection in Fig-
a field restricted search because the number of postings in a field        ure 2, assuming each document is separated by <item> tags.
is typically smaller than in the entire collection, and smaller again      Postings are represented <docid, tf>.
because word positions are not stored.
    Figure 3 illustrates the two documents from Figure 2 each with
two fields (title and description) sharing a single vocabulary. If
                                                                           into the result retrieval process. For example, for the query “2006
the user searches for “star” anywhere in the collection then the
                                                                           Golf”, the search system should understand that the search intent
postings for “star” are examined. If the user searches for “star”
                                                                           is more likely to be cars and less likely to be sports.
in the title then the postings for “title:star” are examined. If the
                                                                              The objective of the rewrite is to encapsulate user intent so as to
user searches for “star” in the description then, since there is no
                                                                           retrieve a different set of relevant results to rank. This is especially
“description:star” in the vocabulary, there is a vocabulary mismatch
                                                                           useful when there is a long tail of items. Rewrite is achieved in two
and no documents will be found. Recall that field restricted phrase
                                                                           ways 1) Query expansion and 2) Query relaxation. Query expan-
search can be performed using region algebra.
                                                                           sion broadens the intent of the query by adding additional relevant
                                                                           tokens. Query relaxation removes tokens from the query, the ob-
5    QUERY TRANSFORMATION                                                  jective of query relaxation is to make the queries less restrictive
When a query arrives at the eBay search engine it is immediately           thus increasing recall. Of course, a query might have all its terms
sent to a number of subsystems that are responsible for re-writing         relaxed and then be expanded with a completely new set of terms.
it in several different ways. These query transformations can be              Query suggestion is another important aspect of all modern
either explicit or implicit. In explicit query transformation, such as     search engines. At eBay, the objective of query suggestion is to
spelling correction, the user sees the transformed query. In implicit      explicitly nudge the user to construct a query that is aligned with
query transformation, the user does not. Search engines use query          their intent. Query suggestions are created by mining the query logs
transformation as a way to increase the recall of queries while            and looking at reformulations. The transient nature of products at
trying to minimize any negative effect on precision. But it is an          eBay adds additional complexity. We do not want to suggest queries
especially critical aspect of eCommerce search as it affects the           for items that are no longer available because doing so results in a
purchasing behavior of the user. If the transformation results in          negative user experience.
items that are not relevant then it reflects badly on the site. Equally,
if the transformation results in retrieving too few items then it          5.1    Automatic Category Navigation (DSBE)
affects revenues and reduces the purchasing options for the user.          The behavioral data generated by users searching and buying items
    Query transformation often happens through a process of query          on eBay is a valuable source of query disambiguation data (amongst
understanding then query rewriting. The objective of query under-          other things). For example, if a user enters “bread” as a query, eBay
standing is to identify the meaning of a query and incorporate that        knows, from that behavioral data, that the user is more likely to be
The Architecture of eBay Search                                                                   SIGIR 2017 eCom, August 2017, Tokyo, Japan


interested in bakeware than music. This knowledge is mined from             boosted decision trees is learned and used at the top level of a multi-
the query logs.                                                             stage ranking (i.e. learning to rank) function.
   When a user enters a query it is logged, when they click on items,           There is no reason why these trees should be constant across
that too is logged. Eventually, when the user purchases an item             queries, users, or locations – and they are not. There are commonly
that is logged. In an eCommerce environment it is more difficult to         many different ranking functions in the system simultaneously.
determine this click chain that in a web environment. A web session         They may serve different search applications or different query
is often considered to be a continuous sequence of requests termi-          segments within the same search application. A/B tests are an
nating after a period of about 10-15 minutes of user inactivity [4].        especially important case. To perform an A/B test of a ranking
However, in an eCommerce environment a user might search for                function its necessary to “tie” the user to a particular function for
an item (for example, a car), watch that item for a period of time,         the duration of the test.
periodically returning to examine characteristics of the item (for              These capabilities are provided by the Profile Lookup Service.
example, the type of stereo), compare that item to several others,          It allows eBay search to define and house many different ranking
then eventually purchase it (or not). The determination of a session        functions. It allows them to be retrieved by search application, user
on an eCommerce site is beyond the scope of this contribution.              (for A/B tests), and other dimensions of the search environment. It
   Periodically the query and click streams are mined. One purpose          also serves to isolate front-end applications from the details of the
of the mining task is to determine, for the most frequent queries,          recall and ranking algorithms.
how to disambiguate those queries. The difficulty within eBay is
that any user can list almost any item for sale, and many of these are
                                                                            6    RANKING
rare or one-off items – so it is often not feasible to associate frequent
queries with items via a product catalog alone. However, it is often        Although we are not at liberty to release details of the eBay rank-
feasible to use associations with structured data components, such          ing function, some details are either standard practice or already
as category, brand, etc. that can be further associated with items.         released.
The eBay query disambiguation engine takes this approach.                       By default, only the titles of the documents are searched. How-
   One of the more valuable associations is the most likely cate-           ever, the queries are re-written to include two parts: first, those
gory that will satisfy the query. For the example, “bread” is more          terms useful in determining a suitable set of documents to rank
often bakeware than music so the results list is limited to just the        (the recall base); and second, those terms used in ranking. In this
bakeware section of eBay. Another is any special considerations on          way terms that are necessary to identify a relevant document (but
how to rank the query. That is, the query itself carries details of         frequent in that set of documents) can be ignored in ranking, and
the ranking function to be applied by the search engine.                    terms that are frequent in the collection but not in the recall base
                                                                            can be used for ranking. These terms are mined from user logs.
                                                                                Multi-stage raking is used. The final stage uses a forest of gradi-
5.2    Query Rewrite (SIBE)                                                 ent boosted decision trees to perform the final ordering. These trees
A more complex mining task is that of determining user intent               (and lower levels too) draw from textural features of the document
rather than simply disambiguating the query. This, too, is mined            (such as term proximity), as well as features the user has supplied
from user behavior logs and the document collection. For each               (such as price). They are deduced from the customer and seller (lo-
frequent query a set of purchases and a set of non-purchases forms          cation information used for postage computation), seller reputation,
a dataset that is analogous to a relevance feedback [9] dataset.            and over 500 features eBay stores about an item (ranging from color
   It is clear from relevance feedback experiments [14] that a new          to condition to size). Much of this feature information is stored in
query can be formed from the original and the labeled data, and             a forward index (or docdata).
that query will better fulfills the user’s information need than the            Unusual in academic search but common in industrial search,
original query. At eBay these rewritten queries are stored keyed on         eBay search avoids returning no results whenever possible. If a
the user’s original query and a quick lookup returns the replacement        query is about to return no results, or a small number of results, then
query.                                                                      the null-and-low subsystem is notified. There are several reasons
   Examination of these re-written queries shows that they tend             recall might be low; for example it might be due to the absence of
to include boosting of synonyms and stemming variants, but also             products in the repository (the query is highly specific and the item
down-ranking of ambiguous and noise terms in order to clarify               does not exist) or due to a misalignment of vocabulary between
the query. For some queries these re-writes can be large in size –          the user and the product (for example, “apple tablet” rather than
not infrequently exceeding hundreds of kilobytes. Unfortunately,            “iPad”).
these long queries take a considerable amount of the computational              The null-and-low component takes the query, re-writes it to a
resources to resolve, a necessary trade off in order to successfully        new query with similar semantics, and sends it back to the search
fulfill the user’s needs.                                                   engine. This process repeats until either a suitable number of results
                                                                            is identified, or the process gives up. We cannot release the details
                                                                            of the null-and-low subsystems, however a user can observe the
5.3    Profile Lookup Service (PLS)                                         eBay spelling checker at work if a single nonsense word is entered.
The eBay ranking functions are stored externally from the main              They can also observe the dropping of query terms if a nonsense
source code of the search engine. They are often generated through          word and a non-nonsense word are entered as the query. Section 5.1
data mining and machine learning. For example, a forest of gradient         introduced automatic category restriction for queries. Experiments
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                           Andrew Trotman, Jon Degenhardt, and Surya Kallumadi


showed that this has several effects on the search engine. First, by        serving node, that node is defragmented as the new bulk index lies
reducing the size of the recall base the query latency goes down.           consecutively in memory and all old patches can be purged. Other
Second, by removing noise from the results list the precision goes          approaches are possible, for example, a periodic process could move
up.                                                                         postings lists about in memory in a de-fragmentation pass.
                                                                                Another important and pragmatic role of bulk indexes is to
                                                                            provide a vehicle for large updates affecting many documents, but
7    INDEXING
                                                                            that do not have freshness requirements offered by mini index.
Maintaining an inverted index over a rapidly changing document              Distributing large updates via bulk indexes has strong efficiency
collection has received little attention in the past. Indeed, it is often   benefits.
assumed that indexing efficiency is unimportant as indexing only                The 8-hour bulk reload also ensures that three times a day all
occurs once. This is not the case at eBay or with any other rapidly         machines in the grid are synchronised. When a new machine is
changing collection that must be kept up-to-date (for example a             added to the grid it simply needs to wait for the bulk to arrive to
news site).                                                                 be in sync.
   Lester et al. [7] offer three update strategies for maintaining              If a machine fails (for whatever reason) and comes back on-line,
an inverted index, in-place, re-merge, and re-build. The in-place           it can simply compare time stamps of which indexes are stored
approach accumulates changes to postings lists in memory then               locally against a global registry of indexes to work out what needs
merges those into a master index, keeping the postings list at the          to be done to re-sync – and that registry need not be concerned
same location if it still fits – and if it does not then it appends it to   with indexes prior to the previous bulk.
the end of the index. Re-merge simply merges the existing index                 An alternative strategy of keeping up-to-date is to build the
with a set of diffs to create a new index. Re-build discards the old        index on the query serving node in real-time as changes occur. This
index and re-indexes everything from scratch. Their experiments             approach results in a substantial increase in the amount of network
were on-disk and so the efficiency results are not applicable to an         traffic seen at the query serving nodes. It is also difficult to re-sync
in-memory index.                                                            in the case of node failure. That is, even if real-time changes are
   At eBay a combination of approaches is used. Every 8 hours               maintained at each node, a master index of that node must be build
a complete re-build of the gold-standard document collection is             and maintained so that it can be distributed for a restart.
performed. This “bulk” is then shipped to the index serving grid,               The index building process is, itself, relatively straightforward.
trickling from row to row so that, over time, the entire grid has           A large HBase cluster is used as the gold-standard document collec-
a fresh index. On a more frequent basis (minutes), all changes to           tion. That no-sql database is built and maintained through a series
the gold-standard document collection are computed and shipped              of processes that listen to the eBay internal item messaging pipeline
to the grid as a “mini” index. These minis contain newly indexed            for messages about changes to items and sellers, and updates are
documents and a list of deleted documents (document changes are             applied to HBase accordingly. Although not the topic of this contri-
deletes followed by adds). There is, clearly, a short period when both      bution, different processes in different parts of eBay are responsible
bulks and minis are being built at the same time. This is resolved at       for determining different characteristics of an item; for example, it
the query nodes and in distribution where it is ensured that indexes        is automatically classified, spam detection algorithms are applied,
arrive in the correct locations and in the correct order.                   language translation occurs, and more. Updates to a document can
   The in-place index merging approach of Lester et al. is used to          occur at any time and HBase ensures these updates are applied
combine bulks and minis, except that the index resides in memory            in the correct order even if they arrive out of order. This HBase
not on disk. One at a time each postings list that requires append-         database is not the eBay-wide gold-standard (that is elsewhere), it
ing is pulled from memory. If it contains a posting for a deleted           is the search department’s gold-standard.
document then that posting is removed from the list, then the                   HBase is built on top of Hadoop, the Apache open-source map-
new postings are added to the end. The postings list is put back            reduce platform. Periodically a map-reduce process kicks-off, the
in memory in-place if it fits and if not then elsewhere in mem-             indexer is shipped to the data, indexing occurs in a distributed
ory. The consequence of this approach is that postings lists that           manner, then the index parts are reduced to form a number of
are not added to are also not purged of deleted documents. There            indexes that matches the topology of the query serving grid. These
are several reasons for choosing this approach. One is that it is           are then picked up by the index distribution agent and moved to
not necessary to exactly re-parse the document in order to deter-           the query serving nodes.
mine which postings lists need purging. A second is that it reduces
fragmentation because fewer lists are modified.
   As a postings list might contain a posting for a document that           8    SELLERS, ITEMS, INDEXES, RANKING
has been deleted it is also necessary to keep track of deleted items.       The ranking function draws from many hundreds of different fea-
Documents that have been deleted between the construction of the            tures. Those features are derived from the item, the seller, and the
mini and the time of the search are removed from the results list           buyer. Features of the buyer are determined high in the protocol
further up the protocol stack.                                              stack and are constant for each item. Features of the item are stored
   Each merge of a mini results in a little more memory fragmenta-          in the inverted index and forward index. It is not clear where to
tion as some postings lists no-longer fit back in memory where they         store the features of the seller in this eCommerce search problem.
came from. This is one of the key reasons for an 8-hour bulk re-               If the seller features are stored in HBase along with the item
indexing. Each time a bulk index is loaded the memory on the query          then any change to a seller would require a change to all items
The Architecture of eBay Search                                                                                SIGIR 2017 eCom, August 2017, Tokyo, Japan

                                                                                                 Seller Forward Index
                                                                                                   Feature                Value

                                                                                                   feedback_score         146
                                Item Forward Index
                                                                                                   followers              1
                                  Feature     Value               Seller Lookup
                                  seller      3                     id       ptr
                                                                                                   Feature                Value
                                  price       3.00                  1
                                                                                                   feedback_score         1170
                                                                    2
                                                                                                   followers              10
                                  Feature     Value
                                                                    3
                                  seller      2
                                                                                                   Feature                Value
                                  price       9.00
                                                                                                   feedback_score         3898

                                                                                                   followers              90


               Figure 4: Seller data is stored in a forward index pointed to indirectly from the item forward index.


that seller is selling – which can be many hundreds of thousands                   REFERENCES
of items (especially for sellers of Compact Disks). If the seller data              [1] Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web Search for a Planet:
is “crossed” with the item data at indexing time then a change to                       The Google Cluster Architecture. IEEE Micro 23, 2 (March 2003), 22–28.
                                                                                    [2] Charles L. A. Clarke, G. V. Cormack, and F. J. Burkowski. 1995. An Algebra for
the seller data would require a substantial update to the index with                    Structured Text Search and a Framework for its Implementation. Comput. J. 38,
a mini. Such a change in seller data might occur if, for example,                       1 (1995), 43. https://doi.org/10.1093/comjnl/38.1.43
                                                                                    [3] O. de Kretser, A. Moffat, T. Shimmin, and J. Zobel. 1998. Methodologies for
the seller declares that they are on holiday and so the expected                        distributed information retrieval. In Proceedings. 18th International Conference on
shipping date of all their items is to be postponed a short time.                       Distributed Computing Systems (Cat. No.98CB36183). 66–73.
    As part of the indexing process eBay builds two indexes, one for                [4] Ayse Goker and Daqing He. 2000. Analysing Web Search Logs to Determine
                                                                                        Session Boundaries for User-Oriented Learning. In AH.
the items and one for the sellers. Each item has a seller id associated             [5] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. 2003.
with it, but the reverse it not true (to find all items of a given seller,              XRANK: Ranked Keyword Search over XML Documents. In Proceedings of the
an item search for the seller’s id is performed). To extract the seller                 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD
                                                                                        ’03). ACM, New York, NY, USA, 16–27.
features during ranking a seller lookup is performed on the seller’s                [6] Yubin Kim, Jamie Callan, J. Shane Culpepper, and Alistair Moffat. 2017. Efficient
id and the seller’s doc-data is examined.                                               Distributed Selective Search. Inf. Retr. 20, 3 (June 2017), 221–252.
                                                                                    [7] Nicholas Lester, Justin Zobel, and Hugh E. Williams. 2004. In-place Versus Re-
    Figure 4 illustrates this process. The document ids from the                        build Versus Re-merge: Index Maintenance Strategies for Text Retrieval Systems.
inverted index point to the item doc-data (forward index) which                         In Proceedings of the 27th Australasian Conference on Computer Science - Volume 26
contains a seller id which is used to find the seller forward index                     (ACSC ’04). Australian Computer Society, Inc., Darlinghurst, Australia, Australia,
                                                                                        15–23.
which contains the seller’s features used in ranking.                               [8] Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered
    The seller index only contains data for sellers who have items                      Indexes. In Proceedings of the 2015 International Conference on The Theory of
for sale at the moment the index is constructed. The alternative                        Information Retrieval (ICTIR ’15). ACM, New York, NY, USA, 301–304.
                                                                                    [9] J. J. Rocchio. 1971. Relevance Feedback in Information Retrieval.
is to include all potential sellers, but since all users are potential             [10] Milad Shokouhi, Falk Scholer, and Justin Zobel. 2006. Sample Sizes for Query
sellers this table would be large and contain many unused entries                       Probing in Uncooperative Distributed Information Retrieval. In Proceedings of the
                                                                                        8th Asia-Pacific Web Conference on Frontiers of WWW Research and Development
(which is space inefficient). It is clear that the closure of the set of                (APWeb’06). Springer-Verlag, Berlin, Heidelberg, 63–75.
current sellers can be constructed from the closure of the set of                  [11] Luo Si and Jamie Callan. 2003. Relevant Document Distribution Estimation
items, and indeed that is how it is constructed.                                        Method for Resource Selection. In Proceedings of the 26th Annual International
                                                                                        ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR
                                                                                        ’03). ACM, New York, NY, USA, 298–305.
                                                                                   [12] Andrew Trotman. 2004. Searching Structured Documents. Inf. Process. Manage.
9    CONCLUSION                                                                         40, 4 (May 2004), 619–632.
                                                                                   [13] Andrew Trotman, Nils Pharo, and Miro Lehtonen. 2007. XML-IR Users and Use
In this position paper we outlined much of how the eBay search                          Cases. Springer Berlin Heidelberg, Berlin, Heidelberg, 400–412.
engine currently functions along with the engineering reasons for                  [14] Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to
it working in this way.                                                                 BM25 and Language Models Examined. In Proceedings of the 2014 Australasian
                                                                                        Document Computing Symposium (ADCS ’14). ACM, New York, NY, USA, Article
   Although many standard techniques are used (inverted indexes,                        58, 8 pages.
learning-to-rank, etc.), it is not possible to simply take an off-the-
shelf web search engine and apply it to eCommerce as other stan-
dard techniques (caching, index-once philosophy) are simply un-
tenable. This, along with the rapid rate of change are the reasons
eBay chose to build a search engine – and this paper outlines many
of the engineering decisions taken.