=Paper= {{Paper |id=Vol-2311/paper_14 |storemode=property |title=The Architecture of eBay Search |pdfUrl=https://ceur-ws.org/Vol-2311/paper_14.pdf |volume=Vol-2311 |authors=Andrew Trotman,Jon Degenhardt,Surya Kallumadi |dblpUrl=https://dblp.org/rec/conf/sigir/TrotmanDK17 }} ==The Architecture of eBay Search== https://ceur-ws.org/Vol-2311/paper_14.pdf
                                           The Architecture of eBay Search
                Andrew Trotman                                            Jon Degenhardt                              Surya Kallumadi
                University of Otago                                           eBay inc.                            Kansas State University
              andrew@cs.otago.ac.nz                                    jdegenhardt@ebay.com                           surya@ksu.edu

ABSTRACT                                                                              moment of the search (despite the bidding war). If the item sells
The architecture of a large-scale search engine is dependent on                       to one of the bidders, then a fourth user should not be able to find
the application. In the case of a web search engine, the document                     that item even if it was sold only a few milliseconds earlier. And
collection can be considered to be a constantly growing, but slowly                   it is obvious that two users cannot be permitted to buy the same
changing, archive. The task is to find and crawl good pages and                       unique item.
to index them. The rate of change of these pages can be estimated                         Although this appears to be a simple case of database search, it
and the pages re-crawled and re-indexed periodically. Users issue                     is not – it is complex Information Retrieval search. eBay buyers
queries, and the results to those queries are likely to be the same                   can, and do, issue queries along the lines of “blue Converse All
from day to day so can be cached.                                                     Stars”, price restricted to under $10, and ordered on price includ-
   eCommerce, on the other hand, is quite different. The document                     ing shipping. Such a scenario requires a free-text search (to find
collection can change very quickly, the results of queries may differ                 the shoe), a facet restriction (under $10), the use of personalized
from user to user, and crawling may not be required.                                  information about the seller (the shipping from address), the buyer
   In this contribution, we outline the architecture of Cassini, the                  (the shipping to address), and feature data such as the weight of the
eBay search engine. eBay tackles a problem quite different from                       item (necessary to compute the shipping cost). In this example, the
that of a traditional web search and consequently chose to design                     rank order of items is dependent on the seller and the buyer and
and build a search engine from scratch and to customize it to the                     consequently a different user is likely to see a different ordering
nature of the problem.                                                                making result caching near impossible.
                                                                                          Clearly there is much in common between a traditional large
CCS CONCEPTS                                                                          scale web search engine and eBay. Both require distributed search
                                                                                      in order to search the collection quickly. Both require replicated
• Information systems → Environment-specific retrieval;
                                                                                      search in order to maintain high levels of reliability. Both require
KEYWORDS                                                                              multiple data centers to mitigate network latency issues and to
                                                                                      insure against network or power problems.
Product Search, Industrial Information Retrieval.                                         In this position paper we outline the current state of the eBay
ACM Reference format:                                                                 search engine (called Cassini), and outline the reasons behind some
Andrew Trotman, Jon Degenhardt, and Surya Kallumadi. 2017. The Architec-              of the engineering decisions. We start with the search engine itself,
ture of eBay Search. In Proceedings of ACM SIGIR Workshop on eCommerce,               before moving on to briefly discuss ranking (unfortunately the
Tokyo, Japan, August 2017 (SIGIR 2017 eCom), 7 pages.
                                                                                      details of which are proprietary), we then discuss the indexing
https://doi.org/
                                                                                      pipeline. Finally we introduce one of the many difficult and unique
                                                                                      eCommerce challenges eBay faces: updates to sellers of items.
1 INTRODUCTION
Like a traditional web archive, a marketplace such as that at eBay                    2   DISTRUBUTED IR
forms a large search space. At the time of writing there were about a                 There is a practical limit to the number of queries that can be ser-
billion searchable documents (items) available on eBay. Unlike a tra-                 viced on a single machine in a given time period. Results from the
ditional web archive, a marketplace such as eBay sees rapid change                    SIGIR RIGOR .gov2 evaluation showed the top performing system,
to that document collection, with approximately 20% of the collec-                    JASS [8], resolving queries in about 28ms – or only 36 queries per
tion changing every day. Also unlike a web archive, changes in the                    second on short queries. A system like eBay must scale in three
document collection must be propagated to the users immediately.                      dimensions, to increased queries per second, to increased complex-
   An example scenario illustrates this immediacy. An item may                        ity of those queries, and to an increased number of documents
be seeing a bidding war where two or more users are pushing up                        searched as the size of the document collection grows. The natural
the price, while a third user is searching for a “good deal” listing                  solution is a distributed architecture.
results to their search ordered on price from low to high. That third                    The usual distributed solution is to break the document collection
user expects the price they see on their screen to be accurate at the                 into several equal sized chunks called shards, and to search each
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic   shard on a different CPU core (or machine) and in parallel. The
purposes.                                                                             machines that resolve the query are referred to as query nodes and
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.):
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published     results are then merged at a broker.
at http://ceur-ws.org                                                                    The number of shards that are needed is a function of the query
                                                                                      load and the number of documents. Replication is used to address
                                                                                      an increase in query rate. This results in a grid of machines where
                                                                                      columns are shards and rows are replicas. eBay takes exactly this
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                         Andrew Trotman, Jon Degenhardt, and Surya Kallumadi


approach with the exact configuration changing over time (by man-         the query nodes, to merge them, and to pass the merged list back
ual re-configuration) in order better address the behavioral changes      up the protocol stack.
of the users, advances in hardware, and a growing document col-              eBay supports several different datasets that a user can search.
lection.                                                                  The most obvious is known as Active Item, those items currently for
    It is relatively easy to show that the same document will be found    sale through the site. But it is also possible to search through those
regardless of the sharding strategy. However, it is also relatively       items that have recently been sold, known as Completed Item. To
easy to show that for any ranking function using IDF it is not            allow the searching through either Active Item, Completed Item, or
possible to guarantee that the relative rank order of the documents       the two combined, a second aggregator (the Top Level Aggregator)
will remain the same if the shards are searched in isolation then         is used. This aggregator examines the query to determine which set
merged. Prior work suggests several solutions to this [3] including       of documents should be searched, sends the query to the appropriate
keeping a central vocabulary of terms and frequencies seen across         set, aggregates the results and sends that back up the protocol stack.
all the shards. In a search scenario undergoing rapid change (such as        Figure 1 Illustrates the topology of the eBay search engine.
eBay), keeping a central vocabulary is impractical. A better solution     Queries eventually arrive at the top-level aggregator which then
comes from uncooperative distributed search: probing [10]. With           distributes them to the correct lower level aggregators (based on
probing the search engine first sends a query to a query node             collection to search) which then distributes to query nodes to do
in order to ascertain term statistics, the query is annotated with        the search in a shared nothing (central nothing) architecture. Top-k
those statistics and then distributed back to the query nodes to be       results are then passed up the tree, each time being merged at an
resolved. Using this approach the global statistics are computed on       aggregator before being returned back to the user. Other elements
the fly and IDF is, consequently, correct at the moment of query          of Figure 1 are discussed in other sections.
processing – ideal for a rapidly changing collection. In practice,
probing is expensive as it substantially increases network traffic.
It also introduces an additional level of dependence between the
                                                                          3   RELIABLE IR
broker and the query nodes as the query cannot be resolved at any         Any production environment expected to remain up 24 hours a day
query node until the broker has fully annotated the query.                7 days a week is faced with possible software, network, or hardware
    eBay, like many distributed search engines, prefers the central       failure. This is often addressed through replication. In the case of
nothing approach. That is, the task of the broker is to distribute        eBay, a data center is divided into a logical grid where columns are
the query to the query nodes and to merge the results, while the          shards (query nodes) and rows are replica sets. Various software
task of the query node is to search its shard. In this way the query      components are constantly monitoring the state of the grid and
nodes are independent and there is no cross-system dependency.            reporting any failures.
Although the global term statistics necessary to correctly compute            These replicas are a normal part of the data center – they are
IDF are inaccurate, the statistics stored at each node are a good-        all resolving queries as those queries arrive. But some queries take
enough estimate of the global statistics because each shard is large      longer to resolve than others. A software load balancer is used
(typically tens of millions of documents).                                to ensure replica sets do not form bottlenecks. Each time a query
    There has been considerable research into selective search (also      arrives the software load balancer identifies the most idle replica
known as topical sharding) [6] and collection selection by using          set and forwards the query to that replica set.
source selection approaches such as ReDDE [11]. It is unclear from            Should a replica set fail, the software load balancer stops sending
the literature whether topical sharding is used in any commercial         queries to that set until it eventually returns. Should an individual
search engine, and it is not used at eBay. One consideration is failure   node in a column fail then the low-level aggregator knows how
tolerance. Random sharding appears to offer better resilience to          to fail over to a “spare” replica of that individual node. This same
query node drop-out than topical sharding, as it naturally avoids         approach was discussed by Barroso et al [1] with respect to the
dropping large numbers of relevant documents. Also importantly,           Google search engine.
query streams are topically unpredictable – especially after a world          As with any large-scale web search engine, eBay has a suffi-
event. For example, when a famous musician dies eBay can see a            ciently large number of computers in the grid that it is reasonable
peak in queries for merchandise and memorabilia of that individual.       to expect that at any moment in time at least one is in a failed state.
With random sharding that increased load is spread across all the         Various systems identify this, try and bring those nodes back up,
query nodes, but with topical sharding a single node might be             try to reboot machines, and after going through a checklist, notify
expected to resolve the increased query load on its own – leading         a hardware engineer of a fault.
to an increase in latency and a decrease in quality of service. Recall        Once a node returns to service it rejoins its replica set by no-
that caching is difficult in a system like eBay due to the high rate      tifying the broker of its reincarnation. The broker will then send
of change in the document collection (and individualized results          queries to that node. This is also illustrated in Figure 1 which shows
listings) – and so all queries pass to all query nodes to be resolved.    two lower level aggregators with exact replica sets of the query
The cache does not and cannot compensate for the increased query          nodes for the Active Item collection.
traffic as there is no cache.
    The task of the broker (referred to, at eBay, as an aggregator)
                                                                          4   STRUCTURED IR
is to simply receive a query from further up the protocol stack, to
distribute that to the query nodes, then to receive the results from      By default, the user queries on the eBay site search only the listing
                                                                          titles – users can, however, specify that a search should search not
The Architecture of eBay Search                                                                        SIGIR 2017 eCom, August 2017, Tokyo, Japan


                                                                  Software
                                                                Load Balancer


                                                                    Query
                                                                Transformation


                                                                 Top Level
                                                                 Aggregator

         Active Item                                                                                                             Completed Item
                        Lower Level                              Lower Level                                 Lower Level
                        Aggregator                               Aggregator                                  Aggregator


               Query                  Query             Query                    Query               Query                  Query
                            ∙∙∙                                      ∙∙∙                                         ∙∙∙
               Node                   Node              Node                     Node                Node                   Node



Figure 1: Two levels of aggregator (broker) talking to query nodes (search engine instances) makes it possible to search active
item or completed item or both, all in parallel. However, before any of this load ballancing and query transformation occurs.


only the title, but also the description of the items. This is an ex-
                                                                                          Collection         Position            Region
ample of structured (or fielded) information retrieval. Surprisingly,
there does not appear to be a single best index structure for struc-                                                 1-4
tured search. Trotman [12] suggests building a single DOM-like                                                      1-3
super-tree over all documents, numbering the nodes, and annotat-
ing postings lists with node numbers. Guo et. al. [5] stores a Dewey                      Converse              1
encoded path directly in the postings lists. Both these approaches                        All                   2
involve annotating each posting in a postings list.
    Clarke et. al. [2] introduce region algebra, and in doing so show                     Star                  3
that the structural information can be stored separate from the post-                     
ings lists. If the postings lists encode word positions from the start
of the collection (rather than document id and term frequency ) and                                           4-4
if the document start and end positions are stored in a separate list                     Blue                  4
then it is possible, given any posting, to determine which document
it came from. If a document contains several different structures                         
then the start and end of these structures can be stored in lists,                        
one for each structure. Given a term restriction (e.g. “Converse in
                                                                                                                     5-7
Description”), the search engine loads the postings for the term,
filters the list using the field’s positional information, and converts                                             5-6
the result into a document id. This approach is particularly appli-
                                                                                          Low                   5
cable if the document collection is XML and the task is focused
retrieval [13].                                                                           Tops                  6
    Figure 2 illustrates region algebra. It shows two documents
                                                                                          
(items) in XML, the term positions, and the regions are shown
at the starts of each element. Given a term position and the regions                                          7-7
it is possible to determine whether or not the term lies in the region.
                                                                                          Red                   7
For example, “red” occurs at position 7, and the title regions are 1-3
and 5-6 so “red” does not occur in the title of any document.                             
    eBay sometimes uses region algebra albeit encoded differently                         
from Figure 2. As the task is document retrieval (not focused re-
trieval), the postings lists store document id, term frequency, and
word positions within that document. The region lists encode doc-               Figure 2: A collection of 2 structured documents, the term
ument id, field frequency, and the start and length of that field in            positions and the regions (start and end positions) used for
that document. There are several reasons for this encoding, the                 region algebra.
most obvious of which is that by storing  pairs the
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                          Andrew Trotman, Jon Degenhardt, and Surya Kallumadi


difficulty of resolving self-containing regions (e.g. a  element
                                                                                               Term                 Postings
inside a  element) is diminished. Another reason is efficiency.
When a query contains a field restricted phrase it can be resolved by                          all                    <1,1>
first resolving the phrase query then resolving the field restriction.
                                                                                               blue                   <1,1>
Region algebra, although provably correct, does not scale well for
semi-structured document collections. For a field that occurs many                             converse               <1,1>
times in a document (such as 

in an HTML document) the lists that contain the start and end positions of that field can become low <2,1> very long – longer than the postings lists for even the most frequent red <2,1> terms. Processing these long lists takes time and can overwhelm the overall processing time of a query. This does not happen within star <1,1> eBay because documents are structured (not semi-structured) and tops <2,1> so structures occur at most once per document. An obvious and far simpler approach to structured information title:all <1,1> retrieval is to build an inverted index over each separate field in the title:converse <1,1> collection. This approach also does not scale, this time if there are a large number of fields within a document. If the user is able to title:low <2,1> choose any combination of fields to search over then each inverted list for each field must be examined – which is computationally title:star <1,1> expensive. title:tops <2,1> The eBay indexer builds separate inverted indexes over a small number of fields as well as an index over the entire document, description:blue <1,1> this time without word positions. A user is then able to restrict a description:red <2.1> query term to one of these fields or to the entire collection, but never to more than one of these (and they cannot use phrases). This approach has the advantage of decreasing the complexity of Figure 3: The fielded index for document collection in Fig- a field restricted search because the number of postings in a field ure 2, assuming each document is separated by tags. is typically smaller than in the entire collection, and smaller again Postings are represented . because word positions are not stored. Figure 3 illustrates the two documents from Figure 2 each with two fields (title and description) sharing a single vocabulary. If into the result retrieval process. For example, for the query “2006 the user searches for “star” anywhere in the collection then the Golf”, the search system should understand that the search intent postings for “star” are examined. If the user searches for “star” is more likely to be cars and less likely to be sports. in the title then the postings for “title:star” are examined. If the The objective of the rewrite is to encapsulate user intent so as to user searches for “star” in the description then, since there is no retrieve a different set of relevant results to rank. This is especially “description:star” in the vocabulary, there is a vocabulary mismatch useful when there is a long tail of items. Rewrite is achieved in two and no documents will be found. Recall that field restricted phrase ways 1) Query expansion and 2) Query relaxation. Query expan- search can be performed using region algebra. sion broadens the intent of the query by adding additional relevant tokens. Query relaxation removes tokens from the query, the ob- 5 QUERY TRANSFORMATION jective of query relaxation is to make the queries less restrictive When a query arrives at the eBay search engine it is immediately thus increasing recall. Of course, a query might have all its terms sent to a number of subsystems that are responsible for re-writing relaxed and then be expanded with a completely new set of terms. it in several different ways. These query transformations can be Query suggestion is another important aspect of all modern either explicit or implicit. In explicit query transformation, such as search engines. At eBay, the objective of query suggestion is to spelling correction, the user sees the transformed query. In implicit explicitly nudge the user to construct a query that is aligned with query transformation, the user does not. Search engines use query their intent. Query suggestions are created by mining the query logs transformation as a way to increase the recall of queries while and looking at reformulations. The transient nature of products at trying to minimize any negative effect on precision. But it is an eBay adds additional complexity. We do not want to suggest queries especially critical aspect of eCommerce search as it affects the for items that are no longer available because doing so results in a purchasing behavior of the user. If the transformation results in negative user experience. items that are not relevant then it reflects badly on the site. Equally, if the transformation results in retrieving too few items then it 5.1 Automatic Category Navigation (DSBE) affects revenues and reduces the purchasing options for the user. The behavioral data generated by users searching and buying items Query transformation often happens through a process of query on eBay is a valuable source of query disambiguation data (amongst understanding then query rewriting. The objective of query under- other things). For example, if a user enters “bread” as a query, eBay standing is to identify the meaning of a query and incorporate that knows, from that behavioral data, that the user is more likely to be The Architecture of eBay Search SIGIR 2017 eCom, August 2017, Tokyo, Japan interested in bakeware than music. This knowledge is mined from boosted decision trees is learned and used at the top level of a multi- the query logs. stage ranking (i.e. learning to rank) function. When a user enters a query it is logged, when they click on items, There is no reason why these trees should be constant across that too is logged. Eventually, when the user purchases an item queries, users, or locations – and they are not. There are commonly that is logged. In an eCommerce environment it is more difficult to many different ranking functions in the system simultaneously. determine this click chain that in a web environment. A web session They may serve different search applications or different query is often considered to be a continuous sequence of requests termi- segments within the same search application. A/B tests are an nating after a period of about 10-15 minutes of user inactivity [4]. especially important case. To perform an A/B test of a ranking However, in an eCommerce environment a user might search for function its necessary to “tie” the user to a particular function for an item (for example, a car), watch that item for a period of time, the duration of the test. periodically returning to examine characteristics of the item (for These capabilities are provided by the Profile Lookup Service. example, the type of stereo), compare that item to several others, It allows eBay search to define and house many different ranking then eventually purchase it (or not). The determination of a session functions. It allows them to be retrieved by search application, user on an eCommerce site is beyond the scope of this contribution. (for A/B tests), and other dimensions of the search environment. It Periodically the query and click streams are mined. One purpose also serves to isolate front-end applications from the details of the of the mining task is to determine, for the most frequent queries, recall and ranking algorithms. how to disambiguate those queries. The difficulty within eBay is that any user can list almost any item for sale, and many of these are 6 RANKING rare or one-off items – so it is often not feasible to associate frequent queries with items via a product catalog alone. However, it is often Although we are not at liberty to release details of the eBay rank- feasible to use associations with structured data components, such ing function, some details are either standard practice or already as category, brand, etc. that can be further associated with items. released. The eBay query disambiguation engine takes this approach. By default, only the titles of the documents are searched. How- One of the more valuable associations is the most likely cate- ever, the queries are re-written to include two parts: first, those gory that will satisfy the query. For the example, “bread” is more terms useful in determining a suitable set of documents to rank often bakeware than music so the results list is limited to just the (the recall base); and second, those terms used in ranking. In this bakeware section of eBay. Another is any special considerations on way terms that are necessary to identify a relevant document (but how to rank the query. That is, the query itself carries details of frequent in that set of documents) can be ignored in ranking, and the ranking function to be applied by the search engine. terms that are frequent in the collection but not in the recall base can be used for ranking. These terms are mined from user logs. Multi-stage raking is used. The final stage uses a forest of gradi- 5.2 Query Rewrite (SIBE) ent boosted decision trees to perform the final ordering. These trees A more complex mining task is that of determining user intent (and lower levels too) draw from textural features of the document rather than simply disambiguating the query. This, too, is mined (such as term proximity), as well as features the user has supplied from user behavior logs and the document collection. For each (such as price). They are deduced from the customer and seller (lo- frequent query a set of purchases and a set of non-purchases forms cation information used for postage computation), seller reputation, a dataset that is analogous to a relevance feedback [9] dataset. and over 500 features eBay stores about an item (ranging from color It is clear from relevance feedback experiments [14] that a new to condition to size). Much of this feature information is stored in query can be formed from the original and the labeled data, and a forward index (or docdata). that query will better fulfills the user’s information need than the Unusual in academic search but common in industrial search, original query. At eBay these rewritten queries are stored keyed on eBay search avoids returning no results whenever possible. If a the user’s original query and a quick lookup returns the replacement query is about to return no results, or a small number of results, then query. the null-and-low subsystem is notified. There are several reasons Examination of these re-written queries shows that they tend recall might be low; for example it might be due to the absence of to include boosting of synonyms and stemming variants, but also products in the repository (the query is highly specific and the item down-ranking of ambiguous and noise terms in order to clarify does not exist) or due to a misalignment of vocabulary between the query. For some queries these re-writes can be large in size – the user and the product (for example, “apple tablet” rather than not infrequently exceeding hundreds of kilobytes. Unfortunately, “iPad”). these long queries take a considerable amount of the computational The null-and-low component takes the query, re-writes it to a resources to resolve, a necessary trade off in order to successfully new query with similar semantics, and sends it back to the search fulfill the user’s needs. engine. This process repeats until either a suitable number of results is identified, or the process gives up. We cannot release the details of the null-and-low subsystems, however a user can observe the 5.3 Profile Lookup Service (PLS) eBay spelling checker at work if a single nonsense word is entered. The eBay ranking functions are stored externally from the main They can also observe the dropping of query terms if a nonsense source code of the search engine. They are often generated through word and a non-nonsense word are entered as the query. Section 5.1 data mining and machine learning. For example, a forest of gradient introduced automatic category restriction for queries. Experiments SIGIR 2017 eCom, August 2017, Tokyo, Japan Andrew Trotman, Jon Degenhardt, and Surya Kallumadi showed that this has several effects on the search engine. First, by serving node, that node is defragmented as the new bulk index lies reducing the size of the recall base the query latency goes down. consecutively in memory and all old patches can be purged. Other Second, by removing noise from the results list the precision goes approaches are possible, for example, a periodic process could move up. postings lists about in memory in a de-fragmentation pass. Another important and pragmatic role of bulk indexes is to provide a vehicle for large updates affecting many documents, but 7 INDEXING that do not have freshness requirements offered by mini index. Maintaining an inverted index over a rapidly changing document Distributing large updates via bulk indexes has strong efficiency collection has received little attention in the past. Indeed, it is often benefits. assumed that indexing efficiency is unimportant as indexing only The 8-hour bulk reload also ensures that three times a day all occurs once. This is not the case at eBay or with any other rapidly machines in the grid are synchronised. When a new machine is changing collection that must be kept up-to-date (for example a added to the grid it simply needs to wait for the bulk to arrive to news site). be in sync. Lester et al. [7] offer three update strategies for maintaining If a machine fails (for whatever reason) and comes back on-line, an inverted index, in-place, re-merge, and re-build. The in-place it can simply compare time stamps of which indexes are stored approach accumulates changes to postings lists in memory then locally against a global registry of indexes to work out what needs merges those into a master index, keeping the postings list at the to be done to re-sync – and that registry need not be concerned same location if it still fits – and if it does not then it appends it to with indexes prior to the previous bulk. the end of the index. Re-merge simply merges the existing index An alternative strategy of keeping up-to-date is to build the with a set of diffs to create a new index. Re-build discards the old index on the query serving node in real-time as changes occur. This index and re-indexes everything from scratch. Their experiments approach results in a substantial increase in the amount of network were on-disk and so the efficiency results are not applicable to an traffic seen at the query serving nodes. It is also difficult to re-sync in-memory index. in the case of node failure. That is, even if real-time changes are At eBay a combination of approaches is used. Every 8 hours maintained at each node, a master index of that node must be build a complete re-build of the gold-standard document collection is and maintained so that it can be distributed for a restart. performed. This “bulk” is then shipped to the index serving grid, The index building process is, itself, relatively straightforward. trickling from row to row so that, over time, the entire grid has A large HBase cluster is used as the gold-standard document collec- a fresh index. On a more frequent basis (minutes), all changes to tion. That no-sql database is built and maintained through a series the gold-standard document collection are computed and shipped of processes that listen to the eBay internal item messaging pipeline to the grid as a “mini” index. These minis contain newly indexed for messages about changes to items and sellers, and updates are documents and a list of deleted documents (document changes are applied to HBase accordingly. Although not the topic of this contri- deletes followed by adds). There is, clearly, a short period when both bution, different processes in different parts of eBay are responsible bulks and minis are being built at the same time. This is resolved at for determining different characteristics of an item; for example, it the query nodes and in distribution where it is ensured that indexes is automatically classified, spam detection algorithms are applied, arrive in the correct locations and in the correct order. language translation occurs, and more. Updates to a document can The in-place index merging approach of Lester et al. is used to occur at any time and HBase ensures these updates are applied combine bulks and minis, except that the index resides in memory in the correct order even if they arrive out of order. This HBase not on disk. One at a time each postings list that requires append- database is not the eBay-wide gold-standard (that is elsewhere), it ing is pulled from memory. If it contains a posting for a deleted is the search department’s gold-standard. document then that posting is removed from the list, then the HBase is built on top of Hadoop, the Apache open-source map- new postings are added to the end. The postings list is put back reduce platform. Periodically a map-reduce process kicks-off, the in memory in-place if it fits and if not then elsewhere in mem- indexer is shipped to the data, indexing occurs in a distributed ory. The consequence of this approach is that postings lists that manner, then the index parts are reduced to form a number of are not added to are also not purged of deleted documents. There indexes that matches the topology of the query serving grid. These are several reasons for choosing this approach. One is that it is are then picked up by the index distribution agent and moved to not necessary to exactly re-parse the document in order to deter- the query serving nodes. mine which postings lists need purging. A second is that it reduces fragmentation because fewer lists are modified. As a postings list might contain a posting for a document that 8 SELLERS, ITEMS, INDEXES, RANKING has been deleted it is also necessary to keep track of deleted items. The ranking function draws from many hundreds of different fea- Documents that have been deleted between the construction of the tures. Those features are derived from the item, the seller, and the mini and the time of the search are removed from the results list buyer. Features of the buyer are determined high in the protocol further up the protocol stack. stack and are constant for each item. Features of the item are stored Each merge of a mini results in a little more memory fragmenta- in the inverted index and forward index. It is not clear where to tion as some postings lists no-longer fit back in memory where they store the features of the seller in this eCommerce search problem. came from. This is one of the key reasons for an 8-hour bulk re- If the seller features are stored in HBase along with the item indexing. Each time a bulk index is loaded the memory on the query then any change to a seller would require a change to all items The Architecture of eBay Search SIGIR 2017 eCom, August 2017, Tokyo, Japan Seller Forward Index Feature Value feedback_score 146 Item Forward Index followers 1 Feature Value Seller Lookup seller 3 id ptr Feature Value price 3.00 1 feedback_score 1170 2 followers 10 Feature Value 3 seller 2 Feature Value price 9.00 feedback_score 3898 followers 90 Figure 4: Seller data is stored in a forward index pointed to indirectly from the item forward index. that seller is selling – which can be many hundreds of thousands REFERENCES of items (especially for sellers of Compact Disks). If the seller data [1] Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web Search for a Planet: is “crossed” with the item data at indexing time then a change to The Google Cluster Architecture. IEEE Micro 23, 2 (March 2003), 22–28. [2] Charles L. A. Clarke, G. V. Cormack, and F. J. Burkowski. 1995. An Algebra for the seller data would require a substantial update to the index with Structured Text Search and a Framework for its Implementation. Comput. J. 38, a mini. Such a change in seller data might occur if, for example, 1 (1995), 43. https://doi.org/10.1093/comjnl/38.1.43 [3] O. de Kretser, A. Moffat, T. Shimmin, and J. Zobel. 1998. Methodologies for the seller declares that they are on holiday and so the expected distributed information retrieval. In Proceedings. 18th International Conference on shipping date of all their items is to be postponed a short time. Distributed Computing Systems (Cat. No.98CB36183). 66–73. As part of the indexing process eBay builds two indexes, one for [4] Ayse Goker and Daqing He. 2000. Analysing Web Search Logs to Determine Session Boundaries for User-Oriented Learning. In AH. the items and one for the sellers. Each item has a seller id associated [5] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. 2003. with it, but the reverse it not true (to find all items of a given seller, XRANK: Ranked Keyword Search over XML Documents. In Proceedings of the an item search for the seller’s id is performed). To extract the seller 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD ’03). ACM, New York, NY, USA, 16–27. features during ranking a seller lookup is performed on the seller’s [6] Yubin Kim, Jamie Callan, J. Shane Culpepper, and Alistair Moffat. 2017. Efficient id and the seller’s doc-data is examined. Distributed Selective Search. Inf. Retr. 20, 3 (June 2017), 221–252. [7] Nicholas Lester, Justin Zobel, and Hugh E. Williams. 2004. In-place Versus Re- Figure 4 illustrates this process. The document ids from the build Versus Re-merge: Index Maintenance Strategies for Text Retrieval Systems. inverted index point to the item doc-data (forward index) which In Proceedings of the 27th Australasian Conference on Computer Science - Volume 26 contains a seller id which is used to find the seller forward index (ACSC ’04). Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 15–23. which contains the seller’s features used in ranking. [8] Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered The seller index only contains data for sellers who have items Indexes. In Proceedings of the 2015 International Conference on The Theory of for sale at the moment the index is constructed. The alternative Information Retrieval (ICTIR ’15). ACM, New York, NY, USA, 301–304. [9] J. J. Rocchio. 1971. Relevance Feedback in Information Retrieval. is to include all potential sellers, but since all users are potential [10] Milad Shokouhi, Falk Scholer, and Justin Zobel. 2006. Sample Sizes for Query sellers this table would be large and contain many unused entries Probing in Uncooperative Distributed Information Retrieval. In Proceedings of the 8th Asia-Pacific Web Conference on Frontiers of WWW Research and Development (which is space inefficient). It is clear that the closure of the set of (APWeb’06). Springer-Verlag, Berlin, Heidelberg, 63–75. current sellers can be constructed from the closure of the set of [11] Luo Si and Jamie Callan. 2003. Relevant Document Distribution Estimation items, and indeed that is how it is constructed. Method for Resource Selection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR ’03). ACM, New York, NY, USA, 298–305. [12] Andrew Trotman. 2004. Searching Structured Documents. Inf. Process. Manage. 9 CONCLUSION 40, 4 (May 2004), 619–632. [13] Andrew Trotman, Nils Pharo, and Miro Lehtonen. 2007. XML-IR Users and Use In this position paper we outlined much of how the eBay search Cases. Springer Berlin Heidelberg, Berlin, Heidelberg, 400–412. engine currently functions along with the engineering reasons for [14] Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to it working in this way. BM25 and Language Models Examined. In Proceedings of the 2014 Australasian Document Computing Symposium (ADCS ’14). ACM, New York, NY, USA, Article Although many standard techniques are used (inverted indexes, 58, 8 pages. learning-to-rank, etc.), it is not possible to simply take an off-the- shelf web search engine and apply it to eCommerce as other stan- dard techniques (caching, index-once philosophy) are simply un- tenable. This, along with the rapid rate of change are the reasons eBay chose to build a search engine – and this paper outlines many of the engineering decisions taken.