1. Introduction

Journal of the ACM (JACM) 64 (2017) 1-24. [9] O.

10.1109/18.382012

Practical Secondary Stack Optimization on Search Pages: A Lightweight Contextual Bandit Approach⋆

Eva C. Song

eva.song@walmart.com 0

Jun Zhao

jun.zhao3@walmart.com 0

Junchao Zheng

Vivek Agrawal

vivek.agrawal@walmart.com 1 0 Walmart Global Tech , Hoboken NJ , USA 1 Walmart Global Tech , Sunnyvale CA , USA

2025

1 1516 1541

Secondary Stacks - recommender modules displayed alongside core search results, are a key driver of engagement and conversion on E-commerce search pages. A secondary stack suitable for display should generally demonstrate relevance to the query and good potential for engagement. An E-commerce website can have many of such secondary stack modules that feature diferent themes and are powered by diferent algorithms for the content generation. Due to space limitations, not all available stacks can be displayed, making it necessary to prioritize those most likely to perform well under the current search context, such as the query. Additionally, retrieval for many stacks is computationally non-trivial. Hence, it is often beneficial to conduct a pre-retrieval selection, when only partial context is known, to limit the number of stacks to retrieve, and then prioritize based on the full context only among the retrieved stacks. Lastly, stack performance is dynamic, varying with evolving content and user behavior. We introduce a lightweight, highly flexible solution that frames secondary stack prioritization as a contextual multi-armed bandit problem. Our approach features a low-cost ofline data pipeline and a real-time selection algorithm based on Thompson sampling, enabling seamless deployment at any stage of the search flow, such as pre-retrieval, post-ranking, or anywhere contextually appropriate. The system dynamically adapts to shifting contexts without requiring complex retraining, and thrives where large models often overfit, such as with low-visibility stacks that receive sparse impressions. AB tests on Walmart.com show that our method delivers significant gains in user engagement metrics, validating a solution that is fast to implement, cost-eficient to serve, and robust in production. The algorithm has been successfully launched and continues to drive improved search experiences at scale.

eol>Multi-armed bandit Search Page Optimization Secondary Stack Thompson sampling

1. Introduction

In E-commerce search experiences, Secondary Stacks – recommender modules displayed alongside or below primary search results play a vital role in driving user engagement, discovery, and conversion. With limited real estate on the search page and a growing number of available stacks powered by diverse algorithms, eficiently prioritizing which modules to display has become increasingly critical.

On the Walmart.com website, a secondary stack can be shown at the bottom of the first page search result. At the time of writing, there are ∼ 10 types of secondary stacks live in production and several more under development. Each of these secondary stacks features a diferent aspect of business consideration and the content within each secondary stack is generated with a separate algorithm. For example, some secondary stacks are personalized based on users’ past shopping history and the other ones are generic to all users. However, they all share a very similar front-end treatment and the same space on the search page. As the number of diferent types of secondary stack increases, several stacks may be qualified to be shown on the same search page and consequently, a choice must be made. It is worth noting that retrieval for many stacks are non-trivial, with some as costly as that of the primary search retrieval. Therefore, the ideal framework should be invokable at both pre-retrieval stage, where it needs to operate on partial context, and post-retrieval stage, to conduct the final selection based on the full context. Prior to this work, the selection of the secondary stacks was based on a set of business heuristic rules. Under these rules, the prioritization of the stacks is predetermined. This rule-based prioritization is suboptimal as each secondary stack can demonstrate diferent user engagement patterns under diferent contexts. One of such contexts is the query itself. An example illustrating diferent stacks should be prioritized under diferent contexts is shown in Figure 1. In addition to the lack of contextual awareness, the engagement pattern of each individual stack can also change over time. In order to address this issue, we treat it as a contextual multi-armed bandit problem, where an explore-exploit algorithm learns the dynamics from all the secondary stacks under various contexts by continuously collecting users’ engagement data. When multiple secondary stacks are qualified to be displayed on the same search page, the explore-exploit algorithm decides whether to show the most promising stack based on past users’ engagement data (exploit) or show another stack so that we continue taking measurements on the other stacks in order to capture any change in stacks’ behaviors over time (explore). An eficient explore-exploit algorithm allows us to achieve the best engagement from all secondary stacks on average in the long run. (a) A stack featuring personalized cross-selling options (b) A stack featuring highly rated items

The multi-armed bandit is a framework that handles the tradeof between explore and exploit. In this setting, at each time instance, an arm is selected and a reward (regret) from choosing the arm is observed. The selected arm and its corresponding reward then contribute as historical observations and are used to make the next decision. This area has been intensively studied in the literature since 1933 [1]. Many algorithms are well established in the literature, including -greedy, Upper Confidence Bound (UCB) [2, 3, 4] and Thompson sampling [5, 6, 7]. The corresponding theoretical bounds on expected regret in time have also been studied [2, 8, 9, 5, 10]. The contextual multi-armed bandit further assumes that the selection of the best arm and the corresponding reward depend on another random variable, the context. This setting has been applied to many areas including in recommender systems [11]. In recent years, deep contextual bandits [12] are also intensively studied and applied to similar problems and demonstrate great results.

Traditional complex modeling approaches often require high-quality, large-scale data and heavy infrastructure, making them impractical for the dynamic and noisy environment where many stacks can sufer from low visibility and sparse impressions. In this work, we present a lightweight, flexible, and production-ready solution based on contextual multi-armed bandits that eficiently selects highperforming secondary stacks with minimal cost.

The main contributions of this paper include: • Framing the problem of search page secondary stack selection as a contextual multi-armed bandit and designing the ofline data pipeline to maintain contextual statistics and online selection using Thompson sampling. • Designing the context tree data structure with – operational flexibility that allows the sampling to be conducted with full or partial context; – robustness that allows tracing back to higher levels of the tree when the confidence in leaf-node statistics is low due to low impressions.

2. Formulation and Related Work 2.1. Contextual Multi-Armed Bandit Setting

We identify the possible arms as diferent types of search page secondary stack (recommender) = {recommender 1, . . . , recommender K} .

We emphasize that we focus on the selection of the secondary stack, rather than the content generation within each individual secondary stack. The content within each secondary stack is assumed to have been generated. All these secondary stacks share many common characteristics, such as similar frontend User Interface (UI) treatment and position on the search page where they can be displayed when selected. We focus on making the optimal selection of a stack for a fixed position on the search page.

We define the user engagement (number of item clicks) from a secondary stack as the reward . From historical observations, we can form a data set as

= {(, , )}=1 where the context itself can be a vector, the corresponding selected stack (arm) , and observed reward are recorded. We then obtain an induced distribution

,|,− 1 = |,− 1 |,,− 1 , = 1, 2, . . .

Note that the arm selection scheme |,− 1 and the corresponding reward |,,− 1 can be stochastic. The goal is to obtain an optimal arm selection scheme |,− 1 that maximizes the expected reward over time, i.e.

As for the reward, we use the user engagement from the secondary stack by assuming the number of engaged items from a search page secondary stack follows a binomial distribution max ∑︁ E [︀ | , − 1]︀ . {︁|,− 1 }︁ =1 ∼ B(, Θ), (1) (2) (3) Θ ∼ Beta(, ),

=1 ︃( =1

=1 =1 Θ| ( |) ∼ Beta + ∑︁ , + ∑︁( − ) .

)︃ )︃ where 0 and 0 are universal initial values to be determined experimentally, and (, ) denotes the number of clicked items from the secondary stack under context . The corresponding optimization given in (2) becomes (4) (5) (6) (7)

Since we choose the reward as the number of engaged items from the selected secondary stack, which in turn is modeled according to the binomial distribution given in (3) and is a fixed parameter to be determined later, the problem simplifies as follows.

Θ(, ) ∼ Beta 0 + ∑︁ (, ), 0 + ∑︁( − (, )) ,

︃( |( = , = , − 1 = ) ∼ B (, Θ− 1(, )) where is the maximum number of engaged items from the secondary stack, and Θ is the success rate of each item from the secondary stack. We let the parameter Θ itself be a random variable where and are hyperparameters. Since Θ is a conjugate prior for , after observations, we have a computationally eficient way of obtaining the posterior distribution

max {︁|,Θ− 1 }︁ =1 ∑︁ E [ | , Θ− 1] where the expectation is w.r.t. the joint distribution

,,,Θ− 1 = ,Θ− 1 |,Θ− 1 |,,Θ− 1 .

2.2. Related Work and Our Approach

In our setting (7), we formulate the problem as contextual multi-armed bandit in order to capture each secondary stack’s performance under diferent search contexts, such as queries. In addition, we adopt a Bayesian approach to exploit the user engagement in order to update the posterior distributions of the parameters in (6).

Ref. [11] proposes a contextualized approach LinUCB based on the traditional UCB where the expected reward is assumed to take a linear form of the context vector with some unknown parameters to be estimated, i.e. E [|, Θ] = ⊤Θ. This result also has a natural Bayesian interpretation when Θ follows a Gaussian distribution. Similarly, the linear expected reward treatment is also studied using Thompson sampling [13]. However, both approaches intrinsincally assume a continuous unbounded form of reward. These formulations are not directly applicable to our problem since our reward is discrete.

Without the intricacies of the context, the problem can be addressed based on the Bernoulli Bandit setting. Thompson sampling is particularly eficient in conducting the explore-exploit of Bernoulli Bandit. However, Bernoulli Bandit is not as flexible in handling the inclusion of contexts. One way of handling the contextualization is by quantization, such as Lipschitz contextual bandits [14, 15, 16, 17]. The general idea is that the distributions of the parameters are maintained separately for a finite set of context realizations after the quantization of the context space. When the cardinality of the context is small, it is possible to skip the quantization. This is the approach we adopt by making sure that only the most prominent contexts are kept through a context preselection.

To solve for (7), we use Thompson sampling [18]. The simplified version of the protocol is summarized as follows.

Algorithm 1 Thompson sampling for stack selection Require:

Algorithm 1 only serves as a high-level overview of the flow. The actual realization of this algorithm does a batch update with delay and is discussed in the following sections. In the rest of the paper, we ifrst discuss the methodology of identifying the context for our business problem and the data structure we use to maintain the statistics under such contexts. We then present the secondary stack selection algorithm which contains an ofline data processing part and an online stack selection part including data logging. We also discuss the implementation details that address several practical considerations, including an initial data collection mode, handling data delays, hashing, and intricacies in conducting AB tests.

3. Context Tree

From business intuition and domain knowledge, we start by identifying the following contexts that can afect the performance (user engagement rate) of a secondary stack. Such contexts broadly include but are not limited to: a) search context, such as query and Product Type (PT) associated with the query and how many exact matched items are retrieved; b) session context, such as device platform and search filters applied; c) customer context, such as customer’s membership status

The use of context trees is well known in the literature of data compression [19]. Here we use the data structure to maintain each stack’s statistics on the context level.

3.1. Building the Context Tree

To keep the notation generic, assume we have the following contexts ranked from the most to the least important after context preselection:

= (ctx1, . . . , ctx) Each context ctx takes on a finite number of values ||. The structure of a context tree is shown in Figure 2. As a point of reference, our context tree is constructed with six layers.

We next explain the nodes in the context tree. A node at depth contains the engagement statistics of each secondary stack under a realization of context ctx1, . . . , ctx, which can be traversed from the root node. Consequently, the statistics contained at any given node is an aggregation of the statistics of all its direct child nodes.

Within each node, the engagement statistics of each secondary stack is given by two parameters which correspond to the hyperparameters of the beta distribution given in (5). That is, by looking at both parameters, we can eficiently obtain the posterior distribution of Θ(, ) for a given secondary stack and context realization that is indicated by the traversal trajectory that leads to the node from the root.

To construct the tree based on a historical data set, we can first construct all the leaf nodes in the tree and then recursively compute the statistics at all the other nodes bottom up. This is presented in Algorithm 2.

In Algorithm 2, the maximum number of engaged items from a secondary stack , as mentioned in (3), is set to 3. This is chosen by our empirical study of clicks from secondary stacks. In the rest of the algorithm, we first construct all the leaf nodes under |1| × . . . × | | realizations. As a result of the simple form of Bayesian update in (5), we can maintain only two parameters per stack and context realization by counting the engagement directly in L.7-8. Consequently, the count of engagement at a given node is simply the sum of engagement from all its child nodes. This allows us to easily compute the statistics by recursion.

Note that the constructed context tree after executing Algorithm 2 cannot be directly used as the Bayesian statistics yet, because we have not assign any prior distribution. If we assume a uniform prior of Θ from (3) on [0, 1], we can set = = 1 in (4). Therefore, with the uniform prior assumption, we can use the context tree from Algorithm 2 as the Bayesian statistics by adding 1 to all the and ’s at each node.

3.2. Updating the Context Tree

In Algorithm 2, we provide a way of constructing a context tree from one batch of data. Next, we present a scheme to keep updating the Bayesian statistics with new observations.

The frequency of updating the context tree can be application driven. We focus on a batch update scheme, where we choose one day as the batch size or update frequency. However, the scheme is generalizable to any update batch size. We provide an algorithm to update the context tree on a daily basis in Algorithm 3.

Observe that at any point of the traversal, oldTree and oneDayTree have exactly the same structure. Therefore, we can traverse the two trees together based on the child node key of either tree. In Algorithm 2 Construct context tree from historical data ◁ compose stats from child nodes Algorithm 3, we provide a way of eficiently updating the posterior distribution of Θ(, ) by batch. If it is the first batch of data, oldTree is initialized to reflect uniform distributions of Θ(, ) for every secondary stack and context realization , i.e. Beta(1, 1). Exploiting the identical tree structure, each node with the corresponding statistics can be traversed and updated using Depth First Search (DFS).

In order to balance the "Explore" capability in Thompson sampling, a discount factor ∈ [0, 1] is enforced to prevent the parameters and ’s from growing unboundedly. This is to ensure that L.5 in Algorithm 1 does not become deterministic.

The rationale for the parameter update in L.20-21 of Algorithm 3 is the following result in Proposition 1.

Proposition 1. Given two random variables

Φ0 ∼ Beta( 0, 0) and Φ1 ∼ Beta( 1, 1) with the corresponding probability density functions 0, 0 (· ) and 1, 1 (· ), respectively, the distribution defined by a new probability density function ( )(· ) of the tilted form

log ( )() ≜ (1 − ) log 0, 0 () + log 1, 1 () + constant is again a beta distribution

Beta ((1 − ) 0 + 1, (1 − ) 0 + 1) .

Note that the constant term itself is equal to (1 − ) (Φ1‖Φ0), where (Φ1‖Φ0) is known as the Rényi divergence [20].

By updating the hyperparameters as a convex combination of the ones from the oldTree and the oneDayTree as in L.20-21, we obtain the tilted distribution.

The discount factor itself can be adaptive. We discuss more on the adaptation scheme in Section 5. Algorithm 3 Update context tree by batch 1: oldTree ← load existing context tree 2: oneDayTree ← construct tree with Algorithm 2 with one day of data 3: if oldTree == ∅ then 4: oldTree ← uniform tree 5: end if 6: updatedTree ← updateTree(oldTree, oneDayTree, ) ◁ first day, no previous tree 7: function updateTree(oldTree, oneDayTree, ) 8: if oldTree == ∅ then 9: return ∅ 10: end if 11: newTreeNodes, stats ← {} 12: childNodes ← oldTree["nodes"] 13: while ctxVal ∈ childNodes do 14: newSubTree ← updateTree(childNodes[ctxVal], oneDayTree["nodes"][ctxVal], ) 15: newTreeNodes[ctxVal] ← newSubTree 16: end while 17: while ∈ do 18: 0, 0 ← oldTree["stats"][][“ ”], [“ ”] 19: 1, 1 ← oneDayTree["stats"][][“ ”], [“ ”] 20: ← (1 − ) * 0 + * 1 21: ← (1 − ) * 0 + * 1 22: stats[] ← { “ ” : , “ ” : } 23: end while 24: return {"stats" : stats, "nodes" : newTreeNodes} 25: end function ◁ base case ◁ DFS

3.3. Context Preselection

If the sample size were not an issue, including more contexts could let us learn each stack’s performance more accurately under various situations. However, recall that in (6), we are maintaining each stack parameters on a per context vector realization level. The sample size under each context vector realization decreases exponentially with the number of contexts, thereby eventually leading to overfitting. To combat the overfitting issue, we propose a context preselection method. To allow this context preselection, we enable an initial data collection phase with a small percentage of trafic, where all the qualifying secondary stacks are selected uniformly at random, and the corresponding contexts and user engagement are logged. This is also known as -greedy in the literature of Explore and Exploit. However, we use this phase only for context preselection and context tree initial construction. The details of experiment setup for the data collection phase are presented in Section 5.

In order to shortlist the context and avoid overfitting, we design a simple scheme using the stack engagement data from the data collection phase. We split the data chronologically, i.e. first time window is used for training and the second time window is used for evaluation. We treat it as a winning stack prediction problem. For each context or set of context, we use the statistics of each secondary stack under the given context to predict the winning stack. When evaluated on the testing data set, the prediction accuracy should increase when we include one more useful context and decrease when we include too many contexts and start overfitting.

3.4. Operational Flexibility

Observe that with the context tree structure, it is possible to invoke the stack selection algorithm at diferent stage of the search flow. For example, given a context tree (after context preselection) with structure “query -> platform -> filter -> recall size”, the stack selection algorithm can be applied pre-retrieval with the context up to “query -> platform -> filter” as well as post-ranking, i.e. after all the stacks have already been generated using the full context. This flexibility makes the solution applicable to a wide range of use cases. In this example, invoking the algorithm pre-retrieval can reduce cost by limiting the number of secondary stacks to generate; whereas invoking the algorithm post-ranking can refine the selection based on the search result.

4. Secondary Stack Selection Algorithm 4.1. Architecture

In this section, we present the complete secondary stack selection logic using Thompson sampling. Recall that our objective is to maximize the secondary stack user engagement in the long run by selecting the best performing stack among all based on the context "most of the time" while continuously exploring the potentially underperforming stack "from time to time" to capture any non-stationarity. The overall lfow of the secondary stack selection procedure is illustrated in Figure 3. The flow contains two major components: ofline and online.

In the ofline component, we manage the context tree update according to the schemes in Section 3. This is implemented through a daily automatic data pipeline which reads the backend log and frontend log, and joins the information to update the context tree. Recall the data format in (1), the context and selected secondary stack are from the backend log and the engagement data, i.e. the reward, is from the frontend log. The updated context tree efectively acts as the "Model" as it contains all the parameters that are needed for the stack selection.

In the online part, we conduct the Explore and Exploit operation by using Thompson sampling. At runtime, when a search request is triggered, diferent qualifying secondary stacks are computed and generated in the background. The resulting stacks are returned to the backend engine. At this point, all the required contexts are available in the backend engine as well. Based on the most updated parameters published by the data pipeline and the context, the backend engine must make a decision on which secondary stack to select when there are multiple options. The general scheme to conduct Thompson sampling is already given in Algorithm 1. However, there are more practical considerations to be accounted for during implementation.

4.2. Node statistics reliability

In normal circumstances, we should always invoke the statistics at a leaf node by traversing down the context tree shown in Figure 2 by using the context obtained at runtime. This allows us to retrieve the most accurate stack performance history given a context combination. However, the context preselection in 3.3 only eliminates overfitting on average. There can be leaf nodes under which certain stacks have not been seen enough number of times. Theoretically, we can let Thompson sampling handle the tradeof between explore and exploit and gradually accumulate more observations. Because of the site facing nature of our application, we place a further regularization based on the reliability of the statistics at the node. When the confidence is low due to low impressions, we trace back one level (a) Architecture (b) Example of beta distributions of three stacks with (, ) set to (20, 80), (27.5, 472.5) and (250, 750), respectively. In this example, stack 3 is confidently better than stack 2, with high concentrations. During the Thompson sampling, stack 3 is preferred to stack 2 almost deterministically (exploit). However, between stack 1 and stack 3, the average engagement rate is not too diferent. With the lower concentration from stack 1, there will be more explore between the two stacks during Thompson sampling. by going back to the parent node and invoke the statistics carried by the parent node. We repeat it on the parent node if the impression is still low until we reach an ancestor node on the path where the impression exceeds a chosen threshold impression or we hit the first level context node.

In addition, we enforce another guardrail to prevent the worst secondary stack from showing. The reason for this consideration is that there is no guarantee on the recall quality from each individual stack’s qualifying condition. Due to the random nature of Thompson sampling, there is a non-zero probability that a bad secondary stack is selected for exploration. Theoretically, we can rely on the algorithm itself to converge based on the engagement statistics it collects. However, since we are operating in production, we need to mitigate the risk of harming business metrics or breaking customers’ trust from doing the exploration. To address this issue, we apply a regularization by setting a dynamic threshold using the mean of the beta distribution corresponding to the second worst stack. If the winner stack from Thompson sampling is worse than the overall performance of the second worst stack among all, the winner stack is dropped and consequently no secondary stack is shown.

4.3. Runtime algorithm

We next outline the main algorithm that conducts the stack selection per search request with Thompson sampling with the practical considerations that are pointed out earlier.

traverseTreeByCtx(currentContext) ◁ returns a list of nodes from root to leaf Algorithm 4 Runtime stack selection with Thompson sampling 1: Input: contextTree, currentContext, eligibleStacks, sessionId 2: Ouput: selectedStack 3: nodePath ← 4: * , impression ← 5: while impression < impression and node is not root do 0, ← , node ← ∅, ctrList ←

[] sumImpressionsFromNode(node)/ hash(sessionId + query + ) readStatsFromNode(node) 6: 7: 8: 11: 12: 13: 14: 15: 16: 17: 18: 9: end while 10: while ∈ eligibleStacks do

nodePath[i] ← node ←

− 1 impression ← seed ← selectedStack ← end if ctrList.add + selectedStack ←

∅ 19: end while 20: if * < secondSmallest(ctrList) then 21: 22: else 24: end if 23:

logging(requestId, currentContext, selectedStack) ◁ requestId is used as the join key with FE log ◁ is the depth of the context tree

◁ find operating node ◁ generate a random number

The algorithm operates on a per search request level. For each search request, we obtain its corresponding context and execute Algorithm 4 on all the qualifying secondary stacks to select the final one to be displayed.

Due to the random nature of Thompson sampling, without further regulation, it is possible for the same user to experience diferent selections of secondary stacks under the same query, e.g. a customer may search for the same query twice. This can lead to confusing customer experience, especially within a short time window. To prevent such situation from happening, we set a constraint so that a user always sees the same choice of secondary stack under the same query within a session. A session is defined as a thirty-minute window uniquely assigned to each visitor. This is achieved in L.11 by concatenating the session ID, the query and the stack name, which are all strings respectively, and using the hash value of the concatenated string as seed.

5. Experimental Setup and Results

In this section, we explain how the AB tests are conducted in production on Walmart.com, which include a data collection phase and a full explore-exploit phase.

There are a total of ∼ 10 secondary stacks live in production. In the control group, the stack selection logic is exactly the same as in production, i.e. following a manually configured priority that is universal w.r.t. context. Consequently, one stack is always preferred over another stack according to the manual priority rule regardless of the context. As previously mentioned, we have two phases. In the data collection phase, the goal is to collect unbiased data to initialize the context tree. It can be also interpreted as learning the prior distributions. No significant business metrics improvement is expected in the variant group. In the full explore-exploit phase, the goal is to show that the stack selection algorithm Algorithm 4 together with the context tree has an efect on improving the Converted Visitor (CV) rate.

5.1. Data collection phase

In this phase, we set up an AB test with a small percentage of the production trafic, 3% in both control and variant group. In the variant group, we aim to collect unbiased data to initialize the context tree. To achieve that, we allow a uniform random triggering among all the qualifying secondary stacks. The corresponding context encountered and the actual user engagement are logged together with the selected secondary stack. In theory, we should only need the data from the variant group. However, we set up the control group to monitor the impact from random triggering. If the uniform random triggering introduce too much suboptimality that it starts to show statistically significant loss in CV, we must stop or reduce the trafic of this experiment in order not to harm the business. This experiment is set up on three platforms: web, iOS, and Android, respectively. The metrics that are closely monitored. During the data collection period, no statistically significant lift or drop is detected in the business metrics. Therefore, we keep this test running at low trafic over the course of four weeks.

As pointed out in Section 3.3, the goal of data collection is to preselect the useful context and initialize the context tree. We start with the following context based on business intuition: query or PT, platform, whether a fulfillment based filter is applied, and customer’s Walmart membership status. We briefly explain the query / PT context. Since the secondary stack is shown on the search page, we believe the query plays an important role. However, it is not always possible to aggregate the statistics on a per query level as the query itself is free text and the number of unique queries is unbounded. For frequently searched queries, we consider the normalized query1 as one context. For infrequently searched queries, we use its PT as the corresponding context. A query PT can be understood as a quantization of the actual queries where the cardinality is bounded (∼ 8000 in Walmart production). By conducting the context preselection scheme discussed in 3.3, we find out only two levels of context, query / PT and platform contribute positively to the overall prediction accuracy. The various combinations of contexts and their corresponding prediction accuracy is reported in Table 1. The context preselection is learned based on ten consecutive days of data and evaluated on next five days of data. After the context preselection, we initialize the context tree using the four weeks of data collected by executing Algorithm 2 and 3 day by day.

Note that in the context tree initialization, in order to quickly establish the statistics, we have an adaptation scheme for the discount factor in Algorithm 3, where we replace with 1 and 1 − with 0. This is summarized as 0 ← 0 ← 0, 1 ← 0.99, 1 ←

1, if impression < 300 0.9, 1 ← 0.1, if impression > 3000

linear interpolation in [0.99, 0.9] and [1, 0.1] otherwise where impression is the total number of impressions from all observed secondary stacks at the first level (query or PT) node. 1A normalized query is by standardizing the expression based on stemming, such as singular/plural forms.

5.2. Explore-Exploit phase

At this point, we have all the components to start the full explore-exploit AB test. In this phase, we set up a regular AB test with higher production search trafic, 15% each in control and variant group. The goal of this test is to show that the scheme can move up business and search metrics. In both the control and the variant group, we have the same set of secondary stacks. Recall that in the control group, the selection of the secondary stack is pre-configured given a prioritization order regardless of any context. Similar to the data collection phase, the experiment is set up on three platforms: web, iOS and Android, respectively. In the variant group, we test the explore-exploit algorithm by conducting Thompson sampling in Algorithm 4. The tests are run for a two-week period in production. We report the search metrics, secondary stack specific metrics, and engagement metrics in Table 2, 3 and 4, respectively. The numbers in color (or with superscript) indicate statistical significance with p-value < 0.1, where color green (or superscript +) generally indicates a change in the “good" direction and color red (or superscript -) generally indicates a change in a “bad" direction.

We next explain how to interpret the metrics encountered in the tables. From the search metrics table, ATC stands for “Add To Cart". With this definition, it is intuitively easy to understand Search ATC / Visitor (the higher the better). The next two metrics Search Direct ATC / Visitor and Search Rf Item Page ATC / Visitor measure the ATC from two sources: the search page directly and the item page after the customer clicks on an item from the search page (the higher the better). The last metric Clicks to ATC is a search eficiency metric, defined as the average number of clicks it takes for a customer to add an item to cart (the lower the better).

From the secondary stack specific metrics, we report the stack surface rate and number of secondary stack add to cart per visitor. Note that the decrease in surface rate of secondary stacks is expected because of the guardrail we impose during Thompson sampling as was discussed in 4.2. The key observation here is that the number of ATC from secondary stacks has significantly increased across platforms despite a lower surface rate. This shows that the algorithm is triggering the secondary stacks less but target customers more accurately.

Overall, the experimental results have shown that the algorithm is highly efective in targeting the right secondary stack selection based on the secondary stack metrics. The improvement on search metrics and business metrics is mostly reflected in the test on iOS. Our conjecture is that these are more general metrics and therefore, require longer time (more data) for the efect to propagate. Since the trafic on iOS is relatively large, we already observe the efect.

We discuss briefly a few practical considerations in productionalization of the algorithm, including how to onboard a new secondary stack and conduct AB tests with new stacks. In order to keep the maintenance of the context tree manageable, we keep only one version of the tree at any time. When a new secondary stack enters the competition, an AB test needs to be conducted to evaluate the efectiveness of the new stack. An initial statistics, such as a uniform distribution on [0, ] where ≤ 1, can be assigned based on the average of all qualifying stacks in production to the new stack on Day 1. Once the experiment has started, the engagement data and corresponding context of the new stack will be collected. In both control and variant group, we conduct Thompson sampling by accessing the same context tree. However, in the control group, the new stack is not considered as a qualifying stack. The data collected from control and variant groups contributes equally to updating the context tree.

6. Conclusion

We optimize the search page secondary stack (recommender) selection using a low-cost and lightweight contextual multi-armed bandit approach. The algorithm has two major components: an ofline data pipeline that maintains the statistics of each secondary stack under various contexts, and an online stack selection using Thompson sampling. We design a context tree data structure that allows Thompson sampling to be conducted with full or partial contexts. This flexibility makes the solution general enough to be invoked at various stage of the search flow depending on the use case. We achieve the results by conducting the experiment in two phases. In the data collection phase, we collect unbiased data of all the secondary stacks by enabling a uniform selection among all qualified stacks. In the explore-exploit phase, we use Thompson sampling to handle the tradeof between explore and exploit. The AB test results show positive impact on business metrics, search metrics and secondary stack metrics. Particularly, within the secondary stack metrics, we are able to target the triggering of the stack more accurately, reflected in a significant increase in the secondary stack Add To Cart per Visitor rate.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. [1] W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika 25 (1933) 285–294. [2] T. L. Lai, H. Robbins, Asymptotically eficient adaptive allocation rules, Advances in applied mathematics 6 (1985) 4–22.