<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of the ACM
(JACM) 64 (2017) 1-24.
[9] O.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/18.382012</article-id>
      <title-group>
        <article-title>Practical Secondary Stack Optimization on Search Pages: A Lightweight Contextual Bandit Approach⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eva C. Song</string-name>
          <email>eva.song@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jun Zhao</string-name>
          <email>jun.zhao3@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junchao Zheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vivek Agrawal</string-name>
          <email>vivek.agrawal@walmart.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Walmart Global Tech</institution>
          ,
          <addr-line>Hoboken NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Walmart Global Tech</institution>
          ,
          <addr-line>Sunnyvale CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>1516</fpage>
      <lpage>1541</lpage>
      <abstract>
        <p>Secondary Stacks - recommender modules displayed alongside core search results, are a key driver of engagement and conversion on E-commerce search pages. A secondary stack suitable for display should generally demonstrate relevance to the query and good potential for engagement. An E-commerce website can have many of such secondary stack modules that feature diferent themes and are powered by diferent algorithms for the content generation. Due to space limitations, not all available stacks can be displayed, making it necessary to prioritize those most likely to perform well under the current search context, such as the query. Additionally, retrieval for many stacks is computationally non-trivial. Hence, it is often beneficial to conduct a pre-retrieval selection, when only partial context is known, to limit the number of stacks to retrieve, and then prioritize based on the full context only among the retrieved stacks. Lastly, stack performance is dynamic, varying with evolving content and user behavior. We introduce a lightweight, highly flexible solution that frames secondary stack prioritization as a contextual multi-armed bandit problem. Our approach features a low-cost ofline data pipeline and a real-time selection algorithm based on Thompson sampling, enabling seamless deployment at any stage of the search flow, such as pre-retrieval, post-ranking, or anywhere contextually appropriate. The system dynamically adapts to shifting contexts without requiring complex retraining, and thrives where large models often overfit, such as with low-visibility stacks that receive sparse impressions. AB tests on Walmart.com show that our method delivers significant gains in user engagement metrics, validating a solution that is fast to implement, cost-eficient to serve, and robust in production. The algorithm has been successfully launched and continues to drive improved search experiences at scale.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-armed bandit</kwd>
        <kwd>Search Page Optimization</kwd>
        <kwd>Secondary Stack</kwd>
        <kwd>Thompson sampling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In E-commerce search experiences, Secondary Stacks – recommender modules displayed alongside or
below primary search results play a vital role in driving user engagement, discovery, and conversion.
With limited real estate on the search page and a growing number of available stacks powered by
diverse algorithms, eficiently prioritizing which modules to display has become increasingly critical.</p>
      <p>On the Walmart.com website, a secondary stack can be shown at the bottom of the first page
search result. At the time of writing, there are ∼ 10 types of secondary stacks live in production and
several more under development. Each of these secondary stacks features a diferent aspect of business
consideration and the content within each secondary stack is generated with a separate algorithm.
For example, some secondary stacks are personalized based on users’ past shopping history and the
other ones are generic to all users. However, they all share a very similar front-end treatment and the
same space on the search page. As the number of diferent types of secondary stack increases, several
stacks may be qualified to be shown on the same search page and consequently, a choice must be made.
It is worth noting that retrieval for many stacks are non-trivial, with some as costly as that of the
primary search retrieval. Therefore, the ideal framework should be invokable at both pre-retrieval stage,
where it needs to operate on partial context, and post-retrieval stage, to conduct the final selection
based on the full context. Prior to this work, the selection of the secondary stacks was based on a
set of business heuristic rules. Under these rules, the prioritization of the stacks is predetermined.
This rule-based prioritization is suboptimal as each secondary stack can demonstrate diferent user
engagement patterns under diferent contexts. One of such contexts is the query itself. An example
illustrating diferent stacks should be prioritized under diferent contexts is shown in Figure 1. In
addition to the lack of contextual awareness, the engagement pattern of each individual stack can
also change over time. In order to address this issue, we treat it as a contextual multi-armed bandit
problem, where an explore-exploit algorithm learns the dynamics from all the secondary stacks under
various contexts by continuously collecting users’ engagement data. When multiple secondary stacks
are qualified to be displayed on the same search page, the explore-exploit algorithm decides whether
to show the most promising stack based on past users’ engagement data (exploit) or show another
stack so that we continue taking measurements on the other stacks in order to capture any change in
stacks’ behaviors over time (explore). An eficient explore-exploit algorithm allows us to achieve the
best engagement from all secondary stacks on average in the long run.
(a) A stack featuring personalized cross-selling options
(b) A stack featuring highly rated items</p>
      <p>The multi-armed bandit is a framework that handles the tradeof between explore and exploit. In
this setting, at each time instance, an arm is selected and a reward (regret) from choosing the arm is
observed. The selected arm and its corresponding reward then contribute as historical observations and
are used to make the next decision. This area has been intensively studied in the literature since 1933
[1]. Many algorithms are well established in the literature, including  -greedy, Upper Confidence Bound
(UCB) [2, 3, 4] and Thompson sampling [5, 6, 7]. The corresponding theoretical bounds on expected
regret in time have also been studied [2, 8, 9, 5, 10]. The contextual multi-armed bandit further assumes
that the selection of the best arm and the corresponding reward depend on another random variable,
the context. This setting has been applied to many areas including in recommender systems [11]. In
recent years, deep contextual bandits [12] are also intensively studied and applied to similar problems
and demonstrate great results.</p>
      <p>Traditional complex modeling approaches often require high-quality, large-scale data and heavy
infrastructure, making them impractical for the dynamic and noisy environment where many stacks
can sufer from low visibility and sparse impressions. In this work, we present a lightweight, flexible,
and production-ready solution based on contextual multi-armed bandits that eficiently selects
highperforming secondary stacks with minimal cost.</p>
      <p>The main contributions of this paper include:
• Framing the problem of search page secondary stack selection as a contextual multi-armed bandit
and designing the ofline data pipeline to maintain contextual statistics and online selection using
Thompson sampling.
• Designing the context tree data structure with
– operational flexibility that allows the sampling to be conducted with full or partial context;
– robustness that allows tracing back to higher levels of the tree when the confidence in
leaf-node statistics is low due to low impressions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Formulation and Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Contextual Multi-Armed Bandit Setting</title>
        <p>We identify the possible arms as diferent types of search page secondary stack (recommender)
 = {recommender 1, . . . , recommender K} .</p>
        <p>We emphasize that we focus on the selection of the secondary stack, rather than the content generation
within each individual secondary stack. The content within each secondary stack is assumed to have
been generated. All these secondary stacks share many common characteristics, such as similar
frontend User Interface (UI) treatment and position on the search page where they can be displayed when
selected. We focus on making the optimal selection of a stack for a fixed position on the search page.</p>
        <p>We define the user engagement (number of item clicks) from a secondary stack as the reward .
From historical observations, we can form a data set as</p>
        <p>= {(, , )}=1
where the context  itself can be a vector, the corresponding selected stack (arm) , and observed
reward  are recorded. We then obtain an induced distribution</p>
        <p>,|,− 1 = |,− 1 |,,− 1 ,  = 1, 2, . . .</p>
        <p>Note that the arm selection scheme |,− 1 and the corresponding reward |,,− 1 can be
stochastic. The goal is to obtain an optimal arm selection scheme |,− 1 that maximizes the
expected reward over time, i.e.</p>
        <p>As for the reward, we use the user engagement from the secondary stack by assuming the number of
engaged items  from a search page secondary stack follows a binomial distribution

max ∑︁ E [︀ | , − 1]︀ .
{︁|,− 1 }︁ =1
 ∼ B(, Θ),
(1)
(2)
(3)
Θ ∼ Beta(,  ),</p>
        <p>=1
︃(

=1</p>
        <p>=1

=1
Θ| ( |) ∼ Beta  + ∑︁ ,  + ∑︁( − ) .</p>
        <p>)︃
)︃
where  0 and  0 are universal initial values to be determined experimentally, and (, ) denotes the
number of clicked items from the secondary stack  under context . The corresponding optimization
given in (2) becomes
(4)
(5)
(6)
(7)</p>
        <p>Since we choose the reward as the number of engaged items from the selected secondary stack, which
in turn is modeled according to the binomial distribution given in (3) and  is a fixed parameter to be
determined later, the problem simplifies as follows.</p>
        <p>Θ(, ) ∼ Beta  0 + ∑︁ (, ),  0 + ∑︁( − (, )) ,</p>
        <p>︃(
|( = ,  = , − 1 = ) ∼ B (, Θ− 1(, ))
where  is the maximum number of engaged items from the secondary stack, and Θ is the success rate
of each item from the secondary stack. We let the parameter Θ itself be a random variable
where  and  are hyperparameters. Since Θ is a conjugate prior for , after  observations, we have a
computationally eficient way of obtaining the posterior distribution</p>
        <p>max
{︁|,Θ− 1 }︁ =1

∑︁ E [ | , Θ− 1]
where the expectation is w.r.t. the joint distribution</p>
        <p>,,,Θ− 1 = ,Θ− 1 |,Θ− 1 |,,Θ− 1 .</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Related Work and Our Approach</title>
        <p>In our setting (7), we formulate the problem as contextual multi-armed bandit in order to capture each
secondary stack’s performance under diferent search contexts, such as queries. In addition, we adopt a
Bayesian approach to exploit the user engagement in order to update the posterior distributions of the
parameters in (6).</p>
        <p>Ref. [11] proposes a contextualized approach LinUCB based on the traditional UCB where the
expected reward is assumed to take a linear form of the context vector with some unknown parameters
to be estimated, i.e. E [|, Θ] = ⊤Θ. This result also has a natural Bayesian interpretation when
Θ follows a Gaussian distribution. Similarly, the linear expected reward treatment is also studied using
Thompson sampling [13]. However, both approaches intrinsincally assume a continuous unbounded
form of reward. These formulations are not directly applicable to our problem since our reward is
discrete.</p>
        <p>Without the intricacies of the context, the problem can be addressed based on the Bernoulli Bandit
setting. Thompson sampling is particularly eficient in conducting the explore-exploit of Bernoulli
Bandit. However, Bernoulli Bandit is not as flexible in handling the inclusion of contexts. One way of
handling the contextualization is by quantization, such as Lipschitz contextual bandits [14, 15, 16, 17].
The general idea is that the distributions of the parameters are maintained separately for a finite set of
context realizations after the quantization of the context space. When the cardinality of the context is
small, it is possible to skip the quantization. This is the approach we adopt by making sure that only
the most prominent contexts are kept through a context preselection.</p>
        <p>To solve for (7), we use Thompson sampling [18]. The simplified version of the protocol is summarized
as follows.</p>
        <p>Algorithm 1 Thompson sampling for stack selection
Require:</p>
        <p>Algorithm 1 only serves as a high-level overview of the flow. The actual realization of this algorithm
does a batch update with delay and is discussed in the following sections. In the rest of the paper, we
ifrst discuss the methodology of identifying the context for our business problem and the data structure
we use to maintain the statistics under such contexts. We then present the secondary stack selection
algorithm which contains an ofline data processing part and an online stack selection part including
data logging. We also discuss the implementation details that address several practical considerations,
including an initial data collection mode, handling data delays, hashing, and intricacies in conducting
AB tests.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Context Tree</title>
      <p>From business intuition and domain knowledge, we start by identifying the following contexts that can
afect the performance (user engagement rate) of a secondary stack. Such contexts broadly include but
are not limited to:
a) search context, such as query and Product Type (PT) associated with the query and how many
exact matched items are retrieved;
b) session context, such as device platform and search filters applied;
c) customer context, such as customer’s membership status</p>
      <p>The use of context trees is well known in the literature of data compression [19]. Here we use the
data structure to maintain each stack’s statistics on the context level.</p>
      <sec id="sec-3-1">
        <title>3.1. Building the Context Tree</title>
        <p>To keep the notation generic, assume we have the following  contexts ranked from the most to the
least important after context preselection:</p>
        <p>= (ctx1, . . . , ctx)
Each context ctx takes on a finite number of values ||. The structure of a context tree is shown in
Figure 2. As a point of reference, our context tree is constructed with six layers.</p>
        <p>We next explain the nodes in the context tree. A node at depth  contains the engagement statistics
of each secondary stack under a realization of context ctx1, . . . , ctx, which can be traversed from the
root node. Consequently, the statistics contained at any given node is an aggregation of the statistics of
all its direct child nodes.</p>
        <p>Within each node, the engagement statistics of each secondary stack is given by two parameters
which correspond to the hyperparameters of the beta distribution given in (5). That is, by looking at
both parameters, we can eficiently obtain the posterior distribution of Θ(, ) for a given secondary
stack  and context realization  that is indicated by the traversal trajectory that leads to the node from
the root.</p>
        <p>To construct the tree based on a historical data set, we can first construct all the leaf nodes in the
tree and then recursively compute the statistics at all the other nodes bottom up. This is presented in
Algorithm 2.</p>
        <p>In Algorithm 2, the maximum number of engaged items from a secondary stack , as mentioned in
(3), is set to 3. This is chosen by our empirical study of clicks from secondary stacks. In the rest of the
algorithm, we first construct all the leaf nodes under |1| × . . . × | | realizations. As a result of the
simple form of Bayesian update in (5), we can maintain only two parameters per stack and context
realization by counting the engagement directly in L.7-8. Consequently, the count of engagement at a
given node is simply the sum of engagement from all its child nodes. This allows us to easily compute
the statistics by recursion.</p>
        <p>Note that the constructed context tree after executing Algorithm 2 cannot be directly used as the
Bayesian statistics yet, because we have not assign any prior distribution. If we assume a uniform prior
of Θ from (3) on [0, 1], we can set  =  = 1 in (4). Therefore, with the uniform prior assumption, we
can use the context tree from Algorithm 2 as the Bayesian statistics by adding 1 to all the  and  ’s at
each node.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Updating the Context Tree</title>
        <p>In Algorithm 2, we provide a way of constructing a context tree from one batch of data. Next, we
present a scheme to keep updating the Bayesian statistics with new observations.</p>
        <p>The frequency of updating the context tree can be application driven. We focus on a batch update
scheme, where we choose one day as the batch size or update frequency. However, the scheme is
generalizable to any update batch size. We provide an algorithm to update the context tree on a daily
basis in Algorithm 3.</p>
        <p>Observe that at any point of the traversal, oldTree and oneDayTree have exactly the same structure.
Therefore, we can traverse the two trees together based on the child node key of either tree. In
Algorithm 2 Construct context tree from historical data
◁ compose stats from child nodes
Algorithm 3, we provide a way of eficiently updating the posterior distribution of Θ(, ) by batch. If
it is the first batch of data, oldTree is initialized to reflect uniform distributions of Θ(, ) for every
secondary stack  and context realization , i.e. Beta(1, 1). Exploiting the identical tree structure, each
node with the corresponding statistics can be traversed and updated using Depth First Search (DFS).</p>
        <p>In order to balance the "Explore" capability in Thompson sampling, a discount factor  ∈ [0, 1] is
enforced to prevent the parameters  and  ’s from growing unboundedly. This is to ensure that L.5 in
Algorithm 1 does not become deterministic.</p>
        <p>The rationale for the parameter update in L.20-21 of Algorithm 3 is the following result in
Proposition 1.</p>
        <p>Proposition 1. Given two random variables</p>
        <p>Φ0 ∼ Beta( 0,  0) and Φ1 ∼ Beta( 1,  1)
with the corresponding probability density functions  0, 0 (· ) and  1, 1 (· ), respectively, the distribution
defined by a new probability density function  ( )(· ) of the tilted form</p>
        <p>log  ( )() ≜ (1 −  ) log  0, 0 () +  log  1, 1 () + constant
is again a beta distribution</p>
        <p>Beta ((1 −  ) 0 +  1, (1 −  ) 0 +  1) .</p>
        <p>Note that the constant term itself is equal to (1 −  ) (Φ1‖Φ0), where  (Φ1‖Φ0) is known as the
Rényi divergence [20].</p>
        <p>By updating the hyperparameters as a convex combination of the ones from the oldTree and the
oneDayTree as in L.20-21, we obtain the tilted distribution.</p>
        <p>The discount factor  itself can be adaptive. We discuss more on the adaptation scheme in Section 5.
Algorithm 3 Update context tree by batch
1: oldTree ← load existing context tree
2: oneDayTree ← construct tree with Algorithm 2 with one day of data
3: if oldTree == ∅ then
4: oldTree ← uniform tree
5: end if
6: updatedTree ← updateTree(oldTree, oneDayTree,  )
◁ first day, no previous tree
7: function updateTree(oldTree, oneDayTree,  )
8: if oldTree == ∅ then
9: return ∅
10: end if
11: newTreeNodes, stats ← {}
12: childNodes ← oldTree["nodes"]
13: while ctxVal ∈ childNodes do
14: newSubTree ← updateTree(childNodes[ctxVal], oneDayTree["nodes"][ctxVal],  )
15: newTreeNodes[ctxVal] ← newSubTree
16: end while
17: while  ∈  do
18:  0,  0 ← oldTree["stats"][][“ ”], [“ ”]
19:  1,  1 ← oneDayTree["stats"][][“ ”], [“ ”]
20:  ← (1 −  ) *  0 +  *  1
21:  ← (1 −  ) *  0 +  *  1
22: stats[] ← { “ ” : , “ ” :  }
23: end while
24: return {"stats" : stats, "nodes" : newTreeNodes}
25: end function
◁ base case
◁ DFS</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Context Preselection</title>
        <p>If the sample size were not an issue, including more contexts could let us learn each stack’s performance
more accurately under various situations. However, recall that in (6), we are maintaining each stack
parameters on a per context vector realization level. The sample size under each context vector
realization decreases exponentially with the number of contexts, thereby eventually leading to overfitting.
To combat the overfitting issue, we propose a context preselection method. To allow this context
preselection, we enable an initial data collection phase with a small percentage of trafic, where all the
qualifying secondary stacks are selected uniformly at random, and the corresponding contexts and
user engagement are logged. This is also known as  -greedy in the literature of Explore and Exploit.
However, we use this phase only for context preselection and context tree initial construction. The
details of experiment setup for the data collection phase are presented in Section 5.</p>
        <p>In order to shortlist the context and avoid overfitting, we design a simple scheme using the stack
engagement data from the data collection phase. We split the data chronologically, i.e. first time window
is used for training and the second time window is used for evaluation. We treat it as a winning stack
prediction problem. For each context or set of context, we use the statistics of each secondary stack
under the given context to predict the winning stack. When evaluated on the testing data set, the
prediction accuracy should increase when we include one more useful context and decrease when we
include too many contexts and start overfitting.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Operational Flexibility</title>
        <p>Observe that with the context tree structure, it is possible to invoke the stack selection algorithm
at diferent stage of the search flow. For example, given a context tree (after context preselection)
with structure “query -&gt; platform -&gt; filter -&gt; recall size”, the stack selection algorithm can be applied
pre-retrieval with the context up to “query -&gt; platform -&gt; filter” as well as post-ranking, i.e. after all the
stacks have already been generated using the full context. This flexibility makes the solution applicable
to a wide range of use cases. In this example, invoking the algorithm pre-retrieval can reduce cost by
limiting the number of secondary stacks to generate; whereas invoking the algorithm post-ranking can
refine the selection based on the search result.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Secondary Stack Selection Algorithm</title>
      <sec id="sec-4-1">
        <title>4.1. Architecture</title>
        <p>In this section, we present the complete secondary stack selection logic using Thompson sampling.
Recall that our objective is to maximize the secondary stack user engagement in the long run by selecting
the best performing stack among all based on the context "most of the time" while continuously exploring
the potentially underperforming stack "from time to time" to capture any non-stationarity. The overall
lfow of the secondary stack selection procedure is illustrated in Figure 3. The flow contains two major
components: ofline and online.</p>
        <p>In the ofline component, we manage the context tree update according to the schemes in Section 3.
This is implemented through a daily automatic data pipeline which reads the backend log and frontend
log, and joins the information to update the context tree. Recall the data format in (1), the context 
and selected secondary stack  are from the backend log and the engagement data, i.e. the reward,  is
from the frontend log. The updated context tree efectively acts as the "Model" as it contains all the
parameters that are needed for the stack selection.</p>
        <p>In the online part, we conduct the Explore and Exploit operation by using Thompson sampling.
At runtime, when a search request is triggered, diferent qualifying secondary stacks are computed
and generated in the background. The resulting stacks are returned to the backend engine. At this
point, all the required contexts are available in the backend engine as well. Based on the most updated
parameters published by the data pipeline and the context, the backend engine must make a decision
on which secondary stack to select when there are multiple options. The general scheme to conduct
Thompson sampling is already given in Algorithm 1. However, there are more practical considerations
to be accounted for during implementation.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Node statistics reliability</title>
        <p>In normal circumstances, we should always invoke the statistics at a leaf node by traversing down the
context tree shown in Figure 2 by using the context obtained at runtime. This allows us to retrieve
the most accurate stack performance history given a context combination. However, the context
preselection in 3.3 only eliminates overfitting on average. There can be leaf nodes under which certain
stacks have not been seen enough number of times. Theoretically, we can let Thompson sampling
handle the tradeof between explore and exploit and gradually accumulate more observations. Because
of the site facing nature of our application, we place a further regularization based on the reliability of
the statistics at the node. When the confidence is low due to low impressions, we trace back one level
(a) Architecture
(b) Example of beta distributions of three stacks with (,  ) set to (20, 80), (27.5, 472.5) and (250, 750),
respectively. In this example, stack 3 is confidently better than stack 2, with high concentrations. During the
Thompson sampling, stack 3 is preferred to stack 2 almost deterministically (exploit). However, between stack
1 and stack 3, the average engagement rate is not too diferent. With the lower concentration from stack 1,
there will be more explore between the two stacks during Thompson sampling.
by going back to the parent node and invoke the statistics carried by the parent node. We repeat it on
the parent node if the impression is still low until we reach an ancestor node on the path where the
impression exceeds a chosen threshold  impression or we hit the first level context node.</p>
        <p>In addition, we enforce another guardrail to prevent the worst secondary stack from showing. The
reason for this consideration is that there is no guarantee on the recall quality from each individual stack’s
qualifying condition. Due to the random nature of Thompson sampling, there is a non-zero probability
that a bad secondary stack is selected for exploration. Theoretically, we can rely on the algorithm
itself to converge based on the engagement statistics it collects. However, since we are operating in
production, we need to mitigate the risk of harming business metrics or breaking customers’ trust from
doing the exploration. To address this issue, we apply a regularization by setting a dynamic threshold
using the mean of the beta distribution corresponding to the second worst stack. If the winner stack
from Thompson sampling is worse than the overall performance of the second worst stack among all,
the winner stack is dropped and consequently no secondary stack is shown.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Runtime algorithm</title>
        <p>We next outline the main algorithm that conducts the stack selection per search request with Thompson
sampling with the practical considerations that are pointed out earlier.</p>
        <p>traverseTreeByCtx(currentContext)
◁ returns a list of nodes from root to leaf
Algorithm 4 Runtime stack selection with Thompson sampling
1: Input: contextTree, currentContext, eligibleStacks, sessionId
2: Ouput: selectedStack
3: nodePath ←
4:  * , impression ←
5: while impression &lt;  impression and node is not root do
0,  ←
, node ←
∅, ctrList ←</p>
        <p>[]
sumImpressionsFromNode(node)/
hash(sessionId + query + )
readStatsFromNode(node)
6:
7:
8:
11:
12:
13:
14:
15:
16:
17:
18:
9: end while
10: while  ∈ eligibleStacks do</p>
        <p>nodePath[i]
 ←
node ←</p>
        <p>− 1
impression ←
seed ←
selectedStack ← 
end if
ctrList.add  +
selectedStack ←</p>
        <p>∅
19: end while
20: if  * &lt; secondSmallest(ctrList) then
21:
22: else
24: end if
23:</p>
        <p>logging(requestId, currentContext, selectedStack)
◁ requestId is used as the join key with FE log
◁  is the depth of the context tree</p>
        <p>◁ find operating node
◁ generate a random number</p>
        <p>The algorithm operates on a per search request level. For each search request, we obtain its
corresponding context and execute Algorithm 4 on all the qualifying secondary stacks to select the final one
to be displayed.</p>
        <p>Due to the random nature of Thompson sampling, without further regulation, it is possible for the
same user to experience diferent selections of secondary stacks under the same query, e.g. a customer
may search for the same query twice. This can lead to confusing customer experience, especially within
a short time window. To prevent such situation from happening, we set a constraint so that a user
always sees the same choice of secondary stack under the same query within a session. A session
is defined as a thirty-minute window uniquely assigned to each visitor. This is achieved in L.11 by
concatenating the session ID, the query and the stack name, which are all strings respectively, and
using the hash value of the concatenated string as seed.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup and Results</title>
      <p>In this section, we explain how the AB tests are conducted in production on Walmart.com, which
include a data collection phase and a full explore-exploit phase.</p>
      <p>There are a total of ∼ 10 secondary stacks live in production. In the control group, the stack selection
logic is exactly the same as in production, i.e. following a manually configured priority that is universal
w.r.t. context. Consequently, one stack is always preferred over another stack according to the manual
priority rule regardless of the context. As previously mentioned, we have two phases. In the data
collection phase, the goal is to collect unbiased data to initialize the context tree. It can be also interpreted
as learning the prior distributions. No significant business metrics improvement is expected in the
variant group. In the full explore-exploit phase, the goal is to show that the stack selection algorithm
Algorithm 4 together with the context tree has an efect on improving the Converted Visitor (CV) rate.</p>
      <sec id="sec-5-1">
        <title>5.1. Data collection phase</title>
        <p>In this phase, we set up an AB test with a small percentage of the production trafic, 3% in both control
and variant group. In the variant group, we aim to collect unbiased data to initialize the context tree.
To achieve that, we allow a uniform random triggering among all the qualifying secondary stacks.
The corresponding context encountered and the actual user engagement are logged together with the
selected secondary stack. In theory, we should only need the data from the variant group. However,
we set up the control group to monitor the impact from random triggering. If the uniform random
triggering introduce too much suboptimality that it starts to show statistically significant loss in CV, we
must stop or reduce the trafic of this experiment in order not to harm the business. This experiment is
set up on three platforms: web, iOS, and Android, respectively. The metrics that are closely monitored.
During the data collection period, no statistically significant lift or drop is detected in the business
metrics. Therefore, we keep this test running at low trafic over the course of four weeks.</p>
        <p>As pointed out in Section 3.3, the goal of data collection is to preselect the useful context and initialize
the context tree. We start with the following context based on business intuition: query or PT, platform,
whether a fulfillment based filter is applied, and customer’s Walmart membership status. We briefly
explain the query / PT context. Since the secondary stack is shown on the search page, we believe the
query plays an important role. However, it is not always possible to aggregate the statistics on a per
query level as the query itself is free text and the number of unique queries is unbounded. For frequently
searched queries, we consider the normalized query1 as one context. For infrequently searched queries,
we use its PT as the corresponding context. A query PT can be understood as a quantization of the actual
queries where the cardinality is bounded (∼ 8000 in Walmart production). By conducting the context
preselection scheme discussed in 3.3, we find out only two levels of context, query / PT and platform
contribute positively to the overall prediction accuracy. The various combinations of contexts and their
corresponding prediction accuracy is reported in Table 1. The context preselection is learned based on
ten consecutive days of data and evaluated on next five days of data. After the context preselection, we
initialize the context tree using the four weeks of data collected by executing Algorithm 2 and 3 day by
day.</p>
        <p>Note that in the context tree initialization, in order to quickly establish the statistics, we have an
adaptation scheme for the discount factor  in Algorithm 3, where we replace  with  1 and 1 −  with
 0. This is summarized as
 0 ←
 0 ←
 0,  1 ←
0.99,  1 ←</p>
        <p>1, if impression &lt; 300
0.9,  1 ← 0.1, if impression &gt; 3000</p>
        <p>linear interpolation in [0.99, 0.9] and [1, 0.1] otherwise
where impression is the total number of impressions from all observed secondary stacks at the first level
(query or PT) node.
1A normalized query is by standardizing the expression based on stemming, such as singular/plural forms.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Explore-Exploit phase</title>
        <p>At this point, we have all the components to start the full explore-exploit AB test. In this phase, we set
up a regular AB test with higher production search trafic, 15% each in control and variant group. The
goal of this test is to show that the scheme can move up business and search metrics. In both the control
and the variant group, we have the same set of secondary stacks. Recall that in the control group, the
selection of the secondary stack is pre-configured given a prioritization order regardless of any context.
Similar to the data collection phase, the experiment is set up on three platforms: web, iOS and Android,
respectively. In the variant group, we test the explore-exploit algorithm by conducting Thompson
sampling in Algorithm 4. The tests are run for a two-week period in production. We report the search
metrics, secondary stack specific metrics, and engagement metrics in Table 2, 3 and 4, respectively.
The numbers in color (or with superscript) indicate statistical significance with p-value &lt; 0.1, where
color green (or superscript +) generally indicates a change in the “good" direction and color red (or
superscript -) generally indicates a change in a “bad" direction.</p>
        <p>We next explain how to interpret the metrics encountered in the tables. From the search metrics
table, ATC stands for “Add To Cart". With this definition, it is intuitively easy to understand Search
ATC / Visitor (the higher the better). The next two metrics Search Direct ATC / Visitor and Search Rf
Item Page ATC / Visitor measure the ATC from two sources: the search page directly and the item page
after the customer clicks on an item from the search page (the higher the better). The last metric Clicks
to ATC is a search eficiency metric, defined as the average number of clicks it takes for a customer to
add an item to cart (the lower the better).</p>
        <p>From the secondary stack specific metrics, we report the stack surface rate and number of secondary
stack add to cart per visitor. Note that the decrease in surface rate of secondary stacks is expected
because of the guardrail we impose during Thompson sampling as was discussed in 4.2. The key
observation here is that the number of ATC from secondary stacks has significantly increased across
platforms despite a lower surface rate. This shows that the algorithm is triggering the secondary stacks
less but target customers more accurately.</p>
        <p>Overall, the experimental results have shown that the algorithm is highly efective in targeting the
right secondary stack selection based on the secondary stack metrics. The improvement on search
metrics and business metrics is mostly reflected in the test on iOS. Our conjecture is that these are more
general metrics and therefore, require longer time (more data) for the efect to propagate. Since the
trafic on iOS is relatively large, we already observe the efect.</p>
        <p>We discuss briefly a few practical considerations in productionalization of the algorithm, including
how to onboard a new secondary stack and conduct AB tests with new stacks. In order to keep the
maintenance of the context tree manageable, we keep only one version of the tree at any time. When
a new secondary stack enters the competition, an AB test needs to be conducted to evaluate the
efectiveness of the new stack. An initial statistics, such as a uniform distribution on [0,  ] where  ≤ 1,
can be assigned based on the average of all qualifying stacks in production to the new stack on Day 1.
Once the experiment has started, the engagement data and corresponding context of the new stack will
be collected. In both control and variant group, we conduct Thompson sampling by accessing the same
context tree. However, in the control group, the new stack is not considered as a qualifying stack. The
data collected from control and variant groups contributes equally to updating the context tree.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We optimize the search page secondary stack (recommender) selection using a low-cost and lightweight
contextual multi-armed bandit approach. The algorithm has two major components: an ofline data
pipeline that maintains the statistics of each secondary stack under various contexts, and an online stack
selection using Thompson sampling. We design a context tree data structure that allows Thompson
sampling to be conducted with full or partial contexts. This flexibility makes the solution general
enough to be invoked at various stage of the search flow depending on the use case. We achieve the
results by conducting the experiment in two phases. In the data collection phase, we collect unbiased
data of all the secondary stacks by enabling a uniform selection among all qualified stacks. In the
explore-exploit phase, we use Thompson sampling to handle the tradeof between explore and exploit.
The AB test results show positive impact on business metrics, search metrics and secondary stack
metrics. Particularly, within the secondary stack metrics, we are able to target the triggering of the
stack more accurately, reflected in a significant increase in the secondary stack Add To Cart per Visitor
rate.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[1] W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the
evidence of two samples, Biometrika 25 (1933) 285–294.
[2] T. L. Lai, H. Robbins, Asymptotically eficient adaptive allocation rules, Advances in applied
mathematics 6 (1985) 4–22.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>