=Paper=
{{Paper
|id=Vol-3303/paper2
|storemode=property
|title=Page-Wise Personalized Recommendations in an Industrial e-Commerce Setting
|pdfUrl=https://ceur-ws.org/Vol-3303/paper2.pdf
|volume=Vol-3303
|authors=Liying Zheng,Yingji Pan,Yuri M. Brovman
|dblpUrl=https://dblp.org/rec/conf/recsys/ZhengPB22
}}
==Page-Wise Personalized Recommendations in an Industrial e-Commerce Setting==
<pdf width="1500px">https://ceur-ws.org/Vol-3303/paper2.pdf</pdf>
<pre>
Page-Wise Personalized Recommendations in an Industrial
e-Commerce Setting
Liying Zheng1 , Yingji Pan1 and Yuri M. Brovman1
1
    Ebay, Inc.


                                             Abstract
                                             Providing personalized recommendations based on the dynamic sequential behaviors of users plays an important role in
                                             e-commerce platforms since it can considerably improve a user’s shopping experience. Previous works apply a unified model
                                             pipeline to build recommender systems, without considering the differentiated behavior patterns and intrinsic shopping
                                             tendencies on different pages of an e-commerce website. In this paper, we focus on generating a personalized recommender
                                             system optimized to both the View Item Page and Homepage by elaborately designing strategies for data formulation and
                                             model structure. Our proposed model (PWPRec) consists of a causal transformer encoder together with a fusion module
                                             designed for different pages, built on the basis of the classical two-tower structure. This provides the capability to capture a
                                             balanced long-short interest or diverse multiple interests of a user during their shopping journey across multiple types of
                                             pages. We have conducted experiments both on in-house datasets as well as public datasets to validate the effectiveness of our
                                             model, all showing significant improvements on Recall@k metrics compared to the commonly applied sequential models of
                                             recent years. Additionally, we built a state-of-the-art deep learning based retrieval system utilizing real-time KNN search as
                                             well as near real-time (NRT) user embedding updates to reduce the recommendation delay to a few seconds. Our online A/B
                                             test results show a big advantage compared to the previous GRU-based sequential model in production, with a 38.5% increase
                                             in purchased items due to model improvements and 107% increase in purchased items due to the engineering innovations.

                                             Keywords
                                             sequential recommendation, multi-interest, attention network, transformer encoder


1. Introduction
Recommender systems play a fundamental role in e-
commerce marketplaces, offering personalized recom-
mendation products based on a user’s specific interests
which will largely improve a user’s shopping experience.
In this work, we focus on ”user context based” recom-
mender systems that generate recommendations using a
user’s historical interactions as the main context. There
are several different landing pages which display recom-
mendations to the user on an e-commerce platform and
in this work we focus on two pages: View Item Page (VIP)
and Homepage (HP). For the VIP, users usually have a
specific shopping mission when they navigate a detailed
item page, thus they tend to spend more time comparing                                                                                Figure 1: Screenshot of eBay View Item Page recommenda-
similar products and trying to find the most appropriate                                                                              tion module with one item set of personalized items.
one. Figure 1 depicts an example of a VIP with a user
context based recommendations module based on users
recent views. For the HP, usually at the beginning of a
user’s shopping session, users tend to wander through the                                                                             ested in, thus we plan to design a new module generating
whole page without a specific shopping mission. They                                                                                  multiple item sets capturing user’s multiple interests.
could be attracted by discounted or hot-sale products, or                                                                                Incorporating different user shopping behavior pat-
diversified categories they have been consistently inter-                                                                             terns on the VIP and HP mentioned above, we have devel-
                                                                                                                                      oped a page-wise personalized recommendation model
                                                                                                                                      (PWPRec) in order to capture a user’s different shopping
ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender
                                                                                                                                      goals and interests. Specifically, here are the main con-
Systems and User Modeling, jointly with the 16th ACM Conference on
Recommender Systems, September 23rd, 2022, Seattle, WA, USA                                                                           tributions of the paper:
Envelope-Open liyzheng@ebay.com (L. Zheng); yingpan@ebay.com (Y. Pan);
ybrovman@ebay.com (Y. M. Brovman)                                                                                                         1. We present a page-wise deep learning model that
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                             considers multiple shopping contexts in an indus-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                               trial setting.
    2. We develop a novel model architecture by com-              Most prior work generates a single embedding to repre-
       bining a causal transformer encoder with a long-        sent a user, this is reasonable for recommendation pages
       short or multi-interest fusion module in order to       or placements with specified target items. But in some
       generate user embedding(s).                             occasions such as Homepage recommendations we may
    3. We deploy our recommender system to our pro-            like to provide users with more diversified set of recom-
       duction industrial setting building a state-of-the-     mendations reflecting the multiple interests of a user. It
       art deep learning based retrieval system in the         can be observed that the problem of how to capture the
       process.                                                multiple interests of a user a popular topic in recent years.
                                                               Weston et al. [9] introduced a highly scalable method for
   The paper is organized in the following sections. Sec-      learning a non-linear latent factorization to model the
tion 2 covers related approaches in literature to our          multiple interests of a user. Li et al. [10] proposed a multi-
method. The main model architecture is discussed in            interest extractor layer based on capsule network with
Section 3. The datasets and sampling strategies used for       the dynamic routing mechanism. Cen et al. [11] explored
our offline experiments are then discussed in Section 4.       a self-attentive method for multi-interest extraction, and
An overview of our production engineering architecture         utilized an aggregation module to balance accuracy and
as well as A/B tests is presented in Section 5. We conclude    diversity.
our work in Section 6.                                            In terms of engineering system architecture, there
                                                               are several works which describe large scale embedding
2. Related Works                                               based retrieval systems. Pal et al. [12] describes an in-
                                                               dustrial embeddings based retrieval system which uses
Adding personalization in recommender systems is a well        HNSW model [13] for the approximate nearest neighbor
studied problem both in academia and in industrial ap-         (ANN) component. There are several production systems
plications. Recently, deep neural networks have been           that utilize a two tower model for search and retrieval,
adopted in personalized recommendations, with the abil-        including in the social media space [14] as well as in e-
ity to build a more generalized model by capturing com-        commerce space [15, 16]. We will now discuss the details
plex content-based features, which can also serve well in      of our model architecture.
cold-start situations or volatile situations. To generate
personalized recommendations, the sequential behaviors
of users are effectively exploited by applying different se-
                                                               3. The PWPRec Model
quential encoder networks. Many works apply Recurrent          In our application scenario, we find that a user’s distribu-
Neural Networks (RNN) for sequential recommendation            tion of recently viewed items differs for different pages.
and obtain promising results, among those Hidasi et al.        For VIP, users usually have a definite shopping purpose
[1] proposed an GRU-based network to model the se-             and are thus more likely to click on items related to their
quential behaviors of users, and adopts the last output        most recently viewed items. However, for Homepage,
as the user embedding (known as GRU4Rec), Hidasi and           users have a less focused shopping purpose and may click
Karatzoglou [2] proposed a top-k gain ranking loss func-       on different categories of items. Thus we build our model
tion used in RNNs for session-based recommendations,           in consideration of different pages and placements, which
Li et al. [3] was also based on RNN network, but pro-          can better capture and understand the different behavior
posed a way to balance a user’s local interest and global      intentions of users.
interest (known as NARM). Besides RNN network, the
recently well-known self-attention mechanism [4] for
sequential modeling has also been commonly applied             3.1. Page-Wise Sequential Behavior
in recommendations, Kang and McAuley [5] proposed                   Analysis
a self-attention based network to capture the sequential
                                                               Before introducing our detailed model, we first present
behaviors of users, and the encoded value of last item
                                                               an analysis of a user’s shopping behavior as a function
in the sequence is regarded as the ultimate user vector
                                                               of time. In our sequential modeling approach, every
(known as SASRec), Sun et al. [6] adopted the Bidirec-
                                                               training example is composed of a positive target clicked
tional Encoder Representations from Transformers which
                                                               item, several negative items a user did not click on in the
trained a bidirectional model to predict the masked items
                                                               impression, and a series of user historical items.
in the sequence. Also, there are some methods based
                                                                  We build up a histogram of the overlap between the
on graph neural networks proposed for sequential rec-
                                                               category of the target item and historical items for all
ommendation, Wu et al. [7] modeled session sequences
                                                               users in the dataset. Figure 2 demonstrates the difference
as graph-structured data to take item transitions into
                                                               between the VIP and HP distributions. The horizontal
account, Xu et al. [8] proposed a graph contextualized
                                                               axis represents the number of hours between the target
self-attention model for session-based recommendation.
clicked item and a historical item, while the vertical axis    at the same time. The overall architecture of PWPRec is
represents the category overlap between the target and         shown in Figure 3. Following our previous work [17] we
historical items. It can be seen from the graph that for       keep using the same structure for the item tower, and fo-
the View Item Page (orange in Figure 2), about 80% of          cus on optimizing the user tower. The original user tower
users are also viewing the same category in the first hour     adopted the recurrent neural network as the base encoder
before, 5% are viewing same category in the second hour        of user’s historical events and an average fusion strategy
before, indicating that users are focused on the same          to generate the final user embedding. Here we optimize
category of most recent items. While for Homepage, the         the user encoder network by two architectural modules:
curve is more gradual with only 30% overlap in the first       1) a sequential encoder to better capture the ordered his-
hour before, indicating that on homepage users show            torical events, and 2) a fusion network to better adapt to
interest in categories they interacted with in a longer        pages with different historical item distributions. In the
period, thus target item category may correlate to more        next sections, we will delve deep into these modules in
diverse historical categories.                                 detail.

                                                               3.3. Causal Transformer Encoder
                                                               The transformer network and self-attention mechanism
                                                               described in [4] are widely applied in NLP related tasks,
                                                               and achieved state-of-the-art performance. Here we
                                                               adopt the idea of transformer and self-attention to func-
                                                               tion as the sequential encoder, and we have made some
                                                               modifications in order to capture the order information
                                                               which is of vital importance to recommendation scenar-
                                                               ios.

                                                               3.3.1. Relative Positional Embedding
                                                        We first tried the fixed position embedding originated
                                                        from the vanilla self-attention in [4], it did not work well.
Figure 2: User historical category overlap histogram on dif-
ferent pages.                                           This may due to fixed embedding cannot capture the rel-
                                                        ative positional information well, which is quite essential
                                                        in an e-commerce setting. In our case, a learnable em-
   Based on the above analysis, we decide to adopt dif-
                                                        bedding with relative position value works the best. The
ferent data formulation strategies and different model
                                                        relative position value is calculated as below:
structures for different pages.
     • For the View Item Page, considering users usu-                              𝑝𝑜𝑠(𝑖𝑡𝑒𝑚𝑖 ) = 𝑇 − 𝑖                  (1)
       ally have specific shopping missions and less in-
                                                               where 𝑇 is the position of the target item and 𝑖 is the
       terests in other categories, we organize training
                                                               position of items prior to the target item. The relative
       data in a ”session-based way” with most recent
                                                               position is then encoded into embedding P𝑒𝑚𝑏 , and the
       past behaviors in a shorter period. The ultimate
                                                               final input to the transformer encoder is calculated as:
       output will be a singular item set showing the
       user’s most recent interest.                                           IN𝑣𝑒𝑐𝑡𝑜𝑟 = ITEM𝑒𝑚𝑏 + P𝑒𝑚𝑏                 (2)
     • For the Homepage, users may show interest in            where ITEM𝑒𝑚𝑏 is the original item embedding, and the
       a diverse set of categories that they interacted        final input vector IN𝑣𝑒𝑐𝑡𝑜𝑟 is a vector addition of the item
       with, even several days before. In this case, we        embedding with the positional embedding P𝑒𝑚𝑏 .
       organize the training data in a ”user-based way”
       incorporating more days and more past behaviors.
                                                               3.3.2. Causal Attention Mask
       The ultimate output will be multiple item sets
       showing the user’s multiple interests through the       The vanilla transformer encoder attends to any positions
       long shopping journey.                                  in a sequence by self-attention and multi-head mecha-
                                                               nism, with each head output being formulated as:
3.2. Model structure
                                                                                                       Q     V
Our proposed approach for personalized recommenda-                           head𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(QWi , KWK
                                                                                                       i , VWi )
tions is based on a two-tower deep learning model struc-                                         QKT                    (3)
ture to generate user embedding(s) and item embeddings          𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q, K, V) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(           )V
                                                                                                  √𝑑
Figure 3: Two tower model architecture for user embedding(s) and item embedding. The causal transformer encoder is
explained in the left part. The fusion module serving different pages will be explained in subsequent subsections, with
long-short fusion generating a comprehensive user embedding or multi-interest fusion generating multiple user embeddings.


where Q, K, V are the packed matrices of queries, keys         interest applied for the VIP, named as Long-Short Fu-
and values, 𝑑 is the dimension of queries and keys�and         sion; the other for generating multiple interests for the
   Q          V                                                HP, named as Multi-Interest Fusion.
Wi , WK  i , Wi are the parameter matrices, as described
in self-attention mechanism [4].
   For the sequential recommendation scenario, a causal        3.4.1. Long-Short Fusion Strategy
mask [18] needs to be performed to guarantee that post-
                                                               For generating one single interest, we would like to adopt
clicked items cannot be seen when predicting previous
                                                               a network architecture which combines a user’s short-
items, otherwise this may lead to data leakage. There-
                                                               term and long-term interests. The short-term takes the
fore, we apply a lower triangle attention mask matrix (as
                                                               last position output of the transformer encoder, indicat-
shown in the left part of Figure 3) to guarantee the causal-
                                                               ing the most recent preferences; while the long-term
ity between items, and in this way the self-attention can
                                                               takes outputs of all the positions into consideration, indi-
be formulated as :
                                                               cating their global preferences. We involve the attention
                                                               mechanism to calculate a weighted average of all the
                                            QKT                outputs to form a long-term interest. which can be inter-
     𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q, K, V) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑀𝑎𝑠𝑘(        ))V            preted as:
                                          √𝑑            (4)
                   𝑀𝑎𝑠𝑘 = 𝑇 𝑟𝑖𝑙(𝑂𝑛𝑒𝑠(M ∈ ℛ 𝐿∗𝐿 )),
                                                                                          𝐿
where 𝐿 is the sequence length, Ones represents all-ones                        U𝑙𝑜𝑛𝑔 = ∑ 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(ik𝑒𝑛𝑐 , iL𝑒𝑛𝑐 ) ∗ ik𝑒𝑛𝑐
matrix and Tril represents the lower triangular part of                                  𝑘=1
matrix, and the mask operation will fill future values with    𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(ik𝑒𝑛𝑐 , iL𝑒𝑛𝑐 ) = v𝑇 𝜎(A1 × ik𝑒𝑛𝑐 + A2 × iL𝑒𝑛𝑐 )
−𝑖𝑛𝑓.                                                                                                                       (5)

                                                        where the attention function is additive attention [19],
3.4. Fusion Module                                      iL𝑒𝑛𝑐 represents the last position item embedding, ik𝑒𝑛𝑐
In our industrial recommendation scenario, we design represents item embeddings of all the positions, A1 trans-
two fusion networks to handle the different recommenda- forms ik𝑒𝑛𝑐 into a latent space, A2 plays the same role
tion targets, one for generating a comprehensive single for iL𝑒𝑛𝑐 , and 𝜎 is the sigmoid function.
   After learning a long-term and a short-term embed-
ding, the last important step is to integrate them appropri-
ately. Here we chose the gated way to learn contribution
coefficients of long-term and short-term embeddings,
which is illustrated in Figure 4, and can be calculated as:


         U𝑒𝑚𝑏 = (1 − 𝑔𝑎𝑡𝑒) × U𝑙𝑜𝑛𝑔 + 𝑔𝑎𝑡𝑒 × U𝑠ℎ𝑜𝑟𝑡
                                                           (6)
         𝑔𝑎𝑡𝑒 = 𝜎(G1 × U𝑠ℎ𝑜𝑟𝑡 + G2 × U𝑙𝑜𝑛𝑔 )

where U𝑠ℎ𝑜𝑟𝑡 = iL𝑒𝑛𝑐 , U𝑙𝑜𝑛𝑔 is in Equation (5) and 𝜎 is
sigmoid activation function. In the gate equation, G1             Figure 5: Multi-Interest Fusion Module.
and G2 both transform U𝑠ℎ𝑜𝑟𝑡 and U𝑙𝑜𝑛𝑔 into latent spaces,
respectively.
                                                                  where I ∈ ℛ 𝐿∗𝑑𝑒𝑚𝑏 is the sequential items embeddings,
                                                                  W1 ∈ ℛ 𝑑𝑒𝑚𝑏 ∗𝑑ℎ𝑖𝑑𝑑𝑒𝑛 is a trainable parameter matrix which
                                                                  transforms input item encoded vectors from dimension
                                                                  𝑑𝑒𝑚𝑏 to 𝑑ℎ𝑖𝑑𝑑𝑒𝑛 (usually hidden is several times larger than
                                                                  emb to increase model capacity), W2 ∈ ℛ 𝑑ℎ𝑖𝑑𝑑𝑒𝑛 ∗𝐾 is an-
                                                                  other trainable parameter matrix which maps 𝑑ℎ𝑖𝑑𝑑𝑒𝑛 to
                                                                  the number of embeddings 𝐾 (𝐾 is the number of user
                                                                  interests to be generated). The attention weights matrix
                                                                  is A ∈ ℛ 𝐾 ∗𝐿 . The final multiple user embeddings is
                                                                  U ∈ ℛ 𝐾 ∗𝑑𝑒𝑚𝑏 .


Figure 4: Long-short Fusion Module.                               4. Offline Datasets & Experiments
                                                                  In this section, we describe the dataset we utilized to
                                                                  train and validate our PWPRec model. We have adopted
3.4.2. Multi-Interest Fusion Strategy                             different data formulation strategies for the View Item
                                                                  Page and Homepage respectively. We also conducted
The multi-interest fusion module is utilized to capture           experiments on both our eBay dataset and public dataset
multiple interests from a user’s shopping journey. A              to validate the effectiveness of our model.
multi-head self-attentive network is applied to transform
the sequential item encoders of a user into multiple user
representations. We follow the self-attentive method              4.1. Dataset for View Item Page
originated from Lin et al. [20], which was then applied in        For the View Item Page, users tended to click more items
a recommendation system in Cen et al. [11] to function            related to recently viewed items, thus we organize the
as the multi-interest extractor. In our work, we found            data in a session-based way. Here we choose two session-
that when this multi-interest fusion module was com-              based datasets for our experiments, one is collected from
bined with the transformer sequential encoder, the model          our eBay in-house data, the other is the public YooChoose
performance was significantly improved.                           dataset [21] which is also commonly adopted by research
   The multi-interest fusion network is illustrated in Fig-       papers.
ure 5. Suppose we have a sequence of items 𝑖1 , 𝑖2 , ..., 𝑖𝐿 ,
and after the causal transformer encoder, the items can be             • eBay (session-based) dataset
represented as I = {i1enc , i2enc , ..., iLenc }, with sequence          This dataset is derived from our real world
length 𝐿. A multi-head self-attentive layer is adopted to                eBay production traffic containing view item page
calculate the attention weights A of input item sequences,               events within a session. All items are enriched
with each head representing one interest. The mutiple                    with necessary metadata like titles, aspects and
user embeddings U for the current user can be calculated                 categories as well.
as :                                                                   • YooChoose dataset
                                                                         This dataset is provided by YooChoose in RecSys
                                                                         Challenge 2015, with each session encapsulating
             A = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥((𝑡𝑎𝑛ℎ(IW1 )W2 )𝑇 )
                                                           (7)           the click events that a user performed from a
             U = AI
        retailer. In this dataset, only item id and category     Statistics                  eBay�(user-based)           Taobao
        is provided to generate an item embedding.               # of training users         40 million                  0.8 million
                                                                 # of validation users       2 million                   97k
   In order to better validate the effectiveness of sequen-      Average sequence length     102                         87
tial encoders, we filter out very short sessions with se-
quence length of less than 4. The data statistics of the Table 2
two datasets are shown in Table 1.                          Homepage Data Statistics.

Statistics                  eBay(session-based)    YooChoose
# of training sessions      18 million             1.9 million
# of validation sessions    2 million              470k
# of items                  72 million             53k
Average sequence length     15                     8

Table 1
View Item Page Data Statistics.


4.2. Dataset for Homepage
For the Homepage, we organize the data in a user-based
way within a longer time window and thus much longer Figure 6: User historical category overlap histogram on
user’s sequential length is obtained. Here we choose Taobao and eBay Homepage
two user-based datasets in our experiments, one is col-
lected from our eBay in-house data, the other is the public
Taobao dataset [22].
                                                            4.3.1. View Item Page
      • eBay (user-based) dataset                                View item page data samples are grouped by session,
        This dataset is also derived from our real world         as the session lengths are usually shorter than the user-
        eBay production traffic containing clicked items         based way, we adopt global negative sampling to choose
        on Homepage as the target label, and all the items       negative items in a larger candidate pool. For the training
        that a user have viewed within 30 days before the        phase, each training data sample has one positive item
        clicked item as the sequential historical events.        and 10 negative items; while for validation phase, we
      • Taobao dataset                                           select 1000 negative items to make the evaluation more
        This dataset contains the sequential behavior of         generalized to the whole candidate item set. We use
        users collected from Taobao, which consists of 1         cross entropy loss to train the model, whose target is to
        million users shopping behaviors within 10 days.         maximize the softmax probability of the positive item:
        We follow the same training/validation data split-
        ting methods as in [11].
                                                                                                𝑒 𝛾 (v𝑝𝑜𝑠 ,v𝑢 )
                                                                              𝑃(𝑝𝑜𝑠|𝑈 ) =                            ,
The statistics of the above datasets are shown in Table 2.                                  ∑𝑖∈𝑝𝑜𝑠∪𝑛𝑒𝑔 𝑒 𝛾 (v𝑖 ,v𝑢 )  (8)
Also, we build up a histogram of the overlap between the
                                                                              𝐿𝑜𝑠𝑠 = −𝑙𝑜𝑔𝑃(𝑝𝑜𝑠|𝑈 )
category of the target item and historical items for the
Taobao dataset in Figure 6, which shows a gradual curve
                                                              where v𝑢 ∈ ℝ𝑑 is a 𝑑-dimensional vector for the em-
more similar to the eBay HP than VIP. So we adopted
                                                           bedding of user 𝑈, v𝑝𝑜𝑠 ∈ ℝ𝑑 is a 𝑑-dimensional vector for
the Taobao dataset to validate the effectiveness of the
                                                           the embedding of positive item, 𝛾 is the affinity function
multiple interests model targeted for Homepage.
                                                           between user and item (we adopt the inner product result
                                                           as the affinity score), and 𝑝𝑜𝑠 ∪ 𝑛𝑒𝑔 is the union set of the
4.3. Model Training & Validation                           target positive and sampled negative items.
For model training and validation, different negative sam-
pling strategies and loss calculations are adopted for View 4.3.2. Homepage
Item Page and Homepage.                                     As for the Homepage, data samples are grouped in the
                                                            user-based way with longer sequential behaviors within a
                                                            30 days time window, batch negative sampling is adopted
to select 1000 samples both in training and validation         entropy with inverse temperature, 2) adopt a attention-
phase. Here the loss calculation logic for training and val-   based weighted sum mechanism to generate the ulti-
idation process is tackled differently for accelerating the    mate embedding. We call our baseline model GRU4Rec-
convergence of multiple user embeddings model struc-           Enhanced. For our model PWPRec proposed in this
ture.                                                          paper, we add the suffix (LS) to represent Long-Short
                                                               fusion strategy.
     • Training phase. As we have the positive item               We see from Table 3 that on both of the datasets we
       for the target label information, we can use the        have depicted in Section 4.1, our model PWPRec(LS)
       positive item embedding to choose one final user        achieved the best performance. Our model gains 10+%
       embedding from multiple embeddings as the one           increase on Recall@1 compared to the baseline model
       to calculate the training loss.                         GRU4Rec-Enhanced. We notice that on the YooChoose
                                                               dataset, the recall values are lower, possibly because of
                    v𝑢 = V𝑢 [𝑎𝑟𝑔𝑚𝑎𝑥(V𝑢 v𝑇𝑝𝑜𝑠 )]         (9)    the smaller size of training set as well as the lack of item
                                                               features, like titles or aspects. However, our model has
       where v𝑢 ∈ ℝ𝑑 is the final user embedding we
                                                               a bigger advantage on this dataset even for recall with
       select to calculate the loss in equation (8), V𝑢 is
                                                               larger Ks, which implies that even in a situation where
       the multiple embeddings genenerated for the user,
                                                               less features are available, our model can perform better
       and v𝑝𝑜𝑠 ∈ ℝ𝑑 is the positive item embedding.
                                                               and have better generalization capabilities.
     • Validation phase. Different from the training
       phase, label information like positive item can-
                                                               4.4.2. Homepage Experiments
       not be used in metrics calculation, otherwise this
       would result in label leakage. Here we applied a    The experimental results can be found in Table 4. For our
       simplified trick to fastern the procedure, which    model PWPRec proposed in this paper, we add the suffix
       is selecting one user embedding having the maxi-    (MI) to represent Multi-Interest fusion strategy. Here
       mum summarized affinity score with the candi-       we chose the model ComiRec described in [11] as the
       date item set as the final user embedding for loss  baseline, and also the well-known multi-interest model
       and metrics calculation.                            MIND [10] for comparison. We see from Table 4 that
                                                           our model PWPRec(MI), with the transformer encoder
                  v𝑢 = V𝑢 [𝑎𝑟𝑔𝑚𝑎𝑥( ∑ V𝑢 v𝑇𝑖 )]        (10) and multi-interest fusion layer, outperforms the other
                                  𝑖∈𝑖𝑡𝑒𝑚𝑠                  two models by 20+% on Recall@1 metrics, and a high
                      𝑑
       where v𝑢 ∈ ℝ is the final user embedding we se-     10+%  increase for recall with larger K. Similar to pre-
       lect to calculate model metrics, V𝑢 is the multiple vious experiments, our model gains a more significant
       embeddings genenerated for the user, and v𝑖 ∈ ℝ𝑑 improvement on the public datasets which lacks item
       is the item embedding contained in the candidate feature information.
       item set for validation.
                                                               5. Production Engineering
4.4. Offline Experimental Results
                                                                  Architecture
The primary evaluation metric we use is Recall@k at
                                                         Details of the continuous improvements we have made
several 𝑘 = 1, 5, 10, 20. For P impressions, the metric is
defined as:                                              to the engineering architecture of this system can be
                                                         seen in our eBay Tech Blog post [23]. Most of the mod-
                         𝑃
                      1     # relevant items @ k         eling innovations described in Sections 3.2 and 4 were
         𝑅𝑒𝑐𝑎𝑙𝑙@𝑘 = ∑                               (11)
                      𝑃 𝑖=1 # total relevant items       A/B tested against the baseline version of the system
                                                         described in our previous work [17]. In our previous ap-
We then explain the experimental results conducted on proach, most of the model calculations were performed
different pages with different datasets.                 offline with daily batch jobs to generate user/item em-
                                                         beddings and perform KNN for every user embedding
4.4.1. View Item Page Experiments                        searching over the space of item embeddings. There is a
                                                         clear disadvantage, namely the delay between offline cal-
The experimental results can be found in Table 3. We culation of predictions (performed daily) and displaying
select three other models for comparison: GRU4Rec [1], the recommendations to the user could lead to stale out-
NARM [3], SASRec [5]. Our baseline model , described dated recommendations and a degraded user experience.
in [17], is very similar to GRU4Rec but has the follow- To overcome this issue and reduce this delay to a few
ing enhancements: 1) changes the loss function to cross seconds, we built a state-of-the-art deep learning based
Table 3
Experimental results for View Item Page

     Dataset     Model                    Recall@1              Recall@5            Recall@10           Recall@20
                 GRU4Rec-Enhanced         0.4212                0.6341              0.7038              0.7643
                 NARM                     0.4249(+0.88%)        0.6378(+0.58%)      0.7101(+0.90%)      0.7743(+1.31%)
                 SASRec                   0.4745(+12.65%)       0.6509(+2.65%)      0.7084(+0.65%)      0.767(+0.35%)
                 ours-PWPRec(LS)          0.4761(+13.03%)       0.6611(+4.26%)      0.7239(+2.86%)      0.7777(+1.75%)
                 GRU4Rec-Enhanced         0.1222                0.127               0.1372              0.2116
                 NARM                     0.1186(-2.95%)        0.123(-3.15%)       0.1331(-2.99%)      0.2079(-1.75%)
                 SASRec                   0.1366(+11.78%)       0.1429(+12.52%)     0.1526(+11.22%)     0.2281(+7.80%)
                 ours-PWPRec(LS)          0.1407(+15.14%)       0.1459(+14.88%)     0.1568(+14.29%)     0.2306(+8.98%)


Table 4
Experimental results for Homepage

     Dataset     Model                    Recall@1              Recall@5           Recall@10           Recall@20
                 ComiRec                  0.4835                0.7139             0.7522              0.7916
                 MIND                     0.3832(-20.74%)       0.5863(-17.87%)    0.6231(-17.16%)     0.687(-13.21%)
                 ours-PWPRec(MI)          0.5898(+21.99%)       0.8221(15.16%)     0.8512(13.16%)      0.8738(+10.38%)
                 ComiRec                  0.1098                0.1970             0.2461              0.3007
                 MIND                     0.0862(-21.49%)       0.1533(-22.18%)    0.1926(-21.74%)     0.2514(-16.39%)
                 ours-PWPRec(MI)          0.1373(+25.05%)       0.2419(+22.79%)    0.2914(+18.41%)     0.3427(+13.97%)


                                                                 are returned. In order to generate a user embedding in
                                                                 real time, we capture user click activity on the site using
                                                                 Apache Kafka message events and process them using a
                                                                 Apache Flink application. The events are enriched with
                                                                 metadata and processed through a deep learning model
                                                                 prediction microservice to generate the actual embed-
                                                                 ding vector, which is subsequently stored in Couchbase.
                                                                 Putting all of these together, we generate the full NRT
                                                                 flow for personalized recommendations:

                                                                      1. Step 1A - A user clicks on previous View Item
                                                                         Pages and these click events are collected using
                                                                         the Kafka messaging platform.
                                                                      2. Step 1B - The Flink application aggregates the last
                                                                         several events and generates a user embedding
                                                                         by calling the model prediction microservice.
                                                                      3. Step 1C - The user embedding is stored in Couch-
                                                                         base with {key:value} = {user id: user embedding
                                                                         vector}.
Figure 7: Production engineering architecture featuring real-         4. Step 2A - As the user lands on a View Item Page,
time KNN search as well as NRT user embedding updates.                   the backend recommendations application gets
                                                                         the user embedding from Couchbase.
                                                                      5. Step 2B - A request is made to the KNN microser-
retrieval system utilizing real-time KNN search as well                  vice, personalized recommendations are returned
as near real-time (NRT) user embedding updates, details                  and rendered back to the user.
displayed in Figure 7.
                                                                     As a result of this system architecture, the delay be-
   To enable fast real-time KNN search for vector em-
                                                                  tween generating personalized recommendations based
beddings, we have built an in-house KNN microservice
                                                                  on the user’s session data and displaying them is reduced
based on HNSW [13] method, where a user embedding
                                                                  to a few seconds. This system is in production serving
is sent as an input, an ANN search is performed in the
                                                                  high volume traffic to a diverse set of users. Next we will
item embedding space, and then item recommendations
discuss our online A/B testing results which support our
offline model evaluations.

5.1. Online Evaluation
In order to understand how our models perform online,
we deployed them to serve real world users and produc-
tion traffic. We compare results respectively for View
Item Page and Homepage, and the NRT architecture as
well.

5.1.1. View Item Page A/B Test
We performed A/B testing on the View Item Page on the
desktop platform comparing our PWPRec(LS) model to
our previous baseline model [17] named as GRU4Rec-
Enhanced. Our model outperformed the previous base-            Figure 8: A user scrape of production environment, (a) shows
line with a 38.53% increase on purchases. This implies         the user’s historical interacted items, (b)(c)(d) shows the three
                                                               interests we have captured for this user, with (b) representing
our model with transformer encoder better captures the
                                                               the first interest on Jeans, (c) representing the second interest
sequential behavior of a user, and the Long-Short fusion       on Hot Tubs, (d) representing the third interest on Rings.
mechanism is also a good choice to automatically bal-
ance the weights captured from long interests and short
interests, better than the previous weighted sum fusion
way.                                                           6. Summary and future work
                                                               In this paper, we presented an approach for generating
5.1.2. NRT A/B Test
                                                               personalized recommendations by considering different
It was interesting to see the impact of the reduced delay      user behavior patterns on different pages in an indus-
between recommendation generation and serving on the           trial e-commerce setting. Different strategies on data
operational metrics of the system as we deployed the           formulation and fusion layer adoption have been elabo-
NRT engineering architecture to production. The pur-           rately designed to capture a user’s sequential behavior
chases were improved by 107% compared to the previous          on the View Item Page and the Homepage. The over-
offline system. This makes sense from a user experience        all structure is based on a two-tower model aiming to
perspective, as the user shopping journey evolves in real      learn embeddings of items and users in a shared vector
time, the model embedding is updated in real time, the         space. To model the user’s sequential behavior, we adopt
recommendation relevance quality is improved, and op-          a causal transformer encoder together with Long-Short
erations metrics are better. [17]                              fusion or Multi-Interest fusion determined by page set-
                                                               tings on the user tower side. This approach captures the
5.1.3. Homepage Multi-Interest User Scrapes                    user’s long-short interests and multiple interests well.
                                                               In order to verify the effectiveness of our model, we
We are in the process of serving our multi-interest model      have conducted experiments on our in-house datasets
online for A/B test. However, we wanted to share some          and commonly adopted public datasets as well. All ex-
multi-interest user recommendations from production en-        periments showed significant improvements over the
vironment to demonstrate the performance of the model.         baseline approaches in comparison. Furthermore, a per-
We can see the generated recommendations in Figure 8           sonalized recommender system with NRT engineering
which depicts 3 distinct sets of recommended items based       architecture has been launched to production and is now
on the multiple interests of a user from related browsing      serving recommendations at scale to eBay buyers. This
history. Based on the user’s past viewed items shown on        system reacts quickly based on instant user interactions
the first line, our model captures three interests for this    and generates large improvements in the buyer shopping
user, which accurately reveals the intrinsically diverse set   experience. Online A/B tests have also been conducted
of interests of a user throughout their shopping journey.      for our proposed model as well as the NRT architecture,
                                                               which also show increases on downstream business met-
                                                               rics, such as purchases.
                                                                  We are actively working to enhance the performance
                                                               and extend the application scenario of our model as well
as engineering system. One direction of future work is        [9] J. Weston, R. J. Weiss, H. Yee, Nonlinear latent
to incorporate more rich user features (e.g. demographic          factorization by embedding multiple user interests
features like buyer age and behavioral features like pur-         (2013). URL: https://static.googleusercontent.com
chase quantity) as well as item features (e.g. item price,        /media/research.google.com/en//pubs/archive/4153
popularity). Another direction is to add a deep learning          5.pdf.
ranking model after the multiple recommendation sets [10] C. Li, Z. Liu, M. Wu, Y. Xu, P. Huang, H. Zhao,
have been retrieved, in order to further optimize for oper-       G. Kang, Q. Chen, W. Li, D. L. Lee, Multi-interest
ational metrics, like engagement or conversion. Last but          network with dynamic routing for recommendation
not least, besides the current View Item Page and Home-           at tmall, arXiv:1904.08030 (2019).
page we are serving, we plan to extend our personalized [11] Y. Cen, J. Zhang, X. Zou, C. Zhou, H. Yang, J. Tang,
recommneder system to more scenarios like infinite feed           Controllable multi-interest framework for recom-
as well as checkout success placements, in order to give          mendation, arXiv:2005.09347 (2020).
users more personalized and diverse choice with NRT [12] A. Pal, C. Eksombatchai, Y. Zhou, B. Zhao, C. Rosen-
experiences in their shopping journey.                            berg, J. Leskovec, Pinnersage: Multi-modal user
                                                                  embedding framework for recommendations at pin-
                                                                  terest, in: Proceedings of the 26th ACM SIGKDD
7. Acknowledgements                                               International Conference on Knowledge Discovery
                                                                  & Data Mining, 2020, pp. 2311–2320.
We wanted to thank the generous support of Sathish
                                                             [13] Y. A. Malkov, D. A. Yashunin, Efficient and robust
Veeraraghavan, Bing Zhou, Arman Uygur, Marshall
                                                                  approximate nearest neighbor search using hierar-
Wu, Sriganesh Madhvanath, Santosh Shahane, Leonard
                                                                  chical navigable small world graphs, IEEE transac-
Dahlmann, Menghan Wang, and Jeff Kahn for the help
                                                                  tions on pattern analysis and machine intelligence
with the production system as well as review comments
                                                                  42 (2018) 824–836.
during the manuscript preparation.
                                                             [14] J.-T. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang,
                                                                  P. Pronin, J. Padmanabhan, G. Ottaviano, L. Yang,
References                                                        Embedding-based retrieval in facebook search, in:
                                                                  Proceedings of the 26th ACM SIGKDD Interna-
 [1] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk,            tional Conference on Knowledge Discovery & Data
      Session-based recommendations with recurrent                Mining, 2020, pp. 2553–2561.
      neural networks, arXiv:1511.06939 (2016).              [15] H. Zhang, S. Wang, K. Zhang, Z. Tang, Y. Jiang,
 [2] B. Hidasi, A. Karatzoglou, Recurrent neural net-             Y. Xiao, W. Yan, W.-Y. Yang, Towards personalized
      works with top-k gains for session-based recom-             and semantic retrieval: An end-to-end solution for
      mendations, arXiv:1706.03847 (2017).                        e-commerce search via embedding learning, in:
 [3] J. Li, P. Ren, Z. Chen, Z. Ren, J. Ma, Neural atentive       Proceedings of the 43rd International ACM SIGIR
      session-based recommendation, arXiv:1711.04725              Conference on Research and Development in Infor-
      (2017).                                                     mation Retrieval, 2020, pp. 2407–2416.
 [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor- [16] S. Li, F. Lv, T. Jin, G. Lin, K. Yang, X. Zeng, X.-M.
      eit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-             Wu, Q. Ma, Embedding-based product retrieval in
      sukhin, Attention is all you need, arXiv preprint           taobao search, in: Proceedings of the 27th ACM
      arXiv:2102.06156 (2017).                                    SIGKDD Conference on Knowledge Discovery &
 [5] W.-C. Kang, J. McAuley, Self-attentive sequential            Data Mining, 2021, pp. 3181–3189.
      recommendation, arXiv:1808.09781 (2018).               [17] T. Wang, Y. M. Brovman, S. Madhvanath, Person-
 [6] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,      alized embedding-based e-commerce recommen-
      Bert4rec: Sequential recommendation with bidirec-           dations at ebay, arXiv preprint arXiv:2102.06156
      tional encoder representations from transformer,            (2021).
      arXiv:1904.06690 (2019).                               [18] W.-C. Kang, J. McAuley, Self-attentive sequential
 [7] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, T. Tan,             recommendation, arXiv preprint arXiv:1808.09781
      Session-based recommendation with graph neural              (2018).
      networks, arXiv:1811.00855 (2019).                     [19] D. Bahdanau, K. Cho, Y. Bengio, Neural machine
 [8] C. Xu, P. Zhao, Y. Liu, V. S. Sheng, J. Xu, F. Zhuang,       translation by jointly learning to align and trans-
      J. Fang, X. Zhou, Graph contextualized self-                late., CoRR, abs/1409.0473, 2014 (2020).
      attention network for session-based recommenda- [20] Z. Lin, M. Feng, C. Nogueira dos Santos, M. Yu,
      tion (2019). URL: https://www.ijcai.org/proceeding          B. Xiang, B. Zhou, Y. Bengio, A structured self-
      s/2019/0547.pdf.                                            attentive sentence embedding, arXiv preprint
                                                                  arXiv:1703.03130 (2017).
[21] Yoochoose, Recsys challenge 2015, 2015. URL: https:
     //recsys.acm.org/recsys15/challenge/.
[22] Alimama, User behavior data from taobao for rec-
     ommendation, 2018. URL: https://tianchi.aliyun.c
     om/dataset/dataDetail?dataId=649&userId=1&lan
     g=en-us.
[23] Y. M. Brovman, Building a deep learning based re-
     trieval system for personalized recommendations,
     2022. URL: https://tech.ebayinc.com/engineering/b
     uilding-a-deep-learning-based-retrieval-system-f
     or-personalized-recommendations/.

</pre>