<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Quantifying Exploration Preference for E-Commerce Recommendation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amy B.Z. Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siyun Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphael Louca</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karl Ni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diane Hu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cornell University</institution>
          ,
          <addr-line>2 W Loop Rd, New York, NY 10044</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Etsy, Inc.</institution>
          ,
          <addr-line>117 Adams St, Brooklyn, NY 11201</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recommendation systems are widely used in e-commerce setting to help users find the most relevant items. However, what is relevant for a user changes dynamically. For next-item recommendation, similarity to recently interacted items is desirable in some cases but not others: consider for example the behavior during comparison shopping in contrast to after a purchase. Such shifting user preference regarding exploration is not well captured by existing concepts, much less taken into account in recommendations. In this paper, we ofer definitions to quantify user exploration preference and how it trends over time, based on spread of recently interacted items in embedding space. The soundness of the concepts are illustrated with mathematical properties as well as analysis of platform data. We further demonstrate flexibility and potential by attaching simple modules to well-known baseline algorithms in two separate use cases, comparing performances for three e-commerce datasets. Source code can be found at https://github.com/bz275/DispPred.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;sequential recommendation</kwd>
        <kwd>dynamic preference</kwd>
        <kwd>user profiling</kwd>
        <kwd>embedding model</kwd>
        <kwd>adaptive module</kwd>
        <kwd>evaluation metric</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        It is well recognized that user satisfaction with next-item recommendation depends on far more
than similarity to past interactions [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. User preferences shift dynamically with context
and intent, making accounting for such changes important for relevant recommendations.
      </p>
      <p>One aspect of this preference is with regard to the preference for exploration. Consider the
example stream of user activity illustrated by Figure 1, where the user exhibited a change in
attitude towards unfamiliar items. It is intuitive to suppose a user engaging in comparison
shopping is interested in seeing similar alternatives to previous visited items, but prefers novelty
after the purchase is complete. However there lacks quantitative measure that can reflect such
trends, so the focus of this work is to quantify a user’s exploration preference in a dynamic
fashion.</p>
      <p>Specifically, we consider the context of sequential recommendation in e-commerce, and
quantify the afinity to variation a user demonstrates at a given time based on the spread of
their recent activities. With access to item embeddings that reflect item similarity/ relation,
the "variance" of the past  items quantifies the  at that timestep for look-back
window . A smaller value indicates that the vectors are more concentrated, suggesting focused
activities, while a larger value indicates greater dispersion, implying more scattered browsing.
As the look-back window slides forward with each additional interaction, the relative change,
, reflects the direction and magnitude of how the afinity is trending. A positive value
suggests a trend towards increased variation, and the magnitude against the bounded scale
indicates the strength of this trend.</p>
      <p>
        To show how the proposed concepts provide valuable insights into user preference changes
in a real-world setting, we use both publicly available e-commerce datasets and data collected
from live trafic on Etsy, a two-sided online marketplace notable for unique items. We ofer data
analysis results to illustrate conceptually, and also provide proof-of-concept experiments for
two separate tasks, next-item recommendation and purchase prediction. Specifically, we show
that incorporating  improves accuracy (Hit Rate and NDCG) in the former (with lower
 deviation), boosts both ROC-AUC and Precision-Recall AUC in the latter, even with
the module added using the most basic structure. The embeddings we use are generated using
the Universal Sentence Encoder [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is more general, as well as SASRec [5], which is task
specific. All experiments are conducted with SASRec as the base engine; while later methods (e.g.
[6], [7]) have competitive performances, SASRec has shown robust SOTA performance and is
representative of the basic transformer infrastructure most commonly used in practical systems.
The adaptive module is added using the most basic structure, demonstrating the potential of
simply quantifying and accounting for changing user intent for extending existing systems. Note
also that the discussions are easily extendable to other contexts with sequential interactions,
such as music or news recommendations, in line with the central goal of adaptability.
      </p>
      <p>While we do propose a related evaluation metric through this new lens of relevance, by
comparing the counter-factual  from the recommended next item to the ground truth,
this work difers in spirit from the broader study of diversification in recommender systems
([8, 9, 10]). Rather than making a prescriptive claim on a proposed metric and proposing an
algorithm that improves performance with respect to the metric, e.g. novelty, our focus is more
descriptive and aims at dynamic user profiling.</p>
      <p>The paper is organized as follows. We discuss comparisons with existing literature in Section
2. In Section 3, we present the detailed definitions and properties of our proposed concepts,
illustrated through analysis of data collected from Etsy. Then in Section 5, we show
proof-ofconcept experimental results using three e-commerce datasets for next-item recommendation
and purchase prediction. Finally, we conclude with Section 6.</p>
      <p>To summarize, our main contributions are as follows:
• We quantify afinity to variation in sequential recommendation to capture dynamic user
exploration preferences, ofering intuitive ingredient for adaptive approaches.
• We establish soundness and usability of the proposed concepts through analysis both
mathematically and based on real-world e-commerce data.
• We demonstrate flexibility and potential by enhancing baseline algorithms with proposed
concepts for two separate use cases, showing improved performance in three datasets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The dynamic nature of recommender systems has long been recognized. Sequential
recommender models that make personalized next-item predictions based on interaction histories rose
rapidly in popularity, and form the context of the discussion here. Well-known models include
GRU4REC [11], Caser [12], SASRec [5], NextIt[13], BERT4Rec [6], S3-Rec [7]; see surveys
[14, 15, 16].</p>
      <p>
        The discussion of exploration preference in this work is based on the fundamental recognition
that similarity to past interactions alone does not warrant user satisfaction. Empirical evidence
supporting this notion [17] in part spurred development of the field of diversification in
recommender systems, concerned with evaluation metrics beyond accuracy. Most common ones
include diversity [
        <xref ref-type="bibr" rid="ref1">1, 18</xref>
        ], novelty [19, 20], and serendipity/ unexpectedness [21, 22, 23]; we defer
reader to surveys for details [24, 25, 26, 27, 28]. This work share similar motivations but focuses
more on describing user behavior, compared to the more prescriptive nature of studies that
centers around promoting alternative objectives that are important for recommender quality.
      </p>
      <p>Nevertheless, many recent works take a personalized approach that bear relevance to this
work or could be inspiration for future directions. Some approaches balance the tradeof between
accuracy and alternative objectives based on user behavior [8, 29, 30, 31, 32]. User demonstrated
preference is sometimes considered explicitly, in a similar spirit to our work. Kapoor et al.
[33] propose the concept of novelty preference which changes based on user and session, and
is predicted with the help of behavioral psychology insights. Li et al. [34] use multi-cluster
modeling of user interests in the latent space to personalize unexpectedness using self-attention.
Mehrotra et al. [35] log user behavioral response to divergent recommendations to predict
future receptivity. Qian et al. [36] disentangle user intent into popularity conformity and
personal preference to introduce intrinsic novelty for long-tail items.</p>
      <p>Another aspect of our proposed quantification involves the dependency on context that
changes over time. While it is mostly implicit here through user behavior, many recent works
explicitly model these temporal context and efects [ 37, 38, 39, 40, 41, 42, 43]. Of particular
interest is the concept of temporal diversity proposed by Lathia et al. [44] that similarly explores
the dissimilarity between sets of top-N recommendations temporally, though they concern
mainly with the same items repeating over time.</p>
      <p>Lastly, we highlight some studies on the particular efect of user intent on preference drift
[45, 46, 47]. Intuitively, the exploration preference of an user is linked to how strongly they
intend to purchase, and indeed we use purchase prediction as one of the example use cases.
However, the central focus of this work is not on determining what governs user intent but
rather on representing and adapting to exhibited behavior.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Afinity to Variation</title>
      <p>We now provide formal definitions of the concepts described thus far, ofering design
justifications and useful properties.</p>
      <p>Recall again that we are interested in a given user’s afinity to variation given their recent
interactions, based on how concentrated the items when embedded on a vector space. We
assume access to the sequential interaction history of each user, where each interaction consists
of an item and an action. Furthermore, we assume access to a reasonable embedding space, as
detailed later.
 For a ifxed user, consider a sequence of interactions,
{(1, 1), (2, 2), ..., ( ,  )}, where  is the item the user interested with at time 
and  is the action taken. Denote this sequence of items in the embedding space as
{1, 2, ...,  }.</p>
      <p>Let  be the size of the look-back window, where we consider the last  interactions to be a
snapshot of a user’s current attention space. To assess the concentration level of this attention
space, we measure the spread of these items on the embedding space using high-dimensional
analogue of variance, known in some contexts as the squared standard distance deviation.</p>
      <p>We define the  of the -step sub-sequence ending at time step  as

 2(; ) := 1 ∑︁ ‖ −  (; )‖2,</p>
      <p>=− 
where  (; ) = 1 (∑︀</p>
      <p>=−   ) and ‖ · ‖ is the Euclidean norm. In other words, 
measures the average squared Euclidean distance of the points in the set from the mean of the
set. We omit  when it is clear from the context.</p>
      <p>This is meant to be a simple and convenient definition, aiming towards our ambitions of
usability in practice and adaptability for future research. The simplicity of the definition, along
with its resemblance to variance, helps with intuition and interpretation. Moreover, similar to
the standard definition of variance,  can be updated incrementally as the sequence
rolls forward, without having to store every embedding in the sequence and thereby reducing
memory demands. Having a simple construct also help the concepts to be well-behaved, and
thus conducive to future adaptations. Specifically, the standard components of 
lead to well-understood properties that allow easy incorporation into deep neural networks,
while the problem-agnostic range of  makes it viable for settings such as weighting or
hypothesis testing.
 With this basic quantification of how spread out the user’s interactions are over a
ifxed window, we examine the dynamics of how the spread changes with each new interaction.</p>
      <p>With each subsequent item that the user interacts with, the window of  most recent items
is shifted forward to include this new item. Because the window size is fixed, how the relative
distance from the set to the new item compares to that to the least recent item is reflected in
the change in .</p>
      <p>While direction of change is the most important for detecting trends, we would also want to
understand the strength of the trend through magnitude of change. However, note that the scale
of  depends in general on the distribution of the items on the embedding space, and
specifically on which regions the recent set of interaction items are mapped to, so the diference
in  could vary greatly in scale. This implies concerns in interpretability not only
from simple diference but also from percentage change, since the resulting  could
be orders of magnitude larger than the preceding one, especially for embedding space that is
normalized.</p>
      <p>To capture this change while being scale invariant, we define  as the relative
diference resulting from shifting the window, comparing the diference to the average:
 2() −  2( − 1)
() := 1 ( 2() +  2( − 1)) .</p>
      <p>2
(1)</p>
      <p>A positive value of  indicates an increase in , reflective of increased
afinity to variation; the reverse is true of a negative value.</p>
      <p>Note that the risk of division by zero is reduced, but not eliminated, in which case the
definition becomes ill-posed. This only happens in the special case that  2() =  2( − 1) = 0;
there defining  to be 0 seems reasonable. However, for the purpose of our discussions,
we will in fact disregard this case. A  value of 0 implies that all items interacted in
the look-back window are at the same point on the embedding space, or identical from the view
of the vector representation. While this could be possible, it could also indicate an embedding
space without enough representation power, unintended system behavior or even malicious bot
activity. It is more informative for raising concerns than our purpose of understanding user
behavior patterns.</p>
      <p>Putting aside the above technicality, the range of  is bounded independent of the
embedding space, as stated below.</p>
      <p>Lemma 3.1.  has value bounded between -2 and 2.</p>
      <p>The proof is in the appendix. This is a theoretical range to certify the bounded behavior of the
definition, while in practice the real range could be much narrower as shown in the experiments,
depending on the characteristics of the embedding space.
 We treat  as a dimension of user intent, and evaluate a sequential
recommender system in its ability to correctly capture it. We argue that recommended items that
result in similar  as the true next-item better match the user’s intended behavior,
reflecting an alternative aspect of relevance.</p>
      <p>The time indices are less important for evaluating a recommender system, as recommender
outputs to predefined test trajectories are compared to the ground truths. Thus we instead
index trajectories to simplify notation. Specifically, let  denote the relative change to
 from the true next item following trajectory . Then replace the ground truth by
the next-item predicted by the recommender, and denote the hypothetical  as   2,
and hypothetical relative change using   2 as  . The absolute diference between the
two values represents the error in capturing the user’s true change in afinity to variation.</p>
      <p>To evaluate on the recommender level, we average the above diference over all trajectories
in the test set T, and call it Average DispAdd Deviation ():
 :=
1 ∑︁ (︀ | −  |︀) .</p>
      <p>T
| | ∈T</p>
      <p>In the ideal case that the correct  is perfectly captured and the error is 0, (such as is
the case if the predicted item is the true next item), then  2() =  2().</p>
      <p>In fact, we show this two way relation is true in general:
Lemma 3.2. | −  | = 0 if  2() =  2().</p>
      <p>Consequently, we will explicitly model  2() and  2() instead of the respective relative
diferences in the next section, for simplicity and numerical stability.</p>
      <p>Embeddings The above definitions rest on the assumption that there is an embedding that
can suficiently capture the relation among items, specifically that pairwise distances between
item embeddings have reasonable correspondence to the similarities between the items. How to
encode rich item information appropriately into embedding vectors is an actively research area
in itself, with no universally agreed upon best practice. On the other hand, many recommender
systems train their own embedding mappings in the process of learning item relevance, where
the performance of the system lends credibility to the quality of the embeddings. Our proposed
definitions work with embedding either trained separately or from a base recommender engine.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Analysis of platform data</title>
      <p>
        We now explore our definitions of  and  on real-world data. We draw
samples from Etsy user interaction data, and calculate embedding vectors using the Universal
Sentence Encoder [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We present results from moderately active users, specifically those who
made between 2 and 6 purchases in the past year, to avoid any potential idiosyncrasies of the
extremes. Results from highly active users in fact show similar patterns. We sample a percentage
from the candidate pool, roughly 3,000 users, and collect their interactions from the first six
months of 2021. There are 4 types of actions: view, favorite, cart, and buy. Users with less than
20 interactions and 3 purchases are filtered, and consecutive interactions within 4 hours of each
other that are identical are removed. The resulting dataset contains 799 users, with an average
of 164 interactions per user.
      </p>
      <p>First, we illustrate the definitions with example trajectories to provide a concrete
understanding, and to check for consistency with our intuition. We calculate  for a window of
six items, displayed on the top row in each figure, then calculate the  for each of the
four items that follow, shown on the bottom row.</p>
      <p>In Figure 2, we can see that the user starts by interacting with sewing patterns of similar
styles; this is corroborated by a small  value of 0.172. The first item that follows is
again similar, confirmed by a near-zero  that reflects little change. However, each of
the next two items belongs to a diferent category, showing an expanding afinity to variation;
indeed, we see large positive  values for both steps. The last item is less of a departure
from the item prior, showing a slow in the exploration, which the small positive  value
again confirms.</p>
      <p>The opposite pattern can be seen in Figure 3. Here the user begins with a diverse selection
including various kitchen items and notebooks, captured by a relatively large  value
of 0.507. The spread then shrinks as the user focuses in on notebooks of a particular style, during
which process we see increasingly negative  values. Then, when the user interacts
with an item from a diferent category, there is a slight expansion in the afinity to variation
again, reflected by a small positive .</p>
      <p>Now, we attempt to verify intuitive hypotheses about user behavior patterns using .
Specifically, it seems reasonable to expect that before a user makes a purchase, they have a
higher tendency to look at items that are highly similar to each other to compare. After the
purchase is complete, on the other hand, we might expect them to be more interested in items
that are meaningfully diferent, as their previous need has been met. To extract supporting
evidence, we look at each time a purchase is made, and take the last five interacted items prior to
it as well as the five following it. We compute the  resulting from each interaction, and
average them according to the number of interactions from the purchase (negative corresponds
to before purchase and positive corresponds to after); we plot the results in Figure 4 for four
diferent window sizes. We see that there is a clear distinction between items interacted with
prior to purchase, which produced negative  on average, and items interacted with
after a purchase, whose resulting  increased noticeably. The trends are consistent for
the window sizes presented. Note that smaller window sizes result in less smoothness and vice
versa, so we choose sizes which are suficiently large but not so much to cause overly muted
dynamics.</p>
      <p>We repeat a similar analysis, but divide the behaviour relative to the start of a new session,
which is an interaction that is at least 4 hours apart from the previous one. We again take five
items from the end of the previous session and five items from the start of the new session,
and average the  from each interaction. We observe a similar trend, though of a
reduced scale. In Figure 5, we again plot the average  against the distance, defined
similar to above. The plots reflect a tendency for users to look over items very similar to recent
interactions right before the end of a session, and interact with items further removed from the
average of recent interactions at the start of a new session. Compared to the behavior after a
purchase, this exploration seems to end much sooner on average, indicated by a quick drop of
the average  towards 0. This aligns with user interviews conducted by the platform,
where users report themselves typically returning via the home page, where they may engage
briefly with prompts for exploratory contents, then go back to where they left of.</p>
      <p>We now examine patterns on the taxonomy level. Etsy categorizes products with 15 top-level
taxonomies; for each interaction, we average its  into the taxonomy the item belongs
to, and present the resulting bar plot on the bottom of Figure 6. To contrast with alternative
quantification without the new concepts, we plot on the top of Figure 6 the average number
of consecutive interactions in the same taxonomy (blue solid line) and the total number of
purchases of this taxonomy(red dashed line).</p>
      <p>When a taxonomy has negative average , it suggests that users overall spend more
time in narrowing behavior when interacting with items in the category. In other words, the
plot shows listings under the categories of Jewelry and Wedding, for example, are more likely
to be visited following interactions with similar items. This largely aligns with our intuitive
understanding of user behaviors. For jewelry, one typically moves through many iterations of
products that have increasing number of desirable features, easily resulting in long trajectories
of increasingly similar items. This is consistent with the alternative statistics, with largest
number of consecutive interactions of all taxonomies. For wedding supplies, it is reasonable to
imagine that most users engaging with the category have persistent interest over a prolonged
period of planning, and have generally static style preferences. The prolonged planning is
confirmed by the low total purchases, and the short average consecutive interactions may be
due to interactions with other categories in-between.</p>
      <p>On the other hand, when a taxonomy has positive average , it reflects that users visit
the category more frequently as part of exploratory behavior rather than focused engagements.</p>
      <p>This usually mean less purchasing, as confirmed by the overall lower total purchases for
taxonomies with more positive . There is one category, Home and Living, that stands
out as having relative a high purchase number despite the positive average . One
potential interpretation might be that highly similar items are less common in the category,
or that many items serve some specific functionalities, so there is less incentive to re-visit the
same or a similar item once a user is satisfied with a discovery.</p>
      <p>In both cases, we see that alternative statistics are able to help interpret patterns uncovered
by afinity to variation, but not reflect these patterns themselves. Our proposed concepts, in
reflecting this dynamic form of intent otherwise hard to capture, enables a novel angle for
gaining insights into user behavior.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Numerical Evidence</title>
      <p>The ability to quantify a user’s afinity to variation opens up a range of application opportunities.
For proof-of-concept, we incorporate the proposed concepts in two common applications:
nextitem recommendation and purchase prediction.</p>
      <sec id="sec-5-1">
        <title>5.1. Datasets</title>
        <p>We investigate the performance on three diferent e-commerce datasets, two publicly available
and one proprietary. In general we remove users and actions with too few interactions to ensure
a base level of density. Moreover, since we aim to capture changing user behavior patterns, we
only preserve users with a minimum number of purchases. The data cleaning procedures are
detailed below, and the resulting summary statistics are in Table 1.</p>
        <p>Retail-Rocket: An e-commerce dataset publicly available on Kaggle 1 that contains 4.5
months of user interaction data. Actions include click, add-to-cart, and transaction. We set
minimum number of interactions to 5, but no minimum purchase due to data size.</p>
        <p>Tmall:. A publicly available dataset provided by Alibaba 2 that contains around 10,000 user
records and 12 million actions of user activities on Taobao app from Nov 18, 2014 to Dec 18,
2014. Each action is one of click, add-to-favorite, add-to-cart and purchase. We set interaction
minimum to 5, and purchase minimum to 2.</p>
        <p>Etsy: A proprietary dataset collected from user activity logs between Aug 1, 2021 to Aug 31,
2021. A random sample is drawn from the most active users on the platform, containing the
user id, item id, and interaction type, which include view, cart, favorite and purchase. Minimum
number of interactions is set to 20, and minimum purchase is set to 3.</p>
        <p>For each user trajectory, the second to last item is used for validation and the last for testing.
All remaining interactions with at least  steps of history are used for training. Hyperparameters
are tuned on the validation sets, then results on the test sets are reported. For more eficient
evaluation, for each target item we sample 99 negative examples to create the candidate set,
1https://www.kaggle.com/retailrocket/ecommerce-dataset/home
2https://tianchi.aliyun.com/dataset/dataDetail?dataId=46
and evaluate on a sample of 10,000 users when applicable. When an identical action is repeated,
only the first is preserved to reduce noise.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Embeddings and Base Engine</title>
        <p>We opt to use embeddings trained with a base recommendation engine to demonstrate the
lfexibility of our construct. This is particularly convenient for recommendation tasks such as
our first example, since  can simply be incorporated in an additional module on top
of the base model.</p>
        <p>For clarity and consistency, we choose to use one base engine for both experiments, varying
the size and sparsity of the datasets instead. This of course raises the concern of generalizability.
Our ideal candidate is a base engine with verified performance to ensure the informativeness
of our defined concepts, but not overly optimized or specialized, in order that the outcomes
of the experiments can be reflective of similar engines. We examined the performance of two
popular baseline methods, SASRec ([5]) and GRU4Rec ([11]), both applicable to general settings
with no requirements on data density or context information. We found that the characteristics
of  difered more among the datasets of varying size and sparsity using the same
embedding mapping than between the embedding mappings, supporting the choice to prioritize
varying the datasets in the experiments.</p>
        <p>SASRec is what we present here since it is a representative sequential recommendation
approach. It is used as baseline in numerous studies with consistent performance (e.g., [13, 41])
and extended upon in later works (e.g., [6, 46]). In fact, the attention mechanism it uses from
the Transformer structure ([48]) is actively studied and extended upon both in and outside
of recommender systems ([49, 50]), proving to be a adaptable method with the potential to
persist as a component in a subclass of future state-of-the-art methods. This in turn hopefully
lends more relevance to the demonstrated performance boosts from incorporating our proposed
concepts.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Application 1: Next-item Recommendation</title>
        <p>As afinity to variation is defined in the context of sequential interactions, next-item
recommendation is the most natural application.</p>
        <p>From the exploratory analysis, we see that the pattern of a user’s engagement is not static
across time. For example, one might engage with items similar to the previous one for some
period of time while opting for entirely diferent categories during others. While many
sophisticated sequential models have been proposed to account for various short-term and long-term
trends (e.g. [51, 52]),  provides direct information on the change in the user’s
afinity to variation. Thus, we aim to augment existing sequential recommender systems by also
considering the  signal.</p>
        <p>Specifically, in order to choose the most relevant item to recommend next, we predict
 at the next time step and assign a score to each candidate item based on how
closely their inclusion would match this predicted level of . In particular, to
recommend a next-item at time  , we use item embeddings from the base model to compute  2()
for  ≤  , and predict ˆ2( + 1). Then, a relevance score for each of the candidate items
is computed based on the closeness to generating ˆ2( + 1). Here we do so by computing
 2( + 1) induced by each candidate item and then using the negative of the L1 distance to
ˆ2( + 1) as logits. We perform the prediction with a simple two-layer network, as illustrated
in Figure 7 (LEFT), but any prediction model (and qualifying base engine) can be swapped into
the general framework, indicated by the dashed box(es) in Figure 7 (RIGHT).</p>
        <p>In general, a base engine qualifies if it (1) is a sequential recommendation model, (2) uses
embedding vector representations for items, and (3) generates recommended next items from a
set of candidates by computing their relevance scores. These are common to some of the most
popular models for sequential recommendations, ensuring that the structure is generalizable
and expandable.</p>
        <p>Note that a possible alternative is to implement a secondary prediction algorithm for the
desirable item embedding that would generate  of value ˆ2( + 1), then compute
the relevance with inner product, which might be more eficient for large systems.</p>
        <p>With the new relevance scores based on afinity to variation, we can obtain a diferent ranking
of the candidate items by adding them to the original scores produced by the base engine. To
ensure the relative magnitude of the two is appropriate, we scale the new scores to match
the mean of the original scores, then weigh by a changeable hyperparameter  . We rank the
candidate items using this combined score.</p>
        <p>We adopt the hyperparameters and initialization strategies suggested by the author for
SASRec. We use learning rate 0.001, latent dimension 50, l2 regularization parameter 1e-6, and
maximum sequence length roughly proportional to the average action length: Retail-Rocket at
30, Tmall at 300 and Etsy at 100. We tuned the hyperparameters for the  prediction
module on the validation set. We set the learning rate to 0.001 after performing a grid search from
{0.1, 0.01, 0.001, 0.0001}, and the weight in the combined score to 0.8. We set the window/kernel
size to 6/4, 8/5, 7/3 for Retail-Rocket, Tmall, and Etsy respectively, from {6,7,8,9,10} for window
and {1,2,3,4,5,6} for kernel. The reported results are based on outputs for the test set using
the combined score between the tuned  module and SASRec. However, note that
improvements are observed consistently for other hyperparameter choices as well, without
discernible trends.</p>
        <p>To evaluate the performance, we compare the ranking resulting from the combined scores to
the ranking using only the base scores. Table 2 contains the results, reported as the average
over 5 repeated runs; results that are statistically better, with p-value less than 0.05, are in bold.
Retail-Rocket is listed as "R-R".</p>
        <p>For accuracy, we report popular measures HitRate@5, NDCG@5 and HitRate@1. Across
all datasets, we see that the combined score produced rankings with higher NDCG@5 and
HitRate@1 compared to the base model alone, showing a boost from incorporating 
information even in this simplest form. We also report the  values, which are lower for
rankings using the combined score. Since we argue a recommendation that better matches the
 of the true next-item is more relevant to the user, it suggests additional reduction in
user disatifaction not captured by the improved accuracy.</p>
        <p>Note that for comparison, we also compute  for a randomly selected item to follow
each trajectory, which is generally larger but never exceeds 0.6. It seems that the embedding
space generally produces small diferences in  for the item sets. Furthermore, this
indicates the results of the base recommender on average matches the users’ demonstrated
 much better than random, which is as we expect with the accuracy performance.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Application 2: Predicting Purchases</title>
        <p>A second example application we explore is the connection between afinity to variation
dynamics and purchase behaviors. Analysis presented before shows an overall narrowing
trend in interest sets prior to purchases, so a natural question is whether modeling 
might help forecast whether a purchase will occur.</p>
        <p>Answering this question is meaningful for e-commerce platforms, as it could help understand
user intents and better cater to real time needs. This would be especially the case if a reliable
purchase forecast could be obtained a period of time in advance (e.g. at the start of a session).
As a first step, we focus on a simple sequential setting, and aim to predict whether the next
interaction will be a purchase using information up to the current time.</p>
        <p>The most basic way to attempt this is to perform a time series forecast using actions up
to the current time as features. This is the approach we compare against as the baseline. To
investigate the value from information of afinity to variation, we simply include 
as an additional feature at each time step. In other words, now there are two covariates for the
timeseries forecast. Note that  is used as the feature instead of  because
it preserves the magnitude information, while the key signal from , the sign, can be
derived from the first diference.</p>
        <p>We use the same two-layer network as shown in Figure 7, changing only the input and
output dimensions in addition to tuneable hyperparameters. For fairness of comparison, we
tune the hyperparameters based on the baseline model using only action signals, then keep
the same model configuration when adding the  signal. For each, the model with
the best performance on the validation set during training is selected and the performance
on the test set reported. Because the number of purchases is much lower than the number
of non-purchases, we report not only the AUC (area-under-the-curve) value of the Receiver
Operating Characteristic (ROC) curve, which shows the general diagnostic ability of the binary
classifier, but also the AUC value for the Precision-Recall (PR) curve, which shows the trade-of
between precision and recall of only the positive class.</p>
        <p>The results for the three datasets are shown in Table 3, with ROC-AUC reported as "AUC"
and Precision-Recall AUC reported as "PR-A". The learning rate used for Retail-Rocket and Etsy
is 0.01, and for Tmall is 0.1, after performing a grid search from {0.1, 0.01, 0.001, 0.0001}. The
same window and kernel sizes are used, and the number of hidden units in the feedforward
network is set to 16 for Retail-Rocket and 32 for the remaining two datasets, after searching
{2, 4, 8, 16, 32}. We observe that the prediction accuracy for the purchase action is higher for
all three datasets. While the improvements seem marginal for ROC-AUCs, the percentage
improvements for PR-AUCs are much higher, reflecting stronger performance in precision and
recall for positive instances of purchase actions despite their rare occurrence. Note that the
predictive ability for Tmall is poor overall due to an extremely small number of purchase signals
in the test set, only 1.1% at 91 purchases.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Future Directions</title>
        <p>Our research highlights the importance of understanding users’ exploration preference
dynamically, and establishes a foundation for new approaches in adaptive systems. One could envision
an ensemble recommender system with specialized context dependent modules, modulated
based on the user’s current afinity to variation. This could help open up possibilities on more
creative approaches to recommendations, such as a style based exploration module. If a large
increase in afinity to variation is anticipated, the module is activated to ofer recommendations
that match recent interactions in style but not necessarily category, providing more
opportunities for serendipitous discoveries. As a further potential for interplay with diversification
work, our derived metric could help ground the appropriate balance between accuracy and
diversification objectives. By demonstrating user demonstrated preference, it takes away the
burden of arguing for diversification to be added at the "cost" of accuracy, and introduces an
additional dimension to the inherently multi-faceted concept of relevance.</p>
        <p>Other future directions include alternative constructs for user’s exploration preference, such
as using determinant of embedding matrix, or more dynamic treatment of the lookback window
size. It may also be worth examining the confounding efect of existing recommender policies
on observed user behavior regarding exploration preferences, which could yield behavioral
insights and present opportunity for further studies on how to remove such bias.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we introduced novel concepts to quantify and analyze user exploration preference
in the context of sequential interactions. We propose afinity to variation measure at each
time step,  to measure the relative change, and a derived metric to evaluate whether
results from a recommender system matches the true . These definitions with their
mathematical properties establish a foundation for building adaptive recommendation systems
that account for the dynamic nature of user preferences, demonstrated through promising
results in proof-of-concepts experiments. The insights and methodologies presented in this
paper open up simple and implementable avenues to enhance recommendation algorithms for
more relevant item suggestions to users. With potential to connect ongoing research, we believe
that our approach opens exciting new directions for progress in more adaptive and efective
recommendation systems in the future.
[5] W.-C. Kang, J. McAuley, Self-attentive sequential recommendation, in: 2018 IEEE
International Conference on Data Mining (ICDM), IEEE, 2018, pp. 197–206.
[6] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, Bert4rec: Sequential recommendation
with bidirectional encoder representations from transformer, in: Proceedings of the 28th
ACM international conference on information and knowledge management, 2019, pp.
1441–1450.
[7] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang, F. Zhang, Z. Wang, J.-R. Wen,
S3rec: Self-supervised learning for sequential recommendation with mutual information
maximization, in: Proceedings of the 29th ACM international conference on information
&amp; knowledge management, 2020, pp. 1893–1902.
[8] M. Zhang, N. Hurley, Novel item recommendation by user profile partitioning, in: 2009
IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent
Technology, volume 1, IEEE, 2009, pp. 508–515.
[9] R. Boim, T. Milo, S. Novgorodov, Diversification and refinement in collaborative filtering
recommender, in: Proceedings of the 20th ACM international conference on Information
and knowledge management, 2011, pp. 739–744.
[10] J. Parapar, F. Radlinski, Diverse user preference elicitation with multi-armed bandits, in:
Proceedings of the 14th ACM International Conference on Web Search and Data Mining,
2021, pp. 130–138.
[11] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, Session-based recommendations with
recurrent neural networks, arXiv preprint arXiv:1511.06939 (2015).
[12] J. Tang, K. Wang, Personalized top-n sequential recommendation via convolutional
sequence embedding, in: Proceedings of the eleventh ACM international conference on
web search and data mining, 2018, pp. 565–573.
[13] F. Yuan, A. Karatzoglou, I. Arapakis, J. M. Jose, X. He, A simple convolutional generative
network for next item recommendation, in: Proceedings of the Twelfth ACM International
Conference on Web Search and Data Mining, 2019, pp. 582–590.
[14] C. Rana, S. K. Jain, A study of the dynamic features of recommender systems, Artificial</p>
      <p>Intelligence Review 43 (2015) 141–153.
[15] M. Quadrana, P. Cremonesi, D. Jannach, Sequence-aware recommender systems, ACM</p>
      <p>Computing Surveys (CSUR) 51 (2018) 1–36.
[16] I. Rabiu, N. Salim, A. Da’u, A. Osman, Recommender system based on temporal models: a
systematic review, Applied Sciences 10 (2020) 2204.
[17] C.-N. Ziegler, S. M. McNee, J. A. Konstan, G. Lausen, Improving recommendation lists
through topic diversification, in: Proceedings of the 14th international conference on
World Wide Web, 2005, pp. 22–32.
[18] S. Vargas, L. Baltrunas, A. Karatzoglou, P. Castells, Coverage, redundancy and
sizeawareness in genre diversity for recommender systems, in: Proceedings of the 8th ACM
Conference on Recommender systems, 2014, pp. 209–216.
[19] N. Hurley, M. Zhang, Novelty and diversity in top-n recommendation–analysis and
evaluation, ACM Transactions on Internet Technology (TOIT) 10 (2011) 1–30.
[20] S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender
systems, in: Proceedings of the fifth ACM conference on Recommender systems, 2011, pp.
109–116.
[21] Y. C. Zhang, D. Ó. Séaghdha, D. Quercia, T. Jambor, Auralist: introducing serendipity into
music recommendation, in: Proceedings of the fifth ACM international conference on
Web search and data mining, 2012, pp. 13–22.
[22] P. Adamopoulos, A. Tuzhilin, On unexpectedness in recommender systems: Or how to
better expect the unexpected, ACM Transactions on Intelligent Systems and Technology
(TIST) 5 (2014) 1–32.
[23] D. Kotkov, S. Wang, J. Veijalainen, A survey of serendipity in recommender systems,</p>
      <p>Knowledge-Based Systems 111 (2016) 180–192.
[24] S. Gollapudi, A. Sharma, An axiomatic approach for result diversification, in: Proceedings
of the 18th international conference on World wide web, 2009, pp. 381–390.
[25] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond accuracy: evaluating recommender
systems by coverage and serendipity, in: Proceedings of the fourth ACM conference on
Recommender systems, 2010, pp. 257–260.
[26] M. Kaminskas, D. Bridge, Diversity, serendipity, novelty, and coverage: a survey and
empirical analysis of beyond-accuracy objectives in recommender systems, ACM Transactions
on Interactive Intelligent Systems (TiiS) 7 (2016) 1–42.
[27] T. Silveira, M. Zhang, X. Lin, Y. Liu, S. Ma, How good your recommender system is? a
survey on evaluations in recommendation, International Journal of Machine Learning and
Cybernetics 10 (2019) 813–831.
[28] Q. Wu, Y. Liu, C. Miao, Y. Zhao, L. Guan, H. Tang, Recent advances in diversified
recommendation, arXiv preprint arXiv:1905.06589 (2019).
[29] N. Tintarev, M. Dennis, J. Masthof, Adapting recommendation diversity to openness to
experience: a study of human behaviour, in: International Conference on User Modeling,
Adaptation, and Personalization, Springer, 2013, pp. 190–202.
[30] C. H. Teo, H. Nassif, D. Hill, S. Srinivasan, M. Goodman, V. Mohan, S. Vishwanathan,
Adaptive, personalized diversity for visual discovery, in: Proceedings of the 10th ACM
conference on recommender systems, 2016, pp. 35–38.
[31] Y. Song, N. Sahoo, E. Ofek, When and how to diversify—a multicategory utility model for
personalized content recommendation, Management Science 65 (2019) 3737–3757.
[32] R. S. Fortes, D. X. de Sousa, D. G. Coelho, A. M. Lacerda, M. A. Gonçalves,
Individualized extreme dominance (inded): A new preference-based method for multi-objective
recommender systems, Information Sciences 572 (2021) 558–573.
[33] K. Kapoor, V. Kumar, L. Terveen, J. A. Konstan, P. Schrater, " i like to explore sometimes"
adapting to dynamic user novelty preferences, in: Proceedings of the 9th ACM Conference
on Recommender Systems, 2015, pp. 19–26.
[34] P. Li, M. Que, Z. Jiang, Y. Hu, A. Tuzhilin, Purs: Personalized unexpected recommender
system for improving user satisfaction, Fourteenth ACM Conference on Recommender
Systems (2020).
[35] R. Mehrotra, C. Shah, B. Carterette, Investigating listeners’ responses to divergent
recommendations, in: Proceedings of the 14th ACM Conference on Recommender Systems,
RecSys ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 692–696.</p>
      <p>URL: https://doi.org/10.1145/3383313.3418482. doi:10.1145/3383313.3418482.
[36] T. Qian, Y. Liang, Q. Li, X. Ma, K. Sun, Z. Peng, Intent disentanglement and feature
self-supervision for novel recommendation, ArXiv abs/2106.14388 (2021).
[37] P. G. Campos, F. Díez, I. Cantador, Time-aware recommender systems: a comprehensive
survey and analysis of existing evaluation protocols, User Modeling and User-Adapted
Interaction 24 (2014) 67–119.
[38] B. Veloso, B. Malheiro, J. C. Burguillo, J. Foss, Personalised fading for stream data, in:</p>
      <p>Proceedings of the Symposium on Applied Computing, 2017, pp. 870–872.
[39] Y. Wu, K. Li, G. Zhao, X. Qian, Long-and short-term preference learning for next poi
recommendation, in: Proceedings of the 28th ACM international conference on information
and knowledge management, 2019, pp. 2301–2304.
[40] Y. Gu, Z. Ding, S. Wang, D. Yin, Hierarchical user profiling for e-commerce recommender
systems, in: Proceedings of the 13th International Conference on Web Search and Data
Mining, 2020, pp. 223–231.
[41] J. Wang, R. Louca, D. Hu, C. Cellier, J. Caverlee, L. Hong, Time to shop for valentine’s day:
Shopping occasions and sequential recommendation in e-commerce, in: Proceedings of
the 13th International Conference on Web Search and Data Mining, 2020, pp. 645–653.
[42] F. A. Chowdhury, Y. Liu, K. Saha, N. Vincent, L. Neves, N. Shah, M. W. Bos, Ceam: The
efectiveness of cyclic and ephemeral attention models of user behavior on social platforms,
in: ICWSM, 2021.
[43] R. Verachtert, L. Michiels, B. Goethals, Are we forgetting something? correctly evaluate a
recommender system with an optimal training window, in: Proceedings of the Perspectives
on the Evaluation of Recommender Systems Workshop, volume 3228, 2022.
[44] N. Lathia, S. Hailes, L. Capra, X. Amatriain, Temporal diversity in recommender systems, in:
Proceedings of the 33rd international ACM SIGIR conference on Research and development
in information retrieval, 2010, pp. 210–217.
[45] M. Zaheer, A. Ahmed, A. J. Smola, Latent lstm allocation: Joint clustering and non-linear
dynamic modeling of sequence data, in: International Conference on Machine Learning,
PMLR, 2017, pp. 3967–3976.
[46] M. M. Tanjim, C. Su, E. Benjamin, D. Hu, L. Hong, J. McAuley, Attentive sequential models
of latent intent for next item recommendation, in: Proceedings of The Web Conference
2020, 2020, pp. 2528–2534.
[47] C. Wang, W. Ma, M. Zhang, C. Chen, Y. Liu, S. Ma, Toward dynamic user intention:
Temporal evolutionary efects of item relations in sequential recommendation, ACM
Transactions on Information Systems (TOIS) 39 (2020) 1–33.
[48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I.
Polosukhin, Attention is all you need, Advances in neural information processing systems 30
(2017).
[49] E. Fischer, D. Zoller, A. Hotho, Comparison of transformer-based sequential product
recommendation models for the coveo data challenge (2021).
[50] T. Lin, Y. Wang, X. Liu, X. Qiu, A survey of transformers, arXiv preprint arXiv:2106.04554
(2021).
[51] L. Li, L. Zheng, F. Yang, T. Li, Modeling and broadening temporal user interest in
personalized news recommendation, Expert Systems with Applications 41 (2014) 3168–3177.
[52] D.-T. Le, Y. Fang, H. W. Lauw, Modeling sequential preferences with dynamic user and
context factors, in: Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, Springer, 2016, pp. 145–161.
Proof of Lemma 3.1 If  2() ≥  2( − 1) ≥ 0,
If instead  2() −  2( − 1) ≤ 0,
Note that  is nonnegative by construction. Then if the denominator is nonzero, the
numerator is zero only if  2() −  2() = 0 given  2( − 1) ̸= 0. □</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Smyth</surname>
          </string-name>
          ,
          <article-title>Improving recommendation diversity</article-title>
          ,
          <source>in: Proceedings of the twelfth Irish conference on artificial intelligence and cognitive science, Maynooth, Ireland</source>
          , volume
          <volume>85</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2001</year>
          , pp.
          <fpage>141</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>McNee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <article-title>Being accurate is not enough: how accuracy metrics have hurt recommender systems</article-title>
          , in: CHI'
          <article-title>06 extended abstracts on Human factors in computing systems</article-title>
          ,
          <year>2006</year>
          , pp.
          <fpage>1097</fpage>
          -
          <lpage>1101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Demangeot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Broderick</surname>
          </string-name>
          ,
          <article-title>Exploration and its manifestations in the context of online shopping</article-title>
          ,
          <source>Journal of Marketing Management</source>
          <volume>26</volume>
          (
          <year>2010</year>
          )
          <fpage>1256</fpage>
          -
          <lpage>1278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , S.-y. Kong,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Limtiaco</surname>
          </string-name>
          ,
          <string-name>
            R. S. John,
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          , M. GuajardoCéspedes, S. Yuan,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tar</surname>
          </string-name>
          , et al.,
          <source>Universal sentence encoder</source>
          , arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>11175</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>