=Paper= {{Paper |id=Vol-3886/paper1 |storemode=property |title=TRACE: Transformer-based User Representations from Attributed Clickstream Event Sequences |pdfUrl=https://ceur-ws.org/Vol-3886/paper1.pdf |volume=Vol-3886 |authors=William Black,Alex Manlove,Jack Pennington,Andrea Marchini,Ercument Ilhan,Vilda Markeviciute |dblpUrl=https://dblp.org/rec/conf/rectour/BlackMPMIM24 }} ==TRACE: Transformer-based User Representations from Attributed Clickstream Event Sequences== https://ceur-ws.org/Vol-3886/paper1.pdf
                         TRACE: Transformer-based user Representations from
                         Attributed Clickstream Event sequences
                         William Black1 , Alexander Manlove1 , Jack Pennington1 , Andrea Marchini1 , Ercument Ilhan1
                         and Vilda Markeviciute1
                         1
                             Expedia Group, 407 St John St, London EC1V 4EX


                                       Abstract
                                        For users navigating travel e-commerce websites, the process of researching products and making a purchase
                                        often results in intricate browsing patterns that span numerous sessions over an extended period of time. The
                                        resulting clickstream data chronicle these user journeys and present valuable opportunities to derive insights
                                        that can significantly enhance personalized recommendations. We introduce TRACE, a novel transformer-
                                        based approach tailored to generate rich user embeddings from live multi-session clickstreams for real-time
                                        recommendation applications. Prior works largely focus on single-session product sequences, whereas TRACE
                                        leverages site-wide page view sequences spanning multiple user sessions to model long-term engagement.
                                        Employing a multi-task learning framework, TRACE captures comprehensive user preferences and intents
                                        distilled into low-dimensional representations. We demonstrate TRACE’s superior performance over vanilla
                                        transformer and LLM-style architectures through extensive experiments on a large-scale travel e-commerce
                                        dataset of real user journeys, where the challenges of long page-histories and sparse targets are particularly
                                        prevalent. Visualizations of the learned embeddings reveal meaningful clusters corresponding to latent user
                                        states and behaviors, highlighting TRACE’s potential to enhance recommendation systems by capturing nuanced
                                        user interactions and preferences.

                                        Keywords
                                        transformers, user embeddings, clickstream data, multi-task




                         1. Introduction
                         On tourism e-commerce websites users often exhibit complex navigation patterns whilst they browse
                         travel and accommodation options before making a purchase. A typical user could land on the homepage,
                         search for a flight then bounce, only to return a few days later to browse hotels and then purchase a
                         package holiday. The resulting clickstream data captures these intricate journeys and offers valuable
                         insights into users’ behaviour and intentions. By harnessing this data and better understanding users’
                         latent psychological states and preferences, we can significantly enhance personalized experiences by
                         matching them with more relevant content [1, 2, 3, 4, 5] and adapting the experience to better suit their
                         context [4]. For instance, users earlier in their search can be presented with more exploratory content,
                         as compared to users nearer the end of the purchase funnel.
                            However, achieving this level of personalization can be challenging as user journeys often span
                         multiple sessions over an extended period of time, and specific goals, such as completing a purchase,
                         occur infrequently within this window. This is a particularly pertinent challenge within the tourism
                         industry as users will often only make one booking a year, which can take weeks of searching and
                         planning before purchasing it months in advance.
                            In this work, we present TRACE (Transformer-based Representations of Attributed Clickstream Event
                         sequences), a novel approach for generating rich user embeddings from live multi-session clickstream
                         data with sparse targets. TRACE employs a multi-task learning (MTL) framework, where a lightweight
                         transformer encoder is trained to predict multiple user engagement targets based on sequences of

                          Workshop on Recommenders in Tourism (RecTour 2024), October 18th, 2024, co-located with the 18th ACM Conference on
                          Recommender Systems, Bari, Italy
                          $ wblack@expediagroup.com (W. Black); amanlove@expediagroup.com (A. Manlove); jpennington@expediagroup.com
                          (J. Pennington); amarchini@expediagroup.com (A. Marchini); eilhan@expediagroup.com (E. Ilhan);
                          vmarkeviciute@expediagroup.com (V. Markeviciute)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
attributed clickstream events. By jointly predicting a diverse set of user future engagement signals, the
model is encouraged to learn robust versatile representations. We demonstrate its effectiveness using a
real-world travel e-commerce dataset.
   Numerous works have explored the use of statistical and machine learning techniques on clickstreams
to mine patterns [6, 7, 8] or cluster user behaviors [9, 10, 11, 12] for analytical insights or motivating
recommendations. Comparable works have also investigated neural and MTL approaches to user
modeling, but typically focus on product-level interactions or single session sequences [13, 14, 15, 16].
TRACE instead ingests live clickstream data and addresses more general sequences of site-wide page
views spanning multiple sessions in order to obtain rich user journey representations for real-time
downstream applications. PinnerFormer [17] notably uses a transformer, but relies on previously
learned embeddings and abundant pin-based interactions. TRACE learns directly and exclusively from
the sequence of attributed page views and employs a MTL approach to overcome sparse engagement
signals. Zhuang et al. [18] studied attributes at the sequence level, whereas TRACE is more granular
and addresses attributes at the event-level. Where Rahmani et al. [19] incorporated temporal signals in
sequential recommendations, TRACE instead adopts learnable positional encodings which capture both
event and session positions.
   Overall, the key distinction of TRACE is the use of a transformer-based MTL framework with event-
session position encoding to generate versatile user embeddings from enriched multi-session clickstream
sequences with event-level attributes, which has not been explored in depth by previous research nor
applied to travel e-commerce.


2. Methodology
2.1. Problem Formulation
Each time a user visits a new page it is logged in a clickstream as a page view event 𝑃 ∈ 𝒫, characterized
by a small set of contextual features including the page name and timestamp 𝑡𝑃 . These events collectively
form user sessions 𝒮, representing ordered sequences of the pages visited within defined time intervals.
Formally, a session 𝒮 = {𝑃0 , 𝑃1 , ..., 𝑃𝑁 }, where 𝑃𝑗 denotes the 𝑗th page the user visited in this session,
subject to the condition that
                                      𝑡𝑃𝑗 − 𝑡𝑃𝑗−1 ≤ 𝑇, ∀𝑗 ∈ [1, 𝑁 ].                                      (1)
   Here 𝑇 is a fixed constant, often in the order of magnitude of a few hours. If the difference in
timestamps of two sequential page view events is greater than 𝑇 , the latter is considered to be in a new
session.
   Then for each user, we define their journey 𝐽 as the chronological sequence of their sessions, where
𝐽 = {𝒮0 , 𝒮1 , ..., 𝒮𝑘 }, with 𝒮𝑖 representing their 𝑖th session. In this way a journey 𝐽 is the sequence of
pages a user has visited across multiple sessions. We use a corpus of user journeys 𝒥 = {𝐽0 , 𝐽1 , ...}
captured on a large-scale travel e-commerce site over a few months, where |𝒥 | > 50M, and the
vocabulary exceeds 1000 page names.
   Our objective is to predict future engagement of users using their past navigation patterns on the
website. Formally, we want to learn a model 𝑓 : 𝒥 → R𝑑 for some positive integer 𝑑, which summarises
these journeys in rich low dimensional representations that can then be used for downstream machine
learning applications, such as content personalisation and product recommendations. As such the
model 𝑓 must satisfy three main requirements:

   1. Effectively capture the intricate page navigation patterns in users’ journeys which span multiple
      sessions.
   2. Meaningfully distill user journeys into embeddings that can predict engagement across diverse
      tasks and contexts.
   3. Scale efficiently to accommodate high-traffic real-time production environments.
                                                           User Journey Through Time

                           Session k-1                                      Session k                     Session k+1

               ...          ...                                                    ...                             ...




                                             TimeFeatures
                                            Time   Features                                       TimeFeatures
                                                                                                 Time  Features
                                            Input
                                             InputJourney
                                                   Journey                                    Future Engagement



                      FeatureEngineering
                     Feature  Engineering                      EventPosition
                                                               Event PositionEncoder
                                                                              Encoder          Target Extraction
                     Feature Engineering                      Event Position Encoding




                                    Transformer Encoder Block



                                         Feed Forward Network
    Journey
   Embedding
              Task 1          Task 2           Task 3           Task 4            Task 5
            Dense Layer     Dense Layer      Dense Layer      Dense Layer       Dense Layer


                                                              Multi Task Loss




Figure 1: Overview of the TRACE multi-task transformer architecture.


   To generate our datasets, we split each journey at a random point and designate the pages before as
the input journey, and those after to be used for target generation.
   In our proposed approach, TRACE, we train a multi-task transformer. This model takes as input
some sequence of pages in the form of some journey 𝐽, and predicts a cohort of future user engagement
targets. We extract the output of the final layer of the shared backbone of the model as the journey
embedding 𝑒𝐽 ∈ R𝑑 . We hypothesise that if the embeddings are predictive across a cohort of diverse
user engagement tasks, they will capture a generalised understanding of a user’s diverse intents. Figure
1 illustrates the components of TRACE. We address each in more detail below.

2.2. Feature Engineering and Position Encoding
We first crop each input journey, taking up to the 𝐿 most recent page view events, where 𝐿 is chosen
in a way to capture most users’ entire recent page view history.
   Each page view event 𝑃 has a set of categorical attributes, such as a page name and the user’s device
type, which are passed through their own learnable embedding layer to produce a dense representation
in R32 . We engineer two features from the event timestamp; the time interval between consecutive
events and the time elapsed until the most recent event, both logged and standard scaled. Additionally
we encode session ID where events in the 𝑛th most recent session are given value 𝑛. These time-based
features aim to capture planning phases and session gaps common in extended travel user journeys. All
features are standard scaled and concatenated s.t. each 𝑃 is now represented by a vector in R𝐷 , where
𝐷 is approximately a few hundred.
   We also enumerate the event position, where the 𝑚th most recent event in the entire journey is given
value 𝑚. Then both the event and session position indexes are independently embedded in R𝐷 via
their own learnable layers, and added onto the final feature vector, acting as an event-session position
encoding. This was designed to allow the model to learn representations specific to session and position
combinations, enabling it to capture dynamics both within and across multiple sessions more effectively.
  For input journeys of length less than 𝐿 we pad the 𝐷 features with value 0. As such each journey 𝐽
can now be encoded as some matrix 𝑀𝐽 ∈ R𝐿×𝐷 .

2.3. Model Architecture
In TRACE, we use a transformer encoder architecture to process the input sequences of pages, and train
it in a multi-task regime across five different targets, representing a variety of future user engagement
signals.
    An encoded journey 𝑀𝐽 ∈ R𝐿×𝐷 is passed through a single transformer encoder block, constisting of
a multi-head self-attention layer with 8 heads followed by a position-wise fully connected feed-forward
network (FFN) with an intermediate dimension of 128. We employ dropout and a residual connection
around each of the two sub-layers, followed by layer normalization. Global max pooling is applied to
the output of this encoder block, before being forward passed through a FFN. For an input journey 𝐽
the output of this shared backbone is some 𝑒𝐽 ∈ R𝑑 .
    This tensor 𝑒𝐽 is then passed through five separate task-specific dense layers, each compressing
down to a scalar value so the final output of the model is some five logits y      ^ ∈ R5 . After training
we then remove the five task-specific heads, and take the output of the shared backbone 𝑒𝐽 as the
journey embedding. We deliberately restrict the heads to be simple logistic regression layers. This
approach encourages the shared backbone to capture most of the nuance, ensuring the embeddings are
information-rich and generalizable, as opposed to relying too heavily on the task-specific layers.
    Throughout the architecture we use ReLU activations, except for the final shared dense layer where
sigmoid is used for its desirable bounding property. This ensures normalization of the output embedding,
with our experiments demonstrating no performance loss. We set dimension 𝑑 = 32 for the embedding,
which is well suited for downstream applications.

2.4. Multi Task Training Regime and Objective
The motivation behind the MTL approach is that by jointly predicting a diverse set of user engagement
signals, the model is encouraged to learn comprehensive and generalizable representations that can be
effectively utilized across a variety of downstream applications, extending beyond just the tasks during
training. Furthermore, by mixing the infrequent targets such as purchases, with more common events
like product searches, the model learns from a stronger signal and as our results show perform better
on those sparse tasks. This is especially advantageous in the travel domain for events such as bookings,
as demonstrated in our experiments.
   The model is trained on five binary classification tasks which represent potential future actions of
a user: (PW2) Make any purchase within two weeks; (BN5) Bounce within next five pages, and the
following which relate to actions within rest of session; (SRP) Make a search for a product; (PDP) View
a product details page; and (VUO) View an upcoming order. For more details on the metrics used in
training see Table 1. Each task head has its own class-weighted binary cross-entropy loss function. The
overall objective is expressed as a linear combination of these task-specific losses. For a journey 𝐽 with
model prediction y  ^ and true labels y, the loss is defined as:
                                    5
                                   ∑︁
                     ℒ(𝐽, y) = −         [𝑤𝑘 · 𝑦𝑘 · log(𝑦^𝑘 ) + (1 − 𝑦𝑘 ) · log(1 − 𝑦^𝑘 )] .           (2)
                                   𝑘=1

Class weights 𝑤𝑘 are computed as the reciprocal of the proportion of positive samples for each task 𝑘,
in order to account for task-specific class imbalance. We weights tasks equally to encourage the model
to develop features which generalize across each task.
3. Experimental Results
3.1. Downstream Embedding Evaluation
Supervised probing techniques have previously been developed to assess linguistic embeddings [20,
21, 22], although are not directly suited to this scenario. We instead propose a downstream strategy
for evaluating the richness of information contained within a set of embeddings. After training, we
compute ground truth targets on an unseen test set of historical user journeys. These targets seek
to encapsulate users’ latent psychological states and future intentions. For this, we use the same five
tasks from the TRACE objective in eqn. 2, and introduce three more evaluation tasks that were not
previously seen. These include: (PWS) whether a user converts in the current session; (HOM) returns
to the homepage in the current session; and (RE7) whether they return to the site within seven days.
For more details see Table 1. This captures a broad scope of user outcomes, allowing us to characterize
how well the embeddings generalize.
   We pass the unseen test journeys through the model to obtain a corresponding set of embeddings. Next,
we train XGBoost models [23] on these test set embeddings. We fit one XGBoost model independently to
each unique evaluation task and optimize hyperparameters, such as max_depth and learning_rate, using
K-fold cross validation. The trained XGBoosts then undergo evaluation and we compute performance
metrics on the model predictions. These metrics serve as proxies for assessing the richness of embeddings
and exemplify downstream model performance across various use cases. Throughout this section, we
evaluate each upstream embedding model by using the same procedure on the same unseen test set.

Table 1
Explanation of targets and how they are used in training and downstream evaluation.
 Target Name      Description of future action of user                  Used in training   Used in Evaluation
     PW2          Make a purchase within two weeks.                     Yes                Yes
     BN5          Bounce within next five pages.                        Yes                Yes
     SRP          Make a search for a hotel or flight within session.   Yes                Yes
     PDP          View a hotel or flight details page within session.   Yes                Yes
     VUO          View an upcoming order within session.                Yes                Yes
     PWS          Make a purchase within session.                       No                 Yes
     HOM          Return to homepage within session.                    No                 Yes
      RE7         Return to site within 7 days.                         No                 Yes



3.2. Comparable Models vs TRACE
We evaluate the quality of TRACE embeddings against several comparable approaches. We express our
comparisons as the mean uplift taken over all evaluation tasks. Results are shown in Table 2.

Table 2
Mean % uplift in XGBoost metrics from myopic baseline across eval. tasks for TRACE and comparable models.
                          Model             AUROC       AUPRC           F1     Acc
                          TRACE                +7.23     +13.58     +2.73     +2.15
                          ST Cohort            +6.38     +10.75     +2.72     +2.06
                          ST Aggregated        +6.34     +10.62     +2.18     +1.73
                          MT LSTM              +1.91      −3.29     −0.29     +0.27
                          Mini-GPT             +1.86      −2.40     −1.13     −0.60


  Myopic Baseline. Our baseline predicts targets using explicit attributes from only the most recent
event. We report all results as percentage uplifts from this. TRACE significantly outperforms this,
highlighting the benefits of mining a user’s full navigation history.
  Single Task Cohort. To demonstrate the effectiveness of TRACE’s MTL approach, we trained a
dedicated single-task transformer for each of the evaluation tasks. These models each produce an
embedding. In Table 3, the TRACE score on a given task is compared to the corresponding dedicated
ST model embedding’s score. Overall results show that the TRACE embeddings outperform every
task-specific equivalent on the 5 tasks TRACE was trained on, and even wins on all but one of the
unseen targets, demonstrating the advantages of the MTL approach.
   Single Task Aggregated. Here we combine the task-specific models’ embeddings into a single
embedding of the same length by taking the mean along each dimension.
   Multi-Task LSTM. We note the demonstrated efficacy of LSTMs in related works [14, 18, 24, 25].
We train a comparable LSTM minimizing the same multi-task objective function shown in (2).
   Mini-GPT. We train a small GPT-style model [26] on the page name sequences, with a single
transformer block and causal masking in the attention layer for next event prediction. Embeddings are
computed from the mean of the transformer block outputs.

Table 3
Mean % uplift in XGBoost AUROC on the eight eval. tasks for TRACE vs dedicated single-task models.
                Model      PW2      BN5      SRP     PDP    VUO     HOM      PWS      RE7
                TRACE      +11.8   +9.32   +6.73    +7.29   +3.99   +7.70   +4.35    +6.52
                ST         +11.2   +8.14   +5.08    +6.25   +3.56   +5.59   +2.99    +8.26



3.3. Ablation Experiments
In Table 4 we list the results of our ablation studies.

3.3.1. Position Encodings
In section 2.2, we discussed our approach to position encoding which is designed to tackle event
sequences over multiple sessions. Static trigonometric position encodings are also widely popular
[13, 27]. We trained a variant including this additional encoding, but found better performance without
it.

3.3.2. Number of Encoders
Here, we vary the number of transformer encoders, ℎ. Our results suggest that ℎ = 1 encoder is
sufficient for capturing the structure of the data, likely due to our sequences being of relatively shorter
length with small vocabulary compared to typical NLP applications [27]. We measured the time taken
for the forward pass in each variant. The experiments were conducted on a system equipped with an
Nvidia T4 Tensor Core GPU (16 GiB VRAM) and an Intel Xeon processor (32 vCPUs, 128 GiB RAM, 2.5
GHz clock speed). We repeat the model call 10,000 times and measure the mean and standard deviation
for various encoder configurations. The results are as follows:

    • ℎ = 1 encoder: 27.5 ms ± 0.1 ms
    • ℎ = 2 encoders: 40.8 ms ± 0.1 ms
    • ℎ = 3 encoders: 54.7 ms ± 0.6 ms
    • ℎ = 4 encoders: 67.7 ms ± 0.4 ms

   Our final model design used only a single encoder ℎ = 1, which is sufficiently fast taking only 27.5
milliseconds on average for the forward pass. This is well within our self-imposed upper limit of 100ms
latency, which we find to be practical for real-time applications.
Table 4
Mean XGBoost performance uplift % on evaluation tasks for TRACE variants’ embeddings vs myopic baseline.
 Pos. Encodings     Num. Encoders              Chrono. Features        AUROC         AUPRC           F1      Accuracy
               †               †                                   †
  Event-Session            1                 Timestamp & Session       +7.23         +13.58          +2.73   +2.15
  Trigonometric                                                        +6.64         +12.22          +2.47   +2.17
                            2                                          +6.87         +12.62          +2.65   +2.07
                            3                                          +6.84         +12.76          +2.53   +1.98
                            4                                          +6.84         +12.76          +2.61   +2.06
                                                   Timestamp           +6.62         +12.24          +2.45   +1.94
                                                     None              +6.36         +12.0           +2.17   +1.70
                                   †
                                       Final variant used in proposed TRACE model.


3.3.3. Chronological Features
To better understand the specific performance gains from chronological features, we train variants
which omit these. The "Timestamp" variant retains event timestamps but removes session ID, thereby
eliminating explicit information about session continuity. The "None" variant excludes both session
IDs and timestamps, retaining only the sequential order of events. Results demonstrate that including
timestamp features enhances performance, but the greatest improvement arises from incorporating
TRACE’s session encoding on top of this, as used in our final variant. This highlights the effectiveness of
TRACE in exploiting the multi-session structure of the sequences, and its significance for applications
in e-commerce recommendation systems.


                                                                                      Trips Overview
                                                                                      Bounced
                                                                                      Product Details Page
                                                                                      Booking Form
                                                                                      Booking Confirmation
                                                                                      Search Results Page
                                                                                      Home Page




Figure 2: t-SNE projections of TRACE page sequence embeddings, colored according to the next page the users
visited.
3.4. Visualisation of Learned Embeddings
In Fig. 2, we present a visualization of the 32-dimensional embeddings learned by TRACE, reduced to 2
dimensions using t-SNE [28]. This subset of observations was uniformly sampled with respect to users’
next visited page, ensuring equal representation from seven common pages. We note the emergence of
clusters corresponding to the next page visited by users, despite TRACE never being explicitly exposed
to this information during training. Qualitatively, the clusters appear to loosely align with how a user
traverses a website, going from homepage at the bottom progressing through to search and product
pages, before reaching checkout and order confirmation. This underscores TRACE’s ability to identify
and encode patterns in user journeys, showcasing the effectiveness of our approach for generating
information-rich embeddings.


4. Conclusion
In this work, we have presented TRACE, a novel approach for generating user embeddings from
multi-session page view sequences through a multi-task learning (MTL) framework, which employs a
lightweight, encoder-only transformer to process real-time cross-session clickstream data. Our experi-
ments on a large-scale, real-world travel e-commerce dataset demonstrate the superior performance
of TRACE embeddings compared to traditional single-task and LSTM-based models, and highlight its
potential for enhancing tourism recommender systems. The learned embeddings exhibit strong results
on a diverse set of targets and demonstrate the ability to generalize well to unseen tasks, underscoring
their utility for applications like content personalization and user modeling. Visualizations reveal that
TRACE can effectively capture meaningful clusters corresponding to latent user intents and behaviors.
   To reinforce the performance of TRACE, we plan to publish results showing its strength on a public
e-commerce user-journey dataset produced by Coveo [29]. Although this dataset is neither multi-session
nor tourism-specific, its user journeys exhibit comparable navigation patterns, which will underscore
the robustness of the TRACE architecture. Additionally, we intend to integrate these embeddings into
our in-house recommendation systems and evaluate their effectiveness in online experiments.
   In the future, we plan to explore the integration of LLMs, as in [30, 31], and investigate hierarchical
models to further improve the model’s representational capacity.


References
 [1] H. Zhao, L. Si, X. Li, Q. Zhang, Recommending complementary products in e-commerce push
     notifications with a mixture model approach, in: Proceedings of the 40th International ACM SIGIR
     Conference on Research and Development in Information Retrieval, 2017, pp. 909–912.
 [2] X. Shen, J. Shi, S. Yoon, J. Katzur, H. Wang, J. Chan, J. Li, Learning to personalize recommendation
     based on customers’ shopping intents, 2023. arXiv:2305.05279.
 [3] I. Kangas, M. Schwoerer, L. J. Bernardi, Recommender systems for personalized user experience:
     Lessons learned at booking.com, in: Proceedings of the 15th ACM Conference on Recommender
     Systems, RecSys ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 583–586.
     URL: https://doi.org/10.1145/3460231.3474611. doi:10.1145/3460231.3474611.
 [4] W. Black, E. Ilhan, A. Marchini, V. Markeviciute, Adaptex: A self-service contextual bandit platform,
     in: Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, ACM, 2023.
     URL: http://dx.doi.org/10.1145/3604915.3608870. doi:10.1145/3604915.3608870.
 [5] M. Grbovic, H. Cheng, Real-time personalization using embeddings for search ranking at airbnb,
     in: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery &
     data mining, 2018, pp. 311–320.
 [6] E. Olmezogullari, M. S. Aktas, Representation of click-stream datasequences for learning user
     navigational behavior by using embeddings, in: 2020 IEEE International Conference on Big Data
     (Big Data), IEEE, 2020, pp. 3173–3179.
 [7] S. D. Bernhard, C. K. Leung, V. J. Reimer, J. Westlake, Clickstream prediction using sequential
     stream mining techniques with markov chains, in: Proceedings of the 20th international database
     engineering & applications symposium, 2016, pp. 24–33.
 [8] Y. S. Kim, B.-J. Yum, Recommender system based on click stream data using association rule
     mining, Expert Systems with Applications 38 (2011) 13320–13327.
 [9] G. Wang, X. Zhang, S. Tang, H. Zheng, B. Y. Zhao, Unsupervised clickstream clustering for user
     behavior analysis, in: Proceedings of the 2016 CHI conference on human factors in computing
     systems, 2016, pp. 225–236.
[10] Q. Su, L. Chen, A method for discovering clusters of e-commerce interest patterns using click-
     stream data, electronic commerce research and applications 14 (2015) 1–13.
[11] J. Wei, Z. Shen, N. Sundaresan, K.-L. Ma, Visual cluster exploration of web clickstream data, in:
     2012 IEEE conference on visual analytics science and technology (VAST), IEEE, 2012, pp. 3–12.
[12] M. Zavali, E. Lacka, J. De Smedt, Shopping hard or hardly shopping: Revealing consumer segments
     using clickstream data, IEEE Transactions on Engineering Management 70 (2021) 1353–1364.
[13] H. Bai, D. Liu, T. Hirtz, A. Boulenger, Expressive user embedding from churn and recommendation
     multi-task learning, in: Companion Proceedings of the ACM Web Conference 2023, 2023, pp.
     37–40.
[14] M. Alves Gomes, R. Meyes, P. Meisen, T. Meisen, Will this online shopping session succeed?
     predicting customer’s purchase intention using embeddings, in: Proceedings of the 31st ACM
     international conference on information & knowledge management, 2022, pp. 2873–2882.
[15] B. Requena, G. Cassani, J. Tagliabue, C. Greco, L. Lacasa, Shopper intent prediction from clickstream
     e-commerce data with minimal browsing information, Scientific reports 10 (2020) 16983.
[16] C. H. Tan, A. Chan, M. Haldar, J. Tang, X. Liu, M. Abdool, H. Gao, L. He, S. Katariya, Optimizing
     airbnb search journey with multi-task learning, in: Proceedings of the 29th ACM SIGKDD
     Conference on Knowledge Discovery and Data Mining, KDD ’23, ACM, 2023. URL: http://dx.doi.
     org/10.1145/3580305.3599881. doi:10.1145/3580305.3599881.
[17] N. Pancha, A. Zhai, J. Leskovec, C. Rosenberg, Pinnerformer: Sequence modeling for user rep-
     resentation at pinterest, in: Proceedings of the 28th ACM SIGKDD conference on knowledge
     discovery and data mining, 2022, pp. 3702–3712.
[18] Z. Zhuang, X. Kong, R. Elke, J. Zouaoui, A. Arora, Attributed sequence embedding, in: 2019 IEEE
     International Conference on Big Data (Big Data), IEEE, 2019, pp. 1723–1728.
[19] M. Rahmani, J. Caverlee, F. Wang, Incorporating time in sequential recommendation models, in:
     Proceedings of the 17th ACM Conference on Recommender Systems, 2023, pp. 784–790.
[20] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. Van Durme, S. R. Bowman,
     D. Das, et al., What do you learn from context? probing for sentence structure in contextualized
     word representations, in: 7th International Conference on Learning Representations, ICLR 2019,
     2019.
[21] J. Hewitt, C. D. Manning, A structural probe for finding syntax in word representations, in:
     Proceedings of the 2019 Conference of the North American Chapter of the Association for Com-
     putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019,
     pp. 4129–4138.
[22] J. Hewitt, P. Liang, Designing and interpreting probes with control tasks, 2019.
[23] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm
     sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
[24] D. Koehn, S. Lessmann, M. Schaal, Predicting online shopping behaviour from clickstream data
     using deep learning, Expert Systems with Applications 150 (2020) 113342.
[25] C. O. Sakar, S. O. Polat, M. Katircioglu, Y. Kastro, Real-time prediction of online shoppers’
     purchasing intention using multilayer perceptron and lstm recurrent neural networks, Neural
     Computing and Applications 31 (2019) 6893–6908.
[26] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language understanding by
     generative pre-training (2018).
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
     Attention is all you need, Advances in neural information processing systems 30 (2017).
[28] L. Van der Maaten, G. Hinton, Visualizing data using t-sne., Journal of machine learning research
     9 (2008).
[29] J. Tagliabue, C. Greco, J.-F. Roy, B. Yu, P. J. Chia, F. Bianchi, G. Cassani, Sigir 2021 e-commerce
     workshop data challenge, 2021. URL: https://arxiv.org/abs/2104.09423. arXiv:2104.09423.
[30] K. Christakopoulou, A. Lalama, C. Adams, I. Qu, Y. Amir, S. Chucri, P. Vollucci, F. Soldo, D. Bseiso,
     S. Scodel, et al., Large language models for user interest journeys, arXiv preprint arXiv:2305.15498
     (2023).
[31] Z. Zhao, W. Fan, J. Li, Y. Liu, X. Mei, Y. Wang, Z. Wen, F. Wang, X. Zhao, J. Tang, et al., Recommender
     systems in the era of large language models (llms), IEEE Transactions on Knowledge and Data
     Engineering (2024).