<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Model for Features Sum marization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amazon Music</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amazon Music</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vito Bellini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhan Shi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huseyin Yurtseven</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Coviello</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Recommender Systems, Features Summarization, Model Consolidation</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Personalization is a key requirement for most customer experiences in a music streaming service, such as landing page, station song sequencing, or search. A common approach is to develop dedicated personalization ML models, one for each experience, that directly integrate with all the personalization signals alongside experiencespecific signals. However, this is not scalable as it is costly for each product team to replicate improvements and experiment with new personalization features. It can result in inconsistencies across the various experiences and slow down adoption of new features in all the models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Streaming services ofer an array of experiences to let customers engage with vast catalogs of music
content, including songs, albums, playlists, stations, etc. This can include visual experiences for
browsing and discovering content, search, automatic stations/radios, daily playlists, and voice controlled
experiences, to name a few. Whereas individual customer experiences each have unique experience
optimization challenges, they all need personalization.</p>
      <p>
        Recommender Systems (RSs) are designed to suggest content that is relevant to the customer and the
specific experience. For instance, on the home page of a streaming service, RSs are used to select and
rank the widgets to display on each slate on the page, and then to select and rank relevant content items
for each widget in a personalized fashion. Recommending relevant content according to the user’s
preferences does not refer only to ranking but it extends to other tasks such top-1 recommendation
where the RS has to select only one item to show to the user. These are two fundamentally diferent
problems that require diferent techniques for efective resolution. For example learning-to-rank
(LTR) algorithm with pair-wise objectives [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] widely adopted in ranking cannot be used for top-1
recommendation use-cases because they require pairs containing both a positive and negative samples
whereas only a single positive or negative sample is available in top-1 use-case. Other diferences
involves the way of debiasing dataset for training ranking or top-1 use-cases. The former typically
involves assumption on click-models [
        <xref ref-type="bibr" rid="ref4">4, 5, 6</xref>
        ] where the position based model is one of the most used
and studied in the literature, and the latter involves propensities estimation to account for selection bias
[7]. Finally, specific metrics to evaluate performances are used according to the use-case. Generally,
ranking metrics are optimized for engagement on the home page while on the top-1 recommender
where the user interface allows only playbacks, listening time is the gold metric to optimize. Beyond the
distinction between ranking and top-1 selection, diferent content types entail diferent recommendation
(E. Coviello)
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
requirements. For example, a playlist recommender seeks to ofer a diverse yet relevant set of playlists
to maximize aggregated utility, while a song recommender favors coherence to enable uninterrupted
listening. Moreover, recommenders may be tailored to specific customer experience designs, such as
emphasizing trending content or recent user behavior.</p>
      <p>A common architecture for RSs in online streaming services follows a two-steps approach that
consists in candidate generation and then re-ranking [8, 9]. Candidate generation determines what is
eligible for an experience (for example entities matching a search query well enough, entities displayed
in a visual recommender or considered for a voice request, a list of widgets to be arranged on a page,
related artists to be displayed on the artist page), generally eficiently selecting a high coverage set of
candidates from a large set of items (potentially millions). In the subsequent ranking step we want to
optimize the ranking of these entities (or widgets with entities) in a way that is contextually relevant
and personalized and meets the specific customer experience requirement of the particular widget.</p>
      <p>It is common to develop dedicated ranking models one for each experience, trained on trafic from
that experience and directly integrates with all the personalization signals plus experience-specific
signals. However, the issue is that it takes a lot replicated efort for diferent product teams to integrate
and experiment with the same set of personalization features. This can result in slow adoption of new
or improved personalization features, inconsistencies across experiences, and reducing resources to
solve other key challenges peculiar to individual products.</p>
      <p>The observation motivating this work is that, irrespective of the actual experience, we believe that
this ranking step using personalization and contextual signals is essentially the same ML problem
across diferent experiences, with small diferences that can be captured by contextual features and
multi-task training. We hence propose a customer-item relevance score that is derived from a model
that incorporates many personalization signals and is trained from feedback from several customer
experiences. This simplifies the adoption of personalization features and reduces experimentation
costs, since a product team can integrate in their ML ranking models the summary customer-item score
instead of all the feature directly. In this paper, we present initial results validating the approach on a
ranking use-case where we re-rank the widgets on the home page of music streaming service and on a
top-1 use-case where we pick the best content from a limited pool to show in the first widget of the
home page.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Traditional RSs require extensive manual feature engineering, with diferent personalization services
often duplicating efort to integrate the same personalization signals. Recent work has focused on
automating this process to reduce engineering overhead and improve consistency across systems.</p>
      <p>Authors in [10] introduced AutoField, a diferentiable approach for automating feature selection in
deep recommender systems that reduces embedding parameters and inference time while maintaining
performance. This work demonstrates how automated feature selection can create consistent feature
sets across diferent recommendation surfaces. The challenge of feature interaction selection has been
addressed through several automated approaches. Work presented in [11] proposes AutoFIS, a
twostage algorithm for automatic feature interaction selection that identifies important interactions while
removing redundant ones. AutoInt, introduced in [12], uses self-attention mechanisms for automatic
high-order feature interaction learning without manual design. These approaches directly support the
concept of unified feature representations by automatically determining which features and interactions
contribute most to overall relevance across multiple tasks. Our approach difers by providing a relevance
score that eliminates the need for downstream teams to focus on feature selection and engineering.</p>
      <p>While feature engineering automation reduces the complexity of feature selection and engineering,
usually a single model is used to solve a specific task (use-case). Building separate models for each
task or domain creates significant computational and manual maintenance costs as pointed out in
[13, 14]. Eforts to mitigate this problem have been made in order to consolidate or unify models across
use-cases. However, existing approaches require each team to integrate all personalization signals
directly into their models. This poses the risk of feature fragmentation across use-cases and the burden
of feature selection and engineering. Authors in [15] introduced Multi-gate Mixture-of-Experts (MMoE),
it uses multiple expert networks with task-specific gating to balance shared and task-specific learning.
However, this approach still requires each product team to build and maintain their own unified model
with all personalization signals, limiting its ability to reduce engineering overhead across teams. Our
approach extends this concept by training a single model that produces a relevance score consumable by
any downstream model, eliminating the need for teams to implement their own multi-task architectures.</p>
      <p>Multi-domain recommendation refers to a scenario in which the RS provides recommendations
in diferent domains where an overlapping between items across the domain exists. Conceptually
similar, multi-task learning aims to learn from diferent tasks where tasks could be potentially very
diferent to each other (e.g.: home page, search page, etc...) with a partial or non-existent feature overlap
across the tasks. Recent works on Multi-Task and Multi-domain have shown encouraging results in
leveraging knowledge learned from a task applied to another where no much training data is available
or transferring knowledge from one domain to another. In [16] authors proposed to train a single model
on multiple domains to be used for diferent tasks; this allows for compressing information in a unified
model that enables knowledge transfer across domains and tasks. Multi-domain recommendation
has also been studied in the setting of sequential recommendation [17], where characteristics of each
domain are projected into a common domain space leveraged to transform item representation across
domain spaces.</p>
      <p>While significant progress has been made in individual components, automated feature engineering,
unified architectures, and transfer learning, existing work fails to address the fundamental operational
challenge in production personalization systems: the repeated engineering efort required when multiple
teams integrate the same personalization signals. Our work directly addresses these gaps by proposing
a unified customer relevance score that:
• Eliminates repeated feature integration: teams consume a single score instead of integrating
many individual signals;
• Reduces engineering overhead: new personalization features are automatically incorporated into
the unified score without requiring downstream changes;
• Maintains model flexibility: teams can integrate the relevance customer-item score into any
model architecture while preserving experience-specific optimizations;</p>
      <p>Unlike previous unified approaches that require each team to build and maintain their own
multitask models, our approach centralizes the complexity in a single, reusable component while enabling
lightweight integration downstream.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Modeling customer relevance</title>
      <p>In this paper we propose a customer-item relevance score that summarizes a rich set of personalization
features and is trained using feedback from multiple widgets. In this section we describe the relevance
model architecture and training, feature selection and engineering, and ofline evaluation.</p>
      <sec id="sec-3-1">
        <title>3.1. Relevance model</title>
        <p>We want to build a model capable of representing the customer relevance towards a music entity (e.g.,
a song, playlist, album, etc.) and we want the predicted score to be interpretable where a high score
represents high relevance for the customer towards an entity. Learning-to-rank algorithms do not fit
perfectly in this scenario as they predict relative ordering instead of relevance scores. Therefore, we
use reward regression which uses a point-wise loss function.</p>
        <p>We train the relevance model with feedback collected from a variety of diferent widgets. Feedback
  () is binary with   () ∈ {0, 1} and it represents a click occurring on an item  being shown to the
customer for a request at time  . We frame the learning problem in a bandit setting since we observe
feedback only for the items that our production recommenders are actually showing to the customers.
This introduces a form of bias in logged data called selection bias that needs to be taken into account
to avoid training and evaluating a sub-optimal model. We evaluate adaptations of L2 [18] and
BinaryCross-Entropy (BCE) [19] losses that correct for such bias. We chose the former as it outperformed the
latter through ofline evaluations with extensive hyper-parameters optimization (HPO).</p>
        <p>Our model is a multi-layer perceptron (MLP) trained on click feedback to predict a relevance score
of an item for a given customer. The model optimizes the following objective for biased feedback as
presented in [18]:</p>
        <p>∑
=1 
  ()
  ()
ℒ ( ) =
∑ (|̂  )2 +
(  () − 2(|̂  ))
(1)
where  denotes the item for which to compute the score for a ranking request at time  , (|̂
a regression estimate for the item  given its feature representation   ,   () and   () are the observed
click and its expected examination probability respectively; where by examination probability we refer
to the joint probability that the ranker ranks the item  in position  and the user observes  at rank  .</p>
        <p>The relevance model optimizes the objective in Equation 1 using an online dataset collected under the
production model. The dataset contains for each request at time  a list of items to rank where each item
 is represented by a feature vector   ∈ R in the feature space ℱ = ℱ ∪ ℱ
 with () + () = 
.</p>
        <p>The features set ℱ represents temporal and contextual information features such as time, day, device
type and others, while ℱ</p>
        <p>represents customer preferences. The relevance model is trained only on
the subset ℱ which does not depend on contextual information. It is then used in real-time to infer
relevance scores defined as  = (̂ ′) where  ′ represents the item description with features in ℱ
only and (⋅̂) is a non-linear transformation function learnt from the neural network regressor which
optimizes Equation 1. The customer-item afinity score  is then used as input feature to a final ranker.
 )represents</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Features</title>
        <p>Features representing customer preferences serve as input to the customer relevance model. First and
foremost, we have features that directly capture the historical interactions of the customer with the
music items and their artists, covering aspects such as the number of listens, likes, and follows. They
also encompass non-personalized item information such as popularity or whether the item is a new
release.</p>
        <p>In order to capture user-item relationships on items that the customer had not interacted with in
the past, we also leverage a collaborative filtering model that provides user and item embeddings. The
model is trained on past user-item interactions. Denote  as the user-item interaction matrix, where
  indicates number of times user  has listened to item  . The goal is to learn a  -dimensional user
embedding matrix  and item embedding matrix  by solving the problem:</p>
        <p>min
 ∈ℝ  users× , ∈ℝ  items×</p>
        <p>,
− ∑   logsoftmax  (⟨  ,   ⟩)
where logsoftmax</p>
        <p>(⟨  ,   ⟩) ∶= log(exp(⟨  ,   ⟩)/∑ exp(⟨  ,   ⟩)). In order to control computation cost,
we employ random negative sampling to approximate the partition function ∑ exp(⟨  ,   ⟩). We then
pass the embedding dot product ⟨  ,   ⟩ as one input feature to the customer relevance model, where
  ,   are learned embeddings for the user and candidate item, respectively.</p>
        <p>We choose not to include any context features such as time-of-day in the model, this allows
customerrelevance scores to be pre-computed and stored in a low-latency database for later access.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Feature selection</title>
        <p>Feature set ℱ is partitioned into multiple subsets {ℱ  }, where each subset ℱ  represents a particular
facet, such as customer-item historical playback, customer-artist historical playback, customer-item
follows and likes and embedding dot-product. These distinct sets of features are created using separate
batch-compute pipelines, which extract and process data from diferent data warehouses. The resulting
features are then uploaded to a low-latency database to allow low latency access for real-time model
inference. It’s desirable to reduce the number of feature subsets used in production in order to limit
engineering costs, as long as it maintains model performance. For this reason, we focus our eforts on
selecting the best collection of feature subsets.</p>
        <p>To this end, we perform a backward feature elimination on these feature subsets. During each stage,
we first perform HPOs on variants models trained with diferent feature sets  − ℱ   , ℱ  ∈  , separately,
where  is the subsets of features determined by the previous step (at stage 1,  = ⋃ ℱ  ). Then we
train each model  times with random initialization, using the best-performing hyper-parameter. A
feature subset is eliminated only if the average performance of the model without the feature subset
is superior. Our findings revealed that all feature subsets, barring the entity-type indicator features,
made a positive and statistically significant contribution to the model performance. Consequently, we
integrated all feature subsets, except for the entity-type indicator features, into the model.</p>
        <p>Notably, the embedding dot product from the user-item embedding model is often used in lieu of our
proposed customer-relevance model to re-rank candidates. Nevertheless, our ablation study
demonstrates that augmenting the embedding dot product with additional feature set leads to significantly
improved ranking performance. Specifically, the playback feature, which includes user-item play counts
within diferent time windows and negative feedback such as skips and stops, notably contributed to
the model’s performance.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Online data collection and ofline evaluation methods</title>
        <p>Ofline experimentation enables us to evaluate performance and improve model architecture such as
training loss function, features and hyper-parameters, with no direct impact to customer experience.</p>
        <p>To this regard, randomized content ranking data are collected from production systems, to be used
for ofline experiments. The online ranker first ranks candidates, then apply a Fisher–Yates random
shufle to ranked candidates before outputting the first  candidates to be shown to the customer. The
random shufle step is critical to ensure that every candidate has a non-zero probability of being ranked
at any position, thus allowing an unbiased of-policy evaluation through importance-sampling.</p>
        <p>This logged data is then used for ofline experiments. Firstly, the data is split into train/validation/test
sets where we ensure that 1) validation and test sets come from the week after training period ends
2) test and validation sets are based on disjoint sets of customers, to prevent data leakage. Then we
evaluate each ofline-trained policy using the item-position model (IPM) [ 20]. Under the assumption
that the likelihood of a customer interacting with the item only depends the item itself and the ranked
position, and independent from other items being ranked, the IPM estimator is unbiased.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this section we describe three A/B tests we ran on production trafic. To evaluate treatment impacts
over control we adopt a Bayesian framework [21, 22, 23, 24] which is broadly adopted in industry due
to its advantages over frequentist methods [25]. In this paper’s Bayesian analysis, we treat the result
as statistically significant when the probability of positive return (PPR) exceeds 66% (i.e., 2:1 odds of
positive return).</p>
      <p>First we ran the model consolidation experiment, followed by the feature summarization experiment,
and then evaluated the model on the top-1 selection task. During the A/B tests, we focused on the
widgets on the home page of a music streaming service, particularly on the problem of personalized
re-ranking of the items in each of the widgets. This includes item-item focused widgets (e.g., More
Like &lt;Artist&gt;, Because You Followed &lt;Artist&gt;), customer-item widgets for various content types (e.g.,
&lt;Songs&gt; For You, &lt;Stations&gt; For You), recent-preference widgets (e.g., You Might Like), as well as Top-1
recommender widget.</p>
      <p>We ran experiments in this order because we expected that consolidating the re-ranking model would
simplify the deployment and evaluation of the customer-item relevance score.</p>
      <p>For each experiment, we run a 2-week A/B test on production trafic, randomly splitting users into
diferent treatment groups. Since model consolidation and feature summarization are general design
choices that apply to all widgets on the page, we report an engagement metric that measures the success
rate of sessions on the home page1, and we do not report widget-specific metrics.</p>
      <sec id="sec-4-1">
        <title>4.1. Consolidated re-ranking model</title>
        <p>In this experiment we used a consolidated model to rank all the content of widgets on the home page
where each widget has a diferent feature space that takes into account the widget specific ranking
strategy based on the widget’s objective which can range from showing popular content or to increase
discoveries. Note that in this setting, the team owning the product has to research and specify which
features are included in the model.</p>
        <p>
          We run a 2 weeks A/B test with the following treatments:
• C – current production system where each widget is ranked by an individual specialized model;
• T1 – consolidated model based on DeepProp [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] shared across all widgets;
• T2 – consolidated model based on [18] shared across all widgets.
        </p>
        <p>We report results in Table 1. We registered a 0.44% improvement in session success rate on the home
page when using the consolidated DeepProp model compared to the control setting with specialized
re-rankers for each widget. Interestingly, we see that a consolidated model based on reward regression
and point-wise cost function (T2) not only under-performs DeepProp, but also the control setting. We
hypothesize this is due to the shared reward regression model failing to capture the diferent reward
distributions associated to the diferent widgets (e.g., widget driving more/less trafic and engagement).
On the contrary, DeepProp does not have this problem, since it learns to predict the relative ranking of
items, and not their actual relevance [26].
1A sessions is considered successful if it results in a high-value-action from the user, including playback over a minimum
threshold of time, a like on an entity, following an artist, and similar.</p>
        <p>T1
T2</p>
        <p>Engagement Metric</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Feature summarization via customer-item score</title>
        <p>In this experiment we trained a reward regression model as presented in [18] to regress clicks on
widgets’ content in the home page. The model is trained ofline using logged data from production
trafic. The idea is to predict whether an item inside the widget will be clicked by the user. This way, a
product team can simply set its model (e.g., a consolidated DeepProp re-ranker as in Section 4.1) to
consume this score as replacement of many personalization features.</p>
        <p>To validate the idea of features summarization, we designed an experiment in which we replaced a
subset ℱ ⊂ ℱ of personalization features where |ℱ | = 45 with a customer-item afinity score  . We
run a one-week A/B test with these treatments:
• C baseline that uses only features from ℱ0 = ℱ − ℱ ;
• T1 uses all the features in ℱ directly as input to the model;
• T2 replaces ℱ with the afinity score  , thus the features space is ℱ0 ∪  .</p>
        <p>All the treatments use a the consolidated DeepProp re-ranker from Section 4.1 (one per treatment).
The diference is that T1 uses the full feature set ℱ as input, T2 replaces ℱ with a single customer-item
score  , while C drops them altogether.</p>
        <p>T1
T2</p>
        <p>Engagement Metric
2.23%
2.12%</p>
        <p>PPR
100%
100%
A/B test results. Engagement metric reports the relative lift over the baseline in C. T2 summarizes ℱ from T1
and it retains a lift of 95% from T1.</p>
        <p>We report results in Table 2. First, by comparing T1 to C, we see that more personalization features
(i.e., adding ℱ</p>
        <p>to ℱ0) help, leading to a relative lift of 2.23% in sessions success rate. By comparing T2
to C, we also see that using the single customer-item score leads to a substantial improvement over the
baseline of 2.12%. Finally, by comparing these two lifts, we see that including the single customer-item
score retains 95% of the benefit of including the original features directly. In other words, using  instead
of ℱ as input features for a ranking model would consist in 95% of the original lift given by
ℱ .</p>
        <p />
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Content selection in Top-1 Recommender</title>
        <p>Top-1 Recommender is a widget fixed on top position on our home page. In this widget a single content
is selected from a pool of candidates and shown to the customer. Therefore, optimizing this widget is
treated as selection problem instead of a ranking problem like the rest of the home page. To evaluate
impact of a A/B test on this widget we use listening time instead of engagement metric. There are
multiple reasons for this choice, which include the particular customer experience design of this widget.</p>
        <p>In this experiment we leveraged the feature summarization model A/B tested in 4.2 in Top-1
Recommender on the home page. Thus, in this A/B test we evaluate the feature summarization model for
content selection use-case instead of ranking.</p>
        <p>Following the experiment design described in Section 4.2, we run a two-week A/B test with the
following treatments all are powered by a linear model:
• C baseline that uses only features from ℱ0 = ℱ − ℱ ;
• T1 uses all the features in ℱ directly as input to the model;
• T2 replaces ℱ with the afinity score  , thus the features space is ℱ0 ∪  .</p>
        <p>T1
T2</p>
        <p>Interestingly, results in Table 3 show that T2 using  as replacement of ℱ from T1, outperforms the
latter. This result is surprising compared to the A/B test results described in Section 4.2. We argue that
in this case since T1 is powered by a linear model is unable to learn high-order feature interactions
using all the features ℱ directly while T2 has the advantage that  is computed by a non-linear model
which indeed is able to learn high-order feature interactions in ℱ .</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this paper we presented an approach that can drastically simplify how product teams bring
personalization into their experiences. We proposed a customer-item relevance score that can summarize
many personalization features and is trained on feedback from diferent widgets. Product teams can
feed this score into their recommendation models in place of directly integrating with all the
personalization features. In particular, we saw that on an application on the home page of a music streaming
service that integrating with the customer-items score retains most of the benefits of integrating with
personalization feature directly. We argue this is a benefit for product teams, since it gives the option
to simplify building personalization into a product and focus resources on challenges peculiar to the
product (e.g., sequencing for automated radio stations). We showed that using a consolidated model for
re-ranking the content in all widgets on the home page is beneficial in terms of engagement. We also
experimented in a content selection use-case reporting positive impact on listening time for the top-1
recommender. Whereas in this work we tested the idea on the home page of our streaming service, in
the future we plan to expand experimentation and deployment of the customer-item relevance score in
more parts of our service that need personalization.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[5] Z. A. Zhu, W. Chen, T. Minka, C. Zhu, Z. Chen, A novel click model and its applications to online
advertising, in: Proceedings of the third ACM international conference on Web search and data
mining, 2010, pp. 321–330.
[6] T. Joachims, L. Granka, B. Pan, H. Hembrooke, G. Gay, Accurately interpreting clickthrough data
as implicit feedback, in: Acm Sigir Forum, volume 51, Acm New York, NY, USA, 2017, pp. 4–11.
[7] O. Jeunen, Counterfactual inference under thompson sampling, arXiv preprint arXiv:2504.08773
(2025).
[8] D. Agarwal, M. Gurevich, Fast top-k retrieval for model based recommendation, in: Proceedings
of the fith ACM international conference on Web search and data mining, 2012, pp. 483–492.
[9] J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert,
B. Livingston, et al., The youtube video recommendation system, in: Proceedings of the fourth
ACM conference on Recommender systems, 2010, pp. 293–296.
[10] Y. Wang, X. Zhao, T. Xu, X. Wu, Autofield: Automating feature selection in deep recommender
systems, in: Proceedings of the ACM Web Conference 2022, 2022, pp. 1977–1986.
[11] B. Liu, C. Zhu, G. Li, W. Zhang, J. Lai, R. Tang, X. He, Z. Li, Y. Yu, Autofis: Automatic feature
interaction selection in factorization models for click-through rate prediction, in: proceedings of
the 26th ACM SIGKDD international conference on knowledge discovery &amp; data mining, 2020, pp.
2636–2645.
[12] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, J. Tang, Autoint: Automatic feature interaction
learning via self-attentive neural networks, in: Proceedings of the 28th ACM international
conference on information and knowledge management, 2019, pp. 1161–1170.
[13] Y. Wang, H. Guo, B. Chen, W. Liu, Z. Liu, Q. Zhang, Z. He, H. Zheng, W. Yao, M. Zhang, et al.,
Causalint: Causal inspired intervention for multi-scenario recommendation, in: Proceedings of the
28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 4090–4099.
[14] Y. Wang, X. Zhao, B. Chen, Q. Liu, H. Guo, H. Liu, Y. Wang, R. Zhang, R. Tang, Plate: A
prompt-enhanced paradigm for multi-scenario recommendations, in: Proceedings of the 46th
International ACM SIGIR Conference on Research and Development in Information Retrieval,
2023, pp. 1498–1507.
[15] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, E. H. Chi, Modeling task relationships in multi-task learning
with multi-gate mixture-of-experts, in: Proceedings of the 24th ACM SIGKDD international
conference on knowledge discovery &amp; data mining, 2018, pp. 1930–1939.
[16] A. Royer, T. Blankevoort, B. Ehteshami Bejnordi, Scalarization for multi-task and multi-domain
learning at scale, Advances in Neural Information Processing Systems 36 (2023) 16917–16941.
[17] J. Hwang, H. Ju, S. Kang, S. Jang, H. Yu, Multi-domain sequential recommendation via domain
space learning, in: Proceedings of the 47th International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2024, pp. 2134–2144.
[18] M. Morik, A. Singh, J. Hong, T. Joachims, Controlling fairness and bias in dynamic learning-to-rank,
in: Proceedings of the 43rd international ACM SIGIR conference on research and development in
information retrieval, 2020, pp. 429–438.
[19] Y. Saito, S. Yaginuma, Y. Nishino, H. Sakata, K. Nakata, Unbiased recommender learning from
missing-not-at-random implicit feedback, in: Proceedings of the 13th International Conference on
Web Search and Data Mining, 2020, pp. 501–509.
[20] S. Li, Y. Abbasi-Yadkori, B. Kveton, S. Muthukrishnan, V. Vinay, Z. Wen, Ofline evaluation of
ranking policies with click models, 2018. arXiv:1804.10488.
[21] S. Chennu, A. Maher, C. Pangerl, S. Prabanantham, J. H. Bae, J. Martin, B. Goswami, Rapid and
scalable bayesian ab testing, in: 2023 IEEE 10th International Conference on Data Science and
Advanced Analytics (DSAA), IEEE, 2023, pp. 1–10.
[22] S. Kamalbasha, M. J. Eugster, Bayesian a/b testing for business decisions, in: International Data</p>
      <p>Science Conference, Springer, 2020, pp. 50–57.
[23] Y. Zhang, C. Qian, W. Bao, Bayesian a/b testing with covariates, in: 2023 IEEE International</p>
      <p>Conference on Data Mining (ICDM), IEEE, 2023, pp. 1553–1558.
[24] A. Deng, Objective bayesian two sample hypothesis testing for online controlled experiments, in:</p>
      <p>Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 923–928.
[25] Q. F. Gronau, A. R. KN, E.-J. Wagenmakers, Informed bayesian inference for the a/b test, Journal
of Statistical Software 100 (2021) 1–39.
[26] H. Li, Learning to rank for information retrieval and natural language processing, Springer Nature,
2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shaked</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Renshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lazier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Deeds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hullender</surname>
          </string-name>
          ,
          <article-title>Learning to rank using gradient descent</article-title>
          ,
          <source>in: Proceedings of the 22nd international conference on Machine learning</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Unbiased lambdamart: an unbiased pairwise learning-to-rank algorithm</article-title>
          ,
          <source>in: The World Wide Web Conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2830</fpage>
          -
          <lpage>2836</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Takatsu</surname>
          </string-name>
          , I. Zaitsev,
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>A general framework for counterfactual learningto-rank</article-title>
          ,
          <source>in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zoeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , B.
          <string-name>
            <surname>Ramsey</surname>
          </string-name>
          ,
          <article-title>An experimental comparison of click position-bias models</article-title>
          ,
          <source>in: Proceedings of the 2008 international conference on web search and data mining</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>