<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>YachenYan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liubo Li</string-name>
          <email>liubo.li@creditkarma.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop ProceedingsC(EUR-WS.org)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CTR prediction, Recommendation System, Feature Interaction, Mixture of Experts</institution>
          ,
          <addr-line>Dynamic Inference, Early Exiting, AutoML</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Credit Karma</institution>
          ,
          <addr-line>760 Market Street, San Francisco, California, USA, 94012</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that diferent datasets favor diferent neural network architectures and feature interaction types, suggesting that diferent feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of diferent types and orders. To further improve the prediction accuracy and inference eficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the eficiency and efectiveness of AdaEnsemble over state-of-the-art models. We open-source the TensorFlow implementation of AdaEnsemble:</p>
      </abstract>
      <kwd-group>
        <kwd>Rate Prediction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>https://github.com/yanyachen/AdaEnsemble.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>advertising and recommendation syste2m, 3[, 4, 5].</p>
      <p>Click-through rate (CTR) prediction mod1e]li[s an essen- cally select optimal feature interaction depth.
AdaEnsemtial component for the large-scale search ranking, onlbinlee encompasses SparseMoE layers and the Depth
Selecting Controller.</p>
      <sec id="sec-2-1">
        <title>Within each SparseMoE layer of</title>
        <p>Many deep learning-based models have been proposed AdaEnsemble, there is a collection of interaction
learnfor CTR prediction problems in the industry. They havieng experts, and a trainable gating network determines a
become dominant in learning the useful feature interaspca- rse combination of these experts to use for each
examtions of the mixed-type input in an end-to-end fash5io].n[ ple. Within the Depth Selecting Controller, a trainable
While every existing method focuses on automaticallgyating network will choose the feature interaction depth
modeling diferent types of feature interactions, therfoer each example and recursively propagate feature
inhave been very few attempts to model diferent typesteraction representations through SparseMoE layers to
of interactions jointly and dynamically, such that otnhee corresponding depth for computing the prediction.
model architecture can be directly applied to diferenTthrough these conditional computation mechanisms, we
types of datasets. We believe that the ensemble of variouesnlarged the model capacity exponentially maintaining
interaction modules to generate heterogeneous featcuormeputational eficiency.
interactions can complement the non-overlapping knowl-The main contributions of this paper can be
summa</p>
      </sec>
      <sec id="sec-2-2">
        <title>With the aim of accomplishing the stated objective,</title>
        <p>edge learned through each interaction learning approarcizhe.d as follows:
we propose AdaEnsemble: a Sparsely-Gated Mixture-of-• We designed a novel model architecture called</p>
      </sec>
      <sec id="sec-2-3">
        <title>Experts (SparseMoE) hierarchical architecture to ensemble diferent interaction learning modules and dynami</title>
        <p>CA
∗Corresponding author.</p>
      </sec>
      <sec id="sec-2-4">
        <title>AdaEnsemble to ensemble various types of feature in</title>
        <p>teraction learning modules by Sparsely-Gated
Mixtureof-Experts (SparseMoE). Through utilizing MoE layers
recursively with residual connections and
normalization, AdaEnsemble can model diferent types of
interactions jointly and dynamically.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Controller to adaptively choose the optimal feature © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons Licen•se We designed an eficient and efective Depth Selecting</title>
        <p>Add &amp; Normalize
l-th Sparse MoE Layer
--</p>
        <p>Add &amp; Normalize
2nd Sparse MoE Layer</p>
        <p>Add &amp; Normalize
1st Sparse MoE Layer
Input Feature Map
Embedding Layer
Categorical
Feature</p>
        <sec id="sec-2-5-1">
          <title>NumBuercikctiFzeeadture</title>
        </sec>
        <sec id="sec-2-5-2">
          <title>DepNthetSweolerckting</title>
          <p>l-th
Estimator</p>
          <p>2nd
Estimator</p>
          <p>1st
Estimator</p>
          <p>Input</p>
          <p>Embedding
Sparse MoE Layer</p>
          <p>Expert 1</p>
          <p>Expert 2</p>
          <p>Expert 3
....</p>
          <p>Expert n-1</p>
          <p>Expert n</p>
          <p>Sparse
non-zero value NGeatwtinogrk non-zero index Dispatcher
In this example, the depth selecting network selects the 2nd
layer to exit and compute the final prediction, therefore the
deeper layers was not activated and plotted translucent in
the figure.</p>
          <p>Expert
Embedding
e1</p>
          <p>e2 e3
L2 Normalization
e4
interaction depth. Through utilizing this controller, Cosine</p>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>AdaEnsemble can dynamically determine the layer Similarity</title>
        <p>for early exiting to improve prediction accuracy and
inference eficiency.
• We applied a bi-level optimization algorithm for
iteratively training the modeling network and gating
network.</p>
        <p>L2 Normalization
Feed-Forward Network</p>
        <p>Noise Injection</p>
        <p>Input
Embedding</p>
        <p>Routing Score</p>
        <p>Softmax</p>
        <p>Top-K
Learnable Temperature</p>
        <p>Re-Scaling
2. Proposed Model: AdaEnsemble
2.1. Feature Interaction Experts</p>
      </sec>
      <sec id="sec-2-7">
        <title>We considered several types of feature interaction ex</title>
        <p>perts in our model: Dense Layer, Convolution Layer2,.2. Sparse Mixture-of-Experts Layer</p>
      </sec>
      <sec id="sec-2-8">
        <title>Multi-Head Self-Attention Layer, Polynomial Interaction</title>
        <p>Layer, and Cross Layer. Essentially, any feature interTahc-e Sparse Mixture-of-Experts layer ensembles
aforetion learning layer can be included in our framewormke,ntioned heterogeneous feature interaction experts and
and the residual connection and normalization willcboensists of several other essential parts to make the
overapplied to their ensembles. Now we introduce these feaa-ll model can be stably trained.
ture interaction experts included in our framework. Note
that our proposed framework is general and can use ar2b.i2-.1. Noisy Gating Network
trary feature interaction modules. The potential feature</p>
      </sec>
      <sec id="sec-2-9">
        <title>The gating network essentially computes the gating</title>
        <p>interaction experts can be used are not limited to the
aforementioned. value for selecting experts for each input embedding
and weighting the output embedding of each expert.</p>
        <p>For the input embedding of gating netw or0k, it
ifrstly processed by the gating network: a two-layer
feedforward network, i.e. a dimension reduction layer withBy annealing of the value, we start to train our
archireduction ratio[6], a non-linear activation function antdecture with a dense structure which allows us to
thorthen a dense layer projecting to hidden stℎat∈e   .</p>
        <p>oughly learn all experts and adjust the gating network in
Additionally, we applied multiplicative jitter noise fotrhe correct direction at the beginning. Therefore, we can
introducing exploration and promoting load balancicnogntrol the sparsity of our architecture while training to
between diferent experts.</p>
        <p>not only accelerate the convergence of the gating
network but also benefit the experts’ specialty for learning
particular types of feature interactions.</p>
        <p>2.2.3. Sparse Dispatcher
ℎ = FFN( 0 ∘ RandomUniform(1.0 − eps, 1.0 + eps))</p>
        <p>After projecting the input embedding to hidden state
stateℎ ∈   and learnable expert embedding s ∈   ,
ℎ ∈   , we apply the  2 normalization to both hiddenThe sparse dispatcher 8[, 7, 9] takes the examples gating
values and experts as input. It firstly dispatches the
examwhere  is the index of expert. Then, we compute theples to the experts corresponding to the non-zero gating
cosine similarity between the hidden state and expevratlues, and lets experts generate the output embeddings.
embedding as the initial routing score. Here we encouTrh-e output of the Sparse Mixture-of-Experts layer is
age the uniformity of representations to avoid dominattehde linearly weighted combination of expert output
emexperts issue.</p>
        <p>=
ℎ ⋅  
‖ℎ‖‖  ‖
  =   /
(
0) =
exp (TopK(, )  )

∑=1 exp (TopK(, )  )</p>
        <p>,</p>
      </sec>
      <sec id="sec-2-10">
        <title>Finally, we use a learnable temperature sca latro rescale the routing scores to the ra[n−g1e, +1].</title>
      </sec>
      <sec id="sec-2-11">
        <title>For the computed routing sco r,ewe only keep the</title>
        <p>sponding softmax gating values equal0. The  -th element
of the output vector of the gating network is
top k values and set the rest−t∞o, resulting in the corre-2.2.4. Load Distribution Regularization
where
TopK(, )  = {
 
−∞
if   is in the to p elements of
otherwise.</p>
        <p>beddings by the non-zero gating values.</p>
        <p>=
∑ 
∈</p>
        <p>( 0)  ( 0,   )</p>
      </sec>
      <sec id="sec-2-12">
        <title>Where  denotes the selected non-zero indices. We save computation based on the sparsit y(o f</title>
        <p>Wherever(</p>
        <p>0) = 0, we don’t pass the expert to
the corresponding expert and do not need to compute
expert output embedding ( 0,   ).
(1)
(2)
(3)
(5)
0) is
(6)
0).
(7)
(8)
(9)
As stated in the previous researc8h,7[, 9, 10], the gating
network tends to select only a few experts if no
regularization is applied, especially when certain experts are
(4) easier to train than other experts. This phenomenon is
self-reinforcing, since the selected experts are trained
more and will be selected more frequently by the gating
network. Therefore, the load balancing loss is applied to
enforce the uniform expert routing.</p>
        <p>balance=  ⋅  ⋅

∑   ⋅  
=1
  = 1
 ∈ℬ
∑ 1{argmax() = }
  = 1
 ∈ℬ
∑   ()</p>
      </sec>
      <sec id="sec-2-13">
        <title>These gating values will be used by the sparse dis</title>
        <p>patcher for routing examples to diferent experts. This where  is the batch size , is the number of experts,
is the essential step for achieving sparsity of our Spa r seis the fraction of examples dispatched to expe r t ijs,
the average of the router probability allocated for expert
j, and is the coeficient for the regularization term.</p>
      </sec>
      <sec id="sec-2-14">
        <title>Mixture-of-Experts layer. Note thatt(he diferentiable regardless the value o f[7].</title>
        <p>2.2.2. Annealing Top-K Gating
We also introduce annealing mechanism to the Top-K
operation. We starts w itvhalue equal to the number of
experts, which means that we starts as a fully dense gate
that routes examples to all experts. Then we gradually While the default load balancing loss is applicable and
decrease th e and route examples to fewer experts, toefective when experts are of the same type, AdaEnsemble
adaptively make the structure sparser and continuouisslyusing heterogeneous feature interaction experts, and
improving the computation eficiency.</p>
        <p>the optimal load for each expert is not uniform.
Therefore, we apply the below load distribution regularization
neous experts.
to encourage the expected load distribution of heterogew-here  1 and 2 are the coeficients for weighting the
load distribution regularization of experts and depth.
 distribution=  ⋅
   ⋅  
∑
=1
 
(10) 2.4.2. Bi-Level Optimization</p>
        <p>The optimization task for training the AdaEnsemble is
where hyper-parameter  is the expected load frac-to jointly optimize the parameter,swhich stands for
tion of examples dispatched to expert j, and naturaltlyhe expert layers and estimator layers, a,nwdhich
rep
∑</p>
        <p>=1   = 1. In practice, th eshould be suficiently large
to prevent expert selection self-reinforcing phenomenonnetwork. Inspired by the DARTS11[], we apply bi-level
at the initial training stage while not overwhelmingotphteimization algorithm for training our model, where
resents the expert gating network and depth selecting
primary LogLoss objective.
2.3. Depth Selecting Controller
2.3.1. Depth Selecting Network
is the upper-level parameters an d is the lower-level
parameters. We apply algorith2mto optimize and
alternatively and iteratively.</p>
        <p>2.5. Discussion on AdaEnsemble
The Depth Selecting Network is essentially the same
conifguration as the aforementioned Noisy Gating Network
for SparseMoE layer. We denote it bℎy(</p>
        <p>0). The
outputs ofℎ(
0) are[ 1
ℎ</p>
        <p>ℎ
,  2
, ⋯ ,  
ℎ</p>
      </sec>
      <sec id="sec-2-15">
        <title>The combination of sparse experts routing within each</title>
        <p>SparseMoE layer and the early exiting by depth selecting
controller brings two merits to the proposed model. On
], in- one hand, the stacked SparseMoE layers allow the
prodicating each example’s optimal forward propagatiopnosed model to leverage the exponential combinations
depth. The -th unit denotes the probability of selectinogf sparsely gated experts, which brings in more
predictthe -th MoE layer to exit. The optimal depth is automaitn-g power. On the other hand, both the experts routing
ically selected as the one corresponding to the largmesetchanism and the depth selecting mechanism enables
probability. In contrast to the expert selection, wthhene proposed model to learn the instance-ware expert
choosing the optimal depth of each example for the dyc-ombination and instance-ware model depth. These two
namic inference, we only keep the top-1 depth indexconditional computation mechanisms improve the
efifrom the output units of the Depth Selecting Netwocrike.ncy during model serving. In the next section, we
Note that we can also apply the load distribution rewguil-l illustrate the efectiveness of the proposed model
larization to encourage the examples’ propagation deptthhrough some experimental studies.
distribution.
example. If</p>
        <p>ℎ
If</p>
        <p>ℎ
2.3.2. Dynamic Propagation Mechanism</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>ℎ

ℎ
= 0, we recursively forward propagatelowing research questions::</p>
      <sec id="sec-3-1">
        <title>With the depth gat es</title>
        <p>∈ [0, 1] computed by Depth In this section, we focus on evaluating the efectiveness
Selecting Network, we obtain the optimal depth for eaocfhour proposed models and seeking answers to the
folexamples through MoE layers and compute deeper rep-• Q1: How does our proposed AdaEnsemble perform
resentation un til
= 1 or reaching the final layer.</p>
        <p>compared to each baseline in the CTR prediction
prob= 1, the forward propagation will be stopped lem?
gorithm1 for dynamic forward propagation.
prediction. To eficiently process a batch of examples
and the correspondin g-th estimator will compute the• Q2: How does the SparseMoE layer perform compared
to DenseMoE, which utilizes all feature interaction
with diferent optimal propagation depths, we utilize al- experts? Does the cascade of SparseMoE layers
efectively capture diferent types of feature interactions?
2.4. Training
2.4.1. Training Objective
The loss function we use a linearly weighted combina- ifciency?
tion of the Log Loss and the auxiliary load distribution
regularization,
 =</p>
        <p>LogLoss +  1 distribution+  2 distribution (11)
expert
depth
• Q3: How does the depth selecting controller perform
compared to a full-depth network? Does the early
exiting mechanism achieve both efectiveness and
efAlgorithm 1 Dynamic Propagation
3.1. Experiment Setup
3.1.1. Datasets</p>
        <p>3.1.2. Competing Models
We evaluate our proposed model on three public realW-e compare AdaEnsemble with following models: LR
world datasets widely used for research. (Logistic Regression)1[3, 2], FM (Factorization
Ma1. Criteo.1 Criteo dataset is from Kaggle competitiocnhine) [14], DNN (Multilayer Perceptron), Wide &amp;
in 2014. Criteo AI Lab oficially released this dataset after, Deep [4], DeepCrossing [15], DCN (Deep &amp; Cross
Netfor academic use. work) [16], PNN (with both inner product layer and outer
2. Avazu.2 Avazu dataset is from Kaggle competitionproduct layer)1[7, 18], DeepFM [19], xDeepFM [20],
Auin 2015. Avazu provided 10 days of click-through data.toInt2[1], FiBiNET [22], xDeepInt[23] and DCN V2 [24].</p>
        <p>3. iPinYou.3 iPinYou dataset is from iPinYou GlobalSome of the models are state-of-the-art models for CTR
RTB(Real-Time Bidding) Bidding Algorithm Competitionprediction problem and are widely used in the industry.
in 2013. We follow the data processing steps o12f][.</p>
      </sec>
      <sec id="sec-3-2">
        <title>1https://www.kaggle.com/c/criteo-display-ad-challenge</title>
      </sec>
      <sec id="sec-3-3">
        <title>2https://www.kaggle.com/c/avazu-ctr-prediction</title>
      </sec>
      <sec id="sec-3-4">
        <title>3http://contest.ipinyou.com/</title>
        <p>3.2. Model Performance Comparison (Q1)
The overall performance of diferent model architectures
is listed in Table1. We have the following observations
in terms of model efectiveness:
• Models with more than two feature interaction
modules generally perform better than models with only Taable 2
single feature interaction module, indicating the Pimer-formance Comparison of SparseMoE and DenseMoE on
portance of jointly learned feature interaction reCprritee-o Dataset.
sentation.</p>
        <p>SparseMoE(k=1)
• The optimal feature interaction depth varies by featurSeparseMoE(k=2)
interaction module type and when combined with SparseMoE(k=3)
diferent module types, indicating the necessity for SparseMoE(k=4)
dynamically combining diferent feature interactions DenseMoE
on diferent interaction depths. Ensemble</p>
        <p>Dense Expert Only
• AdaEnsemble achieves the best prediction perfor- Cross Expert Only
mance among all models. Our model’s superior perfor- Polynomial Expert Only
mance could be attributed to the fact that AdaEnsemC-NN Expert Only
ble jointly model various types of feature interactionsMHSA Expert Only
by adaptively selecting the feature interaction experts
combination and determining the optimal feature
interaction depth by the controller.
• Utilizing diferent feature interaction experts resultEach vertical axis represents a SparseMoE layer and the
in better performance than single expert models inproportion of an expert being used. The horizontal flows
general. SparseMoE layer achieves a better tradeoinf dicate the dependency and relation of diferent SparseMoE
between accuracy and computation eficiency. comlabyinear’tsioenxpwearst sreeplercetsieonnt.eTdhbeyptrhoepowritdiotnh ooff tthhee eflxopwesrtand
further clustered to diferent colors.</p>
        <p>3.4. Depth Selection Analysis (Q3)
• When utilizing more than one expert per SparseMoWEe compare the model performance between the
layer, even though only a subset of feature interactiAodnaEnsemble with and without depth selecting controller
experts are selected, SparseMoE can still efectivelyto investigate whether the model achieves the harmony
capture the most significant feature interactionsboetfween prediction accuracy and inference eficiency
diferent depths and maintain similar performancewith respect to depth selection. The performance of the
as the DenseMoE layer and superior performance tdoiferent types of MoE layers and ensemble result is listed
ensemble, while including more experts can also resultin Table3.
in more computational cost.</p>
        <p>With the incorporation of the depth selecting
con• Figure 4shows that the SparseMoE layers dynamicallytroller, we can observe that our model can significantly
utilize a diferent combination of experts across difer-improve training complexity and inference eficiency
ent layers to capture the complex feature interacti(omnesasured in FLOPs) while achieving slightly better
perefectively. That also explains why fusing diferent formance than the full-depth model. We think the
fullfeature interactions is crucial for prediction accurdaepctyh. model is easier to overfit compared to
AdaEnsemble, thus resulting in slightly worse accuracy
perforTable 3 D. Golovin, et al., Ad click prediction: a view
Performance Comparison of AdaEnsemble with and without from the trenches, in: Proceedings of the 19th
controller on Criteo Dataset. ACM SIGKDD international conference on
Knowl</p>
      </sec>
      <sec id="sec-3-5">
        <title>AUC LogLoss FLOPs edge discovery and data mining, ACM, 2013, pp.</title>
        <p>w/ controller 0.8132 0.4394 6.02M 1222–1230.
w/o controller 0.8128 0.4396 8.58M [3] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi,</p>
      </sec>
      <sec id="sec-3-6">
        <title>A. Atallah, R. Herbrich, S. Bowers, et al., Practical</title>
        <p>Table 4 lessons from predicting clicks on ads at facebook, in:
AdaEnsemble Propagation Depth on Criteo Dataset. Proceedings of the Eighth International Workshop
on Data Mining for Online Advertising, ACM, 2014,</p>
        <p>Layer 1 Layer 2 Layer 3 Layer 4 pp. 1–9.</p>
        <p>
          Fraction 6.53% 19.36% 66.43% 7.68% [4] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T.
Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai,
mance. The AdaEnsemble with depth selecting controller M. Ispir, et al., Wide &amp; deep learning for
recomadaptively selects feature interaction depth per example mender systems, in: Proceedings of the 1st
workbasis, thus achieving better trade-ofs between prediction shop on deep learning for recommender systems,
accuracy and inference eficiency. The distribution of per
          <xref ref-type="bibr" rid="ref2">ACM, 2016</xref>
          , pp. 7–10.
example forward propagation depth is listed in Ta4b.le [5] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning
based recommender system: A survey and new
perspectives, ACM Computing Surveys (CSUR) 52
4. Conclusion (2019) 5.
        </p>
        <p>[6] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation
netIn this paper, we present a novel model architecture to works, in: Proceedings of the IEEE conference on
click-through rate (CTR) modeling by introducing the computer vision and pattern recognition, 2018, pp.
Sparse-Gated Mixture-of-Experts (SparseMoE) hierarchi- 7132–7141.
cal architecture for ensemble learning of heterogeneou[7s] W. Fedus, B. Zoph, N. Shazeer, Switch transformers:
feature interactions experts. A Depth Selecting Con- Scaling to trillion parameter models with simple
troller component was integrated into the model to dy- and eficient sparsity, 2021.
namically select the optimal feature interaction depth f[o8r] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis,
each instance. The utilization of these two conditional Q. Le, G. Hinton, J. Dean, Outrageously large neural
computation mechanisms results in a model architecture networks: The sparsely-gated mixture-of-experts
that can select a subset of feature interactions experts layer, arXiv preprint arXiv:1701.06538 (2017).
and the optimal interaction depth for each instance simu[9l-] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean,
taneously, leading to an exponential increase in model N. Shazeer, W. Fedus, Designing efective sparse
capacity without incurring a corresponding increase in expert models, arXiv preprint arXiv:2202.08906
inference cost. Our extensive experiment demonstrate (2022).
the superiority of our approach in terms of efectivenes[s10] Z. Chi, L. Dong, S. Huang, D. Dai, S. Ma, B. Patra,
and eficiency. S. Singhal, P. Bajaj, X. Song, F. Wei, On the
rep</p>
        <p>Future work will be dedicated to exploring the poten- resentation collapse of sparse mixture of experts,
tial for extending our method to the modeling of user arXiv preprint arXiv:2204.09179 (2022).
behavior sequences. By learning a sparse ensemble of[11] H. Liu, K. Simonyan, Y. Yang, Darts:
Difmodels, we anticipate that our approach can dynamically ferentiable architecture search, arXiv preprint
select the optimal expert for diferent behaviors in the arXiv:1806.09055 (2018).
context of user behavior sequence data. [12] W. Zhang, S. Yuan, J. Wang, X. Shen, Real-time
bidding benchmarking with ipinyou dataset, arXiv
preprint arXiv:1407.7073 (2014).</p>
        <p>References [13] H. B. McMahan, Follow-the-regularized-leader and
mirror descent: Equivalence theorems and l1
regu[1] M. Richardson, E. Dominowska, R. Ragno, Predict- larization (2011).</p>
        <p>ing clicks: estimating the click-through rate for</p>
        <p>[14] S. Rendle, Factorization machines, in: 2010 IEEE
new ads, in: Proceedings of the 16th international International Conference on Data Mining, IEEE,
conference on World Wide Web, ACM, 2007, pp. 2010, pp. 995–1000.</p>
        <p>521–530. [15] Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, J. Mao,
[2] H. B. McMahan, G. Holt, D. Sculley, M. Young, Deep crossing: Web-scale modeling without
manuD. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, ally crafted combinatorial features, in: Proceedings</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>of the 22nd ACM SIGKDD International Confer-</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2016</year>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>262</lpage>
          . [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fu</surname>
          </string-name>
          , G. Fu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Deep &amp; cross
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>of the ADKDD'17</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2017</year>
          , p.
          <fpage>12</fpage>
          . [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>response prediction</article-title>
          ,
          <source>in: 2016 IEEE 16th Interna-</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <year>2016</year>
          , pp.
          <fpage>1149</fpage>
          -
          <lpage>1154</lpage>
          . [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Niu</surname>
          </string-name>
          , H. Guo,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>tems (TOIS) 37 (</article-title>
          <year>2018</year>
          )
          <article-title>5</article-title>
          . [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          , Deepfm: a
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>prediction</surname>
          </string-name>
          ,
          <source>arXiv preprint arXiv:1703.04247</source>
          (
          <year>2017</year>
          ). [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          , G. Sun,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>ceedings of the 24th ACM SIGKDD International</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>ing</surname>
          </string-name>
          , ACM,
          <year>2018</year>
          , pp.
          <fpage>1754</fpage>
          -
          <lpage>1763</lpage>
          . [21]
          <string-name>
            <given-names>W.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          preprint arXiv:
          <year>1810</year>
          .
          <volume>11921</volume>
          (
          <year>2018</year>
          ). [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Zhang, Fibinet: Combining
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          arXiv:
          <year>1905</year>
          .
          <volume>09433</volume>
          (
          <year>2019</year>
          ). [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>xdeepint: a hybrid architecture for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>teractions</surname>
          </string-name>
          (
          <year>2020</year>
          ). [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shivanna</surname>
          </string-name>
          , D. Cheng, S. Jain,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          , E. Chi, Dcn v2:
          <article-title>Improved deep</article-title>
          &amp; cross
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>Conference</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>1785</fpage>
          -
          <lpage>1797</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>