-

Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction

YachenYan

0 1

Liubo Li

liubo.li@creditkarma.co 0 1

Workshop ProceedingsC(EUR-WS.org)

0 0 CTR prediction, Recommendation System, Feature Interaction, Mixture of Experts , Dynamic Inference, Early Exiting, AutoML 1 Credit Karma , 760 Market Street, San Francisco, California, USA, 94012 , USA

Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that diferent datasets favor diferent neural network architectures and feature interaction types, suggesting that diferent feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of diferent types and orders. To further improve the prediction accuracy and inference eficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the eficiency and efectiveness of AdaEnsemble over state-of-the-art models. We open-source the TensorFlow implementation of AdaEnsemble:

Rate Prediction

https://github.com/yanyachen/AdaEnsemble.

1. Introduction

advertising and recommendation syste2m, 3[, 4, 5].

Click-through rate (CTR) prediction mod1e]li[s an essen- cally select optimal feature interaction depth. AdaEnsemtial component for the large-scale search ranking, onlbinlee encompasses SparseMoE layers and the Depth Selecting Controller.

Within each SparseMoE layer of

Many deep learning-based models have been proposed AdaEnsemble, there is a collection of interaction learnfor CTR prediction problems in the industry. They havieng experts, and a trainable gating network determines a become dominant in learning the useful feature interaspca- rse combination of these experts to use for each examtions of the mixed-type input in an end-to-end fash5io].n[ ple. Within the Depth Selecting Controller, a trainable While every existing method focuses on automaticallgyating network will choose the feature interaction depth modeling diferent types of feature interactions, therfoer each example and recursively propagate feature inhave been very few attempts to model diferent typesteraction representations through SparseMoE layers to of interactions jointly and dynamically, such that otnhee corresponding depth for computing the prediction. model architecture can be directly applied to diferenTthrough these conditional computation mechanisms, we types of datasets. We believe that the ensemble of variouesnlarged the model capacity exponentially maintaining interaction modules to generate heterogeneous featcuormeputational eficiency. interactions can complement the non-overlapping knowl-The main contributions of this paper can be summa

With the aim of accomplishing the stated objective,

edge learned through each interaction learning approarcizhe.d as follows: we propose AdaEnsemble: a Sparsely-Gated Mixture-of-• We designed a novel model architecture called

Experts (SparseMoE) hierarchical architecture to ensemble diferent interaction learning modules and dynami

CA ∗Corresponding author.

AdaEnsemble to ensemble various types of feature in

teraction learning modules by Sparsely-Gated Mixtureof-Experts (SparseMoE). Through utilizing MoE layers recursively with residual connections and normalization, AdaEnsemble can model diferent types of interactions jointly and dynamically.

Add & Normalize l-th Sparse MoE Layer --

Add & Normalize 2nd Sparse MoE Layer

Add & Normalize 1st Sparse MoE Layer Input Feature Map Embedding Layer Categorical Feature

NumBuercikctiFzeeadture DepNthetSweolerckting

l-th Estimator

2nd Estimator

1st Estimator

Input

Embedding Sparse MoE Layer

Expert 1

Expert 2

Expert 3 ....

Expert n-1

Expert n

Sparse non-zero value NGeatwtinogrk non-zero index Dispatcher In this example, the depth selecting network selects the 2nd layer to exit and compute the final prediction, therefore the deeper layers was not activated and plotted translucent in the figure.

Expert Embedding e1

e2 e3 L2 Normalization e4 interaction depth. Through utilizing this controller, Cosine

AdaEnsemble can dynamically determine the layer Similarity

for early exiting to improve prediction accuracy and inference eficiency. • We applied a bi-level optimization algorithm for iteratively training the modeling network and gating network.

L2 Normalization Feed-Forward Network

Noise Injection

Input Embedding

Routing Score

Softmax

Top-K Learnable Temperature

Re-Scaling 2. Proposed Model: AdaEnsemble 2.1. Feature Interaction Experts

We considered several types of feature interaction ex

perts in our model: Dense Layer, Convolution Layer2,.2. Sparse Mixture-of-Experts Layer

Multi-Head Self-Attention Layer, Polynomial Interaction

Layer, and Cross Layer. Essentially, any feature interTahc-e Sparse Mixture-of-Experts layer ensembles aforetion learning layer can be included in our framewormke,ntioned heterogeneous feature interaction experts and and the residual connection and normalization willcboensists of several other essential parts to make the overapplied to their ensembles. Now we introduce these feaa-ll model can be stably trained. ture interaction experts included in our framework. Note that our proposed framework is general and can use ar2b.i2-.1. Noisy Gating Network trary feature interaction modules. The potential feature

The gating network essentially computes the gating

interaction experts can be used are not limited to the aforementioned. value for selecting experts for each input embedding and weighting the output embedding of each expert.

For the input embedding of gating netw or0k, it ifrstly processed by the gating network: a two-layer feedforward network, i.e. a dimension reduction layer withBy annealing of the value, we start to train our archireduction ratio[6], a non-linear activation function antdecture with a dense structure which allows us to thorthen a dense layer projecting to hidden stℎat∈e .

oughly learn all experts and adjust the gating network in Additionally, we applied multiplicative jitter noise fotrhe correct direction at the beginning. Therefore, we can introducing exploration and promoting load balancicnogntrol the sparsity of our architecture while training to between diferent experts.

not only accelerate the convergence of the gating network but also benefit the experts’ specialty for learning particular types of feature interactions.

2.2.3. Sparse Dispatcher ℎ = FFN( 0 ∘ RandomUniform(1.0 − eps, 1.0 + eps))

After projecting the input embedding to hidden state stateℎ ∈ and learnable expert embedding s ∈ , ℎ ∈ , we apply the 2 normalization to both hiddenThe sparse dispatcher 8[, 7, 9] takes the examples gating values and experts as input. It firstly dispatches the examwhere is the index of expert. Then, we compute theples to the experts corresponding to the non-zero gating cosine similarity between the hidden state and expevratlues, and lets experts generate the output embeddings. embedding as the initial routing score. Here we encouTrh-e output of the Sparse Mixture-of-Experts layer is age the uniformity of representations to avoid dominattehde linearly weighted combination of expert output emexperts issue.

= ℎ ⋅ ‖ℎ‖‖ ‖ = / ( 0) = exp (TopK(, ) ) ∑=1 exp (TopK(, ) )

Finally, we use a learnable temperature sca latro rescale the routing scores to the ra[n−g1e, +1]. For the computed routing sco r,ewe only keep the

sponding softmax gating values equal0. The -th element of the output vector of the gating network is top k values and set the rest−t∞o, resulting in the corre-2.2.4. Load Distribution Regularization where TopK(, ) = { −∞ if is in the to p elements of otherwise.

beddings by the non-zero gating values.

= ∑ ∈

( 0) ( 0, )

Where denotes the selected non-zero indices. We save computation based on the sparsit y(o f

Wherever(

0) = 0, we don’t pass the expert to the corresponding expert and do not need to compute expert output embedding ( 0, ). (1) (2) (3) (5) 0) is (6) 0). (7) (8) (9) As stated in the previous researc8h,7[, 9, 10], the gating network tends to select only a few experts if no regularization is applied, especially when certain experts are (4) easier to train than other experts. This phenomenon is self-reinforcing, since the selected experts are trained more and will be selected more frequently by the gating network. Therefore, the load balancing loss is applied to enforce the uniform expert routing.

balance= ⋅ ⋅ ∑ ⋅ =1 = 1 ∈ℬ ∑ 1{argmax() = } = 1 ∈ℬ ∑ ()

These gating values will be used by the sparse dis

patcher for routing examples to diferent experts. This where is the batch size , is the number of experts, is the essential step for achieving sparsity of our Spa r seis the fraction of examples dispatched to expe r t ijs, the average of the router probability allocated for expert j, and is the coeficient for the regularization term.

Mixture-of-Experts layer. Note thatt(he diferentiable regardless the value o f[7].

2.2.2. Annealing Top-K Gating We also introduce annealing mechanism to the Top-K operation. We starts w itvhalue equal to the number of experts, which means that we starts as a fully dense gate that routes examples to all experts. Then we gradually While the default load balancing loss is applicable and decrease th e and route examples to fewer experts, toefective when experts are of the same type, AdaEnsemble adaptively make the structure sparser and continuouisslyusing heterogeneous feature interaction experts, and improving the computation eficiency.

the optimal load for each expert is not uniform. Therefore, we apply the below load distribution regularization neous experts. to encourage the expected load distribution of heterogew-here 1 and 2 are the coeficients for weighting the load distribution regularization of experts and depth. distribution= ⋅ ⋅ ∑ =1 (10) 2.4.2. Bi-Level Optimization

The optimization task for training the AdaEnsemble is where hyper-parameter is the expected load frac-to jointly optimize the parameter,swhich stands for tion of examples dispatched to expert j, and naturaltlyhe expert layers and estimator layers, a,nwdhich rep ∑

=1 = 1. In practice, th eshould be suficiently large to prevent expert selection self-reinforcing phenomenonnetwork. Inspired by the DARTS11[], we apply bi-level at the initial training stage while not overwhelmingotphteimization algorithm for training our model, where resents the expert gating network and depth selecting primary LogLoss objective. 2.3. Depth Selecting Controller 2.3.1. Depth Selecting Network is the upper-level parameters an d is the lower-level parameters. We apply algorith2mto optimize and alternatively and iteratively.

2.5. Discussion on AdaEnsemble The Depth Selecting Network is essentially the same conifguration as the aforementioned Noisy Gating Network for SparseMoE layer. We denote it bℎy(

0). The outputs ofℎ( 0) are[ 1 ℎ

ℎ , 2 , ⋯ , ℎ

The combination of sparse experts routing within each

SparseMoE layer and the early exiting by depth selecting controller brings two merits to the proposed model. On ], in- one hand, the stacked SparseMoE layers allow the prodicating each example’s optimal forward propagatiopnosed model to leverage the exponential combinations depth. The -th unit denotes the probability of selectinogf sparsely gated experts, which brings in more predictthe -th MoE layer to exit. The optimal depth is automaitn-g power. On the other hand, both the experts routing ically selected as the one corresponding to the largmesetchanism and the depth selecting mechanism enables probability. In contrast to the expert selection, wthhene proposed model to learn the instance-ware expert choosing the optimal depth of each example for the dyc-ombination and instance-ware model depth. These two namic inference, we only keep the top-1 depth indexconditional computation mechanisms improve the efifrom the output units of the Depth Selecting Netwocrike.ncy during model serving. In the next section, we Note that we can also apply the load distribution rewguil-l illustrate the efectiveness of the proposed model larization to encourage the examples’ propagation deptthhrough some experimental studies. distribution. example. If

ℎ If

ℎ 2.3.2. Dynamic Propagation Mechanism

3. Experiments

ℎ ℎ = 0, we recursively forward propagatelowing research questions::

With the depth gat es

∈ [0, 1] computed by Depth In this section, we focus on evaluating the efectiveness Selecting Network, we obtain the optimal depth for eaocfhour proposed models and seeking answers to the folexamples through MoE layers and compute deeper rep-• Q1: How does our proposed AdaEnsemble perform resentation un til = 1 or reaching the final layer.

compared to each baseline in the CTR prediction prob= 1, the forward propagation will be stopped lem? gorithm1 for dynamic forward propagation. prediction. To eficiently process a batch of examples and the correspondin g-th estimator will compute the• Q2: How does the SparseMoE layer perform compared to DenseMoE, which utilizes all feature interaction with diferent optimal propagation depths, we utilize al- experts? Does the cascade of SparseMoE layers efectively capture diferent types of feature interactions? 2.4. Training 2.4.1. Training Objective The loss function we use a linearly weighted combina- ifciency? tion of the Log Loss and the auxiliary load distribution regularization, =

LogLoss + 1 distribution+ 2 distribution (11) expert depth • Q3: How does the depth selecting controller perform compared to a full-depth network? Does the early exiting mechanism achieve both efectiveness and efAlgorithm 1 Dynamic Propagation 3.1. Experiment Setup 3.1.1. Datasets

3.1.2. Competing Models We evaluate our proposed model on three public realW-e compare AdaEnsemble with following models: LR world datasets widely used for research. (Logistic Regression)1[3, 2], FM (Factorization Ma1. Criteo.1 Criteo dataset is from Kaggle competitiocnhine) [14], DNN (Multilayer Perceptron), Wide & in 2014. Criteo AI Lab oficially released this dataset after, Deep [4], DeepCrossing [15], DCN (Deep & Cross Netfor academic use. work) [16], PNN (with both inner product layer and outer 2. Avazu.2 Avazu dataset is from Kaggle competitionproduct layer)1[7, 18], DeepFM [19], xDeepFM [20], Auin 2015. Avazu provided 10 days of click-through data.toInt2[1], FiBiNET [22], xDeepInt[23] and DCN V2 [24].

3. iPinYou.3 iPinYou dataset is from iPinYou GlobalSome of the models are state-of-the-art models for CTR RTB(Real-Time Bidding) Bidding Algorithm Competitionprediction problem and are widely used in the industry. in 2013. We follow the data processing steps o12f][.

1https://www.kaggle.com/c/criteo-display-ad-challenge 2https://www.kaggle.com/c/avazu-ctr-prediction 3http://contest.ipinyou.com/

3.2. Model Performance Comparison (Q1) The overall performance of diferent model architectures is listed in Table1. We have the following observations in terms of model efectiveness: • Models with more than two feature interaction modules generally perform better than models with only Taable 2 single feature interaction module, indicating the Pimer-formance Comparison of SparseMoE and DenseMoE on portance of jointly learned feature interaction reCprritee-o Dataset. sentation.

SparseMoE(k=1) • The optimal feature interaction depth varies by featurSeparseMoE(k=2) interaction module type and when combined with SparseMoE(k=3) diferent module types, indicating the necessity for SparseMoE(k=4) dynamically combining diferent feature interactions DenseMoE on diferent interaction depths. Ensemble

Dense Expert Only • AdaEnsemble achieves the best prediction perfor- Cross Expert Only mance among all models. Our model’s superior perfor- Polynomial Expert Only mance could be attributed to the fact that AdaEnsemC-NN Expert Only ble jointly model various types of feature interactionsMHSA Expert Only by adaptively selecting the feature interaction experts combination and determining the optimal feature interaction depth by the controller. • Utilizing diferent feature interaction experts resultEach vertical axis represents a SparseMoE layer and the in better performance than single expert models inproportion of an expert being used. The horizontal flows general. SparseMoE layer achieves a better tradeoinf dicate the dependency and relation of diferent SparseMoE between accuracy and computation eficiency. comlabyinear’tsioenxpwearst sreeplercetsieonnt.eTdhbeyptrhoepowritdiotnh ooff tthhee eflxopwesrtand further clustered to diferent colors.

3.4. Depth Selection Analysis (Q3) • When utilizing more than one expert per SparseMoWEe compare the model performance between the layer, even though only a subset of feature interactiAodnaEnsemble with and without depth selecting controller experts are selected, SparseMoE can still efectivelyto investigate whether the model achieves the harmony capture the most significant feature interactionsboetfween prediction accuracy and inference eficiency diferent depths and maintain similar performancewith respect to depth selection. The performance of the as the DenseMoE layer and superior performance tdoiferent types of MoE layers and ensemble result is listed ensemble, while including more experts can also resultin Table3. in more computational cost.

With the incorporation of the depth selecting con• Figure 4shows that the SparseMoE layers dynamicallytroller, we can observe that our model can significantly utilize a diferent combination of experts across difer-improve training complexity and inference eficiency ent layers to capture the complex feature interacti(omnesasured in FLOPs) while achieving slightly better perefectively. That also explains why fusing diferent formance than the full-depth model. We think the fullfeature interactions is crucial for prediction accurdaepctyh. model is easier to overfit compared to AdaEnsemble, thus resulting in slightly worse accuracy perforTable 3 D. Golovin, et al., Ad click prediction: a view Performance Comparison of AdaEnsemble with and without from the trenches, in: Proceedings of the 19th controller on Criteo Dataset. ACM SIGKDD international conference on Knowl

AUC LogLoss FLOPs edge discovery and data mining, ACM, 2013, pp.

w/ controller 0.8132 0.4394 6.02M 1222–1230. w/o controller 0.8128 0.4396 8.58M [3] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi,

A. Atallah, R. Herbrich, S. Bowers, et al., Practical

Table 4 lessons from predicting clicks on ads at facebook, in: AdaEnsemble Propagation Depth on Criteo Dataset. Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ACM, 2014,

Layer 1 Layer 2 Layer 3 Layer 4 pp. 1–9.

Fraction 6.53% 19.36% 66.43% 7.68% [4] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, mance. The AdaEnsemble with depth selecting controller M. Ispir, et al., Wide & deep learning for recomadaptively selects feature interaction depth per example mender systems, in: Proceedings of the 1st workbasis, thus achieving better trade-ofs between prediction shop on deep learning for recommender systems, accuracy and inference eficiency. The distribution of per ACM, 2016 , pp. 7–10. example forward propagation depth is listed in Ta4b.le [5] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning based recommender system: A survey and new perspectives, ACM Computing Surveys (CSUR) 52 4. Conclusion (2019) 5.

[6] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation netIn this paper, we present a novel model architecture to works, in: Proceedings of the IEEE conference on click-through rate (CTR) modeling by introducing the computer vision and pattern recognition, 2018, pp. Sparse-Gated Mixture-of-Experts (SparseMoE) hierarchi- 7132–7141. cal architecture for ensemble learning of heterogeneou[7s] W. Fedus, B. Zoph, N. Shazeer, Switch transformers: feature interactions experts. A Depth Selecting Con- Scaling to trillion parameter models with simple troller component was integrated into the model to dy- and eficient sparsity, 2021. namically select the optimal feature interaction depth f[o8r] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, each instance. The utilization of these two conditional Q. Le, G. Hinton, J. Dean, Outrageously large neural computation mechanisms results in a model architecture networks: The sparsely-gated mixture-of-experts that can select a subset of feature interactions experts layer, arXiv preprint arXiv:1701.06538 (2017). and the optimal interaction depth for each instance simu[9l-] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, taneously, leading to an exponential increase in model N. Shazeer, W. Fedus, Designing efective sparse capacity without incurring a corresponding increase in expert models, arXiv preprint arXiv:2202.08906 inference cost. Our extensive experiment demonstrate (2022). the superiority of our approach in terms of efectivenes[s10] Z. Chi, L. Dong, S. Huang, D. Dai, S. Ma, B. Patra, and eficiency. S. Singhal, P. Bajaj, X. Song, F. Wei, On the rep

Future work will be dedicated to exploring the poten- resentation collapse of sparse mixture of experts, tial for extending our method to the modeling of user arXiv preprint arXiv:2204.09179 (2022). behavior sequences. By learning a sparse ensemble of[11] H. Liu, K. Simonyan, Y. Yang, Darts: Difmodels, we anticipate that our approach can dynamically ferentiable architecture search, arXiv preprint select the optimal expert for diferent behaviors in the arXiv:1806.09055 (2018). context of user behavior sequence data. [12] W. Zhang, S. Yuan, J. Wang, X. Shen, Real-time bidding benchmarking with ipinyou dataset, arXiv preprint arXiv:1407.7073 (2014).

References [13] H. B. McMahan, Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regu[1] M. Richardson, E. Dominowska, R. Ragno, Predict- larization (2011).

ing clicks: estimating the click-through rate for

[14] S. Rendle, Factorization machines, in: 2010 IEEE new ads, in: Proceedings of the 16th international International Conference on Data Mining, IEEE, conference on World Wide Web, ACM, 2007, pp. 2010, pp. 995–1000.

521–530. [15] Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, J. Mao, [2] H. B. McMahan, G. Holt, D. Sculley, M. Young, Deep crossing: Web-scale modeling without manuD. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, ally crafted combinatorial features, in: Proceedings

of the 22nd ACM SIGKDD International Confer-

ACM , 2016 , pp. 255 - 262 . [16]

Wang ,

Fu , G. Fu,

Wang , Deep & cross

of the ADKDD'17 , ACM , 2017 , p. 12 . [17]

Qu ,

Cai ,

Ren ,

Zhang ,

Yu ,

Wen ,

response prediction , in: 2016 IEEE 16th Interna-

2016 , pp. 1149 - 1154 . [18]

Qu ,

Fang ,

Zhang ,

Tang ,

Niu , H. Guo,

tems (TOIS) 37 (

2018 ) 5 . [19]

Guo ,

Tang ,

Ye ,

Li ,

He , Deepfm: a

prediction , arXiv preprint arXiv:1703.04247 ( 2017 ). [20]

Lian ,

Zhou ,

Zhang ,

Chen ,

Xie , G. Sun,

ceedings of the 24th ACM SIGKDD International

ing , ACM, 2018 , pp. 1754 - 1763 . [21]

Song ,

Shi ,

Xiao ,

Duan ,

Xu ,

Zhang ,

preprint arXiv: 1810 . 11921 ( 2018 ). [22]

Huang ,

Zhang , J. Zhang, Fibinet: Combining

arXiv: 1905 . 09433 ( 2019 ). [23]

Yan ,

Li , xdeepint: a hybrid architecture for

teractions ( 2020 ). [24]

Wang ,

Shivanna , D. Cheng, S. Jain,

Lin ,

Hong , E. Chi, Dcn v2: Improved deep & cross

Conference 2021 , 2021 , pp. 1785 - 1797 .