Visualizing and Understanding Deep Neural Networks in CTR Prediction Lin Guo Hui Ye Wenbo Su Alibaba Group Alibaba Group Alibaba Group Henhuan Liu Kai Sun Hang Xiang Alibaba Group Alibaba Group Alibaba Group ABSTRACT interpretability becomes an obstacle for deep learning, and raises Although deep learning techniques have been successfully applied concerns on the reliability of deep learning applications, especially to many tasks, interpreting deep neural network models is still a for critical industrial implementations. big challenge to us. Recently, many works have been done on visu- Many recent progresses have been made in visualizing and in- alizing and analyzing the mechanism of deep neural networks in terpolating deep learning models for image processing [15, 18, 20, the areas of image processing and natural language processing. In 21, 26, 29] and natural language processing [3, 4, 14, 16, 27]. In this this paper, we present our approaches to visualize and understand paper, we present a series of approaches to visualize and analyze a deep neural networks for a very important commercial task—CTR simple DNN model for CTR prediction on the productive data from (Click-through rate) prediction. We conduct experiments on the pro- our search advertising platform. The model’s performance decay is ductive data from our online advertising system with daily varying investigated over datasets with daily varying distribution, and the distribution. To understand the mechanism and the performance distributions of the output scores are also compared for different of the model, we inspect the model’s inner status at neuron level. training stages. We inspect the model’s inner status down to neuron Also, a probe approach is implemented to measure the layer-wise level. We study the statistical properties of the neurons’ statuses for performance of the model. Moreover, to measure the influence from the hidden layers, and investigate the high-level representations the input features, we calculate saliency scores based on the back- learned by the model through t-SNE projection [17, 21]. A probe propagated gradients. Practical applications are also discussed, for method [2] is applied to dissect model’s performance layer by layer example, in understanding, monitoring, diagnosing and refining for different datasets. Moreover, to measure the influence of the models and algorithms. input features, we calculate saliency scores for the feature groups based on back-propagated gradients. ACM Reference format: Beyond the classic model evaluation metrics [11, 12], we open up Lin Guo, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang. 2018. the "black box" and inspect the DNN model from the output to the Visualizing and Understanding Deep Neural Networks in CTR Prediction. In Proceedings of ACM SIGIR Workshop on eCommerce, Ann Arbor, Michigan, input end. Understanding the model’s mechanism can help us not USA, July 2018 (SIGIR 2018 eCom), 7 pages. only design and diagnose models, but also monitor the algorithmic https://doi.org/ advertising system for daily production. 2 EXPERIMENTAL SETTING 1 INTRODUCTION 2.1 Datasets Click-through rate (CTR) prediction plays a crucial role in com- putational advertising. In the common cost-per-click advertising We perform experiments on the productive CTR prediction data system, advertisements are ranked by the product of the bid price from the search advertising platform of our company. Started from and the predicted CTR when bidding for impression opportunities. a typical Wednesday, our data are collected over eight consecutive Therefore, the revenue of the multi-billion business heavily relies days. The training set is sampled from day one. To investigate decay on the performance of the CTR prediction model. of the model’s performance, we evaluate the model on a daily basis Deep learning techniques have been successfully applied to CTR from day one to day eight. The eight test sets are, in turn, denoted prediction tasks [6, 7, 23]. Deep neural networks (DNNs), composed by test1, test2, ..., test8. Each dataset contains about 150 million of stacked layers of neurons, have the capability to extract the instances which are randomly sampled from the ad impression logs nonlinear patterns from features and thus reduce the burden of of the corresponding day. Note that there are no overlap between nontrivial feature engineering. However, the working mechanisms test1 and the training set. The setup of datasets simulates the real of deep learning models are still not well understood. The lack of world environment for the CTR prediction task, i.e., the model is trained with historical data and deployed to serve the future Permission to make digital or hard copies of part or all of this work for personal or online traffic, where the data distribution varies and differs with Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes. classroom In: use is G. J. Degenhardt, granted withoutS.fee Di Fabbrizio, providedM.that Kallumadi, copies Kumar, areLin, Y.-C. notA.made or distributed Trotman, H. Zhao the training data by nature. (eds.): Proceedings for profit of the SIGIR or commercial 2018 eCom advantage andworkshop, 12 July, that copies bear2018, Ann Arbor, this notice andMichigan, USA, the full citation published at http://ceur-ws.org on the first page. Copyrights for third-party components of this work must be honored. Our data contains 34 groups of sparse categorical features (around For all other uses, contact the owner/author(s). 100 million binary features in total), e.g., user id, user’s city, user’s SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA gender, user’s age level, query id, query words, shop id, ad’s cat- © 2018 Copyright held by the owner/author(s). ACM ISBN . egory, etc.. Note that there are no combinational features in this https://doi.org/ study. SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Lin Guo, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang 2.2 Model setting key metric. AUC is a widely used measure for evaluating the CTR The DNN model contains four fully-connected hidden layers. From performance [12]. layer 1 (closest to input) to layer 4 (right before output layer), the In Fig. 1, we present the evolution of the model’s AUC as a layer’s width is set to 256, 128, 64 and 32 neurons. The formulation function of the training steps for training and test sets. With the for the output vector of kth hidden layer, denoted by hk , can be training going on, the train AUC keeps growing, while all the test written as: AUCs follow a same pattern — first rises and then decreases due to overfitting. The model generalizes best at step 210000. Comparing hk = ReLU (Wk hk −1 + bk ), (1) the eight test AUCs for the same time step, the model’s performance Where Wk is the weight tensor of all the connections from the decay can be disclosed as a function of dataset. The test AUC score neurons of layer (k − 1), bk represents the bias term and ReLU decreases monotonically from day one to day five. As expected, (rectifier linear unit) function is used as the activation function. this is because the distribution of the test data differs with the The output layer uses a sigmoid function to map the output to a training set, and the difference grows day by day. After that, AUC float number between 0 and 1 as the predicted probability of click: upswings for the last three days and surpasses day four. This is in accordance with a characteristic of our business scene — although Pctr = Siдmoid(W5 h4 + b5 ). (2) the data varies from day to day, the users’ behaviors on our website have weekly periodic patterns. This non-monotonic change of AUC For the training process, Pctr is compared against the ground truth is evident for the regime from under-fitting to weak overfitting label and cross entropy is calculated as the loss function. For each in- (before step ∼ 400000). At larger training steps, overfitting becomes put instance, the sparse feature ids are embedded into 8-dimensional severe and the model performs same bad for the last five days. float vectors [6, 7, 23]. For feature groups containing multiple fea- ture ids per instance, e.g., query words, sum pooling operations are applied to enforce each feature group to produce an 8-dimensional embedding vector. The embedding outputs are concatenated into a 272-dimensional vector, denoted by h0 , as the input to layer 1. The step=210000 positive, training distribution density embedding vectors are trained jointed with the other parts of the 0.006 positive, test1 model. positive, test5 The experiments are run on distributed TensorFlow [1] released negative, training by Google. The model is trained by Adagrad optimizer [8] with 0.004 negative, test1 negative, test5 learning rate = 0.005, initial accumulator value = 0.0001 and mini- batch size = 1000. Glorot and Bengio’s method [10] is used for 0.002 initialization. We visualize the model’s inner status by dynamically dumping the processing data based on model graph. 0.000 0.008 step=600000 positive, training distribution density 3 RESULTS positive, test1 0.006 positive, test5 3.1 AUC and Prediction Score negative, training negative, test1 0.004 negative, test5 0.67 0.850 0.825 0.002 0.66 training 0.800 0.000 0.65 test1 0.0 0.5 1.0 1.5 2.0 0.775 test2 normalized prediction score train AUC test AUC 0.64 test3 0.750 test4 0.63 test5 0.725 test6 Figure 2: Distribution of predicted CTR for models at train- test7 0.700 0.62 test8 ing step 210000 and step 600000. The X-axis denotes the 0.675 predicted CTR normalized by the average click ratio of the 0.61 0.650 training set. 0 100000 200000 300000 400000 500000 600000 training step Fig. 2 provides insights into the distribution of predicted CTR score for training, test1 and test5 sets. At training step 210000, the AUC decay from training set to test1 is mainly because the CTR of Figure 1: AUC score as a function of training step for train- the positive (clicked) samples in test1 are more under-predicted by ing and test sets. the model. The further decay from test1 to test5 is mainly due to that the negative (non-clicked) samples in test5 tend to be predicted with To measure the performance of model, we employ AUC (area higher CTRs (the train and test1 curves overlap for the negative under curve of the receiver operating characteristic plot) as the samples and can hardly be distinguished by eye). For training step Visualizing and Understanding Deep Neural Networks in CTR Prediction SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA step=100000 legend 0.25 training test1 0.00 step=210000 mean output 0.25 0.00 step=300000 0.25 0.00 step=600000 0.25 0.00 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 index of neuron Figure 3: Mean outputs of the neurons in layer 3 for training and test1 sets, for training step 100000, 210000, 300000 and 600000. Each bar represents a neuron. step=100000 legend 0.5 training test1 0.0 step=210000 mean output 0.5 0.0 step=300000 0.5 0.0 step=600000 0.5 0.0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 index of neuron Figure 4: Mean outputs of the neurons in layer 4 for training and test1 sets, for training step 100000, 210000, 300000 and 600000. Each bar represents a neuron. standard deviation of output 0.10 step=100000 legend training 0.05 test1 0.10 step=210000 0.05 0.10 step=300000 0.05 0.15 step=600000 0.10 0.05 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 index of neuron Figure 5: Standard deviations of the outputs of the neurons in layer 3 for training and test1 sets, for training step 100000, 210000, 300000 and 600000. Each bar represents a neuron. standard deviation of output 0.2 step=100000 legend training 0.1 test1 0.2 step=210000 0.1 0.2 step=300000 0.1 0.4 step=600000 0.2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 index of neuron Figure 6: Standard deviations of the outputs of the neurons in layer 4 for training and test1 sets, for training step 100000, 210000, 300000 and 600000. Each bar represents a neuron. 600000, the model overfits the training data such that it aggressively proportion of clicked samples is lower than 10%, so under-predicting predicts the CTR towards zero for both clicked and non-clicked the CTR for all samples may still reduce loss in training. This shape samples. This is attributed to the high skewness of the data. The SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Lin Guo, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang of distribution changes significantly as the data become different, the scores move rightwards and the distribution becomes blurred. 0.90 layer 4 3.2 Neuron Status 0.85 legend average correlation magnitude In this subsection, we investigate the statistics of the neurons’ training 0.80 test1 statuses for different training stages and datasets. These statistical layer 3 properties depict the model’s representation of the input data, and 0.8 can help us to interpret the model’s performance and working 0.7 mechanism. 0.6 0.45 layer 2 0.0310 step=100000 0.40 0.0305 0.35 average standard deviation 0.26 layer 1 0.0300 0.24 0.0325 step=210000 0.22 0.0320 100000 200000 300000 400000 500000 600000 0.0315 training step 0.0310 0.10 Figure 8: Average magnitude of the correlations among the 0.08 step=600000 neurons for each hidden layer. The evolution as a function of training step is plotted for training and test1 set. 0.06 0.04 train test1 test3 test5 test8 data set up with the height of layer. This indicates that the DNN model is refining the input information through the successive layers [22, Figure 7: Average standard deviation of neurons’ outputs 24, 28]. Only very limited portion of the input information can be for layer 3 as a function of dataset, for training step 100000, transfered to the output layer. 210000 and 600000. The output’s standard deviation is first After step 210000, the neurons’ correlation deceases monotoni- calculated for each neuron, and then averaged over all the cally with training step for all hidden layers. Recalling the enhanced 64 neurons of layer 3. neuron activation observed for this overfitting regime (Figs. 3 and 4), we can interpret that the model starts to explore more predictive The mean outputs of the neurons within layer 3 and 4 are illus- patterns from the input information. However, the deceasing test trated in Figs. 3 and 4, respectively. Correspondingly, the standard AUC (Fig. 1) reveals that the boosted representation of the input deviation of the neurons’ outputs are plotted in Figs. 5 and 6. For from training data can not be well generalized to predict the test step 100000 and 210000, the results are quite close between the data. underfitting and well-fitting stages. About a quarter of the neu- In order to inspect the spacial structure of the high-level rep- rons are barely activated. Significant changes are observed for the resentations for the input data, we project the neurons’ output overfitting regime (step > 300000). More neurons become activated. vectors to 2-dimensional space using t-SNE method [17, 21]. The Also, the difference between the training and test sets grows with t-SNE projection is able to preserve neighborhoods and clusters the degree of overfitting, especially in the standard deviation (Figs. of the data points in the original representation space. In Fig. 9, 5 and 6). The higher standard deviation on the training set indi- we illustrate the projection results for layer 2, 3 and 4 at training cates that the neurons become over sensitive to the input of the step 210000. The presented 10000 clicked and 10000 non-clicked training data. Fig. 7 presents the variation of the standard deviation instances are randomly selected from the training set. averaged over all the 64 neurons of layer 3 as a function of dataset. For layer 3 (the center plot in Fig. 9), we can clearly see the For all the three different training stages, the trend of the average regions with concentrated clicked points. We find that the training standard deviation correlates with the model’s AUC score (Fig. 1). process enhances the concentration of clicked points for the training To gain more knowledge about the collaborative patterns of set, indicating that the model learns more discriminative represen- neurons inside the model [21, 26], for each layer, we calculate the tation for the training data. For the test datasets, we observe that correlations among the neurons. Neurons’ statuses before activation the concentrated distribution disappears when overfitting happens. are used. We measure the average degree of neurons’ correlations Unlike the case of image classification in Ref. [21], no class separa- by averaging the absolute value of all the correlation coefficients tion is observed even at severely overfitting stage. This is mainly for each layer. The average strength of correlations is plotted as a due to the highly noisy and skewed data for the CTR prediction function of training step in Fig. 8. The degree of correlation climbs task. Visualizing and Understanding Deep Neural Networks in CTR Prediction SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Figure 9: Visualization of the output vectors for layer 2, 3 and 4 using t-SNE method, at training step 210000. Clicked and non-clicked samples are represented by red and blue points, respectively Comparing with the left plot in Fig. 9, the concentration of clicked To investigate the effectiveness of the hidden layers, we imple- points of layer 2 is obviously worse than layer 3. This agrees with ment Alain & Bengio’s probe approach [2]. DNN model is expected the assumption that for a properly trained DNN model, the dis- to mining for predictive patterns from input features through layers criminative quality of the hidden layer’s output increases with the of transformations, and then feed the extracted information into height of the layer [2, 5, 21]. However, as revealed in the right plot the simple linear classifier at the output end. For each layer, we use of Fig. 9, the clicked points for layer 4 show no improvement in the layer’s output vector as input features to train a LR (Logistic the degree of concentration and look even slightly more scattered. Regression) model to predict CTR. The LR model serves as a probe Recalling the very strong correlations among the neurons in layer to evaluate the usefulness the hidden layer. A higher performance 4 (Fig. 8), one may doubt whether the output of layer 4 is more of the LR probe implies that the transformation of this layer makes predictive than layer 3. This issue will be further discussed in the information more predictive, and thus benefits the performance of following subsections. the whole DNN model. The LR models are trained on the data of the training set until convergence, with the DNN model fixed, and then the performances 3.3 Probe Evaluations are evaluated on the tests sets. As shown in Fig. 10, for training step 210000, the performance increases from layer 1 to layer 3, indicating that these layers do transform input information to be more predictive. The probe’s performance for layer 4 is the same as layer 3, indicating that layer 4 is not as useful as the previous step=100000 three layers. This is consistent with the observations in the last 0.660 subsection. 0.655 The change of AUC along each curve (in Fig. 10) illustrates how AUC layer1 0.650 layer2 the hidden layer reacts to the varying data distribution. At training layer3 layer4 0.645 step 210000 where the DNN model generalizes best, the effective- ness of all the layers varies as a function of dataset in the same 0.665 step=210000 pattern with the DNN model. In contrast, for training step 100000, 0.660 layer1 where the DNN model is underfitting, layer 1 behaves differently AUC layer2 layer3 with the other layers. Moreover, for step 600000, the DNN model 0.655 layer4 overfits the training data such that the learned information trans- step=600000 formations begin to fail for test data. Therefore, the performance 0.62 of probes is very low and fluctuates significantly. AUC 0.61 layer1 layer2 0.60 layer3 layer4 3.4 Feature Group Saliency 1 2 3 4 5 6 7 8 test dataset For the input end of the DNN model, we study how the input features influence the model with the back-propagated gradient signals [16]. The embedding output of the sparse feature ids (con- Figure 10: Test AUC scores of the probe LR models as a func- catenated as h0 ) can be treated as the input for the following deep tion of test dataset for three training steps: 100000, 210000 neural network. With the model fixed, for each input instance, we and 600000. calculate the gradient of h0 with respective to the model’s output SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Lin Guo, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang Pctr : g0 = ∇h0 Pctr . (3) Pctr = Siдmoid(W5 h4 + b5 + buser ). (4) The magnitude of each element of the gradient vector g0 quanti- This bias is trained jointly with the other parts of the model. We fies the sensitivity of the model’s output to the change in the par- find this approach can improve AUC on the test datasets by about ticular embedding element. It describe how much a small change 0.1%. in a particular embedding value could affect the final output Pctr . Given a dataset, we calculate the saliency score for each feature 5 APPLICATIONS group by averaging the mean absolute value of the corresponding 8 With the visualization and analysis techniques presented above, we gradient elements in g0 over the whole dataset. This saliency score discuss some of the practical applications in this section. provides us with an average measure of the model’s sensitivity to each feature group for the given dataset. • The distribution of the predicted CTR score is very important for We illustrate the saliency scores in Fig. 11. Overall, the model real-time bidding auctions. Understanding the score distribution is becoming increasingly sensitive to all the feature groups during can help us to design better calibration methods [13, 19]. Also, training. In the overfitting regime, the score of feature group 10 rises score distribution can help to find outliers or bad-fitted samples, up dramatically and becomes much higher than the other feature which can in turn be used to improve the model. groups. This feature group is composed of user ids, in which the • Inspections of model’s inner status and gradient signals open number of ids is larger than any other feature group by at least up the "black box" of the DNN model, helping us to understand two orders of magnitude [9]. For this training stage, the model is the mechanism of the model and the influence of features. These trained to memorize the vast amount of information from user ids approaches can be used to diagnose the model, like (but not that is not generalizable, and thus significantly deteriorates the limited to) underfitting/overfitting, gradient vanishing/explosion, performance on test datasets. ineffective model structure, etc.. A deep understanding of the model’s mechanism can help us to design better model structure, 4 DISCUSSION training algorithm and features. • For online advertisting, it is of great importance to monitor the 4.1 Role of Layer 4 model’s online performance and the health of data pipeline. Feed- The results about layer 4 raise a question about the necessity to ing the model with problematic data can cause disaster. However, include this layer in the model. To answer this question, we modify it is very difficult to describe and monitor the distribution of the the neural network and investigate the impact on performance of extremely sparse and high-dimensional data. Moreover, monitor- the retrained models. We modify layer 4 by reducing or increas- ing the model’s online performance may not be sufficient. The ing its width by a factor of two, or even remove layer 4 from the model predicts CTR for hundreds of candidate ads for each biding, model. It turns out that these modifications do not affect the mod- while only very few ads can win the bidding and get feedback els’ performance (highest test AUCs) for the different test dataset. from impression. The classic performance metrics are mainly Although not harmful, there is no benefit to include layer 4 in the based on those feedbacks, and thus can only cover a limited DNN model. portion of biased data. The DNN model, by nature, transforms the sparse input data 4.2 Regularization into dense numerical representations. Therefore, the statistics of Analysis in the previous section reveals that the model become over neurons’ output and the gradient signals can be implemented as sensitive to the input when overfitting. Also, the high correlations a new kind of metrics to monitor the distribution of the input among neurons for layer 3 and 4 (Fig. 8) imply that there might be data. Note that no feedback labels are needed to calculate these severe co-adaptations [25]. One may hope to use regularizations quantities. For example, as illustrated in Fig. 7, the average stan- to control overfitting and obtain better performance on test data. dard deviation for layer 3’s output changes with the naturally We have tried L1 and L2 regularization [11], and dropout [25], for a varying distribution of input data. Problematic input data can variety of hyper-parameters. However, no improvement is obtained. cause more significant change in the statistics. In future, more work needs to be conducted on improving model’s generalization power. 6 CONCLUSION In this work, we visualize and analyze a simple DNN model for CTR 4.3 Feature Treatment prediction down to neuron level. Model training and evaluations Subsection 3.4 discloses the problem that the model is greatly sen- are performed over a series of datasets. The model is inspected from sitive to the feature group of user ids when overfitting. Other than the output to the input end. The statuses of neurons are studied regularization, it is also possible to improve the models’ general- using a variety of methods. Gradients of the feature embeddings ization power by optimizing the input features. User id is a highly are used to create a salience map to describe the influence of the granular feature group. Inputting it directly to the embedding-based feature groups. The analysis provides insightful knowledges of the deep neural network may not be the optimal choice. Following the model’s mechanism, helping us to monitor, diagnose and refine the idea of Wide&Deep [6], we remove user id from the embedding model. layer. The bias of each user id is represented by a float number Currently, we are applying these approaches to build a model- buser and added immediately into the output layer: based evaluation and monitoring system for our online advertising Visualizing and Understanding Deep Neural Networks in CTR Prediction SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA 0.10 step=100000 0.05 gradient-based saliency score 0.00 0.10 step=210000 0.05 0.00 0.75 step=300000 0.50 0.25 0.00 3 step=600000 legend 2 training test1 1 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 index of feature group Figure 11: Gradient-based saliency score of the 34 feature groups for training and test1 sets. Each bar represents a feature group. platform. Based on our industrial scenario, future work will focus on [14] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015. Visualizing and under- exploring more approaches to interpret deep learning, investigating standing recurrent networks. arXiv preprint arXiv:1506.02078 (2015). [15] Pangwei Koh and Percy Liang. 2017. Understanding Black-box Predictions via more complex algorithms and applying these approaches to design Influence Functions. In International Conference on Machine Learning. 1885–1894. better models and algorithms. [16] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and Understanding Neural Models in NLP. arXiv preprint arXiv:1506.01066v2 (2016). [17] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605. REFERENCES [18] Aravindh Mahendran and Andrea Vedaldi. 2016. Visualizing deep convolutional [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, neural networks using natural pre-images. International Journal of Computer Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Vision 120, 3 (2016), 233–255. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed [19] Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, systems. arXiv preprint arXiv:1603.04467 (2016). https://www.tensorflow.org/ Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. 2013. [2] Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016). SIGKDD international conference on Knowledge discovery and data mining. ACM, [3] Leila Arras, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. 2017. 1222–1230. Explaining recurrent neural network predictions in sentiment analysis. arXiv [20] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Au- preprint arXiv:1706.07206 (2017). tomated whitebox testing of deep learning systems. In Proceedings of the 26th [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- Symposium on Operating Systems Principles. ACM, 1–18. chine translation by jointly learning to align and translate. arXiv preprint [21] Paulo E Rauber, Samuel G Fadel, Alexandre X Falcao, and Alexandru C Telea. 2017. arXiv:1409.0473 (2014). Visualizing the hidden activity of artificial neural networks. IEEE transactions on [5] Yoshua Bengio et al. 2009. Learning deep architectures for AI. Foundations and visualization and computer graphics 23, 1 (2017), 101–110. trends® in Machine Learning 2, 1 (2009), 1–127. [22] Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy [6] Heng-Tze Cheng and Levent Koc. 2016. Wide & deep learning for recommender Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. 2018. On the Informa- systems. In Proceedings of the ACM 1st Workshop on Deep Learning for Recom- tion Bottleneck Theory of Deep Learning. In International Conference on Learning mender Systems. 7–10. Representations. https://openreview.net/forum?id=ry_WPG-A- [7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for [23] Ying Shan and T Ryan Hoens. 2016. Deep crossing: Web-scale modeling without youtube recommendations. In Proceedings of ACM Conference on Recommender manually crafted combinatorial features. In Proceedings of ACM Conference on Systems. 191–198. Knowledge Discovery and Data Mining. [8] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods [24] Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the black box of deep for online learning and stochastic optimization. Journal of Machine Learning neural networks via information. arXiv preprint arXiv:1703.00810 (2017). Research 12, Jul (2011), 2121–2159. [25] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan [9] Tiezheng Ge, Liqin Zhao, Guorui Zhou, Keyu Chen, Shuying Liu, Huiming Yi, Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from Zelin Hu, Bochao Liu, Peng Sun, Haoyu Liu, et al. 2017. Image Matters: Jointly overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958. Train Advertising CTR Model with Image Representation of Ad and User Behavior. [26] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, arXiv preprint arXiv:1711.06505 (2017). Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. [10] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training arXiv preprint arXiv:1312.6199 (2013). deep feedforward neural networks. Journal of Machine Learning Research 9 (2010), [27] Zhiyuan Tang, Ying Shi, Dong Wang, Yang Feng, and Shiyue Zhang. 2017. Mem- 249–256. ory visualization for gated recurrent neural networks in speech recognition. [11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Proceedings of IEEE International Conference on Acoustics, Speech and Signal Pro- Press. cessing (ICASSP) (2017). [12] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. [28] Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information 2010. Web-scale Bayesian Click-through Rate Prediction for Sponsored Search bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW). 1–5. https: Advertising in Microsoft’s Bing Search Engine. In Proceedings of the 27th Inter- //doi.org/10.1109/ITW.2015.7133169 national Conference on International Conference on Machine Learning (ICML’10). [29] Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolu- Omnipress, USA, 13–20. tional networks. In European conference on computer vision. Springer, 818–833. [13] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, 1–9.