Visualizing and Understanding Deep Neural Networks in CTR
                             Prediction
                            Lin Guo                                                             Hui Ye                                      Wenbo Su
                        Alibaba Group                                                      Alibaba Group                                   Alibaba Group

                       Henhuan Liu                                                             Kai Sun                                     Hang Xiang
                        Alibaba Group                                                      Alibaba Group                                  Alibaba Group

 ABSTRACT                                                                                             interpretability becomes an obstacle for deep learning, and raises
Although deep learning techniques have been successfully applied                                      concerns on the reliability of deep learning applications, especially
to many tasks, interpreting deep neural network models is still a                                     for critical industrial implementations.
big challenge to us. Recently, many works have been done on visu-                                        Many recent progresses have been made in visualizing and in-
alizing and analyzing the mechanism of deep neural networks in                                        terpolating deep learning models for image processing [15, 18, 20,
the areas of image processing and natural language processing. In                                     21, 26, 29] and natural language processing [3, 4, 14, 16, 27]. In this
this paper, we present our approaches to visualize and understand                                     paper, we present a series of approaches to visualize and analyze a
deep neural networks for a very important commercial task—CTR                                         simple DNN model for CTR prediction on the productive data from
(Click-through rate) prediction. We conduct experiments on the pro-                                   our search advertising platform. The model’s performance decay is
ductive data from our online advertising system with daily varying                                    investigated over datasets with daily varying distribution, and the
distribution. To understand the mechanism and the performance                                         distributions of the output scores are also compared for different
of the model, we inspect the model’s inner status at neuron level.                                    training stages. We inspect the model’s inner status down to neuron
Also, a probe approach is implemented to measure the layer-wise                                       level. We study the statistical properties of the neurons’ statuses for
performance of the model. Moreover, to measure the influence from                                     the hidden layers, and investigate the high-level representations
the input features, we calculate saliency scores based on the back-                                   learned by the model through t-SNE projection [17, 21]. A probe
propagated gradients. Practical applications are also discussed, for                                  method [2] is applied to dissect model’s performance layer by layer
example, in understanding, monitoring, diagnosing and refining                                        for different datasets. Moreover, to measure the influence of the
models and algorithms.                                                                                input features, we calculate saliency scores for the feature groups
                                                                                                      based on back-propagated gradients.
ACM Reference format:                                                                                    Beyond the classic model evaluation metrics [11, 12], we open up
Lin Guo, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang. 2018.
                                                                                                      the "black box" and inspect the DNN model from the output to the
Visualizing and Understanding Deep Neural Networks in CTR Prediction.
In Proceedings of ACM SIGIR Workshop on eCommerce, Ann Arbor, Michigan,
                                                                                                      input end. Understanding the model’s mechanism can help us not
USA, July 2018 (SIGIR 2018 eCom), 7 pages.                                                            only design and diagnose models, but also monitor the algorithmic
https://doi.org/                                                                                      advertising system for daily production.

                                                                                                      2 EXPERIMENTAL SETTING
 1     INTRODUCTION
                                                                                                      2.1 Datasets
Click-through rate (CTR) prediction plays a crucial role in com-
putational advertising. In the common cost-per-click advertising                                      We perform experiments on the productive CTR prediction data
system, advertisements are ranked by the product of the bid price                                     from the search advertising platform of our company. Started from
and the predicted CTR when bidding for impression opportunities.                                      a typical Wednesday, our data are collected over eight consecutive
Therefore, the revenue of the multi-billion business heavily relies                                   days. The training set is sampled from day one. To investigate decay
on the performance of the CTR prediction model.                                                       of the model’s performance, we evaluate the model on a daily basis
   Deep learning techniques have been successfully applied to CTR                                     from day one to day eight. The eight test sets are, in turn, denoted
prediction tasks [6, 7, 23]. Deep neural networks (DNNs), composed                                    by test1, test2, ..., test8. Each dataset contains about 150 million
of stacked layers of neurons, have the capability to extract the                                      instances which are randomly sampled from the ad impression logs
nonlinear patterns from features and thus reduce the burden of                                        of the corresponding day. Note that there are no overlap between
nontrivial feature engineering. However, the working mechanisms                                       test1 and the training set. The setup of datasets simulates the real
of deep learning models are still not well understood. The lack of                                    world environment for the CTR prediction task, i.e., the model
                                                                                                      is trained with historical data and deployed to serve the future
 Permission to make digital or hard copies of part or all of this work for personal or                online traffic, where the data distribution varies and differs with
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
 classroom
In:            use is G.
    J. Degenhardt,     granted  withoutS.fee
                          Di Fabbrizio,      providedM.that
                                          Kallumadi,        copies
                                                         Kumar,     areLin,
                                                                 Y.-C.   notA.made  or distributed
                                                                               Trotman, H. Zhao       the training data by nature.
(eds.): Proceedings
 for profit           of the SIGIR
             or commercial         2018 eCom
                               advantage  andworkshop,  12 July,
                                               that copies bear2018,   Ann Arbor,
                                                                 this notice  andMichigan,  USA,
                                                                                  the full citation
published  at http://ceur-ws.org
 on the first   page. Copyrights for third-party components of this work must be honored.                Our data contains 34 groups of sparse categorical features (around
For all other uses, contact the owner/author(s).                                                      100 million binary features in total), e.g., user id, user’s city, user’s
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                  gender, user’s age level, query id, query words, shop id, ad’s cat-
© 2018 Copyright held by the owner/author(s).
ACM ISBN .
                                                                                                      egory, etc.. Note that there are no combinational features in this
https://doi.org/                                                                                      study.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                              Lin Guo, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang


2.2          Model setting                                                              key metric. AUC is a widely used measure for evaluating the CTR
The DNN model contains four fully-connected hidden layers. From                         performance [12].
layer 1 (closest to input) to layer 4 (right before output layer), the                     In Fig. 1, we present the evolution of the model’s AUC as a
layer’s width is set to 256, 128, 64 and 32 neurons. The formulation                    function of the training steps for training and test sets. With the
for the output vector of kth hidden layer, denoted by hk , can be                       training going on, the train AUC keeps growing, while all the test
written as:                                                                             AUCs follow a same pattern — first rises and then decreases due to
                                                                                        overfitting. The model generalizes best at step 210000. Comparing
                         hk = ReLU (Wk hk −1 + bk ),                        (1)         the eight test AUCs for the same time step, the model’s performance
Where Wk is the weight tensor of all the connections from the                           decay can be disclosed as a function of dataset. The test AUC score
neurons of layer (k − 1), bk represents the bias term and ReLU                          decreases monotonically from day one to day five. As expected,
(rectifier linear unit) function is used as the activation function.                    this is because the distribution of the test data differs with the
The output layer uses a sigmoid function to map the output to a                         training set, and the difference grows day by day. After that, AUC
float number between 0 and 1 as the predicted probability of click:                     upswings for the last three days and surpasses day four. This is in
                                                                                        accordance with a characteristic of our business scene — although
                        Pctr = Siдmoid(W5 h4 + b5 ).                        (2)         the data varies from day to day, the users’ behaviors on our website
                                                                                        have weekly periodic patterns. This non-monotonic change of AUC
For the training process, Pctr is compared against the ground truth                     is evident for the regime from under-fitting to weak overfitting
label and cross entropy is calculated as the loss function. For each in-                (before step ∼ 400000). At larger training steps, overfitting becomes
put instance, the sparse feature ids are embedded into 8-dimensional                    severe and the model performs same bad for the last five days.
float vectors [6, 7, 23]. For feature groups containing multiple fea-
ture ids per instance, e.g., query words, sum pooling operations are
applied to enforce each feature group to produce an 8-dimensional
embedding vector. The embedding outputs are concatenated into a
272-dimensional vector, denoted by h0 , as the input to layer 1. The                                                         step=210000
                                                                                                                                                 positive, training
                                                                                  distribution density

embedding vectors are trained jointed with the other parts of the                                        0.006                                   positive, test1
model.                                                                                                                                           positive, test5
   The experiments are run on distributed TensorFlow [1] released                                                                                negative, training
by Google. The model is trained by Adagrad optimizer [8] with
                                                                                                         0.004                                   negative, test1
                                                                                                                                                 negative, test5
learning rate = 0.005, initial accumulator value = 0.0001 and mini-
batch size = 1000. Glorot and Bengio’s method [10] is used for                                           0.002
initialization. We visualize the model’s inner status by dynamically
dumping the processing data based on model graph.                                                        0.000
                                                                                                         0.008               step=600000
                                                                                                                                                 positive, training
                                                                                  distribution density


3 RESULTS                                                                                                                                        positive, test1
                                                                                                         0.006                                   positive, test5
3.1 AUC and Prediction Score                                                                                                                     negative, training
                                                                                                                                                 negative, test1
                                                                                                         0.004                                   negative, test5
           0.67                                                    0.850
                                                                   0.825                                 0.002
           0.66
                                                          training 0.800                                 0.000
           0.65        test1                                                                                     0.0   0.5      1.0        1.5           2.0
                                                                   0.775
                       test2                                                                                           normalized prediction score
                                                                      train AUC
test AUC


           0.64        test3                                       0.750
                       test4
           0.63        test5                                       0.725
                       test6                                                             Figure 2: Distribution of predicted CTR for models at train-
                       test7                                       0.700
           0.62        test8                                                             ing step 210000 and step 600000. The X-axis denotes the
                                                                   0.675                 predicted CTR normalized by the average click ratio of the
           0.61                                                    0.650                 training set.
                  0   100000 200000 300000 400000 500000 600000
                                  training step                                            Fig. 2 provides insights into the distribution of predicted CTR
                                                                                        score for training, test1 and test5 sets. At training step 210000, the
                                                                                        AUC decay from training set to test1 is mainly because the CTR of
Figure 1: AUC score as a function of training step for train-                           the positive (clicked) samples in test1 are more under-predicted by
ing and test sets.                                                                      the model. The further decay from test1 to test5 is mainly due to that
                                                                                        the negative (non-clicked) samples in test5 tend to be predicted with
  To measure the performance of model, we employ AUC (area                              higher CTRs (the train and test1 curves overlap for the negative
under curve of the receiver operating characteristic plot) as the                       samples and can hardly be distinguished by eye). For training step
Visualizing and Understanding Deep Neural Networks in CTR Prediction                                                           SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


                                                                                                           step=100000                                         legend
                                                     0.25                                                                                                        training
                                                                                                                                                                 test1
                                                     0.00                                                  step=210000


                      mean output
                                                     0.25
                                                     0.00                                                  step=300000
                                                     0.25
                                                     0.00                                                  step=600000
                                                     0.25
                                                     0.00
                                                                 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62
                                                                                                        index of neuron
Figure 3: Mean outputs of the neurons in layer 3 for training and test1 sets, for training step 100000, 210000, 300000 and 600000.
Each bar represents a neuron.


                                                                                                               step=100000                                     legend
                                                           0.5                                                                                                   training
                                                                                                                                                                 test1
                                                           0.0                                                 step=210000
                            mean output


                                                           0.5
                                                           0.0                                                 step=300000
                                                           0.5
                                                           0.0                                                 step=600000
                                                           0.5
                                                           0.0
                                                                  0     2    4     6    8     10   12     14     16      18   20   22   24   26   28   30
                                                                                                        index of neuron
Figure 4: Mean outputs of the neurons in layer 4 for training and test1 sets, for training step 100000, 210000, 300000 and 600000.
Each bar represents a neuron.
                      standard deviation of output


                                                     0.10                                                  step=100000                                         legend
                                                                                                                                                                 training
                                                     0.05                                                                                                        test1
                                                     0.10                                                  step=210000
                                                     0.05
                                                     0.10                                                  step=300000
                                                     0.05
                                                     0.15                                                  step=600000
                                                     0.10
                                                     0.05
                                                                 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62
                                                                                                        index of neuron
Figure 5: Standard deviations of the outputs of the neurons in layer 3 for training and test1 sets, for training step 100000,
210000, 300000 and 600000. Each bar represents a neuron.
                            standard deviation of output


                                                           0.2                                                 step=100000                                     legend
                                                                                                                                                                 training
                                                           0.1                                                                                                   test1
                                                           0.2                                                 step=210000
                                                           0.1
                                                           0.2                                                 step=300000
                                                           0.1
                                                           0.4                                                 step=600000
                                                           0.2
                                                                  0     2    4     6    8     10   12     14     16      18   20   22   24   26   28   30
                                                                                                        index of neuron
Figure 6: Standard deviations of the outputs of the neurons in layer 4 for training and test1 sets, for training step 100000,
210000, 300000 and 600000. Each bar represents a neuron.


600000, the model overfits the training data such that it aggressively                                             proportion of clicked samples is lower than 10%, so under-predicting
predicts the CTR towards zero for both clicked and non-clicked                                                     the CTR for all samples may still reduce loss in training. This shape
samples. This is attributed to the high skewness of the data. The
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                          Lin Guo, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang


of distribution changes significantly as the data become different,
the scores move rightwards and the distribution becomes blurred.
                                                                                                                                   0.90                                                    layer 4
3.2                       Neuron Status
                                                                                                                                   0.85            legend


                                                                                                   average correlation magnitude
In this subsection, we investigate the statistics of the neurons’                                                                                    training
                                                                                                                                   0.80              test1
statuses for different training stages and datasets. These statistical
                                                                                                                                                                                           layer 3
properties depict the model’s representation of the input data, and                                                                 0.8
can help us to interpret the model’s performance and working                                                                        0.7
mechanism.                                                                                                                          0.6
                                                                                                                                   0.45                                                    layer 2
                                   0.0310
                                                                        step=100000                                                0.40
                                   0.0305
                                                                                                                                   0.35
      average standard deviation


                                                                                                                                   0.26                                                    layer 1
                                   0.0300
                                                                                                                                   0.24
                                   0.0325
                                                                        step=210000                                                0.22
                                   0.0320                                                                                                 100000   200000       300000   400000   500000    600000
                                   0.0315
                                                                                                                                                                training step
                                   0.0310
                                     0.10                                                       Figure 8: Average magnitude of the correlations among the
                                     0.08                               step=600000             neurons for each hidden layer. The evolution as a function
                                                                                                of training step is plotted for training and test1 set.
                                     0.06
                                     0.04
                                            train   test1     test3    test5          test8
                                                            data set                            up with the height of layer. This indicates that the DNN model is
                                                                                                refining the input information through the successive layers [22,
Figure 7: Average standard deviation of neurons’ outputs                                        24, 28]. Only very limited portion of the input information can be
for layer 3 as a function of dataset, for training step 100000,                                 transfered to the output layer.
210000 and 600000. The output’s standard deviation is first                                         After step 210000, the neurons’ correlation deceases monotoni-
calculated for each neuron, and then averaged over all the                                      cally with training step for all hidden layers. Recalling the enhanced
64 neurons of layer 3.                                                                          neuron activation observed for this overfitting regime (Figs. 3 and
                                                                                                4), we can interpret that the model starts to explore more predictive
   The mean outputs of the neurons within layer 3 and 4 are illus-                              patterns from the input information. However, the deceasing test
trated in Figs. 3 and 4, respectively. Correspondingly, the standard                            AUC (Fig. 1) reveals that the boosted representation of the input
deviation of the neurons’ outputs are plotted in Figs. 5 and 6. For                             from training data can not be well generalized to predict the test
step 100000 and 210000, the results are quite close between the                                 data.
underfitting and well-fitting stages. About a quarter of the neu-                                   In order to inspect the spacial structure of the high-level rep-
rons are barely activated. Significant changes are observed for the                             resentations for the input data, we project the neurons’ output
overfitting regime (step > 300000). More neurons become activated.                              vectors to 2-dimensional space using t-SNE method [17, 21]. The
Also, the difference between the training and test sets grows with                              t-SNE projection is able to preserve neighborhoods and clusters
the degree of overfitting, especially in the standard deviation (Figs.                          of the data points in the original representation space. In Fig. 9,
5 and 6). The higher standard deviation on the training set indi-                               we illustrate the projection results for layer 2, 3 and 4 at training
cates that the neurons become over sensitive to the input of the                                step 210000. The presented 10000 clicked and 10000 non-clicked
training data. Fig. 7 presents the variation of the standard deviation                          instances are randomly selected from the training set.
averaged over all the 64 neurons of layer 3 as a function of dataset.                               For layer 3 (the center plot in Fig. 9), we can clearly see the
For all the three different training stages, the trend of the average                           regions with concentrated clicked points. We find that the training
standard deviation correlates with the model’s AUC score (Fig. 1).                              process enhances the concentration of clicked points for the training
   To gain more knowledge about the collaborative patterns of                                   set, indicating that the model learns more discriminative represen-
neurons inside the model [21, 26], for each layer, we calculate the                             tation for the training data. For the test datasets, we observe that
correlations among the neurons. Neurons’ statuses before activation                             the concentrated distribution disappears when overfitting happens.
are used. We measure the average degree of neurons’ correlations                                Unlike the case of image classification in Ref. [21], no class separa-
by averaging the absolute value of all the correlation coefficients                             tion is observed even at severely overfitting stage. This is mainly
for each layer. The average strength of correlations is plotted as a                            due to the highly noisy and skewed data for the CTR prediction
function of training step in Fig. 8. The degree of correlation climbs                           task.
Visualizing and Understanding Deep Neural Networks in CTR Prediction                SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


Figure 9: Visualization of the output vectors for layer 2, 3 and 4 using t-SNE method, at training step 210000. Clicked and
non-clicked samples are represented by red and blue points, respectively


   Comparing with the left plot in Fig. 9, the concentration of clicked      To investigate the effectiveness of the hidden layers, we imple-
points of layer 2 is obviously worse than layer 3. This agrees with       ment Alain & Bengio’s probe approach [2]. DNN model is expected
the assumption that for a properly trained DNN model, the dis-            to mining for predictive patterns from input features through layers
criminative quality of the hidden layer’s output increases with the       of transformations, and then feed the extracted information into
height of the layer [2, 5, 21]. However, as revealed in the right plot    the simple linear classifier at the output end. For each layer, we use
of Fig. 9, the clicked points for layer 4 show no improvement in          the layer’s output vector as input features to train a LR (Logistic
the degree of concentration and look even slightly more scattered.        Regression) model to predict CTR. The LR model serves as a probe
Recalling the very strong correlations among the neurons in layer         to evaluate the usefulness the hidden layer. A higher performance
4 (Fig. 8), one may doubt whether the output of layer 4 is more           of the LR probe implies that the transformation of this layer makes
predictive than layer 3. This issue will be further discussed in the      information more predictive, and thus benefits the performance of
following subsections.                                                    the whole DNN model.
                                                                             The LR models are trained on the data of the training set until
                                                                          convergence, with the DNN model fixed, and then the performances
3.3      Probe Evaluations
                                                                          are evaluated on the tests sets. As shown in Fig. 10, for training
                                                                          step 210000, the performance increases from layer 1 to layer 3,
                                                                          indicating that these layers do transform input information to be
                                                                          more predictive. The probe’s performance for layer 4 is the same
                                                                          as layer 3, indicating that layer 4 is not as useful as the previous
                                step=100000                               three layers. This is consistent with the observations in the last
        0.660
                                                                          subsection.
        0.655                                                                The change of AUC along each curve (in Fig. 10) illustrates how
  AUC


                                                       layer1
        0.650                                          layer2             the hidden layer reacts to the varying data distribution. At training
                                                       layer3
                                                       layer4
        0.645                                                             step 210000 where the DNN model generalizes best, the effective-
                                                                          ness of all the layers varies as a function of dataset in the same
        0.665                   step=210000
                                                                          pattern with the DNN model. In contrast, for training step 100000,
        0.660                                          layer1             where the DNN model is underfitting, layer 1 behaves differently
  AUC


                                                       layer2
                                                       layer3             with the other layers. Moreover, for step 600000, the DNN model
        0.655                                          layer4
                                                                          overfits the training data such that the learned information trans-
                                step=600000                               formations begin to fail for test data. Therefore, the performance
         0.62                                                             of probes is very low and fluctuates significantly.
   AUC


         0.61                                          layer1
                                                       layer2
         0.60                                          layer3
                                                       layer4             3.4    Feature Group Saliency
                1   2     3     4     5       6    7      8
                              test dataset                                For the input end of the DNN model, we study how the input
                                                                          features influence the model with the back-propagated gradient
                                                                          signals [16]. The embedding output of the sparse feature ids (con-
Figure 10: Test AUC scores of the probe LR models as a func-              catenated as h0 ) can be treated as the input for the following deep
tion of test dataset for three training steps: 100000, 210000             neural network. With the model fixed, for each input instance, we
and 600000.                                                               calculate the gradient of h0 with respective to the model’s output
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                       Lin Guo, Hui Ye, Wenbo Su, Henhuan Liu, Kai Sun, and Hang Xiang


Pctr :
                         g0 = ∇h0 Pctr .                            (3)                  Pctr = Siдmoid(W5 h4 + b5 + buser ).                  (4)
   The magnitude of each element of the gradient vector g0 quanti-           This bias is trained jointly with the other parts of the model. We
fies the sensitivity of the model’s output to the change in the par-         find this approach can improve AUC on the test datasets by about
ticular embedding element. It describe how much a small change               0.1%.
in a particular embedding value could affect the final output Pctr .
Given a dataset, we calculate the saliency score for each feature            5   APPLICATIONS
group by averaging the mean absolute value of the corresponding 8
                                                                             With the visualization and analysis techniques presented above, we
gradient elements in g0 over the whole dataset. This saliency score
                                                                             discuss some of the practical applications in this section.
provides us with an average measure of the model’s sensitivity to
each feature group for the given dataset.                                    • The distribution of the predicted CTR score is very important for
   We illustrate the saliency scores in Fig. 11. Overall, the model            real-time bidding auctions. Understanding the score distribution
is becoming increasingly sensitive to all the feature groups during            can help us to design better calibration methods [13, 19]. Also,
training. In the overfitting regime, the score of feature group 10 rises       score distribution can help to find outliers or bad-fitted samples,
up dramatically and becomes much higher than the other feature                 which can in turn be used to improve the model.
groups. This feature group is composed of user ids, in which the             • Inspections of model’s inner status and gradient signals open
number of ids is larger than any other feature group by at least               up the "black box" of the DNN model, helping us to understand
two orders of magnitude [9]. For this training stage, the model is             the mechanism of the model and the influence of features. These
trained to memorize the vast amount of information from user ids               approaches can be used to diagnose the model, like (but not
that is not generalizable, and thus significantly deteriorates the             limited to) underfitting/overfitting, gradient vanishing/explosion,
performance on test datasets.                                                  ineffective model structure, etc.. A deep understanding of the
                                                                               model’s mechanism can help us to design better model structure,
4 DISCUSSION                                                                   training algorithm and features.
                                                                             • For online advertisting, it is of great importance to monitor the
4.1 Role of Layer 4                                                            model’s online performance and the health of data pipeline. Feed-
The results about layer 4 raise a question about the necessity to              ing the model with problematic data can cause disaster. However,
include this layer in the model. To answer this question, we modify            it is very difficult to describe and monitor the distribution of the
the neural network and investigate the impact on performance of                extremely sparse and high-dimensional data. Moreover, monitor-
the retrained models. We modify layer 4 by reducing or increas-                ing the model’s online performance may not be sufficient. The
ing its width by a factor of two, or even remove layer 4 from the              model predicts CTR for hundreds of candidate ads for each biding,
model. It turns out that these modifications do not affect the mod-            while only very few ads can win the bidding and get feedback
els’ performance (highest test AUCs) for the different test dataset.           from impression. The classic performance metrics are mainly
Although not harmful, there is no benefit to include layer 4 in the            based on those feedbacks, and thus can only cover a limited
DNN model.                                                                     portion of biased data.
                                                                               The DNN model, by nature, transforms the sparse input data
4.2      Regularization                                                        into dense numerical representations. Therefore, the statistics of
Analysis in the previous section reveals that the model become over            neurons’ output and the gradient signals can be implemented as
sensitive to the input when overfitting. Also, the high correlations           a new kind of metrics to monitor the distribution of the input
among neurons for layer 3 and 4 (Fig. 8) imply that there might be             data. Note that no feedback labels are needed to calculate these
severe co-adaptations [25]. One may hope to use regularizations                quantities. For example, as illustrated in Fig. 7, the average stan-
to control overfitting and obtain better performance on test data.             dard deviation for layer 3’s output changes with the naturally
We have tried L1 and L2 regularization [11], and dropout [25], for a           varying distribution of input data. Problematic input data can
variety of hyper-parameters. However, no improvement is obtained.              cause more significant change in the statistics.
In future, more work needs to be conducted on improving model’s
generalization power.                                                        6   CONCLUSION
                                                                             In this work, we visualize and analyze a simple DNN model for CTR
4.3      Feature Treatment                                                   prediction down to neuron level. Model training and evaluations
Subsection 3.4 discloses the problem that the model is greatly sen-          are performed over a series of datasets. The model is inspected from
sitive to the feature group of user ids when overfitting. Other than         the output to the input end. The statuses of neurons are studied
regularization, it is also possible to improve the models’ general-          using a variety of methods. Gradients of the feature embeddings
ization power by optimizing the input features. User id is a highly          are used to create a salience map to describe the influence of the
granular feature group. Inputting it directly to the embedding-based         feature groups. The analysis provides insightful knowledges of the
deep neural network may not be the optimal choice. Following the             model’s mechanism, helping us to monitor, diagnose and refine the
idea of Wide&Deep [6], we remove user id from the embedding                  model.
layer. The bias of each user id is represented by a float number                Currently, we are applying these approaches to build a model-
buser and added immediately into the output layer:                           based evaluation and monitoring system for our online advertising
Visualizing and Understanding Deep Neural Networks in CTR Prediction                                              SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

                                            0.10                                             step=100000
                                            0.05
            gradient-based saliency score

                                            0.00
                                            0.10                                             step=210000
                                            0.05

                                            0.00
                                            0.75                                             step=300000
                                            0.50
                                            0.25
                                            0.00
                                               3                                             step=600000                                                           legend
                                              2                                                                                                                      training
                                                                                                                                                                     test1
                                              1
                                              0
                                                   0   2   4   6   8   10    12      14         16         18    20       22      24       26      28      30       32
                                                                                    index of feature group

Figure 11: Gradient-based saliency score of the 34 feature groups for training and test1 sets. Each bar represents a feature
group.


platform. Based on our industrial scenario, future work will focus on                                [14] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015. Visualizing and under-
exploring more approaches to interpret deep learning, investigating                                       standing recurrent networks. arXiv preprint arXiv:1506.02078 (2015).
                                                                                                     [15] Pangwei Koh and Percy Liang. 2017. Understanding Black-box Predictions via
more complex algorithms and applying these approaches to design                                           Influence Functions. In International Conference on Machine Learning. 1885–1894.
better models and algorithms.                                                                        [16] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and
                                                                                                          Understanding Neural Models in NLP. arXiv preprint arXiv:1506.01066v2 (2016).
                                                                                                     [17] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
                                                                                                          Journal of machine learning research 9, Nov (2008), 2579–2605.
REFERENCES                                                                                           [18] Aravindh Mahendran and Andrea Vedaldi. 2016. Visualizing deep convolutional
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,                              neural networks using natural pre-images. International Journal of Computer
     Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al.                        Vision 120, 3 (2016), 233–255.
     2016. Tensorflow: Large-scale machine learning on heterogeneous distributed                     [19] Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner,
     systems. arXiv preprint arXiv:1603.04467 (2016). https://www.tensorflow.org/                         Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. 2013.
 [2] Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers                           Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM
     using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016).                              SIGKDD international conference on Knowledge discovery and data mining. ACM,
 [3] Leila Arras, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. 2017.                       1222–1230.
     Explaining recurrent neural network predictions in sentiment analysis. arXiv                    [20] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Au-
     preprint arXiv:1706.07206 (2017).                                                                    tomated whitebox testing of deep learning systems. In Proceedings of the 26th
 [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-                                 Symposium on Operating Systems Principles. ACM, 1–18.
     chine translation by jointly learning to align and translate. arXiv preprint                    [21] Paulo E Rauber, Samuel G Fadel, Alexandre X Falcao, and Alexandru C Telea. 2017.
     arXiv:1409.0473 (2014).                                                                              Visualizing the hidden activity of artificial neural networks. IEEE transactions on
 [5] Yoshua Bengio et al. 2009. Learning deep architectures for AI. Foundations and                       visualization and computer graphics 23, 1 (2017), 101–110.
     trends® in Machine Learning 2, 1 (2009), 1–127.                                                 [22] Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy
 [6] Heng-Tze Cheng and Levent Koc. 2016. Wide & deep learning for recommender                            Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. 2018. On the Informa-
     systems. In Proceedings of the ACM 1st Workshop on Deep Learning for Recom-                          tion Bottleneck Theory of Deep Learning. In International Conference on Learning
     mender Systems. 7–10.                                                                                Representations. https://openreview.net/forum?id=ry_WPG-A-
 [7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for                      [23] Ying Shan and T Ryan Hoens. 2016. Deep crossing: Web-scale modeling without
     youtube recommendations. In Proceedings of ACM Conference on Recommender                             manually crafted combinatorial features. In Proceedings of ACM Conference on
     Systems. 191–198.                                                                                    Knowledge Discovery and Data Mining.
 [8] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods                    [24] Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the black box of deep
     for online learning and stochastic optimization. Journal of Machine Learning                         neural networks via information. arXiv preprint arXiv:1703.00810 (2017).
     Research 12, Jul (2011), 2121–2159.                                                             [25] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
 [9] Tiezheng Ge, Liqin Zhao, Guorui Zhou, Keyu Chen, Shuying Liu, Huiming Yi,                            Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from
     Zelin Hu, Bochao Liu, Peng Sun, Haoyu Liu, et al. 2017. Image Matters: Jointly                       overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
     Train Advertising CTR Model with Image Representation of Ad and User Behavior.                  [26] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
     arXiv preprint arXiv:1711.06505 (2017).                                                              Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks.
[10] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training                      arXiv preprint arXiv:1312.6199 (2013).
     deep feedforward neural networks. Journal of Machine Learning Research 9 (2010),                [27] Zhiyuan Tang, Ying Shi, Dong Wang, Yang Feng, and Shiyue Zhang. 2017. Mem-
     249–256.                                                                                             ory visualization for gated recurrent neural networks in speech recognition.
[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT                         Proceedings of IEEE International Conference on Acoustics, Speech and Signal Pro-
     Press.                                                                                               cessing (ICASSP) (2017).
[12] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich.                   [28] Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information
     2010. Web-scale Bayesian Click-through Rate Prediction for Sponsored Search                          bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW). 1–5. https:
     Advertising in Microsoft’s Bing Search Engine. In Proceedings of the 27th Inter-                     //doi.org/10.1109/ITW.2015.7133169
     national Conference on International Conference on Machine Learning (ICML’10).                  [29] Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolu-
     Omnipress, USA, 13–20.                                                                               tional networks. In European conference on computer vision. Springer, 818–833.
[13] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine
     Atallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predicting
     clicks on ads at facebook. In Proceedings of the Eighth International Workshop on
     Data Mining for Online Advertising. ACM, 1–9.