Comprehensive Audience Expansion based on End-to-End
                        Neural Prediction
                              Jinling Jiang                                                             Xiaoming Lin
                       MiningLamp Technology.                                                     MiningLamp Technology.
                             Beijing, China                                                            Beijing, China
                    jiangjinling@mininglamp.com                                                linxiaoming@mininglamp.com

                                Junjie Yao                                                                   Hua Lu
                    East China Normal University.                                Department of Computer Science, Aalborg University.
                           Shanghai, China                                                       Aalborg, Denmark
                     junjie.yao@sei.ecnu.edu.cn                                                  luhua@cs.aau.dk

ABSTRACT                                                                            needs of the consumer. As the development of e-commerce plat-
In current online advertising applications, look-alike methods are                  forms has introduced SMEs (Small and medium-sized enterprises)
valuable and commonly used to identify new potential users, tack-                   to enter consumers’ sight, large enterprise advertisers face the crisis
ling the difficulties of audience expansion. However, the demo-                     of slowing business growth and falling revenue. Therefore, brand
graphic information and a variety of user behavior logs are high                    advertisers have begun to pay more attention to the contribution of
dimensional,noisy, and increasingly complex, which are challenging                  advertising to sales conversion, the actual revenue brought by ad-
to extract suitable user profiles. Usually, rule-based and similarity-              vertising,  requiring advertising agencies and third-party suppliers
based approaches are proposed to profile the users’ interests and                   to provide more refined performance data of advertising effects.
expand the audience. However, they are specific and limited in                         Meanwhile, the emergence of big data technology has subverted
more complex scenarios.                                                             the  operation model of the entire advertising industry and the
    In this paper, we propose a new end-to-end solution, unifying                   traditional  way of evaluating advertising effects. By tracking and
the feature extraction and profile prediction stages. Specifically,                 obtaining user behavior data, a third-party supplier of advertising
we present a neural prediction framework and leverage it with the                   monitor can analyze the data according to the advertiser needs, not
intuitive audience feature extraction stages. We conduct extensive                  only understanding the communication effects and sales conversion
study on a real and large advertisement dataset. The results demon-                 rate generated by the advertisement in time but also predicting
strate the advantage of the proposed approach, not only in accuracy                 the user conversion probability to some extent. Through analysis
but also generality.                                                                and modeling on massive data of user behavior, advertisers can
                                                                                    accurately reach the target consumer. Therefore, how to better
CCS CONCEPTS                                                                        utilize the advertising monitor data in order to optimize ad serving
                                                                                    and improve marketing conversion rate has become an important
• Information systems → Online Advertising; • Human-centered
                                                                                    issue.
computing → User Models; • Theory of computation → Com-
                                                                                       One of the main challenges in ad serving is how to find the best
putational Advertising theory; • Computing methodologies →
                                                                                    converting prospects. A typical way is to do audience expansion,
Factorization methods;
                                                                                    that is, to identify and reach new audiences with similar interests
KEYWORDS                                                                            to the original target audience. Usually, the methodology used in
                                                                                    audience expansion problem is called look-alike modeling. Given a
Online Advertising; Audience Expansion; Lookalike Modeling                          seed user set S from a universal set U , look-alike models essentially
ACM Reference Format:                                                               find groups of audiences from U − S who look and act like the
Jinling Jiang, Xiaoming Lin, Junjie Yao, and Hua Lu. 2019. Comprehensive            audience in S.
Audience Expansion based on End-to-End Neural Prediction. In Proceedings               The data flow of audience expansion service is illustrated in
of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom), 8 pages.                 Figure 1. The data runs between advertisers and our universal
                                                                                    advertising monitor system across different media platforms. The
1 INTRODUCTION                                                                      original users come from the advertiser’s CRM System selecting
The remarkable growth of online advertisement enables the                           the consumers who recently exercise the purchase actions. Then
 ad-vertisers to sync up their products according to the fast-                      the users who are tracked by the universal advertising monitor will
 changing                                                                           be matched and treated as "seed" users.
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic    In this paper, we build up a closed-loop data solution for brand
purposes.                                                                           advertisers and combines multiple techniques of selecting negative
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                      samples and extracting features, as well as machine learning looka-
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at
http://ceur-ws.org                                                                  like models to reach the targeted audience. Which greatly enhances
                                                                                    the conversion effect of ad serving. Based on the "seed" users and
                                                                                    universal user set from advertising monitor, we build a lookalike
SIGIR 2019 eCom, July 2019, Paris, France                                                                                       J. Jiang et al.


                                                 Figure 1: Audience Expansion Dataflow


model to predict the probability to be the target audience for all           • We conduct extensive and effective experiments to extract
users. Afterward, according to the advertising budget, lookalike               negative samples from unlabeled data.
model will yield the corresponding number of expanded users to be            • We prove the effectiveness of the proposed lookalike models
reached through ad serving system. Finally, the ad serving perfor-             in an online environment.
mance is evaluated by advertiser’s site monitor system that record           The rest of the paper is organized as follows. In Section 2, we
sales conversion shortly.                                                review the related work on various kinds of look-alike models and
   But both traditional and current look-alike strategies for an ad-     illustrate different design philosophy behind them.
vertiser to look for the target audience are mainly based on user            Section 3.3 gives out the formal problem statement and specifies
demographics. There are two main problems with demographics-             the notations used in the paper. We then introduces our proposed
based audience segmentation: user demographics (age, gender, and         lookalike models and Section 3.4 reveals the sampling strategies.
geographical location) itself is not precise as it is estimated via      The evaluation of the algorithm is presented in Section 4. Finally
various statistical methods or machine learning models based on          the conclusion and future work are discussed in Section 5.
a small group of surveyed samples (10-100 thousand); the number
of users that are specified by demographics is large, more sophisti-
cated screening is required. Accordingly, the details of user behavior
                                                                         2   RELATED WORKS
data should be harnessed in machine learning models to target ac-        We briefly review the related literature of look-alike modeling. Gen-
curate audience segment. At the same time, there are two main            erally in online user-targeted advertising areas, look-alike modeling
problems that need to be solved based on user behavior data model-       which supports audience expansion system can be categorized in
ing: user-generated behavior data through the Internet is generally      three lines: rule-based, similarity-based and model-based.
high-dimensional and sparse; advertisers usually can only provide            Rule-based approaches focus on explicit positioning, where
positive samples, while negative samples need to be carefully picked     users with specific demographic tags (age, gender, geography) or
up from a substantial unlabeled sample set.                              interests are targeted directly for advertiser. The core technical
   Besides, the ecologically closed Internet tycoons (represented        support in the background is user profile mining, which means,
by Facebook, Amazon, Tencent, Alibaba and etc.) provide the ad-          the interest tags are inferred from the user behaviour [20][27]. Fur-
vertisers the capability to perform audience expansion within their      thermore, Mangalampalli et al. [17] builds a rule-based associative
own platforms. However, ad serving data of these platforms are           classifier for campaigns with less conversion; Shen et al. [24] and
not connected with the advertiser’s CRM (Customer Relationship           Liu et al. [14] present detailed in-depth analysis of multiple meth-
Management) system. Thus, it is difficult to directly track the real     ods under different considerations(such as similarity, performance,
conversion rate. In order to verify that lookalike models based          whether or not campaign-agnostic) for online social network ad-
on the user behaviour work better than traditional demographics-         vertising. The main disadvantage of rule-based look-alike modeling
based approach regarding the sales conversation rate, we need to         is that it only captures the high-level features, therefore loses so-
integrate data flow during the whole advertising life cycle.             phisticated details of user behaviour.
   The contributions of this paper can be summarized as follows.             Similarity-based approaches apply different similarity met-
                                                                         rics to solve the problem of look-alike modeling. Naive similarity-
    • We have improved the commonly used ad serving mode                 based method computes pairwise similarities between and seed
      from demographics-based crowd segmentation to a compre-            user and all the other users in the set while the locality-sensitive
      hensive audience expansion framework.                              hashing (LSH) [25] technique is often applied to decrease the com-
    • We propose a lookalike model that has better generalization        putation complexity of pairwise similarity. In addition, based on
      ability for audience expansion problem.                            Ma et al. [15][16] provide several similarity scoring methods to
Comprehensive Audience Expansion based on End-to-End Neural Prediction                            SIGIR 2019 eCom, July 2019, Paris, France


measure the potential value of the users to an specific advertiser.     Table 1: An example of data from advertising monitor sys-
However, the similarity-based approach lacks the ability to catch       tem
the implicit interaction between features indicating user behaviour.
   Model-based look-alike systems fall into two categories: un-                  CLICK     Timestamp         USER_ID      SPID
supervised and supervised learning. For instance, k-means clus-                    1      201809123278       66a7988f   107122831
tering [21] and frequent pattern mining [1] are the instances of                   0      201809123346       9e664577   107108909
unsupervised approach. Meanwhile, the supervised approach trans-                   1      201809123456       9b3fcc94   107104618
forms the look-alike model into a positive-unlabeled learning (PU                  0      201809123787       0043fbf4   107102974
learning) problem [12][10][19][13]. In PU learning, the positive                   0      201809132592       1df73293   107108909
samples are seed users while negative samples should be selected
from the non-seed users. The main challenge of PU learning prob-
lem lies in three following aspects: negative samples not easy to          Each row of the original data collected by advertising monitor
obtain; negative samples are too diverse; negative samples are dy-      system represents an ad impression. The "CLICK" column is an
namically changing. In one word, different strategies on how to         indicator that shows whether or not the advertisement is clicked
sample the negative users will definitely affect the model results.     by the corresponding user (1 represents CLICK while 0 means
For example, besides random sampling, Ma et al. [15] select the past    the opposite). As shown in Table 1, The main information of an
non-converter users as negative samples and Liu et al. [13] propose     ad impression includes timestamp, user_ id and an spid. The spid
a "spy" method to aggregate negative users. Another challenge in        refers to the specific information of an advertisement where they
model-based look-alike system is that it need have the capability       are multi-field categorical data [28] which are commonly seen in
to model in the very sparse feature space.                              CTR prediction and recommendation system.
   A key challenge in applying collaborative filtering lies also on        The user behaviour is represented by a high-dimensional sparse
the extreme sparsity of interaction between users and campaign          feature vector where each feature corresponding to the times an ad-
and the way Kanagal et al. [9] address this challenge is to utilize a   vertisement is clicked or impressed. One typical feature extraction
product taxonomy to reveal the relationships. Regarding the algo-       result is shown in Table 2, User "66a7988f " is impressed by spid1
rithms dealing with high-dimensional sparse data is an essential        and spid2 both 3 times while he only clicks spid2 once. The user
task in online advertising industry. Many models have been pro-         feature vector will be normalized afterwards. The normalization
posed to resolve this problem such as Logistic Regression (LR)          approach is as follows where f req represents the original frequency
[3][11], lowPolynomial-2 (Poly2) [2], Factorization Machine-based       and norm_f req is the frequency after normalization:
models [22][7][6] and end-to-end deep learning related models
[4][5][26].                                                                                             1
                                                                                                                          f req > 0
                                                                                              
                                                                                              
                                                                                              
                                                                                                            f r eq
                                                                               norm_f req =     1 + exp(− 10 )
                                                                                              
                                                                                              
3     THE PROPOSED APPROACH                                                                                                              (1)
                                                                                                                           f req = 0
                                                                                              
                                                                                                             0
                                                                                              
Here we first formalize the problem and then list the feature extrac-
                                                                                              
                                                                                              
tion and the prediction framework.                                         To this end, every feature value is converted to a number between
                                                                        0 and 1.
3.1    Problem Statement                                                   It is noteworthy that the data label is the purchase tag (meaning
We formalize the look-alike modeling as a prediction problem. Ad-       the corresponding user has purchase action) from CRM system of
vertisers submit a list of customers, which we call seed user set S,    a particular brand advertiser over a period of time, while features
as positive samples and there are a universal user set U existing       represent the impression and click behaviour for ads of different
in advertising monitor platform. Then the problem is transformed        brands. Unlike the high-dimensional sparse feature transformed by
into a Positive and Unlabeled learning problem: using a small num-      one-hot encoder in CTR prediction task, the original feature space
ber of labeled positive samples S and a large number of unlabeled       is already sparse and high-dimensional.
samples U − S to derive a prediction classifier. Eventually unlabeled      The intuitive idea of utilizing spid as feature is that the ads
users are scored by the classifier and the target audience set T is     are somehow correlated to the websites highly indicating user
taken out according to advertising requirements. The dataset sizes      interests. That is to say, when an internet user is impressed by
are typically configured in real business environment as follows:       an specific ad, the ad itself could describe the user interests to
∥S ∥ = 0.1−0.2M(Million), ∥T ∥ = 10−20M and ∥U ∥ = 2000−3000M.          some extend. Moreover, "CLICK" information directly connects user
Meanwhile, a user is represented by a feature vector which indi-        intention. The detailed comparison of different feature extraction
cates the user’s past behaviour collected by the advertising monitor    methodologies will be incorporated in Section 4.2.
system. The feature vector always occurs with high-dimension D
and extreme sparsity. D is usually around 100-300 thousands and         3.3    Comprehensive Modeling
only 0.1 percent of the feature vector are non-zero elements.           We continue to introduce the lookalike model techniques used in
                                                                        our audience expansion system. Multilayer Perceptron (MLP) is a
3.2    Feature Extraction and Analysis                                  feedforward neural network consisting of several layers. By adding
Here we introduce the feature extraction and analysis stages in the     non-linear activation functions, MLP can fit high-order non-linear
lookalike model.                                                        features. Figure 2 illustrates a MLP network added by a scale layer.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                             J. Jiang et al.

                                                     Table 2: User behaviour Representation

                             USER_ID     spid1_click    spid1_impression    spid2_click   spid2_impression     ...   label
                             66a7988f         0                 3                1                3            ...     1
                             9b3fcc94         0                 2                1                1            ...     0
                             0043fbf4         1                 1                0                1            ...     0
                             9e664577         1                 3                1                2            ...     1
                             1df73293         0                 1                0                1            ...     0


                                                                             effect for the target, when the values of x j drifts, it will cause
                                                                             training difficulty unless the absolute value of parameters ai j , i =
                                                                             1...k are all small; on the other side, as long as the absolute value
                                                                             of the only affected parameter w j in Scale-MLP model is small,
                                                                             the influence of the feature on the target can be made smaller. To
                                                                             conclude, adding the scale layer and updating the parameters of the
                                                                             scale layer during backpropagation can directly change the final
                                                                             influence of each feature on the model.
                                                                                Generally saying, for MLP model, matrix A captures the first-
                                                                             order combinatoric features. In order to learn high-order features,
                                                                             the model need to fit the data by adjusting both the parameters
                                                                             of matrix A and the hidden layers of MLP. Due to the sparsity
                                                                             of feature space and importance of different features varies, the
                                                                             parameters of matrix A cannot be very effectively trained. Under
                                                                             such circumstances, the MLP model is easier to overfit. On the
Figure 2: Comprehensive Audience Expansion Framework                         contrast, the Scale-MLP model only needs to train the parameters
                                                                             of the scale layer properly for the same purpose. Therefore, Scale-
                                                                             MLP model is much simpler to train in our setting.
Based on it, we proceed the audience expansion with intuitive                   Another angel to look at the functionality of the new model is
feature extraction and prediction tasks.                                     that it adds randomness to the original user feature vector. In other
   The prediction equation of a standard MLP model is defined as:            words, if a user is not impressed by some ad, it doesn’t mean that
                                                                             he/she is totally not interested in that ad. Therefore, the scale layer
                      y = mlp(AX + bias), A ∈ Rk ×n                   (2)    will help to learn a model which has better generalization capability
      After adding a scale layer, the model we call Scale-MLP is updated     for this task.
as:
                                                                             3.4    Model Training
                 y = mlp(A(W ◦ X ) + bias), A ∈ Rk×n                (3)         3.4.1 The Impact of Sampling Ratio. We evaluate the impact
     The model expressibility of Equation 2 and 3 is the same so that        of sampling ratio based on different number of positive and un-
there is no difference at model prediction stage. That is to say, the        labeled samples, seeing unlabeled as negative label. The standard
theoretical optimal solution of MLP and Scale-MLP are the same.              classification algorithm we choose is Logistic Regression. The key
However, deep models don’t always converge to the same optimal               metrics need to be taken care are test recall and threshold, mean-
solution in practice, therefore, the effectiveness of actual models          ing positive sample recall on testing data set and the corresponding
obtained from Scale-MLP and MLP are often different on different             probability boundary. The number of positive and negative sam-
datasets.                                                                    ples in testing data set are 34657 and 72464. The evaluation result
     To be detailed, the essential difference lies in the way backprop-      in Table 3 shows when ratio of positive and unlabeled reaches
agation update the network parameter during model training stage.            1:2 (the number of positive and negative samples are 69331 and
Compared to a standard MLP, Equation 3 reflects that the network             134584 respectively), the threshold doesn’t change significantly
need feedforward an intermediate result w j x j after the scale layer        when more unlabeled samples are added. Considering both training
added. When MLP updates the parameter matrix A during backprop-              efficiency and effectiveness, it is practical to set the sampling ratio
agation , the partial derivative regarding ai j is x j ; for Scale-MLP,      of positive:negative as 1:2.
the partial derivative regarding ai j is w j x j while regarding w j is
x j . In another word, the value of feature x j in MLP can directly             3.4.2 Sampling Techniques. For general classification problem,
affect the parameters ai j , i = 1...k; for Scale-MLP, feature x j can       to determine where the class boundary is, at least some of the
only update w j .                                                            negative samples to be close to the positive ones are chosen. Take
     Assuming that the influence of different features on the model is       "active learning" [23] as an example, algorithms will select out those
quite different, the fluctuation of feature values will make training        samples that are most indistinguishable from the model for human
process difficult to converge. Suppose that the feature x j has little       expert to label. However, look-alike models deal with data without
Comprehensive Audience Expansion based on End-to-End Neural Prediction                                 SIGIR 2019 eCom, July 2019, Paris, France

                                                   Table 3: The Impact of Sampling Ratio

                             positive   unlabeled     train loss   test accuracy   test auc    test recall   threshold
                              69331       69331         0.4693          0.764       0.835        0.740         0.493
                              97064       97064         0.4692          0.767       0.840        0.740         0.498
                             138663      138663         0.4700          0.770       0.843        0.740         0.493
                              69331       95743         0.4650          0.769       0.837        0.740         0.518
                              69331      134584        0.4298          0.774        0.839        0.740         0.668
                              69331      197328         0.3874          0.776       0.839        0.740         0.678
                              69331      245811         0.3576          0.776       0.839        0.740         0.682


labelled negative samples, hence the goal of sampling is to pick out           A more sophisticated approach [18] is a variant of bagging: first
a reliable set of negative users.                                           of all, a subset of unlabeled samples are bootstrapped from the
   Besides randomly selecting negative samples and directly apply           unlabeled sample set U . The algorithm details are depicted in Al-
standard classifier to the PU learning problem, we compare the              gorithm 3. Here we set the number of iterations T and for each
effectiveness of three other sampling techniques: spy, pre-train and        iteration, a standard classifier responsible for predicting U is trained
bootstrap sampling. The "Spy" [13] [12] and "Pre-Train" sampling            on bootstrapped sample set U ′ and positive sample set P. The final
strategies are so-called "two-step" approach [8] where the general          predicted probability equals to the average score of T iterations.
idea is described as follows: the first step is to identify a subset of
unlabeled samples that can be reliably labelled as negative, then             Algorithm 3: Bootstrap Sampling
positive and negative samples are used to train a standard classifier          Input: Positive Sample Set P, Unlabeled Sample Set U
that will be applied to the remaining unlabeled samples. Usually               Output: Negative Sample Set N with size k
the classifier is learned iteratively till it converges or some stop-        1 for t ≤ T do
ping criterion is met. Correspondingly, the "Spy" and "Pre-Train"            2     Bootstrap a subset U ′ from U ;
sampling strategies are illustrated in Algorithm 1 and 2.                    3     Train a classifier M on P and U ′ ;
                                                                             4     Predict U − U ′ using classifier M;
 Algorithm 1: Spy Sampling                                                   5     Record the classifying scores;
  Input: Positive Sample Set P, Unlabeled Sample Set U                       6 Average the classifying scores of all iterations;
  Output: Negative Sample Set N with size k                                  7 Select a subset N of k samples with least average scores;
1 Randomly select a subset from P as the spy set P ;
                                                  ′
                                                                             8 Return N ;
2 Train a classifier M based on P − P and U + P ;
                                     ′          ′

3 Select a subset N of k samples from U with least prediction                  Table 4 shows the experimental result of different sampling ap-
   scores;                                                                  proaches. The samplinд parameter represents the percentage of
4 Return N ;                                                                unlabeled samples picked out as negative and threshold indicates
                                                                            the corresponding probability boundary. From the result table it can
                                                                            be seen that when spy and bootstrap approaches sample half size of
                                                                            the unlabeled data, it still guarantees almost the same level of recall
                                                                            on testing data while regarding pre-train sampling approach, the
 Algorithm 2: Pre-Train Sampling                                            recall on test data is much lower. On the sampling efficiency, spy ap-
  Input: Positive Samples Set P, Unlabeled Sample Set U ,                   proach can only run one iteration compared to the other two which
         Validation Set V                                                   need converge after several rounds. Therefore, it is both efficient
  Output: Negative Sample Set N with size k                                 and effective to utilize spy sampling approach in our setting.
1 Randomly select a subset N with size k from U ;                              Logistic Regression: Logistic Regression (LR) is probably the
2 while true do                                                             most widely used baseline model. Suppose there are n features
3    Randomly select a subset N ′ from N ;                                  {x 1 , x 2 , ..., x n } and x i is either 0 or 1, consider an LR model without
4    Train a classifier M based on P and N ′ , and evaluate the             a regularization term:
      model on V ;
5    if the accuracy of M doesn’t improve on V then                                                    y = bias + β T X                         (4)
6        Return N ;                                                         where β is the coefficient vector. This simple linear model misses the
7        break;                                                             crucial feature crosses, therefore, the Degree-2 Polynomial (Poly2)
                                                                            model is always provided to ease the problem.
8     Predict U using classifier M;
9     Select a subset N of k samples with least prediction                                         y = bias + β T X + XW X T                          (5)
       scores;                                                              where W is a symmetric parameter matrix with the elements on
                                                                            the diagonal are all equal to 0.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                               J. Jiang et al.

                                                    Table 4: The Impact of Sampling Approach

                              approach    sampling parameter   train loss   test accuracy   test auc   test recall   threshold
                              Random              0.9            0.4250          0.775       0.847       0.766         0.633
                              Random              0.5            0.4150          0.753       0.843       0.676         0.612
                                 Spy             0.95            0.4138          0.775       0.847       0.768         0.640
                                 Spy              0.9            0.4141          0.776       0.847       0.763         0.633
                                 Spy             0.5            0.3660          0.775        0.845       0.768         0.607
                              Pre-Train          0.95            0.3677          0.775       0.845       0.771         0.624
                              Pre-Train           0.9            0.3829          0.776       0.846       0.771         0.632
                              Pre-Train           0.5            0.4382          0.702       0.839       0.632         0.628
                              Boostrap           0.95            0.4127          0.775       0.847       0.768         0.638
                              Boostrap            0.9            0.4153          0.776       0.847       0.763         0.633
                              Boostrap           0.5            0.3976          0.775        0.845       0.766         0.640


   Factorization Machine In order to extract feature crosses while             typical feature could be that one specific user is impressed by an ad
reducing the influence of high-dimensional sparse features, Rendle             of "Maybelline" 5 times in July. In general, only activities happening
[22] proposes Factorization Machines to overcome the drawbacks                 in last three months are to be extracted. Click means whether we
of LR. Regarding LR model, the number of parameters in matrix                  distinguish between click action from impression. The experimen-
                          n(n−1)
W need to be learned is 2 . When n is 100,000, the number                      tal results based on LR model (training data volume: 428484; testing
of parameters is tens of billions. At the same time, when training             data volume: 107121; positive and negative ratio is 1:2) show that if
the model using gradient descent optimization, the parameter w i j             the features are calculated by month and click action is separated
can only be trained when x i and x j are both not zero, therefore              from impression, the AUC value will reach 0.8465 in testing phrase
there is a high demand on both the number of training samples and              which is the best among all settings. Therefore, this feature engi-
memory space at training phrase. As a result, for high-dimensional             neering strategy will be applied in various model methodologies
sparse features, the parameter matrix W is almost impossible to                afterwards.
train.
                                                                                        Table 5: The Impact of Feature Engineering
   To overcome this problem, we will decompose W into VV T
where each vi in V = (v 1 , v 2 , ..., vn )T can be seen as a latent k-
dimensional factor of original feature. The Degree-2 F M model                     Feature Size   Time Slice     Click   Train AUC      Test AUC
equation is defined as:                                                              144009         None         True      0.8721        0.8447
                                                                                      94932         None         False     0.8689        0.8443
                 y = bias + β T X + XVV T X T , V ∈ Rn×k              (6)            249406       by holiday     True      0.8808        0.8445
                                                                                     196184       by month       True      0.8780        0.8465
    At this time, the number of parameters need to be estimated is
                                                                                     133605       by month       False     0.8761        0.8461
n · k and easier to train even under sparsity setting as F M model
break the independence of the interaction parameters by factorizing
them.
                                                                               4.3    Model Performance
4 EXPERIMENTS                                                                  In this section, the performance comparison of various models is in-
4.1 Setup                                                                      troduced. The hyper-parameters configured in different models are
                                                                               listed at Table 6. In this table, BN-MLP is a multi-layer perceptron
Regarding the model implementation, we use MXNet1 on a stand-                  with a batch normalization layer after each hidden layer; Scale-
alone 1080TI GPU to compare different model effects and figure                 BN-MLP adds a scale layer before BN-MLP; lr and wd represent
out model parameters. When predicting the universal user pool                  learning rate and L2 regularization parameter respectively.
consisting of nearly 2.5 billion users, we used distributed MXNet on
a 80-cores hadoop cluster to re-train the model and it took nearly 4                          Table 6: Hyper-parameter Setting
hours to finish the prediction of all users.
                                                                                                 Model                Parameters
4.2     The Impact of Feature Engineering
                                                                                                   LR              lr=1e-4, wd=1e-6
Table 5 shows the impact of different feature engineering approaches.                              FM           lr=1e-4, wd=3e-5, k=6
In this table, T ime Slice indicates the strategy of calculating the                              MLP              lr=1e-4, wd=3e-5
user behaviour by time slice (None: no time slice; day: slice by day;                           BN-MLP             lr=1e-4, wd=3e-5
holiday: slice by holiday and weekday; month: slice by month). For                             Scale-MLP           lr=1e-4,wd=3e-5
example, if we extract features of user activities by month, one                             Scale-BN-MLP          lr=1e-4,wd=3e-5
1 https://mxnet.apache.org/
Comprehensive Audience Expansion based on End-to-End Neural Prediction                                                              SIGIR 2019 eCom, July 2019, Paris, France


                                trainDataSet: Different model performance                                          testDataSet: Different model performance
                     0.94
                     0.92                                                                           0.86

                     0.90
                                                                                                    0.84
                     0.88
                     0.86                                                                           0.82
                                                                           LR                                                                               LR
               auc


                                                                                              auc
                     0.84                                                  FM                                                                               FM
                     0.82                                                  bn_mlp                   0.80                                                    bn_mlp
                     0.80
                                                                           mlp                                                                              mlp
                                                                           s_mlp                    0.78                                                    s_mlp
                     0.78                                                  s_bn_mlp                                                                         s_bn_mlp
                            0           5              10             15              20                   0             5              10             15              20
                                                    epoch                                                                            epoch

                                            (a) Train Dataset                                                                (b) Test Dataset


                                                                       Figure 3: Model Performance


                                    trainDataSet: Different lr performance                                           testDataSet: Different lr performance
                                                                                                   0.875
                     0.95
                                                                                                   0.850
                     0.90                                                                          0.825
                                                                                                   0.800
                     0.85
                                                                                                   0.775
               auc


                                                                                             auc


                     0.80
                                                                                                   0.750

                     0.75                                             lr=0.00001                   0.725                                               lr=0.00001
                                                                      lr=0.00005                   0.700
                                                                                                                                                       lr=0.00005
                     0.70                                             lr=0.0001                                                                        lr=0.0001
                                                                      lr=0.0005                    0.675                                               lr=0.0005
                            0   5      10      15     20      25     30      35   40                       0   5        10     15      20     25      30      35   40
                                                    epoch                                                                            epoch

                                            (a) Train Dataset                                                                (b) Test Dataset


                                                       Figure 4: Comparison of Different Learning Rate

                                                                   Table 7: Online A/B Testing Results

                                           Metric                   Random             F20-34         FEMALE           MALE             MODEL
                                         Impression                18,367,151         6,493,314       3,910,355       1,454,655         1,221,095
                                       Impression UV               8,578,859          3,152,614       2,052,897        912,468           594,456
                                          Purchaser                    597               123             117              29               217
                                       Purchaser Rate                0.01%              0.00%           0.01%           0.00%             0.04%
                                        Transaction                    731               155             134              32               275
                                            Sales                    69,974             14,976          14,727          3,739            23,413
                                            ATV                        96                 97             110             117                85
                                         Media Cost                 295,575            106,982          61,972          24,429           19,100
                                            CPO                       404.3             690.2           462.5           763.4              69.5
                                            CPA                       495.1             869.8           529.7           842.4              88.0
                                      Incremental ROI                  0.2               0.1              0.2             0.2              1.2


   From the experiment results in Figure 3, we can see that the                                    model. Therefore, Scale-BN-MLP outperforms other models regard-
effect of the multi-layer perceptron is better than that of LR and FM,                             ing AUC value during training phrase. Meanwhile, the convergence
and adding the batch normalization layer and the scale layer can                                   speed of Scale-BN-MLP (4 epochs) is the fastest one among all mod-
both improve the model performance and convergence speed of the                                    els, requiring early stopping to yield the optimal model in
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                                          J. Jiang et al.


practice. The result confirms the derivation in section 3.3. Figure 4                       [5] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.
shows different learning rates for Scale-BN-MLP model in training                               DeepFM: A Factorization-machine Based Neural Network for CTR Prediction. In
                                                                                                Proc. of IJCAI. 1725–1731.
and testing data set, the convergence speed performs well when                              [6] Yuchin Juan, Damien Lefortier, and Olivier Chapelle. 2017. Field-aware Factor-
learning rate equals to 0.0001(1e-4).                                                           ization Machines in a Real-world Online Advertising System. In Proc. of WWW
                                                                                                Companion. 680–688.
                                                                                            [7] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware
4.4     Online Effectiveness Evaluation                                                         Factorization Machines for CTR Prediction. In Proc. of RecSys. 43–50.
                                                                                            [8] Azam Kaboutari, Shabestar Branch, Jamshid Bagherzadeh, Iran Urmia, and Fate-
Regarding effectiveness evaluation in a real closed-loop business set-                          meh Kheradmand. 2014. An evaluation of two-step techniques for positive-
ting, we corporate with a brand advertiser and a third-party adver-                             unlabeled learning in text classification. Int. J. Comput. Appl. Technol. Res 3,
tising monitor supplier in order to conduct the online experiments.                             592–594.
                                                                                            [9] B. Kanagal, A. Ahmed, S. Pandey, V. Josifovski, L. Garcia-Pueyo, and J. Yuan. 2013.
The final experiment results are shown at Table 7. There are several                            Focused matrix factorization for audience selection in display advertising. In
important business metrics like Impression UV, Purchaser Rate,                                  Proc. of ICDE. 386–397.
ATV (Average Transaction Value), CPO (Cost Per Order), CPA (Cost                           [10] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. 2017.
                                                                                                Positive-Unlabeled Learning with Non-Negative Risk Estimator. In Proc. of NIPS,
Per Action) and Incremental ROI listed in this table. All indicators                            I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and
of our model perform far better than traditional demographic-based                              R. Garnett (Eds.). 1675–1685.
                                                                                           [11] R. Kumar, S. M. Naik, V. D. Naik, S. Shiralli, Sunil V.G, and M. Husain. 2015.
approaches.                                                                                     Predicting clicks: CTR estimation of advertisements using Logistic Regression
                                                                                                classifier. In 2015 IEEE International Advance Computing Conference (IACC). 1134–
5     CONCLUSIONS AND FUTURE WORK                                                               1138.
                                                                                           [12] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. 2003. Building text classifiers using
In this paper, we showed an data application architect to utilize ad-                           positive and unlabeled examples. In Proc. of ICDM. 179–186.
vertisement monitor data in audience expansion system for brand                            [13] Bing Liu, Wee Sun Lee, Philip S. Yu, and Xiaoli Li. 2002. Partially Supervised
                                                                                                Classification of Text Documents. In Proc. of ICML. 387–394.
advertisers, compared to traditional ad serving based on demograph-                        [14] Haishan Liu, David Pardoe, Kun Liu, Manoj Thakur, Frank Cao, and Chongzhe
ics, the lookalike model in our application focuses on analysing                                Li. 2016. Audience Expansion for Online Social Network Advertising. In Proc. of
user behaviour. Regarding the way of picking up the negative sam-                               KDD. 165–174.
                                                                                           [15] Q. Ma, E. Wagh, J. Wen, Z. Xia, R. Ormandi, and D. Chen. 2016. Score Look-Alike
ples from unlabeled data, we compared four sampling techniques                                  Audiences. In Proc.of workshops on ICDM. 647–654.
and the impact of different sampling ratios in order to figure out                         [16] Qiang Ma, Musen Wen, Zhen Xia, and Datong Chen. 2016. A Sub-linear, Massive-
                                                                                                scale Look-alike Audience Extension System A Massive-scale Look-alike Au-
the best setting. Meanwhile, to overcome the sparsity and high                                  dience Extension. In Workshop on Big Data, Streams and Heterogeneous Source
dimension of feature space, we proposed Scale-MLP, a modified                                   Mining: Algorithms, Systems, Programming Models and Applications.
MLP by adding a scale layer, although the training AUC is lower                            [17] Ashish Mangalampalli, Adwait Ratnaparkhi, Andrew O. Hatch, Abraham Bagher-
                                                                                                jeiran, Rajesh Parekh, and Vikram Pudi. 2011. A Feature-pair-based Associative
than other traditional learning strategies, however, it gains perfor-                           Classification Approach to Look-alike Modeling for Conversion-oriented User-
mance improvement when generalizing the model to testing data                                   targeting in Tail Campaigns. In Proc. of WWW. 85–86.
while the efficiency of Scale-MLP is comparable to other approaches.                       [18] F. Mordelet and J. P. Vert. 2014. A Bagging SVM to Learn from Positive and
                                                                                                Unlabeled Examples. Pattern Recogn. Lett. 37 (Feb. 2014), 201–209.
Lastly we prove that the lookalike model outperforms traditional                           [19] Minh Nhut Nguyen, Xiao-Li Li, and See-Kiong Ng. 2011. Positive Unlabeled
ad serving mechanisms in real business environment.                                             Learning for Time Series Classification. In Proc. of IJCAI. 1421–1426.
                                                                                           [20] Sandeep Pandey, Mohamed Aly, Abraham Bagherjeiran, Andrew Hatch, Peter
   Several directions exist for future research. The rich information                           Ciccolo, Adwait Ratnaparkhi, and Martin Zinkevich. 2011. Learning to Target:
contained in the advertisement could be harnessed to investigate                                What Works for Behavioral Targeting. In Proc. of CIKM. 1805–1814.
more sophisticated look-alike models. For example, we could in-                            [21] Archana Ramesh, Ankur Teredesai, Ashish Bindra, Sreenivasulu Pokuri, and Kr-
                                                                                                ishna Uppala. 2013. Audience Segment Expansion Using Distributed In-database
corporate advertising information including advertiser, brand and                               K-means Clustering. In Proc. of ADKDD. 5:1–5:9.
product in order to explore more detailed feature interactions. For                        [22] Steffen Rendle. 2010. Factorization Machines. In Proc. of ICDM. 995–1000.
different advertisers’ campaign, adaptive user feature representa-                         [23] Burr Settles. 2010. Active learning literature survey. Technical Report.
                                                                                           [24] Jianqiang Shen, Sahin Cem Geyik, and Ali Dasdan. 2015. Effective Audience
tion also need to be taken into consideration. Meanwhile, CTR                                   Extension in Online Advertising. In Proc. of KDD. 2099–2108.
prediction task will be a challenging and interesting problem under                        [25] M. Slaney and M. Casey. 2008. Locality-Sensitive Hashing for Finding Nearest
                                                                                                Neighbors [Lecture Notes]. IEEE Signal Processing Magazine 25, 2 (March 2008),
the setting of growing diversity in targeting users and cross-media                             128–131.
advertising platforms. CTR prediction results could be utilized for                        [26] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network
the purpose of omni-channel uniform budget allocation to effec-                                 for Ad Click Predictions. In Proc. of ADKDD. 12:1–12:7.
                                                                                           [27] Jun Yan, Ning Liu, Gang Wang, Wen Zhang, Yun Jiang, and Zheng Chen. 2009.
tively enhance ROI by matching brands/products with different                                   How Much Can Behavioral Targeting Help Online Advertising?. In Proc. of WWW.
media platforms.                                                                                261–270.
                                                                                           [28] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep Learning over Multi-
                                                                                                field Categorical Data - - A Case Study on User Response Prediction. In Proc. of
REFERENCES                                                                                      ECIR (Lecture Notes in Computer Science), Nicola Ferro, Fabio Crestani, Marie-
 [1] A. Bindra, S. Pokuri, K. Uppala, and A. Teredesai. 2012. Distributed Big Advertiser        Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio,
     Data Mining. In Proc. of Workshops on ICDM. 914–914.                                       Claudia Hauff, and Gianmaria Silvello (Eds.), Vol. 9626. 45–57.
 [2] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-
     Jen Lin. 2010. Training and Testing Low-degree Polynomial Data Mappings via
     Linear SVM. JMLR 11 (Aug. 2010), 1471–1490.
 [3] Olivier Chapelle, Eren Manavoglu, and Romer Rosales. 2014. Simple and Scalable
     Response Prediction for Display Advertising. ACM Trans. Intell. Syst. Technol. 5,
     4 (Dec. 2014), 61:1–61:34.
 [4] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
     Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan
     Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah.
     2016. Wide & Deep Learning for Recommender Systems. In Proc. of the 1st
     Workshop on Deep Learning for Recommender Systems. 7–10.