Comprehensive Audience Expansion based on End-to-End Neural Prediction Jinling Jiang Xiaoming Lin MiningLamp Technology. MiningLamp Technology. Beijing, China Beijing, China jiangjinling@mininglamp.com linxiaoming@mininglamp.com Junjie Yao Hua Lu East China Normal University. Department of Computer Science, Aalborg University. Shanghai, China Aalborg, Denmark junjie.yao@sei.ecnu.edu.cn luhua@cs.aau.dk ABSTRACT needs of the consumer. As the development of e-commerce plat- In current online advertising applications, look-alike methods are forms has introduced SMEs (Small and medium-sized enterprises) valuable and commonly used to identify new potential users, tack- to enter consumers’ sight, large enterprise advertisers face the crisis ling the difficulties of audience expansion. However, the demo- of slowing business growth and falling revenue. Therefore, brand graphic information and a variety of user behavior logs are high advertisers have begun to pay more attention to the contribution of dimensional,noisy, and increasingly complex, which are challenging advertising to sales conversion, the actual revenue brought by ad- to extract suitable user profiles. Usually, rule-based and similarity- vertising, requiring advertising agencies and third-party suppliers based approaches are proposed to profile the users’ interests and to provide more refined performance data of advertising effects. expand the audience. However, they are specific and limited in Meanwhile, the emergence of big data technology has subverted more complex scenarios. the operation model of the entire advertising industry and the In this paper, we propose a new end-to-end solution, unifying traditional way of evaluating advertising effects. By tracking and the feature extraction and profile prediction stages. Specifically, obtaining user behavior data, a third-party supplier of advertising we present a neural prediction framework and leverage it with the monitor can analyze the data according to the advertiser needs, not intuitive audience feature extraction stages. We conduct extensive only understanding the communication effects and sales conversion study on a real and large advertisement dataset. The results demon- rate generated by the advertisement in time but also predicting strate the advantage of the proposed approach, not only in accuracy the user conversion probability to some extent. Through analysis but also generality. and modeling on massive data of user behavior, advertisers can accurately reach the target consumer. Therefore, how to better CCS CONCEPTS utilize the advertising monitor data in order to optimize ad serving and improve marketing conversion rate has become an important • Information systems → Online Advertising; • Human-centered issue. computing → User Models; • Theory of computation → Com- One of the main challenges in ad serving is how to find the best putational Advertising theory; • Computing methodologies → converting prospects. A typical way is to do audience expansion, Factorization methods; that is, to identify and reach new audiences with similar interests KEYWORDS to the original target audience. Usually, the methodology used in audience expansion problem is called look-alike modeling. Given a Online Advertising; Audience Expansion; Lookalike Modeling seed user set S from a universal set U , look-alike models essentially ACM Reference Format: find groups of audiences from U − S who look and act like the Jinling Jiang, Xiaoming Lin, Junjie Yao, and Hua Lu. 2019. Comprehensive audience in S. Audience Expansion based on End-to-End Neural Prediction. In Proceedings The data flow of audience expansion service is illustrated in of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom), 8 pages. Figure 1. The data runs between advertisers and our universal advertising monitor system across different media platforms. The 1 INTRODUCTION original users come from the advertiser’s CRM System selecting The remarkable growth of online advertisement enables the the consumers who recently exercise the purchase actions. Then ad-vertisers to sync up their products according to the fast- the users who are tracked by the universal advertising monitor will changing be matched and treated as "seed" users. Copyright © 2019 by the paper’s authors. Copying permitted for private and academic In this paper, we build up a closed-loop data solution for brand purposes. advertisers and combines multiple techniques of selecting negative In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): samples and extracting features, as well as machine learning looka- Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at http://ceur-ws.org like models to reach the targeted audience. Which greatly enhances the conversion effect of ad serving. Based on the "seed" users and universal user set from advertising monitor, we build a lookalike SIGIR 2019 eCom, July 2019, Paris, France J. Jiang et al. Figure 1: Audience Expansion Dataflow model to predict the probability to be the target audience for all • We conduct extensive and effective experiments to extract users. Afterward, according to the advertising budget, lookalike negative samples from unlabeled data. model will yield the corresponding number of expanded users to be • We prove the effectiveness of the proposed lookalike models reached through ad serving system. Finally, the ad serving perfor- in an online environment. mance is evaluated by advertiser’s site monitor system that record The rest of the paper is organized as follows. In Section 2, we sales conversion shortly. review the related work on various kinds of look-alike models and But both traditional and current look-alike strategies for an ad- illustrate different design philosophy behind them. vertiser to look for the target audience are mainly based on user Section 3.3 gives out the formal problem statement and specifies demographics. There are two main problems with demographics- the notations used in the paper. We then introduces our proposed based audience segmentation: user demographics (age, gender, and lookalike models and Section 3.4 reveals the sampling strategies. geographical location) itself is not precise as it is estimated via The evaluation of the algorithm is presented in Section 4. Finally various statistical methods or machine learning models based on the conclusion and future work are discussed in Section 5. a small group of surveyed samples (10-100 thousand); the number of users that are specified by demographics is large, more sophisti- cated screening is required. Accordingly, the details of user behavior 2 RELATED WORKS data should be harnessed in machine learning models to target ac- We briefly review the related literature of look-alike modeling. Gen- curate audience segment. At the same time, there are two main erally in online user-targeted advertising areas, look-alike modeling problems that need to be solved based on user behavior data model- which supports audience expansion system can be categorized in ing: user-generated behavior data through the Internet is generally three lines: rule-based, similarity-based and model-based. high-dimensional and sparse; advertisers usually can only provide Rule-based approaches focus on explicit positioning, where positive samples, while negative samples need to be carefully picked users with specific demographic tags (age, gender, geography) or up from a substantial unlabeled sample set. interests are targeted directly for advertiser. The core technical Besides, the ecologically closed Internet tycoons (represented support in the background is user profile mining, which means, by Facebook, Amazon, Tencent, Alibaba and etc.) provide the ad- the interest tags are inferred from the user behaviour [20][27]. Fur- vertisers the capability to perform audience expansion within their thermore, Mangalampalli et al. [17] builds a rule-based associative own platforms. However, ad serving data of these platforms are classifier for campaigns with less conversion; Shen et al. [24] and not connected with the advertiser’s CRM (Customer Relationship Liu et al. [14] present detailed in-depth analysis of multiple meth- Management) system. Thus, it is difficult to directly track the real ods under different considerations(such as similarity, performance, conversion rate. In order to verify that lookalike models based whether or not campaign-agnostic) for online social network ad- on the user behaviour work better than traditional demographics- vertising. The main disadvantage of rule-based look-alike modeling based approach regarding the sales conversation rate, we need to is that it only captures the high-level features, therefore loses so- integrate data flow during the whole advertising life cycle. phisticated details of user behaviour. The contributions of this paper can be summarized as follows. Similarity-based approaches apply different similarity met- rics to solve the problem of look-alike modeling. Naive similarity- • We have improved the commonly used ad serving mode based method computes pairwise similarities between and seed from demographics-based crowd segmentation to a compre- user and all the other users in the set while the locality-sensitive hensive audience expansion framework. hashing (LSH) [25] technique is often applied to decrease the com- • We propose a lookalike model that has better generalization putation complexity of pairwise similarity. In addition, based on ability for audience expansion problem. Ma et al. [15][16] provide several similarity scoring methods to Comprehensive Audience Expansion based on End-to-End Neural Prediction SIGIR 2019 eCom, July 2019, Paris, France measure the potential value of the users to an specific advertiser. Table 1: An example of data from advertising monitor sys- However, the similarity-based approach lacks the ability to catch tem the implicit interaction between features indicating user behaviour. Model-based look-alike systems fall into two categories: un- CLICK Timestamp USER_ID SPID supervised and supervised learning. For instance, k-means clus- 1 201809123278 66a7988f 107122831 tering [21] and frequent pattern mining [1] are the instances of 0 201809123346 9e664577 107108909 unsupervised approach. Meanwhile, the supervised approach trans- 1 201809123456 9b3fcc94 107104618 forms the look-alike model into a positive-unlabeled learning (PU 0 201809123787 0043fbf4 107102974 learning) problem [12][10][19][13]. In PU learning, the positive 0 201809132592 1df73293 107108909 samples are seed users while negative samples should be selected from the non-seed users. The main challenge of PU learning prob- lem lies in three following aspects: negative samples not easy to Each row of the original data collected by advertising monitor obtain; negative samples are too diverse; negative samples are dy- system represents an ad impression. The "CLICK" column is an namically changing. In one word, different strategies on how to indicator that shows whether or not the advertisement is clicked sample the negative users will definitely affect the model results. by the corresponding user (1 represents CLICK while 0 means For example, besides random sampling, Ma et al. [15] select the past the opposite). As shown in Table 1, The main information of an non-converter users as negative samples and Liu et al. [13] propose ad impression includes timestamp, user_ id and an spid. The spid a "spy" method to aggregate negative users. Another challenge in refers to the specific information of an advertisement where they model-based look-alike system is that it need have the capability are multi-field categorical data [28] which are commonly seen in to model in the very sparse feature space. CTR prediction and recommendation system. A key challenge in applying collaborative filtering lies also on The user behaviour is represented by a high-dimensional sparse the extreme sparsity of interaction between users and campaign feature vector where each feature corresponding to the times an ad- and the way Kanagal et al. [9] address this challenge is to utilize a vertisement is clicked or impressed. One typical feature extraction product taxonomy to reveal the relationships. Regarding the algo- result is shown in Table 2, User "66a7988f " is impressed by spid1 rithms dealing with high-dimensional sparse data is an essential and spid2 both 3 times while he only clicks spid2 once. The user task in online advertising industry. Many models have been pro- feature vector will be normalized afterwards. The normalization posed to resolve this problem such as Logistic Regression (LR) approach is as follows where f req represents the original frequency [3][11], lowPolynomial-2 (Poly2) [2], Factorization Machine-based and norm_f req is the frequency after normalization: models [22][7][6] and end-to-end deep learning related models [4][5][26]. 1 f req > 0    f r eq norm_f req = 1 + exp(− 10 )   3 THE PROPOSED APPROACH (1) f req = 0  0  Here we first formalize the problem and then list the feature extrac-   tion and the prediction framework. To this end, every feature value is converted to a number between 0 and 1. 3.1 Problem Statement It is noteworthy that the data label is the purchase tag (meaning We formalize the look-alike modeling as a prediction problem. Ad- the corresponding user has purchase action) from CRM system of vertisers submit a list of customers, which we call seed user set S, a particular brand advertiser over a period of time, while features as positive samples and there are a universal user set U existing represent the impression and click behaviour for ads of different in advertising monitor platform. Then the problem is transformed brands. Unlike the high-dimensional sparse feature transformed by into a Positive and Unlabeled learning problem: using a small num- one-hot encoder in CTR prediction task, the original feature space ber of labeled positive samples S and a large number of unlabeled is already sparse and high-dimensional. samples U − S to derive a prediction classifier. Eventually unlabeled The intuitive idea of utilizing spid as feature is that the ads users are scored by the classifier and the target audience set T is are somehow correlated to the websites highly indicating user taken out according to advertising requirements. The dataset sizes interests. That is to say, when an internet user is impressed by are typically configured in real business environment as follows: an specific ad, the ad itself could describe the user interests to ∥S ∥ = 0.1−0.2M(Million), ∥T ∥ = 10−20M and ∥U ∥ = 2000−3000M. some extend. Moreover, "CLICK" information directly connects user Meanwhile, a user is represented by a feature vector which indi- intention. The detailed comparison of different feature extraction cates the user’s past behaviour collected by the advertising monitor methodologies will be incorporated in Section 4.2. system. The feature vector always occurs with high-dimension D and extreme sparsity. D is usually around 100-300 thousands and 3.3 Comprehensive Modeling only 0.1 percent of the feature vector are non-zero elements. We continue to introduce the lookalike model techniques used in our audience expansion system. Multilayer Perceptron (MLP) is a 3.2 Feature Extraction and Analysis feedforward neural network consisting of several layers. By adding Here we introduce the feature extraction and analysis stages in the non-linear activation functions, MLP can fit high-order non-linear lookalike model. features. Figure 2 illustrates a MLP network added by a scale layer. SIGIR 2019 eCom, July 2019, Paris, France J. Jiang et al. Table 2: User behaviour Representation USER_ID spid1_click spid1_impression spid2_click spid2_impression ... label 66a7988f 0 3 1 3 ... 1 9b3fcc94 0 2 1 1 ... 0 0043fbf4 1 1 0 1 ... 0 9e664577 1 3 1 2 ... 1 1df73293 0 1 0 1 ... 0 effect for the target, when the values of x j drifts, it will cause training difficulty unless the absolute value of parameters ai j , i = 1...k are all small; on the other side, as long as the absolute value of the only affected parameter w j in Scale-MLP model is small, the influence of the feature on the target can be made smaller. To conclude, adding the scale layer and updating the parameters of the scale layer during backpropagation can directly change the final influence of each feature on the model. Generally saying, for MLP model, matrix A captures the first- order combinatoric features. In order to learn high-order features, the model need to fit the data by adjusting both the parameters of matrix A and the hidden layers of MLP. Due to the sparsity of feature space and importance of different features varies, the parameters of matrix A cannot be very effectively trained. Under such circumstances, the MLP model is easier to overfit. On the Figure 2: Comprehensive Audience Expansion Framework contrast, the Scale-MLP model only needs to train the parameters of the scale layer properly for the same purpose. Therefore, Scale- MLP model is much simpler to train in our setting. Based on it, we proceed the audience expansion with intuitive Another angel to look at the functionality of the new model is feature extraction and prediction tasks. that it adds randomness to the original user feature vector. In other The prediction equation of a standard MLP model is defined as: words, if a user is not impressed by some ad, it doesn’t mean that he/she is totally not interested in that ad. Therefore, the scale layer y = mlp(AX + bias), A ∈ Rk ×n (2) will help to learn a model which has better generalization capability After adding a scale layer, the model we call Scale-MLP is updated for this task. as: 3.4 Model Training y = mlp(A(W ◦ X ) + bias), A ∈ Rk×n (3) 3.4.1 The Impact of Sampling Ratio. We evaluate the impact The model expressibility of Equation 2 and 3 is the same so that of sampling ratio based on different number of positive and un- there is no difference at model prediction stage. That is to say, the labeled samples, seeing unlabeled as negative label. The standard theoretical optimal solution of MLP and Scale-MLP are the same. classification algorithm we choose is Logistic Regression. The key However, deep models don’t always converge to the same optimal metrics need to be taken care are test recall and threshold, mean- solution in practice, therefore, the effectiveness of actual models ing positive sample recall on testing data set and the corresponding obtained from Scale-MLP and MLP are often different on different probability boundary. The number of positive and negative sam- datasets. ples in testing data set are 34657 and 72464. The evaluation result To be detailed, the essential difference lies in the way backprop- in Table 3 shows when ratio of positive and unlabeled reaches agation update the network parameter during model training stage. 1:2 (the number of positive and negative samples are 69331 and Compared to a standard MLP, Equation 3 reflects that the network 134584 respectively), the threshold doesn’t change significantly need feedforward an intermediate result w j x j after the scale layer when more unlabeled samples are added. Considering both training added. When MLP updates the parameter matrix A during backprop- efficiency and effectiveness, it is practical to set the sampling ratio agation , the partial derivative regarding ai j is x j ; for Scale-MLP, of positive:negative as 1:2. the partial derivative regarding ai j is w j x j while regarding w j is x j . In another word, the value of feature x j in MLP can directly 3.4.2 Sampling Techniques. For general classification problem, affect the parameters ai j , i = 1...k; for Scale-MLP, feature x j can to determine where the class boundary is, at least some of the only update w j . negative samples to be close to the positive ones are chosen. Take Assuming that the influence of different features on the model is "active learning" [23] as an example, algorithms will select out those quite different, the fluctuation of feature values will make training samples that are most indistinguishable from the model for human process difficult to converge. Suppose that the feature x j has little expert to label. However, look-alike models deal with data without Comprehensive Audience Expansion based on End-to-End Neural Prediction SIGIR 2019 eCom, July 2019, Paris, France Table 3: The Impact of Sampling Ratio positive unlabeled train loss test accuracy test auc test recall threshold 69331 69331 0.4693 0.764 0.835 0.740 0.493 97064 97064 0.4692 0.767 0.840 0.740 0.498 138663 138663 0.4700 0.770 0.843 0.740 0.493 69331 95743 0.4650 0.769 0.837 0.740 0.518 69331 134584 0.4298 0.774 0.839 0.740 0.668 69331 197328 0.3874 0.776 0.839 0.740 0.678 69331 245811 0.3576 0.776 0.839 0.740 0.682 labelled negative samples, hence the goal of sampling is to pick out A more sophisticated approach [18] is a variant of bagging: first a reliable set of negative users. of all, a subset of unlabeled samples are bootstrapped from the Besides randomly selecting negative samples and directly apply unlabeled sample set U . The algorithm details are depicted in Al- standard classifier to the PU learning problem, we compare the gorithm 3. Here we set the number of iterations T and for each effectiveness of three other sampling techniques: spy, pre-train and iteration, a standard classifier responsible for predicting U is trained bootstrap sampling. The "Spy" [13] [12] and "Pre-Train" sampling on bootstrapped sample set U ′ and positive sample set P. The final strategies are so-called "two-step" approach [8] where the general predicted probability equals to the average score of T iterations. idea is described as follows: the first step is to identify a subset of unlabeled samples that can be reliably labelled as negative, then Algorithm 3: Bootstrap Sampling positive and negative samples are used to train a standard classifier Input: Positive Sample Set P, Unlabeled Sample Set U that will be applied to the remaining unlabeled samples. Usually Output: Negative Sample Set N with size k the classifier is learned iteratively till it converges or some stop- 1 for t ≤ T do ping criterion is met. Correspondingly, the "Spy" and "Pre-Train" 2 Bootstrap a subset U ′ from U ; sampling strategies are illustrated in Algorithm 1 and 2. 3 Train a classifier M on P and U ′ ; 4 Predict U − U ′ using classifier M; Algorithm 1: Spy Sampling 5 Record the classifying scores; Input: Positive Sample Set P, Unlabeled Sample Set U 6 Average the classifying scores of all iterations; Output: Negative Sample Set N with size k 7 Select a subset N of k samples with least average scores; 1 Randomly select a subset from P as the spy set P ; ′ 8 Return N ; 2 Train a classifier M based on P − P and U + P ; ′ ′ 3 Select a subset N of k samples from U with least prediction Table 4 shows the experimental result of different sampling ap- scores; proaches. The samplinд parameter represents the percentage of 4 Return N ; unlabeled samples picked out as negative and threshold indicates the corresponding probability boundary. From the result table it can be seen that when spy and bootstrap approaches sample half size of the unlabeled data, it still guarantees almost the same level of recall on testing data while regarding pre-train sampling approach, the Algorithm 2: Pre-Train Sampling recall on test data is much lower. On the sampling efficiency, spy ap- Input: Positive Samples Set P, Unlabeled Sample Set U , proach can only run one iteration compared to the other two which Validation Set V need converge after several rounds. Therefore, it is both efficient Output: Negative Sample Set N with size k and effective to utilize spy sampling approach in our setting. 1 Randomly select a subset N with size k from U ; Logistic Regression: Logistic Regression (LR) is probably the 2 while true do most widely used baseline model. Suppose there are n features 3 Randomly select a subset N ′ from N ; {x 1 , x 2 , ..., x n } and x i is either 0 or 1, consider an LR model without 4 Train a classifier M based on P and N ′ , and evaluate the a regularization term: model on V ; 5 if the accuracy of M doesn’t improve on V then y = bias + β T X (4) 6 Return N ; where β is the coefficient vector. This simple linear model misses the 7 break; crucial feature crosses, therefore, the Degree-2 Polynomial (Poly2) model is always provided to ease the problem. 8 Predict U using classifier M; 9 Select a subset N of k samples with least prediction y = bias + β T X + XW X T (5) scores; where W is a symmetric parameter matrix with the elements on the diagonal are all equal to 0. SIGIR 2019 eCom, July 2019, Paris, France J. Jiang et al. Table 4: The Impact of Sampling Approach approach sampling parameter train loss test accuracy test auc test recall threshold Random 0.9 0.4250 0.775 0.847 0.766 0.633 Random 0.5 0.4150 0.753 0.843 0.676 0.612 Spy 0.95 0.4138 0.775 0.847 0.768 0.640 Spy 0.9 0.4141 0.776 0.847 0.763 0.633 Spy 0.5 0.3660 0.775 0.845 0.768 0.607 Pre-Train 0.95 0.3677 0.775 0.845 0.771 0.624 Pre-Train 0.9 0.3829 0.776 0.846 0.771 0.632 Pre-Train 0.5 0.4382 0.702 0.839 0.632 0.628 Boostrap 0.95 0.4127 0.775 0.847 0.768 0.638 Boostrap 0.9 0.4153 0.776 0.847 0.763 0.633 Boostrap 0.5 0.3976 0.775 0.845 0.766 0.640 Factorization Machine In order to extract feature crosses while typical feature could be that one specific user is impressed by an ad reducing the influence of high-dimensional sparse features, Rendle of "Maybelline" 5 times in July. In general, only activities happening [22] proposes Factorization Machines to overcome the drawbacks in last three months are to be extracted. Click means whether we of LR. Regarding LR model, the number of parameters in matrix distinguish between click action from impression. The experimen- n(n−1) W need to be learned is 2 . When n is 100,000, the number tal results based on LR model (training data volume: 428484; testing of parameters is tens of billions. At the same time, when training data volume: 107121; positive and negative ratio is 1:2) show that if the model using gradient descent optimization, the parameter w i j the features are calculated by month and click action is separated can only be trained when x i and x j are both not zero, therefore from impression, the AUC value will reach 0.8465 in testing phrase there is a high demand on both the number of training samples and which is the best among all settings. Therefore, this feature engi- memory space at training phrase. As a result, for high-dimensional neering strategy will be applied in various model methodologies sparse features, the parameter matrix W is almost impossible to afterwards. train. Table 5: The Impact of Feature Engineering To overcome this problem, we will decompose W into VV T where each vi in V = (v 1 , v 2 , ..., vn )T can be seen as a latent k- dimensional factor of original feature. The Degree-2 F M model Feature Size Time Slice Click Train AUC Test AUC equation is defined as: 144009 None True 0.8721 0.8447 94932 None False 0.8689 0.8443 y = bias + β T X + XVV T X T , V ∈ Rn×k (6) 249406 by holiday True 0.8808 0.8445 196184 by month True 0.8780 0.8465 At this time, the number of parameters need to be estimated is 133605 by month False 0.8761 0.8461 n · k and easier to train even under sparsity setting as F M model break the independence of the interaction parameters by factorizing them. 4.3 Model Performance 4 EXPERIMENTS In this section, the performance comparison of various models is in- 4.1 Setup troduced. The hyper-parameters configured in different models are listed at Table 6. In this table, BN-MLP is a multi-layer perceptron Regarding the model implementation, we use MXNet1 on a stand- with a batch normalization layer after each hidden layer; Scale- alone 1080TI GPU to compare different model effects and figure BN-MLP adds a scale layer before BN-MLP; lr and wd represent out model parameters. When predicting the universal user pool learning rate and L2 regularization parameter respectively. consisting of nearly 2.5 billion users, we used distributed MXNet on a 80-cores hadoop cluster to re-train the model and it took nearly 4 Table 6: Hyper-parameter Setting hours to finish the prediction of all users. Model Parameters 4.2 The Impact of Feature Engineering LR lr=1e-4, wd=1e-6 Table 5 shows the impact of different feature engineering approaches. FM lr=1e-4, wd=3e-5, k=6 In this table, T ime Slice indicates the strategy of calculating the MLP lr=1e-4, wd=3e-5 user behaviour by time slice (None: no time slice; day: slice by day; BN-MLP lr=1e-4, wd=3e-5 holiday: slice by holiday and weekday; month: slice by month). For Scale-MLP lr=1e-4,wd=3e-5 example, if we extract features of user activities by month, one Scale-BN-MLP lr=1e-4,wd=3e-5 1 https://mxnet.apache.org/ Comprehensive Audience Expansion based on End-to-End Neural Prediction SIGIR 2019 eCom, July 2019, Paris, France trainDataSet: Different model performance testDataSet: Different model performance 0.94 0.92 0.86 0.90 0.84 0.88 0.86 0.82 LR LR auc auc 0.84 FM FM 0.82 bn_mlp 0.80 bn_mlp 0.80 mlp mlp s_mlp 0.78 s_mlp 0.78 s_bn_mlp s_bn_mlp 0 5 10 15 20 0 5 10 15 20 epoch epoch (a) Train Dataset (b) Test Dataset Figure 3: Model Performance trainDataSet: Different lr performance testDataSet: Different lr performance 0.875 0.95 0.850 0.90 0.825 0.800 0.85 0.775 auc auc 0.80 0.750 0.75 lr=0.00001 0.725 lr=0.00001 lr=0.00005 0.700 lr=0.00005 0.70 lr=0.0001 lr=0.0001 lr=0.0005 0.675 lr=0.0005 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 epoch epoch (a) Train Dataset (b) Test Dataset Figure 4: Comparison of Different Learning Rate Table 7: Online A/B Testing Results Metric Random F20-34 FEMALE MALE MODEL Impression 18,367,151 6,493,314 3,910,355 1,454,655 1,221,095 Impression UV 8,578,859 3,152,614 2,052,897 912,468 594,456 Purchaser 597 123 117 29 217 Purchaser Rate 0.01% 0.00% 0.01% 0.00% 0.04% Transaction 731 155 134 32 275 Sales 69,974 14,976 14,727 3,739 23,413 ATV 96 97 110 117 85 Media Cost 295,575 106,982 61,972 24,429 19,100 CPO 404.3 690.2 462.5 763.4 69.5 CPA 495.1 869.8 529.7 842.4 88.0 Incremental ROI 0.2 0.1 0.2 0.2 1.2 From the experiment results in Figure 3, we can see that the model. Therefore, Scale-BN-MLP outperforms other models regard- effect of the multi-layer perceptron is better than that of LR and FM, ing AUC value during training phrase. Meanwhile, the convergence and adding the batch normalization layer and the scale layer can speed of Scale-BN-MLP (4 epochs) is the fastest one among all mod- both improve the model performance and convergence speed of the els, requiring early stopping to yield the optimal model in SIGIR 2019 eCom, July 2019, Paris, France J. Jiang et al. practice. The result confirms the derivation in section 3.3. Figure 4 [5] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. shows different learning rates for Scale-BN-MLP model in training DeepFM: A Factorization-machine Based Neural Network for CTR Prediction. In Proc. of IJCAI. 1725–1731. and testing data set, the convergence speed performs well when [6] Yuchin Juan, Damien Lefortier, and Olivier Chapelle. 2017. Field-aware Factor- learning rate equals to 0.0001(1e-4). ization Machines in a Real-world Online Advertising System. In Proc. of WWW Companion. 680–688. [7] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware 4.4 Online Effectiveness Evaluation Factorization Machines for CTR Prediction. In Proc. of RecSys. 43–50. [8] Azam Kaboutari, Shabestar Branch, Jamshid Bagherzadeh, Iran Urmia, and Fate- Regarding effectiveness evaluation in a real closed-loop business set- meh Kheradmand. 2014. An evaluation of two-step techniques for positive- ting, we corporate with a brand advertiser and a third-party adver- unlabeled learning in text classification. Int. J. Comput. Appl. Technol. Res 3, tising monitor supplier in order to conduct the online experiments. 592–594. [9] B. Kanagal, A. Ahmed, S. Pandey, V. Josifovski, L. Garcia-Pueyo, and J. Yuan. 2013. The final experiment results are shown at Table 7. There are several Focused matrix factorization for audience selection in display advertising. In important business metrics like Impression UV, Purchaser Rate, Proc. of ICDE. 386–397. ATV (Average Transaction Value), CPO (Cost Per Order), CPA (Cost [10] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. 2017. Positive-Unlabeled Learning with Non-Negative Risk Estimator. In Proc. of NIPS, Per Action) and Incremental ROI listed in this table. All indicators I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and of our model perform far better than traditional demographic-based R. Garnett (Eds.). 1675–1685. [11] R. Kumar, S. M. Naik, V. D. Naik, S. Shiralli, Sunil V.G, and M. Husain. 2015. approaches. Predicting clicks: CTR estimation of advertisements using Logistic Regression classifier. In 2015 IEEE International Advance Computing Conference (IACC). 1134– 5 CONCLUSIONS AND FUTURE WORK 1138. [12] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. 2003. Building text classifiers using In this paper, we showed an data application architect to utilize ad- positive and unlabeled examples. In Proc. of ICDM. 179–186. vertisement monitor data in audience expansion system for brand [13] Bing Liu, Wee Sun Lee, Philip S. Yu, and Xiaoli Li. 2002. Partially Supervised Classification of Text Documents. In Proc. of ICML. 387–394. advertisers, compared to traditional ad serving based on demograph- [14] Haishan Liu, David Pardoe, Kun Liu, Manoj Thakur, Frank Cao, and Chongzhe ics, the lookalike model in our application focuses on analysing Li. 2016. Audience Expansion for Online Social Network Advertising. In Proc. of user behaviour. Regarding the way of picking up the negative sam- KDD. 165–174. [15] Q. Ma, E. Wagh, J. Wen, Z. Xia, R. Ormandi, and D. Chen. 2016. Score Look-Alike ples from unlabeled data, we compared four sampling techniques Audiences. In Proc.of workshops on ICDM. 647–654. and the impact of different sampling ratios in order to figure out [16] Qiang Ma, Musen Wen, Zhen Xia, and Datong Chen. 2016. A Sub-linear, Massive- scale Look-alike Audience Extension System A Massive-scale Look-alike Au- the best setting. Meanwhile, to overcome the sparsity and high dience Extension. In Workshop on Big Data, Streams and Heterogeneous Source dimension of feature space, we proposed Scale-MLP, a modified Mining: Algorithms, Systems, Programming Models and Applications. MLP by adding a scale layer, although the training AUC is lower [17] Ashish Mangalampalli, Adwait Ratnaparkhi, Andrew O. Hatch, Abraham Bagher- jeiran, Rajesh Parekh, and Vikram Pudi. 2011. A Feature-pair-based Associative than other traditional learning strategies, however, it gains perfor- Classification Approach to Look-alike Modeling for Conversion-oriented User- mance improvement when generalizing the model to testing data targeting in Tail Campaigns. In Proc. of WWW. 85–86. while the efficiency of Scale-MLP is comparable to other approaches. [18] F. Mordelet and J. P. Vert. 2014. A Bagging SVM to Learn from Positive and Unlabeled Examples. Pattern Recogn. Lett. 37 (Feb. 2014), 201–209. Lastly we prove that the lookalike model outperforms traditional [19] Minh Nhut Nguyen, Xiao-Li Li, and See-Kiong Ng. 2011. Positive Unlabeled ad serving mechanisms in real business environment. Learning for Time Series Classification. In Proc. of IJCAI. 1421–1426. [20] Sandeep Pandey, Mohamed Aly, Abraham Bagherjeiran, Andrew Hatch, Peter Several directions exist for future research. The rich information Ciccolo, Adwait Ratnaparkhi, and Martin Zinkevich. 2011. Learning to Target: contained in the advertisement could be harnessed to investigate What Works for Behavioral Targeting. In Proc. of CIKM. 1805–1814. more sophisticated look-alike models. For example, we could in- [21] Archana Ramesh, Ankur Teredesai, Ashish Bindra, Sreenivasulu Pokuri, and Kr- ishna Uppala. 2013. Audience Segment Expansion Using Distributed In-database corporate advertising information including advertiser, brand and K-means Clustering. In Proc. of ADKDD. 5:1–5:9. product in order to explore more detailed feature interactions. For [22] Steffen Rendle. 2010. Factorization Machines. In Proc. of ICDM. 995–1000. different advertisers’ campaign, adaptive user feature representa- [23] Burr Settles. 2010. Active learning literature survey. Technical Report. [24] Jianqiang Shen, Sahin Cem Geyik, and Ali Dasdan. 2015. Effective Audience tion also need to be taken into consideration. Meanwhile, CTR Extension in Online Advertising. In Proc. of KDD. 2099–2108. prediction task will be a challenging and interesting problem under [25] M. Slaney and M. Casey. 2008. Locality-Sensitive Hashing for Finding Nearest Neighbors [Lecture Notes]. IEEE Signal Processing Magazine 25, 2 (March 2008), the setting of growing diversity in targeting users and cross-media 128–131. advertising platforms. CTR prediction results could be utilized for [26] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network the purpose of omni-channel uniform budget allocation to effec- for Ad Click Predictions. In Proc. of ADKDD. 12:1–12:7. [27] Jun Yan, Ning Liu, Gang Wang, Wen Zhang, Yun Jiang, and Zheng Chen. 2009. tively enhance ROI by matching brands/products with different How Much Can Behavioral Targeting Help Online Advertising?. In Proc. of WWW. media platforms. 261–270. [28] Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep Learning over Multi- field Categorical Data - - A Case Study on User Response Prediction. In Proc. of REFERENCES ECIR (Lecture Notes in Computer Science), Nicola Ferro, Fabio Crestani, Marie- [1] A. Bindra, S. Pokuri, K. Uppala, and A. Teredesai. 2012. Distributed Big Advertiser Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Data Mining. In Proc. of Workshops on ICDM. 914–914. Claudia Hauff, and Gianmaria Silvello (Eds.), Vol. 9626. 45–57. [2] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih- Jen Lin. 2010. Training and Testing Low-degree Polynomial Data Mappings via Linear SVM. JMLR 11 (Aug. 2010), 1471–1490. [3] Olivier Chapelle, Eren Manavoglu, and Romer Rosales. 2014. Simple and Scalable Response Prediction for Display Advertising. ACM Trans. Intell. Syst. Technol. 5, 4 (Dec. 2014), 61:1–61:34. [4] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. In Proc. of the 1st Workshop on Deep Learning for Recommender Systems. 7–10.