Entire Cost Enhanced Multi-Task Model for Online-to-Offline Conversion Rate Prediction Yingyi Zhang1 , Xianneng Li1,* , Yahe Yu1 , Jian Tang2 , Huanfang Deng2 , Junya Lu2 , Yeyin Zhang2 , Qiancheng Jiang2 , Yunsen Xian2 and Liqian Yu2 1 Dalian University of Technology, Dalian, 116024, China 2 Meituan, Beijing, 100102, China Abstract Predicting users’ conversion rate (CVR) is essentially important for ranking systems in industrial Online-to-Offline (O2O) applications. Numerous efforts have been made in CVR modeling to achieve state-of-the-art performance. However, existing methods mainly focus on the Business-to-Customer (B2C) scenario, which makes implementations to O2O meet with mixed success. This can be revealed via several scenario-specific challenges. For example, O2O users in different locations generally encounter different candidates of surrounding stores. This leads to users’ behavioral regularity becoming essentially prominent. Besides, O2O users’ conversion includes a two-stage cost, i.e., online order cost and offline transportation cost. This inspires that users’ location sensitivity deserves additional attention compared with conventional scenarios. Motivated by these characteristics, we propose a novel CVR prediction method for the O2O scenario, named Entire Cost enhanced Multi-task Model (ECMM): i) users’ historical behavior sequences across different locations are modeled to capture the users’ preference of behavioral regularity; ii) both online order cost and offline transportation cost are modeled to predict the users’ aggregated preference for conversion. By designing two novel attention mechanisms, i.e., convert attention and sliding window attention, ECMM can be trained end-to-end to appropriately fit O2O characteristics. Extensive experiments have been carried out under a real-world industrial O2O platform Meituan. Both offline and rigorous online A/B tests under the billion-level data scale demonstrate the superiority of the proposed ECMM over the highly optimized state-of-the-art baselines. Keywords Online-to-Offline, Multi-Task Learning, Conversion Rate Prediction 1. Introduction challenging, whereas conventional methods may not be perfectly suitable. In the Online-to-Offline (O2O) scenario, industrial plat- In this paper, two critical O2O characteristics summa- forms generally rely on commission fees of successful rized in our real practice are focused on: i) online be- conversion as profit. Hence, how to accurately predict havioral regularity. As a typical form of Location-Based users’ conversion rate (CVR) is essentially important Service (LBS), the O2O scenario provides an online rank- for ranking systems in O2O industry. However, the ing list that only considers surrounding stores of a user’s O2O scenario requires the conversion of users from not location. The limited candidates require CVR modeling only online click to online order, but also to final offline to more accurately grasp users’ preference of historical consumption[1, 2]. In other words, O2O users’ behaviors behaviors for online conversion since users’ behaviors follow a sequential pattern of impression→click→online generally appear homogeneously on the platform in dif- order→offline consumption, which is somewhat different ferent locations[6, 7, 8] such as clicking/ordering stores from that of other online e-commerce forms[3, 4, 5], i.e., with similar prices or distances showing online. ii) offline Business-to-Costumer (B2C). This raises several scenario- transportation regularity. Different from B2C purchases specific characteristics that make CVR prediction of O2O with only online order cost, O2O users should spend DL4SR’22: Workshop on Deep Learning for Search and Recommen- additional transportation cost for the offline consump- dation, co-located with the 31st ACM International Conference on tion [9, 2]. Since user’s preference for distance varies Information and Knowledge Management (CIKM), October 17-21, 2022, in different periods, offline cost should be counted for Atlanta, USA decision-making dynamically to predict the current trans- * Corresponding author. $ yingyizhang@mail.dlut.edu.cn (Y. Zhang); portation preference of the user. This inspires that CVR xianneng@dlut.edu.cn (X. Li); yaheyu@dlut.edu.cn (Y. Yu); modeling should consider additionally location-sensitive tangjian13@meituan.com (J. Tang); denghuanfang@meituan.com factors when capturing O2O users’ preferences. (H. Deng); lujunya@meituan.com (J. Lu); Although numerous efforts have been made in CVR zhangyeyin@meituan.com (Y. Zhang); modeling to achieve state-of-the-art industrial perfor- jiangqiancheng@meituan.com (Q. Jiang); xianyunsen@meituan.com (Y. Xian); yuliqian@meituan.com mance , existing methods such as ESMM and its vari- (L. Yu) ants focus on addressing the problems of sample se- © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). lection bias and data sparsity under the B2C scenario CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) [10, 11, 12, 13, 14, 15] and some method solving domain respectively. We note that the first three terms are the specifically problem are proposed [16, 17, 18, 19]. How- information widely used in conventional CVR modeling, ever, where the intrinsic characteristics of O2O, i.e., on- while ℎ𝑐 and ℎ𝑜 are two newly considered ones to assist line behavioral regularity and offline transportation reg- in the modeling of behavioral and transportation regu- ularity, are rarely considered. larities. Moreover, the user’s click and order sequences One possible strategy to improve learning users’ online in ECMM are used from the online-offline cost perspec- behavioral regularity and offline transportation regular- tive, i.e., online order cost and offline transportation cost, ity is to consider user statistical features i.e. user’s aver- which are essentially different from that of conventional age online order cost and user’s average offline distance CTR prediction methods of modeling user’s multiple in- features. However, in O2O scenarios, the spatiotemporal terests [20, 21, 22, 23, 24, 25]. As a novel CVR prediction nature is inseparable, and using this strategy will lose method for the O2O scenario, the contributions of ECMM time-series information when characterizing user pref- are threefold: erences. Therefore, sequence representation techniques are also taken into account as shown in Figure 1. • ECMM elongates the observation dimensions by learning users’ online conversion preferences Offline stores that user interacted historically from historical behavior sequences. A new mech- anism named convert attention is proposed to learn the user’s behavior regularity from the Offline transportation cost global and local perspectives of online order cost. • To the best of our knowledge, ECMM is the first Online order cost method for CVR modeling from the perspective of offline transportation cost. We propose a new mechanism named sliding window attention to dy- Offline transportation cost namically learn users’ preference of offline trans- portation. • ECMM is testified under a real-world industrial Online order cost O2O platform, where extensive experiments are carried out. Both offline and rigorous online A/B ... tests under the billion-level data scale demon- Offline transportation cost strate the significant superiority of ECMM over the state-of-the-art baselines. Online order cost 2. Related Work Figure 1: The online order cost and offline transportation Our work is closely related to traditional e-commerce cost in user history. Such sequence can represent user online order and offline transportation preferences in time-series. CVR prediction, where the state-of-the-art model is trained by multi-task learning. Besides, for capture user behavior regularity, user history behavior sequence is Hence, in this paper, we propose a novel CVR predic- considered in our model which is related to user behavior tion method for the O2O scenario, named Entire Cost sequence representation. In this section, we give a brief enhanced Multi-task Model (ECMM), to model users’ ag- introduction. gregated preference under a online-offline cost perspec- tive. Following the formation of state-of-the-art CVR 2.1. CVR Prediction modeling, two auxiliary tasks are focused on, i.e., pre- dicting the click-through rate (CTR) and click-through Inspired by the success within deep learning, recent CVR conversion rate (CTCVR), which can be defined as fol- prediction model has evolved from traditional approaches lows: to deep approaches. Traditional method used logistic 𝑝(𝑐𝑡𝑐𝑣𝑟 = 1|𝑥) regression [26, 27] and GBDT [28] for modeling CVR 𝑝(𝑐𝑣𝑟 = 1|𝑐𝑡𝑟 = 1, 𝑥) = , (1) problem with feature interactions. However, nonlinear 𝑝(𝑐𝑡𝑟 = 1|𝑥) relationships of features are not considered in these mod- where 𝑥 is (𝑢, 𝑠, 𝑡, ℎ𝑐 , ℎ𝑜 ), 𝑢 is the user, 𝑠 denotes the els. Modern deep learning based method transforms CVR store, and 𝑡 represents the current context, such as the problem into a multi-task problem [10, 11, 12]. ESMM current time, city, day of the week, and other informa- [10] make use of users sequential actions, "impression tion that is independent of user and store. ℎ𝑐 and ℎ𝑜 are → click → pay", to solve sample selection bias and data the user’s historical click sequence and order sequence, sparsity problem over the entire space by simultaneous modeling of CTR and CTCVR tasks. ESM2 [11] method and context features, the entire cost module contain both extends users sequential actions to a more general situa- the user’s click and order sequence to capture the user’s tion, "impression → click → D(O)Action →pay", which historical cost preference, and the cost combination mod- simultaneous models CVR with CTR, CTAVR and CTCVR ule for combining online-to-offline cost to predict CTR tasks. HM3 [12] form "impression → click → D(O)Mi and CVR. With this network, the model can capture the → D(O)Ma → pay" perspective models CVR with CTR, user’s online behavioral and offline transportation reg- D-Mi, D-Ma and CTCVR tasks. ularities, which are hidden in users’ historical behavior However, all these methods are based on B2C e- sequences. The details of each module are described as commerce platforms which makes implementations to follows. O2O platforms meet with mixed success. Users have unique sequential actions in O2O, which can be repre- 3.1. Motivation sented as "impression→click→online order→offline con- sumption". Such situations require CVR model to con- As discussed in the previous section, users’ online behav- sider not only user online behavioral regularity, but also ioral and offline transportation regularities are indispens- offline transportation regularity. able for O2O recommendation [9, 2, 20, 21]. However, how to define their relationship with users’ behavior se- quence as well as embody both online and offline cost 2.2. User Behavior Sequence into a unified framework for CVR prediction remains Representation unexplored. In the past decade, user behavior sequence representation For one thing, we propose a novel CVR prediction have received much attention and achieved remarkable method from the perspective of user historical behavior. effectiveness. Many well designed recommender meth- We proposed convert attention to extract the local and ods have been proposed and brought huge commercial global preference of users’ online-to-offline behaviors revenues for companies and advertisers. In this mod- from both depth and breadth perspectives. From a lo- els, users’ history behaviors are transformed into low- cal view, an order placed by a user is affected by clicks. dimension vectors after embedding to represent users’ We design the local impact of a click on a order from interest and other character. DIN [20] employs the atten- the store perspective. From a global perspective, users’ tion mechanism to activate historical behaviors locally overall order sequence receives the impression of click which capture user diversity interest to the given target sequence in terms of id, price, and relative distance. For item. DIEN [21] further proposes an auxiliary loss and another, to model users’ transportation cost, we capture attention mechanism with GRU to capture the dynamic the information of the distance sequence implied in users’ evolution of users interest. DFN [29] jointly consider preference for offline cost in the O2O scenario, to assist explicit/implicit and positive/negative feedbacks to learn the model in learning users’ conversion preference in user unbiased preferences. Moreover, inspired by the suc- the offline stage. Each store of a user’s historical click cess of the self-attention architecture [30], Transformer and order has distance features which means the offline is introduced in for session CTR prediction [31]. MIND transportation cost. Then we use sliding window atten- [32] and DMIN [33] model multi-interest by multiple tion method to calculate the user dynamic preference for vectors with dynamic routing mechanism and capsule offline cost during different timestamps. network. Although all these user behavior sequence representa- 3.2. Base Module tion methods have brought a huge boost to the business from the perspective of user interest, there are still op- The base module is used to aggregate the basic features. portunities for improvement in modeling user behavior Refer to [10, 11, 12], the embedding and MLP (multiple sequences from other perspectives. Cost sensitivity [34] layer perception) structures are used in the base mod- is an indispensable aspect of user modeling, and users of ule. The user, store, and contextual features (𝑢 ∈ R𝑛𝑢 , e-commerce often have certain restrictions on payment 𝑠 ∈ R𝑛𝑠 , and 𝑡 ∈ R𝑛𝑡 ) are the inputs of the base module, costs which makes it possible to further improve the user which are mapped into a d-dimensional space via embed- behavior sequence modeling from the perspective of cost. ding operations. MLP are used to learn the aggregated vector 𝑏 of basic features, with ELU [35] as the activation function: 3. The Proposed Approach 𝑏 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 (𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑢, 𝑠, 𝑡))). (2) In this section, we introduce the proposed ECMM model. As shown in Figure 2, it consists of three modules, which are base module includes the online user, the offline store loss 1 loss 2 Combination module Sliding window attention User features : Share features : CTR CTCVR CVR Click features : Store features : CTR network CVR network ... Share net feature concat Order features : Context features : User transportation cost Store Distance Base module Entire cost module Convert attention Mean pooling Mean pooling Global convert Flatten Flatten MLP W Concat Convert attention id dis price Flatten Click W ... ... Concat id dis price Sliding window attention Sliding window attention Order Sparse attention id dis price Sparse attention ... ... ... Order 1 ... Store distance ... Local convert N-block Transformer N-Block Transformer ... Embedding Context W s1 s2 sk ... ... ... Embedding Embedding ... ... Click ... W User click User pay User feature Store feature Context feature ... s1 s2 sk Click 1 Click 2 ... Click k Order 1 Order 2 ... Order k transportation cost transportation cost Order s1 s2 sk Base feature Online convert cost Offline convert cost Order Figure 2: The structure of ECMM. Two auxiliary cost are introduced to model the entire cost, i) online convert cost calculate the behavior regularity of users when they face price and distance shown online, ii) offline convert cost calculate the transportation cost for offline consumption. 3.3. Entire Cost Module restriction. The sparse attention takes the embedding of the user’s current context feature, click and order se- Different from B2C purchase, O2O scenario generally quences as input, and then get the most important user considers surrounding stores of a user’s location. Limited click and order behavior in the current context. The candidates actually reduce the possibility of matching sparse attention [36] is defined as follows: with users’ preference. Thus, it is critical to accurately capture the user’s behavioral regularity from historical 𝑄𝐾 𝑇 behaviors. Meanwhile, O2O users need to consider two- 𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄, 𝐾, 𝑉 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑡𝑜𝑝𝑛( √ ))𝑉 , 𝑘 stage costs for decision making, i.e., online order cost (7) and offline transportation cost, both of which should be where the 𝑡𝑜𝑝𝑛 operation takes the top 𝑛 pieces of his- considered. Entire cost module is designed to solve the torical information most relevant to the current context. above problems and is the most important part of the Through the sparse attention, we can get the updated ECMM model. It contains two parts: online cost feature embeddings of user’s click and order sequences: module and offline cost feature module. Online Cost Feature Module. Each store that in 𝐻 𝑎𝑐 = 𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄𝑠 , 𝐾 𝑐 , 𝑉 𝑐 ), 𝐻 𝑎𝑐 ∈ R𝑘×3𝑑 , the user’s click or order sequence has side-information (8) features of id 𝑠𝑖𝑑 , distance 𝑠𝑑𝑖𝑠 and price 𝑠𝑝𝑟𝑖𝑐𝑒 that rep- 𝐻 𝑎𝑜 = 𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄𝑠 , 𝐾 𝑜 , 𝑉 𝑜 ), 𝐻 𝑎𝑜 ∈ R𝑘×3𝑑 , resent the user cost that he decide to click/order an offline (9) store in the online platform. Then we have embedding where 𝑄𝑠 means converts context features as query vec- of the i-th store in user historical behavior, tor, {𝐾 𝑐 , 𝑉 𝑐 } denotes converts the user click sequence as key and value vectors and {𝐾 𝑜 , 𝑉 𝑜 } as well. ℎ𝑐𝑖 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑠𝑖𝑑 𝑑𝑖𝑠 𝑝𝑟𝑖𝑐𝑒 𝑖 , 𝑠𝑖 , 𝑠𝑖 ), ℎ𝑗𝑖 ∈ R3𝑑 . (3) In order to better capture the impact of the user click se- quence 𝐻𝑐𝑎 on the order sequence 𝐻𝑜𝑎 from the retrieved ℎ𝑜𝑖 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑠𝑖𝑑 𝑑𝑖𝑠 𝑝𝑟𝑖𝑐𝑒 𝑖 , 𝑠 𝑖 , 𝑠𝑖 ), ℎ𝑗𝑖 ∈ R3𝑑 . (4) click and order aggregation information, we propose a Thus, the user’s historical click and order behavior se- convert attention mechanism to capture these impacts quences, i.e., 𝐻 𝑐 and 𝐻 𝑜 , can be represented as follows: from both local and global perspectives. From a local perspective, the preference of the user’s 𝐻 𝑐 = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑐1 , ℎ𝑐2 , ..., ℎ𝑐𝑘 ), 𝐻 𝑐 ∈ R𝑘×3𝑑 , (5) conversion to store ℎ𝑜,𝑖 ∈ 𝐻 𝑜 can be characterized by 𝑎 𝑎 the clicked store ℎ𝑐,𝑖 ∈ 𝐻 𝑐 related to where the order 𝑎 𝑎 𝐻 𝑜 = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑜1 , ℎ𝑜2 , ..., ℎ𝑜𝑘 ), 𝐻 𝑜 ∈ R𝑘×3𝑑 , (6) was placed: where 𝑘 denotes the length of user’s click and order se- 𝛽𝑖𝑗 = (W𝑙𝑐 × ℎ𝑎𝑐,𝑖 ) ⊗ (𝑊 𝑙𝑜 × ℎ𝑎𝑜,𝑗 )𝑇 , (10) quences. After embedding, the sparse attention is used to cap- 𝑒𝑥𝑝(𝛽 𝑗 ) × ℎ𝑎𝑐,𝑖 ture the user’s historical preference under contextual 𝑠𝑙𝑜,𝑗 = Σ𝑘𝑖=1 𝑘 𝑖 + ℎ𝑎𝑜,𝑗 , ℎ𝑙𝑜,𝑗 ∈ R3𝑑 , (11) Σ𝑜=1 𝑒𝑥𝑝(𝛽𝑜𝑗 ) where 𝑊𝑐𝑙 , 𝑊𝑜𝑙 ∈ R3𝑑×3𝑑 is trainable parameters. 𝛽𝑖𝑗 We propose a sliding window attention mechanism that represents the correlation between clicked store 𝑖 and or- uses fixed-length windows to characterize the user’s pref- der store 𝑗. 𝑠𝑙𝑜,𝑗 means to use the aggregation of clicked erence for transportation cost in different periods, be- stores information to obtain the local conversion prefer- cause the user’s preference for transportation cost varies ence to update the order store information. Here, we use in different periods. Note the mechanism has generation the residual design to retain the original information of for not only O2O platform users but also for other sce- the order store. nario which need to capture user dynamic preference From a global perspective, the user’s preferences for during different period. different dimensions (i.e., store’s id, price, distance) of Each offline store has a distance feature 𝑠𝑑𝑖𝑠 ∈ R𝑑 order stores are affected by the relevant information of with respect to the current store, we match this feature the clicked store. Hence, we separate the submatrix from with the user’s historical distance sequence: the click and order sequences: 𝐷 𝑗,𝑖 = 𝑇 𝑑𝑖𝑠 𝑐,𝑖:𝑖+𝑤𝑠 , 𝐷 𝑗,𝑖 ∈ R 𝑤𝑠×𝑑 , 𝑗 ∈ {𝑐, 𝑜}, (19) 𝐻 𝑎𝑖𝑑 = (𝑠𝑎,𝑖𝑑 𝑖 ), 𝐻 𝑎𝑑𝑖𝑠 = (𝑠𝑎,𝑑𝑖𝑠 𝑖 ), 𝐻 𝑎𝑝𝑟𝑖𝑐𝑒 = (𝑠𝑎,𝑝𝑟𝑖𝑐𝑒 𝑖 ), 𝐷 𝑗,𝑖 𝑠𝑑𝑖𝑠 𝐻 𝑎𝑐𝑡𝑥𝑗 ∈ R𝑘×𝑑 . 𝐴𝑗,𝑖 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( √ ), 𝐴𝑗,𝑖 ∈ R𝑤𝑠×𝑑 , 𝑗 ∈ {𝑐, 𝑜}, 𝑤𝑠 (12) (20) For each dimension, we calculate the impact of the 𝑀 𝑑𝑖𝑠 = Σ𝑘𝑖=1 𝐴𝑤 𝑑𝑖𝑠 ∈ R𝑤𝑠×𝑑 , 𝑗 ∈ {𝑐, 𝑜}, 𝑗 𝑗,𝑖 𝐷 𝑗,𝑖 , 𝑀 𝑗 user’s clicked sequence on the user’s order sequence (21) from a global perspective: where 𝑤𝑠 ∈ N denotes our window length, 𝐷 𝑗,𝑖 de- notes the subsequence in 𝑖-th window, 𝑀 𝑑𝑖𝑠 denotes 𝑐𝑡𝑥𝑗 𝛾𝑐𝑡𝑥𝑖 = (𝑊 𝑔𝑐 × 𝐻 𝑎𝑐,𝑐𝑡𝑥𝑖 ) ⊗ (𝑊 𝑔𝑜 × 𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 )𝑇 , (13) 𝑗 the user offline preference of the window length dimen- 𝑐𝑡𝑥𝑗 𝑒𝑥𝑝(𝛾𝑐𝑡𝑥𝑖 ) sion matrix, and 𝑚𝑑𝑖𝑠 𝑗 = 𝐹 𝑙𝑎𝑡𝑡𝑒𝑛(𝑀 𝑑𝑖𝑠𝑗 ) denotes the 𝐻 𝑔𝑜,𝑐𝑡𝑥𝑗 = Σ𝑐𝑡𝑥𝑖 𝑐𝑡𝑥𝑗 𝐻 𝑎𝑐,𝑐𝑡𝑥𝑖 + 𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 , user offline preference vector. Σ𝑐𝑡𝑥𝑖 𝑒𝑥𝑝(𝛾𝑐𝑡𝑥𝑖 ) 𝐻 𝑔𝑜,𝑐𝑡𝑥𝑗 ∈ R𝑘×𝑑 , 3.4. Cost Combination Module (14) where 𝑐𝑡𝑥𝑖, 𝑐𝑡𝑥𝑗 ∈ (𝑖𝑑, 𝑑𝑖𝑠, 𝑝𝑟𝑖𝑐𝑒), 𝑊 𝑔𝑐 , 𝑊 𝑔𝑜 ∈ R𝑑×𝑑 In this section, we embody CTR and CVR prediction tasks is trainable parameters, 𝛾𝑐𝑡𝑥𝑖 𝑐𝑡𝑥𝑗 represents the correlation into a multi-task framework. The input of this module is between the click sequence in dimension 𝑐𝑡𝑥𝑗 and the the concatenation of the outputs from base module and order sequence in dimension 𝑐𝑡𝑥𝑖, 𝐻𝑜,𝑐𝑡𝑥𝑗 𝑔 means that entire cost module. 𝑟𝑐𝑡𝑟 and 𝑟𝑐𝑣𝑟 are calculated by MLP using the click additional information aggregation to network, respectively. obtain the global conversion preference to update the 𝑐 , 𝑚𝑜 ])), (22) 𝑟𝑐𝑡𝑟 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 ([𝑏, ℎ𝑐 , ℎ𝑜 , 𝑚𝑑𝑖𝑠 𝑑𝑖𝑠 order sequence. The residual design is also used in this part. 𝑐 , 𝑚𝑜 ])). (23) 𝑟𝑐𝑣𝑟 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 ([𝑏, ℎ𝑐 , ℎ𝑜 , 𝑚𝑑𝑖𝑠 𝑑𝑖𝑠 Finally, the aggregation of order sequence and click sequence can be obtained : To this end, we calculate the post-view click through&conversion rate (CTCVR) by 𝑟𝑐𝑡𝑐𝑣𝑟 = 𝑟𝑐𝑡𝑟 * ℎ𝑜 = 𝑀 𝑒𝑎𝑛𝑝𝑜𝑜𝑙𝑖𝑛𝑔(‖𝑗 (𝑠𝑙𝑜,𝑗 ) + ‖𝑐𝑡𝑥𝑗 (𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 )), 𝑟𝑐𝑣𝑟 . The loss function used here is lambda loss [37]. ℎ𝑜 ∈ R3𝑑 , (15) 4. Experiments ℎ𝑐 = 𝑀 𝑒𝑎𝑛𝑝𝑜𝑜𝑙𝑖𝑛𝑔(𝐻𝑐𝑎 ), ℎ𝑐 ∈ R3𝑑 , (16) where ‖ means concatenate of vectors. In this section, we evaluate the model performance of the Offline Cost Feature Module. In O2O scenario, of- proposed ECMM. We describe the experimental settings fline transportation costs also play an important role and experimental results as follows. in the conversion rate as users need to go to offline stores. We first construct the user’s historical behav- 4.1. Experimental Settings ior sequences to represent the user’s historical click and Datasets. We selected 30 days exposure logs from August order transportation costs, and takes them as the input to September obtained from the online O2O business of the 𝑁 -layers Transformer encoder: system to train the CVR model. We have two test sets: 𝑇 𝑑𝑖𝑠 𝑐 = 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝐻 𝑑𝑖𝑠 𝑑𝑖𝑠 𝑐 ), 𝑇 𝑐 ∈ R𝑘×𝑑 , (17) one is one day dataset in September and another is three days in October. Since user behavior evolves with time, 𝑇 𝑑𝑖𝑠 𝑜 = 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝐻 𝑑𝑖𝑠 𝑑𝑖𝑠 𝑜 ), 𝑇 𝑜 ∈ R𝑘×𝑑 . (18) the closer the time is to the training data, the closer the distribution of user behavior is to the training data, Table 1 Offline experimental results on two testing sets. September October October improvement Models CTR—NDCG CTCVR-NDCG CTR—NDCG CTCVR-NDCG CTR—NDCG CTCVR-NDCG ESMM 0.7560 0.8446 0.7515 0.8455 0.00% 0.00% ESMM+DIN 0.7577 0.8456 0.7528 0.8463 0.17% 0.09% ECMM wo offline and convAttn 0.7575 0.8458 0.7525 0.8464 0.13% 0.11% ECMM wo offline 0.7574 0.8462 0.7532 0.8469 0.23% 0.17% ECMM wo online and slidWinAttn 0.7577 0.8458 0.7533 0.8467 0.24% 0.14% ECMM wo online 0.7579 0.8463 0.7534 0.8471 0.25% 0.19% ECMM+dualInfo 0.7576 0.8462 0.7533 0.8469 0.24% 0.17% ECMM+sepInput 0.7581 0.8465 0.7537 0.8472 0.29% 0.20% ECMM 0.7585 0.8480 0.7541 0.8487 0.34% 0.38% and the longer the relative time is, the user behavior task model for learning CTR and CVR in the industry. b) distribution will change. Therefore the test sets in this ESMM+DIN [20]. Based on ESMM, users’ click sequence experiment can effectively evaluate the accuracy and feature and the current store feature are introduced by generalization of the model. The number of our training DIN method. samples is approximately 1.1 billion, while the testing (2) Ablation: a) ECMM wo offline and convAttn. sets are 40 million and 100 million, respectively. Based on ECMM, we only use online convert cost with- Metric. The goal of our ranking task is to provide a out convert attention. b) ECMM wo offline. Based on list that is more likely to facilitate users’ conversion. TheECMM, we only use online convert cost. c) ECMM wo evaluation metric used in this paper is NDCG. We have online and slidWinAttn. Based on ECMM, we only use two ranking strategies: sorting by CTR and sorting by offline convert cost without sliding window attention. d) CTCVR. So we have NDCG sorted by CTR to predict real ECMM wo online. Based on ECMM, we only use offline click rate and NDCG sorted by CTCVR to predict real convert cost. purchase rate. The calculation criteria are as follows: (3) ECMM variants: a) ECMM+dualInfo: Based on ECMM, we calculate convert attention not only convert 𝐷𝐶𝐺 Σ𝑛 𝑟 𝑗=1 (2 − 1)/𝑙𝑜𝑔(1 + 𝑗) click sequence information to the order sequence but 𝑁 𝐷𝐶𝐺 = = |𝑟𝑒𝑙| 𝑟 , 𝐼𝐷𝐶𝐺 Σ𝑗=1 (2 − 1)/𝑙𝑜𝑔(1 + 𝑗) also convert order sequence information to the click se- (24) quence. b) ECMM+sepInput: Based on ECMM, we use where 𝑛 represents the length of the list of stores ranked the click feature as the input for the CTR network, the by the model, 𝑟 represents the label of the sample includ- order feature as the input for the CVR network. ing click and order differing from the model task, and |𝑟𝑒𝑙| represents the number of stores that label is not 4.2. Offline Performance zero. Compared Methods. Our baseline is a highly opti- The evaluation metric used in this paper is CTR-NDCG mized ESMM model that incorporates a large number and CTCVR-NDCG. Table 1 shows the experimental re- of business features and handcrafted features. The to- sults of the comparison methods on two testing sets, from tal number of features is 473. The embedding matrix of which we have: dimension 𝑑 is 10. We use the sequences feature from For the entire cost module, compared with ESMM, users’ history for 180 days and the length 𝑘 is 50. The ECMM can obtain a 0.35% gain on CTR-NDCG and 0.38% 1 numbers of Transformer layers 𝑁 is 2. Because 80% of gain on CTCVR-NDCG . And all other ablation methods users click sequence length is less than 10 and order se- and variants can also improve the model performance quence length is less than 5, and considering the service after modeling users’ behavior sequences. performance, the 𝑛 of the sparse attention we chose is For online cost feature, compared with ESMM, 10. The dimension of the MLP used in the base module is ESMM+DIN adding click sequence has a certain increase 1024, and the dimension of the four-layer MLP used by in CTR- and CTCVR-NDCG. As showen in Figure 3, the CTR and CVR networks is 512, 256, 128, 1 with ELU ECMM wo offline and convAttn, which is further added activation function, respectively. And all baselines take to the order sequence, slightly decreases in the CTR- into account the statistical user features of online and of- 1 fline costs for fair comparison. We conduct comparative For large-scale datasets in industrial recommender systems, the improvement is considerable because of its hardness, and the testing experiments with three categories of methods: results in Section 3.3 further verify the significant improvement of (1) Baselines: a) ESMM [10]. An outstanding multi- our proposal. NDCG, but greatly improves the CTCVR-NDCG. ECMM wo offline indicates that the convert attention mechanism can learn users’ order characteristics from click to order. These three methods show that it is effective to utilize his- torical features to improve CVR prediction. The convert attention brings 0.18% and 0.19% gains in CTR-NDCG and CTCVR-NDCG. Figure 5: Online performance. The improvements of CTCVR and CTR are significant with the significance level 𝛼=0.05. consistent with the assessment in September. The ECMM model shows that the advantage of considering users’ Figure 3: Improvement in conversion rate prediction from online behavioral regularity in October. online behavioral and offline transportation regularities is helpful in predicting users’ current CTR and CTCVR. For offline cost feature, the ECMM wo online and 4.3. Online Evaluations slidWinAttn model that uses distance sequence features brings stronger effects improve both CTR- and CTCVR- Online A/B test was conducted in the recommender sys- NDCG. As showen in Figure 4, comparing ECMM wo tem in 7 days in January 2022. For the control group, online and slidWinAttn with ESMM, it can be seen that 10% of users were randomly assigned and presented in the offline transportation cost is indispensable for the a recommender system presented by a highly optimized conversion rate prediction of O2O platform. And ECMM ESMM algorithm. For the experimental group, 10% of wo online model introduced by our proposed slide win- users were randomly selected to use the ECMM method. dow attention brings greater gains by dynamic matching In the online experiment, we choose CTR and CTCVR as user preference during different times. The sliding win- evaluation indicators, where CTCVR represents the pur- dow method brings 0.02% and 0.05% gains in CTR-NDCG chase rate of each request. The result is shown in Figure and CTCVR-NDCG. 5. We can see that our proposed ECMM method im- proves the CTR by 0.52% (p-value=0.00<0.05) compared with the baseline model, and the CTCVR by 0.73% (p- value=0.02<0.05), which has a 1.8% (p-value=0.02<0.05) increase in total revenue. Here, total revenue increases to 1.8% with a 0.45% increase in CTCVR means the model provides users with higher price list. So far, the ECMM method has been applied to the main online traffic and has served more than hundreds of millions of users, bring- Figure 4: Improvement in conversion rate prediction from ing a significant increase in the total revenue of Meituan. offline transportation regularity in October. In order to explore whether the user’s historical 5. Conclusion order will affect click, we further study with the ECMM+dualInfo model that the order sequence trans- In this paper, inspired by the user sequential behaviors mits information to the click sequence. It can be seen in O2O platform, a novel model is proposed to predict that the click NDCG decreased by 0.05%, and the CTCVR- conversion rate. Further, introduce covert attention and NDCG decreased by 0.06%. We separate the click and the sliding window attention in the cost module to learn users’ order features into the CTR network and CVR network to online behavioral regularity and offline transportation obtain the ECMM+sepInput model to verify the feature regularity. At the same time, offline experiments have impact of different task, and found that separate features proved the effectiveness of our proposed method to learn will reduce model performance. users’ conversion from users’ click sequence to order To verify the generalization of our model instead of sequence, and the accuracy of the ranking list is im- fitting users over a certain period, we further evaluate proved by evaluating NDCG. Online experiments show our method on a test set in October. The results are that ECMM method has a significant effect on improv- ing the total revenue of the O2O platform. For now, the 10.1145/2424321.2424348. doi:10.1145/2424321. ECMM method has been applied to the main online traf- 2424348. fic, bringing a significant increase in the total revenue of [7] J. Huang, K. Hu, Q. Tang, M. Chen, Y. Qi, J. Cheng, the enterprise. J. Lei, Deep position-wise interaction network for ctr prediction, in: Proceedings of the 44th Inter- national ACM SIGIR Conference on Research and Acknowledgments Development in Information Retrieval, 2021, pp. 1885–1889. This research was supported by the National Natural Sci- [8] Y. Ping, C. Gao, T. Liu, X. Du, H. Luo, D. Jin, Y. Li, ence Foundation of China (NSFC) under Grant 72071029, User consumption intention prediction in meituan, 71974031 and 72231010. This research was also supported in: Proceedings of the 27th ACM SIGKDD Confer- by Meituan. ence on Knowledge Discovery & Data Mining, 2021, pp. 3472–3482. References [9] Z. Fang, B. Gu, X. Luo, Y. Xu, Contemporaneous and delayed sales impact of location-based mobile pro- [1] X. Ding, J. Tang, T. Liu, C. Xu, Y. Zhang, F. Shi, motions, Information Systems Research 26 (2015) Q. Jiang, D. Shen, Infer implicit contexts in real-time 552–564. online-to-offline recommendation, in: Proceedings [10] X. Ma, L. Zhao, G. Huang, Z. Wang, Z. Hu, X. Zhu, of the 25th ACM SIGKDD International Conference K. Gai, Entire space multi-task model: An ef- on Knowledge Discovery & Data Mining, KDD ’19, fective approach for estimating post-click conver- Association for Computing Machinery, New York, sion rate, in: The 41st International ACM SIGIR NY, USA, 2019, p. 2336–2346. URL: https://doi.org/ Conference on Research & Development in Infor- 10.1145/3292500.3330716. doi:10.1145/3292500. mation Retrieval, SIGIR ’18, Association for Com- 3330716. puting Machinery, New York, NY, USA, 2018, p. [2] H. Li, Q. Shen, Y. Bart, Local market characteris- 1137–1140. URL: https://doi.org/10.1145/3209978. tics and online-to-offline commerce: An empirical 3210104. doi:10.1145/3209978.3210104. analysis of groupon, Management Science 64 (2018) [11] H. Wen, J. Zhang, Y. Wang, F. Lv, W. Bao, Q. Lin, 1860–1878. K. Yang, Entire space multi-task modeling via post- [3] S. Kawanaka, D. Moriwaki, Uplift modeling for click behavior decomposition for conversion rate location-based online advertising, in: Proceedings prediction, in: Proceedings of the 43rd Interna- of the 3rd ACM SIGSPATIAL International Work- tional ACM SIGIR Conference on Research and De- shop on Location-Based Recommendations, Geoso- velopment in Information Retrieval, Association for cial Networks and Geoadvertising, LocalRec ’19, Computing Machinery, New York, NY, USA, 2020, Association for Computing Machinery, New York, p. 2377–2386. URL: https://doi.org/10.1145/3397271. NY, USA, 2019. 3401443. [4] M.-H. Park, J.-H. Hong, S.-B. Cho, Location-based [12] H. Wen, J. Zhang, F. Lv, W. Bao, T. Wang, Z. Chen, recommendation system using bayesian user’s pref- Hierarchically modeling micro and macro behav- erence model in mobile devices, in: International iors via multi-task learning for conversion rate pre- conference on ubiquitous intelligence and comput- diction, in: Proceedings of the 44th International ing, Springer, 2007, pp. 1130–1139. ACM SIGIR Conference on Research and Devel- [5] H. Yang, T. Liu, Y. Sun, E. Bertino, Exploring the opment in Information Retrieval, Association for interaction effects for temporal spatial behavior Computing Machinery, New York, NY, USA, 2021, prediction, in: Proceedings of the 28th ACM In- p. 2187–2191. URL: https://doi.org/10.1145/3404835. ternational Conference on Information and Knowl- 3463053. edge Management, CIKM ’19, Association for Com- [13] Q. Lu, S. Pan, L. Wang, J. Pan, F. Wan, H. Yang, puting Machinery, New York, NY, USA, 2019, p. A practical framework of conversion rate predic- 2013–2022. URL: https://doi.org/10.1145/3357384. tion for online display advertising, in: Proceed- 3357963. doi:10.1145/3357384.3357963. ings of the ADKDD’17, ADKDD’17, Association [6] J. Bao, Y. Zheng, M. F. Mokbel, Location-based and for Computing Machinery, New York, NY, USA, preference-aware recommendation using sparse 2017. URL: https://doi.org/10.1145/3124749.3124750. geo-social networking data, in: Proceedings of doi:10.1145/3124749.3124750. the 20th International Conference on Advances in [14] T. Tong, X. Xu, N. Yan, J. Xu, Impact of different Geographic Information Systems, SIGSPATIAL ’12, platform promotions on online sales and conversion Association for Computing Machinery, New York, rate: The role of business model and product line NY, USA, 2012, p. 199–208. URL: https://doi.org/ length, Decision Support Systems (2022) 113746. [15] S. Guo, L. Zou, Y. Liu, W. Ye, S. Cheng, S. Wang, 2671–2679. H. Chen, D. Yin, Y. Chang, Enhanced Doubly Robust [24] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian, Learning for Debiasing Post-Click Conversion Rate G. Zhou, J. Xu, Y. Yu, X. Zhu, et al., Lifelong se- Estimation, Association for Computing Machinery, quential modeling with personalized memorization New York, NY, USA, 2021, p. 275–284. URL: https: for user response prediction, in: Proceedings of the //doi.org/10.1145/3404835.3462917. 42nd International ACM SIGIR Conference on Re- [16] X. Pan, M. Li, J. Zhang, K. Yu, L. Wang, H. Wen, search and Development in Information Retrieval, C. Mao, B. Cao, Conversion rate prediction via meta 2019, pp. 565–574. learning in small-scale recommendation scenarios, [25] Q. Tan, J. Zhang, J. Yao, N. Liu, J. Zhou, H. Yang, arXiv preprint arXiv:2112.13753 (2021). X. Hu, Sparse-interest network for sequential rec- [17] H. Wang, Z. Li, X. Liu, D. Ding, Z. Hu, P. Zhang, ommendation, in: Proceedings of the 14th ACM C. Zhou, J. Bu, Fulfillment-time-aware personalized International Conference on Web Search and Data ranking for on-demand food recommendation, in: Mining, 2021, pp. 598–606. Proceedings of the 30th ACM International Confer- [26] K.-c. Lee, B. Orten, A. Dasdan, W. Li, Estimating ence on Information & Knowledge Management, conversion rate in display advertising from past 2021, pp. 4184–4192. erformance data, in: Proceedings of the 18th ACM [18] D. Xi, Z. Chen, P. Yan, Y. Zhang, Y. Zhu, F. Zhuang, SIGKDD international conference on Knowledge Y. Chen, Modeling the sequential dependence discovery and data mining, 2012, pp. 768–776. among audience multi-step conversions with multi- [27] O. Chapelle, Modeling delayed feedback in dis- task learning in targeted display advertising, in: play advertising, in: Proceedings of the 20th ACM Proceedings of the 27th ACM SIGKDD Conference SIGKDD international conference on Knowledge on Knowledge Discovery & Data Mining, 2021, pp. discovery and data mining, 2014, pp. 1097–1105. 3745–3755. [28] Q. Lu, S. Pan, L. Wang, J. Pan, F. Wan, H. Yang, A [19] F. Xiao, L. Li, W. Xu, J. Zhao, X. Yang, J. Lang, practical framework of conversion rate prediction H. Wang, Dmbgn: Deep multi-behavior graph net- for online display advertising, in: Proceedings of works for voucher redemption rate prediction, in: the ADKDD’17, 2017, pp. 1–9. Proceedings of the 27th ACM SIGKDD Conference [29] R. Xie, C. Ling, Y. Wang, R. Wang, F. Xia, L. Lin, on Knowledge Discovery & Data Mining, 2021, pp. Deep feedback network for recommendation, in: 3786–3794. Proceedings of the Twenty-Ninth International [20] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Conference on International Joint Conferences on Y. Yan, J. Jin, H. Li, K. Gai, Deep interest network Artificial Intelligence, 2021, pp. 2519–2525. for click-through rate prediction, in: Proceedings [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, of the 24th ACM SIGKDD International Conference L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- on Knowledge Discovery & Data Mining, KDD ’18, tention is all you need, Advances in neural infor- Association for Computing Machinery, New York, mation processing systems 30 (2017). NY, USA, 2018, p. 1059–1068. URL: https://doi.org/ [31] Y. Feng, F. Lv, W. Shen, M. Wang, F. Sun, Y. Zhu, 10.1145/3219819.3219823. doi:10.1145/3219819. K. Yang, Deep session interest network for click- 3219823. through rate prediction, in: IJCAI, 2019. [21] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, [32] C. Li, Z. Liu, M. Wu, Y. Xu, P. Huang, H. Zhao, X. Zhu, K. Gai, Deep interest evolution net- G. Kang, Q. Chen, W. Li, Lee, Multi-interest net- work for click-through rate prediction, volume 33, work with dynamic routing for recommendation 2019, pp. 5941–5948. URL: https://ojs.aaai.org/index. at tmall, Proceedings of the 28th ACM Interna- php/AAAI/article/view/4545. doi:10.1609/aaai. tional Conference on Information and Knowledge v33i01.33015941. Management (2019). [22] C. Li, Z. Liu, M. Wu, Y. Xu, H. Zhao, P. Huang, [33] Z. Xiao, L. Yang, W. Jiang, Y. Wei, Y. Hu, H. Wang, G. Kang, Q. Chen, W. Li, D. L. Lee, Multi-interest Deep multi-interest network for click-through rate network with dynamic routing for recommendation prediction, Proceedings of the 29th ACM Inter- at tmall, in: Proceedings of the 28th ACM interna- national Conference on Information & Knowledge tional conference on information and knowledge Management (2020). management, 2019, pp. 2615–2623. [34] T. Natarajan, S. A. Balasubramanian, D. Kasilingam, [23] Q. Pi, W. Bian, G. Zhou, X. Zhu, K. Gai, Practice on Understanding the intention to use mobile shop- long sequential user behavior modeling for click- ping applications and its influence on price sensi- through rate prediction, in: Proceedings of the tivity, Journal of Retailing and Consumer Services 25th ACM SIGKDD International Conference on 37 (2017) 8–22. Knowledge Discovery & Data Mining, 2019, pp. [35] D. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential lin- ear units (elus), in: Y. Bengio, Y. LeCun (Eds.), 4th International Conference on Learning Representa- tions, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL: http://arxiv.org/abs/1511.07289. [36] G. Zhao, J. Lin, Z. Zhang, X. Ren, X. Sun, Sparse transformer: Concentrated attention through ex- plicit selection, 2020. URL: https://openreview.net/ forum?id=Hye87grYDH. [37] X. Wang, C. Li, N. Golbandi, M. Bendersky, M. Na- jork, The lambdaloss framework for ranking metric optimization, in: Proceedings of the 27th ACM In- ternational Conference on Information and Knowl- edge Management, 2018, pp. 1313–1322.