=Paper= {{Paper |id=Vol-3317/paper4 |storemode=property |title=Entire Cost Enhanced Multi-Task Model for Online-to-Offline Conversion Rate Prediction |pdfUrl=https://ceur-ws.org/Vol-3317/Paper4.pdf |volume=Vol-3317 |authors=Yingyi Zhang,Xianneng Li,Yahe Yu,Jian Tang,Huanfang Deng,Junya Lu,Yeyin Zhang,Qiancheng Jiang,Yunsen Xian,Liqian Yu |dblpUrl=https://dblp.org/rec/conf/cikm/ZhangLYTDLZJXY22 }} ==Entire Cost Enhanced Multi-Task Model for Online-to-Offline Conversion Rate Prediction== https://ceur-ws.org/Vol-3317/Paper4.pdf
Entire Cost Enhanced Multi-Task Model for
Online-to-Offline Conversion Rate Prediction
Yingyi Zhang1 , Xianneng Li1,* , Yahe Yu1 , Jian Tang2 , Huanfang Deng2 , Junya Lu2 ,
Yeyin Zhang2 , Qiancheng Jiang2 , Yunsen Xian2 and Liqian Yu2
1
    Dalian University of Technology, Dalian, 116024, China
2
    Meituan, Beijing, 100102, China


                                          Abstract
                                          Predicting users’ conversion rate (CVR) is essentially important for ranking systems in industrial Online-to-Offline (O2O)
                                          applications. Numerous efforts have been made in CVR modeling to achieve state-of-the-art performance. However, existing
                                          methods mainly focus on the Business-to-Customer (B2C) scenario, which makes implementations to O2O meet with mixed
                                          success. This can be revealed via several scenario-specific challenges. For example, O2O users in different locations generally
                                          encounter different candidates of surrounding stores. This leads to users’ behavioral regularity becoming essentially prominent.
                                          Besides, O2O users’ conversion includes a two-stage cost, i.e., online order cost and offline transportation cost. This inspires
                                          that users’ location sensitivity deserves additional attention compared with conventional scenarios. Motivated by these
                                          characteristics, we propose a novel CVR prediction method for the O2O scenario, named Entire Cost enhanced Multi-task
                                          Model (ECMM): i) users’ historical behavior sequences across different locations are modeled to capture the users’ preference
                                          of behavioral regularity; ii) both online order cost and offline transportation cost are modeled to predict the users’ aggregated
                                          preference for conversion. By designing two novel attention mechanisms, i.e., convert attention and sliding window attention,
                                          ECMM can be trained end-to-end to appropriately fit O2O characteristics. Extensive experiments have been carried out under
                                          a real-world industrial O2O platform Meituan. Both offline and rigorous online A/B tests under the billion-level data scale
                                          demonstrate the superiority of the proposed ECMM over the highly optimized state-of-the-art baselines.

                                          Keywords
                                          Online-to-Offline, Multi-Task Learning, Conversion Rate Prediction



1. Introduction                                                                                        challenging, whereas conventional methods may not be
                                                                                                       perfectly suitable.
In the Online-to-Offline (O2O) scenario, industrial plat-                                                 In this paper, two critical O2O characteristics summa-
forms generally rely on commission fees of successful rized in our real practice are focused on: i) online be-
conversion as profit. Hence, how to accurately predict havioral regularity. As a typical form of Location-Based
users’ conversion rate (CVR) is essentially important Service (LBS), the O2O scenario provides an online rank-
for ranking systems in O2O industry. However, the ing list that only considers surrounding stores of a user’s
O2O scenario requires the conversion of users from not location. The limited candidates require CVR modeling
only online click to online order, but also to final offline to more accurately grasp users’ preference of historical
consumption[1, 2]. In other words, O2O users’ behaviors behaviors for online conversion since users’ behaviors
follow a sequential pattern of impression→click→online generally appear homogeneously on the platform in dif-
order→offline consumption, which is somewhat different ferent locations[6, 7, 8] such as clicking/ordering stores
from that of other online e-commerce forms[3, 4, 5], i.e., with similar prices or distances showing online. ii) offline
Business-to-Costumer (B2C). This raises several scenario- transportation regularity. Different from B2C purchases
specific characteristics that make CVR prediction of O2O with only online order cost, O2O users should spend
DL4SR’22: Workshop on Deep Learning for Search and Recommen- additional transportation cost for the offline consump-
dation, co-located with the 31st ACM International Conference on tion [9, 2]. Since user’s preference for distance varies
Information and Knowledge Management (CIKM), October 17-21, 2022, in different periods, offline cost should be counted for
Atlanta, USA                                                                                           decision-making dynamically to predict the current trans-
*
  Corresponding author.
$ yingyizhang@mail.dlut.edu.cn (Y. Zhang);
                                                                                                       portation preference of the user. This inspires that CVR
xianneng@dlut.edu.cn (X. Li); yaheyu@dlut.edu.cn (Y. Yu);                                              modeling should consider additionally location-sensitive
tangjian13@meituan.com (J. Tang); denghuanfang@meituan.com                                             factors when capturing O2O users’ preferences.
(H. Deng); lujunya@meituan.com (J. Lu);                                                                   Although numerous efforts have been made in CVR
zhangyeyin@meituan.com (Y. Zhang);                                                                     modeling to achieve state-of-the-art industrial perfor-
jiangqiancheng@meituan.com (Q. Jiang);
xianyunsen@meituan.com (Y. Xian); yuliqian@meituan.com
                                                                                                       mance , existing methods such as ESMM and its vari-
(L. Yu)                                                                                                ants focus on addressing the problems of sample se-
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).
                                                                                                       lection bias and data sparsity under the B2C scenario
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
[10, 11, 12, 13, 14, 15] and some method solving domain                                  respectively. We note that the first three terms are the
specifically problem are proposed [16, 17, 18, 19]. How-                                 information widely used in conventional CVR modeling,
ever, where the intrinsic characteristics of O2O, i.e., on-                              while ℎ𝑐 and ℎ𝑜 are two newly considered ones to assist
line behavioral regularity and offline transportation reg-                               in the modeling of behavioral and transportation regu-
ularity, are rarely considered.                                                          larities. Moreover, the user’s click and order sequences
   One possible strategy to improve learning users’ online                               in ECMM are used from the online-offline cost perspec-
behavioral regularity and offline transportation regular-                                tive, i.e., online order cost and offline transportation cost,
ity is to consider user statistical features i.e. user’s aver-                           which are essentially different from that of conventional
age online order cost and user’s average offline distance                                CTR prediction methods of modeling user’s multiple in-
features. However, in O2O scenarios, the spatiotemporal                                  terests [20, 21, 22, 23, 24, 25]. As a novel CVR prediction
nature is inseparable, and using this strategy will lose                                 method for the O2O scenario, the contributions of ECMM
time-series information when characterizing user pref-                                   are threefold:
erences. Therefore, sequence representation techniques
are also taken into account as shown in Figure 1.                                             • ECMM elongates the observation dimensions by
                                                                                                learning users’ online conversion preferences
        Offline stores that user interacted historically
                                                                                                from historical behavior sequences. A new mech-
                                                                                                anism named convert attention is proposed to
                                                                                                learn the user’s behavior regularity from the
                                                           Offline transportation cost
                                                                                                global and local perspectives of online order cost.
                                                                                              • To the best of our knowledge, ECMM is the first
                                                               Online order cost
                                                                                                method for CVR modeling from the perspective
                                                                                                of offline transportation cost. We propose a new
                                                                                                mechanism named sliding window attention to dy-
                                                           Offline transportation cost          namically learn users’ preference of offline trans-
                                                                                                portation.
                                                                                              • ECMM is testified under a real-world industrial
                                                               Online order cost
                                                                                                O2O platform, where extensive experiments are
                                                                                                carried out. Both offline and rigorous online A/B
                              ...




                                                                                                tests under the billion-level data scale demon-
                                                           Offline transportation cost          strate the significant superiority of ECMM over
                                                                                                the state-of-the-art baselines.

                                                               Online order cost

                                                                                         2. Related Work
Figure 1: The online order cost and offline transportation
                                                              Our work is closely related to traditional e-commerce
cost in user history. Such sequence can represent user online
order and offline transportation preferences in time-series.
                                                              CVR prediction, where the state-of-the-art model is
                                                              trained by multi-task learning. Besides, for capture user
                                                              behavior regularity, user history behavior sequence is
   Hence, in this paper, we propose a novel CVR predic- considered in our model which is related to user behavior
tion method for the O2O scenario, named Entire Cost sequence representation. In this section, we give a brief
enhanced Multi-task Model (ECMM), to model users’ ag- introduction.
gregated preference under a online-offline cost perspec-
tive. Following the formation of state-of-the-art CVR
                                                              2.1. CVR Prediction
modeling, two auxiliary tasks are focused on, i.e., pre-
dicting the click-through rate (CTR) and click-through Inspired by the success within deep learning, recent CVR
conversion rate (CTCVR), which can be defined as fol- prediction model has evolved from traditional approaches
lows:                                                         to deep approaches. Traditional method used logistic
                                   𝑝(𝑐𝑡𝑐𝑣𝑟 = 1|𝑥)             regression [26, 27] and GBDT [28] for modeling CVR
     𝑝(𝑐𝑣𝑟 = 1|𝑐𝑡𝑟 = 1, 𝑥) =                        ,     (1) problem with feature interactions. However, nonlinear
                                    𝑝(𝑐𝑡𝑟 = 1|𝑥)
                                                              relationships of features are not considered in these mod-
where 𝑥 is (𝑢, 𝑠, 𝑡, ℎ𝑐 , ℎ𝑜 ), 𝑢 is the user, 𝑠 denotes the els. Modern deep learning based method transforms CVR
store, and 𝑡 represents the current context, such as the problem into a multi-task problem [10, 11, 12]. ESMM
current time, city, day of the week, and other informa- [10] make use of users sequential actions, "impression
tion that is independent of user and store. ℎ𝑐 and ℎ𝑜 are → click → pay", to solve sample selection bias and data
the user’s historical click sequence and order sequence, sparsity problem over the entire space by simultaneous
modeling of CTR and CTCVR tasks. ESM2 [11] method             and context features, the entire cost module contain both
extends users sequential actions to a more general situa-     the user’s click and order sequence to capture the user’s
tion, "impression → click → D(O)Action →pay", which           historical cost preference, and the cost combination mod-
simultaneous models CVR with CTR, CTAVR and CTCVR             ule for combining online-to-offline cost to predict CTR
tasks. HM3 [12] form "impression → click → D(O)Mi             and CVR. With this network, the model can capture the
→ D(O)Ma → pay" perspective models CVR with CTR,              user’s online behavioral and offline transportation reg-
D-Mi, D-Ma and CTCVR tasks.                                   ularities, which are hidden in users’ historical behavior
   However, all these methods are based on B2C e-             sequences. The details of each module are described as
commerce platforms which makes implementations to             follows.
O2O platforms meet with mixed success. Users have
unique sequential actions in O2O, which can be repre-         3.1. Motivation
sented as "impression→click→online order→offline con-
sumption". Such situations require CVR model to con-          As discussed in the previous section, users’ online behav-
sider not only user online behavioral regularity, but also    ioral and offline transportation regularities are indispens-
offline transportation regularity.                            able for O2O recommendation [9, 2, 20, 21]. However,
                                                              how to define their relationship with users’ behavior se-
                                                              quence as well as embody both online and offline cost
2.2. User Behavior Sequence
                                                              into a unified framework for CVR prediction remains
     Representation                                           unexplored.
In the past decade, user behavior sequence representation        For one thing, we propose a novel CVR prediction
have received much attention and achieved remarkable          method from the perspective of user historical behavior.
effectiveness. Many well designed recommender meth-           We proposed convert attention to extract the local and
ods have been proposed and brought huge commercial            global preference of users’ online-to-offline behaviors
revenues for companies and advertisers. In this mod-          from both depth and breadth perspectives. From a lo-
els, users’ history behaviors are transformed into low-       cal view, an order placed by a user is affected by clicks.
dimension vectors after embedding to represent users’         We design the local impact of a click on a order from
interest and other character. DIN [20] employs the atten-     the store perspective. From a global perspective, users’
tion mechanism to activate historical behaviors locally       overall order sequence receives the impression of click
which capture user diversity interest to the given target     sequence in terms of id, price, and relative distance. For
item. DIEN [21] further proposes an auxiliary loss and        another, to model users’ transportation cost, we capture
attention mechanism with GRU to capture the dynamic           the information of the distance sequence implied in users’
evolution of users interest. DFN [29] jointly consider        preference for offline cost in the O2O scenario, to assist
explicit/implicit and positive/negative feedbacks to learn    the model in learning users’ conversion preference in
user unbiased preferences. Moreover, inspired by the suc-     the offline stage. Each store of a user’s historical click
cess of the self-attention architecture [30], Transformer     and order has distance features which means the offline
is introduced in for session CTR prediction [31]. MIND        transportation cost. Then we use sliding window atten-
[32] and DMIN [33] model multi-interest by multiple           tion method to calculate the user dynamic preference for
vectors with dynamic routing mechanism and capsule            offline cost during different timestamps.
network.
   Although all these user behavior sequence representa-      3.2. Base Module
tion methods have brought a huge boost to the business
from the perspective of user interest, there are still op-    The base module is used to aggregate the basic features.
portunities for improvement in modeling user behavior         Refer to [10, 11, 12], the embedding and MLP (multiple
sequences from other perspectives. Cost sensitivity [34]      layer perception) structures are used in the base mod-
is an indispensable aspect of user modeling, and users of     ule. The user, store, and contextual features (𝑢 ∈ R𝑛𝑢 ,
e-commerce often have certain restrictions on payment         𝑠 ∈ R𝑛𝑠 , and 𝑡 ∈ R𝑛𝑡 ) are the inputs of the base module,
costs which makes it possible to further improve the user     which are mapped into a d-dimensional space via embed-
behavior sequence modeling from the perspective of cost.      ding operations. MLP are used to learn the aggregated
                                                              vector 𝑏 of basic features, with ELU [35] as the activation
                                                              function:
3. The Proposed Approach
                                                                     𝑏 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 (𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑢, 𝑠, 𝑡))).             (2)
In this section, we introduce the proposed ECMM model.
As shown in Figure 2, it consists of three modules, which
are base module includes the online user, the offline store
                                                                                       loss 1                    loss 2
                                                    Combination module                                                                                                                     Sliding window attention
User features :                  Share features :
                                                                                   CTR                           CTCVR                 CVR
                                 Click features :
Store features :                                                                    CTR network                                         CVR network                                                                  ...


                                                                                                         Share net feature concat
                                 Order features :
Context features :


                                                                                                                                                                                                       User transportation cost     Store Distance

 Base module                                        Entire cost module
                                                                                                                                                                                           Convert attention

                                                             Mean pooling                                Mean pooling                                                                                                             Global convert
                                                                                                                                             Flatten                    Flatten
                       MLP                                                                                                                                                                                       W
                                                                Concat                                 Convert attention                                                                      id    dis price
                      Flatten                                                                                                                                                                      Click
                                                                                                                                                                                                                 W
                                                                       ...                                         ...
                      Concat                                                                                                                                                                                                          id dis price
                                                                                                                                    Sliding window attention    Sliding window attention                                                    Order

                                                                                                        Sparse attention                                                                      id    dis price
                                                            Sparse attention
          ...             ...                ...                                                                                                                                                   Order

                                                        1                                                          ...                                 Store distance
                                                                       ...                                                                                                                                                        Local convert
                                                                                                                                     N-block Transformer         N-Block Transformer                  ...
                     Embedding
                                                                                    Context                                                                                                                      W
                                                                                                                                                                                              s1 s2         sk
         ...               ...               ...              Embedding                                   Embedding                              ...                        ...                    Click
                                                                                                                                                                                                                                            ...
                                                                                                                                                                                                                 W
                                                                                                                                          User click                   User pay
 User feature      Store feature Context feature                                                                                                                                                      ...                           s1 s2         sk
                                                     Click 1 Click 2 ... Click k                Order 1 Order 2 ... Order k           transportation cost         transportation cost
                                                                                                                                                                                                                                        Order
                                                                                                                                                                                              s1 s2         sk
                Base feature                                                 Online convert cost                                                  Offline convert cost
                                                                                                                                                                                                   Order




Figure 2: The structure of ECMM. Two auxiliary cost are introduced to model the entire cost, i) online convert cost calculate the
behavior regularity of users when they face price and distance shown online, ii) offline convert cost calculate the transportation
cost for offline consumption.



3.3. Entire Cost Module                                                                                                    restriction. The sparse attention takes the embedding
                                                                                                                           of the user’s current context feature, click and order se-
Different from B2C purchase, O2O scenario generally
                                                                                                                           quences as input, and then get the most important user
considers surrounding stores of a user’s location. Limited
                                                                                                                           click and order behavior in the current context. The
candidates actually reduce the possibility of matching
                                                                                                                           sparse attention [36] is defined as follows:
with users’ preference. Thus, it is critical to accurately
capture the user’s behavioral regularity from historical                                                                                                               𝑄𝐾 𝑇
behaviors. Meanwhile, O2O users need to consider two-                                                                     𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄, 𝐾, 𝑉 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑡𝑜𝑝𝑛( √ ))𝑉 ,
                                                                                                                                                                           𝑘
stage costs for decision making, i.e., online order cost                                                                                                                        (7)
and offline transportation cost, both of which should be                                                                  where the 𝑡𝑜𝑝𝑛 operation takes the top 𝑛 pieces of his-
considered. Entire cost module is designed to solve the                                                                   torical information most relevant to the current context.
above problems and is the most important part of the                                                                        Through the sparse attention, we can get the updated
ECMM model. It contains two parts: online cost feature                                                                    embeddings of user’s click and order sequences:
module and offline cost feature module.
   Online Cost Feature Module. Each store that in              𝐻 𝑎𝑐 = 𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄𝑠 , 𝐾 𝑐 , 𝑉 𝑐 ), 𝐻 𝑎𝑐 ∈ R𝑘×3𝑑 ,
the user’s click or order sequence has side-information                                                                (8)
features of id 𝑠𝑖𝑑 , distance 𝑠𝑑𝑖𝑠 and price 𝑠𝑝𝑟𝑖𝑐𝑒 that rep-  𝐻 𝑎𝑜 = 𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄𝑠 , 𝐾 𝑜 , 𝑉 𝑜 ), 𝐻 𝑎𝑜 ∈ R𝑘×3𝑑 ,
resent the user cost that he decide to click/order an offline                                                          (9)
store in the online platform. Then we have embedding        where 𝑄𝑠 means converts context features as query vec-
of the i-th store in user historical behavior,              tor, {𝐾 𝑐 , 𝑉 𝑐 } denotes converts the user click sequence
                                                            as key and value vectors and {𝐾 𝑜 , 𝑉 𝑜 } as well.
  ℎ𝑐𝑖 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑠𝑖𝑑         𝑑𝑖𝑠 𝑝𝑟𝑖𝑐𝑒
                        𝑖 , 𝑠𝑖 , 𝑠𝑖      ), ℎ𝑗𝑖 ∈ R3𝑑 . (3)    In order to better capture the impact of the user click se-
                                                            quence 𝐻𝑐𝑎 on the order sequence 𝐻𝑜𝑎 from the retrieved
  ℎ𝑜𝑖 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑠𝑖𝑑         𝑑𝑖𝑠 𝑝𝑟𝑖𝑐𝑒
                        𝑖 , 𝑠 𝑖 , 𝑠𝑖     ), ℎ𝑗𝑖 ∈ R3𝑑 . (4) click and order aggregation information, we propose a
  Thus, the user’s historical click and order behavior se- convert attention mechanism to capture these impacts
quences, i.e., 𝐻 𝑐 and 𝐻 𝑜 , can be represented as follows: from both local and global perspectives.
                                                               From a local perspective, the preference of the user’s
   𝐻 𝑐 = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑐1 , ℎ𝑐2 , ..., ℎ𝑐𝑘 ), 𝐻 𝑐 ∈ R𝑘×3𝑑 , (5) conversion to store ℎ𝑜,𝑖 ∈ 𝐻 𝑜 can be characterized by
                                                                                    𝑎        𝑎

                                                            the clicked store ℎ𝑐,𝑖 ∈ 𝐻 𝑐 related to where the order
                                                                                  𝑎       𝑎

   𝐻 𝑜 = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑜1 , ℎ𝑜2 , ..., ℎ𝑜𝑘 ), 𝐻 𝑜 ∈ R𝑘×3𝑑 , (6) was placed:
where 𝑘 denotes the length of user’s click and order se-                                                                                  𝛽𝑖𝑗 = (W𝑙𝑐 × ℎ𝑎𝑐,𝑖 ) ⊗ (𝑊 𝑙𝑜 × ℎ𝑎𝑜,𝑗 )𝑇 ,                                                               (10)
quences.
  After embedding, the sparse attention is used to cap-                                                                                  𝑒𝑥𝑝(𝛽 𝑗 ) × ℎ𝑎𝑐,𝑖
ture the user’s historical preference under contextual                                                                      𝑠𝑙𝑜,𝑗 = Σ𝑘𝑖=1 𝑘 𝑖              + ℎ𝑎𝑜,𝑗 , ℎ𝑙𝑜,𝑗 ∈ R3𝑑 ,                                                                (11)
                                                                                                                                          Σ𝑜=1 𝑒𝑥𝑝(𝛽𝑜𝑗 )
where 𝑊𝑐𝑙 , 𝑊𝑜𝑙 ∈ R3𝑑×3𝑑 is trainable parameters. 𝛽𝑖𝑗               We propose a sliding window attention mechanism that
represents the correlation between clicked store 𝑖 and or-       uses fixed-length windows to characterize the user’s pref-
der store 𝑗. 𝑠𝑙𝑜,𝑗 means to use the aggregation of clicked       erence for transportation cost in different periods, be-
stores information to obtain the local conversion prefer-        cause the user’s preference for transportation cost varies
ence to update the order store information. Here, we use         in different periods. Note the mechanism has generation
the residual design to retain the original information of        for not only O2O platform users but also for other sce-
the order store.                                                 nario which need to capture user dynamic preference
   From a global perspective, the user’s preferences for         during different period.
different dimensions (i.e., store’s id, price, distance) of         Each offline store has a distance feature 𝑠𝑑𝑖𝑠 ∈ R𝑑
order stores are affected by the relevant information of         with respect to the current store, we match this feature
the clicked store. Hence, we separate the submatrix from         with the user’s historical distance sequence:
the click and order sequences:
                                                                   𝐷 𝑗,𝑖 = 𝑇 𝑑𝑖𝑠
                                                                             𝑐,𝑖:𝑖+𝑤𝑠 , 𝐷 𝑗,𝑖 ∈ R
                                                                                                  𝑤𝑠×𝑑
                                                                                                       , 𝑗 ∈ {𝑐, 𝑜},    (19)
𝐻 𝑎𝑖𝑑 = (𝑠𝑎,𝑖𝑑
          𝑖    ), 𝐻 𝑎𝑑𝑖𝑠 = (𝑠𝑎,𝑑𝑖𝑠
                             𝑖     ), 𝐻 𝑎𝑝𝑟𝑖𝑐𝑒 = (𝑠𝑎,𝑝𝑟𝑖𝑐𝑒
                                                   𝑖       ),
                                                                                      𝐷 𝑗,𝑖 𝑠𝑑𝑖𝑠
                                      𝐻 𝑎𝑐𝑡𝑥𝑗 ∈ R𝑘×𝑑 .           𝐴𝑗,𝑖 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( √              ), 𝐴𝑗,𝑖 ∈ R𝑤𝑠×𝑑 , 𝑗 ∈ {𝑐, 𝑜},
                                                                                           𝑤𝑠
                                                  (12)                                                                  (20)
   For each dimension, we calculate the impact of the             𝑀 𝑑𝑖𝑠  = Σ𝑘𝑖=1 𝐴𝑤               𝑑𝑖𝑠
                                                                                                      ∈ R𝑤𝑠×𝑑 , 𝑗 ∈ {𝑐, 𝑜},
                                                                     𝑗              𝑗,𝑖 𝐷 𝑗,𝑖 , 𝑀 𝑗
user’s clicked sequence on the user’s order sequence                                                                    (21)
from a global perspective:                                       where 𝑤𝑠 ∈ N denotes our window length, 𝐷 𝑗,𝑖 de-
                                                                 notes the subsequence in 𝑖-th window, 𝑀 𝑑𝑖𝑠        denotes
  𝑐𝑡𝑥𝑗
 𝛾𝑐𝑡𝑥𝑖 = (𝑊 𝑔𝑐 × 𝐻 𝑎𝑐,𝑐𝑡𝑥𝑖 ) ⊗ (𝑊 𝑔𝑜 × 𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 )𝑇 , (13)                                                      𝑗
                                                                 the user offline preference of the window length dimen-
                             𝑐𝑡𝑥𝑗
                      𝑒𝑥𝑝(𝛾𝑐𝑡𝑥𝑖   )                              sion matrix, and 𝑚𝑑𝑖𝑠  𝑗   = 𝐹 𝑙𝑎𝑡𝑡𝑒𝑛(𝑀 𝑑𝑖𝑠𝑗 ) denotes the
𝐻 𝑔𝑜,𝑐𝑡𝑥𝑗 = Σ𝑐𝑡𝑥𝑖               𝑐𝑡𝑥𝑗
                                       𝐻 𝑎𝑐,𝑐𝑡𝑥𝑖 + 𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 ,   user offline preference vector.
                    Σ𝑐𝑡𝑥𝑖 𝑒𝑥𝑝(𝛾𝑐𝑡𝑥𝑖  )
                                      𝐻 𝑔𝑜,𝑐𝑡𝑥𝑗 ∈ R𝑘×𝑑 ,         3.4. Cost Combination Module
                                                      (14)
where 𝑐𝑡𝑥𝑖, 𝑐𝑡𝑥𝑗 ∈ (𝑖𝑑, 𝑑𝑖𝑠, 𝑝𝑟𝑖𝑐𝑒), 𝑊 𝑔𝑐 , 𝑊 𝑔𝑜 ∈ R𝑑×𝑑          In this section, we embody CTR and CVR prediction tasks
is trainable parameters, 𝛾𝑐𝑡𝑥𝑖
                          𝑐𝑡𝑥𝑗
                               represents the correlation        into a multi-task framework. The input of this module is
between the click sequence in dimension 𝑐𝑡𝑥𝑗 and the             the concatenation of the outputs from base module and
order sequence in dimension 𝑐𝑡𝑥𝑖, 𝐻𝑜,𝑐𝑡𝑥𝑗
                                       𝑔
                                              means that         entire cost module. 𝑟𝑐𝑡𝑟 and 𝑟𝑐𝑣𝑟 are calculated by MLP
using the click additional information aggregation to            network, respectively.
obtain the global conversion preference to update the
                                                                                                    𝑐 , 𝑚𝑜 ])), (22)
                                                                   𝑟𝑐𝑡𝑟 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 ([𝑏, ℎ𝑐 , ℎ𝑜 , 𝑚𝑑𝑖𝑠  𝑑𝑖𝑠
order sequence. The residual design is also used in this
part.
                                                                                                    𝑐 , 𝑚𝑜 ])). (23)
                                                                   𝑟𝑐𝑣𝑟 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 ([𝑏, ℎ𝑐 , ℎ𝑜 , 𝑚𝑑𝑖𝑠  𝑑𝑖𝑠
   Finally, the aggregation of order sequence and click
sequence can be obtained :                                         To this end, we calculate the post-view click
                                                                 through&conversion rate (CTCVR) by 𝑟𝑐𝑡𝑐𝑣𝑟 = 𝑟𝑐𝑡𝑟 *
  ℎ𝑜 = 𝑀 𝑒𝑎𝑛𝑝𝑜𝑜𝑙𝑖𝑛𝑔(‖𝑗 (𝑠𝑙𝑜,𝑗 ) + ‖𝑐𝑡𝑥𝑗 (𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 )),           𝑟𝑐𝑣𝑟 . The loss function used here is lambda loss [37].
                                       ℎ𝑜 ∈ R3𝑑 ,
                                               (15)              4. Experiments
         ℎ𝑐 = 𝑀 𝑒𝑎𝑛𝑝𝑜𝑜𝑙𝑖𝑛𝑔(𝐻𝑐𝑎 ), ℎ𝑐 ∈ R3𝑑 ,   (16)
where ‖ means concatenate of vectors.                            In this section, we evaluate the model performance of the
   Offline Cost Feature Module. In O2O scenario, of-             proposed ECMM. We describe the experimental settings
fline transportation costs also play an important role           and experimental results as follows.
in the conversion rate as users need to go to offline
stores. We first construct the user’s historical behav-          4.1. Experimental Settings
ior sequences to represent the user’s historical click and
                                                                 Datasets. We selected 30 days exposure logs from August
order transportation costs, and takes them as the input
                                                                 to September obtained from the online O2O business
of the 𝑁 -layers Transformer encoder:
                                                                 system to train the CVR model. We have two test sets:
  𝑇 𝑑𝑖𝑠
    𝑐   = 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝐻 𝑑𝑖𝑠    𝑑𝑖𝑠
                          𝑐 ), 𝑇 𝑐   ∈ R𝑘×𝑑 , (17)               one is one day dataset in September and another is three
                                                                 days in October. Since user behavior evolves with time,
  𝑇 𝑑𝑖𝑠
    𝑜   = 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝐻 𝑑𝑖𝑠    𝑑𝑖𝑠
                          𝑜 ), 𝑇 𝑜   ∈ R𝑘×𝑑 . (18)               the closer the time is to the training data, the closer
                                                                 the distribution of user behavior is to the training data,
Table 1
Offline experimental results on two testing sets.
                                               September                       October                  October improvement
  Models
                                   CTR—NDCG         CTCVR-NDCG      CTR—NDCG       CTCVR-NDCG       CTR—NDCG       CTCVR-NDCG
  ESMM                                0.7560               0.8446     0.7515             0.8455        0.00%           0.00%
  ESMM+DIN                            0.7577               0.8456     0.7528             0.8463        0.17%           0.09%
  ECMM wo offline and convAttn        0.7575               0.8458     0.7525             0.8464        0.13%           0.11%
  ECMM wo offline                     0.7574               0.8462     0.7532             0.8469        0.23%           0.17%
  ECMM wo online and slidWinAttn      0.7577               0.8458     0.7533             0.8467        0.24%           0.14%
  ECMM wo online                      0.7579               0.8463     0.7534             0.8471        0.25%           0.19%
  ECMM+dualInfo                       0.7576               0.8462     0.7533             0.8469        0.24%           0.17%
  ECMM+sepInput                       0.7581               0.8465     0.7537             0.8472        0.29%           0.20%
  ECMM                                0.7585           0.8480         0.7541             0.8487       0.34%            0.38%



and the longer the relative time is, the user behavior       task model for learning CTR and CVR in the industry. b)
distribution will change. Therefore the test sets in this    ESMM+DIN [20]. Based on ESMM, users’ click sequence
experiment can effectively evaluate the accuracy and         feature and the current store feature are introduced by
generalization of the model. The number of our training      DIN method.
samples is approximately 1.1 billion, while the testing          (2) Ablation: a) ECMM wo offline and convAttn.
sets are 40 million and 100 million, respectively.           Based on ECMM, we only use online convert cost with-
   Metric. The goal of our ranking task is to provide a      out convert attention. b) ECMM wo offline. Based on
list that is more likely to facilitate users’ conversion. TheECMM, we only use online convert cost. c) ECMM wo
evaluation metric used in this paper is NDCG. We have        online and slidWinAttn. Based on ECMM, we only use
two ranking strategies: sorting by CTR and sorting by        offline convert cost without sliding window attention. d)
CTCVR. So we have NDCG sorted by CTR to predict real         ECMM wo online. Based on ECMM, we only use offline
click rate and NDCG sorted by CTCVR to predict real          convert cost.
purchase rate. The calculation criteria are as follows:          (3) ECMM variants: a) ECMM+dualInfo: Based on
                                                             ECMM,       we calculate convert attention not only convert
                 𝐷𝐶𝐺        Σ𝑛      𝑟
                              𝑗=1 (2 − 1)/𝑙𝑜𝑔(1 + 𝑗)         click   sequence     information to the order sequence but
   𝑁 𝐷𝐶𝐺 =              = |𝑟𝑒𝑙| 𝑟                       ,
                𝐼𝐷𝐶𝐺        Σ𝑗=1 (2 − 1)/𝑙𝑜𝑔(1 + 𝑗)          also convert order sequence information to the click se-
                                                       (24) quence. b) ECMM+sepInput: Based on ECMM, we use
where 𝑛 represents the length of the list of stores ranked the click feature as the input for the CTR network, the
by the model, 𝑟 represents the label of the sample includ- order feature as the input for the CVR network.
ing click and order differing from the model task, and
|𝑟𝑒𝑙| represents the number of stores that label is not 4.2. Offline Performance
zero.
   Compared Methods. Our baseline is a highly opti- The evaluation metric used in this paper is CTR-NDCG
mized ESMM model that incorporates a large number and CTCVR-NDCG. Table 1 shows the experimental re-
of business features and handcrafted features. The to- sults of the comparison methods on two testing sets, from
tal number of features is 473. The embedding matrix of which we have:
dimension 𝑑 is 10. We use the sequences feature from             For the entire cost module, compared with ESMM,
users’ history for 180 days and the length 𝑘 is 50. The ECMM can obtain a 0.35% gain on CTR-NDCG and 0.38%
                                                                                            1
numbers of Transformer layers 𝑁 is 2. Because 80% of gain on CTCVR-NDCG . And all other ablation methods
users click sequence length is less than 10 and order se- and variants can also improve the model performance
quence length is less than 5, and considering the service after modeling users’ behavior sequences.
performance, the 𝑛 of the sparse attention we chose is           For online cost feature, compared with ESMM,
10. The dimension of the MLP used in the base module is ESMM+DIN adding click sequence has a certain increase
1024, and the dimension of the four-layer MLP used by in CTR- and CTCVR-NDCG. As showen in Figure 3,
the CTR and CVR networks is 512, 256, 128, 1 with ELU ECMM wo offline and convAttn, which is further added
activation function, respectively. And all baselines take to the order sequence, slightly decreases in the CTR-
into account the statistical user features of online and of-
                                                             1
fline costs for fair comparison. We conduct comparative For large-scale datasets in industrial recommender systems, the
                                                               improvement is considerable because of its hardness, and the testing
experiments with three categories of methods:
                                                               results in Section 3.3 further verify the significant improvement of
   (1) Baselines: a) ESMM [10]. An outstanding multi- our proposal.
NDCG, but greatly improves the CTCVR-NDCG. ECMM
wo offline indicates that the convert attention mechanism
can learn users’ order characteristics from click to order.
These three methods show that it is effective to utilize his-
torical features to improve CVR prediction. The convert
attention brings 0.18% and 0.19% gains in CTR-NDCG
and CTCVR-NDCG.


                                                                Figure 5: Online performance. The improvements of CTCVR
                                                                and CTR are significant with the significance level 𝛼=0.05.



                                                                consistent with the assessment in September. The ECMM
                                                                model shows that the advantage of considering users’
Figure 3: Improvement in conversion rate prediction from
online behavioral regularity in October.
                                                                online behavioral and offline transportation regularities
                                                                is helpful in predicting users’ current CTR and CTCVR.

   For offline cost feature, the ECMM wo online and             4.3. Online Evaluations
slidWinAttn model that uses distance sequence features
brings stronger effects improve both CTR- and CTCVR-            Online A/B test was conducted in the recommender sys-
NDCG. As showen in Figure 4, comparing ECMM wo                  tem in 7 days in January 2022. For the control group,
online and slidWinAttn with ESMM, it can be seen that           10% of users were randomly assigned and presented in
the offline transportation cost is indispensable for the        a recommender system presented by a highly optimized
conversion rate prediction of O2O platform. And ECMM            ESMM algorithm. For the experimental group, 10% of
wo online model introduced by our proposed slide win-           users were randomly selected to use the ECMM method.
dow attention brings greater gains by dynamic matching          In the online experiment, we choose CTR and CTCVR as
user preference during different times. The sliding win-        evaluation indicators, where CTCVR represents the pur-
dow method brings 0.02% and 0.05% gains in CTR-NDCG             chase rate of each request. The result is shown in Figure
and CTCVR-NDCG.                                                 5. We can see that our proposed ECMM method im-
                                                                proves the CTR by 0.52% (p-value=0.00<0.05) compared
                                                                with the baseline model, and the CTCVR by 0.73% (p-
                                                                value=0.02<0.05), which has a 1.8% (p-value=0.02<0.05)
                                                                increase in total revenue. Here, total revenue increases
                                                                to 1.8% with a 0.45% increase in CTCVR means the model
                                                                provides users with higher price list. So far, the ECMM
                                                                method has been applied to the main online traffic and
                                                                has served more than hundreds of millions of users, bring-
Figure 4: Improvement in conversion rate prediction from        ing a significant increase in the total revenue of Meituan.
offline transportation regularity in October.


   In order to explore whether the user’s historical            5. Conclusion
order will affect click, we further study with the
ECMM+dualInfo model that the order sequence trans-              In this paper, inspired by the user sequential behaviors
mits information to the click sequence. It can be seen          in O2O platform, a novel model is proposed to predict
that the click NDCG decreased by 0.05%, and the CTCVR-          conversion rate. Further, introduce covert attention and
NDCG decreased by 0.06%. We separate the click and the          sliding window attention in the cost module to learn users’
order features into the CTR network and CVR network to          online behavioral regularity and offline transportation
obtain the ECMM+sepInput model to verify the feature            regularity. At the same time, offline experiments have
impact of different task, and found that separate features      proved the effectiveness of our proposed method to learn
will reduce model performance.                                  users’ conversion from users’ click sequence to order
   To verify the generalization of our model instead of         sequence, and the accuracy of the ranking list is im-
fitting users over a certain period, we further evaluate        proved by evaluating NDCG. Online experiments show
our method on a test set in October. The results are            that ECMM method has a significant effect on improv-
ing the total revenue of the O2O platform. For now, the             10.1145/2424321.2424348. doi:10.1145/2424321.
ECMM method has been applied to the main online traf-               2424348.
fic, bringing a significant increase in the total revenue of    [7] J. Huang, K. Hu, Q. Tang, M. Chen, Y. Qi, J. Cheng,
the enterprise.                                                     J. Lei, Deep position-wise interaction network for
                                                                    ctr prediction, in: Proceedings of the 44th Inter-
                                                                    national ACM SIGIR Conference on Research and
Acknowledgments                                                     Development in Information Retrieval, 2021, pp.
                                                                    1885–1889.
This research was supported by the National Natural Sci-
                                                                [8] Y. Ping, C. Gao, T. Liu, X. Du, H. Luo, D. Jin, Y. Li,
ence Foundation of China (NSFC) under Grant 72071029,
                                                                    User consumption intention prediction in meituan,
71974031 and 72231010. This research was also supported
                                                                    in: Proceedings of the 27th ACM SIGKDD Confer-
by Meituan.
                                                                    ence on Knowledge Discovery & Data Mining, 2021,
                                                                    pp. 3472–3482.
References                                                      [9] Z. Fang, B. Gu, X. Luo, Y. Xu, Contemporaneous and
                                                                    delayed sales impact of location-based mobile pro-
 [1] X. Ding, J. Tang, T. Liu, C. Xu, Y. Zhang, F. Shi,             motions, Information Systems Research 26 (2015)
     Q. Jiang, D. Shen, Infer implicit contexts in real-time        552–564.
     online-to-offline recommendation, in: Proceedings         [10] X. Ma, L. Zhao, G. Huang, Z. Wang, Z. Hu, X. Zhu,
     of the 25th ACM SIGKDD International Conference                K. Gai, Entire space multi-task model: An ef-
     on Knowledge Discovery & Data Mining, KDD ’19,                 fective approach for estimating post-click conver-
     Association for Computing Machinery, New York,                 sion rate, in: The 41st International ACM SIGIR
     NY, USA, 2019, p. 2336–2346. URL: https://doi.org/             Conference on Research & Development in Infor-
     10.1145/3292500.3330716. doi:10.1145/3292500.                  mation Retrieval, SIGIR ’18, Association for Com-
     3330716.                                                       puting Machinery, New York, NY, USA, 2018, p.
 [2] H. Li, Q. Shen, Y. Bart, Local market characteris-             1137–1140. URL: https://doi.org/10.1145/3209978.
     tics and online-to-offline commerce: An empirical              3210104. doi:10.1145/3209978.3210104.
     analysis of groupon, Management Science 64 (2018)         [11] H. Wen, J. Zhang, Y. Wang, F. Lv, W. Bao, Q. Lin,
     1860–1878.                                                     K. Yang, Entire space multi-task modeling via post-
 [3] S. Kawanaka, D. Moriwaki, Uplift modeling for                  click behavior decomposition for conversion rate
     location-based online advertising, in: Proceedings             prediction, in: Proceedings of the 43rd Interna-
     of the 3rd ACM SIGSPATIAL International Work-                  tional ACM SIGIR Conference on Research and De-
     shop on Location-Based Recommendations, Geoso-                 velopment in Information Retrieval, Association for
     cial Networks and Geoadvertising, LocalRec ’19,                Computing Machinery, New York, NY, USA, 2020,
     Association for Computing Machinery, New York,                 p. 2377–2386. URL: https://doi.org/10.1145/3397271.
     NY, USA, 2019.                                                 3401443.
 [4] M.-H. Park, J.-H. Hong, S.-B. Cho, Location-based         [12] H. Wen, J. Zhang, F. Lv, W. Bao, T. Wang, Z. Chen,
     recommendation system using bayesian user’s pref-              Hierarchically modeling micro and macro behav-
     erence model in mobile devices, in: International              iors via multi-task learning for conversion rate pre-
     conference on ubiquitous intelligence and comput-              diction, in: Proceedings of the 44th International
     ing, Springer, 2007, pp. 1130–1139.                            ACM SIGIR Conference on Research and Devel-
 [5] H. Yang, T. Liu, Y. Sun, E. Bertino, Exploring the             opment in Information Retrieval, Association for
     interaction effects for temporal spatial behavior              Computing Machinery, New York, NY, USA, 2021,
     prediction, in: Proceedings of the 28th ACM In-                p. 2187–2191. URL: https://doi.org/10.1145/3404835.
     ternational Conference on Information and Knowl-               3463053.
     edge Management, CIKM ’19, Association for Com-           [13] Q. Lu, S. Pan, L. Wang, J. Pan, F. Wan, H. Yang,
     puting Machinery, New York, NY, USA, 2019, p.                  A practical framework of conversion rate predic-
     2013–2022. URL: https://doi.org/10.1145/3357384.               tion for online display advertising, in: Proceed-
     3357963. doi:10.1145/3357384.3357963.                          ings of the ADKDD’17, ADKDD’17, Association
 [6] J. Bao, Y. Zheng, M. F. Mokbel, Location-based and             for Computing Machinery, New York, NY, USA,
     preference-aware recommendation using sparse                   2017. URL: https://doi.org/10.1145/3124749.3124750.
     geo-social networking data, in: Proceedings of                 doi:10.1145/3124749.3124750.
     the 20th International Conference on Advances in          [14] T. Tong, X. Xu, N. Yan, J. Xu, Impact of different
     Geographic Information Systems, SIGSPATIAL ’12,                platform promotions on online sales and conversion
     Association for Computing Machinery, New York,                 rate: The role of business model and product line
     NY, USA, 2012, p. 199–208. URL: https://doi.org/               length, Decision Support Systems (2022) 113746.
[15] S. Guo, L. Zou, Y. Liu, W. Ye, S. Cheng, S. Wang,            2671–2679.
     H. Chen, D. Yin, Y. Chang, Enhanced Doubly Robust       [24] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian,
     Learning for Debiasing Post-Click Conversion Rate            G. Zhou, J. Xu, Y. Yu, X. Zhu, et al., Lifelong se-
     Estimation, Association for Computing Machinery,             quential modeling with personalized memorization
     New York, NY, USA, 2021, p. 275–284. URL: https:             for user response prediction, in: Proceedings of the
     //doi.org/10.1145/3404835.3462917.                           42nd International ACM SIGIR Conference on Re-
[16] X. Pan, M. Li, J. Zhang, K. Yu, L. Wang, H. Wen,             search and Development in Information Retrieval,
     C. Mao, B. Cao, Conversion rate prediction via meta          2019, pp. 565–574.
     learning in small-scale recommendation scenarios,       [25] Q. Tan, J. Zhang, J. Yao, N. Liu, J. Zhou, H. Yang,
     arXiv preprint arXiv:2112.13753 (2021).                      X. Hu, Sparse-interest network for sequential rec-
[17] H. Wang, Z. Li, X. Liu, D. Ding, Z. Hu, P. Zhang,            ommendation, in: Proceedings of the 14th ACM
     C. Zhou, J. Bu, Fulfillment-time-aware personalized          International Conference on Web Search and Data
     ranking for on-demand food recommendation, in:               Mining, 2021, pp. 598–606.
     Proceedings of the 30th ACM International Confer-       [26] K.-c. Lee, B. Orten, A. Dasdan, W. Li, Estimating
     ence on Information & Knowledge Management,                  conversion rate in display advertising from past
     2021, pp. 4184–4192.                                         erformance data, in: Proceedings of the 18th ACM
[18] D. Xi, Z. Chen, P. Yan, Y. Zhang, Y. Zhu, F. Zhuang,         SIGKDD international conference on Knowledge
     Y. Chen, Modeling the sequential dependence                  discovery and data mining, 2012, pp. 768–776.
     among audience multi-step conversions with multi-       [27] O. Chapelle, Modeling delayed feedback in dis-
     task learning in targeted display advertising, in:           play advertising, in: Proceedings of the 20th ACM
     Proceedings of the 27th ACM SIGKDD Conference                SIGKDD international conference on Knowledge
     on Knowledge Discovery & Data Mining, 2021, pp.              discovery and data mining, 2014, pp. 1097–1105.
     3745–3755.                                              [28] Q. Lu, S. Pan, L. Wang, J. Pan, F. Wan, H. Yang, A
[19] F. Xiao, L. Li, W. Xu, J. Zhao, X. Yang, J. Lang,            practical framework of conversion rate prediction
     H. Wang, Dmbgn: Deep multi-behavior graph net-               for online display advertising, in: Proceedings of
     works for voucher redemption rate prediction, in:            the ADKDD’17, 2017, pp. 1–9.
     Proceedings of the 27th ACM SIGKDD Conference           [29] R. Xie, C. Ling, Y. Wang, R. Wang, F. Xia, L. Lin,
     on Knowledge Discovery & Data Mining, 2021, pp.              Deep feedback network for recommendation, in:
     3786–3794.                                                   Proceedings of the Twenty-Ninth International
[20] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma,             Conference on International Joint Conferences on
     Y. Yan, J. Jin, H. Li, K. Gai, Deep interest network         Artificial Intelligence, 2021, pp. 2519–2525.
     for click-through rate prediction, in: Proceedings      [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     of the 24th ACM SIGKDD International Conference              L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
     on Knowledge Discovery & Data Mining, KDD ’18,               tention is all you need, Advances in neural infor-
     Association for Computing Machinery, New York,               mation processing systems 30 (2017).
     NY, USA, 2018, p. 1059–1068. URL: https://doi.org/      [31] Y. Feng, F. Lv, W. Shen, M. Wang, F. Sun, Y. Zhu,
     10.1145/3219819.3219823. doi:10.1145/3219819.                K. Yang, Deep session interest network for click-
     3219823.                                                     through rate prediction, in: IJCAI, 2019.
[21] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou,       [32] C. Li, Z. Liu, M. Wu, Y. Xu, P. Huang, H. Zhao,
     X. Zhu, K. Gai, Deep interest evolution net-                 G. Kang, Q. Chen, W. Li, Lee, Multi-interest net-
     work for click-through rate prediction, volume 33,           work with dynamic routing for recommendation
     2019, pp. 5941–5948. URL: https://ojs.aaai.org/index.        at tmall, Proceedings of the 28th ACM Interna-
     php/AAAI/article/view/4545. doi:10.1609/aaai.                tional Conference on Information and Knowledge
     v33i01.33015941.                                             Management (2019).
[22] C. Li, Z. Liu, M. Wu, Y. Xu, H. Zhao, P. Huang,         [33] Z. Xiao, L. Yang, W. Jiang, Y. Wei, Y. Hu, H. Wang,
     G. Kang, Q. Chen, W. Li, D. L. Lee, Multi-interest           Deep multi-interest network for click-through rate
     network with dynamic routing for recommendation              prediction, Proceedings of the 29th ACM Inter-
     at tmall, in: Proceedings of the 28th ACM interna-           national Conference on Information & Knowledge
     tional conference on information and knowledge               Management (2020).
     management, 2019, pp. 2615–2623.                        [34] T. Natarajan, S. A. Balasubramanian, D. Kasilingam,
[23] Q. Pi, W. Bian, G. Zhou, X. Zhu, K. Gai, Practice on         Understanding the intention to use mobile shop-
     long sequential user behavior modeling for click-            ping applications and its influence on price sensi-
     through rate prediction, in: Proceedings of the              tivity, Journal of Retailing and Consumer Services
     25th ACM SIGKDD International Conference on                  37 (2017) 8–22.
     Knowledge Discovery & Data Mining, 2019, pp.            [35] D. Clevert, T. Unterthiner, S. Hochreiter, Fast and
     accurate deep network learning by exponential lin-
     ear units (elus), in: Y. Bengio, Y. LeCun (Eds.), 4th
     International Conference on Learning Representa-
     tions, ICLR 2016, San Juan, Puerto Rico, May 2-4,
     2016, Conference Track Proceedings, 2016. URL:
     http://arxiv.org/abs/1511.07289.
[36] G. Zhao, J. Lin, Z. Zhang, X. Ren, X. Sun, Sparse
     transformer: Concentrated attention through ex-
     plicit selection, 2020. URL: https://openreview.net/
     forum?id=Hye87grYDH.
[37] X. Wang, C. Li, N. Golbandi, M. Bendersky, M. Na-
     jork, The lambdaloss framework for ranking metric
     optimization, in: Proceedings of the 27th ACM In-
     ternational Conference on Information and Knowl-
     edge Management, 2018, pp. 1313–1322.