=Paper= {{Paper |id=Vol-3317/paper4 |storemode=property |title=Entire Cost Enhanced Multi-Task Model for Online-to-Offline Conversion Rate Prediction |pdfUrl=https://ceur-ws.org/Vol-3317/Paper4.pdf |volume=Vol-3317 |authors=Yingyi Zhang,Xianneng Li,Yahe Yu,Jian Tang,Huanfang Deng,Junya Lu,Yeyin Zhang,Qiancheng Jiang,Yunsen Xian,Liqian Yu |dblpUrl=https://dblp.org/rec/conf/cikm/ZhangLYTDLZJXY22 }} ==Entire Cost Enhanced Multi-Task Model for Online-to-Offline Conversion Rate Prediction== https://ceur-ws.org/Vol-3317/Paper4.pdf

Entire Cost Enhanced Multi-Task Model for
Online-to-Offline Conversion Rate Prediction
Yingyi Zhang1 , Xianneng Li1,* , Yahe Yu1 , Jian Tang2 , Huanfang Deng2 , Junya Lu2 ,
Yeyin Zhang2 , Qiancheng Jiang2 , Yunsen Xian2 and Liqian Yu2
1
Dalian University of Technology, Dalian, 116024, China
2
Meituan, Beijing, 100102, China

Abstract
Predicting users’ conversion rate (CVR) is essentially important for ranking systems in industrial Online-to-Offline (O2O)
applications. Numerous efforts have been made in CVR modeling to achieve state-of-the-art performance. However, existing
methods mainly focus on the Business-to-Customer (B2C) scenario, which makes implementations to O2O meet with mixed
success. This can be revealed via several scenario-specific challenges. For example, O2O users in different locations generally
encounter different candidates of surrounding stores. This leads to users’ behavioral regularity becoming essentially prominent.
Besides, O2O users’ conversion includes a two-stage cost, i.e., online order cost and offline transportation cost. This inspires
that users’ location sensitivity deserves additional attention compared with conventional scenarios. Motivated by these
characteristics, we propose a novel CVR prediction method for the O2O scenario, named Entire Cost enhanced Multi-task
Model (ECMM): i) users’ historical behavior sequences across different locations are modeled to capture the users’ preference
of behavioral regularity; ii) both online order cost and offline transportation cost are modeled to predict the users’ aggregated
preference for conversion. By designing two novel attention mechanisms, i.e., convert attention and sliding window attention,
ECMM can be trained end-to-end to appropriately fit O2O characteristics. Extensive experiments have been carried out under
a real-world industrial O2O platform Meituan. Both offline and rigorous online A/B tests under the billion-level data scale
demonstrate the superiority of the proposed ECMM over the highly optimized state-of-the-art baselines.

Keywords
Online-to-Offline, Multi-Task Learning, Conversion Rate Prediction

1. Introduction challenging, whereas conventional methods may not be
perfectly suitable.
In the Online-to-Offline (O2O) scenario, industrial plat- In this paper, two critical O2O characteristics summa-
forms generally rely on commission fees of successful rized in our real practice are focused on: i) online be-
conversion as profit. Hence, how to accurately predict havioral regularity. As a typical form of Location-Based
users’ conversion rate (CVR) is essentially important Service (LBS), the O2O scenario provides an online rank-
for ranking systems in O2O industry. However, the ing list that only considers surrounding stores of a user’s
O2O scenario requires the conversion of users from not location. The limited candidates require CVR modeling
only online click to online order, but also to final offline to more accurately grasp users’ preference of historical
consumption[1, 2]. In other words, O2O users’ behaviors behaviors for online conversion since users’ behaviors
follow a sequential pattern of impression→click→online generally appear homogeneously on the platform in dif-
order→offline consumption, which is somewhat different ferent locations[6, 7, 8] such as clicking/ordering stores
from that of other online e-commerce forms[3, 4, 5], i.e., with similar prices or distances showing online. ii) offline
Business-to-Costumer (B2C). This raises several scenario- transportation regularity. Different from B2C purchases
specific characteristics that make CVR prediction of O2O with only online order cost, O2O users should spend
DL4SR’22: Workshop on Deep Learning for Search and Recommen- additional transportation cost for the offline consump-
dation, co-located with the 31st ACM International Conference on tion [9, 2]. Since user’s preference for distance varies
Information and Knowledge Management (CIKM), October 17-21, 2022, in different periods, offline cost should be counted for
Atlanta, USA decision-making dynamically to predict the current trans-
*
Corresponding author.
$ yingyizhang@mail.dlut.edu.cn (Y. Zhang);
portation preference of the user. This inspires that CVR
xianneng@dlut.edu.cn (X. Li); yaheyu@dlut.edu.cn (Y. Yu); modeling should consider additionally location-sensitive
tangjian13@meituan.com (J. Tang); denghuanfang@meituan.com factors when capturing O2O users’ preferences.
(H. Deng); lujunya@meituan.com (J. Lu); Although numerous efforts have been made in CVR
zhangyeyin@meituan.com (Y. Zhang); modeling to achieve state-of-the-art industrial perfor-
jiangqiancheng@meituan.com (Q. Jiang);
xianyunsen@meituan.com (Y. Xian); yuliqian@meituan.com
mance , existing methods such as ESMM and its vari-
(L. Yu) ants focus on addressing the problems of sample se-
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
lection bias and data sparsity under the B2C scenario
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
[10, 11, 12, 13, 14, 15] and some method solving domain respectively. We note that the first three terms are the
specifically problem are proposed [16, 17, 18, 19]. How- information widely used in conventional CVR modeling,
ever, where the intrinsic characteristics of O2O, i.e., on- while ℎ𝑐 and ℎ𝑜 are two newly considered ones to assist
line behavioral regularity and offline transportation reg- in the modeling of behavioral and transportation regu-
ularity, are rarely considered. larities. Moreover, the user’s click and order sequences
One possible strategy to improve learning users’ online in ECMM are used from the online-offline cost perspec-
behavioral regularity and offline transportation regular- tive, i.e., online order cost and offline transportation cost,
ity is to consider user statistical features i.e. user’s aver- which are essentially different from that of conventional
age online order cost and user’s average offline distance CTR prediction methods of modeling user’s multiple in-
features. However, in O2O scenarios, the spatiotemporal terests [20, 21, 22, 23, 24, 25]. As a novel CVR prediction
nature is inseparable, and using this strategy will lose method for the O2O scenario, the contributions of ECMM
time-series information when characterizing user pref- are threefold:
erences. Therefore, sequence representation techniques
are also taken into account as shown in Figure 1. • ECMM elongates the observation dimensions by
learning users’ online conversion preferences
Offline stores that user interacted historically
from historical behavior sequences. A new mech-
anism named convert attention is proposed to
learn the user’s behavior regularity from the
Offline transportation cost
global and local perspectives of online order cost.
• To the best of our knowledge, ECMM is the first
Online order cost
method for CVR modeling from the perspective
of offline transportation cost. We propose a new
mechanism named sliding window attention to dy-
Offline transportation cost namically learn users’ preference of offline trans-
portation.
• ECMM is testified under a real-world industrial
Online order cost
O2O platform, where extensive experiments are
carried out. Both offline and rigorous online A/B
...

tests under the billion-level data scale demon-
Offline transportation cost strate the significant superiority of ECMM over
the state-of-the-art baselines.

Online order cost

2. Related Work
Figure 1: The online order cost and offline transportation
Our work is closely related to traditional e-commerce
cost in user history. Such sequence can represent user online
order and offline transportation preferences in time-series.
CVR prediction, where the state-of-the-art model is
trained by multi-task learning. Besides, for capture user
behavior regularity, user history behavior sequence is
Hence, in this paper, we propose a novel CVR predic- considered in our model which is related to user behavior
tion method for the O2O scenario, named Entire Cost sequence representation. In this section, we give a brief
enhanced Multi-task Model (ECMM), to model users’ ag- introduction.
gregated preference under a online-offline cost perspec-
tive. Following the formation of state-of-the-art CVR
2.1. CVR Prediction
modeling, two auxiliary tasks are focused on, i.e., pre-
dicting the click-through rate (CTR) and click-through Inspired by the success within deep learning, recent CVR
conversion rate (CTCVR), which can be defined as fol- prediction model has evolved from traditional approaches
lows: to deep approaches. Traditional method used logistic
𝑝(𝑐𝑡𝑐𝑣𝑟 = 1|𝑥) regression [26, 27] and GBDT [28] for modeling CVR
𝑝(𝑐𝑣𝑟 = 1|𝑐𝑡𝑟 = 1, 𝑥) = , (1) problem with feature interactions. However, nonlinear
𝑝(𝑐𝑡𝑟 = 1|𝑥)
relationships of features are not considered in these mod-
where 𝑥 is (𝑢, 𝑠, 𝑡, ℎ𝑐 , ℎ𝑜 ), 𝑢 is the user, 𝑠 denotes the els. Modern deep learning based method transforms CVR
store, and 𝑡 represents the current context, such as the problem into a multi-task problem [10, 11, 12]. ESMM
current time, city, day of the week, and other informa- [10] make use of users sequential actions, "impression
tion that is independent of user and store. ℎ𝑐 and ℎ𝑜 are → click → pay", to solve sample selection bias and data
the user’s historical click sequence and order sequence, sparsity problem over the entire space by simultaneous
modeling of CTR and CTCVR tasks. ESM2 [11] method and context features, the entire cost module contain both
extends users sequential actions to a more general situa- the user’s click and order sequence to capture the user’s
tion, "impression → click → D(O)Action →pay", which historical cost preference, and the cost combination mod-
simultaneous models CVR with CTR, CTAVR and CTCVR ule for combining online-to-offline cost to predict CTR
tasks. HM3 [12] form "impression → click → D(O)Mi and CVR. With this network, the model can capture the
→ D(O)Ma → pay" perspective models CVR with CTR, user’s online behavioral and offline transportation reg-
D-Mi, D-Ma and CTCVR tasks. ularities, which are hidden in users’ historical behavior
However, all these methods are based on B2C e- sequences. The details of each module are described as
commerce platforms which makes implementations to follows.
O2O platforms meet with mixed success. Users have
unique sequential actions in O2O, which can be repre- 3.1. Motivation
sented as "impression→click→online order→offline con-
sumption". Such situations require CVR model to con- As discussed in the previous section, users’ online behav-
sider not only user online behavioral regularity, but also ioral and offline transportation regularities are indispens-
offline transportation regularity. able for O2O recommendation [9, 2, 20, 21]. However,
how to define their relationship with users’ behavior se-
quence as well as embody both online and offline cost
2.2. User Behavior Sequence
into a unified framework for CVR prediction remains
Representation unexplored.
In the past decade, user behavior sequence representation For one thing, we propose a novel CVR prediction
have received much attention and achieved remarkable method from the perspective of user historical behavior.
effectiveness. Many well designed recommender meth- We proposed convert attention to extract the local and
ods have been proposed and brought huge commercial global preference of users’ online-to-offline behaviors
revenues for companies and advertisers. In this mod- from both depth and breadth perspectives. From a lo-
els, users’ history behaviors are transformed into low- cal view, an order placed by a user is affected by clicks.
dimension vectors after embedding to represent users’ We design the local impact of a click on a order from
interest and other character. DIN [20] employs the atten- the store perspective. From a global perspective, users’
tion mechanism to activate historical behaviors locally overall order sequence receives the impression of click
which capture user diversity interest to the given target sequence in terms of id, price, and relative distance. For
item. DIEN [21] further proposes an auxiliary loss and another, to model users’ transportation cost, we capture
attention mechanism with GRU to capture the dynamic the information of the distance sequence implied in users’
evolution of users interest. DFN [29] jointly consider preference for offline cost in the O2O scenario, to assist
explicit/implicit and positive/negative feedbacks to learn the model in learning users’ conversion preference in
user unbiased preferences. Moreover, inspired by the suc- the offline stage. Each store of a user’s historical click
cess of the self-attention architecture [30], Transformer and order has distance features which means the offline
is introduced in for session CTR prediction [31]. MIND transportation cost. Then we use sliding window atten-
[32] and DMIN [33] model multi-interest by multiple tion method to calculate the user dynamic preference for
vectors with dynamic routing mechanism and capsule offline cost during different timestamps.
network.
Although all these user behavior sequence representa- 3.2. Base Module
tion methods have brought a huge boost to the business
from the perspective of user interest, there are still op- The base module is used to aggregate the basic features.
portunities for improvement in modeling user behavior Refer to [10, 11, 12], the embedding and MLP (multiple
sequences from other perspectives. Cost sensitivity [34] layer perception) structures are used in the base mod-
is an indispensable aspect of user modeling, and users of ule. The user, store, and contextual features (𝑢 ∈ R𝑛𝑢 ,
e-commerce often have certain restrictions on payment 𝑠 ∈ R𝑛𝑠 , and 𝑡 ∈ R𝑛𝑡 ) are the inputs of the base module,
costs which makes it possible to further improve the user which are mapped into a d-dimensional space via embed-
behavior sequence modeling from the perspective of cost. ding operations. MLP are used to learn the aggregated
vector 𝑏 of basic features, with ELU [35] as the activation
function:
3. The Proposed Approach
𝑏 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 (𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑢, 𝑠, 𝑡))). (2)
In this section, we introduce the proposed ECMM model.
As shown in Figure 2, it consists of three modules, which
are base module includes the online user, the offline store
loss 1 loss 2
Combination module Sliding window attention
User features : Share features :
CTR CTCVR CVR
Click features :
Store features : CTR network CVR network ...

Share net feature concat
Order features :
Context features :

User transportation cost Store Distance

Base module Entire cost module
Convert attention

Mean pooling Mean pooling Global convert
Flatten Flatten
MLP W
Concat Convert attention id dis price
Flatten Click
W
... ...
Concat id dis price
Sliding window attention Sliding window attention Order

Sparse attention id dis price
Sparse attention
... ... ... Order

1 ... Store distance
... Local convert
N-block Transformer N-Block Transformer ...
Embedding
Context W
s1 s2 sk
... ... ... Embedding Embedding ... ... Click
...
W
User click User pay
User feature Store feature Context feature ... s1 s2 sk
Click 1 Click 2 ... Click k Order 1 Order 2 ... Order k transportation cost transportation cost
Order
s1 s2 sk
Base feature Online convert cost Offline convert cost
Order

Figure 2: The structure of ECMM. Two auxiliary cost are introduced to model the entire cost, i) online convert cost calculate the
behavior regularity of users when they face price and distance shown online, ii) offline convert cost calculate the transportation
cost for offline consumption.

3.3. Entire Cost Module restriction. The sparse attention takes the embedding
of the user’s current context feature, click and order se-
Different from B2C purchase, O2O scenario generally
quences as input, and then get the most important user
considers surrounding stores of a user’s location. Limited
click and order behavior in the current context. The
candidates actually reduce the possibility of matching
sparse attention [36] is defined as follows:
with users’ preference. Thus, it is critical to accurately
capture the user’s behavioral regularity from historical 𝑄𝐾 𝑇
behaviors. Meanwhile, O2O users need to consider two- 𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄, 𝐾, 𝑉 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑡𝑜𝑝𝑛( √ ))𝑉 ,
𝑘
stage costs for decision making, i.e., online order cost (7)
and offline transportation cost, both of which should be where the 𝑡𝑜𝑝𝑛 operation takes the top 𝑛 pieces of his-
considered. Entire cost module is designed to solve the torical information most relevant to the current context.
above problems and is the most important part of the Through the sparse attention, we can get the updated
ECMM model. It contains two parts: online cost feature embeddings of user’s click and order sequences:
module and offline cost feature module.
Online Cost Feature Module. Each store that in 𝐻 𝑎𝑐 = 𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄𝑠 , 𝐾 𝑐 , 𝑉 𝑐 ), 𝐻 𝑎𝑐 ∈ R𝑘×3𝑑 ,
the user’s click or order sequence has side-information (8)
features of id 𝑠𝑖𝑑 , distance 𝑠𝑑𝑖𝑠 and price 𝑠𝑝𝑟𝑖𝑐𝑒 that rep- 𝐻 𝑎𝑜 = 𝑆𝑝𝑎𝑟𝑠𝑒𝐴𝑡𝑡𝑛(𝑄𝑠 , 𝐾 𝑜 , 𝑉 𝑜 ), 𝐻 𝑎𝑜 ∈ R𝑘×3𝑑 ,
resent the user cost that he decide to click/order an offline (9)
store in the online platform. Then we have embedding where 𝑄𝑠 means converts context features as query vec-
of the i-th store in user historical behavior, tor, {𝐾 𝑐 , 𝑉 𝑐 } denotes converts the user click sequence
as key and value vectors and {𝐾 𝑜 , 𝑉 𝑜 } as well.
ℎ𝑐𝑖 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑠𝑖𝑑 𝑑𝑖𝑠 𝑝𝑟𝑖𝑐𝑒
𝑖 , 𝑠𝑖 , 𝑠𝑖 ), ℎ𝑗𝑖 ∈ R3𝑑 . (3) In order to better capture the impact of the user click se-
quence 𝐻𝑐𝑎 on the order sequence 𝐻𝑜𝑎 from the retrieved
ℎ𝑜𝑖 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑠𝑖𝑑 𝑑𝑖𝑠 𝑝𝑟𝑖𝑐𝑒
𝑖 , 𝑠 𝑖 , 𝑠𝑖 ), ℎ𝑗𝑖 ∈ R3𝑑 . (4) click and order aggregation information, we propose a
Thus, the user’s historical click and order behavior se- convert attention mechanism to capture these impacts
quences, i.e., 𝐻 𝑐 and 𝐻 𝑜 , can be represented as follows: from both local and global perspectives.
From a local perspective, the preference of the user’s
𝐻 𝑐 = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑐1 , ℎ𝑐2 , ..., ℎ𝑐𝑘 ), 𝐻 𝑐 ∈ R𝑘×3𝑑 , (5) conversion to store ℎ𝑜,𝑖 ∈ 𝐻 𝑜 can be characterized by
𝑎 𝑎

the clicked store ℎ𝑐,𝑖 ∈ 𝐻 𝑐 related to where the order
𝑎 𝑎

𝐻 𝑜 = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑜1 , ℎ𝑜2 , ..., ℎ𝑜𝑘 ), 𝐻 𝑜 ∈ R𝑘×3𝑑 , (6) was placed:
where 𝑘 denotes the length of user’s click and order se- 𝛽𝑖𝑗 = (W𝑙𝑐 × ℎ𝑎𝑐,𝑖 ) ⊗ (𝑊 𝑙𝑜 × ℎ𝑎𝑜,𝑗 )𝑇 , (10)
quences.
After embedding, the sparse attention is used to cap- 𝑒𝑥𝑝(𝛽 𝑗 ) × ℎ𝑎𝑐,𝑖
ture the user’s historical preference under contextual 𝑠𝑙𝑜,𝑗 = Σ𝑘𝑖=1 𝑘 𝑖 + ℎ𝑎𝑜,𝑗 , ℎ𝑙𝑜,𝑗 ∈ R3𝑑 , (11)
Σ𝑜=1 𝑒𝑥𝑝(𝛽𝑜𝑗 )
where 𝑊𝑐𝑙 , 𝑊𝑜𝑙 ∈ R3𝑑×3𝑑 is trainable parameters. 𝛽𝑖𝑗 We propose a sliding window attention mechanism that
represents the correlation between clicked store 𝑖 and or- uses fixed-length windows to characterize the user’s pref-
der store 𝑗. 𝑠𝑙𝑜,𝑗 means to use the aggregation of clicked erence for transportation cost in different periods, be-
stores information to obtain the local conversion prefer- cause the user’s preference for transportation cost varies
ence to update the order store information. Here, we use in different periods. Note the mechanism has generation
the residual design to retain the original information of for not only O2O platform users but also for other sce-
the order store. nario which need to capture user dynamic preference
From a global perspective, the user’s preferences for during different period.
different dimensions (i.e., store’s id, price, distance) of Each offline store has a distance feature 𝑠𝑑𝑖𝑠 ∈ R𝑑
order stores are affected by the relevant information of with respect to the current store, we match this feature
the clicked store. Hence, we separate the submatrix from with the user’s historical distance sequence:
the click and order sequences:
𝐷 𝑗,𝑖 = 𝑇 𝑑𝑖𝑠
𝑐,𝑖:𝑖+𝑤𝑠 , 𝐷 𝑗,𝑖 ∈ R
𝑤𝑠×𝑑
, 𝑗 ∈ {𝑐, 𝑜}, (19)
𝐻 𝑎𝑖𝑑 = (𝑠𝑎,𝑖𝑑
𝑖 ), 𝐻 𝑎𝑑𝑖𝑠 = (𝑠𝑎,𝑑𝑖𝑠
𝑖 ), 𝐻 𝑎𝑝𝑟𝑖𝑐𝑒 = (𝑠𝑎,𝑝𝑟𝑖𝑐𝑒
𝑖 ),
𝐷 𝑗,𝑖 𝑠𝑑𝑖𝑠
𝐻 𝑎𝑐𝑡𝑥𝑗 ∈ R𝑘×𝑑 . 𝐴𝑗,𝑖 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( √ ), 𝐴𝑗,𝑖 ∈ R𝑤𝑠×𝑑 , 𝑗 ∈ {𝑐, 𝑜},
𝑤𝑠
(12) (20)
For each dimension, we calculate the impact of the 𝑀 𝑑𝑖𝑠 = Σ𝑘𝑖=1 𝐴𝑤 𝑑𝑖𝑠
∈ R𝑤𝑠×𝑑 , 𝑗 ∈ {𝑐, 𝑜},
𝑗 𝑗,𝑖 𝐷 𝑗,𝑖 , 𝑀 𝑗
user’s clicked sequence on the user’s order sequence (21)
from a global perspective: where 𝑤𝑠 ∈ N denotes our window length, 𝐷 𝑗,𝑖 de-
notes the subsequence in 𝑖-th window, 𝑀 𝑑𝑖𝑠 denotes
𝑐𝑡𝑥𝑗
𝛾𝑐𝑡𝑥𝑖 = (𝑊 𝑔𝑐 × 𝐻 𝑎𝑐,𝑐𝑡𝑥𝑖 ) ⊗ (𝑊 𝑔𝑜 × 𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 )𝑇 , (13) 𝑗
the user offline preference of the window length dimen-
𝑐𝑡𝑥𝑗
𝑒𝑥𝑝(𝛾𝑐𝑡𝑥𝑖 ) sion matrix, and 𝑚𝑑𝑖𝑠 𝑗 = 𝐹 𝑙𝑎𝑡𝑡𝑒𝑛(𝑀 𝑑𝑖𝑠𝑗 ) denotes the
𝐻 𝑔𝑜,𝑐𝑡𝑥𝑗 = Σ𝑐𝑡𝑥𝑖 𝑐𝑡𝑥𝑗
𝐻 𝑎𝑐,𝑐𝑡𝑥𝑖 + 𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 , user offline preference vector.
Σ𝑐𝑡𝑥𝑖 𝑒𝑥𝑝(𝛾𝑐𝑡𝑥𝑖 )
𝐻 𝑔𝑜,𝑐𝑡𝑥𝑗 ∈ R𝑘×𝑑 , 3.4. Cost Combination Module
(14)
where 𝑐𝑡𝑥𝑖, 𝑐𝑡𝑥𝑗 ∈ (𝑖𝑑, 𝑑𝑖𝑠, 𝑝𝑟𝑖𝑐𝑒), 𝑊 𝑔𝑐 , 𝑊 𝑔𝑜 ∈ R𝑑×𝑑 In this section, we embody CTR and CVR prediction tasks
is trainable parameters, 𝛾𝑐𝑡𝑥𝑖
𝑐𝑡𝑥𝑗
represents the correlation into a multi-task framework. The input of this module is
between the click sequence in dimension 𝑐𝑡𝑥𝑗 and the the concatenation of the outputs from base module and
order sequence in dimension 𝑐𝑡𝑥𝑖, 𝐻𝑜,𝑐𝑡𝑥𝑗
𝑔
means that entire cost module. 𝑟𝑐𝑡𝑟 and 𝑟𝑐𝑣𝑟 are calculated by MLP
using the click additional information aggregation to network, respectively.
obtain the global conversion preference to update the
𝑐 , 𝑚𝑜 ])), (22)
𝑟𝑐𝑡𝑟 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 ([𝑏, ℎ𝑐 , ℎ𝑜 , 𝑚𝑑𝑖𝑠 𝑑𝑖𝑠
order sequence. The residual design is also used in this
part.
𝑐 , 𝑚𝑜 ])). (23)
𝑟𝑐𝑣𝑟 = 𝐸𝐿𝑈 (𝑀 𝐿𝑃 ([𝑏, ℎ𝑐 , ℎ𝑜 , 𝑚𝑑𝑖𝑠 𝑑𝑖𝑠
Finally, the aggregation of order sequence and click
sequence can be obtained : To this end, we calculate the post-view click
through&conversion rate (CTCVR) by 𝑟𝑐𝑡𝑐𝑣𝑟 = 𝑟𝑐𝑡𝑟 *
ℎ𝑜 = 𝑀 𝑒𝑎𝑛𝑝𝑜𝑜𝑙𝑖𝑛𝑔(‖𝑗 (𝑠𝑙𝑜,𝑗 ) + ‖𝑐𝑡𝑥𝑗 (𝐻 𝑎𝑜,𝑐𝑡𝑥𝑗 )), 𝑟𝑐𝑣𝑟 . The loss function used here is lambda loss [37].
ℎ𝑜 ∈ R3𝑑 ,
(15) 4. Experiments
ℎ𝑐 = 𝑀 𝑒𝑎𝑛𝑝𝑜𝑜𝑙𝑖𝑛𝑔(𝐻𝑐𝑎 ), ℎ𝑐 ∈ R3𝑑 , (16)
where ‖ means concatenate of vectors. In this section, we evaluate the model performance of the
Offline Cost Feature Module. In O2O scenario, of- proposed ECMM. We describe the experimental settings
fline transportation costs also play an important role and experimental results as follows.
in the conversion rate as users need to go to offline
stores. We first construct the user’s historical behav- 4.1. Experimental Settings
ior sequences to represent the user’s historical click and
Datasets. We selected 30 days exposure logs from August
order transportation costs, and takes them as the input
to September obtained from the online O2O business
of the 𝑁 -layers Transformer encoder:
system to train the CVR model. We have two test sets:
𝑇 𝑑𝑖𝑠
𝑐 = 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝐻 𝑑𝑖𝑠 𝑑𝑖𝑠
𝑐 ), 𝑇 𝑐 ∈ R𝑘×𝑑 , (17) one is one day dataset in September and another is three
days in October. Since user behavior evolves with time,
𝑇 𝑑𝑖𝑠
𝑜 = 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝐻 𝑑𝑖𝑠 𝑑𝑖𝑠
𝑜 ), 𝑇 𝑜 ∈ R𝑘×𝑑 . (18) the closer the time is to the training data, the closer
the distribution of user behavior is to the training data,
Table 1
Offline experimental results on two testing sets.
September October October improvement
Models
CTR—NDCG CTCVR-NDCG CTR—NDCG CTCVR-NDCG CTR—NDCG CTCVR-NDCG
ESMM 0.7560 0.8446 0.7515 0.8455 0.00% 0.00%
ESMM+DIN 0.7577 0.8456 0.7528 0.8463 0.17% 0.09%
ECMM wo offline and convAttn 0.7575 0.8458 0.7525 0.8464 0.13% 0.11%
ECMM wo offline 0.7574 0.8462 0.7532 0.8469 0.23% 0.17%
ECMM wo online and slidWinAttn 0.7577 0.8458 0.7533 0.8467 0.24% 0.14%
ECMM wo online 0.7579 0.8463 0.7534 0.8471 0.25% 0.19%
ECMM+dualInfo 0.7576 0.8462 0.7533 0.8469 0.24% 0.17%
ECMM+sepInput 0.7581 0.8465 0.7537 0.8472 0.29% 0.20%
ECMM 0.7585 0.8480 0.7541 0.8487 0.34% 0.38%

and the longer the relative time is, the user behavior task model for learning CTR and CVR in the industry. b)
distribution will change. Therefore the test sets in this ESMM+DIN [20]. Based on ESMM, users’ click sequence
experiment can effectively evaluate the accuracy and feature and the current store feature are introduced by
generalization of the model. The number of our training DIN method.
samples is approximately 1.1 billion, while the testing (2) Ablation: a) ECMM wo offline and convAttn.
sets are 40 million and 100 million, respectively. Based on ECMM, we only use online convert cost with-
Metric. The goal of our ranking task is to provide a out convert attention. b) ECMM wo offline. Based on
list that is more likely to facilitate users’ conversion. TheECMM, we only use online convert cost. c) ECMM wo
evaluation metric used in this paper is NDCG. We have online and slidWinAttn. Based on ECMM, we only use
two ranking strategies: sorting by CTR and sorting by offline convert cost without sliding window attention. d)
CTCVR. So we have NDCG sorted by CTR to predict real ECMM wo online. Based on ECMM, we only use offline
click rate and NDCG sorted by CTCVR to predict real convert cost.
purchase rate. The calculation criteria are as follows: (3) ECMM variants: a) ECMM+dualInfo: Based on
ECMM, we calculate convert attention not only convert
𝐷𝐶𝐺 Σ𝑛 𝑟
𝑗=1 (2 − 1)/𝑙𝑜𝑔(1 + 𝑗) click sequence information to the order sequence but
𝑁 𝐷𝐶𝐺 = = |𝑟𝑒𝑙| 𝑟 ,
𝐼𝐷𝐶𝐺 Σ𝑗=1 (2 − 1)/𝑙𝑜𝑔(1 + 𝑗) also convert order sequence information to the click se-
(24) quence. b) ECMM+sepInput: Based on ECMM, we use
where 𝑛 represents the length of the list of stores ranked the click feature as the input for the CTR network, the
by the model, 𝑟 represents the label of the sample includ- order feature as the input for the CVR network.
ing click and order differing from the model task, and
|𝑟𝑒𝑙| represents the number of stores that label is not 4.2. Offline Performance
zero.
Compared Methods. Our baseline is a highly opti- The evaluation metric used in this paper is CTR-NDCG
mized ESMM model that incorporates a large number and CTCVR-NDCG. Table 1 shows the experimental re-
of business features and handcrafted features. The to- sults of the comparison methods on two testing sets, from
tal number of features is 473. The embedding matrix of which we have:
dimension 𝑑 is 10. We use the sequences feature from For the entire cost module, compared with ESMM,
users’ history for 180 days and the length 𝑘 is 50. The ECMM can obtain a 0.35% gain on CTR-NDCG and 0.38%
1
numbers of Transformer layers 𝑁 is 2. Because 80% of gain on CTCVR-NDCG . And all other ablation methods
users click sequence length is less than 10 and order se- and variants can also improve the model performance
quence length is less than 5, and considering the service after modeling users’ behavior sequences.
performance, the 𝑛 of the sparse attention we chose is For online cost feature, compared with ESMM,
10. The dimension of the MLP used in the base module is ESMM+DIN adding click sequence has a certain increase
1024, and the dimension of the four-layer MLP used by in CTR- and CTCVR-NDCG. As showen in Figure 3,
the CTR and CVR networks is 512, 256, 128, 1 with ELU ECMM wo offline and convAttn, which is further added
activation function, respectively. And all baselines take to the order sequence, slightly decreases in the CTR-
into account the statistical user features of online and of-
1
fline costs for fair comparison. We conduct comparative For large-scale datasets in industrial recommender systems, the
improvement is considerable because of its hardness, and the testing
experiments with three categories of methods:
results in Section 3.3 further verify the significant improvement of
(1) Baselines: a) ESMM [10]. An outstanding multi- our proposal.
NDCG, but greatly improves the CTCVR-NDCG. ECMM
wo offline indicates that the convert attention mechanism
can learn users’ order characteristics from click to order.
These three methods show that it is effective to utilize his-
torical features to improve CVR prediction. The convert
attention brings 0.18% and 0.19% gains in CTR-NDCG
and CTCVR-NDCG.

Figure 5: Online performance. The improvements of CTCVR
and CTR are significant with the significance level 𝛼=0.05.

consistent with the assessment in September. The ECMM
model shows that the advantage of considering users’
Figure 3: Improvement in conversion rate prediction from
online behavioral regularity in October.
online behavioral and offline transportation regularities
is helpful in predicting users’ current CTR and CTCVR.

For offline cost feature, the ECMM wo online and 4.3. Online Evaluations
slidWinAttn model that uses distance sequence features
brings stronger effects improve both CTR- and CTCVR- Online A/B test was conducted in the recommender sys-
NDCG. As showen in Figure 4, comparing ECMM wo tem in 7 days in January 2022. For the control group,
online and slidWinAttn with ESMM, it can be seen that 10% of users were randomly assigned and presented in
the offline transportation cost is indispensable for the a recommender system presented by a highly optimized
conversion rate prediction of O2O platform. And ECMM ESMM algorithm. For the experimental group, 10% of
wo online model introduced by our proposed slide win- users were randomly selected to use the ECMM method.
dow attention brings greater gains by dynamic matching In the online experiment, we choose CTR and CTCVR as
user preference during different times. The sliding win- evaluation indicators, where CTCVR represents the pur-
dow method brings 0.02% and 0.05% gains in CTR-NDCG chase rate of each request. The result is shown in Figure
and CTCVR-NDCG. 5. We can see that our proposed ECMM method im-
proves the CTR by 0.52% (p-value=0.00<0.05) compared
with the baseline model, and the CTCVR by 0.73% (p-
value=0.02<0.05), which has a 1.8% (p-value=0.02<0.05)
increase in total revenue. Here, total revenue increases
to 1.8% with a 0.45% increase in CTCVR means the model
provides users with higher price list. So far, the ECMM
method has been applied to the main online traffic and
has served more than hundreds of millions of users, bring-
Figure 4: Improvement in conversion rate prediction from ing a significant increase in the total revenue of Meituan.
offline transportation regularity in October.

In order to explore whether the user’s historical 5. Conclusion
order will affect click, we further study with the
ECMM+dualInfo model that the order sequence trans- In this paper, inspired by the user sequential behaviors
mits information to the click sequence. It can be seen in O2O platform, a novel model is proposed to predict
that the click NDCG decreased by 0.05%, and the CTCVR- conversion rate. Further, introduce covert attention and
NDCG decreased by 0.06%. We separate the click and the sliding window attention in the cost module to learn users’
order features into the CTR network and CVR network to online behavioral regularity and offline transportation
obtain the ECMM+sepInput model to verify the feature regularity. At the same time, offline experiments have
impact of different task, and found that separate features proved the effectiveness of our proposed method to learn
will reduce model performance. users’ conversion from users’ click sequence to order
To verify the generalization of our model instead of sequence, and the accuracy of the ranking list is im-
fitting users over a certain period, we further evaluate proved by evaluating NDCG. Online experiments show
our method on a test set in October. The results are that ECMM method has a significant effect on improv-
ing the total revenue of the O2O platform. For now, the 10.1145/2424321.2424348. doi:10.1145/2424321.
ECMM method has been applied to the main online traf- 2424348.
fic, bringing a significant increase in the total revenue of [7] J. Huang, K. Hu, Q. Tang, M. Chen, Y. Qi, J. Cheng,
the enterprise. J. Lei, Deep position-wise interaction network for
ctr prediction, in: Proceedings of the 44th Inter-
national ACM SIGIR Conference on Research and
Acknowledgments Development in Information Retrieval, 2021, pp.
1885–1889.
This research was supported by the National Natural Sci-
[8] Y. Ping, C. Gao, T. Liu, X. Du, H. Luo, D. Jin, Y. Li,
ence Foundation of China (NSFC) under Grant 72071029,
User consumption intention prediction in meituan,
71974031 and 72231010. This research was also supported
in: Proceedings of the 27th ACM SIGKDD Confer-
by Meituan.
ence on Knowledge Discovery & Data Mining, 2021,
pp. 3472–3482.
References [9] Z. Fang, B. Gu, X. Luo, Y. Xu, Contemporaneous and
delayed sales impact of location-based mobile pro-
[1] X. Ding, J. Tang, T. Liu, C. Xu, Y. Zhang, F. Shi, motions, Information Systems Research 26 (2015)
Q. Jiang, D. Shen, Infer implicit contexts in real-time 552–564.
online-to-offline recommendation, in: Proceedings [10] X. Ma, L. Zhao, G. Huang, Z. Wang, Z. Hu, X. Zhu,
of the 25th ACM SIGKDD International Conference K. Gai, Entire space multi-task model: An ef-
on Knowledge Discovery & Data Mining, KDD ’19, fective approach for estimating post-click conver-
Association for Computing Machinery, New York, sion rate, in: The 41st International ACM SIGIR
NY, USA, 2019, p. 2336–2346. URL: https://doi.org/ Conference on Research & Development in Infor-
10.1145/3292500.3330716. doi:10.1145/3292500. mation Retrieval, SIGIR ’18, Association for Com-
3330716. puting Machinery, New York, NY, USA, 2018, p.
[2] H. Li, Q. Shen, Y. Bart, Local market characteris- 1137–1140. URL: https://doi.org/10.1145/3209978.
tics and online-to-offline commerce: An empirical 3210104. doi:10.1145/3209978.3210104.
analysis of groupon, Management Science 64 (2018) [11] H. Wen, J. Zhang, Y. Wang, F. Lv, W. Bao, Q. Lin,
1860–1878. K. Yang, Entire space multi-task modeling via post-
[3] S. Kawanaka, D. Moriwaki, Uplift modeling for click behavior decomposition for conversion rate
location-based online advertising, in: Proceedings prediction, in: Proceedings of the 43rd Interna-
of the 3rd ACM SIGSPATIAL International Work- tional ACM SIGIR Conference on Research and De-
shop on Location-Based Recommendations, Geoso- velopment in Information Retrieval, Association for
cial Networks and Geoadvertising, LocalRec ’19, Computing Machinery, New York, NY, USA, 2020,
Association for Computing Machinery, New York, p. 2377–2386. URL: https://doi.org/10.1145/3397271.
NY, USA, 2019. 3401443.
[4] M.-H. Park, J.-H. Hong, S.-B. Cho, Location-based [12] H. Wen, J. Zhang, F. Lv, W. Bao, T. Wang, Z. Chen,
recommendation system using bayesian user’s pref- Hierarchically modeling micro and macro behav-
erence model in mobile devices, in: International iors via multi-task learning for conversion rate pre-
conference on ubiquitous intelligence and comput- diction, in: Proceedings of the 44th International
ing, Springer, 2007, pp. 1130–1139. ACM SIGIR Conference on Research and Devel-
[5] H. Yang, T. Liu, Y. Sun, E. Bertino, Exploring the opment in Information Retrieval, Association for
interaction effects for temporal spatial behavior Computing Machinery, New York, NY, USA, 2021,
prediction, in: Proceedings of the 28th ACM In- p. 2187–2191. URL: https://doi.org/10.1145/3404835.
ternational Conference on Information and Knowl- 3463053.
edge Management, CIKM ’19, Association for Com- [13] Q. Lu, S. Pan, L. Wang, J. Pan, F. Wan, H. Yang,
puting Machinery, New York, NY, USA, 2019, p. A practical framework of conversion rate predic-
2013–2022. URL: https://doi.org/10.1145/3357384. tion for online display advertising, in: Proceed-
3357963. doi:10.1145/3357384.3357963. ings of the ADKDD’17, ADKDD’17, Association
[6] J. Bao, Y. Zheng, M. F. Mokbel, Location-based and for Computing Machinery, New York, NY, USA,
preference-aware recommendation using sparse 2017. URL: https://doi.org/10.1145/3124749.3124750.
geo-social networking data, in: Proceedings of doi:10.1145/3124749.3124750.
the 20th International Conference on Advances in [14] T. Tong, X. Xu, N. Yan, J. Xu, Impact of different
Geographic Information Systems, SIGSPATIAL ’12, platform promotions on online sales and conversion
Association for Computing Machinery, New York, rate: The role of business model and product line
NY, USA, 2012, p. 199–208. URL: https://doi.org/ length, Decision Support Systems (2022) 113746.
[15] S. Guo, L. Zou, Y. Liu, W. Ye, S. Cheng, S. Wang, 2671–2679.
H. Chen, D. Yin, Y. Chang, Enhanced Doubly Robust [24] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian,
Learning for Debiasing Post-Click Conversion Rate G. Zhou, J. Xu, Y. Yu, X. Zhu, et al., Lifelong se-
Estimation, Association for Computing Machinery, quential modeling with personalized memorization
New York, NY, USA, 2021, p. 275–284. URL: https: for user response prediction, in: Proceedings of the
//doi.org/10.1145/3404835.3462917. 42nd International ACM SIGIR Conference on Re-
[16] X. Pan, M. Li, J. Zhang, K. Yu, L. Wang, H. Wen, search and Development in Information Retrieval,
C. Mao, B. Cao, Conversion rate prediction via meta 2019, pp. 565–574.
learning in small-scale recommendation scenarios, [25] Q. Tan, J. Zhang, J. Yao, N. Liu, J. Zhou, H. Yang,
arXiv preprint arXiv:2112.13753 (2021). X. Hu, Sparse-interest network for sequential rec-
[17] H. Wang, Z. Li, X. Liu, D. Ding, Z. Hu, P. Zhang, ommendation, in: Proceedings of the 14th ACM
C. Zhou, J. Bu, Fulfillment-time-aware personalized International Conference on Web Search and Data
ranking for on-demand food recommendation, in: Mining, 2021, pp. 598–606.
Proceedings of the 30th ACM International Confer- [26] K.-c. Lee, B. Orten, A. Dasdan, W. Li, Estimating
ence on Information & Knowledge Management, conversion rate in display advertising from past
2021, pp. 4184–4192. erformance data, in: Proceedings of the 18th ACM
[18] D. Xi, Z. Chen, P. Yan, Y. Zhang, Y. Zhu, F. Zhuang, SIGKDD international conference on Knowledge
Y. Chen, Modeling the sequential dependence discovery and data mining, 2012, pp. 768–776.
among audience multi-step conversions with multi- [27] O. Chapelle, Modeling delayed feedback in dis-
task learning in targeted display advertising, in: play advertising, in: Proceedings of the 20th ACM
Proceedings of the 27th ACM SIGKDD Conference SIGKDD international conference on Knowledge
on Knowledge Discovery & Data Mining, 2021, pp. discovery and data mining, 2014, pp. 1097–1105.
3745–3755. [28] Q. Lu, S. Pan, L. Wang, J. Pan, F. Wan, H. Yang, A
[19] F. Xiao, L. Li, W. Xu, J. Zhao, X. Yang, J. Lang, practical framework of conversion rate prediction
H. Wang, Dmbgn: Deep multi-behavior graph net- for online display advertising, in: Proceedings of
works for voucher redemption rate prediction, in: the ADKDD’17, 2017, pp. 1–9.
Proceedings of the 27th ACM SIGKDD Conference [29] R. Xie, C. Ling, Y. Wang, R. Wang, F. Xia, L. Lin,
on Knowledge Discovery & Data Mining, 2021, pp. Deep feedback network for recommendation, in:
3786–3794. Proceedings of the Twenty-Ninth International
[20] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Conference on International Joint Conferences on
Y. Yan, J. Jin, H. Li, K. Gai, Deep interest network Artificial Intelligence, 2021, pp. 2519–2525.
for click-through rate prediction, in: Proceedings [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
of the 24th ACM SIGKDD International Conference L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
on Knowledge Discovery & Data Mining, KDD ’18, tention is all you need, Advances in neural infor-
Association for Computing Machinery, New York, mation processing systems 30 (2017).
NY, USA, 2018, p. 1059–1068. URL: https://doi.org/ [31] Y. Feng, F. Lv, W. Shen, M. Wang, F. Sun, Y. Zhu,
10.1145/3219819.3219823. doi:10.1145/3219819. K. Yang, Deep session interest network for click-
3219823. through rate prediction, in: IJCAI, 2019.
[21] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, [32] C. Li, Z. Liu, M. Wu, Y. Xu, P. Huang, H. Zhao,
X. Zhu, K. Gai, Deep interest evolution net- G. Kang, Q. Chen, W. Li, Lee, Multi-interest net-
work for click-through rate prediction, volume 33, work with dynamic routing for recommendation
2019, pp. 5941–5948. URL: https://ojs.aaai.org/index. at tmall, Proceedings of the 28th ACM Interna-
php/AAAI/article/view/4545. doi:10.1609/aaai. tional Conference on Information and Knowledge
v33i01.33015941. Management (2019).
[22] C. Li, Z. Liu, M. Wu, Y. Xu, H. Zhao, P. Huang, [33] Z. Xiao, L. Yang, W. Jiang, Y. Wei, Y. Hu, H. Wang,
G. Kang, Q. Chen, W. Li, D. L. Lee, Multi-interest Deep multi-interest network for click-through rate
network with dynamic routing for recommendation prediction, Proceedings of the 29th ACM Inter-
at tmall, in: Proceedings of the 28th ACM interna- national Conference on Information & Knowledge
tional conference on information and knowledge Management (2020).
management, 2019, pp. 2615–2623. [34] T. Natarajan, S. A. Balasubramanian, D. Kasilingam,
[23] Q. Pi, W. Bian, G. Zhou, X. Zhu, K. Gai, Practice on Understanding the intention to use mobile shop-
long sequential user behavior modeling for click- ping applications and its influence on price sensi-
through rate prediction, in: Proceedings of the tivity, Journal of Retailing and Consumer Services
25th ACM SIGKDD International Conference on 37 (2017) 8–22.
Knowledge Discovery & Data Mining, 2019, pp. [35] D. Clevert, T. Unterthiner, S. Hochreiter, Fast and
accurate deep network learning by exponential lin-
ear units (elus), in: Y. Bengio, Y. LeCun (Eds.), 4th
International Conference on Learning Representa-
tions, ICLR 2016, San Juan, Puerto Rico, May 2-4,
2016, Conference Track Proceedings, 2016. URL:
http://arxiv.org/abs/1511.07289.
[36] G. Zhao, J. Lin, Z. Zhang, X. Ren, X. Sun, Sparse
transformer: Concentrated attention through ex-
plicit selection, 2020. URL: https://openreview.net/
forum?id=Hye87grYDH.
[37] X. Wang, C. Li, N. Golbandi, M. Bendersky, M. Na-
jork, The lambdaloss framework for ranking metric
optimization, in: Proceedings of the 27th ACM In-
ternational Conference on Information and Knowl-
edge Management, 2018, pp. 1313–1322.