Spatiotemporal-Enhanced Network for Click-Through Rate
Prediction in Location-based Services
Shaochuan Lin1 , Yicong Yu2 , Xiyu Ji2 , Taotao Zhou2 , Hengxu He2,† , Zisen Sang2 , Jia Jia2,† ,
Guodong Cao3 and Ning Hu1
1
  Alibaba Group, Hangzhou, China
2
  Alibaba Group, Shanghai, China
3
  Alibaba Group, Beijiing, China


                                          Abstract
                                          In Location-Based Services(LBS), user behavior naturally has a strong dependence on the spatiotemporal information,
                                          𝑖.𝑒., in different geographical locations and at different times, user click behavior will change significantly. Appropriate
                                          spatiotemporal enhancement modeling of user click behavior and large-scale sparse attributes is key to building an LBS model.
                                          Although most of existing methods have been proved to be effective, they are difficult to apply to takeaway scenarios due to
                                          insufficient modeling of spatiotemporal information. In this paper, we address this challenge by seeking to explicitly model
                                          the timing and locations of interactions and proposing a Spatiotemporal-Enhanced Network, namely StEN. In particular,
                                          StEN applies a Spatiotemporal Profile Activation module to capture common spatiotemporal preference through attribute
                                          features. A Spatiotemporal Preference Activation is further applied to model the personalized spatiotemporal preference
                                          embodied by behaviors in detail. Moreover, a Spatiotemporal-aware Target Attention mechanism is adopted to generate
                                          different parameters for target attention at different locations and times, thereby improving the personalized spatiotemporal
                                          awareness of the model. Comprehensive experiments are conducted on three large-scale industrial datasets, and the results
                                          demonstrate the state-of-the-art performance of our methods. In addition, we have also released an industrial dataset for
                                          takeaway industry to make up for the lack of public datasets in this community.

                                          Keywords
                                          spatiotemporal systems, click-through rate prediction, location-based services


1. Introduction                                                                                         a user prefers fast food in the work area on weekdays and
                                                                                                        may choose fried chicken in his or her residential area on
Location-Based Services (LBS) are mobile services that weekends. This changes in user behavioral interests are
provide the user with current location-relevant content bonded with the changes of location and time. Although
on smartphones or other services. Among them, take- there are some initial efforts[4, 5] to integrate spatiotem-
away service is the most popular and convenient com- poral information into sequential recommendation, most
mercial service. Like other LBS, it also requires timely of them consider partial spatiotemporal information, and
delivery, which results in a strong dependence on time efforts to fully and thoroughly model such integrated
and geographical location for users. In this way, recom- spatiotemporal patterns are still lacking. Different from
mending products suitable for the user’s temporal and the above scenarios, there are some common attributes
spatial demands in LBS is a pretty challenging problem. in the takeaway scenario which have a weak correlation
   Recently, some methods[1, 2, 3] have been proved effec- with the user’s historical behavior. For example, milk
tive in e-commerce through the user’s historical behavior, tea is naturally suitable to be recommended at afternoon
but it is not easy to adapt them into the LBS scenario. The tea. On the other hand, the historical behaviors of users
main reason is that most of them do not pay attention to imply their personal dietary preferences.
users’ strong spatial and temporal demands. For instance,                                                  To tackle above problems, we propose a
                                                                                                        Spatiotemporal-Enhanced Network(StEN), to bet-
DL4SR’22: Workshop on Deep Learning for Search and Recommen- ter meet users’ temporal and spatial demands. Specially,
dation, co-located with the 31st ACM International Conference on
Information and Knowledge Management (CIKM), October 17-21, 2022,
                                                                                                        StEN applies Spatiotemporal Profile Activation (StPro)
Atlanta, USA                                                                                            module to model user’s common spatiotemporal
*
  Corresponding author.                                                                                 preference by activating attribute features (user and
$ lin.lsc@alibaba-inc.com (S. Lin); yicongyu.yyc@alibaba-inc.com item). For the personalized spatiotemporal preference
(Y. Yu); jixiyu.jxy@alibaba-inc.com (X. Ji);                                                            of users, a novel Spatiotemporal Preference Activation
taotao.zhou@lazada.com (T. Zhou); hengxu.hhx@alibaba-inc.com
(H. He); zisen.szs@koubei.com (Z. Sang);
                                                                                                        (StPre)  and a Spatiotemporal-aware Target Attention
jj229618@alibaba-inc.com (J. Jia); guodong.cao@alibaba-inc.com                                          (StTA) module are proposed. StPre disassembles the spa-
(G. Cao); huning.hu@alibaba-inc.com (N. Hu)                                                             tiotemporal preference embodied by the user’s historical
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
           Attribution 4.0 International (CC BY 4.0).                                                   behavior in detail, which including Temporal Evolving
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Activation(TEA), Temporal periodic Fusion(TPF) and           in the user’s historical behavior sequence are diverse.
Spatial Preference Activation(SPA). While StTA employs       Faced with a particular product, only part of the interests
different spatiotemporal information to generate             associated with that product will influence user’s behav-
different parameters and feed them into target attention     ior. Based on this, DIN designs a local activation module
to improve the personalized spatiotemporal awareness         to extract different user interests from the sequence for
of the model. In addition, we have released an industrial    various target commodities. DIEN[2] further explores
dataset for takeaway industry to make up for the lack of     the interrelationships between users’ historical behaviors
public datasets in this community.                           and proposes the concept of user interest evolution. It
   All our contributions can be summarized as follows:       designs an auxiliary loss and a structure based on GRU.
                                                             Inspired by the success of the self-attention mechanism
     • StEN applies Spatiotemporal Profile Activation        in sequence-to-sequence tasks, BST[12] leverages a trans-
       (StPro) module to model user’s common spa-            former layer instead of GRU to mine information about
       tiotemporal preference by activating attribute fea-   the user’s interest. DSIN[3] observes that the user’s inter-
       tures (user and item).                                ests in a short period are concentrated, while long-term
     • For the personalized spatiotemporal preference        interests are scattered. It splits the sequence into differ-
       of users, a novel Spatiotemporal Preference Acti-     ent sessions and explores the information through the
       vation (StPre) is proposed, which disassembles        self-attention mechanism and Bi-LSTM module. SIM[13]
       the spatiotemporal preference embodied by the         proposes an interest mining method for life-long user
       user’s historical behavior in detail, and extracts    sequences. However, all historical behavior sequences of
       preferences from three small modules: Tempo-          users are very long, which may lead to time-consuming
       ral Evolving Activation (TEA), Temporal Periodic      and noise problems. To overcome this, SIM provides
       Fusion (TPF) and Spatial Preference Activation        a search-based long sequence extraction method to ex-
       (SPA).                                                tract top-k behavior sequences from life-long sequences
     • We also propose a Spatiotemporal-aware Target         through soft and hard search technology.
       Attention (StTA) module, which employs differ-
       ent spatiotemporal information to generate dif-
       ferent parameters and feed them into target atten-
                                                             2.2. Time Aware Attention Model
       tion to improve the personalized spatiotemporal       The above deep CTR models do not explicitly make use
       awareness of the model                                of the click time information in the user’s historical be-
     • In addition, we have also released an industrial      havior, where the click time information has an impact
       dataset for takeaway industry to make up for the      on the user’s evolutionary behavior and the user’s peri-
       lack of public datasets in this community. Experi-    odic behavior. The user’s evolutionary behavior denotes
       mental results demonstrate that our method has        that the user’s interest changes over time, and the user’s
       achieved the state-of-the-art on three large-scale    periodic behavior indicates the user’s periodic actions.
       industrial datasets and the online A/B testing re-    Specially, TIEN[14] pays more attention to the user’s
       sults further show its practical value.               evolutionary behavior, and believes that the closer the
                                                             historical behavior is to the current time, the greater
                                                             the weight should be. TLSAN[15] leverages the absolute
2. Related Work                                              value of the time difference and then uses its reciprocal
                                                             as the time position embedding. TiSASRec[16] models
2.1. Sequence-based Model                                    items’ relative time intervals by sine and cosine func-
Earlier deep CTR approaches hope to eliminate the com-       tion to explore the evolutionary behavior of users and
plicated work of feature engineering jobs and focus          then utilizes items’ absolute temporal signals, such as
more on automatically mining the correlations between        month(M), weekday(W), date(D) and hour(H), to detect
features[6, 7, 8, 9, 10]. Later on, researchers[1, 2, 3]     periodic behavior of users. TimelyRec[17] captures po-
found that the users’ historical behavior sequence con-      tential irregularity information in user’s periodic pat-
tains richer and more direct information, which brought      terns, and then integrates the information to compute
breakthroughs to the entire recommendation commu-            the similarity between target time and users interactions
nity. Many researches focus on exploring potential in-       with an attention mechanism.
terests in the user’s historical behavior sequence. They
extract sequence features by incorporating structures        2.3. Spatiotemporal Model
such as Pooling, RNN, and Attention into the model.
YoutubeDNN[11] proposes a feature embedding on items         Spatial location is also important for some location-aware
method and then takes the average value to extract his-      platforms, such as Facebook Places[18] and Airbnb[19].
torical sequence features. DIN[1] believes that interests    Thus, it is a natural way to integrate temporal in-
                                                                                                                                            CTR
                                                                                                            FC+BN+LReLU (256)
  DNN Tower                                                                                                 FC+BN+LReLU (512)                                                                                                                               StPro Activation Flow
                                                                                                            FC+BN+LReLU (1024)                                                                                                                              StPre Activation Flow
                                                                                                                                                                                                                                                            StTA Flow

                                                                                                                                         Concat


                                                                                                                                                                                                                         MatMul                               User Embedding
                                                                                                            StPre                                           +                                       StTA    Softmax                             (User Id、User Views in the last 30 days …)
                                                                                                                                                                                                            MatMul
                                     StPro                                                                                                                                                                                                                Target Spatiotemporal Embedding
                                                                                                            TEA                                     TPF                                SPA            𝑸𝑷𝒂𝒓𝒂𝒎     𝑲𝑷𝒂𝒓𝒂𝒎           𝑽𝑷𝒂𝒓𝒂𝒎                     ( Hour、User Geohash …)

                                                                                                                                                                                                                                                             Target Query Embedding
                                                                                                                                                                                                                                                       ( Target Id、Target Category Id …)

                                                                                                                                                                                                                      Stack                                 User Behavior Embedding
                       Concat                          Concat                                                                                    Concat                                                                                                       ( Item Id、City Id …)
                                                                                                                                                                                                   Concat       Concat                 Concat

 Embedding Layer
                                                                                                                                                                                                                              ...

                     User Feature            Spatiotemporal Feature                                                                   Target Item Feature
                                                                                                                                                                                                                  User Behavior

Figure 1: Our StEN consists of three modules: Spatiotemporal Profile Activation(StPro), Spatiotemporal Preference Activa-
tion(StPre) and Spatiotemporal-aware Target Attention(StTA).


                                                                                                                                                                                             the segmented time and geographic information in the
                                                                 SPA
                                                                                                                                                                                             user’s historical behavior sequence. While effective, it is
                           Concat                                                        MeanPooling
                                                                                            FFN
                                                                                                                                                            Spatial
                                                                                                                                                            weight
                                                                                                                                                                                             applied to article browsing of web pages without regard
                       Concat
                          MatMul                                                                                                                                                             to the geographic location of the item. So it is not suit-
                                                                                                                                                                                             able for our takeaway industry. TRISAN[21] extracts the
                                                                                                                         MatMul
                      MatMul
                         Sigmoid
                      Sigmoid
                       Scaled Dot-Product                                                   Sigmoid                                             User Behavior                                spatiotemporal information from the user’s historical be-
                   Scaled Dot-Product
                                    Linear
                                                                                               Linear                                                                                        havior sequence by employing two spatial activation and
                           Linear
            User Feature Spatiotemporal Feature
                                                                                              Concat                                                                                         one temporal similarity activation modules in the model.
       User Feature Spatiotemporal Feature                      Spatial Feature User Feature                                                                                                 However, it does not detail the information contained
      (a) The architecture of StPro                             (b) The architecture of SPA                                                                                                  in the user’s spatiotemporal behavior, which leads to in-
                                                                                                                                                                                             sufficient spatiotemporal information exploring. While
      TEA                                                        TPF
                                                                                        MeanPooling                                                  Period of
                                                                                                                                                                                             TRISAN is of great relevance for our purposes, unfortu-
                        Time interval
                           weight
                                              Mean
                                              weight
                                                                                           FFN                                                     time weight                               nately, the method has not been open-sourced and the
                                                                                                                                                                                             dataset used in this paper is not publicly available. So
                                                                                                             Afternoon Tea Behavior


                                                                                                                                                                Night Snack Behavior
                                                                   Breakfast Behavior


                                                                                                                                          Dinner Behavior
                                                                                           Lunch Behavior


                                    MatMul

                                         MeanPooling
                                                                                                                                                                                             we cannot perform method comparisons with it in the
         Softmax          Sigmoid
                                            FFN                                                                                                                                              Section 4.
   Time Interval Feature User Feature    User Behavior

      (c) The architecture of TEA                               (d) The architecture of TPF
                                                             3. Spatiotemporal-Enhanced
Figure 2: The architecture of Spatiotemporal Profile Activa-    Network
tion(StPro) and Spatiotemporal Preference Activation(StPre).
StPre includes three models: Temporal Evolving Activa-                                                                                                                                       3.1. Preliminary
tion(TEA), Temporal periodic Fusion(TPF) and Spatial Pref-
erence Activation(SPA).                                In this paper, we denote 𝑥 = (𝑚, 𝑢, 𝑠𝑡, 𝑏) ∈ 𝒳 as input
                                                       data, where 𝑚 is the target item feature, 𝑢 is the user, 𝑏
                                                       is the user click behavior and 𝑠𝑡 is the spatiotemporal
                                                       feature.
formation and spatial location to optimize recommen-      In particular, we geocode1 the user’s latitude and longi-
dation models. However, due to the complexity of tude and convert them to hexadecimal numbers to obtain
model design, publicly available existing work is lim-
ited. CaledarGNN[20] utilizes GNN and GRU to extract
                                                                                                                                                                                             1
                                                                                                                                                                                                 https://en.wikipedia.org/wiki/Geohash
geohash-6, which is then combined with the user’s Area-          Through the above same activation method, we can
of-Interest(AOI)[22] and serve as the spatial feature 𝑔 in    obtain the final activation value of the item and is denoted
this paper. While the temporal feature is represented by      as ℎ𝑚 . Finally, we concat the above activation values to
hour of day, time period of day(breakfast, lunch, after-      obtain the spatiotemporal profile activation value ℎ𝑠𝑡𝑝𝑟𝑜 .
noon tea, dinner and night snack) and day of the week.        Fig. 2(a) shows the structure.
User features 𝑢 include user id, user gender and other fea-
tures, while item features 𝑖 include item id, item category   3.3. Spatiotemporal Preference Activation
and other features. Before all features enter the model,
we will perform a vectorized representation of them. For      We further propose a Spatiotemporal Preference Acti-
the convenience of description, in the latter part of this    vation(Stpre) to model the personalized spatiotemporal
article, 𝑚, 𝑢, 𝑠𝑡, 𝑏 all represent the embedding vectors of   preference embodied by user behaviors in detail.
the corresponding features. Denoting 𝑦 ∈ 𝒴 as the click
label, and our CTR prediction task can be defined as:         3.3.1. Temporal Evolving Activation(TEA)
            𝒫(𝑦 = 1|𝑥) = 𝑓 (𝑥; 𝜃)(𝑥 ∈ 𝒳 )              (1)    The time sequence of user clicks will have a certain im-
                                                              pact on the current behavior. For example, a user who
where 𝑓 (𝑥; 𝜃) is a probability value obtained after we
                                                              frequently clicks on milk tea in a short period of time will
forward the input data 𝑥 into any CTR network, and
                                                              cause him to be more willing to click on dessert in the
then activate by a sigmoid function. 𝜃 represents the
                                                              next time slot. To model this temporal evolving pattern,
parameters of the network. Typically, each of our user
                                                              we first calculate the time interval 𝑡𝑖 between request
history behaviors includes the item 𝑣, the item’s location
                                                              time 𝑡𝑟 and each historical behavior click time 𝑡𝑗 . Then
𝑙, the click time 𝑡 and the click period of time 𝑝. The
                                                              we eliminate the noise by applying a nonlinear transfor-
CTR task of Equation 1 above is then mainly achieved
                                                              mation to the time interval, thus obtaining the temporal
by minimizing the following cross-entropy loss function
                                                              evolution factor 𝑓𝑡𝑒 ,
during training,
                             𝑁                                  𝑓𝑡𝑒 = 𝐹 𝐶2 (𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈 (𝐹 𝐶1 (𝑒−𝑡𝑖 ))) + 𝑒−𝑡𝑖          (4)
                          1 ∑︁
         ℒ(𝑓, 𝑥𝑖 , 𝑦𝑖 ) =       −𝑦𝑖 𝑙𝑜𝑔𝑓 (𝑥𝑖 ; 𝜃)
                          𝑁 𝑖=1                        (2)    where 𝐹 𝐶1 ∈ R𝑁 *𝑁ℎ and 𝐹 𝐶2 ∈ R𝑁ℎ *𝑁𝑙 denotes two
                   −(1 − 𝑦𝑖 )𝑙𝑜𝑔(1 − 𝑓 (𝑥𝑖 ; 𝜃))              fully connected layers, 𝑡𝑖 ∈ R𝑁 *𝑁𝑙 , 𝑁ℎ is the hidden
                                                              size, and 𝑁𝑙 is the sequence length we set. In this paper,
where 𝑦𝑖 ∈ {0, 1} is the ground-truth label, 𝑁 is the         we abbreviate the structure of Equation 4 as FFN. Then
mini-batch size and 𝑖 is the index of the input data. We      we normalize the above temporal evolution factor 𝑓𝑡𝑒
set 𝑁 to 1024 in this paper.                                  through a softmax function to get the weight of temporal
                                                              evolution 𝑤𝑡𝑒 . After that, 𝑤𝑡𝑒 can help to obtain temporal
3.2. Spatiotemporal Profile Activation                        activation features related to the behavior order,
This module is mainly used to capture common spa-
tiotemporal preferences that are less correlated with user         𝑎𝑡𝑡𝑡𝑒𝑎 = 𝑤𝑡𝑒 · 𝐹 𝐹 𝑁 (𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝐹 𝐶𝑡 (𝑢)) · 𝑏)       (5)
behavior. E-commerce scenarios only need to consider
the personalized behavior of user, but in the takeaway        where 𝐹 𝐶𝑡 (𝑢) ∈ R𝑁𝑢 *1 , 𝑁𝑢 is the last dimension
scenario, we need to consider the impact of time and lo-      of the feature 𝑢. Finally our robust temporal evo-
cation on users and items. For instance, there is a natural   lution fusion feature be obtained by ℎ𝑡𝑒𝑎 = 𝑤𝑚 *
difference between the user’s order in the workplace and      𝑀 𝑒𝑎𝑛𝑃 𝑜𝑜𝑙𝑖𝑛𝑔(𝐹 𝐹 𝑁 (𝑏))+𝑤𝑡𝑒𝑎 *𝑎𝑡𝑡𝑡𝑒𝑎 . Mean weight
the residential area. Therefore, we use spatiotemporal        𝑤𝑚 and time interval weight 𝑤𝑡𝑒𝑎 are two trainable
features 𝑠𝑡 to extract common spatiotemporal preference       weight parameters used to balance the output. The mod-
for the static item and user features. Below we will take     ule is depicted in Fig. 2(c).
the user feature as an example,
                              𝐹 𝐶𝑢 (𝑠𝑡) · 𝑢𝑇                  3.3.2. Temporal periodic Fusion(TPF)
           𝑎𝑡𝑡𝑢 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(        √          )𝑢        (3)
                                     𝑑𝑢                   User historical behavior contains rich but scattered be-
                                                          havioral interests. However, when we explore user behav-
where 𝐹 𝐶𝑢 (𝑠𝑡) ∈ R𝑑𝑠𝑡 *𝑑𝑢 , is the linear transforma-
                                                          ior from the perspective of time period, we are pleased
tion of the 𝑠𝑡, 𝑑𝑢 is the last dimension of 𝑢, 𝑑𝑠𝑡 is the
                                                          to find that users’ behavioral interests are more concen-
last dimension of 𝑠𝑡. Inspired by [1], we then concate-
                                                          trated and periodic. Model would be messy if we directly
nate 𝑢 and 𝑎𝑡𝑡𝑢 and add their differences, their com-
                                                          learn mixed user behavior without any behavioral slices.
mon values, to get the final activation value ℎ𝑢 =
𝑐𝑜𝑛𝑐𝑎𝑡(𝑢, 𝑎𝑡𝑡𝑢 , 𝑢 − 𝑎𝑡𝑡𝑢 , 𝑢 * 𝑎𝑡𝑡𝑢 ).
In this case, we propose a Temporal periodic Fusion mod-           awareness of the model. Taking 𝑊𝑄 , 𝑏𝑄 as an example,
ule to learn the user periodic preference in takeaway              we can get that,
industry.
   Based on the period of time 𝑝, we first divide the                        𝑄𝑃 𝑎𝑟𝑎𝑚 = 𝑊𝑞 · 𝑠𝑡 + 𝑏𝑞 → 𝑊𝑄 , 𝑏𝑄              (8)
user historical behavior 𝑏 into five time slices 𝑏 =
{𝑏𝑝𝑏 , 𝑏𝑝𝑙 , 𝑏𝑝𝑡 , 𝑏𝑝𝑑 , 𝑏𝑝𝑠 }. Then we feed each period of time   where 𝑊𝑞 ∈ R𝐷×(𝑑𝑖 *𝑑𝑜 +𝑑𝑜 ) and 𝑏𝑞 ∈ R𝑑𝑖 *𝑑𝑜 +𝑑𝑜 are
sequence into the FFN and mean pooling in turn to get              the parameters of a fully-connected layer. 𝐷 is the dimen-
the characteristics of breakfast behaviors 𝑚𝑒𝑎𝑛𝑝𝑏 , lunch          sion of 𝑠𝑡, 𝑑𝑖 is the dimension of input embedding (such
behaviors 𝑚𝑒𝑎𝑛𝑝𝑙 , afternoon tea behaviors 𝑚𝑒𝑎𝑛𝑝𝑡 ,                as target item embedding 𝑚 or user behavior embedding
dinner behaviors 𝑚𝑒𝑎𝑛𝑝𝑑 , and night snack behaviors                𝑏) and 𝑑𝑜 is the dimension of final output embedding.
𝑚𝑒𝑎𝑛𝑝𝑠 . Take the breakfast behavior as an example,                Then we can split 𝑄𝑃 𝑎𝑟𝑎𝑚 into two parts(𝑊𝑄 , 𝑏𝑄 ) as
                                                                   parameters of the subsequent target attention fully con-
        𝑚𝑒𝑎𝑛𝑝𝑏 = 𝑀 𝑒𝑎𝑛𝑃 𝑜𝑜𝑙𝑖𝑛𝑔(𝐹 𝐹 𝑁 (𝑏𝑝𝑏 ))                (6)    nected layer. Specially, we take the first 𝑑𝑖 *𝑑𝑜 parameters
                                                                   as 𝑊𝑄 and the last 𝑑𝑜 parameters as 𝑏𝑄 . In the same way,
Further, to obtain a more general periodic representation          we can obtain 𝐾𝑃 𝑎𝑟𝑎𝑚 (𝑊𝐾 , 𝑏𝐾 ) and 𝑉𝑃 𝑎𝑟𝑎𝑚 (𝑊𝑉 , 𝑏𝑉 )
ℎ𝑡𝑝𝑓 , we fuse the above periodic characteristics through          through the spatiotemporal feature 𝑠𝑡. After that, we uti-
an average operation. Fig. 2(d) illustrates a outline of this      lize the primitive target attention mechanism to obtain
architecture.                                                      the final module output ℎ𝑡𝑎 ,

3.3.3. Spatial Preference Activation(SPA)                                                𝑄 = 𝑊𝑄 · 𝑚 + 𝑏 𝑄 ,
                                                                             𝐾 = 𝑊𝐾 · 𝑏 + 𝑏 𝐾 ,
User’s geographic location affects his personalized di-
                                                                              𝑉 = 𝑊𝑉 · 𝑏 + 𝑏 𝑉                 (9)
etary choices. For example, when the user works in com-
pany, he may choose rice, and when the user is at home,                               𝑄𝐾 𝑇
                                                                      ℎ𝑡𝑎 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( √       )𝑉
he may prefer fried chicken. We call this the user’s spatial                             𝑑𝐾
preference. To capture this spatial preference, we utilize
the spatial features 𝑔 and combine them with the user’s Where 𝑑𝑘 is the dimension of 𝐾. Fig. 1(a) illustrates the
feature 𝑢. We then feed the above-combined values into structure.
a fully connected layer and activate through a sigmoid
function to get the geolocation activation value of 𝑞𝑠𝑝𝑎 , 3.5. Dense Tower for StEN

         𝑞𝑠𝑝𝑎 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝐹 𝐶𝑞 (𝑐𝑜𝑛𝑐𝑎𝑡(𝑔, 𝑢)))                (7)    Once we have all the feature vector representations, we
                                                                   can fuse all the above module outputs to get the final pre-
where 𝐹 𝐶𝑞 ∈ R𝑁𝑔𝑢 *1 , 𝑁𝑔𝑢 is the dimension of the                 diction 𝑑𝑒𝑛𝑠𝑒0 = 𝑐𝑜𝑛𝑐𝑎𝑡(ℎ𝑠𝑡𝑝𝑟𝑜 , ℎ𝑠𝑡𝑝𝑟𝑒 , ℎ𝑡𝑎 ). A three-
combine value 𝑔 and 𝑢. Further, we use 𝑞𝑠𝑝𝑎 to activate all        layer perceptron structure is then applied,
of the user history behavior to explore the user’s spatial
preferences ℎ𝑠𝑝𝑎 through FFN and mean pooling. The                    𝑑𝑒𝑛𝑠𝑒𝑖+1 = 𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈 (𝐵𝑁 (𝐹 𝐶𝑓 𝑖 (𝑑𝑒𝑛𝑠𝑒𝑖 )))
architecture can be observed in Fig. 2(b).                                                                              (10)
   Finally, we fuse the output of the above three small            where 𝑖 = 0, 1, 2. We then get the prediction
modules together to obtain our final spatiotemporal pref-          of click via a sigmoid activation 𝒫(𝑦 = 1|𝑥) =
erence activation value ℎ𝑠𝑡𝑝𝑟𝑒 = ℎ𝑡𝑒𝑓 + 𝑤𝑡𝑝𝑓 * ℎ𝑡𝑝𝑓 +              𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝐹 𝐶𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝑑𝑒𝑛𝑠𝑒3 )). 𝐹 𝐶𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ∈ R𝑁 *1 .
𝑤𝑠𝑝𝑎 * ℎ𝑠𝑝𝑎 . Period of time weight 𝑤𝑡𝑝𝑓 and spatial               Finally, we optimize the parameters of our whole model
weight 𝑤𝑠𝑝𝑎 are also two trainable weight parameters               by Equation 2 defined above. The detail is illustrated in
used to balance the output.                                        Fig. 1(a).

3.4. Spatiotemporal-aware Target                                   4. EXPERIMENTS
     Attention
                                                                   4.1. Datasets
To more effectively explore the spatiotemporal rela-
tionships between historical user behavior and target              Due to the lack of public spatiotemporal datasets in the
item, we propose a Spatio-temporal-aware Target Atten-             takeaway industry, we conducted experimental compar-
tion(StTA) mechanism. Drawing on the ideas of CAN[23]              isons on three industrial datasets (𝐷1 , 𝐷2 and 𝐷3 ) col-
and AdaptPGM[24], we generate different parameters                 lected from Ele.me, a major LBS platform in China. The
through spatiotemporal information for target atten-               dataset 𝐷1 mainly recommend stores to users, which
tion, thereby improving the personalized spatiotemporal            consists of over 5 billion samples. Dataset 𝐷2 and 𝐷3
                                                                   mainly recommend meals to users and contain more than
Table 1
Statistics of the dataset used in this paper. ML indicates median length.
                                     Datasets              𝒟1               𝒟2        𝒟3
                                  Total Size            5541799773    575941170    177114244
                                  # Feature                  388          218           38
                                   # Users                49249999     28706270     14427689
                                   # Items                2750505     12302502       7446116
                                   # Clicks              343277081     5626279       3140831
                             ML of User Behaviors           39.66        41.59        41.19


Table 2                                                            DIN: Deep Interest Network (DIN) designs a local ac-
Overall performance on 𝒟1 , 𝒟2 and 𝒟3 . StPro: Spa-             tivation module to capture the information in the user
tiotemporal Profile Activation. StPre: Spatiotemporal Pref-     behavior sequence that will affect the user behavior when
erence Activation. DIN+StPro+StPre, DHAN+StPro+StPre,           facing the target item. At the same time, DIN does not
DIEN+StPro+StPre are three variation models to investigate      model the interrelationships among items in a sequence
the generalization of our module.                               of actions.
          Model              𝒟1         𝒟2         𝒟3              DHAN: Deep Hierarchical Attention Net-
           DIN              0.7209     0.7294    0.6403         works(DHAN) designs a set of attention networks with
          DHAN              0.7265     0.7312    0.6419         multi-dimensional and multi-level structures, which
          DIEN              0.7346     0.7452    0.6531         can capture the interest expression of users in various
    DIN+StPro+StPre         0.7236     0.7324    0.6434         dimensions. At the same time, the attention network
   DHAN+StPro+StPre         0.7271     0.7336    0.6445         can extract features that are similar to the knowledge
   DIEN+StPro+StPre         0.7348     0.7458    0.6571         expression of the tree structure.
           StEN             0.7353     0.7535    0.6627            DIEN: Deep Interest Evolution Network (DIEN) adapts
                                                                the interest evolution factors in user behavior. It designs
                                                                an AUGRU-based module to model the evolution process
                                                                and trend of user interests.
500 million and 100 million samples, respectively. For
𝐷3 , we collected one week’s data from the server logs as       4.3. Overall Performance Comparison
training set and one day’s data as the test set. We have
publicly released the dataset 𝐷3 2 to further advance         Table 2 compares StEN with three well-known CTR
the exploration of spatiotemporal patterns in the LBS         prediction models on 𝐷1 , 𝐷2 and 𝐷3 . We find that
community. The details of our datasets can be seen in         DHAN[28] performs better than DIN[1] on both datasets
Table 1.                                                      due to the addition of a multi-dimensional and multi-
                                                              level attention mechanism. For example, DHAN surpass
                                                              DIN on by margins of 0.56% on dataset 𝐷1 . Notably, 0.1%
4.2. Experimental Settings                                    improvement of AUC is significant for online model de-
All models in this paper are implemented with Pyhton ployment to improve the actual CTR in production. Due
2.7 and Tensorflow 1.4. AdagradDecay[25] is chosen as to the excellent performance of LSTM module in explor-
our optimizer to train the model. To avoid overfitting ing user behavior sequence, DIEN[2] outperforms DHAN
in the early stage of model training and maintain the in both datasets. However, it is worth noting that recur-
training stability, we adopt a warm-up[26] strategy for all rent neural networks such as LSTM have slow training
methods. We set the learning rate to 0.001 and gradually and prediction problems and are prone to high response
increased it to 0.015 within 1M steps. We set the batchsize time problems when serving online. By comparison, our
𝑁 to 1024. We repeated all the experiments five times StEN advantages all of them to a new level. We have
and averaged the metrics to obtain more reliable results. achieved AUC=0.7353, AUC=0.7525 and AUC=0.6627
In our experiments, We adapt Area Under Cure (AUC) on 𝐷1 , 𝐷2 and 𝐷3 , respectively. Our method is 0.96%
and RelaImpr[27] as our evaluation metric.                    higher than current best results (DIEN) on dataset 𝐷3 .
    To show the effectiveness of our method, we select           At the same time, to investigate the generalization of
three well-known and industry-proven CTR prediction our module, we have conducted variation experiments
models as our baselines.                                      by adding StPre and StPro to the above baseline models.
                                                              Note that the main difference among the above three
2
  https://tianchi.aliyun.com/dataset/dataDetail?dataId=131047
                                                              methods is the attention module, so our StTA will not
                        (a) Eleme App homepage                        (b) Eleme App recommendations page

Figure 3: Screenshots of the Eleme mobile App. (a) and (b) are the recommendation results of the online-serving model (red
box) and StEN (green box) during afternoon tea, where the right of (a) and (b) (green box) are more suitable for afternoon tea.


Table 3                                                    in this paper consists of a primitive Target Attention
Ablation study on 𝒟1 and 𝒟2 . StPro: Spatiotemporal Profilemodule mentioned in Section 3.4. Observed from Table 3,
                                                           each module has played a different positive role after
Activation. TEA: Temporal Evolving Activation. TPF: Temporal
Periodic Fusion. SPA: Spatial Preference Activation. StPre:being added.
Spatiotemporal Preference Activation. StTA: Spatiotemporal-   We then show the effect of Spatiotemporal Profile Ac-
aware Target Attention.                                    tivation (StPro) by adding it to the BaseModel. Observed
                        𝒟1                   𝒟2            From Table 3, we can see that our "w/ StPro" has brought a
   Methods
                  AUC     RelaImpr     AUC     RelaImpr    relatively stable improvement in effect. In particular, com-
  BaseModel      0.7332     0.00%     0.7414     0.00%     pared to BaseModel, the offline AUC rises from 0.7332 to
   w/ StPro      0.7345     0.56%     0.7474     2.49%     0.7345 (+0.13%) and 0.7414 to 0.7474 (+0.6%) when tested
    w/ TEA       0.7345     0.56%     0.7500     3.56%     on 𝐷1 and 𝐷2 , respectively. The results demonstrate
    w/ TPF       0.7342     0.43%     0.7479     2.69%     that Spatiotemporal Profile Activation is an effective way
    w/ SPA       0.7348     0.69%     0.7476     2.57%
                                                           to model user’s common spatiotemporal preference.
   w/ StPre      0.7349     0.73%     0.7521     4.43%
    w/ StTA      0.7350     0.77%     0.7499     3.52%        Next, we validate the effectiveness of Spatiotemporal
                                                           Preference Activation (StPre) over the model. As reported
     StEN        0.7353     0.90%     0.7535     5.01%
                                                           in table 3, "w/ StPre" increases the results of "BaseModel"
                                                           by 0.17% and by 1.07% on dataset of 𝐷1 and 𝐷2 , respec-
                                                           tively. In order to see the effect of the three small modules
be added to interfere. It can be observed from Table 2     (TEA,   TPF and SPA) in StPre, we also performed some
that when we directly adapt our two proposed activation    ablation   experiments in Table 3. We can observe that
modules to the three baselines mentioned above, there is a module   SPA    shows the best performance when tested on
certain stable improvement in performance. For example,    dataset   𝐷 1 , while module TEA achieves better perfor-

DIN obtains a significant improvement of 0.27% on 𝐷1 mance when tested on dataset 𝐷2 . This illustrates that in
, 0.30% on the 𝐷2 and 0.31% on the 𝐷3 , while DIEN different scenarios, the user’s spatiotemporal preferences
has the weaker improvement of 0.02% on 𝐷1 , 0.06% will focus on different emphasis, specific focus needs to
on 𝐷2 and 0.4% on the 𝐷3 . All these variation models be specifically determined.
further demonstrate that our proposed modules have            We also evaluate the effect of Spatiotemporal-aware
good generalizability and can be added to other existing   Target    Attention (StTA) mechanism. In Table 3,
models as a plug-and-play module.                          we observe a significant improvement after adding
                                                           Spatiotemporal-aware Target Attention into the system.
                                                           For example, "w/ StTA" achieves an offline AUC of 0.7350
4.4. Ablation Study                                        when tested on the dataset of 𝐷1 . This is higher than
To investigate the effectiveness of our proposed method, "BaseModel" by 0.18%. The improvement demonstrates
we conduct ablation studies in Table 3. Our BaseModel that our proposed Target Attention mechanism can meet
the user’s spatiotemporal demands compared to the prim-      References
itive target attention module. Injecting our StTA into the
model could improve the effectiveness of system in LBS.      [1] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma,
Furthermore, our "StEN(StPre+StPro+StTA)" consistently           Y. Yan, J. Jin, H. Li, K. Gai, Deep interest network
improves the results of "w/ StPre", "w/ StPro" and "w/           for click-through rate prediction, in: Y. Guo, F. Fa-
StTA". This is because more appropriate spatiotempo-             rooq (Eds.), Proceedings of the 24th ACM SIGKDD
ral enhancement has been conducted by integrating the            International Conference on Knowledge Discovery
three module we proposed in this paper.                          & Data Mining, KDD 2018, London, UK, August
                                                                 19-23, 2018, ACM, 2018, pp. 1059–1068.
                                                             [2] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou,
4.5. Online A/B Testing                                          X. Zhu, K. Gai, Deep interest evolution network
We have deployed our method on the Ele.me platform and           for click-through rate prediction, in: The Thirty-
conducted an online A/B test for one month in November           Third AAAI Conference on Artificial Intelligence,
2021, which is under the bucket test. One bucket is the          AAAI 2019, The Thirty-First Innovative Applica-
BaseModel we have defined in Section 4.4 and the other           tions of Artificial Intelligence Conference, IAAI
bucket is our model StEN. Compared with the online-              2019, The Ninth AAAI Symposium on Educational
serving BaseModel, our method has increased the CTR of           Advances in Artificial Intelligence, EAAI 2019, Hon-
one-hop by 1.6%, the CTR of the second-hop by 2.4%, the          olulu, Hawaii, USA, January 27 - February 1, 2019,
order volume by 2.1%, and the order UV by 2.4%. These            AAAI Press, 2019, pp. 5941–5948.
online benefits from our method are crucial for the rec-     [3] Y. Feng, F. Lv, W. Shen, M. Wang, F. Sun, Y. Zhu,
ommendation systems of Ele.me. On the one hand, an               K. Yang, Deep session interest network for click-
efficient model can improve user click efficiency. On the        through rate prediction, in: S. Kraus (Ed.), Pro-
other hand, the emphasis on spatiotemporal characteris-          ceedings of the Twenty-Eighth International Joint
tics can also improve user experience and increase the           Conference on Artificial Intelligence, IJCAI 2019,
user stickiness of the platform.                                 Macao, China, August 10-16, 2019, ijcai.org, 2019,
   For better understanding, we also compare the recom-          pp. 2301–2307.
mendation results of the online-serving model with our       [4] Q. Cui, C. Zhang, Y. Zhang, J. Wang, M. Cai, ST-PIL:
StEN on the Ele.me platform, as shown in Figure 3. The           spatial-temporal periodic interest learning for next
items (red box) on the left of Figure 3(a) and Figure 3(b)       point-of-interest recommendation, in: G. Demar-
are not suitable for afternoon tea, but are more appropri-       tini, G. Zuccon, J. S. Culpepper, Z. Huang, H. Tong
ate for breakfast and staple food, respectively. While our       (Eds.), CIKM ’21: The 30th ACM International Con-
StEN (green box) recommends the sweetmeats and milk              ference on Information and Knowledge Manage-
tea that are suitable for afternoon tea. Therefore, StEN         ment, Virtual Event, Queensland, Australia, Novem-
does a better job of capturing users’ strong spatial and         ber 1 - 5, 2021, ACM, 2021, pp. 2960–2964.
temporal demands and can improve the user experience.        [5] W. Zhang, J. Wang, Location and time aware social
                                                                 collaborative retrieval for new successive point-of-
                                                                 interest recommendation, in: Proceedings of the
5. Conclusions                                                   24th ACM International on Conference on Infor-
                                                                 mation and Knowledge Management, CIKM ’15,
In this paper, we propose a novel spatiotemporal-                Association for Computing Machinery, New York,
enhanced network StEN. In particular, StEN applies a             NY, USA, 2015, p. 1221–1230.
StPro module to capture common spatiotemporal prefer-        [6] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan-
ence by activating attribute features. A StPre module is         dra, H. Aradhye, G. Anderson, G. Corrado, W. Chai,
further applied to model the personalized spatiotemporal         M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain,
preference embodied by the behaviors in detail. More-            X. Liu, H. Shah, Wide & deep learning for recom-
over, a StTA mechanism is adopted to generate different          mender systems, in: Proceedings of the 1st Work-
parameters for target attention at different locations and       shop on Deep Learning for Recommender Systems,
times, thereby improving the personalized spatiotempo-           DLRSRecSys 2016, Boston, MA, USA, September 15,
ral awareness of the model. Comprehensive experiments            2016, ACM, 2016, pp. 7–10.
are conducted on three large-scale industrial datasets,      [7] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: A
and the results demonstrate the state-of-the-art perfor-         factorization-machine based neural network for
mance of our methods.                                            CTR prediction, in: Proceedings of the Twenty-
                                                                 Sixth International Joint Conference on Artificial
                                                                 Intelligence, IJCAI 2017, Melbourne, Australia, Au-
                                                                 gust 19-25, 2017, ijcai.org, 2017, pp. 1725–1731.
 [8] R. Wang, B. Fu, G. Fu, M. Wang, Deep & cross                 jana, Slovenia, April 19-23, 2021, ACM, 2021, p.
     network for ad click predictions, in: Proceedings            1274–1283.
     of the ADKDD’17, Halifax, NS, Canada, August 13 -       [18] C. Ting, S. Yizhou,         Task-guided and path-
     17, 2017, ACM, 2017, pp. 12:1–12:7.                          augmented heterogeneous network embedding for
 [9] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, G. Sun,         author identification, in: WSDM ’17: The 10th ACM
     xdeepfm: Combining explicit and implicit feature             International Conference on Web Search and Data
     interactions for recommender systems, in: Pro-               Mining, Cambridge, United Kingdom, February 2-6,
     ceedings of the 24th ACM SIGKDD International                2017, ACM, 2017, pp. 295–304.
     Conference on Knowledge Discovery & Data Min-           [19] Grbovic, Mihajlo, H. Cheng, Real-time personaliza-
     ing, KDD 2018, London, UK, August 19-23, 2018,               tion using embeddings for search ranking at airbnb,
     ACM, 2018, pp. 1754–1763.                                    in: KDD ’18: The 24th ACM SIGKDD Conference
[10] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang,          on Knowledge Discovery and Data Mining, London,
     J. Tang, Autoint: Automatic feature interaction              United Kingdom, August 19-23, 2018, ACM, 2018,
     learning via self-attentive neural networks, in: Pro-        pp. 311–320.
     ceedings of the 28th ACM International Confer-          [20] D. Wang, M. Jiang, M. Syed, O. Conway, V. Juneja,
     ence on Information and Knowledge Management,                S. Subramanian, N. V. Chawla, Calendar graph
     CIKM 2019, Beijing, China, November 3-7, 2019,               neural networks for modeling time structures in
     ACM, 2019, pp. 1161–1170.                                    spatiotemporal user behaviors, in: KDD ’20: The
[11] P. Covington, J. Adams, E. Sargin, Deep neural net-          26th ACM SIGKDD Conference on Knowledge Dis-
     works for youtube recommendations, in: S. Sen,               covery and Data Mining, Virtual Event, CA, USA,
     W. Geyer, J. Freyne, P. Castells (Eds.), Proceedings         August 23-27, 2020, ACM, 2020, pp. 2581–2589.
     of the 10th ACM Conference on Recommender Sys-          [21] Y. Qi, K. Hu, B. Zhang, J. Cheng, J. Lei, Trilateral
     tems, Boston, MA, USA, September 15-19, 2016,                spatiotemporal attention network for user behav-
     ACM, 2016, pp. 191–198.                                      ior modeling in location-based search, in: CIKM
[12] Q. Chen, H. Zhao, W. Li, P. Huang, W. Ou, Behavior           ’21: The 30th ACM International Conference on
     sequence transformer for e-commerce recommen-                Information and Knowledge Management, Gold
     dation in alibaba, in: Proceedings of the 1st Inter-         Coast, Australia, November 1-5, 2021, ACM, 2021,
     national Workshop on Deep Learning Practice for              pp. 3373–3377.
     High-Dimensional Sparse Data, 2019, pp. 1–4.            [22] Y. Hu, S. Gao, K. Janowicz, B. Yu, W. Li, S. Prasad,
[13] Q. Pi, G. Zhou, Y. Zhang, Z. Wang, L. Ren, Y. Fan,           Extracting and understanding urban areas of in-
     X. Zhu, K. Gai, Search-based user interest model-            terest using geotagged photos, Comput. Environ.
     ing with lifelong sequential behavior data for click-        Urban Syst. (2015) 240–254.
     through rate prediction, in: CIKM ’20: The 29th         [23] G. Zhou, W. Bian, K. Wu, L. Ren, Q. Pi, Y. Zhang,
     ACM International Conference on Information and              C. Xiao, X. Sheng, N. Mou, X. Luo, C. Zhang, X. Qiao,
     Knowledge Management, Virtual Event, Ireland,                S. Xiang, K. Gai, X. Zhu, J. Xu, CAN: revisiting
     October 19-23, 2020, ACM, 2020, pp. 2685–2692.               feature co-action for click-through rate prediction,
[14] L. Xiang, W. Chao, T. Bin, T. Jiwei, Z. Tao,                 CoRR abs/2011.05625 (2020).
     Deep time-aware item evolution network for click-       [24] Shishi, Thinking and practice of alimama’s search
     through rate prediction, in: CIKM ’20: The 29th              advertising prediction model 2021, 2022. URL: https:
     ACM International Conference on Information and              //zhuanlan.zhihu.com/p/446993392.
     Knowledge Management, Virtual Event, Ireland,           [25] J. C. Duchi, E. Hazan, Y. Singer, Adaptive subgra-
     October 19-23, 2020, ACM, 2020, pp. 785–794.                 dient methods for online learning and stochastic
[15] J. Zhang, D. Wang, D. Yu, Tlsan: Time-aware long-            optimization, J. Mach. Learn. Res. 12 (2011) 2121–
     and short-term attention network for next-item rec-          2159.
     ommendation, volume 441, 2021, pp. 179–191.             [26] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
[16] Y. Wang, L. Zhang, Q. Dai, F. Sun, Y. Bao, Regular-          ing for image recognition, in: 2016 IEEE Confer-
     ized adversarial sampling and deep time-aware at-            ence on Computer Vision and Pattern Recognition,
     tention for click-through rate prediction, in: CIKM          CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016,
     ’19: The 28th ACM International Conference on                IEEE Computer Society, 2016, pp. 770–778.
     Information and Knowledge Management, Beijing,          [27] L. Yan, W. Li, G. Xue, D. Han, Coupled group lasso
     China, November 3-7, 2019, ACM, 2019, p. 349–358.            for web-scale CTR prediction in display advertis-
[17] C. Junsu, H. Dongmin, K. Seongku, Y. Hwanjo,                 ing, in: Proceedings of the 31th International Con-
     Learning heterogeneous temporal patterns of user             ference on Machine Learning, ICML 2014, Beijing,
     preference for timely recommendation, in: WWW                China, 21-26 June 2014, volume 32 of JMLR Work-
     ’21: Proceedings of the Web Conference 2021, Ljubl-          shop and Conference Proceedings, JMLR.org, 2014,
     pp. 802–810.
[28] W. Xu, H. He, M. Tan, Y. Li, J. Lang, D. Guo, Deep in-
     terest with hierarchical attention network for click-
     through rate prediction, in: Proceedings of the 43rd
     International ACM SIGIR conference on research
     and development in Information Retrieval, SIGIR
     2020, Virtual Event, China, July 25-30, 2020, 2020,
     pp. 1905–1908.