Next Basket Recommendation with Neural Networks

           Shengxian Wan, Yanyan Lan, Pengfei Wang, Jiafeng Guo, Jun Xu, Xueqi Cheng
                     wanshengxian@software.ict.ac.cn, lanyanyan@ict.ac.cn
                wangpengfei@software.ict.ac.cn, {guojiafeng, junxu, cxq}@ict.ac.cn

ABSTRACT
One crucial task in recommendation is to predict what a user
will buy next given her shopping history. In this paper, we
propose a novel neural network to complete this task. The
model consists of an embedding layer, a hidden layer and
an output layer. Firstly, the distributed representations of
the user and the items bought before are obtained and used
to form a feature vector by the embedding layer. Then the
hidden layer transforms the feature vector to another space
by a non-linear operator. Finally, the softmax operator is
adopted to output the probabilities of next items. We can
see that the model elegantly involves both the user’s general                          Figure 1: NN-Rec model
interest and the sequential dependencies between items for
prediction. Experimental results on two real datasets prove           real datasets show that our model significantly outperforms
the eﬀectiveness of our model.                                        state-of-the-art methods.
                                                                         Our work is inspired by [2], where a neural network is
1. INTRODUCTION                                                       proposed to use the local context to predict the next word
   Next basket recommendation (NBP) is important in appli-            in language model. However, [2] only models the depen-
cations such as online electronic business and retail shopping        dences between next items and nearest items, while ignores
market. Since a user’s shopping history is typically repre-           the user’s general interest if we apply it directly for NBP.
sented as a sequence of baskets, it is natural to formalize           The main novelty of our model lies in that the representa-
this task to a sequential prediction problem. That is, given          tion of the user is further introduced to model the influence
a sequence of baskets, we want to predict what the user will          of the user’s general interest. In this way the model can well
buy next. In this process, both the user’s general interest           fit the characteristics of NBP. To the best of our knowledge,
(what items the user likes) and the sequential dependencies           this is the first work to introduce neural networks to NBP.
between items (the influence of items bought before to items
in the next basket) are important for the prediction.                 2.   OUR APPROACH
   Existing approaches for this task are mainly based on ma-             Next basket recommendation can be formalized as a se-
trix/tensor factorization and markov chains [5, 6]. In this           quential prediction problem, here we follow the notations
paper, we propose to use neural networks to complete this             of [5]. Let U = {u1 , u2 , ..., u|U| } be the user set and I =
task. The reason is that neural networks have been success-           {i1 , i2 , ..., i|I| } be the item set. For each user u, there is a
fully applied to sequential prediction problems such as lan-          observed buying behavior sequence Su = (Bu1 , Bu2 , ..., But−1 ),
guage model [2, 3] and click through rate prediction [7]. Be-         where But is a set of items which are bought by user u at
sides, neural networks can learn richer representations than          time t. The sequential prediction problem is to predict But
matrix factorization, and are more flexible and powerful in           for each user u, given Su .
modelling complicated relationships [1].                                 Fig.1 shows our proposed model, namely NN-Rec. It is a
   Our model consists of three layers: embedding, hidden              three-layer neural network:
and output layer. The embedding layer firstly maps the user              (1) The first layer is the embedding layer. The inputs are
and the items in the user’s shopping history to distributed           the user ID and the item IDs in the user’s last k baskets.
representations, and then concatenates them together to ob-           We first transform the inputs to distributed representations,
tain a feature vector. The hidden layer transforms the fea-           where each user is represented as a vector u ∈ Rdu , and
ture vector to another space non-linearly. Finally, the out-          each item is represented as v ∈ Rdi . We obtain user ma-
put layer gives the probabilities of next items by the softmax        trix U ∈ Rdu ∗|U| and item matrix V ∈ Rdi ∗|I| by putting all
operator. From the above process, we can see that the two             user and item vectors together, respectively. Both U and
crucial factors of the user’s general interest and the sequen-        V are learned during the training process. Then, each bas-
tial dependencies between items are both elegantly involved           ket is represented as the mean of all items included in it.
in this model. Empirically, our experimental results on two           The output of the embedding layer is to concatenate the
                                                                      user’s and the baskets’ representations together to obtain a
                                                                      feature vector h1 ∈ Rdu +k∗di , which can be viewed as the
                                                                      representation of both the user’s general interest and local
Copyright is held by the author(s). RecSys 2015 Poster Proceedings,
September 16-20, 2015, Austria, Vienna.                               context.
                                                                         (2) The second layer is a non-linear hidden layer, which
                                                                                                               Tafeng                                           Beiren
                                                                                                    0.07                                              0.11

                              Table 1: Datasets


                                                                                     F1-measure@5


                                                                                                                                       F1-measure@5
                                                                                                    0.06                                               0.1
         Dataset      # of users           # of items        # of baskets
         Tafeng         9,238                7,973              77,202                              0.05
                                                                                                                                                      0.09
         Beiren        13,736                5,920             522,963                                                   NNRec(k=2)                                       NNRec(k=2)
                                                                                                                         NNRec(k=1)                   0.08                NNRec(k=1)
                                                                                                    0.04                 FPMC                                             FPMC
transforms h1 to a hidden representation h2 with dimension                                                               BPR
                                                                                                                         NMF
                                                                                                                                                                          BPR
                                                                                                                                                                          NMF
                                                                                                                                                      0.07
l. Here we use tanh as the activation function, which is                                            0.03
                                                                                                                         TOP                                              TOP

commonly used in neural networks. h2 is obtained as follows.                                           50    100
                                                                                                                   d
                                                                                                                       150       200                     50   100
                                                                                                                                                                    d
                                                                                                                                                                        150       200

                          h2 = tanh(W1 h1 + b1 ),                                    Figure 2: Experimental results on two datasets, d is
  where W1 ∈ Rl∗|h1 | , b1 ∈ Rl∗1 are parameters to be learned.                      the feature dimension
  (3) The output layer is a softmax layer, which outputs the
probabilities of next items,                                                         the performances of our model are consistently better than
                                                                           s         all baselines on both datasets, whenever k = 1 or 2. There-
                                                                          e ij
     s = W2 h2 + b2 , P (ij ∈ Bt |u, Bt−1 , ..., Bt−k ) = ∑|I|                s ip   fore, we can conclude that neural networks are suitable for
                                                                         p=1 e       this task. Furthermore, on Beiren the performance when
   where W2 ∈ R|I|∗l , b2 ∈ R|I|∗1 are parameters to be learned.                     k = 2 is consistently better than k = 1. This indicates that
   For training, we use negative log-likelihood as the loss                          longer dependences between items can be captured by NN-
function and stochastic gradient decent with back propaga-                           Rec and are important for this task. However, on Tafeng,
tion for optimization. Weight decay is also used as regular-                         larger k does not show benefits. This is because Tafeng is
ization, therefore the optimization problem becomes
                              ∑∑ ∑                                                   much smaller than Beiren, as shown in Table 1, and thus
          arg min         −                logP (i ∈ Bt |u, Bt−1 , ..., Bt−k )       complex models are more easily to overfit the training set.
     U,V,W1 ,W2 ,b1 ,b2       u   t i∈Bt
                                                                                     Therefore, we recommend to choose appropriate k according
          + λU ||U ||22 + λV ||V ||22 + λ1 ||W1 ||22 + λ2 ||W2 ||22                  to the scale of data.
   NN-Rec has some superiorities compared to previous state-                         4.                    CONCLUSIONS
of-the-art methods such as FPMC. Compared to FPMC
which is based on tensor factorization, NN-Rec is neural                                In this paper, we propose to use neural networks for next
network based and is more flexible and powerful. Firstly,                            basket recommendation. Our model consists three layers
the model can easily capture longer dependencies by vary-                            and can elegantly incorporate both a user’s general interest
ing the window size k of the embedding layer, while FPMC                             and sequential dependencies between items for recommenda-
only captures the influence of the nearest one basket. Sec-                          tion. Experimental results show that our model significantly
ondly, the embedding layer is flexible and we can add other                          outperforms existing approaches. To the best of our knowl-
features such as user profiles and item attributes to this                           edge, this is the first time to introduce neural networks to
layer without modifying the model’s framework. Finally,                              this task. We believe neural network is a more flexible frame-
the hidden layer gives the power to model more complicated                           work to model complicated interactions and will impact this
interactions between user and items.                                                 area further. In future work we will investigate other neural
                                                                                     networks such as recurrent neural network which can capture
3. EXPERIMENTS                                                                       longer term sequential dependencies and try to incorporate
   We conduct experiments on two real retail datasets (Tafeng                        other information such as time serials into the model.
and Beiren) to evaluate the eﬀectiveness of our model. Tafeng                        5.                    ACKNOWLEDGMENTS
is a benchmark dataset released by RecSys 1 . While Beiren
                                                                                       This research work was funded by the 973 Program of
is collected by a large retail store in China, which contains                        China under Grants No. 2014CB340401, No. 2012CB316303,
users’ shopping history during 2011 to 2013. We preprocess                           the 863 Program of China under Grants No. 2014AA015204,
the two datasets by removing the items bought less than 10                           the National Natural Science Foundation of China under
times and the users who bought less than 10 items or 4 bas-                          Grant No. 61232010, No. 61425016, No. 61173064, No.
kets. The detailed information of the obtained datasets are                          61472401, No. 61203298, and No. 61202214.
shown in Table 1. For each user, we hold out the last basket                         6.                    REFERENCES
as the test set and keep other data as the training set. The                         [1] Y. Bengio. Learning deep architectures for ai. Found. Trends
last basket of all users in the training set are used for vali-                          Mach. Learn., 2(1):1–127, Jan. 2009.
dation. The final models are trained on the whole training                           [2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A
set. For our model in all experiments, the dimensions of h2                              Neural Probabilistic Language Model. JMLR, 3, 2003.
                                                                                     [3] T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and
(i.e., l), user vector (i.e., du ) and item vector (i.e., di ) are
                                                                                         S. Khudanpur. Recurrent neural network based language
set to the same value, denoted as d. k is set to 1 and 2 since                           model. In INTERSPEECH, pages 1045–1048, 2010.
that most users only have a few baskets in the two datasets.                         [4] S. Rendle, C. Freudenthaler, Z. Gantner, and
   We recommend top c items for each user, denoted as B̂ut ,                             L. Schmidt-Thieme. Bpr: Bayesian personalized ranking
and use F1-measure for evaluation. c is set to 5 in Fig. 2.                              from implicit feedback. In UAI, pages 452–461, 2009.
                                  ∑                          ∑                       [5] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme.
                                      u |Bu ∩ B̂u |              u |Bu ∩ B̂u |
                                          t     t                    t      t
                2pr
         F1 =       ,p =                              ,r =       ∑                       Factorizing personalized markov chains for next-basket
                p+r                    |U | ∗ c                     u |Bu |
                                                                        t
                                                                                         recommendation. In WWW, pages 811–820. ACM, 2010.
   We compare our model with several existing methods,                               [6] X. Wu, Q. Liu, E. Chen, L. He, J. Lv, C. Cao, and G. Hu.
such as Top popular (TOP) which recommend items ac-                                      Personalized next-song recommendation in online karaokes.
cording to global popularity, NMF, BPR [4] and FPMC [5].                                 In RecSys, pages 137–140. ACM, 2013.
Notice that, when k = 1, our model uses the same informa-                            [7] Y. Zhang, H. Dai, C. Xu, J. Feng, T. Wang, J. Bian,
                                                                                         B. Wang, and T. Liu. Sequential click prediction for
tion as FPMC. From the results (Fig. 2), we can see that                                 sponsored search with recurrent neural networks. In AAAI,
1                                                                                        pages 1369–1375, 2014.
    http://recsyswiki.com/wiki/Grocery shopping datasets