Next Basket Recommendation with Neural Networks Shengxian Wan, Yanyan Lan, Pengfei Wang, Jiafeng Guo, Jun Xu, Xueqi Cheng wanshengxian@software.ict.ac.cn, lanyanyan@ict.ac.cn wangpengfei@software.ict.ac.cn, {guojiafeng, junxu, cxq}@ict.ac.cn ABSTRACT One crucial task in recommendation is to predict what a user will buy next given her shopping history. In this paper, we propose a novel neural network to complete this task. The model consists of an embedding layer, a hidden layer and an output layer. Firstly, the distributed representations of the user and the items bought before are obtained and used to form a feature vector by the embedding layer. Then the hidden layer transforms the feature vector to another space by a non-linear operator. Finally, the softmax operator is adopted to output the probabilities of next items. We can see that the model elegantly involves both the user’s general Figure 1: NN-Rec model interest and the sequential dependencies between items for prediction. Experimental results on two real datasets prove real datasets show that our model significantly outperforms the effectiveness of our model. state-of-the-art methods. Our work is inspired by [2], where a neural network is 1. INTRODUCTION proposed to use the local context to predict the next word Next basket recommendation (NBP) is important in appli- in language model. However, [2] only models the depen- cations such as online electronic business and retail shopping dences between next items and nearest items, while ignores market. Since a user’s shopping history is typically repre- the user’s general interest if we apply it directly for NBP. sented as a sequence of baskets, it is natural to formalize The main novelty of our model lies in that the representa- this task to a sequential prediction problem. That is, given tion of the user is further introduced to model the influence a sequence of baskets, we want to predict what the user will of the user’s general interest. In this way the model can well buy next. In this process, both the user’s general interest fit the characteristics of NBP. To the best of our knowledge, (what items the user likes) and the sequential dependencies this is the first work to introduce neural networks to NBP. between items (the influence of items bought before to items in the next basket) are important for the prediction. 2. OUR APPROACH Existing approaches for this task are mainly based on ma- Next basket recommendation can be formalized as a se- trix/tensor factorization and markov chains [5, 6]. In this quential prediction problem, here we follow the notations paper, we propose to use neural networks to complete this of [5]. Let U = {u1 , u2 , ..., u|U| } be the user set and I = task. The reason is that neural networks have been success- {i1 , i2 , ..., i|I| } be the item set. For each user u, there is a fully applied to sequential prediction problems such as lan- observed buying behavior sequence Su = (Bu1 , Bu2 , ..., But−1 ), guage model [2, 3] and click through rate prediction [7]. Be- where But is a set of items which are bought by user u at sides, neural networks can learn richer representations than time t. The sequential prediction problem is to predict But matrix factorization, and are more flexible and powerful in for each user u, given Su . modelling complicated relationships [1]. Fig.1 shows our proposed model, namely NN-Rec. It is a Our model consists of three layers: embedding, hidden three-layer neural network: and output layer. The embedding layer firstly maps the user (1) The first layer is the embedding layer. The inputs are and the items in the user’s shopping history to distributed the user ID and the item IDs in the user’s last k baskets. representations, and then concatenates them together to ob- We first transform the inputs to distributed representations, tain a feature vector. The hidden layer transforms the fea- where each user is represented as a vector u ∈ Rdu , and ture vector to another space non-linearly. Finally, the out- each item is represented as v ∈ Rdi . We obtain user ma- put layer gives the probabilities of next items by the softmax trix U ∈ Rdu ∗|U| and item matrix V ∈ Rdi ∗|I| by putting all operator. From the above process, we can see that the two user and item vectors together, respectively. Both U and crucial factors of the user’s general interest and the sequen- V are learned during the training process. Then, each bas- tial dependencies between items are both elegantly involved ket is represented as the mean of all items included in it. in this model. Empirically, our experimental results on two The output of the embedding layer is to concatenate the user’s and the baskets’ representations together to obtain a feature vector h1 ∈ Rdu +k∗di , which can be viewed as the representation of both the user’s general interest and local Copyright is held by the author(s). RecSys 2015 Poster Proceedings, September 16-20, 2015, Austria, Vienna. context. (2) The second layer is a non-linear hidden layer, which Tafeng Beiren 0.07 0.11 Table 1: Datasets F1-measure@5 F1-measure@5 0.06 0.1 Dataset # of users # of items # of baskets Tafeng 9,238 7,973 77,202 0.05 0.09 Beiren 13,736 5,920 522,963 NNRec(k=2) NNRec(k=2) NNRec(k=1) 0.08 NNRec(k=1) 0.04 FPMC FPMC transforms h1 to a hidden representation h2 with dimension BPR NMF BPR NMF 0.07 l. Here we use tanh as the activation function, which is 0.03 TOP TOP commonly used in neural networks. h2 is obtained as follows. 50 100 d 150 200 50 100 d 150 200 h2 = tanh(W1 h1 + b1 ), Figure 2: Experimental results on two datasets, d is where W1 ∈ Rl∗|h1 | , b1 ∈ Rl∗1 are parameters to be learned. the feature dimension (3) The output layer is a softmax layer, which outputs the probabilities of next items, the performances of our model are consistently better than s all baselines on both datasets, whenever k = 1 or 2. There- e ij s = W2 h2 + b2 , P (ij ∈ Bt |u, Bt−1 , ..., Bt−k ) = ∑|I| s ip fore, we can conclude that neural networks are suitable for p=1 e this task. Furthermore, on Beiren the performance when where W2 ∈ R|I|∗l , b2 ∈ R|I|∗1 are parameters to be learned. k = 2 is consistently better than k = 1. This indicates that For training, we use negative log-likelihood as the loss longer dependences between items can be captured by NN- function and stochastic gradient decent with back propaga- Rec and are important for this task. However, on Tafeng, tion for optimization. Weight decay is also used as regular- larger k does not show benefits. This is because Tafeng is ization, therefore the optimization problem becomes ∑∑ ∑ much smaller than Beiren, as shown in Table 1, and thus arg min − logP (i ∈ Bt |u, Bt−1 , ..., Bt−k ) complex models are more easily to overfit the training set. U,V,W1 ,W2 ,b1 ,b2 u t i∈Bt Therefore, we recommend to choose appropriate k according + λU ||U ||22 + λV ||V ||22 + λ1 ||W1 ||22 + λ2 ||W2 ||22 to the scale of data. NN-Rec has some superiorities compared to previous state- 4. CONCLUSIONS of-the-art methods such as FPMC. Compared to FPMC which is based on tensor factorization, NN-Rec is neural In this paper, we propose to use neural networks for next network based and is more flexible and powerful. Firstly, basket recommendation. Our model consists three layers the model can easily capture longer dependencies by vary- and can elegantly incorporate both a user’s general interest ing the window size k of the embedding layer, while FPMC and sequential dependencies between items for recommenda- only captures the influence of the nearest one basket. Sec- tion. Experimental results show that our model significantly ondly, the embedding layer is flexible and we can add other outperforms existing approaches. To the best of our knowl- features such as user profiles and item attributes to this edge, this is the first time to introduce neural networks to layer without modifying the model’s framework. Finally, this task. We believe neural network is a more flexible frame- the hidden layer gives the power to model more complicated work to model complicated interactions and will impact this interactions between user and items. area further. In future work we will investigate other neural networks such as recurrent neural network which can capture 3. EXPERIMENTS longer term sequential dependencies and try to incorporate We conduct experiments on two real retail datasets (Tafeng other information such as time serials into the model. and Beiren) to evaluate the effectiveness of our model. Tafeng 5. ACKNOWLEDGMENTS is a benchmark dataset released by RecSys 1 . While Beiren This research work was funded by the 973 Program of is collected by a large retail store in China, which contains China under Grants No. 2014CB340401, No. 2012CB316303, users’ shopping history during 2011 to 2013. We preprocess the 863 Program of China under Grants No. 2014AA015204, the two datasets by removing the items bought less than 10 the National Natural Science Foundation of China under times and the users who bought less than 10 items or 4 bas- Grant No. 61232010, No. 61425016, No. 61173064, No. kets. The detailed information of the obtained datasets are 61472401, No. 61203298, and No. 61202214. shown in Table 1. For each user, we hold out the last basket 6. REFERENCES as the test set and keep other data as the training set. The [1] Y. Bengio. Learning deep architectures for ai. Found. Trends last basket of all users in the training set are used for vali- Mach. Learn., 2(1):1–127, Jan. 2009. dation. The final models are trained on the whole training [2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A set. For our model in all experiments, the dimensions of h2 Neural Probabilistic Language Model. JMLR, 3, 2003. [3] T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and (i.e., l), user vector (i.e., du ) and item vector (i.e., di ) are S. Khudanpur. Recurrent neural network based language set to the same value, denoted as d. k is set to 1 and 2 since model. In INTERSPEECH, pages 1045–1048, 2010. that most users only have a few baskets in the two datasets. [4] S. Rendle, C. Freudenthaler, Z. Gantner, and We recommend top c items for each user, denoted as B̂ut , L. Schmidt-Thieme. Bpr: Bayesian personalized ranking and use F1-measure for evaluation. c is set to 5 in Fig. 2. from implicit feedback. In UAI, pages 452–461, 2009. ∑ ∑ [5] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme. u |Bu ∩ B̂u | u |Bu ∩ B̂u | t t t t 2pr F1 = ,p = ,r = ∑ Factorizing personalized markov chains for next-basket p+r |U | ∗ c u |Bu | t recommendation. In WWW, pages 811–820. ACM, 2010. We compare our model with several existing methods, [6] X. Wu, Q. Liu, E. Chen, L. He, J. Lv, C. Cao, and G. Hu. such as Top popular (TOP) which recommend items ac- Personalized next-song recommendation in online karaokes. cording to global popularity, NMF, BPR [4] and FPMC [5]. In RecSys, pages 137–140. ACM, 2013. Notice that, when k = 1, our model uses the same informa- [7] Y. Zhang, H. Dai, C. Xu, J. Feng, T. Wang, J. Bian, B. Wang, and T. Liu. Sequential click prediction for tion as FPMC. From the results (Fig. 2), we can see that sponsored search with recurrent neural networks. In AAAI, 1 pages 1369–1375, 2014. http://recsyswiki.com/wiki/Grocery shopping datasets