HPT4Rec: AutoML-based Hyperparameter Self-Tuning Framework for Session-based Recommender Systems Amir Reza Mohammadi1 , Amir Hossein Karimi2 , Mahdi Bohlouli3 , Eva Zangerle1 and Günther Specht1 1 Department of Computer Science, Universität Innsbruck, Austria 2 Mathematics and Computer Science Department, Amirkabir University of Technology, Tehran, Iran 3 Computer Science and Information Technology Department, IASBS, Zanjan, Iran Abstract Recommender systems have evolved beyond the basic user-item filtering methods in research. However, these filtering methods are still commonly used in real-world scenarios, mainly because they are easier to debug and reconfigure. Indeed the existing frameworks do not adequately support algorithmic tuning. Moreover, they are primarily focused on the reproducibility of state-of-the-art accuracy rather than ease of algorithm development and maintenance. Therefore, rapid and iterative experimentation and debugging are considerably hindered. In this work, we propose an AutoML-based framework with a modular deep session-based recommender code-base and an integrated automated HyperParameter Tuning (HPT4Rec) component. The proposed framework automates searching for the best session-based model for a given data. Therefore it can help to consistently update the model based on potential changes in the type and volume of data that is prevalent for a real-world scenario. It is demonstrated that HPT4Rec provides extensible data structures, training service compatibility, and GPU-accelerated execution while maintaining training efficiency and recommendation accuracy. We have conducted our experiments on the benchmark RecSys 2015 dataset and achieved performance on par with state-of-the-art results. Achieved results of our experiments show the importance of continuous and iterative parameter tuning, particularly for real-world scenarios. Keywords AutoML, Session-based Recommender Systems, Framework, Hyperparameter Tuning 1. Introduction start problem. Session-based recommendation might be a vital component of the future recommendation, espe- It is often overwhelming to an e-commerce user to see cially for the business and real-world applications, as so many products available for sale. Recognizing the there are concerns and regulations about collecting user burden of data overload, recommender systems (RSs) data like GDPR [5]. improve user experience substantially in various appli- Methods based on deep learning (DL) have shown cations. Traditional RSs often rely on user profiles to great promise in the session-based recommendation and provide personalized recommendations. Collaborative fil- also in other communities [6]. As stated in various lit- tering approaches [1, 2, 3] could use history of purchases erature [7, 8, 9], they perform better than traditional to determine user similarity, or use matrix factorization baseline methods by around 20-30 percent. However, to establish latent factor vectors for each user. In both recent investigations have shown that many of these cases, it is essential to identify the user when making methods are not compelling enough [10], moreover, re- recommendations. However, this may not always be pos- sults are hard to reproduce in many of them [11], and sible, such as not being logged in, having deleted their the codes are not readily available. Recent publications tracking information, or a new user not having profile. have addressed reproducibility by implementing several Consequently, recommendation methods that require the DL-based recommendation algorithms as a framework user’s history suffer from cold-start issues. [12, 13, 14]. While these frameworks are effective and Making session-based recommendations is another al- helped to alleviate the problem, two key factors should ternative to using historical data [4]. In this setup, recom- not be overlooked: 1. Iterative algorithm optimization: mendations are only made based on the behavior of users If these algorithms are intended for real-world use, they in their current session which helps on tackling the cold- should include tools for being iteratively tuned to a given dataset (not the offline benchmark datasets). The process 34th GI-Workshop on Foundations of Databases (Grundlagen von Daten- should be iterative and persistent since new features may banken), June 7-9, 2023, Hirsau, Germany Envelope-Open amir.reza@uibk.ac.at (A. R. Mohammadi); ahkarimi@aut.ac.ir emerge, and user preferences may change. 2. Modular- (A. H. Karimi) ity and ease of reproduciblity: Besides accuracy, several Orcid 0000-0003-3934-6941 (A. R. Mohammadi); 0009-0001-3946-6954 other factors must be taken into consideration, when im- (A. H. Karimi) plementing literature-approved methods in production, © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings including non-complexity, fault tolerance, real-time pre- with the neighborhood-based method, Jannach et al. [10] diction, debuggability, resource consumption, and modu- combined sequential patterns and co-occurrence signals larity [15, 16]. The most advanced and well-performing to get the best of both worlds. Tuan et al. [17] fused models are often left behind in the business, because they session clicks with content features (namely, item titles are complex and challenging to debug. As a result, busi- and categories) to generate recommendations based on nesses still opt for more straightforward methods that 3-dimensional Convolutional Neural Networks (CNN). are less accurate, but easier to manipulate and debug. In Li et al. [21] have developed a neural attentive recom- several papers [8, 10, 17, 18] (discussed in the background mendation machine (NARM) using an encoder-decoder section of prior work), various techniques were used to architecture. NARM can distinguish sequential behavior slightly improve performance, which not only may not and the primary purposes of users using the attention be useful for large-scale day-to-day use, but may also mechanism on RNN. In another study, a Short-Term At- cause problems in production and during debugging. It tention Priority model (STAMP) [18], which employs a would be more practical to implement a robust and mod- simple MLP network, and an attentive net has been pro- ular core structure with clear interfaces and to give room posed for understanding users’ general interests as well to add more complex mechanisms based on the business as their current interests. In both NARM and STAMP, an demands. attention mechanism emphasizes the importance of the Motivated by reasons mentioned above, in this paper, last click. we present HPT4Rec, an AutoML-based framework for Almost all of the aforementioned RNN-based SBR mod- hyperparameter self-tuning with a modular code-base els follow the same architecture as GRU4Rec [7]. They aimed at session-based recommendation. Our frame- have just incorporated new features and mechanisms to work simplifies the development and manipulation of improve performance on top of the core structure. There- deep recommendation algorithms to meet business needs. fore, in HPT4Rec, a minimal code-base based on GRU4Rec PyTorch and Microsoft NNI 1 are used to develop the was built, with all the necessary tools and modules for code-base, both of which are well known in the DL and a methodologically simplified bottom-up approach to AutoML communities and receive continuous updates. model development. This can remove the barrier of en- Besides being open-source, this framework can be in- try for practitioners and allow them to add other features stalled easily, and all prepared data and trained mod- if necessary. els are available at https://github.com/amirreza-m95/ Related Frameworks In the modern RSs field, re- HPT4Rec producibility is crucial. Recently, various researchers [10, 11, 22, 23] pointed out the need for fair evaluation of recommender models. Upon thorough hyperparameter 2. Prior Work tuning, their argument about the supremacy of latent- factor models over deep neural models made it necessary Background. The most commonly used deep model, to develop new recommendation frameworks. Begin- when dealing with sequential data are Recurrent Neu- ning in 2011, Mymedialite [24], , RankSys [25], LensKit ral Networks (RNN). There is a type of RNNs known [26], LightFM [27], and Surprise [28] have established as LSTM [19] that are shown to work particularly well, a set of integrated tools for rapid prototyping and test- including additional gates regulating, when to take into ing of recommendation models, using standard metrics account input and, when to reset the hidden state. These and an intuitive model execution. Deep learning (DL) models are not affected by the vanishing gradient prob- recommendation models achieved remarkable success lem usually associated with RNN models. A somewhat and attracted growing community interest, which led simpler alternative to LSTM, but still retaining all of its to the development of new tools. The first open-source properties, are Gated Recurrent Units (GRUs) [20], which frameworks for DL-based recommenders were LibRec we employ in this work as the core learning structure of [29], Spotlight [30], and OpenRec [31]. Although these the recommender for the experiments. frameworks provided plenty of models, they lacked fil- Hidasi et al. [7] suggested the RNN approach for tering and Automated hyperparameter tuning strategies. session-based recommendation (SBR) and then proposed The RecQ [32], DeepRec [33], and Cornac [34] frame- a parallel RNN architecture [9] to model sessions using works have made a significant contribution towards a the clicks and features of the clicked items. Further re- more comprehensive collection of model implementa- search was presented based on RNN methods in order to tions. DaisyRec [35], RecBole [36], and Elliot [12] raised improve the accuracy of this model. Performance of the the bar considerably after the reproducibility hype, mak- recurrent model can be boosted by taking into account ing available a large number of models, data filtering and temporal changes in user behavior and data augmen- splitting operations, as well as hyperparameter tuning. tation techniques[8]. By uniting the recurrent method Nevertheless, we observed a deficiency of two increas- 1 https://github.com/microsoft/nni ingly critical aspects of recommendation model develop- ment in real-world scenarios: Automated Hyperparame- state ℎ according to mechanism showed in eq. (1): ter tuning and industry-level compatibility of tools and training services. In reviewing these related frameworks, ht = 𝑔 (𝑊xt + 𝑈ht−1 ) (1) we observed the lack of an open-source recommenda- tion framework to perform automated hyperparameter where, The logistic sigmoid function 𝑔 is a smooth tuning while adopting various hyperparameter tuning function with a bounded input of 𝑥𝑡 , which is the unit strategies on different distributed platforms. HPT4Rec input at time 𝑡. Based on its actual state ℎ𝑡 , an RNN represents a step toward reaching that goal. provides a probability distribution for the subsequent Earlier studies attempted to find a universal automated element of the sequence. solution for both architecture design [37, 38] and opti- GRU is a form of RNN that tends to cope with vanish- mization [39, 40, 41] but that seems to be ineffective since ing gradient problems better than vanilla RNN. In essence, the problems are diverse with different characteristics, GRU gates learn when to update their hidden state and so a one-size-fits-all solution is not appropriate. The goal by how much. GRUs are superior to Long Short-Term of complete automation might be inspiring for scientific Memory (LSTM) units when it comes to the session-based research and serve as a long-term engineering objective, recommendation. [7]. but it seems likely that we will need to semi-automate the A linear interpolation between the prior activation majority of these tasks and gradually reduce the human and the candidate activation is used to determine GRU factor over time. Then it is expected that we will develop activation, ℎ𝑡 : powerful tools to assist in making machine learning, first ht = (1 − zt ) ht−1 + zt ht ̂ (2) and foremost, more systematic and second, more effi- cient. Aiming to accomplish this goal is the purpose of where the update gate is given by: HPT4Rec. zt = 𝜎 (𝑊𝑧 xt + 𝑈𝑧 ht−1 ) (3) In a similar manner while the candidate activation function, ℎ𝑡 , is also computed: Scores on Items Feedforward Layers Embedding Layer Gated Recurrent Unit Gated Recurrent Unit Gated Recurrent Unit Input Data ht ̂ = tanh (𝑊xt + 𝑈 (rt ⊙ ht−1 )) (4) and eventually, the reset gate 𝑟𝑡 is provided by: rt = 𝜎 (𝑊𝑟 xt + 𝑈𝑟 ht−1 ) (5) We have presented the standard formulation of GRU in Equations (3) and (4), but it is important to note that Figure 1: Overview of HPT4Rec’s Session-based Recommen- dation Architecture framework users can tweak the model by using other options, like using different final activations such as relu, leaky-relu, and softmax. 3.1.1. GRU4Rec Architecture 3. HPT4Rec The network core comprises the GRU layers, and further In this section, we describe HPT4Rec’s architecture and feedforward layers may be added between the GRU layer tuning pipeline. First, we describe the general architec- and the output. Each item’s predicted preference can be ture of the recommender. Next, we present the compo- calculated to predict whether it will be the next item in nents and architecture of the framework. Finally, we the session. If more than one GRU layer is employed, discuss the available self-tuning methods and their best the hidden state of each layer is used as an input for the application scenarios. next layer. An option is to connect the input to a higher layer of the network to improve performance [7]. We 3.1. Sequential Modeling with RNN adjusted the base network to suit the task better since rec- ommender systems are not the principal application area Variable-length sequence data can be modeled using of RNNs. The SBR model architecture is demonstrated RNNs. RNNs are characterized by the internal hidden in Figure 1. state present in the units that make up the network, In addition, we also use trainable embeddings to rep- which sets them apart from conventional feedforward resent all of our inputs. With backpropagation Through- neural networks. A standard RNN updates its hidden Time (BPTT), we can train our neural networks using Experiment Manager Figure 2: HPT4Rec’s Architecture Overview mini-batch gradient descent on multiple options for loss by the variable name, sampling strategy, and parameters over a dynamic number of time steps. of a search space. Session-parallel mini-batches. Click sessions are often A search space definition can be expressed as follows: of varying length. It may take some users a long time to 1{ find their desired item, while others find it within seconds. 2 "dropout_rate": {"_type": "uniform", "_value": [0.1, 0.5]}, In the recommender system, accurate predictions should 3 "conv_size": {"_type": "choice", "_value": [2, 3, 5, 7]}, 4 "hidden_size": {"_type": "choice", "_value": [124, 512, 1024]}, be provided regardless of the current session length. This 5 "lr": {"_type": "loguniform", "_value": [0.0001, 0.1]}, problem has been addressed by different methods like 6 "momentum": {"_type": "lognormal", "_value": [0.1, 1]} 7} session-parallel mini-batches [9] and data augmentation [8]. Since we are seeking the least sophisticated approach, We have five parameters to tune in this search space. we have taken the former approach. According to this definition, the dropout rate is charac- terized by a uniform distribution within a range of 0.1 to 3.2. Architecture and Data Flow 0.5. This search space will be used by Tuner to build con- figurations, selecting a value from within the range for Automated tuning of hyperparameters is a key feature of each parameter. Besides defining the search space, the HPT4Rec. We provide 11 popular self-tuning algorithms. only requirement is to define a configuration file contain- Experiments can be run on a wide range of training plat- ing information like experiment log folder, self-tuning forms, including local machines, multiple servers on a algorithms, trial number, and duration threshold. The distributed network, and open-source platforms such as configuration file is in YAML format. Kubernetes and OpenPAI. In order to implement a new tuning algorithm or tweak the existing ones, the base tuner should be inherited. 3.2.1. HPT4Rec’s Data Flow Then, by following the interface of the module and re- HPT4Rec experiments are individual attempts to apply a turning the experiment results, passing the new parame- configuration (e.g., a set of hyperparameters) to a model. ters, and updating the search space, the tuning module The first step in constructing an experiment is to define will function properly. the search space (i.e., parameters). The tuner will sample parameters/architecture according to the search space, which is defined as a JSON file. Search spaces are defined Table 1 Self-tuning methods performance on different proxy datasets. TPE SMAC Anneal #Samples Recall@20 MRR@20 Time Recall@20 MRR@20 Time Recall@20 MRR@20 Time 125K 0.4314 0.2069 23 0.4229 0.2114 29 0.4332 0.203 25 250K 0.4687 0.225 39 0.473 0.2235 45 0.4633 0.2311 41 500K 0.5062 0.2426 76 0.5082 0.2442 77 0.5103 0.2487 57 1M 0.545 0.2559 139 0.5479 0.2636 147 0.5481 0.2619 191 3.2.2. Architecture ber of trials. A wide range of experiments revealed that TPE outperformed random search. If the variables in the By executing the experiment_runner python script search space can be selected from a prior distribution, through Cli and passing the configuration file path, exper- Anneal is useful. Likewise, it is recommended to use iments are instantiated. The experiment manager parses naive evolution, when your experiment code supports the configuration file to determine the path to the search weight transfer, which implies that the experiment could space and target the training service, and then runs the inherit its parent’s converged weight from its predecessor. model code with the appropriate parameters from the Training can be substantially accelerated with the right search space. Preprocessing will be performed by the tuning method, resulting in less time and money spent experiment manager (e.g., one-hot encoding, embedding and higher revenue, as well as better recommenders, to dropout). Following the execution of the model with enhance user experience. the first set of parameters, the self-tuner will examine intermediate results (i.e., after each epoch) to determine whether results are improving. Next, it will pass the 4. Experiments model on to the evaluation module. Evaluation will be conducted by the evaluator, and results will be provided 4.1. Experiment Setup to self-tuning algorithm to update its inner state. Follow- ing the update, the self-tuning algorithm determines the 4.1.1. Dataset next metric to use. The iterative process will be repeated We conducted our experiments on the YOOCHOOSE until a certain time or number of experiments is reached. e-commerce dataset for RecSys 2015 challenge 2 . A six- Figure 2 illustrates this procedure. HPT4Rec will output month period of click-streams from an e-commerce site results in a webUI interface and collect all metrics, inter- was included in this dataset. Click-streams are some- mediate results, best parameters, and system logs in a times followed by purchase events. Following prepro- JSON format. cessing, there are 7,936,469 sessions and 31,437,691 clicks on 37,403 items left for training and testing. Each clicking 3.2.3. Self-tuning event contains a session ID, an item ID and, if the item is a buy-item, a price tag. A shopping session can contain The cycle of getting hyperparameters, carrying out exper- anywhere between 1 and 200 clicks, but most sessions iments, testing their results, and then tuning hyperparam- contain less than 30 clicks. We keep only the click events eters is deemed as self-tuning. Recommender systems from the challenge’s training set. Sessions of length one are used in various online websites with different lev- are filtered out. The Yoochoose dataset was chosen since els of user activity, which directly affects the volume of it is the most general dataset, based on the dataset’s fea- data available for training models. Additionally, Training tures compared to other well-known datasets in this field deep models require substantial computational resources, such as Diginetica3 , Xing4 , and Last.fm5 . The default set- which is another crucial aspect since it directly impacts tings of the framework can be used for all the datasets we revenue. Thereby, different tuning strategies are needed mentioned just by omitting some of their extra features. based on available features, the volume of data, and avail- We employ a dataset characterized by minimalistic data able computational resources. Based on the framework features as a means to ensure the robust generalizability review shown in table 1, HPT4Rec offers several tuning of the model to diverse datasets encompassing a greater techniques tailored for diverse scenarios that occur in abundance of data features. real-world scenarios. After a series of experiments, we have gained an early intuition about the most suitable use cases of each self- 2 http://2015.recsyschallenge.com tuning algorithm. In that sense, Tree-structured Parzen 3 https://competitions.codalab.org/competitions/11161 Estimator (TPE) [42] is suitable when computation re- 4 http://2016.recsyschallenge.com/ sources are limited, and you can only try a limited num- 5 http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html 4.1.2. Evaluation Metrics Table 2 Comparison of the our optimized recommender against base- In order to match the user with the most relevant item lines. on the list, recommender systems can recommend only a model/type/loss HS 6 Recall@20 MRR@20 few items at a time. We, therefore, use recall@20 as our POP 0.005 0.0012 main evaluation metric, which counts the proportion of S-POP 0.2672 0.1775 cases that have the targeted item in the top 20 items for all Item-KNN 0.5065 0.2048 test cases. As long as an item is among the top-N, recall BPR-MF 0.2574 0.0618 does not take its rank into consideration. The MRR@20 GRU4REC BPR 1000 0.6322 0.2467 metric is the second metric used in the experiments. A GRU4REC top1 100 0.5853 0.2305 reciprocal ranking of the desired items determines this HPT4Rec TOP1 110 0.6259 0.2681 value. A reciprocal rank above 20 is set to zero. 4.1.3. Implementation Details Thus, to make recommendations that reflect changes in user behavior over time, models must be continuously For demonstration purposes and to have a quantifiable and iteratively optimized. It is possible to have different search space, we optimized hidden size, batch size, learn- approaches when we have different quantities of data ing rate and the number of GRU layers and fixed others as and computation to find the best-optimized model, as follow. For our model, 50-dimensional embeddings were discussed in the self-tuning 3.2.2. Our experiments have used for the items, with a 20% embedding dropout. The been conducted using four proxy datasets that mirror optimization was conducted using Adam [43]. The GRU the RecSys benchmark data, which comprise different search space was set at 50 to 1000 hidden units for each quantities of data. HPT4Rec’s recommender model was model. A session ends with the GRU’s hidden state reset tuned using four self-tuning methods that used proxy to zero. Models are developed in PyTorch and trained on datasets as training data. Evaluation metrics and tuning an NVIDIA Tesla V100. The source code of the model, time were recorded to compare these methods. Table 2 checkpoints, and logs are available online. shows how we found the most effective model using 30 The comparison was made with four traditional recom- experiments. Results do not indicate the optimal use case mendations (POP, S-POP, Item-KNN and BPR-MF) and scenario for tuning methods, but rather demonstrate that with two well-performing configurations of GRU4Rec. each of these tuners performs well in different scenarios • POP. In one of its simplest forms, the popular pre- and that one of them does not outperform the others in dictor predicts the items that are most popular in all proxy datasets and evaluation metrics. the training set. Even though it is simple, it often provides a good baseline for certain domains. 4.2.2. Consistency with Published Results • S-POP. This baseline recommends the items that A key element for any new tool is consistency with the are most popular during the current session. As previously published results since a wide range of results the session progresses, the recommendation list are possible due to a variety of implementation details, grows. Global popularity values are used to break non-fixed seed values, and other domain-specific rea- up ties. sons. Our research also featured HPT4Rec’s self-tuning • Item-KNN. This baseline measures similarity by method for optimizing the base recommender model with dividing the number of times two items appear the Original RecSys dataset. In Table 2 we show that together in sessions by the square root of the HPT4Rec has outperformed baseline models by a fair product of their occurrence rates. margin and is almost on par with state-of-the-art models with this privilege that it has discovered parameters that lead to a simpler model, which results in less resource 4.2. Performance and Results consumption in production mode. The pursuit of more 4.2.1. Diverse Self-tuning Methods Effectiveness streamlined models facilitates enhanced reproducibility, a fundamental tenet of our methodology, thereby engen- The most likely scenario for developing a recommender dering an essential advancement. system in the real world is carrying out an experiment, where different levels of training data are collected. This may change as user activity increases and new users 5. Conclusion and Future Work visit the website. Even in the offline dataset of RecSys In this paper, we have released a session-based rec- 2015, the results of training on a complete dataset are ommender system framework based on AutoML called slightly worse than those of training on a recent region HPT4Rec. We reviewed the recommended systems of the dataset, which shows changing user behavior [8]. frameworks in the literature, showing HPT4Rec’s mer- networks meet the neighborhood for session-based its and shortcomings, and emphasizing the advantages recommendation, in: Proceedings of the Eleventh of modularity and automatic tuning. To the best of our ACM Conference on Recommender Systems, 2017, knowledge, HPT4Rec is the first recommendation frame- pp. 306–310. work that provides a thorough self-tuning experimen- [11] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we tal pipeline supported by business scale training service really making much progress? a worrying analy- compatibility. We expect HPT4Rec to simplify the tuning sis of recent neural recommendation approaches, effort of recommendation models, facilitate the devel- in: Proceedings of the 13th ACM Conference on opment and debugging process of new algorithms, and Recommender Systems, 2019, pp. 101–109. help migrate deep recommender algorithms to be used [12] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta, in real-world scenarios. Our immediate future work will F. A. Merra, C. Pomo, F. M. Donini, T. D. Noia, Elliot: emphasize automating other aspects of the recommen- a comprehensive and rigorous framework for re- dation pipeline, such as automated data augmentation, producible recommender systems evaluation, 2021. which has traditionally been done manually in literature. arXiv:2103.02590 . [13] L. Yang, E. Bagdasaryan, J. Gruenstein, C.-K. Hsieh, D. Estrin, Openrec: A modular framework for References extensible and adaptable recommendation algo- rithms, in: Proceedings of the Eleventh ACM In- [1] Y. Koren, R. Bell, C. Volinsky, Matrix factorization ternational Conference on Web Search and Data techniques for recommender systems, Computer Mining, WSDM ’18, Association for Computing 42 (2009) 30–37. Machinery, New York, NY, USA, 2018, p. 664–672. [2] Y. Koren, Factorization meets the neighborhood: a URL: https://doi.org/10.1145/3159652.3159681. multifaceted collaborative filtering model, in: Pro- [14] S. Zhang, Y. Tay, L. Yao, B. Wu, A. Sun, Deeprec: ceedings of the 14th ACM SIGKDD international An open-source toolkit for deep learning based rec- conference on Knowledge discovery and data min- ommendation, 2019. arXiv:1905.10536 . ing, 2008, pp. 426–434. [15] P. Kouki, I. Fountalis, N. Vasiloglou, X. Cui, E. Lib- [3] R. Salakhutdinov, A. Mnih, G. Hinton, Restricted erty, K. Al Jadda, From the lab to production: A boltzmann machines for collaborative filtering, in: case study of session-based recommendations in Proceedings of the 24th international conference the home-improvement domain, in: Fourteenth on Machine learning, 2007, pp. 791–798. ACM conference on recommender systems, 2020, [4] J. B. Schafer, J. Konstan, J. Riedl, Recommender pp. 140–149. systems in e-commerce, in: Proceedings of the 1st [16] D. Jannach, M. Jugovac, Measuring the business ACM conference on Electronic commerce, 1999, pp. value of recommender systems, ACM Trans. Man- 158–166. age. Inf. Syst. 10 (2019). URL: https://doi.org/10. [5] E. Commission, 2018 reform of eu data 1145/3370082. protection rules, 2018-05-25. URL: https: [17] T. X. Tuan, T. M. Phuong, 3d convolutional net- //ec.europa.eu/commission/sites/beta-political/ works for session-based recommendation with con- files/data-protection-factsheet-changes_en.pdf. tent features, in: Proceedings of the eleventh [6] A. Datar, C. Pan, M. Nazeri, X. Xiao, Toward ACM conference on recommender systems, 2017, wheeled mobility on vertically challenging terrain: pp. 138–146. Platforms, datasets, and algorithms, arXiv preprint [18] Q. Liu, Y. Zeng, R. Mokhosi, H. Zhang, Stamp: arXiv:2303.00998 (2023). short-term attention/memory priority model for [7] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, session-based recommendation, in: Proceedings of Session-based recommendations with recurrent the 24th ACM SIGKDD International Conference neural networks, CoRR abs/1511.06939 (2016). on Knowledge Discovery & Data Mining, 2018, pp. [8] Y. K. Tan, X. Xu, Y. Liu, Improved recurrent neural 1831–1839. networks for session-based recommendations, in: [19] S. Hochreiter, J. Schmidhuber, Long short-term Proceedings of the 1st workshop on deep learning memory, Neural computation 9 (1997) 1735–1780. for recommender systems, 2016, pp. 17–22. [20] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Ben- [9] B. Hidasi, M. Quadrana, A. Karatzoglou, D. Tikk, gio, On the properties of neural machine transla- Parallel recurrent neural network architectures for tion: Encoder-decoder approaches, Fifth Workshop feature-rich session-based recommendations, in: on Syntax, Semantics and Structure in Statistical Proceedings of the 10th ACM conference on recom- Translation (2014). mender systems, 2016, pp. 241–248. [21] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, J. Ma, Neu- [10] D. Jannach, M. Ludewig, When recurrent neural ral attentive session-based recommendation, in: Proceedings of the 2017 ACM on Conference on ACM/IEEE 47th Annual International Symposium Information and Knowledge Management, 2017, pp. on Computer Architecture (ISCA), IEEE, 2020, pp. 1419–1428. 982–995. [22] S. Rendle, L. Zhang, Y. Koren, On the difficulty of [34] A. Salah, Q.-T. Truong, H. W. Lauw, Cornac: A com- evaluating baselines: A study on recommender sys- parative framework for multimodal recommender tems, 2019. arXiv:1905.01395 . systems, Journal of Machine Learning Research 21 [23] D. Jannach, G. de Souza P. Moreira, E. Oldridge, (2020) 1–5. Why are deep learning models not consistently [35] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang, winning recommender systems competitions yet? C. Geng, Are we evaluating rigorously? bench- a position paper, in: Proceedings of the Rec- marking recommendation for reproducible evalu- ommender Systems Challenge 2020, RecSysChal- ation and fair comparison, in: Fourteenth ACM lenge ’20, Association for Computing Machinery, Conference on Recommender Systems, RecSys ’20, New York, NY, USA, 2020, p. 44–49. URL: https: Association for Computing Machinery, New York, //doi.org/10.1145/3415959.3416001. NY, USA, 2020, p. 23–32. URL: https://doi.org/10. [24] Z. Gantner, S. Rendle, C. Freudenthaler, L. Schmidt- 1145/3383313.3412489. Thieme, MyMediaLite: A free recommender system [36] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, K. Li, Y. Chen, library, in: 5th ACM International Conference on Y. Lu, H. Wang, C. Tian, X. Pan, Y. Min, Z. Feng, Recommender Systems (RecSys 2011), 2011. X. Fan, X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, [25] S. Vargas, Novelty and diversity enhancement J.-R. Wen, Recbole: Towards a unified, comprehen- and evaluation in recommender systems and in- sive and efficient framework for recommendation formation retrieval, in: Proceedings of the 37th algorithms, 2020. arXiv:2011.01731 . international ACM SIGIR conference on Research [37] P. Zhao, K. Xiao, Y. Zhang, K. Bian, W. Yan, Amer: & development in information retrieval, 2014, pp. Automatic behavior modeling and interaction ex- 1281–1281. ploration in recommender system, arXiv preprint [26] M. D. Ekstrand, Lenskit for python: Next- arXiv:2006.05933 (2020). generation software for recommender systems ex- [38] Y. Chen, Y. Yang, H. Sun, Y. Wang, Y. Xu, W. Shen, periments, in: Proceedings of the 29th ACM Inter- R. Zhou, Y. Tong, J. Bai, R. Zhang, Autoadr: Auto- national Conference on Information & Knowledge matic model design for ad relevance, in: Proceed- Management, 2020, pp. 2999–3006. ings of the 29th ACM International Conference on [27] M. Kula, Metadata embeddings for user and Information & Knowledge Management, 2020, pp. item cold-start recommendations, arXiv preprint 2365–2372. arXiv:1507.08439 (2015). [39] T.-H. Wang, X. Hu, H. Jin, Q. Song, X. Han, Z. Liu, [28] N. Hug, Surprise: A python library for recom- Autorec: An automated recommender system, in: mender systems, Journal of Open Source Software Fourteenth ACM Conference on Recommender Sys- 5 (2020) 2174. tems, 2020, pp. 582–584. [29] G. Guo, J. Zhang, Z. Sun, N. Yorke-Smith, Librec: A [40] R. Anand, J. Beel, Auto-surprise: An automated java library for recommender systems., in: UMAP recommender-system (autorecsys) library with tree Workshops, volume 4, Citeseer, 2015. of parzens estimator (tpe) optimization, in: Four- [30] M. Kula, Spotlight, https://github.com/maciejkula/ teenth ACM Conference on Recommender Systems, spotlight, 2017. 2020, pp. 585–587. [31] L. Yang, E. Bagdasaryan, J. Gruenstein, C.-K. Hsieh, [41] H. Liu, X. Zhao, C. Wang, X. Liu, J. Tang, Auto- D. Estrin, Openrec: A modular framework for ex- mated embedding size search in deep recommender tensible and adaptable recommendation algorithms, systems, in: Proceedings of the 43rd International in: Proceedings of the Eleventh ACM International ACM SIGIR Conference on Research and Develop- Conference on Web Search and Data Mining, 2018, ment in Information Retrieval, 2020, pp. 2307–2316. pp. 664–672. [42] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algo- [32] J. Yu, M. Gao, H. Yin, J. Li, C. Gao, Q. Wang, Gen- rithms for hyper-parameter optimization, in: 25th erating reliable friends via adversarial training to annual conference on neural information process- improve social recommendation, in: 2019 IEEE ing systems (NIPS 2011), volume 24, Neural Infor- International Conference on Data Mining (ICDM), mation Processing Systems Foundation, 2011. IEEE, 2019, pp. 768–777. [43] D. P. Kingma, J. Ba, Adam: A method for stochas- [33] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, tic optimization, arXiv preprint arXiv:1412.6980 G.-Y. Wei, H.-H. S. Lee, D. Brooks, C.-J. Wu, Deep- (2014). recsys: A system for optimizing end-to-end at- scale neural recommendation inference, in: 2020