HPT4Rec: AutoML-based Hyperparameter Self-Tuning
                                Framework for Session-based Recommender Systems
                                Amir Reza Mohammadi1 , Amir Hossein Karimi2 , Mahdi Bohlouli3 , Eva Zangerle1 and
                                Günther Specht1
                                1
                                  Department of Computer Science, Universität Innsbruck, Austria
                                2
                                  Mathematics and Computer Science Department, Amirkabir University of Technology, Tehran, Iran
                                3
                                  Computer Science and Information Technology Department, IASBS, Zanjan, Iran


                                                 Abstract
                                                 Recommender systems have evolved beyond the basic user-item filtering methods in research. However, these filtering
                                                 methods are still commonly used in real-world scenarios, mainly because they are easier to debug and reconfigure. Indeed the
                                                 existing frameworks do not adequately support algorithmic tuning. Moreover, they are primarily focused on the reproducibility
                                                 of state-of-the-art accuracy rather than ease of algorithm development and maintenance. Therefore, rapid and iterative
                                                 experimentation and debugging are considerably hindered. In this work, we propose an AutoML-based framework with
                                                 a modular deep session-based recommender code-base and an integrated automated HyperParameter Tuning (HPT4Rec)
                                                 component. The proposed framework automates searching for the best session-based model for a given data. Therefore it
                                                 can help to consistently update the model based on potential changes in the type and volume of data that is prevalent for a
                                                 real-world scenario. It is demonstrated that HPT4Rec provides extensible data structures, training service compatibility, and
                                                 GPU-accelerated execution while maintaining training efficiency and recommendation accuracy. We have conducted our
                                                 experiments on the benchmark RecSys 2015 dataset and achieved performance on par with state-of-the-art results. Achieved
                                                 results of our experiments show the importance of continuous and iterative parameter tuning, particularly for real-world
                                                 scenarios.

                                                 Keywords
                                                 AutoML, Session-based Recommender Systems, Framework, Hyperparameter Tuning


                                1. Introduction                                                                                                  start problem. Session-based recommendation might be
                                                                                                                                                 a vital component of the future recommendation, espe-
                                It is often overwhelming to an e-commerce user to see cially for the business and real-world applications, as
                                so many products available for sale. Recognizing the there are concerns and regulations about collecting user
                                burden of data overload, recommender systems (RSs) data like GDPR [5].
                                improve user experience substantially in various appli-                                                             Methods based on deep learning (DL) have shown
                                cations. Traditional RSs often rely on user profiles to great promise in the session-based recommendation and
                                provide personalized recommendations. Collaborative fil- also in other communities [6]. As stated in various lit-
                                tering approaches [1, 2, 3] could use history of purchases erature [7, 8, 9], they perform better than traditional
                                to determine user similarity, or use matrix factorization baseline methods by around 20-30 percent. However,
                                to establish latent factor vectors for each user. In both recent investigations have shown that many of these
                                cases, it is essential to identify the user when making methods are not compelling enough [10], moreover, re-
                                recommendations. However, this may not always be pos- sults are hard to reproduce in many of them [11], and
                                sible, such as not being logged in, having deleted their the codes are not readily available. Recent publications
                                tracking information, or a new user not having profile. have addressed reproducibility by implementing several
                                Consequently, recommendation methods that require the DL-based recommendation algorithms as a framework
                                user’s history suffer from cold-start issues.                                                                    [12, 13, 14]. While these frameworks are effective and
                                              Making session-based recommendations is another al- helped to alleviate the problem, two key factors should
                                ternative to using historical data [4]. In this setup, recom- not be overlooked: 1. Iterative algorithm optimization:
                                mendations are only made based on the behavior of users If these algorithms are intended for real-world use, they
                                in their current session which helps on tackling the cold- should include tools for being iteratively tuned to a given
                                                                                                                                                 dataset (not the offline benchmark datasets). The process
                                34th GI-Workshop on Foundations of Databases (Grundlagen von Daten- should be iterative and persistent since new features may
                                banken), June 7-9, 2023, Hirsau, Germany
                                Envelope-Open amir.reza@uibk.ac.at (A. R. Mohammadi); ahkarimi@aut.ac.ir
                                                                                                                                                 emerge, and user preferences may change. 2. Modular-
                                (A. H. Karimi)                                                                                                   ity and ease of reproduciblity: Besides accuracy, several
                                Orcid 0000-0003-3934-6941 (A. R. Mohammadi); 0009-0001-3946-6954                                                 other factors must be taken into consideration, when im-
                                (A. H. Karimi)                                                                                                   plementing literature-approved methods in production,
                                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                           Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
including non-complexity, fault tolerance, real-time pre-     with the neighborhood-based method, Jannach et al. [10]
diction, debuggability, resource consumption, and modu-       combined sequential patterns and co-occurrence signals
larity [15, 16]. The most advanced and well-performing        to get the best of both worlds. Tuan et al. [17] fused
models are often left behind in the business, because they    session clicks with content features (namely, item titles
are complex and challenging to debug. As a result, busi-      and categories) to generate recommendations based on
nesses still opt for more straightforward methods that        3-dimensional Convolutional Neural Networks (CNN).
are less accurate, but easier to manipulate and debug. In     Li et al. [21] have developed a neural attentive recom-
several papers [8, 10, 17, 18] (discussed in the background   mendation machine (NARM) using an encoder-decoder
section of prior work), various techniques were used to       architecture. NARM can distinguish sequential behavior
slightly improve performance, which not only may not          and the primary purposes of users using the attention
be useful for large-scale day-to-day use, but may also        mechanism on RNN. In another study, a Short-Term At-
cause problems in production and during debugging. It         tention Priority model (STAMP) [18], which employs a
would be more practical to implement a robust and mod-        simple MLP network, and an attentive net has been pro-
ular core structure with clear interfaces and to give room    posed for understanding users’ general interests as well
to add more complex mechanisms based on the business          as their current interests. In both NARM and STAMP, an
demands.                                                      attention mechanism emphasizes the importance of the
   Motivated by reasons mentioned above, in this paper,       last click.
we present HPT4Rec, an AutoML-based framework for                Almost all of the aforementioned RNN-based SBR mod-
hyperparameter self-tuning with a modular code-base           els follow the same architecture as GRU4Rec [7]. They
aimed at session-based recommendation. Our frame-             have just incorporated new features and mechanisms to
work simplifies the development and manipulation of           improve performance on top of the core structure. There-
deep recommendation algorithms to meet business needs.        fore, in HPT4Rec, a minimal code-base based on GRU4Rec
PyTorch and Microsoft NNI 1 are used to develop the           was built, with all the necessary tools and modules for
code-base, both of which are well known in the DL and         a methodologically simplified bottom-up approach to
AutoML communities and receive continuous updates.            model development. This can remove the barrier of en-
   Besides being open-source, this framework can be in-       try for practitioners and allow them to add other features
stalled easily, and all prepared data and trained mod-        if necessary.
els are available at https://github.com/amirreza-m95/            Related Frameworks In the modern RSs field, re-
HPT4Rec                                                       producibility is crucial. Recently, various researchers
                                                              [10, 11, 22, 23] pointed out the need for fair evaluation of
                                                              recommender models. Upon thorough hyperparameter
2. Prior Work                                                 tuning, their argument about the supremacy of latent-
                                                              factor models over deep neural models made it necessary
Background. The most commonly used deep model,
                                                              to develop new recommendation frameworks. Begin-
when dealing with sequential data are Recurrent Neu-
                                                              ning in 2011, Mymedialite [24], , RankSys [25], LensKit
ral Networks (RNN). There is a type of RNNs known
                                                              [26], LightFM [27], and Surprise [28] have established
as LSTM [19] that are shown to work particularly well,
                                                              a set of integrated tools for rapid prototyping and test-
including additional gates regulating, when to take into
                                                              ing of recommendation models, using standard metrics
account input and, when to reset the hidden state. These
                                                              and an intuitive model execution. Deep learning (DL)
models are not affected by the vanishing gradient prob-
                                                              recommendation models achieved remarkable success
lem usually associated with RNN models. A somewhat
                                                              and attracted growing community interest, which led
simpler alternative to LSTM, but still retaining all of its
                                                              to the development of new tools. The first open-source
properties, are Gated Recurrent Units (GRUs) [20], which
                                                              frameworks for DL-based recommenders were LibRec
we employ in this work as the core learning structure of
                                                              [29], Spotlight [30], and OpenRec [31]. Although these
the recommender for the experiments.
                                                              frameworks provided plenty of models, they lacked fil-
   Hidasi et al. [7] suggested the RNN approach for
                                                              tering and Automated hyperparameter tuning strategies.
session-based recommendation (SBR) and then proposed
                                                              The RecQ [32], DeepRec [33], and Cornac [34] frame-
a parallel RNN architecture [9] to model sessions using
                                                              works have made a significant contribution towards a
the clicks and features of the clicked items. Further re-
                                                              more comprehensive collection of model implementa-
search was presented based on RNN methods in order to
                                                              tions. DaisyRec [35], RecBole [36], and Elliot [12] raised
improve the accuracy of this model. Performance of the
                                                              the bar considerably after the reproducibility hype, mak-
recurrent model can be boosted by taking into account
                                                              ing available a large number of models, data filtering and
temporal changes in user behavior and data augmen-
                                                              splitting operations, as well as hyperparameter tuning.
tation techniques[8]. By uniting the recurrent method
                                                              Nevertheless, we observed a deficiency of two increas-
1
    https://github.com/microsoft/nni                          ingly critical aspects of recommendation model develop-
ment in real-world scenarios: Automated Hyperparame-                                                                                          state ℎ according to mechanism showed in eq. (1):
ter tuning and industry-level compatibility of tools and
training services. In reviewing these related frameworks,                                                                                                      ht = 𝑔 (𝑊xt + 𝑈ht−1 )                 (1)
we observed the lack of an open-source recommenda-
tion framework to perform automated hyperparameter                                                                                               where, The logistic sigmoid function 𝑔 is a smooth
tuning while adopting various hyperparameter tuning                                                                                           function with a bounded input of 𝑥𝑡 , which is the unit
strategies on different distributed platforms. HPT4Rec                                                                                        input at time 𝑡. Based on its actual state ℎ𝑡 , an RNN
represents a step toward reaching that goal.                                                                                                  provides a probability distribution for the subsequent
   Earlier studies attempted to find a universal automated                                                                                    element of the sequence.
solution for both architecture design [37, 38] and opti-                                                                                         GRU is a form of RNN that tends to cope with vanish-
mization [39, 40, 41] but that seems to be ineffective since                                                                                  ing gradient problems better than vanilla RNN. In essence,
the problems are diverse with different characteristics,                                                                                      GRU gates learn when to update their hidden state and
so a one-size-fits-all solution is not appropriate. The goal                                                                                  by how much. GRUs are superior to Long Short-Term
of complete automation might be inspiring for scientific                                                                                      Memory (LSTM) units when it comes to the session-based
research and serve as a long-term engineering objective,                                                                                      recommendation. [7].
but it seems likely that we will need to semi-automate the                                                                                       A linear interpolation between the prior activation
majority of these tasks and gradually reduce the human                                                                                        and the candidate activation is used to determine GRU
factor over time. Then it is expected that we will develop                                                                                    activation, ℎ𝑡 :
powerful tools to assist in making machine learning, first
                                                                                                                                                              ht = (1 − zt ) ht−1 + zt ht ̂          (2)
and foremost, more systematic and second, more effi-
cient. Aiming to accomplish this goal is the purpose of                                                                                         where the update gate is given by:
HPT4Rec.
                                                                                                                                                               zt = 𝜎 (𝑊𝑧 xt + 𝑈𝑧 ht−1 )             (3)

                                                                                                                                                In a similar manner while the candidate activation
                                                                                                                                              function, ℎ𝑡 , is also computed:
                                                                                                                            Scores on Items
                                                                                                       Feedforward Layers
                Embedding Layer


                                  Gated Recurrent Unit


                                                         Gated Recurrent Unit


                                                                                Gated Recurrent Unit
   Input Data


                                                                                                                                                          ht ̂ = tanh (𝑊xt + 𝑈 (rt ⊙ ht−1 ))         (4)
                                                                                                                                                and eventually, the reset gate 𝑟𝑡 is provided by:

                                                                                                                                                               rt = 𝜎 (𝑊𝑟 xt + 𝑈𝑟 ht−1 )             (5)
                                                           We have presented the standard formulation of GRU
                                                        in Equations (3) and (4), but it is important to note that
Figure 1: Overview of HPT4Rec’s Session-based Recommen-
dation Architecture
                                                        framework users can tweak the model by using other
                                                        options, like using different final activations such as relu,
                                                        leaky-relu, and softmax.

                                                                                                                                              3.1.1. GRU4Rec Architecture
3. HPT4Rec
                                                     The network core comprises the GRU layers, and further
In this section, we describe HPT4Rec’s architecture and
                                                     feedforward layers may be added between the GRU layer
tuning pipeline. First, we describe the general architec-
                                                     and the output. Each item’s predicted preference can be
ture of the recommender. Next, we present the compo- calculated to predict whether it will be the next item in
nents and architecture of the framework. Finally, we the session. If more than one GRU layer is employed,
discuss the available self-tuning methods and their best
                                                     the hidden state of each layer is used as an input for the
application scenarios.                               next layer. An option is to connect the input to a higher
                                                     layer of the network to improve performance [7]. We
3.1. Sequential Modeling with RNN                    adjusted the base network to suit the task better since rec-
                                                     ommender systems are not the principal application area
Variable-length sequence data can be modeled using of RNNs. The SBR model architecture is demonstrated
RNNs. RNNs are characterized by the internal hidden in Figure 1.
state present in the units that make up the network,    In addition, we also use trainable embeddings to rep-
which sets them apart from conventional feedforward resent all of our inputs. With backpropagation Through-
neural networks. A standard RNN updates its hidden Time (BPTT), we can train our neural networks using
                             Experiment
                              Manager


Figure 2: HPT4Rec’s Architecture Overview


mini-batch gradient descent on multiple options for loss by the variable name, sampling strategy, and parameters
over a dynamic number of time steps.                           of a search space.
   Session-parallel mini-batches. Click sessions are often        A search space definition can be expressed as follows:
of varying length. It may take some users a long time to
                                                              1{
find their desired item, while others find it within seconds. 2 "dropout_rate": {"_type": "uniform", "_value": [0.1, 0.5]},
In the recommender system, accurate predictions should 3 "conv_size": {"_type": "choice", "_value": [2, 3, 5, 7]},
                                                              4 "hidden_size": {"_type": "choice", "_value": [124, 512, 1024]},
be provided regardless of the current session length. This 5 "lr": {"_type": "loguniform", "_value": [0.0001, 0.1]},
problem has been addressed by different methods like 6 "momentum": {"_type": "lognormal", "_value": [0.1, 1]}
                                                              7}
session-parallel mini-batches [9] and data augmentation
[8]. Since we are seeking the least sophisticated approach,       We have five parameters to tune in this search space.
we have taken the former approach.                             According to this definition, the dropout rate is charac-
                                                               terized by a uniform distribution within a range of 0.1 to
3.2. Architecture and Data Flow                                0.5. This search space will be used by Tuner to build con-
                                                               figurations, selecting a value from within the range for
Automated tuning of hyperparameters is a key feature of each parameter. Besides defining the search space, the
HPT4Rec. We provide 11 popular self-tuning algorithms. only requirement is to define a configuration file contain-
Experiments can be run on a wide range of training plat- ing information like experiment log folder, self-tuning
forms, including local machines, multiple servers on a algorithms, trial number, and duration threshold. The
distributed network, and open-source platforms such as configuration file is in YAML format.
Kubernetes and OpenPAI.                                           In order to implement a new tuning algorithm or tweak
                                                               the existing ones, the base tuner should be inherited.
3.2.1. HPT4Rec’s Data Flow                                     Then, by following the interface of the module and re-
HPT4Rec experiments are individual attempts to apply a turning the experiment results, passing the new parame-
configuration (e.g., a set of hyperparameters) to a model. ters, and updating the search space, the tuning module
The first step in constructing an experiment is to define will function properly.
the search space (i.e., parameters). The tuner will sample
parameters/architecture according to the search space,
which is defined as a JSON file. Search spaces are defined
Table 1
Self-tuning methods performance on different proxy datasets.
                              TPE                                  SMAC                             Anneal
    #Samples    Recall@20     MRR@20       Time    Recall@20        MRR@20     Time     Recall@20    MRR@20        Time
      125K        0.4314       0.2069       23       0.4229          0.2114     29       0.4332        0.203         25
      250K        0.4687        0.225       39       0.473           0.2235     45        0.4633      0.2311         41
      500K        0.5062       0.2426       76       0.5082          0.2442     77       0.5103       0.2487        57
       1M         0.545        0.2559      139       0.5479          0.2636     147      0.5481       0.2619        191


3.2.2. Architecture                                         ber of trials. A wide range of experiments revealed that
                                                            TPE outperformed random search. If the variables in the
By executing the experiment_runner python script
                                                            search space can be selected from a prior distribution,
through Cli and passing the configuration file path, exper-
                                                            Anneal is useful. Likewise, it is recommended to use
iments are instantiated. The experiment manager parses
                                                            naive evolution, when your experiment code supports
the configuration file to determine the path to the search
                                                            weight transfer, which implies that the experiment could
space and target the training service, and then runs the
                                                            inherit its parent’s converged weight from its predecessor.
model code with the appropriate parameters from the
                                                            Training can be substantially accelerated with the right
search space. Preprocessing will be performed by the
                                                            tuning method, resulting in less time and money spent
experiment manager (e.g., one-hot encoding, embedding
                                                            and higher revenue, as well as better recommenders, to
dropout). Following the execution of the model with
                                                            enhance user experience.
the first set of parameters, the self-tuner will examine
intermediate results (i.e., after each epoch) to determine
whether results are improving. Next, it will pass the 4. Experiments
model on to the evaluation module. Evaluation will be
conducted by the evaluator, and results will be provided 4.1. Experiment Setup
to self-tuning algorithm to update its inner state. Follow-
ing the update, the self-tuning algorithm determines the 4.1.1. Dataset
next metric to use. The iterative process will be repeated     We conducted our experiments on the YOOCHOOSE
until a certain time or number of experiments is reached.      e-commerce dataset for RecSys 2015 challenge 2 . A six-
Figure 2 illustrates this procedure. HPT4Rec will output       month period of click-streams from an e-commerce site
results in a webUI interface and collect all metrics, inter-   was included in this dataset. Click-streams are some-
mediate results, best parameters, and system logs in a         times followed by purchase events. Following prepro-
JSON format.                                                   cessing, there are 7,936,469 sessions and 31,437,691 clicks
                                                               on 37,403 items left for training and testing. Each clicking
3.2.3. Self-tuning                                             event contains a session ID, an item ID and, if the item is
                                                               a buy-item, a price tag. A shopping session can contain
The cycle of getting hyperparameters, carrying out exper-
                                                               anywhere between 1 and 200 clicks, but most sessions
iments, testing their results, and then tuning hyperparam-
                                                               contain less than 30 clicks. We keep only the click events
eters is deemed as self-tuning. Recommender systems
                                                               from the challenge’s training set. Sessions of length one
are used in various online websites with different lev-
                                                               are filtered out. The Yoochoose dataset was chosen since
els of user activity, which directly affects the volume of
                                                               it is the most general dataset, based on the dataset’s fea-
data available for training models. Additionally, Training
                                                               tures compared to other well-known datasets in this field
deep models require substantial computational resources,
                                                               such as Diginetica3 , Xing4 , and Last.fm5 . The default set-
which is another crucial aspect since it directly impacts
                                                               tings of the framework can be used for all the datasets we
revenue. Thereby, different tuning strategies are needed
                                                               mentioned just by omitting some of their extra features.
based on available features, the volume of data, and avail-
                                                               We employ a dataset characterized by minimalistic data
able computational resources. Based on the framework
                                                               features as a means to ensure the robust generalizability
review shown in table 1, HPT4Rec offers several tuning
                                                               of the model to diverse datasets encompassing a greater
techniques tailored for diverse scenarios that occur in
                                                               abundance of data features.
real-world scenarios.
   After a series of experiments, we have gained an early
intuition about the most suitable use cases of each self-      2
                                                                 http://2015.recsyschallenge.com
tuning algorithm. In that sense, Tree-structured Parzen        3
                                                                 https://competitions.codalab.org/competitions/11161
Estimator (TPE) [42] is suitable when computation re-          4
                                                                 http://2016.recsyschallenge.com/
sources are limited, and you can only try a limited num-       5
                                                                 http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html
4.1.2. Evaluation Metrics                                       Table 2
                                                                Comparison of the our optimized recommender against base-
In order to match the user with the most relevant item          lines.
on the list, recommender systems can recommend only a
                                                                       model/type/loss HS 6  Recall@20   MRR@20
few items at a time. We, therefore, use recall@20 as our               POP                      0.005     0.0012
main evaluation metric, which counts the proportion of                 S-POP                   0.2672     0.1775
cases that have the targeted item in the top 20 items for all          Item-KNN                0.5065     0.2048
test cases. As long as an item is among the top-N, recall              BPR-MF                  0.2574     0.0618
does not take its rank into consideration. The MRR@20                  GRU4REC BPR 1000        0.6322     0.2467
metric is the second metric used in the experiments. A                 GRU4REC top1 100        0.5853     0.2305
reciprocal ranking of the desired items determines this                HPT4Rec TOP1 110        0.6259     0.2681
value. A reciprocal rank above 20 is set to zero.

4.1.3. Implementation Details                                   Thus, to make recommendations that reflect changes in
                                                                user behavior over time, models must be continuously
For demonstration purposes and to have a quantifiable           and iteratively optimized. It is possible to have different
search space, we optimized hidden size, batch size, learn-      approaches when we have different quantities of data
ing rate and the number of GRU layers and fixed others as       and computation to find the best-optimized model, as
follow. For our model, 50-dimensional embeddings were           discussed in the self-tuning 3.2.2. Our experiments have
used for the items, with a 20% embedding dropout. The           been conducted using four proxy datasets that mirror
optimization was conducted using Adam [43]. The GRU             the RecSys benchmark data, which comprise different
search space was set at 50 to 1000 hidden units for each        quantities of data. HPT4Rec’s recommender model was
model. A session ends with the GRU’s hidden state reset         tuned using four self-tuning methods that used proxy
to zero. Models are developed in PyTorch and trained on         datasets as training data. Evaluation metrics and tuning
an NVIDIA Tesla V100. The source code of the model,             time were recorded to compare these methods. Table 2
checkpoints, and logs are available online.                     shows how we found the most effective model using 30
   The comparison was made with four traditional recom-         experiments. Results do not indicate the optimal use case
mendations (POP, S-POP, Item-KNN and BPR-MF) and                scenario for tuning methods, but rather demonstrate that
with two well-performing configurations of GRU4Rec.             each of these tuners performs well in different scenarios
     • POP. In one of its simplest forms, the popular pre-      and that one of them does not outperform the others in
       dictor predicts the items that are most popular in       all proxy datasets and evaluation metrics.
       the training set. Even though it is simple, it often
       provides a good baseline for certain domains.            4.2.2. Consistency with Published Results
     • S-POP. This baseline recommends the items that           A key element for any new tool is consistency with the
       are most popular during the current session. As          previously published results since a wide range of results
       the session progresses, the recommendation list          are possible due to a variety of implementation details,
       grows. Global popularity values are used to break        non-fixed seed values, and other domain-specific rea-
       up ties.                                                 sons. Our research also featured HPT4Rec’s self-tuning
     • Item-KNN. This baseline measures similarity by           method for optimizing the base recommender model with
       dividing the number of times two items appear            the Original RecSys dataset. In Table 2 we show that
       together in sessions by the square root of the           HPT4Rec has outperformed baseline models by a fair
       product of their occurrence rates.                       margin and is almost on par with state-of-the-art models
                                                                with this privilege that it has discovered parameters that
                                                                lead to a simpler model, which results in less resource
4.2. Performance and Results                                    consumption in production mode. The pursuit of more
4.2.1. Diverse Self-tuning Methods Effectiveness                streamlined models facilitates enhanced reproducibility,
                                                                a fundamental tenet of our methodology, thereby engen-
The most likely scenario for developing a recommender           dering an essential advancement.
system in the real world is carrying out an experiment,
where different levels of training data are collected. This
may change as user activity increases and new users 5. Conclusion and Future Work
visit the website. Even in the offline dataset of RecSys
                                                            In this paper, we have released a session-based rec-
2015, the results of training on a complete dataset are
                                                            ommender system framework based on AutoML called
slightly worse than those of training on a recent region
                                                            HPT4Rec. We reviewed the recommended systems
of the dataset, which shows changing user behavior [8].
frameworks in the literature, showing HPT4Rec’s mer-            networks meet the neighborhood for session-based
its and shortcomings, and emphasizing the advantages            recommendation, in: Proceedings of the Eleventh
of modularity and automatic tuning. To the best of our          ACM Conference on Recommender Systems, 2017,
knowledge, HPT4Rec is the first recommendation frame-           pp. 306–310.
work that provides a thorough self-tuning experimen- [11] M. F. Dacrema, P. Cremonesi, D. Jannach, Are we
tal pipeline supported by business scale training service       really making much progress? a worrying analy-
compatibility. We expect HPT4Rec to simplify the tuning         sis of recent neural recommendation approaches,
effort of recommendation models, facilitate the devel-          in: Proceedings of the 13th ACM Conference on
opment and debugging process of new algorithms, and             Recommender Systems, 2019, pp. 101–109.
help migrate deep recommender algorithms to be used [12] V. W. Anelli, A. Bellogín, A. Ferrara, D. Malitesta,
in real-world scenarios. Our immediate future work will         F. A. Merra, C. Pomo, F. M. Donini, T. D. Noia, Elliot:
emphasize automating other aspects of the recommen-             a comprehensive and rigorous framework for re-
dation pipeline, such as automated data augmentation,           producible recommender systems evaluation, 2021.
which has traditionally been done manually in literature.       arXiv:2103.02590 .
                                                           [13] L. Yang, E. Bagdasaryan, J. Gruenstein, C.-K. Hsieh,
                                                                D. Estrin, Openrec: A modular framework for
References                                                      extensible and adaptable recommendation algo-
                                                                rithms, in: Proceedings of the Eleventh ACM In-
  [1] Y. Koren, R. Bell, C. Volinsky, Matrix factorization
                                                                ternational Conference on Web Search and Data
      techniques for recommender systems, Computer
                                                                Mining, WSDM ’18, Association for Computing
      42 (2009) 30–37.
                                                                Machinery, New York, NY, USA, 2018, p. 664–672.
  [2] Y. Koren, Factorization meets the neighborhood: a
                                                                URL: https://doi.org/10.1145/3159652.3159681.
      multifaceted collaborative filtering model, in: Pro-
                                                           [14] S. Zhang, Y. Tay, L. Yao, B. Wu, A. Sun, Deeprec:
      ceedings of the 14th ACM SIGKDD international
                                                                An open-source toolkit for deep learning based rec-
      conference on Knowledge discovery and data min-
                                                                ommendation, 2019. arXiv:1905.10536 .
      ing, 2008, pp. 426–434.
                                                           [15] P. Kouki, I. Fountalis, N. Vasiloglou, X. Cui, E. Lib-
  [3] R. Salakhutdinov, A. Mnih, G. Hinton, Restricted
                                                                erty, K. Al Jadda, From the lab to production: A
      boltzmann machines for collaborative filtering, in:
                                                                case study of session-based recommendations in
      Proceedings of the 24th international conference
                                                                the home-improvement domain, in: Fourteenth
      on Machine learning, 2007, pp. 791–798.
                                                                ACM conference on recommender systems, 2020,
  [4] J. B. Schafer, J. Konstan, J. Riedl, Recommender
                                                                pp. 140–149.
      systems in e-commerce, in: Proceedings of the 1st
                                                           [16] D. Jannach, M. Jugovac, Measuring the business
      ACM conference on Electronic commerce, 1999, pp.
                                                                value of recommender systems, ACM Trans. Man-
      158–166.
                                                                age. Inf. Syst. 10 (2019). URL: https://doi.org/10.
  [5] E. Commission, 2018 reform of eu data
                                                                1145/3370082.
      protection rules, 2018-05-25. URL: https:
                                                           [17] T. X. Tuan, T. M. Phuong, 3d convolutional net-
      //ec.europa.eu/commission/sites/beta-political/
                                                                works for session-based recommendation with con-
      files/data-protection-factsheet-changes_en.pdf.
                                                                tent features, in: Proceedings of the eleventh
  [6] A. Datar, C. Pan, M. Nazeri, X. Xiao, Toward
                                                                ACM conference on recommender systems, 2017,
      wheeled mobility on vertically challenging terrain:
                                                                pp. 138–146.
      Platforms, datasets, and algorithms, arXiv preprint
                                                           [18] Q. Liu, Y. Zeng, R. Mokhosi, H. Zhang, Stamp:
      arXiv:2303.00998 (2023).
                                                                short-term attention/memory priority model for
  [7] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk,
                                                                session-based recommendation, in: Proceedings of
      Session-based recommendations with recurrent
                                                                the 24th ACM SIGKDD International Conference
      neural networks, CoRR abs/1511.06939 (2016).
                                                                on Knowledge Discovery & Data Mining, 2018, pp.
  [8] Y. K. Tan, X. Xu, Y. Liu, Improved recurrent neural
                                                                1831–1839.
      networks for session-based recommendations, in:
                                                           [19] S. Hochreiter, J. Schmidhuber, Long short-term
      Proceedings of the 1st workshop on deep learning
                                                                memory, Neural computation 9 (1997) 1735–1780.
      for recommender systems, 2016, pp. 17–22.
                                                           [20] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Ben-
  [9] B. Hidasi, M. Quadrana, A. Karatzoglou, D. Tikk,
                                                                gio, On the properties of neural machine transla-
      Parallel recurrent neural network architectures for
                                                                tion: Encoder-decoder approaches, Fifth Workshop
      feature-rich session-based recommendations, in:
                                                                on Syntax, Semantics and Structure in Statistical
      Proceedings of the 10th ACM conference on recom-
                                                                Translation (2014).
      mender systems, 2016, pp. 241–248.
                                                           [21] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, J. Ma, Neu-
[10] D. Jannach, M. Ludewig, When recurrent neural
                                                                ral attentive session-based recommendation, in:
     Proceedings of the 2017 ACM on Conference on                ACM/IEEE 47th Annual International Symposium
     Information and Knowledge Management, 2017, pp.             on Computer Architecture (ISCA), IEEE, 2020, pp.
     1419–1428.                                                  982–995.
[22] S. Rendle, L. Zhang, Y. Koren, On the difficulty of    [34] A. Salah, Q.-T. Truong, H. W. Lauw, Cornac: A com-
     evaluating baselines: A study on recommender sys-           parative framework for multimodal recommender
     tems, 2019. arXiv:1905.01395 .                              systems, Journal of Machine Learning Research 21
[23] D. Jannach, G. de Souza P. Moreira, E. Oldridge,            (2020) 1–5.
     Why are deep learning models not consistently          [35] Z. Sun, D. Yu, H. Fang, J. Yang, X. Qu, J. Zhang,
     winning recommender systems competitions yet?               C. Geng, Are we evaluating rigorously? bench-
     a position paper, in: Proceedings of the Rec-               marking recommendation for reproducible evalu-
     ommender Systems Challenge 2020, RecSysChal-                ation and fair comparison, in: Fourteenth ACM
     lenge ’20, Association for Computing Machinery,             Conference on Recommender Systems, RecSys ’20,
     New York, NY, USA, 2020, p. 44–49. URL: https:              Association for Computing Machinery, New York,
     //doi.org/10.1145/3415959.3416001.                          NY, USA, 2020, p. 23–32. URL: https://doi.org/10.
[24] Z. Gantner, S. Rendle, C. Freudenthaler, L. Schmidt-        1145/3383313.3412489.
     Thieme, MyMediaLite: A free recommender system         [36] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, K. Li, Y. Chen,
     library, in: 5th ACM International Conference on            Y. Lu, H. Wang, C. Tian, X. Pan, Y. Min, Z. Feng,
     Recommender Systems (RecSys 2011), 2011.                    X. Fan, X. Chen, P. Wang, W. Ji, Y. Li, X. Wang,
[25] S. Vargas, Novelty and diversity enhancement                J.-R. Wen, Recbole: Towards a unified, comprehen-
     and evaluation in recommender systems and in-               sive and efficient framework for recommendation
     formation retrieval, in: Proceedings of the 37th            algorithms, 2020. arXiv:2011.01731 .
     international ACM SIGIR conference on Research         [37] P. Zhao, K. Xiao, Y. Zhang, K. Bian, W. Yan, Amer:
     & development in information retrieval, 2014, pp.           Automatic behavior modeling and interaction ex-
     1281–1281.                                                  ploration in recommender system, arXiv preprint
[26] M. D. Ekstrand,       Lenskit for python: Next-             arXiv:2006.05933 (2020).
     generation software for recommender systems ex-        [38] Y. Chen, Y. Yang, H. Sun, Y. Wang, Y. Xu, W. Shen,
     periments, in: Proceedings of the 29th ACM Inter-           R. Zhou, Y. Tong, J. Bai, R. Zhang, Autoadr: Auto-
     national Conference on Information & Knowledge              matic model design for ad relevance, in: Proceed-
     Management, 2020, pp. 2999–3006.                            ings of the 29th ACM International Conference on
[27] M. Kula, Metadata embeddings for user and                   Information & Knowledge Management, 2020, pp.
     item cold-start recommendations, arXiv preprint             2365–2372.
     arXiv:1507.08439 (2015).                               [39] T.-H. Wang, X. Hu, H. Jin, Q. Song, X. Han, Z. Liu,
[28] N. Hug, Surprise: A python library for recom-               Autorec: An automated recommender system, in:
     mender systems, Journal of Open Source Software             Fourteenth ACM Conference on Recommender Sys-
     5 (2020) 2174.                                              tems, 2020, pp. 582–584.
[29] G. Guo, J. Zhang, Z. Sun, N. Yorke-Smith, Librec: A    [40] R. Anand, J. Beel, Auto-surprise: An automated
     java library for recommender systems., in: UMAP             recommender-system (autorecsys) library with tree
     Workshops, volume 4, Citeseer, 2015.                        of parzens estimator (tpe) optimization, in: Four-
[30] M. Kula, Spotlight, https://github.com/maciejkula/          teenth ACM Conference on Recommender Systems,
     spotlight, 2017.                                            2020, pp. 585–587.
[31] L. Yang, E. Bagdasaryan, J. Gruenstein, C.-K. Hsieh,   [41] H. Liu, X. Zhao, C. Wang, X. Liu, J. Tang, Auto-
     D. Estrin, Openrec: A modular framework for ex-             mated embedding size search in deep recommender
     tensible and adaptable recommendation algorithms,           systems, in: Proceedings of the 43rd International
     in: Proceedings of the Eleventh ACM International           ACM SIGIR Conference on Research and Develop-
     Conference on Web Search and Data Mining, 2018,             ment in Information Retrieval, 2020, pp. 2307–2316.
     pp. 664–672.                                           [42] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algo-
[32] J. Yu, M. Gao, H. Yin, J. Li, C. Gao, Q. Wang, Gen-         rithms for hyper-parameter optimization, in: 25th
     erating reliable friends via adversarial training to        annual conference on neural information process-
     improve social recommendation, in: 2019 IEEE                ing systems (NIPS 2011), volume 24, Neural Infor-
     International Conference on Data Mining (ICDM),             mation Processing Systems Foundation, 2011.
     IEEE, 2019, pp. 768–777.                               [43] D. P. Kingma, J. Ba, Adam: A method for stochas-
[33] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen,           tic optimization, arXiv preprint arXiv:1412.6980
     G.-Y. Wei, H.-H. S. Lee, D. Brooks, C.-J. Wu, Deep-         (2014).
     recsys: A system for optimizing end-to-end at-
     scale neural recommendation inference, in: 2020