Utilizing Metadata to Select a Recommendation Algorithm for a User or an Item Alexander Nechaev1[0000−0002−0464−9961] , Nataly Zhukova2[0000−0001−5877−4461] , and Vasily Meltsov1[0000−0001−5479−9979] 1 Vyatka State University, Kirov, Russia 2 ITMO University, St. Petersburg, Russia dapqa@yandex.ru, nazhukova@mail.ru, meltsov69@mail.ru Abstract. In general, recommender systems solve the problem of infor- mation overload by helping users of services to find items in which they are interested. There are plenty of algorithms and models that can be used to build such a subset of items, and their performance may vary not only for different datasets but for separate parts of a single one. This issue leads to the ”algorithm selection problem”. Many state-of-art solu- tions to this problem are based on meta-learning. It associates the task features with the performance of the base-level algorithms and models. This paper presents the method of recommendation algorithm selection for particular users (or items) that uses binary representations of both ex- plicit metadata of them and computable statistical meta-features. There are two different techniques within the proposed method, which are based on classification or clustering of such binary data, respectively. The meta- learning process is almost automated. The findings of the experiments prove that the usage of the method for recommendation algorithm se- lection is reasonable and effective. In most cases, a recommender system that uses the metamodel shows lower rating prediction errors compared to any other one utilizing a single model or algorithm for all the users (or items), while in a small number of tests their performance is just the same. The detailed analysis of the evaluation results allows for affirm- ing that the described metamodels can be used in real-world systems to improve the experience of particular users. Keywords: Recommender System · Algorithm Selection Problem · Meta- Learning · Collaborative Filtering. 1 Introduction Numerous modern companies provide their clients with a great variety of ser- vices, products, and content. Such diversity leads to the problem of information overload that has an adverse effect on both user experience and service growth. Recommender systems (RSs) are used to solve this problem effectively. They Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 A. Nechaev, N. Zhukova, and V. Meltsov automatically create suggestions of items that most likely lie in a user’s area of interest. Modern recommendation algorithms are actively used in real-world systems [1]. Even a moderate improvement in the personalization of the collections of items is beneficial for both services’ clients and its owners. Moreover, the existence of instruments predicting a user’s interests makes it possible to use recommender systems in such non-standard areas as medicine and law. Nonetheless, the accu- racy of these algorithms is relatively low. The need for functioning in the poorly formalizable domain of people’s interests leads to the vast complexity of the de- velopment of effective algorithms as well as to the inability to create universal ones. Usually, RS designers implement and evaluate different models and sev- eral versions of their compositions. This approach needs a significant number of computational resources, but the result can be suboptimal [2]. The crucial problem that arises is that the performance of a model may vary for different users and items within a single dataset. For example, a matrix factorization model can predict the interests of a particular service user precisely, while the k-Nearest-Neighbors algorithm is much more applicable for making recommendations for another individual. A state-of-art solution to this problem is the use of meta-learning. Meta-learning lies in the creation of metamodels that associate task fea- tures with algorithms’ performance. Such metamodels can be separated into the global-level (of a dataset), middle-level (of particular users or items), and micro-level (of particular rows) ones [3]. Micro-level metamodels have recently been proposed and not been studied enough, since they need massive and dense datasets to be effective. There are plenty of researches within the first two levels, including the meta-recommender system methods. Within this field, the contri- bution of Cuhna et al. [4,5] and their meta-research [2] is particularly notable. The authors’ studies provide an in-depth review of the existing meta-learning approaches to the problem of algorithm selection in the domain of recommender systems. In most cases, a metamodel utilizes only the data about the base models’ performance. The real-world systems are generally aware of users’ and items’ features, even if they are not used in the recommendations generation process. The core idea behind this research is that a metamodel can take a cue from both such metadata and standard meta-features. Moreover, when a metamodel training dataset is represented in a binary format, it allows for applying the same algorithms in different business cases. This paper proposes the middle-level automated meta-learning method that uses the recommendation algorithms’ evaluation results as well as the binary representation of both users’ (or items’) metadata and their computable meta- features. Utilizing Metadata to Select a Recommendation Algorithm 3 2 Method The formal statement of the research problem is as follows. There is a dataset consisting of (u, i, rui ) rows that show what rating rui has the user u given to the item i. Also, there are users’ and items’ metadata stored as binary vectors. Several recommendations models (base models) are trained on the part of a dataset, and evaluated on another part, which has given the predictions (u, i, rbui ). The task is to create a metamodel that selects the best base model for a particular user (or item), i.e. makes a personalized selection. It should be stated that below is the solution that utilizes users’ metadata. The solution that uses items’ metadata is the same. At this stage, two different solutions to the problem are proposed. On the one hand, one can take the evaluation results as the starting point and make a dataset showing the correspondence between metadata of the user and the best model for them. In this case, the task is to solve a classification problem. On the other hand, one can start from the nature of users, group them using particular features, and select the best model for each built group. In this case, the task is to solve a clustering problem [6]. While a metamodel is being trained, its input is built from two parts. The first one is a set of explicit users’ metadata, taken from a dataset, e.g. age, sex, or profession. The second one contains meta-features computed on the basis of users’ rating information, i.e., some standard values used in meta-learning. Fig. 1 shows the proposed recommender system structure. Both parts are represented as binary vectors. It is assumed that users’ meta- data are already binary, while the computable features are converted after a calculation. Both proposed meta-learning techniques suggest the users are grouped. The best base model must be chosen for each users’ group U from the whole set of models M . The selection is performed by the equation X mbest = argmax( f (m, u)), (1) m∈M u∈U where f (m, u) is the utility function for the model m and the user u that grows together with the model’s performance. The definition of such a function depends on the error metric in use. Since RMSE is highly applicable to the task of rating prediction, the utility function is defined as follows. Let r ∈ RN be the vector of actual ratings rj , brm ∈ RN be the vectors of predictions rbmj which model m has given for the row j. The vector emin ∈ RN contains values emin j = min ((b rmj − rj )2 ) (2) m∈M equal to minimal squares of error for each row. Thus, the utility function for the model m and the user u is computed by the formula X f (m, u) = (emin j rmj − rj )2 ), − (b (3) u,j∈1..N 4 A. Nechaev, N. Zhukova, and V. Meltsov Fig. 1. The proposed recommender system structure i.e. as the cumulative difference between the minimal squared errors and the models’ squared errors. When one chooses an algorithm for a users’ group, they should consider the fact that several base-level models can show roughly equal performance, but one of them is the best on a whole dataset, i.e., fits the task and its domain more than others. Another model may show a higher level of utility not only because of its accordance with this users’ group but due to the existence of outlying values in a training dataset, causing errors in modeling of interests. It may lead to the situation when a wrong base-level model that has learned several random records, but not the actual patterns of users’ behavior, is selected. Consequently, there should be a strong reason to prefer a particular algorithm for a users’ group over the overall best one. In this research, the stated problem is solved as follows. After the model mU best that has the highest utility for a group is chosen, it is calculated how much higher its utility is compared to the one of the overall best model moverall best . If this value is lower than the constant γ, the model moverall best is used for a users’ group. Since a utility function may take strictly non-negative values as well as strictly non-positive ones, the final selection of a model mU for a users’ group U Utilizing Metadata to Select a Recommendation Algorithm 5 is performed by equations: X utilityU = f (mU best , u), (4) u∈U X utilityoverall = f (moverall best , u), (5) u∈U utilityU utilityoverall kimp = max ( , ), (6) utilityoverall utilityU ( mU best if kimp >= γ mU = . (7) moverall best if kimp < γ The meta-learning techniques used within the method are described below. 2.1 Computation of Meta-Features Meta-features, which are commonly used in metamodel training, can be sepa- rated into three groups [7]. 1. Statistical and/or information-theoretical meta-features describe a dataset or its parts using particular metrics from corresponding sciences, which are applied to data analysis. Such metrics provide certain knowledge about users’ behavior and can be used for distinguishing and grouping them. Examples of these features are the record count, the expected value for separate columns, their entropy, skewness, and kurtosis. 2. Base-level model features. This group includes hyperparameters and the fea- tures of trained base-level models. The examples of the latter include the leaf count in a decision tree or weights of input neurons in the feed-forward net- work. These metrics can be used to discover implicit dependencies between a model and separate parts of a dataset. 3. The last kind of metrics that are commonly used in meta-features compu- tation is ”landmarkers”, i.e., fast and rough estimations of different models’ performance. To calculate them, one can use simplified models or small parts of a whole dataset. In terms of the current research, the meta-features from the first group are of particular interest. Base-level model features are not studied, since it is assumed that all the models are of different kinds, and not the ones of a single kind that differ only by hyperparameters. The usage of ”landmarkers” is also not studied, as the stated task implies the hybridization of several models that are trained already. Several statistical features have been chosen; they describe rating dataset parts that correspond to separate users. For each of these features, the algorithm of conversion to 0-or-1 columns is described. The meta-features count has been restricted to three to avoid the increase in the width of the metamodel training dataset. The meta-features are shown in Table 1. 6 A. Nechaev, N. Zhukova, and V. Meltsov Table 1. Meta-features used in the research Meta-Feature Conversion to the Binary Format Relative count of user’s ratings. Three columns for ranges Equals to count of user’s rating divided by [0; 0.75), [0.75; 1.25), [1.25; +∞). the average rating count per user in a whole dataset. Standard deviation of user’s ratings. Two columns for ranges: [0; 0.1 of maximal possible rating), [0.1 of maximal possible rating; +∞). User’s ratings skewness. Two columns for positive and negative values respectively. 2.2 User Classification Since the input consists of binary vectors, a classifier can distinguish any two users having non-equal metadata vectors. Thus, the set U is defined as the group of users having equal metadata vectors. The best model is selected for each set U using formulas (1) – (7). These actions build the dataset for a classifier, which associates the metadata vector with the base model as a class label. Obviously, if the dataset includes all the possible varieties of binary metadata vectors, the classifier is not needed. However, most likely, the dataset does not cover all of them. It is considered that the model count is no less than three, so it is better to train an ensemble of binary classifiers using the ”One-vs-Rest” strategy. Since the metadata consists of binary vectors, the ensemble can be effectively founded on any decision-tree based classifiers. In this research, the random forest classifier is used. 2.3 User Clustering The clustering-based technique brings a specific problem. A clusterer can con- sider some users ”noise” and not attach them to any group. In this case, the meta-learning process is organized as follows. At first, the overall set of users’ metadata is clustered. As a result, several clusters are obtained, each of which can be considered as a target group U , along with the separate part of users not belonging to any cluster. The best models for clustered users are chosen using the formulas (1) – (7). For the ”noise” group, a metamodel uses the overall best base model, since it is incorrect to consider these users as a neighborhood of any kind. The task of clustering of binary vectors can be considered as the one of point clustering in the multi-dimensional Euclidean space. It is possible if at least one of the two following conditions is met [8]. The first one is that a dataset must not be very sparse. The second one states that a dataset must not be very wide. It is challenging to provide more precise restrictions, but when both conditions are not met, a clustering model trains on a wide, sparse dataset. In this case, distance metrics applied in Euclidean spaces become almost useless and cannot Utilizing Metadata to Select a Recommendation Algorithm 7 be utilized to distinguish points properly. Nonetheless, it is assumed in this research that the count of binary columns is less than 100, while the acceptable density can at least be provided by the computable meta-features. Thus, the OPTICS model with cosine similarity as a distance metric has been chosen initially to perform clustering. It has been stated heuristically that the farthest two users can differ in no more than two columns and coincide in at least one another column. √ So,√ according to the distance metric, the maximal cluster size equals 1 − 1/( 2 ∗ 2) = 0.5. However, the OPTICS model has a significant disadvantage. Its accuracy is high, but the training process on the datasets used in the experimental study can take an unacceptably long amount of time. This problem can be solved by using the HDBSCAN model [9]. Like OPTICS, it is an improved version of DBCSAN that performs hierarchical clustering without the fixed restriction of maximal cluster radius. It has been discovered empirically that the HDBSCAN model with L2 distance metric shows almost the same level of performance on the used datasets as the OPTICS one with the stated above configuration. It shows highly similar results, but learns much faster. So, this is the HDBSCAN model that is used further. 3 Results and Discussion The initial pool of algorithms in this research included five collaborative filtering models, implemented in the Surpriselib library [10]: SVD, SVD++, Baseline only, KNN Baseline, Co-Clustering. Five datasets [11,12,13] were used. They are described in 2. The presented features do not include computable ones. Metadata rows were converted to bi- nary vectors manually in such a way that most of the information was saved, but the width of the dataset did not increase drastically. Table 2. Summary description about the datasets used in the experimental study Dataset No. of No. of No. of user No. of No. of item Rating scale ratings users features items features Movielens 100k 100 000 943 32 1 682 19 1-5 Movielens 1M 1 000 000 6 040 29 3 883 18 1-5 Amazon Phones 82 816 55 975 6 793 20 1-5 Rent the Runway 192 463 105 509 41 5 851 68 1-10 Modcloth 82 723 47 911 17 1 377 8 1-5 Each dataset was used in the following way. It was shuffled randomly and split into three parts A, B, and C with 60, 20%, and 20% of records from the initial dataset, respectively. Each model from the initial pool was trained on part A and evaluated on part B. Based on the evaluation results, the pool of the best base models was built. Next, four metamodels were trained, each of which was classifying/clustering users/items. The γ constant was defined as 1.25. During 8 A. Nechaev, N. Zhukova, and V. Meltsov the process of classifier training, only 70% of users/items from part B were used. Finally, all the base models and created metamodels were evaluated on the part C. The RMSE was used as the evaluation metric. The evaluation results are presented in Table 3. Additionally, the table in- cludes the results for a perfect abstract metamodel that chooses the truly best algorithm for each user or item. Table 3. Evaluation results (RMSE). The best results among the models and the metamodels are bold-faced. MM is for ”Metamodel”. Model Movielens Movielens 1M Amazon Rent the Modcloth 100k Phones Runway SVD 0.9433 0.8941 1.5569 1.3941 0.9522 SVD++ 0.9306 0.8795 1.5992 1.4043 0.9547 KNN Baseline 0.9434 0.9133 1.5649 - 0.9506 Baseline Only 0.9350 0.9039 1.5525 1.3889 0.9877 Co-Clustering 0.9713 0.9224 1.6088 1.5531 1.0263 Perfect (by user) 0.8947 0.8558 1.3921 1.2280 0.7877 Perfect (by item) 0.8866 0.8634 1.5003 1.3374 0.9310 MM (user class.) 0.9296 0.8798 1.5371 1.3752 0.9144 MM (item class.) 0.9324 0.8796 1.5463 1.3897 0.9505 MM (user clust.) 0.9317 0.8795 1.5375 1.3746 0.9144 MM (item clust.) 0.9320 0.8795 1.5492 1.3894 0.9504 In four out of five cases, the best metamodel had better performance than the best base-level one, since its RMSE on a whole dataset was lower. The greatest benefit from the meta-learning was taken for the Modcloth dataset. In the case of Movielens 1M dataset, the best metamodel had shown exactly the same result as the best base-level one. The improvement achieved by applying the proposed method was different for different datasets. It is possible to suggest that it depended both on the best base-level model accuracy and on the level of maximal theoretical improvement that could be obtained by per-user (or per-item) algorithm selection. Figure 2 illustrates the relation between these values. 18% 35% 30% Relative best model 13% 25% error 20% Actual MM 8% 15% improvement 3% 10% Maximal possible 5% MM improvement -2% 0% Fig. 2. The relation between the relative best model error and the metamodel im- provement. Metamodel improvements are on the primary axis, and relative best model errors are on the secondary axis. Utilizing Metadata to Select a Recommendation Algorithm 9 According to the graphs, the actual improvement that the metamodel usage made could be truly dependent on the maximal possible one. At the same time, the possibility of such dependency on the relative best base-level model error was questionable due to the results for the Rent the Runway dataset. It was important that previous experiments that had been performed without statistical meta-features and the usage of the constant γ showed that this dependency could exist, and the results had been generally worse. It allows for affirming that combining explicit metadata together with computable meta-features and performing a restricted selection of the best model for a users’ (or items’) group is reasonable and increases the accuracy of metamodels. Both the numerical results and the graphs showed that the actual improve- ment was far from theoretically possible. Along with that, training accuracy was reasonably good. The graphs in Figure 3 demonstrate how often metamodels have selected the truly best algorithm for a user or an item. 30% 20% 10% 0% MM (user class.) MM (item class.) MM (user clust.) MM (item clust.) Movielens 100k Movielens 1M Amazon Phones Rent the Runway Modcloth Fig. 3. A perfect algorithm selection rate per metamodels and datasets According to these measurements, classification and clustering models had a rather similar performance that led to the best selection rate that equaled 24- 28%. Worse results showed by item-based models for three out of five datasets could be explained by relatively small item counts in them. It allows for con- cluding that the bottleneck lies in both dataset size and the complexity of de- termining the truly best model for users’ or items’ groups before training. A pending question within this research is how to choose an optimal value for the constant γ. Figure 4 shows the ratios between the lowest metamodel RMSE and the lowest base-level model RMSE with different γ values. When γ is less or equal to 1, it is equivalent to γ to be ignored (following formulas (4) – (7) from Section 2). In three out of five cases, the usage of the γ constant was effective. In the case of the Modcloth dataset, there was almost no difference between the evaluated γ values. For the Amazon Phones dataset, the restrictions in the algorithm selec- tion the γ constant brought had affected metamodels’ performance negatively. Consequently, the optimal γ value depended on the used dataset and perfor- mance of the base-level models. Since the evaluation of several metamodels with different γ values has a high computational cost, it is highly preferable to deter- 10 A. Nechaev, N. Zhukova, and V. Meltsov Modcloth Rent the Runway Amazon Phones Movielens 1M Movielens 100k 0,96 0,97 0,97 0,98 0,98 0,99 0,99 1,00 1,00 1,01 gamma = 1.5 gamma = 1.25 gamma = 1.0 Fig. 4. The lowest metamodel RMSE divided by the lowest base-level model RMSE with different γ values mine its value before the training. Possibly, this task can be done on the basis of the base-level models’ errors. This problem requires additional research. The impact of the usage of the metamodels on the time needed for recommen- dation generation has also been considered. Figure 5 shows the average times the base-level and meta- models need to make a single rating prediction on the used datasets. The experiments have been performed using the Intel Core i7-7700HQ CPU and DDR-4 RAM on 2133 MHz frequency. It is significant that the Sur- priselib framework evaluates test requests individually, and not in batches. This peculiarity allows the measured time values to be as realistic as possible. 500 400 300 200 100 0 SVD SVD++ KNN Baseline Co-Clustering MM (user MM (item MM (user MM (item Baseline Only class.) class.) clust.) clust.) Movielens 100k Movielens 1M Amazon Phones Rent the Runway Modcloth Fig. 5. Average prediction time per record (microseconds) Depending on a dataset, when a metamodel was used, the average prediction time was ranged between 143 and 413 microseconds. This interval allowed an RS to process thousands of requests per second. The speed of base-level models, in some cases, was comparable to the metamodels’ one, but, in some other cases, it was noticeably higher. However, the base-level models have particularly native implementation, while the metamodels are implemented in pure Python without Utilizing Metadata to Select a Recommendation Algorithm 11 any optimizations. Besides, it is obvious and seen from the graphs that the speed of a hybrid model depends on the speed of all the models it combines directly. Thus, the prediction time of a recommender system that uses the metamodel is enough for the production but it can be decreased by optimizations and the native implementation. The results show that all the metamodels are equally performant. To choose one of them in real-world tasks, one should consider the number of users and items in the available data. The classification-based metamodels are preferred in this case, since they can be trained faster than the clustering-based ones. To employ the proposed method in an existing system that includes several recom- menders, one can convert users’ metadata to the binary vectors format, train the classifier using the existing recommenders’ evaluation results, and redirect recommendation requests to the metamodel. Although the improvement is moderate, the usage of the metamodel that uti- lizes the peculiarities of a particular user can noticeably increase the quality of their experience. To examine this assumption, one should conduct additional re- search that aims at the evaluation of ranking measures and uses the appropriate datasets. 4 Conclusions The presented work describes a solution to the problem of recommendation al- gorithm selection for particular users or items within a single service. The paper proposes the middle-level automated meta-learning method that uses the rating prediction algorithms’ evaluation results as well as the binary representation of both users’ (or items’) metadata together with their computable meta-features. Two techniques of creating the metamodel are suggested; they are based on clas- sification and clustering, respectively. Both techniques are applicable for users’ features as well as for an items’ features. The metamodels have been evaluated on five open datasets, with the usage of collaborating filtering models on a base level. The experiments prove that the usage of the proposed method is reasonable and effective. The performance in comparison to the base-level models improved up to 3.81% of RMSE value in a best-case scenario and did not decrease in any case. The measures obtained from the evaluation have been analyzed in detail, and the possible directions of future work been described, including the ones aimed at performance enhance- ment. Current results can be used in real-world systems; they can improve the experience of their users. The additional advantage of the method is that the process of creation of a metamodel is almost automated, and the only manual task is to convert explicit metadata to the binary format. References 1. Ricci, F., Rokach, L., Shapira, B. eds: Recommender Systems Handbook. Springer US (2015). https://doi.org/10.1007/978-1-4899-7637-6 12 A. Nechaev, N. Zhukova, and V. Meltsov 2. Cunha, T., Soares, C., de Carvalho, A.C.P.L.F.: Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering. Information Sciences. 423, 128–144 (2018). https://doi.org/10.1016/j.ins.2017.09.050 3. Collins A., Beel J., Tkaczyk D. One-at-a-time: A Meta-Learning Recommender- System for Recommendation-Algorithm Selection on Micro Level. ArXiv abs/1805.12118 (2018). 4. Cunha, T., Soares, C., de Carvalho, A.C.P.L.F.: Selecting Collaborative Fil- tering Algorithms Using Metalearning. In: Machine Learning and Knowledge Discovery in Databases. pp. 393–409. Springer International Publishing (2016). https://doi.org/10.1007/978-3-319-46227-1 25 5. Cunha, T., Soares, C., de Carvalho, A.C.P.L.F.: CF4CF. In: Proceedings of the 12th ACM Conference on Recommender Systems - RecSys ’18. ACM Press (2018). https://doi.org/10.1145/3240323.3240378 6. Meltsov V., Novokshonov P., Repkin D., Nechaev A., Zhukova N.: Development of an intelligent module for monitoring and analysis of client’s bank transactions. In: Conference of Open Innovations Association, FRUCT. N 24. 255-262 (2019). https://doi.org/10.23919/fruct.2019.8711931 7. Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.: Metalearning. Springer Berlin Heidelberg (2009). https://doi.org/10.1007/978-3-540-73263-1 8. Smieja, M., Hajto, K., Tabor, J.: Efficient mixture model for clustering of sparse high dimensional binary data. Data Mining and Knowledge Discovery. 33, 1583– 1624 (2019). https://doi.org/10.1007/s10618-019-00635-1 9. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-Based Clustering Based on Hierarchical Density Estimates. In: Advances in Knowledge Discov- ery and Data Mining. pp. 160–172. Springer Berlin Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2 14 10. Surprise – A Python skicit for recommender systems. http://surpriselib.com/ 11. Harper, F.M., Konstan, J.A.: The MovieLens Datasets. ACM Transactions on In- teractive Intelligent Systems. 5, 1–19 (2015). https://doi.org/10.1145/2827872 12. Amazon Cell Phones Reviews — Kaggle - https://www.kaggle.com/grikomsn/amazon- cell-phones-reviews 13. Misra, R., Wan, M., McAuley, J.: Decomposing fit semantics for product size recommendation in metric spaces. In: Proceedings of the 12th ACM Conference on Recommender Systems - RecSys ’18. ACM Press (2018). https://doi.org/10.1145/3240323.3240398