Evaluating Stereotype and Non-Stereotype Recommender Systems Nourah A. ALRossais Daniel Kudenko University of York, UK University of York, UK nar537@york.ac.uk National Research University Higher School of Economics, St Petersburg, Russia JetBrains Research, St Petersburg, Russia daniel.kudenko@york.ac.uk ABSTRACT The objective was that recommendations could be presented Stereotype-based user modeling was proposed by Elaine Rich in to new users without the requirement to gather a set of ratings 1979 and has been applied to recommender systems on numerous from the users for the purpose of user model training. Moreover, instances since its conception. The key motivations for application Rich mentioned that an additional benefit of stereotyping is its of stereotyping in user modeling are resolution of the new user space-efficiency, because the characteristics that are applicable to problem and space efficiency. Several claims have been made in several users are required to be stored only once and can be applied the literature related to the effectiveness of stereotyping but only a to all members belonging to a stereotype. Stereotyping has been few studies have validated them empirically. Furthermore, to the deployed in user modeling (see e.g. [3, 5, 10, 12–15]), and to the best best of our knowledge, there has been no empirical study of item- of our knowledge there has been no application of the concept of based stereotype models for recommender systems. Our research stereotypes to item modeling, apart from our previous study where empirically substantiates the efficacy of using stereotypes in item we developed techniques for an item-based recommender system modeling and user modeling when compared with non-utilization employing stereotypes based on item characteristics [1]. of stereotypes. The empirical evaluation was performed with a state- Furthermore, several claims have been made in the literature re- of-the-art machine learning algorithm (gradient boosted decision lated to the effectiveness of stereotyping in user modeling, but only forests) applied to two datasets integrating MovieLens, IMDb and a few studies have validated them empirically [12]. A performance TMDB movie data. comparison between item modeling with and without stereotypes has not been carried out to date. KEYWORDS Recommender Systems; Stereotypes; New Item; User-Item Model- 1.1 Contribution ing; Performance Evaluation In previous work [1], we proposed a technique for utilizing stereo- types in item modeling. However, that study did not include a ACM Reference Format: Nourah A. ALRossais and Daniel Kudenko. 2019. Evaluating Stereotype and performance comparison of item modeling with and without the Non-Stereotype Recommender Systems. In Proceedings of Knowledge-aware application of stereotypes. This paper, for the first time, provides a and Conversational Recommender Systems (KaRS) Workshop 2018 (co-located comparative analysis of the performance of stereotype-based item with RecSys 2018). ACM, New York, NY, USA, 6 pages. modeling with non-stereotype-based item modeling. Furthermore, this paper contributes to the literature by presenting experimen- 1 INTRODUCTION tal results comparing the effectiveness of user models with and without the application of stereotypes. Elaine Rich was the first to propose the utilization of stereotypes in Specifically, the presented research addresses the following re- user modeling and recommender systems as a method for the resolu- search questions: tion of the new user problem [14]. A stereotype depicts a collection of attributes that are relevant for a collection of users [12] and (1) Using a state-of-the-art machine learning method, do recom- represent the "frequently occurring characteristics of users" [14]. mendations based on stereotype-based user modeling have A stereotype may or may not be a precise representation of the better accuracy than those not utilizing it in user modeling? user group or any specific group member, but it may simply be an (2) Do item-based recommender systems have improved accu- estimation of certain characteristics of the group. racy when they employ stereotypes for item modeling when The fundamental motivation for applying stereotyping is to pro- compared to non-utilization of stereotypes? vide personalization while having insufficient information about For evaluation, we use two integrated datasets combining Movie- new users, by assigning them to a stereotype. In this context, stereo- Lens, IMDb and TMDB information. types are usually regarded to be similar to other models [12]. The remainder of this paper is organized as follows: Section 2 Knowledge-aware and Conversational Recommender Systems (KaRS) Workshop 2018 summarizes related work. A preliminary design of the prediction (co-located with RecSys 2018), October 7, 2018, Vancouver, Canada. algorithm for building manual stereotype-based item model offline 2018. ACM ISBN Copyright for the individual papers remains with the authors. Copying is presented in Section 3. Experimental results are presented in Sec- permitted for private and academic purposes. This volume is published and copyrighted by its editors.. tion 4 along with discussion in Section 5. Lastly, Section 6 presents concluding remarks alluding to future work. 2 RELATED WORK will not lead to lower performance. It was added by Lock [12] that The Grundy system developed by Rich [14] is a pioneering work online user evaluation is required to substantiate the advantages in the field of stereotype-based recommender systems. It is the of stereotype-based user modeling over single component user first system of its kind to recommend items to users and uses a models. hierarchical structure for the creation of stereotypes to make rec- The shortage of empirical studies into the performance advan- ommendations to users. The results of experiments conducted by tages of stereotype-based recommendations is a key gap in this Rich revealed that users were more satisfied with recommendations field. To support this, the authors of [15] stated that ”experiments made by Grundy when compared with ones generated randomly. must be conducted to compare the results with and without the use of However, the empirical evaluation did not provide ample evidence stereotypes for the same users and data”. However, they added that to support the case of a stereotype-based user model over indi- ”such experiments are not easily carried out”. vidualized user modeling as the latter type of modeling was not implemented and evaluated [12]. 3 STEREOTYPE BASED RECOMMENDATION Even though resolution of the new user problem and achieving ALGORITHM space efficiency [14] are the key objectives of stereotyping, authors To address the research questions, in previous work [1] we proposed in [5] stated that stereotyping offers yet another advantage as it an algorithm that assigns users and items to stereotypes through our renders ”knowledge acquisition and debugging to occur in a highly user and item preference model (i.e. Pu , Pi ) as well as a clustering modular and incremental way, thus facilitating the job of the knowl- technique. In this work, we will validate the algorithms empirically. edge engineer (which turns out to be especially hard in the particular The stereotypes in this study are of a ’double stereotype’ nature domain of user modeling)”. which implies that the information that was used to recommend An evaluation of Personal Program Guide (PPG) by Ardissono et an item to a target user is also influencing the allocation of items al. [3] showed that overall user modeling in PPG consisting of three to stereotypes. Double Stereotypes were suggested by Chin [6] in user modeling components displayed better performance but perfor- terms of user-based stereotypes for information filtering where mance of stereotypical user model was poor which might be attrib- the information that a user chooses to view is also examined as a uted to lack of completeness of knowledge base in stereotypes and way to allocate users to stereotypes. We apply a similar concept for inappropriate assignment of stereotypes. Gena and Ardissono [8] item-based stereotypes. Sections 3.1 and 3.2 explain our proposed noted that ”stereotypical knowledge does not correctly handle users algorithm in more detail. matching different lifestyles in different aspects of their behaviours, because of the major selectivity of the personal data in the classifi- 3.1 User-Based Stereotype Recommendation cation of users, in spite of interests”. Thus, Ardissono et al. believed Let Pu (i) be the preference function of user u ∈ U in item i ∈ I that stereotype-based user models are useful when interacting with and PU S (i) the preference function of User Stereotype (US) in which users [3], even when they lead to weaker recommendation perfor- user u is a member, in item i ∈ I then: mance. Kurapati and Gutta proposed in the domain of TV personaliza- Pu (i) = PU S (i) ∀i ∈ I tion that a stereotype-based approach to recommendations displays similar performance to the individualized recommender system de- Since user u can belong to multiple User Stereotypes (USs), the rec- veloped previously by them [11], even though the comparison was ommendation setting is the sum of weighted preference functions performed only for a single user (User K). The estimated error rate of User Stereotype (USs) to that item i, given by for the individualized user model of User K was 22% while it was 13% for the stereotype-based model. However, the study did not Pu (i) = Õ w S .PU S (i) (1) include a detailed and direct comparison between a stereotype- S ∈U S (u) base user modeling and a single component approach, neither at individual user level nor averaged over all users [12]. Where US(u) is the set of user stereotypes for which user u is a Krulwich while evaluating LIFESTYLEFINDER found that a ran- member, w S is weight of preferences functions as defined by an dom recommendation approach was outperformed by a stereotype- expert in the field and Õ based system. Krulwich noted that ”the ability to operate on a small w S = 1, w S ∈ [0, 1] ∀S ∈ U S (u) amount of innocuous information comes at the expense of the ac- S ∈U S (u) curacy that the system is able to achieve” [10]. Yet, this claim is unsupported as no direct comparisons between stereotype-based and individualized approach was made. Lock [12] in a rare empirical 3.2 Item-Based Stereotype Recommendation study on the performance of a stereotype-based approach (in the Let Pi (u) be the preference function of item i ∈ I in user u ∈ U and PI S (u) the preference function of Item Stereotype (IS) in which item context of the development of the stereotype-based recommender i is a member, in user u ∈ U then: system GERMANE) found that, on average, stereotype-based user modeling is comparable to a single component approach. How- ever, Lock also noted that stereotyping can be effectively employed Pi (u) = P I S (u) ∀u ∈ U in recommender systems for known users from whom relevance feedback has been collected and this enhancement in flexibility Since item i can belong to multiple Item Stereotypes (ISs), the rec- ommendation setting is the sum of weighted preference functions 2 of Item Stereotype (ISs) to that user u, given by 4.2 Evaluation Procedure Õ In this Section, we present the results of experimental studies and Pi (u) = w S .P I S (u) (2) investigate the answer of the following research question: S ∈I S (i ) ”Does the use of stereotypes help to improve accuracy over not using stereotypes?” Where IS(i) is the set of item stereotypes for which item i is a member, w S is the weight of preference functions as defined by an We first have to build a user model and an item model. These expert in the field and models can be computed automatically by applying machine learn- Õ ing techniques to the ratings given by the user to the items viewed. w S = 1, w S ∈ [0, 1] ∀S ∈ I S (i) A machine learning algorithm (gradient boosted decision trees [7]) S ∈I S (i ) was deployed to build a user/item and a user/item-based stereotype models. In our experiments, the baseline for evaluating stereotyp- 4 EXPERIMENTAL EVALUATION ing is a single user/item model constructed using the same machine learning algorithm. This makes it possible to directly compare the This Section details the two datasets used in our investigation into individualized and stereotype-based models. the performance of stereotype-based user/item models. They differ Experiments were conducted offline by considering two different in terms of the number of stereotypes assigned to users/items. predictive accuracy measures: (1) Mean Absolute Error (MAE) and We run a direct comparison between user/item and stereotype (2) Mean Squared Error (MSE). MAE and MSE are appropriate based-user/item models as the same datasets are used to construct metrics for assessing models that output scores with similar range and evaluate both model types. The main concern of this paper and distributions and have been used in previous studies [3, 16]. is to compare the single user/item model and stereotype-based User satisfaction is not measured at all in our experiments. approaches. The performance levels reported for both experiments For our investigation, we performed two experiments using dis- are for known users (i.e. users from which training feedback has tinct settings to generalize findings and prove that the proposed been obtained). stereotype approach is applicable to any user attributes whether subject to change or not. In Table 1, the experimental settings are 4.1 Dataset presented. The MovieLens dataset is quite popular among the research commu- nity. GroupLens Research has collected and made available different versions of the MovieLens dataset. For the purpose of this study, Table 1: Experiment Settings experiments were conducted on two different versions of Movie- Lens: (1) MovieLens 1 Million dataset and (2) MovieLens 20 Million Setting Exp on Dataset 1 Exp on Dataset 2 dataset [9]. Dataset MovieLens 1M MovieLens 20M Demographic features (e.g. age, gender, occupation) of users were User Features Demographic Preferences extracted from the MovieLens 1 Million dataset and supplementary (not subject to change) (subject to change) item features were extracted from kaggle https://www.kaggle.com/ Splitting Data Train/validate/test k-fold cross validation that is based on the TMDB dataset. The combined dataset contains (no overlap) 6,040 users, 3,827 movies, 1,000,209 ratings, 35,052 cast members, Machine Learn- gradient boosted deci- bagged decision tree and 28,541 crew members along with other movie data and user ing algorithm sion generated features like keywords. We refer to this dataset as Dataset No. of stereo- 2 varies between 1 and 1 in the remainder of this paper. types assigned -age (7 exclusive 477 Unlike the MovieLens 1 Million dataset, the MovieLens 20M to user groups) representing user pref- dataset does not contain demographic features. Instead, we inter- -gender (2 exclusive erences for: preted a user’s average rating per item feature as a user feature. groups) -genres (28 groups) Precisely, in a previous work [2], we integrated the MovieLens 20M -actors (248 groups) dataset and the IMDb dataset and generated a dataset from this in- -directors (101 groups) tegrated data. This dataset included a feature vector that represent -writers (100 groups) useful information about users and movies that is not explicitly No. of stereo- varies between 1 and varies between 1 and contained in the raw data. More specifically, our dataset contains types assigned 647 477 information about user interest in movie genres, actors, etc. The to item -genres (23 groups) -genres (28 groups) dataset is different from other data in that the interest of users in -cast (132 groups) -actors (248 groups) movie features are calculated implicitly from their overall ratings -crew (192 groups) -directors (101 groups) rather that explicitly asking user his or her preferences. -keywords (300 groups) -writers (100 groups) A total of 20M ratings applied to 27,242 movies by 138,000 users, where each user rates at least 20 movies, were extracted. In our experiment, we applied our algorithm to 150,567 ratings applied to 9734 movies by 1000 users. There is a wide variance in performance 4.3 Experiment on Dataset 1 between the users as each user has a different set of interests. We To build the user model, we treat movie ratings of a user as the refer to this dataset as Dataset 2 in the remainder of this paper. label of training examples, and the features of a movie (e.g. genre, 3 actor, etc.) form the training example itself. The user model is the evaluation. Random assignment of items to folds was inappropriate output of the applied machine learning method when fed with this as indicated by other authors in their research [4]. Billsus and training data. Pazzani found that a user’s ratings of an item is influenced by the In the case of user-based stereotypes, we train a stereotype model items they already saw and rated. Also, the ordering of items is in the same way as the user model, but now using the combined critical. Therefore, we preserved the chronological ordering of the training data from all users that fit the stereotype as indicated in relevance feedback data by sampling every user in either train, Equation (1). We split Dataset 1 by items to train the user model validate or test sets. (i.e. every movie is in exactly one of train (70%), validate (10%), or Input to the item model is a matrix consisting of the following test (20%) sets; there is no overlapping). We repeat the process five user features: gender, age, occupation and zip code. times using simple random sampling to ensure unbiased results. Input to the item-based stereotype model is the same matrix used Performance was averaged over all phases. in the item model, but here we combined training data from all Input to the user model is a matrix consisting of the following items using genres, cast, crew and keywords to define different item- item features: genres, id, adult, budget, imdb_id, original_language, based stereotypes. The choice of features on which the stereotypes popularity, production_companies, production_countries, release_date- are based has been made using our domain expertise. Equation (2) converted into release_year and release_month, revenue, runtime, included weighted preference functions of item-based stereotypes, spoken_languages, title, vote_average, vote_count, keyword, cast however, in our experiment, we used uniform stereotype weights and crew. for simplicity as assigning weights manually is not practical. Instead, Input to the user-based stereotype model is the same matrix it should be done automatically and we will leave this for future used in the user model, but here we combined training data from work. all users using gender and age to define user-based stereotypes. As Table 3 summarizes the accuracy of stereotype and non-stereotype Equation (1) uses weighted preferences of user-based stereotypes, based models for Dataset 1. The accuracy of user-based stereotype we experimented with different weights for the age and gender modeling is promising and in line with findings in literature. As for stereotypes over all five samples to ensure unbiased results. Table 2 the item models that represent the "preference" of an item for a user, summarizes the average results. Overall, there is no significant i.e. a mapping of users to preference values, our expectation that impact on accuracy as we change weights and the best result is this will provide additional useful information for the recommender achieved when we assign a weight of age and gender to 0.2 and 0.8 system was correct. Moreover, designing stereotypes analogously respectively. to user stereotypes was even promising as item-based stereotypes achieve an enhancement in accuracy compared to the raw item Table 2: Different weights for User-based stereotype model model. Age weight Gender weight MAE MSE Table 3: The Accuracy of Stereotype and non Stereotype 0.1 0.9 0.79133 1.2236 models for Dataset 1 0.2 0.8 0.79131 1.2235 0.3 0.7 0.79147 1.22420 Model MAE MSE 0.4 0.6 0.79180 1.22570 User Model 0.794 1.234 0.5 0.5 0.79200 1.22629 User-based Stereotype 0.791 1.223 0.6 0.4 0.79329 1.23012 Item Model 0.878 1.450 0.7 0.3 0.79403 1.23183 Item-based Stereotype 0.876 1.449 0.8 0.2 0.79610 1.23720 0.9 0.1 0.79697 1.23951 4.4 Experiment on Dataset 2 To build the item model, we treat movie ratings of a user as the This experiment was run and validated with 5-fold and 10-fold label of training examples, and the features of a user (gender, age, cross validation techniques to avoid over-fitting. The reason for occupation) form the training example itself. The item model is the using a different algorithm and different validation methods in output of the applied machine learning method when fed with this this experiment is to demonstrate the impact of stereotypes on training data. recommendation performance irrespective of the methods and al- In the case of the item-based stereotypes, we train a stereotype gorithms. model in the same way as the item model, but now using the com- To build the user model, we treat movie ratings of a user as the bined training data from all items that fit the stereotype as indicated label of training examples, and the features of a movie (e.g. genre, in Equation (2). We split Dataset 1 by users to train the item model duration, etc.) form the training example itself. The user model is (i.e. every user is in exactly one of train (70%), validate (10%), or the output of the applied machine learning method when fed with test (20%) sets; there is no overlapping). We repeat the process five this training data. times using simple random sampling to ensure unbiased results. In the case of user-based stereotypes, we train a stereotype model Performance was averaged over all phases. in the same way as a user model, but now using the combined Although cross-validation is used to estimate generalization training data from all users that fit the stereotype as indicated in performance, it is not always appropriate for recommender system Equation (1). 4 Input to the user model is a matrix that consists of the following different datasets in the movie domain demonstrated that the per- item features: genres, release_year and duration. formance levels of stereotype-based user models are slightly better Input to user-based stereotype model is the same matrix used than the single-component models for an existing user. in the user model, but here we combined training data from all Moreover, a performance comparison of item modeling with users using preferences of genres, preferences of actors, prefer- and without stereotypes has been shown for the first time. The ences of directors and preferences of writers. The choice of features generated item-based stereotype models are models of the target- used to create the stereotypes was based on our domain expertise. market for a given group of items, i.e. denoting how much an item Equation (1) indicates weighted preference functions of user-based "likes" a user (rather than the other way round as is done in user stereotypes, however, in our experiment, for simplicity, we used uni- modeling). The results are promising for solving the new item form stereotype weights as we assume all preferences are equally problem. important (which may not be the case). Nevertheless, it may not be effective and efficient for a recom- As noted in Section 4.1, the MovieLens 20 Million dataset does mender system to manually define stereotypes from a restricted not contain any user features. Hence, we implicitly calculated the list of item features such as size, sold quantity, price, etc. Another interest of a users in given movie features, in the form of average way could be an automatic and dynamic generation of stereotypes rating, and treat this as a user feature. Details are in our previous from a collection of features where, for example, in one case feed- work [2]. To build the item model, we treat movie ratings of a back, price and similarity are utilized to group products and in user as the label of training examples, and the interest of a user in another case quantity sold, click-through rate and popularity could various genres form the training example itself. The item model is be employed. Thus, automated stereotype generation better en- the output of the applied machine learning method when fed with hance models that are focused on the requirements of the user in this training data. order to increase revenue through identification of items which In the case of the item-based stereotypes, we train a stereotype users may find more interesting. model in the same way as the item model, but now using the com- Therefore, to overcome the limitations of manual stereotypes, we bined training data from all items that fit the stereotype as indicated intend to develop an automatic item-based recommender system in Equation (2). in the next phase of the project. Input to the item model is a matrix consisting of the user features corresponding to the user preferences of various genres. 6 CONCLUSION AND FUTURE WORK Input to the item-based stereotype model is the same matrix used In this work, we evaluated user and item models with and with- in the item model, but here we combined training data from all out stereotypes on two movie recommendation datasets, and the items using genres, actors, directors and writers for the item-based results demonstrate the effectiveness of stereotypes in significantly stereotypes. Equation (2) indicates weighted preference functions of improving the accuracy of recommendations. item-based stereotypes. However, in our experiment, for simplicity, In future work, we aim to evaluate our model on other datasets we used uniform stereotype weights assuming that all item features collected from an online business and an online user study. are equally important. Furthermore, we intend to develop a hybrid method that com- bines stereotype-based user and item models to achieve a higher Table 4: The Accuracy of Stereotype and non Stereotype recommendation accuracy. models for Dataset 2 REFERENCES k-fold 5-fold 10-fold [1] Nourah A. AlRossais. 2018. Integrating Item Based Stereotypes in Recom-mender Systems. In UMAP’18 Adjunct: 26th Conference on User Modeling, Adaptation and Model MAE MSE MAE MSE Personalization, July 8-11, 2018, Singapore, Singapore. ACM, 4. User Model 0.779 0.988 0.778 0.986 [2] Nourah A. AlRossais and Daniel Kudenko. 2018. iSynchronizer: A Tool for Ex- tracting, Integration and Analysis of MovieLens and IMDb Datasets. In UMAP’18 User-based Stereotype 0.768 0.970 0.769 0.970 Adjunct: 26th Conference on User Modeling, Adaptation and Personalization Ad- Item Model 0.774 0.978 0.774 0.977 junct, July 8-11, 2018, Singapore, Singapore. ACM, 5. Item-based Stereotype 0.742 0.906 0.742 0.905 [3] Liliana Ardissono, Alfred Kobsa, and Mark T Maybury. 2004. Personalized digital television: targeting programs to individual viewers. Vol. 6. Springer Science & Business Media. [4] Daniel Billsus and Michael J Pazzani. 1999. A hybrid user model for news story In Table 4, the accuracy of stereotype and non-stereotype based classification. In UM99 User Modeling. Springer, 99–108. models for Dataset 2 is presented. The user-based stereotype model- [5] Giorgio Brajnik, Giovanni Guida, and Carlo Tasso. 1990. User modeling in expert ing achieves better accuracy than the single user model. The same man-machine interfaces: A case study in intelligent information retrieval. IEEE Transactions on Systems, Man, and Cybernetics 20, 1 (1990), 166–185. applies to the item-based stereotype models when compared to the [6] David N Chin. 1989. KNOME: Modeling what the user knows in UC. In User item model. In this experiment, we achieved even better accuracy models in dialog systems. Springer, 74–107. [7] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting for item-based stereotype as compared to user-based stereotype. machine. Annals of statistics (2001), 1189–1232. This indicates that stereotype-based item modeling is a promising [8] Cristina Gena and Liliana Ardissono. 2001. On the construction of TV viewer approach. stereotypes starting from lifestyles surveys. In Workshop on Personalization in Future TV. [9] F Maxwell Harper and Joseph A Konstan. 2016. The movielens datasets: History 5 DISCUSSION and context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4 (2016), 19. Our research questions have been addressed with the experimen- [10] Bruce Krulwich. 1997. Lifestyle finder: Intelligent user profiling using large-scale tal results in Section 4. Two scientific experiments conducted on demographic data. AI magazine 18, 2 (1997), 37. 5 [11] Kaushal Kurapati and Srinivas Gutta. 2002. Instant personalization via clustering [14] Elaine Rich. 1979. User modeling via stereotypes. Cognitive science 3, 4 (1979), TV viewing patterns. IASTEDâĂŹs ASC (2002). 329–354. [12] Zoe Lock. 2005. Performance and Flexibility of Stereotype-based User Models. Ph.D. [15] Bracha Shapira, Peretz Shoval, and Uri Hanani. 1997. Stereotypes in information Dissertation. University of York. filtering systems. Information Processing & Management 33, 3 (1997), 273–287. [13] Jon Orwant. 1994. Heterogeneous learning in the Doppelgänger user modeling [16] Manolis Vozalis and Konstantinos G Margaritis. 2004. Unison-CF: a multiple- system. User Modeling and User-Adapted Interaction 4, 2 (1994), 107–130. component, adaptive collaborative filtering system. In International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems. Springer, 255–264. 6