Featuristic: An interactive hybrid system for generating explainable recommendations – beyond system accuracy Sidra Naveed Jürgen Ziegler University of Duisburg-Essen University of Duisburg-Essen Duisburg, Germany Duisburg, Germany sidra.naveed@uni-due.de juergen.ziegler@uni-due.de ABSTRACT improving the accuracy of predictions, by mostly using ratings Hybrid recommender systems (RS) have shown to improve provided by users for items. Recently, with the increasing system accuracy by combining benefits from the collaborative complexity of RS algorithms, the user-oriented aspects have filtering (CF) and content-based (CB) approaches. Recently, gained more attention from the research community. It has the increasing complexity of such algorithms has fueled a been shown that improving these aspects lead to a commen- demand for researchers to focus more on the user-oriented surate level of user satisfaction and user experience with the aspects such as explainability, user interaction, and control system [18, 33]. mechanisms. Even in cases, where explanations are provided, One of these aspects that may contribute to the actual user the systems mostly fall short in explaining the connection be- experience is the degree of control users have over the system tween the recommended items and users’ preferred features. and their preference profiles [16, 14, 18]. Yet, from a user’s Additionally, in most cases, rating or re-evaluating items is perspective, today’s automated RS such as the ones used by typically the only option for users to specify or manipulate Amazon [19] or Netflix [3], provide limited ways to influence their preferences. With the purpose to provide advanced expla- the recommendation generation process. Usually, the only nations, we implemented a prototype system called Featuristic, means to actively influence the results is by rating or re-rating by applying a hybrid approach that uses content-features in single items, which raises the risk of users being stuck in a a CF approach and exploits feature-based similarities. Ad- "filter bubble" [6, 29, 42, 28]. This effect makes it difficult for dressing important user-oriented aspects, we have integrated users to explore new areas of potential interest and to adapt interactive mechanisms into the system to improve both pref- their preferences towards the situational needs and goals [25]. erence elicitation and preference manipulation. Besides, we have integrated explanations for the recommendations into Additionally, another problem can be seen in the general lack these interactive mechanisms. We evaluated our prototype of explainability in most of the current RS which could nega- system in two user studies to investigate the impact of the tively impact user’s subjective system assessment and overall interactive explanations on the user-oriented aspects. The user experience. For instance, lack of explanations could re- results showed that the Featuristic System with interactive ex- sult in the difficulty of understanding recommendations which planations have significantly improved users’ perception of maybe a hindrance for users to make their decisions [41, 22]. the system in terms of the preference elicitation, explainability, These aspects consequently negatively effect the overall user and preference manipulation – compared to the systems that experience with the system [33]. Moreover, it is often un- provide non-interactive explanations. clear to users that how their expressed preferences actually correspond to the system’s representation of the user model i.e. Author Keywords how manipulating the preference model affects the system’s Hybrid Recommender System; Explanations; Interactive output [36, 46]. Hence, adding more interactivity to the sys- Recommending; User Experience tem by letting users influence their recommendation processes and preference profiles is considered a possible solution in CCS Concepts RS research to improve the system’s explainability [16, 14, •Information systems → Recommender systems; 18]. In this regard, only presenting users with the matching recommendations is not very supportive and it has been ob- INTRODUCTION served that users require additional information and interactive Recommender systems (RS) based on Collaborative Filtering mechanisms to fully benefit from the system [43]. (CF) or Content-based (CB), have been mainly focusing on To address the limitations of state-of-the-art CF and CB ap- proaches, limited hybrid approaches exist that focus on user- oriented aspects and user experience, beyond the algorithmic accuracy [6, 20, 30]. But such approaches are still limited in terms of providing explanations, as the connection between the recommended items and the user’s preferences for item- Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Com- features are not clearly explained to the user. Additionally, mons License Attribution 4.0 International (CC BY 4.0). these systems rarely explore whether a combination of expla- IntRS ’20 - Joint Workshop on Interfaces and Human Decision Making for Recom- mender Systems, September 26, 2020, Virtual Event nations with interaction tools, has a positive influence on the algorithm tries to predict the missing ratings of the item, which user-oriented aspects or not. have not been rated by the users, yet based on, for instance, the weighted average of the ratings provided by similar users In this paper, we implemented an interactive hybrid system (user-based CF) or of similar items (item-based CF). Explain- called Featuristic in the domain of digital cameras, that ex- ing these predictions to users is sometimes, very complicated ploits content-features in a CF approach. The recommenda- and might be difficult for users to understand. Herlocker et tions and corresponding explanations are generated based on al. recognized this problem and compared 21 different ex- users that are similar to the current user in terms of shared planation interfaces for CF, for getting an understanding of feature-based preferences. The implemented approach is in- how users with similar tastes rated recommended items [13]. spired from the approach proposed in [26]. We exploited Their study indicated that users preferred rating histograms multiple data sources to provide explainable recommendations over other explanation styles. Numerous attempts have been rather than relying only on item-ratings (CF approach) or item- made to increase the transparency of the RS through visual features (CB approach). We further integrated these advanced explanations such as; flowcharts [15], Venn diagrams [30], explanations with interactive mechanisms with the purpose to graph-based representations [45], clustermaps [45], concen- improve the proposed prototype system with respect to three tric circles [17, 27], paths among columns [6], and map visual- main user-oriented aspects: 1) Preference elicitation process izations [23, 10]. Approaches such as, PeerChooser [27] and 2) Explainability of recommendations and, 3) Preference ma- SmallWorlds [11] presented complex interactive visualizations nipulation of users. In this regard, we aim at addressing the with the aim to explain the output of CF: similar users are following research question: displayed by means of connected nodes, where the distance RQ: Does integration of the hybrid-style explanations with between the nodes reflects the similarity between two users. interaction tools improve the preference elicitation, explain- Hybrid approaches have emerged to benefit from both CF ability of recommendations, and preference manipulation for and CB approaches when generating recommendations and users – compared to a conventional filtering system with sim- its corresponding explanations [7]. Some of these approaches ple and non-interactive explanations? combined ratings with content features [37, 12], and others To address the research question, we ran a user study in which have additionally taken social data into account [6, 30, 44, 34]. we evaluated the Featuristic System with advanced interac- However, these systems rarely focus on making the recom- tive explanations, against conventional filtering approach with mendation process more transparent and explainable. In cases rather simple and non-interactive explanations. In a subse- where they attempt to provide explanations, these explanations quent study we successfully evaluated the results of our first are mostly presented visually. A prominent example is Talk study, by isolating the affect of the underlying algorithms and Explorer [44] which uses cluster maps allowing the user to only focusing on the affect of interactive explanations on the explore the connections of conference talks to user bookmarks, user-oriented aspects. For this purpose, we compared two user tags, and social data. SetFusion [30] is the hybrid system versions of our prototype systems with or without interactive which is based on TalkExplorer which uses Venn diagrams explanations. instead of cluster maps. RELATED WORK The aspects of user control and interactivity have also been Among other user-oriented factors, increasing the transparency integrated in the hybrid systems. A common example of such of the RS has proved to improve the perceived recommenda- systems is Tasteweights [6] that exploit social, content, and tion quality, decision support, trust, overall satisfaction and expert data to provide interactive music recommendations. higher acceptance of the recommendations [47, 33, 41, 22]. The system not only visually presents the relation between the Several studies have investigated the aspect of transparency, by user profile, data sources, and recommendations but it also comparing different explanation styles [4], combining differ- allows the user to manipulate their recommendation process ent explanation styles [38], considering factors like personal- by changing weights associated with individual items and by ization [39, 40], tags [46], rankings [24], and natural language expressing their relative trust for each context source. These presentations [9]. However, the current RS often lacks in ex- interactions are dynamically reflected in the recommendation plaining to users; how a system generates recommendations feedback in real time. In the same context, MyMovieMixer and why it recommends certain items [35, 41]. [21] is the hybrid approach that allow users to control their recommendation process. The system provides immediate In the context of CB approaches, for instance, item attributes feedback, highlighting the criteria used to generate the recom- can be used to textually explain the relevance of recommended mendations. MoodPlay [1] is an other example that combines items to the users’ personal preferences, though it requires content- and mood-based data for recommending music. Rec- availability of content data. The most common example of ommendations and an avatar representing the user profile is such explanations is Tagsplanation where the recommended displayed in terms of visualization, enabling the user to under- movies are explained based on the user’s preferred tags, ex- stand why certain songs are recommended by means of the plaining how the movies are relevant to these users’ preferred position in the latent space, presenting the relation to different tags [46]. Billsus et al. [5] proposed a news RS where the moods, and allowing the user to influence the recommendation explanations are presented by means of textual keywords. process by moving the avatar [1]. While these works have attempted to increase the transparency, user control, and inter- In case of conventional CF approaches, users and items are active mechanisms, mostly including advanced visualizations, represented through vectors containing the item-ratings. The they usually fall short of explaining the connections between item has mixed data type features. On the other hand, an user preference profile in terms of item-features and the rel- entropy-based feature-weighting method is also limited in evance of recommended items to this profile. Additionally, terms of computing the relevance between two continuous users are provided with limited mechanisms to modify their features with mutual information due to the problem of loss preference profile or manipulate their recommendation pro- of information during the process of discretization in order to cess – mostly in terms of rating or re-rating items. Current transform non-nominal to nominal data [48]. work aims to focus on the user-oriented aspects by combin- ing the advanced hybrid-style explanations with interaction To overcome the limitation of entropy-based feature-weighting mechanisms. method, we applied ordinal regression model, which can pre- dict an ordinal dependent variable (i.e. item-ratings in terms A HYBRID SYSTEM BASED ON FEATURE-BASED SIMI- of five-point likert scale) given one or more categorical or con- tinuous independent variables (i.e. item-features). The model LARITY is able to determine which of the features have a statistically Following steps are used to implement the hybrid approach and significant effect on the item-ratings. The model allows to are briefly discussed here. 1) Creating feature-based profile of determine, how much of the variation in item-ratings can be the current user 2) Creating other users’ feature-based profiles explained by item-features and also, the relative contribution – implicitly predicted from their item-ratings 3) Computing of each feature in explaining this variance. The steps applied user-user similarities based on shared feature preferences 4) for ordinal regression model are briefly described below. Generating recommendations and corresponding explanations from similar users’ feature-based preferences. Selecting specific features for the model When constructing a regression model, it is important to iden- Creating feature-based profile of the current user tify the predictor variables (item-features) that contribute sig- In the first step, a feature-based profile of the current user is nificantly to the model. To do so, correlation of the item- required to be used in a feature-based CF approach. For this features with the item-ratings are computed on the overall purpose, first the user is required to select the feature-value ratings dataset, by applying Spearman’s rank-order correlation. and then must specify how important this value is for him/her The top 15 features with highest significant correlations with in terms of five-point likert-scale (from "not important" = the ratings are further considered for the model. 0; "very important" = 1). For binary features e.g., WLAN, selecting a feature and giving it an importance scale, will add Predicting ratings from features this feature in a user vector. In case of continuous features In the next step, PLUM procedure is used in SPSS to apply an such as Pixel Number, the user can select any range-value ordinal regression model2 . For each user in the dataset, the and select the importance scale, which will be discussed in model was applied separately, taking only values into account the section "Similarity between users in terms of continuous which have a significant correlation with the user ratings. features with range-value categories", specifying how these values will be mapped and saved in the user vector. Interpreting the output For each user, we want to determine which features have a Additionally, we used the knowledge based data from a camera statistically significant effect on the item-ratings. For this pur- website1 to identify the set of features which are important pose, parameter estimates table is used to interpret the results for the five most common photography modes i.e., sports, and identify the features and its values that have statistically landscape, Filming, street, and portrait photography. The significant effect in predicting the item-ratings, as well as the current user can explore any photography mode in terms of contribution of each feature-value in predicting this rating. the pre-defined set of features associated with each mode. The current user can select one of the photography modes, with an option to exclude any feature from the features-set for that Computing user-user similarity based on feature- mode or can add the entire set of features directly into his/her preferences preference profile as part of the mode. To increase the control The feature-based profile explicitly created for the current user over the system and to enable users to adjust their profile at and implicitly computed using ordinal regression for all other any time, both the feature values and the importance scores users, are then used to identify peer users with similar taste can be adjusted. in item-features as that of the current user. As the camera features are of mixed data type, categorical and continuous – Predicting feature-weights for users using ordinal regres- separate measures have been considered, which take the data sion model type into account when computing similarity between two The second step is to compute feature-based profiles of all users and is further explained below. other users by implicitly predicting from item-ratings. There are several techniques proposed in the literature to predict Similarity between users in terms of categorical features feature-weightings from item-ratings including TF-IDF (Term To compute similarity between two users in terms of categor- Frequency- Inverse Document Frequency) method and entropy- ical feature-values and their corresponding weightings, we based feature-weighting method proposed in [8, 2]. On one 2 The technical details and steps applied in SPSS for PLUM pro- hand, the TF-IDF does not provide satisfactory results as the cedure can be found in the link: https://statistics.laerd.com/spss- 1 https://cameradecision.com/ tutorials/ordinal-regression-using-spss-statistics-2.php applied Mean Squared Error (MSE) which provides a quan- similarity), we applied post-filtering mechanisms in three-step titative score describing the degree of dissimilarity between process to generate a final list of recommendation. two profiles. Gower’s similarity measure for categorical features Similarity between users in terms of continuous features with To compute similarities between the current user’s preferred range-value categories features and the potential items in terms of categorical fea- tures, we applied Gower’s similarity measure that takes In case of continuous features with range-values, the tradi- the type of variables into account. Details of the method tional similarity measures fail to address the question that can be found in [31]. Let the current user be defined by whether the partial presence of the range-value be treated as cu = {cu f | f = 1, 2, ..., F} and the item is defined by item = presence or absence of the feature or not? To address this {item f | f = 1, 2, ..., F}. The similarity between two profiles issue, we computed similarity between two user vectors in is computed using the Gower’s similarity measure using the terms of the continuous features with range-value categories, formula: in a two step process. ∑Ff=1 s(cu,item) f ∗ δ(cu,item) f 1) Percentage similarity measure: For applying regression S(cu,item) = (2) ∑Ff=1 δ(cu,item) f model, the continuous values are categorized into fixed pre- defined bins, where each binned category gets different The similarity coefficient δ(cu,item) f determines whether the weights for the respective user (section "Predicting feature- comparison can be made for the f-th feature between cu and weights for users using ordinal regression model"). As the ac- item which is equal to 1 if comparison can be made between tive user can select any customized range value that might not two objects for the feature f and 0 otherwise. s(cu,item) f is the exactly correspond to these binned categories, we expressed similarity coefficient that determines the contribution provided the customized range selected by the active user, as a percent- by the f-th feature between cu and item, where the way this age at which it is expressed in each binned category. If the coefficient is computed depends on the data type of features range-value is completely covered by a binned category, then i.e., categorical and numeric. In case of categorical features it is assigned a value of 1, and 0 if it is not covered at all. For i.e., nominal or ordinal, the coefficient gets a value 1 if both partially covered range value in a binned category, the percent- objects have observed the same state for the feature f and is 0 age similarity is computed using one of the given formulas by otherwise. matching each condition:  v j −vi Linear modification of Gower’s similarity measure for continu- i f vmin < vi < v j < vmax ; icu, f = vmax −vmin ∗ rcu, f   ous features v j −vmin elsei f vi < vmin < v j < vmax ; icu, f = vmax −vmin ∗ rcu, f The second step of the post-filtering process for item recom-   vmax −vi mendations is to compute the similarity between the current elsei f vmin < vi < vmax < v j ; icu, f = vmax −vmin ∗ rcu, f  user (cu) and the item in terms of continuous features, where (1) the cu has a range-value and the item has one discrete value for Here [vi , v j ] are the range values selected by the current user the feature f. In this case, the Gower’s coefficient of similarity (cu) and [vmin , vmax ] are the minimum and maximum values of s(cu,item) f for the numeric feature fail to address the issue as it the binned category. To compute the importance weighting takes only one distinct value for each object [31]. icu, f of each binned category for the current user, we multiplied To deal with this limitation, we proposed a linear modifica- the computed percentage similarity with the current user’s tion of Gower’s similarity coefficient s(cu,item) f by computing feature-specific weight for the selected range rcu, f . a similarity score that is linearly decreasing with a feature- 2) Applying MSE on percentage similarity : Once the current value’s distance from the user’s desired range if it is outside user’s range values are mapped in terms of percentage at which this range. The idea is to assign a similarity score to the feature it is expressed for each binned category, then the dissimilarity of the item depending on how close the value is to the active between current user and other user in terms of categories user’s selected range. Let v be the distinct value of feature f defined by range-values, is computed by applying MSE on in an item, [vi , v j ] is the min and max values of range selected these computed values. by an active user, and [vmin , vmax ] are the min and max value available in the dataset for the feature f. The linear function Generating item recommendations for Gower’s similarity coefficient s(cu,item) f is then computed The final dissimilarity score between the active user and using one of the given formulas by matching each condition: the other in terms of categorical and continuous features is i f vmin < v < vi ; s(cu,item) f = vv−v  min computed by taking the average of the scores computed in   i −vmin section "Computing user-user similarity based on feature-  elsei f vi < v < v j ; s(cu,item) f = 1 preferences". The 10 users with lowest MSE scores are con-  vj + 1) + ( −v )  sidered for the recommendation process. From these similar elsei f v < v < v ; s j max (cu,item) f =( vmax −v j vmax −v j users’ profiles, the highest rated items are considered as poten- (3) tial list of recommendation. However, to filter out the items from this list, that not only matches the active user’s feature The final user-item similarity score for current user’s all se- preferences (user-item similarity) but also matches the feature lected features is then computed by putting the values of the requirements for the preferred photography mode (item-mode respective similarity coefficient s(cu,item) f for categorical and numeric features (computed in section "Gower’s similarity System visually explains how users are similar to the current measure for categorical features" and "Linear modification user in terms of shared feature preferences (Figure 1d) and of Gower’s similarity measure for continuous features") and how recommendations are generated based on similar users’ δ(cu,item) f in equation 3 and the top 10 items are then selected feature-based profiles. for recommendation. Most of the current RS do not provide any insight into the distribution of the feature-values in the feature-space or even FEATURISTIC: PROTOTYPE AND INTERACTION POSSI- the availability of the offered items distributed over the feature- BILITIES space. This might be useful for users to detect relevant features To implement the prototype system based on the method de- and to inform their own decision by thoroughly narrowing scribed in section 3, we collected our own explicit item-ratings down the list of items based on the item-features. In Featuristic data set. For this purpose, we conducted an online study on System, this aspect is integrated by showing the distribution Amazon Mechanical Turk (AMT) 3 users by providing them of feature-values selected by similar users (Figure 1d). Then, with 60 digital cameras where each camera was described in the recommended items are mapped on top of this distribution terms of a list of 90-95 features extracted from a website with (Figure 1e). This visually explains how the recommended editorial product reviews 4 . Each participant was asked to items are generated from similar users’ feature-based profiles, evaluate at least 20 cameras in terms of five-star rating based as most of the recommended items lie within most preferred on the available features, which resulted a total of 5765 ratings feature-values by similar users. on 60 cameras by 150 users. The implemented prototype sys- tem called Featuristic is shown in Figure 1, which extends the As in the Featuristic System, users can indicate their prefer- conventional CF and CB approaches in terms of three main ences for one of the five photography modes – the approach aspects as described below: also considers the features-set for the selected mode in com- puting similar users. For each item, the suitability score for Preference Elicitation each mode is computed and can be explored by clicking on Conventional CF or CB approaches, mostly elicit users’ pref- the "suitability for other modes" which opens a bar chart in erences for items in terms of rating or re-rating single items. a pop-up window (Figure 2b). Clicking on any bar would The filtering process of such approaches often assumes that all expand the window with explanation of how the scores are features are equally important for users and does not take that computed in terms of one-to-one comparison of features of aspect into account. In the Featuristic System, we elicit the items with the required features of the mode. new user’s preferences for item-features by explicitly asking the user to select the preferred feature-values and indicates the Manipulation of Preferences importance of the feature-value using the importance slider In most conventional CF approaches, the only way for users (Figure 1a). This enables users to specify their preferences to indicate or modify their preferences is by (re)rating items. more precisely, especially in high-risk domains, e.g., digital In case of the filtering systems, users can specify their pref- cameras, where the features of items play a vital role in users’ erences by selecting the desired value or value-range for a decision-making processes. The system further assists users specific attribute of the items. In complex domains e.g., digital in indicating their preferences more clearly especially, when cameras where users mostly lack precise knowledge of the do- users have limited domain knowledge, their preferences are main, providing explanations can be considered an important not defined, or they are unaware of the context in which the factor. On the other hand, providing interactivity and direct camera can be used. This is done by providing users with an manipulation within an explanation might offer users a flexible option to indicate their preferences for one of the five most and comprehensible way to manipulate their preferences. common photography modes (Figure 1b). The system pro- vides users with features-set along with the suggested values, In this respect, the Featuristic System integrates sliders (for explaining why these features with certain values are important continuous features) and toggle buttons (for binary features) for a particular mode (Figure 2a). with the explanations (Figure 1g), to facilitate the direct ma- nipulation of preferences from the system provided explana- Explainable Recommendations tions. The interactive explanations are further combined with recommendations – visually showing the location of the rec- Current CF or CB approaches fail to explain the connection ommended items distributed over the feature-space (Figure between recommended items and the user’s preferences of 1e). The system allows the users to manipulate their prefer- item-features. This is addressed in the Featuristic System by ences directly from the explanations, by either changing the showing a table that compares the features of each recom- feature-value or feature-rating – which results in dynamically mended item with the user’s preferred features (Figure 1c). updating recommendations. Additionally, it is mostly unclear to users how their expressed feature preferences actually correspond to the system’s rep- EMPIRICAL EVALUATION resentation of their preference models. Even in cases, when To investigate the impact of the explanation method developed the explanations are provided, the rationale behind recom- when integrated with interaction mechanisms, on user oriented mendations is mostly not explained to users. The Featuristic aspects, we designed a user study. Accordingly, we formulated 3 https://www.mturk.com/ the hypotheses with respect to user-oriented aspects focusing 4 https://www.test.de/ on, preference elicitation (H1), explainable recommendations Figure 1: Screenshot of the Featuristic system. Filtering Area for selecting features (a) and choosing modes (b); One-to-one comparison of the recommended item with the user’s selected feature-values (c); Graphical explanations showing the comparison of the current user’s shared feature preferences with similar users (d); Recommended items mapped on top of the similar users’ selection (e); Sliders to modify importance of feature-value (f); Sliders and toggle buttons to modify feature-value (g). Figure 2: (a) shows the list of features along with an explanation of why these features are required for the photography mode, (b) shows the suitability scores of all modes for the recommended item based on the available features in the item. (H2a and H2b), preference modification (H3a and H3b), and • H3a: More direct manipulation of user preferences user experience (H4). • H3b: More controllable manipulation of user preferences Hypotheses: • H4: An improved user experience Integrating the feature-based CF style explanations with inter- action tools when compare to a conventional filtering system, User Study 1 leads to: To address our hypotheses, we conducted an online crowd- sourced study via Prolific5 . In this study, the Featuristic Sys- • H1: More concrete preference elicitation tem that provides advanced interactive explanations is com- pared with the conventional Filtering System that only provides • H2a: Better explained recommendations simple and non-interactive explanations. • H2b: More comprehensible recommendations 5 https://www.prolific.co/ Table 1: Self-Created items used for the constructs during the user study. Construct Self-Created Items Preference Elicitation - The system allows me to indicate my preferences for the camera-features efficiently. - The system allows me to indicate my preferences for the camera-features precisely. - The system allows me to specify how important the specific camera-features are to me. Understandability - The information provided for the recommended cameras is easy to understand. - Overall, I find it difficult to understand the information provided for the recommended cameras. Decision Support - The information provided helps me decide quickly. - Overall, I find it difficult to decide which camera to select. Direct Manipulation - Seeing other users feature-selection helps me in modifying my preferred features. - I am able to determine suitable feature-values for my selection. - I am confident in modifying my selected feature-values. - I am able to directly compare features present in given recommendations with features that other users have selected. - I am able to directly see the recommended cameras that lie within my feature selection. Table 2: Mean values and standard deviations for the subjective system assessment of the two conditions. Significant differences are marked by *. Higher values (highlighted in bold) indicate better results. User Study 1 (df=54) Follow-up User Study (df=36) Construct Featuristic Conventional Filtering Featuristic Non-Interactive Featuristic M SD M SD p M SD M SD p Preference Elicitation (H1) 3.90 0.65 3.67 0.85 .032* 3.80 0.85 4.02 0.64 .006* Transparency (H2a) 4.26 0.50 3.82 0.89 .003* 3.89 0.80 3.89 0.85 >.999 Information Sufficiency (H2a) 3.27 0.87 2.94 0.85 .019* 4.02 0.56 3.10 1.01 <.001* Understandability (H2b) 3.43 0.99 3.64 0.91 .069 3.52 0.95 3.48 1.04 .760 Decision Support (H2b) 3.10 1.03 3.15 1.04 .734 3.18 1.00 3.33 1.00 .260 Direct Manipulation (H3a) 3.83 .52 3.67 0.58 .034* 4.05 0.47 3.76 0.60 .016* User Control (H3b) 3.93 .74 3.93 0.70 >.999 4.40 0.36 3.97 0.81 .003* Method first asked to indicate their preferences in terms of features The study was conducted in a within-subject design, where according to the task scenario. The system then generates participants were presented with two prototype systems in a recommendations and corresponding explanations. Partici- counter-balanced order: pants were required to explore the system recommendations and each of its presented explanations and functionality in • Featuristic System: The interface design of the system order to understand the rationale behind the recommendations is depicted in Figure 1. The interaction possibilities are and its explanations and select camera(s) that matches their further described in the section "Featuristic: Prototype and preferences according to the task scenario. After interacting Interaction possibilities". with each system, they were then asked to evaluate the system by answering series of questions. • Conventional Filtering System: The system allowed par- Participants and Questionnaire. A total of 55 Prolific users ticipants to indicate preferences in terms of features by were recruited online (19 females) with age ranging from 18- simply selecting feature-values. The system then gener- 54 years (M = 28, SD = 8.7). The study completion time was ates recommendations and explanations only showing the recorded approximately 15-20 minutes. To address our hy- comparison of recommended items with the participants’ potheses, we mostly used the self-created items to evaluate selected features and values (Figure 3A). both systems in terms of the above mentioned three aspects In each of the two resulting conditions, participants were pro- and are shown in Table 1. For Preference Elicitation, we used vided with the same task scenario. In the system, they were the self-created items. The aspect of Explainable Recommen- dations was measured in terms of two sub-aspects i.e. Explain- two systems. The Featuristic System received the following ability (H2a) and Comprehensibility (H2b). For Explainability, scores: 0.66 for pragmatic quality (Bad), 0.38 for Hedonic we used the items related to Transparency and Information Quality (Bad), and 0.53 Overall (Bad). On the other hand, Sufficiency from [32]. For Comprehensibility, we used our the Filtering System received the scores: 0.99 for Pragmatic self-created items related to Understandability and Decision Quality (Below average), 0.15 fro Hedonic Quality (Bad), and Support. Furthermore, the aspect of Preference Modification 0.58 Overall (Below Average). Yet, we can not accept this was measured in terms of self-created items specifically re- hypothesis. lated to the interactive mechanisms allowing the participants to directly manipulate their preferences (H3a). Additionally, Moreover, participants indicated their likes/dislikes for each we used items for User Control (H3b) taken from [32]. All system. When asked about the Filtering System, majority of participants liked the system because of its simple and clean questionnaire items were rated on a 1-5 Likert response scale. design which is easy to understand and use the system and its Additionally, to test our fourth hypothesis, we used the short functionalities. In comparison to the Featuristic System, some version of User Experience Questionnaire (UEQ) (7-point participants indicated their dislike about the Filtering System bipolar scale ranging from -3 to 3). For qualitative feedback, in terms of not being able to indicate the importance for the we provided open-ended questions asking the participants feature-values. For some participants the reason for not liking about their likes and dislikes for both systems in terms of the the Filtering System is because it does not show the graphs of information provided on the interfaces. features or does not include reviews from other people. On Results the other hand, when asked about the likes and dislikes for the Hypothesis 1. To test our hypothesis, we conducted a one- Featuristic System, majority of participants liked the system way repeated measure ANOVA (α = 0.05), revealing that Fea- because the system was clear, precise, intuitive, and innovative. turistic performed significantly better than the Conventional Many participants liked the graph comparisons, where one Filtering System for Preference Elicitation. Therefore, we participant indicated that "The graphs feel like I have a more can accept our H1, indicating that Featuristic leads to more accurate decision", the other stated that: "The graphs and the concrete preference elicitation (Table 2). bar diagrams are innovative which is useful for more focused and serious buyers". Others also liked the option of selecting Hypothesis 2a and 2b. To test H2a, which refers to the as- the importance of feature-values. Even though majority liked pect of Explainability measured in terms of two sub-aspects various functionalities of the Featuristic System, however, for i.e. Transparency and Information Sufficiency, we applied some participants, the interface was quite complex with lots one-way repeated measure MANOVA (α = 0.05). The results of information. One participant wrote that "There is a lot of showed significant differences between two systems in terms information for a novice". For some participants, the graphs of the two aggregated variables (F(2, 54) = 5.59, p < .006, were also difficult to understand. Wilk’s λ = 0.826). Univariate test results further revealed that for both Transparency and Information Sufficiency, the Featur- istic system significantly performed better than the Filtering Discussion. system, indicating that the Featuristic leads to better explained The results show that the Featuristic System significantly im- recommendations. Therefore, we can accept H2a. proved the Preference Elicitation of users as compared to the Filtering System (H1). This might be due to the Featuristic However, in terms of Comprehensibility (H2b) which is mea- System’s ability, allowing users to not only select features and sured in terms of two sub-aspects i.e. Understandability and its values but also indicate the importance for each individual Decision Support, we found no significant differences between feature-value. This might have made the preference indication the two systems (F(2, 53) = 1.93, p < .15, Wilk’s λ = 0.932). for users more precise and efficient as compared to conven- Therefore, the Hypothesis 2b can not be accepted. tional CF and CB systems, where it is mostly assumed that Hypothesis 3a and 3b. With respect to Direct Manipulation all features are equally important to users. This can also be reflected in participants’ qualitative feedback. For example, of Preferences (H3a), the result of one-way repeated measure one participant stated that "I like specifying how important a ANOVA showed statistically significant difference between feature was and not only if I wanted it or not" and the other the two systems, where the Featuristic system performed sig- nificantly better than the filtering system as can be seen in wrote "I like being able to select how important a feature is Table 2. The result shows that the Featuristic system leads to with the sliding bar". more direct manipulation of user preferences, thus accepting Additionally, we investigated the second main aspect of the our hypothesis 3a. Featuristic System i.e. Explainable Recommendations which is further measured in terms of two sub-aspects: Explain- On the other hand, w.r.t. User Control, we found no sig- ability (H2a) and Comprehensibility (H2b). With respect to nificant difference between the two systems F(1, 54) = 0.00, Explainability (H2a), the Featuristic System is perceived sig- p < 1.00, Wilk’s λ = 1.00, where surprisingly, both systems were perceived equally in terms of User Control. Therefore, nificantly better than the Filtering System. This indicates, the hypothesis 3b can not be accepted. that the more advanced explanations in the Featuristic System made the recommendations more transparent and explainable Hypothesis 4. To evaluate the systems with respect to the for users which can be validated from the participants’ qual- User Experience, we analyzed the different sub-scales of the itative feedback. One participant indicated that "It gave you UEQ, where we found no significant differences between the the information and segregation of data in an easy to read Figure 3: (A) Filtering System, (B) Featuristic System without interactive explanations. (graphical) format.", where the other stated that "You can see users in the Featuristic and hence, not being perceived better at a glance whether or not a specific camera has these fea- by users compare to the Filtering System. tures". For others it was useful to compare their choices with other users, where one participant wrote "I love the fact that I Additionally, for User Experience (H4), we found no signifi- had to compare my choices with recommendations of others". cant differences between two systems. This might indicates Another participant wrote "I like the fact that the system brings that regardless of more advanced explanations with interactive other users’ choice for me and also gave me detailed informa- mechanisms provided in Featuristic compared to the Filtering System with much simpler explanations, participants perceived tion about my search. Additionally, we found no significant both systems to be of similar quality in terms of the user expe- differences between two systems in terms of Comprehensibil- rience. On the other hand, this might also be explained under ity (H2b) for aggregated variables, where the Filtering System the assumption that participants are different in terms of the showed slightly better results. This might be explained due to the fact that the two systems were quite different in terms domain knowledge and their ability to perceive and understand of the functionalities and level of information provided. On the system provided information and functionalities – as for one hand, the Filtering System provides rather simple and non- some participants it might be easier to understand the informa- interactive explanations and on the other hand, the Featuristic tion and its functionalities and for others too complicated. As System is more complex in terms of interactive functionalities stated by one of the participants about the Featuristic System and advanced explanations that it provided. Thus, making that "The information was easy to understand for me, but I the Filtering System being perceived more comprehensible by can imagine less technical people would find information and graphics confusing." users. This is also depicted in participants’ qualitative feed- back about the Filtering System, where they found the system Follow-up User Study much simpler, clean, and easy to understand as compare to the To verify, that integrating the developed explanation method Featuristic System. For some of the participants, the Featur- with interaction tools have positive impact on user-oriented istic System provided too much information which is rather aspects, which is independent of the types of underlying algo- complex for them to comprehend. rithms – we conducted a follow-up user study. In this study, With respect to Direct Manipulation of Preferences (H3a), the we isolated the underlying algorithm by focusing only on the Featuristic System is perceived significantly better than the type of explanations provided. For this, we compared two Filtering System, suggesting that integrating the interactive versions of the Featuristic System that apply same underlying mechanisms with our explanations allowed users to directly hybrid approach. The only difference is in terms of interactive manipulate their preferences through these explanations. Sur- and non-interactive explanations provided by the systems. prisingly, in terms of User Control (H3b), both systems are Method perceived of equal quality. As in the Filtering System, the The study was conducted via Prolific in a within-subject design only way provided to users to control the system’s output is and follows the same procedure and design as the first study by selecting features or re-adjusting the feature-values. And it described in section "Featuristic: Prototype and Interaction has been shown, that such user control mechanisms are easy possibilities". We again tested the same hypotheses described to use compared to mechanisms that allow users to indicate in section "Hypotheses:", but this time, isolating the type of the relative preferences [14] (e.g., feature-rating slider in Fea- recommendation as the independent variable. We created two turistic). In such cases, it is sometimes not clear to users if versions of the Featuristic System, described below: having the slider in the middle position has same meaning as having the slider at the maximum level. This might have made • Featuristic System: The interface design and interaction the interpretation of such control mechanisms complicated for its possibilities are described in "User Study 1" and shown in Figure 1. • Featuristic System without interactive explanations: both systems being perceived equally in terms of Comprehen- The prototype is similar to the one shown in the Figure sibility and User Experience. However, qualitative feedback 1. The only major difference is that the user is not provided showed that most of the participants like the interactive func- with the functionality to modify or critique their selected tionality of the Featuristic System. One participant stated that feature-value or rating through graphical explanations of " In my opinion, this system is more clear and clean than the recommendation (See Figure 3B). other one. Although they look almost the same, I feel this one can be a bit more efficient. It is very helpful and intuitive". Participants and Questionnaire. A total of 37 Prolific users were recruited online (15 females) with age ranging from 18- CONCLUSION AND OUTLOOK 50 years (M = 24.86, SD = 6.9). The study completion time In this paper, we showcased the possibility of integrating our was recorded approximately 15-20 minutes. To address our proposed feature-based CF style explanations with interaction hypotheses, we used the same questionnaire items as in the tools, through a prototype system called Featuristic. To study first user study. the impact from a user perspective in terms of Preference Elicitation, Explainable Recommendations, Preference Manip- Results ulation, and User Experience, we first compared our Featur- To compare our two versions of Featuristic system, we ap- istic System with the Conventional Filtering System that only plied one-way repeated measure MANOVA and the results provides simple and non-interactive explanations.The results can be seen in Table 2. With respect to Preference Elicitation showed that the Featuristic System is significantly perceived (H1), the results showed significant difference, where the Non- better than the Conventional Filtering System with respect Interactive version of the system is perceived significantly to the aspects of Preference Elicitation, Explainability, and better than the Interactive version of the system. Therefore, Preference Manipulation. However, we found no significant we have to reject our H1. differences between the two systems in terms of the User Ex- For Explainability of recommendations (H2a) which is perience and Comprehensibility, which might be due to the measured in terms of Transparency and Information Suffi- complex structure of explanations and the system design, as ciency, we found significant differences between two systems stated by many participants in their qualitative feedback. F(2, 35) = 16.30, p < .001, Wilk’s λ = 0.518, for aggregated We further conducted a follow-up user study to verify, that variables. However, the result of univariate test showed sig- the results from the first study are independent of the under- nificant difference only in terms of Information Sufficiency, lying algorithms. For this, we compared two versions of the where the Interactive Featuristic performed better. Overall, Featuristic System, by isolating the types of underlying algo- we can accept our H2a. rithms and only focusing on the type of explanations provided Regarding Comprehensibility (H2b), which is measured in i.e., Interactive and Non-interactive explanations. The results terms of Understandability and Decision Support, we found showed that the Interactive version of Featuristic performed no significant differences between two systems. Overall, we significantly better than the non-interactive version in terms can not accept our H2b. of Explainability, User Control, and Direct Manipulation. Additionally, in terms of Direct Manipulation and User Con- To summarize, the current work clearly showed the positive trol, we again found significant differences between two sys- impact of integrating advanced explanations with interaction tems, where the Interactive Featuristic performed significantly tools to improve the user-oriented aspects, especially in com- better than the Non-interactive system. Therefore, we can ac- plex product domains. However, the current work has some cept H3a and H3b. However, with respect to UEQ, we found limitation in terms of the complex system design which could no significant differences between the two systems, which further be simplified for improving the overall User Experi- leads to rejecting the H4. ence. Additionally, factors like user’s cognitive effort and user experience with the product domain, might also impact the Discussion user perception of the system with respect to user-oriented The results of the follow-up study showed, that in terms of aspects, and thus requires further investigation in future work. Explainability, User Control, and Direct Manipulation, the Interactive version of Featuristic performed significantly bet- ter than the Non-interactive version. This clearly shows the positive impact of integrating interactive mechanisms with explanations, on these aspects. The results are similar to re- sults of the first user study for most of the factors, where the Interactive Featuristic performed better. This verifies, that our advanced explanations showed positive impact on user- oriented aspects, independent of the underlying algorithms. The insignificant differences in terms of Comprehensibility and User Experience, might be due to the fact that both sys- tems provided same functionalities and level of explanations. The only difference is with respect to the interactivity and non- interactivity of explanations. This might explain the reason for REFERENCES [13] Jonathan L Herlocker, Joseph A Konstan, and John [1] Ivana Andjelkovic, Denis Parra, and John O’Donovan. Riedl. 2000. Explaining collaborative filtering 2016. Moodplay: Interactive mood-based music recommendations. In Proceedings of the 2000 ACM discovery and recommendation. In Proceedings of the conference on Computer supported cooperative work. 2016 Conference on User Modeling Adaptation and ACM, 241–250. Personalization. ACM, 275–279. [14] Dietmar Jannach, Sidra Naveed, and Michael Jugovac. [2] Manuel J Barranco and Luis Martínez. 2010. A method 2016. User control in recommender systems: Overview for weighting multi-valued features in content-based and interaction challenges. In International Conference filtering. In International conference on industrial, on Electronic Commerce and Web Technologies. engineering and other applications of applied intelligent Springer, 21–33. systems. Springer, 409–418. [15] Yucheng Jin, Karsten Seipp, Erik Duval, and Katrien [3] James Bennett, Stan Lanning, and others. 2007. The Verbert. 2016. Go with the flow: effects of transparency netflix prize. In Proceedings of KDD cup and workshop, and user control on targeted advertising using flow Vol. 2007. New York, 35. charts. In Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, [4] Mustafa Bilgic and Raymond J Mooney. 2005. 68–75. Explaining recommendations: Satisfaction vs. promotion. In Beyond Personalization Workshop, IUI, [16] Michael Jugovac and Dietmar Jannach. 2017. Vol. 5. 153. Interacting with recommenders—overview and research directions. ACM Transactions on Interactive Intelligent [5] Daniel Billsus and Michael J Pazzani. 1999. A personal Systems (TiiS) 7, 3 (2017), 10. news agent that talks, learns and explains. In Proceedings of the third annual conference on [17] Antti Kangasrääsiö, Dorota Glowacka, and Samuel Autonomous Agents. Citeseer, 268–275. Kaski. 2015. Improving controllability and predictability of interactive recommendation interfaces for exploratory [6] Svetlin Bostandjiev, John O’Donovan, and Tobias search. In Proceedings of the 20th international Höllerer. 2012. TasteWeights: a visual interactive hybrid conference on intelligent user interfaces. ACM, recommender system. In Proceedings of the sixth ACM 247–251. conference on Recommender systems. ACM, 35–42. [18] Joseph A Konstan and John Riedl. 2012. Recommender [7] Robin Burke. 2007. Hybrid web recommender systems. systems: from algorithms to user experience. User In The adaptive web. Springer, 377–408. modeling and user-adapted interaction 22, 1-2 (2012), [8] Jorge Castro, Rosa M. Rodriguez, and Manuel J. 101–123. Barranco. 2014. Weighting of features in content-based [19] Greg Linden, Brent Smith, and Jeremy York. 2003. filtering with entropy and dependence measures. Amazon. com recommendations: Item-to-item International journal of computational intelligence collaborative filtering. IEEE Internet computing 7, 1 systems 7, 1 (2014), 80–89. (2003), 76–80. [9] Shuo Chang, F Maxwell Harper, and Loren Gilbert [20] Benedikt Loepp, Tim Donkers, Timm Kleemann, and Terveen. 2016. Crowd-based personalized natural Jürgen Ziegler. 2019. Interactive recommending with language explanations for recommendations. In tag-enhanced matrix factorization (TagMF). Proceedings of the 10th ACM Conference on International Journal of Human-Computer Studies 121 Recommender Systems. ACM, 175–182. (2019), 21–41. [10] Emden Gansner, Yifan Hu, Stephen Kobourov, and [21] Benedikt Loepp, Katja Herrmanny, and Jürgen Ziegler. Chris Volinsky. 2009. Putting recommendations on the 2015. Blended recommending: Integrating interactive map: visualizing clusters and relations. In Proceedings information filtering and algorithmic recommender of the third ACM conference on Recommender systems. techniques. In Proceedings of the 33rd Annual ACM ACM, 345–348. Conference on Human Factors in Computing Systems. ACM, 975–984. [11] Brynjar Gretarsson, John O’Donovan, Svetlin Bostandjiev, Christopher Hall, and Tobias Höllerer. [22] Martijn Millecamp, Sidra Naveed, Katrien Verbert, and 2010. Smallworlds: visualizing social recommendations. Jürgen Ziegler. 2019. To Explain or Not to Explain: the In Computer Graphics Forum, Vol. 29. Wiley Online Effects of Personal Characteristics When Explaining Library, 833–842. Feature-based Recommendations in Different Domains. In CEUR workshop proceedings. CEUR. [12] Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. Trirank: Review-aware explainable [23] Afshin Moin. 2014. A unified approach to collaborative recommendation by modeling aspects. In Proceedings of data visualization. In Proceedings of the 29th Annual the 24th ACM International on Conference on ACM Symposium on Applied Computing. ACM, Information and Knowledge Management. 1661–1670. 280–286. [24] Khalil Muhammad, Aonghus Lawlor, and Barry Smyth. [37] Panagiotis Symeonidis, Alexandros Nanopoulos, and 2016. On the use of opinionated explanations to rank Yannis Manolopoulos. 2008. Providing justifications in and justify recommendations. In The Twenty-Ninth recommender systems. IEEE Transactions on Systems, International Flairs Conference. Man, and Cybernetics-Part A: Systems and Humans 38, 6 (2008), 1262–1272. [25] Sayooran Nagulendra and Julita Vassileva. 2014. Understanding and controlling the filter bubble through [38] Panagiotis Symeonidis, Alexandros Nanopoulos, and interactive visualization: a user study. In Proceedings of Yannis Manolopoulos. 2009. MoviExplain: a the 25th ACM conference on Hypertext and social media. recommender system with explanations. RecSys 9 107–115. (2009), 317–320. [26] Sidra Naveed and Jürgen Ziegler. 2019. Feature-Driven [39] Nava Tintarev and Judith Masthoff. 2007. Effective Interactive Recommendations and Explanations with explanations of recommendations: user-centered design. Collaborative Filtering Approach.. In ComplexRec@ In Proceedings of the 2007 ACM conference on RecSys. 10–15. Recommender systems. ACM, 153–156. [27] John O’Donovan, Barry Smyth, Brynjar Gretarsson, [40] Nava Tintarev and Judith Masthoff. 2012. Evaluating the Svetlin Bostandjiev, and Tobias Höllerer. 2008. effectiveness of explanations for recommender systems. PeerChooser: visual interactive recommendation. In User Modeling and User-Adapted Interaction 22, 4-5 Proceedings of the SIGCHI Conference on Human (2012), 399–439. Factors in Computing Systems. ACM, 1085–1088. [41] Nava Tintarev and Judith Masthoff. 2015. Explaining [28] Eli Pariser. 2011. The filter bubble: What the Internet is recommendations: Design and evaluation. In hiding from you. Penguin UK. Recommender systems handbook. Springer, 353–382. [29] Denis Parra and Peter Brusilovsky. 2015. [42] Chun-Hua Tsai and Peter Brusilovsky. 2017. Providing User-controllable personalization: A case study with Control and Transparency in a Social Recommender SetFusion. International Journal of Human-Computer System for Academic Conferences. In Proceedings of Studies 78 (2015), 43–67. the 25th Conference on User Modeling, Adaptation and Personalization. ACM, 313–317. [30] Denis Parra, Peter Brusilovsky, and Christoph Trattner. 2014. See what you want to see: visual user-driven [43] Chun-Hua Tsai and Peter Brusilovsky. 2018. Beyond the approach for hybrid recommendation. In Proceedings of ranked list: User-driven exploration and diversification the 19th international conference on Intelligent User of social recommendation. In 23rd International Interfaces. ACM, 235–240. Conference on Intelligent User Interfaces. ACM, 239–250. [31] János Podani. 1999. Extending Gower’s general coefficient of similarity to ordinal characters. Taxon 48, [44] Katrien Verbert, Denis Parra, and Peter Brusilovsky. 2 (1999), 331–340. 2014. The effect of different set-based visualizations on user exploration of recommendations. In CEUR [32] Pearl Pu, Li Chen, and Rong Hu. 2011. A user-centric Workshop Proceedings, Vol. 1253. University of evaluation framework for recommender systems. In Pittsburgh, 37–44. Proceedings of the fifth ACM conference on Recommender systems. ACM, 157–164. [45] Katrien Verbert, Denis Parra, Peter Brusilovsky, and Erik Duval. 2013. Visualizing recommendations to [33] Pearl Pu, Li Chen, and Rong Hu. 2012. Evaluating support exploration, transparency and controllability. In recommender systems from the user’s perspective: Proceedings of the 2013 international conference on survey of the state of the art. User Modeling and Intelligent user interfaces. ACM, 351–362. User-Adapted Interaction 22, 4-5 (2012), 317–355. [46] Jesse Vig, Shilad Sen, and John Riedl. 2009. [34] Amit Sharma and Dan Cosley. 2013. Do social Tagsplanations: explaining recommendations using tags. explanations work?: studying and modeling the effects In Proceedings of the 14th international conference on of social explanations in recommender systems. In Intelligent user interfaces. ACM, 47–56. Proceedings of the 22nd international conference on World Wide Web. ACM, 1133–1144. [47] Bo Xiao and Izak Benbasat. 2007. E-commerce product recommendation agents: use, characteristics, and impact. [35] Rashmi Sinha and Kirsten Swearingen. 2002. The role MIS quarterly 31, 1 (2007), 137–209. of transparency in recommender systems. In CHI’02 extended abstracts on Human factors in computing [48] Kai Zeng, Kun She, and Xinzheng Niu. 2014. Feature systems. ACM, 830–831. selection with neighborhood entropy-based cooperative game theory. Computational intelligence and [36] Kirsten Swearingen and Rashmi Sinha. 2001. Beyond neuroscience 2014 (2014), 11. algorithms: An HCI perspective on recommender systems. In ACM SIGIR 2001 Workshop on Recommender Systems, Vol. 13. Citeseer, 1–11.