=Paper=
{{Paper
|id=Vol-1884/paper1
|storemode=property
|title=Enhancing an Interactive Recommendation System with Review-based Information Filtering
|pdfUrl=https://ceur-ws.org/Vol-1884/paper1.pdf
|volume=Vol-1884
|authors=Jan Feuerbach,Benedikt Loepp,Catalin-Mihai Barbu,Jürgen Ziegler
|dblpUrl=https://dblp.org/rec/conf/recsys/FeuerbachLB017
}}
==Enhancing an Interactive Recommendation System with Review-based Information Filtering==
Enhancing an Interactive Recommendation System with Review-based Information Filtering Jan Feuerbach Benedikt Loepp University of Duisburg-Essen University of Duisburg-Essen Duisburg, Germany Duisburg, Germany jan.feuerbach@stud.uni-due.de benedikt.loepp@uni-due.de Catalin-Mihai Barbu Jürgen Ziegler University of Duisburg-Essen University of Duisburg-Essen Duisburg, Germany Duisburg, Germany catalin.barbu@uni-due.de juergen.ziegler@uni-due.de ABSTRACT product facets, or mined keywords are taken into account Integrating interactive faceted filtering with intelligent rec- when generating recommendations. However, they often re- ommendation techniques has shown to be a promising means quire up front availability of information such as existing for increasing user control in Recommender Systems. In this user preference profiles or rich item data. In contrast, the paper, we extend the concept of blended recommending by increasing amount of user-provided content that is available automatically extracting meaningful facets from social media online today has not yet been extensively exploited for inter- by means of Natural Language Processing. Concretely, we active recommending. Social media, such as product reviews allow users to influence the recommendations by selecting written by users in online shops or opinions about hotels on facet values and weighting them based on information other booking platforms, has been used so far primarily to deal users provided in their reviews. We conducted a user study with data sparsity and to increase algorithmic precision [7]. with an interactive recommender implemented in the hotel The concept of blended recommending [23] combines advan- domain. This evaluation shows that users are consequently tages of conventional automated RS, e.g. low user effort and able to find items fitting interests that are typically difficult high accuracy, with those of interactive information filtering, to take into account when only structured content data is e.g. high level of control and transparency. For this, faceted available. For instance, the extracted facets representing the filtering [15], which has shown to be an intuitive and efficient opinions of hotel visitors make it possible to effectively search means for browsing large items spaces [15, 32], is integrated for hotels with comfortable beds or that are located in quiet with different recommender techniques in a hybrid fashion. surroundings without having to read the user reviews. Consequently, users are enabled to select and weight criteria from facets leading to items being recommended based on CCS CONCEPTS their weighted relevance as determined by Collaborative Fil- tering (CF) or content-based techniques. In MyMovieMixer • Information systems → Recommender systems; Data min- [23], an interactive RS based on this concept, users can se- ing; Information retrieval; Search interfaces; • Computing lect a movie as a facet value that in the following serves to methodologies → Natural language processing; suggest other movies that are similar with respect to their latent factor representation as derived from ordinary rating KEYWORDS data. Users can also express that they want, for example, Interactive Recommending, Faceted Filtering, User Reviews movies from a particular director, starring certain actors, or related in terms of user-generated tags, while being able to 1 INTRODUCTION specify the weight of each of these criteria. MyMovieMixer was found especially promising for cold-start situations, i.e. Conventional automated Recommender Systems (RS) that without an existing rating profile for the current user, and pro-actively suggest items of potential interest to users make when users only have a vague search goal in mind. Moreover, it often difficult to influence and to understand their outcome the approach allowed to significantly increase the perceived [21]. Interactive RS have been proposed that particularly aim level of user control [23, 24]. at giving users more control over the recommendation process In this paper, we build upon our prior work and extend and at improving transparency. For instance, TasteWeights the concept by extracting meaningful facets and correspond- [4], SetFusion [27], MyMovieMixer [23], or uRank [10] allow ing values from user reviews by means of Natural Language users to vary the degree to which datasources, algorithms, Processing (NLP). So far, blended recommending has only IntRS ’17, August 27, 2017, Como, Italy been implemented based on ratings and structured content 2017. ACM ISBN Copyright for the individual papers remains with information. Relying on social media has several advantages. the authors. Copying permitted for private and academic purposes. User-written product reviews, in particular, play an impor- This volume is published and copyrighted by its editors. https://doi.org/ tant role in buying decisions [9, 31]. They form a useful source IntRS ’17, August 27, 2017, Como, Italy Feuerbach et al. of information about what users liked or disliked, especially into account [e.g. 7], and by eliciting user preferences in an in- in case of “experience products”. In the hotel domain, for teractive manner [e.g. 25]. One prominent type of interactive instance, reviews may contain references to amenities that RS are critique-based variants that allow users to iteratively are typically not available as filters on online booking web- refine the results by critiquing features of currently recom- sites. Manually looking through dozens of reviews to check mended items [8]. While this usually requires availability of whether the recommended hotels have “comfortable beds” or well-defined product data, more recently, several attempts are in “quiet locations” would be time-consuming and require have been made that instead rely on, for instance, latent a lot of cognitive effort. By automatically exploiting review factors automatically derived from user ratings [25] or user- data, in contrast, we allow users to directly express their generated tags [35]. MovieTuner [35], as an example, first preferences with respect to such subjective dimensions, con- determines the relevance of tags and presents users with the sequently being able to better take their actual interests into most important ones. Then, users can indicate their prefer- account. Besides, from an information provider’s perspective, ences by critiquing recommendations in terms of these well user-generated content can be considered a useful addition to understandable dimensions—which could represent subjec- conventional objective product data that might be difficult tive aspects not adequately describable by the often more to prepare for each item at the same level of quality. technical objective product attributes. While it seems promis- The remainder of this paper is organized as follows: First, ing to elicit preferences this way, interaction is still typically we discuss relevant related work. Next, we describe our pro- limited to one single kind of item features, i.e. predefined posed method for extracting facets from reviews using NLP. metadata, latent factors or tags. Besides, user-provided con- Then, we elaborate on how these facets can be used in blended tent has only been exploited to a limited extent. The richness recommending and introduce a demonstrator system we im- of social media such as user reviews has, to our knowledge, plemented in the hotel domain based on a real-world dataset1 . not yet been used for integrating RS with more interactivity. Afterwards, we present a user study we conducted to exam- In hybrid RS, multiple algorithms, and often multiple ine the value of facets extracted according to our method datasources, are combined to generate results with higher in comparison to facets based on typical features as defined precision. This has consequently led to the development of by content providers in terms of user experience. Finally, we corresponding interactive approaches that give users con- conclude the paper and discuss avenues for further research. trol over the recommendation process in terms of several dimensions at once. TasteWeights [4], a hybrid music recom- 2 RELATED WORK mender, allows users to directly weight different information types and social datasources, thereby increasing perceived Today’s RS pro-actively suggest items that match individ- recommendation quality and comprehensibility. The data- ual preferences based on long-term user models. Producing sources used include social media artifacts such as Wikipedia well-fitting results can thereby reduce interaction effort and articles, Facebook profiles or Twitter tweets, but without cognitive load [29]. However, the process of generating rec- processing the content data on a semantic level to e.g. infer ommendations is often not controllable by users. Generally inherent user preferences. SetFusion [27] also employs a stan- improving user experience, giving users more control, and dard hybridization strategy, but enables users to weight each increasing system transparency, have therefore been identi- algorithm individually. In addition, the system provides a fied as important goals [21, 29], which are still only partially number of interactive features. However, it requires a persis- addressed in many real-world systems. RS research has for a tent user profile and does not offer interaction with respect long time been focused on algorithmic issues as well [21, 29]. to any content-related criteria. MyMovieMixer [23] lets users For instance, in order to improve accuracy, several attempts select and weight facet values based on different types of have been made to integrate CF algorithms with additional recommender algorithms and related background data. The data, including user-generated content such as tags, and in system increases the perceived amount of control by success- few cases also topics or opinions automatically extracted fully enabling users to manipulate the result set not only with from user reviews [e.g. 1, 7, 18, 26, 37]. Only more recently, respect to explicitly defined product features, but also latent model-based CF has been enhanced for other purposes. One factors as well as user-generated data such as tags. Although example is TagMF [12], a method that allows users to select the underlying concept of blended recommending is easily and weight tags in order to manipulate the set of recommen- extendable, it has yet only been implemented using struc- dations generated as in conventional RS based on ratings. tured data that is directly given through the applied datasets. In general, many interactive recommending approaches uRank [10] is one exception where keywords are extracted have been proposed to overcome the issues of automated RS from background data especially to promote interactivity. [14, 17, 22]. The cold-start problem, which occurs when no The system focuses on exploration of document collections historical data is available for new users, has, for instance, and supports users when their interests shift while browsing. been addressed algorithmically [e.g. 38], by taking reviews The extraction is performed after some preprocessing steps by creating a vector space model using TF-IDF. Then, the 1 We crawled metadata and overall 838 780 user reviews for 11 544 keywords are presented to users as an interactive means to hotels located in five major European cities from Booking.com (http: influence the recommendations by selecting and weighting //www.booking.com/). Enhancing an Interactive Rec. System with Review-based Info. Filtering IntRS ’17, August 27, 2017, Como, Italy them. To increase transparency, their occurrences in the doc- tags), and to exploit the rich knowledge found in unstructured uments are visualized by means of a stacked bar chart and user-provided information such as reviews. they are also used to generate an overview of the collection. Other approaches that make extensive use of visualizations 3 EXTRACTING MEANINGFUL FACETS for the purpose of increasing system transparency comprise, FROM USER REVIEWS among others, MoodPlay [2] or Conference Navigator [34]. In order to apply the concept of blended recommending based Overall, all of these works attempt to give users a high de- on a social datasource, we propose the procedure illustrated gree of control over hybrid RS. While this line of RS research in Figure 1 to extract facets from user reviews. to some extent converges with information filtering, there exists a wide range of manual filtering approaches outside The bed (comfortable, bed) (comfortable, bed) Hotel was com- (comfy, bed) the field of RS that can also be considered highly supportive fortable, but the location … … (comfortable, bed) (noisy, location) ... (noisy, location) (comfortable, bed) for users finding the right items. Faceted filtering is one of was loud at night. (loud, location) (loud, location) (noisy, location) Room the most prominent methods that supports exploration and discovery in large product spaces [15]. By selecting values Figure 1: First, we identify attribute-value pairs such as “com- from facets, the product space is iteratively constrained until fortable bed” in the reviews. Then, these pairs are merged the desired product is found. This principle also allows to e.g. with others that have the same meaning, e.g. “comfy bed”. facilitate keyword search and navigation in digital libraries Next, using sentiment analysis, pairs are classified as positive or online shops [15]. Early attempts as well as contemporary or negative item properties. Finally, pairs that describe prop- real-world examples that can be found on many websites (e.g. erties all related to e.g. hotel rooms are classified and grouped accommodation booking platforms) usually rely on prede- together to serve as values of a corresponding facet. fined features, support only Boolean filtering and conjunctive In the following, we elaborate on each of the steps involved, queries, and consider all selected facet values with equal im- and also describe how we actually implemented them. portance [30, 32, 33, 36]. Only few exceptions employ fuzzy methods for value matching [13]. 3.1 Identifying Attribute-Value Pairs More recently, facets and facet values have also been au- tomatically extracted, and adaptive techniques have been First, we split user reviews into sentences. Then, in each applied to faceted search based on e.g. semantic or social sentence, we need to identify nouns that describe properties datasources [6, 16, 33]. For instance, RevMiner [16] extracts of the respective item as well as adjectives which represent attribute-value pairs from restaurant reviews (e.g. “delicious the opinion of the user who has written the review regarding pizza”), associates each value with a positive-negative score these properties, e.g. “the bed was comfortable”. representing the sentiment, groups the attributes, and even- With the help of a Part of Speech tagger, we determine the tually presents them to the user in form of facets and facet word form of each term in a sentence. Based on the results, values. When applying filter criteria, the restaurants in the we establish grammatical relations using a dependency parser. results are ranked according to sentiment, strength and fre- Especially the relations amod (adjectival modifier) and nsubj quency of the selected value. Moreover, users can receive (nominal subject) are of interest as they describe relations recommendations for other places with similar attributes. between adjectives and nouns. Besides, we take other relations In general, previous attempts have however been focused into account to analyze more complex sentence structures, on supporting users to select appropriate filter criteria and e.g. negations and relative clauses. Eventually, this gives us to deal with lack of metadata. Yet, the user’s influence on a set of attribute-value pairs which we then reduce to those the current filter setting is still limited. VizBoard [36], in pairs that appear a minimum number of times in all reviews. contrast, allows users to prioritize selected criteria. Other For actually implementing this, we decided to use the work has also investigated user experience of faceted search Stanford CoreNLP toolkit2 , a lightweight framework with all as well as integrating visualizations. For example, in [32], a the NLP functionality required in this step. matrix visualization is used to display documents and their relevance with respect to the selected facets. While research 3.2 Merging Values in faceted filtering has thus brought numerous advances, the After attribute-value pairs have been identified, there might respective methods neither have yet made extensive use of be multiple values sharing the same meaning, e.g. “large room” recommender functionalities nor social media. and “big room”. Thus, we need to merge values that describe Summarizing the state-of-the-art, there exist various at- the same concept (i.e. synonyms) and replace them with a tempts that give users more control over RS, also in complex representative value to avoid confusion and redundancy. hybrid scenarios. Only to a limited extent, social media has For this purpose, we employ a lexical database providing thereby been utilized as a means to increase interactivity. links between synonyms. Since values may have different Building on MyMovieMixer and extending the concept of meanings depending on context (e.g. “big” and “heavy”), we blended recommending seems promising to go beyond integrat- do not directly use these links as criteria to merge them with ing the datasources used so far (i.e. rating data, structured others, but instead take the proportion of intersection of the content information and explicit user-generated data such as 2 https://stanfordnlp.github.io/CoreNLP/ IntRS ’17, August 27, 2017, Como, Italy Feuerbach et al. sets of synonyms for each value into account. Then, a value any case, to assign an attribute to one of the categories, e.g. is classified as being similar if a specified threshold is met. “bed” to “Room”, a term to calculate the similarity with is Otherwise, it is assumed to be a new representative itself. To required. Comparing with the class name itself would make it improve the quality even further, we define groups of values difficult to distinctly assign attributes. Thus, we employ typ- and related representatives manually for terms that carry ical terms from [11] and from taxonomies of popular booking very different meanings across contexts, e.g. “good”. websites for each category (e.g. the set for “Room” contains In the same step, we also identify pairs with opposite “bathroom”, “bed” and “air conditioning”), and determine meaning, e.g. “comfortable bed” and “uncomfortable bed”, their average similarity with the respective attribute. and associate them by looking up the respective terms (i.e. For implementing the classification, we use the WS4J 5 -API antonyms) in a lexical database as well. Although we assume that offers a range of similarity metrics based on WordNet. negative pairs to be less meaningful as filter criteria that can When examining the results obtained on the dataset we used1 be selected by the user, we need them for later calculating with a set of manually labeled pairs, HirstStOnge yielded item relevances (see Section 4). highest accuracy (.737). As lexical database, we use WordNet 2.1 3 . 4 BLENDED RECOMMENDING WITH 3.3 Analyzing Sentiments EXTRACTED FACETS Next, in order to explicitly distinguish between positive and In order to use the previously extracted attribute-value pairs negative pairs for the recommendation process, we need to for blended recommending, the corresponding facet values detect the sentiment of the pairs. (e.g. “comfortable bed”) have to be taken into account for Adjectives often already represent a certain sentiment, e.g. calculating the relevance of the items when selected by the “comfortable” can be considered positive. Thus, we decided to user. In the following, we describe how individual relevances use an approach which determines sentiments for single words. are determined for each facet type, and how these relevance Sentiment lexicons are databases where each term is assigned values eventually lead to an aggregated score for each hotel. a sentiment value or class based on a certain algorithm or by human judgment. In contrast, in [16], a computational Standard Facets. As in [23], we use Boolean filtering for method specifically for reviews is proposed: By averaging nominal facets (“Location”) and fuzzy filtering for numerical helpfulness scores of user reviews in which a pair occurs, the facets (“Price”, “Stars”, “Score”). In case of Boolean filtering, respective values are classified as positive or negative based items matching a selected value are additionally ranked using on a specified threshold. a criterion that establishes an ordering, e.g. score. In case of To obtain adequate results on the dataset we used1 , we fuzzy filtering, the distance between a selected value and the compared SentiWordNet 3.0 4 , an algorithmically labeled respective item property determines the relevance (e.g. if the sentiment lexicon, the Stanford CoreNLP toolkit which uses user selects the facet value “score = 8.0”, items with a score the Sentiment Treebank, a manually labeled dataset, and the of 7.5 are considered more relevant than items with 7.0). method from [16]. By choosing the “right” threshold value, the Extracted Facets. For the extracted facets, i.e. related to latter achieved perfect accuracy, i.e. every value was labeled “Hotel”, “Room” and “Service”, we in principle follow the way correctly in our test. Among the other two approaches, the keywords are treated in [23], i.e. relevance calculation is based Stanford CoreNLP toolkit yielded better results (accuracy on the TF-IDF heuristic. Therefore, pairs are considered of .933) than SentiWordNet (.667). Consequently, in cases as terms, and sets of pairs associated with the hotels as where a threshold leading to adequate results could not be set documents. Table 1 shows an example where a user is looking for the computational method, we use the Stanford CoreNLP for a hotel with a “comfortable bed”. toolkit for performing the sentiment analysis as well. Table 1: Results of different TF-IDF variants for an example 3.4 Classifying Pairs where a user is looking for a hotel with a “comfortable bed”. Hotel A Hotel B Hotel C Finally, to support users in finding facet values that fit their number of pairs 5 10 2 goal and to reduce cognitive load, we aim at assigning the “comfortable bed” 3 2 1 pairs to predefined categories based on the attributes. “uncomfortable bed” 2 0 0 For this purpose, we rely on the categories presented in tfidf baseline .90 .60 .30 [11] which resulted from collecting and grouping relevant tfidf norm .13 .04 .11 tfidf pair .06 .06 .15 hotel properties. Accordingly, we distinguish between pairs that either describe qualities of the “Hotel”, the “Room”, or related to the “Service”. As classification method, we employ The baseline heuristic tfidf baseline results in hotel B being a semantic similarity metric utilizing the graph structure of a more relevant than C. Although hotel B is associated more lexical database. There exist several interchangeable metrics often with the desired criterion than C, more pairs are as- that rely on different aspects of the underlying database. In sociated with B in total, i.e. “comfortable bed” cannot be assumed to be a very distinctive characteristic of this hotel. 3 https://wordnet.princeton.edu/ 4 5 http://sentiwordnet.isti.cnr.it/ https://github.com/Sciss/ws4j/ Enhancing an Interactive Rec. System with Review-based Info. Filtering IntRS ’17, August 27, 2017, Como, Italy Consequently, we additionally normalize the frequency us- The resulting recommendations are shown on the right ing the overall number of associated pairs. According to the side (F). We deliberately limit their number to reduce choice modified heuristic tfidf norm , hotel A is still more relevant difficulty and to motivate users to manipulate the results than C. However, even though “comfortable bed” is relatively by selecting criteria and weighting them, this way being associated more often with hotel A than with C, hotel A is able to explore the effects of their preference settings in an also associated with the opposite pair. Hence, we not only interactive manner. However, if users are not satisfied with consider the TF-IDF value for the positive, but also for the particular recommendations, the respective hotels can be negative pair: removed from the list so that the next most relevant item appears. Each recommendation is displayed with a photo tfidf pair = tfidf positive − tfidf negative (1) and the top-3 attribute-value pairs7 occurring in the related Item Relevance. Finally, individual relevance scores rel 𝑖 6 reviews (the number of occurrences is shown in brackets). By for each facet value 𝑓𝑖 are aggregated using the corresponding clicking on a recommendation, a list of all associated pairs as weights 𝑤𝑖 by means of arithmetic mean as in [23]. The overall well as further metadata is shown in a dialog. Since negative relevance rel of a hotel ℎ is thus calculated as follows: opinions can also have an impact on the decision-making ∑︀𝑛 process [9], users are here presented with both pairs having 𝑤𝑖 · rel 𝑖 (ℎ, 𝑓𝑖 ) 𝑖=1∑︀ positive (green) or negative (red) sentiments. rel (ℎ, 𝑓1 , ..., 𝑓𝑛 , 𝑤1 , ..., 𝑤𝑛 ) = 𝑛 (2) 𝑖=1 𝑤𝑖 When users are satisfied with a recommendation, the re- spective hotel can be dragged into the basket in the top-right 5 DEMONSTRATOR SYSTEM corner (G). This area serves to store items that users would To finally demonstrate how blended recommending can be like to take into consideration for their final decision. To implemented based on facets extracted from user reviews, we enable further refinement, the system also suggests new tiles developed a web application in the hotel domain using the as soon as an item is put into the basket: The last values from dataset we crawled from Booking.com1 . The demonstrator each of the extracted facets on the left side are replaced by system which generally follows the design of MyMovieMixer the top-3 pairs of the respective hotel. Moreover, the basket [23] is shown in Figure 2. is used for evaluation purposes. On the left side, a list comprising all facets users can choose from is presented. Clicking on a facet expands it and 6 EVALUATION shows corresponding facet values in form of tiles (A). Initially, To examine the benefits extracting facets from user reviews the values of each facet are hidden (B) to reduce cognitive has in terms of user experience in comparison to using explic- load. For the facets “Price”, “Stars” and “Score”, some tiles itly defined features as they are typically found on booking represent predefined values (e.g. “30 − 45 Euro”). In addition, websites, we performed a user study. In this study, we com- users can create tiles themselves by manually specifying pared the demonstrator system described in the previous preferred ranges (e.g. “25 − 45 Euro”). The “Location” facet section8 with an almost identical variant where instead of presents users with predefined tiles for each location in the facets extracted according to our method we used facets dataset. For the extracted facets, i.e. related to “Hotel”, based on well-defined provided features. “Room” and “Service”, we initially show those attribute-value pairs that occur most often in the reviews assuming they 6.1 Method are generally more important for users. Thereby, we consider Participants and Materials. We recruited 30 participants only positive values because it is unlikely that users want to (24 female) with an average age of M = 21.63 (SD = 3.32). receive recommendations related to negative properties. Yet, Most of the participants were students; only two of them users can request more tiles, i.e. the next most frequent pairs, were employed. Participants were asked to use the system by clicking the respective button (C). Moreover, to look for under controlled conditions in a lab-based setting. During specific values which might be useful to pursue a particular the course of the study, they used a desktop PC with 24′′ search goal, users can also perform a text-based search. LCD (1920×1200 px resolution) and a standard web browser As soon as users drag a tile into the preference area in to fill in a questionnaire and to perform several tasks. the middle of the screen, the corresponding facet value is considered for generating recommendations, i.e. its individual Procedure. Participants were assigned (in counter-balanced relevance rel 𝑖 is now used in (2) when calculating overall item order) in a between-subject design9 to one of the two following relevances. In this area, each tile is accompanied by a slider 7 that allows users to weight the respective criterion, i.e. to Note that these are pairs after performing all steps described in modify 𝑤𝑖 (D). In case users are no longer interested in Section 3, i.e. they do not necessarily appear in exactly the same way in all underlying reviews (some reviews may refer to e.g. an “excellent applying a specific criterion, tiles can be removed from the location”, which however would increase the count for “great location”). preference area (E). Adding or removing tiles as well as 8 For the user study, we used an earlier version of the demonstrator changing their weight immediately updates the results. system which was slightly different, in particular, with respect to the level of detail for presented recommendations (among others, frequent attribute-value pairs were not visible right away). 6 9 Scores are determined as described above. Note that this might not We decided against a within-subject design to avoid carry-over effects be possible when a criterion is not referred to in the reviews of a hotel. and to reduce participants’ workload. IntRS ’17, August 27, 2017, Como, Italy Feuerbach et al. Figure 2: Screenshot of our demonstrator system: Facet values are shown as tiles on the left side (A). Users can expand and collapse each facet (B). For some facets, users can search and ask for more tiles (C). As soon as users drag tiles into the preference area in the middle, an accompanying slider allows to weight the corresponding value (D). If users do not want to consider a criterion anymore, they can remove it (E). Recommended hotels are shown on the right side with images and attribute-value pairs most frequently occurring in the related reviews (F). Users can put items they like from the results into a basket (G). conditions (15 per condition), which varied regarding the Task 4 “You get a one-week holiday as a gift. Money plays source used for “Hotel”, “Room”, and “Service” facet: no role and you are able to freely choose the location.” Feature-Based Facets (FF): Demonstrator system with val- Questionnaires and Log Data. At the beginning of each ues for the “Hotel” (89 values), “Room” (124), and “Ser- session, we elicited demographics and domain knowledge. To vice” (109) facets based on predefined features10 . assess participants’ subjective perception of the respective Extracted Facets (EF): Demonstrator system with values for system variant, we used a questionnaire primarily composed the “Hotel” (562 values), “Room” (266), and “Service” of existing constructs. Concretely, after each task, we used the (1038) facets based on attribute-value pairs extracted evaluation framework proposed in [20] to assess perceived rec- from a dataset of reviews1 as described in Section 3. ommendation quality, perceived set variety, choice satisfaction The user study comprised four tasks which were presented as well as choice difficulty, usage effort, and perceived effec- (in random order) as scenarios described as follows: tiveness. Since interaction influences user experience, we also Task 1 “You want to do a weekend trip to London or Berlin assessed the intention to provide feedback [20] and tracked with a friend of yours. You want to spent a maximum user behavior. Relying on [28], we additionally assessed per- of 40 Euros per night. The accommodation should have ceived usefulness and overall satisfaction. Furthermore, we reliable Wifi and a bar.” formulated questionnaire items ourselves to particularly ad- Task 2 “You are going on vacation to Brussels with your dress helpfulness and understandability of the facets, their parents. The accommodation should have 3 stars and suitability for expressing preferences, and participants’ sat- should cost about 80 Euros. Furthermore, your parents isfaction with them. Finally, at the end of each session, we want a nice view and a good breakfast.” asked the same questions again, now regarding participants’ Task 3 “You want to surprise your partner with a short get- general impression independent of specific tasks. In addition, away to Rome. You have some money, so you can spend we applied the System Usability Scale (SUS) [5]. All items 60 Euros per night. Since you would like to have some were assessed on a positive 5-point Likert scale. private time, the place should be quiet and you want to have your own (large and clean) bathroom.” 6.2 Results Domain Knowledge and Usability. Overall, participants 10 Hotel features were crawled from Booking.com and manually assigned reported average domain knowledge with no significant differ- to the three facets. For example, we associated the feature “non- smoking room”, that relying on their taxonomy is explicitly given to ence (𝑡(28) = .54, 𝑝 = .595) between conditions (FF: M = 2.87, hotels at the Booking.com website, with the “Room” facet. SD = 1.13; EF: M = 2.67, SD = 0.90). Regarding usability, both Enhancing an Interactive Rec. System with Review-based Info. Filtering IntRS ’17, August 27, 2017, Como, Italy variants of the demonstrator system received “good” scores Interaction Behavior. Concerning actual user behavior, no on the SUS, with 76 in the FF, and 83 in the EF condition. significant differences were found for the number of facet values being selected (FF: M = 3.77, SD = 1.53; EF: M = 3.57, User Experience. Table 2 shows the results regarding par- SD = 1.55), (𝑡(28) = .36, 𝑝 = .72), the number of times more ticipants’ general impression11 . We conducted t-tests (𝛼 = .05) facets values were requested (FF: M = 24.93, SD = 39.53; EF: to assess differences between conditions. EF was rated su- M = 4.67, SD = 13.06), (𝑡(17) = 1.87, 𝑝 = .08), and the number perior to FF with respect to all constructs. As highlighted, of recommendations removed from the results (FF: M = 5.67, there were significant differences (with medium to large effect SD = 12.84; EF: M = 5.87, SD = 12.71), (𝑡(28) = −.04, 𝑝 = .97). size) in terms of perceived variety of the recommendation set, choice difficulty, and intention to provide feedback. 6.3 Discussion Table 2: 𝑡-test (df = 28) results with means and SDs for the In conclusion, the user study shows that the concept of overall comparison of the two conditions (* indicates signifi- blended recommending can be successfully applied relying on cance at 5 % level; 𝑑 represents Cohen’s effect size value). social datasources. When compared to the system variant Feat.-Based Facets Extracted Facets with facets based on features from a well-established taxon- M SD M SD T p d omy (FF), the variant with facets extracted from user reviews Perc. Rec. Quality 3.97 0.83 4.03 0.79 -0.23 .824 .07 (EF) obtained overall superior results after the individual Perc. Set Variety 3.60 0.51 4.13 0.83 -2.12 .043* .77 Choice Satisfaction 4.27 0.70 4.40 0.63 -0.55 .590 .20 tasks as well as in the end after participants finished all Choice Difficulty 2.73 1.34 3.67 0.98 -2.19 .037* .80 tasks. Regarding participants’ general impression, significant Perc. Effectiveness 3.60 1.18 4.07 1.10 -1.12 .273 .41 Usage Effort 3.70 0.94 4.07 0.86 -1.11 .276 .41 differences were identified for several relevant variables. Con- Feedback Intention 3.20 0.78 3.87 0.83 -2.27 .031* .83 Usefulness 3.69 0.73 4.18 0.85 -1.69 .103 .62 cretely, set diversity was perceived to be higher and it was Overall Satisfaction 3.73 1.10 4.07 0.96 -0.88 .384 .33 easier for participants to settle on one of the recommended hotels (which is in line with earlier research [3]). Further- Moreover, we found significant gender differences (𝑡(28) = more, participants’ feedback intention was higher, i.e. they −2.48, 𝑝 = .019) regarding recommendation quality, with preferred to provide feedback in the EF condition. This is men giving higher ratings (M = 4.67, SD = 0.41) than women corroborated by their answers to the questionnaire items that (M = 3.83, SD = 0.79), with large effect size (𝑑 = 1.34). Women specifically addressed the perception of facets. Apparently, in (M=367.00 sec, SD = 138.91) also took on average significantly all tasks, participants valued the possibility to express their longer (𝑡(28) = 2.08, 𝑝 = .047) to accomplish tasks than men preferences with respect to the more subjective dimensions (M=245.17 sec, SD = 63.77), with large effect size (𝑑 = 1.13). represented through facets extracted from reviews. At the same time, it can be considered promising that we did not Facets. After each task, we assessed participants’ opinions find a negative effect in terms of facet understandability. In specifically on the facets. We conducted two-way RM ANOVA contrast, our proposed method seems able to extract facets to examine effects of condition and task. Interaction terms from a real-world review dataset in a meaningful way. were not significant, and we found only few small differences By showing an effect of gender on perceived quality, our between tasks. Table 3 shows that participants perceived study partly validates earlier findings that such factors influ- the suitability of the facets for expressing their preferences ence how important reviews are considered as an information significantly higher in the EF condition than in the other, in- source by individual users [19]. Thus, it is subject of future dependent of the task. With respect to all other variables, EF studies to explore more deeply the impact of personality was assessed superior to FF as well, but without significances. variables on the use of different information sources. In this Table 3: ANOVA (df 1 = 1, df 2 = 28) results with means and regard, it is important to note that there might have been SEs for the comparison of the conditions across tasks (* indi- confounding factors. In particular, with the present experi- cates significance at 5 % level; we aggregated scores assessed mental design, participants did not know the source of the individually for the “Hotel”, “Room”, and “Service” facet). facet values, i.e. that they were extracted from user reviews. Feat.-Based Facets Extracted Facets However, as reviews particularly in the hotel domain influence M SE M SE F p the perception of trust [31], knowing where the information Suit. for Exp. Preferences 3.38 0.179 3.93 0.179 4.604 .041* comes from might positively affect perceived usefulness and Helpfulness 3.82 0.185 4.12 0.185 1.322 .260 possibly also trustworthiness—in particular, if it would be Understandability 4.26 0.176 4.27 0.176 0.002 .965 Satisfaction 3.96 0.161 4.16 0.161 0.776 .386 possible to trace back from mined attribute-value pairs to underlying reviews. Besides, we were not able to establish all relations between pairs and hotels due to time restrictions, potentially contributing negatively to the assessment in the 11 We examined effects of condition and task using two-way RM EF condition. The lack of differences in actual user behav- ANOVA. Interaction terms were only significant for two variables (variety, effort), each showing differences in only one pairwise compari- ior may in contrast be attributed to the almost identical son. Since the results obtained after each task were overall tendentially interfaces of the two demonstrator variants. In summary, the similar, we thus omit reporting them separately. Instead, we present assessment however yielded promising results with respect to the scores from the final assessment where participants were asked regarding their general impression after completing all tasks. all variables. Since interaction terms of condition and task IntRS ’17, August 27, 2017, Como, Italy Feuerbach et al. were significant only in two cases, we deduce that this applies control. In UMAP ’16. ACM, 169–173. independent of the task and its complexity. Nevertheless, fur- [13] A. Girgensohn, F. Shipman, F. Chen, and L. Wilcox. 2010. DocuBrowse: Faceted searching, browsing, and recommendations ther investigating task-related differences is subject of future in an enterprise context. In IUI ’10. ACM, 189–198. work. Overall, participants seemed more satisfied in the EF [14] C. He, D. Parra, and K. Verbert. 2016. Interactive recommender systems: A survey of the state of the art and future research condition, which is reflected accordingly in effect sizes. challenges and opportunities. Expert Syst. Appl. 56, 1 (2016), 9–27. 7 CONCLUSIONS AND OUTLOOK [15] M. A. Hearst. 2009. Search user interfaces. Cambridge University Press. In this paper, we have presented an extension to the con- [16] J. Huang, O. Etzioni, L. Zettlemoyer, K. Clark, and C. Lee. 2012. cept of blended recommending. Relying on social media, in RevMiner: An extractive interface for navigating reviews on a smartphone. In UIST ’12. ACM, 3–12. particular, reviews for hotels, we allow users to specify their [17] M. Jugovac and D. Jannach. 2017. Interacting with recommenders preferences with respect to meaningful criteria that usually - Overview and research directions. ACM TiiS (2017). are not available as filter options, but especially useful for [18] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. 2010. Multiverse recommendation: N-dimensional tensor factorization adequately choosing from sets of recommended “experience for context-aware collaborative filtering. In RecSys ’10. ACM, products”. In line with that, the user study we conducted has 79–86. [19] E. E. K. Kim, A. S. Mattila, and S. Baloglu. 2011. Effects of shown significant improvements with respect to diversity and gender and expertise on consumers’ motivation to read online choice difficulty, but also promising results in terms of other hotel reviews. Cornell Hosp. Q. 52, 4 (2011), 399–406. relevant variables. Thus, it seems that user reviews can be [20] B. P. Knijnenburg, M. C. Willemsen, and A. Kobsa. 2011. A pragmatic procedure to support the user-centric evaluation of successfully exploited for interactive recommending, and that recommender systems. In RecSys ’11. ACM, 321–324. the contained information is of value for users even without [21] J. A. Konstan and J. Riedl. 2012. Recommender systems: From actually reading them. Besides, the study demonstrates that algorithms to user experience. UMUAI 22, 1-2 (2012), 101–123. [22] B. Loepp, C.-M. Barbu, and J. Ziegler. 2016. Interactive recom- the concept can be applied—also in absence of structured mending: Framework, state of research and future challenges. In content data—in other domains than movies. EnCHIReS ’16. 3–13. [23] B. Loepp, K. Herrmanny, and J. Ziegler. 2015. Blended recom- In future work, we aim at improving and possibly using mending: Integrating interactive information filtering and algo- different NLP methods to extract the facets. Moreover, re- rithmic recommender techniques. In CHI ’15. ACM, 975–984. views could be further exploited to improve the presentation [24] B. Loepp, K. Herrmanny, and J. Ziegler. 2015. Merging interactive information filtering and recommender algorithms – Model and of recommendations. For instance, by identifying users with concept demonstrator. i-com 14, 1 (2015), 5–17. shared interests, results could be accompanied with sum- [25] B. Loepp, T. Hussein, and J. Ziegler. 2014. Choice-based prefer- maries of their opinions in order to make suggestions easier ence elicitation for collaborative filtering recommender systems. In CHI ’14. ACM, 3085–3094. to understand and to increase the system’s trustworthiness. [26] J. McAuley and J. Leskovec. 2013. Hidden factors and hidden Finally, all of this would go hand in hand with conducting topics: Understanding rating dimensions with review text. In RecSys ’13. ACM, 165–172. more user studies, e.g. to evaluate the specific improvements. [27] D. Parra, P. Brusilovsky, and C. Trattner. 2014. See what you want to see: Visual user-driven approach for hybrid recommendation. REFERENCES In IUI ’14. ACM, 235–240. [28] P. Pu, L. Chen, and R. Hu. 2011. A user-centric evaluation [1] A. Almahairi, K. Kastner, K. Cho, and A. Courville. 2015. Learn- framework for recommender systems. In RecSys ’11. ACM, 157– ing distributed representations from reviews for collaborative 164. filtering. In RecSys ’15. ACM, 147–154. [29] P. Pu, L. Chen, and R. Hu. 2012. Evaluating recommender systems [2] I. Andjelkovic, D. Parra, and J. O’Donovan. 2016. Moodplay: from the user’s perspective: Survey of the state of the art. UMUAI Interactive mood-based music discovery and recommendation. In 22, 4-5 (2012), 317–355. UMAP ’16. ACM, 275–279. [30] G. M. Sacco. 2006. Dynamic taxonomies and guided searches. J. [3] D. Bollen, B. P. Knijnenburg, M. C. Willemsen, and M. P. Graus. Am. Soc. Inf. Sci. Tec. 57, 6 (2006), 792–796. 2010. Understanding choice overload in recommender systems. In [31] B. A. Sparks and V. Browning. 2011. The impact of online reviews RecSys ’10. ACM, 63–70. on hotel booking intentions and perception of trust. Tourism [4] S. Bostandjiev, J. O’Donovan, and T. Höllerer. 2012. Manage. 32, 6 (2011), 1310–1323. TasteWeights: A visual interactive hybrid recommender system. [32] V. T. Thai, P.-Y. Rouille, and S. Handschuh. 2012. Visual abstrac- In RecSys ’12. ACM, 35–42. tion and ordering in faceted browsing of text collections. ACM [5] J. Brooke. 1996. SUS – A quick and dirty usability scale. In TIST 3, 2 (2012), 21:1–21:24. Usability Evaluation in Industry. Taylor & Francis, 189–194. [33] M. Tvarožek, M. Barla, G. Frivolt, M. Tomša, and M. Bieliková. [6] I. Celik, F. Abel, and P. Siehndel. 2011. Towards a framework 2008. Improving semantic search via integrated personalized for adaptive faceted search on twitter. In DAH ’11. 11–22. faceted and visual graph navigation. In SOFSEM ’08. Springer, [7] L. Chen, G. Chen, and F. Wang. 2015. Recommender systems 778–789. based on user reviews: The state of the art. UMUAI 25, 2 (2015), [34] K. Verbert, D. Parra, and P. Brusilovsky. 2016. Agents vs. users: 99–154. Visual recommendation of research talks with multiple dimension [8] L. Chen and P. Pu. 2012. Critiquing-based recommenders: Survey of relevance. ACM TiiS 6, 2 (2016), 11:1–11:42. and emerging trends. UMUAI 22, 1-2 (2012), 125–150. [35] J. Vig, S. Sen, and J. Riedl. 2011. Navigating the tag genome. In [9] J. A. Chevalier and D. Mayzlin. 2006. The effect of word of mouth IUI ’11. ACM, 93–102. on sales: Online book reviews. J. Marketing Res. 43, 3 (2006), [36] M. Voigt, A. Werstler, J. Polowinski, and K. Meißner. 2012. 345–354. Weighted faceted browsing for characteristics-based visualization [10] C. di Sciascio, V. Sabol, and E. E. Veas. 2016. Rank as you selection through end users. In EICS ’12. ACM, 151–156. go: User-driven exploration of search results. In IUI ’16. ACM, [37] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma. 2014. 118–129. Explicit factor models for explainable recommendation based on [11] S. Dolnicar and T. Otter. 2003. Which hotel attributes matter? A phrase-level sentiment analysis. In SIGIR ’14. ACM, 83–92. review of previous and a framework for future research. Technical [38] K. Zhou, S.-H. Yang, and H. Zha. 2011. Functional matrix fac- Report. University of Wollongong. torizations for cold-start recommendation. In SIGIR ’11. ACM, [12] T. Donkers, B. Loepp, and J. Ziegler. 2016. Tag-enhanced col- 315–324. laborative filtering for increasing transparency and interactive