=Paper=
{{Paper
|id=Vol-1884/paper1
|storemode=property
|title=Enhancing an Interactive Recommendation System with Review-based Information Filtering
|pdfUrl=https://ceur-ws.org/Vol-1884/paper1.pdf
|volume=Vol-1884
|authors=Jan Feuerbach,Benedikt Loepp,Catalin-Mihai Barbu,Jürgen Ziegler
|dblpUrl=https://dblp.org/rec/conf/recsys/FeuerbachLB017
}}
==Enhancing an Interactive Recommendation System with Review-based Information Filtering==
<pdf width="1500px">https://ceur-ws.org/Vol-1884/paper1.pdf</pdf>
<pre>
        Enhancing an Interactive Recommendation System with
                 Review-based Information Filtering
                        Jan Feuerbach                                                 Benedikt Loepp
                 University of Duisburg-Essen                                   University of Duisburg-Essen
                      Duisburg, Germany                                             Duisburg, Germany
                jan.feuerbach@stud.uni-due.de                                    benedikt.loepp@uni-due.de

                    Catalin-Mihai Barbu                                                Jürgen Ziegler
                 University of Duisburg-Essen                                   University of Duisburg-Essen
                     Duisburg, Germany                                               Duisburg, Germany
                  catalin.barbu@uni-due.de                                       juergen.ziegler@uni-due.de
ABSTRACT                                                            product facets, or mined keywords are taken into account
Integrating interactive faceted filtering with intelligent rec-     when generating recommendations. However, they often re-
ommendation techniques has shown to be a promising means            quire up front availability of information such as existing
for increasing user control in Recommender Systems. In this         user preference profiles or rich item data. In contrast, the
paper, we extend the concept of blended recommending by             increasing amount of user-provided content that is available
automatically extracting meaningful facets from social media        online today has not yet been extensively exploited for inter-
by means of Natural Language Processing. Concretely, we             active recommending. Social media, such as product reviews
allow users to influence the recommendations by selecting           written by users in online shops or opinions about hotels on
facet values and weighting them based on information other          booking platforms, has been used so far primarily to deal
users provided in their reviews. We conducted a user study          with data sparsity and to increase algorithmic precision [7].
with an interactive recommender implemented in the hotel                The concept of blended recommending [23] combines advan-
domain. This evaluation shows that users are consequently           tages of conventional automated RS, e.g. low user effort and
able to find items fitting interests that are typically difficult   high accuracy, with those of interactive information filtering,
to take into account when only structured content data is           e.g. high level of control and transparency. For this, faceted
available. For instance, the extracted facets representing the      filtering [15], which has shown to be an intuitive and efficient
opinions of hotel visitors make it possible to effectively search   means for browsing large items spaces [15, 32], is integrated
for hotels with comfortable beds or that are located in quiet       with different recommender techniques in a hybrid fashion.
surroundings without having to read the user reviews.               Consequently, users are enabled to select and weight criteria
                                                                    from facets leading to items being recommended based on
CCS CONCEPTS                                                        their weighted relevance as determined by Collaborative Fil-
                                                                    tering (CF) or content-based techniques. In MyMovieMixer
• Information systems → Recommender systems; Data min-
                                                                    [23], an interactive RS based on this concept, users can se-
ing; Information retrieval; Search interfaces; • Computing
                                                                    lect a movie as a facet value that in the following serves to
methodologies → Natural language processing;
                                                                    suggest other movies that are similar with respect to their
                                                                    latent factor representation as derived from ordinary rating
KEYWORDS
                                                                    data. Users can also express that they want, for example,
Interactive Recommending, Faceted Filtering, User Reviews           movies from a particular director, starring certain actors, or
                                                                    related in terms of user-generated tags, while being able to
1   INTRODUCTION                                                    specify the weight of each of these criteria. MyMovieMixer
                                                                    was found especially promising for cold-start situations, i.e.
Conventional automated Recommender Systems (RS) that
                                                                    without an existing rating profile for the current user, and
pro-actively suggest items of potential interest to users make
                                                                    when users only have a vague search goal in mind. Moreover,
it often difficult to influence and to understand their outcome
                                                                    the approach allowed to significantly increase the perceived
[21]. Interactive RS have been proposed that particularly aim
                                                                    level of user control [23, 24].
at giving users more control over the recommendation process
                                                                        In this paper, we build upon our prior work and extend
and at improving transparency. For instance, TasteWeights
                                                                    the concept by extracting meaningful facets and correspond-
[4], SetFusion [27], MyMovieMixer [23], or uRank [10] allow
                                                                    ing values from user reviews by means of Natural Language
users to vary the degree to which datasources, algorithms,
                                                                    Processing (NLP). So far, blended recommending has only
IntRS ’17, August 27, 2017, Como, Italy                             been implemented based on ratings and structured content
2017. ACM ISBN Copyright for the individual papers remains with     information. Relying on social media has several advantages.
the authors. Copying permitted for private and academic purposes.   User-written product reviews, in particular, play an impor-
This volume is published and copyrighted by its editors.
https://doi.org/                                                    tant role in buying decisions [9, 31]. They form a useful source
IntRS ’17, August 27, 2017, Como, Italy                                                                                Feuerbach et al.


of information about what users liked or disliked, especially          into account [e.g. 7], and by eliciting user preferences in an in-
in case of “experience products”. In the hotel domain, for             teractive manner [e.g. 25]. One prominent type of interactive
instance, reviews may contain references to amenities that             RS are critique-based variants that allow users to iteratively
are typically not available as filters on online booking web-          refine the results by critiquing features of currently recom-
sites. Manually looking through dozens of reviews to check             mended items [8]. While this usually requires availability of
whether the recommended hotels have “comfortable beds” or              well-defined product data, more recently, several attempts
are in “quiet locations” would be time-consuming and require           have been made that instead rely on, for instance, latent
a lot of cognitive effort. By automatically exploiting review          factors automatically derived from user ratings [25] or user-
data, in contrast, we allow users to directly express their            generated tags [35]. MovieTuner [35], as an example, first
preferences with respect to such subjective dimensions, con-           determines the relevance of tags and presents users with the
sequently being able to better take their actual interests into        most important ones. Then, users can indicate their prefer-
account. Besides, from an information provider’s perspective,          ences by critiquing recommendations in terms of these well
user-generated content can be considered a useful addition to          understandable dimensions—which could represent subjec-
conventional objective product data that might be difficult            tive aspects not adequately describable by the often more
to prepare for each item at the same level of quality.                 technical objective product attributes. While it seems promis-
   The remainder of this paper is organized as follows: First,         ing to elicit preferences this way, interaction is still typically
we discuss relevant related work. Next, we describe our pro-           limited to one single kind of item features, i.e. predefined
posed method for extracting facets from reviews using NLP.             metadata, latent factors or tags. Besides, user-provided con-
Then, we elaborate on how these facets can be used in blended          tent has only been exploited to a limited extent. The richness
recommending and introduce a demonstrator system we im-                of social media such as user reviews has, to our knowledge,
plemented in the hotel domain based on a real-world dataset1 .         not yet been used for integrating RS with more interactivity.
Afterwards, we present a user study we conducted to exam-                 In hybrid RS, multiple algorithms, and often multiple
ine the value of facets extracted according to our method              datasources, are combined to generate results with higher
in comparison to facets based on typical features as defined           precision. This has consequently led to the development of
by content providers in terms of user experience. Finally, we          corresponding interactive approaches that give users con-
conclude the paper and discuss avenues for further research.           trol over the recommendation process in terms of several
                                                                       dimensions at once. TasteWeights [4], a hybrid music recom-
2   RELATED WORK                                                       mender, allows users to directly weight different information
                                                                       types and social datasources, thereby increasing perceived
Today’s RS pro-actively suggest items that match individ-
                                                                       recommendation quality and comprehensibility. The data-
ual preferences based on long-term user models. Producing
                                                                       sources used include social media artifacts such as Wikipedia
well-fitting results can thereby reduce interaction effort and
                                                                       articles, Facebook profiles or Twitter tweets, but without
cognitive load [29]. However, the process of generating rec-
                                                                       processing the content data on a semantic level to e.g. infer
ommendations is often not controllable by users. Generally
                                                                       inherent user preferences. SetFusion [27] also employs a stan-
improving user experience, giving users more control, and
                                                                       dard hybridization strategy, but enables users to weight each
increasing system transparency, have therefore been identi-
                                                                       algorithm individually. In addition, the system provides a
fied as important goals [21, 29], which are still only partially
                                                                       number of interactive features. However, it requires a persis-
addressed in many real-world systems. RS research has for a
                                                                       tent user profile and does not offer interaction with respect
long time been focused on algorithmic issues as well [21, 29].
                                                                       to any content-related criteria. MyMovieMixer [23] lets users
For instance, in order to improve accuracy, several attempts
                                                                       select and weight facet values based on different types of
have been made to integrate CF algorithms with additional
                                                                       recommender algorithms and related background data. The
data, including user-generated content such as tags, and in
                                                                       system increases the perceived amount of control by success-
few cases also topics or opinions automatically extracted
                                                                       fully enabling users to manipulate the result set not only with
from user reviews [e.g. 1, 7, 18, 26, 37]. Only more recently,
                                                                       respect to explicitly defined product features, but also latent
model-based CF has been enhanced for other purposes. One
                                                                       factors as well as user-generated data such as tags. Although
example is TagMF [12], a method that allows users to select
                                                                       the underlying concept of blended recommending is easily
and weight tags in order to manipulate the set of recommen-
                                                                       extendable, it has yet only been implemented using struc-
dations generated as in conventional RS based on ratings.
                                                                       tured data that is directly given through the applied datasets.
   In general, many interactive recommending approaches
                                                                       uRank [10] is one exception where keywords are extracted
have been proposed to overcome the issues of automated RS
                                                                       from background data especially to promote interactivity.
[14, 17, 22]. The cold-start problem, which occurs when no
                                                                       The system focuses on exploration of document collections
historical data is available for new users, has, for instance,
                                                                       and supports users when their interests shift while browsing.
been addressed algorithmically [e.g. 38], by taking reviews
                                                                       The extraction is performed after some preprocessing steps
                                                                       by creating a vector space model using TF-IDF. Then, the
1
 We crawled metadata and overall 838 780 user reviews for 11 544
                                                                       keywords are presented to users as an interactive means to
hotels located in five major European cities from Booking.com (http:   influence the recommendations by selecting and weighting
//www.booking.com/).
Enhancing an Interactive Rec. System with Review-based Info. Filtering                                         IntRS ’17, August 27, 2017, Como, Italy


them. To increase transparency, their occurrences in the doc-      tags), and to exploit the rich knowledge found in unstructured
uments are visualized by means of a stacked bar chart and          user-provided information such as reviews.
they are also used to generate an overview of the collection.
Other approaches that make extensive use of visualizations         3          EXTRACTING MEANINGFUL FACETS
for the purpose of increasing system transparency comprise,                   FROM USER REVIEWS
among others, MoodPlay [2] or Conference Navigator [34].
                                                                   In order to apply the concept of blended recommending based
   Overall, all of these works attempt to give users a high de-
                                                                   on a social datasource, we propose the procedure illustrated
gree of control over hybrid RS. While this line of RS research
                                                                   in Figure 1 to extract facets from user reviews.
to some extent converges with information filtering, there
exists a wide range of manual filtering approaches outside             The bed            (comfortable, bed)   (comfortable, bed)                          Hotel
                                                                       was com-                                   (comfy, bed)
the field of RS that can also be considered highly supportive          fortable, but
                                                                       the location
                                                                                                  …
                                                                                                  …
                                                                                                                                    (comfortable, bed)       (noisy, location)
                                                                                                  ...           (noisy, location)                        (comfortable, bed)
for users finding the right items. Faceted filtering is one of         was loud at
                                                                       night.              (loud, location)      (loud, location)    (noisy, location)
                                                                                                                                                                      Room

the most prominent methods that supports exploration and
discovery in large product spaces [15]. By selecting values        Figure 1: First, we identify attribute-value pairs such as “com-
from facets, the product space is iteratively constrained until    fortable bed” in the reviews. Then, these pairs are merged
the desired product is found. This principle also allows to e.g.   with others that have the same meaning, e.g. “comfy bed”.
facilitate keyword search and navigation in digital libraries      Next, using sentiment analysis, pairs are classified as positive
or online shops [15]. Early attempts as well as contemporary       or negative item properties. Finally, pairs that describe prop-
real-world examples that can be found on many websites (e.g.       erties all related to e.g. hotel rooms are classified and grouped
accommodation booking platforms) usually rely on prede-            together to serve as values of a corresponding facet.
fined features, support only Boolean filtering and conjunctive
                                                                     In the following, we elaborate on each of the steps involved,
queries, and consider all selected facet values with equal im-
                                                                   and also describe how we actually implemented them.
portance [30, 32, 33, 36]. Only few exceptions employ fuzzy
methods for value matching [13].
                                                                   3.1                 Identifying Attribute-Value Pairs
   More recently, facets and facet values have also been au-
tomatically extracted, and adaptive techniques have been           First, we split user reviews into sentences. Then, in each
applied to faceted search based on e.g. semantic or social         sentence, we need to identify nouns that describe properties
datasources [6, 16, 33]. For instance, RevMiner [16] extracts      of the respective item as well as adjectives which represent
attribute-value pairs from restaurant reviews (e.g. “delicious     the opinion of the user who has written the review regarding
pizza”), associates each value with a positive-negative score      these properties, e.g. “the bed was comfortable”.
representing the sentiment, groups the attributes, and even-          With the help of a Part of Speech tagger, we determine the
tually presents them to the user in form of facets and facet       word form of each term in a sentence. Based on the results,
values. When applying filter criteria, the restaurants in the      we establish grammatical relations using a dependency parser.
results are ranked according to sentiment, strength and fre-       Especially the relations amod (adjectival modifier) and nsubj
quency of the selected value. Moreover, users can receive          (nominal subject) are of interest as they describe relations
recommendations for other places with similar attributes.          between adjectives and nouns. Besides, we take other relations
In general, previous attempts have however been focused            into account to analyze more complex sentence structures,
on supporting users to select appropriate filter criteria and      e.g. negations and relative clauses. Eventually, this gives us
to deal with lack of metadata. Yet, the user’s influence on        a set of attribute-value pairs which we then reduce to those
the current filter setting is still limited. VizBoard [36], in     pairs that appear a minimum number of times in all reviews.
contrast, allows users to prioritize selected criteria. Other         For actually implementing this, we decided to use the
work has also investigated user experience of faceted search       Stanford CoreNLP toolkit2 , a lightweight framework with all
as well as integrating visualizations. For example, in [32], a     the NLP functionality required in this step.
matrix visualization is used to display documents and their
relevance with respect to the selected facets. While research      3.2                 Merging Values
in faceted filtering has thus brought numerous advances, the       After attribute-value pairs have been identified, there might
respective methods neither have yet made extensive use of          be multiple values sharing the same meaning, e.g. “large room”
recommender functionalities nor social media.                      and “big room”. Thus, we need to merge values that describe
   Summarizing the state-of-the-art, there exist various at-       the same concept (i.e. synonyms) and replace them with a
tempts that give users more control over RS, also in complex       representative value to avoid confusion and redundancy.
hybrid scenarios. Only to a limited extent, social media has          For this purpose, we employ a lexical database providing
thereby been utilized as a means to increase interactivity.        links between synonyms. Since values may have different
Building on MyMovieMixer and extending the concept of              meanings depending on context (e.g. “big” and “heavy”), we
blended recommending seems promising to go beyond integrat-        do not directly use these links as criteria to merge them with
ing the datasources used so far (i.e. rating data, structured      others, but instead take the proportion of intersection of the
content information and explicit user-generated data such as
                                                                   2
                                                                       https://stanfordnlp.github.io/CoreNLP/
IntRS ’17, August 27, 2017, Como, Italy                                                                                Feuerbach et al.


sets of synonyms for each value into account. Then, a value        any case, to assign an attribute to one of the categories, e.g.
is classified as being similar if a specified threshold is met.    “bed” to “Room”, a term to calculate the similarity with is
Otherwise, it is assumed to be a new representative itself. To     required. Comparing with the class name itself would make it
improve the quality even further, we define groups of values       difficult to distinctly assign attributes. Thus, we employ typ-
and related representatives manually for terms that carry          ical terms from [11] and from taxonomies of popular booking
very different meanings across contexts, e.g. “good”.              websites for each category (e.g. the set for “Room” contains
   In the same step, we also identify pairs with opposite          “bathroom”, “bed” and “air conditioning”), and determine
meaning, e.g. “comfortable bed” and “uncomfortable bed”,           their average similarity with the respective attribute.
and associate them by looking up the respective terms (i.e.           For implementing the classification, we use the WS4J 5 -API
antonyms) in a lexical database as well. Although we assume        that offers a range of similarity metrics based on WordNet.
negative pairs to be less meaningful as filter criteria that can   When examining the results obtained on the dataset we used1
be selected by the user, we need them for later calculating        with a set of manually labeled pairs, HirstStOnge yielded
item relevances (see Section 4).                                   highest accuracy (.737).
   As lexical database, we use WordNet 2.1 3 .
                                                                   4       BLENDED RECOMMENDING WITH
3.3       Analyzing Sentiments                                             EXTRACTED FACETS
Next, in order to explicitly distinguish between positive and      In order to use the previously extracted attribute-value pairs
negative pairs for the recommendation process, we need to          for blended recommending, the corresponding facet values
detect the sentiment of the pairs.                                 (e.g. “comfortable bed”) have to be taken into account for
   Adjectives often already represent a certain sentiment, e.g.    calculating the relevance of the items when selected by the
“comfortable” can be considered positive. Thus, we decided to      user. In the following, we describe how individual relevances
use an approach which determines sentiments for single words.      are determined for each facet type, and how these relevance
Sentiment lexicons are databases where each term is assigned       values eventually lead to an aggregated score for each hotel.
a sentiment value or class based on a certain algorithm or
by human judgment. In contrast, in [16], a computational              Standard Facets. As in [23], we use Boolean filtering for
method specifically for reviews is proposed: By averaging          nominal facets (“Location”) and fuzzy filtering for numerical
helpfulness scores of user reviews in which a pair occurs, the     facets (“Price”, “Stars”, “Score”). In case of Boolean filtering,
respective values are classified as positive or negative based     items matching a selected value are additionally ranked using
on a specified threshold.                                          a criterion that establishes an ordering, e.g. score. In case of
   To obtain adequate results on the dataset we used1 , we         fuzzy filtering, the distance between a selected value and the
compared SentiWordNet 3.0 4 , an algorithmically labeled           respective item property determines the relevance (e.g. if the
sentiment lexicon, the Stanford CoreNLP toolkit which uses         user selects the facet value “score = 8.0”, items with a score
the Sentiment Treebank, a manually labeled dataset, and the        of 7.5 are considered more relevant than items with 7.0).
method from [16]. By choosing the “right” threshold value, the        Extracted Facets. For the extracted facets, i.e. related to
latter achieved perfect accuracy, i.e. every value was labeled     “Hotel”, “Room” and “Service”, we in principle follow the way
correctly in our test. Among the other two approaches, the         keywords are treated in [23], i.e. relevance calculation is based
Stanford CoreNLP toolkit yielded better results (accuracy          on the TF-IDF heuristic. Therefore, pairs are considered
of .933) than SentiWordNet (.667). Consequently, in cases          as terms, and sets of pairs associated with the hotels as
where a threshold leading to adequate results could not be set     documents. Table 1 shows an example where a user is looking
for the computational method, we use the Stanford CoreNLP          for a hotel with a “comfortable bed”.
toolkit for performing the sentiment analysis as well.
                                                                   Table 1: Results of different TF-IDF variants for an example
3.4       Classifying Pairs                                        where a user is looking for a hotel with a “comfortable bed”.
                                                                                                   Hotel A   Hotel B          Hotel C
Finally, to support users in finding facet values that fit their
                                                                       number of pairs                  5      10               2
goal and to reduce cognitive load, we aim at assigning the
                                                                       “comfortable bed”                3      2                1
pairs to predefined categories based on the attributes.                “uncomfortable bed”              2      0                0
   For this purpose, we rely on the categories presented in            tfidf baseline                .90       .60              .30
[11] which resulted from collecting and grouping relevant              tfidf norm                    .13       .04              .11
                                                                       tfidf pair                    .06       .06              .15
hotel properties. Accordingly, we distinguish between pairs
that either describe qualities of the “Hotel”, the “Room”, or
related to the “Service”. As classification method, we employ         The baseline heuristic tfidf baseline results in hotel B being
a semantic similarity metric utilizing the graph structure of a    more relevant than C. Although hotel B is associated more
lexical database. There exist several interchangeable metrics      often with the desired criterion than C, more pairs are as-
that rely on different aspects of the underlying database. In      sociated with B in total, i.e. “comfortable bed” cannot be
                                                                   assumed to be a very distinctive characteristic of this hotel.
3
    https://wordnet.princeton.edu/
4                                                                  5
    http://sentiwordnet.isti.cnr.it/                                   https://github.com/Sciss/ws4j/
Enhancing an Interactive Rec. System with Review-based Info. Filtering                                  IntRS ’17, August 27, 2017, Como, Italy


Consequently, we additionally normalize the frequency us-                         The resulting recommendations are shown on the right
ing the overall number of associated pairs. According to the                   side (F). We deliberately limit their number to reduce choice
modified heuristic tfidf norm , hotel A is still more relevant                 difficulty and to motivate users to manipulate the results
than C. However, even though “comfortable bed” is relatively                   by selecting criteria and weighting them, this way being
associated more often with hotel A than with C, hotel A is                     able to explore the effects of their preference settings in an
also associated with the opposite pair. Hence, we not only                     interactive manner. However, if users are not satisfied with
consider the TF-IDF value for the positive, but also for the                   particular recommendations, the respective hotels can be
negative pair:                                                                 removed from the list so that the next most relevant item
                                                                               appears. Each recommendation is displayed with a photo
              tfidf pair = tfidf positive − tfidf negative               (1)
                                                                               and the top-3 attribute-value pairs7 occurring in the related
   Item Relevance. Finally, individual relevance scores rel 𝑖 6                reviews (the number of occurrences is shown in brackets). By
for each facet value 𝑓𝑖 are aggregated using the corresponding                 clicking on a recommendation, a list of all associated pairs as
weights 𝑤𝑖 by means of arithmetic mean as in [23]. The overall                 well as further metadata is shown in a dialog. Since negative
relevance rel of a hotel ℎ is thus calculated as follows:                      opinions can also have an impact on the decision-making
                                              ∑︀𝑛                              process [9], users are here presented with both pairs having
                                                   𝑤𝑖 · rel 𝑖 (ℎ, 𝑓𝑖 )
                                                𝑖=1∑︀                          positive (green) or negative (red) sentiments.
    rel (ℎ, 𝑓1 , ..., 𝑓𝑛 , 𝑤1 , ..., 𝑤𝑛 ) =          𝑛                   (2)
                                                     𝑖=1 𝑤𝑖                       When users are satisfied with a recommendation, the re-
                                                                               spective hotel can be dragged into the basket in the top-right
5    DEMONSTRATOR SYSTEM                                                       corner (G). This area serves to store items that users would
To finally demonstrate how blended recommending can be                         like to take into consideration for their final decision. To
implemented based on facets extracted from user reviews, we                    enable further refinement, the system also suggests new tiles
developed a web application in the hotel domain using the                      as soon as an item is put into the basket: The last values from
dataset we crawled from Booking.com1 . The demonstrator                        each of the extracted facets on the left side are replaced by
system which generally follows the design of MyMovieMixer                      the top-3 pairs of the respective hotel. Moreover, the basket
[23] is shown in Figure 2.                                                     is used for evaluation purposes.
   On the left side, a list comprising all facets users can
choose from is presented. Clicking on a facet expands it and                   6     EVALUATION
shows corresponding facet values in form of tiles (A). Initially,              To examine the benefits extracting facets from user reviews
the values of each facet are hidden (B) to reduce cognitive                    has in terms of user experience in comparison to using explic-
load. For the facets “Price”, “Stars” and “Score”, some tiles                  itly defined features as they are typically found on booking
represent predefined values (e.g. “30 − 45 Euro”). In addition,                websites, we performed a user study. In this study, we com-
users can create tiles themselves by manually specifying                       pared the demonstrator system described in the previous
preferred ranges (e.g. “25 − 45 Euro”). The “Location” facet                   section8 with an almost identical variant where instead of
presents users with predefined tiles for each location in the                  facets extracted according to our method we used facets
dataset. For the extracted facets, i.e. related to “Hotel”,                    based on well-defined provided features.
“Room” and “Service”, we initially show those attribute-value
pairs that occur most often in the reviews assuming they                       6.1    Method
are generally more important for users. Thereby, we consider
                                                                                  Participants and Materials. We recruited 30 participants
only positive values because it is unlikely that users want to
                                                                               (24 female) with an average age of M = 21.63 (SD = 3.32).
receive recommendations related to negative properties. Yet,
                                                                               Most of the participants were students; only two of them
users can request more tiles, i.e. the next most frequent pairs,
                                                                               were employed. Participants were asked to use the system
by clicking the respective button (C). Moreover, to look for
                                                                               under controlled conditions in a lab-based setting. During
specific values which might be useful to pursue a particular
                                                                               the course of the study, they used a desktop PC with 24′′
search goal, users can also perform a text-based search.
                                                                               LCD (1920×1200 px resolution) and a standard web browser
   As soon as users drag a tile into the preference area in
                                                                               to fill in a questionnaire and to perform several tasks.
the middle of the screen, the corresponding facet value is
considered for generating recommendations, i.e. its individual                   Procedure. Participants were assigned (in counter-balanced
relevance rel 𝑖 is now used in (2) when calculating overall item               order) in a between-subject design9 to one of the two following
relevances. In this area, each tile is accompanied by a slider
                                                                               7
that allows users to weight the respective criterion, i.e. to                    Note that these are pairs after performing all steps described in
modify 𝑤𝑖 (D). In case users are no longer interested in                       Section 3, i.e. they do not necessarily appear in exactly the same way
                                                                               in all underlying reviews (some reviews may refer to e.g. an “excellent
applying a specific criterion, tiles can be removed from the                   location”, which however would increase the count for “great location”).
preference area (E). Adding or removing tiles as well as                       8
                                                                                 For the user study, we used an earlier version of the demonstrator
changing their weight immediately updates the results.                         system which was slightly different, in particular, with respect to the
                                                                               level of detail for presented recommendations (among others, frequent
                                                                               attribute-value pairs were not visible right away).
6                                                                              9
 Scores are determined as described above. Note that this might not              We decided against a within-subject design to avoid carry-over effects
be possible when a criterion is not referred to in the reviews of a hotel.     and to reduce participants’ workload.
IntRS ’17, August 27, 2017, Como, Italy                                                                               Feuerbach et al.


Figure 2: Screenshot of our demonstrator system: Facet values are shown as tiles on the left side (A). Users can expand and
collapse each facet (B). For some facets, users can search and ask for more tiles (C). As soon as users drag tiles into the preference
area in the middle, an accompanying slider allows to weight the corresponding value (D). If users do not want to consider a
criterion anymore, they can remove it (E). Recommended hotels are shown on the right side with images and attribute-value
pairs most frequently occurring in the related reviews (F). Users can put items they like from the results into a basket (G).


conditions (15 per condition), which varied regarding the              Task 4 “You get a one-week holiday as a gift. Money plays
source used for “Hotel”, “Room”, and “Service” facet:                     no role and you are able to freely choose the location.”
Feature-Based Facets (FF): Demonstrator system with val-                  Questionnaires and Log Data. At the beginning of each
   ues for the “Hotel” (89 values), “Room” (124), and “Ser-            session, we elicited demographics and domain knowledge. To
   vice” (109) facets based on predefined features10 .                 assess participants’ subjective perception of the respective
Extracted Facets (EF): Demonstrator system with values for             system variant, we used a questionnaire primarily composed
   the “Hotel” (562 values), “Room” (266), and “Service”               of existing constructs. Concretely, after each task, we used the
   (1038) facets based on attribute-value pairs extracted              evaluation framework proposed in [20] to assess perceived rec-
   from a dataset of reviews1 as described in Section 3.               ommendation quality, perceived set variety, choice satisfaction
   The user study comprised four tasks which were presented            as well as choice difficulty, usage effort, and perceived effec-
(in random order) as scenarios described as follows:                   tiveness. Since interaction influences user experience, we also
Task 1 “You want to do a weekend trip to London or Berlin              assessed the intention to provide feedback [20] and tracked
   with a friend of yours. You want to spent a maximum                 user behavior. Relying on [28], we additionally assessed per-
   of 40 Euros per night. The accommodation should have                ceived usefulness and overall satisfaction. Furthermore, we
   reliable Wifi and a bar.”                                           formulated questionnaire items ourselves to particularly ad-
Task 2 “You are going on vacation to Brussels with your                dress helpfulness and understandability of the facets, their
   parents. The accommodation should have 3 stars and                  suitability for expressing preferences, and participants’ sat-
   should cost about 80 Euros. Furthermore, your parents               isfaction with them. Finally, at the end of each session, we
   want a nice view and a good breakfast.”                             asked the same questions again, now regarding participants’
Task 3 “You want to surprise your partner with a short get-            general impression independent of specific tasks. In addition,
   away to Rome. You have some money, so you can spend                 we applied the System Usability Scale (SUS) [5]. All items
   60 Euros per night. Since you would like to have some               were assessed on a positive 5-point Likert scale.
   private time, the place should be quiet and you want to
   have your own (large and clean) bathroom.”                          6.2   Results
                                                                         Domain Knowledge and Usability. Overall, participants
10
  Hotel features were crawled from Booking.com and manually assigned   reported average domain knowledge with no significant differ-
to the three facets. For example, we associated the feature “non-
smoking room”, that relying on their taxonomy is explicitly given to   ence (𝑡(28) = .54, 𝑝 = .595) between conditions (FF: M = 2.87,
hotels at the Booking.com website, with the “Room” facet.              SD = 1.13; EF: M = 2.67, SD = 0.90). Regarding usability, both
Enhancing an Interactive Rec. System with Review-based Info. Filtering                                                         IntRS ’17, August 27, 2017, Como, Italy


variants of the demonstrator system received “good” scores                                                  Interaction Behavior. Concerning actual user behavior, no
on the SUS, with 76 in the FF, and 83 in the EF condition.                                               significant differences were found for the number of facet
                                                                                                         values being selected (FF: M = 3.77, SD = 1.53; EF: M = 3.57,
   User Experience. Table 2 shows the results regarding par-                                             SD = 1.55), (𝑡(28) = .36, 𝑝 = .72), the number of times more
ticipants’ general impression11 . We conducted t-tests (𝛼 = .05)                                         facets values were requested (FF: M = 24.93, SD = 39.53; EF:
to assess differences between conditions. EF was rated su-                                               M = 4.67, SD = 13.06), (𝑡(17) = 1.87, 𝑝 = .08), and the number
perior to FF with respect to all constructs. As highlighted,                                             of recommendations removed from the results (FF: M = 5.67,
there were significant differences (with medium to large effect                                          SD = 12.84; EF: M = 5.87, SD = 12.71), (𝑡(28) = −.04, 𝑝 = .97).
size) in terms of perceived variety of the recommendation
set, choice difficulty, and intention to provide feedback.
                                                                                                         6.3   Discussion
Table 2: 𝑡-test (df = 28) results with means and SDs for the
                                                                                                         In conclusion, the user study shows that the concept of
overall comparison of the two conditions (* indicates signifi-
                                                                                                         blended recommending can be successfully applied relying on
cance at 5 % level; 𝑑 represents Cohen’s effect size value).
                                                                                                         social datasources. When compared to the system variant
                        Feat.-Based Facets           Extracted Facets                                    with facets based on features from a well-established taxon-
                         M            SD              M            SD           T          p         d   omy (FF), the variant with facets extracted from user reviews
 Perc. Rec. Quality     3.97          0.83           4.03        0.79         -0.23      .824     .07    (EF) obtained overall superior results after the individual
 Perc. Set Variety      3.60          0.51           4.13        0.83         -2.12     .043*     .77
 Choice Satisfaction    4.27          0.70           4.40        0.63         -0.55      .590     .20    tasks as well as in the end after participants finished all
 Choice Difficulty      2.73          1.34           3.67        0.98         -2.19     .037*     .80    tasks. Regarding participants’ general impression, significant
 Perc. Effectiveness    3.60          1.18           4.07        1.10         -1.12      .273     .41
 Usage Effort           3.70          0.94           4.07        0.86         -1.11      .276     .41    differences were identified for several relevant variables. Con-
 Feedback Intention     3.20          0.78           3.87        0.83         -2.27     .031*     .83
 Usefulness             3.69          0.73           4.18        0.85         -1.69      .103     .62
                                                                                                         cretely, set diversity was perceived to be higher and it was
 Overall Satisfaction   3.73          1.10           4.07        0.96         -0.88      .384     .33    easier for participants to settle on one of the recommended
                                                                                                         hotels (which is in line with earlier research [3]). Further-
   Moreover, we found significant gender differences (𝑡(28) =                                            more, participants’ feedback intention was higher, i.e. they
−2.48, 𝑝 = .019) regarding recommendation quality, with                                                  preferred to provide feedback in the EF condition. This is
men giving higher ratings (M = 4.67, SD = 0.41) than women                                               corroborated by their answers to the questionnaire items that
(M = 3.83, SD = 0.79), with large effect size (𝑑 = 1.34). Women                                          specifically addressed the perception of facets. Apparently, in
(M=367.00 sec, SD = 138.91) also took on average significantly                                           all tasks, participants valued the possibility to express their
longer (𝑡(28) = 2.08, 𝑝 = .047) to accomplish tasks than men                                             preferences with respect to the more subjective dimensions
(M=245.17 sec, SD = 63.77), with large effect size (𝑑 = 1.13).                                           represented through facets extracted from reviews. At the
                                                                                                         same time, it can be considered promising that we did not
   Facets. After each task, we assessed participants’ opinions
                                                                                                         find a negative effect in terms of facet understandability. In
specifically on the facets. We conducted two-way RM ANOVA
                                                                                                         contrast, our proposed method seems able to extract facets
to examine effects of condition and task. Interaction terms
                                                                                                         from a real-world review dataset in a meaningful way.
were not significant, and we found only few small differences
                                                                                                            By showing an effect of gender on perceived quality, our
between tasks. Table 3 shows that participants perceived
                                                                                                         study partly validates earlier findings that such factors influ-
the suitability of the facets for expressing their preferences
                                                                                                         ence how important reviews are considered as an information
significantly higher in the EF condition than in the other, in-
                                                                                                         source by individual users [19]. Thus, it is subject of future
dependent of the task. With respect to all other variables, EF
                                                                                                         studies to explore more deeply the impact of personality
was assessed superior to FF as well, but without significances.
                                                                                                         variables on the use of different information sources. In this
Table 3: ANOVA (df 1 = 1, df 2 = 28) results with means and                                              regard, it is important to note that there might have been
SEs for the comparison of the conditions across tasks (* indi-                                           confounding factors. In particular, with the present experi-
cates significance at 5 % level; we aggregated scores assessed                                           mental design, participants did not know the source of the
individually for the “Hotel”, “Room”, and “Service” facet).                                              facet values, i.e. that they were extracted from user reviews.
                               Feat.-Based Facets           Extracted Facets
                                                                                                         However, as reviews particularly in the hotel domain influence
                                M             SE             M           SE            F         p
                                                                                                         the perception of trust [31], knowing where the information
 Suit. for Exp. Preferences    3.38          0.179          3.93        0.179         4.604     .041*
                                                                                                         comes from might positively affect perceived usefulness and
 Helpfulness                   3.82          0.185          4.12        0.185         1.322      .260    possibly also trustworthiness—in particular, if it would be
 Understandability             4.26          0.176          4.27        0.176         0.002      .965
 Satisfaction                  3.96          0.161          4.16        0.161         0.776      .386    possible to trace back from mined attribute-value pairs to
                                                                                                         underlying reviews. Besides, we were not able to establish all
                                                                                                         relations between pairs and hotels due to time restrictions,
                                                                                                         potentially contributing negatively to the assessment in the
11
  We examined effects of condition and task using two-way RM                                             EF condition. The lack of differences in actual user behav-
ANOVA. Interaction terms were only significant for two variables
(variety, effort), each showing differences in only one pairwise compari-                                ior may in contrast be attributed to the almost identical
son. Since the results obtained after each task were overall tendentially                                interfaces of the two demonstrator variants. In summary, the
similar, we thus omit reporting them separately. Instead, we present                                     assessment however yielded promising results with respect to
the scores from the final assessment where participants were asked
regarding their general impression after completing all tasks.                                           all variables. Since interaction terms of condition and task
IntRS ’17, August 27, 2017, Como, Italy                                                                                      Feuerbach et al.


were significant only in two cases, we deduce that this applies               control. In UMAP ’16. ACM, 169–173.
independent of the task and its complexity. Nevertheless, fur-           [13] A. Girgensohn, F. Shipman, F. Chen, and L. Wilcox. 2010.
                                                                              DocuBrowse: Faceted searching, browsing, and recommendations
ther investigating task-related differences is subject of future              in an enterprise context. In IUI ’10. ACM, 189–198.
work. Overall, participants seemed more satisfied in the EF              [14] C. He, D. Parra, and K. Verbert. 2016. Interactive recommender
                                                                              systems: A survey of the state of the art and future research
condition, which is reflected accordingly in effect sizes.                    challenges and opportunities. Expert Syst. Appl. 56, 1 (2016),
                                                                              9–27.
7   CONCLUSIONS AND OUTLOOK                                              [15] M. A. Hearst. 2009. Search user interfaces. Cambridge University
                                                                              Press.
In this paper, we have presented an extension to the con-                [16] J. Huang, O. Etzioni, L. Zettlemoyer, K. Clark, and C. Lee. 2012.
cept of blended recommending. Relying on social media, in                     RevMiner: An extractive interface for navigating reviews on a
                                                                              smartphone. In UIST ’12. ACM, 3–12.
particular, reviews for hotels, we allow users to specify their          [17] M. Jugovac and D. Jannach. 2017. Interacting with recommenders
preferences with respect to meaningful criteria that usually                  - Overview and research directions. ACM TiiS (2017).
are not available as filter options, but especially useful for           [18] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. 2010.
                                                                              Multiverse recommendation: N-dimensional tensor factorization
adequately choosing from sets of recommended “experience                      for context-aware collaborative filtering. In RecSys ’10. ACM,
products”. In line with that, the user study we conducted has                 79–86.
                                                                         [19] E. E. K. Kim, A. S. Mattila, and S. Baloglu. 2011. Effects of
shown significant improvements with respect to diversity and                  gender and expertise on consumers’ motivation to read online
choice difficulty, but also promising results in terms of other               hotel reviews. Cornell Hosp. Q. 52, 4 (2011), 399–406.
relevant variables. Thus, it seems that user reviews can be              [20] B. P. Knijnenburg, M. C. Willemsen, and A. Kobsa. 2011. A
                                                                              pragmatic procedure to support the user-centric evaluation of
successfully exploited for interactive recommending, and that                 recommender systems. In RecSys ’11. ACM, 321–324.
the contained information is of value for users even without             [21] J. A. Konstan and J. Riedl. 2012. Recommender systems: From
actually reading them. Besides, the study demonstrates that                   algorithms to user experience. UMUAI 22, 1-2 (2012), 101–123.
                                                                         [22] B. Loepp, C.-M. Barbu, and J. Ziegler. 2016. Interactive recom-
the concept can be applied—also in absence of structured                      mending: Framework, state of research and future challenges. In
content data—in other domains than movies.                                    EnCHIReS ’16. 3–13.
                                                                         [23] B. Loepp, K. Herrmanny, and J. Ziegler. 2015. Blended recom-
   In future work, we aim at improving and possibly using                     mending: Integrating interactive information filtering and algo-
different NLP methods to extract the facets. Moreover, re-                    rithmic recommender techniques. In CHI ’15. ACM, 975–984.
views could be further exploited to improve the presentation             [24] B. Loepp, K. Herrmanny, and J. Ziegler. 2015. Merging interactive
                                                                              information filtering and recommender algorithms – Model and
of recommendations. For instance, by identifying users with                   concept demonstrator. i-com 14, 1 (2015), 5–17.
shared interests, results could be accompanied with sum-                 [25] B. Loepp, T. Hussein, and J. Ziegler. 2014. Choice-based prefer-
maries of their opinions in order to make suggestions easier                  ence elicitation for collaborative filtering recommender systems.
                                                                              In CHI ’14. ACM, 3085–3094.
to understand and to increase the system’s trustworthiness.              [26] J. McAuley and J. Leskovec. 2013. Hidden factors and hidden
Finally, all of this would go hand in hand with conducting                    topics: Understanding rating dimensions with review text. In
                                                                              RecSys ’13. ACM, 165–172.
more user studies, e.g. to evaluate the specific improvements.           [27] D. Parra, P. Brusilovsky, and C. Trattner. 2014. See what you want
                                                                              to see: Visual user-driven approach for hybrid recommendation.
REFERENCES                                                                    In IUI ’14. ACM, 235–240.
                                                                         [28] P. Pu, L. Chen, and R. Hu. 2011. A user-centric evaluation
 [1] A. Almahairi, K. Kastner, K. Cho, and A. Courville. 2015. Learn-
                                                                              framework for recommender systems. In RecSys ’11. ACM, 157–
     ing distributed representations from reviews for collaborative
                                                                              164.
     filtering. In RecSys ’15. ACM, 147–154.
                                                                         [29] P. Pu, L. Chen, and R. Hu. 2012. Evaluating recommender systems
 [2] I. Andjelkovic, D. Parra, and J. O’Donovan. 2016. Moodplay:
                                                                              from the user’s perspective: Survey of the state of the art. UMUAI
     Interactive mood-based music discovery and recommendation. In
                                                                              22, 4-5 (2012), 317–355.
     UMAP ’16. ACM, 275–279.
                                                                         [30] G. M. Sacco. 2006. Dynamic taxonomies and guided searches. J.
 [3] D. Bollen, B. P. Knijnenburg, M. C. Willemsen, and M. P. Graus.
                                                                              Am. Soc. Inf. Sci. Tec. 57, 6 (2006), 792–796.
     2010. Understanding choice overload in recommender systems. In
                                                                         [31] B. A. Sparks and V. Browning. 2011. The impact of online reviews
     RecSys ’10. ACM, 63–70.
                                                                              on hotel booking intentions and perception of trust. Tourism
 [4] S. Bostandjiev, J. O’Donovan, and T. Höllerer. 2012.
                                                                              Manage. 32, 6 (2011), 1310–1323.
     TasteWeights: A visual interactive hybrid recommender system.
                                                                         [32] V. T. Thai, P.-Y. Rouille, and S. Handschuh. 2012. Visual abstrac-
     In RecSys ’12. ACM, 35–42.
                                                                              tion and ordering in faceted browsing of text collections. ACM
 [5] J. Brooke. 1996. SUS – A quick and dirty usability scale. In
                                                                              TIST 3, 2 (2012), 21:1–21:24.
     Usability Evaluation in Industry. Taylor & Francis, 189–194.
                                                                         [33] M. Tvarožek, M. Barla, G. Frivolt, M. Tomša, and M. Bieliková.
 [6] I. Celik, F. Abel, and P. Siehndel. 2011. Towards a framework
                                                                              2008. Improving semantic search via integrated personalized
     for adaptive faceted search on twitter. In DAH ’11. 11–22.
                                                                              faceted and visual graph navigation. In SOFSEM ’08. Springer,
 [7] L. Chen, G. Chen, and F. Wang. 2015. Recommender systems
                                                                              778–789.
     based on user reviews: The state of the art. UMUAI 25, 2 (2015),
                                                                         [34] K. Verbert, D. Parra, and P. Brusilovsky. 2016. Agents vs. users:
     99–154.
                                                                              Visual recommendation of research talks with multiple dimension
 [8] L. Chen and P. Pu. 2012. Critiquing-based recommenders: Survey
                                                                              of relevance. ACM TiiS 6, 2 (2016), 11:1–11:42.
     and emerging trends. UMUAI 22, 1-2 (2012), 125–150.
                                                                         [35] J. Vig, S. Sen, and J. Riedl. 2011. Navigating the tag genome. In
 [9] J. A. Chevalier and D. Mayzlin. 2006. The effect of word of mouth
                                                                              IUI ’11. ACM, 93–102.
     on sales: Online book reviews. J. Marketing Res. 43, 3 (2006),
                                                                         [36] M. Voigt, A. Werstler, J. Polowinski, and K. Meißner. 2012.
     345–354.
                                                                              Weighted faceted browsing for characteristics-based visualization
[10] C. di Sciascio, V. Sabol, and E. E. Veas. 2016. Rank as you
                                                                              selection through end users. In EICS ’12. ACM, 151–156.
     go: User-driven exploration of search results. In IUI ’16. ACM,
                                                                         [37] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, and S. Ma. 2014.
     118–129.
                                                                              Explicit factor models for explainable recommendation based on
[11] S. Dolnicar and T. Otter. 2003. Which hotel attributes matter? A
                                                                              phrase-level sentiment analysis. In SIGIR ’14. ACM, 83–92.
     review of previous and a framework for future research. Technical
                                                                         [38] K. Zhou, S.-H. Yang, and H. Zha. 2011. Functional matrix fac-
     Report. University of Wollongong.
                                                                              torizations for cold-start recommendation. In SIGIR ’11. ACM,
[12] T. Donkers, B. Loepp, and J. Ziegler. 2016. Tag-enhanced col-
                                                                              315–324.
     laborative filtering for increasing transparency and interactive

</pre>