=Paper= {{Paper |id=Vol-1440/paper3 |storemode=property |title=Interactive Preference Elicitation for Scientific and Cultural Recommendations |pdfUrl=https://ceur-ws.org/Vol-1440/Paper3.pdf |volume=Vol-1440 |dblpUrl=https://dblp.org/rec/conf/ijcai/VeasS15 }} ==Interactive Preference Elicitation for Scientific and Cultural Recommendations == https://ceur-ws.org/Vol-1440/Paper3.pdf
    Interactive Preference Elicitation for Scientific and Cultural Recommendations

                                  Eduardo Veas1,2 and Cecilia di Sciascio2
                1
                  Information and Communications Technologies, National University of Cuyo
                               2
                                 Knowledge Visualization, Know-Center GmbH
                              eduveas@gmail.com, cdissciascio@know-center.at


                          Abstract                                 (scrutability) [Kay, 2006]. Hence, to warrant increased user
                                                                   involvement the RS has to justify recommendations and let
     This paper presents a visual interface developed on           the user customize their generation. Transparency and con-
     the basis of control and transparency to elicit pref-         trollability are key facilities of a self-explanatory RS that pro-
     erences in the scientific and cultural domain. Pref-          mote trust and satisfaction [Tintarev and Masthoff, 2012]
     erence elicitation is a recognized challenge in user             Our work is set in the scientific and cultural domain. In
     modeling for personalized recommender systems.                this frame, users are most often engaged in exploration and
     The amount of feedback the user is willing to pro-            production tasks that involve gathering and organizing large
     vide depends on how trustworthy the system seems              collections in preparatory steps (e.g., for writing, preparing
     to be and how invasive the elicitation process is.            a lecture or presentation). A federated system (FS) com-
     Our approach ranks a collection of items with a               piles scientific documents or electronic cultural content (im-
     controllable text analytics model. It integrates con-         ages) upon an explicit or implicit query, with little control
     trol with the ranking and uses it as implicit prefer-         over the way results are generated. Content takes the form of
     ence for content based recommendations.                       text document surrogates comprising title and abstract. They
                                                                   also include minimal additional metadata, like creator, URL,
1    Introduction                                                  provider and year of publication.
A recommender system (RS) depends on a model of a user to             This paper introduces a visual tool to support exploration
be accurate. To build a model of the user, behavioral recom-       of scientific and cultural collections. The approach includes
menders collect preferences from browsing and purchasing           a metaphor to represent a set of documents, with which the
history, whereas rating recommenders require a user to rate        user interacts to understand and define themes of interest.
a set of items to state their preferences (implicit and explicit   The contribution of this work is the interactive personaliza-
methods respectively) [Pu et al., 2011]. Preference elicitation    tion feature that, instead of presenting a static ranked list, al-
is fundamental for the whole operational lifecycle of a RS: it     lows users to dynamically re-sort the document set in the vi-
affects the recommendations for a new user and also those of       sual representation and re-calculate relevance scores with re-
the whole system community, given what the RS learns from          gards to the own interests. The visual interface employs con-
each new user [Cremonesi et al., 2012]. Whichever method           trollable methods to represents their results in a transparent
is chosen, preference elicitation represents an added effort,      manner which, rather than adding effort, reduces complexity
which may be willingly avoided to the detriment of user satis-     of the overall task.
faction. The amount of feedback the user is willing to provide
is a tradeoff between system aspects and personal character-       2   The Approach
istics, for example privacy vs recommendation quality [Knij-       The proposed approach was designed to quickly reorganize
nenburg et al., 2012].                                             a large collection in terms of its relevance to a set of key-
   In their seminal work, Swearingen et al. pointed out one        words expressing the choice of topic. In a nutshell, the goal
challenge: the recommender has to convince the user to try         is to interactively discover the topics in a collection, build-
the recommended items [Swearingen and Sinha, 2001]. To             ing the knowledge in the user. But, instead of trying to in-
do so, the recommendation algorithm has to propose items           fer a hidden topic structure fully automatically (as in [Blei,
effectively, but also the interfaces must deliver recommenda-      2012]), we propose an interactive approach, which works as
tions in a way that can be compared and explained [Ricci et        a conversation between the user and the RS to build a per-
al., 2011]. The willingness to provide feedback is directly re-    sonalized theme structure. Controllability and transparency
lated to the overall perception and satisfaction the user has      are crucial for the user to understand how a topic came about
of the RS [Knijnenburg et al., 2012]. Explanation inter-           from their personal exploration. The challenge for the inter-
faces increase confidence in the system (trust) by explaining      face is to clearly explain the recommendation process, and for
how the system works (transparency) [Tintarev and Masthoff,        the analytics method to reduce the computational problem to
2012] and allowing users to tell the system when it is wrong       interactive terms.
Figure 1: (Left) TagBox. A summary of the collection contents as a bag of words. (Right) The RankView is updated as two
terms have been selected. As the user points at a third keyword (employment), a hint shows which documents would be affected
by picking it (3 highlighted documents).


2.1 Visual Interface                                               2.2 Text Analytics and Ranking
To search and explore documents based on the themes that           Keyword extraction plays two roles: it summarizes the top-
run through them, we build an interface that allows the user       ics in the collection, and it also provides the basis for the
to establish a conversation with the RS. Two main parts of         fast ranking of documents. Preprocessing involves part-of-
the interface comprise the topic summary and the recommen-         speech tagging, singularizing plural nouns, and stemming
dation pane. The topic summary is built from keywords ex-          with a Porter Stemmer. Resulting terms form a document vec-
tracted from the whole collection. Keywords are presented in       tor, which also constitutes its index. Subsequently, individual
a Tag Box, organized and encoded in terms of their frequency       terms are scored with TF-IDF (term frequency - inverse doc-
of occurrence in the collection (tf-idf), see Fig. 1. The rec-     ument frequency). It how important a term is to a document
ommendation list initially shows the unranked collection.          in a collection, as the coefficient between its frequency in a
   As the user interacts with the contents choosing words to       document and the logarithm of the times it is repeated in the
express her information needs, the recommendation list is          collection of documents. The more frequent a term is in a
ranked on-the-fly (see Fig. 1). The RankView shows the con-        document and the fewer times it appears in the corpora, the
tribution each keyword has on the overall score of a docu-         higher its score will be. TF-IDF scored terms are added to the
ment. With a slider, the user can assign a weight to a key-        metadata of each document. To provide an overview of the
word and modify its contribution to the score. Furthermore,        contents in the collection, keywords from all documents are
the TagBox and RankView illustrate the possible effect of          collected in a global set of keywords. Global keywords are
user actions in a quick overview: mouse over a keyword in          sorted by the accumulated document frequency (DF), calcu-
the TagBox shows a micro-chart with the proportion of docu-        lated as the number of documents in which a keyword appears
ments affected and the RankView highlights those documents         - regardless of the frequency within the documents.
in view that would be affected by choosing the keyword.               Quick exploration of content depends on quickly re-sorting
   It is important to note that the user is aware and in control   the documents according the information needs of the user,
of the ranking and organization of the document collection at      expressed with a query built from a subset of the global key-
all times. With the visual interface, the user describes her in-   word collection. We assume that some keywords are more
formation needs and chooses documents from the collection          important to the topic model than others and allow the user to
that better reflect those needs. Chosen items can be assigned      assign weights to them.
to a collection. The act of choosing an item is considered            The documents in the set are then ranked and sorted as
an expression of preference. With the collection, the system       follows. Given a set of documents D = d1 , ..., dn , a set of
stores keywords and score of each document. Although this          keywords: K = k1 , ..., km and a set of selected keywords:
feedback is not yet incorporated in our ranking approach, we       T = t1 , ..., tp , T ⊆ K; the overall score for document di is
analyze its effects with a user study and outline future direc-    calculated as the sum of the weighted scores of its keywords
tions to integrate this additional information in the system.      matching selected keywords:
                                                                    two iterations of this task for each condition. The short text
                            ∑
                            p
                                                                    task required participants to come up with the keywords de-
                    sdi =         wtj × mli tj ,                    scribing the topic by themselves.
                            j=1                                        Twenty four (24) participants took part in the study (11
   Where wtj is the weight assigned by the user to the selected     fem., 13 m., between 22 and 37 years old). They were re-
keyword tj , such that ∀j : 0 ≤ wtj ≤ 1; and mdi tj is the tf-      cruited from the medical university and from computer sci-
idf score for keyword tj in document di . D is next sorted by       ence university graduate population. None of them is major-
overall score using the quicksort algorithm. Documents in D         ing in the topic areas selected for the study.
are now elements of sequence Q with order determined by:
                                                                    Procedure
          Q = (qi )ni=1 , qi , qi+1 ∈ D ∧ sqi ≥ sqi+1 .             A study session started with an intro video, which explained
                                                                    the functionality of the tool. Each participant got exactly the
   Finally, the ranking position is calculated in such a way that   same instructions. There was a short training session on a
items with equivalent overall score share the same position.        dummy dataset to let participants familiarize with the tool.
The position for a sorted document qi is calculated as              Thereafter, the first condition started. The system showed a
                  {                                                 short text to introduce the topic. After reading the text, partic-
                     1                 if i = 0
           rqi =     rqi−1             if sqi = sqi−1               ipants pressed start, opening the interface for the first task. At
                     rqi−1 + |C|       if sqi ¡ sqi−1               the beginning of the task, the items in the collection were or-
                                                                    dered randomly, ensuring that an item would not appear in the
   Where C = qj /sqj = sqj−1 , 0 ≤ j ≤ i represents the set         same position again. The instructions for the task were shown
of all the items with immediate superior overall score than qi .    in the upper part of the screen. In all conditions participants
   The current approach employs a term-frequency-based              were able to collect items and inspect their collections. In the
scheme to compute document scores, as it is more appropriate        (L) condition the main interface was a list of items, whereas
to compute and highlight individual term contributions than a       the (U) condition used the proposed interface. Participants
single similarity measure.                                          had to click the finished button to conclude the task. It was
                                                                    possible to finish without collecting all items. After each con-
3 Experimental Setup                                                dition, participants had to fill a NASA TLX questionnaire to
                                                                    assess cognitive load, performance and effort among others.
We performed a preliminary study to determine if controlla-
bility and transparency increase the complexity and pose an            The procedure was repeated for each of the four iterations.
extra effort in the task of building topic oriented document        Thereafter participants were interviewed for comments.
collections. Thus, participants had the task to “gather relevant    3.2 Results
items” using our tool (U), or using a recommendation list (L)
with usual tools (keyword search).We chose two variations of        NASA TLX data were analyzed using a repeated measures
size of the dataset in terms of item count S(30), L(60).            ANOVA with independent variables tool, and dataset size.
                                                                    Post-hoc effects were computed using Bonferroni corrected
3.1 Method                                                          pairwise comparisons. The two by two experimental design
The study was structured as a repeated measures design, with        ensures that sphericity is necessarily met. A repeated mea-
four iterations of the same tasks, each with a different combi-     sures ANOVA revealed a significant effect of tool on per-
nation of the independent variables (e.g., US-LL-UL-LS). To         ceived workload F(1,23)=35.254, p < 0.01, ϵ = 0.18. A
counter the effects of habituation, we used four topics cov-        Post-hoc paired-samples t-test revealed a significantly lower
ering a spectrum of cultural, technical and scientific content:     workload when using uRank (p < 0.001). Further, repeated
women in the workforce (WW), robotics (RO), augmented               measures ANOVA in each dimension of the workload mea-
reality (AR), circular economy (CE). Each of these topics has       sure showed significant effects of tool in all dimensions as
a well defined wikipedia page, which was used as seed to            shown in Table 1.
retrieve a collection from a federated system. The system              To test the proposed recommender, we gathered and com-
creates a query from the text of the page and forwards it to        pared for each topic (WW, RO, AR, CE) the most popular
a number of content providers. The result is a joint list of        items collected using the list (L-MP) and our approach (U-
items from each provider. The federated system cannot es-           MP) with the scores received by our ranking algorithm (U).
tablish how relevant the items are. Furthermore, the resulting
collection refers to the whole text, but there is no indication     Table 1: Complexity: people found our tool incurs signifi-
of subtopics. We collected sets of 60 and 30 items as static        cantly lower workload in all dimensions
datasets for each topic. We simulated the proposed scenario
of reorganizing the collection by choosing subtopics for each                   Dimension         F(1,23)       p         ϵ
task in the study. The combinations were randomized and                         Mental Demand     19.700    p < 0.05    0.10
                                                                                Physical Demand   14.520    p < 0.01    0.07
assigned using a balanced Latin Square.                                         Temporal Demand    7.720    p < 0.05    0.05
   Each condition had two fundamental tasks: find items most                    Performance       11.800    p < 0.01    0.10
relevant to a set of given keywords, find items most relevant                   Effort            48.600    p < 0.001   0.22
                                                                                Frustration       15.120    p < 0.01    0.07
to a short text. In the former, participants were given the key-                Workload          35.254    p < 0.01    0.20
words and they just had to explore the collection. There were
                                   WW                                                         RO
    U
 U_MP




                                                                                                                             q1
 L_MP
    U
 U_MP




                                                                                                                             q2
 L_MP
    U
 U_MP




                                                                                                                             q3
 L_MP

                                   AR                                                         CE
    U
 U_MP




                                                                                                                             q1
 L_MP
    U
 U_MP




                                                                                                                             q2
 L_MP
    U
 U_MP




                                                                                                                             q3
 L_MP

                                  items                                                      items

Figure 2: Correlation heatmap. Most popular items collected with our tool (U MP) had high scores in the topic based ranking
(U). Most popular items collected with the list (L MP) are more widespread. The ranking (U), produced many high scoring
items (WW-q1, RO-q1, RO-q2), indicating that personalized ranking may be more appropriate.

Table 2: Correlation analysis: ICCs established good to           funded by the Austrian Research Promotion Agency (FFG)
excellent correlations between most-popular items collected       under the COMET Program.
with List(L-MP), our tool (U-MP), and the ranking scores.
                                                                  References
        Task   ww-q1   ww-q2   ww-q3      ro-q1   ro-q2   ro-q3
        ICC     .658    .862    .751       .857    .835    .738   [Blei, 2012] David M. Blei. Probabilistic topic models.
        Task   ar-q1   ar-q2   ar-q3      ce-q1   ce-q2   ce-q3      Commun. ACM, 55(4):77–84, April 2012.
        ICC     .814    .869    .813       .911    .866    .707
                                                                  [Cremonesi et al., 2012] Paolo Cremonesi, Franca Gar-
                                                                     zottto, and Roberto Turrin. User effort vs. accuracy in
We performed intra-class correlations (ICC), using a two-            rating-based elicitation. In Proc. of the Sixth ACM Conf.
way, consistency, average measures model. Results are sum-           on Recommender Systems, RecSys ’12, pg. 27–34, New
marized in Table 2. For broad exploration (q1 & q2), we              York, NY, USA, 2012. ACM.
found good to excellent ICCs. A closer look at the distribu-      [Kay, 2006] Judy Kay. Scrutable adaptation: Because we
tion of scores in Fig. 2 underlines the fact that high ranked        can and must. In Vincent P. Wade, Helen Ashman, and
documents (U) were a popular choice with U MP and also               Barry Smyth, editors, AH, volume 4018 of Lecture Notes
relatively popular with L MP. For q3, the ranking (U) pro-           in Computer Science, pg. 11–19. Springer, 2006.
duced widespread scores with less individual favorites, items     [Knijnenburg et al., 2012] Bart P. Knijnenburg, Martijn C.
L MP were generally least popular. U MP resulted in the              Willemsen, Zeno Gantner, Hakan Soncu, and Chris
most focused of the three (less blocks with higher intensity).       Newell. Explaining the user experience of recommender
                                                                     systems. User Modeling and User-Adapted Interaction,
4   Discussion and Outlook                                           22(4-5):441–504, October 2012.
Results show that the fast-ranking method in our content rec-     [Pu et al., 2011] Pearl Pu, Li Chen, and Rong Hu. A user-
ommender helps users quickly reorganize collections. The             centric evaluation framework for recommender systems.
preference elicitation method was well received and quickly          In Proc. of the Fifth ACM Conf. on Recommender Systems,
adopted. Participants experienced less effort and overall            RecSys ’11, pg. 157–164, New York, NY, USA, 2011.
workload using our tool. Still, they took time to check their        ACM.
choices carefully in both U and L conditions. Comparing
most popular choices after the experiment reinforces our as-      [Ricci et al., 2011] Francesco Ricci, Lior Rokach, Bracha
sumptions: the fast-ranking method (U) correlates with most          Shapira, and Paul B. Kantor. Introduction to Recom-
popular choices made with the tool (U MP) but also without           mender Systems Handbook, pg. 1–35. Springer US, 2011.
it (L MP). Yet, widespread results in some cases call for a       [Swearingen and Sinha, 2001] K. Swearingen and R. Sinha.
personalized recommendation method. Our preference elici-            Beyond algorithms: An hci perspective on recommender
tation forms the backbone of personalized recommendations.           systems. In ACM SIGIR. Workshop on Recommender Sys-
In the future we will explore recommendations of related             tems, volume Vol. 13, Numbers 5-6, pg. 393–408, 2001.
items in context, showing keywords used to collect the item,      [Tintarev and Masthoff, 2012] Nava Tintarev and Judith
other items collected together and under which collections.
                                                                     Masthoff. Evaluating the effectiveness of explanations for
                                                                     recommender systems. User Modeling and User-Adapted
Acknowledgments                                                      Interaction, 22(4-5):399–439, October 2012.
This work is partially funded by CONICET (project Visual-
I-Lab, Res. 4050-13) and by Know-Center. Know-Center is