Comparing User Interfaces for Customizing Multi-Objective Recommender Systems Patrik Dokoupil1 , Ludovico Boratto2 and Ladislav Peska1 1 Faculty of Mathematics and Physics, Charles University, Prague, Czechia 2 University of Cagliari, Italy Abstract The goal of Multi-Objective Recommender Systems (MORSs) is to adapt to the needs and preferences of the users from different beyond-accuracy perspectives. When a MORS operates at the local level, it tailors its results to the needs of each individual user. Recent studies have highlighted that, however, the self-declared propensity of the users towards the different objectives does not always match with the characteristics of the accepted recommendations. Therefore, in this study, we delve into different ways for users to express their preference toward multi-objective goals and observe whether they have some impact on declared propensities and overall user satisfaction. In particular, we explore four different user interface (UI) designs and perform a user study focused on the interactions with both the UI and the recommendations. Results show that multiple UIs lead to similar results w.r.t. usage statistics, but users’ perceptions of these UIs often differ. These results highlight the importance of examining MORSs from multiple perspectives to accommodate the users’ actual needs when producing recommendations. Study data and detailed results are available from https://osf.io/pbd54/. Keywords Multi-objective recommender systems, User study, Recommender systems UI 1. Introduction Multiple-Objective Recommender Systems (MORSs) produce results that account for the effectiveness perspective but also go beyond it so as to tackle perspectives such as novelty, diversity, and fairness (to name a few) [1]. The optimization for these objectives can happen at the aggregate level so that the system can guarantee certain properties (e.g., all providers receive a certain exposure in the recommendation lists). Another alternative is to build MORSs that operate at the local (individual) level so as to shape results towards the prominence of different goals for individual users (e.g., each user would receive recommendations with a different level of diversity) [2]. Having local MORS, one may aim to provide users with additional control over the recommendations and allow them to set their propensities towards individual objectives [3]. This is in line with the general trend of a growing need for understanding and control over recommendations as illustrated, e.g., by the recent EU’s Digital Services Act1 . The regulation requires that the main driving forces of the recommendation process are disclosed and also that users should be allowed to select their preferred options (Article 27). However, recent literature revealed a mismatch between the self-declared propensity of users for the different objectives and the characteristics of the recommendations they accept (i.e., the items they choose among the recommendations are less novel or diverse than what they believe they would like) [3]. In this work, we explore the issue of self-declared propensities from the perspective of UI design. Specifically, we focused on a widely used combination of relevance, diversity, and novelty criteria and designed four different UIs that allow users to express their propensity toward the objectives. In a user study (Section 3), we allowed users to interact with both the UI and the recommendations themselves and evaluated the impact of different customization UIs. In particular, we observed whether the UI IntRS’24: Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, October 18, 2024, Bari (Italy) $ patrik.dokoupil@matfyz.cuni.cz (P. Dokoupil); ludovico.boratto@acm.org (L. Boratto); ladislav.peska@matfyz.cuni.cz (L. Peska)  0000-0002-1423-628X (P. Dokoupil); 0000-0002-6053-3015 (L. Boratto); 0000-0001-8082-4509 (L. Peska) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://eur-lex.europa.eu/eli/reg/2022/2065/oj CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings designs affect how users perceive individual objectives, how they interact with recommendations, and whether there is some impact on perceived recommendation quality and overall satisfaction. Our results (Section 4) show that there is a trade-off between the perceived usability of the different UIs and their effectiveness at indicating user propensity. Moreover, no UI has clearly shown to be the most effective, as users exploited three of our designs with similar effectiveness. 2. Background and Related Work 2.1. Customization UIs in Recommenders We are not aware of any previous studies focusing on the comparison of UI designs for local MORS setting. However, in the context of MORS, the work that most closely aligns with ours is by Harper et al. [4], where the authors propose an algorithm allowing users to control item popularity and recency. In addition to the algorithm and its offline evaluation, the authors conducted a user study in the movie domain, finding that the tuned recommendations were rated more positively by users. They also highlighted the importance of individual-level optimization, as no single global setting worked equally well for all users. The tuning was done using buttons labeled neutrally as “left” and “right.” While this choice was intentional and justified, users responded negatively when asked about the ease of use of the tuning interface. Therefore, we focused on different UI designs for RS tuning in our work. Several additional UI variants were considered for value setting in RS as well as other HCI tasks [5, 3, 6, 7]. In web design praxis, sliders are considered to be a primary design choice for values specification as long as these do not have to be very precise [7]. This is well-reflected in UIs used for RS tuning, as illustrated, e.g., by the work of Liang and Willemsen [5] on tuneable exploration-oriented music RS. Similarly, sliders were also used in [3] for the customization of local MORS. Nonetheless, some researchers pointed out the slider’s inferior performance (e.g., w.r.t. response times) in situations with limited options and advocated standard HTML radio buttons instead [6]. However, unlike in other scenarios, the particular value of a MORS objective does not carry an inherent meaning for the user (compared, e.g., to a price setting in an e-shop’s faceted search). Therefore, users can only perceive the values relative to their previous settings (i.e., incremental increase/decrease; also denoted as “relative” in the literature) or through the comparison with the weights of other criteria ( also denoted as “absolute” [8]). Naturally, UIs can be tailored to better reflect one of these views. Another open question is the optimal level of response granularity [9] so that the task complexity is minimized while the UI expressive power is still sufficient. From these points of view, we can understand sliders as fine-grained UI collecting absolute feedback. To cover other design options, we propose and evaluate two alternatives to the sliders UI. In options UI, we provide users with several prompts to relatively increase/decrease the importance of individual criteria (as such, coarse-grained feedback with relative answers is received). In a way, this layout is most similar to the left/right buttons described in [4], but without the obfuscated labeling. The plus- minus buttons UI is inspired by common RPG gaming designs for character stats and, as such, provides coarse-grained absolute feedback. Finally, in [3], authors reported on an extensive over-weighting of beyond-relevance criteria by the users, and so the particular interpretation of user-provided weights can be questioned as well. This was the main driving force for sliders_shifted UI variant, which reduces the weights of beyond-relevance criteria. 2.2. Objectives in MORS MORS typically aim to balance recommendations’ relevance with various beyond-accuracy objectives, including diversity, novelty, serendipity, or fairness [10]. Out of the available options, we adopted the approach from [3], focusing on the following variants of relevance, novelty, and diversity. • Estimated relevance 𝑟𝑒𝑙 of recommendation list 𝐿 was set to the mean of estimated relevance scores (𝑟ˆ𝑢,𝑖 ) predicted by the relevance-only baseline: 𝑟𝑒𝑙(𝐿) = |𝐿| 𝑖∈𝐿 ˆ𝑟𝑢,𝑖 . 1 ∑︀ Recommendations 6x repeat Preference Elicitation Post-Study Questionnaire MORS SORS Consent and … Demographics • Gender • Age … Customization UI • … Search Load More Figure 1: Schema of the study flow. Informed consent and basic demographics are required first, followed by preference elicitation. Then, participants are directed to in total 6 iterations of recommendations, where single-objective and multi-objective results are displayed side-by-side (i.e., within-user variable). After each iteration, users may modify their propensities towards individual objectives via a designated GUI (between-user variable). Finally, users are directed to the post-study questionnaire. |𝑢∈𝑈 :𝑟𝑢,𝑖 exists| (︁ )︁ • Novelty was defined as mean popularity complement: 𝑛𝑜𝑣(𝐿) = |𝐿| 1 ∑︀ 𝑖∈𝐿 1 − |𝑈 | , where 𝑟𝑢,𝑖 is the feedback of user 𝑢 on item 𝑖 and 𝑈 is the set of all users. • Diversity was set to collaborative intra-list diversity: CF-ILD(𝐿) = |𝐿|*(|𝐿|−1) 1 ∑︀ ∀𝑖,𝑗∈𝐿,𝑖̸=𝑗 𝑑(𝑖, 𝑗), where 𝑑(𝑖, 𝑗) is cosine similarity on items’ ratings. 3. User Study The study was conducted on a movies domain using the EasyStudy framework [11] and the experimental setup was largely based on [3]. In particular, we utilized the same filtered MovieLens Latest [12] dataset, preference elicitation process, objective criteria definitions, items presentation, and task definition. In the rest of this section, we provide details about the data pre-processing, describe the study flow, and specify the customization UI variants we evaluated. 3.1. Dataset For the purpose of the study, we utilized an augmented version of the MovieLens-Latest dataset [12]. The dataset was selected for its relative novelty and the general familiarity and popularity of the movie domain among the general public. Both factors should contribute to the realisticness of the study. The dataset was utilized both to train the collaborative filtering algorithms and as a starting point to gather necessary item metadata. As the feedback collected during the study was binary, we binarized the dataset as well (4* and above counts as positive). Furthermore, we only considered the more recent and less obscure portion of the data. In particular, we removed movies released before 1990, ratings older than 2010, movies that have less than 50 ratings per year, and users with less than 100 ratings. This resulted in 9K users, 2K movies, and 1.5M ratings. In order to properly visualize the items, additional metadata were collected from respective IMDb profiles: movie descriptions, posters, and links to movie trailers. 3.2. Study flow The user study was organized in four phases: informed consent, preference elicitation, recommendation comparison, and post-study questionnaire. The schematic of the study flow is depicted in Figure 1, while Figure 2: Screenshot of a recommendation iteration. Instructions and the head of the two lists are visible, as well as a description of the movie with a mouse hover focus. the detailed description of individual steps follows. 3.2.1. Pre-study Prior to the study commencement, users were shown a study mission statement and detailed instructions and were asked for informed consent on the publication of anonymized data. Since all participants were recruited using the Prolific service, we relied on the demographic information participants submitted there and did not ask participants for additional demographics. 3.2.2. Preference elicitation After the initial step, participants were routed to the preference elicitation page to collect information to train the collaborative recommenders. In this phase, 24 movies were displayed to the users, asking them to select movies they previously watched and liked. The displayed movies were sampled from the dataset using the following procedure. We calculated estimated relevance (w.r.t. average user profile), novelty (w.r.t. item’s mean popularity complement [13]), and diversity (w.r.t. CF-ILD [14]) characteristics for each item in the dataset. For each characteristic, we divided items into "low" and "high" buckets and sampled four items from each bucket.2 The displayed items were organized in a grid, and each item was represented by its poster image, title, and genres. Users could express their preferences by simply clicking on the ones they watched and liked before. We recommended selecting at least 5-10 items, but users were allowed to continue even if fewer items were selected. To make the elicitation more thorough, users could dynamically load additional items (repeating the procedure above) or search for specific ones via a text prompt. After the elicitation phase, we estimated the initial user’s propensities of users toward individual objectives based on the normalized marginal gains calculated for each selected vs. each displayed movie - i.e., as compared to all displayed movies, how much were the ones selected by the user relevant/novel/diverse. Please see [3, 15] for more details on the procedure. 2 Note that we first sampled from the relevance and novelty buckets and only then calculated the diversity w.r.t. already selected items. 3.2.3. Recommendation comparison The main part of the study comprised six iterations, where users received two lists of top-10 recom- mendations side-by-side. One list of recommendations was supplied by the relevance-only baseline algorithm, while the other was provided by the multi-objective RS. In particular, we employed gener- alized matrix factorization as a relevance-only baseline.3 Note the relevance-only baseline will also be referred to as single-objective RS or simply SORS in the results. For the multi-objective RS, we employed the RLprop algorithm [15] aiming to maintain the proportionality between user-defined propensities and the fraction of individual objectives in the results. The considered relevance, novelty, and diversity objectives were used as defined in Section 2.2 while noting that internally, the RLprop algorithm normalizes the objectives using empirical cumulative distribution function (CDF) to make them comparable against each other. Note that each participant received their own copy of the recommending algorithms (and objectives’ weights) that were updated after each step. That is, for each participant, the algorithms were gradually fine-tuned w.r.t. recommended items the user selected in previous iterations.4 However, these “sandbox" updates did not affect the recommendations given to the other users. Also note that if the item was recommended in one iteration, it was removed from the set of candidates in the subsequent ones so that the user was not overwhelmed with repeating recommendations. Regarding the number of iterations, we opted for a rather lower number to maintain a reasonable study duration. Otherwise, too much of users’ attention could be lost, which would compromise the results. Following the findings of [16], recommendation lists were organized into columns, and the placement (i.e., left or right) of RS variants was randomized. Each item was represented with its poster image, title, genres, a short plot summary, and a link to its trailer to allow users to thoroughly inspect the previously unknown ones.5 See Figure 2 for a screenshot of the layout. At each iteration, users were asked for both low-level and high-level feedback. Similarly to the preference elicitation phase, the low-level feedback was a simple click on relevant items. However, we used a different prompt: “Select items that you would consider watching tonight.” As for the high-level feedback, users were required to assign 1-5 stars to both recommendation lists so as to compare their overall quality. Once the users finished the feedback provision, they were directed to the customization UI, where they could modify their propensities towards individual objectives. These were then supplied to the MORS algorithm to generate the next list of recommendations. We evaluated in total four variants of the customization UIs (details in Section 3.3), while one variant was assigned to the user for the whole duration of the study (i.e., a between-user variable). We opted for this due to supposedly substantial carry-over effects that could otherwise compromise our results. 3.2.4. Post-study questionnaire During the final phase, users were asked to fill out a post-study questionnaire containing 19 questions together with 6 attention checks (instruction manipulation, nonsensical questions, and memory-based questions). The questionnaire was inspired by the ResQue framework [17], but we altered it to primarily cover the effect of customization UIs (see Figure 3 for the exact wording). Users were allowed to reply in the form of a 5-point Likert scale. The exact prompts were “Strongly Disagree”, “Disagree”, “Neutral”, “Agree”, “Strongly Agree”, and we also allowed users to answer “I don’t understand”. In the subsequent analysis, we assigned the numeric values of -2, -1, 0, 1, and 2 to these prompts, while discarding the “I don’t understand” answers. In the evaluation, we grouped the question into individual evaluated aspects of users’ attitudes towards the system: through perceived relevance, novelty, and diversity, we aim to observe to what extent these correspond to the users’ feedback and measurable characteristics of the resulting recommendations. Several questions aim to determine whether users receive sufficient information to participate in the study 3 Based on the implementation from https://www.tensorflow.org/recommenders/examples/basic_retrieval. 4 Unlike [3], algorithms were only updated by selections originating from that particular algorithm. 5 Only the movie’s poster was initially visible; other information was displayed on mouse hover. Figure 3: The exact wording of questionnaire questions. The evaluated aspect corresponding to each question is displayed in the brackets. Note that “slider” was replaced with other UI names where appropriate. Figure 4: Different variants of the customization UIs: sliders/sliders_shifted, options, and buttons. The following prompt was displayed for all layouts: “How much of the specified quality should be present in the next recommendation iteration?” and whether the study interface was easy to use. Then, a series of questions focused on the customization UIs: whether the initial state (i.e., estimated propensities) already provided good recommendations, whether the effect of changing propensities was both positive and substantial, and whether the UI was understandable, easy to use and gave the users sufficient control to express their preferences. Finally, we also enquired about the overall perceived satisfaction of the users. 3.3. Customization UIs The user study evaluated four different UI variants: sliders, sliders_shifted, options, and buttons (see Figure 4). The sliders layout comprised three sliders, one for each objective, that automatically normalize values to unit sum, i.e., when one value was being increased, others decreased proportionally. Note that the sliders were initialized with the previous values of each objective. The sliders_shifted layout appeared the same from the user’s point of view, but in line with the findings of [3], the relative weight of relevance was increased. In particular, upon receiving the user-defined weights, we silently increased the relevance’s weight by the factor of 𝑓 = 0.5 and then re-normalized the weights again. As such, both sliders variants provide users an interface with a well-perceivable tradeoff between individual objectives and a chance to provide fine-grained preferences. The options layout provided five radio buttons for each objective that allowed users to manipulate the objectives relative to their previous weights. At 𝑘-th iteration, objective weights 𝑤[𝑘] were calculated as 𝑤[𝑘−1] * 𝑓 , where the factor 𝑓 was derived from selected options (less: 1/2, slightly less: 2/3, same: 1/1, slightly more: 3/2, and more: 2/1). As such, the options UI gives users a chance to relate their feedback to the previous recommendations while allowing them to supply coarse-grained responses only. Table 1 Overall comparison of single-objective and multi-objective RS. Average results per user and recommend- ing algorithm are displayed. Note that “IMP” stands for metrics evaluated on impressed (displayed) items, and “SEL” stands for metrics evaluated on items selected by the user. Stat. sign. results (Paired t-test p-value ≤ 0.05) are marked with an asterisk (*). Algorithm Estimated relevance CF-ILD CB-ILD Novelty Recency Topic_Coverage Single-objective IMP *1.537 0.881 0.324 0.972 2014.0 0.739 Multi-objective 1.136 *0.957 *0.375 *0.990 *2016.8 *0.791 Single-objective *1.551 0.857 0.299 0.966 2013.7 *0.569 SEL Multi-objective 1.282 *0.931 *0.338 *0.984 *2016.2 0.535 Finally, the plus-minus buttons layout utilized “virtual coins” to allow users to increase/decrease the objective’s importance. At the very beginning, ten coins were assigned w.r.t. preference elicitation, and at each iteration, the user received four additional coins to assign. Users could also transfer the coins already assigned to other objectives (via a minus button). This is a very similar setting to many RPG games, where players define their avatar’s statistics when the game beginnings and then incrementally update them after some level-ups are accumulated. Therefore, we believe users may be quite familiar with such a UI as well. Similarly as sliders, buttons UI is tuned to visualize the tradeoff between individual objectives. However, it only allows for a coarse-grained response and nudges users towards smaller, incremental changes. 4. Results The study was conducted in June 2023. In total, 142 participants were recruited using the Prolific.com service. Participants were pre-screened for fluent English, no less than 10 previous submissions, and a 99% approval rate. Twelve users did not finish the study, and, in addition, we rejected 9 participants due to failed attention checks, which resulted in 121 participants uniformly distributed along individual UIs (i.e., at least 30 participants evaluated each UI). The study sample size was constrained by the funds allocated for participants’ compensations. Nonetheless, we also conducted a sensitivity analysis in G*Power software [18] using ANOVA with four groups, 𝛼 = 0.05, and 1 − 𝛽 = 0.8, concluding that the study should be capable of discovering medium effects (Cohen’s 𝑓 = 0.305) with reasonable probability. As for the participant’s demographics, the sample was rather well-balanced regarding gender (50% female, 49% male, 1% unspecified). Participants were rather younger in general (mean age = 27, standard deviation = 7.7, median age = 24), mostly white (64%) or black (21%), and mostly from South Africa (21%) or several European countries (55% in total). The average time to complete the study was 15 minutes. In the analysis of the results, we focused on three main aspects: (i) whether different UIs affected the received implicit and explicit user feedback, (ii) whether the UIs affected perceived RS qualities as expressed in the questionnaire, and (iii) how individual questionnaire answers correlate with each other. 4.1. Users Feedback Analysis 4.1.1. Comparison of single-objective and multi-objective RS Let us first focus on the overall difference between the results of single and multi-objective RS. We first analyzed, whether the single- and multi-objective RS actually supplied users with different lists of recommendations. To do so, we compared the corresponding pairs of lists given to the user at each iteration w.r.t. the size of their intersection. Depending on the customization UI, the mean intersection ranged from 12% (buttons UI) to 28% (sliders_shifted). Therefore we can conclude that the lists were sufficiently different to perform the subsequent analyses. Table 2 Overall results of the multi-objective RS w.r.t. customization UIs. The highest results are in bold, while the lowest results are in italics. Results significantly lower (p-value < 0.05 w.r.t. Fisher’s exact test for hit rate and one-sided T-test for ratings and weights) than the highest ones are marked with an asterisk (*). Results significantly higher than the lowest ones are denoted with a circle (∘). Feedback Mean propensity scores UI variant Selects fraction Hit ratio Mean rating Relevance Diversity Novelty Sliders 0.744 ∘ 0.316 2.737 *∘ 0.508 *∘ 0.257 *∘ 0.235 Sliders_shifted 0.774 ∘ 0.312 ∘ 2.946 ∘ 0.618 * 0.195 * 0.187 Options 0.728 ∘ 0.307 ∘ 2.853 *∘ 0.527 *∘ 0.254 *∘ 0.219 Buttons 0.597 * 0.247 * 2.571 * 0.415 ∘ 0.324 ∘ 0.262 Next, Table 1 contains estimated relevance and beyond-accuracy metrics evaluated on the resulting lists. Similarly as in [3], we observed that MORS provided recommendations of higher diversity (CF- ILD) and novelty. MORS also provided more diverse recommendations w.r.t. content-based ILD (cosine similarity of associated genres; denoted as CB-ILD), more recent movies (mean year of release), and had higher coverage of topics (w.r.t. associated genres). We evaluated these metrics both w.r.t. individual lists and w.r.t. all items the algorithm recommended to a particular user throughout the six recommendation iterations - yet the conclusions were the same. Also, all considered customization UIs exhibited the same trend. However, the improvements in beyond-accuracy metrics were achieved at the expense of a significant drop in the estimated relevance of recommended items (1.537 vs. 1.136). It seemed that the deficiency w.r.t. relevance was perceived also by the users, who selected items with significantly higher estimated relevance than the average values (1.282 vs. 1.136, T-test p-value: 1.8e-50). While a similar trend was also observed for SORS selections, its magnitude was much smaller. Overall, single-objective RS obtained more user selections (3096 vs. 2200) and also received a higher average rating from participants (3.36 vs. 2.78). On the other hand, significantly higher diversity, novelty, and recency were also maintained within the selected items recommended by MORS as compared to those recommended by SORS. Furthermore, the ratio of selected items recommended by SORS tends to drop with subsequent iterations, while the volume of selected items recommended by MORS remained roughly the same throughout all iterations. As such, we can conclude that despite not beating the single-objective RS w.r.t. short-term utility, MORS brings favorable features that might pay off in the long run. 4.1.2. Comparison of different customization UIs Table 2 depicts the results of the MORS separately for individual customization UIs. Here, the selections fraction depicts the ratio between the volume of selections for SORS and MORS, while the hit ratio depicts the fraction between the number of selections and the number of impressions for MORS. As such, these represent the relative and absolute relevance w.r.t. implicit feedback data. Notably, the buttons UI attracted the least selections both absolutely (hit rate) and relatively (selects fraction) and also produced the lowest explicit ratings on average. The results of the three remaining variants were mostly comparable with each other. The options variant attracted slightly fewer selections than both the sliders variants, but the difference was not significant. Similar results were also obtained w.r.t. overall algorithm’s ratings. In order to better understand the inferior performance of the buttons UI, we investigated the weights assigned to each objective (depicted in Table 2 as well). Notable differences were observed for the relevance objective, where the buttons UI had in average lowest values, options and sliders UI represent an approximate midpoint, and sliders_shifted ended with the highest values in average. Inverse ordering was observed for both novelty and diversity. The reasoning behind sliders_shifted is straightforward as we intentionally manipulated its interpretation towards adding more relevance. However, it is not so clear why such low relevance weights were used in buttons UI. We did not find any substantial 6 buttons 4 2 0 6 4 sliders 2 0 6 sliders_shifted 4 2 0 6 Relevance Novelty options 4 Diversity 2 0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 initial values after iteration 1 after iteration 2 after iteration 3 after iteration 4 after iteration 5 Figure 5: Distribution of the propensity scores provided by the users in each iteration. 1.0 buttons 0.8 0.1 1.2 1.0 1.1 1.0 0.9 1.1 0.7 0.9 0.3 0.5 0.9 0.5 0.6 -0.5 0.8 1.0 0.6 0.5 sliders 1.1 0.2 0.9 0.8 1.0 0.8 0.3 0.5 0.5 0.5 0.2 0.6 0.7 0.5 0.4 -0.1 0.5 0.4 0.9 0.0 sliders_shifted 0.7 -0.0 0.7 0.8 1.1 0.7 0.6 0.9 0.6 0.7 0.5 0.5 0.9 0.5 0.4 -0.4 0.8 0.7 0.6 0.5 options 0.9 0.3 0.8 1.0 0.9 0.8 1.0 1.1 0.4 1.0 0.3 0.6 0.8 0.6 0.4 -0.4 0.9 0.7 0.7 1.0 (n e) (in (di lty) (RS ffic y) (in se se) (in uffi se) q1 UI ef t) q1 UI ef t) 4 ( UI t) 5 ( ffic t) q1 UI ef ) 7 ( (U ffic ) ) 8 ( nd cy) q1 ease ility) n) isf e) (R se y) q9 suffi ncy) q1 it. st y) e q1 q16 I su iency un uf cy 1 ( fec 2 ( fec q1 3 ( fec q1 UI su effec q5 o. su ersit q6 ea ienc in c q2 vanc 0 ( at sat -us tio q7 S ea -of-u q8 fo. s -of-u UI I s ien q1 ersta icien (UI cien q4 q3 ove fo. cie UI ab ac 9 ( -of f v le d f (re q1 U Figure 6: Results of the questionnaire analysis. The mean of the numeric values corresponding to individual answers is displayed. differences in the initial weights (i.e., after preference elicitation), so we trust this was a deliberate act of the users. Furthermore, Figure 5 depicts the distribution of propensity scores in each iteration and for each UI. It can be seen that the change is rather gradual for all UIs, but the average vector of the change in buttons UI is opposite to those of other UIs (i.e., demoting rather than promoting relevance). Also, while the other UIs tend to disperse the propensities more, these remain fairly compact in the case of buttons UI. As an additional observation, we can see that while the feedback manipulation introduced by the sliders_shifted UI had a visible effect on the propensity scores, it did not fully translate to the users’ feedback. While the fraction of selections and the mean rating were slightly higher for sliders_shifted than for sliders, the difference was not significant, and also sliders achieved a slightly higher hit ratio. We hypothesize that the measured difference in the resulting recommendations was simply not substantial enough to trigger a significantly different response from the users. This is in line with the observations of [19] regarding perceived diversity, where users often perceived the diversity of the presented lists inversely or indifferently, despite relatively large differences in the measured diversity levels. 4.2. Questionnaire Analysis The feedback analysis revealed that the buttons UI leads to inferior results w.r.t. short-term relevance, while the other three UIs perform comparably with a slight preference towards sliders and sliders_shifted. However, it is not yet clear whether the users perceived the results alike. So, in the questionnaire analysis, we focused on evaluating additional axes of RS’s and customization UI’s quality. Figure 6 depicts the results of the questionnaire analysis. Let us start with overall remarks. Generally, users were able to understand and answer required questions; We received less than 1% of “I don’t understand” answers in total. The only question with more (9) of such answers was Q16 targeting UI satisfaction. This might be partially caused by the fact that it was the only question with negative phrasing. Therefore, we approach Q16 cautiously here and plan to rephrase it in future studies. Overall, users answered that recommendations were sufficiently relevant (Q1) and diverse (Q3)6 , but not quite as novel (Q2). This might be an effect of using a bit older dataset or not taking movie recency directly into account. The experiment environment’s validity is supported by overly positive answers to RS ease-of- use and information sufficiency (Q4-Q8), but the sufficiency and effect of the customization UIs may be questioned to some extent, given slightly less positive answers for Q11, Q12, and Q14-Q16. Nevertheless, answers on all questions except Q16 were above the neutral point (p-vals < 0.0002). We plan to explore this issue in the future by providing users with more options to tune the recommendations. Finally, users of all UI variants agreed that tweaking the values had a visible effect on resulting recommendations (Q13) and that they were generally satisfied with recommendations (Q19). Moving to compare different UIs, one of the main results was the superiority of the options and buttons UI w.r.t. information sufficiency. In particular, participants perceived the description of relevance, novelty, and diversity as clearer (Q7) and better understood the purpose of tweaking their values (Q8). This seemingly affected the perceived usefulness of the UI usage (Q10) and the understandability of the UIs’ mechanisms (Q17).7 These findings can be, to some extent, backed by the work of Funke [6] - if we accept that users internally perceive the task as one with a limited option space. Let us now briefly mention some more speculative observations. Despite its inferior effectivity, the buttons UI surpassed both sliders and sliders_shifted in perceived ease of setting proper weights for objectives (Q18). This might be an artifact of the finer-grained slider’s scale [20, 21], but the same was not sufficiently corroborated for the options UI. The perceived diversity (Q3) of buttons-based recommendations was significantly higher than for sliders_shifted – in accordance with the differences of the user-defined diversity weights. In contrast, although the average weights of relevance criterion were much higher for sliders_shifted than for sliders, users perceived sliders-based recommendations as significantly more matching to their interests (Q1). This supports our previous hypothesis on the somewhat inconsistent perception of individual objectives. However, a dedicated future study is needed to quantify the magnitude of such inconsistencies. 4.3. Questionnaire correlations Finally, let us focus on the interdependence of the questionnaire answers. Figure 7 depicts the correlation matrix of the responses to individual questions in the post-study questionnaire. We derive several interesting observations from the results. First, considering overall satisfaction (Q19) as a target variable, we can see that no other evaluated quality criteria exhibited a substantial negative correlation with satisfaction.8 Also, while the perceived relevance (Q1) was strongly correlated with the overall satisfaction (𝜌 = 0.5), several questions related to UI’s effect and sufficiency had an even larger impact (Q11, Q12, Q14, Q16). This also translates to 6 Means significantly above the neutral point; one-sample t-test p-vals < 2.6e-19. 7 In Q7, options improved over sliders (one-sided T-test p-value: 0.002) and sliders_shifted (p-val: 0.03). Also, buttons improved over sliders (p-val: 0.02). In Q8, options and buttons improved over sliders (p-vals: 0.002 and 0.02 resp.). In Q10, options improved over sliders (p-val: 0.027). In Q17, options improved over sliders (p-val: 0.041). Also, if all information sufficiency answers are merged together, options UI significantly outperforms sliders and sliders_shifted, while buttons UI outperforms sliders. 8 Note that Q16 was negatively formulated itself, so negative values actually indicate a positive effect. Ratings diff (MORS - SORS) 1.00 -0.09 0.13 -0.09 -0.05 0.06 0.05 -0.01 0.31 0.08 -0.07 0.02 -0.17 -0.15 0.01 -0.03 -0.15 0.25 -0.02 -0.06 q1 (relevance) -0.09 1.00 0.21 0.20 0.20 0.34 0.19 0.20 0.12 0.06 0.28 0.35 0.46 0.14 0.38 0.36 -0.26 0.07 0.16 0.50 q2 (novelty) 0.13 0.21 1.00 0.23 0.05 0.04 0.21 0.21 0.10 0.25 0.05 0.10 0.14 0.03 0.19 0.17 -0.03 -0.05 -0.02 0.16 q3 (diversity) -0.09 0.20 0.23 1.00 -0.10 0.03 0.24 0.19 0.18 0.21 0.30 0.34 0.24 0.30 0.24 0.26 -0.07 0.15 0.19 0.35 q4 (info. sufficiency) -0.05 0.20 0.05 -0.10 1.00 0.23 0.06 0.09 0.01 0.08 0.11 0.10 0.14 0.14 0.02 0.12 -0.06 0.06 0.02 0.06 q5 (RS ease-of-use) 0.06 0.34 0.04 0.03 0.23 1.00 0.20 0.38 0.17 0.17 0.30 0.24 0.10 0.12 0.24 0.35 -0.41 0.26 0.34 0.29 q6 (RS ease-of-use) 0.05 0.19 0.21 0.24 0.06 0.20 1.00 0.37 0.15 0.19 0.11 0.25 0.11 0.15 0.37 0.35 -0.34 0.19 0.09 0.22 q7 (info. sufficiency) -0.01 0.20 0.21 0.19 0.09 0.38 0.37 1.00 0.48 0.15 0.43 0.34 0.14 0.16 0.49 0.46 -0.53 0.44 0.39 0.29 q8 (info. sufficiency) 0.31 0.12 0.10 0.18 0.01 0.17 0.15 0.48 1.00 0.10 0.30 0.29 0.14 0.15 0.37 0.26 -0.51 0.44 0.33 0.13 q9 (UI init. state) 0.08 0.06 0.25 0.21 0.08 0.17 0.19 0.15 0.10 1.00 -0.01 0.14 -0.09 0.17 0.18 0.13 0.07 0.09 0.14 0.24 q10 (UI effect) -0.07 0.28 0.05 0.30 0.11 0.30 0.11 0.43 0.30 -0.01 1.00 0.62 0.54 0.30 0.39 0.52 -0.42 0.39 0.32 0.46 q11 (UI effect) 0.02 0.35 0.10 0.34 0.10 0.24 0.25 0.34 0.29 0.14 0.62 1.00 0.49 0.28 0.52 0.51 -0.34 0.20 0.32 0.57 q12 (UI effect) -0.17 0.46 0.14 0.24 0.14 0.10 0.11 0.14 0.14 -0.09 0.54 0.49 1.00 0.45 0.40 0.49 -0.33 0.15 0.20 0.57 q13 (UI effect) -0.15 0.14 0.03 0.30 0.14 0.12 0.15 0.16 0.15 0.17 0.30 0.28 0.45 1.00 0.34 0.36 -0.09 0.17 0.31 0.29 q14 (UI sufficiency) 0.01 0.38 0.19 0.24 0.02 0.24 0.37 0.49 0.37 0.18 0.39 0.52 0.40 0.34 1.00 0.73 -0.44 0.31 0.40 0.57 q15 (UI sufficiency) -0.03 0.36 0.17 0.26 0.12 0.35 0.35 0.46 0.26 0.13 0.52 0.51 0.49 0.36 0.73 1.00 -0.55 0.37 0.48 0.55 q16 (UI sufficiency) -0.15 -0.26 -0.03 -0.07 -0.06 -0.41 -0.34 -0.53 -0.51 0.07 -0.42 -0.34 -0.33 -0.09 -0.44 -0.55 1.00 -0.47 -0.33 -0.38 q17 (UI understandability) 0.25 0.07 -0.05 0.15 0.06 0.26 0.19 0.44 0.44 0.09 0.39 0.20 0.15 0.17 0.31 0.37 -0.47 1.00 0.31 0.24 q18 (UI ease-of-use) -0.02 0.16 -0.02 0.19 0.02 0.34 0.09 0.39 0.33 0.14 0.32 0.32 0.20 0.31 0.40 0.48 -0.33 0.31 1.00 0.35 q19 (satisfaction) -0.06 0.50 0.16 0.35 0.06 0.29 0.22 0.29 0.13 0.24 0.46 0.57 0.57 0.29 0.57 0.55 -0.38 0.24 0.35 1.00 (re S) (no ) q4 q3 (d elty) q5 . suff ity) q7 eas use) q8 . su use) 1 ( ct) 2 ( ct) q1 (U ct) q1 I suff ect) 0 ( te) q1 16 uffic ) un suffic y) q1 rstan ncy) q1 se-o y) n) isf ) q6 ease cy) q9 ffici ) ini y) q2 nce y sat se su cy UI enc c ea bilit (UI enc q1 - SOR tio q1 I effe q1 I effe q1 I effe q1 . sta 9 ( f-u 7 ( (UI ien n fo. en rs ff - f- e (RS icie a v ac (RS -of (in ive 4( Ie 5 ( ici i (in ffici lev (in e-o 8 ( da t U U U RS s 3 MO U fo fo UI de ( iff sd q UI ng ti Ra Figure 7: Pearson’s correlation between answers to individual questions. In addition, correlations w.r.t. Ratings diff, i.e., the difference between per-user mean ratings to MORS and SORS recommendations, are displayed in the first row/column. compound statistics,9 where the mean UI effect and mean UI sufficiency answers are strongly correlated with the overall satisfaction (𝜌 = 0.57 and 𝜌 = 0.62, respectively). Furthermore, UI’s understandability (Q17) and ease of use (Q18) also exhibited a non-negligible correlation with overall satisfaction. To sum up, these findings indicate a possible strong influence of UI controls and their function on the overall user experience with the recommender systems. In this study, we only evaluated limited graphical user interfaces. However, in light of emerging conversational RS powered by large-language models, it may be crucial to focus on this aspect of user experience also in connection with additional UI and interactional designs. Second, some of the considered quality axes (information sufficiency, RS ease-of-use, UI effect, and UI sufficiency) were targetted by multiple questions. However, while the questions targetting UI’s effect and UI’s sufficiency were highly correlated in most cases, this was not true for the information sufficiency and RS ease-of-use. As such, a finer-grained division of these objectives may be considered in future work. In addition, we also focused on whether the difference in the feedback users provided on MORS and SORS recommendations can be explained by some of the questionnaire answers. To do so, we also calculated the correlations for the per-user differences in mean ratings to MORS and SORS. In most cases, we obtained close-to-zero results, with the exception of two moderate correlations: Q8 and Q17, both targeting possible understandability issues (Q8: “I understood the purpose of tweaking relevance, diversity, and novelty.” ; Q17: “The mechanism (slider) for tweaking the objectives was understandable and intuitive.” ). Therefore, we can preliminarily conclude that the main driving force behind the adoption of customizable individual MORS is actually whether we did a good job of explaining why & how should 9 I.e., using the mean of all answers targeting the same evaluated aspect. users tune their propensities. 5. Conclusions and Limitations We tackled the problem of allowing users to indicate their propensity towards different recommendation objectives, so as to shape more effective and better-tailored MORS. To this end, we conducted a user study that allowed users to customize MORS via sliders, sliders_shifted, buttons, and options UIs. Results show that while multiple UIs can lead to similarly effective recommendations (w.r.t. user feedback), they can significantly vary in some of the user-perceived quality criteria. In particular, buttons UI resulted in the lowest consumption-related statistics as well as lowest user ratings, while the other three UIs performed without significant differences from each other. The main driving force behind this inferiority was a different distribution of propensities the users set through this UI. Further research is needed to focus on the causes of this behavior and possible ways to support users in setting the best possible values for their current needs. A subsequent questionnaire analysis revealed certain advantages of the options UI variant as compared to more standard slider-based UIs. In particular, options UI dominated over sliders and sliders_shifted in information sufficiency, perceived usefulness, and UI’s ease-of-use aspects. As such, we can tentatively recommend options UI with prompts relative to the previous criteria values as a good variant for customizing local MORS. As an initial work on a rather complex topic, the study has numerous limitations, which we plan to address in the future. First, when designing the evaluated UIs, we primarily aimed at the most commonly used UI components. Even though, there was a plethora of parameters and design options that we could not test due to the limits imposed on the number of participants. In particular, the current study could not disentangle whether the differences between sliders and options UIs were mainly caused by the different “grounding” of the choices (i.e., relative to other criteria vs. relative to previous choices), different response granularity, or different appearance. So, although we can conclude that there are viable alternatives to the most common choice (i.e., sliders UI), the selection of the best such alternative is a matter for future work. Similarly as in some related works [3], the study revealed several features that should favor MORS over single-objective RS in the long term. However, a truly long-term study should be conducted to verify these assumptions. As indicated by not-so-positive scores for Q14-Q16, there is some space for revisiting the objectives by incorporating additional criteria or re-defining the current ones. Finally, while the pool of participants was sufficient to reveal the differences in user feedback and corroborate medium-sized effects in the questionnaire analysis, subtle effects might have been overlooked, which could be remedied by contracting more users. Overall, we plan a series of larger follow-up studies that will focus on a more detailed long-term analysis of user interaction and perception of MORS. This should also include studying the impact of different domains, dataset properties, recommending algorithms, and study designs. Acknowledgments This paper has been supported by Czech Science Foundation (GAČR) project 22-21696S, Charles University grant SVV-260698/2023, and Charles University Grant Agency (GA UK) project number 188322. References [1] Y. Zheng, D. X. Wang, A survey of recommender systems with multi-objective optimization, Neurocomputing 474 (2022) 141–153. URL: https://doi.org/10.1016/j.neucom.2021.11.041. doi:10. 1016/j.neucom.2021.11.041. [2] D. Jannach, Multi-objective recommendation: Overview and challenges, in: H. Abdollahpouri, S. Sahebi, M. Elahi, M. Mansoury, B. Loni, Z. Nazari, M. Dimakopoulou (Eds.), Proceedings of the 2nd Workshop on Multi-Objective Recommender Systems co-located with 16th ACM Conference on Recommender Systems (RecSys 2022), Seattle, WA, USA, 18th-23rd September 2022, volume 3268 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3268/paper1.pdf. [3] P. Dokoupil, L. Peska, L. Boratto, Looks can be deceiving: Linking user-item interactions and user’s propensity towards multi-objective recommendations, in: Proceedings of the Seventeenth ACM Conference on Recommender Systems, RecSys ’23, Association for Computing Machinery, New York, NY, USA, 2023. URL: https://doi.org/10.1145/3604915.3608848. doi:10.1145/3604915. 3608848. [4] F. M. Harper, F. Xu, H. Kaur, K. Condiff, S. Chang, L. Terveen, Putting users in control of their recommendations, in: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 3–10. URL: https://doi.org/10.1145/2792838.2800179. doi:10.1145/2792838.2800179. [5] Y. Liang, M. C. Willemsen, Personalized recommendations for music genre exploration, in: Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 276–284. URL: https://doi.org/10.1145/3320435.3320455. doi:10.1145/3320435.3320455. [6] F. Funke, A web experiment showing negative effects of slider scales compared to visual analogue scales and radio button scales, Social Science Computer Review 34 (2016) 244–254. doi:10.1177/ 0894439315575477. [7] B. Shneiderman, Designing the User Interface: Strategies for Effective Human-Computer Interac- tion, 3rd ed., Addison-Wesley Longman Publishing Co., Inc., USA, 1997. [8] Q. Zhao, The superior psychological impact of absolute (vs. relative) standing feedback does not depend on the reward criterion, Social Psychology of Education 26 (2023) 473–484. URL: https://doi.org/10.1007/s11218-023-09758-2. doi:10.1007/s11218-023-09758-2. [9] L. Peska, S. Balcar, The effect of feedback granularity on recommender systems performance, in: Proceedings of the 16th ACM Conference on Recommender Systems, RecSys ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 586–591. URL: https://doi.org/10.1145/ 3523227.3551479. doi:10.1145/3523227.3551479. [10] D. Jannach, H. Abdollahpouri, A survey on multi-objective recommender systems, Frontiers in Big Data 6 (2023). URL: https://www.frontiersin.org/articles/10.3389/fdata.2023.1157899. doi:10. 3389/fdata.2023.1157899. [11] P. Dokoupil, L. Peska, Easystudy: Framework for easy deployment of user studies on recommender systems, in: Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 1196–1199. URL: https: //doi.org/10.1145/3604915.3610640. doi:10.1145/3604915.3610640. [12] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, ACM Trans. Interact. Intell. Syst. 5 (2015). URL: https://doi.org/10.1145/2827872. doi:10.1145/2827872. [13] S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender systems, in: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 109–116. URL: https://doi.org/ 10.1145/2043932.2043955. doi:10.1145/2043932.2043955. [14] K. Bradley, B. Smyth, Improving recommendation diversity, in: Proceedings of the twelfth Irish conference on artificial intelligence and cognitive science, Maynooth, Ireland, volume 85, Citeseer, 2001, pp. 141–152. [15] L. Peska, P. Dokoupil, Towards results-level proportionality for multi-objective recommender systems, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 1963–1968. URL: https://doi.org/10.1145/3477495.3531787. doi:10.1145/ 3477495.3531787. [16] P. Dokoupil, L. Peska, L. Boratto, Rows or columns? minimizing presentation bias when comparing multiple recommender systems, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 2354–2358. URL: https://doi.org/10.1145/3539618.3592056. doi:10.1145/3539618.3592056. [17] P. Pu, L. Chen, R. Hu, A user-centric evaluation framework for recommender systems, in: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 157–164. URL: https://doi.org/10.1145/ 2043932.2043962. doi:10.1145/2043932.2043962. [18] F. Faul, E. Erdfelder, A. Buchner, A.-G. Lang, Statistical power analyses using g*power 3.1: Tests for correlation and regression analyses, Behavior Research Methods 41 (2009) 1149–1160. URL: https://doi.org/10.3758/BRM.41.4.1149. doi:10.3758/BRM.41.4.1149. [19] P. Dokoupil, L. Boratto, L. Peska, User perceptions of diversity in recommender systems, in: Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 212–222. URL: https://doi.org/10.1145/3627043.3659555. doi:10.1145/3627043.3659555. [20] C. C. Preston, A. M. Colman, Optimal number of response categories in rating scales: reliability, validity, discriminating power, and respondent preferences, Acta Psychologica 104 (2000) 1–15. URL: https://www.sciencedirect.com/science/article/pii/S0001691899000505. doi:https://doi. org/10.1016/S0001-6918(99)00050-5. [21] E. I. Sparling, S. Sen, Rating: How difficult is it?, in: Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 149–156. URL: https://doi.org/10.1145/2043932.2043961. doi:10.1145/2043932. 2043961.