Development of an Instrument for Measuring Users’ Perception of Transparency in Recommender Systems Marco Hellmann, Diana C. Hernandez-Bocanegra and Jürgen Ziegler University of Duisburg-Essen, Forsthausweg 2, 47057 Duisburg, Germany Abstract Transparency is increasingly seen as a critical requirement for achieving the goal of human-centered AI systems in general and also, specifically, recommender systems (RS). However, defining and operationalizing the concept is still difficult, due to its multi-faceted nature. Currently, there are hardly any measurement instruments to adequately assess the perceived transparency of RS in user studies. Thus, we present the development of a measurement instrument that aims at capturing perceived transparency as a multidimensional construct. The results of our validation show that transparency can be distinguished with respect to input (what data does the system use?), functionality (how and why is an item recommended?), output (why and how well does an item fit one’s preferences?), and interaction (what needs to be changed for a different prediction?). The study is intended as a first iteration in the development of a reliable and fully validated measurement tool for assessing transparency in RS. Keywords Recommender systems, transparency, explanations, user study 1. Introduction data, the recommendation algorithm, or features of the recommended items may be exposed to the user, trans- The request for more transparency in intelligent systems parency as a user-centric quality can only be assessed has become steadily louder in recent years, formulated by measuring users’ perception and understanding of in academic research as well as in most public and cor- those system aspects that are relevant for their decision porate policies concerning the ethics of artificial intelli- making and trust in the system [4]. gence [1, 2]. Although there is now broad agreement that Despite the acclaimed relevance of transparency in RS, transparency is of high relevance for developing human- the instruments available for measuring it from a user centred AI systems, the concept is still elusive due to perspective are still very limited. Some instruments for its multi-faceted nature and the different objectives it is assessing overall recommendation quality include a small intended to serve. The questions raised when asking for number of items related to perceived transparency [5], transparency include, for example, the system aspects but these measures still seem far from covering the multi- that should be made transparent, or the riskiness of an ple facets involved. To the best of our knowledge, there is AI function at an individual or societal level. no instrument focusing specifically on RS transparency. A need for greater transparency has also been noted A further shortcoming of existing instruments is the lack for recommender systems (RS), a frequent, user-facing of sufficiently considering the cognitive processes in- type of AI-driven technology, to better support users’ in volved in users’ understanding of recommendations and their decision-making and to avoid potentially negative in their ability to influence the system according to their consequences, e. g. users getting trapped in filter bub- needs if such influence is possible. bles [3]. Various methods have been proposed to this In this paper, we describe steps towards a more holis- end, ranging from disclosing the user profile on which a tic and cognitively grounded psychometric instrument recommendation is based to providing explicit explana- for measuring perceived transparency in RS. We first tions. Still, the multi-facetedness of the concept makes explain the questionnaire development process that re- it difficult to design effective transparent RS. A central sulted in a validated set of items specifically focused on question that must be solved to this end is how trans- RS transparency. The candidate items for this develop- parency of a RS can be measured and evaluated. While ment were chosen to reflect the different steps involved different aspects of the system, for example, the input in cognitively processing the information provided about Joint Proceedings of the ACM IUI Workshops 2022, March 2022, the recommendation process and its output. To further Helsinki, Finland validate the instrument, we performed an analysis of the $ marco.hellmann@stud.uni-due.de (M. Hellmann); effects of perceived transparency as measured by our diana.hernandez-bocanegra@uni-due.de new instrument on factors related to trust in the RS and (D. C. Hernandez-Bocanegra); juergen.ziegler@uni-due.de effectiveness of the recommendations. An influence of (J. Ziegler) © 2022 Copyright for this paper by its authors. Use permitted under Creative transparency on users’ trust in the system and on the ac- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) ceptance of the recommendations has been suggested in prior research, e. g., in [5]. We analyzed these influences mains an applications, and formulate the measurement through structural equation modeling to show that the of the construct transparency using only a single item ("I construct ’transparency’ as measured by our instrument understood why the items were recommended to me"), has in fact the assumed effects. this latter being a frequently used item for the evaluation Our contribution is thus twofold: we provide a system- of RS transparency. atically derived and validated measurement instrument Consequently, we set out to formulate and validate a for transparency in RS, and we can show that the differ- more comprehensive way to measure the perceived trans- ent transparency factors represented in the questionnaire parency of a RS, as described in the methods section. The have an impact on the effectiveness of recommendations procedure followed the typical procedure for developing and trust in the system, albeit to different degrees. psychometric measurement instruments (e.g. [15]): (1) To operationalize a target construct, first a larger number of candidate items is formulated and compiled. 2. Related work Here, we draw on the basic structure of RS ([16], [17]) and typical user questions related to artificial intelligence Users’ perception of the transparency of a RS may be algorithms [18]. Second, items were also derived from a influenced by several factors. Providing explanations is qualitative preliminary study, to further analyze the un- one important aspect, and some studies have shown that certainties in users’ mental models, which can be under- transparency is positively influenced by the quality of stood as the notion that users have about how a system the explanations given ([5], [6]) and that it is related to or a certain type of systems work [19]. control over the system [7]. The effect of systematically (2) We examined the factor structure of the trans- varied levels and styles of explanation on perceived trans- parency construct, which was formed as a reflective fac- parency has been studied and assessed via questionnaires tor in the sense of classical test theory (see also [20]). We (see e.g. [8], [9], [10]). Also, a positive influence of in- considered 4 factors that could group individual ques- teraction possibilities as well as perceived control on the tionnaire items, and that might contribute to variances in perceived transparency of the system was reported by [5]. perceived transparency, inspired on dimensions defined Transparency perception seems to be enhanced both by by [18]: Input ("what kind of data does the system learn the perceived quality of explanations and the perceived from"), output ("what kind of output does the system accuracy or quality of recommendations. In addition, the give"), functionality ("how / why does the system make authors show a positive effect of transparency on trust predictions") and interaction (what if / how to be that, and through trust an indirect effect on purchase inten- "what would the system predict if this instance changes tions. According to [11], this can be related to evaluating to.."). the effectiveness of the RS. Moreover, studies suggest (3) The developed measurement instrument was vali- that perceived transparency promotes satisfaction with dated. For this purpose, the framework model of [7] was the system [12] [7]. used. The influence of personal factors on the perception of recommender systems has often been investigated in the light of the general decision-making behavior of users 2.0.1. Mental models and stages of cognitive (see [13]). [9] showed that individuals with a rational processing decision-making style trusted the recommender system Transparency is frequently discussed like an objective tested more and rated its efficiency and effectiveness property of a system. A system becomes only transparent, higher. Furthermore, they showed that individuals with however, if its users can understand the transparency- an intuitive decision-making style rate the quality of related information, such as explicit explanations, and explanations better. evaluate it with respect to their goals. The degree of To date, however, few measurement tools exist to quan- comprehension may depend on the mental model users titatively assess the transparency of a RS as perceived have about how the system works [21], either based on by users. [6] surveyed perceived transparency using two preconceptions, previous experiences with similar sys- items (“I understand why the system recommended the tems, or on the interaction with and perception of the artworks it did”; “I understand what the system bases its present system [22]. As discussed in [19], mental models recommendations on”), in the domain of art objects. [14] that drift considerably from actual system functioning use a single item ("I did not understand why the items may result in broadening the "gulfs" described by [22]: were recommended to me (reverse scale)"), for event rec- 1) the gulf of execution, when the user’s mental model ommendations. [8] proposed an item that explicitly refers is inaccurate in terms of how the system can be used to to explanations: "Which explanation interfaces are con- execute a task, 2) the gulf of evaluation, when the output sidered to be transparent by the users?". [5] proposed (as consequence of a user’s action) differs from what is an evaluation framework for RS, involving different do- expected, according to the user’s mental model. To bridge these gulfs, users must process the informa- 3.1. Formulation and compilation of tion provided by the system at different cognitive levels. questionnaire items The items of the proposed questionnaire were formulated to reflect the action levels according to [23]. According Here, we draw on the basic structure of RS ([16], [17]) to their model, the quality of interaction with the system and typical user questions to AI algorithms [18]. Candi- can be described through a cycle of evaluation and exe- date items were also chosen to cover different stages of cution. For example, at first, the user may perceive the the cognitive action cycle described in related work. Sec- output of the system (e.g., the recommendations and ex- ond, items were also derived from a qualitative pre-study, planations), then interpret the information gathered (e.g., consisting of interviews with users to further analyze the how the system works), and thereby evaluate the state of uncertainties in users’ mental models [19], in regard to the system (e.g., performance of the system and quality different commercial RS, like Netflix, Spotify or Amazon. of the output). As a consequence, the user formulates A total of 6 interviews were conducted via video call, goals aiming to achieve with the system or matches their with voluntary participants. When selecting the inter- goals with the evaluation of the system (e.g., get more view partners, care was taken to represent in the sample accurate or diverse recommendations). The user then different age groups and experience with Internet appli- pursues an intention (e.g., improve recommendations), cations. Students and non-students from different age which is translated into planning actions (e.g., change groups (20 to 50 years) were interviewed. Overall, pre- input), which they finally execute. While this cognitive vious exposure to recommender systems was equally cycle is well-known in the HCI field, it has hardly been strong among all participants. Only one interviewee had applied in the investigation of transparency for AI-based lower experience and one interviewee had slightly higher systems. experience. The authors in [23] assume that there are gaps between The aim of the interviews was to capture the experi- the users’ goals and their knowledge about the system, ence, perception and evaluation as well as possible ques- and the extent to which system provides descriptions tions of users regarding the functionality or transparency about its functioning (gulfs of execution and of evalu- of recommender systems. The subjects were asked to ex- ation, as mentioned beforehand). By taking actions to plain the functionality of RS from their perspective and bridge those gaps, (making system functions to match to create a corresponding sketch. Following this, uncer- goals, and making the output represent a “good concep- tainties and possible lack of transparency were discussed. tual model of the system that is easily perceived, inter- Finally, prototypical explanations from [24] for increas- preted and evaluated” [23]), system designers may con- ing the perceived transparency were evaluated by the tribute to minimize cognitive effort by users [23], and interview partners. The explanations refer differently to to decrease the discrepancy between the mental model the input used, the functionality and the output. In addi- of the system and its functioning, which may have an tion, they use different visual forms of representation, e.g. impact on the perception of transparency, as discussed by star ratings, profile lines, text. In this way, uncertainties [19]. We argue then, that a more comprehensive instru- as well as wishes for more transparency by users could ment to measure perceived transparency is still needed, be identified. Each question encountered in interviews so that such impact can be evaluated not only on the was directly transformed into one or more items. basis of general perceived understanding ("I understood A resulting set of 92 items was collected and discussed why recommended"), but also on the basis of the extent by the research team, where linguistic revision and elim- to which output and functionalities that reflect the con- ination of redundancies were also performed. The dis- ceptual model of the system are perceived, interpreted cussions led to a reduction of the set to 34 items, which and evaluated by users. were used as input for the online validation described in the next section. 3. Methods 3.2. Online user study To operationalize the construct of perceived transparency, We conducted a user study to examine item quality and we conducted the following steps, based on the typical factor structure, as described below. procedure for developing measurement instruments (e.g., [15]): 1. Formulation and compilation of questionnaire Participants We recruited 171 participants (89 female, items. 2. Examination of items quality and factor struc- mean age 29 and range between 18 and 69) through the ture, based on an online study. 3. Validation of the mea- crowdsourcing platform Prolific. We restricted the task surement instrument. We describe each step below. to workers in the U.S and the U.K., with an approval rate greater than 98%. Participants were rewarded with £1.15. Time devoted to the survey (in minutes): M=13.2, SD= Data analysis We performed an exploratory factor 7.33. analysis (EFA) to further reduce the initial set of items We applied a quality check to select participants with and a Confirmatory Factor Analysis (CFA) to test internal quality survey responses (we included attention checks reliability and convergent validity. Furthermore, we eval- in the survey, e.g. “This is an attention check. Please click uated discriminant validity of the resultant set of items, in here the option ‘Disagree’”. We discarded participants relation to other constructs of the subjective evaluation with at least 1 failed attention check, or those who did of RS, for example explanation quality, effectiveness and not finish the survey. Thus, the responses of 17 of the overall satisfaction, according to the frameworks defined 192 initial Prolific respondents were discarded and not by [7] and [5]. paid. 4 additional cases were removed due to suspicious response behavior, e.g. responding all questions with the same value within the same page. Thus, 171 cases were 4. Results used for further analysis. The target sample size was chosen to allow performing 4.1. Exploratory Factor Analysis (EFA) CFA analysis. [25], p. 389, recommend a minimum of The factor structure was exploratively examined, aiming n>50 or three times the number of indicators. [26], p. to further reduce the set of items. A total of 5 EFAs with 102, recommend a minimum of n>100 or five times the principal axis factor analysis and promax rotation were number of indicators. Thus, given that we wanted to performed. First, items that did not have a unique princi- evaluate a set of 34 items, the sample size was set to a pal loading or had a principal loading that was too low minimum of 170 participants. (<.40) were removed. In the first 4 EFAs, 11 items were removed based on this criterion. Subsequently, more Questionnaires We utilized the set of 34 items result- stringent criteria were used (factor loadings <.50). The ing of the formulation of items step described above. guideline values are based on [31]. Thus, 2 items were Additionally, aiming to further validate the final mea- removed again. Subsequently, a 6-factorial structure re- surement instrument (4.3), we used items from [5] to sulted, with a total of 21 items and a variance resolution evaluate perception of control (how much they think of 62.45%. Reliability of the factors fall in the range ‘good’ they can influence the system), interaction adequacy and to ‘very good’ (.782 to .888), as defined by [32]. The in- interface adequacy, information sufficiency and recom- ternal consistency across all items is .867. mendation accuracy. Furthermore we included items from [7] to evaluate the perception of system effective- 4.2. Confirmatory Factor Analysis (CFA) ness (construct perceived system effectiveness, system is useful and helps the user to make better choices), and Following the exploration of the factor structure, the of trust in the system [27] (constructs trusting beliefs - result obtained was tested for internal reliability and con- subconstructs benevolence, integrity, and competence-, vergent validity using confirmatory analysis. A first CFA user considers the system to be honest and trusts its rec- was performed, resulting in 8 items with low factor load- ommendations; and trusting intentions, user willing to ings, which were eliminated from the set. Two factors share information and to follow advice). We used items were removed in the process because they did not load on described from [28, 29] for explanation quality, and from a second-level overall transparency factor. A final CFA [30] to evaluate decision-making style. All items were with 4 factors was performed (model fit X2 = 86.997, df = measured with a 1-5 Likert-scale (1: Strongly disagree, 5: 61, p = .016; X2 /df = 1.426; CFI = .975; TLI = .968; RMSEA Strongly agree). = .050; SRMR = .047). Reliability across all items is equal to .884. This model comprises a final set of four factors Procedure Participants were asked to choose a service and 13 items, which are reported in Table 1 along with from five applications, for which they were required to factor loadings. have an active account: Amazon, Spotify, Netflix, Tripad- The four factors identified can be associated with the visor, and Booking. Participants were instructed to open concepts Input, composed of 3 items, Output, also with the application, browse it at their own discretion. They 3 items, Functionality with 5 items, and Interaction with were explicitly told to select an item that was relevant only 2 items. Although the initial item set comprised to them and which they would actually buy or consume. questions for all stages of the cognitive action cycle, af- A real purchasing of items was explicitly not requested. ter CFA, items related the perception level were only Participants were asked to return to the survey after com- left for the factor Functionality, comprising questions pleting the task and to answer questions about the system about whether users are aware of transparency-related they used. information if provided by the system (e.g.: "The system provided information about how well the recommenda- tions match my preferences"). This factor covers mostly Table 1 Test results of internal reliability and convergent validity of our proposed transparency questionnaire. Cronbach Factor Factor Items alpha loading It was clear to me what kind of data the system uses to generate recommendations. 0.817 Input I understood what data was used by the system to infer my preferences. 0.842 0.901 I understood which item characteristics were considered to generate recommendations. 0.712 I understood why the items were recommended to me. 0.771 Output I understood why the system determined that the recommended items would suit me. 0.801 0.794 I can tell how well the recommendations match my preferences. 0.710 The system provided information to understand why the items were recommended. 0.731 The system provided information about how the quality of the items was determined. 0.705 Functionality The system provided information about how my preferences were inferred. 0.847 0.736 The system provided information about how well the recommendations match my preferences. 0.696 I understood how the quality of the items was determined by the system. 0.760 I know what actions to perform in the system so that it generates better recommendations 0.896 Interaction 0.888 I know what needs to be changed in order to get better recommendations 0.892 perception-related questions. The missing coverage of fect of higher overall transparency on the perception of perception-related items in other factors is likely due the recommendations and the overall system. Further- to limitations of the systems used for the online study more, we assumed that transparency is influenced by which do not, for example, provide access to the data system-related aspects (accuracy, interaction quality, and on which recommendations are based, thus preventing explanation quality) as well as by personal characteristics users to become aware of input data . The factor Output such as decision-making behavior) as described in the comprises items related to the interpretation and evalua- related work section. Some of these factors can also be tion stages. The factor Interaction has the smallest scope expected to influence perceived control over the system, with 2 items and covers only the facets of action planning a construct that may mediate the impact of these factors or action execution. This factor thus describes whether on transparency perception. This led us to formulating users know which actions they would have to perform if the hypotheses shown in table 3. they wanted to receive other recommendations. In the following, the relationships of the factors in the structural equation model are presented (see fig. 1). 4.3. Discriminant validity of Only significant paths with standardized path coefficients are shown. Indirect effects are only considered for the measurement instrument transparency factors relevant here. The final model is We determined discriminant validity of the instrument in shown to have a very good fit: X = 75.767, df = 57, p = 2 relation to other constructs of the subjective evaluation .049; X /df = 1.329; CFI = .980; TLI = .965; RMSEA = .044; 2 of RS, for example explanation quality, effectiveness and SRMR = .072. The model is thus adequate to describe the overall satisfaction, according to the frameworks defined relationships in the data set. by [7] and [5]. Discriminant validity was assessed using Influences on perceived transparency of the sys- inter-construct correlations (see results in table 2). We tem. Transparency with respect to interaction is rated found that the squared correlations between pairs of con- higher when users are more likely to exhibit an intuitive structs were all less than the value of average variances decision-making style (0.186, p <.05) and users report that are shown in the diagonal, representing “a level of higher perceived control (0.293, p <.001). The latter is appropriate discriminant validity” [5]. increased by the quality of interaction (0.502, p <.001) and explanations (0.341, p <.001). Users thus know better how to influence recommendations when they have more op- 5. Structural Equation Model portunities to interact with the system, and can gather in- (SEM) formation about the system through explanations as well as through ’trial and error’ (indirect: explanation quality To explore the relation between the transparency factors →control →Transparency-interaction: 0.100, p < .01; in- assessed by the questionnaire and the effects of perceived teraction quality →control →Transparency-interaction: transparency on recommendation effectiveness and trust 0.147, p < .001). in the system, as well as the impact of factors influenc- Similar observations can be made for functionality. ing transparency, we set up a Structural Equation Model Again, transparency is rated higher when users are more (SEM). The model is based on hypotheses we derived likely to exhibit an intuitive decision-making style (0.141, from existing research that has shown the positive ef- p <.05) and users report higher perceived control (0.261, Table 2 Inter-construct correlation matrix. Average Variance Extracted (AVE) on the main diagonal; correlations below the diagonal; quadratic correlations above the diagonal. Target value for AVE ≥.5. p<0.05*, p<0.01**. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 Transp. - input 0.662 0.227 0.235 0.136 0.111 0.023 0.009 0.125 0.053 0.051 0.075 0.040 0.065 0.063 0.159 0.002 0.026 0.081 0.021 2 Transp. - output 0,476** 0.577 0.231 0.121 0.157 0.039 0.019 0.146 0.187 0.054 0.094 0.291 0.071 0.041 0.239 0.001 0.186 0.240 0.074 3 Transp. - function 0,485** 0,481** 0.527 0.155 0.246 0.021 0.061 0.341 0.153 0.022 0.114 0.094 0.147 0.153 0.183 0.021 0.127 0.168 0.094 4 Transp. - interaction 0,369** 0,348** 0,394** 0.799 0.119 0.000 0.055 0.048 0.072 0.000 0.056 0.030 0.060 0.106 0.070 0.008 0.064 0.035 0.038 5 Control 0,333** 0,396** 0,496** 0,345** 0.775 0.004 0.018 0.242 0.366 0.052 0.154 0.090 0.153 0.198 0.156 0.007 0.144 0.141 0.118 6 DM style - rational 0,153* 0,197** 0.146 0.016 0.061 0.454 0.041 0.062 0.004 0.073 0.018 0.032 0.017 0.017 0.026 0.004 0.036 0.058 0.030 7 DM style - intuitive 0.092 0.138 0,246** 0,234** 0.136 -0,203** 0.502 0.022 0.027 0.000 0.000 0.019 0.006 0.011 0.012 0.012 0.001 0.001 0.005 8 Explanation quality 0,353** 0,382** 0,584** 0,220** 0,492** 0,248** 0.148 0.557 0.091 0.080 0.265 0.151 0.112 0.100 0.230 0.030 0.199 0.177 0.171 9 Interaction adequacy 0,230** 0,432** 0,391** 0,269** 0,605** 0.064 0,163* 0,301** 0.791 0.082 0.065 0.048 0.116 0.151 0.101 0.020 0.147 0.118 0.084 10 Interface adequacy 0,226** 0,232** 0.147 0.008 0,228** 0,270** -0.001 0,282** 0,286** 0.618 0.123 0.054 0.052 0.043 0.207 0.020 0.108 0.130 0.187 11 Info. sufficiency 0,273** 0,307** 0,337** 0,236** 0,393** 0.133 -0.001 0,515** 0,254** 0,350** — 0.104 0.064 0.063 0.182 0.048 0.216 0.188 0.170 12 Recomm. accuracy 0,201** 0,539** 0,307** 0,174* 0,300** 0,180* 0.137 0,389** 0,220** 0,232** 0,323** — 0.086 0.062 0.259 0.021 0.187 0.326 0.221 13 Trust - benevolence 0,254** 0,266** 0,384** 0,245** 0,391** 0.130 0.079 0,334** 0,341** 0,228** 0,252** 0,293** 0.666 0.661 0.366 0.095 0.162 0.332 0.282 14 Trust - integrity 0,250** 0,202** 0,391** 0,326** 0,445** 0.129 0.106 0,316** 0,388** 0,207** 0,251** 0,249** 0,813** 0.476 0.332 0.088 0.179 0.238 0.272 15 Trust - competence 0,399** 0,489** 0,428** 0,265** 0,395** 0,162* 0.111 0,480** 0,318** 0,455** 0,427** 0,509** 0,605** 0,576** 0.608 0.030 0.278 0.440 0.358 16 Trust - share info. 0.040 0.028 0.146 0.091 0.086 0.060 0.109 0,174* 0.141 0.140 0,219** 0.143 0,308** 0,297** 0,174* — 0.064 0.062 0.078 17 Trust - follow advice 0,160* 0,431** 0,356** 0,253** 0,379** 0,189* 0.028 0,446** 0,384** 0,328** 0,465** 0,433** 0,402** 0,423** 0,527** 0,252** — 0.213 0.269 18 Effectiveness 0,284** 0,490** 0,410** 0,187* 0,375** 0,241** 0.036 0,421** 0,344** 0,360** 0,434** 0,571** 0,576** 0,488** 0,663** 0,249** 0,461** 0.545 0.389 19 Overall satisfaction 0.145 0,272** 0,306** 0,194* 0,343** 0,174* 0.069 0,414** 0,289** 0,432** 0,412** 0,470** 0,531** 0,522** 0,598** 0,280** 0,519** 0,624** — Table 3 Overview of hypothesis addressed in SEM Hypotheses Reference Relevant factor Explanation Factors influencing perceived transparency (X →perceived transparency) H-1.1 [6],[5] Explanation quality Comprehensibility and contribution of the explanations to the understanding of the system H-1.2 [5] Accuracy Match between the items and the user’s preferences H-1.3 [5] (indirect effect) Interaction quality Possibilities of adaptation and feedback H-1.4 [5] Control Possibilities of personalization H-1.5 [10] decision-making styles Rational / Intuitive Effects of perceived transparency (perceived transparency →Y). H-2.1 [3],[14] Trust Trusting beliefs and intentions H-2.2 [11] Effectiveness Usefulness of the system H-2.3 [14], [12] Overall satisfaction Satisfaction with the system p <.001). The quality of interaction, promoting perceived .001). It both indirectly and directly (0.416, p <.001) in- control, has a positive effect on transparency concern- creases the transparency of the functionality when users ing how the system works (indirect: interaction qual- rate explanations positive (indirect: explanation quality ity →control →Transparency-functionality: 0.131, p < →control →Transparency-functionality: 0.089, p < .01). Figure 1: Structural model. p<0.05*, p<0.01**, p<0.001*** The input is perceived as more transparent the better follow the advice of the recommendation system is in- users can interact with the system. Thus, here again, creased (indirect: Transparency-functionality →Trust- perceived control has a direct positive effect (0.200, p competence →Trust-follow advice: 0.071, p <.05). Thus, <.01). The quality of the interaction thus repeatedly has it is clear that an understanding of the internal mecha- an indirect effect (indirect: interaction quality →control nisms of recommender systems leads to trusting beliefs →Transparency-input: 0.100, p < .05). Similarly to what and thus to trusting actions and a positive overall evalu- has already been shown with regard to functionality, the ation. quality of the explanations also has a direct, positive Transparency with regard to the input has a negative effect on transparency of the input (0.209, p <.01) in addi- effect. If users can see which data is used, this has a tion to the indirect effect (explanation quality →control negative effect on the willingness to follow the advice →Transparency-input: 0.068, p < .05). of the recommendation system in this model (-0.144, p The transparency of the output shows how well users <.05). Thus, this shows a certain counterbalance to a can assess why a recommendation is made or should transparent functionality, possibly triggered by too much match the user’s preferences. This is directly increased information or a general distrust regarding data privacy. by the quality of the interaction with the system (0.311, This shows that transparency can also have negative con- p <.001), i.e. when possibilities are offered or used to sequences. However, these turn out to be comparatively indicate one’s own preferences. On the other hand, there small. Transparent output again has strong positive ef- are no direct or indirect influences of the explanations. fects. If users can understand why the recommended Instead, the accuracy of the recommendation has a posi- item matches their preferences, this increases trust in tive influence on the transparency of the output (0.454, the competence of the system (0.194, p <.01). Indirectly, p <.001). Accordingly, the output is easier to understand transparency also promotes overall satisfaction via this if it is rated as suitable. Unsuitable recommendations increase in trust (indirect: Transparency-output →Trust- would thus be more difficult for the user to comprehend. competence →overall satisfaction: 0.055, p <.05). Further- As shown, transparency is positively influenced by the more, the increase in transparency indirectly (indirect: quality of explanations, accuracy of recommendations, Transparency-output →Trust-competence →effectiveness: opportunities for interaction, and perceived control. Hy- 0.056, p <.05), but also directly (0.127, p <.05), contributes potheses 1.1, 1.2, 1.3 and 1.4 can thus be considered con- to a higher rating of the system’s effectiveness. Indirectly, firmed. The influence of the decision-making style is this in turn increases overall satisfaction with the sys- limited to the intuitive style. Therefore, hypothesis 1.5 tem (indirect: Transparency-output →Trust-competence can only be partially confirmed. →effectiveness →overall satisfaction: 0.019, p <.05). Ad- Effects of perceived transparency of the system. ditionally, it increases the willingness to follow the ad- No effects can be observed for transparency with regard vice of the recommendation system when users better to interaction. It is possible that effects exist on factors understand the output (direct: 0.268, p <.001; indirect: that were not surveyed in this study. For the other trans- Transparency-output →Trust-competence →Trust-follow parency factors, however, significant positive effects can advice: 0.073, p <.05). be observed. As shown, the transparency factors have clear effects Transparency regarding the functionality has the strongeston trust in the system, evaluation of effectiveness and on and most diverse effect. If users can understand the the overall satisfaction. Therefore hypotheses 2.1, 2.2 and internal mechanisms, they trust the recommendation 2.3 can be considered confirmed. Thus, perceived trans- system more. Direct positive effects can be observed parency can also be viewed as a mediator of perceived on benevolence (0.248, p <.01) and trust in the compe- control over the system, user characteristics, and other tence (0.188, p <.05) of the system. Indirectly, such trans- qualities of the system. The importance of the different parency thus contributes to a better evaluation of the sys- factors of perceived transparency can be shown by the tem’s effectiveness (indirect: Transparency-functionality differentiated assessment. →Trust-benevolence →effectiveness: 0.074, p <.01; in- direct: Transparency-functionality →Trust-competence →effectiveness: 0.055, p <.05). Via the increase in ef- 6. Discussion fectiveness, overall satisfaction with the system is also We aimed at developing a measurement tool that is specif- promoted (indirect: Transparency-functionality →Trust- ically focused on capturing the transparency of RS as benevolence →effectiveness →overall satisfaction: 0.024, perceived by users. In an initial interview study, con- p <.05). Via the increase in perceived benevolence, the cerns and uncertainties in relation to RS transparency willingness to share information about oneself is also were identified, which are well in line with the general increased (indirect: Transparency-functionality →Trust- AI-related questions compiled by [18]. This indicates that benevolence →Trust-information sharing: 0.072, p <.05). the scheme developed by these authors can be a useful Moreover, via trust in competence, the willingness to starting point for developing measures also for specific system state to their own goal to decide about the next systems such as RS, which address a wider range of users action, a stage defined by Norman [22] as evaluation. The beyond more expert users as in the original work by [18]. item “I can tell how well the recommendations match my Our confirmatory analyses confirmed our hypothesis preferences” from our scale relates to this stage, by assess- that subjective perceived transparency can be charac- ing explicitly the correspondence of the recommended terized by the factors: input, output, functionality and items with one’s own preferences. Items from the inter- interaction. Adequate reliability as well as convergent action group ("I know what needs to be changed in order and divergent validity was demonstrated, which indicates to get better recommendations") can be associated with that identified transparency factors can clearly be consid- intent formation and the downstream path in the action ered as independent, and they can be distinguished from cycle. each other and also from other factors of the subjective As discussed by [23], designers can contribute to close evaluation of RS (trust, effectiveness, etc.). the gap between mental models (users’ idea on how the The identified factors in our analysis reflect the basic system works [22]), and the actual system’s functioning, components of RS as defined by [16], i.e., the input (what by providing output and functionalities reflecting an ad- data does the system use?), the functionality (how and equate system’s conceptual model, that can be “easily why is an item recommended?), and the output (why and perceived, interpreted and evaluated” [23]. The above how well does an item match one’s own preferences?). can in turn impact perceived transparency [19]. Conse- Additionally, the factor interaction could be extracted. quently, our instrument can contribute to a more compre- This factor is consistent with the category interaction hensive assessment of subjective perceived transparency, (what if / how to be that, what has to be changed for a by going beyond the one-dimensional construct address- different prediction?) of the prototypical questions to AI, ing a general "why-recommended" understanding, and formulated by [18]. assessing instead, the extent to which output and func- Furthermore, the final set of items can also be consid- tionalities reflecting the system’s conceptual model are ered through the lens of the different interaction stages in fact perceived, interpreted and evaluated. as defined by Norman [22]. In our examined context, for example, the stage perception relates to the presence of system functions that explicitly reveal information on 7. Conclusion and Outlook how the recommendations were derived, e.g. through The instrument developed can be seen as a first step explanations. Items of the type “The system provided in- towards assessing transparency in RS in a more com- formation about how. . . ”, grouped under the factor func- prehensive and cognitively meaningful manner. Overall, tionality, could be validated, indicating that making in- reliability and construct validity of the developed mea- formation about the recommendation process observable surement instrument could be confirmed, identifying is a prerequisite for further cognitive processing. This four transparency factors (input, output, functionality, indicates that the evaluation of perceived transparency interaction) and resulting in a 13 item questionnaire (see should consider not only items related to users’ inter- Table 1). The expected influence of system aspects and pretation (i.e. “user understands”, as it has traditionally personal characteristics on the transparency factors could been evaluated in RS research), but also items related to be demonstrated for the developed factors with the excep- the presence and perception of transparency-related sys- tion of transparency regarding interaction, which may be tem functions (e.g. “user notices that the system actually due to the limited interaction possibilities in the applica- explains”). tions used by participants. Furthermore, we could show Once the user perceives a system output (e.g. the fea- the impact of different transparency aspects on trust in tures of a recommendation or an explanation), the next the system and on the overall evaluation of the system. stage is the interpretation of the system state, in which The differentiated assessment of transparency makes it users use their knowledge to interpret the new system possible to elaborate the significance of individual aspects state [22]: in our context, to assess the recommendation of transparency in more detail than it was possible with inferred by the system. Our validated final set includes previous measurement instruments. Thus, it could be items which are related to the interpretation stage, and shown that transparency with respect to functioning are of the type “I understood what data was used . . . ”, and output is of greater importance for the dependent which can be grouped under the factor input), or “I under- variables considered than transparency with respect to stood how the system determined . . . ”, grouped under the interaction and input. factor functionality of our developed scale. This group of The findings obtained here should be considered under items is also consistent with the definition of perceived the following limitations. Real systems were tested for transparency by [5], which focuses on the perceived un- this online study. On the one hand, this allowed us to ob- derstanding of the inner processes of RS. tain users’ views with respect to applications they were In a subsequent stage, users compare the interpreted familiar with and that were fully functional. On the other tended Abstracts on Human Factors in Computing hand, no controlled manipulation of influencing variables Systems (2002) 830–831. was possible. We also did not analyze the differences be- [4] N. Tintarev, J. Masthoff, Explaining Recommenda- tween the systems which would have required a larger tions: Design and Evaluation, in: F. Ricci, L. Rokach, sample, also addressing questions outside the scope of B. Shapira (Eds.), Recommender Systems Hand- the present study. An effect of explanations could only book, Springer, 2015, pp. 353–382. URL: https://doi. be shown for the factors input and functionality, partly org/10.1007/978-1-4899-7637-6_10. doi:10.1007/ mediated by perceived control, which may also be due to 978-1-4899-7637-6_10. the limited explanations provided by the systems used. [5] P. Pu, L. Chen, R. Hu, A user-centric evaluation In addition, only systems that were already known to the framework for recommender systems, in: Proceed- users were tested. Thus, a stronger expression of trust ings of the fifth ACM conference on Recommender and overall more positive evaluation might be expected. systems - RecSys 11, 2011, pp. 157–164. In terms of social desirability or self-overestimation, per- [6] H. Cramer, V. Evers, S. Ramlal, M. van Someren, ceived understanding might be valued higher than actual L. Rutledge, N. Stash, L. Aroyo, B. Wielinga, The understanding would lead one to expect. effects of transparency on trust in and acceptance Follow-up research should be guided by the limitations of a content-based art recommender, User Model. mentioned here for further validation of the measure- User-Adap. Inter. 18 (2008) 455–496. ment instrument. The degree of perceived transparency [7] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, should also be compared with actual, genuine understand- H. Soncu, C. Newell, Explaining the user experience ing using parallel qualitative methods [6]. Furthermore, of recommender systems, in: User Modeling and it is important to check to what extent the questionnaire User-Adapted Interaction, 2012, p. 441–504. is also able to evaluate systems that are unknown to [8] F. Gedikli, D. Jannach, M. Ge, How should i ex- the users. Assessing unfamiliar systems or specifically plain? a comparison of different explanation types designed prototypes would provide the opportunity to for recommender systems, International Journal of systematically vary components of the recommender Human-Computer Studies 72 (2014) 367–382. system (input, functionality, output), the quality of expla- [9] D. C. Hernandez-Bocanegra, J. Ziegler, Ex- nations, and/or the interaction possibilities [9]. Thus, the plaining review-based recommendations: Effects influence of these features on the transparency factors of profile transparency, presentation style and and likewise possible differences in their manifestation user characteristics, Journal of Interactive Me- should be further explored. dia 19 (2020) 181–200. doi:https://doi.org/10. Overall, a first validated version of a questionnaire 1515/icom-2020-0021. to assess perceived transparency can be presented. The [10] D. C. Hernandez-Bocanegra, J. Ziegler, Effects findings presented here also provide starting points for of interactivity and presentation on review-based research into further elucidating the multi-faceted con- explanations for recommendations, in: Human- cept of transparency. Computer Interaction – INTERACT 2021, Springer International Publishing, 2021, pp. 597–618. [11] N. Tintarev, J. Masthoff, Evaluating the effective- Acknowledgments ness of explanations for recommender systems, User Modeling and User-Adapted Interaction 22 This work was funded by the German Research Founda- (2012) 399–439. tion (DFG) under grant No. GRK 2167, Research Training [12] C.-H. Tsai, P. Brusilovsky, Explaining recommenda- Group “User-Centred Social Media”. tions in an interactive hybrid social recommender, in: 24th International Conference on Intelligent References User Interfaces (IUI 19), 2019, pp. 391–396. [13] A. Jameson, M. C. Willemsen, A. Felfernig, [1] N. Bostrom, E. Yudkowski, The Ethics of Artificial M. de Gemmis, P. Lops, G. Semeraro, L. Chen, Hu- Intelligence, in: W. Ramsey, K. Frankish (Eds.), Cam- man decision making and recommender systems, bridge Handbook of Artificial Intelligence, Cam- Recommender Systems Handbook (2015) 611–648. bridge University Press, 2014, pp. 316–334. [14] S. Dooms, T. D. Pessemier, L. Martens, A user- [2] U. S. a. H. S. C. (SHS), Recommendation on the centric evaluation of recommender algorithms for Ethics of Artificial Intelligence, Technical Report, an event recommendation system, in: Proceedings UNESCO, 2021. URL: https://unesdoc.unesco.org/ of the RecSys 2011: Workshop on Human Deci- ark:/48223/pf0000379920.page=14. sion Making in Recommender Systems (Decisions [3] R. Sinha, K. Swearingen, The role of transparency RecSys 11) and User-Centric Evaluation of Recom- in recommender systems, CHI EA ’02 CHI ’02 Ex- mender Systems and Their Interfaces - 2 (UCERSTI 2) affiliated with the 5th ACM Conference on Rec- recommender systems, in: Proceedings of 24th ommender Systems (RecSys 2011), 2011, p. 67–73. International Conference on Intelligent User Inter- [15] M. Bühner, Einführung in die Test- und Fragebo- faces (IUI 19), ACM, 2019, p. 379–390. genkonstruktion, Pearson Studium, Aufl. München, [29] T. Donkers, T. Kleemann, J. Ziegler, Explaining 2011. recommendations by means of aspect-based trans- [16] D. Jannach, M. Zanker, A. Felfernig, G. Friedrich, parent memories, in: Proceedings of the 25th Inter- Recommender Systems. An introduction, Cam- national Conference on Intelligent User Interfaces, bridge University Press, 2011. 2020, p. 166–176. [17] J. Lu, Q. Zhang, G. quan Zhang, Recommender Sys- [30] K. Hamilton, S.-I. Shih, S. Mohammed, The devel- tems. Advanced Developments, World Scientific opment and validation of the rational and intuitive Publishing, 2021. decision styles scale, Journal of Personality Assess- [18] Q. V. Liao, D. Gruen, S. Miller, Questioning the ment 98 (2016) 523–535. ai: Informing design practices for explainable ai [31] A. G. Yong, S. Pearce, A beginner’s guide to factor user experiences, Proceedings of the 2020 CHI analysis: Focusing on exploratory factor analysis, Conference on Human Factors in Computing Sys- Tutorials in Quantitative Methods for Psychology tems 9042 (2020) 1–15. doi:https://doi.org/10. 9 (2013) 79–94. 1145/3313831.3376590. [32] R. A. Peterson, A meta-analysis of cronbach’s co- [19] T. Ngo, J. Kunkel, J. Ziegler, Exploring mental mod- efficient alpha, Journal of Consumer Research 21 els for transparent and controllable recommender (1994) 381–391. systems: A qualitative study, in: Proceedings of the 28th ACM Conference on User Modeling, Adapta- tion and Personalization UMAP 20, 2020, pp. 183– 191. [20] D. Borsboom, G. J. Mellenbergh, J. van Heerden, The theoretical status of latent variables, Psycho- logical Review 110 (2003) 203–219. [21] J. Kunkel, T. Ngo, J. Ziegler, N. Krämer, Iden- tifying Group-Specific Mental Models of Recom- mender Systems: A Novel Quantitative Approach, in: Human-Computer Interaction – INTERACT 2021, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2021, pp. 383–404. doi:10.1007/978-3-030-85610-6_23. [22] D. A. Norman, Some Observations on Mental Mod- els, In Mental Models, Dedre Gentner and Albert L. Stevens (Eds.). Psychology Press, New York, NY, USA, 1983. [23] E. Hutchins, J. D. Hollan, D. A. Norman, Direct manipulation interfaces, Human-Computer Inter- action 1 (1985) 311–338. [24] Y. Zhang, X. Chen, Explainable recommendation: A survey and new perspectives, Foundations and Trends in Information Retrieval 14 (2020) 1–101. [25] K. Backhaus, B. Erichson, R. Weiber, Multivari- ate Analysemethoden. Eine anwendungsorientierte Einführung, Berlin 13. Aufl, 2011. [26] J. F. Hair, W. C. Black, B. J. Babin, R. E. Anderson, Multivariate data analysis. A global perspective, Boston 7. Aufl., 2010. [27] D. H. McKnight, V. Choudhury, C. Kacmar, Develop- ing and validating trust measures for e-commerce: An integrative typology, in: Information Systems Research, volume 13, 2002. [28] P. Kouki, J. Schaffer, J. Pujara, J. O’Donovan, L. Getoor, Personalized explanations for hybrid