1. Introduction

Development of an Instrument for Measuring Users' Perception of Transparency in Recommender Systems

Marco Hellmann

Diana C. Hernandez-Bocanegra

Jürgen Ziegler

0 0 University of Duisburg-Essen , Forsthausweg 2, 47057 Duisburg , Germany

Transparency is increasingly seen as a critical requirement for achieving the goal of human-centered AI systems in general and also, specifically, recommender systems (RS). However, defining and operationalizing the concept is still dificult, due to its multi-faceted nature. Currently, there are hardly any measurement instruments to adequately assess the perceived transparency of RS in user studies. Thus, we present the development of a measurement instrument that aims at capturing perceived transparency as a multidimensional construct. The results of our validation show that transparency can be distinguished with respect to input (what data does the system use?), functionality (how and why is an item recommended?), output (why and how well does an item fit one's preferences?), and interaction (what needs to be changed for a diferent prediction?). The study is intended as a first iteration in the development of a reliable and fully validated measurement tool for assessing transparency in RS.

eol>Recommender systems transparency explanations user study

1. Introduction

data, the recommendation algorithm, or features of the recommended items may be exposed to the user, transThe request for more transparency in intelligent systems parency as a user-centric quality can only be assessed has become steadily louder in recent years, formulated by measuring users’ perception and understanding of in academic research as well as in most public and cor- those system aspects that are relevant for their decision porate policies concerning the ethics of artificial intelli- making and trust in the system [ 4 ]. gence [ 1, 2 ]. Although there is now broad agreement that Despite the acclaimed relevance of transparency in RS, transparency is of high relevance for developing human- the instruments available for measuring it from a user centred AI systems, the concept is still elusive due to perspective are still very limited. Some instruments for its multi-faceted nature and the diferent objectives it is assessing overall recommendation quality include a small intended to serve. The questions raised when asking for number of items related to perceived transparency [ 5 ], transparency include, for example, the system aspects but these measures still seem far from covering the multithat should be made transparent, or the riskiness of an ple facets involved. To the best of our knowledge, there is AI function at an individual or societal level. no instrument focusing specifically on RS transparency.

A need for greater transparency has also been noted A further shortcoming of existing instruments is the lack for recommender systems (RS), a frequent, user-facing of suficiently considering the cognitive processes intype of AI-driven technology, to better support users’ in volved in users’ understanding of recommendations and their decision-making and to avoid potentially negative in their ability to influence the system according to their consequences, e. g. users getting trapped in filter bub- needs if such influence is possible. bles [ 3 ]. Various methods have been proposed to this In this paper, we describe steps towards a more holisend, ranging from disclosing the user profile on which a tic and cognitively grounded psychometric instrument recommendation is based to providing explicit explana- for measuring perceived transparency in RS. We first tions. Still, the multi-facetedness of the concept makes explain the questionnaire development process that reit dificult to design efective transparent RS. A central sulted in a validated set of items specifically focused on question that must be solved to this end is how trans- RS transparency. The candidate items for this developparency of a RS can be measured and evaluated. While ment were chosen to reflect the diferent steps involved diferent aspects of the system, for example, the input in cognitively processing the information provided about the recommendation process and its output. To further validate the instrument, we performed an analysis of the efects of perceived transparency as measured by our new instrument on factors related to trust in the RS and efectiveness of the recommendations. An influence of transparency on users’ trust in the system and on the acceptance of the recommendations has been suggested in Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland $ marco.hellmann@stud.uni-due.de (M. Hellmann); diana.hernandez-bocanegra@uni-due.de (D. C. Hernandez-Bocanegra); juergen.ziegler@uni-due.de (J. Ziegler)

© 2022 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) prior research, e. g., in [ 5 ]. We analyzed these influences mains an applications, and formulate the measurement through structural equation modeling to show that the of the construct transparency using only a single item ("I construct ’transparency’ as measured by our instrument understood why the items were recommended to me"), has in fact the assumed efects. this latter being a frequently used item for the evaluation

Our contribution is thus twofold: we provide a system- of RS transparency. atically derived and validated measurement instrument Consequently, we set out to formulate and validate a for transparency in RS, and we can show that the difer- more comprehensive way to measure the perceived transent transparency factors represented in the questionnaire parency of a RS, as described in the methods section. The have an impact on the efectiveness of recommendations procedure followed the typical procedure for developing and trust in the system, albeit to diferent degrees. psychometric measurement instruments (e.g. [15]): (1) To operationalize a target construct, first a larger number of candidate items is formulated and compiled. 2. Related work Here, we draw on the basic structure of RS ([16], [17]) and typical user questions related to artificial intelligence algorithms [18]. Second, items were also derived from a qualitative preliminary study, to further analyze the uncertainties in users’ mental models, which can be understood as the notion that users have about how a system or a certain type of systems work [19].

(2) We examined the factor structure of the transparency construct, which was formed as a reflective factor in the sense of classical test theory (see also [20]). We considered 4 factors that could group individual questionnaire items, and that might contribute to variances in perceived transparency, inspired on dimensions defined by [18]: Input ("what kind of data does the system learn from"), output ("what kind of output does the system give"), functionality ("how / why does the system make predictions") and interaction (what if / how to be that, "what would the system predict if this instance changes to..").

(3) The developed measurement instrument was validated. For this purpose, the framework model of [ 7 ] was used.

Users’ perception of the transparency of a RS may be influenced by several factors. Providing explanations is one important aspect, and some studies have shown that transparency is positively influenced by the quality of the explanations given ([ 5 ], [ 6 ]) and that it is related to control over the system [ 7 ]. The efect of systematically varied levels and styles of explanation on perceived transparency has been studied and assessed via questionnaires (see e.g. [ 8 ], [ 9 ], [ 10 ]). Also, a positive influence of interaction possibilities as well as perceived control on the perceived transparency of the system was reported by [ 5 ].

Transparency perception seems to be enhanced both by the perceived quality of explanations and the perceived accuracy or quality of recommendations. In addition, the authors show a positive efect of transparency on trust and through trust an indirect efect on purchase intentions. According to [ 11 ], this can be related to evaluating the efectiveness of the RS. Moreover, studies suggest that perceived transparency promotes satisfaction with the system [ 12 ] [ 7 ].

The influence of personal factors on the perception of recommender systems has often been investigated in the light of the general decision-making behavior of users 2.0.1. Mental models and stages of cognitive (see [ 13 ]). [ 9 ] showed that individuals with a rational processing decision-making style trusted the recommender system Transparency is frequently discussed like an objective tested more and rated its eficiency and efectiveness property of a system. A system becomes only transparent, higher. Furthermore, they showed that individuals with however, if its users can understand the transparencyan intuitive decision-making style rate the quality of related information, such as explicit explanations, and explanations better. evaluate it with respect to their goals. The degree of

To date, however, few measurement tools exist to quan- comprehension may depend on the mental model users titatively assess the transparency of a RS as perceived have about how the system works [21], either based on by users. [ 6 ] surveyed perceived transparency using two preconceptions, previous experiences with similar sysitems (“I understand why the system recommended the tems, or on the interaction with and perception of the artworks it did”; “I understand what the system bases its present system [22]. As discussed in [19], mental models recommendations on”), in the domain of art objects. [ 14 ] that drift considerably from actual system functioning use a single item ("I did not understand why the items may result in broadening the "gulfs" described by [22]: were recommended to me (reverse scale)"), for event rec- 1) the gulf of execution, when the user’s mental model ommendations. [ 8 ] proposed an item that explicitly refers is inaccurate in terms of how the system can be used to to explanations: "Which explanation interfaces are con- execute a task, 2) the gulf of evaluation, when the output sidered to be transparent by the users?". [ 5 ] proposed (as consequence of a user’s action) difers from what is an evaluation framework for RS, involving diferent do- expected, according to the user’s mental model.

To bridge these gulfs, users must process the informa- 3.1. Formulation and compilation of tion provided by the system at diferent cognitive levels. questionnaire items The items of the proposed questionnaire were formulated to reflect the action levels according to [ 23]. According Here, we draw on the basic structure of RS ([16], [17]) to their model, the quality of interaction with the system and typical user questions to AI algorithms [18]. Candican be described through a cycle of evaluation and exe- date items were also chosen to cover diferent stages of cution. For example, at first, the user may perceive the the cognitive action cycle described in related work. Secoutput of the system (e.g., the recommendations and ex- ond, items were also derived from a qualitative pre-study, planations), then interpret the information gathered (e.g., consisting of interviews with users to further analyze the how the system works), and thereby evaluate the state of uncertainties in users’ mental models [19], in regard to the system (e.g., performance of the system and quality diferent commercial RS, like Netflix, Spotify or Amazon. of the output). As a consequence, the user formulates A total of 6 interviews were conducted via video call, goals aiming to achieve with the system or matches their with voluntary participants. When selecting the intergoals with the evaluation of the system (e.g., get more view partners, care was taken to represent in the sample accurate or diverse recommendations). The user then diferent age groups and experience with Internet applipursues an intention (e.g., improve recommendations), cations. Students and non-students from diferent age which is translated into planning actions (e.g., change groups (20 to 50 years) were interviewed. Overall, preinput), which they finally execute. While this cognitive vious exposure to recommender systems was equally cycle is well-known in the HCI field, it has hardly been strong among all participants. Only one interviewee had applied in the investigation of transparency for AI-based lower experience and one interviewee had slightly higher systems. experience.

The authors in [23] assume that there are gaps between The aim of the interviews was to capture the experithe users’ goals and their knowledge about the system, ence, perception and evaluation as well as possible quesand the extent to which system provides descriptions tions of users regarding the functionality or transparency about its functioning (gulfs of execution and of evalu- of recommender systems. The subjects were asked to exation, as mentioned beforehand). By taking actions to plain the functionality of RS from their perspective and bridge those gaps, (making system functions to match to create a corresponding sketch. Following this, uncergoals, and making the output represent a “good concep- tainties and possible lack of transparency were discussed. tual model of the system that is easily perceived, inter- Finally, prototypical explanations from [24] for increaspreted and evaluated” [23]), system designers may con- ing the perceived transparency were evaluated by the tribute to minimize cognitive efort by users [ 23], and interview partners. The explanations refer diferently to to decrease the discrepancy between the mental model the input used, the functionality and the output. In addiof the system and its functioning, which may have an tion, they use diferent visual forms of representation, e.g. impact on the perception of transparency, as discussed by star ratings, profile lines, text. In this way, uncertainties [19]. We argue then, that a more comprehensive instru- as well as wishes for more transparency by users could ment to measure perceived transparency is still needed, be identified. Each question encountered in interviews so that such impact can be evaluated not only on the was directly transformed into one or more items. basis of general perceived understanding ("I understood A resulting set of 92 items was collected and discussed why recommended"), but also on the basis of the extent by the research team, where linguistic revision and elimto which output and functionalities that reflect the con- ination of redundancies were also performed. The disceptual model of the system are perceived, interpreted cussions led to a reduction of the set to 34 items, which and evaluated by users. were used as input for the online validation described in the next section.

3. Methods

3.2. Online user study To operationalize the construct of perceived transparency, We conducted a user study to examine item quality and we conducted the following steps, based on the typical factor structure, as described below. procedure for developing measurement instruments (e.g., [15]): 1. Formulation and compilation of questionnaire Participants We recruited 171 participants (89 female, items. 2. Examination of items quality and factor struc- mean age 29 and range between 18 and 69) through the ture, based on an online study. 3. Validation of the mea- crowdsourcing platform Prolific. We restricted the task surement instrument. We describe each step below. to workers in the U.S and the U.K., with an approval rate greater than 98%. Participants were rewarded with £1.15.

Time devoted to the survey (in minutes): M=13.2, SD= Data analysis We performed an exploratory factor 7.33. analysis (EFA) to further reduce the initial set of items

We applied a quality check to select participants with and a Confirmatory Factor Analysis (CFA) to test internal quality survey responses (we included attention checks reliability and convergent validity. Furthermore, we evalin the survey, e.g. “This is an attention check. Please click uated discriminant validity of the resultant set of items, in here the option ‘Disagree’”. We discarded participants relation to other constructs of the subjective evaluation with at least 1 failed attention check, or those who did of RS, for example explanation quality, efectiveness and not finish the survey. Thus, the responses of 17 of the overall satisfaction, according to the frameworks defined 192 initial Prolific respondents were discarded and not by [ 7 ] and [ 5 ]. paid. 4 additional cases were removed due to suspicious response behavior, e.g. responding all questions with the same value within the same page. Thus, 171 cases were 4. Results used for further analysis.

The target sample size was chosen to allow performing 4.1. Exploratory Factor Analysis (EFA) CFA analysis. [25], p. 389, recommend a minimum of The factor structure was exploratively examined, aiming n>50 or three times the number of indicators. [26], p. to further reduce the set of items. A total of 5 EFAs with 102, recommend a minimum of n>100 or five times the principal axis factor analysis and promax rotation were number of indicators. Thus, given that we wanted to performed. First, items that did not have a unique princievaluate a set of 34 items, the sample size was set to a pal loading or had a principal loading that was too low minimum of 170 participants. (<.40) were removed. In the first 4 EFAs, 11 items were removed based on this criterion. Subsequently, more stringent criteria were used (factor loadings <.50). The guideline values are based on [31]. Thus, 2 items were removed again. Subsequently, a 6-factorial structure resulted, with a total of 21 items and a variance resolution of 62.45%. Reliability of the factors fall in the range ‘good’ to ‘very good’ (.782 to .888), as defined by [ 32]. The internal consistency across all items is .867.

Questionnaires We utilized the set of 34 items resulting of the formulation of items step described above.

Additionally, aiming to further validate the final measurement instrument (4.3), we used items from [ 5 ] to evaluate perception of control (how much they think they can influence the system), interaction adequacy and interface adequacy, information suficiency and recommendation accuracy. Furthermore we included items from [ 7 ] to evaluate the perception of system efective- 4.2. Confirmatory Factor Analysis (CFA) ness (construct perceived system efectiveness , system is useful and helps the user to make better choices), and Following the exploration of the factor structure, the of trust in the system [27] (constructs trusting beliefs - result obtained was tested for internal reliability and consubconstructs benevolence, integrity, and competence-, vergent validity using confirmatory analysis. A first CFA user considers the system to be honest and trusts its rec- was performed, resulting in 8 items with low factor loadommendations; and trusting intentions, user willing to ings, which were eliminated from the set. Two factors share information and to follow advice). We used items were removed in the process because they did not load on described from [28, 29] for explanation quality, and from a second-level overall transparency factor. A final CFA [30] to evaluate decision-making style. All items were with 4 factors was performed (model fit X 2 = 86.997, df = measured with a 1-5 Likert-scale (1: Strongly disagree, 5: 61, p = .016; X2/df = 1.426; CFI = .975; TLI = .968; RMSEA Strongly agree). = .050; SRMR = .047). Reliability across all items is equal to .884. This model comprises a final set of four factors Procedure Participants were asked to choose a service and 13 items, which are reported in Table 1 along with from five applications, for which they were required to factor loadings. have an active account: Amazon, Spotify, Netflix, Tripad- The four factors identified can be associated with the visor, and Booking. Participants were instructed to open concepts Input, composed of 3 items, Output, also with the application, browse it at their own discretion. They 3 items, Functionality with 5 items, and Interaction with were explicitly told to select an item that was relevant only 2 items. Although the initial item set comprised to them and which they would actually buy or consume. questions for all stages of the cognitive action cycle, afA real purchasing of items was explicitly not requested. ter CFA, items related the perception level were only Participants were asked to return to the survey after com- left for the factor Functionality, comprising questions pleting the task and to answer questions about the system about whether users are aware of transparency-related they used. information if provided by the system (e.g.: "The system provided information about how well the recommendations match my preferences"). This factor covers mostly It was clear to me what kind of data the system uses to generate recommendations.

I understood what data was used by the system to infer my preferences.

I understood which item characteristics were considered to generate recommendations.

I understood why the items were recommended to me.

I understood why the system determined that the recommended items would suit me.

I can tell how well the recommendations match my preferences.

The system provided information to understand why the items were recommended.

The system provided information about how the quality of the items was determined.

The system provided information about how my preferences were inferred.

The system provided information about how well the recommendations match my preferences.

I understood how the quality of the items was determined by the system.

I know what actions to perform in the system so that it generates better recommendations I know what needs to be changed in order to get better recommendations Cronbach alpha perception-related questions. The missing coverage of fect of higher overall transparency on the perception of perception-related items in other factors is likely due the recommendations and the overall system. Furtherto limitations of the systems used for the online study more, we assumed that transparency is influenced by which do not, for example, provide access to the data system-related aspects (accuracy, interaction quality, and on which recommendations are based, thus preventing explanation quality) as well as by personal characteristics users to become aware of input data . The factor Output such as decision-making behavior) as described in the comprises items related to the interpretation and evalua- related work section. Some of these factors can also be tion stages. The factor Interaction has the smallest scope expected to influence perceived control over the system, with 2 items and covers only the facets of action planning a construct that may mediate the impact of these factors or action execution. This factor thus describes whether on transparency perception. This led us to formulating users know which actions they would have to perform if the hypotheses shown in table 3. they wanted to receive other recommendations. In the following, the relationships of the factors in the structural equation model are presented (see fig. 1). 4.3. Discriminant validity of Only significant paths with standardized path coeficients are shown. Indirect efects are only considered for the measurement instrument transparency factors relevant here. The final model is We determined discriminant validity of the instrument in shown to have a very good fit: X 2 = 75.767, df = 57, p = relation to other constructs of the subjective evaluation .049; X2/df = 1.329; CFI = .980; TLI = .965; RMSEA = .044; of RS, for example explanation quality, efectiveness and SRMR = .072. The model is thus adequate to describe the overall satisfaction, according to the frameworks defined relationships in the data set. by [ 7 ] and [ 5 ]. Discriminant validity was assessed using Influences on perceived transparency of the sysinter-construct correlations (see results in table 2). We tem. Transparency with respect to interaction is rated found that the squared correlations between pairs of con- higher when users are more likely to exhibit an intuitive structs were all less than the value of average variances decision-making style (0.186, p <.05) and users report that are shown in the diagonal, representing “a level of higher perceived control (0.293, p <.001). The latter is appropriate discriminant validity” [ 5 ]. increased by the quality of interaction (0.502, p <.001) and explanations (0.341, p <.001). Users thus know better how to influence recommendations when they have more op5. Structural Equation Model portunities to interact with the system, and can gather in(SEM) formation about the system through explanations as well as through ’trial and error’ (indirect: explanation quality To explore the relation between the transparency factors →control →Transparency-interaction: 0.100, p < .01; inassessed by the questionnaire and the efects of perceived teraction quality →control →Transparency-interaction: transparency on recommendation efectiveness and trust 0.147, p < .001). in the system, as well as the impact of factors influenc- Similar observations can be made for functionality. ing transparency, we set up a Structural Equation Model Again, transparency is rated higher when users are more (SEM). The model is based on hypotheses we derived likely to exhibit an intuitive decision-making style (0.141, from existing research that has shown the positive ef- p <.05) and users report higher perceived control (0.261, Inter-construct correlation matrix. Average Variance Extracted (AVE) on the main diagonal; correlations below the diagonal; quadratic correlations above the diagonal. Target value for AVE ≥ .5. p<0.05*, p<0.01**.

The input is perceived as more transparent the better follow the advice of the recommendation system is inusers can interact with the system. Thus, here again, creased (indirect: Transparency-functionality →Trustperceived control has a direct positive efect (0.200, p competence →Trust-follow advice: 0.071, p <.05). Thus, <.01). The quality of the interaction thus repeatedly has it is clear that an understanding of the internal mechaan indirect efect (indirect: interaction quality →control nisms of recommender systems leads to trusting beliefs →Transparency-input: 0.100, p < .05). Similarly to what and thus to trusting actions and a positive overall evaluhas already been shown with regard to functionality, the ation. quality of the explanations also has a direct, positive Transparency with regard to the input has a negative efect on transparency of the input (0.209, p <.01) in addi- efect. If users can see which data is used, this has a tion to the indirect efect (explanation quality →control negative efect on the willingness to follow the advice →Transparency-input: 0.068, p < .05). of the recommendation system in this model (-0.144, p

The transparency of the output shows how well users <.05). Thus, this shows a certain counterbalance to a can assess why a recommendation is made or should transparent functionality, possibly triggered by too much match the user’s preferences. This is directly increased information or a general distrust regarding data privacy. by the quality of the interaction with the system (0.311, This shows that transparency can also have negative conp <.001), i.e. when possibilities are ofered or used to sequences. However, these turn out to be comparatively indicate one’s own preferences. On the other hand, there small. Transparent output again has strong positive efare no direct or indirect influences of the explanations. fects. If users can understand why the recommended Instead, the accuracy of the recommendation has a posi- item matches their preferences, this increases trust in tive influence on the transparency of the output (0.454, the competence of the system (0.194, p <.01). Indirectly, p <.001). Accordingly, the output is easier to understand transparency also promotes overall satisfaction via this if it is rated as suitable. Unsuitable recommendations increase in trust (indirect: Transparency-output →Trustwould thus be more dificult for the user to comprehend. competence →overall satisfaction: 0.055, p <.05). Further

As shown, transparency is positively influenced by the more, the increase in transparency indirectly (indirect: quality of explanations, accuracy of recommendations, Transparency-output →Trust-competence →efectiveness: opportunities for interaction, and perceived control. Hy- 0.056, p <.05), but also directly (0.127, p <.05), contributes potheses 1.1, 1.2, 1.3 and 1.4 can thus be considered con- to a higher rating of the system’s efectiveness. Indirectly, ifrmed. The influence of the decision-making style is this in turn increases overall satisfaction with the syslimited to the intuitive style. Therefore, hypothesis 1.5 tem (indirect: Transparency-output →Trust-competence can only be partially confirmed. →efectiveness →overall satisfaction: 0.019, p <.05). Ad

Efects of perceived transparency of the system. ditionally, it increases the willingness to follow the adNo efects can be observed for transparency with regard vice of the recommendation system when users better to interaction. It is possible that efects exist on factors understand the output (direct: 0.268, p <.001; indirect: that were not surveyed in this study. For the other trans- Transparency-output →Trust-competence →Trust-follow parency factors, however, significant positive efects can advice: 0.073, p <.05). be observed. As shown, the transparency factors have clear efects

Transparency regarding the functionality has the strongesotn trust in the system, evaluation of efectiveness and on and most diverse efect. If users can understand the the overall satisfaction. Therefore hypotheses 2.1, 2.2 and internal mechanisms, they trust the recommendation 2.3 can be considered confirmed. Thus, perceived transsystem more. Direct positive efects can be observed parency can also be viewed as a mediator of perceived on benevolence (0.248, p <.01) and trust in the compe- control over the system, user characteristics, and other tence (0.188, p <.05) of the system. Indirectly, such trans- qualities of the system. The importance of the diferent parency thus contributes to a better evaluation of the sys- factors of perceived transparency can be shown by the tem’s efectiveness (indirect: Transparency-functionality diferentiated assessment. →Trust-benevolence →efectiveness: 0.074, p <.01; indirect: Transparency-functionality →Trust-competence →efectiveness: 0.055, p <.05). Via the increase in ef- 6. Discussion fectiveness, overall satisfaction with the system is also promoted (indirect: Transparency-functionality →Trust- We aimed at developing a measurement tool that is specifbenevolence →efectiveness →overall satisfaction: 0.024, ically focused on capturing the transparency of RS as p <.05). Via the increase in perceived benevolence, the perceived by users. In an initial interview study, conwillingness to share information about oneself is also cerns and uncertainties in relation to RS transparency increased (indirect: Transparency-functionality →Trust- were identified, which are well in line with the general benevolence →Trust-information sharing: 0.072, p <.05). AI-related questions compiled by [18]. This indicates that Moreover, via trust in competence, the willingness to the scheme developed by these authors can be a useful starting point for developing measures also for specific system state to their own goal to decide about the next systems such as RS, which address a wider range of users action, a stage defined by Norman [ 22] as evaluation. The beyond more expert users as in the original work by [18]. item “I can tell how well the recommendations match my

Our confirmatory analyses confirmed our hypothesis preferences” from our scale relates to this stage, by assessthat subjective perceived transparency can be charac- ing explicitly the correspondence of the recommended terized by the factors: input, output, functionality and items with one’s own preferences. Items from the interinteraction. Adequate reliability as well as convergent action group ("I know what needs to be changed in order and divergent validity was demonstrated, which indicates to get better recommendations") can be associated with that identified transparency factors can clearly be consid- intent formation and the downstream path in the action ered as independent, and they can be distinguished from cycle. each other and also from other factors of the subjective As discussed by [23], designers can contribute to close evaluation of RS (trust, efectiveness, etc.). the gap between mental models (users’ idea on how the

The identified factors in our analysis reflect the basic system works [22]), and the actual system’s functioning, components of RS as defined by [ 16], i.e., the input (what by providing output and functionalities reflecting an addata does the system use?), the functionality (how and equate system’s conceptual model, that can be “easily why is an item recommended?), and the output (why and perceived, interpreted and evaluated” [23]. The above how well does an item match one’s own preferences?). can in turn impact perceived transparency [19]. ConseAdditionally, the factor interaction could be extracted. quently, our instrument can contribute to a more compreThis factor is consistent with the category interaction hensive assessment of subjective perceived transparency, (what if / how to be that, what has to be changed for a by going beyond the one-dimensional construct addressdiferent prediction?) of the prototypical questions to AI, ing a general "why-recommended" understanding, and formulated by [18]. assessing instead, the extent to which output and func

Furthermore, the final set of items can also be consid- tionalities reflecting the system’s conceptual model are ered through the lens of the diferent interaction stages in fact perceived, interpreted and evaluated. as defined by Norman [ 22]. In our examined context, for example, the stage perception relates to the presence of system functions that explicitly reveal information on 7. Conclusion and Outlook how the recommendations were derived, e.g. through explanations. Items of the type “The system provided in- The instrument developed can be seen as a first step formation about how. . . ”, grouped under the factor func- towards assessing transparency in RS in a more comtionality, could be validated, indicating that making in- prehensive and cognitively meaningful manner. Overall, formation about the recommendation process observable reliability and construct validity of the developed meais a prerequisite for further cognitive processing. This surement instrument could be confirmed, identifying indicates that the evaluation of perceived transparency four transparency factors (input, output, functionality, should consider not only items related to users’ inter- interaction) and resulting in a 13 item questionnaire (see pretation (i.e. “user understands”, as it has traditionally Table 1). The expected influence of system aspects and been evaluated in RS research), but also items related to personal characteristics on the transparency factors could the presence and perception of transparency-related sys- be demonstrated for the developed factors with the exceptem functions (e.g. “user notices that the system actually tion of transparency regarding interaction, which may be explains”). due to the limited interaction possibilities in the applica

Once the user perceives a system output (e.g. the fea- tions used by participants. Furthermore, we could show tures of a recommendation or an explanation), the next the impact of diferent transparency aspects on trust in stage is the interpretation of the system state, in which the system and on the overall evaluation of the system. users use their knowledge to interpret the new system The diferentiated assessment of transparency makes it state [22]: in our context, to assess the recommendation possible to elaborate the significance of individual aspects inferred by the system. Our validated final set includes of transparency in more detail than it was possible with items which are related to the interpretation stage, and previous measurement instruments. Thus, it could be are of the type “I understood what data was used . . . ”, shown that transparency with respect to functioning which can be grouped under the factor input), or “I under- and output is of greater importance for the dependent stood how the system determined . . . ”, grouped under the variables considered than transparency with respect to factor functionality of our developed scale. This group of interaction and input. items is also consistent with the definition of perceived The findings obtained here should be considered under transparency by [ 5 ], which focuses on the perceived un- the following limitations. Real systems were tested for derstanding of the inner processes of RS. this online study. On the one hand, this allowed us to ob

In a subsequent stage, users compare the interpreted tain users’ views with respect to applications they were familiar with and that were fully functional. On the other hand, no controlled manipulation of influencing variables was possible. We also did not analyze the diferences between the systems which would have required a larger sample, also addressing questions outside the scope of the present study. An efect of explanations could only be shown for the factors input and functionality, partly mediated by perceived control, which may also be due to the limited explanations provided by the systems used.

In addition, only systems that were already known to the users were tested. Thus, a stronger expression of trust and overall more positive evaluation might be expected.

In terms of social desirability or self-overestimation, perceived understanding might be valued higher than actual understanding would lead one to expect.

Follow-up research should be guided by the limitations mentioned here for further validation of the measurement instrument. The degree of perceived transparency should also be compared with actual, genuine understanding using parallel qualitative methods [ 6 ]. Furthermore, it is important to check to what extent the questionnaire is also able to evaluate systems that are unknown to the users. Assessing unfamiliar systems or specifically designed prototypes would provide the opportunity to systematically vary components of the recommender system (input, functionality, output), the quality of explanations, and/or the interaction possibilities [ 9 ]. Thus, the influence of these features on the transparency factors and likewise possible diferences in their manifestation should be further explored.

Overall, a first validated version of a questionnaire to assess perceived transparency can be presented. The ifndings presented here also provide starting points for research into further elucidating the multi-faceted concept of transparency.

Acknowledgments

This work was funded by the German Research Foundation (DFG) under grant No. GRK 2167, Research Training Group “User-Centred Social Media”. 2) afiliated with the 5th ACM Conference on Rec- recommender systems, in: Proceedings of 24th ommender Systems (RecSys 2011), 2011, p. 67–73. International Conference on Intelligent User Inter[15] M. Bühner, Einführung in die Test- und Fragebo- faces (IUI 19), ACM, 2019, p. 379–390. genkonstruktion, Pearson Studium, Aufl. München, [29] T. Donkers, T. Kleemann, J. Ziegler, Explaining 2011. recommendations by means of aspect-based trans[16] D. Jannach, M. Zanker, A. Felfernig, G. Friedrich, parent memories, in: Proceedings of the 25th InterRecommender Systems. An introduction, Cam- national Conference on Intelligent User Interfaces, bridge University Press, 2011. 2020, p. 166–176. [17] J. Lu, Q. Zhang, G. quan Zhang, Recommender Sys- [30] K. Hamilton, S.-I. Shih, S. Mohammed, The develtems. Advanced Developments, World Scientific opment and validation of the rational and intuitive Publishing, 2021. decision styles scale, Journal of Personality Assess[18] Q. V. Liao, D. Gruen, S. Miller, Questioning the ment 98 (2016) 523–535.

ai: Informing design practices for explainable ai [31] A. G. Yong, S. Pearce, A beginner’s guide to factor user experiences, Proceedings of the 2020 CHI analysis: Focusing on exploratory factor analysis, Conference on Human Factors in Computing Sys- Tutorials in Quantitative Methods for Psychology tems 9042 (2020) 1–15. doi:https://doi.org/10. 9 (2013) 79–94.

1145/3313831.3376590. [32] R. A. Peterson, A meta-analysis of cronbach’s co[19] T. Ngo, J. Kunkel, J. Ziegler, Exploring mental mod- eficient alpha, Journal of Consumer Research 21 els for transparent and controllable recommender (1994) 381–391. systems: A qualitative study, in: Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization UMAP 20, 2020, pp. 183– 191. [20] D. Borsboom, G. J. Mellenbergh, J. van Heerden,

The theoretical status of latent variables, Psychological Review 110 (2003) 203–219. [21] J. Kunkel, T. Ngo, J. Ziegler, N. Krämer, Identifying Group-Specific Mental Models of Recommender Systems: A Novel Quantitative Approach, in: Human-Computer Interaction – INTERACT 2021, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2021, pp. 383–404.

doi:10.1007/978-3-030-85610-6_23. [22] D. A. Norman, Some Observations on Mental Models, In Mental Models, Dedre Gentner and Albert L. Stevens (Eds.). Psychology Press, New York, NY,

USA, 1983. [23] E. Hutchins, J. D. Hollan, D. A. Norman, Direct manipulation interfaces, Human-Computer Interaction 1 (1985) 311–338. [24] Y. Zhang, X. Chen, Explainable recommendation:

A survey and new perspectives, Foundations and

Trends in Information Retrieval 14 (2020) 1–101. [25] K. Backhaus, B. Erichson, R. Weiber, Multivariate Analysemethoden. Eine anwendungsorientierte

Einführung, Berlin 13. Aufl, 2011. [26] J. F. Hair, W. C. Black, B. J. Babin, R. E. Anderson,

Multivariate data analysis. A global perspective,

Boston 7. Aufl., 2010. [27] D. H. McKnight, V. Choudhury, C. Kacmar, Developing and validating trust measures for e-commerce: An integrative typology, in: Information Systems

Research, volume 13, 2002. [28] P. Kouki, J. Schafer, J. Pujara, J. O’Donovan,

L. Getoor, Personalized explanations for hybrid

[1]

Bostrom , E. Yudkowski, The Ethics of Artificial Intelligence , in: W. Ramsey, K. Frankish (Eds.), Cambridge Handbook of Artificial Intelligence , Cambridge University Press, 2014 , pp. 316 - 334 .

[2] U. S. a. H. S. C. (SHS), Recommendation on the Ethics of Artificial Intelligence , Technical Report, UNESCO , 2021 . URL: https://unesdoc.unesco.org/ ark:/48223/pf0000379920.page= 14 .

[3]

Sinha ,

Swearingen , The role of transparency in recommender systems , CHI EA '02 CHI '02 Extended Abstracts on Human Factors in Computing Systems ( 2002 ) 830 - 831 .

[4]

Tintarev ,

Masthof , Explaining Recommendations: Design and Evaluation , in: F. Ricci , L. Rokach , B. Shapira (Eds.), Recommender Systems Handbook , Springer, 2015 , pp. 353 - 382 . URL: https://doi. org/10.1007/978-1- 4899 -7637-6_ 10 . doi: 10 .1007/ 978-1- 4899 -7637-6_ 10 .

[5]

Pu ,

Chen ,

Hu , A user-centric evaluation framework for recommender systems , in: Proceedings of the fifth ACM conference on Recommender systems - RecSys 11 , 2011 , pp. 157 - 164 .

[6]

Cramer ,

Evers ,

Ramlal , M. van Someren , L.

Rutledge , N.

Stash , L.

Aroyo , B.

Wielinga , The efects of transparency on trust in and acceptance of a content-based art recommender, User Model . User-Adap. Inter . 18 ( 2008 ) 455 - 496 .

[7]

B. P.

Knijnenburg ,

M. C.

Willemsen ,

Gantner ,

Soncu , C. Newell, Explaining the user experience of recommender systems, in: User Modeling and User-Adapted

Interaction

, 2012 , p. 441 - 504 .

[8]

Gedikli ,

Jannach ,

Ge , How should i explain? a comparison of diferent explanation types for recommender systems , International Journal of Human-Computer Studies 72 ( 2014 ) 367 - 382 .

[9]

D. C.

Hernandez-Bocanegra ,

Ziegler , Explaining review-based recommendations: Efects of profile transparency, presentation style and user characteristics , Journal of Interactive Media 19 ( 2020 ) 181 - 200 . doi:https://doi.org/10. 1515/icom-2020-0021.

[10]

D. C.

Hernandez-Bocanegra ,

Ziegler , Efects of interactivity and presentation on review-based explanations for recommendations , in: HumanComputer Interaction - INTERACT 2021 , Springer International Publishing, 2021 , pp. 597 - 618 .

[11]

Tintarev ,

Masthof , Evaluating the efectiveness of explanations for recommender systems, User Modeling and User-Adapted Interaction 22 ( 2012 ) 399 - 439 .

[12] C.-H. Tsai , P. Brusilovsky , Explaining recommendations in an interactive hybrid social recommender , in: 24th International Conference on Intelligent User Interfaces (IUI 19) , 2019 , pp. 391 - 396 .

[13]

Jameson ,

M. C.

Willemsen ,

Felfernig , M. de Gemmis, P. Lops,

Semeraro , L. Chen, Human decision making and recommender systems, Recommender Systems Handbook ( 2015 ) 611 - 648 .

[14]

Dooms ,

T. D.

Pessemier ,

Martens , A usercentric evaluation of recommender algorithms for an event recommendation system , in: Proceedings of the RecSys 2011 : Workshop on Human Decision Making in Recommender Systems (Decisions RecSys 11) and User-Centric Evaluation of Recommender Systems and Their Interfaces - 2 (UCERSTI