=Paper= {{Paper |id=Vol-3124/paper17 |storemode=property |title=Development of an Instrument for Measuring Users' Perception of Transparency in Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-3124/paper17.pdf |volume=Vol-3124 |authors=Marco Hellman,Diana C. Hernandez-Bocanegra,Jürgen Ziegler |dblpUrl=https://dblp.org/rec/conf/iui/HellmanH022 }} ==Development of an Instrument for Measuring Users' Perception of Transparency in Recommender Systems== https://ceur-ws.org/Vol-3124/paper17.pdf

Development of an Instrument for Measuring Users’
Perception of Transparency in Recommender Systems
Marco Hellmann, Diana C. Hernandez-Bocanegra and Jürgen Ziegler
University of Duisburg-Essen, Forsthausweg 2, 47057 Duisburg, Germany

Abstract
Transparency is increasingly seen as a critical requirement for achieving the goal of human-centered AI systems in general
and also, specifically, recommender systems (RS). However, defining and operationalizing the concept is still difficult, due
to its multi-faceted nature. Currently, there are hardly any measurement instruments to adequately assess the perceived
transparency of RS in user studies. Thus, we present the development of a measurement instrument that aims at capturing
perceived transparency as a multidimensional construct. The results of our validation show that transparency can be
distinguished with respect to input (what data does the system use?), functionality (how and why is an item recommended?),
output (why and how well does an item fit one’s preferences?), and interaction (what needs to be changed for a different
prediction?). The study is intended as a first iteration in the development of a reliable and fully validated measurement tool
for assessing transparency in RS.

Keywords
Recommender systems, transparency, explanations, user study

1. Introduction data, the recommendation algorithm, or features of the
recommended items may be exposed to the user, trans-
The request for more transparency in intelligent systems parency as a user-centric quality can only be assessed
has become steadily louder in recent years, formulated by measuring users’ perception and understanding of
in academic research as well as in most public and cor- those system aspects that are relevant for their decision
porate policies concerning the ethics of artificial intelli- making and trust in the system [4].
gence [1, 2]. Although there is now broad agreement that Despite the acclaimed relevance of transparency in RS,
transparency is of high relevance for developing human- the instruments available for measuring it from a user
centred AI systems, the concept is still elusive due to perspective are still very limited. Some instruments for
its multi-faceted nature and the different objectives it is assessing overall recommendation quality include a small
intended to serve. The questions raised when asking for number of items related to perceived transparency [5],
transparency include, for example, the system aspects but these measures still seem far from covering the multi-
that should be made transparent, or the riskiness of an ple facets involved. To the best of our knowledge, there is
AI function at an individual or societal level. no instrument focusing specifically on RS transparency.
A need for greater transparency has also been noted A further shortcoming of existing instruments is the lack
for recommender systems (RS), a frequent, user-facing of sufficiently considering the cognitive processes in-
type of AI-driven technology, to better support users’ in volved in users’ understanding of recommendations and
their decision-making and to avoid potentially negative in their ability to influence the system according to their
consequences, e. g. users getting trapped in filter bub- needs if such influence is possible.
bles [3]. Various methods have been proposed to this In this paper, we describe steps towards a more holis-
end, ranging from disclosing the user profile on which a tic and cognitively grounded psychometric instrument
recommendation is based to providing explicit explana- for measuring perceived transparency in RS. We first
tions. Still, the multi-facetedness of the concept makes explain the questionnaire development process that re-
it difficult to design effective transparent RS. A central sulted in a validated set of items specifically focused on
question that must be solved to this end is how trans- RS transparency. The candidate items for this develop-
parency of a RS can be measured and evaluated. While ment were chosen to reflect the different steps involved
different aspects of the system, for example, the input in cognitively processing the information provided about
Joint Proceedings of the ACM IUI Workshops 2022, March 2022,
the recommendation process and its output. To further
Helsinki, Finland validate the instrument, we performed an analysis of the
$ marco.hellmann@stud.uni-due.de (M. Hellmann); effects of perceived transparency as measured by our
diana.hernandez-bocanegra@uni-due.de new instrument on factors related to trust in the RS and
(D. C. Hernandez-Bocanegra); juergen.ziegler@uni-due.de effectiveness of the recommendations. An influence of
(J. Ziegler)
© 2022 Copyright for this paper by its authors. Use permitted under Creative transparency on users’ trust in the system and on the ac-
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org) ceptance of the recommendations has been suggested in
prior research, e. g., in [5]. We analyzed these influences mains an applications, and formulate the measurement
through structural equation modeling to show that the of the construct transparency using only a single item ("I
construct ’transparency’ as measured by our instrument understood why the items were recommended to me"),
has in fact the assumed effects. this latter being a frequently used item for the evaluation
Our contribution is thus twofold: we provide a system- of RS transparency.
atically derived and validated measurement instrument Consequently, we set out to formulate and validate a
for transparency in RS, and we can show that the differ- more comprehensive way to measure the perceived trans-
ent transparency factors represented in the questionnaire parency of a RS, as described in the methods section. The
have an impact on the effectiveness of recommendations procedure followed the typical procedure for developing
and trust in the system, albeit to different degrees. psychometric measurement instruments (e.g. [15]):
(1) To operationalize a target construct, first a larger
number of candidate items is formulated and compiled.
2. Related work Here, we draw on the basic structure of RS ([16], [17])
and typical user questions related to artificial intelligence
Users’ perception of the transparency of a RS may be
algorithms [18]. Second, items were also derived from a
influenced by several factors. Providing explanations is
qualitative preliminary study, to further analyze the un-
one important aspect, and some studies have shown that
certainties in users’ mental models, which can be under-
transparency is positively influenced by the quality of
stood as the notion that users have about how a system
the explanations given ([5], [6]) and that it is related to
or a certain type of systems work [19].
control over the system [7]. The effect of systematically
(2) We examined the factor structure of the trans-
varied levels and styles of explanation on perceived trans-
parency construct, which was formed as a reflective fac-
parency has been studied and assessed via questionnaires
tor in the sense of classical test theory (see also [20]). We
(see e.g. [8], [9], [10]). Also, a positive influence of in-
considered 4 factors that could group individual ques-
teraction possibilities as well as perceived control on the
tionnaire items, and that might contribute to variances in
perceived transparency of the system was reported by [5].
perceived transparency, inspired on dimensions defined
Transparency perception seems to be enhanced both by
by [18]: Input ("what kind of data does the system learn
the perceived quality of explanations and the perceived
from"), output ("what kind of output does the system
accuracy or quality of recommendations. In addition, the
give"), functionality ("how / why does the system make
authors show a positive effect of transparency on trust
predictions") and interaction (what if / how to be that,
and through trust an indirect effect on purchase inten-
"what would the system predict if this instance changes
tions. According to [11], this can be related to evaluating
to..").
the effectiveness of the RS. Moreover, studies suggest
(3) The developed measurement instrument was vali-
that perceived transparency promotes satisfaction with
dated. For this purpose, the framework model of [7] was
the system [12] [7].
used.
The influence of personal factors on the perception of
recommender systems has often been investigated in the
light of the general decision-making behavior of users 2.0.1. Mental models and stages of cognitive
(see [13]). [9] showed that individuals with a rational processing
decision-making style trusted the recommender system Transparency is frequently discussed like an objective
tested more and rated its efficiency and effectiveness property of a system. A system becomes only transparent,
higher. Furthermore, they showed that individuals with however, if its users can understand the transparency-
an intuitive decision-making style rate the quality of related information, such as explicit explanations, and
explanations better. evaluate it with respect to their goals. The degree of
To date, however, few measurement tools exist to quan- comprehension may depend on the mental model users
titatively assess the transparency of a RS as perceived have about how the system works [21], either based on
by users. [6] surveyed perceived transparency using two preconceptions, previous experiences with similar sys-
items (“I understand why the system recommended the tems, or on the interaction with and perception of the
artworks it did”; “I understand what the system bases its present system [22]. As discussed in [19], mental models
recommendations on”), in the domain of art objects. [14] that drift considerably from actual system functioning
use a single item ("I did not understand why the items may result in broadening the "gulfs" described by [22]:
were recommended to me (reverse scale)"), for event rec- 1) the gulf of execution, when the user’s mental model
ommendations. [8] proposed an item that explicitly refers is inaccurate in terms of how the system can be used to
to explanations: "Which explanation interfaces are con- execute a task, 2) the gulf of evaluation, when the output
sidered to be transparent by the users?". [5] proposed (as consequence of a user’s action) differs from what is
an evaluation framework for RS, involving different do- expected, according to the user’s mental model.
To bridge these gulfs, users must process the informa- 3.1. Formulation and compilation of
tion provided by the system at different cognitive levels. questionnaire items
The items of the proposed questionnaire were formulated
to reflect the action levels according to [23]. According Here, we draw on the basic structure of RS ([16], [17])
to their model, the quality of interaction with the system and typical user questions to AI algorithms [18]. Candi-
can be described through a cycle of evaluation and exe- date items were also chosen to cover different stages of
cution. For example, at first, the user may perceive the the cognitive action cycle described in related work. Sec-
output of the system (e.g., the recommendations and ex- ond, items were also derived from a qualitative pre-study,
planations), then interpret the information gathered (e.g., consisting of interviews with users to further analyze the
how the system works), and thereby evaluate the state of uncertainties in users’ mental models [19], in regard to
the system (e.g., performance of the system and quality different commercial RS, like Netflix, Spotify or Amazon.
of the output). As a consequence, the user formulates A total of 6 interviews were conducted via video call,
goals aiming to achieve with the system or matches their with voluntary participants. When selecting the inter-
goals with the evaluation of the system (e.g., get more view partners, care was taken to represent in the sample
accurate or diverse recommendations). The user then different age groups and experience with Internet appli-
pursues an intention (e.g., improve recommendations), cations. Students and non-students from different age
which is translated into planning actions (e.g., change groups (20 to 50 years) were interviewed. Overall, pre-
input), which they finally execute. While this cognitive vious exposure to recommender systems was equally
cycle is well-known in the HCI field, it has hardly been strong among all participants. Only one interviewee had
applied in the investigation of transparency for AI-based lower experience and one interviewee had slightly higher
systems. experience.
The authors in [23] assume that there are gaps between The aim of the interviews was to capture the experi-
the users’ goals and their knowledge about the system, ence, perception and evaluation as well as possible ques-
and the extent to which system provides descriptions tions of users regarding the functionality or transparency
about its functioning (gulfs of execution and of evalu- of recommender systems. The subjects were asked to ex-
ation, as mentioned beforehand). By taking actions to plain the functionality of RS from their perspective and
bridge those gaps, (making system functions to match to create a corresponding sketch. Following this, uncer-
goals, and making the output represent a “good concep- tainties and possible lack of transparency were discussed.
tual model of the system that is easily perceived, inter- Finally, prototypical explanations from [24] for increas-
preted and evaluated” [23]), system designers may con- ing the perceived transparency were evaluated by the
tribute to minimize cognitive effort by users [23], and interview partners. The explanations refer differently to
to decrease the discrepancy between the mental model the input used, the functionality and the output. In addi-
of the system and its functioning, which may have an tion, they use different visual forms of representation, e.g.
impact on the perception of transparency, as discussed by star ratings, profile lines, text. In this way, uncertainties
[19]. We argue then, that a more comprehensive instru- as well as wishes for more transparency by users could
ment to measure perceived transparency is still needed, be identified. Each question encountered in interviews
so that such impact can be evaluated not only on the was directly transformed into one or more items.
basis of general perceived understanding ("I understood A resulting set of 92 items was collected and discussed
why recommended"), but also on the basis of the extent by the research team, where linguistic revision and elim-
to which output and functionalities that reflect the con- ination of redundancies were also performed. The dis-
ceptual model of the system are perceived, interpreted cussions led to a reduction of the set to 34 items, which
and evaluated by users. were used as input for the online validation described in
the next section.

3. Methods 3.2. Online user study
To operationalize the construct of perceived transparency, We conducted a user study to examine item quality and
we conducted the following steps, based on the typical factor structure, as described below.
procedure for developing measurement instruments (e.g.,
[15]): 1. Formulation and compilation of questionnaire Participants We recruited 171 participants (89 female,
items. 2. Examination of items quality and factor struc- mean age 29 and range between 18 and 69) through the
ture, based on an online study. 3. Validation of the mea- crowdsourcing platform Prolific. We restricted the task
surement instrument. We describe each step below. to workers in the U.S and the U.K., with an approval rate
greater than 98%. Participants were rewarded with £1.15.
Time devoted to the survey (in minutes): M=13.2, SD= Data analysis We performed an exploratory factor
7.33. analysis (EFA) to further reduce the initial set of items
We applied a quality check to select participants with and a Confirmatory Factor Analysis (CFA) to test internal
quality survey responses (we included attention checks reliability and convergent validity. Furthermore, we eval-
in the survey, e.g. “This is an attention check. Please click uated discriminant validity of the resultant set of items, in
here the option ‘Disagree’”. We discarded participants relation to other constructs of the subjective evaluation
with at least 1 failed attention check, or those who did of RS, for example explanation quality, effectiveness and
not finish the survey. Thus, the responses of 17 of the overall satisfaction, according to the frameworks defined
192 initial Prolific respondents were discarded and not by [7] and [5].
paid. 4 additional cases were removed due to suspicious
response behavior, e.g. responding all questions with the
same value within the same page. Thus, 171 cases were 4. Results
used for further analysis.
The target sample size was chosen to allow performing 4.1. Exploratory Factor Analysis (EFA)
CFA analysis. [25], p. 389, recommend a minimum of The factor structure was exploratively examined, aiming
n>50 or three times the number of indicators. [26], p. to further reduce the set of items. A total of 5 EFAs with
102, recommend a minimum of n>100 or five times the principal axis factor analysis and promax rotation were
number of indicators. Thus, given that we wanted to performed. First, items that did not have a unique princi-
evaluate a set of 34 items, the sample size was set to a pal loading or had a principal loading that was too low
minimum of 170 participants. (<.40) were removed. In the first 4 EFAs, 11 items were
removed based on this criterion. Subsequently, more
Questionnaires We utilized the set of 34 items result- stringent criteria were used (factor loadings <.50). The
ing of the formulation of items step described above. guideline values are based on [31]. Thus, 2 items were
Additionally, aiming to further validate the final mea- removed again. Subsequently, a 6-factorial structure re-
surement instrument (4.3), we used items from [5] to sulted, with a total of 21 items and a variance resolution
evaluate perception of control (how much they think of 62.45%. Reliability of the factors fall in the range ‘good’
they can influence the system), interaction adequacy and to ‘very good’ (.782 to .888), as defined by [32]. The in-
interface adequacy, information sufficiency and recom- ternal consistency across all items is .867.
mendation accuracy. Furthermore we included items
from [7] to evaluate the perception of system effective-
4.2. Confirmatory Factor Analysis (CFA)
ness (construct perceived system effectiveness, system is
useful and helps the user to make better choices), and Following the exploration of the factor structure, the
of trust in the system [27] (constructs trusting beliefs - result obtained was tested for internal reliability and con-
subconstructs benevolence, integrity, and competence-, vergent validity using confirmatory analysis. A first CFA
user considers the system to be honest and trusts its rec- was performed, resulting in 8 items with low factor load-
ommendations; and trusting intentions, user willing to ings, which were eliminated from the set. Two factors
share information and to follow advice). We used items were removed in the process because they did not load on
described from [28, 29] for explanation quality, and from a second-level overall transparency factor. A final CFA
[30] to evaluate decision-making style. All items were with 4 factors was performed (model fit X2 = 86.997, df =
measured with a 1-5 Likert-scale (1: Strongly disagree, 5: 61, p = .016; X2 /df = 1.426; CFI = .975; TLI = .968; RMSEA
Strongly agree). = .050; SRMR = .047). Reliability across all items is equal
to .884. This model comprises a final set of four factors
Procedure Participants were asked to choose a service and 13 items, which are reported in Table 1 along with
from five applications, for which they were required to factor loadings.
have an active account: Amazon, Spotify, Netflix, Tripad- The four factors identified can be associated with the
visor, and Booking. Participants were instructed to open concepts Input, composed of 3 items, Output, also with
the application, browse it at their own discretion. They 3 items, Functionality with 5 items, and Interaction with
were explicitly told to select an item that was relevant only 2 items. Although the initial item set comprised
to them and which they would actually buy or consume. questions for all stages of the cognitive action cycle, af-
A real purchasing of items was explicitly not requested. ter CFA, items related the perception level were only
Participants were asked to return to the survey after com- left for the factor Functionality, comprising questions
pleting the task and to answer questions about the system about whether users are aware of transparency-related
they used. information if provided by the system (e.g.: "The system
provided information about how well the recommenda-
tions match my preferences"). This factor covers mostly
Table 1
Test results of internal reliability and convergent validity of our proposed transparency questionnaire.
Cronbach Factor
Factor Items
alpha loading
It was clear to me what kind of data the system uses to generate recommendations. 0.817
Input I understood what data was used by the system to infer my preferences. 0.842 0.901
I understood which item characteristics were considered to generate recommendations. 0.712
I understood why the items were recommended to me. 0.771
Output I understood why the system determined that the recommended items would suit me. 0.801 0.794
I can tell how well the recommendations match my preferences. 0.710
The system provided information to understand why the items were recommended. 0.731
The system provided information about how the quality of the items was determined. 0.705
Functionality The system provided information about how my preferences were inferred. 0.847 0.736
The system provided information about how well the recommendations match my preferences. 0.696
I understood how the quality of the items was determined by the system. 0.760
I know what actions to perform in the system so that it generates better recommendations 0.896
Interaction 0.888
I know what needs to be changed in order to get better recommendations 0.892

perception-related questions. The missing coverage of fect of higher overall transparency on the perception of
perception-related items in other factors is likely due the recommendations and the overall system. Further-
to limitations of the systems used for the online study more, we assumed that transparency is influenced by
which do not, for example, provide access to the data system-related aspects (accuracy, interaction quality, and
on which recommendations are based, thus preventing explanation quality) as well as by personal characteristics
users to become aware of input data . The factor Output such as decision-making behavior) as described in the
comprises items related to the interpretation and evalua- related work section. Some of these factors can also be
tion stages. The factor Interaction has the smallest scope expected to influence perceived control over the system,
with 2 items and covers only the facets of action planning a construct that may mediate the impact of these factors
or action execution. This factor thus describes whether on transparency perception. This led us to formulating
users know which actions they would have to perform if the hypotheses shown in table 3.
they wanted to receive other recommendations. In the following, the relationships of the factors in
the structural equation model are presented (see fig. 1).
4.3. Discriminant validity of Only significant paths with standardized path coefficients
are shown. Indirect effects are only considered for the
measurement instrument transparency factors relevant here. The final model is
We determined discriminant validity of the instrument in shown to have a very good fit: X = 75.767, df = 57, p =
2

relation to other constructs of the subjective evaluation .049; X /df = 1.329; CFI = .980; TLI = .965; RMSEA = .044;
2

of RS, for example explanation quality, effectiveness and SRMR = .072. The model is thus adequate to describe the
overall satisfaction, according to the frameworks defined relationships in the data set.
by [7] and [5]. Discriminant validity was assessed using Influences on perceived transparency of the sys-
inter-construct correlations (see results in table 2). We tem. Transparency with respect to interaction is rated
found that the squared correlations between pairs of con- higher when users are more likely to exhibit an intuitive
structs were all less than the value of average variances decision-making style (0.186, p <.05) and users report
that are shown in the diagonal, representing “a level of higher perceived control (0.293, p <.001). The latter is
appropriate discriminant validity” [5]. increased by the quality of interaction (0.502, p <.001) and
explanations (0.341, p <.001). Users thus know better how
to influence recommendations when they have more op-
5. Structural Equation Model portunities to interact with the system, and can gather in-
(SEM) formation about the system through explanations as well
as through ’trial and error’ (indirect: explanation quality
To explore the relation between the transparency factors →control →Transparency-interaction: 0.100, p < .01; in-
assessed by the questionnaire and the effects of perceived teraction quality →control →Transparency-interaction:
transparency on recommendation effectiveness and trust 0.147, p < .001).
in the system, as well as the impact of factors influenc- Similar observations can be made for functionality.
ing transparency, we set up a Structural Equation Model Again, transparency is rated higher when users are more
(SEM). The model is based on hypotheses we derived likely to exhibit an intuitive decision-making style (0.141,
from existing research that has shown the positive ef- p <.05) and users report higher perceived control (0.261,
Table 2
Inter-construct correlation matrix. Average Variance Extracted (AVE) on the main diagonal; correlations below the diagonal;
quadratic correlations above the diagonal. Target value for AVE ≥.5. p<0.05*, p<0.01**.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1 Transp. - input 0.662 0.227 0.235 0.136 0.111 0.023 0.009 0.125 0.053 0.051 0.075 0.040 0.065 0.063 0.159 0.002 0.026 0.081 0.021
2 Transp. - output 0,476** 0.577 0.231 0.121 0.157 0.039 0.019 0.146 0.187 0.054 0.094 0.291 0.071 0.041 0.239 0.001 0.186 0.240 0.074
3 Transp. - function 0,485** 0,481** 0.527 0.155 0.246 0.021 0.061 0.341 0.153 0.022 0.114 0.094 0.147 0.153 0.183 0.021 0.127 0.168 0.094
4 Transp. - interaction 0,369** 0,348** 0,394** 0.799 0.119 0.000 0.055 0.048 0.072 0.000 0.056 0.030 0.060 0.106 0.070 0.008 0.064 0.035 0.038
5 Control 0,333** 0,396** 0,496** 0,345** 0.775 0.004 0.018 0.242 0.366 0.052 0.154 0.090 0.153 0.198 0.156 0.007 0.144 0.141 0.118
6 DM style - rational 0,153* 0,197** 0.146 0.016 0.061 0.454 0.041 0.062 0.004 0.073 0.018 0.032 0.017 0.017 0.026 0.004 0.036 0.058 0.030
7 DM style - intuitive 0.092 0.138 0,246** 0,234** 0.136 -0,203** 0.502 0.022 0.027 0.000 0.000 0.019 0.006 0.011 0.012 0.012 0.001 0.001 0.005
8 Explanation quality 0,353** 0,382** 0,584** 0,220** 0,492** 0,248** 0.148 0.557 0.091 0.080 0.265 0.151 0.112 0.100 0.230 0.030 0.199 0.177 0.171
9 Interaction adequacy 0,230** 0,432** 0,391** 0,269** 0,605** 0.064 0,163* 0,301** 0.791 0.082 0.065 0.048 0.116 0.151 0.101 0.020 0.147 0.118 0.084
10 Interface adequacy 0,226** 0,232** 0.147 0.008 0,228** 0,270** -0.001 0,282** 0,286** 0.618 0.123 0.054 0.052 0.043 0.207 0.020 0.108 0.130 0.187
11 Info. sufficiency 0,273** 0,307** 0,337** 0,236** 0,393** 0.133 -0.001 0,515** 0,254** 0,350** — 0.104 0.064 0.063 0.182 0.048 0.216 0.188 0.170
12 Recomm. accuracy 0,201** 0,539** 0,307** 0,174* 0,300** 0,180* 0.137 0,389** 0,220** 0,232** 0,323** — 0.086 0.062 0.259 0.021 0.187 0.326 0.221
13 Trust - benevolence 0,254** 0,266** 0,384** 0,245** 0,391** 0.130 0.079 0,334** 0,341** 0,228** 0,252** 0,293** 0.666 0.661 0.366 0.095 0.162 0.332 0.282
14 Trust - integrity 0,250** 0,202** 0,391** 0,326** 0,445** 0.129 0.106 0,316** 0,388** 0,207** 0,251** 0,249** 0,813** 0.476 0.332 0.088 0.179 0.238 0.272
15 Trust - competence 0,399** 0,489** 0,428** 0,265** 0,395** 0,162* 0.111 0,480** 0,318** 0,455** 0,427** 0,509** 0,605** 0,576** 0.608 0.030 0.278 0.440 0.358
16 Trust - share info. 0.040 0.028 0.146 0.091 0.086 0.060 0.109 0,174* 0.141 0.140 0,219** 0.143 0,308** 0,297** 0,174* — 0.064 0.062 0.078
17 Trust - follow advice 0,160* 0,431** 0,356** 0,253** 0,379** 0,189* 0.028 0,446** 0,384** 0,328** 0,465** 0,433** 0,402** 0,423** 0,527** 0,252** — 0.213 0.269
18 Effectiveness 0,284** 0,490** 0,410** 0,187* 0,375** 0,241** 0.036 0,421** 0,344** 0,360** 0,434** 0,571** 0,576** 0,488** 0,663** 0,249** 0,461** 0.545 0.389
19 Overall satisfaction 0.145 0,272** 0,306** 0,194* 0,343** 0,174* 0.069 0,414** 0,289** 0,432** 0,412** 0,470** 0,531** 0,522** 0,598** 0,280** 0,519** 0,624** —

Table 3
Overview of hypothesis addressed in SEM
Hypotheses Reference Relevant factor Explanation
Factors influencing perceived transparency (X →perceived transparency)
H-1.1 [6],[5] Explanation quality Comprehensibility and contribution of the explanations to the understanding of the system
H-1.2 [5] Accuracy Match between the items and the user’s preferences
H-1.3 [5] (indirect effect) Interaction quality Possibilities of adaptation and feedback
H-1.4 [5] Control Possibilities of personalization
H-1.5 [10] decision-making styles Rational / Intuitive
Effects of perceived transparency (perceived transparency →Y).
H-2.1 [3],[14] Trust Trusting beliefs and intentions
H-2.2 [11] Effectiveness Usefulness of the system
H-2.3 [14], [12] Overall satisfaction Satisfaction with the system

p <.001). The quality of interaction, promoting perceived .001). It both indirectly and directly (0.416, p <.001) in-
control, has a positive effect on transparency concern- creases the transparency of the functionality when users
ing how the system works (indirect: interaction qual- rate explanations positive (indirect: explanation quality
ity →control →Transparency-functionality: 0.131, p < →control →Transparency-functionality: 0.089, p < .01).

Figure 1: Structural model. p<0.05*, p<0.01**, p<0.001***
The input is perceived as more transparent the better follow the advice of the recommendation system is in-
users can interact with the system. Thus, here again, creased (indirect: Transparency-functionality →Trust-
perceived control has a direct positive effect (0.200, p competence →Trust-follow advice: 0.071, p <.05). Thus,
<.01). The quality of the interaction thus repeatedly has it is clear that an understanding of the internal mecha-
an indirect effect (indirect: interaction quality →control nisms of recommender systems leads to trusting beliefs
→Transparency-input: 0.100, p < .05). Similarly to what and thus to trusting actions and a positive overall evalu-
has already been shown with regard to functionality, the ation.
quality of the explanations also has a direct, positive Transparency with regard to the input has a negative
effect on transparency of the input (0.209, p <.01) in addi- effect. If users can see which data is used, this has a
tion to the indirect effect (explanation quality →control negative effect on the willingness to follow the advice
→Transparency-input: 0.068, p < .05). of the recommendation system in this model (-0.144, p
The transparency of the output shows how well users <.05). Thus, this shows a certain counterbalance to a
can assess why a recommendation is made or should transparent functionality, possibly triggered by too much
match the user’s preferences. This is directly increased information or a general distrust regarding data privacy.
by the quality of the interaction with the system (0.311, This shows that transparency can also have negative con-
p <.001), i.e. when possibilities are offered or used to sequences. However, these turn out to be comparatively
indicate one’s own preferences. On the other hand, there small. Transparent output again has strong positive ef-
are no direct or indirect influences of the explanations. fects. If users can understand why the recommended
Instead, the accuracy of the recommendation has a posi- item matches their preferences, this increases trust in
tive influence on the transparency of the output (0.454, the competence of the system (0.194, p <.01). Indirectly,
p <.001). Accordingly, the output is easier to understand transparency also promotes overall satisfaction via this
if it is rated as suitable. Unsuitable recommendations increase in trust (indirect: Transparency-output →Trust-
would thus be more difficult for the user to comprehend. competence →overall satisfaction: 0.055, p <.05). Further-
As shown, transparency is positively influenced by the more, the increase in transparency indirectly (indirect:
quality of explanations, accuracy of recommendations, Transparency-output →Trust-competence →effectiveness:
opportunities for interaction, and perceived control. Hy- 0.056, p <.05), but also directly (0.127, p <.05), contributes
potheses 1.1, 1.2, 1.3 and 1.4 can thus be considered con- to a higher rating of the system’s effectiveness. Indirectly,
firmed. The influence of the decision-making style is this in turn increases overall satisfaction with the sys-
limited to the intuitive style. Therefore, hypothesis 1.5 tem (indirect: Transparency-output →Trust-competence
can only be partially confirmed. →effectiveness →overall satisfaction: 0.019, p <.05). Ad-
Effects of perceived transparency of the system. ditionally, it increases the willingness to follow the ad-
No effects can be observed for transparency with regard vice of the recommendation system when users better
to interaction. It is possible that effects exist on factors understand the output (direct: 0.268, p <.001; indirect:
that were not surveyed in this study. For the other trans- Transparency-output →Trust-competence →Trust-follow
parency factors, however, significant positive effects can advice: 0.073, p <.05).
be observed. As shown, the transparency factors have clear effects
Transparency regarding the functionality has the strongeston trust in the system, evaluation of effectiveness and on
and most diverse effect. If users can understand the the overall satisfaction. Therefore hypotheses 2.1, 2.2 and
internal mechanisms, they trust the recommendation 2.3 can be considered confirmed. Thus, perceived trans-
system more. Direct positive effects can be observed parency can also be viewed as a mediator of perceived
on benevolence (0.248, p <.01) and trust in the compe- control over the system, user characteristics, and other
tence (0.188, p <.05) of the system. Indirectly, such trans- qualities of the system. The importance of the different
parency thus contributes to a better evaluation of the sys- factors of perceived transparency can be shown by the
tem’s effectiveness (indirect: Transparency-functionality differentiated assessment.
→Trust-benevolence →effectiveness: 0.074, p <.01; in-
direct: Transparency-functionality →Trust-competence
→effectiveness: 0.055, p <.05). Via the increase in ef- 6. Discussion
fectiveness, overall satisfaction with the system is also
We aimed at developing a measurement tool that is specif-
promoted (indirect: Transparency-functionality →Trust-
ically focused on capturing the transparency of RS as
benevolence →effectiveness →overall satisfaction: 0.024,
perceived by users. In an initial interview study, con-
p <.05). Via the increase in perceived benevolence, the
cerns and uncertainties in relation to RS transparency
willingness to share information about oneself is also
were identified, which are well in line with the general
increased (indirect: Transparency-functionality →Trust-
AI-related questions compiled by [18]. This indicates that
benevolence →Trust-information sharing: 0.072, p <.05).
the scheme developed by these authors can be a useful
Moreover, via trust in competence, the willingness to
starting point for developing measures also for specific system state to their own goal to decide about the next
systems such as RS, which address a wider range of users action, a stage defined by Norman [22] as evaluation. The
beyond more expert users as in the original work by [18]. item “I can tell how well the recommendations match my
Our confirmatory analyses confirmed our hypothesis preferences” from our scale relates to this stage, by assess-
that subjective perceived transparency can be charac- ing explicitly the correspondence of the recommended
terized by the factors: input, output, functionality and items with one’s own preferences. Items from the inter-
interaction. Adequate reliability as well as convergent action group ("I know what needs to be changed in order
and divergent validity was demonstrated, which indicates to get better recommendations") can be associated with
that identified transparency factors can clearly be consid- intent formation and the downstream path in the action
ered as independent, and they can be distinguished from cycle.
each other and also from other factors of the subjective As discussed by [23], designers can contribute to close
evaluation of RS (trust, effectiveness, etc.). the gap between mental models (users’ idea on how the
The identified factors in our analysis reflect the basic system works [22]), and the actual system’s functioning,
components of RS as defined by [16], i.e., the input (what by providing output and functionalities reflecting an ad-
data does the system use?), the functionality (how and equate system’s conceptual model, that can be “easily
why is an item recommended?), and the output (why and perceived, interpreted and evaluated” [23]. The above
how well does an item match one’s own preferences?). can in turn impact perceived transparency [19]. Conse-
Additionally, the factor interaction could be extracted. quently, our instrument can contribute to a more compre-
This factor is consistent with the category interaction hensive assessment of subjective perceived transparency,
(what if / how to be that, what has to be changed for a by going beyond the one-dimensional construct address-
different prediction?) of the prototypical questions to AI, ing a general "why-recommended" understanding, and
formulated by [18]. assessing instead, the extent to which output and func-
Furthermore, the final set of items can also be consid- tionalities reflecting the system’s conceptual model are
ered through the lens of the different interaction stages in fact perceived, interpreted and evaluated.
as defined by Norman [22]. In our examined context, for
example, the stage perception relates to the presence of
system functions that explicitly reveal information on 7. Conclusion and Outlook
how the recommendations were derived, e.g. through
The instrument developed can be seen as a first step
explanations. Items of the type “The system provided in-
towards assessing transparency in RS in a more com-
formation about how. . . ”, grouped under the factor func-
prehensive and cognitively meaningful manner. Overall,
tionality, could be validated, indicating that making in-
reliability and construct validity of the developed mea-
formation about the recommendation process observable
surement instrument could be confirmed, identifying
is a prerequisite for further cognitive processing. This
four transparency factors (input, output, functionality,
indicates that the evaluation of perceived transparency
interaction) and resulting in a 13 item questionnaire (see
should consider not only items related to users’ inter-
Table 1). The expected influence of system aspects and
pretation (i.e. “user understands”, as it has traditionally
personal characteristics on the transparency factors could
been evaluated in RS research), but also items related to
be demonstrated for the developed factors with the excep-
the presence and perception of transparency-related sys-
tion of transparency regarding interaction, which may be
tem functions (e.g. “user notices that the system actually
due to the limited interaction possibilities in the applica-
explains”).
tions used by participants. Furthermore, we could show
Once the user perceives a system output (e.g. the fea-
the impact of different transparency aspects on trust in
tures of a recommendation or an explanation), the next
the system and on the overall evaluation of the system.
stage is the interpretation of the system state, in which
The differentiated assessment of transparency makes it
users use their knowledge to interpret the new system
possible to elaborate the significance of individual aspects
state [22]: in our context, to assess the recommendation
of transparency in more detail than it was possible with
inferred by the system. Our validated final set includes
previous measurement instruments. Thus, it could be
items which are related to the interpretation stage, and
shown that transparency with respect to functioning
are of the type “I understood what data was used . . . ”,
and output is of greater importance for the dependent
which can be grouped under the factor input), or “I under-
variables considered than transparency with respect to
stood how the system determined . . . ”, grouped under the
interaction and input.
factor functionality of our developed scale. This group of
The findings obtained here should be considered under
items is also consistent with the definition of perceived
the following limitations. Real systems were tested for
transparency by [5], which focuses on the perceived un-
this online study. On the one hand, this allowed us to ob-
derstanding of the inner processes of RS.
tain users’ views with respect to applications they were
In a subsequent stage, users compare the interpreted
familiar with and that were fully functional. On the other tended Abstracts on Human Factors in Computing
hand, no controlled manipulation of influencing variables Systems (2002) 830–831.
was possible. We also did not analyze the differences be- [4] N. Tintarev, J. Masthoff, Explaining Recommenda-
tween the systems which would have required a larger tions: Design and Evaluation, in: F. Ricci, L. Rokach,
sample, also addressing questions outside the scope of B. Shapira (Eds.), Recommender Systems Hand-
the present study. An effect of explanations could only book, Springer, 2015, pp. 353–382. URL: https://doi.
be shown for the factors input and functionality, partly org/10.1007/978-1-4899-7637-6_10. doi:10.1007/
mediated by perceived control, which may also be due to 978-1-4899-7637-6_10.
the limited explanations provided by the systems used. [5] P. Pu, L. Chen, R. Hu, A user-centric evaluation
In addition, only systems that were already known to the framework for recommender systems, in: Proceed-
users were tested. Thus, a stronger expression of trust ings of the fifth ACM conference on Recommender
and overall more positive evaluation might be expected. systems - RecSys 11, 2011, pp. 157–164.
In terms of social desirability or self-overestimation, per- [6] H. Cramer, V. Evers, S. Ramlal, M. van Someren,
ceived understanding might be valued higher than actual L. Rutledge, N. Stash, L. Aroyo, B. Wielinga, The
understanding would lead one to expect. effects of transparency on trust in and acceptance
Follow-up research should be guided by the limitations of a content-based art recommender, User Model.
mentioned here for further validation of the measure- User-Adap. Inter. 18 (2008) 455–496.
ment instrument. The degree of perceived transparency [7] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner,
should also be compared with actual, genuine understand- H. Soncu, C. Newell, Explaining the user experience
ing using parallel qualitative methods [6]. Furthermore, of recommender systems, in: User Modeling and
it is important to check to what extent the questionnaire User-Adapted Interaction, 2012, p. 441–504.
is also able to evaluate systems that are unknown to [8] F. Gedikli, D. Jannach, M. Ge, How should i ex-
the users. Assessing unfamiliar systems or specifically plain? a comparison of different explanation types
designed prototypes would provide the opportunity to for recommender systems, International Journal of
systematically vary components of the recommender Human-Computer Studies 72 (2014) 367–382.
system (input, functionality, output), the quality of expla- [9] D. C. Hernandez-Bocanegra, J. Ziegler, Ex-
nations, and/or the interaction possibilities [9]. Thus, the plaining review-based recommendations: Effects
influence of these features on the transparency factors of profile transparency, presentation style and
and likewise possible differences in their manifestation user characteristics, Journal of Interactive Me-
should be further explored. dia 19 (2020) 181–200. doi:https://doi.org/10.
Overall, a first validated version of a questionnaire 1515/icom-2020-0021.
to assess perceived transparency can be presented. The [10] D. C. Hernandez-Bocanegra, J. Ziegler, Effects
findings presented here also provide starting points for of interactivity and presentation on review-based
research into further elucidating the multi-faceted con- explanations for recommendations, in: Human-
cept of transparency. Computer Interaction – INTERACT 2021, Springer
International Publishing, 2021, pp. 597–618.
[11] N. Tintarev, J. Masthoff, Evaluating the effective-
Acknowledgments ness of explanations for recommender systems,
User Modeling and User-Adapted Interaction 22
This work was funded by the German Research Founda-
(2012) 399–439.
tion (DFG) under grant No. GRK 2167, Research Training
[12] C.-H. Tsai, P. Brusilovsky, Explaining recommenda-
Group “User-Centred Social Media”.
tions in an interactive hybrid social recommender,
in: 24th International Conference on Intelligent
References User Interfaces (IUI 19), 2019, pp. 391–396.
[13] A. Jameson, M. C. Willemsen, A. Felfernig,
[1] N. Bostrom, E. Yudkowski, The Ethics of Artificial M. de Gemmis, P. Lops, G. Semeraro, L. Chen, Hu-
Intelligence, in: W. Ramsey, K. Frankish (Eds.), Cam- man decision making and recommender systems,
bridge Handbook of Artificial Intelligence, Cam- Recommender Systems Handbook (2015) 611–648.
bridge University Press, 2014, pp. 316–334. [14] S. Dooms, T. D. Pessemier, L. Martens, A user-
[2] U. S. a. H. S. C. (SHS), Recommendation on the centric evaluation of recommender algorithms for
Ethics of Artificial Intelligence, Technical Report, an event recommendation system, in: Proceedings
UNESCO, 2021. URL: https://unesdoc.unesco.org/ of the RecSys 2011: Workshop on Human Deci-
ark:/48223/pf0000379920.page=14. sion Making in Recommender Systems (Decisions
[3] R. Sinha, K. Swearingen, The role of transparency RecSys 11) and User-Centric Evaluation of Recom-
in recommender systems, CHI EA ’02 CHI ’02 Ex- mender Systems and Their Interfaces - 2 (UCERSTI
2) affiliated with the 5th ACM Conference on Rec- recommender systems, in: Proceedings of 24th
ommender Systems (RecSys 2011), 2011, p. 67–73. International Conference on Intelligent User Inter-
[15] M. Bühner, Einführung in die Test- und Fragebo- faces (IUI 19), ACM, 2019, p. 379–390.
genkonstruktion, Pearson Studium, Aufl. München, [29] T. Donkers, T. Kleemann, J. Ziegler, Explaining
2011. recommendations by means of aspect-based trans-
[16] D. Jannach, M. Zanker, A. Felfernig, G. Friedrich, parent memories, in: Proceedings of the 25th Inter-
Recommender Systems. An introduction, Cam- national Conference on Intelligent User Interfaces,
bridge University Press, 2011. 2020, p. 166–176.
[17] J. Lu, Q. Zhang, G. quan Zhang, Recommender Sys- [30] K. Hamilton, S.-I. Shih, S. Mohammed, The devel-
tems. Advanced Developments, World Scientific opment and validation of the rational and intuitive
Publishing, 2021. decision styles scale, Journal of Personality Assess-
[18] Q. V. Liao, D. Gruen, S. Miller, Questioning the ment 98 (2016) 523–535.
ai: Informing design practices for explainable ai [31] A. G. Yong, S. Pearce, A beginner’s guide to factor
user experiences, Proceedings of the 2020 CHI analysis: Focusing on exploratory factor analysis,
Conference on Human Factors in Computing Sys- Tutorials in Quantitative Methods for Psychology
tems 9042 (2020) 1–15. doi:https://doi.org/10. 9 (2013) 79–94.
1145/3313831.3376590. [32] R. A. Peterson, A meta-analysis of cronbach’s co-
[19] T. Ngo, J. Kunkel, J. Ziegler, Exploring mental mod- efficient alpha, Journal of Consumer Research 21
els for transparent and controllable recommender (1994) 381–391.
systems: A qualitative study, in: Proceedings of the
28th ACM Conference on User Modeling, Adapta-
tion and Personalization UMAP 20, 2020, pp. 183–
191.
[20] D. Borsboom, G. J. Mellenbergh, J. van Heerden,
The theoretical status of latent variables, Psycho-
logical Review 110 (2003) 203–219.
[21] J. Kunkel, T. Ngo, J. Ziegler, N. Krämer, Iden-
tifying Group-Specific Mental Models of Recom-
mender Systems: A Novel Quantitative Approach,
in: Human-Computer Interaction – INTERACT
2021, Lecture Notes in Computer Science, Springer
International Publishing, Cham, 2021, pp. 383–404.
doi:10.1007/978-3-030-85610-6_23.
[22] D. A. Norman, Some Observations on Mental Mod-
els, In Mental Models, Dedre Gentner and Albert
L. Stevens (Eds.). Psychology Press, New York, NY,
USA, 1983.
[23] E. Hutchins, J. D. Hollan, D. A. Norman, Direct
manipulation interfaces, Human-Computer Inter-
action 1 (1985) 311–338.
[24] Y. Zhang, X. Chen, Explainable recommendation:
A survey and new perspectives, Foundations and
Trends in Information Retrieval 14 (2020) 1–101.
[25] K. Backhaus, B. Erichson, R. Weiber, Multivari-
ate Analysemethoden. Eine anwendungsorientierte
Einführung, Berlin 13. Aufl, 2011.
[26] J. F. Hair, W. C. Black, B. J. Babin, R. E. Anderson,
Multivariate data analysis. A global perspective,
Boston 7. Aufl., 2010.
[27] D. H. McKnight, V. Choudhury, C. Kacmar, Develop-
ing and validating trust measures for e-commerce:
An integrative typology, in: Information Systems
Research, volume 13, 2002.
[28] P. Kouki, J. Schaffer, J. Pujara, J. O’Donovan,
L. Getoor, Personalized explanations for hybrid