=Paper= {{Paper |id=Vol-2225/paper9 |storemode=property |title=Assessing the Value of Transparency in Recommender Systems: An End-User Perspective |pdfUrl=https://ceur-ws.org/Vol-2225/paper9.pdf |volume=Vol-2225 |authors=Eric S. Vorm,Andrew D. Miller |dblpUrl=https://dblp.org/rec/conf/recsys/VormM18 }} ==Assessing the Value of Transparency in Recommender Systems: An End-User Perspective== https://ceur-ws.org/Vol-2225/paper9.pdf

Assessing the Value of Transparency in Recommender Systems:
An End-User Perspective
Eric S. Vorm*
Andrew D. Miller†
Indiana University Purdue University Indianapolis
Indianapolis, IN
ABSTRACT
Recommender systems, especially those built on machine learning,
are increasing in popularity, as well as complexity and scope. Sys-
tems that cannot explain their reasoning to end-users risk losing
trust with users and failing to achieve acceptance. Users demand
interfaces that afford them insights into internal workings, allow-
ing them to build appropriate mental models and calibrated trust.
Building interfaces that provide this level of transparency, however,
Figure 1: ONNPAR is a simulated clinical decision support sys-
is a significant design challenge, with many design features that
tem built on machine learning. It was used as the testbed for
compete, and little empirical research to guide implementation. We
this study, serving the role of a highly-critical decision context.
investigated how end-users of recommender systems value different
categories of information to help in determining what to do with
computer-generated recommendations in contexts involving high means to increase user acceptance and enhancing user attitudes to-
risk to themselves or others. Findings will inform future design of wards recommender systems [3]. Studies have shown that providing
decision support in high-criticality contexts. explanations to users tends to increase trust [4], improves user com-
prehension [5], calibrates appropriate reliance on decision aids [6],
and enables better detection and correction of system errors [7].
1 INTRODUCTION Generating explanations that users find both useful and satisfactory,
New machines are embodied with increasing levels of authority however, can be a complicated task, and much research has been
and unprecedented scope. Decisions previously made by humans conducted to try to answer the question of "what makes a good
are increasingly being made by computers, often with little or no explanation" [8].
explanation, raising concerns over a plethora of social, legal, and While system-generated explanations represent the most common
ethical issues such as privacy, bias, and safety. approach to transparency in recommender systems, in many cases
Transparency is often discussed in terms of back-end program- simply providing users access to certain types of information can
ming or troubleshooting. End-users, especially in the context of also improve transparency, and can dramatically improve user expe-
novice users interacting with recommender systems, are seldom rience and the likelihood of further interaction [5]. In some contexts,
studied. Yet recent developments in AI suggest that automated rec- affording users the opportunity to see into the system’s dependencies,
ommendations will be an increasingly common component in user’s policies, limitations, or information about how the user is modeled
daily lives as technologies such as self-driving cars and IoT-enabled and considered by the system can facilitate the same level of user
smart homes become commonplace. Developing methods to increase understanding (and subsequent trust) as an explicit explanation [9].
the transparency of computer-generated recommendations, as well Providing targeted information as a means of improving a user’s
as understanding user information needs as a means to increase mental model and trust (i.e., transparency) has two potential benefits
trust and engagement with recommendations, is therefore crucial. over the building of explanation interfaces. First, it affords users an
Accomplishing transparent interface design is often complicated by opportunity to use deductive reasoning to determine the merit and
a series of trade-offs that seek to balance and prioritize several com- validity of system recommendations, which has been demonstrated
peting design principals. Striking the appropriate balance between to improve usability and user trust in many contexts. For instance,
too much and not enough information is often more art than science, Swearingen and Sinha reported that recommender systems whose
and is becoming more difficult with the cascading prevalence of interfaces provided information that could help users understand
data-driven paradigms such as machine learning [1]. the system were preferred over those that did not [10]. Research
Efforts towards improving the transparency of recommender sys- in cognitive agents has also demonstrated that providing users ac-
tems commonly involve programming system-generated explana- cess to underlying system information, such as system dependencies
tions that seek to justify the recommendation to users, often through or provenance of data, can greatly improve human-machine perfor-
the use of system logic [2]. Providing explanations and justifica- mance and reduce the likelihood of users acting on recommendations
tions of system behavior to users has proven to be a highly effective that are erroneous, known as "errors of commission" [11]. A second
benefit of affording users the opportunity to see into the system in
* esvorm@iu.edu order to understand its processes is that it takes little to no additional
†
andrewm@iupui.edu programming. This is often because much of the information that
IntRS Workshop, October 2018, Vancouver, Canada Eric S. Vorm and Andrew D. Miller

could enhance user understanding of system functions and behaviors beyond understanding the points of view present in that particular
is already present in the system, but is often hidden from front-end group of individuals. Through the use of factor analysis, however,
interfaces in order to reduce clutter and streamline layouts. patterns of subjective opinion are uncovered, which reveal a struc-
This trade off between providing adequate information to com- ture of thoughts and beliefs surrounding a given topic and context.
municate a system’s intent and achieving a user-friendly interface We can use these findings to understand or model a phenomenon, or
design is a common challenge, often resolved through iterative de- in our example, infer the potential value of different design features
sign evaluations involving user testing. While research involving through user input that is both qualitatively rich, yet statistically
transparency in system design frequently focuses on behavioral out- sound.
comes, such as modeling the appropriateness of a user’s interaction In Q-methodology, participants are given a bank of statements,
with a recommender system, little is known about what information each one on a separate card (or electronically using specialized
is most efficacious to users in terms of improving mental models, software), and asked to rank order them in a forced distribution
resolving conflicts caused by unexpected or unanticipated system grid according to some measure of affinity or agreement, depend-
behaviors, or improving user trust and technology acceptance. An- ing on the context of the study [13]. For our study, we employed
swering these questions requires an investigation into how user’s q-methodology as a design-elicitation tool, similar to traditional
subjectively value and prioritize different categories of information iterative design strategies involving user evaluation of prototype
in an effort to resolve conflicts between expected and observed sys- designs. In this way, we provided participants with questions, each
tem behaviors, or in order to evaluate the validity or accuracy of a representing a design feature or suite of features that could be pro-
recommendation in order to determine whether to accept or reject it. vided through a user interface (UI). We asked participants to sort
To accomplish this, we used an approach known as Q-Methodology, these statements in a forced distribution, such as the one shown
commonly referred to as the systematic study of subjectivity [13]. in figure 2, ranking them from most important to least important
To constrain our work and prevent over generalization of findings, to them. Then, through the use of factor analysis, we analyzed the
we chose to investigate what information users value most when
engaged with recommender systems in a highly critical decision
scenario. We hypothesize that users involved in tasks that involve
a high degree of personal risk or risk to others are more likely to
critically interrogate computer-generated recommendations before
accepting and acting upon them. This suggests that systems provid-
ing recommendations in highly critical decision contexts, such as
medical, legal, financial, or automotive domains, amongst others,
would benefit most from efforts to develop interfaces that enable
users to quickly and accurately discern whether or not to trust those
recommendations. Using the decision criticality framework as a
guide, we developed a hypothetical recommender system named
the Oncological Neural Network Prognosis and Recommendation
(ONNPAR) System. ONNPAR was modeled after modern clini-
cal decision support systems offering recommendations, and was
designed to serve as the highly-critical decision scenario for our Figure 2: Example forced-sort matrix used for our study. Par-
research. ticipants sorted all 36 questions into the array, ranking them
according to personal value and significance in the context of in-
2 METHODS formation that could help them understand how the ONNPAR
2.1 A brief introduction to Q-Methodology system works, and determine whether or not to accept or reject
the computer-generated recommendation.
Q-methodology is distinctly different from "R" methodology and
has several distinctions that should be addressed. R-methodology different ways that users value and prioritize these questions, thus in-
samples respondents who are representative of a larger population, ferring what design elements may add to or detract from an optimal
and measures them for the presence or expression of certain charac- user experience [15] and quantifying the potential value of different
teristics or traits. These measurements are made objectively, as the categories of information to users in the context of improving the
opinions of respondents is seen as potentially confounding and are transparency of recommender system interfaces.
therefore controlled. Using inferential statistics, findings are then
abstracted to predict prevalence and generalize findings to a larger 2.2 Model Development
target population [50]. The first step for our study was to ensure that our approach was
Q-methodology, on the other hand, invites participants to directly representative of the technical and theoretical issues related to trans-
express their subjective opinions on a given topic by sorting state- parency in recommender systems (i.e., ontological homogeneity).
ments (or questions) into a hierarchy that represents what is most To accomplish this we used a combination of analytic and inductive
or least important to them. Each participant’s arrangement of state- techniques, combining findings from a systematic literature review
ments or questions represents an individual person’s point of view with user input from a user-centered design workshop conducted for
about a given topic, which ordinarily would not be of much value a previous project [16].
Vorm & Miller, 2018 IntRS Workshop, October 2018, Vancouver, Canada

We also sought out the advice and guidance of subject matter may be a viable way to improve the transparency of recommender
experts (SMEs) to ensure that all technical and theoretical aspects systems.
of the concept of transparency in recommender systems had been Qualities of Data: In many instances, understanding the relation-
addressed. We conducted informal interviews with a combination ship of dependencies present in a system can provide meaningful
of academics who regularly conduct research in the fields of ma- insights into that system’s functionality. A computer program may
chine learning and intelligent systems, as well as applied researchers be functioning perfectly, but if the data on which it is operating is
currently engaged in the development and design of recommender exceedingly noisy or corrupt, its outputs may still be incorrect or in-
systems for industry. In total, nine SMEs were consulted and asked appropriate. Numerous real-world examples from accidents such as
to review our preliminary categorization structure, and to offer sug- the Space Shuttle Challenger and the Navy warship USS Vincennes
gestions for other technical or theoretical issues not already captured serve as a testament to the importance of providing information on
by our approach. the quality and provenance of the underlying data to decision makers
The result was a five-factor model of transparency in recom- [25].
mender systems. These categories consist of Data, Personal, System, Efforts to make data-related information available to users of
Options, and Social. We briefly describe and discuss the relevance machine learning applications have been shown to result in higher
of these categories below. user ratings of ease of understanding, meaningfulness, and con-
vincingness [26]. Advances in visual analytic approaches have also
improved the comprehensibility and intelligibility of data to users
by presenting it in a manner that is more readily understood [27].
Different visualization techniques have also been demonstrated to
improve user’s understanding of cause and effect relationships be-
tween variables, even among users with little to no data analytical
background (i.e., data novices, [28]).
Just as it is important to consider the source as well as the quality
of information, so too must users be able to see into the system and
understand the data on which it is operating. The current data-driven
paradigm of machine learning, therefore, necessitates information
that can help users answer questions about the qualities of the sys-
tem’s data. Affording users the ability to see this data may well
improve the transparency of a system’s interface from the user’s
perspective.
User Representation: The concept of personalization is central
to the discussion of transparency in a variety of intelligent system
domains such as context-aware and automated-collaborative filtering
applications [4], [29]-[31]. Users often want to understand how they
are modeled by a system, if at all, and to what extent system outputs
Figure 3: A five-factor model of system transparency. Each fac- are personalized for them. While commercial applications such as
tor represents categories of information which can assist users personalized targeted advertisement algorithms are an important
in understanding and trust computer-generated recommenda- component of this category, the importance of user representation
tions. extends well beyond the suitability of computer-generated recom-
mendations like movies or music titles.
Future machine learning applications are expected to encompass
System Parameters and Logic: Understanding the perspective of a variety of domains that may very well necessitate extensive expla-
another in order to anticipate their actions or understand their inten- nation of how users are represented by computer systems in order to
tions is the process known as building a mental model [17]. Infor- achieve user buy-in and acceptance. For example, in the domain of
mation related to how a system works, including its policies, logic, personal financial trading, a machine learning algorithm may possess
and limitations, can help users build an appropriate mental model of a model of risk that is very different from its user, and may perhaps
the system. This is often critical, as many accidents, particularly in prioritize one aspect of financial growth, such as diversification, over
high-risk domains such as aviation, have resulted from users having other aspects that the user may prioritize more, such as long-term
an inappropriate or inaccurate mental model of system functionality stability. Understanding what a system knows about its user, and
[18]-[20]. how that information is subsequently used to derive recommenda-
Having knowledge of how a system functions can also help in tions, is therefore of potential critical importance for applications to
determining when the system may be in error. Numerous studies achieve acceptable levels of user trust, engagement, and technology
have demonstrated that providing information about how the sys- acceptance.
tem processes information can improve the detection of system Social Influence: The power of social media has been displayed
errors and faults[21]-[23], and can thereby lower so-called ’errors of in a variety of contexts over the past decade of its modern existence,
commission’ [24]. These studies indicate that providing users with and has become a powerful tool for marketers and influencers. As
information that assists their understanding of system functionality of August 2017, two thirds of Americans (67%) reported that they
IntRS Workshop, October 2018, Vancouver, Canada Eric S. Vorm and Andrew D. Miller

received at least some of their news from social media [32]. Systems corporate policy, or mandated safety requirements [41]. But in some
that group users according to online behavior in order to predict contexts, users may want more options than they are often provided,
future interests and purchases, such as automated collaborative filter- or, at the very least, users may want to know whether or not other
ing algorithms, are abundant, and represent a foundational approach options exist before engaging in a decision. Closely related to this is
to modern marketing and sales [33]. In many cases, a user’s under- the importance of providing some justification of why one option is
standing of how they are grouped by a system using social media deemed better than another.
information can provide meaningful insights into why a system out- Much has been written about the role that system explanations or
put, such as a targeted advertisement, was generated. This is most justifications can have on a person’s interaction with or sentiment
important when conflicts arise between a user and an inappropriate towards intelligent systems [42], [43]. Users often demand some
system output. These are often the result of loose affiliations on form of justification from a system to help them determine the merit
social media with others who may hold radically opposing philo- of an output such as a recommendation [10]. There are a variety of
sophical or political viewpoints, which some recommender systems sub-categories of this concept too, such as why one option is NOT
incorrectly associate into their models. Providing users opportunities the best, (known as counter-factual explanation).
to see into a system and understand how they, the user, are catego- The range of discussions over how precisely to engineer explana-
rized and represented in a social group, may improve user experience tion systems in a format that is meaningful and understood by the
and trust, leading some users to remain more willing to interact with user under different circumstances is the subject of much current
a system after such a conflict arises. There is also some evidence discussion in the intelligent systems communities of practice, es-
that some decision making may be socially-mediated as well. pecially related to machine learning (for an exhaustive review, see
Scientists have long studied the broad range that social influences [8]). Much of these are beyond the scope of this current paper, but
can have on decision making and behavior. These can include vari- for the purposes of this discussion, suffice it to say that the ability
ous social biases [34] which can explain in limited cases how some for systems to offer explanations of their outputs is central to the
people sometimes defer their decision making to a group or other concept of transparency in recommender systems.
individual, even when it would seem prudent not to do so [35]. Addi-
tionally, many people express the importance of social relationships
in guiding and assisting in decision-making. In a 2017 Pew Research
Poll, 74% of American respondents reported that their social circles
2.3 Concourse and Q-sort development
played at least a small role in their decision making; 37% reported
it played a significant role [36]. Systems that afford information Having identified these five factors, we then created a bank of
that connects a user’s system interaction with their social circles, questions for our participants to sort. This bank is known in Q-
may well improve user satisfaction and usability. For example, if methodology parlance as a ’concourse.’ A goal of developing a
we imagine a user attempting to determine whether or not to accept concourse is to create as many statements as possible to ensure a
or reject a recommendation, in some contexts, social information, comprehensive and saturated pool of opinions or sentiment from
such as the prevalence of that recommendation to others in their which to sample. We used Ram’s taxonomy of question types as
social circle, or a ratio of accept/reject decisions from their friends an initial starting point to ensure that we used a variety of question
or family, could prove to be valuable to some people, and could be types [44]. This was then refined using Silveira et al’s taxonomy
used as a decision heuristic. of user’s frequent doubts [45]. The initial concourse consisted of
Options: People often express a preference of choice over no 71 questions. We then refined this concourse down to a reasonable
choice in most decision-making contexts [37]. Accordingly, many bank of 36 questions through the use of five individuals who are
systems strive to offer choices to users as a means of increasing subject-matter experts in recommender systems (either professors in
engagement and satisfaction [38]. There are times, however, when Cognitive Psychology with experience with recommender systems,
providing multiple choices to a user may be undesirable. or programmers of recommender systems). Questions that appeared
For example, most navigation systems output at most three route redundant were combined, and those that were deemed irrelevant
choices to the user, and typically highlight the one recommended or unrelated were discarded. Each of the five factors had a roughly
by the system. There may be, of course, several hundreds or even equivalent number of representative questions.
thousands more options available to the user, but displaying them all This final bank of 36 questions was randomized and assigned
would unlikely benefit the user, and may in fact lead them to discard numbers, then printed on 3x5 index cards. Each participant received
the technology due to its confusing and cluttered interface. their own deck consisting of 36 individual questions. Participants
This "tyranny of choices" [39] is even more evident in light of the were given instructions for how to sort cards from most-to-least
size and scope of many machine learning models, especially those valuable or important to them. Participants were then shown a vi-
involving deep learning. In these circumstances, it is practically gnette on a computer screen or projector. The vignette described
infeasible to display every possible optional output to the user. an interaction with ONNPAR, and ends with the user being given a
Common interface design strategies involve efforts that reduce recommendation which the user must determine whether or not to
choices in order to lessen cognitive load and improve the speed act on, or reject. Participants then sorted their cards, and recorded
and efficiency of decision making [40]. Determining the trade-offs their arrangement on a form, along with two additional questions on
between interface aesthetics (i.e., clutter) and user preference for a questionnaire: In a few words, please explain WHY you chose your
options is often a challenge for engineers and designers alike. Some- MOST/LEAST important question to ask."
times, these decisions are determined by external factors, such as
Vorm & Miller, 2018 IntRS Workshop, October 2018, Vancouver, Canada

3 RESULTS loaded clearly on at least one factor, resulting in four distinct view-
Our participant sample was comprised of n=22, 16 males, 6 females, points of information priorities and preferences of 21 individuals.
aged 22-59, average age 33 years old. Expertise was evaluated by
self-report. Participants were classified as novices if they had no
knowledge of or personal use experience with recommender sys- Factor Characteristics
tems, and experts if they had participated in either the design or Factor 1 Factor 2 Factor 3 Factor 4
programming of recommender systems. No. of Defining Variables 8 5 5 3
In the following sections we briefly describe the methodological Avg. Rel. Coef. 0.8 0.8 0.8 0.8
analysis of q-methodology, and then present the findings from our Composite Reliability 0.966 0.96 0.952 0.941
ONNPAR study. We will describe interpretations and insights from S.E. of Factor Z-scores 0.184 0.2 0.219 0.243
each of the factor groups of our factor analysis in the discussion Table 1: Characteristics of factors after rotation.
section.
3.3 Factor Interpretation
3.1 Q-method Analysis Overview Once factor extraction and rotation was complete, we analyzed each
The analysis of q-methodology is quite straightforward. Each ques- factor group to interpret its meaning. This was first accomplished by
tion from the set is assigned a numerical value according to which producing a weighted average of each participant’s arrangement of
column it was placed (-5 to +5 for our study). Each participant’s cards from within their factor group, and combining those arrange-
arrangement of cards is then combined to create a by-person cor- ments into one exemplar composite arrangement, which serves as
relation matrix. This matrix describes the relationship of each par- the model arrangement of questions for that factor group. Once these
ticipant’s arrangement of questions with every other participant’s composite arrangements, or "factor arrays," have been developed
arrangement (NOT the relationship between items within each par- for each factor group, they can then be analyzed for deeper inter-
ticipant). This matrix is then submitted for factor analysis, which pretation. We next evaluated the questions that were ranked highest
produces factors onto which participants load based on their arrange- and lowest for each factor array. This provides an early indication of
ments of questions. Two or more participants who load on the same information priorities, and allows us to begin crafting a picture of
factor, therefore, will have arranged their questions in a very similar how participants in each factor group tend to think about the value
manner, which represents similar reasoning styles or prioritization. of each category of information.
These factors, or clusters of participants, are then analyzed by exam-
ining what questions were ranked highest and lowest by each group, 3.4 Factor Groups
as well as examining the similarities and differences between each Here we will report the findings from the factor analysis. To do this
factor group. we will describe each factor group’s arrangements of the questions
For simplicity’s sake, we will henceforth refer to factors as fac- in terms of their highest- and lowest-ranked questions, as well as pos-
tor groups, since in the context of q-methodology, factor analysis itive and negative distinguishing questions. Distinguishing questions
identifies groups of individuals. The term factor group is not to be are those where the absolute differences between factor z-scores
confused with the five-factor model of transparency, used to guide are larger than the standard error of differences for a given pair of
our investigation. factors. All distinguishing questions are significant at (p < .01).
Several statistical packages are freely available to aid in the anal- Factor Group One was defined by eight participants and ex-
ysis of q-methodology studies. We used a version known as Ken-Q plained 22% of the study variance with an eigenvalue of 6.7. Three
Analysis [46]. of the factor loading participants were females, five were males,
with an average age of 37.5 years old. Knowledge of recommender
3.2 Factor Analysis systems was split between five novices and three experts.
Once all sorts had been entered into our database, they were factor The highest ranked question of this factor group was "Why is
analyzed using the Ken-Q software. We used principal components this recommendation the best option?" (+5) The lowest ranked ques-
analysis (PCA) because it has been shown to better account for tion of this factor group was "Is there anyone in my social network
random, specific, and common error variances [47]. The unrotated that has received a similar recommendation?" (-5) Other positive
factor matrix was then analyzed to determine how many factors distinguishing questions for the factor one group were (in descend-
to retain for rotation. A significant factor loading at (P<0.01) is ing order): "What are all of the factors (or indicators) that were
√
calculated using the equation 2.581 n where n = the number of considered in this recommendation, and how are they weighted?"
questions in our set (36). Individuals with factor loadings of ± .48 (4) "Precisely what information about me does the system know?"
were considered to have loaded on a factor and were arranged into a "What does the system think is me level of "acceptable risk?" (1)
factor group. Negative distinguishing questions for Factor Group One were (in
For factor extraction, we used the common practice of evaluating ascending order): "How much data was used to train this system?"
only factors with an eigenvalue greater than one [13]. We also deter- (-4) "How many other people have received this recommendation
mined that only factors with three or more participants loading on from this system?" (-2) and "What does the system think I want to
them would be retained. These steps resulted in four factors, which achieve?" (-1)
were then submitted to rotation according to mathematical criteria Factor Group Two was defined by five participants and ex-
(e.g., varimax). With this four-factor solution, all but one participant plained 13% of the study variance with an eigenvalue of 2.8. All
IntRS Workshop, October 2018, Vancouver, Canada Eric S. Vorm and Andrew D. Miller

of the factor loading participants were males, average age of 42 (-4) "What does the system think is MY level of "acceptable risk?"
years old. All but one of this factor group were considered experts (-2) "What if I decline? How will that decision be used in future
in recommender systems. The highest ranked question of this factor recommendations by this system?" (-1) "How is my information
measured and weighted in this recommendation?" (-1)
Factor Group Three was defined by five participants and ex-
Relative Rankings of Questions by Factor Group
plained 9% of the study variance with an eigenvalue of 1.9. Two of
the factor loading participants were females, three were males, and
Factor Group 1 an average age of 34 years old. All but one of the participants for
Highest Why is this recommendation the best op- this group were considered experts in recommender systems.
tion? The highest ranked question of this factor group was "Under
what circumstances has this system been wrong in the past?" (+5)
Lowest Is there anyone in my social network that The lowest ranked question of this factor group was "What if I
has received a similar recommendation? decline? How will that decision be used in future recommendations
by this system?" (-5) Other positive distinguishing questions for
Factor Group 2 the factor three group were (in descending order): "What data does
Highest What are all of the factors (or indicators) the system depend on in order to work properly, and do we know
that were considered in this recommen- if those dependencies are functioning properly?" (+4) "Is my data
dation, and how are they weighted? uniquely different from the data on which the system has been
trained?" (+3) "What have other people like me done in response
Lowest Was this recommendation made specif- to this recommendation?" (+2) Negative distinguishing questions
ically for me (based on my pro- for the factor three group were (in ascending order): "What is the
file/interests), or something else? system’s level of confidence in this recommendation?" (-2) "Are
there any other options not presented here?" (-2) "How much data
Factor Group 3 was used to train this system?" (-1) "How does the system consider
Highest Under what circumstances has this sys- risk, and what is its level of "acceptable risk?" (-1)
tem been wrong in the past? Factor Group Four was defined by three participants and ex-
plained 8% of the study variance with an eigenvalue of 1.7. There
Lowest What if I decline? How will that decision were two males and one female, and an average age of 20 years old.
be used in future recommendations by Knowledge of recommender systems was split between two novices
this system? and one expert.
The highest ranked question of this factor group was "What is
Factor Group 4 the history of the reliability of this system?" (+5) The lowest ranked
Highest What is the history of the reliability of question of this factor group was "What does the system THINK I
this system? want to achieve? (How does the system represent my priorities and
goals?)" (-5) Positive distinguishing questions for the factor four
Lowest What does the system think I want to group were (in descending order): "How many other people have
achieve? (How does the system represent accepted or rejected this recommendation from this system? (What
my priorities and goals?) is the ratio of approve to disapprove?)" (+4) "Is the system working
Table 2: Highest and lowest ranking questions of each factor with solid data, or is the system inferring or making assumptions on
group. Although this is only the most superficial analysis, distin- ’fuzzy’ information?" (+3) "How many other people have received
guishing differences amongst groups begin to emerge as we an- this recommendation from this system?" (+1) Negative distinguish-
alyze each group’s prioritization and valuation of transparency ing questions for the factor four group were (in ascending order):
information. "Is my data uniquely different from the data on which the system
has been trained?" (-3) "What are all of the factors (or indicators)
group was "What are all of the factors (or indicators) that were con- that were considered in this recommendation, and how are they
sidered in this recommendation, and how are they weighted?" (+5) weighted?" (-2) "What have other people like me done in response
The lowest ranked question of this factor group was "Was this recom- to this recommendation?" (-1)
mendation made specifically for ME (based on my profile/interests),
or was it made based on something else (based on some other model,
such as corporate profit, or my friend’s interests, etc.)?" (-5) Positive
distinguishing questions for the factor two group were (in descend-
ing order): "How is this data weighted or what data does the system
4 DISCUSSION
prioritize?" (+4) "How much data was used to train this system?" Findings from our factor analysis yielded several surprising insights.
(+2) "Is my data uniquely different from the data on which the sys- We begin with a discussion of questions that produced a high degree
tem has been trained?" (1) Negative distinguishing questions for of either consensus or disagreement amongst factor groups, and then
the factor two group were (in ascending order): "Is there anyone in conclude with a discussion of each factor group.
my social network that has received a similar recommendation?"
Vorm & Miller, 2018 IntRS Workshop, October 2018, Vancouver, Canada

4.1 Analysis of Consensus vs. Disagreement Consensus Versus Disagreement
Findings Consensus Z-Score
A common technique to examine these data is to explore questions Variance
that created either consensus or a large amount of disagreement in
our sample. By examining the variance between all item rankings, Can I influence the system by providing 0.024
we can explore what questions were generally agreed on (i.e., con- feedback? Will it listen and consider my
sensus), and what items produced large disagreement. For instance, input?
all participants ranked "How clean or accurate is the data used in
making this recommendation?" as either 0 or -1, indicating that this How clean or accurate is the data used in 0.029
question was only moderately valuable to them in the context of a making this recommendation?
clinical decision support system. This is potentially valuable infor-
mation for designers to consider, given that the fuzziness of data is How often is the system checked to make 0.046
sometimes displayed to users as a method of enhancing system trans- sure it is functioning as it was designed
parency [48]. Given these findings, it may be useful to reconsider (i.e., for model accuracy)?
displaying information about the qualities of data to users in favor
of other types of information deemed more useful or valuable.
Disagreement Z-Score
Similarly, we can learn much from these data by evaluating ques-
Variance
tions that produced a great deal of disagreement between factor
groups. For instance, the question "Was this recommendation made
How many other people have accepted or 1.179
specifically for ME (based on my profile/interests), or was it made
rejected this recommendation from this
based on something else (based on some other model, such as corpo-
system? (What is the ratio of approve to
rate profit, or my friend’s interests, etc.)?" had the largest variance,
disapprove?)
with factor groups one and three assigning it a positive value (4 and
3), and factor groups two and four assigning it a negative value (-5
Is there anyone in my social network that 1.246
and -4). Interestingly, factor group two assigned this question as
has received a similar recommendation?
the least valuable or important question of their q-set, while factor
group one assigned this question as their second most valuable or
Was this recommendation made specif- 2.261
important question.
ically for ME, or was it based on some-
Interpreting these findings can, at first glance, appear confounding
thing else?
to a designer looking for clear guidance. Clearly, some individuals
would prefer to have information that could indicate how they, as Table 3: Consensus questions are those which all participants
a user, are modeled and considered (if at all) in system-generated agreed were of relevant importance, as indicated by a low Z-
recommendations as a means of improving their trust, while others score variance in their arrangements. Disagreement questions
clearly discount the value of this kind of information. These findings are those which polarized opinion, as indicated by high Z-score
suggest that social influence information, such as what other users variance in their arrangments.
are doing in response to recommendations, may at times be valuable
to some users in helping determine whether or not to accept or reject Using this bank of questions, participants sorted them according to
a recommendation. those they found most valuable or useful in helping them determine
Two other questions also produced wide disagreement across whether to accept or reject a computer-generated recommendation.
factor groups. "How many other people have accepted or rejected We analyzed how participants arranged these questions using a fac-
this recommendation from this system? (What is the ratio of approve tor analytic technique. Our findings support other studies that find
to disapprove?)" and "Is there anyone in my social network that has that transparency is a multi-dimensional construct, and achieving
received a similar recommendation?" were ranked near the poles by it is dependent on multiple variables, including to some extent the
different factor groups. This indicates that the value of social media- user’s preferences for and valuation of certain categories of informa-
related information in highly-critical contexts, while not important tion. Our findings are intended to inform future interface design of
to some, is still considered valuable information by some users who recommender systems, as well as to broaden the discussion of the
may find it a valuable and important component to enhance their importance of building systems whose outputs and recommendations
understanding and trust in system-generated recommendations. are easily understood by their users.

5 CONCLUSION REFERENCES
We have illustrated our five-factor model of information categories [1] F. Doshi-Velez and B. Kim, "Towards A Rigorous Science of Interpretable Ma-
that can be used to increase the transparency of recommender sys- chine Learning," AirXiv, 2017.
[2] B. Buchannan and E. Shortliffe, Rule-Based Expert Systems: The MYCIN Exper-
tems to end users. We developed a bank of 36 questions representing iments of the Stanford Heuristic Programming Project. Reading, MA: Addison
information gathering strategies that users could use to interrogate Wesley, 1984.
[3] L. R. Ye and P. E. Johnson, "The Impact of Explanation Facilities on User Ac-
system-generated recommendations in an effort to understand its rea- ceptance of Expert Systems Advice," MIS Quarterly, vol. 19, no. 2, p. 157, Jun.
soning, and decide whether to accept or reject the recommendation. 1995.
IntRS Workshop, October 2018, Vancouver, Canada Eric S. Vorm and Andrew D. Miller

[4] J. L. Herlocker, J. A. Konstan, and J. Riedl, "Explaining collaborative filtering pp. 193-212, 2001.
recommendations," presented at the 2000 ACM conference, New York, New York, [30] B. Y. Lim and A. K. Dey, "Assessing demand for intelligibility in context-aware
USA, 2000, pp. 241-250. applications," presented at the the 11th international conference, New York, New
[5] T. Kulesza, M. Burnett, W.-K. Wong, and S. Stumpf, "Principles of Explanatory York, USA, 2009, p. 195.
Debugging to Personalize Interactive Machine Learning," presented at the the [31] A. S. Clare, M. L. Cummings, and N. P. Repenning, "Influencing Trust for Human-
20th International Conference, New York, New York, USA, 2015, pp. 126-137. Automation Collaborative Scheduling of Multiple Unmanned Vehicles," Human
[6] M. T. Dzindolet, S. A. Peterson, R. A. Pomranky, L. G. Pierce, and H. P. Beck, "The Factors, vol. 57, no. 7, pp. 1208-1218, Oct. 2015.
role of trust in automation reliance," International Journal of Human-Computer [32] E. Shearer and J. Gottfried, "News Use Across Social Media Platforms 2017,"
Studies, vol. 58, no. 6, pp. 697-718, Jun. 2003. Pew Research Center, Sep. 2017.
[7] B. Lorenz, F. Di Nocera, and R. Parasuraman, "Display Integration Enhances [33] Adobe Inc., "Digital Intelligence Briefing: 2018 Digital Trends," Adobe Inc., Feb.
Information Sampling and Decision Making in Automated Fault Management in 2018.
a Simulated Spaceflight Micro-World," Proceedings of the Human Factors and [34] A. Tversky and D. Kahneman, "Judgment under Uncertainty: Heuristics and
Ergonomics Society 58th Annual Meeting, pp. 31-35, 2002. Biases," Science, vol. 185, no. 4157, pp. 1124-1131, Sep. 1974.
[8] T. Miller, "Explanation in Artificial Intelligence: Insights from the Social Sci- [35] S. Fiske and S. Taylor, Social Cognition. Reading, MA: Addison Wesley, 1991.
ences," AirXiv, pp. 1-57, Jun. 2017. [36] J. B. Horrigan, "How People Approach Facts and Information," Pew Research
[9] G. B. Duggan, S. Banbury, A. Howes, J. Patrick, and S. M. Waldron, "Too Much, Center, Aug. 2017.
Too Little, or Just Right: Designing Data Fusion for Situation Awareness," Pro- [37] L. E. Blume and D. Easley, "Rationality," in The New Palgrave Dictionary of
ceedings of the Human Factors and Ergonomics Society 58th Annual Meeting, pp. Economics, S. Durlauf and L. E. Blume, Eds. 2008.
528-532, 2004. [38] J. Preece, H. Sharp, and Y. Rogers, Interaction Design: Beyond Human Computer
[10] K. Swearingen and R. Sinha, "Beyond algorithms: An HCI perspective on rec- Interaction, 4 ed. Wiley, 2015, pp. 1-551.
ommender systems," ACM SIGIR 2001 Workshop on Recommender Systems, [39] B. Schwartz, The paradox of choice: Why more is less. Harper Perennial, 2004.
2001. [40] Rose, "Human-Centered Design Meets Cognitive Load Theory: Designing Inter-
[11] H. F. Neyedli, J. G. Hollands, and G. A. Jamieson, "Beyond Identity: Incorporating faces that Help People Think," pp. 1-10, Oct. 2006.
System Reliability Information Into an Automated Combat Identification System," [41] M. Zahabi, D. B. Kaber, and M. Swangnetr, "Usability and Safety in Electronic
Human Factors, vol. 53, no. 4, pp. 338-355, Jul. 2011. Medical Records Interface Design: A Review of Recent Literature and Guideline
[12] W. Stephenson, The study of behavior; Q-technique and its methodology. Chicago, Formulation.," Human Factors, vol. 57, no. 5, pp. 805-834, Aug. 2015.
IL: University of Chicago Press, 1953. [42] S. Gregor and I. Benbasat, "Explanations from Intelligent Systems: Theoretical
[13] S. R. Brown, "A primer on Q methodology," Operant Subjectivity. 16(3/4), 91-138. Foundations and Implications for Practice," MIS Quarterly, vol. 23, no. 4, p. 497,
1993 Dec. 1999.
[14] S. Watts and P. Stenner, "Doing Q Methodology: theory, method and interpreta- [43] D. L. McGuinness, A. Glass, M. Wolverton, and P. P. Da Silva ExaCt, "A Catego-
tion," Qualitative Research in Psychology, vol. 2, no. 1, pp. 67-91, Jan. 2005. rization of Explanation Questions for Task Processing Systems.," presented at the
[15] K. O’Leary, J. O. Wobbrock, and E. A. Riskin, "Q-methodology as a research AAAI Workshop on Explanation- Aware Computing ExaCt-, 2007.
and design tool for HCI," presented at the CHI 2013, Paris, France, 2013, pp. [44] A. Ram, AQUA: Questions that Drive the Explanation Process. Lawrence Erlbaum,
1941-1950. 1993.
[16] E. S. Vorm, "Assessing Demand for Transparency in Intelligent Systems Using [45] M. S. Silveira, C. S. de Souza, and S. D. J. Barbosa, "Semiotic engineering
Machine Learning," presented at the IEEE Innovations in Intelligent Systems and contributions for designing online help systems," presented at the the 19th annual
Applications INISTA, Thessonaliki, 2018, pp. 41-48. international conference, New York, New York, USA, 2001, p. 31.
[17] W. B. Rouse and N. M. Morris, "On looking into the black box: Prospects and [46] S. Banasick, "Ken-Q Analysis."
limits in the search for mental models.," Psychological Bulletin, vol. 100, no. 3, [47] J. K. Ford, R. C. MacCallum, and M. Tait, "The Application of Exploratory
pp. 349-363, 1986. Factor Analysis in Applied Psychology: A critical review and analysis," Personnel
[18] N. B. Sarter and D. D. Woods, "How in the World Did We Ever Get into That Psychology, vol. 39, no. 2, pp. 291-314, Jun. 1986.
Mode? Mode Error and Awareness in Supervisory Control," Human Factors, vol. [48] W. Yuji, "The Trust Value Calculating for Social Network Based on Machine
37, no. 1, pp. 5-19, 1995. Learning," presented at the 2017 9th International Conference on Intelligent
[19] A. F. Zeller, "Accidents and Safety," in Systems Psychology, K. B. DeGreene, Ed. Human-Machine Systems and Cybernetics (IHMSC), 2017, pp. 133-136.
New York, NY, 1970, pp. 131-150. [49] R. S. Amant and R. M. Young, "Interface Agents in Model World Environments,"
[20] National Transportation Safety Board, "Loss of Control on Approach Colgan Air, AI Magazine, vol. 22, no. 4, p. 95, Dec. 2001.
Inc. Operating as Continental Connection Flight 3407 Bombardier DHC-8-400, [50] J. Devore, Probability and Statistics for Engineering and the Sciences, Fourth.
N200WQ Clarence Center, New York February 12, 2009," National Transportation Brooks/Cole, 1995.
Safety Board, NTSB/AAR-10/01 PB2010-910401, Feb. 2010.
[21] G. G. Sadler, H. Battiste, N. Ho, L. C. Hoffmann, W. Johnson, R. Shively, J. B.
Lyons, and D. Smith, "Effects of transparency on pilot trust and agreement in the
autonomous constrained flight planner," presented at the 2016 IEEE/AIAA 35th
Digital Avionics Systems Conference (DASC), 2016, pp. 1-9.
[22] A. Sebok and C. D. Wickens, "Implementing Lumberjacks and Black Swans Into
Model-Based Tools to Support Human-Automation Interaction," Human Factors,
vol. 59, no. 2, pp. 189-203, Mar. 2017.
[23] J. Y. C. Chen, K. Procci, M. Boyce, J. L. Wright, A. Garcia, and M. J. Barnes,
"Situation Awareness-Based Transparency," ARL-TR-6905, Apr. 2014.
[24] K. L. Mosier and L. J. Skitka, "Human Decision Makers and Automated Decision
Aids: Made for Each Other?," in Automation and human performance Theory and
applications, R. Parasuraman and M. Mouloua, Eds. NJ: Lawrence Erlbaum, 1996,
pp. 201-220.
[25] C. W. Fisher and B. R. Kingma, "Criticality of data quality as exemplified in two
disasters," Information and Management, vol. 39, pp. 109-116, 2001.
[26] J. Zhou, M. A. Khawaja, Z. Li, J. Sun, Y. Wang, and F. Chen, "Making machine
learning useable by revealing internal states update - a transparent approach,"
International Journal of Computational Science and Engineering, vol. 13, no. 4,
pp. 378-389, 2016.
[27] T. Muhlbacher, H. Piringer, S. Gratzl, M. Sedlmair, and M. Streit, "Opening the
Black Box: Strategies for Increased User Involvement in Existing Algorithm
Implementations," IEEE Trans. Visual. Comput. Graphics, vol. 20, no. 12, pp.
1643-1652.
[28] J. Bae, E. Ventocilla, M. Riveiro, T. Helldin, and G. Falkman, "Evaluating Multi-
attributes on Cause and Effect Relationship Visualization," presented at the Inter-
national Conference on Information Visualization Theory and Applications, 2017,
pp. 64-74.
[29] V. Bellotti and K. Edwards, "Intelligibility and Accountability: Human Considera-
tions in Context-Aware Systems," Human-Computer Interaction, vol. 16, no. 2,