=Paper= {{Paper |id=Vol-2225/paper9 |storemode=property |title=Assessing the Value of Transparency in Recommender Systems: An End-User Perspective |pdfUrl=https://ceur-ws.org/Vol-2225/paper9.pdf |volume=Vol-2225 |authors=Eric S. Vorm,Andrew D. Miller |dblpUrl=https://dblp.org/rec/conf/recsys/VormM18 }} ==Assessing the Value of Transparency in Recommender Systems: An End-User Perspective== https://ceur-ws.org/Vol-2225/paper9.pdf
    Assessing the Value of Transparency in Recommender Systems:
                       An End-User Perspective
                                                              Eric S. Vorm*
                                                            Andrew D. Miller†
                                            Indiana University Purdue University Indianapolis
                                                             Indianapolis, IN
ABSTRACT
Recommender systems, especially those built on machine learning,
are increasing in popularity, as well as complexity and scope. Sys-
tems that cannot explain their reasoning to end-users risk losing
trust with users and failing to achieve acceptance. Users demand
interfaces that afford them insights into internal workings, allow-
ing them to build appropriate mental models and calibrated trust.
Building interfaces that provide this level of transparency, however,
                                                                          Figure 1: ONNPAR is a simulated clinical decision support sys-
is a significant design challenge, with many design features that
                                                                          tem built on machine learning. It was used as the testbed for
compete, and little empirical research to guide implementation. We
                                                                          this study, serving the role of a highly-critical decision context.
investigated how end-users of recommender systems value different
categories of information to help in determining what to do with
computer-generated recommendations in contexts involving high             means to increase user acceptance and enhancing user attitudes to-
risk to themselves or others. Findings will inform future design of       wards recommender systems [3]. Studies have shown that providing
decision support in high-criticality contexts.                            explanations to users tends to increase trust [4], improves user com-
                                                                          prehension [5], calibrates appropriate reliance on decision aids [6],
                                                                          and enables better detection and correction of system errors [7].
1    INTRODUCTION                                                         Generating explanations that users find both useful and satisfactory,
New machines are embodied with increasing levels of authority             however, can be a complicated task, and much research has been
and unprecedented scope. Decisions previously made by humans              conducted to try to answer the question of "what makes a good
are increasingly being made by computers, often with little or no         explanation" [8].
explanation, raising concerns over a plethora of social, legal, and          While system-generated explanations represent the most common
ethical issues such as privacy, bias, and safety.                         approach to transparency in recommender systems, in many cases
   Transparency is often discussed in terms of back-end program-          simply providing users access to certain types of information can
ming or troubleshooting. End-users, especially in the context of          also improve transparency, and can dramatically improve user expe-
novice users interacting with recommender systems, are seldom             rience and the likelihood of further interaction [5]. In some contexts,
studied. Yet recent developments in AI suggest that automated rec-        affording users the opportunity to see into the system’s dependencies,
ommendations will be an increasingly common component in user’s           policies, limitations, or information about how the user is modeled
daily lives as technologies such as self-driving cars and IoT-enabled     and considered by the system can facilitate the same level of user
smart homes become commonplace. Developing methods to increase            understanding (and subsequent trust) as an explicit explanation [9].
the transparency of computer-generated recommendations, as well              Providing targeted information as a means of improving a user’s
as understanding user information needs as a means to increase            mental model and trust (i.e., transparency) has two potential benefits
trust and engagement with recommendations, is therefore crucial.          over the building of explanation interfaces. First, it affords users an
Accomplishing transparent interface design is often complicated by        opportunity to use deductive reasoning to determine the merit and
a series of trade-offs that seek to balance and prioritize several com-   validity of system recommendations, which has been demonstrated
peting design principals. Striking the appropriate balance between        to improve usability and user trust in many contexts. For instance,
too much and not enough information is often more art than science,       Swearingen and Sinha reported that recommender systems whose
and is becoming more difficult with the cascading prevalence of           interfaces provided information that could help users understand
data-driven paradigms such as machine learning [1].                       the system were preferred over those that did not [10]. Research
   Efforts towards improving the transparency of recommender sys-         in cognitive agents has also demonstrated that providing users ac-
tems commonly involve programming system-generated explana-               cess to underlying system information, such as system dependencies
tions that seek to justify the recommendation to users, often through     or provenance of data, can greatly improve human-machine perfor-
the use of system logic [2]. Providing explanations and justifica-        mance and reduce the likelihood of users acting on recommendations
tions of system behavior to users has proven to be a highly effective     that are erroneous, known as "errors of commission" [11]. A second
                                                                          benefit of affording users the opportunity to see into the system in
* esvorm@iu.edu                                                           order to understand its processes is that it takes little to no additional
†
andrewm@iupui.edu                                                         programming. This is often because much of the information that
IntRS Workshop, October 2018, Vancouver, Canada                                                          Eric S. Vorm and Andrew D. Miller


could enhance user understanding of system functions and behaviors      beyond understanding the points of view present in that particular
is already present in the system, but is often hidden from front-end    group of individuals. Through the use of factor analysis, however,
interfaces in order to reduce clutter and streamline layouts.           patterns of subjective opinion are uncovered, which reveal a struc-
   This trade off between providing adequate information to com-        ture of thoughts and beliefs surrounding a given topic and context.
municate a system’s intent and achieving a user-friendly interface      We can use these findings to understand or model a phenomenon, or
design is a common challenge, often resolved through iterative de-      in our example, infer the potential value of different design features
sign evaluations involving user testing. While research involving       through user input that is both qualitatively rich, yet statistically
transparency in system design frequently focuses on behavioral out-     sound.
comes, such as modeling the appropriateness of a user’s interaction        In Q-methodology, participants are given a bank of statements,
with a recommender system, little is known about what information       each one on a separate card (or electronically using specialized
is most efficacious to users in terms of improving mental models,       software), and asked to rank order them in a forced distribution
resolving conflicts caused by unexpected or unanticipated system        grid according to some measure of affinity or agreement, depend-
behaviors, or improving user trust and technology acceptance. An-       ing on the context of the study [13]. For our study, we employed
swering these questions requires an investigation into how user’s       q-methodology as a design-elicitation tool, similar to traditional
subjectively value and prioritize different categories of information   iterative design strategies involving user evaluation of prototype
in an effort to resolve conflicts between expected and observed sys-    designs. In this way, we provided participants with questions, each
tem behaviors, or in order to evaluate the validity or accuracy of a    representing a design feature or suite of features that could be pro-
recommendation in order to determine whether to accept or reject it.    vided through a user interface (UI). We asked participants to sort
   To accomplish this, we used an approach known as Q-Methodology,      these statements in a forced distribution, such as the one shown
commonly referred to as the systematic study of subjectivity [13].      in figure 2, ranking them from most important to least important
To constrain our work and prevent over generalization of findings,      to them. Then, through the use of factor analysis, we analyzed the
we chose to investigate what information users value most when
engaged with recommender systems in a highly critical decision
scenario. We hypothesize that users involved in tasks that involve
a high degree of personal risk or risk to others are more likely to
critically interrogate computer-generated recommendations before
accepting and acting upon them. This suggests that systems provid-
ing recommendations in highly critical decision contexts, such as
medical, legal, financial, or automotive domains, amongst others,
would benefit most from efforts to develop interfaces that enable
users to quickly and accurately discern whether or not to trust those
recommendations. Using the decision criticality framework as a
guide, we developed a hypothetical recommender system named
the Oncological Neural Network Prognosis and Recommendation
(ONNPAR) System. ONNPAR was modeled after modern clini-
cal decision support systems offering recommendations, and was
designed to serve as the highly-critical decision scenario for our      Figure 2: Example forced-sort matrix used for our study. Par-
research.                                                               ticipants sorted all 36 questions into the array, ranking them
                                                                        according to personal value and significance in the context of in-
2 METHODS                                                               formation that could help them understand how the ONNPAR
2.1 A brief introduction to Q-Methodology                               system works, and determine whether or not to accept or reject
                                                                        the computer-generated recommendation.
Q-methodology is distinctly different from "R" methodology and
has several distinctions that should be addressed. R-methodology        different ways that users value and prioritize these questions, thus in-
samples respondents who are representative of a larger population,      ferring what design elements may add to or detract from an optimal
and measures them for the presence or expression of certain charac-     user experience [15] and quantifying the potential value of different
teristics or traits. These measurements are made objectively, as the    categories of information to users in the context of improving the
opinions of respondents is seen as potentially confounding and are      transparency of recommender system interfaces.
therefore controlled. Using inferential statistics, findings are then
abstracted to predict prevalence and generalize findings to a larger    2.2    Model Development
target population [50].                                                 The first step for our study was to ensure that our approach was
   Q-methodology, on the other hand, invites participants to directly   representative of the technical and theoretical issues related to trans-
express their subjective opinions on a given topic by sorting state-    parency in recommender systems (i.e., ontological homogeneity).
ments (or questions) into a hierarchy that represents what is most      To accomplish this we used a combination of analytic and inductive
or least important to them. Each participant’s arrangement of state-    techniques, combining findings from a systematic literature review
ments or questions represents an individual person’s point of view      with user input from a user-centered design workshop conducted for
about a given topic, which ordinarily would not be of much value        a previous project [16].
Vorm & Miller, 2018                                                                      IntRS Workshop, October 2018, Vancouver, Canada


   We also sought out the advice and guidance of subject matter           may be a viable way to improve the transparency of recommender
experts (SMEs) to ensure that all technical and theoretical aspects       systems.
of the concept of transparency in recommender systems had been               Qualities of Data: In many instances, understanding the relation-
addressed. We conducted informal interviews with a combination            ship of dependencies present in a system can provide meaningful
of academics who regularly conduct research in the fields of ma-          insights into that system’s functionality. A computer program may
chine learning and intelligent systems, as well as applied researchers    be functioning perfectly, but if the data on which it is operating is
currently engaged in the development and design of recommender            exceedingly noisy or corrupt, its outputs may still be incorrect or in-
systems for industry. In total, nine SMEs were consulted and asked        appropriate. Numerous real-world examples from accidents such as
to review our preliminary categorization structure, and to offer sug-     the Space Shuttle Challenger and the Navy warship USS Vincennes
gestions for other technical or theoretical issues not already captured   serve as a testament to the importance of providing information on
by our approach.                                                          the quality and provenance of the underlying data to decision makers
   The result was a five-factor model of transparency in recom-           [25].
mender systems. These categories consist of Data, Personal, System,          Efforts to make data-related information available to users of
Options, and Social. We briefly describe and discuss the relevance        machine learning applications have been shown to result in higher
of these categories below.                                                user ratings of ease of understanding, meaningfulness, and con-
                                                                          vincingness [26]. Advances in visual analytic approaches have also
                                                                          improved the comprehensibility and intelligibility of data to users
                                                                          by presenting it in a manner that is more readily understood [27].
                                                                          Different visualization techniques have also been demonstrated to
                                                                          improve user’s understanding of cause and effect relationships be-
                                                                          tween variables, even among users with little to no data analytical
                                                                          background (i.e., data novices, [28]).
                                                                             Just as it is important to consider the source as well as the quality
                                                                          of information, so too must users be able to see into the system and
                                                                          understand the data on which it is operating. The current data-driven
                                                                          paradigm of machine learning, therefore, necessitates information
                                                                          that can help users answer questions about the qualities of the sys-
                                                                          tem’s data. Affording users the ability to see this data may well
                                                                          improve the transparency of a system’s interface from the user’s
                                                                          perspective.
                                                                             User Representation: The concept of personalization is central
                                                                          to the discussion of transparency in a variety of intelligent system
                                                                          domains such as context-aware and automated-collaborative filtering
                                                                          applications [4], [29]-[31]. Users often want to understand how they
                                                                          are modeled by a system, if at all, and to what extent system outputs
Figure 3: A five-factor model of system transparency. Each fac-           are personalized for them. While commercial applications such as
tor represents categories of information which can assist users           personalized targeted advertisement algorithms are an important
in understanding and trust computer-generated recommenda-                 component of this category, the importance of user representation
tions.                                                                    extends well beyond the suitability of computer-generated recom-
                                                                          mendations like movies or music titles.
                                                                             Future machine learning applications are expected to encompass
System Parameters and Logic: Understanding the perspective of             a variety of domains that may very well necessitate extensive expla-
another in order to anticipate their actions or understand their inten-   nation of how users are represented by computer systems in order to
tions is the process known as building a mental model [17]. Infor-        achieve user buy-in and acceptance. For example, in the domain of
mation related to how a system works, including its policies, logic,      personal financial trading, a machine learning algorithm may possess
and limitations, can help users build an appropriate mental model of      a model of risk that is very different from its user, and may perhaps
the system. This is often critical, as many accidents, particularly in    prioritize one aspect of financial growth, such as diversification, over
high-risk domains such as aviation, have resulted from users having       other aspects that the user may prioritize more, such as long-term
an inappropriate or inaccurate mental model of system functionality       stability. Understanding what a system knows about its user, and
[18]-[20].                                                                how that information is subsequently used to derive recommenda-
   Having knowledge of how a system functions can also help in            tions, is therefore of potential critical importance for applications to
determining when the system may be in error. Numerous studies             achieve acceptable levels of user trust, engagement, and technology
have demonstrated that providing information about how the sys-           acceptance.
tem processes information can improve the detection of system                Social Influence: The power of social media has been displayed
errors and faults[21]-[23], and can thereby lower so-called ’errors of    in a variety of contexts over the past decade of its modern existence,
commission’ [24]. These studies indicate that providing users with        and has become a powerful tool for marketers and influencers. As
information that assists their understanding of system functionality      of August 2017, two thirds of Americans (67%) reported that they
IntRS Workshop, October 2018, Vancouver, Canada                                                            Eric S. Vorm and Andrew D. Miller


received at least some of their news from social media [32]. Systems      corporate policy, or mandated safety requirements [41]. But in some
that group users according to online behavior in order to predict         contexts, users may want more options than they are often provided,
future interests and purchases, such as automated collaborative filter-   or, at the very least, users may want to know whether or not other
ing algorithms, are abundant, and represent a foundational approach       options exist before engaging in a decision. Closely related to this is
to modern marketing and sales [33]. In many cases, a user’s under-        the importance of providing some justification of why one option is
standing of how they are grouped by a system using social media           deemed better than another.
information can provide meaningful insights into why a system out-           Much has been written about the role that system explanations or
put, such as a targeted advertisement, was generated. This is most        justifications can have on a person’s interaction with or sentiment
important when conflicts arise between a user and an inappropriate        towards intelligent systems [42], [43]. Users often demand some
system output. These are often the result of loose affiliations on        form of justification from a system to help them determine the merit
social media with others who may hold radically opposing philo-           of an output such as a recommendation [10]. There are a variety of
sophical or political viewpoints, which some recommender systems          sub-categories of this concept too, such as why one option is NOT
incorrectly associate into their models. Providing users opportunities    the best, (known as counter-factual explanation).
to see into a system and understand how they, the user, are catego-          The range of discussions over how precisely to engineer explana-
rized and represented in a social group, may improve user experience      tion systems in a format that is meaningful and understood by the
and trust, leading some users to remain more willing to interact with     user under different circumstances is the subject of much current
a system after such a conflict arises. There is also some evidence        discussion in the intelligent systems communities of practice, es-
that some decision making may be socially-mediated as well.               pecially related to machine learning (for an exhaustive review, see
   Scientists have long studied the broad range that social influences    [8]). Much of these are beyond the scope of this current paper, but
can have on decision making and behavior. These can include vari-         for the purposes of this discussion, suffice it to say that the ability
ous social biases [34] which can explain in limited cases how some        for systems to offer explanations of their outputs is central to the
people sometimes defer their decision making to a group or other          concept of transparency in recommender systems.
individual, even when it would seem prudent not to do so [35]. Addi-
tionally, many people express the importance of social relationships
in guiding and assisting in decision-making. In a 2017 Pew Research
Poll, 74% of American respondents reported that their social circles
                                                                          2.3    Concourse and Q-sort development
played at least a small role in their decision making; 37% reported
it played a significant role [36]. Systems that afford information        Having identified these five factors, we then created a bank of
that connects a user’s system interaction with their social circles,      questions for our participants to sort. This bank is known in Q-
may well improve user satisfaction and usability. For example, if         methodology parlance as a ’concourse.’ A goal of developing a
we imagine a user attempting to determine whether or not to accept        concourse is to create as many statements as possible to ensure a
or reject a recommendation, in some contexts, social information,         comprehensive and saturated pool of opinions or sentiment from
such as the prevalence of that recommendation to others in their          which to sample. We used Ram’s taxonomy of question types as
social circle, or a ratio of accept/reject decisions from their friends   an initial starting point to ensure that we used a variety of question
or family, could prove to be valuable to some people, and could be        types [44]. This was then refined using Silveira et al’s taxonomy
used as a decision heuristic.                                             of user’s frequent doubts [45]. The initial concourse consisted of
   Options: People often express a preference of choice over no           71 questions. We then refined this concourse down to a reasonable
choice in most decision-making contexts [37]. Accordingly, many           bank of 36 questions through the use of five individuals who are
systems strive to offer choices to users as a means of increasing         subject-matter experts in recommender systems (either professors in
engagement and satisfaction [38]. There are times, however, when          Cognitive Psychology with experience with recommender systems,
providing multiple choices to a user may be undesirable.                  or programmers of recommender systems). Questions that appeared
   For example, most navigation systems output at most three route        redundant were combined, and those that were deemed irrelevant
choices to the user, and typically highlight the one recommended          or unrelated were discarded. Each of the five factors had a roughly
by the system. There may be, of course, several hundreds or even          equivalent number of representative questions.
thousands more options available to the user, but displaying them all        This final bank of 36 questions was randomized and assigned
would unlikely benefit the user, and may in fact lead them to discard     numbers, then printed on 3x5 index cards. Each participant received
the technology due to its confusing and cluttered interface.              their own deck consisting of 36 individual questions. Participants
   This "tyranny of choices" [39] is even more evident in light of the    were given instructions for how to sort cards from most-to-least
size and scope of many machine learning models, especially those          valuable or important to them. Participants were then shown a vi-
involving deep learning. In these circumstances, it is practically        gnette on a computer screen or projector. The vignette described
infeasible to display every possible optional output to the user.         an interaction with ONNPAR, and ends with the user being given a
   Common interface design strategies involve efforts that reduce         recommendation which the user must determine whether or not to
choices in order to lessen cognitive load and improve the speed           act on, or reject. Participants then sorted their cards, and recorded
and efficiency of decision making [40]. Determining the trade-offs        their arrangement on a form, along with two additional questions on
between interface aesthetics (i.e., clutter) and user preference for      a questionnaire: In a few words, please explain WHY you chose your
options is often a challenge for engineers and designers alike. Some-     MOST/LEAST important question to ask."
times, these decisions are determined by external factors, such as
Vorm & Miller, 2018                                                                       IntRS Workshop, October 2018, Vancouver, Canada


3     RESULTS                                                              loaded clearly on at least one factor, resulting in four distinct view-
Our participant sample was comprised of n=22, 16 males, 6 females,         points of information priorities and preferences of 21 individuals.
aged 22-59, average age 33 years old. Expertise was evaluated by
self-report. Participants were classified as novices if they had no
knowledge of or personal use experience with recommender sys-              Factor Characteristics
tems, and experts if they had participated in either the design or                                     Factor 1   Factor 2   Factor 3   Factor 4
programming of recommender systems.                                        No. of Defining Variables      8           5         5           3
   In the following sections we briefly describe the methodological        Avg. Rel. Coef.               0.8        0.8        0.8         0.8
analysis of q-methodology, and then present the findings from our          Composite Reliability        0.966       0.96      0.952       0.941
ONNPAR study. We will describe interpretations and insights from           S.E. of Factor Z-scores      0.184        0.2      0.219       0.243
each of the factor groups of our factor analysis in the discussion                 Table 1: Characteristics of factors after rotation.
section.
                                                                           3.3    Factor Interpretation
3.1    Q-method Analysis Overview                                          Once factor extraction and rotation was complete, we analyzed each
The analysis of q-methodology is quite straightforward. Each ques-         factor group to interpret its meaning. This was first accomplished by
tion from the set is assigned a numerical value according to which         producing a weighted average of each participant’s arrangement of
column it was placed (-5 to +5 for our study). Each participant’s          cards from within their factor group, and combining those arrange-
arrangement of cards is then combined to create a by-person cor-           ments into one exemplar composite arrangement, which serves as
relation matrix. This matrix describes the relationship of each par-       the model arrangement of questions for that factor group. Once these
ticipant’s arrangement of questions with every other participant’s         composite arrangements, or "factor arrays," have been developed
arrangement (NOT the relationship between items within each par-           for each factor group, they can then be analyzed for deeper inter-
ticipant). This matrix is then submitted for factor analysis, which        pretation. We next evaluated the questions that were ranked highest
produces factors onto which participants load based on their arrange-      and lowest for each factor array. This provides an early indication of
ments of questions. Two or more participants who load on the same          information priorities, and allows us to begin crafting a picture of
factor, therefore, will have arranged their questions in a very similar    how participants in each factor group tend to think about the value
manner, which represents similar reasoning styles or prioritization.       of each category of information.
These factors, or clusters of participants, are then analyzed by exam-
ining what questions were ranked highest and lowest by each group,         3.4    Factor Groups
as well as examining the similarities and differences between each         Here we will report the findings from the factor analysis. To do this
factor group.                                                              we will describe each factor group’s arrangements of the questions
   For simplicity’s sake, we will henceforth refer to factors as fac-      in terms of their highest- and lowest-ranked questions, as well as pos-
tor groups, since in the context of q-methodology, factor analysis         itive and negative distinguishing questions. Distinguishing questions
identifies groups of individuals. The term factor group is not to be       are those where the absolute differences between factor z-scores
confused with the five-factor model of transparency, used to guide         are larger than the standard error of differences for a given pair of
our investigation.                                                         factors. All distinguishing questions are significant at (p < .01).
   Several statistical packages are freely available to aid in the anal-       Factor Group One was defined by eight participants and ex-
ysis of q-methodology studies. We used a version known as Ken-Q            plained 22% of the study variance with an eigenvalue of 6.7. Three
Analysis [46].                                                             of the factor loading participants were females, five were males,
                                                                           with an average age of 37.5 years old. Knowledge of recommender
3.2    Factor Analysis                                                     systems was split between five novices and three experts.
Once all sorts had been entered into our database, they were factor            The highest ranked question of this factor group was "Why is
analyzed using the Ken-Q software. We used principal components            this recommendation the best option?" (+5) The lowest ranked ques-
analysis (PCA) because it has been shown to better account for             tion of this factor group was "Is there anyone in my social network
random, specific, and common error variances [47]. The unrotated           that has received a similar recommendation?" (-5) Other positive
factor matrix was then analyzed to determine how many factors              distinguishing questions for the factor one group were (in descend-
to retain for rotation. A significant factor loading at (P<0.01) is        ing order): "What are all of the factors (or indicators) that were
                                       √
calculated using the equation 2.581 n where n = the number of              considered in this recommendation, and how are they weighted?"
questions in our set (36). Individuals with factor loadings of ± .48       (4) "Precisely what information about me does the system know?"
were considered to have loaded on a factor and were arranged into a        "What does the system think is me level of "acceptable risk?" (1)
factor group.                                                              Negative distinguishing questions for Factor Group One were (in
   For factor extraction, we used the common practice of evaluating        ascending order): "How much data was used to train this system?"
only factors with an eigenvalue greater than one [13]. We also deter-      (-4) "How many other people have received this recommendation
mined that only factors with three or more participants loading on         from this system?" (-2) and "What does the system think I want to
them would be retained. These steps resulted in four factors, which        achieve?" (-1)
were then submitted to rotation according to mathematical criteria             Factor Group Two was defined by five participants and ex-
(e.g., varimax). With this four-factor solution, all but one participant   plained 13% of the study variance with an eigenvalue of 2.8. All
IntRS Workshop, October 2018, Vancouver, Canada                                                             Eric S. Vorm and Andrew D. Miller


of the factor loading participants were males, average age of 42            (-4) "What does the system think is MY level of "acceptable risk?"
years old. All but one of this factor group were considered experts         (-2) "What if I decline? How will that decision be used in future
in recommender systems. The highest ranked question of this factor          recommendations by this system?" (-1) "How is my information
                                                                            measured and weighted in this recommendation?" (-1)
                                                                                Factor Group Three was defined by five participants and ex-
     Relative Rankings of Questions by Factor Group
                                                                            plained 9% of the study variance with an eigenvalue of 1.9. Two of
                                                                            the factor loading participants were females, three were males, and
                                   Factor Group 1                           an average age of 34 years old. All but one of the participants for
     Highest           Why is this recommendation the best op-              this group were considered experts in recommender systems.
                       tion?                                                    The highest ranked question of this factor group was "Under
                                                                            what circumstances has this system been wrong in the past?" (+5)
     Lowest            Is there anyone in my social network that            The lowest ranked question of this factor group was "What if I
                       has received a similar recommendation?               decline? How will that decision be used in future recommendations
                                                                            by this system?" (-5) Other positive distinguishing questions for
                                   Factor Group 2                           the factor three group were (in descending order): "What data does
     Highest           What are all of the factors (or indicators)          the system depend on in order to work properly, and do we know
                       that were considered in this recommen-               if those dependencies are functioning properly?" (+4) "Is my data
                       dation, and how are they weighted?                   uniquely different from the data on which the system has been
                                                                            trained?" (+3) "What have other people like me done in response
     Lowest            Was this recommendation made specif-                 to this recommendation?" (+2) Negative distinguishing questions
                       ically for me (based on my pro-                      for the factor three group were (in ascending order): "What is the
                       file/interests), or something else?                  system’s level of confidence in this recommendation?" (-2) "Are
                                                                            there any other options not presented here?" (-2) "How much data
                                 Factor Group 3                             was used to train this system?" (-1) "How does the system consider
     Highest           Under what circumstances has this sys-               risk, and what is its level of "acceptable risk?" (-1)
                       tem been wrong in the past?                              Factor Group Four was defined by three participants and ex-
                                                                            plained 8% of the study variance with an eigenvalue of 1.7. There
     Lowest            What if I decline? How will that decision            were two males and one female, and an average age of 20 years old.
                       be used in future recommendations by                 Knowledge of recommender systems was split between two novices
                       this system?                                         and one expert.
                                                                                The highest ranked question of this factor group was "What is
                                   Factor Group 4                           the history of the reliability of this system?" (+5) The lowest ranked
     Highest           What is the history of the reliability of            question of this factor group was "What does the system THINK I
                       this system?                                         want to achieve? (How does the system represent my priorities and
                                                                            goals?)" (-5) Positive distinguishing questions for the factor four
     Lowest            What does the system think I want to                 group were (in descending order): "How many other people have
                       achieve? (How does the system represent              accepted or rejected this recommendation from this system? (What
                       my priorities and goals?)                            is the ratio of approve to disapprove?)" (+4) "Is the system working
Table 2: Highest and lowest ranking questions of each factor                with solid data, or is the system inferring or making assumptions on
group. Although this is only the most superficial analysis, distin-         ’fuzzy’ information?" (+3) "How many other people have received
guishing differences amongst groups begin to emerge as we an-               this recommendation from this system?" (+1) Negative distinguish-
alyze each group’s prioritization and valuation of transparency             ing questions for the factor four group were (in ascending order):
information.                                                                "Is my data uniquely different from the data on which the system
                                                                            has been trained?" (-3) "What are all of the factors (or indicators)
group was "What are all of the factors (or indicators) that were con-       that were considered in this recommendation, and how are they
sidered in this recommendation, and how are they weighted?" (+5)            weighted?" (-2) "What have other people like me done in response
The lowest ranked question of this factor group was "Was this recom-        to this recommendation?" (-1)
mendation made specifically for ME (based on my profile/interests),
or was it made based on something else (based on some other model,
such as corporate profit, or my friend’s interests, etc.)?" (-5) Positive
distinguishing questions for the factor two group were (in descend-
ing order): "How is this data weighted or what data does the system
                                                                            4   DISCUSSION
prioritize?" (+4) "How much data was used to train this system?"            Findings from our factor analysis yielded several surprising insights.
(+2) "Is my data uniquely different from the data on which the sys-         We begin with a discussion of questions that produced a high degree
tem has been trained?" (1) Negative distinguishing questions for            of either consensus or disagreement amongst factor groups, and then
the factor two group were (in ascending order): "Is there anyone in         conclude with a discussion of each factor group.
my social network that has received a similar recommendation?"
Vorm & Miller, 2018                                                                         IntRS Workshop, October 2018, Vancouver, Canada


4.1    Analysis of Consensus vs. Disagreement                                    Consensus Versus Disagreement
       Findings                                                                  Consensus                                             Z-Score
A common technique to examine these data is to explore questions                                                                       Variance
that created either consensus or a large amount of disagreement in
our sample. By examining the variance between all item rankings,                 Can I influence the system by providing               0.024
we can explore what questions were generally agreed on (i.e., con-               feedback? Will it listen and consider my
sensus), and what items produced large disagreement. For instance,               input?
all participants ranked "How clean or accurate is the data used in
making this recommendation?" as either 0 or -1, indicating that this             How clean or accurate is the data used in             0.029
question was only moderately valuable to them in the context of a                making this recommendation?
clinical decision support system. This is potentially valuable infor-
mation for designers to consider, given that the fuzziness of data is            How often is the system checked to make               0.046
sometimes displayed to users as a method of enhancing system trans-              sure it is functioning as it was designed
parency [48]. Given these findings, it may be useful to reconsider               (i.e., for model accuracy)?
displaying information about the qualities of data to users in favor
of other types of information deemed more useful or valuable.
                                                                                 Disagreement                                          Z-Score
   Similarly, we can learn much from these data by evaluating ques-
                                                                                                                                       Variance
tions that produced a great deal of disagreement between factor
groups. For instance, the question "Was this recommendation made
                                                                                 How many other people have accepted or                1.179
specifically for ME (based on my profile/interests), or was it made
                                                                                 rejected this recommendation from this
based on something else (based on some other model, such as corpo-
                                                                                 system? (What is the ratio of approve to
rate profit, or my friend’s interests, etc.)?" had the largest variance,
                                                                                 disapprove?)
with factor groups one and three assigning it a positive value (4 and
3), and factor groups two and four assigning it a negative value (-5
                                                                                 Is there anyone in my social network that             1.246
and -4). Interestingly, factor group two assigned this question as
                                                                                 has received a similar recommendation?
the least valuable or important question of their q-set, while factor
group one assigned this question as their second most valuable or
                                                                                 Was this recommendation made specif- 2.261
important question.
                                                                                 ically for ME, or was it based on some-
   Interpreting these findings can, at first glance, appear confounding
                                                                                 thing else?
to a designer looking for clear guidance. Clearly, some individuals
would prefer to have information that could indicate how they, as          Table 3: Consensus questions are those which all participants
a user, are modeled and considered (if at all) in system-generated         agreed were of relevant importance, as indicated by a low Z-
recommendations as a means of improving their trust, while others          score variance in their arrangements. Disagreement questions
clearly discount the value of this kind of information. These findings     are those which polarized opinion, as indicated by high Z-score
suggest that social influence information, such as what other users        variance in their arrangments.
are doing in response to recommendations, may at times be valuable
to some users in helping determine whether or not to accept or reject      Using this bank of questions, participants sorted them according to
a recommendation.                                                          those they found most valuable or useful in helping them determine
   Two other questions also produced wide disagreement across              whether to accept or reject a computer-generated recommendation.
factor groups. "How many other people have accepted or rejected            We analyzed how participants arranged these questions using a fac-
this recommendation from this system? (What is the ratio of approve        tor analytic technique. Our findings support other studies that find
to disapprove?)" and "Is there anyone in my social network that has        that transparency is a multi-dimensional construct, and achieving
received a similar recommendation?" were ranked near the poles by          it is dependent on multiple variables, including to some extent the
different factor groups. This indicates that the value of social media-    user’s preferences for and valuation of certain categories of informa-
related information in highly-critical contexts, while not important       tion. Our findings are intended to inform future interface design of
to some, is still considered valuable information by some users who        recommender systems, as well as to broaden the discussion of the
may find it a valuable and important component to enhance their            importance of building systems whose outputs and recommendations
understanding and trust in system-generated recommendations.               are easily understood by their users.

5     CONCLUSION                                                           REFERENCES
We have illustrated our five-factor model of information categories         [1] F. Doshi-Velez and B. Kim, "Towards A Rigorous Science of Interpretable Ma-
that can be used to increase the transparency of recommender sys-               chine Learning," AirXiv, 2017.
                                                                            [2] B. Buchannan and E. Shortliffe, Rule-Based Expert Systems: The MYCIN Exper-
tems to end users. We developed a bank of 36 questions representing             iments of the Stanford Heuristic Programming Project. Reading, MA: Addison
information gathering strategies that users could use to interrogate            Wesley, 1984.
                                                                            [3] L. R. Ye and P. E. Johnson, "The Impact of Explanation Facilities on User Ac-
system-generated recommendations in an effort to understand its rea-            ceptance of Expert Systems Advice," MIS Quarterly, vol. 19, no. 2, p. 157, Jun.
soning, and decide whether to accept or reject the recommendation.              1995.
IntRS Workshop, October 2018, Vancouver, Canada                                                                                    Eric S. Vorm and Andrew D. Miller


 [4] J. L. Herlocker, J. A. Konstan, and J. Riedl, "Explaining collaborative filtering          pp. 193-212, 2001.
     recommendations," presented at the 2000 ACM conference, New York, New York,           [30] B. Y. Lim and A. K. Dey, "Assessing demand for intelligibility in context-aware
     USA, 2000, pp. 241-250.                                                                    applications," presented at the the 11th international conference, New York, New
 [5] T. Kulesza, M. Burnett, W.-K. Wong, and S. Stumpf, "Principles of Explanatory              York, USA, 2009, p. 195.
     Debugging to Personalize Interactive Machine Learning," presented at the the          [31] A. S. Clare, M. L. Cummings, and N. P. Repenning, "Influencing Trust for Human-
     20th International Conference, New York, New York, USA, 2015, pp. 126-137.                 Automation Collaborative Scheduling of Multiple Unmanned Vehicles," Human
 [6] M. T. Dzindolet, S. A. Peterson, R. A. Pomranky, L. G. Pierce, and H. P. Beck, "The        Factors, vol. 57, no. 7, pp. 1208-1218, Oct. 2015.
     role of trust in automation reliance," International Journal of Human-Computer        [32] E. Shearer and J. Gottfried, "News Use Across Social Media Platforms 2017,"
     Studies, vol. 58, no. 6, pp. 697-718, Jun. 2003.                                           Pew Research Center, Sep. 2017.
 [7] B. Lorenz, F. Di Nocera, and R. Parasuraman, "Display Integration Enhances            [33] Adobe Inc., "Digital Intelligence Briefing: 2018 Digital Trends," Adobe Inc., Feb.
     Information Sampling and Decision Making in Automated Fault Management in                  2018.
     a Simulated Spaceflight Micro-World," Proceedings of the Human Factors and            [34] A. Tversky and D. Kahneman, "Judgment under Uncertainty: Heuristics and
     Ergonomics Society 58th Annual Meeting, pp. 31-35, 2002.                                   Biases," Science, vol. 185, no. 4157, pp. 1124-1131, Sep. 1974.
 [8] T. Miller, "Explanation in Artificial Intelligence: Insights from the Social Sci-     [35] S. Fiske and S. Taylor, Social Cognition. Reading, MA: Addison Wesley, 1991.
     ences," AirXiv, pp. 1-57, Jun. 2017.                                                  [36] J. B. Horrigan, "How People Approach Facts and Information," Pew Research
 [9] G. B. Duggan, S. Banbury, A. Howes, J. Patrick, and S. M. Waldron, "Too Much,              Center, Aug. 2017.
     Too Little, or Just Right: Designing Data Fusion for Situation Awareness," Pro-       [37] L. E. Blume and D. Easley, "Rationality," in The New Palgrave Dictionary of
     ceedings of the Human Factors and Ergonomics Society 58th Annual Meeting, pp.              Economics, S. Durlauf and L. E. Blume, Eds. 2008.
     528-532, 2004.                                                                        [38] J. Preece, H. Sharp, and Y. Rogers, Interaction Design: Beyond Human Computer
[10] K. Swearingen and R. Sinha, "Beyond algorithms: An HCI perspective on rec-                 Interaction, 4 ed. Wiley, 2015, pp. 1-551.
     ommender systems," ACM SIGIR 2001 Workshop on Recommender Systems,                    [39] B. Schwartz, The paradox of choice: Why more is less. Harper Perennial, 2004.
     2001.                                                                                 [40] Rose, "Human-Centered Design Meets Cognitive Load Theory: Designing Inter-
[11] H. F. Neyedli, J. G. Hollands, and G. A. Jamieson, "Beyond Identity: Incorporating         faces that Help People Think," pp. 1-10, Oct. 2006.
     System Reliability Information Into an Automated Combat Identification System,"       [41] M. Zahabi, D. B. Kaber, and M. Swangnetr, "Usability and Safety in Electronic
     Human Factors, vol. 53, no. 4, pp. 338-355, Jul. 2011.                                     Medical Records Interface Design: A Review of Recent Literature and Guideline
[12] W. Stephenson, The study of behavior; Q-technique and its methodology. Chicago,            Formulation.," Human Factors, vol. 57, no. 5, pp. 805-834, Aug. 2015.
     IL: University of Chicago Press, 1953.                                                [42] S. Gregor and I. Benbasat, "Explanations from Intelligent Systems: Theoretical
[13] S. R. Brown, "A primer on Q methodology," Operant Subjectivity. 16(3/4), 91-138.           Foundations and Implications for Practice," MIS Quarterly, vol. 23, no. 4, p. 497,
     1993                                                                                       Dec. 1999.
[14] S. Watts and P. Stenner, "Doing Q Methodology: theory, method and interpreta-         [43] D. L. McGuinness, A. Glass, M. Wolverton, and P. P. Da Silva ExaCt, "A Catego-
     tion," Qualitative Research in Psychology, vol. 2, no. 1, pp. 67-91, Jan. 2005.            rization of Explanation Questions for Task Processing Systems.," presented at the
[15] K. O’Leary, J. O. Wobbrock, and E. A. Riskin, "Q-methodology as a research                 AAAI Workshop on Explanation- Aware Computing ExaCt-, 2007.
     and design tool for HCI," presented at the CHI 2013, Paris, France, 2013, pp.         [44] A. Ram, AQUA: Questions that Drive the Explanation Process. Lawrence Erlbaum,
     1941-1950.                                                                                 1993.
[16] E. S. Vorm, "Assessing Demand for Transparency in Intelligent Systems Using           [45] M. S. Silveira, C. S. de Souza, and S. D. J. Barbosa, "Semiotic engineering
     Machine Learning," presented at the IEEE Innovations in Intelligent Systems and            contributions for designing online help systems," presented at the the 19th annual
     Applications INISTA, Thessonaliki, 2018, pp. 41-48.                                        international conference, New York, New York, USA, 2001, p. 31.
[17] W. B. Rouse and N. M. Morris, "On looking into the black box: Prospects and           [46] S. Banasick, "Ken-Q Analysis."
     limits in the search for mental models.," Psychological Bulletin, vol. 100, no. 3,    [47] J. K. Ford, R. C. MacCallum, and M. Tait, "The Application of Exploratory
     pp. 349-363, 1986.                                                                         Factor Analysis in Applied Psychology: A critical review and analysis," Personnel
[18] N. B. Sarter and D. D. Woods, "How in the World Did We Ever Get into That                  Psychology, vol. 39, no. 2, pp. 291-314, Jun. 1986.
     Mode? Mode Error and Awareness in Supervisory Control," Human Factors, vol.           [48] W. Yuji, "The Trust Value Calculating for Social Network Based on Machine
     37, no. 1, pp. 5-19, 1995.                                                                 Learning," presented at the 2017 9th International Conference on Intelligent
[19] A. F. Zeller, "Accidents and Safety," in Systems Psychology, K. B. DeGreene, Ed.           Human-Machine Systems and Cybernetics (IHMSC), 2017, pp. 133-136.
     New York, NY, 1970, pp. 131-150.                                                      [49] R. S. Amant and R. M. Young, "Interface Agents in Model World Environments,"
[20] National Transportation Safety Board, "Loss of Control on Approach Colgan Air,             AI Magazine, vol. 22, no. 4, p. 95, Dec. 2001.
     Inc. Operating as Continental Connection Flight 3407 Bombardier DHC-8-400,            [50] J. Devore, Probability and Statistics for Engineering and the Sciences, Fourth.
     N200WQ Clarence Center, New York February 12, 2009," National Transportation               Brooks/Cole, 1995.
     Safety Board, NTSB/AAR-10/01 PB2010-910401, Feb. 2010.
[21] G. G. Sadler, H. Battiste, N. Ho, L. C. Hoffmann, W. Johnson, R. Shively, J. B.
     Lyons, and D. Smith, "Effects of transparency on pilot trust and agreement in the
     autonomous constrained flight planner," presented at the 2016 IEEE/AIAA 35th
     Digital Avionics Systems Conference (DASC), 2016, pp. 1-9.
[22] A. Sebok and C. D. Wickens, "Implementing Lumberjacks and Black Swans Into
     Model-Based Tools to Support Human-Automation Interaction," Human Factors,
     vol. 59, no. 2, pp. 189-203, Mar. 2017.
[23] J. Y. C. Chen, K. Procci, M. Boyce, J. L. Wright, A. Garcia, and M. J. Barnes,
     "Situation Awareness-Based Transparency," ARL-TR-6905, Apr. 2014.
[24] K. L. Mosier and L. J. Skitka, "Human Decision Makers and Automated Decision
     Aids: Made for Each Other?," in Automation and human performance Theory and
     applications, R. Parasuraman and M. Mouloua, Eds. NJ: Lawrence Erlbaum, 1996,
     pp. 201-220.
[25] C. W. Fisher and B. R. Kingma, "Criticality of data quality as exemplified in two
     disasters," Information and Management, vol. 39, pp. 109-116, 2001.
[26] J. Zhou, M. A. Khawaja, Z. Li, J. Sun, Y. Wang, and F. Chen, "Making machine
     learning useable by revealing internal states update - a transparent approach,"
     International Journal of Computational Science and Engineering, vol. 13, no. 4,
     pp. 378-389, 2016.
[27] T. Muhlbacher, H. Piringer, S. Gratzl, M. Sedlmair, and M. Streit, "Opening the
     Black Box: Strategies for Increased User Involvement in Existing Algorithm
     Implementations," IEEE Trans. Visual. Comput. Graphics, vol. 20, no. 12, pp.
     1643-1652.
[28] J. Bae, E. Ventocilla, M. Riveiro, T. Helldin, and G. Falkman, "Evaluating Multi-
     attributes on Cause and Effect Relationship Visualization," presented at the Inter-
     national Conference on Information Visualization Theory and Applications, 2017,
     pp. 64-74.
[29] V. Bellotti and K. Edwards, "Intelligibility and Accountability: Human Considera-
     tions in Context-Aware Systems," Human-Computer Interaction, vol. 16, no. 2,