-

On the Perception of Dificulty: Diferences between Humans and AI

Philipp Spitzer

Philipp.Spitzer@kit.edu 0 2

Joshua Holstein

Joshua.Holstein@kit.edu 0 2

Michael Vössing

Michael.Voessing@kit.edu 0 2

Niklas Kühl

1 2

Artificial Intelligence, Human-AI Interaction, Confidence Estimation, Instance Dificulty

0 Karlsruhe Institute of Technology , Kaiserstraße 89-93, Karlsruhe, 76133 , Germany 1 University of Bayreuth , Wittelsbacherring 10, Bayreuth, 95444 , Germany 2 ing Automation Experiences , CHI '23

With the increased adoption of artificial intelligence (AI) in industry and society, efective human-AI interaction systems are becoming increasingly important. A central challenge in the interaction of humans with AI is the estimation of dificulty for human and AI agents for single task instances. These estimations are crucial to evaluate each agent's capabilities and, thus, required to facilitate efective collaboration. So far, research in the field of human-AI interaction estimates the perceived dificulty of humans and AI independently from each other. However, the efective interaction of human and AI agents depends on metrics that accurately reflect each agent's perceived dificulty in achieving valuable outcomes. Research to date has not yet adequately examined the diferences in the perceived dificulty of humans and AI. Thus, this work reviews recent research on the perceived dificulty in human-AI interaction and contributing factors to consistently compare each agent's perceived dificulty, e.g., creating the same prerequisites. Furthermore, we present an experimental design to thoroughly examine the perceived dificulty of both agents and contribute to a better understanding of the design of such systems.

ple instances (for example in Hemmer et al [8] Geirhos

1. Introduction In recent decades, technological advances have led to artificial intelligence (AI) applications becoming part of our everyday lives, e.g., when learning a new language [1] or driving autonomous cars [2]. Like many other methods and terms, most prominently uncertainty, conifdence , performance (e.g., in [19]), for measuring the dificulty of human and AI agents, which is why we aim to delimit our research in the following and create a shared understanding of the relevant terms. Before diving into the frequently used methods, we elaborate on the examples of human-AI interaction, it comes down to ap- commonly used terms to describe the dificulty. Perforpropriately assessing the dificulty of diferent situations for each agent (human and AI). The consequences for incorrect estimates can range from rejecting such sys- stance [10, 19]. mance represents the aggregated accuracy over multiple instances for a task or over multiple agents for an intems, e.g., when the human learner is given too dificult words or grammar without being ready, to potentially severe consequences, e.g., autonomously driving cars on a foggy night. Consequently, it is necessary to estimate each agent’s dificulty for an instance adequately.

Further examples of human-AI interaction that draw from an estimation of instance dificulty are AI complementarity [3–11], curriculum learning [12–14],

and machine teaching [ 12, 13, 15–18 ]. Accurately assessing the dificulty of single instances for both human and

AI interaction to fully exploit their complementary capa

bilities while creating pleasant automation experiences.

By reviewing related literature, we observe diferent

∗Corresponding author. †These authors contributed equally. nEvelop-O of human and AI agents is assessed diferently. First, a potential issue arises from an existing gap in access to relevant information. Usually, the AI agent is trained and having information on the label distribution. However, AI agents is central to developing these forms of human- of a given task [21]. For the perceived dificulty , one must this is often not the case for humans (e.g., in [ 10, 22, 23]). advancing [27, 28]. Hereby, various forms of human-AI Therefore, it remains unclear whether and how, amongst interaction rely on estimating an instance’s dificulty for others, this afects humans’ perception of dificulty. Sec- efective collaboration. Following, we outline the three ond, the dificulty of single instances is assessed difer- forms of human-AI interaction most relevant to our reently. For AI agents, the distribution of the softmax out- search: human-AI complementarity, curriculum learning, puts is often used to determine its uncertainty [ 10, 22]. and machine teaching.

Contrarily, the human’s perceived instance dificulty is In the field of human-AI complementarity, recent often measured by observing the distribution of predic- research studies complementary team performance— tions over groups of humans for single instances or by exceeding the performance each agent (human or AI) their average performance for an instance [19, 23]. Con- can achieve on their own [ 3, 5 ]. In this collaboration, it sequently, individual skills and capabilities of humans are is crucial to properly delegate tasks to each agent to exneglected, potentially resulting in poor experiences in ploit their complementary capabilities [9]. Steyvers et al. human-AI interaction settings [24]. As related literature [10] establish a framework to facilitate both human and shows, humans have distinct cognitive styles which can AI agents’ confidence scores to investigate factors that afect their perceived dificulty [ 25]. Hence, neglecting influence complementary capabilities of human-AI coltheir individual traits and generalizing their predictions laborations. Lai et al. [11] suggest using uncertainty as a to determine the perceived dificulty can result in poor measure to delegate tasks between human and AI agents. estimation for individuals. In the work of Fügener et al. [29], the authors evaluate

As we observe inconsistencies in the measurement of diferent delegation strategies based on the performance the perceived dificulty between human and AI agents, of both agents for single instances. They find that huwe outline existing metrics to measure their perceived dif- mans’ perception of task dificulty difers from the actual ifculty as a first step. Moreover, we scrutinize methods to task dificulty. Lubars and Tan [6] investigate, amongst compare both agents adequately. Based upon this, we are others, the efect of the dificulty of single instances to interested in adequately examining the diference in the delegate tasks. perceived dificulty between humans and AI. Therefore, Curriculum learning denotes another form of humanwe state the following research question: AI interaction in which the perceived dificulty is relevant to the overall process. This form of learning is based on RQ: What are the diferences in the perceived dificulty of human learning and incorporates the idea that the order humans and AI for single instances? is crucial in which training instances are presented to a learner [12]. A central aspect of curriculum learning

To answer this research question, we conduct a liter- is the assertion of dificulty levels of single instances. ature review to evaluate existing research fields relying Wei et al. [13] use the annotator agreement in an image on an accurate measurement of the perceived instance classification task to determine the dificulty of instances. dificulty. Furthermore, we present an experimental de- In the field of machine teaching, a human or an AI agent sign that avoids the previously mentioned inconsisten- is trained by selecting samples to achieve high learning cies. Through our experiment, we want to analyze the outcomes [15]. The selection of training instances can be perceived dificulty of human and AI agents for single grounded on dificulty estimation. For example, Zhang instances, using established metrics like confidence [ 10] et al. [16] presents an interactive learning procedure in and PVI [26] adequately. We support our endeavor to which crowd workers are trained based on an approxestablish adequate methods to consistently measure the imated dificulty for instances. Similarly, Singla et al. perceived instance dificulty of human and AI agents with [17] select training instances for learners based on an ifrst empirical results based on an existing, public dataset. expected uncertainty measured by an AI agent. Overall, with our experiment, we aim to contribute to a better and more integrated understanding of how to adequately compare human and AI agents’ perceived dif- 2.2. Measuring Perceived Dificulty of ifculty leading to a thorough understanding of the design Humans and AI of human-AI interaction systems.

AI’s perceived dificulty. In Ståhl et al. [30], the authors

evaluate diferent metrics to compare the uncertainty of 2. Related Work deep learning models. One of these metrics is a Bayesian network-based approach using dropout [31]. Further, 2.1. Human-AI Interaction and Instance Xu et al. [32] present a metric that builds on Shannon Dificulty entropy [33] to compare the dificulty of diferent datasets.

Moreover, Ethayarajh et al. [26] extend this metric, called With the latest ascent in research on human-AI inter- -usable information, to apply it to single instances. This action, the deployment of AI in automated systems is metric, the pointwise -information (PVI), is used to compare the dificulty of single instances with respect with the confidence of human and AI agents available, to a model family . According to the authors, PVI, in we compare their performance and confidence for single contrast to related metrics, quantifies the dificulty of instances. single instances accounting for how much information Figure 1 illustrates and compares instance performance can be extracted beyond the label distribution. and confidence for ten randomly sampled instances of

Human’s perceived dificulty. Most works focus on the ImageNet-16H dataset. The left part represents the estimating the perceived dificulty of humans by aggre- AI agent’s output, while the right part shows the hugating over multiple humans. For example, Peterson et al. man’s self-reported confidence. Based on this, we can [23] asses the disagreement of two decision-makers. In make several observations. First, task performance is not their work, the authors define the dificulty of a single necessarily a reliable factor to determine the perceived instance by using the disagreement of crowdsourcing dificulty of an instance. For example, instances seven annotators. To measure the individual perceived difi- to nine have the same performance but difer greatly in culty of instances, Steyvers et al. [10] use a diferent their reported confidence. Second, human and AI agents approach. The authors use the ordinal responses of hu- can perceive diferent instances as easy, e.g., the AI agent mans to determine their confidence. Similarly, Bıyık et al. has low confidence for instances seven and eight, while [34] determine human dificulty by asking participants the humans have medium to high confidence. Third, the about their perceived task dificulty. human self-reported confidence scores difer among participants, as can be seen from the standard deviation of confidence. We argue that these observations represent 3. Empirical Validation Using ifrst evidence in the direction of our hypotheses. More Public Datasets specifically, we can see that the average performance of an instance cannot be used to determine the perceived Before our experiment, we examine reports of other stud- dificulty of an instance for individual humans. Instead, ies to investigate the diferences in the perceived dificulty other metrics need to be considered. of single instances. Therefore, we utilize publicly avail- Moreover, the high standard deviation of human conable datasets, e.g., CIFAR10-H [23], modelvshuman [22], fidence for almost all instances indicates that humans or ImageNet-16H [10]. However, the first two datasets, difer in their perceived dificulty. Consequently, the diCIFAR10-H, and modelvshuman do not contain individ- versity of humans must be taken into consideration when ual human confidence or uncertainty measurements. In- designing human-AI interaction systems. stead, the authors of the datasets [22, 23] estimate the instance dificulty by aggregating the performance of multiple human annotators for instances. ImageNet-16H 4. Experimental Design is the only dataset containing human dificulty measurements in the form of self-reported confidence levels, e.g., Our experiment is based on a mixed-efects model that low, medium, and high. To compare these reported confi- combines a between-subject and a within-subject design dence levels with the commonly used technique of aver- [35]. We follow the notion of existing works and unage instance performance, we transformed the confidence derstand confidence as a proxy for dificulty [ 36]. More levels to 0 (low), 0.5 (medium), and 1 (high). precisely, we measure the dificulty of the human and

Further, we fine-tune an eficientnet model with the the AI agent by two metrics: the commonly used confidataset for two epochs and use Monte-Carlo Dropout to dence [10] and the PVI score [26] as a novel metric that receive the perceived confidence of the AI agent. Finally, considers the label distribution. We measure the confidence of AI agents by Monte-Carlo Dropout [31] and for humans via probabilities, e.g., using a scale between 0% and 100%. We use a binary classification task to avoid participants having to assign multiple probabilities. The binary classification allows us to observe one probability, e.g., an image showing a cat with a probability of 80%, and calculate the complementary probability, e.g., the complementary probability that the image does not represent a cat is 20%.

Part I Start & consent

Attention check

Part II Task 1:

Image Classification

Task 2: Classification on Tabular Data Measure difficulty for each instance performance of human and AI agents is a consequence of the probabilities they assign to each class and, thus, their uncertainty, we argue that the perceived dificulty for an instance can difer even for instances both agents have classified the same. Thus, we hypothesize:

Hypothesis 2. There are instances for which human and AI agents make the same prediction but difer in their perceived dificulty.

Within our experiment, we leverage two datasets for the tasks of Part II to compare the perceived dificulty of human and AI agents. Both conditions comprise the same tasks. We chose two diferent tasks: one visual classification task and one based on tabular data. Research shows the impact of diferent cognitive styles on participants’ task performance (i.e., [25, 37, 38]). By choosing a visual and a text-based task, we account for participants’ diferent cognitive styles and individual perceptions of dificulty. Accordingly, participants will be asked to conduct a questionnaire in which we determine their cognitive styles.We assess these styles by using the validated items of Kirby et al. [37] (initially presented by Richardson [39]). The items of the cognitive style questionnaire are randomly arranged as suggested by Kirby et al. [37]. All items are measured on a five-point Likert scale. We hypothesize:

Hypothesis 3. Humans with distinct cognitive styles perceive the dificulty of single instances diferently. The preliminary experiment design is illustrated in

Figure 2. The experiment is composed of three parts.

Part I includes consent, instructions, and a demographics questionnaire. Next, Part II comprises two binary classi- 5. Discussion ifcation tasks—one visual and one textual—and, finally, Part III is a questionnaire on cognitive styles. In both In this work, we propose an experimental design to intasks, we measure the perceived dificulty of participants vestigate the diference in perceived dificulty between and AI for single instances. human and AI agents for single instances. To build a

In our experiment, we have two treatments. First, foundation, we assess related work and common metas we want a consistent comparison of the perceived rics to estimate instance dificulty. Yet, these studies dificulty between humans and AI, we must ensure they insuficiently scrutinize consistent dificulty estimations have access to the relevant information. However, in between humans and AI. By first examining a related contrast to humans, the AI agent has access to the label dataset, we show the discrepancies in dificulty estimadistribution through its training prior to the task. As we tion by applying conventional approaches. Thus, we want to examine this efect, we show humans the label propose an experiment design that paves the way for a distribution before conducting the task in one condition. broad main study in which we: (I) Develop a consistent Thus, we hypothesize: way to measure the perceived dificulty of instances, (II) Examine the diferences in the perceived dificulty of huHypothesis 1. Access to the information on label distri- man and AI agents, (III) Investigate a potential cause in bution has an impact on humans’ perceived dificulty of varying perceived dificulty of humans. single instances. Through our main study, we expect to contribute to

After providing a consistent way to measure the the ongoing discussion on developing automated and confidence—as a proxy for the perceived dificulty—of the reliable AI agents interacting with humans with diverse human and the AI agent, we want to examine the difer- skills and capabilities. Moreover, our results will provide ences in their perceived dificulty of instances. Previous guidance not only in research but also in practice on research identified subsets of data on which either human designing human-AI interaction systems. A promising or AI agent has a better performance, e.g., [22]. As the field of research lies ahead.

chine Learning , PMLR, 2022 , pp. 5988 - 6008 . [27]

Barboni ,

J.-F.

Ladry ,

Navarre , P. Palanque,

systems models , in: Proceedings of the 2nd ACM

computing systems , 2010 , pp. 165 - 174 . [28]

Roto ,

Palanque ,

Karvonen , Engaging au-

tomation: 5th IFIP WG 13 .6 Working Conference,

HWID 2018 , Espoo, Finland, August 20-21 , 2018 , Re-

vised Selected Papers 5 , Springer, 2019 , pp. 158 - 172 . [29]

Fügener ,

Grahl ,

Gupta , W. Ketter, Collabo-

2019 . [30]

Ståhl ,

Falkman ,

Karlsson , G. Mathiason,

18th International Conference, IPMU 2020 , Lisbon,

Portugal , June 15-19, 2020 , Proceedings, Part I 18,

Springer , 2020 , pp. 556 - 568 . [31]

Gal ,

Ghahramani , Dropout as a bayesian ap-

chine learning , PMLR , 2016 , pp. 1050 - 1059 . [32]

Xu ,

Zhao ,

Song ,

Stewart ,

Ermon , A

constraints , arXiv preprint arXiv: 2002 . 10689 ( 2020 ). [33]

C. E.

Shannon , A mathematical theory of commu-

communications review 5 ( 2001 ) 3 - 55 . [34]

Bıyık ,

Palan ,

N. C.

Landolfi ,

D. P.

Losey ,

arXiv: 1910 . 04365 ( 2019 ). [35]

Riefle ,

Benz , T. Tomar, “may i help you?”: Ex-

use of conversational agents , ICIS 2022 Proceedings

( 2022 ). [36]

Kompa ,

Snoek ,

A. L.

Beam , Second opinion

chine learning , NPJ Digital Medicine 4 ( 2021 ) 4 . [37]

J. R.

Kirby ,

P. J.

Moore ,

N. J.

Schofield , Verbal and

psychology 13 ( 1988 ) 169 - 184 . [38]

Riefle ,

Hemmer ,

Benz ,

Vössing , J. Pries,

derstanding of explanations , ICIS 2022 Proceedings

( 2022 ). [39]

Richardson , Verbalizer-visualizer: a cognitive

style dimension. , Journal of mental imagery ( 1977 ).