1. Introduction

A Comparison of Human and Machine Learning Errors in Face Recognition

Marina Estévez-Almenzar

Ricardo Baeza-Yates

Carlos Castillo

0 1 0 ICREA , Barcelona , Spain 1 Universitat Pompeu Fabra , Barcelona , Spain

2025

Machine learning applications in high-stakes scenarios should always operate under human oversight. Developing an optimal combination of human and machine intelligence requires an understanding of their complementarities, particularly regarding the similarities and diferences in the way they make mistakes. We perform extensive experiments in the area of face recognition and compare two automated face recognition systems against human annotators through a demographically balanced user study. Our research uncovers important ways in which machine learning errors and human errors difer from each other, and suggests potential strategies in which human-machine collaboration can improve accuracy in face recognition.

eol>Human factors User studies Human-centered computing

1. Introduction

Decision support systems powered by machine learning (ML) are increasingly used in high-stakes scenarios including immigration, healthcare, justice, and access to labor and education, among many others. In these application domains, ML systems should not be autonomous, but rely on a human operator or expert, who should be responsible for the final decision. An in-depth understanding of the dynamics of human-algorithm interactions is crucial for developing safe, trustworthy systems [ 1 ]. Understanding the complementarities between human and machine intelligence is crucial. In an “ideal” scenario, there is perfect complementarity: cases challenging for the ML system are easily handled by the human operator. Conversely, the worst case is when there is total overlap: cases that are dificult or uncertain for the ML also lead to human errors. In practice, we may find applications that are somewhere between these extremes, as our empirical findings demonstrate within the context of face recognition.

Among other areas that need exploration, little has been done to understand when human and machine errors are similar and when they are diferent. Analyzing these similarities and diferences is particularly important because the presence of algorithmic errors influences well-known patterns of human-machine interaction, such as algorithmic aversion [ 2, 3 ], a biased and overly negative human evaluation of an algorithm. The question arises as to whether this aversion also varies depending on whether or not the errors presented by the model resemble those that a human agent might make. If we are able to avoid this type of bias, there is another risk: automation bias, an over-reliance on automated decision support mechanisms [ 4 ]. In this case we can ask an analogous question: Does a high similarity between human and machine errors influence the human agent’s ability to judge the accuracy of the model?

The main goal of this research is to compare ML errors and human errors. The results obtained from this comparative study serve as inputs for the development of straightforward yet impactful strategies to combine human and machine intelligence.

The use of the concept “human error” suggests an homogeneity that is almost non-existent in real life. Human perception varies from individual to individual, either due to variations in physiological structures or external influences such as culture. It becomes even more complex when it comes to assessing human perception in distinguishing between other human identities, such as in the context of a face recognition task. We tackle this complexity through a demographically diverse user study for face matching in which possible inter-individual diferences and disagreements are considered. Proposing hybrid human-machine strategies in the field of face recognition is crucial. For instance, in an automated system integrated in a police surveillance scenario, errors deserve special attention. In January 2020, Robert Julian-Borchak Williams became the first documented person in the U.S. that was wrongfully arrested based on a false hit produced by facial recognition technology.1 Many more cases have been documented, and often there are racial biases.2

Our main findings for the face recognition task we study are: (1) humans rarely produce false positives; (2) the ML similarity score is a potential error predictor; and (3) humans find it easier to address mistakes made by an individual model compared to addressing shared errors between two models; These findings provide a method for detecting potential errors in automated facial recognition, and help us find those errors that a human annotator has a high chance of correcting. Applying this approach in a practical setting enables us to develop an efective evaluation strategy that maximizes joint human-machine accuracy while controlling human annotation efort. Unlike other approaches that strictly emphasize the enhancement of accuracy through algorithmic advancements, this work underscores not only the importance of incorporating the human factor in this race for accuracy maximization, but also the efectiveness of this approach.

This work contributes to the broader agenda of developing trustworthy AI systems, which require not only technical robustness and accuracy, but also fairness, transparency, and meaningful human oversight. Understanding when and why human and machine errors overlap is central to assessing the trustworthiness of AI-based decision support: if both human and algorithmic judgments fail in similar cases, this may compromise system reliability and propagate bias. Conversely, identifying complementary strengths between human and machine reasoning enhances both accuracy and fairness, two key dimensions of trustworthy AI. By empirically analyzing patterns of human-AI error alignment in face recognition, our study provides insights into how hybrid decision-making can be designed to strengthen trustworthiness through improved error detection, bias mitigation, and accountability.

The rest of the paper is organized as follows. In §2 we review related work, followed by our research questions and our methodology in §3 and §4, respectively. In §5 we present our results, while in §6 we discuss our results, limitations, and give our conclusions.

2. Related Work

Human and ML performance While Machine Learning (ML) systems may surpass human performance in simple facial recognition tasks, they often struggle under complex, real-world conditions. In such challenging scenarios, non-expert humans perform comparably to some algorithms, and experts often outperform them [ 5 ]. In particular, humans have been shown to exceed random chance where facial recognition systems failed [ 6 ].

Historically, although competitions have framed human and algorithmic facial recognition as competition rather than collaboration, they have brought out the potential human-AI complementarities. Phillips et al. [ 7 ] found that algorithms excelled at simple, static, frontal images, whereas humans were better at interpreting dificult images and videos. Similarly, Rice et al. [ 6 ] and White et al. [ 5 ] documented situations where humans, especially experts, outperformed algorithms in complex identification tasks.

While AI models outperform humans in data-intensive tasks, human abilities remain essential for contextual reasoning, abstract judgment, and perceptual robustness [ 8, 9 ]. The emerging consensus suggests that hybrid approaches may ofer the most promising path forward in advancing decisionmaking and performance across domains.

Combining human and machine intelligence Researchers have explored how to efectively combine human and machine intelligence, with a particular focus on addressing algorithm aversion—users’ reluctance to trust algorithms after observing errors [ 10, 11, 3 ]. This aversion can be mitigated when users retain some control over the prediction process [12, 13].

The deployment of facial recognition systems illustrates the broader need for more than just technical performance. The EU AI Act [14] emphasizes that such technologies should be used only when strictly necessary and in clearly defined scenarios. In line with this, Negri et al. [15] propose a framework for assessing the appropriateness of facial recognition interventions based on contextual needs.

Efective implementation must also consider application context, including user characteristics and demographics. This aligns with research on human-centered machine learning [16], human oversight strategies [17, 18], and mechanisms to preserve the human role in decision-making, especially in rights-sensitive domains [19].

Human factors in decision support Several studies have focused on incorporating human factors into machine learning systems. Han et al. [20] improved emotion detection by using inter-annotator agreement to align predictions with human judgment. Andrews et al. [21] questioned the adequacy of categorical labels for representing human phenotypes, advocating for more continuous representations. In medical imaging, Makino et al. [22] showed that deep neural networks rely on features not typically used by experts, highlighting the need to integrate domain knowledge when comparing human and AI decisions.

Transparency has also been addressed by Huber et al. [23], who proposed propagating model uncertainties to improve user understanding in facial recognition. Papenmeier et al. [16] found that users are more averse to models that fail on simple tasks, suggesting that the context of errors afects perceived reliability.

However, mimicking human cognition can also introduce biases [24]. The well-documented other-race efect in face recognition [25, 26] has been replicated in algorithms [27].

Recent work also explores hybrid human-AI systems, where the model defers to human judgment under certain conditions, such as low confidence [ 28, 29, 30, 31]. Research has also examined how to structure annotation workflows to improve collaboration [ 32], and how human-AI pairing based on perceived similarity can influence decision-making [ 33 ].

To the best of our knowledge, most of the eforts in integrating human factors into technology have mainly focused on encoding specific human traits and enhancing model performance — observing humans and refining models independently. Some more recent eforts have gone further, proposing novel techniques to combine human and machine performance. However, there is still a lack of understanding of the key diferences of decision-makers, especially in contexts where the task involves a certain subjectivity. In a scenario where there is no longer only the final decision related to the task at hand, but also the decision as to which agent — human, algorithmic or combination of both — should decide, it is important to know the strengths and weaknesses of both agents, which of these are shared and which diverge, and how these diferences and similarities can be exploited.

3. Research Questions

The main goal of this work is to compare model errors with human errors in face recognition. Error consistency We would like to characterize similarities and diferences between human errors and ML system errors. To achieve this, first we need to determine if errors and successes are consistent, i.e., if we can determine which are the subsets of face recognition tasks in which errors and successes are concentrated.

RQ1a Are humans consistent in a face recognition task? RQ1b Are ML systems consistent in a face recognition task?

Error alignment We want to uncover whether there are common dificulties between ML systems and human annotators. We expect these common dificulties to manifest as incorrect human annotations on those face recognition tasks where the ML system erred. We also expect to obtain more incorrect annotations in cases where more than one ML system errs. If human annotators and ML systems provide a low-confidence annotation, then we would like to test whether their confidence in annotation aligns.

RQ2a Are human annotators more likely to make a mistake in a face recognition task if a ML system also gives an incorrect answer, compared to tasks for which the system is correct? RQ2b Are human annotators even more likely to make a mistake if more than one ML system is incorrect? RQ2c Are human annotators’ perception of similarity and ML computations of similarity correlated? Error-based human-machine collaboration We want to investigate if we can develop a strategy to optimise human-machine collaboration in the context of solving a face recognition task. The consistency raised in the first question would allow us to generalise in this context, while the study of the alignment between human and machine error patterns would allow us to detect the key points of complementarity for the development of a successful strategy.

RQ3 Can we design a novel human-computer collaboration strategy based on the results of our comparative study?

4. Experimental Setup and Ethical Considerations

Our experimental setting is based on a number of face recognition tasks that are performed by two automated systems, as well as by human annotators. Given a pair of facial images, the task consists of determining whether they belong to the same individual or not. Both the two automated models and the set of annotators performed this task independently.

4.1. Datasets

Training data. We used two pre-trained face recognition models. Both were trained by their respective authors on MS-Celeb-1M [ 34 ]. According to its authors, it was the largest publicly available face recognition dataset in the world. It contains about 10M images of nearly 100K people. MS-Celeb-1M is fairly unbalanced demographically (more than 70% of the images correspond to white people). For this reason, we investigate whether the MS-Caleb-1M pre-trained models presents the other-race efect. Testing data. We used DemogPairs [24] as the evaluation dataset. It contains 10,800 facial images corresponding to 600 people divided into 6 balanced demographic labeled folds: { female, male } × { Asian, Black, White }. DemogPairs was created and released by its authors with the explicit objective of being used as a tool to test for demographic biases on face recognition models.

4.2. Models

The two face recognition models used in this work were IR50+ArcFace [ 35 ] and LightCNN [ 36 ], both trained over MS-Celeb-1M by the respective authors. We did not conduct additional training or finetuning for this work. Both pre-trained models can be found online [ 37, 36 ].

IR50+ArcFace is an extension of ResNet50 [ 34, 38 ], a residual network that has been extensively applied to many image tasks, with an ArcFace loss function [ 35 ]. It reaches an accuracy of 99.78% in LFW [ 39 ], 97.53% in AgeDB [ 40 ], 95.22% in VGGFace2 [ 41 ], well-known public benchmarks for pair matching. LightCNN was created to learn a compact embedding on large-scale face data with noisy labels. It has been reported to achieve state-of-the-art results on various face benchmarks without ifne-tuning [ 36 ]. In this work, we used the 29-layer version, which reaches an accuracy of 99.40% in LFW.

For evaluation, we used face.evoLVe [ 37 ], a face recognition library for face-related analytics and applications. For the purposes of this research, the library was instrumented to keep track of individual errors. The instrumented library is available with our code release.

4.3. Methodology

We performed an online user study [ 42 ], with the following structure.

Participant recruitment We recruited participants through a crowdsourcing platform Prolific. 3 We considered four countries in continental Europe in which Prolific has large user bases: France, Germany, Italy, and Spain, plus the United Kingdom and Turkey. The crowdsourcing platform provides gender information and allows users to self-identify with a “simplified ethnic group” (Asian, Black, and White). We made sure that our sets of participants were gender balanced, and that for each pair of images, one person from each simplified ethnic group participated in their evaluation. So, for every pair of images, we collected 3 annotations. In total, we recruited 235 participants, excluding 2 of them from our data due to failed attention checks. For the subsequent analysis, based on ethnic self-identification, we selected 162 participants. Participants were paid 0.70 GBP (about 0.82€) to label 10 pairs of images, with an average completion time of 5 minutes. This amounts to 8.4 GBP per hour. Participants were asked about their age, gender identity, and ethnic background in an initial demographic questionnaire. Face recognition tasks Participants evaluated one pair of images at a time. The participant had to answer the question Are they the same person?, with the possible options: No, Probably not, Not sure, Probably yes or Yes.

Task selection We found that the joint accuracy of the face recognition models (see §4.4) was correct above 93% of the tasks. Hence, due to budget constraints, we annotated all the cases where the models were wrong (“misses”), and a sample of cases in which both models were right (“hits”). We annotated 363 “misses” (237 false negatives and 126 false positives), which were shown to a total of 164 participants, from which we selected a demographically balanced set of 108 participants. We also annotated 180 model “hits”, which were shown to a total of 69 participants, from which we selected a demographically balanced set of 54 participants. This selection of “hits” was a random sample that was demographically balanced for the true positive set (90 pairs) and for the true negative set (90 pairs).

4.4. Measurements

Accuracy Accuracy is defined as the fraction of correct responses with respect to the ground truth. We distinguish between (1) Machine accuracy: Joint accuracy of the models. Given a pair, we calculate the average of the calibrated similarity scores of the two models, and the label is decided based on this average. And (2) Human accuracy: Accuracy of the human annotators, as a group of three annotators. This is computed as a macro average, i.e., first the three evaluations on a pair are averaged, and then we determine whether that average is correct or not.

Similarity This is a measurement of how similar the model or the human annotator perceives the persons in the images. We distinguish between (1) ML similarity score: Given two images, the model computes two embeddings. The normalized distance between these embeddings, , is compared against a threshold to determine the output (if < , the pair is labeled as positive, negative otherwise). We take the calibration of 1 − as the similarity of the pair, that can be interpreted as a probability. Scores close to 0.5 can be interpreted as a low model confidence. And (2) Human perception of similarity: this is inferred from the annotator’s actual answer to the questions in the face recognition tasks (see §4.3). From this measurement we can infer human confidence: the answers in the extremes ( No and Yes) correspond to the highest confidence, while answer Not sure corresponds to the lowest confidence.

4.5. Ethical Considerations

Our research plan was reviewed and approved by the Institutional Committee for Ethical Review of Projects (CIREP) at Universitat Pompeu Fabra. The review included compliance with internationally ethical principles and personal data protection guided by the EU General Data Protection Regulation (2016/679).

We acknowledge that comparing performance across self-identified ethnic groups raises ethical concerns, particularly considering the historical misuse of racial and ethnic classification in research. The purpose of our analysis is not to essentialize group diferences, but to identify potential biases in human and algorithmic performance, which is crucial for ensuring trustworthy AI. The “simplified ethnic group” categories provided by the Prolific platform are broad and self-reported; we use them only to determine if systematic disparities that could afect equitable system performance emerge. We believe that exploring such diferences is justified only if it aids in reducing algorithmic discrimination and enhances the fairness of human–AI interactions.

5. Results

In what follows, we will consider a human error / success when the mean response of the three annotators solving the same task corresponds to a incorrect / correct response, respectively. Similarly, we will consider a machine error / success when the joint evaluation of the two models is incorrect / correct, respectively. For brevity, we will use “false/true positives/negatives” when we refer to the responses given by the models. We will also show some results of significance test ( -values, noted as ). All these tests correspond to Kolmogorov-Smirnov tests.

Participant Demographics Participants were on average 27.3 years old (SD=12.0 years). Out of the participants that indicated their gender, 55% identified as female, 41% as male, and 4% as non-binary. The majority of the participants that indicated an ethnicity identified as “White” (46%), followed by “NonArab African” (19%), “South Asian“ (13%), and “East Asian” (9%). The remaining ethnicities accounted for less than 5% of the participants each.

Error Consistency (RQ1) We now consider the agreement of human annotations, i.e., the extent to which multiple people agree on whether a pair of images represents the same person or not. Annotators were shown a total of 543 pairs of face images: 363 machine errors and 180 machine successes. Since the successes shown to the annotators are only a sample of all the successes from the models, we oversampled them (applied a correction 1 success = 28 success) to balance the workload. We also transformed every human annotation, originally based on a numeric 5-point scale, into a binary annotation in order to establish a fair comparison between human and machine agreement. The overall inter-annotator agreement was moderate (Fleiss’ = 0.47), which suggest that there is a mixture of agreement and (a) disagreement between annotators (RQ1a). This is driven primarily by consensus on model successes ( = 0.51), whereas participants’ agreement on model errors was no better than chance ( = − 0.05).

As Figure 1a shows, human annotators are almost always correct in negative pairs (diferent people), as less than 5% of pairs are incorrectly classified as positive by the annotators. However, when images represent the same person, results are mixed. Although most of the positive pairs were correctly classified by the annotators, approximately 30% of those pairs were incorrectly categorized as negative. Diferences in the distributions of labels on negative and positive pairs are significant at ≪ 0.0001.

The inter-rater agreement between the facial recognition models IR50 and LightCNN was almost perfect ( = 0.92), indicating very high consistency in their outputs (RQ1b). Both humans and models showed a similar error pattern, with over 65% of model errors being false negatives. However, human annotators demonstrated poor agreement when analyzing only the cases they misclassified ( = − 0.05), and model agreement dropped even further in error cases ( = − 0.29), revealing substantial disagreement when mistakes occurred. The interpretation of negative values for Fleiss’ kappa are based on [ 43 ].

Error Alignment (RQ2) Next, we studied the extent to which human successes/errors are aligned with machine successes/errors. We considered four categories of model outcomes: True Positives, False Negatives, True Negatives, and False Positives.

Human performance difered significantly between True Negatives and False Positives ( p ≪ 0.0001), as well as between the two types of positive pairs (p ≪ 0.0001). For negative pairs (see Figure 1b), humans showed more uncertainty when the model made a False Positive error, often choosing "Probably not" instead of a definitive "No." Similarly, for positive pairs (see Figure 1c), human errors tended to occur in the same cases where the models also failed. These findings, put together with the significance above, suggest that humans find machine error cases more challenging than those where the model is correct (RQ2a).

In general, human annotators are more likely to make mistakes on image pairs where both models failed (RQ2b). Human certainty also declined in these cases: annotators preferred Probably not over No for false positives made by both models, and were more prone to errors on false negatives.

For False Positives, human evaluations of errors made only by IR50 significantly difered from those on errors shared by both models ( < 0.001), whereas no significant diference was observed between LightCNN-only errors and joint errors. We depict these diferences in Figure 2a.

For False Negatives, human assessments significantly difered between errors made solely by either IR50 ( < 0.001) or LightCNN ( ≪ 0.0001), compared to those made by both models. We depict these diferences in Figure 2b.

We examined human annotators’ perception of similarity and compared them with model-computed similarity scores. This time we distinguished between eight overlapping categories of human and model errors and successes: { Human, Machine } × { True Positives, False Positives, True Negatives, False Negatives }. When both models and humans answered correctly, significant diferences were found between human and machine similarity scores for both positive and negative pairs ( ≪ 0.0001). Machine scores showed bimodal distributions, whereas human judgments were more spread out (RQ2c).

When both models and annotators were incorrect, a significant diference was observed for negative pairs ( ≪ 0.0001), but not for positive ones (see orange violins in Figure 3). Notably, in Machine False Positives, human similarity judgments clustered around 0.5, indicating low confidence (akin to “Not sure”) when incorrectly identifying diferent individuals as the same. The yellow band around similarity 0.5 in Figure 3 includes machine errors that based on these observations could be predicted in advance as potential errors.

(a) (b)

Other-race efect In our experiments, we found partial evidence of the “other-race” efect (see Table 1). We calculated the error rates for the three self-ascribed ethnicities: White, Black, and Asian. We considered only pairs of images with the same ethnicity label in both images and computed the error rate for each of these sets of pairs. “White” and “Black” annotators are the most accurate when annotating images of their same kind, but this was not the case for “Asians”. No “other-race” efect was found in the models. We investigate this efect as a way to anticipate potential biases. We acknowledge that the name of the “other-race efect” itself may carry outdated terminology, but we chose it for consistency with prior literature.

Exploratory study of error-based human-machine collaboration (RQ3) We conducted a study

with the aim of illustrating how to apply certain improvements based on results previously obtained in a “human as overseer” scenario [31]. We studied the improvement in model accuracy (93.5%) that would be obtained by manually reviewing all the pairs evaluated by the machine in an order determined by the results obtained.

The first improvement is based on the results obtained related to RQ2c: the use of machine confidence to prioritize those cases that have a high probability of being corrected by the human annotator (see Figure 3). Note that machine confidence can be inferred from the similarity score (the further the similarity is from 50%, the higher the confidence, see Figure 3). The evolution of joint accuracy when the prioritization of low-confidence machine outputs is implemented can be seen in the top pink line of Figure 4a. We can observe that the pairs that human annotators are able to solve correctly are concentrated at the beginning of the workflow, leading to an early and rapid growth of the joint accuracy. This marked improvement in accuracy contrasts with the results we would obtain if this strategy were not taken into account (see the bottom black line in Figure 4a). The second improvement is based on the results obtained when investigating RQ2b: prioritizing those pairs where the models gave diferent answers, i.e., only one of the two models correctly classified the pair. As we can see in Figure 4b, the joint accuracy obtained if these pairs are prioritized (blue line) exceeds the one obtained when this priority is not implemented (orange line) during most of the human annotation flow.

6. Discussion and Conclusions

This study investigates human-machine complementarities in facial recognition tasks by comparing human and model errors. The analysis highlights how human strengths can complement machine weaknesses, informing strategies for more efective joint systems.

Three main findings e merge: ( 1) C onsistency o f E rrors – B oth h umans a nd m odels showed consistent behaviors, enabling error characterization. Humans excel at negative pairs, with only 4.8% errors on 126 machine false positives, indicating areas where human oversight can efectively correct model failures; (2) Shared Dificulties – Humans are more likely to err when both models (IR-50 and LightCNN) make the same mistake, especially on false negatives. This correlation reveals shared dificulties and helps prioritize cases requiring human review; and (3) Similarity Judgments – Human and model similarity scores diverge significantly except when both err on positive pairs. Models exhibit sharp score contrasts between correct and incorrect predictions, which can signal likely errors and guide targeted human intervention. Based on these observations, we propose a hybrid human-in-the-loop approach that boosts system accuracy by 3% with minimal annotation cost (10% of total), correcting 148 misclassified p airs. This demonstrates how strategic manual review can yield significant gains. This study also reflects on real-world implications. Since most machine errors corrected by humans are false positives, domains with high sensitivity to such errors (such as security or law enforcement scenarios) must implement robust human oversight. Regulatory guidance, such as the European AI Act [14], underscores the importance of human involvement in high-risk AI systems.

Our limitations include the use of symmetric costs and reliance on two pre-trained models trained on the same dataset (MS-Celeb-1M [ 34 ]). While findings are scenario-specific, they provide a framework for error analysis and oversight strategies in other contexts. The results reinforce that current facial recognition systems are not ready to replace human judgment and that ethical, legal, and societal concerns necessitate permanent human oversight. Future work could examine how biases such as algorithm aversion or overconfidence afect collaborative decision-making, especially when machine and human resolutions diverge.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT and Writefull in order to: Grammar and spelling check and improve writing style. After using these tools, the authors reviewed and edited the content as needed and takes full responsibility for the publication’s content. [10] M. T. Dzindolet, L. G. Pierce, H. P. Beck, L. A. Dawe, The perceived utility of human and automated aids in a visual detection task, Human factors 44 (2002) 79–94. [11] T. Reich, A. Kaju, S. J. Maglio, How to overcome algorithm aversion: Learning from mistakes,

Journal of Consumer Psychology 33 (2023) 285–302. [12] B. J. Dietvorst, J. P. Simmons, C. Massey, Overcoming algorithm aversion: People will use imperfect algorithms if they can (even slightly) modify them, Management science 64 (2018) 1155–1170. [13] Q. Roy, F. Zhang, D. Vogel, Automation accuracy is good, but high controllability may be better, in:

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–8. [14] European Union, EU AI Act, 2024. URL: https://artificialintelligenceact.eu/the-act/. [15] P. Negri, I. Hupont, E. Gomez, A framework for assessing proportionate intervention with face recognition systems in real-life scenarios, arXiv preprint arXiv:2402.05731 (2024). [16] A. Papenmeier, D. Kern, D. Hienert, Y. Kammerer, C. Seifert, How accurate does it feel?–human perception of diferent types of classification mistakes, in: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–13. [17] I. Hupont, S. Tolan, H. Gunes, E. Gómez, The landscape of facial processing applications in the context of the european ai act and the development of trustworthy systems, Scientific Reports 12 (2022) 10688. [18] K. Kyriakou, J. Otterbacher, In humans, we trust: Multidisciplinary perspectives on the requirements for human oversight in algorithmic processes, Discover Artificial Intelligence 3 (2023) 44. [19] R. Koulu, Proceduralizing control and discretion: Human oversight in artificial intelligence policy,

Maastricht Journal of European and Comparative Law 27 (2020) 720–735. [20] J. Han, Z. Zhang, M. Schmitt, M. Pantic, B. Schuller, From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty, in: Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 890–897. [21] J. T. Andrews, P. Joniak, A. Xiang, A view from somewhere: Human-centric face representations, arXiv preprint arXiv:2303.17176 (2023). [22] T. Makino, S. Jastrzebski, W. Oleszkiewicz, C. Chacko, R. Ehrenpreis, N. Samreen, C. Chhor, E. Kim, J. Lee, K. Pysarenko, et al., Diferences between human and machine perception in medical diagnosis, Scientific reports 12 (2022) 6877. [23] M. Huber, P. Terhörst, F. Kirchbuchner, N. Damer, A. Kuijper, Stating comparison score uncertainty and verification decision confidence towards transparent face recognition, arXiv preprint arXiv:2210.10354 (2022). [24] I. Hupont, C. Fernández, Demogpairs: Quantifying the impact of demographic imbalance in deep face recognition, in: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–7. [25] C. A. Meissner, J. C. Brigham, Thirty years of investigating the own-race bias in memory for faces:

A meta-analytic review., Psychology, Public Policy, and Law 7 (2001) 3. [26] C. Feliciano, Shades of race: How phenotype and observer characteristics shape racial classification,

American Behavioral Scientist 60 (2016) 390–419. [27] P. J. Phillips, F. Jiang, A. Narvekar, J. Ayyad, A. J. O’Toole, An other-race efect for face recognition algorithms, ACM Transactions on Applied Perception (TAP) 8 (2011) 1–11. [28] P. Hemmer, L. Thede, M. Vössing, J. Jakubik, N. Kühl, Learning to defer with limited expert predictions, Proceedings of the AAAI Conference on Artificial Intelligence 37 (2023) 6002–6011. [29] H. Mozannar, H. Lang, D. Wei, P. Sattigeri, S. Das, D. Sontag, Who should predict? exact algorithms for learning to defer to humans, in: International conference on artificial intelligence and statistics, PMLR, 2023, pp. 10520–10545. [30] V. Keswani, M. Lease, K. Kenthapadi, Towards unbiased and accurate deferral to multiple experts, in: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021, pp. 154–165. [31] C. Punzi, R. Pellungrini, M. Setzu, F. Giannotti, D. Pedreschi, Ai, meet human: Learning paradigms for hybrid decision making systems, arXiv preprint arXiv:2402.06287 (2024). [32] M. H. Lee, D. P. Siewiorek, A. Smailagic, A. Bernardino, S. Bermúdez i Badia, Towards eficient

A. Online Resources

Code and Data is available in our GitHub repository

[1]

J. N.

Matias , Humans and algorithms work together-so study them together , Nature 617 ( 2023 ) 248 - 251 .

[2]

Jussupow , I. Benbasat ,

Heinzl , Why are we averse towards algorithms? a comprehensive literature review on algorithm aversion , in: F. Rowe (Ed.), 28th European Conference on Information Systems - Liberty , Equality, and Fraternity in a Digitizing World, ECIS 2020 , Marrakech, Morocco, June 15-17, 2020 : Proceedings, AISeL, Atlanta, GA , 2020 , p. RP 168 .

[3]

B. J.

Dietvorst ,

J. P.

Simmons ,

Massey , Algorithm aversion: people erroneously avoid algorithms after seeing them err ., Journal of Experimental Psychology: General 144 ( 2015 ) 114 .

[4]

Lyell , E. Coiera, Automation bias and verification complexity: a systematic review , Journal of the American Medical Informatics Association 24 ( 2017 ) 423 - 431 .

[5]

White ,

P. J.

Phillips ,

C. A.

Hahn ,

Hill , A. J. O'Toole , Perceptual expertise in forensic facial image comparison , Proceedings of the Royal Society B: Biological Sciences 282 ( 2015 ) 20151292 .

[6]

Rice ,

P. J.

Phillips ,

Natu ,

An , A. J. O'Toole , Unaware person recognition from the body when face identification fails , Psychological Science 24 ( 2013 ) 2235 - 2243 .

[7]

P. J.

Phillips , A. J. O'toole, Comparison of human and computer performance across face recognition experiments , Image and Vision Computing 32 ( 2014 ) 74 - 85 .

[8]

J. E.

Korteling , G. C. van de Boer-Visschedijk,

R. A.

Blankendaal ,

R. C.

Boonekamp ,

A. R.

Eikelboom , Human-versus artificial intelligence , Frontiers in artificial intelligence 4 ( 2021 ) 622364 .

[9]

Vaccaro ,

Almaatouq , T. Malone, When combinations of humans and ai are useful: A systematic review and meta-analysis , Nature Human Behaviour 8 ( 2024 ) 2293 - 2303 . annotations for a human-ai collaborative, clinical decision support system: A case study on physical stroke rehabilitation assessment , in: Proceedings of the 27th International Conference on Intelligent User Interfaces , 2022 , pp. 4 - 14 .

[33]

P. J.

Phillips ,

A. N.

Yates ,

Hu ,

C. A.

Hahn ,

Noyes ,

Jackson ,

J. G.

Cavazos , G. Jeckeln,

Ranjan ,

Sankaranarayanan , et al., Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms , Proceedings of the National Academy of Sciences 115 ( 2018 ) 6171 - 6176 .

[34]

Guo ,

Zhang ,

Hu ,

He ,

Gao , Ms-celeb-1m: A dataset and benchmark for large-scale face recognition , in: Computer Vision-ECCV 2016 : 14th European Conference, Amsterdam, The Netherlands, October 11-14 , 2016 , Proceedings, Part III 14 , Springer, 2016 , pp. 87 - 102 .

[35]

Deng ,

Guo ,

Xue ,

Zafeiriou , Arcface: Additive angular margin loss for deep face recognition , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019 , pp. 4690 - 4699 .

[36]

Wu ,

He ,

Sun ,

Tan , A light cnn for deep face representation with noisy labels , IEEE Transactions on Information Forensics and Security 13 ( 2018 ) 2884 - 2896 .

[37]

Wang ,

Zhang ,

Xiong ,

Zhao , Face.evolve: A high-performance face recognition library , arXiv preprint arXiv:2107.08621 ( 2021 ).

[38]

He ,

Zhang , S. Ren,

Sun , Deep residual learning for image recognition , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2016 , pp. 770 - 778 .

[39]

G. B.

Huang ,

Mattar ,

Berg ,

Learned-Miller , Labeled faces in the wild: A database forstudying face recognition in unconstrained environments , in: Workshop on faces in' Real-Life'Images: detection, alignment, and recognition , 2008 .

[40]

Moschoglou ,

Papaioannou ,

Sagonas ,

Deng , I. Kotsia,

Zafeiriou , Agedb: the first manually collected, in-the-wild age database , in: proceedings of the IEEE conference on computer vision and pattern recognition workshops , 2017 , pp. 51 - 59 .

[41]

Cao ,

Shen ,

Xie ,

O. M.

Parkhi , A. Zisserman, Vggface2: A dataset for recognising faces across pose and age , in: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018 ), IEEE, 2018 , pp. 67 - 74 .

[42]

D. L.

Chen ,

Schonger , C. Wickens, otree-an open-source platform for laboratory, online, and ifeld experiments , Journal of Behavioral and Experimental Finance 9 ( 2016 ) 88 - 97 .

[43]

J. R.

Landis , G. G. Koch,

The measurement of observer agreement for categorical data, biometrics (

1977 ) 159 - 174 .