<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comparison of Human and Machine Learning Errors in Face Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marina Estévez-Almenzar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo Baeza-Yates</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Castillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ICREA</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Pompeu Fabra</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Machine learning applications in high-stakes scenarios should always operate under human oversight. Developing an optimal combination of human and machine intelligence requires an understanding of their complementarities, particularly regarding the similarities and diferences in the way they make mistakes. We perform extensive experiments in the area of face recognition and compare two automated face recognition systems against human annotators through a demographically balanced user study. Our research uncovers important ways in which machine learning errors and human errors difer from each other, and suggests potential strategies in which human-machine collaboration can improve accuracy in face recognition.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Human factors</kwd>
        <kwd>User studies</kwd>
        <kwd>Human-centered computing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Decision support systems powered by machine learning (ML) are increasingly used in high-stakes
scenarios including immigration, healthcare, justice, and access to labor and education, among many
others. In these application domains, ML systems should not be autonomous, but rely on a human
operator or expert, who should be responsible for the final decision. An in-depth understanding of
the dynamics of human-algorithm interactions is crucial for developing safe, trustworthy systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Understanding the complementarities between human and machine intelligence is crucial. In an “ideal”
scenario, there is perfect complementarity: cases challenging for the ML system are easily handled by
the human operator. Conversely, the worst case is when there is total overlap: cases that are dificult
or uncertain for the ML also lead to human errors. In practice, we may find applications that are
somewhere between these extremes, as our empirical findings demonstrate within the context of face
recognition.
      </p>
      <p>
        Among other areas that need exploration, little has been done to understand when human and
machine errors are similar and when they are diferent. Analyzing these similarities and diferences is
particularly important because the presence of algorithmic errors influences well-known patterns of
human-machine interaction, such as algorithmic aversion [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], a biased and overly negative human
evaluation of an algorithm. The question arises as to whether this aversion also varies depending on
whether or not the errors presented by the model resemble those that a human agent might make. If we
are able to avoid this type of bias, there is another risk: automation bias, an over-reliance on automated
decision support mechanisms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this case we can ask an analogous question: Does a high similarity
between human and machine errors influence the human agent’s ability to judge the accuracy of the
model?
      </p>
      <p>The main goal of this research is to compare ML errors and human errors. The results obtained from
this comparative study serve as inputs for the development of straightforward yet impactful strategies
to combine human and machine intelligence.</p>
      <p>The use of the concept “human error” suggests an homogeneity that is almost non-existent in real
life. Human perception varies from individual to individual, either due to variations in physiological
structures or external influences such as culture. It becomes even more complex when it comes to
assessing human perception in distinguishing between other human identities, such as in the context
of a face recognition task. We tackle this complexity through a demographically diverse user study
for face matching in which possible inter-individual diferences and disagreements are considered.
Proposing hybrid human-machine strategies in the field of face recognition is crucial. For instance, in
an automated system integrated in a police surveillance scenario, errors deserve special attention. In
January 2020, Robert Julian-Borchak Williams became the first documented person in the U.S. that was
wrongfully arrested based on a false hit produced by facial recognition technology.1 Many more cases
have been documented, and often there are racial biases.2</p>
      <p>Our main findings for the face recognition task we study are: (1) humans rarely produce false positives;
(2) the ML similarity score is a potential error predictor; and (3) humans find it easier to address mistakes
made by an individual model compared to addressing shared errors between two models; These findings
provide a method for detecting potential errors in automated facial recognition, and help us find those
errors that a human annotator has a high chance of correcting. Applying this approach in a practical
setting enables us to develop an efective evaluation strategy that maximizes joint human-machine
accuracy while controlling human annotation efort. Unlike other approaches that strictly emphasize
the enhancement of accuracy through algorithmic advancements, this work underscores not only the
importance of incorporating the human factor in this race for accuracy maximization, but also the
efectiveness of this approach.</p>
      <p>This work contributes to the broader agenda of developing trustworthy AI systems, which require
not only technical robustness and accuracy, but also fairness, transparency, and meaningful human
oversight. Understanding when and why human and machine errors overlap is central to assessing
the trustworthiness of AI-based decision support: if both human and algorithmic judgments fail in
similar cases, this may compromise system reliability and propagate bias. Conversely, identifying
complementary strengths between human and machine reasoning enhances both accuracy and fairness,
two key dimensions of trustworthy AI. By empirically analyzing patterns of human-AI error alignment
in face recognition, our study provides insights into how hybrid decision-making can be designed to
strengthen trustworthiness through improved error detection, bias mitigation, and accountability.</p>
      <p>The rest of the paper is organized as follows. In §2 we review related work, followed by our research
questions and our methodology in §3 and §4, respectively. In §5 we present our results, while in §6 we
discuss our results, limitations, and give our conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Human and ML performance While Machine Learning (ML) systems may surpass human
performance in simple facial recognition tasks, they often struggle under complex, real-world conditions. In
such challenging scenarios, non-expert humans perform comparably to some algorithms, and experts
often outperform them [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In particular, humans have been shown to exceed random chance where
facial recognition systems failed [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Historically, although competitions have framed human and algorithmic facial recognition as
competition rather than collaboration, they have brought out the potential human-AI complementarities.
Phillips et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] found that algorithms excelled at simple, static, frontal images, whereas humans were
better at interpreting dificult images and videos. Similarly, Rice et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and White et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
documented situations where humans, especially experts, outperformed algorithms in complex identification
tasks.
      </p>
      <p>
        While AI models outperform humans in data-intensive tasks, human abilities remain essential for
contextual reasoning, abstract judgment, and perceptual robustness [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. The emerging consensus
suggests that hybrid approaches may ofer the most promising path forward in advancing
decisionmaking and performance across domains.
      </p>
      <p>
        Combining human and machine intelligence Researchers have explored how to efectively
combine human and machine intelligence, with a particular focus on addressing algorithm aversion—users’
reluctance to trust algorithms after observing errors [
        <xref ref-type="bibr" rid="ref3">10, 11, 3</xref>
        ]. This aversion can be mitigated when
users retain some control over the prediction process [12, 13].
      </p>
      <p>The deployment of facial recognition systems illustrates the broader need for more than just technical
performance. The EU AI Act [14] emphasizes that such technologies should be used only when strictly
necessary and in clearly defined scenarios. In line with this, Negri et al. [15] propose a framework for
assessing the appropriateness of facial recognition interventions based on contextual needs.</p>
      <p>Efective implementation must also consider application context, including user characteristics and
demographics. This aligns with research on human-centered machine learning [16], human oversight
strategies [17, 18], and mechanisms to preserve the human role in decision-making, especially in
rights-sensitive domains [19].</p>
      <p>Human factors in decision support Several studies have focused on incorporating human factors
into machine learning systems. Han et al. [20] improved emotion detection by using inter-annotator
agreement to align predictions with human judgment. Andrews et al. [21] questioned the adequacy of
categorical labels for representing human phenotypes, advocating for more continuous representations.
In medical imaging, Makino et al. [22] showed that deep neural networks rely on features not typically
used by experts, highlighting the need to integrate domain knowledge when comparing human and AI
decisions.</p>
      <p>Transparency has also been addressed by Huber et al. [23], who proposed propagating model
uncertainties to improve user understanding in facial recognition. Papenmeier et al. [16] found that
users are more averse to models that fail on simple tasks, suggesting that the context of errors afects
perceived reliability.</p>
      <p>However, mimicking human cognition can also introduce biases [24]. The well-documented other-race
efect in face recognition [25, 26] has been replicated in algorithms [27].</p>
      <p>
        Recent work also explores hybrid human-AI systems, where the model defers to human judgment
under certain conditions, such as low confidence [ 28, 29, 30, 31]. Research has also examined how to
structure annotation workflows to improve collaboration [ 32], and how human-AI pairing based on
perceived similarity can influence decision-making [
        <xref ref-type="bibr" rid="ref10">33</xref>
        ].
      </p>
      <p>To the best of our knowledge, most of the eforts in integrating human factors into technology have
mainly focused on encoding specific human traits and enhancing model performance — observing
humans and refining models independently. Some more recent eforts have gone further, proposing novel
techniques to combine human and machine performance. However, there is still a lack of understanding
of the key diferences of decision-makers, especially in contexts where the task involves a certain
subjectivity. In a scenario where there is no longer only the final decision related to the task at hand,
but also the decision as to which agent — human, algorithmic or combination of both — should decide,
it is important to know the strengths and weaknesses of both agents, which of these are shared and
which diverge, and how these diferences and similarities can be exploited.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Research Questions</title>
      <p>The main goal of this work is to compare model errors with human errors in face recognition.
Error consistency We would like to characterize similarities and diferences between human errors
and ML system errors. To achieve this, first we need to determine if errors and successes are consistent,
i.e., if we can determine which are the subsets of face recognition tasks in which errors and successes
are concentrated.</p>
      <sec id="sec-3-1">
        <title>RQ1a Are humans consistent in a face recognition task?</title>
      </sec>
      <sec id="sec-3-2">
        <title>RQ1b Are ML systems consistent in a face recognition task?</title>
        <p>Error alignment We want to uncover whether there are common dificulties between ML systems
and human annotators. We expect these common dificulties to manifest as incorrect human annotations
on those face recognition tasks where the ML system erred. We also expect to obtain more incorrect
annotations in cases where more than one ML system errs. If human annotators and ML systems
provide a low-confidence annotation, then we would like to test whether their confidence in annotation
aligns.</p>
        <p>RQ2a Are human annotators more likely to make a mistake in a face recognition task if a ML
system also gives an incorrect answer, compared to tasks for which the system is correct?
RQ2b Are human annotators even more likely to make a mistake if more than one ML system
is incorrect?
RQ2c Are human annotators’ perception of similarity and ML computations of similarity
correlated?
Error-based human-machine collaboration We want to investigate if we can develop a strategy to
optimise human-machine collaboration in the context of solving a face recognition task. The consistency
raised in the first question would allow us to generalise in this context, while the study of the alignment
between human and machine error patterns would allow us to detect the key points of complementarity
for the development of a successful strategy.</p>
        <p>RQ3 Can we design a novel human-computer collaboration strategy based on the results of
our comparative study?</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup and Ethical Considerations</title>
      <p>Our experimental setting is based on a number of face recognition tasks that are performed by two
automated systems, as well as by human annotators. Given a pair of facial images, the task consists of
determining whether they belong to the same individual or not. Both the two automated models and
the set of annotators performed this task independently.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>
          Training data. We used two pre-trained face recognition models. Both were trained by their
respective authors on MS-Celeb-1M [
          <xref ref-type="bibr" rid="ref11">34</xref>
          ]. According to its authors, it was the largest publicly available face
recognition dataset in the world. It contains about 10M images of nearly 100K people. MS-Celeb-1M is
fairly unbalanced demographically (more than 70% of the images correspond to white people). For this
reason, we investigate whether the MS-Caleb-1M pre-trained models presents the other-race efect.
Testing data. We used DemogPairs [24] as the evaluation dataset. It contains 10,800 facial images
corresponding to 600 people divided into 6 balanced demographic labeled folds: { female, male } × {
Asian, Black, White }. DemogPairs was created and released by its authors with the explicit objective of
being used as a tool to test for demographic biases on face recognition models.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Models</title>
        <p>
          The two face recognition models used in this work were IR50+ArcFace [
          <xref ref-type="bibr" rid="ref12">35</xref>
          ] and LightCNN [
          <xref ref-type="bibr" rid="ref13">36</xref>
          ], both
trained over MS-Celeb-1M by the respective authors. We did not conduct additional training or
finetuning for this work. Both pre-trained models can be found online [
          <xref ref-type="bibr" rid="ref13 ref14">37, 36</xref>
          ].
        </p>
        <p>
          IR50+ArcFace is an extension of ResNet50 [
          <xref ref-type="bibr" rid="ref11 ref15">34, 38</xref>
          ], a residual network that has been extensively
applied to many image tasks, with an ArcFace loss function [
          <xref ref-type="bibr" rid="ref12">35</xref>
          ]. It reaches an accuracy of 99.78% in
LFW [
          <xref ref-type="bibr" rid="ref16">39</xref>
          ], 97.53% in AgeDB [
          <xref ref-type="bibr" rid="ref17">40</xref>
          ], 95.22% in VGGFace2 [
          <xref ref-type="bibr" rid="ref18">41</xref>
          ], well-known public benchmarks for pair
matching. LightCNN was created to learn a compact embedding on large-scale face data with noisy
labels. It has been reported to achieve state-of-the-art results on various face benchmarks without
ifne-tuning [
          <xref ref-type="bibr" rid="ref13">36</xref>
          ]. In this work, we used the 29-layer version, which reaches an accuracy of 99.40% in
LFW.
        </p>
        <p>
          For evaluation, we used face.evoLVe [
          <xref ref-type="bibr" rid="ref14">37</xref>
          ], a face recognition library for face-related analytics and
applications. For the purposes of this research, the library was instrumented to keep track of individual
errors. The instrumented library is available with our code release.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Methodology</title>
        <p>
          We performed an online user study [
          <xref ref-type="bibr" rid="ref19">42</xref>
          ], with the following structure.
        </p>
        <p>Participant recruitment We recruited participants through a crowdsourcing platform Prolific. 3 We
considered four countries in continental Europe in which Prolific has large user bases: France, Germany,
Italy, and Spain, plus the United Kingdom and Turkey. The crowdsourcing platform provides gender
information and allows users to self-identify with a “simplified ethnic group” (Asian, Black, and White).
We made sure that our sets of participants were gender balanced, and that for each pair of images,
one person from each simplified ethnic group participated in their evaluation. So, for every pair of
images, we collected 3 annotations. In total, we recruited 235 participants, excluding 2 of them from
our data due to failed attention checks. For the subsequent analysis, based on ethnic self-identification,
we selected 162 participants. Participants were paid 0.70 GBP (about 0.82€) to label 10 pairs of images,
with an average completion time of 5 minutes. This amounts to 8.4 GBP per hour. Participants were
asked about their age, gender identity, and ethnic background in an initial demographic questionnaire.
Face recognition tasks Participants evaluated one pair of images at a time. The participant had to
answer the question Are they the same person?, with the possible options: No, Probably not, Not sure,
Probably yes or Yes.</p>
        <p>Task selection We found that the joint accuracy of the face recognition models (see §4.4) was correct
above 93% of the tasks. Hence, due to budget constraints, we annotated all the cases where the models
were wrong (“misses”), and a sample of cases in which both models were right (“hits”). We annotated
363 “misses” (237 false negatives and 126 false positives), which were shown to a total of 164 participants,
from which we selected a demographically balanced set of 108 participants. We also annotated 180
model “hits”, which were shown to a total of 69 participants, from which we selected a demographically
balanced set of 54 participants. This selection of “hits” was a random sample that was demographically
balanced for the true positive set (90 pairs) and for the true negative set (90 pairs).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Measurements</title>
        <p>Accuracy Accuracy is defined as the fraction of correct responses with respect to the ground truth.
We distinguish between (1) Machine accuracy: Joint accuracy of the models. Given a pair, we calculate
the average of the calibrated similarity scores of the two models, and the label is decided based on this
average. And (2) Human accuracy: Accuracy of the human annotators, as a group of three annotators.
This is computed as a macro average, i.e., first the three evaluations on a pair are averaged, and then we
determine whether that average is correct or not.</p>
        <p>Similarity This is a measurement of how similar the model or the human annotator perceives the
persons in the images. We distinguish between (1) ML similarity score: Given two images, the model
computes two embeddings. The normalized distance between these embeddings, , is compared against
a threshold  to determine the output (if  &lt;  , the pair is labeled as positive, negative otherwise). We
take the calibration of 1 −  as the similarity of the pair, that can be interpreted as a probability. Scores
close to 0.5 can be interpreted as a low model confidence. And (2) Human perception of similarity:
this is inferred from the annotator’s actual answer to the questions in the face recognition tasks (see
§4.3). From this measurement we can infer human confidence: the answers in the extremes ( No and Yes)
correspond to the highest confidence, while answer Not sure corresponds to the lowest confidence.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Ethical Considerations</title>
        <p>Our research plan was reviewed and approved by the Institutional Committee for Ethical Review of
Projects (CIREP) at Universitat Pompeu Fabra. The review included compliance with internationally
ethical principles and personal data protection guided by the EU General Data Protection Regulation
(2016/679).</p>
        <p>We acknowledge that comparing performance across self-identified ethnic groups raises ethical
concerns, particularly considering the historical misuse of racial and ethnic classification in research.
The purpose of our analysis is not to essentialize group diferences, but to identify potential biases in
human and algorithmic performance, which is crucial for ensuring trustworthy AI. The “simplified
ethnic group” categories provided by the Prolific platform are broad and self-reported; we use them
only to determine if systematic disparities that could afect equitable system performance emerge. We
believe that exploring such diferences is justified only if it aids in reducing algorithmic discrimination
and enhances the fairness of human–AI interactions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In what follows, we will consider a human error / success when the mean response of the three annotators
solving the same task corresponds to a incorrect / correct response, respectively. Similarly, we will
consider a machine error / success when the joint evaluation of the two models is incorrect / correct,
respectively. For brevity, we will use “false/true positives/negatives” when we refer to the responses
given by the models. We will also show some results of significance test ( -values, noted as ). All these
tests correspond to Kolmogorov-Smirnov tests.</p>
      <p>Participant Demographics Participants were on average 27.3 years old (SD=12.0 years). Out of the
participants that indicated their gender, 55% identified as female, 41% as male, and 4% as non-binary. The
majority of the participants that indicated an ethnicity identified as “White” (46%), followed by
“NonArab African” (19%), “South Asian“ (13%), and “East Asian” (9%). The remaining ethnicities accounted
for less than 5% of the participants each.</p>
      <p>Error Consistency (RQ1) We now consider the agreement of human annotations, i.e., the extent to
which multiple people agree on whether a pair of images represents the same person or not. Annotators
were shown a total of 543 pairs of face images: 363 machine errors and 180 machine successes. Since
the successes shown to the annotators are only a sample of all the successes from the models, we
oversampled them (applied a correction 1 success = 28 success) to balance the workload. We also transformed
every human annotation, originally based on a numeric 5-point scale, into a binary annotation in order
to establish a fair comparison between human and machine agreement. The overall inter-annotator
agreement was moderate (Fleiss’  = 0.47), which suggest that there is a mixture of agreement and
(a)
disagreement between annotators (RQ1a). This is driven primarily by consensus on model successes
( = 0.51), whereas participants’ agreement on model errors was no better than chance ( = − 0.05).</p>
      <p>As Figure 1a shows, human annotators are almost always correct in negative pairs (diferent people),
as less than 5% of pairs are incorrectly classified as positive by the annotators. However, when images
represent the same person, results are mixed. Although most of the positive pairs were correctly
classified by the annotators, approximately 30% of those pairs were incorrectly categorized as negative.
Diferences in the distributions of labels on negative and positive pairs are significant at  ≪ 0.0001.</p>
      <p>
        The inter-rater agreement between the facial recognition models IR50 and LightCNN was almost
perfect ( = 0.92), indicating very high consistency in their outputs (RQ1b). Both humans and
models showed a similar error pattern, with over 65% of model errors being false negatives. However,
human annotators demonstrated poor agreement when analyzing only the cases they misclassified
( = − 0.05), and model agreement dropped even further in error cases ( = − 0.29), revealing
substantial disagreement when mistakes occurred. The interpretation of negative values for Fleiss’
kappa are based on [
        <xref ref-type="bibr" rid="ref20">43</xref>
        ].
      </p>
      <p>Error Alignment (RQ2) Next, we studied the extent to which human successes/errors are aligned
with machine successes/errors. We considered four categories of model outcomes: True Positives, False
Negatives, True Negatives, and False Positives.</p>
      <p>Human performance difered significantly between True Negatives and False Positives ( p ≪ 0.0001),
as well as between the two types of positive pairs (p ≪ 0.0001). For negative pairs (see Figure 1b),
humans showed more uncertainty when the model made a False Positive error, often choosing "Probably
not" instead of a definitive "No." Similarly, for positive pairs (see Figure 1c), human errors tended to
occur in the same cases where the models also failed. These findings, put together with the significance
above, suggest that humans find machine error cases more challenging than those where the model is
correct (RQ2a).</p>
      <p>In general, human annotators are more likely to make mistakes on image pairs where both models
failed (RQ2b). Human certainty also declined in these cases: annotators preferred Probably not over No
for false positives made by both models, and were more prone to errors on false negatives.</p>
      <p>For False Positives, human evaluations of errors made only by IR50 significantly difered from those
on errors shared by both models ( &lt; 0.001), whereas no significant diference was observed between
LightCNN-only errors and joint errors. We depict these diferences in Figure 2a.</p>
      <p>For False Negatives, human assessments significantly difered between errors made solely by either
IR50 ( &lt; 0.001) or LightCNN ( ≪ 0.0001), compared to those made by both models. We depict these
diferences in Figure 2b.</p>
      <p>We examined human annotators’ perception of similarity and compared them with model-computed
similarity scores. This time we distinguished between eight overlapping categories of human and
model errors and successes: { Human, Machine } × { True Positives, False Positives, True Negatives,
False Negatives }. When both models and humans answered correctly, significant diferences were
found between human and machine similarity scores for both positive and negative pairs ( ≪ 0.0001).
Machine scores showed bimodal distributions, whereas human judgments were more spread out (RQ2c).</p>
      <p>When both models and annotators were incorrect, a significant diference was observed for negative
pairs ( ≪ 0.0001), but not for positive ones (see orange violins in Figure 3). Notably, in Machine False
Positives, human similarity judgments clustered around 0.5, indicating low confidence (akin to “Not
sure”) when incorrectly identifying diferent individuals as the same. The yellow band around similarity
0.5 in Figure 3 includes machine errors that based on these observations could be predicted in advance
as potential errors.</p>
      <p>(a)
(b)</p>
      <p>Other-race efect In our experiments, we found partial evidence of the “other-race” efect (see Table
1). We calculated the error rates for the three self-ascribed ethnicities: White, Black, and Asian. We
considered only pairs of images with the same ethnicity label in both images and computed the error rate
for each of these sets of pairs. “White” and “Black” annotators are the most accurate when annotating
images of their same kind, but this was not the case for “Asians”. No “other-race” efect was found in
the models. We investigate this efect as a way to anticipate potential biases. We acknowledge that the
name of the “other-race efect” itself may carry outdated terminology, but we chose it for consistency
with prior literature.</p>
      <sec id="sec-5-1">
        <title>Exploratory study of error-based human-machine collaboration (RQ3) We conducted a study</title>
        <p>with the aim of illustrating how to apply certain improvements based on results previously obtained in
a “human as overseer” scenario [31]. We studied the improvement in model accuracy (93.5%) that would
be obtained by manually reviewing all the pairs evaluated by the machine in an order determined by
the results obtained.</p>
        <p>The first improvement is based on the results obtained related to RQ2c: the use of machine confidence
to prioritize those cases that have a high probability of being corrected by the human annotator (see
Figure 3). Note that machine confidence can be inferred from the similarity score (the further the
similarity is from 50%, the higher the confidence, see Figure 3). The evolution of joint accuracy when
the prioritization of low-confidence machine outputs is implemented can be seen in the top pink
line of Figure 4a. We can observe that the pairs that human annotators are able to solve correctly
are concentrated at the beginning of the workflow, leading to an early and rapid growth of the joint
accuracy. This marked improvement in accuracy contrasts with the results we would obtain if this
strategy were not taken into account (see the bottom black line in Figure 4a). The second improvement
is based on the results obtained when investigating RQ2b: prioritizing those pairs where the models
gave diferent answers, i.e., only one of the two models correctly classified the pair. As we can see in
Figure 4b, the joint accuracy obtained if these pairs are prioritized (blue line) exceeds the one obtained
when this priority is not implemented (orange line) during most of the human annotation flow.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Conclusions</title>
      <p>This study investigates human-machine complementarities in facial recognition tasks by comparing
human and model errors. The analysis highlights how human strengths can complement machine
weaknesses, informing strategies for more efective joint systems.</p>
      <p>Three main findings e merge: ( 1) C onsistency o f E rrors – B oth h umans a nd m odels showed
consistent behaviors, enabling error characterization. Humans excel at negative pairs, with only 4.8%
errors on 126 machine false positives, indicating areas where human oversight can efectively correct
model failures; (2) Shared Dificulties – Humans are more likely to err when both models (IR-50
and LightCNN) make the same mistake, especially on false negatives. This correlation reveals shared
dificulties and helps prioritize cases requiring human review; and (3) Similarity Judgments – Human
and model similarity scores diverge significantly except when both err on positive pairs. Models exhibit
sharp score contrasts between correct and incorrect predictions, which can signal likely errors and guide
targeted human intervention. Based on these observations, we propose a hybrid human-in-the-loop
approach that boosts system accuracy by 3% with minimal annotation cost (10% of total), correcting
148 misclassified p airs. This demonstrates how strategic manual review can yield significant gains.
This study also reflects on real-world implications. Since most machine errors corrected by humans
are false positives, domains with high sensitivity to such errors (such as security or law enforcement
scenarios) must implement robust human oversight. Regulatory guidance, such as the European AI Act
[14], underscores the importance of human involvement in high-risk AI systems.</p>
      <p>
        Our limitations include the use of symmetric costs and reliance on two pre-trained models trained on
the same dataset (MS-Celeb-1M [
        <xref ref-type="bibr" rid="ref11">34</xref>
        ]). While findings are scenario-specific, they provide a framework
for error analysis and oversight strategies in other contexts. The results reinforce that current facial
recognition systems are not ready to replace human judgment and that ethical, legal, and societal
concerns necessitate permanent human oversight. Future work could examine how biases such as
algorithm aversion or overconfidence afect collaborative decision-making, especially when machine
and human resolutions diverge.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Writefull in order to: Grammar and
spelling check and improve writing style. After using these tools, the authors reviewed and edited the
content as needed and takes full responsibility for the publication’s content.
[10] M. T. Dzindolet, L. G. Pierce, H. P. Beck, L. A. Dawe, The perceived utility of human and automated
aids in a visual detection task, Human factors 44 (2002) 79–94.
[11] T. Reich, A. Kaju, S. J. Maglio, How to overcome algorithm aversion: Learning from mistakes,</p>
      <p>Journal of Consumer Psychology 33 (2023) 285–302.
[12] B. J. Dietvorst, J. P. Simmons, C. Massey, Overcoming algorithm aversion: People will use imperfect
algorithms if they can (even slightly) modify them, Management science 64 (2018) 1155–1170.
[13] Q. Roy, F. Zhang, D. Vogel, Automation accuracy is good, but high controllability may be better, in:</p>
      <p>Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–8.
[14] European Union, EU AI Act, 2024. URL: https://artificialintelligenceact.eu/the-act/.
[15] P. Negri, I. Hupont, E. Gomez, A framework for assessing proportionate intervention with face
recognition systems in real-life scenarios, arXiv preprint arXiv:2402.05731 (2024).
[16] A. Papenmeier, D. Kern, D. Hienert, Y. Kammerer, C. Seifert, How accurate does it feel?–human
perception of diferent types of classification mistakes, in: Proceedings of the 2022 CHI Conference
on Human Factors in Computing Systems, 2022, pp. 1–13.
[17] I. Hupont, S. Tolan, H. Gunes, E. Gómez, The landscape of facial processing applications in the
context of the european ai act and the development of trustworthy systems, Scientific Reports 12
(2022) 10688.
[18] K. Kyriakou, J. Otterbacher, In humans, we trust: Multidisciplinary perspectives on the
requirements for human oversight in algorithmic processes, Discover Artificial Intelligence 3 (2023)
44.
[19] R. Koulu, Proceduralizing control and discretion: Human oversight in artificial intelligence policy,</p>
      <p>Maastricht Journal of European and Comparative Law 27 (2020) 720–735.
[20] J. Han, Z. Zhang, M. Schmitt, M. Pantic, B. Schuller, From hard to soft: Towards more human-like
emotion recognition by modelling the perception uncertainty, in: Proceedings of the 25th ACM
international conference on Multimedia, 2017, pp. 890–897.
[21] J. T. Andrews, P. Joniak, A. Xiang, A view from somewhere: Human-centric face representations,
arXiv preprint arXiv:2303.17176 (2023).
[22] T. Makino, S. Jastrzebski, W. Oleszkiewicz, C. Chacko, R. Ehrenpreis, N. Samreen, C. Chhor, E. Kim,
J. Lee, K. Pysarenko, et al., Diferences between human and machine perception in medical
diagnosis, Scientific reports 12 (2022) 6877.
[23] M. Huber, P. Terhörst, F. Kirchbuchner, N. Damer, A. Kuijper, Stating comparison score
uncertainty and verification decision confidence towards transparent face recognition, arXiv preprint
arXiv:2210.10354 (2022).
[24] I. Hupont, C. Fernández, Demogpairs: Quantifying the impact of demographic imbalance in deep
face recognition, in: 2019 14th IEEE International Conference on Automatic Face &amp; Gesture
Recognition (FG 2019), IEEE, 2019, pp. 1–7.
[25] C. A. Meissner, J. C. Brigham, Thirty years of investigating the own-race bias in memory for faces:</p>
      <p>A meta-analytic review., Psychology, Public Policy, and Law 7 (2001) 3.
[26] C. Feliciano, Shades of race: How phenotype and observer characteristics shape racial classification,</p>
      <p>American Behavioral Scientist 60 (2016) 390–419.
[27] P. J. Phillips, F. Jiang, A. Narvekar, J. Ayyad, A. J. O’Toole, An other-race efect for face recognition
algorithms, ACM Transactions on Applied Perception (TAP) 8 (2011) 1–11.
[28] P. Hemmer, L. Thede, M. Vössing, J. Jakubik, N. Kühl, Learning to defer with limited expert
predictions, Proceedings of the AAAI Conference on Artificial Intelligence 37 (2023) 6002–6011.
[29] H. Mozannar, H. Lang, D. Wei, P. Sattigeri, S. Das, D. Sontag, Who should predict? exact algorithms
for learning to defer to humans, in: International conference on artificial intelligence and statistics,
PMLR, 2023, pp. 10520–10545.
[30] V. Keswani, M. Lease, K. Kenthapadi, Towards unbiased and accurate deferral to multiple experts,
in: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021, pp. 154–165.
[31] C. Punzi, R. Pellungrini, M. Setzu, F. Giannotti, D. Pedreschi, Ai, meet human: Learning paradigms
for hybrid decision making systems, arXiv preprint arXiv:2402.06287 (2024).
[32] M. H. Lee, D. P. Siewiorek, A. Smailagic, A. Bernardino, S. Bermúdez i Badia, Towards eficient</p>
    </sec>
    <sec id="sec-8">
      <title>A. Online Resources</title>
      <p>Code and Data is available in our GitHub repository</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Matias</surname>
          </string-name>
          ,
          <article-title>Humans and algorithms work together-so study them together</article-title>
          ,
          <source>Nature</source>
          <volume>617</volume>
          (
          <year>2023</year>
          )
          <fpage>248</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jussupow</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Benbasat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heinzl</surname>
          </string-name>
          ,
          <article-title>Why are we averse towards algorithms? a comprehensive literature review on algorithm aversion</article-title>
          , in: F. Rowe (Ed.),
          <source>28th European Conference on Information Systems - Liberty</source>
          , Equality, and Fraternity in a Digitizing World,
          <year>ECIS 2020</year>
          , Marrakech, Morocco, June 15-17,
          <year>2020</year>
          : Proceedings, AISeL, Atlanta,
          <string-name>
            <surname>GA</surname>
          </string-name>
          ,
          <year>2020</year>
          , p.
          <source>RP</source>
          <volume>168</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Dietvorst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Simmons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Massey</surname>
          </string-name>
          ,
          <article-title>Algorithm aversion: people erroneously avoid algorithms after seeing them err</article-title>
          .,
          <source>Journal of Experimental Psychology: General</source>
          <volume>144</volume>
          (
          <year>2015</year>
          )
          <fpage>114</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lyell</surname>
          </string-name>
          , E. Coiera,
          <article-title>Automation bias and verification complexity: a systematic review</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>24</volume>
          (
          <year>2017</year>
          )
          <fpage>423</fpage>
          -
          <lpage>431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Hahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. J. O'Toole</surname>
          </string-name>
          ,
          <article-title>Perceptual expertise in forensic facial image comparison</article-title>
          ,
          <source>Proceedings of the Royal Society B: Biological Sciences</source>
          <volume>282</volume>
          (
          <year>2015</year>
          )
          <fpage>20151292</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Natu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. J. O'Toole</surname>
          </string-name>
          ,
          <article-title>Unaware person recognition from the body when face identification fails</article-title>
          ,
          <source>Psychological Science</source>
          <volume>24</volume>
          (
          <year>2013</year>
          )
          <fpage>2235</fpage>
          -
          <lpage>2243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. J.</surname>
          </string-name>
          <article-title>O'toole, Comparison of human and computer performance across face recognition experiments</article-title>
          ,
          <source>Image and Vision Computing</source>
          <volume>32</volume>
          (
          <year>2014</year>
          )
          <fpage>74</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Korteling</surname>
          </string-name>
          , G. C. van de Boer-Visschedijk,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Blankendaal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Boonekamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Eikelboom</surname>
          </string-name>
          ,
          <article-title>Human-versus artificial intelligence</article-title>
          ,
          <source>Frontiers in artificial intelligence 4</source>
          (
          <year>2021</year>
          )
          <fpage>622364</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vaccaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almaatouq</surname>
          </string-name>
          , T. Malone,
          <article-title>When combinations of humans and ai are useful: A systematic review and meta-analysis</article-title>
          ,
          <source>Nature Human Behaviour</source>
          <volume>8</volume>
          (
          <year>2024</year>
          )
          <fpage>2293</fpage>
          -
          <lpage>2303</lpage>
          .
          <article-title>annotations for a human-ai collaborative, clinical decision support system: A case study on physical stroke rehabilitation assessment</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Intelligent User Interfaces</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Hahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Noyes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Cavazos</surname>
          </string-name>
          , G. Jeckeln,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sankaranarayanan</surname>
          </string-name>
          , et al.,
          <article-title>Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>115</volume>
          (
          <year>2018</year>
          )
          <fpage>6171</fpage>
          -
          <lpage>6176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Ms-celeb-1m: A dataset and benchmark for large-scale face recognition</article-title>
          , in: Computer Vision-ECCV
          <year>2016</year>
          : 14th European Conference, Amsterdam, The Netherlands,
          <source>October 11-14</source>
          ,
          <year>2016</year>
          , Proceedings,
          <source>Part III 14</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          , Arcface:
          <article-title>Additive angular margin loss for deep face recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4690</fpage>
          -
          <lpage>4699</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>A light cnn for deep face representation with noisy labels</article-title>
          ,
          <source>IEEE Transactions on Information Forensics and Security</source>
          <volume>13</volume>
          (
          <year>2018</year>
          )
          <fpage>2884</fpage>
          -
          <lpage>2896</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Face.evolve: A high-performance face recognition library</article-title>
          ,
          <source>arXiv preprint arXiv:2107.08621</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mattar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Learned-Miller</surname>
          </string-name>
          ,
          <article-title>Labeled faces in the wild: A database forstudying face recognition in unconstrained environments</article-title>
          , in: Workshop on faces in'
          <article-title>Real-Life'Images: detection, alignment, and recognition</article-title>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Moschoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Papaioannou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sagonas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          , I. Kotsia,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          ,
          <article-title>Agedb: the first manually collected, in-the-wild age database</article-title>
          ,
          <source>in: proceedings of the IEEE conference on computer vision and pattern recognition workshops</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Parkhi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Zisserman,</surname>
          </string-name>
          <article-title>Vggface2: A dataset for recognising faces across pose and age</article-title>
          ,
          <source>in: 2018 13th IEEE international conference on automatic face &amp; gesture recognition (FG</source>
          <year>2018</year>
          ), IEEE,
          <year>2018</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schonger</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Wickens, otree-an open-source platform for laboratory, online, and ifeld experiments</article-title>
          ,
          <source>Journal of Behavioral and Experimental Finance</source>
          <volume>9</volume>
          (
          <year>2016</year>
          )
          <fpage>88</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Landis</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. G. Koch,</surname>
          </string-name>
          <article-title>The measurement of observer agreement for categorical data, biometrics (</article-title>
          <year>1977</year>
          )
          <fpage>159</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>