<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How Data Quality Determines AI Fairness: The Case of Automated Interviewing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lou Therese Brandner</string-name>
          <email>lou.brandner@uni-tuebingen.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Mahlow</string-name>
          <email>p.mahlow@uni-koeln.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Wilken</string-name>
          <email>awilken2@uni-koeln.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annika Wölke</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hazar Harmouch</string-name>
          <email>hazar.harmouch@hpi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon David Hirsbrunner</string-name>
          <email>simon.hirsbrunner@uni-tuebingen.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hasso Plattner Institute, University of Potsdam</institution>
          ,
          <addr-line>Prof.-Dr.-Helmert Straße 2-3, 14482 Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>International Centrefor Ethics in the Sciencesand Humanities, University of Tübingen</institution>
          ,
          <addr-line>Wilhelmstraße 19, 72074 Tübingen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Cologne</institution>
          ,
          <addr-line>Albert-Magnus-Platz, 50923 Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>Artificial Intelligence (AI) supported job interviewing, i.e., one-sided automated applicant interviews assessed by AI-based systems, presents itself as a new mainstream solution in hiring, promising to be more efficient and effective than human recruiters, but also fairer and more objective. Selecting this technology as an illustrative case, we focus on a central element in the development of fair AI: the issue of (training) data quality (DQ). ML models with unsuitable, biased, or erroneous training data is a major source of bias in AI-based applications and therefore potentially discriminatory, unfair outcomes. However, DQ is often cast aside as one of many technical factors contributing to the overall quality of ML-based systems; this approach runs the risk of understating its crucial relevance. We select salient issues along the technology lifecycle to take a detailed look at the interrelation of fairness and DQ, illustrating how both fairness and DQ must be understood in a broad sense, taking into account normative considerations beyond technical aspects, to facilitate desirable outcomes such as the promotion of diversity, the prevention of discrimination, and the protection of workers' rights.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Automated hiring</kwd>
        <kwd>data quality</kwd>
        <kwd>AI ethics</kwd>
        <kwd>EU law</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Artificial Intelligence (AI) supported job interviewing presents itself as a new mainstream solution
in the human resources (HR) industry. Combining machine learning (ML) techniques such as speech
to-text and text-clustering, these products and services promise to be more efficient and effective than
human recruiters, but also fairer and more objective. Fairness concerns linked to the development and
use of this technology have been discussed from ethical and legal perspectives, considering the
increasing use of such systems in the European Union (EU) and simultaneously emerging regulatory
frameworks, most famously the EU AI Act draft [9]. We argue that these discussions often neglect a
central element when negotiating the development of ethical AI: the issue of (training) data quality
(DQ). With a case study approach, we analyze automated job interviews to center DQ and show how
other DQ dimensions directly impact – and often determine – the fairness of AI-based systems. This
short article summarizes research based on the analysis of public information available about three
companies offering automated interviewing – HireVue2, Knockri3, and myInterview4 – as examples to
assess the technology and its implications, particularly in view of the use within the EU. 5</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background: Fairness in Automated Job Interviews</title>
      <p>
        Automated job interviews, i.e. one-sided structured behavioral interviews with applicants that are
recorded in front of a computer camera without human recruiters present, are increasingly being
addressed in scientific debates [13][14][17][18][22]. Given the ubiquity of job interviews in conjunction
with the time, cost, and effort that go into them, automating this process has disruptive potential for the
HR industry [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The technology largely relies on language recognition in the shape of speech -to-text
transcription and Natural Language Processing (NLP), the computational analysis of speech.
Candidates are evaluated and ultimately scored or ranked for human recruiters who can base their
further decisions on this automated assessment. A central claim of (semi-)automated hiring is that AI
systems are less prone to bias than human recruiters, whose decisions might be unconsciously
influenced by stereotypes or even consciou sly swayed by discriminatory behavior [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Given that these
technical systems depend on human input and on data stemming from often discriminatory social
contexts, AI applications can, however, reproduce existing biases and automate them [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][8][21]. This
bears a risk to perpetuate existing job market discrimination toward women, racial minorities, and other
marginalized communities [13][16][20]. The concept of discriminatory bias increasingly dominates
discussions about real-world implications of AI involving data about human beings [19], with the term
AI fairness or algorithmic fairness describing statistical methods intended to mitigate or eliminate these
biases. Given the highly context -specific nature of fairness metrics and their dependence on different
notions of fairness and societal values, historical and structural power dynamics must be taken into
account when negotiating AI fairness [15].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Centering Data Quality</title>
      <p>
        Well-established computer science DQ criteria typically include dimensions of accuracy,
completeness, redundancy, readability, accessibility, consistency, usefulness, and trust [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Wang and
Strong [23] additionally differentiate between intrinsic, contextual, representational, and accessibility
aspects of datasets. Based on the growing importance of ML applications, recent attempts tailor DQ
dimensions to the application field of ML [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Training ML models with unsuitable, biased, or erroneous
training data can lead to unfair, inaccurate, and unsafe models and therefore low-quality downstream
applications. The EU Agency for Fundamental Rights [10] defines training DQ as a central pillar of
preventing discrimination and other unintended damages caused by AI technology. However, DQ is
often viewed as one of many technical factors contributing to the overall quality of systems; this
approach runs the risk of understating its essential relevance. This is all the more crucial given that the
AI Act proposes several requirements for the fairness of training data for high -risk AI systems, which
an interviewing system would be classified as6: relevance, representativity, freedom of errors, and
completeness.7 These criteria seem to be informed by the computer science literature on DQ which
means they will likely become more critical for datasets that are meant to be used within the EU.
      </p>
      <p>Considering the various DQ dimensions found in the literature, we analyzed information provided
by the aforementioned automated interviewing platforms in the context of potential fairness concerns.
Given the limited scope of this short contribution, in the following sections we will summarize a
selection of salient issues to demonstrate the interrelation of DQ and fairness.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Dataset Design and Data Collection</title>
      <p>Especially for datasets used to train models meant to assess populations with diverse socio-cultural
origins and backgrounds, as is the case for hiring purposes, ensuring DQ begins before the actual data
collection with the specification and documentation of use cases and target groups, for example in the
5 HireVue is already in use in the EU by companies in Germany, France, and the Netherlands, see: https://enlyft.com/tech/products/hirevue,
accessed 08/05/2023.
6 According to Art. 6 para. 2 AIA in conjunction with Annex III no. 4 lit. a) AIA.
7 According to Art. 10 para. 3 sen. 1.
form of datasheets [11]. Training data used for application processes should represent the anticipated
language variations of the target population(s), i.e. future job applicants in the respective geographical
and professional areas the system is intended to operate in. Furthermore, if training data for hiring
purposes only includes certain types of industries, job roles or experience levels, the resulting system
might not be able to evaluate deviating backgrou nds accurately. Involving HR experts or social
scientists who can research and specify target populations can thus be beneficial to avoid insufficient
coverage and ensure representative, balanced, and diverse training data.</p>
      <p>
        But for all analyzed companies, we observe a lack of transparency regarding the original training
data; crucial DQ questions – such as the composition of datasets, when they were collected and by
whom – thus remain unanswered. Available information about test datasets used to evaluate the
performance of HireVue’s system shows a lack of people over the age of forty and makes no mention
of including people with disabilities; this potentially indicates insufficient coverage of these groups
which can lead to discriminatory biases and therefore an unfair treatment of individuals and groups
during the assessment [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Labeling and Annotation</title>
      <p>Quality flaws of labels and annotations, connected to the DQ dimensions of accuracy and objectivity,
are another major source of discriminatory bias when human actors introduce their own – conscious or
unconscious – social and cultural biases into the data [19], for example due to differing perceived
ground truths. Automated interviewing algorithms are typically trained with pre-labeled interview data,
but only provide vague information about who labels them; for instance, the myInterview website reads
"our machine learning models learn from our team of diverse psychologists across the world"8 without
providing concrete information regarding specific training or demographic and geographic
backgrounds.</p>
      <p>Connected to this issue, the working conditions, employment status or salaries of labelers – a “blind
spot” [12] of AI ethics which has recently started receiving more attention [24] – are not disclosed by
the analyzed companies either but can significantly impact the overall labor quality and thus result in
incorrect or inconsistent labels, which can introduce biases if they remain uncorrected and lower the
overall reliability of the model. These factors also influence fairness in a broader sense: can a
hypothetical bias-free and fairness-calibrated system truly be “fair” if those who labored for its training
data were treated unethically? Can it be considered societally acceptable to use training data whose
labeling has been remunerated with wages far below the EU minimum threshold?</p>
    </sec>
    <sec id="sec-6">
      <title>3.3 Post-System Deployment</title>
      <p>DQ concerns do not end with system deployment. Since biases often become apparent when systems
are in active use, models require monitoring and, if biases are discovered, additional, potentially more
diverse, balanced and representative training data. Systems can also continue learning during their use.9
This on the one hand means they can potentially improve their accuracy for groups which were not
sufficiently represented during training, but also that new biases can be introduced during operation,
emphasizing the need for continuous monitoring and evaluation.</p>
      <p>
        Particularly if the technology is used on target groups that models cannot be sufficiently trained for,
rigorous post-deployment monitoring to detect limitations and offering accommodations to those who
cannot be assessed fairly might present the best practice. An example in the context of automated
interviewing are people with disabilities – especially those with disabilities impacting speech and thus
their responses in an automated interview setting – an inherently heterogeneous group which AI-based
systems currently cannot reliably recognize and categorize [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For instance, HireVue works around this
issue by offering disabled candidates certain accommodations such as longer response times for
interview questions and the option to directly contact the business which conducts the interview.
8 https://www.myinterview.com/product-intelligence/, accessed 10/05/2023.
9 Knockri, myInterview and HireVue all utilize candidate data to improve their services, which likely includes the use to furth er train the
models: “[…] we use certain personal data, such as User Profile Information, Video Data […] for our own purposes, such as improving a nd
enhancing our platform and Services.”, https://www.myinterview.com/privacy/#privacy, accessed 10/05/2023.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4. Discussion</title>
      <p>Our analysis underlines the intrinsic interconnections of DQ dimensions and fairness regarding
AIbased applications in a hiring context. DQ remains relevant in all technology development stages from
dataset design to post-deployment and often directly determines how bias-free and therefore fair
systems can be. It is furthermore of crucial relevance regarding the legal compliance of automated job
interviewing within the EU by means of the General Data Protection Regulation and anti-discrimination
directives.</p>
      <p>Transparency can be viewed as a prerequisite and bottleneck in this regard: given the lack of
transparency regarding, for example, dataset composition and collection, the claimed efforts of
automated hiring companies to address ethical and legal fairness concerns can only rarely be externally
validated. This starts with simple issues such as a lack of information or contradictory statements on
websites, for example regarding if body language is analyzed or not (myInterview), as well as missing
resources such as reports that cannot be downloaded (Knockri). While HireVue publishes the most
extensive amount of information about its systems compared to its competitors, it still misses out on
detailing important DQ considerations that could considerably increase corporate accountability and
public trust in the technology. Taking the described issue with modeling disability in ML systems as an
example, comprehensively explaining why there is a lack of people with disabilities in training data and
why systems cannot (yet) reliably and fairly evaluate these individuals could increase public
understanding of ML-based technology and acceptance of alternative workarounds.</p>
      <p>DQ and fairness must both be understood in a broad sense, taking into account normative
considerations beyond technical aspects, to facilitate desirable outcomes such as the promotion of
diversity, the prevention of discrimination, as well as the protection of workers’ rights. Based on the
insights of the study, we propose further research into scenarios beyond the discussed case of
recruitment and into data synthetization, which provides both considerable opportunities and risks
regarding the DQ and fairness of datasets.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Acknowledgements</title>
      <p>This contribution is based on research in the KITQAR project, funded by the Policy Lab Digital,
Work &amp; Society of the German Federal Ministry of Labor and Social Affairs (BMAS).</p>
    </sec>
    <sec id="sec-9">
      <title>6. References</title>
      <p>[8] V. Eubanks, Automating Inequality. How High -Tech Tools Profile, Police, and Punish the Poor,</p>
      <p>St. Martin’s Press, New York, NY, 2018.
[9] European Commission, Proposal for a Regulation of the European Parliament and of the Council
laying down harmonised rules on Artificial Intelligence (Artificial Intelligence Act) and amending
certain union legislative acts, COM(2021) 206 final.
[10] European Union Agency for Fundamental Rights, Data Quality and Artificial Intelligence:
Mitigating Bias and Error to Protect Fundamental Rights, Publications Office, Luxembourg, 2019.
https://data.europa.eu/doi/10.2811/546219.
[11] T. Gebru, J. Morgenstern, B. Vecchione, J. Wortman Vaughan, H. Wallach, H. Daumé III, K.</p>
      <p>Crawford, Datasheets for Datasets, arXiv:1803.09010v8 [cs.DB], 2021.
https://doi.org/10.48550/arXiv.1803.09010  
[12] T. Hagendorff, Blind spots in AI ethics, AI Ethics 2 (2022) 851–867.</p>
      <p>https://doi.org/10.1007/s43681-021-00122-8.
[13] K. Houser, Can AI solve the diversity problem in the tech industry? Mitigating noise and bias in
employment decision-making. 22 Stan. Tech. L. Rev. 290 (2019). URL:
https://papers.ssrn.com/abstract=3344751.
[14] A. L. Hunkenschroer, C. Luetge, Ethics of AI-enabled recruiting and selection: A review and
research agenda, Journal of Business Ethics 178 (2022) 977–1007.
https://doi.org/10.1007/s10551022-05049-6.
[15] J. John-Mathews, D. Cardon, C. Balagué. From reality to world. A critical perspective on AI
fairness, Journal of Business Ethics 178 (2022) 945–959.
https://doi.org/10.1007/s10551-02205055-8
[16] S. Johnson, D. R. Hekman, E. T. Chan, If there’s only one woman in your candidate Pool, there’s
statistically no chance she’ll be hired, Harvard Business Review, 2016. URL:
https://hbr.org/2016/04/if-theres-only-one-woman-in-your-candidate-pool-theres-statistically-nochance-shell-be-hired.
[17] A. Köchling, S. Riazy, M. C. Wehner, K. Simbeck, Highly accurate, but still discriminatory: A
fairness evaluation of algorithmic video analysis in the recruitment context. Business &amp;
Information Systems Engineering 63 (2021) 39–54. https://doi.org/10.1007/s12599-020-00673-w.
[18] L. Li, T. Lassiter, J. Oh, M. K. Lee, Algorithmic hiring in practice: Recruiter and HR professional's
perspectives on AI use in hiring, in: Proceedings of the 2021 AAAI/ACM Conference on AI,
Ethics, and Society, AIES '21, ACM Press, New York, NY, 2021, pp. 166–176.
https://doi.org/10.1145/3461702.3462531.
[19] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, A. Galstyan. A survey on bias and fairness in
machine learning. ACM Comput. Surv. 54 (2022). https://doi.org/10.1145/3457607.
[20] C. Schumann, J. S. Foster, N. Mattei, J. P. Dickerson, We need fairness and explainability in
algorithmic hiring, in: Proceedings of the 19th International Conference on Autonomous Agents
and Multiagent Systems, AAMAS ‘20, IFAAMAS, Auckland, New Zealand, 2020, pp. 1716–
1720. https://dl.acm.org/doi/abs/10.5555/3398761.3398960.
[21] A. Selbst. Disparate impact in big data policing. Georgia Law Review 52 (2017) 109–195.</p>
      <p>http://dx.doi.org/10.2139/ssrn.2819182.
[22] N. Tippins, F. Oswald, S. M. McPhail. Scientific, legal, and ethical concerns about AI-based
personnel selection tools: A call to action. Personnel Assessment and Decisions 7 (2021).
https://doi.org/10.25035/pad.2021.02.001.
[23] R. Wang, D. Strong, Beyond accuracy: What data quality means to data consumers. Journal of
management information systems 12 (1996) 5–33.
https://doi.org/10.1080/07421222.1996.11518099.
[24] A. Williams, M. Miceli, T. Gebru, The exploited labor behind artificial intelligence. Supporting
transnational worker organizing should be at the center of the fight for “ethical AI”, Noēma 13
(2022). URL: https://www.noemamag.com/the-exploited-labor-behind-artificial-intelligence/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          , Data Quality Dimensions, Springer, Cham,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barocas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <source>Fairness and Machine Learning: Limitations and Opportunities</source>
          ,
          <year>2019</year>
          . URL: https://fairmlbook.org/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Brenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Ortner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fay</surname>
          </string-name>
          ,
          <article-title>Asynchronous video interviewing as a new technology in personnel selection: The applicant's point of view, Frontiers in Psychology 7 (</article-title>
          <year>2016</year>
          ). URL: https://www.frontiersin.org/articles/10.3389/fpsyg.
          <year>2016</year>
          .
          <volume>00863</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Budach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feuerpfeil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ihde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nathansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Noack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Patzlaff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Harmouch</surname>
          </string-name>
          ,
          <article-title>The effects of data quality on machine learning performance</article-title>
          ,
          <source>arXiv:2207.14529v4 [cs.DB]</source>
          ,
          <year>2022</year>
          . https://doi.org/10.48550/arxiv.2207.14529.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Buyl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cociancig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Frattone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Roekens</surname>
          </string-name>
          ,
          <article-title>Tackling Algorithmic Disability Discrimination in the Hiring Process: An Ethical, Legal and Technical Analysis</article-title>
          ,
          <source>in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <source>FAccT '22</source>
          , ACM Press, New York, NY,
          <year>2022</year>
          , pp.
          <fpage>1071</fpage>
          -
          <lpage>1082</lpage>
          . https://doi.org/10.1145/3531146.3533169.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chamorro-Premuzic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Winsborough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Sherman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <article-title>New talent signals: Shiny new objects or a brave new world?</article-title>
          ,
          <source>Industrial and Organizational Psychology</source>
          <volume>9</volume>
          , (
          <year>2016</year>
          )
          <fpage>621</fpage>
          -
          <lpage>40</lpage>
          . https://doi.org/10.1017/iop.
          <year>2016</year>
          .
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chamorro-Premuzic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          ,
          <article-title>Should companies use AI to assess job candidates? Harvard Business Review</article-title>
          ,
          <year>2019</year>
          . URL: https://hbr.org/
          <year>2019</year>
          /05/should-companies
          <article-title>-use-ai-to-assess-jobcandidates.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>