Responsible Data Management for Human Resources
                            Dimitrios Vogiatzis∗∗                                                           Olivia Kyriakidou
                 The American College of Greece, Deree                                         The American College of Greece, Deree
                        & NCSR "Demokritos"                                                              Athens, Greece
                           Athens, Greece                                                            OKyriakidou@acg.edu
                          dimitrv@acg.edu

ABSTRACT                                                                                Eventually a RS in HR is a information system that is directly
The human resources (HR) departments rely increasingly on recom-                     related to the professional life of people, and as such it should be
mender systems (RS) for most of their processes, such as recruiting,                 subject to ethical and legal regulations, apart from the technical
selecting and developing their employees. However, the RS often                      ones, like prediction accuracy. ACM 1 and IEEE 2 have issued codes
discriminate unfairly based on biases in data that may perpetuate                    of ethics that refer to the need for fairness in Information Systems.
and enhance existing biases and in the work place. An important                      In particular section 1.4 of ACM code of ethics is entitled Be fair
part of an HR department is a the data ecosystem, comprising raw                     and take action not to discriminate and section II of the IEEE code
and derived data, related to potentially different stakeholders while                of ethics states: To treat all persons fairly and with respect, to not
being subject to laws, and regulations. In this work we propose the                  engage in harassment or discrimination, and to avoid injuring others.
characteristics of a data ecosystem that will facilitate data trans-                    In reality RS are often fraught with elements of discrimination
parency through traceability as a way of detecting potential biases                  and unfairness. The unfairness may stem from biases in the data
in the data.                                                                         that may misrepresent the actual population and subsequent ana-
                                                                                     lytic algorithm often amplify the data biases. Lack of fairness can
CCS CONCEPTS                                                                         have potential legal consequences, especially in employment as it
                                                                                     might violate anti-discrimination laws. Also it might have financial
• Information systems → Data management systems; • Social
                                                                                     consequences as usage of such systems might drop. See also [10]
and professional topics → Employment issues; User character-
                                                                                     for a recent tutorial on the origin and form of fairness in RS.
istics.
                                                                                        A data ecosystem is a network of data in potentially many forms
                                                                                     (e.g. unstructured, structured) as well as accompanying rules that
KEYWORDS
                                                                                     permit their acquisition, storage, maintenance, and retrieval. The
Human Resources, Fairness, Bias, Data Ecosystems                                     data ecosystems is of potential interest to many stakeholders, in-
                                                                                     cluding data providers, and data users that try to create value out
1    INTRODUCTION                                                                    the ecosystems. A data ecosystem includes metadata, as well as le-
Recommender systems (RS) are widely used by Human Resources                          gal, organizational or ethical regulations. Moreover, the ecosystems
(HR) departments to facilitate their business processes from the                     evolve as their constituent components change. Finally, derived
point of time optimization, but also from the perspective of miniz-                  data also form part of the ecosystems. For instance clusters, pre-
ing human intervention, in an effort of achieving fairness. RS can                   dictions etc. are examples of derived data that are produced by
be applied in the recruitment, hiring, promotion of employees, etc.                  statistical or machine learning methods. See [15] for an overview
For instance they can used in matching CVs against job posts, in                     of data ecosystems.
ranking CVs which determines the order of interviews or even in                         Our contribution is to focus on the data component of a RS and
comparison of CVs against past CVs of employees that are deemed                      examine how a data ecosystem would facilitate data transparency
as successful. A RS can also be used for segmenting job applica-                     through data traceability so that potential biases are made explicit
tions into categories, detecting and recording long terms trends etc.                or a least easier to track and detect. Our approach is based on a
Moreover, they can be used by prospective employees that seek                        similar work for data transparency in the biomedical domain [11].
employment.
   Although RS seem to remove human intervention by automating                       2    RELATED WORK
HR processes, often but not exclusively, through advanced machine
                                                                                     Responsible data management has been discussed in the context
learning algorithms, still segments of the population can be dis-
                                                                                     of automated decision systems (ADS), which are systems that
criminated against. The data upon which the analysis is based on
                                                                                     make make decisions about humans that might affect their socio-
may contain biases towards age groups, gender, ethnic origin etc.
                                                                                     economic life [17], [18]. They authors refer to the ethical challenges
The data stem from specific data items, specific data features, data
                                                                                     faced in all phases of a data science pipeline and the need for a fair,
distributions, data sampling methods. The bias in data can be also
                                                                                     transparent and responsible data management.
very subtle and difficult to detect as it may appear in derived data
                                                                                        Our current work focuses on the data representation part of
stemming from an analytics process.
                                                                                     an ADS, which is essentially an RS. In particular, we refer to the
∗ Both authors contributed equally to this research.                                 features of a data ecosystem and how it can be supported with
RecSys ’21, 2021,
                                                                                     1 https://www.acm.org/code-of-ethics
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0)..                                  2 https://www.ieee.org/about/corporate/governance/p7-8.html
RecSys ’21, 2021,                                                                                              Dimitrios Vogiatzis and Olivia Kyriakidou


semantic web technologies, and their relevance to the issues of            senior positions if themselves and others look tend to look for
fairness in HR.                                                            lower-level jobs [4]. Proxies are also included in HR data that train
   A very similar approach that we propose in the current work             employee selection recommendation systems in order to offer the
has been developed for a biomedical system in the context of the           most appropriate renumeration package to prospective employees.
EU funded project BigMedilytics 3 for a lung-cancer pilot applica-         Such suggestions however may reinforce gender or racial pay gaps
tion. The pilot integrates structured and unstructured information,        especially when they reflect the existence of strong proxies that
open and sensitive data in a knowledge graph. This constitutes an          signal for certain gender representations (e.g., male employees as
example of a data ecosystem.                                               breadwinners) and status inequalities.
                                                                              Facial analysis that is used in virtual interviews may also create
3    MOTIVATING EXAMPLES                                                   disparate impact on specific sub-groups of employees across gender
Next we mention some examples that indicate the form of bias in            and racial lines. In [5] it was shown that the faces of women with
raw data, in data associations, as well as in derived data produced by     darker skin cannot be reliably recognized by facial analysis systems
machine learning algorithms. The examples refer to HR department           as well as the emotions of people with disabilities and in different
cases.                                                                     cultural contexts [3]. Finally, in employee selection, most recruiters
                                                                           use a number of candidate characteristics as proxies of culture fit [7],
Biased Data based on human behavioral biases. HR algorithmic               defined as the degree to which the values of the individual match
recommendations may sustain existing inequities when they are              those of the organization. However, there is the danger that these
trained on data that do not include specific groups of individuals [9].    proxies will become hard rules ignoring their subjective character
For example, many selection algorithms try to identify the criteria        and in this way exclude certain individuals who are thought apriori
that characterize the ideal employee and use them for the selection        that they do not “fit” the organizational culture.
of newcomers. For this task they utilize performance data that
identify the best performing employees within the organize and             Segregation of individuals. Biases could also persist when algo-
then identify the traits that distinguish them. However, there is the      rithms segregate employees into groups drawing inferences about
danger that if the performance data favor men due to existing biases       individuals from their group memberships. Selection recommen-
within the organization [16], then the selection algorithm might           dation systems, for instance, may erroneously attribute to people
include gender as a preferred characteristic for the ideal candidate       with disabilities [20] certain characteristics based on their group
and prefer men rather than women applicants. In this sense, existing       membership without properly assessing the candidates and con-
biases could be reified by limiting the number of certain groups and       sequently offer lower status job positions. Moreover, categorizing
possibly underrepresented groups who are alerted, selected, and            individuals into certain gender groups could unfairly marginalize
hired for specific job openings [6]. Moreover, HR recommendation           non-binary and transgender employees while their classification
systems utilized for the automated screening of candidates’ CVs            into certain race groups could signify status inequalities [12].
against certain preferred selection criteria may also generate biased
                                                                           Human computer interaction. Most HR recommendation systems
results when they are trained on data from past hiring decisions that
                                                                           run on platforms that require employees’ and candidates’ active
are based on individual, organizational and structural biases against
                                                                           involvement with them which is determined merely by the rules
certain underrepresented groups of employees [14]. [4]. The use of
                                                                           set by the platform that control all processes [13]. For instance, job
natural language processing (NLP) tools in chatbots that evaluate
                                                                           candidates do not have any control over how their application will
candidates’ competencies and fit to the job and the organization
                                                                           be presented to possible employers and they have to provide all
may also preserve existing societal inequities when they are trained
                                                                           the required information by the platform if they want to be con-
on biased data and exclude certain categories of candidates. The
                                                                           sidered for future job opportunities. Moreover, employee selection
association of African-American names with negative feelings and
                                                                           recommendation systems tend to present numerical rankings of
female names with the household and non-technical jobs has been
                                                                           candidates to employers generating the perception that there are
already documented in the literature [19].
                                                                           actual substantial differences between the candidates for a certain
Proxies. Recommendation systems can replicate biases in other              position, while in reality the differences might be minimal [2].
subtle ways, especially through the use of proxies. Certain hiring
criteria could serve as proxies for categorizing individuals in specific   4   REQUIREMENTS FOR A DATA ECOSYSTEM
groups and drive discrimination. For example, the use of gaps in               IN HR
employment as a hiring criterion could discriminate against women          Next, based on the previously mentioned examples we sketch the
applicants as women disproportionately leave the workplace to              requirements that would be necessary for a data ecosystem in HR.
provide child or elderly care [1]. Moreover, job matching platforms
and job recommendation systems use proxies for “relevance” that                Data management requirements: The data ecosystem should
reproduce biases. Such systems, for example, could show to women                 allow data sharing for structured (e.g. CSV files) and unstruc-
specific jobs at specific hierarchical levels (e.g., senior or junior            tured data (e.g. text). The data should be accessible and re-
positions in management) according to their own search history                   trievable by all stakeholders. Also the data has to be of high
but also according to the search history of women similar to them.               quality. For instance, data items that have missing values
Accordingly, they might end up with fewer recommendations for                    could be rejected or data items that are very old. As an ex-
                                                                                 ample we could mention a CV that that does not contain any
3 https://www.bigmedilytics.eu/                                                  information about education or past employment. The data
Responsible Data Management for Human Resources                                                                                     RecSys ’21, 2021,


       management requirements fall into the following categories:              Data operators: the set of operators that can be executed
       DM1: Data management of multiple document types                             against the data sets. For instance, anonymization, data qual-
       should be supported. DM2: Quality of data items should                      ity checks, recency checks can be considered as data opera-
       be supported at all levels of data pipeline, e.g. at the raw data,          tors.
       but also at derived data.                                                Meta-Data: provide the semantics of the data stored in the
     Organizational requirements: The data should be stored,                       data sets of the data ecosystem. It comprises:
       accessed and processed according to the organization’s rules,             (1) A Domain ontology, which provides a unified view of the
       and regulations. The organization requirements fall into the                   concepts, relationships, and constraints of the domain of
       following categories: O1: Data governance should be en-                        knowledge. It associates formal elements from the domain
       forced by the organization. Thus the data acquisition process,                 ontology to concepts. For instance, a specific job post and
       the data storage, and retention, access rights, and data ob-                   a specific applicant can be part of the concepts in a domain
       solescence are items related to data governance. The HR                        ontology.
       department may have business rules that stipulate the re-                 (2) Properties enable the definition of data quality, provenance,
       cruitment policy, and what will be the requested documents.                    and data access regulations of the data in the ecosystem.
       O2: Data sovereignty which specifies who owns the orig-                        For instance, last updated and other non-domain proper-
       inal data, the derived data, and to what purpose. This will                    ties (quality etc).
       increase the trust in the system. For instance, it will be clearer        (3) Descriptions of the main characteristics of a data set. No
       of how a submitted CV will be handled.                                         specific formal language or vocabulary is required; in fact,
     Legal & Ethical requirements: The data management should                         a data set could be described using natural language. For
       be in accordance with the requirements of the European                         instance, Data set D is a collection of CVs and cover letters.
       GDPR. 4 Moreover, the data management should address                     Mappings expressing correspondences among the different
       bias. For instance, the execution of algorithms should be in-               components of a data ecosystem. The mappings are as fol-
       dependent of sensitive attributes (like ethnicity, age, gender).            lows:
       In addition, the data should be owned and used for the in-               Mappings between ontologies: they represent associations
       dented purposes. For instance, CVs of job applicants should                 between the concepts in the different ontologies that com-
       not used to generate business value by selling them without                 pose the domain ontology of the ecosystem. For instance, if
       their consent. Finally, traceability is an import aspect of the             there can a mapping between the personnel ontology, and
       data that essentially allows to know where the data where                   the candidate employees ontology.
       obtained from, and how they were obtained.                               Mappings between data sets: they represent relations among
       The above can be summarized into the following ethical                      data sets of the ecosystem and the domain ontology.
       requirements: E1: Data protection & ownership which
       specifies the extence of ownership for each stake holder.
       E2: Sensitive attributes which clearly states the sensitive          6   DATA ECOSYSTEM IN HR
       attributes, with the foresight that they should be used by           The role of a data ecosystem (DE) is to provide an explicit descrip-
       prediction algorithms. Typically, they represent age, gender,        tion of the data and applicable operations on them through meta-
       ethnic background etc. The sensitive attributes are typically        data and mapping rules. Next we provide some examples that refer
       associated with the provisions of GDPR.                              to the usage of the elements of the DE as referred to in the previous
       E3: discrimination attributes they may lead to discrimi-             section. The description of a data ecosystem that we describe next
       nation in non-obvious ways. For instance the name of a job           does not cover all the activities HR, but rather it addresses some
       applicant might inadvertently facilitate discrimination as it        essential parts that refer to recruitment and hiring of employees.
       may reveal ethnic origin. Moreover, some derived attributes          Thus we will assume a scenario where there are applicants CVs and
       fall in this category, for instance employment gaps.                 job posts. The DE for this example can be depicted in Figure 1.
                                                                               One of the data sources are the CVs of the applicants. Typically,
5     DESIGN OF A DATA ECOSYSTEM                                            they contain textual information, possibly with some keywords (e.g.
                                                                            education, past employment) which can be helpful as annotations.
We present in detail the concept of a data ecosystem, that will
                                                                            Thus a CV represents a piece of unstructured or partially structured
serve as the infrastructure for an HR. A data ecosystem (DE) can
                                                                            information.
be defined as a 4-tuple: DE=<Data Sets, Data Operators, Meta-Data,
                                                                               A Data Operator can implement an NLP process to extracte struc-
Mappings> [8].
                                                                            tured information from the CV. Typically named entity recognition
     Data sets: the ecosystem is composed of potentially multiple           (NER) and relation extraction (RE) will have to be performed, result-
       data sets. Data sets can be comprised of structured or un-           ing in triplets comprising two entities and a relation. The named
       structured information; also, they have different formats, e.g.,     entities (NE) in CVs can be things like: skills, past employment,
       CSV, JSON or tabular relations, and can be managed using             educational achievements and demographic data. The relations
       different management systems.                                        connect the NE to the person in question, while being labeled with
                                                                            time annotations. This will form the job applicant’s graph. The
                                                                            extracted entities will then be annotated with meta data derived
4 https://gdpr-info.eu/                                                     from a domain ontology, commonly described in OWL. For example
RecSys ’21, 2021,                                                                                                     Dimitrios Vogiatzis and Olivia Kyriakidou


                                                              Figure 1: The HR Data Ecosystem


the NAICS 5 can be used to characterise the entities that refer to                The distinction between attributes can be represented for instance
the industry of employment.                                                       as classes by expanding one of the existing ontologies. Thus is will
    The NER and RE processes can on occasion be of low preci-                     be clearer what attributes should be used from subsequent machine
sion. The Meta-data properties can represent the quality of the NLP               learning algorithms that perform job recommendations.
process as a numerical score per NE or per relation.                                 Subsequently a similar distinction can be made between soft and
    To the best of our knowledge there is not single ontology that                hard skills in job posts. Naturally, this will require Data operators
is complete enough to annotate a CV for the requirements of an                    to split the skills into two classes. This will facilitate an association
HR. For instance, it may be necessary apart from NAICS to use                     of soft and hard skills to the level of seniority of the position, and
also resumeRDF, 6 and the Human Resources Ontology. 7 This                        to the applicants’ gender. This can reveal subtle forms of biases.
results in the need also to have mappings between the ontologies                     Finally, business regulations and regulations derived from eth-
for the common concepts (i.e. classes) and for the common object                  ical data management can be set as constraints on the integrate
properties. The mapping rules can be stated in RML. 8                             knowledge graph, and be expressed in the SHACL 9 language.
    The second major source of information is the job post, which                    Typically the DE can accessed via SPARQL endpoints. Normally,
is typically in textual form, possibly split in sections each with a              the end user has access via web services accessible through a dash-
meaningful keyword (like company culture, required skills etc.).                  board. The web services can also allow for different user roles, thus
This usually constitutes a partially structured piece of information.             implementing data access control.
Likewise with the case of CVs information has to be extracted in
the forms of triplets, resulting in the job posts graph. However,                 7    CONCLUSIONS
it may be not necessary to extract structure from all parts of the
                                                                                  In the current work we proposed a framework for responsible data
document. For instance a company’s culture could fall under the
                                                                                  management for a human resources department. The framework is
Meta-Data Descriptions data set.
                                                                                  based on the concept of a DE, that comprises data, meta-data and
    Finally the merging of the two graphs in the integrated knowl-
                                                                                  data operators. It can be implemented with semantic technologies
edge graph can also be achieved with mapping rules. The mapping
                                                                                  (RDF Schema, OWL, RML rules, etc.). The implementation of a
rules, as well all the ontology selection, and possibly expansion
                                                                                  data ecosystem will require a substantial investment both from a
to be designed in cooperation of a knowledge engineer with a
                                                                                  knowledge engineering and the HR perspective. The benefits can
representative of HR department.
                                                                                  be important, especially in the field of data transparency. Moreover,
    The issue of detecting possible bias in the data can be assisted
                                                                                  a DE can also facilitate the deployment of explainable machine
through data transparency, especially at the stage of NE annotation.
                                                                                  learning algorithms.
Thus CV attributes can be split into sensitive and non-sensitive
ones. The former comprising name, gender, ethnic origin, age etc.
whereas the latter would comprise entities like education, or skills.             ACKNOWLEDGMENTS
                                                                                  The authors would like to acknowledge the support of the Deree -
5 North        American       Industry    Classification    System      (NAICS)   The American College of Greece in the current article.
https://www.census.gov/naics/
6 http://rdfs.org/resume-rdf/
7 https://github.com/motapinto/cv-ontology/blob/main/cv-onto logy.owl
8 https://rml.io/specs/rml/                                                       9 https://www.w3.org/TR/shacl/
Responsible Data Management for Human Resources                                                                                                              RecSys ’21, 2021,


REFERENCES                                                                                    Conference on Recommender Systems. 576–577.
 [1] Ifeoma Ajunwa. 2020. The Paradox of Automation as Anti-Bias Intervention, 41        [11] Sandra Geisler, Maria-Esther Vidal, Cinzia Capiello, Bernadette Farias Loscio,
     Cardozo, L.                                                                              Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda
 [2] Ifeoma Ajunwa and Daniel Greene. 2019. Platforms at work: Automated hiring               Paja, et al. 2021. Knowledge-driven Data Ecosystems Towards Data Transparency.
     platforms and other new intermediaries in the organization of work. In Work              arXiv preprint arXiv:2105.09312 (2021).
     and labor in the digital age. Emerald Publishing Limited.                           [12] Os Keyes. 2018. The misgendering machines: Trans/HCI implications of automatic
 [3] Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M Martinez, and               gender recognition. Proceedings of the ACM on human-computer interaction 2,
     Seth D Pollak. 2019. Emotional expressions reconsidered: Challenges to inferring         CSCW (2018), 1–22.
     emotion from human facial movements. Psychological science in the public interest   [13] Karen Levy and Solon Barocas. 2017. Designing against discrimination in online
     20, 1 (2019), 1–68.                                                                      markets. Berkeley Technology Law Journal 32, 3 (2017), 1183–1238.
 [4] Miranda Bogen and Aaron Rieke. 2018. Help wanted: An examination of hiring          [14] Kirsten Martin. 2019. Ethical implications and accountability of algorithms.
     algorithms, equity, and bias. (2018).                                                    Journal of Business Ethics 160, 4 (2019), 835–850.
 [5] Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accu-          [15] Marcelo Iury S Oliveira and Bernadette Farias Lóscio. 2018. What is a data
     racy disparities in commercial gender classification. In Conference on fairness,         ecosystem?. In Proceedings of the 19th Annual International Conference on Digital
     accountability and transparency. PMLR, 77–91.                                            Government Research: Governance in the Data Age. 1–9.
 [6] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Balanced neigh-          [16] Lauren A Rivera. 2015. Go with your gut: Emotion and evaluation in job inter-
     borhoods for multi-sided fairness in recommendation. In Conference on Fairness,          views. American journal of sociology 120, 5 (2015), 1339–1389.
     Accountability and Transparency. PMLR, 202–214.                                     [17] Julia Stoyanovich, Bill Howe, Serge Abiteboul, Gerome Miklau, Arnaud Sahuguet,
 [7] Hege H Bye, Henrik Herrebrøden, Gunnhild J Hjetland, Guro Ø Røyset, and                  and Gerhard Weikum. 2017. Fides: Towards a platform for responsible data science.
     Linda L Westby. 2014. Stereotypes of Norwegian social groups. Scandinavian               In Proceedings of the 29th International Conference on Scientific and Statistical
     Journal of Psychology 55, 5 (2014), 469–476.                                             Database Management. 1–6.
 [8] Cinzia Capiello, Avigdor Gal, Matthias Jarke, and Jakob Rehof. 2020. Data ecosys-   [18] Julia Stoyanovich, Bill Howe, and HV Jagadish. 2020. Responsible data manage-
     tems: sovereign data exchange among organizations (Dagstuhl Seminar 19391).              ment. Proceedings of the VLDB Endowment 13, 12 (2020), 3474–3488.
     In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.      [19] Adam Sutton, Thomas Lansdall-Welfare, and Nello Cristianini. 2018. Biased
 [9] Kate Crawford. 2013. The hidden biases in big data. Harvard business review 1, 4         embeddings from wild data: Measuring, understanding and removing. In Interna-
     (2013).                                                                                  tional Symposium on Intelligent Data Analysis. Springer, 328–339.
[10] Michael D Ekstrand, Robin Burke, and Fernando Diaz. 2019. Fairness and dis-         [20] Shari Trewin. 2018. AI fairness for people with disabilities: Point of view. arXiv
     crimination in recommendation and retrieval. In Proceedings of the 13th ACM              preprint arXiv:1811.10670 (2018).