=Paper=
{{Paper
|id=Vol-2967/paper7
|storemode=property
|title=Responsible Data Management for Human Resources
|pdfUrl=https://ceur-ws.org/Vol-2967/paper_7.pdf
|volume=Vol-2967
|authors=Dimitrios Vogiatzis,Olivia Kyriakidou
|dblpUrl=https://dblp.org/rec/conf/hr-recsys/VogiatzisK21
}}
==Responsible Data Management for Human Resources==
Responsible Data Management for Human Resources
Dimitrios Vogiatzis∗∗ Olivia Kyriakidou
The American College of Greece, Deree The American College of Greece, Deree
& NCSR "Demokritos" Athens, Greece
Athens, Greece OKyriakidou@acg.edu
dimitrv@acg.edu
ABSTRACT Eventually a RS in HR is a information system that is directly
The human resources (HR) departments rely increasingly on recom- related to the professional life of people, and as such it should be
mender systems (RS) for most of their processes, such as recruiting, subject to ethical and legal regulations, apart from the technical
selecting and developing their employees. However, the RS often ones, like prediction accuracy. ACM 1 and IEEE 2 have issued codes
discriminate unfairly based on biases in data that may perpetuate of ethics that refer to the need for fairness in Information Systems.
and enhance existing biases and in the work place. An important In particular section 1.4 of ACM code of ethics is entitled Be fair
part of an HR department is a the data ecosystem, comprising raw and take action not to discriminate and section II of the IEEE code
and derived data, related to potentially different stakeholders while of ethics states: To treat all persons fairly and with respect, to not
being subject to laws, and regulations. In this work we propose the engage in harassment or discrimination, and to avoid injuring others.
characteristics of a data ecosystem that will facilitate data trans- In reality RS are often fraught with elements of discrimination
parency through traceability as a way of detecting potential biases and unfairness. The unfairness may stem from biases in the data
in the data. that may misrepresent the actual population and subsequent ana-
lytic algorithm often amplify the data biases. Lack of fairness can
CCS CONCEPTS have potential legal consequences, especially in employment as it
might violate anti-discrimination laws. Also it might have financial
• Information systems → Data management systems; • Social
consequences as usage of such systems might drop. See also [10]
and professional topics → Employment issues; User character-
for a recent tutorial on the origin and form of fairness in RS.
istics.
A data ecosystem is a network of data in potentially many forms
(e.g. unstructured, structured) as well as accompanying rules that
KEYWORDS
permit their acquisition, storage, maintenance, and retrieval. The
Human Resources, Fairness, Bias, Data Ecosystems data ecosystems is of potential interest to many stakeholders, in-
cluding data providers, and data users that try to create value out
1 INTRODUCTION the ecosystems. A data ecosystem includes metadata, as well as le-
Recommender systems (RS) are widely used by Human Resources gal, organizational or ethical regulations. Moreover, the ecosystems
(HR) departments to facilitate their business processes from the evolve as their constituent components change. Finally, derived
point of time optimization, but also from the perspective of miniz- data also form part of the ecosystems. For instance clusters, pre-
ing human intervention, in an effort of achieving fairness. RS can dictions etc. are examples of derived data that are produced by
be applied in the recruitment, hiring, promotion of employees, etc. statistical or machine learning methods. See [15] for an overview
For instance they can used in matching CVs against job posts, in of data ecosystems.
ranking CVs which determines the order of interviews or even in Our contribution is to focus on the data component of a RS and
comparison of CVs against past CVs of employees that are deemed examine how a data ecosystem would facilitate data transparency
as successful. A RS can also be used for segmenting job applica- through data traceability so that potential biases are made explicit
tions into categories, detecting and recording long terms trends etc. or a least easier to track and detect. Our approach is based on a
Moreover, they can be used by prospective employees that seek similar work for data transparency in the biomedical domain [11].
employment.
Although RS seem to remove human intervention by automating 2 RELATED WORK
HR processes, often but not exclusively, through advanced machine
Responsible data management has been discussed in the context
learning algorithms, still segments of the population can be dis-
of automated decision systems (ADS), which are systems that
criminated against. The data upon which the analysis is based on
make make decisions about humans that might affect their socio-
may contain biases towards age groups, gender, ethnic origin etc.
economic life [17], [18]. They authors refer to the ethical challenges
The data stem from specific data items, specific data features, data
faced in all phases of a data science pipeline and the need for a fair,
distributions, data sampling methods. The bias in data can be also
transparent and responsible data management.
very subtle and difficult to detect as it may appear in derived data
Our current work focuses on the data representation part of
stemming from an analytics process.
an ADS, which is essentially an RS. In particular, we refer to the
∗ Both authors contributed equally to this research. features of a data ecosystem and how it can be supported with
RecSys ’21, 2021,
1 https://www.acm.org/code-of-ethics
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).. 2 https://www.ieee.org/about/corporate/governance/p7-8.html
RecSys ’21, 2021, Dimitrios Vogiatzis and Olivia Kyriakidou
semantic web technologies, and their relevance to the issues of senior positions if themselves and others look tend to look for
fairness in HR. lower-level jobs [4]. Proxies are also included in HR data that train
A very similar approach that we propose in the current work employee selection recommendation systems in order to offer the
has been developed for a biomedical system in the context of the most appropriate renumeration package to prospective employees.
EU funded project BigMedilytics 3 for a lung-cancer pilot applica- Such suggestions however may reinforce gender or racial pay gaps
tion. The pilot integrates structured and unstructured information, especially when they reflect the existence of strong proxies that
open and sensitive data in a knowledge graph. This constitutes an signal for certain gender representations (e.g., male employees as
example of a data ecosystem. breadwinners) and status inequalities.
Facial analysis that is used in virtual interviews may also create
3 MOTIVATING EXAMPLES disparate impact on specific sub-groups of employees across gender
Next we mention some examples that indicate the form of bias in and racial lines. In [5] it was shown that the faces of women with
raw data, in data associations, as well as in derived data produced by darker skin cannot be reliably recognized by facial analysis systems
machine learning algorithms. The examples refer to HR department as well as the emotions of people with disabilities and in different
cases. cultural contexts [3]. Finally, in employee selection, most recruiters
use a number of candidate characteristics as proxies of culture fit [7],
Biased Data based on human behavioral biases. HR algorithmic defined as the degree to which the values of the individual match
recommendations may sustain existing inequities when they are those of the organization. However, there is the danger that these
trained on data that do not include specific groups of individuals [9]. proxies will become hard rules ignoring their subjective character
For example, many selection algorithms try to identify the criteria and in this way exclude certain individuals who are thought apriori
that characterize the ideal employee and use them for the selection that they do not “fit” the organizational culture.
of newcomers. For this task they utilize performance data that
identify the best performing employees within the organize and Segregation of individuals. Biases could also persist when algo-
then identify the traits that distinguish them. However, there is the rithms segregate employees into groups drawing inferences about
danger that if the performance data favor men due to existing biases individuals from their group memberships. Selection recommen-
within the organization [16], then the selection algorithm might dation systems, for instance, may erroneously attribute to people
include gender as a preferred characteristic for the ideal candidate with disabilities [20] certain characteristics based on their group
and prefer men rather than women applicants. In this sense, existing membership without properly assessing the candidates and con-
biases could be reified by limiting the number of certain groups and sequently offer lower status job positions. Moreover, categorizing
possibly underrepresented groups who are alerted, selected, and individuals into certain gender groups could unfairly marginalize
hired for specific job openings [6]. Moreover, HR recommendation non-binary and transgender employees while their classification
systems utilized for the automated screening of candidates’ CVs into certain race groups could signify status inequalities [12].
against certain preferred selection criteria may also generate biased
Human computer interaction. Most HR recommendation systems
results when they are trained on data from past hiring decisions that
run on platforms that require employees’ and candidates’ active
are based on individual, organizational and structural biases against
involvement with them which is determined merely by the rules
certain underrepresented groups of employees [14]. [4]. The use of
set by the platform that control all processes [13]. For instance, job
natural language processing (NLP) tools in chatbots that evaluate
candidates do not have any control over how their application will
candidates’ competencies and fit to the job and the organization
be presented to possible employers and they have to provide all
may also preserve existing societal inequities when they are trained
the required information by the platform if they want to be con-
on biased data and exclude certain categories of candidates. The
sidered for future job opportunities. Moreover, employee selection
association of African-American names with negative feelings and
recommendation systems tend to present numerical rankings of
female names with the household and non-technical jobs has been
candidates to employers generating the perception that there are
already documented in the literature [19].
actual substantial differences between the candidates for a certain
Proxies. Recommendation systems can replicate biases in other position, while in reality the differences might be minimal [2].
subtle ways, especially through the use of proxies. Certain hiring
criteria could serve as proxies for categorizing individuals in specific 4 REQUIREMENTS FOR A DATA ECOSYSTEM
groups and drive discrimination. For example, the use of gaps in IN HR
employment as a hiring criterion could discriminate against women Next, based on the previously mentioned examples we sketch the
applicants as women disproportionately leave the workplace to requirements that would be necessary for a data ecosystem in HR.
provide child or elderly care [1]. Moreover, job matching platforms
and job recommendation systems use proxies for “relevance” that Data management requirements: The data ecosystem should
reproduce biases. Such systems, for example, could show to women allow data sharing for structured (e.g. CSV files) and unstruc-
specific jobs at specific hierarchical levels (e.g., senior or junior tured data (e.g. text). The data should be accessible and re-
positions in management) according to their own search history trievable by all stakeholders. Also the data has to be of high
but also according to the search history of women similar to them. quality. For instance, data items that have missing values
Accordingly, they might end up with fewer recommendations for could be rejected or data items that are very old. As an ex-
ample we could mention a CV that that does not contain any
3 https://www.bigmedilytics.eu/ information about education or past employment. The data
Responsible Data Management for Human Resources RecSys ’21, 2021,
management requirements fall into the following categories: Data operators: the set of operators that can be executed
DM1: Data management of multiple document types against the data sets. For instance, anonymization, data qual-
should be supported. DM2: Quality of data items should ity checks, recency checks can be considered as data opera-
be supported at all levels of data pipeline, e.g. at the raw data, tors.
but also at derived data. Meta-Data: provide the semantics of the data stored in the
Organizational requirements: The data should be stored, data sets of the data ecosystem. It comprises:
accessed and processed according to the organization’s rules, (1) A Domain ontology, which provides a unified view of the
and regulations. The organization requirements fall into the concepts, relationships, and constraints of the domain of
following categories: O1: Data governance should be en- knowledge. It associates formal elements from the domain
forced by the organization. Thus the data acquisition process, ontology to concepts. For instance, a specific job post and
the data storage, and retention, access rights, and data ob- a specific applicant can be part of the concepts in a domain
solescence are items related to data governance. The HR ontology.
department may have business rules that stipulate the re- (2) Properties enable the definition of data quality, provenance,
cruitment policy, and what will be the requested documents. and data access regulations of the data in the ecosystem.
O2: Data sovereignty which specifies who owns the orig- For instance, last updated and other non-domain proper-
inal data, the derived data, and to what purpose. This will ties (quality etc).
increase the trust in the system. For instance, it will be clearer (3) Descriptions of the main characteristics of a data set. No
of how a submitted CV will be handled. specific formal language or vocabulary is required; in fact,
Legal & Ethical requirements: The data management should a data set could be described using natural language. For
be in accordance with the requirements of the European instance, Data set D is a collection of CVs and cover letters.
GDPR. 4 Moreover, the data management should address Mappings expressing correspondences among the different
bias. For instance, the execution of algorithms should be in- components of a data ecosystem. The mappings are as fol-
dependent of sensitive attributes (like ethnicity, age, gender). lows:
In addition, the data should be owned and used for the in- Mappings between ontologies: they represent associations
dented purposes. For instance, CVs of job applicants should between the concepts in the different ontologies that com-
not used to generate business value by selling them without pose the domain ontology of the ecosystem. For instance, if
their consent. Finally, traceability is an import aspect of the there can a mapping between the personnel ontology, and
data that essentially allows to know where the data where the candidate employees ontology.
obtained from, and how they were obtained. Mappings between data sets: they represent relations among
The above can be summarized into the following ethical data sets of the ecosystem and the domain ontology.
requirements: E1: Data protection & ownership which
specifies the extence of ownership for each stake holder.
E2: Sensitive attributes which clearly states the sensitive 6 DATA ECOSYSTEM IN HR
attributes, with the foresight that they should be used by The role of a data ecosystem (DE) is to provide an explicit descrip-
prediction algorithms. Typically, they represent age, gender, tion of the data and applicable operations on them through meta-
ethnic background etc. The sensitive attributes are typically data and mapping rules. Next we provide some examples that refer
associated with the provisions of GDPR. to the usage of the elements of the DE as referred to in the previous
E3: discrimination attributes they may lead to discrimi- section. The description of a data ecosystem that we describe next
nation in non-obvious ways. For instance the name of a job does not cover all the activities HR, but rather it addresses some
applicant might inadvertently facilitate discrimination as it essential parts that refer to recruitment and hiring of employees.
may reveal ethnic origin. Moreover, some derived attributes Thus we will assume a scenario where there are applicants CVs and
fall in this category, for instance employment gaps. job posts. The DE for this example can be depicted in Figure 1.
One of the data sources are the CVs of the applicants. Typically,
5 DESIGN OF A DATA ECOSYSTEM they contain textual information, possibly with some keywords (e.g.
education, past employment) which can be helpful as annotations.
We present in detail the concept of a data ecosystem, that will
Thus a CV represents a piece of unstructured or partially structured
serve as the infrastructure for an HR. A data ecosystem (DE) can
information.
be defined as a 4-tuple: DE= [8].
tured information from the CV. Typically named entity recognition
Data sets: the ecosystem is composed of potentially multiple (NER) and relation extraction (RE) will have to be performed, result-
data sets. Data sets can be comprised of structured or un- ing in triplets comprising two entities and a relation. The named
structured information; also, they have different formats, e.g., entities (NE) in CVs can be things like: skills, past employment,
CSV, JSON or tabular relations, and can be managed using educational achievements and demographic data. The relations
different management systems. connect the NE to the person in question, while being labeled with
time annotations. This will form the job applicant’s graph. The
extracted entities will then be annotated with meta data derived
4 https://gdpr-info.eu/ from a domain ontology, commonly described in OWL. For example
RecSys ’21, 2021, Dimitrios Vogiatzis and Olivia Kyriakidou
Figure 1: The HR Data Ecosystem
the NAICS 5 can be used to characterise the entities that refer to The distinction between attributes can be represented for instance
the industry of employment. as classes by expanding one of the existing ontologies. Thus is will
The NER and RE processes can on occasion be of low preci- be clearer what attributes should be used from subsequent machine
sion. The Meta-data properties can represent the quality of the NLP learning algorithms that perform job recommendations.
process as a numerical score per NE or per relation. Subsequently a similar distinction can be made between soft and
To the best of our knowledge there is not single ontology that hard skills in job posts. Naturally, this will require Data operators
is complete enough to annotate a CV for the requirements of an to split the skills into two classes. This will facilitate an association
HR. For instance, it may be necessary apart from NAICS to use of soft and hard skills to the level of seniority of the position, and
also resumeRDF, 6 and the Human Resources Ontology. 7 This to the applicants’ gender. This can reveal subtle forms of biases.
results in the need also to have mappings between the ontologies Finally, business regulations and regulations derived from eth-
for the common concepts (i.e. classes) and for the common object ical data management can be set as constraints on the integrate
properties. The mapping rules can be stated in RML. 8 knowledge graph, and be expressed in the SHACL 9 language.
The second major source of information is the job post, which Typically the DE can accessed via SPARQL endpoints. Normally,
is typically in textual form, possibly split in sections each with a the end user has access via web services accessible through a dash-
meaningful keyword (like company culture, required skills etc.). board. The web services can also allow for different user roles, thus
This usually constitutes a partially structured piece of information. implementing data access control.
Likewise with the case of CVs information has to be extracted in
the forms of triplets, resulting in the job posts graph. However, 7 CONCLUSIONS
it may be not necessary to extract structure from all parts of the
In the current work we proposed a framework for responsible data
document. For instance a company’s culture could fall under the
management for a human resources department. The framework is
Meta-Data Descriptions data set.
based on the concept of a DE, that comprises data, meta-data and
Finally the merging of the two graphs in the integrated knowl-
data operators. It can be implemented with semantic technologies
edge graph can also be achieved with mapping rules. The mapping
(RDF Schema, OWL, RML rules, etc.). The implementation of a
rules, as well all the ontology selection, and possibly expansion
data ecosystem will require a substantial investment both from a
to be designed in cooperation of a knowledge engineer with a
knowledge engineering and the HR perspective. The benefits can
representative of HR department.
be important, especially in the field of data transparency. Moreover,
The issue of detecting possible bias in the data can be assisted
a DE can also facilitate the deployment of explainable machine
through data transparency, especially at the stage of NE annotation.
learning algorithms.
Thus CV attributes can be split into sensitive and non-sensitive
ones. The former comprising name, gender, ethnic origin, age etc.
whereas the latter would comprise entities like education, or skills. ACKNOWLEDGMENTS
The authors would like to acknowledge the support of the Deree -
5 North American Industry Classification System (NAICS) The American College of Greece in the current article.
https://www.census.gov/naics/
6 http://rdfs.org/resume-rdf/
7 https://github.com/motapinto/cv-ontology/blob/main/cv-onto logy.owl
8 https://rml.io/specs/rml/ 9 https://www.w3.org/TR/shacl/
Responsible Data Management for Human Resources RecSys ’21, 2021,
REFERENCES Conference on Recommender Systems. 576–577.
[1] Ifeoma Ajunwa. 2020. The Paradox of Automation as Anti-Bias Intervention, 41 [11] Sandra Geisler, Maria-Esther Vidal, Cinzia Capiello, Bernadette Farias Loscio,
Cardozo, L. Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda
[2] Ifeoma Ajunwa and Daniel Greene. 2019. Platforms at work: Automated hiring Paja, et al. 2021. Knowledge-driven Data Ecosystems Towards Data Transparency.
platforms and other new intermediaries in the organization of work. In Work arXiv preprint arXiv:2105.09312 (2021).
and labor in the digital age. Emerald Publishing Limited. [12] Os Keyes. 2018. The misgendering machines: Trans/HCI implications of automatic
[3] Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M Martinez, and gender recognition. Proceedings of the ACM on human-computer interaction 2,
Seth D Pollak. 2019. Emotional expressions reconsidered: Challenges to inferring CSCW (2018), 1–22.
emotion from human facial movements. Psychological science in the public interest [13] Karen Levy and Solon Barocas. 2017. Designing against discrimination in online
20, 1 (2019), 1–68. markets. Berkeley Technology Law Journal 32, 3 (2017), 1183–1238.
[4] Miranda Bogen and Aaron Rieke. 2018. Help wanted: An examination of hiring [14] Kirsten Martin. 2019. Ethical implications and accountability of algorithms.
algorithms, equity, and bias. (2018). Journal of Business Ethics 160, 4 (2019), 835–850.
[5] Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accu- [15] Marcelo Iury S Oliveira and Bernadette Farias Lóscio. 2018. What is a data
racy disparities in commercial gender classification. In Conference on fairness, ecosystem?. In Proceedings of the 19th Annual International Conference on Digital
accountability and transparency. PMLR, 77–91. Government Research: Governance in the Data Age. 1–9.
[6] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Balanced neigh- [16] Lauren A Rivera. 2015. Go with your gut: Emotion and evaluation in job inter-
borhoods for multi-sided fairness in recommendation. In Conference on Fairness, views. American journal of sociology 120, 5 (2015), 1339–1389.
Accountability and Transparency. PMLR, 202–214. [17] Julia Stoyanovich, Bill Howe, Serge Abiteboul, Gerome Miklau, Arnaud Sahuguet,
[7] Hege H Bye, Henrik Herrebrøden, Gunnhild J Hjetland, Guro Ø Røyset, and and Gerhard Weikum. 2017. Fides: Towards a platform for responsible data science.
Linda L Westby. 2014. Stereotypes of Norwegian social groups. Scandinavian In Proceedings of the 29th International Conference on Scientific and Statistical
Journal of Psychology 55, 5 (2014), 469–476. Database Management. 1–6.
[8] Cinzia Capiello, Avigdor Gal, Matthias Jarke, and Jakob Rehof. 2020. Data ecosys- [18] Julia Stoyanovich, Bill Howe, and HV Jagadish. 2020. Responsible data manage-
tems: sovereign data exchange among organizations (Dagstuhl Seminar 19391). ment. Proceedings of the VLDB Endowment 13, 12 (2020), 3474–3488.
In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. [19] Adam Sutton, Thomas Lansdall-Welfare, and Nello Cristianini. 2018. Biased
[9] Kate Crawford. 2013. The hidden biases in big data. Harvard business review 1, 4 embeddings from wild data: Measuring, understanding and removing. In Interna-
(2013). tional Symposium on Intelligent Data Analysis. Springer, 328–339.
[10] Michael D Ekstrand, Robin Burke, and Fernando Diaz. 2019. Fairness and dis- [20] Shari Trewin. 2018. AI fairness for people with disabilities: Point of view. arXiv
crimination in recommendation and retrieval. In Proceedings of the 13th ACM preprint arXiv:1811.10670 (2018).