Responsible Data Management for Human Resources Dimitrios Vogiatzis∗∗ Olivia Kyriakidou The American College of Greece, Deree The American College of Greece, Deree & NCSR "Demokritos" Athens, Greece Athens, Greece OKyriakidou@acg.edu dimitrv@acg.edu ABSTRACT Eventually a RS in HR is a information system that is directly The human resources (HR) departments rely increasingly on recom- related to the professional life of people, and as such it should be mender systems (RS) for most of their processes, such as recruiting, subject to ethical and legal regulations, apart from the technical selecting and developing their employees. However, the RS often ones, like prediction accuracy. ACM 1 and IEEE 2 have issued codes discriminate unfairly based on biases in data that may perpetuate of ethics that refer to the need for fairness in Information Systems. and enhance existing biases and in the work place. An important In particular section 1.4 of ACM code of ethics is entitled Be fair part of an HR department is a the data ecosystem, comprising raw and take action not to discriminate and section II of the IEEE code and derived data, related to potentially different stakeholders while of ethics states: To treat all persons fairly and with respect, to not being subject to laws, and regulations. In this work we propose the engage in harassment or discrimination, and to avoid injuring others. characteristics of a data ecosystem that will facilitate data trans- In reality RS are often fraught with elements of discrimination parency through traceability as a way of detecting potential biases and unfairness. The unfairness may stem from biases in the data in the data. that may misrepresent the actual population and subsequent ana- lytic algorithm often amplify the data biases. Lack of fairness can CCS CONCEPTS have potential legal consequences, especially in employment as it might violate anti-discrimination laws. Also it might have financial • Information systems → Data management systems; • Social consequences as usage of such systems might drop. See also [10] and professional topics → Employment issues; User character- for a recent tutorial on the origin and form of fairness in RS. istics. A data ecosystem is a network of data in potentially many forms (e.g. unstructured, structured) as well as accompanying rules that KEYWORDS permit their acquisition, storage, maintenance, and retrieval. The Human Resources, Fairness, Bias, Data Ecosystems data ecosystems is of potential interest to many stakeholders, in- cluding data providers, and data users that try to create value out 1 INTRODUCTION the ecosystems. A data ecosystem includes metadata, as well as le- Recommender systems (RS) are widely used by Human Resources gal, organizational or ethical regulations. Moreover, the ecosystems (HR) departments to facilitate their business processes from the evolve as their constituent components change. Finally, derived point of time optimization, but also from the perspective of miniz- data also form part of the ecosystems. For instance clusters, pre- ing human intervention, in an effort of achieving fairness. RS can dictions etc. are examples of derived data that are produced by be applied in the recruitment, hiring, promotion of employees, etc. statistical or machine learning methods. See [15] for an overview For instance they can used in matching CVs against job posts, in of data ecosystems. ranking CVs which determines the order of interviews or even in Our contribution is to focus on the data component of a RS and comparison of CVs against past CVs of employees that are deemed examine how a data ecosystem would facilitate data transparency as successful. A RS can also be used for segmenting job applica- through data traceability so that potential biases are made explicit tions into categories, detecting and recording long terms trends etc. or a least easier to track and detect. Our approach is based on a Moreover, they can be used by prospective employees that seek similar work for data transparency in the biomedical domain [11]. employment. Although RS seem to remove human intervention by automating 2 RELATED WORK HR processes, often but not exclusively, through advanced machine Responsible data management has been discussed in the context learning algorithms, still segments of the population can be dis- of automated decision systems (ADS), which are systems that criminated against. The data upon which the analysis is based on make make decisions about humans that might affect their socio- may contain biases towards age groups, gender, ethnic origin etc. economic life [17], [18]. They authors refer to the ethical challenges The data stem from specific data items, specific data features, data faced in all phases of a data science pipeline and the need for a fair, distributions, data sampling methods. The bias in data can be also transparent and responsible data management. very subtle and difficult to detect as it may appear in derived data Our current work focuses on the data representation part of stemming from an analytics process. an ADS, which is essentially an RS. In particular, we refer to the ∗ Both authors contributed equally to this research. features of a data ecosystem and how it can be supported with RecSys ’21, 2021, 1 https://www.acm.org/code-of-ethics Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).. 2 https://www.ieee.org/about/corporate/governance/p7-8.html RecSys ’21, 2021, Dimitrios Vogiatzis and Olivia Kyriakidou semantic web technologies, and their relevance to the issues of senior positions if themselves and others look tend to look for fairness in HR. lower-level jobs [4]. Proxies are also included in HR data that train A very similar approach that we propose in the current work employee selection recommendation systems in order to offer the has been developed for a biomedical system in the context of the most appropriate renumeration package to prospective employees. EU funded project BigMedilytics 3 for a lung-cancer pilot applica- Such suggestions however may reinforce gender or racial pay gaps tion. The pilot integrates structured and unstructured information, especially when they reflect the existence of strong proxies that open and sensitive data in a knowledge graph. This constitutes an signal for certain gender representations (e.g., male employees as example of a data ecosystem. breadwinners) and status inequalities. Facial analysis that is used in virtual interviews may also create 3 MOTIVATING EXAMPLES disparate impact on specific sub-groups of employees across gender Next we mention some examples that indicate the form of bias in and racial lines. In [5] it was shown that the faces of women with raw data, in data associations, as well as in derived data produced by darker skin cannot be reliably recognized by facial analysis systems machine learning algorithms. The examples refer to HR department as well as the emotions of people with disabilities and in different cases. cultural contexts [3]. Finally, in employee selection, most recruiters use a number of candidate characteristics as proxies of culture fit [7], Biased Data based on human behavioral biases. HR algorithmic defined as the degree to which the values of the individual match recommendations may sustain existing inequities when they are those of the organization. However, there is the danger that these trained on data that do not include specific groups of individuals [9]. proxies will become hard rules ignoring their subjective character For example, many selection algorithms try to identify the criteria and in this way exclude certain individuals who are thought apriori that characterize the ideal employee and use them for the selection that they do not “fit” the organizational culture. of newcomers. For this task they utilize performance data that identify the best performing employees within the organize and Segregation of individuals. Biases could also persist when algo- then identify the traits that distinguish them. However, there is the rithms segregate employees into groups drawing inferences about danger that if the performance data favor men due to existing biases individuals from their group memberships. Selection recommen- within the organization [16], then the selection algorithm might dation systems, for instance, may erroneously attribute to people include gender as a preferred characteristic for the ideal candidate with disabilities [20] certain characteristics based on their group and prefer men rather than women applicants. In this sense, existing membership without properly assessing the candidates and con- biases could be reified by limiting the number of certain groups and sequently offer lower status job positions. Moreover, categorizing possibly underrepresented groups who are alerted, selected, and individuals into certain gender groups could unfairly marginalize hired for specific job openings [6]. Moreover, HR recommendation non-binary and transgender employees while their classification systems utilized for the automated screening of candidates’ CVs into certain race groups could signify status inequalities [12]. against certain preferred selection criteria may also generate biased Human computer interaction. Most HR recommendation systems results when they are trained on data from past hiring decisions that run on platforms that require employees’ and candidates’ active are based on individual, organizational and structural biases against involvement with them which is determined merely by the rules certain underrepresented groups of employees [14]. [4]. The use of set by the platform that control all processes [13]. For instance, job natural language processing (NLP) tools in chatbots that evaluate candidates do not have any control over how their application will candidates’ competencies and fit to the job and the organization be presented to possible employers and they have to provide all may also preserve existing societal inequities when they are trained the required information by the platform if they want to be con- on biased data and exclude certain categories of candidates. The sidered for future job opportunities. Moreover, employee selection association of African-American names with negative feelings and recommendation systems tend to present numerical rankings of female names with the household and non-technical jobs has been candidates to employers generating the perception that there are already documented in the literature [19]. actual substantial differences between the candidates for a certain Proxies. Recommendation systems can replicate biases in other position, while in reality the differences might be minimal [2]. subtle ways, especially through the use of proxies. Certain hiring criteria could serve as proxies for categorizing individuals in specific 4 REQUIREMENTS FOR A DATA ECOSYSTEM groups and drive discrimination. For example, the use of gaps in IN HR employment as a hiring criterion could discriminate against women Next, based on the previously mentioned examples we sketch the applicants as women disproportionately leave the workplace to requirements that would be necessary for a data ecosystem in HR. provide child or elderly care [1]. Moreover, job matching platforms and job recommendation systems use proxies for “relevance” that Data management requirements: The data ecosystem should reproduce biases. Such systems, for example, could show to women allow data sharing for structured (e.g. CSV files) and unstruc- specific jobs at specific hierarchical levels (e.g., senior or junior tured data (e.g. text). The data should be accessible and re- positions in management) according to their own search history trievable by all stakeholders. Also the data has to be of high but also according to the search history of women similar to them. quality. For instance, data items that have missing values Accordingly, they might end up with fewer recommendations for could be rejected or data items that are very old. As an ex- ample we could mention a CV that that does not contain any 3 https://www.bigmedilytics.eu/ information about education or past employment. The data Responsible Data Management for Human Resources RecSys ’21, 2021, management requirements fall into the following categories: Data operators: the set of operators that can be executed DM1: Data management of multiple document types against the data sets. For instance, anonymization, data qual- should be supported. DM2: Quality of data items should ity checks, recency checks can be considered as data opera- be supported at all levels of data pipeline, e.g. at the raw data, tors. but also at derived data. Meta-Data: provide the semantics of the data stored in the Organizational requirements: The data should be stored, data sets of the data ecosystem. It comprises: accessed and processed according to the organization’s rules, (1) A Domain ontology, which provides a unified view of the and regulations. The organization requirements fall into the concepts, relationships, and constraints of the domain of following categories: O1: Data governance should be en- knowledge. It associates formal elements from the domain forced by the organization. Thus the data acquisition process, ontology to concepts. For instance, a specific job post and the data storage, and retention, access rights, and data ob- a specific applicant can be part of the concepts in a domain solescence are items related to data governance. The HR ontology. department may have business rules that stipulate the re- (2) Properties enable the definition of data quality, provenance, cruitment policy, and what will be the requested documents. and data access regulations of the data in the ecosystem. O2: Data sovereignty which specifies who owns the orig- For instance, last updated and other non-domain proper- inal data, the derived data, and to what purpose. This will ties (quality etc). increase the trust in the system. For instance, it will be clearer (3) Descriptions of the main characteristics of a data set. No of how a submitted CV will be handled. specific formal language or vocabulary is required; in fact, Legal & Ethical requirements: The data management should a data set could be described using natural language. For be in accordance with the requirements of the European instance, Data set D is a collection of CVs and cover letters. GDPR. 4 Moreover, the data management should address Mappings expressing correspondences among the different bias. For instance, the execution of algorithms should be in- components of a data ecosystem. The mappings are as fol- dependent of sensitive attributes (like ethnicity, age, gender). lows: In addition, the data should be owned and used for the in- Mappings between ontologies: they represent associations dented purposes. For instance, CVs of job applicants should between the concepts in the different ontologies that com- not used to generate business value by selling them without pose the domain ontology of the ecosystem. For instance, if their consent. Finally, traceability is an import aspect of the there can a mapping between the personnel ontology, and data that essentially allows to know where the data where the candidate employees ontology. obtained from, and how they were obtained. Mappings between data sets: they represent relations among The above can be summarized into the following ethical data sets of the ecosystem and the domain ontology. requirements: E1: Data protection & ownership which specifies the extence of ownership for each stake holder. E2: Sensitive attributes which clearly states the sensitive 6 DATA ECOSYSTEM IN HR attributes, with the foresight that they should be used by The role of a data ecosystem (DE) is to provide an explicit descrip- prediction algorithms. Typically, they represent age, gender, tion of the data and applicable operations on them through meta- ethnic background etc. The sensitive attributes are typically data and mapping rules. Next we provide some examples that refer associated with the provisions of GDPR. to the usage of the elements of the DE as referred to in the previous E3: discrimination attributes they may lead to discrimi- section. The description of a data ecosystem that we describe next nation in non-obvious ways. For instance the name of a job does not cover all the activities HR, but rather it addresses some applicant might inadvertently facilitate discrimination as it essential parts that refer to recruitment and hiring of employees. may reveal ethnic origin. Moreover, some derived attributes Thus we will assume a scenario where there are applicants CVs and fall in this category, for instance employment gaps. job posts. The DE for this example can be depicted in Figure 1. One of the data sources are the CVs of the applicants. Typically, 5 DESIGN OF A DATA ECOSYSTEM they contain textual information, possibly with some keywords (e.g. education, past employment) which can be helpful as annotations. We present in detail the concept of a data ecosystem, that will Thus a CV represents a piece of unstructured or partially structured serve as the infrastructure for an HR. A data ecosystem (DE) can information. be defined as a 4-tuple: DE= [8]. tured information from the CV. Typically named entity recognition Data sets: the ecosystem is composed of potentially multiple (NER) and relation extraction (RE) will have to be performed, result- data sets. Data sets can be comprised of structured or un- ing in triplets comprising two entities and a relation. The named structured information; also, they have different formats, e.g., entities (NE) in CVs can be things like: skills, past employment, CSV, JSON or tabular relations, and can be managed using educational achievements and demographic data. The relations different management systems. connect the NE to the person in question, while being labeled with time annotations. This will form the job applicant’s graph. The extracted entities will then be annotated with meta data derived 4 https://gdpr-info.eu/ from a domain ontology, commonly described in OWL. For example RecSys ’21, 2021, Dimitrios Vogiatzis and Olivia Kyriakidou Figure 1: The HR Data Ecosystem the NAICS 5 can be used to characterise the entities that refer to The distinction between attributes can be represented for instance the industry of employment. as classes by expanding one of the existing ontologies. Thus is will The NER and RE processes can on occasion be of low preci- be clearer what attributes should be used from subsequent machine sion. The Meta-data properties can represent the quality of the NLP learning algorithms that perform job recommendations. process as a numerical score per NE or per relation. Subsequently a similar distinction can be made between soft and To the best of our knowledge there is not single ontology that hard skills in job posts. Naturally, this will require Data operators is complete enough to annotate a CV for the requirements of an to split the skills into two classes. This will facilitate an association HR. For instance, it may be necessary apart from NAICS to use of soft and hard skills to the level of seniority of the position, and also resumeRDF, 6 and the Human Resources Ontology. 7 This to the applicants’ gender. This can reveal subtle forms of biases. results in the need also to have mappings between the ontologies Finally, business regulations and regulations derived from eth- for the common concepts (i.e. classes) and for the common object ical data management can be set as constraints on the integrate properties. The mapping rules can be stated in RML. 8 knowledge graph, and be expressed in the SHACL 9 language. The second major source of information is the job post, which Typically the DE can accessed via SPARQL endpoints. Normally, is typically in textual form, possibly split in sections each with a the end user has access via web services accessible through a dash- meaningful keyword (like company culture, required skills etc.). board. The web services can also allow for different user roles, thus This usually constitutes a partially structured piece of information. implementing data access control. Likewise with the case of CVs information has to be extracted in the forms of triplets, resulting in the job posts graph. However, 7 CONCLUSIONS it may be not necessary to extract structure from all parts of the In the current work we proposed a framework for responsible data document. For instance a company’s culture could fall under the management for a human resources department. The framework is Meta-Data Descriptions data set. based on the concept of a DE, that comprises data, meta-data and Finally the merging of the two graphs in the integrated knowl- data operators. It can be implemented with semantic technologies edge graph can also be achieved with mapping rules. The mapping (RDF Schema, OWL, RML rules, etc.). The implementation of a rules, as well all the ontology selection, and possibly expansion data ecosystem will require a substantial investment both from a to be designed in cooperation of a knowledge engineer with a knowledge engineering and the HR perspective. The benefits can representative of HR department. be important, especially in the field of data transparency. Moreover, The issue of detecting possible bias in the data can be assisted a DE can also facilitate the deployment of explainable machine through data transparency, especially at the stage of NE annotation. learning algorithms. Thus CV attributes can be split into sensitive and non-sensitive ones. The former comprising name, gender, ethnic origin, age etc. whereas the latter would comprise entities like education, or skills. ACKNOWLEDGMENTS The authors would like to acknowledge the support of the Deree - 5 North American Industry Classification System (NAICS) The American College of Greece in the current article. https://www.census.gov/naics/ 6 http://rdfs.org/resume-rdf/ 7 https://github.com/motapinto/cv-ontology/blob/main/cv-onto logy.owl 8 https://rml.io/specs/rml/ 9 https://www.w3.org/TR/shacl/ Responsible Data Management for Human Resources RecSys ’21, 2021, REFERENCES Conference on Recommender Systems. 576–577. [1] Ifeoma Ajunwa. 2020. The Paradox of Automation as Anti-Bias Intervention, 41 [11] Sandra Geisler, Maria-Esther Vidal, Cinzia Capiello, Bernadette Farias Loscio, Cardozo, L. Avigdor Gal, Matthias Jarke, Maurizio Lenzerini, Paolo Missier, Boris Otto, Elda [2] Ifeoma Ajunwa and Daniel Greene. 2019. Platforms at work: Automated hiring Paja, et al. 2021. Knowledge-driven Data Ecosystems Towards Data Transparency. platforms and other new intermediaries in the organization of work. In Work arXiv preprint arXiv:2105.09312 (2021). and labor in the digital age. Emerald Publishing Limited. [12] Os Keyes. 2018. The misgendering machines: Trans/HCI implications of automatic [3] Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M Martinez, and gender recognition. Proceedings of the ACM on human-computer interaction 2, Seth D Pollak. 2019. Emotional expressions reconsidered: Challenges to inferring CSCW (2018), 1–22. emotion from human facial movements. Psychological science in the public interest [13] Karen Levy and Solon Barocas. 2017. Designing against discrimination in online 20, 1 (2019), 1–68. markets. Berkeley Technology Law Journal 32, 3 (2017), 1183–1238. [4] Miranda Bogen and Aaron Rieke. 2018. Help wanted: An examination of hiring [14] Kirsten Martin. 2019. Ethical implications and accountability of algorithms. algorithms, equity, and bias. (2018). Journal of Business Ethics 160, 4 (2019), 835–850. [5] Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accu- [15] Marcelo Iury S Oliveira and Bernadette Farias Lóscio. 2018. What is a data racy disparities in commercial gender classification. In Conference on fairness, ecosystem?. In Proceedings of the 19th Annual International Conference on Digital accountability and transparency. PMLR, 77–91. Government Research: Governance in the Data Age. 1–9. [6] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Balanced neigh- [16] Lauren A Rivera. 2015. Go with your gut: Emotion and evaluation in job inter- borhoods for multi-sided fairness in recommendation. In Conference on Fairness, views. American journal of sociology 120, 5 (2015), 1339–1389. Accountability and Transparency. PMLR, 202–214. [17] Julia Stoyanovich, Bill Howe, Serge Abiteboul, Gerome Miklau, Arnaud Sahuguet, [7] Hege H Bye, Henrik Herrebrøden, Gunnhild J Hjetland, Guro Ø Røyset, and and Gerhard Weikum. 2017. Fides: Towards a platform for responsible data science. Linda L Westby. 2014. Stereotypes of Norwegian social groups. Scandinavian In Proceedings of the 29th International Conference on Scientific and Statistical Journal of Psychology 55, 5 (2014), 469–476. Database Management. 1–6. [8] Cinzia Capiello, Avigdor Gal, Matthias Jarke, and Jakob Rehof. 2020. Data ecosys- [18] Julia Stoyanovich, Bill Howe, and HV Jagadish. 2020. Responsible data manage- tems: sovereign data exchange among organizations (Dagstuhl Seminar 19391). ment. Proceedings of the VLDB Endowment 13, 12 (2020), 3474–3488. In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. [19] Adam Sutton, Thomas Lansdall-Welfare, and Nello Cristianini. 2018. Biased [9] Kate Crawford. 2013. The hidden biases in big data. Harvard business review 1, 4 embeddings from wild data: Measuring, understanding and removing. In Interna- (2013). tional Symposium on Intelligent Data Analysis. Springer, 328–339. [10] Michael D Ekstrand, Robin Burke, and Fernando Diaz. 2019. Fairness and dis- [20] Shari Trewin. 2018. AI fairness for people with disabilities: Point of view. arXiv crimination in recommendation and retrieval. In Proceedings of the 13th ACM preprint arXiv:1811.10670 (2018).