=Paper= {{Paper |id=Vol-3606/71 |storemode=property |title=Enhancing Human Resources through Data Science: a Case in Recruiting |pdfUrl=https://ceur-ws.org/Vol-3606/paper71.pdf |volume=Vol-3606 |authors=Paolo Frazzetto,Muhammad Uzair Ul Haq,Alessandro Sperduti,Massimiliano Gervasi,Nicolò G. Totaro,Giorgia Specchia,Maria Elena Latino,Loredana Caruccio,Stefano Cirillo,Tullio Pizzuti,Giuseppe Polese,Angelo Marchese,Orazio Tomarchio,Lorenzo Di Rocco,Umberto Ferraro Petrillo,Giorgio Grani,Alessandro La Ferlita,Yan Qi,Emanuel Di Nardo,Simon Mewes,Ould el Moctar,Angelo Ciaramella,Claudia Diamantini,Alex Mircoli,Domenico Potena,Simone Vagnoni,Claudia Cavallaro,Vincenzo Cutello,Mario Pavone,Patrik Cavina,Federico Manzella,Giovanni Pagliarini,Guido Sciavicco,Eduard I. Stan,Paola Barra,Zied Mnasri,Danilo Greco,Valerio Bellandi,Silvana Castano,Alfio Ferrara,Stefano Montanelli,Davide Riva,Stefano Siccardi,Alessia Antelmi,Massimo Torquati,Daniele Gregori,Francesco Polzella,Gianmarco Spinatelli,Marco Aldinucci |dblpUrl=https://dblp.org/rec/conf/itadata/FrazzettoUS23 }} ==Enhancing Human Resources through Data Science: a Case in Recruiting== https://ceur-ws.org/Vol-3606/paper71.pdf
                                Enhancing Human Resources through Data Science: a
                                Case in Recruiting
                                Paolo Frazzetto1,2 , Muhammad Uzair Ul Haq1,2 and Alessandro Sperduti1
                                1
                                    Department of Mathematics “Tullio Levi-Civita”, University of Padova, 35121 Padua, Italy
                                2
                                    Amajor S.r.l. SB, via Noventana 192, 35027 Noventa Padovana, Italy


                                                                         Abstract
                                                                         The role of Human Resources (HR) in the recruitment process is undergoing great technological changes
                                                                         with the increasing use of web-based job portals for candidate selection. While these portals have
                                                                         improved efficiency for applicants, they have also presented challenges to recruiters in managing the
                                                                         large volume of applications received. Automated systems that use machine learning have been proposed
                                                                         to address this, but the lack of publicly available annotated datasets hinders progress in this field. To
                                                                         overcome this limitation, we introduce a novel dataset of Italian CV embeddings encoded with binary
                                                                         targets, making it publicly accessible for research purposes. The study further explores the performances
                                                                         of data science techniques on the proposed dataset and the application of a Network Science perspective.

                                                                         Keywords
                                                                         HR, Personnel selection, CV Dataset, AutoML, Graph Neural Network, Machine Learning




                                1. Introduction
                                Human Resources (HR) play a central role in any business—finding and hiring the best candidates
                                is of paramount importance to ensure the long-term growth, productivity, and competitiveness
                                of an organization [1].
                                   The process of identifying and selecting suitable candidates for a job has traditionally relied
                                on manual assessment, interviews, and subjective judgments. However, in recent years there has
                                been increased interest in recruiting through web-based job portals [2]. Online job portals allow
                                Human Resource Management (HRM) to target a larger audience and boost candidates’ reach-out.
                                With the help of these platforms, candidates can easily upload their data, supporting documents,
                                such as Curriculum Vitae (CV) or video presentations, or fill in assessment questionnaires [3].
                                   By using such systems, the Human Resources (HR) department can receive a large number of
                                applications, even for a single job posting. On the one hand, these systems have made the job
                                application process efficient for candidates. Still, on the other hand, it has made the screening
                                process time-consuming and labor-intensive for recruiters. Therefore, an automated system
                                is desired. Recently, there have been many efforts to employ the use of machine learning to
                                solve such tasks [4, 5, 6]. However, an annotated dataset is required to use the full potential

                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                $ paolo.frazzetto@phd.unipd.it (P. Frazzetto); muhammaduzair.ulhaq@phd.unipd.it (M. U. U. Haq);
                                alessandro.sperduti@unipd.it (A. Sperduti)
                                 0000-0002-3227-0019 (P. Frazzetto); 0000-0001-9660-8982 (M. U. U. Haq); 0000-0002-8686-850X (A. Sperduti)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of Machine Learning. Due to privacy concerns, most companies do not tend to release their
dataset to the public, thus hampering the basic research in this field.
   In this paper, we analyze and release a novel dataset of CV embeddings. The CVs are in the
Italian language, embedded using a multilingual Large Language Model (LLM), and their targets
are encoded in a binary fashion. In this way, an alternative solution is proposed to make the
dataset publicly available. The embeddings can easily be used for data analysis and building
machine learning models. Each candidate is labeled by his progress in the corresponding
selection. We also explore how Data Science techniques can enhance the candidate selection
process, improve decision-making, and ultimately contribute to the success of organizations. We
further investigate how we can exploit questionnaire data from a Network Science perspective,
testing whether similar questionnaires result in similar candidates’ outcomes. Our ultimate
goal is to contribute to this field’s growing body of knowledge and guide HR professionals in
adopting data-driven approaches to optimize their talent acquisition strategies.
Contributions of the paper: (i) we propose a novel publicly available HR recruitment dataset
obtained by collecting job candidates’ data from real-world cases; (ii) we present results on the
proposed dataset by many strong baseline models, including Neural Networks (NN) and Graph
Neural Networks (GNN).


2. Related Work
Nowadays, automated extraction of information from CVs, job postings, and related HR docu-
ments has caught the attention of many researchers and companies [5, 6, 7]. However, the lack
of publicly available datasets bottlenecks most of the progress.
   Despite the scarcity of datasets, there have been some efforts to extract information from
resumes and job postings with the aim of boosting HR performance. For example, [8] proposed
the use of an average word embedding model for CV retrieval based on the job description.
The embeddings of CVs are trained from scratch and combined with the pre-trained word2vec
embeddings using a hybrid embedding model. The job postings are also embedded using a pre-
trained word2vec model, and cosine similarity is used as a measure to find relevance between
the CV and job description. However, to the best of our knowledge, the dataset used in the
research is not publicly available. Jiang et al. [9] proposed the use of machine learning to match
candidates with job postings on online platforms. The authors used real-case data collected
by the recruitment team between January 2019 and October 2019. They collected about 13K
jobs with 580K candidates and 1.3 million resumes. However, the authors claim the dataset
to be sensitive and do not release it publicly. Yao et al. [10] proposed a method to quantify
the matching between candidate qualifications and the job requirements of a position. They
model the task with distantly supervised skill extraction to identify the skill entities from job
postings and resumes using skill entity dictionaries. The relevance between a resume and a job
description is measured according to the matching score of the skill entities. The dataset used
in the research contains 21K job postings and 86K resumes provided by a high-tech company.
So far, the authors have not released the publicly available dataset.
   Zhang et al. [11] released a novel dataset for skill extraction on English job postings called
SKILLSPAN. The authors also outlined the annotation guidelines created by domain experts to
annotate hard and soft skills in job postings. The dataset consists of 14.5K sentences, of which
12.5K are annotated. The dataset is divided into three categories; BIG, HOUSE, and TECH. The
authors only released the HOUSE and TECH categories of the dataset to the public. However,
the dataset is limited to job postings, and only skill entities are annotated.
  Given the limited availability of publicly available datasets, in this study, we address the
problem at hand and release a novel dataset of CV embeddings. More details of the dataset are
explained in Section 4.


3. Data Collection
The gathering of reliable data in HR recruitment is a complex task that goes beyond the realm of
academia, as it requires external support from the industry. While academic research provides
valuable insights and theoretical frameworks, it often falls short of capturing the intricacies and
practicalities of the recruitment process in real-world settings. Additionally, HR recruitment
involves gathering a wide range of data, including resumes, application forms, psychometric
assessments, interviews, and performance evaluations. This data collection process requires
collaboration with organizations willing to share their recruitment data and provide access to
their internal information systems.
   This work has been made possible by the partnership and support of Amajor S.r.l SB1 , an
Italian business development consulting firm. Its main activities concern proprietary consulting
methods to guide small- and medium-sized enterprises, improve their business model through
the entrepreneurs’ values, and find new candidates that best fit within the client’s organization.
   This partnership enabled us to obtain authentic and diverse datasets, allowing us to analyze
and develop models that closely mirror the challenges and complexities faced by HR practitioners.
This collaborative approach ensures that our research findings are relevant, applicable, and
aligned with the practical needs of the industry, ultimately leading to more effective and
impactful HR recruitment strategies.

3.1. Privacy Regulations
Privacy is a paramount concern in the field of HR, especially when dealing with the sensitive
personal data of job applicants. This sensitive data falls under specific legal frameworks,
most notably the General Data Protection Regulation (GDPR) in the European Union [12].
This regulation mandates that organizations handle personal data with care, ensuring its
confidentiality, security, and lawful processing.
  Furthermore, with the increasing integration of AI technologies in HR processes, additional
privacy concerns arise: such as the potential for unintended biases, discriminatory practices,
and unauthorized access to personal information. These and broader applications are being
regulated in the EU by the so-called AI Act [13] which, once approved, will constitute the
world’s first governmental regulation on Artificial Intelligence. Specifically, AI applications in
HR are classified as High-Risk AI Systems as they are mentioned in Annex III.4.(a):


1
    Corporate Website: https://www.amajorsb.com/en/
      “AI systems intended to be used for recruitment or selection of natural persons,
      notably for advertising vacancies, screening or filtering applications, evaluating
      candidates in the course of interviews or tests;”

Therefore, HR professionals must navigate these specific legal requirements by implementing
robust data protection measures, obtaining informed consent from candidates, and ensuring
secure storage and transmission of personal information.
   In the scope of this work, we have taken extensive measures to handle data in a compliant
manner. Prior to data collection and questionnaire administration, candidates have been duly
informed about the purpose and scope of data processing, providing their informed consent.
In addition, we ensure anonymity by removing any personally identifiable information from
the dataset and releasing the CV embeddings rather than their original form. We analyzed the
data in its aggregate structure without any additional bias. Besides, we adhere to data retention
policies and will promptly delete the original records once they are no longer necessary for
the intended purposes. In this fashion, we aim to maintain the confidentiality and trust of the
candidates’ personal data, along with promoting open science and productive discussion in the
field.


4. Dataset Description
The data have been collected from real-case candidates’ applications for 195 different job postings
for vacancies, mostly in North-East Italy. The positions vary in terms of seniority, role, business
area, and corporate size. The time frame spans from 01-01-2021 to 31-05-2023. From a starting
pool of more than 13.000 applications, we filtered out those candidates who did not fill in the
privacy consent for this research, those who did not complete the assessment questionnaire,
and those who did not submit their CV.
   This would be the first release of such a dataset, and we plan to provide future releases with
more entries, features, and meta-data. In fact, we intend to extend our study by incorporating
meta-data from the job description and the entire selection process. The dataset is available at
the repository Research Data Unipd.

4.1. Classes Identification
The job selection process typically involves several steps that candidates must navigate. In
our scenario, candidates are first required to submit their application, which includes their
CV and other basic data, and fill in the assessment questionnaire. After the initial screening
by HR recruiters, candidates are invited to participate in one or more interviews, which can
be conducted in various formats, such as video interviews or in-person meetings. From the
outcome of the interviews, only a handful of candidates are shortlisted and presented to the
future employer. In the final stage, the employer selects one candidate, negotiates with him the
job offers, reviews the employment contracts, and completes the necessary paperwork before
officially joining the organization.
   In a Machine Learning framework, we can consider each of the previous states as a label, thus
modeling the process as a multi-class classification problem. Alternatively, since the process is
consequential, one could model it as a regression task over an interval. Nonetheless, for this
first analysis and release of the dataset, we opted to consider binary labels. We realized that
in real-case applications, there are many deviations from this standard process; besides, the
data recorded in the information systems may be miss-classified or missing for some candidates
and stages. Therefore, we focus on those candidates who completed all stages up to the first
interview—an HR specialist has checked their CV, questionnaire, and interview and made
the decision to bring them further in the selection (positive labels) or deemed that they were
not the best-suited candidates for that role (negative labels). These steps left out 𝑁 = 2.647
valid candidates, of which 1.674 (63%) with positive labels and 973 (37%) with negative ones.
Meanwhile, we are currently working on cleaning and pre-processing the datasets to get more
reliable data points.

4.2. CV Embeddings
State-of-the-art Large Language Models (LLMs) have succeeded in various natural language
processing tasks. These models can be pre-trained on large corpora to capture contextual
information of words in a text. CVs are unstructured documents that consist of long textual
information. In this study, we use an XLM-RoBERTa-Longformer [14], a multilingual model with
an input size of 4096 tokens. The multilingual characteristic allows us to capture information
in different languages, whereas the larger input size enables processing long documents.
   The documents are preprocessed by removing stopwords, extra spaces, and special charac-
ters. These documents are tokenized and then passed through the pre-trained XLM-RoBERTa-
Longformer to extract the word embeddings of all the tokens. Each token is represented by
a 768-dimensional feature vector. Therefore, a document consisting of 𝑁 tokens returns a
𝑁 × 768 dimensional feature matrix. Processing such large matrices is computationally expen-
sive; therefore, we average the embedding vector of all the tokens in the document, resulting in
a 768-dimensional feature vector representing the document in an embedding space.

4.3. Questionnaire Data
Personality and behavior assessments through questionnaires are one valuable tool in HR
selection processes and organizational psychology [3]. These assessments aim to gain insight
into candidates’ individual traits, characteristics, and behavioral tendencies, providing a deeper
understanding of their fit within the organization and their job role. Some questionnaires are
designed specifically for personality assessment, others for measuring some ability or behavior.
There exists a plethora of different kinds of tests, with different validity and usage among HR
practitioners [15]. Nevertheless, the quantitative nature of questionnaire data, along with its
ease of collection, allows for rigorous statistical analysis and enables standardized evaluation
across candidates, promoting fairness and consistency in the selection process.
   We collected questionnaire data following Amajor’s business model. The tool used for the
candidates’ assessment is the so-called A+ Questionnaire: a set of 242 questions with 3-scale
Likert-type answers (yes/maybe/no or similar) covering various aspects of one’s behavior, habits,
and personality, developed by the company team after working alongside more than 120 clients
over a period of 3 years [16]. The answers are grouped and processed following a proprietary
factor model that gives an estimate of one’s hidden traits; however, its analysis and discussion
go beyond the scope of this work. In the following section, we describe a novel and general
approach to exploit Likert-type questionnaire data to find patterns among respondents.

4.3.1. From Questionnaires to Graph
Likert-type scales are widely employed in academic and industrial settings to capture human
facets due to their user-friendly nature, simplicity of development, and ease of administration
[17]. It enables respondents to answer questions in a closed-form way, picking only one value
on an ordered scale according to some sort of preference or agreement. Due to the fact that the
perceived distance between two consecutive items cannot be defined or presumed equal [18],
such a scale cannot be analyzed by classical statistical methods defined on a metric space or
parametric tests but requires specific modeling and assumptions [19].
   In order to link candidates that give similar answers, we resort to Network Science. In fact,
networks, also called graphs, enable more expressive data structure and occur in many fields of
science and engineering [20]. However, translating tabular data to graphs is not trivial [21, 22]
as it requires domain knowledge and heuristics to define the nodes and their relationships.
   Our approach to tackling these issues is straightforward and takes advantage of the specific
structure of Likert-type data. Given any Likert scale Questionnaire 𝒬 = {𝑞1 , 𝑞2 , . . . , 𝑞𝑛 } made
up of 𝑛 questions, each possible answer 𝑎𝑖 takes value in an ordered set that w.l.o.g. we can
define as 𝒜 = {1, 2, . . . , 𝐿} ⊂ N𝑛 , so 𝑎𝑖 ∈ 𝒜, ∀𝑖 = 1, . . . , 𝑛. The order relation depends on
each 𝑞𝑖 , and we assume that it is universal, i.e., the questionnaire is well-written, and each
question is understood by all respondents. In this way, each completed questionnaire can be
formulated as a specific collection of all the possible answers a = {𝑎1 , . . . , 𝑎𝑛 }, ∀𝑎𝑖 ∈ 𝒜 and a
respondent can be described as a function 𝑟 : 𝒬 → 𝒜 × . . . × 𝒜 = 𝒜𝑛 , 𝑟(𝒬) = a. We desire
to link candidates/respondents that provide similar answers, thus having similar behaviors
and personalities, without resorting to the hidden variables given by factor analysis. Next, we
have to define a distance function 𝑑(u, v) between two responses u, v. Our desiderata are that
respondents who give the exact same answers will be closer, whereas when the answers are on
the opposite side of the scale, the distances should be greater. Additionally, we want to avoid
the Euclidean metric, since it scales quadratically with 𝐿, but Likert scales are perceived as
linear [18]. Therefore, the ideal candidate is the Manhattan distance or ↕1 norm:
                                          𝑛
                                         ∑︁
                             𝑑𝑀 (u, v) =    |𝑢𝑖 − 𝑣𝑖 |     𝑢𝑖 , 𝑣𝑖 ∈ 𝒜.                           (1)
                                           𝑖=1

We brought this idea further by considering that Likert-type answers are usually contrasting,
i.e., one end of the scale is the opposite of the other, with all the ranges in between. Therefore,
in the case of an odd number of choices 𝐿, a middle value is perceived as a neutral or indefinite
answer [23]. We want to emphasize this contrast and penalize the neutral answers, as they
provide little insight into the analysis. For these reasons, we center our answer set in zero
˜ = {−⌊𝐿/2⌋, . . . , 0, . . . , ⌊𝐿/2⌋} and we exploit this symmetry with a redefined Bray-Curtis
⊣
similarity [24]:                                      ∑︀
                                                           |𝑢𝑖 − 𝑣𝑖 |
                                    𝑑𝐵𝐶 (u, v) = 1 − ∑︀𝑖              .                         (2)
                                                         𝑖 |𝑢𝑖 + 𝑣𝑖 |
Figure 1: Boxenplot distribution of the Bray-Curtis similarity for all 𝑁 2 questionnaire pairs; before
(left) and after (right) clipping the negative values.


Notice that this measure of similarity has the desired properties, being normalized to 1 when the
answers of two questionnaires are exactly the same. On the other hand, ∑︀ it diverges to∑︀−∞ when
they are always at the opposite. In practice, 𝑑𝐵𝐶 (u, v) ≤ 0 ⇐⇒          𝑖 |𝑢𝑖 − 𝑣𝑖 | ≥    𝑖 |𝑢𝑖 + 𝑣𝑖 |
and the latter holds when the majority of answers have opposite signs, hence meaning.
   In our specific business case, we have 𝐿 = 3 and thus 𝒜    ˜ = {−1, 0, 1}, where 1 stands for
the positive/affirmative answer, 0 is maybe/neutral, and −1 is no/negative. We then computed
the pairwise similarity of Eq. (2) for all pairs. These values directly translate into a graph in
which each node is a candidate, who is connected to all other candidates by means of a weighted
adjacency matrix 𝐴, whose entries are consequently defined as 𝐴𝑢𝑣 = 𝑑𝐵𝐶 (𝑢, 𝑣) = 𝐴𝑣𝑢 . In this
way, we obtain a fully connected graph with an order of 𝑁 2 ≃ 3.5 × 106 links. We also apply
two heuristics to reduce its complexity and keep only the most significant connections. First, we
set the negative values to zero, enforcing no similarity between such different questionnaires.
This results in the removal of 1.47 × 105 links. The corresponding boxenplot distribution is
shown in Figure 1, noticing that we retain a homogeneous distribution of similarities along a
right-tail of candidates with almost identical answers. Secondly, our aim is to drop the links with
a weight close to zero, starting from the lowest values. Therefore, we apply edge percolation
[25] and heuristically stop when the largest component has 95% of the total nodes. As shown
in Figure. 2, an additional 1.1 × 105 edges can be removed. The corresponding threshold of
the similarity 𝑑𝐵𝐶 (𝑢, 𝑣) is 0.07, so all the remaining links connect candidates with a similarity
greater than this threshold. The basic statistics on the obtained graph are reported in Table 1.
   Such a process allowed us to find a reasonable graph of candidates based on their responses
on a Likert-type questionnaire. Our research question is to test whether such an approach can
improve the identification of patterns and the prediction of the class of new candidates, given
that they provided similar answers to other labeled candidates.


5. Candidates Classification
This section explores how Machine Learning can be leveraged to perform candidate classification
based on this novel dataset. Two different approaches have been investigated—relying on
Table 1
Basic Properties of the Candidates Graph
                                      Avg.     Avg. Clustering    Avg. Path        Graph     Graph
    Property    #Nodes     #Links
                                     Degree         Coeff.         Length         Diameter   Density
    Value        2647      518994      392          0.615           2.141            8        0.148


unstructured or structured data.
   The first approach focused on tabular data analysis, where traditional machine learning
algorithms were applied to extract insights and patterns from unstructured data. This pipeline
usually involves feature engineering, model selection, architecture search, hyper-parameter
optimizations, and training on the tabular dataset to make predictions and classifications. Each
of these steps may be challenging on its own; therefore, we resorted to off-the-shelf AutoML
tools to automate this procedure. In particular, we exploited AutoGluon [26] for its simplicity
and the availability of Neural Networks among its models.
   In addition to the tabular approach, we also explored the use of Graph Neural Networks (GNNs)
[27, 28] to analyze the dataset’s graph structures as described in Section 4.3.1. By representing
the data as a graph, we leveraged GNNs to capture the relationships and dependencies that
emerge from questionnaire data among the entities. The GNN model enables us to learn from
both the nodes’ attributes (i.e., the CV embeddings) and the relational information present in the
graph, thereby capturing complex patterns and interactions that might be missed by traditional
tabular approaches. Contrary to tabular or multi-modal data, AutoML tools for GNNs are still
in their infancy and are under active development [29]. However, we adopted our graph for the
AutoGL framework [30], which enables us to test some of the most common GNN layers for
the node classification task.
   By employing these two complementary approaches, we aimed to gain a comprehensive




                         (a)                                                (b)
Figure 2: (a): Size of the largest connected component by removal of the links with ascending similarity.
The dotted line indicates the point where the largest component has 95% of the original nodes; thus, we
retain all the connections up to that point. (b): Degree distribution of the resulting graph.
Table 2
Experimental results on the test set for different models and data structures.
            Data                         Model                    Accuracy [%]   Std. Dev.
            Baseline                     Class Prior Classifier          63.24           -
                                         NN                              63.40       2.16
            Tabular (CV Embeddings)      RandomForest                    61.51       1.79
                                         CatBoost                        64.53       1.51
                                         NN                              75.78       2.54
            Tabular (Questionnaire)      RandomForest                    76.99       1.42
                                         CatBoost                        77.16       1.20
                                         NN                              76.45       2.44
            Tabular (CV Emb. + Qst.)     RandomForest                    75.27       1.09
                                         CatBoost                        76.91       1.46
                                         GraphSAGE                       77.35       1.77
            Graph (CV Emb.)              GAT                             76.60       1.96
                                         GCN                             74.33       2.77


understanding of the dataset and extract meaningful insights from different perspectives. All
the experiments have been conducted in the same environment and by employing open-source
libraries. We tested different models for each scenario, with 10-fold cross-validation on a
[80, 10, 10] train/validation/test split. We disabled any other feature selection/engineering tech-
niques, as we already employ features extracted from raw HR data, and the models’ comparisons
would be unfair. For the same reason, we turned off bagging and multi-layer stack ensembling
useful to boost predictive accuracy [31].

5.1. Results
The experimental results are reported in Table 2. Our baseline model is a naive class prior
probability classifier that always predicts the most common class (the “positive” candidates)
with an expected accuracy of 63.24%. We considered RandomForest [32] and CatBoost [33]
since they have been proven to be fast to train and effective on tabular data in many domains.
Concerning the GNNs, we selected the modules GraphSAGE [34], GAT [35], and CGN [36]. The
Neural Networks and GNNs were trained with the default hyperparameters and architecture
spaces, therefore, their performances could improve with more extensive model selections.
   CV embeddings alone are substantially equivalent to the class prior classifier. This suggests
that CV embeddings should be improved, and the current general-purpose LLM are unable to
grasp essential information without any domain knowledge or fine-tuning. Considering that
the Questionnaire is an essential element in the considered HR selection process, the fact that
the answers alone are a better predictor is unsurprising. CatBoost performs the best, being
designed for categorical data such as Likert-type questionnaires. Adding the embeddings along
with questionnaire data does not lead to improvements but rather results in slightly degraded
performances. Therefore, the CV embeddings are not as informative, and also considering their
high dimensionality, they deteriorate classification.
   In spite of that, the graph topology we proposed turns out to be valuable for classification,
performing slightly better than the corresponding tabular dataset of embeddings plus answers
to the questionnaire. It seems that our rationale for linking Likert-type data is effectively linking
similar candidates in a meaningful fashion.


6. Conclusions
In conclusion, we collected, processed, and released an HR recruitment dataset. We employed
LLM to both preserve anonymity and study the current boundaries of such models in this
domain. We analyzed this dataset as a data-driven process, modeling it as a binary classification
task. Standard Machine Learning techniques proved effective when combined with assessment
questionnaire data, and we demonstrated a manner to convert Likert-type data to graphs while
preserving their intrinsic patterns and relations.
   In the future, we plan to improve CV embeddings in order to outperform baselines. Addi-
tionally, we are already working on collecting more data and meta-data to enlarge the current
dataset. Given these promising results, we plan to investigate how to translate questionnaire
data into graphs by analyzing different metrics, pruning techniques, and its validity on other
questionnaires. Ultimately, we wish to contribute to the advancement of knowledge in this area
and provide guidance to HR professionals in adopting data-driven approaches for optimizing
talent acquisition strategies.


Acknowledgments
The authors would like to thank the HR recruiters and employees of Amajor for making this
research possible.


References
 [1] R. A. Noe, J. R. Hollenbeck, B. A. Gerhart, P. M. Wright, Fundamentals of human resource
     management, McGraw-Hill Education New York, NY, 2016.
 [2] S. Strohmeier, Concepts of e-hrm consequences: a categorisation, review and suggestion,
     The International Journal of Human Resource Management 20 (2009) 528–543.
 [3] R. Bailey, Hr applications of psychometrics, Psychometric Testing: Critical Perspectives
     (2017) 85–111.
 [4] S. Strohmeier, Handbook of Research on Artificial Intelligence in Human Resource Man-
     agement, Edward Elgar Publishing, 2022.
 [5] C. Bizer, R. Heese, M. Mochól, R. Oldakowski, R. Tolksdorf, R. Eckstein, The impact of
     semantic web technologies on job recruitment processes, in: Wirtschaftsinformatik, 2005.
 [6] K. Yu, G. Guan, M. Zhou, Resume information extraction with cascaded hybrid model
     (2005).
 [7] X. Yi, J. Allan, W. B. Croft, Matching resumes and jobs based on relevance models, in:
     Annual International ACM SIGIR Conference on Research and Development in Information
     Retrieval, 2007.
 [8] F. C. Fernández-Reyes, S. Shinde, Cv retrieval system based on job description matching
     using hybrid word embeddings, Computer Speech & Language 56 (2019) 73–79. URL:
     https://www.sciencedirect.com/science/article/pii/S0885230817302851. doi:https://doi.
     org/10.1016/j.csl.2019.01.003.
 [9] J. Jiang, S. Ye, W. Wang, J. Xu, X. Luo, Learning effective representations for person-
     job fit by feature fusion, in: Proceedings of the 29th ACM International Conference on
     Information & Knowledge Management, CIKM ’20, Association for Computing Machinery,
     New York, NY, USA, 2020, p. 2549–2556. URL: https://doi.org/10.1145/3340531.3412717.
     doi:10.1145/3340531.3412717.
[10] K. Yao, J. Zhang, C. Qin, P. Wang, H. Zhu, H. Xiong, Knowledge enhanced person-job fit
     for talent recruitment, in: 2022 IEEE 38th International Conference on Data Engineering
     (ICDE), 2022, pp. 3467–3480. doi:10.1109/ICDE53745.2022.00325.
[11] M. Zhang, K. N. Jensen, S. D. Sonniks, B. Plank, Skillspan: Hard and soft skill extraction from
     english job postings, in: North American Chapter of the Association for Computational
     Linguistics, 2022.
[12] Regulation (EU) 2016/679 of the european parliament and of the council of 27 april 2016 on
     the protection of natural persons with regard to the processing of personal data and on the
     free movement of such data, and repealing directive 95/46/EC (General Data Protection
     Regulation), Official Journal of the European Union, 2016. URL: https://eur-lex.europa.eu/
     eli/reg/2016/679/oj, GDPR consolidated version.
[13] Proposal for a regulation of the european parliament and of the council laying down
     harmonised rules on artificial intelligence (artificial intelligence act) and amending cer-
     tain union legislative acts, European Commission, 2021. URL: https://eur-lex.europa.eu/
     legal-content/EN/TXT/?uri=celex%3A52021PC0206, cOM(2021) 206 final.
[14] Hugging           Face,         Longformer,              https://huggingface.co/markussagen/
     xlm-roberta-longformer-base-4096, 2021. Accessed: 2023-06-29.
[15] A. Furnham, HR professionals’ beliefs about, and knowledge of, assessment techniques and
     psychometric tests, International Journal of Selection and Assessment 16 (2008) 300–305.
[16] E. Peronato, F. Fabris, N. D’Agnolo, R. D’Orazio, Entrepreneurial values as a key for
     csr in smes, Presented at the XXXIII ISPIM, Copenhagen, 2022, 2022. URL: https://www.
     conferencesubmissions.com/ispim/copenhagen2022/index.html.
[17] A. Joshi, S. Kale, S. Chandel, D. K. Pal, Likert scale: Explored and explained, British journal
     of applied science & technology 7 (2015) 396.
[18] J. Munshi, A method for constructing likert scales, 2014.
[19] M. Disegna, N. Biasetton, E. Barzizza, L. Salmaso, A new adaptive membership function
     with CUB uncertainty with application to cluster analysis of Likert-Type data, Available
     at SSRN 4115553 (2022).
[20] A.-L. Barabási, Network science, Philosophical Transactions of the Royal Society A:
     Mathematical, Physical and Engineering Sciences 371 (2013) 20120375.
[21] J. Liu, Y. Chabot, R. Troncy, V.-P. Huynh, T. Labbé, P. Monnin, From tabular data to
     knowledge graphs: A survey of semantic table interpretation tasks and methods, Journal
     of Web Semantics 76 (2023) 100761. URL: https://www.sciencedirect.com/science/article/
     pii/S1570826822000452. doi:https://doi.org/10.1016/j.websem.2022.100761.
[22] K. Zhou, Z. Liu, R. Chen, L. Li, S.-H. Choi, X. Hu, Table2graph: Transforming tabular
     data to unified weighted graph, in: L. D. Raedt (Ed.), Proceedings of the Thirty-First
     International Joint Conference on Artificial Intelligence, IJCAI-22, International Joint
     Conferences on Artificial Intelligence Organization, 2022, pp. 2420–2426. URL: https:
     //doi.org/10.24963/ijcai.2022/336. doi:10.24963/ijcai.2022/336, main Track.
[23] G. O. Boateng, T. B. Neilands, E. A. Frongillo, H. R. Melgar-Quiñonez, S. L. Young, Best
     practices for developing and validating scales for health, social, and behavioral research: a
     primer, Frontiers in public health 6 (2018) 149.
[24] J. R. Bray, J. T. Curtis, An ordination of the upland forest communities of southern
     wisconsin, Ecological monographs 27 (1957) 326–349.
[25] M. E. J. Newman, R. M. Ziff, Fast monte carlo algorithm for site or bond percolation,
     Phys. Rev. E 64 (2001) 016706. URL: https://link.aps.org/doi/10.1103/PhysRevE.64.016706.
     doi:10.1103/PhysRevE.64.016706.
[26] N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, A. Smola, Autogluon-tabular:
     Robust and accurate automl for structured data, arXiv preprint arXiv:2003.06505 (2020).
[27] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A comprehensive survey on graph
     neural networks, IEEE Transactions on Neural Networks and Learning Systems 32 (2021)
     4–24. doi:10.1109/TNNLS.2020.2978386.
[28] A. Sperduti, A. Starita, Supervised neural networks for the classification of structures,
     IEEE Transactions on Neural Networks 8 (1997) 714–735. doi:10.1109/72.572108.
[29] K. Cao, J. You, J. Liu, J. Leskovec, Autotransfer: Automl with knowledge transfer – an
     application to graph neural networks, 2023. arXiv:2303.07669.
[30] C. Guan, Z. Zhang, H. Li, H. Chang, Z. Zhang, Y. Qin, J. Jiang, X. Wang, W. Zhu, AutoGL: A
     library for automated graph learning, in: ICLR 2021 Workshop on Geometrical and Topolog-
     ical Representation Learning, 2021. URL: https://openreview.net/forum?id=0yHwpLeInDn.
[31] R. Caruana, A. Niculescu-Mizil, G. Crew, A. Ksikes, Ensemble selection from libraries
     of models, in: Proceedings of the Twenty-First International Conference on Machine
     Learning, ICML ’04, Association for Computing Machinery, New York, NY, USA, 2004,
     p. 18. URL: https://doi.org/10.1145/1015330.1015432. doi:10.1145/1015330.1015432.
[32] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
[33] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, A. Gulin, Catboost: unbiased
     boosting with categorical features, Advances in neural information processing systems 31
     (2018).
[34] W. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs,
     Advances in neural information processing systems 30 (2017).
[35] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph attention
     networks, arXiv preprint arXiv:1710.10903 (2017).
[36] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks,
     2017. arXiv:1609.02907.