1. Introduction

Deep Semi-supervised Graph Representation Learning Model for Resume Classification

Wissem Inoubli

wissem.inoubli@loria.fr 0 1

Armelle Brun

armelle.brun@loria.fr 1 0 Keep In Touch , Strasbourg , France 1 University of Lorraine , CNRS, LORIA , France

2022

18 23

The main goal of job seekers is to identify job ofers that match their profile. The same stands for human resource departments that aim to identify candidates, through their resumes, that match the recruiter's expectations. However, the number of job seekers and job ofers is so important that none of human resource employees nor job seekers is able to go through all the resumes and ofers manually. Recommender systems have emerged these last years with the goal to recommend job seekers and human resource departments, job ofers and resumes respectively. One of the approaches adopted by the literature relies on the identification of content elements in the ofers and resumes that contribute to perform matching. We propose to represent data under the form of graphs and approach this problem as a classification problem. We present DGL4C, a semi-supervised graph deep learning model, that learns the adequate representation from a graph and trains a classifier on this latent representation. Experiments are carried out on an open dataset of anonymous resumes. Results show that DGL4C significantly improves precision and accuracy of a traditional deep learning models, such as sBERT and confirm the pertinence of relying on a graph structure for the classification task in HR domain.

Classification graph representation learning deep learning semi-supervised learning resume classification

1. Introduction

Tracking Systems (ATS), e.g. JobSCAN 1.

Recruitment is the process of matching job ofers with resumes. It is performed by both job seekers and human resource (HR) departments, through the use of Applicant

Due to the huge amount of resumes and ofers, matchformed manually anymore. Information retrieval (IR) algorithms have been traditionally used to perform this task. They combine features extraction techniques and a retrieval model (e.g. the standard boolean model). For example, in [1] the goal is to find relevant resumes in a LGOBE (A. Brun) (A. Brun) signed to recommend either HR the resumes that match a given job ofer or a job seeker the relevant ofers for his/her profile [ 2].

RecSys in HR’22: The 2nd Workshop on Recommender Systems for Human Resources, in conjunction with the 16th ACM Conference on 0000-0001-5121-9043 (W. Inoubli); 0000-0002-9876-6906 Although job ofers can be easily collected to form a dataset, resume collection is a more tricky task, due to privacy issues. Such datasets are generally collected by private firms and are not freely available.

Worse,

very few datasets contain the ground-truth matching between ofers and resumes. Thus, the content-based recommendation approach, that identifies resumes and approach to perform this matching.

Besides, due to this lack of ground-truth, the evaluation of recommendation models remains a challenging task. To cope with this limit, we propose to perform this matching by using higher level information about occupation of a job ofer, e.g.

computer scientist, can be used to perform this matching. We propose to view the identification of this higher-level information as a classiifcation problem, as proposed by [ 3 ]. The challenge here is thus to learn a classifier dedicated to resumes or job ofers, i.e. to unstructured plain texts.

The text classification literature traditionally takes place in two steps. First, the texts are pre-processed to extract features. For example, TF-IDF (term- Frequencyinverse Document frequency), LDA (Latent Dirichlet Allocation) [ 4 ], and Word2vec [5] are traditional models for feature extraction and text representation. Second, classification is performed by supervised machine learnliterature has shown that the performance of classifiers CEUR htp:/ceur-ws.org ISN1613-073 https://www.jobscan.co/applicant-tracking-systems © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License ing algorithms that exploit such representations. The Attribution 4.0 International (CC BY 4.0). is highly dependent of the quality of the representation in HR are released. The most famous ones are DISCO2, [6]. ISCO3 and ESCO4 [12]. Those knowledge bases repre

Deep learning, that has the characteristics of perform- sent occupational groups at various granularity levels. ing feature extraction and classification in a unique step, An occupation is defined as a set of jobs whose main tasks has been also studied. Many neural network variants and duties are characterized by a high degree of similarity were studied like long short-term memory (LSTM) [7], [12], and a skill is defined as the ability to apply knowledge based on a recurrent neural network (RNN) architecture and know-how to complete tasks and solve problems. For [8], or a convolution neural network (CNN) [9]. Deep example, the ESCO standard contains 13,485 skills related learning has proven to perform better than classical ma- to 2,942 occupations, being sorted on four granularity chine learning approaches. However, in the context of levels. In [12, 13], the authors use TextKernal Extract HR, deep learning still sufers from high error rates and parser, an industrial tool5 to extract skills from resumes. low classification accuracy, especially for resume classifi- The ESCO base is then used to build a classifier based cation [1]. This can be probably explained by the limited on the matching of the skills defined by ESCO with the size of the training datasets used [10]. skills extracted in the resumes.

Besides, graph structures have been traditionally Regarding machine learning models, which are widely adopted to manage rich data. Recently, deep graph learn- used, they rely on training data, and also require a preing models [ 3 ], that allow to learn a non-euclidean space processing step dedicated to feature extraction. Machine of data, have emerged. Surprisingly, they have not been learning models, such as Random Forests, Decision Trees, studied in the HR context, especially for resume classifi- Support Vector Machines, etc. have shown high efication. ciency and performance on the resume classification task

Thus, in this work, we propose DGL4C, that stands for [14]. In such models, the accuracy of the feature extracDeep Graph Representation Learning for Classification, tion step strongly impacts the classification performance. a new classification model based on deep graph learning. At the opposite of ontology-based and machine learnDGL4C is a semi-supervised model, designed for resume ing models, deep learning models consider both feature classification, that manages both labeled (resumes) and extraction and classification in a unique step, which reunlabeled (elements of resumes) data. Concretely, we pro- duces the possible loss of information in the feature expose two variants of DGL4C. DGL4C-GCN, Deep Graph traction step. Deep learning models are highly popular, Representation Learning for Classification with a Graph and have shown a significant improvement of perforConvolution Network, is an end-to-end model, that learns mance. Several works have been proposed for job classiall the stages between the initial input phase and the final ifcation [ 9, 2, 13, 15] where both a 1-D convolution neural output result (resume classification). DGL4C-GRL, made network (CNN) and a recurrent neural network (RNN) up of two stages: (i) text (resume) representation made architectures were adapted to the HR context. by a GCN architecture, and (ii) a machine learning-based Graph representation learning techniques have reclassifier. cently emerged and are used in many applications.

The remainder of this paper is organized as follows. Graphs are a traditional way to represent data, but modSection 2 introduces the literature related to resume clas- els that rely on such a representation sufer from data sification. In Section 3, we introduce the two variants of sparsity and robustness to noise, which decreases the perDGL4C: DGL4C-GCN and DGL4C-GRL. Then, in Section formance of predictive models. To overcome those limits, 4, experimental results are described and analysed. Last, graph representation learning has been designed to repin Section 5, we conclude and propose perspectives. resent data in a low dimension space, and has shown its eficiency for unstructured data such as images, texts and graphs. Graph representation learning can be cat2. Related Work egorized into three families: matrix factorization based models [16, 17], random-walk based models [18, 19] and neural network based models [20, 21]. The latter are neural networks that are used to learn node embeddings by aggregating information from neighboring node through edges. Neighborhood aggregation consists in forwarding and receiving back data between nodes, throughout their neighborhood. In GNN, a node has an unlimited number of direct neighbors, whereas in the other neural network In this section, we briefly review some works related to resume classification in the area of human resources.

The literature has proposed several approaches, that can be divided into three categories: (i) ontology-based models, (ii) machine learning models and (iii) deep learning models.

Let us first consider the ontology-based models. After a feature extraction step, ontology-based models use ontologies to perform classification. An ontology is a conceptual meta-model that represents a domain knowledge [11]. Few international and national knowledge bases 2European Dictionary of Skills and Competences 3International Standard Classification of Occupations 4European Skills, Competences, Qualifications and Occupations 5https://www.textkernel.com/ architectures, the number of direct neighbors is limited, ∈ is a resume with = { 1, 2, } is the set of (e.g. two for RNN architectures and eight in the case of 2D-CNN architectures). This unlimited number of direct neighbors has shown its ability to encode both structural and semantic (node features) information, which makes it successful. This neighborhood information leads to (classes). = of distinct words in . words of the resume. is the set of labels with ∈ is the label of resume and || is the number of labels ⋃ is the vocabulary of , i.e the set

Definition of a R-graph. Let be an heterogeneous, a neural network that learns better than other architec- attributed and unweighted graph built from , the set of tures, which is confirmed by the work of Ding Yao et al. [22] in the case of hyper spectral image classification. resumes. = ( , ) with and represent nodes and edges of respectively. The set of nodes = ℝ ∪ is Graph Neural Network architectures (GNN) are receiving made up of the union of the set of resumes and the set a growing attention [20] and many models are proposed unique words of the resume dataset. As a consequence, in the literature. Starting from the initial Graph Convolu- is made up of two types of nodes: resume nodes and tion Networks (GCN), GraphSage [23] was then proposed word nodes. to overcome the scalabilty issue of GCN, by changing the The set of edges is also divided into two types of convolution method. In the same context, an attention edges, word-to-resume edges and word-to-word edges. mechanism was proposed by the model called GAT [24].

An edge between two nodes exists if the similarity To the best of our knowledge, the GNN architecture between those nodes is positive, similarly to [ 3 ]. Recall has not been studied for modeling resumes or job ofers, that the edges are unweighted. and we assume that they could be of interest.

The way this similarity is evaluated depends on the type of edge. Word-to-resume edges are evaluated by the well-known term frequency-inverse document frequency (TF-IDF), which is computed as follows:

3. DGL4C: a GNN-based Architecture for Classification

Considering that data representation is a core step of classification models, we propose a new approach for resume representation, based on both a graph structure and semantic information. Concretely, we propose a semi-supervised graph representation learning model, based on a GNN architecture. This representation is used in the classification DGL4C model, that we design.

As previously mentioned, the main motivation for the choice of a graph-based representation learning, and specially a GNN architecture comes from the neighborhood information (neighboring resumes) taken into consideration during the training step.

As deep learning requires a lot of training data to be efective, and since resume data are generally small, we decide to adopt a semi-supervised learning algorithm, which takes both labeled and unlabeled examples; therefore the training process runs on both example types: labeled and unlabeled [21] data. DGL4C aims to form a high quality latent representation of resumes, further used in the classification phase. In the following subsections, we present the way we propose to construct a graph of resumes, then the core idea of graph representation learning, and the way we develop DGL4C, designed to encode a dataset of resumes into a latent vector space.

3.1. Graph Construction

Before learning the graph representation, a first phase consists in constructing the graph of resumes. We propose to is inspire from the recent work conducted in [ 3 ]. Let = (, ) denote a dataset. is the set of resumes, (1) (2) (3) (4) (5) (, ) = (, ) = () = (, ) ()() ♯ (, )

♯ ♯ () ♯ ⎧ ⎪ ⎩ 0 ⎨ 1 =

otherwise 1 , are words, and (, ) > 0 ⎪ 1 is a document and is a word, and - , > 0

TF-IDF(wd,r,R) = ( , ) × ( , ) where ( , ) appears in a resume ; ( , ) denotes the number of times the word denotes the number of resumes that contain the word . Word-to-word edges are evaluated by the Point-wise Mutual Information ( ), as in [ 3 ]. A positive value means a high semantic correlation of the pair of words in the corpus. A negative value indicates that both words are not semantically close. Therefore, only the edges that are associated with a positive unweighted form. The value exist in the graph, under an

value is computed as follows: the corpus. graph . , = Where ♯ () word , ♯ (, ) is the number of resumes that contain the

is number of resumes that contain both words and and ♯ is the number of resumes in Equation (5) represents the adjacency matrix of the

3.2. Graph Representation Learning

After building the graph from the set of resumes , the graph representation learning is the second step of the proposed model. Most of neural networks have the same universal architecture, namely a set of multi-layer perceptron neural networks connected that operate on the input data. GCN [21] is a convolutional neural network on graphs that performs similar operations than CNN, except that it applies convolution over a graph instead of convolution on a 2-D array as input. GCN learns a latent representation by propagating information from direct neighbors in the graph and applies a linear transformation. The information propagation procedure consists in aggregating information from direct and -hops neighborhood. Next, as perceptrons, GCN applies linear transformation followed by pointwise non-linearity. By stacking GCN layers, each node aggregates information from nodes -hops away. The GCN [21] propagation rule is defined as follows: = ( ̂ −1 ), ℎ 0 = (6) 1

1 Where ̂ = − 2

− 2 is the normalized symmetric adjacency matrix and and +1 are the previous and the new hidden state matrix respectively, is a trainable weights matrix for layer , and denotes any non-linear activation function (e.g., ReLU). The convolution step in GCN is based on message passing that is divided into sub-steps (i) message gather and (ii) aggregation. Message gather consists of getting messages from n-hops neighbors, and the aggregation consist of normalization of all messages in order to get an embedding of a node . The message passing form of equation (6) can be written as follows: = ∑ 1 ℎ −1 ∈ √| |√| | ℎ = ( + ) (7) (8) number of profiles. As it is used by the loss function, only the resumes nodes are used where the graph contains resumes and elements of resumes (words) that show the semi-supervised training presented by DGL4C.

In addition, experiments that we conducted showed that the graphSage aggregation method performs better than GCN [21] and GAT [24] aggregation methods, thus the graphSage aggregation (mean aggregation) is the one kept for DGL4C.

We propose two variants of DGL4C, that difer in the number of steps they are made up of. DGL4C-GCN is an end-to-end model with a unique step that includes both representation and classification. Regarding DGL4CGRL, it is made up of two stages: (i) text (resume) representation, and (ii) a machine learning classifier.

4. Experiments

In this section, we aim at evaluating the performance of both DGL4C-GCN and DGL4C-GRL.

4.1. Experimental Setup

4.1.1. Dataset The dataset used is a corpus of 2,484 anonymous resumes6. Each resume is associated with one label, that represents the resume profile, and that we consider as being the class label. 24 profiles (classes) are available. Each resume is written in natural language and contains personal information, education, experience, etc.

In the experiments conducted, the aim is the evaluation of the performance of the proposed models, and compare them to several baseline models from the literature. In addition, we are interested in evaluating the impact of the number of classes (profiles) on the accuracy of the models. Thus, we form several datasets so that they fit a predefined number of classes. Statistics about each of the resulting datasets and associated graphs are presented in where is the set of resume indices that have labels and is the dimension of the last layer of the GCN, which is the

4.2. Parameters Settings

DGL4C-GCN and DGL4C-GRL were implemented using the DGL framework 7 with two convolution layers of the GraphSage [23] architecture to allow message passing among nodes, and the mean aggregation. From an architectural point of view, we set the embedding size of the first convolution layer at 500, fixed from initial experiments. We tuned other parameters and set the learning rate as 0.001, dropout as 0.2. Both models have been trained over 200 epochs with a batch size of 32 training samples. For each dataset, we randomly use 80% of resumes for training and the remaining resumes for test and perform this selection 10 times. As a consequence, the accuracy evaluated and reported in the experiments is the mean test accuracy.

4.3. Experimental Results

To evaluate the efectiveness of DGL4C-GCN and DGL4CGRL, we compare their performance with several models from the literature, that difer in either the text representation or the classifier step. Each model is a pair of text representation model and a classifier, except for the end-to-end model DGL4C-GCN. The popular text representation models, mentioned in the related work section are used. The list of models is presented in Table 2. 4.3.1. Impact of the Text Representation We first focus on the evaluation of the impact of the text representation, by fixing the classifier. We choose to use the popular random forest (RF) algorithm. The models studied are listed in the two upper parts of Table 2. Table 3 presents the test accuracy of these models.

Let us first compare accuracy across models, on the complete dataset (D5). As expected, TF-IDF+RF is the less performing model (30.09 accuracy), TF-IDF being the historical representation and is a quite simple way to represent texts. Deep-learning based representations: Word2Vec+RF and sBERT+RF perform better, with an accuracy of 49.65 and 60.23 respectively. sBERT+RF performs better than Word2Vec+RF, which is in line with

7https://www.dgl.ai/

the literature, sBERT being the current best performing model in NLP, specifically on the semantic textual similarity task [25].

Let us now consider the graph-based representation models. DGL4C-GCN, the end-to-end model we propose, performs slightly better than sBERT+RF, but this increase is not statistically significant. Regarding DGL4CGRL+RF, it performs significantly better than sBERT+RF. We can conclude that graph-based representations are adequate for the resume classification task and that the use of neighborhood information in the representation learning, that combines both of semantic and structural information, is useful. Especially, this information can be viewed as a way to compensate the lack of data faced by deep learning models.

Let us now focus on the impact of the number of classes on the performance of the models, by studying performance on D1 to D5 datasets, i.e. from 5 to 24 classes. As expected, we can see that the performance of each model is negatively impacted by the increase of the number of classes. For example, the accuracy of DGL4C-GRL+RF is 94.38 with 5 classes and decreases to 67.87 with 24 classes. However, this performance does not decrease linearly with the number of classes. Especially, the performance between 11 to 20 classes remain stable. A significant decrease occurs between 20 to 24 classes, from 75.76 to 67.87. A similar decrease also occurs for the other graphbased model DGL4C-GCN. However, this is not the case for deep-learning-based models, nor for TF-IDF.

We can conclude that graph representation based models perform better than traditional deep learning based models. However, they seem to be less robust as the number of classes grows. Additional experiments would deserve to be conducted to identify if the decrease in performance is due to the number of classes or to characteristics of the 4 additional classes of D5. 4.3.2. Impact of the Classifier We now focus on the evaluation of the impact of the classifier on the performance of DGL4C-GRL. We evaluate several well-known classifiers, namely Support Vector Classification, Multi-layer Perceptron, Logistic Regression, that we compare to the previously studied Random

5. Conclusion and Future Work

Forest. The list models studied is presented in the two lower parts of Table 2. The mean accuracy of these models is presented in Table 4, which also recalls the perfor- In this paper we have proposed DGL4C, a deep semimance of the best deep learning model sBERT+RF and supervised graph representation learning based model for the graph-based end-to-end DGL4C-GCN model. resume classification. This model can be used to provide

First of all, we can see that whatever is the classifier recommendations to ATSs, human resource departments, used, DGL4C-GRL still performs better than sBERT+RF and professional online social networks (e.g. Linkedin, on most of the datasets versions. If we focus on the Viadeo, Meetup, JobCase, etc). impact of the classifier on the performance of DGL4C- DGL4C relies on a deep learning approach and adapts GRL, LR and SVC are the two best performing classifiers, the GNN architecture to textual data. The experiments that slightly outperform the performance of RF. However, conducted demonstrate the performance of the two this improvement is not statistically significant. We can variants of DGL4C: DGL4C-GCN and DGL4C-GRL. Esthus conclude that the nature of the classifier does not pecially, both variants perform better than machine significantly impact the performance of the model. On learning-based and deep learning-based models from the the contrary, the graph representation seems to be the literature, including sBERT that has shown good performost influential step for the performance, which confirms mance on close uses cases. Experiments thus confirm the the findings of the literature. relevance of relying on a graph-based representation in

Considering DGL4C-GCN, it is the best performing the HR context. model for two of the five datasets (D2 and D3). However, In future works, we plan to adopt an unsuperDGL4C-GCN has a significantly lower performance on vised graph representation learning instead of a semiD5 (62.43 accuracy) compared to the best performing supervised learning, which will be associated to the posmodel DGL4C-GRL+SVC (69.04 precision). This can be sibility of collecting and evaluating on larger datasets. explained by the fact that an end-to-end model has one optimization function that optimises the representation References learning and the classification, whereas DGL4C-GRL has two optimization functions used separately, which makes the classifier more flexible. [1] A. Zaroor, M. Maree, M. Sabha, Jrc: a job post and resume classification system for online recruitment, in: 29th ICTAI, IEEE, 2017, pp. 780–787. [2] A. Giabelli, L. Malandri, F. Mercorio, M. Mezzan94.38 73.65 73.76 67.87

DGL4C-GRL+LR DGL4C-GRL+SVC DGL4C-GRL+MLP zanica, A. Seveso, Skills2job: A recommender sys - [14]

Fareri ,

Melluso ,

Chiarello , G. Fantoni, Skilltem that encodes job ofer embeddings on graph ner: Mining and mapping soft skills from any databases , Applied Soft Computing 101 ( 2021 ) text , Expert Systems with Applications 184 ( 2021 ) 107049 . 115544 .

[3]

Yao ,

Mao ,

Luo , Graph convolutional net- [15]

Abdollahnejad ,

Kalman ,

B. H.

Far , A deep works for text classification , in: AAAI , volume 33 , learning bert-based approach to person-job fit in 2019 , pp. 7370 - 7377 . talent recruitment, in: CSCI, IEEE, 2021 , pp. 98 - 104 .