Deep Unordered Composition for Multi-label
Classification Applied to Skill Prediction
Yann Duperis1,2 , Adrian-Gabriel Chifu1 , Bernard Espinasse1 , Sébastien Fournier1 and
Arthur Kuehn2
1
    LIS UMR CNRS 7020, Aix-Marseille Université, France
2
    MOBEN & ROOSTER, France


                                         Abstract
                                         Today, many recruitment processes are digitalized. Job offers are posted on job boards and candidates
                                         apply by submitting their resumes. To select an appropriate candidate for a job, recruiters rely mostly
                                         on the evaluation of the professional skills of the individual. However, researches have shown that
                                         individuals tend to omit some skills from their professional profile. A human recruiter, knowledgeable in
                                         a given activity sector, is often able to fill the gaps and infer the missing skills. In this paper our aim is
                                         to support this human recruiter by automatically inferring theses missing skills, a non-trivial task. To
                                         solve this task, first we propose a method to tackle the skill prediction problem by transforming it from a
                                         multi-label classification task it to a binary classification task. Then we implement this method with a
                                         deep learning model inspired by the Deep Unordered Composition approach. Two different variants of
                                         this model, one with the Deep Averaging Network architecture and the other with the Set-Transformer
                                         architecture, are evaluated on an open IT resumes data set, and the results are promising.

                                         Keywords
                                         Job recommender systems, Neural networks, Natural language processing


1. Introduction
Nowadays, most recruitment processes are partially digitalized. In our broad research works, we
focus on two specific twin phases of the recruitment process: the sourcing and screening steps.
The sourcing step consists in searching new candidates for a given job. The screening step
consists in filtering the applications for a published job offer. Both steps rely on the evaluation
of the adequacy of the pair Candidate/Job. Often, this evaluation is performed manually by a
recruiter more or less knowledgeable in the related activity sector. Automated systems, like job
recommender systems, have arisen in the last decades to assist both phases of the recruitment
process, thus allowing the recruiters to focus on activities where they can have a significant
added value.
   Numerous research works have been, and still are, conducted and published on this matter.
A key factor in the success of such endeavor is the appropriate modeling of the professional
needs expressed in a job offer, written in natural language, and of the professional abilities of a
candidate, also expressed in natural language in digitised documents like resumes. Different
approaches exist for such modeling, from the most explicit, with ontologies [1], to Machine
CIRCLE (Joint Conference of the Information Retrieval Communities in Europe) 2022
$ yann.duperis@lis-lab.fr (Y. Duperis)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Learning based black boxes [2]. Most of the works presented in this literature are not specialized
for a given activity sector or focus on the IT activity sector. Hence, most existing solutions
do not yield appropriate results for the activity sector of our industrial partner : the process
industry sector. Like other industry sectors, this one exhibits two interesting features:

        • A significant specialization of practices, skills, equipment and tools;
        • The emergence of a shared sub-language with its own lexicon, concepts and documents
          structure.

   Another constraint expressed by our industrial partner is the need to identify candidates
for specialized or expert-level jobs. Hence, our research project aims at producing a candidate
recommendation system for expert-level jobs in specific activity sectors. Since human recruiters
and most automated systems rely on an individual’s skills to evaluate their adequacy for a given
job, our own recommendation system should do the same. The novelty of our research project
is to create a recommendation system working with specialized and meaningful skills for a
given activity sector (i.e. not skills like Team spirit, Management 1 or Microsoft Office).
   Researches have shown that these skills are not always declared by candidates the way an
automated system (or a recruiter) expects them or even not declared at all [3]. Hence, some
pre-processing of candidates professional profiles is required to detect these skills. Although
some approaches rely on text pre-processing strategies to extract skills and normalize them to a
canonical form [4], some approaches use the information contained in a profile to predict the
missing or misspelled skills. This task is referred to as skill prediction or skills inference and can
be assimilated to a multi-label classification task.
   We hypothesize that the text written in a professional profile is a good predictor of the skills
of an individual. Related work presented in section 2.2 introduces recent proposals of methods
to build job recommender systems based entirely or partly on the processing of natural language
documents - i.e. professional profiles and job offers. Most of the proposed approaches rely on
syntactic composition functions implemented in neural networks architectures like Convolution
Neural Network, Long Short Term Memory or Transformer based architectures [5]. However,
models based on unordered composition functions, like the Deep Averaging Network (DAN [6]),
have demonstrated the ability to rival syntactic approaches on some natural language processing
tasks.
   In this paper, we present a method to predict the skills of an individual from their professional
profile (resume or professional networking platform profile). Our contributions are:

        • A method to tackle the skill prediction problem by transforming it from a multi-label
          classification task it to a binary classification task. The core idea is that the model predicts
          whether a given couple (𝑝𝑟𝑜𝑓 𝑖𝑙𝑒, 𝑠𝑘𝑖𝑙𝑙) is likely to be realistic. Unlike the One versus Rest
          approach to multi-label classification, the proposed method does not require the training
          of one model per skill but requires one inference for every skill prediction;
        • A deep learning modular architecture that performs this binary classification. The module
          responsible for the production of profile text hidden representation is inspired by the

1
    Though the ability to manage a team or a process is a real skill, we see two problems in its usage for recommendation:
    as a soft skill it can hardly be measured and a statistically significant amount of profiles declare possessing it.
      Deep Unordered Composition approach [6]. We evaluated two different implementations
      compatible with this approach: DAN [6] and the Set-Transformer [7].

   This paper have the following structure. In Section 2 we present related works concerning
skill prediction and job/profile recommendation. In Section 3 we introduce a formal definition
of the skill prediction problem, by formally defining profile and skill prediction. In Section 4
we present our method to tackle the skill prediction problem by transforming it from a multi-
label classification task to a binary classification task, and two different deep learning-based
implementations of this method, with the DAN architecture and with the Set-Transformer
architecture. Section 5 is dedicated to experiments of our method on an open IT resumes data
set, which results are discussed in section 6. Finally, we conclude in Section 7 by presenting
some perspectives of this research.


2. Related work
Although most recruitment and professional activity related information retrieval systems
make use of skills, only a few research works focused specifically on the skill prediction (or
inference) problem, we present them in section 2.1. However, the literature on jobs/profiles
recommender systems provide valuable insights on the usage of professional profiles data for
information retrieval purposes and the section 2.2 will introduce the most recent methods using
deep learning to extract useful features from profiles text.

2.1. Skill prediction
To predict the skills of an individual, it is first required to define the concept of skill. Different
domains have different definitions. Information retrieval approaches usually rely on an extensive
paradigm where skills are modeled as elements of ad hoc sets. In this paradigm, a skill is a
property of an individual (either binary - the individual possesses the skill - or associated to a
proficiency score) useful to model their professional abilities. Here, the set of supported skills is
often called a dictionary and is built with data mining approaches [3, 8]. Other approaches to skill
modeling integrates more knowledge through the usage of ontologies [1, 9] (with subsumption
relationships between skills). Even these approaches still consider skills like labels associated
with a profile. There is no intensive definition nor rich semantic relationships between skills or
other domain concepts, although relations are sometimes defined between jobs and commonly
required skills (like in the ESCO ontology [9]). Such knowledge has been implicitly encoded in
a latent space - along with job titles - with graph embedding techniques in [10].
   In [3], the authors present the construction of the "Skills and Expertise" falksonomy in use at
LinkedIn.com. After the introduction of this section in the profiles, the authors noticed a lack of
declarations and developed a skills recommendation system to help users fill this new section.
The best results were obtained by a Naive Bayes classifier using the following features: industry
sector, company, function, title and group.
   The authors of [11] propose a joint prediction factor graph to predict the skills of individuals
in a professional social network by using relationships like: same company, same university or
same job title. This approach was evaluated on the 10 most frequent skills, we cannot evaluate
its performances in our case.
   In [12], the authors propose an explainable Convolution Neural Network (CNN ) model for
high-level skills prediction in the IT domain. The introduced concept of high-level skill is a
composition of low-level skills. Based on the provided examples - css, html, php, etc. for low-level
skills and front-end developer, web developer, etc. for high level skills - we interpret high-level
skills as job titles. Although the tackled problem is different, the presented work shows that
interesting professional information can be inferred by a deep learning model operating on a
profile text.

2.2. Job/Profile recommendation
In [13], the authors propose a co-teaching neural network consisting in a relation matching
component inspired by the Relational Graph Convolution Network (R-GCN [14]) approach and a
text matching component relying on a pre-trained BERT model [15] for sentence encoding. The
document encoding is produced with an hierarchical architecture similar to the HIBERT model
[16]. The relation matching component also relies on shared keywords with a high TF-IDF score
as a relation between two objects.
    [17] also relies on an hierarchical architecture to build documents representations. The
authors use a Bidirectional Long Short Term Memory network (BiLSTM) model to contextually
encode the words in the profile and job sentences. Different attention layers applied either to
the profile or to the job extracts latent representations of the offered and requested abilities.
    CNN can also be used to extract representations from the texts of profiles and job offers. In
[18], the authors propose a deep siamese network based on CNN to maximize the similarity
between related profiles and job offers, thus allowing the twin models to learn how to build
representations of the text contained in these documents.
    The related work presented in this section demonstrates the ability of deep learning models to
capture meaningful information in the text of a professional profile. The power of Transformer
based architectures [5] (like BERT [15]) encouraged us to experiment an unordered version of
it, the Set-Transformer architecture [7] for one of our model’s implementation.


3. Formal definition of skill prediction problem
In this section we define the main concepts used throughout this paper. Section 3.1 introduces
the concept of Skill, section 3.2 introduces the concept of Profile and finally section 3.3 defines
the skill definition problem tackled in this paper.
   Throughout this paper, the following notations are used:
       Symbol                    Definition
       𝐸                         a set of elements {𝑒1 , 𝑒2 , ..., 𝑒|𝐸| }
       |𝐸|                       the cardinality of the set 𝐸
       𝑒                         an element of 𝐸
       𝐸𝑘                        a subset of 𝐸 related to a specific element 𝑘 (element of
                                 another set 𝐾) with 𝐸𝑘 ⊂ 𝐸
       𝑁 𝑎 , 𝜃 or 𝜃𝑏             a constant number
       v                         a vector
       T                         a tensor with 𝑟𝑎𝑛𝑘(T) > 1

3.1. Skill definition
We define a Skill as a property of an individual that allows the prediction of their ability to
perform specific tasks. The understanding of what is a skill, how it is developed and how it
manifests itself during the task execution is out of the scope of our scientific work. We consider
it in the information retrieval domain. Thus, we can define a skill as a binary property of an
individual - the person possesses it or not. With this definition, a skill can be assimilated to a
tag attached to a professional profile.
    Contrary to a simple keyword that might be a mere statistical artifact, a skill is usually meant
to refer to an implicit concept, which extension can vary. Indeed, when a recruiter writes a job
offer or when candidates write their resumes, the skills are often written, in specific parts of the
documents, as short surface forms referring to implicit concepts (especially for abstract skills
like Management, Machine Learning or Information security). Thus, skills can’t be considered
like mere keywords and the canonization of the myriad surface forms to a normalized form
associated to the concept is far from trivial [4].
    We define 𝑆 the set of all skills (𝑠 ∈ 𝑆) supported by our information system.

3.2. Profile definition
We define a (professional) Profile as a digitized document describing the professional life of an
individual. Such documents are not standardized and vary from a multi-page, mostly textual, PDF
resume to a de facto bag of keywords on professional networking platforms like LinkedIn.com.
Even with different formats and contents, most of these documents aim at fulfilling a common
function: advertising the capacity of an individual to perform adequately some economic
activity2 .
   Most profiles contains information to advertise the individual capacities: i.e. history of past
professional experiences, self-assessed skills, education and associated degrees, etc. As we have
defined in 3.1, the skills declared in a profile are surface forms of concepts embedded in wide
implicit knowledge graphs. Other information contained in a profile are mostly semi-structured
text written in natural language.
   Our approach focuses on the lexical properties of a profile (the domain-specific terms men-
tioned), thus we simplify the profile definition by concatenating the professional experiences

2
    Most research works focus on jobs but some aim at modeling economic activity at a higher granularity, like tasks
    allocation in an organization.
and education declared as a bag of words (BOW ):

                                      𝑃 = {(𝑊𝑝 , 𝑆𝑝 )} ∀𝑊𝑝 ⊂ 𝑉, ∀𝑆𝑝 ⊂ 𝑆,                            (1)
with 𝑊𝑝 the most significant terms of a profile, 𝑉 a domain vocabulary and 𝑆𝑝 ⊂ 𝑆 the declared
skills.

3.3. Skill prediction definition
The given definition of a profile considers that all information contained in it is declared by the
individual which professional abilities are described. Researches have shown that the resulting
profile is often incomplete [3].
   In this paper we focus on the prediction of an individual’s skills using the text contained in
their profile. More specifically, we model the text as a bag of words. Thus, for a profile 𝑝 with
unknown skills 𝑆𝑝 and a set of significant domain terms 𝑊𝑝 , our research hypothesis is that the
skill prediction problem can be simplified by estimating 𝑃 (𝑠 | 𝑊𝑝 ) with a function 𝑓𝑠𝑝 (𝑊𝑝 , 𝑠).
   A skill 𝑠 is considered a predicted skill of the profile 𝑝 if 𝑓𝑠𝑝 (𝑊 𝑝, 𝑠) ⩾ 𝜃𝑠𝑝 , with 𝜃𝑠𝑝 the skill
prediction confidence threshold. The aim of the work presented in this paper is to build a deep
learning model able to estimate the function 𝑓𝑠𝑝 .


4. A method to predict skills with deep learning
In this section, we present our method for skill prediction with deep learning. This method
relies on our two main contributions. In 4.1, we introduce the first one - i.e the transformation
of the initial multi-label classification task to a binary classification task by using the skill to
predict as an input. In 4.2 we present the second one - i.e. the deep learning model performing
the binary classification with a deep unordered composition approach for the handling of profile
text. Finally, Section 4.3 will present how the model is trained.

4.1. From a multi-label classification task to a binary classification task
Predicting all possible skills that an individual might possess, from their textual professional
profile, is a multi-label classification task. The number of classes is the number of entries in the
skills dictionary. A granular skills taxonomy would then produce a significant amount of classes
(possibly thousands). As an example, using the European Skills/Competencies and Occupations
ontology (ESCO [9]) as a skills dictionary would require the model to classify examples amongst
13.5K classes3 , most of which are entangled in semantic relationships. In such a scenario, a
training example presented to the model would be a profile as input with its declared skills as
outputs. Such an output vector is high-dimensional and sparse.
   To address the shortcomings of such a multi-label classification task, we shifted to a binary
classification task with a few adjustments. In this task, the model is trained to predict the
probability that a given profile possesses a given skill. This choice is motivated by the following
research hypothesis:
3
    https://ec.europa.eu/esco/portal/skill
      Using the skill which ownership is to be inferred as an input allows the model to
      project it in a latent space, hence allowing it to learn relationships amongst skills
      and with profiles hidden representation.

   A training example presented to the model is composed of a profile and a skill as inputs and a
binary label as expected output. The meaning of this label is 0: the individual which profile
is provided as input to the model does not possess the skill | 1: the individual possesses
the skill.
   Formally, let 𝐸 the set of all examples and 𝑒 an example modeled as a tuple (𝑒 ∈ 𝐸):

                                         𝐸 = {(𝑝, 𝑠, 𝑙𝑠𝑝 )},                                       (2)
with 𝑙𝑠𝑝 ∈ {0, 1} the classification label. Positive examples are generated from profiles by using
declared skills, 𝑆𝑝 . The set of positive examples derived from a profile 𝑝, 𝐸𝑝+ , can be defined as:

                                    𝐸𝑝+ = {(𝑝, 𝑠, 1)}, ∀𝑠 ∈ 𝑆𝑝
                                                   |𝐸𝑝+ | = |𝑆𝑝 |
                                                         ⋃︁                                        (3)
                                                  𝐸+ =       𝐸𝑝+
                                                         𝑝∈𝑃

   Due to the nature of our data set, i.e. a collection of professional profiles with a set of declared
skills, we can only extract positive examples. To overcome this limitation, we perform a random
negative sampling step during training. We introduce a training hyper-parameter 𝑛𝑠 and, for
each positive example, we generate 𝑛𝑠 negative examples. The skills used to generate these
negative examples are randomly sampled uniformly from the skills not declared in the profile.
   If we take 𝑛𝑠 = 1, the examples set, 𝐸 = 𝐸 + ∪ 𝐸 − , is balanced for the binary classification
task. However, the binary classification task is a proxy for the multi-label classification task.
The balance of the data set, in terms of classes (i.e. skills) population, is not guaranteed and can
have an impact on the performances of the method.

4.2. Deep unordered composition based model
Inspired by the good performances and lower complexity of deep unordered composition based
models on some natural language processing tasks [6] - compared to syntactical ones - we
decided to model profile text as a bag of words. Our decision is also motivated by an hypothesis:

      Specific terms in a profile are used by recruiters to infer implicit skills. Thus, a deep
      learning model should be able to do the same.

  Our proposed model takes two inputs: the profile text (as a BOW representation) and the skill
which ownership by the profile must be inferred. Both these inputs are then embedded by using
pre-trained embeddings. Each of these inputs is then projected in a latent space by two parallel
modules: respectively the Profile Encoder and the Skill Encoder. The hidden representations
are then merged by a Combination Module. The resulting aggregated representation is further
processed by the Combination Encoder before the final binary classification layer (single neuron
with a sigmoid activation function). The Combination Encoder can be omitted and its impact
is evaluated by ablation studies in our experiments. The general architecture of the model is
depicted in Figure 1.


Figure 1: General architecture of the proposed model


  Some of these modules have multiple implementations, each with their own variants and
hyper-parameters.

4.2.1. Profile encoder
The profile encoder is the module responsible for the encoding of the BOW profile into a single
aggregated hidden representation. Figure 2 illustrates this module architecture.
  Formally:

                                      ℎ𝑝 = 𝑃 𝐸(𝑊𝑝 , 𝑇𝑝 ),                                   (4)

with ℎ𝑝 the profile hidden representation, 𝑃 𝐸 the profile encoder function, 𝑊𝑝 the profile
terms with the highest scores and 𝑇𝑝 the profile terms scores (scored with TF-IDF).
   We experimented two integration strategies for the scores:

    • Weighting - multiplying each term vector by its score and normalizing the vectors before
      passing them through the next layers;
    • Ignoring - ignoring the score and only use the terms vectors for profile encoding.

  The profile encoder is comprised of multiple layers (see Figure 2):

    • 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 - an embedding layer (trainable or not);
    • 𝑆𝑐𝑜𝑟𝑖𝑛𝑔 - a layer responsible for the aggregation of terms embeddings and scores (it can
      simply return the terms vectors if the score integration strategy is Ignoring);
       • 𝑇 𝑊 𝐸 - a term-wise encoder (possibly comprised of stacked layers) that further en-
         codes each term in the profile BOW. This layer has two implementations based on deep
         unordered composition functions and is presented in the next paragraph;
       • 𝑃 𝑜𝑜𝑙𝑖𝑛𝑔 - a pooling layer that aggregates the terms latent representations into a single
         vector;
       • 𝑃 𝑜𝑜𝑙𝑖𝑛𝑔𝐸𝑛𝑐 - an optional pooling-encoder (comprised of stacked dense layers) that
         further projects the pooling to produce a latent representation of the profile [6]. Our
         experiments evaluates the utility of such a layer for our task.

   Since profile text is encoded as a bag of words - hence a set of terms - we restricted our choice of
layer architectures for the term-wise encoder implementation to the one featuring permutation
invariance [7] - i.e. element of an input sequence can be swapped without consequences on the
function output. We experimented the following architectures for the implementation of the
term-wise encoder layer:

       • Deep Pooling - Stacked row-wise dense layers that encode each term embedding inde-
         pendently4 ;
       • Set Transformer - Stacked multi-head self-attention blocks without positional em-
         beddings for permutation invariance (implementing the Set Transformer architecture
         proposed in [7]). We have also experimented the ISAB block proposed by the authors
         to reduce the computational complexity from 𝑜(𝑛2 ) to 𝑜(𝑛𝑚) with 𝑚 the number of
         trainable induction vectors used in each ISAB block.

  To produce a single vector aggregating all the information of the profile, the profile encoder
module relies on the usage of a pooling strategy (implemented in the pooling layer). We’ve
experimented the following ones:

       • Average pooling;
       • Maximum pooling;
       • Sum pooling with normalization;
       • Pooling by multi-head attention - this pooling strategy relies on a multi-head attention
         block with a trainable vector used as query and the profile terms hidden representations
         as keys and values. This pooling mechanism has been proposed in [7].


4.2.2. Skill encoder
The proposed method relies on the usage of skills as input to allow the model to learn how to
correlate profiles hidden representations with skills hidden representations. The skill encoder
module is responsible for the projection of the input skill in a latent space. This module
implementation is composed of an embedding layer (trainable or not) and of optional stacked
fully connected layers with dropout. The further projection of the skill embedding by stacked
layers is commanded by hyper-parameters and is also evaluated by an ablation study in our
experiments.
4
    Implemented with the Keras library TimeDistributed layer.
Figure 2: Architecture of the profile encoder module (the Term-wise Encoder can be comprised of 𝑁𝑟
stacked layers and the Pooling Encoder can be comprised of 𝑁𝑝 stacked layers, 𝑁𝑝 can be zero)


4.2.3. Profile/skills Combination module
To evaluate the interaction between a profile and a skill, our model needs a module to perform a
combination of both hidden representations. We have only experimented the following strategy:

    • Concatenation - simple concatenation of the features of both hidden representations into
      a single vector. This strategy preserves all available information and lets the following
      layers of the model handle them accordingly to minimize the classification error.

4.2.4. Profile/skills Combination Encoder
The last module of the model projects the profile/skill combination into a latent space that eases
the final classification. This module is implemented with stacked fully connected layers with
dropout and is optional. Its impact on the model’s performances is evaluated by ablation studies
in our experiments.

4.3. Model training
4.3.1. Random inputs masking
The variety of profiles can be important and a data set should be huge to cover the lexical
diversity of real profiles. Such a diversity can hardly be obtained with a regular data set. So we
emulate it by randomly masking some terms in the profile bag of words. This pre-processing
step is controlled by an hyper-parameter 𝑟𝑚 , the masking ratio. Each profile bag of words used
during training undergoes a masking operation when sampled to reduce overfitting. Masked
terms are ignored by the profile encoder.

4.3.2. Loss function
We use the binary cross-entropy loss to optimize the model, since it is a common choice for
binary classification tasks.
4.3.3. Optimizer
Based on several experimental runs, the best optimizer was Nadam [19] with a learning rate of
0.01 and exponential decays of 0.9 and 0.999.


5. Experiments
In this section we introduce the experimental work undertaken to evaluate the proposed method.
The conducted experiments aim at answering the following research questions:

     • RQ-1: can a Deep Unordered Composition based model capture meaningful information
       for skill prediction from a professional profile modeled as a set of significant domain
       terms?
     • RQ-2: can domain-specific rare skills be predicted with a machine learning approach?
     • RQ-3: is the usage of classes as model inputs in a binary classification task a good proxy
       for a multi-label classification task?

  The section 5.1 presents the selected data set to perform the evaluation, the section 5.2 defines
the adequate evaluation metrics for the task at hand and the section 5.3 describes the evaluated
models. Finally, section 5.4 shows the results of some of the evaluated models for our task. To
the best of our knowledge, no other works proposed a skill prediction method that could be
used as a baseline for fair comparison with our method.

5.1. Data set used
We chose to evaluate our approach on a public data set of professional profiles to allow repro-
 ducibility and fair comparison of future works. Such a data set is not easy to find, only a few
 economic actors can collect such data.
    We decided to use the data set published in [12] for our work to promote open research. This
 data set is comprised of 30K professional profiles in the IT domain5 . Since the work published in
 [12] tackled the task of high-level skill prediction - i.e. skills families, job titles or job categories
- we had to convert the data set to a format compatible with our own task6 . This pre-processing
 step is not perfect and has impacts on the performances discussed in Section 6.

5.2. Evaluation metrics
Although the proposed method transformed the initial multi-label classification task to a binary
classification task during model inference and training, we must evaluate it as the former. Hence,
we compute the number of false negatives 𝐹 𝑁 , true negatives 𝑇 𝑁 , false positives 𝐹 𝑃 and true
positives 𝑇 𝑃 for each class. With these metrics, we can compute Recall, Precision, Specificity,
Accuracy and F1-Score.

5
  Although our domain of interest is the process industry sector, this data set allows us to validate the adequacy of
  our approach for specialized domains.
6
  This adaptation of the data set is available here: https://github.com/yannduperis/circle-2022-it-profiles-dataset
  To ease the comparison between different models, we can also compute micro, macro and
weighted aggregation of these metrics across all classes (i.e. skills).
  For a metric 𝑀 , we can compute these aggregations with the following formulas:
                                                    𝑆
                                                   ∑︁           1
                                       𝑀𝑚𝑎𝑐𝑟𝑜 =         𝑀𝑠 ·
                                                    𝑠
                                                               |𝑆|
                                                                                                 (5)
                                                  𝑆
                                                 ∑︁        |𝐸𝑠 |
                                   𝑀𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 =        𝑀𝑠 ·       ,
                                                  𝑠
                                                            |𝐸|

with 𝑆 the set of skills, 𝐸 the set of all evaluation examples and 𝐸𝑠 the set of evaluation examples
with the skill 𝑠. For the micro version of these metrics, we compute them across all classes.

5.3. Evaluated models
This section presents the evaluated models. The following models are named based on the
term-wise encoder used in the profile encoder. Due to the small size of our data set, the evaluated
models are small and regularized (with dropout on both attention layers and fully connected
layers). For all our models, we fine-tuned the pre-trained skills embeddings, used a profile BOW
maximum length of 200 and a masking ratio (𝑟𝑚 ) of 30%. We selected these hyper-parameters
with prior experiments.
   A previous hyper-parameters exploration step guided us to the best embeddings (GloVe 20
dimensions for terms and Word2Vec 10 dimensions for skills), BOW maximum length and
masking ratio. We also noticed during this early experiment that taking 𝑛𝑠 > 1 caused a
dramatic drop of the recall.
   Except for the DAN model, all models use the pooling by multi-head attention strategy.
   ST - Set-Tranformer - These models use the Set-Transformer implementation of the term-wise
encoder with 4 attention heads (except for ST-III which has 5 because of the terms embeddings
number of dimensions), the ISAB version of the multi-head self-attention block [7] with 𝑚 = 25
and 1 layer. For all models, except ST-VII, the evaluated terms scores integration strategy is
ignoring.

    • ST-I: simple model without profile pooling encoder, skill embedding encoder nor combi-
      nation encoder;
    • ST-II: ST-I with trainable terms embedding layer;
    • ST-III: ST-I with pre-trained FastText 50 dimensions embeddings for terms (not trainable);
    • ST-IV: ST-I with profile pooling encoder;
    • ST-V: ST-IV with profile/skill combination encoder;
    • ST-VI: ST-V with skill embedding encoder;
    • ST-VII: ST-I with the weighting terms scores integration strategy.

   DP - Deep-Pooling - These models use the Deep-Pooling implementation of the term-wise
encoder with 1 fully connected layer of 20 units. For all models, except DP-IV, the evaluated
terms scores integration strategy is ignoring.
    • DP-I: simple model without profile pooling encoder, skill embedding encoder nor combi-
      nation encoder ;
    • DP-II: DP-I with trainable terms embedding layer;
    • DP-III: DP-I with a second fully connected layer of 20 units in the term-wise encoder;
    • DP-IV: DP-I with the weighting terms scores integration strategy.

   DAN - This model relies on an implementation similar to DAN [6] as profile encoder : Deep-
Pooling term-wise encoder (1 layer with 20 units), average pooling strategy, pooling encoder (1
layer with 20 units) and no skill embedding encoder nor combination encoder.

5.4. Results
We trained the previously introduced models on the training split of our data set (310K pro-
file/skill positive examples and the same number of randomly generated negative examples,
see section 4.1 for details on the negative sampling procedure) and then evaluated them on the
test split of our data set (44K positive examples and the same number of randomly generated
negative examples, with a fixed random seed for all evaluations).
   Table 1 displays the evaluation metrics introduced in section 5.2 for each model. The Figure
3 depicts the distribution of the aforementioned metrics across all classes. We can see that the
aggregated metrics of all models are close.
   We noticed that a lot of classes had a Precision, Recall and F1-Score of 0. After further analysis
of the detailed performances of the model for these classes, we noticed that the model was
systematically predicting either 1 or 0 for some of them. We name these classes Invalid classes
and discuss this phenomenon in the section 6. The number of invalid classes can be a good
metric to compare the models. For our best performing model - DP-II - 165 classes over 1113
(14.8%) are affected by this phenomenon (most of them, 157 - 95% - are biased towards negative).
Since this phenomenon is easily detectable during evaluation, we know exactly which skills
cannot be predicted and then propose results without these classes.
   Table 2 displays the evaluation metrics when invalid classes are removed and Figure 4 depicts
the distribution of the evaluation metrics across classes for the best performing model (DP-II).
Even with the removal of invalid classes, we can still notice a wide distribution of the model
performances across classes. The most variable metric is the recall, for which a quarter of all
classes has a score lower than 40%, half of the valid classes has a recall score between 40% and
90 %. We can also notice that the precision scores are less variable since the first and third
quartile are closer to the median value of the scores. The precision score is between 50% and
80% for 50% of all classes after invalid classes removal.


6. Discussion
In this section we discuss the experimental results and use them to answer the research questions.
First we analyze the invalid classes problem in 6.1 and then we conclude on the performances
of the method in 6.2.
                      F1                           Recall                          Precision
 Model                                                                                              Invalid classes
           micro   macro     weighted    micro    macro weighted           micro   macro weighted
 DP-II      0.84    0.50         0.66     0.86     0.52     0.69            0.81    0.54     0.67          165.00
 DP-I       0.84    0.49         0.65     0.87     0.51     0.69            0.81     0.53    0.66           203.00
 DP-III     0.84    0.49         0.65     0.87     0.51     0.69            0.81     0.51    0.65           210.00
 ST-VI      0.84    0.48         0.65     0.86     0.51     0.68            0.82     0.52    0.66           202.00
 ST-IV      0.84    0.48         0.65     0.86     0.50     0.68            0.82     0.51    0.65           254.00
 DP-IV      0.84    0.47         0.64     0.86     0.49     0.68            0.81     0.51    0.65           238.00
 ST-VII     0.83    0.47         0.64     0.85     0.49     0.67            0.82     0.50    0.64           221.00
 ST-I       0.84    0.47         0.64     0.86     0.49     0.68            0.81     0.50    0.64           236.00
 ST-III     0.84    0.46         0.64     0.86     0.48     0.67            0.82     0.50    0.64           263.00
 ST-II      0.83    0.46         0.64     0.85     0.47     0.66            0.82     0.51    0.65           240.00
 ST-V       0.83    0.43         0.62     0.84     0.46     0.65            0.81     0.45    0.61           325.00
 DAN        0.82    0.41         0.60     0.82     0.44     0.63            0.82     0.45    0.61           290.00
Table 1
Model performances.


Figure 3: Metrics distribution across classes for the best model (DP-II)


6.1. The invalid classes problem
The aggregated performances of the different model variants are close and, according to the
experimental results, the main difference between them is the number of invalid classes. Indeed,
this metric varies by a factor of two between the best and the worst performing model. Due
to the small size of our data set and the stochastic nature of the negative examples sampling
during training, finding a definitive explanation of this phenomenon is not possible, although
we can hypothesize some explanations.
                        F1                           Recall                        Precision
  Model
            micro    macro    weighted     micro    macro weighted        micro    macro weighted
  ST-IV      0.83     0.60        0.71      0.86     0.63     0.74         0.80     0.65     0.72
  DAN        0.83     0.56        0.70      0.85     0.59     0.74         0.82      0.61    0.72
  DP-II      0.83     0.58        0.70      0.86     0.60     0.73         0.80      0.63    0.71
  ST-III     0.82     0.59        0.70      0.85     0.61     0.73         0.79      0.63    0.71
  ST-VI      0.82     0.58        0.69      0.85     0.60     0.73         0.79      0.62    0.70
  ST-II      0.82     0.57        0.69      0.84     0.59     0.72         0.80      0.63    0.71
  DP-IV      0.81     0.58        0.69      0.85     0.61     0.73         0.78      0.63    0.70
  DP-III     0.82     0.58        0.69      0.86     0.61     0.73         0.78      0.61    0.69
  ST-VII     0.82     0.57        0.68      0.84     0.59     0.72         0.79      0.61    0.69
  ST-V       0.80     0.58        0.68      0.84     0.62     0.72         0.77      0.61    0.69
  ST-I       0.81     0.58        0.68      0.84     0.61     0.72         0.77      0.62    0.69
  DP-I       0.81     0.57        0.67      0.85     0.60     0.71         0.77      0.62    0.69
Table 2
Model performances after invalid classes removal.


Figure 4: Metrics distribution across classes for the best model (DP-II) after invalid classes removal.


   The first is envisioned in section 4.1 and links the asymmetry of classes populations in the
data set used to the possibility of oversampling some classes in the negative examples.
   A second hypothesis can be formulated by correlating the model complexity (number of
layers) with the number of invalid classes. Indeed, we can see that models with more trainable
layers tend to have a greater number of invalid classes, thus allowing us to suspect over-fitting
of the models on our small data set.
   Further work is required to correlate the classes distributions in the training data set to
the performances of the model for theses classes and, if a link is established, develop a new
negative sampling method. To confirm and evacuate the potential over-fitting of the model, our
approach should be applied to a bigger data set. For now, we consider that this shortcoming
can be overcome by temporarily removing the unpredictable skills from the dictionary when
re-using the model for inference.

6.2. Performances of the skill prediction
The distributions of per-class performances of our best model are wide. A significant amount
of skills cannot be predicted reliably. However, for 50% of them (556 skills), our best model
achieves a F1-Score of at least 60%. For 25% of the skills (278) the F1-Score is even superior
to 80%. We believe these performances to be encouraging for a single model, especially when
compared with the small number of skills predictable in related works [12, 11]. Furthermore, the
work hereby presented did not focus on the skills dictionary building, hence the performances
of the model have been hindered by the lack of pre-processing of the extracted skills. Indeed,
the skills dictionary has been built from free-text surface forms without any normalization.
This lack of pre-processing left our skills dictionary with duplicates like xsl, xslt and xsl/xslt
or vbscript and vb script. Although we cannot demonstrate the impact of the absence of skills
surface-forms canonization, we can hypothesize that:

    • The existence of duplicate skills prevents the skills embedding step from adequately
      capturing the similarities between them ;
    • A duplicate skill can be sampled for a negative example while a synonymous skill is used
      as a positive example for the same profile.

  We conclude that the responses to our research questions are:

    • RQ-1 : the architecture of our model is able to capture valuable information for skill
      prediction for a significant amount of skills;
    • RQ-3 : for a significant amount of skills, our approach to multi-label classification yields
      encouraging results. Further work is required to correlate similarity between skills hidden
      representations and the per-class model performances.

  An answer to RQ-2 can be provided by analyzing per-class performances. To discuss this
research question, we provide the performances of the best model on a few rare skills along
with the number of examples in the test data set:

    • zookeeper: 49 negative examples, 6 positive examples - F1-Score 90.9%;
    • windows server 2008: 32 negative examples, 4 positive examples - F1-Score 88.9%;
    • wan: 43 negative examples, 4 positive examples - F1-Score 66.7%;
    • visual force: 32 negative examples, 2 positive examples - F1-Score 100%.

  Although the model fails for some rare skills, it performs well for some others. Which is an
encouraging results for RQ-2 as it demonstrates our approach has the potential to predict rare
skills accurately.
7. Conclusion
In this paper, we presented a method to tackle skill prediction with a deep learning model using
unordered composition functions for the handling of textual data. We also introduced a method
for transforming a multi-label classification task to a binary classification task.
   To evaluate this method, we proposed an experimental evaluation on a small open data set
and showed that this approach has an interesting potential for skill prediction. The proposed
evaluation framework allows for a thorough evaluation of the model performances - overall
and for specific skills.
   The approach to skill prediction introduced in this paper allows for a job recommender system
to close the gap between the declared skills of an individual and their actual skills. The proposed
deep learning architecture allows for more lightweight models than the syntactical models of
the state of art, especially the ones relying on document modeling through an hierarchical
architecture - thus allowing faster inferences.
   Further work will study the integration of this model in a job recommender system for skill
prediction of both profiles and job offers prior to the matching between them. As part of future
work, we also plan to enhance this method - by addressing the discussed shortcomings - and
to evaluate it on different multi-label text classification tasks similar to the one tackled in this
paper - i.e. numerous possible classes with implicit relationships between them and specialized
texts to classify.


8. Acknowledgments
This research has been funded by the French government through the Association Nationale
Recherche et Technologie (ANRT) under the CIFRE convention 2018/0400 between Aix-Marseille
University and MOBEN & ROOSTER7 .


References
    [1] E. Tinelli, A. Cascone, M. Ruta, T. Di Noia, E. Di Sciascio, F. M. Donini, I.M.P.A.K.T.: An
        innovative semantic-based skill management system exploiting standard SQL, in: ICEIS
        2009 - 11th International Conference on Enterprise Information Systems, Proceedings,
        volume AIDSS, 2009, pp. 224–229. doi:10.5220/0002008802240229.
    [2] K. Stencel, A. Janusz, K. Ciebiera, D. Ślȩzak, S. Stawicki, M. Drewniak, How to Match
        Jobs and Candidates - A Recruitment Support System Based on Feature Engineering and
        Advanced Analytics, 2018, pp. 503–514. URL: https://www.researchgate.net/publication/
        325207539. doi:10.1007/978-3-319-91476-3_42.
    [3] M. Bastian, M. Hayes, W. Vaughan, S. Shah, P. Skomoroch, H. J. Kim, S. Uryasev, C. Lloyd,
        Linkedin skills: large-scale topic extraction and inference, in: RecSys ’14, 2014.
    [4] F. Javed, P. Hoang, T. Mahoney, M. McNair, Large-scale occupational skills normalization
        for online recruitment, in: AAAI, 2017.

7
    https://www.mobenrooster.com/
 [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
     I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing
     Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/
     paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
 [6] M. Iyyer, V. Manjunatha, J. L. Boyd-Graber, H. Daumé, Deep unordered composition rivals
     syntactic methods for text classification, in: ACL, 2015.
 [7] J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, Y. Teh, Set transformer: A framework for
     attention-based permutation-invariant neural networks, in: ICML, 2019.
 [8] M. Zhao, F. Javed, F. Jacob, M. McNair, Skill: A system for skill identification and normal-
     ization, in: AAAI, 2015.
 [9] M. le Vrang, A. Papantoniou, E. Pauwels, P. Fannes, D. Vandensteen, J. Smedt, Esco:
     Boosting job matching in europe with semantic interoperability, Computer 47 (2014)
     57–64.
[10] V. S. Dave, B. Zhang, M. Hasan, K. AlJadda, M. Korayem, A combined representation
     learning approach for better job and skill recommendation, Proceedings of the 27th ACM
     International Conference on Information and Knowledge Management (2018).
[11] Z. Wang, S. Li, H. Shi, G. Zhou, Skill inference with personal and skill connections, in:
     COLING, 2014.
[12] F. F. J. Kameni, N. Tsopzé, Skills prediction based on multi-label resume classification
     using cnn with model predictions explanation, Neural Comput. Appl. 33 (2021) 5069–5087.
[13] S. Bian, X. Chen, W. X. Zhao, K. Zhou, Y. Hou, Y. Song, T. Zhang, J.-R. Wen, Learning
     to match jobs with resumes from sparse interaction data using multi-view co-teaching
     network, Proceedings of the 29th ACM International Conference on Information &
     Knowledge Management (2020).
[14] M. Schlichtkrull, T. Kipf, P. Bloem, R. van den Berg, I. Titov, M. Welling, Modeling relational
     data with graph convolutional networks, ArXiv abs/1703.06103 (2018).
[15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, in: NAACL, 2019.
[16] X. Zhang, F. Wei, M. Zhou, Hibert: Document level pre-training of hierarchical bidirectional
     transformers for document summarization, in: ACL, 2019.
[17] C. Qin, H. Zhu, T. Xu, C. Zhu, L. Jiang, E. Chen, H. Xiong, Enhancing person-job fit for
     talent recruitment: An ability-aware neural network approach, The 41st International
     ACM SIGIR Conference on Research & Development in Information Retrieval (2018).
[18] S. Maheshwary, H. Misra, Matching resumes to jobs via deep siamese network, Companion
     Proceedings of the The Web Conference 2018 (2018).
[19] T. Dozat, Incorporating nesterov momentum into adam, 2016.