=Paper= {{Paper |id=Vol-3218/paper4 |storemode=property |title=Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction |pdfUrl=https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_4.pdf |volume=Vol-3218 |authors=Jens-Joris Decorte,Jeroen Van Hautte,Johannes Deleu,Chris Develder,Thomas Demeester |dblpUrl=https://dblp.org/rec/conf/hr-recsys/DecorteHDDD22 }} ==Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction== https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_4.pdf
Design of Negative Sampling Strategies
for Distantly Supervised Skill Extraction
    Jens-Joris Decorte1,2,∗ , Jeroen Van Hautte2 , Johannes Deleu1 , Chris Develder1 and
    Thomas Demeester1
1
    Ghent University – imec, 9052 Gent, Belgium
2
    TechWolf, 9000 Gent, Belgium


                                       Abstract
                                       Skills play a central role in the job market and many human resources (HR) processes. In the wake of other digital experiences,
                                       today’s online job market has candidates expecting to see the right opportunities based on their skill set. Similarly, enterprises
                                       increasingly need to use data to guarantee that the skills within their workforce remain future-proof. However, structured
                                       information about skills is often missing, and processes building on self- or manager-assessment have shown to struggle with
                                       issues around adoption, completeness, and freshness of the resulting data. These challenges can be tackled using automated
                                       techniques for skill extraction. Extracting skills is a highly challenging task, given the many thousands of possible skill labels
                                       mentioned either explicitly or merely described implicitly and the lack of finely annotated training corpora. Previous work
                                       on skill extraction overly simplifies the task to an explicit entity detection task or builds on manually annotated training data
                                       that would be infeasible if applied to a complete vocabulary of skills. We propose an end-to-end system for skill extraction,
                                       based on distant supervision through literal matching. We propose and evaluate several negative sampling strategies, tuned
                                       on a small validation dataset, to improve the generalization of skill extraction towards implicitly mentioned skills, despite the
                                       lack of such implicit skills in the distantly supervised data. We observe that using the ESCO taxonomy to select negative
                                       examples from related skills yields the biggest improvements, and combining three different strategies in one model further
                                       increases the performance, up to 8 percentage points in RP@5. We introduce a manually annotated evaluation benchmark
                                       for skill extraction based on the ESCO taxonomy, on which we validate our models. We release the benchmark dataset for
                                       research purposes to stimulate further research on the task.

                                       Keywords
                                       Skill Extraction, Information Extraction, Distant Supervision, Extreme Multi-Label Classification



1. Introduction                                                                                                   of unique required skills in job ads have been reported
                                                                                                                  never to be explicitly mentioned [5]. Very recently, the
Skill extraction is an information extraction task that work titled SkillSpan has reformulated skill extraction as
aims to identify all skills mentioned in a text. It is essen- a more flexible span detection task [6]. The authors re-
tial for many HR applications, such as resume screening leased a dataset of job postings with span annotations and
and job recommendation systems. A comparative survey trained SpanBERT-based models to detect skill spans as
on skill extraction indicates that research interest has a sequence labeling task. In follow-up work, the authors
steadily grown over the last decade [1]. Traditionally, developed a classification model to link such a span to the
skill extraction has been approached as finding and disam- corresponding coarse-grained skill group in ESCO [7].
biguating entities in texts. These methods typically rely To overcome the difficulty of labeling these spans, the
on a named entity recognition (NER) component based authors relied on weak supervision by automatically se-
on phrase-matching or a trained LSTM model [2, 3, 4]. lecting labels based on the ESCO search API [7]. Another
However, skills are often present implicitly as longer se- study manually annotated job ads with soft skills, which
quences of words (which we refer to as spans) or full sen- were consolidated into a released dataset called FIJO [8].
tences rather than being mentioned explicitly: over 85% However, instead of using an exhaustive list of soft skills,
                                                                                                                  they only incorporated four broad labels to decrease the
RecSys in HR’22: The 2nd Workshop on Recommender Systems for difficulty of the annotation. The skill extraction task can
Human Resources, in conjunction with the 16th ACM Conference on also be reduced to binary skill detection, again reduc-
Recommender Systems, September 18–23, 2022, Seattle, USA.                                                         ing the challenge compared to fine-grained skill extrac-
∗
     Corresponding author.
Envelope-Open jensjoris@techwolf.ai ( Jens-Joris Decorte); jeroen@techwolf.ai
                                                                                                                  tion [9]. These works follow a more relaxed formulation
( Jeroen Van Hautte); johannes.deleu@ugent.be ( Johannes Deleu);                                                  of skill extraction, but they all suffer from the difficulty
chris.develder@ugent.be ( Chris Develder);                                                                        of annotating a fine-grained training dataset.
thomas.demeester@ugent.be ( Thomas Demeester)                                                                        Some work avoids this labeling difficulty completely
GLOBE https://www.techwolf.ai ( Jens-Joris Decorte);                                                              by using readily available labeled datasets. For example,
https://www.techwolf.ai ( Jeroen Van Hautte)
                     © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License in [5], an eXtreme Multi-Label Classification (XMLC)
                     Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings      CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                  model was trained based on a corpus of job ads with
                 Job posting corpus                                                       Binary training data                     Binary classifiers




                                                                                                                     RoBERTa (fixed)
                                            Distant                              Data
                                          supervision                          sampling
                                                        Distantly supervised
                                                          training corpus




                                                                                                  ...
                                            ESCO

                                      1                                    2                                     3                                      4


Figure 1: Overview of our method. Using the ESCO skill taxonomy, the distantly supervised training corpus ⃝ 2 is created from
our job posting corpus ⃝. 1 Based on the negative sampling strategy, the positive data is combined with negative examples ⃝ 3
and finally a classifier is trained for each skill ⃝.
                                                   4



attached skills provided by an online job ads platform.                             2. Related Work
However, the authors reported that for that corpus, at
least 40% of the vacancies missed 20% of explicitly stated                          Multi-label classification datasets often have a skewed
skills in their labels. Recent work [10] successfully recon-                        label distribution, with many labels occurring only
structed the BERT-XLMC approach on Dutch vacancy                                    a few times or even being completely absent in the
texts using the Dutch RobBERT model [11]. The training                              training data. Some works have focused on improving
dataset used for this work is however based on the                                  the few-shot and zero-shot classification performance of
output of an existing commercial skill extraction solution.                         multi-label text classification on these rare or unseen
                                                                                    labels. Typically, the information in structured label
   We propose a new end-to-end approach to fine-grained                             graphs (such as label descriptions or relations) or word
skill extraction that does not rely on a large hand-labeled                         embeddings are used as an input to the system in order
training corpus. Instead, we ease the requirements on the                           to generalize to unseen labels [13, 14]. However, these
training data such that it can be automatically collected                           methods still rely on a large labeled training dataset
through distant supervision. We cast the multi-label skill                          to work. In the absence of any supervision, [15] uses
classification task into independent binary classification                          a novel self-supervision training objective to train a
problems, with skills labeled on the sentence level, to                             dense sentence representation model that is used to
encompass both explicit and implicit skill descriptions.                            assign labels based on cosine similarity in the learned
To the best of our knowledge, our work is the first one to                          space. Yin et al. [16] propose an entailment approach
tackle fine-grained skill extraction in such a flexible dis-                        to zero-short text classification, where the input text is
tant supervision setup. Our distant supervision training                            called the premise, and a hypothesis is constructed for
set contains few false positives, due to the literal match-                         each label using the template “the text is about label”.
ing of known skills, which is a task with low ambiguity.                            The premise and hypothesis are concatenated before
However, we expect many false negatives, for skills not                             being presented to a BERT-based model for prediction,
literally mentioned. This is quantified in Section 4.1. We                          making this method slow at inference for large label
investigate to what extent the distantly supervised train-                          spaces.
ing set can be leveraged at maximum effectiveness to
train a fine-grained skill extraction system. To that end,                             Multi-label classification datasets not only suffer from
we design a number of negative sampling strategies that                             the rare label problem, also many labels are just missing,
can be used to tune the extraction model training process                           since they are usually only partially labeled: instances
on a small annotated development set, covering only a                               without labels thus may either be truly negative, or
fraction of all potential skills (0.2%, to be precise, in our                       positive but not identified as such during labeling. The
experimental setting). Finally, in order to stimulate re-                           “Single Positive Labels” scenario is an extreme case of
search on automated skill extraction, and to facilitate the                         missing labels, where only one positive label is available
comparison of future models with our results, we release1                           for each training instance [17]. Research on this topic is
our development and test data, which is constructed on                              limited, and typically focuses on designing custom loss
top of the “SkillSpan” dataset [6], adding annotations                              functions [18] or online estimation of the missing labels
with the ESCO [12] skill labels.                                                    during training [17]. This line of work is closely related
                                                                                    to “Positive-Unlabeled” (PU) binary classification, which
                                                                                    is typically also tackled using custom loss functions [19].
1
    https://github.com/jensjorisdecorte/Skill-Extraction-benchmark
   Typically in a distant supervision setup, the labeling       distant supervision step, on average 365 sentences were
function is followed by a filtering step that aims to reduce    labeled per skill (for the set of 13,891 ESCO skills). This
the number of false positives in the labels [20]. However,      dataset follows a long tail distribution, with 75.1% of skills
we find that the number of false positives produced by          occurring in only ten or fewer sentences.
the distant supervision step is low in our case of literal
skill mentions. This has been shown previously by [21]          Skill extraction model: The model architecture is de-
where literal skill mentions have been successfully used        picted in Fig. 1. We use a frozen pre-trained RoBERTa
as distant supervision for the task of job title representa-    [23] model with mean pooling to transform input sen-
tion learning. Rather than focusing on a filtering step, we     tences into fixed-length contextual representations, be-
draw inspiration from the idea of “hard negative exam-          fore presenting them for classification. The classification
ples” in representation learning to improve the learning        is performed by separate binary text classification mod-
process. In contrastive learning, hard negative examples        els 𝑓𝑠 , each generating an independent prediction value
refer to samples that are difficult to distinguish from an      for their respective skill label 𝑠. In contrast to a typical
anchor point [22]. This approach improves the discrim-          multi-label model, we optimize each classification model
inative abilities and downstream performance of unsu-           separately on a different corresponding dataset, instead
pervised representation learning methods. We adapt this         of training all weights together.
idea to the multi-label classification setup, by oversam-
pling negative examples from related labels. More details       Training with negative sampling: 𝑃𝑠 serves as pos-
on this approach are contained in the following section.        itive training data for classifier 𝑓𝑠 , and negative exam-
                                                                ples are sampled from the union of all positive sentence
                                                                datasets of all other skills. The basic mechanism for
3. Skill Extraction Approach                                    sampling negatives is uniform sampling from this union.
We approach the task of skill extraction as a sentence-         However, following the ideas in representation learn-
level multi-label classification task. A high-level             ing [22], we hypothesize that sentences from related skills
overview of the method is shown in Fig. 1. Our method           are more informative, harder to distinguish from the pos-
uses distant supervision based on the ESCO skill taxon-         itive sentences (i.e., closer to the decision boundary), and
omy to automatically assign (partial) skill labels for a        could thus improve the learning process. As such, a frac-
given set of sentences from the HR domain (in particular,       tion of the negative examples are sampled specifically
mined from vacancies). Negative sampling strategies are         from sentences that are labeled with a related (but differ-
used to combine ‘positive sentences’ for a given skill (i.e.,   ent) label to skill 𝑠. We refer to these sentences as “hard”
sentences labeled with that skill during the distant su-        negative samples. Our negative sampling strategy is thus
pervision step) with sentences not containing that skill        defined by two important factors. First, the fraction of
(referred to as ‘negative sentences’). Finally, a binary        uniformly sampled negatives versus the hard negative
classifier 𝑓𝑠 is trained for each skill 𝑠, based on the con-    samples is important. Secondly, how we define whether
structed positive and negative sentences for that skill.        two skills are related is crucial to the learning process.
It consists of a logistic regression classifier on top of a     We introduce three different strategies for selecting the
(frozen) representation for the sentences, as described in      related skills in Section 3.1.
more detail below.
                                                                Inference and evaluation: The final model is used
Distantly supervised training set: Given a set 𝑆 of             to rank the relevance of all skills for a given sen-
skills and a background corpus of sentences 𝐷, for each         tence. Similar to [14], we use the macro-averaged R-
skill 𝑠 ∈ 𝑆, a set 𝑃𝑠 of positive sentences is collected from   Precision@K (RP@K) metric to evaluate the performance
𝐷 through distant supervision. In particular, we use the        of the method. Since predictions are made on a sentence-
ESCO [12] skills taxonomy as the set of classification          basis, we restrict the evaluation to low values of K. RP@K
labels. The set 𝑃𝑠 of positive sentences for each skill 𝑠,      is defined in (1), where the quantity 𝑅𝑒𝑙(𝑛, 𝑘) is a binary
consists of those sentences in 𝐷 that literally mention         indicator of whether the 𝑘 th ranked label is a correct label
the skill 𝑠 or any of its alternative forms, as provided in     for data sample 𝑛, and 𝑅𝑛 is the number of gold labels
the taxonomy. This assumes that there are no ambiguous          for sample 𝑛. In addition, we use the mean reciprocal
skill names, which holds in most cases as skill names tend      rank (MRR) of the highest ranked correct label as an in-
to be specific. The positive labels are very precise, due to    dicator of the ranking quality. More information on the
the distant supervision process based on literal matches        evaluation is presented in Section 4.1.
with the highly specific ESCO skill names. However,                                         𝑁   𝐾
this means potentially many skills remain unlabeled, i.e.,                               1          𝑅𝑒𝑙(𝑛, 𝑘)
                                                                             𝑅𝑃@𝐾 =        ∑∑                              (1)
the training data is prone to false negatives. After the                                 𝑁 𝑛=1 𝑘=1 𝑚𝑖𝑛(𝐾 , 𝑅𝑛 )
                             Siblings                          Levenshtein                     Embedding
 disarm land mine           ensure flock safety                find land mines                 repair mine machinery
                            protect important clients          search for land mines           handle mining plant waste
                            signal for explosion               identify land mines             management of mine ventilation
                            deal with challenging people       dismantle machines              construct road base
 Haskell                    DevOps                             add smell                       PostgreSQL
                            XQuery                             upsell                          Erlang
                            Windows Phone                      sink wells                      JavaScript
                            SPARK                              speak well                      C++
 manage musical staff       discharge employees                manage musical groups           manage agricultural staff
                            manage volunteers                  manage musical events           manage staff
                            supervise nursing staff            manage musicians                manage dental staff
                            guide staff                        manage educational staff        manage educational staff

Table 1
Examples of related skill labels for the three different selection strategies. The “siblings” examples are in no particular order as
they form a set of siblings, rather than an ordered list.



3.1. Negative Sampling Strategies                             which contains job posting sentences annotated with skill
                                                              spans. We manually annotate each span in SkillSpan with
Rather than randomly sampling negative examples for
                                                              its corresponding ESCO skill (if it exists). This span-based
training each binary skill classifier, we assume that sam-
                                                              multi-class annotation is less complex than annotating
pling more informative negatives will likely lead to a more
                                                              complete sentences with multiple labels. The process is
efficient training procedure. Instead of sampling hard
                                                              performed on the test sets of the publicly released subsets
negative sentences directly, we first identify related (yet
                                                              TECH and HOUSE. Details on the annotation guidelines
different) skills, and then sample sentences with those
                                                              can be found in Appendix A. The annotation effort results
labels. We introduce three different strategies for identi-
                                                              in fine-grained ESCO skill labels for 64.5% of the spans.
fying such related skills, which we analyze through the
                                                              We split this dataset into a validation and test set using
experiments defined in Section 4. The considered sets
                                                              a 20%/80% split. The validation set contains 165 unique
of related skills, given a particular skill 𝑠 are obtained as
                                                              skill labels, and over 80% of the unique skill labels in the
follows:
                                                              test set never occur in the validation set. A more detailed
• Siblings: all skills that share a parent concept with 𝑠, breakdown of the number of spans and annotations is
   as indicated by the “broader concepts” field in ESCO. shown in table 2.
• Levenshtein: The top 100 skills closest to 𝑠, according                                              TECH           HOUSE
  to their Levenshtein distance.
                                                                                                    val     test    val     test
• Embedding: The top 100 skills closest to 𝑠 in terms                  # sentences                  470     1882    243     973
  of cosine similarity with their mean-pooled RoBERTa-                 # spans                      262     1024    191     786
  encoded skill name representations.                                  # spans with ESCO label      152     644     131     532
   For each of the negative sampling strategies, some
                                                                   Table 2
example ESCO skills with their related labels according            Benchmark dataset statistics on the number of sentences,
to the strategy are shown in table 1.                              spans, and ESCO labeled spans, for both the TECH and HOUSE
                                                                   partitions of the SkillSpan dataset. Numbers of the validation
                                                                   and test split are indicated in the table.
4. Experimental setup
4.1. Evaluation                                                      In order to verify our hypothesis that the distant su-
                                                                   pervision labeling leads to quite precise positive labels, at
While hand-labeling a training dataset for skill extraction        the cost of many false negatives, we validated the distant
is infeasible (given the huge number of skills, e.g., over         supervision labeling of the test set against the manual
13k in ESCO), we argue that with reasonable manual                 annotations. The automatically assigned labels are in-
work, it is possible to construct a benchmark that can be          deed rather precise (overall precision of 79%), but at the
used to compare the performance of different models. We            cost of low coverage (i.e., a recall of 14.6%).
build upon the test set of the SkillSpan dataset from [6],
                             (a) TECH                                                      (b) HOUSE

Figure 2: Evaluating the effect of the fraction of hard negatives used during training, for each of the three strategies (siblings,
levenshtein and embedding-based similarity with the considered positive skill) separately. The baseline model performance
without hard negative sampling is shown by the horizontal red line. Metrics are reported on the validation sets.



4.2. Experiments                                                  the model. Secondly, the “levenshtein” strategy brings
                                                                  the least improvements out of all three strategies.
The sentences used for training are collected from a large
proprietary corpus of public job postings. This dataset
                                                                     Finally, we trained a model that combines all strategies.
has been collected from different public job boards and
                                                                  Based on the results of the above hyper-parameter
contains a large number of English job postings. ESCO
                                                                  search, we chose 5% as an optimal value for the fraction
is used for the distant supervision step: a skill label is as-
                                                                  of negatives sampled through the combined hard
signed when the skill itself, or one of its alternative forms
                                                                  negative strategies. To assess the impact of each of the
provided by ESCO, is literally mentioned in a sentence.
                                                                  strategies within this combination, we trained three
For each skill classifier 𝑓𝑠 , a maximum of one thousand
                                                                  more models in which each of the three strategies is
positive sentences is retained. The amount of negative
                                                                  left out respectively. The performance of these final
examples per positive example is set to 10. We train a
                                                                  models is shown in table 3. The combination of all three
baseline classifier without hard negative sampling. In this
                                                                  strategies yielded the overall best model. This model
case, all negative examples are sampled uniformly from
                                                                  has large performance gains across the MRR and RP@K
the other positive corpora. To investigate the optimal
                                                                  metrics for both the TECH and HOUSE dataset.
hard negative sampling procedure, we conduct a hyper-
parameter search for the fraction of negatives sampled us-
                                                                     Leaving out the “Levenshtein” strategy has a relatively
ing the three strategies (sibling, levenshtein, embedding)
                                                                  low impact on the performance. This might be under-
versus uniform sampling. Based on the performance on
                                                                  stood by looking at the examples in table 1: string sim-
the validation sets, we decide on an optimal value for this
                                                                  ilarity surfaces unrelated skills, for example for proper
percentage. Finally, we report the contribution of each of
                                                                  nouns such as Haskel. This could partially explain the
the negative sampling strategies when combined. This
                                                                  relatively low utility of this negative sampling strategy.
is reported based on performance on the unseen test set,
                                                                  On the other hand, leaving out the “siblings” strategy
and contributions of the strategies are shown through
                                                                  takes away the largest part of the performance improve-
ablations, by leaving one strategy out at a time. We refer
                                                                  ments. This strategy makes use of the hierarchy defined
to Appendix B for more details on the training procedure.
                                                                  in the ESCO taxonomy, and thus is a reliable method
                                                                  for selecting informative hard negatives. The effect of
5. Results and Discussion                                         the “embedding” strategy is comparable to the “siblings”
                                                                  strategy and thus proves a good alternative in case a
The results of the hyper-parameter search for each of             hierarchy such as the one in ESCO is not available.
the negative sampling strategies are shown in Fig. 2.
From these results, it is clear that the different strategies
have different effects on the model performance. Most             6. Conclusion and Future Work
notably, we find that the optimal fraction of hard
                                                                  We propose an end-to-end approach to skill extraction
negative sampling is no higher than 5% for any strategy.
                                                                  using distant supervision. The method is able to make
This is in line with previous findings on hard negative
                                                                  fine-grained skill predictions (using 13,891 skills from
sampling [22]. Sampling large amounts of hard negatives
                                                                  ESCO) for a given input sentence. We introduce the
even has a large negative impact on the performance of
                                                               TECH                     HOUSE
                                                    MRR        RP@5    RP@10   MRR      RP@5     RP@10
                Baseline classifier                 0.246      23.65   33.71   0.255    26.66    34.19
                Classifierneg                       0.326      31.71   39.09   0.299    30.82    38.69
                Classifierneg without embeddings    0.323      31.43   39.19   0.298    29.09    37.70
                Classifierneg without Levenshtein   0.339      31.11   38.55   0.298    30.14    37.22
                Classifierneg without siblings      0.303      30.57   37.07   0.281    29.20    35.91

Table 3
Evaluation metrics of final skill extraction models on the TECH and HOUSE test sets. Reported metrics are mean reciprocal
rank (MRR), R-Precision at 5 and at 10 (RP@5, RP@10).



idea of hard negative sampling through related labels         (2021) 118134–118153.
in a multi-label classification setup and propose three   [2] M. Zhao, F. Javed, F. Jacob, M. McNair, Skill: A
different strategies to select these related labels. We       system for skill identification and normalization,
investigate the impact of each of the strategies, and         in: Twenty-Seventh IAAI Conference, 2015.
found that all three strategies combined yield the        [3] L. Sayfullina, E. Malmi, J. Kannala, Learning repre-
highest increase on top of a baseline model without           sentations for soft skill matching, in: International
hard negative sampling. Both the distant supervision          conference on analysis of images, social networks
and the hard negative sampling are designed to work           and texts, Springer, 2018, pp. 141–152.
well without manual labeling, which makes the whole       [4] S. Jia, X. Liu, P. Zhao, C. Liu, L. Sun, T. Peng, Repre-
method very flexible. To the best of our knowledge, we        sentation of job-skill in artificial intelligence with
are the first to design such a system for skill extraction,   knowledge graph analysis, in: 2018 IEEE sym-
and we improve on prior work by providing methods             posium on product compliance engineering-asia
that have relaxed the requirements from ground-truth          (ISPCE-CN), IEEE, 2018, pp. 1–6.
data and that have the ability to make very fine-grained  [5] A. Bhola, K. Halder, A. Prasad, M.-Y. Kan, Re-
skill predictions. Finally, we release our hand-labeled test  trieving skills from job descriptions: A lan-
and validation dataset for skill extraction to stimulate      guage model based extreme multi-label classi-
further research on the task.                                 fication framework,          in: Proceedings of the
                                                              28th International Conference on Computational
   Future work could entail a more extensive investiga-       Linguistics, International Committee on Compu-
tion of other hyper-parameters, such as the number of         tational Linguistics, Barcelona, Spain (Online),
negatives per positive sentence (𝑘), which was fixed to       2020, pp. 5832–5842. URL: https://aclanthology.
10 in this work. Secondly, more performance gains could       org/2020.coling-main.513. doi:10.18653/v1/2020.
be made if the RoBERTa weights were fine-tuned during         coling- main.513 .
training, but this requires changes in the training setup [6] M. Zhang, K. N. Jensen, S. D. Sonniks, B. Plank,
which should be carefully investigated. Lastly, it could      Skillspan: Hard and soft skill extraction from En-
be interesting to investigate how limited manual labor        glish job postings, arXiv preprint arXiv:2204.12811
can maximally improve the performance of the method           (2022).
even further with techniques such as active learning.     [7] M. Zhang, K. N. Jensen, B. Plank, Kompetencer:
                                                              Fine-grained skill classification in Danish job post-
                                                              ings via distant supervision and transfer learning,
Acknowledgments                                               arXiv preprint arXiv:2205.01381 (2022).
                                                          [8] D. Beauchemin, J. Laumonier, Y. L. Ster, M. Yas-
We thank the anonymous reviewers for their valuable
                                                              sine, “FIJO”: a French insurance soft skill detection
feedback. This project was funded by the Flemish Govern-
                                                              dataset, arXiv preprint arXiv:2204.05208 (2022).
ment, through Flanders Innovation & Entrepreneurship
                                                          [9] D. A. Tamburri, W.-J. Van Den Heuvel, M. Garriga,
(VLAIO, project HBC.2020.2893).
                                                              DataOps for societal intelligence: A data pipeline
                                                              for labor market skills extraction and matching, in:
References                                                    2020 IEEE 21st International Conference on Infor-
                                                              mation Reuse and Integration for Data Science (IRI),
  [1] I. Khaouja, I. Kassou, M. Ghogho, A survey on skill     IEEE, 2020, pp. 391–394.
      identification from online job ads, IEEE Access 9 [10] N. Vermeer, V. Provatorova, D. Graus, T. Rajapakse,
     S. Mesbah, Using RobBERT and eXtreme multi-                    esnay, Scikit-learn: Machine learning in Python,
     label classification to extract implicit and explicit          Journal of Machine Learning Research 12 (2011)
     skills from Dutch job descriptions (2022).                     2825–2830.
[11] P. Delobelle, T. Winters, B. Berendt, RobBERT:            [25] N. Reimers, I. Gurevych, Sentence-bert: Sentence
     a Dutch Roberta-based language model, arXiv                    embeddings using siamese bert-networks, arXiv
     preprint arXiv:2001.06286 (2020).                              preprint arXiv:1908.10084 (2019).
[12] ESCO, European skills, competences, qualifications
     and occupations, EC Directorate E (2017).
[13] J. Lu, L. Du, M. Liu, J. Dipnall, Multi-label few/zero-
     shot learning with knowledge aggregated from mul-
                                                               A. Annotation guidelines
     tiple label graphs, arXiv preprint arXiv:2010.07459       Each item that needs to be annotated is a span, thus a
     (2020).                                                   part of a longer job posting sentence. Both the span and
[14] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale-    the complete sentence are shown to provide the right
     tras, I. Androutsopoulos, Extreme multi-label legal       context for annotation. When a span is ambiguous, the
     text classification: A case study in EU legislation,      full sentence must be read to understand the meaning of
     arXiv preprint arXiv:1905.10892 (2019).                   the span.
[15] Y. Xiong, W.-C. Chang, C.-J. Hsieh, H.-F. Yu,
     I. Dhillon, Extreme zero-shot learning for extreme           The task is to annotate the correct and most specific
     text classification, arXiv preprint arXiv:2112.08652      skill that is mentioned or implied by the span. The place
     (2021).                                                   of the candidate labels within the shortlist has no impor-
[16] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot           tance during annotation. In the case that no correct skill
     text classification: Datasets, evaluation and entail-     is found in the shortlist, you may search for the correct
     ment approach, arXiv preprint arXiv:1909.00161            skill using the ESCO interface [12]. If you still cannot
     (2019).                                                   find a correct label, select LABEL NOT PRESENT. If you
[17] E. Cole, O. Mac Aodha, T. Lorieul, P. Perona, D. Mor-     find that the span can generally not be interpreted as a
     ris, N. Jojic, Multi-label learning from single posi-     skill, select UNDERSPECIFIED.
     tive labels, in: Proceedings of the IEEE/CVF Confer-
     ence on Computer Vision and Pattern Recognition,
     2021, pp. 933–942.                                        A.1. Examples
[18] D. Zhou, P. Chen, Q. Wang, G. Chen, P.-A. Heng,           • Given the span “partner continuously with your many
     Acknowledging the unknown for multi-label learn-            stakeholders” and the candidate labels Communicate
     ing with single positive labels, arXiv preprint             With Stakeholders, Negotiate With Stakeholders and
     arXiv:2203.16219 (2022).                                    “Liaise With Shareholders”, only the first two labels are
[19] M. C. Du Plessis, G. Niu, M. Sugiyama, Analysis             considered correct. “Communicate With Stakeholders”
     of learning from positive and unlabeled data, Ad-           is most specific with regards to the span, so this label
     vances in neural information processing systems             should be selected.
     27 (2014).
[20] L. Sterckx, T. Demeester, J. Deleu, C. Develder,          • Spans such as “apply your depth of knowledge” or “ap-
     Knowledge base population using semantic label              ply your expertise” are classified as UNDERSPECIFIED.
     propagation, Knowledge-Based Systems 108 (2016)
     79–91.
[21] J.-J. Decorte, J. Van Hautte, T. Demeester, C. De-        B. Training details
     velder, Jobbert: Understanding job titles through
                                                               The separate classifiers are implemented as a simple lo-
     skills, arXiv preprint arXiv:2109.09605 (2021).
                                                               gistic regression model, using the popular scikit-learn
[22] J. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Con-
                                                               toolkit [24]. All parameters are set to their default values,
     trastive learning with hard negative samples, arXiv
                                                               except for the inverse regularization strength parameter
     preprint arXiv:2010.04592 (2020).
                                                               𝐶, which is set to 0.1 for stronger regularization. The
[23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
                                                               RoBERTa model and the mean pooling operation are im-
     O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
                                                               plemented using the Sentence-BERT library [25].
     Roberta: A robustly optimized bert pretraining ap-
     proach, arXiv preprint arXiv:1907.11692 (2019).
[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
     D. Cournapeau, M. Brucher, M. Perrot, E. Duch-