=Paper=
{{Paper
|id=Vol-3218/paper4
|storemode=property
|title=Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction
|pdfUrl=https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_4.pdf
|volume=Vol-3218
|authors=Jens-Joris Decorte,Jeroen Van Hautte,Johannes Deleu,Chris Develder,Thomas Demeester
|dblpUrl=https://dblp.org/rec/conf/hr-recsys/DecorteHDDD22
}}
==Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction==
Design of Negative Sampling Strategies
for Distantly Supervised Skill Extraction
Jens-Joris Decorte1,2,∗ , Jeroen Van Hautte2 , Johannes Deleu1 , Chris Develder1 and
Thomas Demeester1
1
Ghent University – imec, 9052 Gent, Belgium
2
TechWolf, 9000 Gent, Belgium
Abstract
Skills play a central role in the job market and many human resources (HR) processes. In the wake of other digital experiences,
today’s online job market has candidates expecting to see the right opportunities based on their skill set. Similarly, enterprises
increasingly need to use data to guarantee that the skills within their workforce remain future-proof. However, structured
information about skills is often missing, and processes building on self- or manager-assessment have shown to struggle with
issues around adoption, completeness, and freshness of the resulting data. These challenges can be tackled using automated
techniques for skill extraction. Extracting skills is a highly challenging task, given the many thousands of possible skill labels
mentioned either explicitly or merely described implicitly and the lack of finely annotated training corpora. Previous work
on skill extraction overly simplifies the task to an explicit entity detection task or builds on manually annotated training data
that would be infeasible if applied to a complete vocabulary of skills. We propose an end-to-end system for skill extraction,
based on distant supervision through literal matching. We propose and evaluate several negative sampling strategies, tuned
on a small validation dataset, to improve the generalization of skill extraction towards implicitly mentioned skills, despite the
lack of such implicit skills in the distantly supervised data. We observe that using the ESCO taxonomy to select negative
examples from related skills yields the biggest improvements, and combining three different strategies in one model further
increases the performance, up to 8 percentage points in RP@5. We introduce a manually annotated evaluation benchmark
for skill extraction based on the ESCO taxonomy, on which we validate our models. We release the benchmark dataset for
research purposes to stimulate further research on the task.
Keywords
Skill Extraction, Information Extraction, Distant Supervision, Extreme Multi-Label Classification
1. Introduction of unique required skills in job ads have been reported
never to be explicitly mentioned [5]. Very recently, the
Skill extraction is an information extraction task that work titled SkillSpan has reformulated skill extraction as
aims to identify all skills mentioned in a text. It is essen- a more flexible span detection task [6]. The authors re-
tial for many HR applications, such as resume screening leased a dataset of job postings with span annotations and
and job recommendation systems. A comparative survey trained SpanBERT-based models to detect skill spans as
on skill extraction indicates that research interest has a sequence labeling task. In follow-up work, the authors
steadily grown over the last decade [1]. Traditionally, developed a classification model to link such a span to the
skill extraction has been approached as finding and disam- corresponding coarse-grained skill group in ESCO [7].
biguating entities in texts. These methods typically rely To overcome the difficulty of labeling these spans, the
on a named entity recognition (NER) component based authors relied on weak supervision by automatically se-
on phrase-matching or a trained LSTM model [2, 3, 4]. lecting labels based on the ESCO search API [7]. Another
However, skills are often present implicitly as longer se- study manually annotated job ads with soft skills, which
quences of words (which we refer to as spans) or full sen- were consolidated into a released dataset called FIJO [8].
tences rather than being mentioned explicitly: over 85% However, instead of using an exhaustive list of soft skills,
they only incorporated four broad labels to decrease the
RecSys in HR’22: The 2nd Workshop on Recommender Systems for difficulty of the annotation. The skill extraction task can
Human Resources, in conjunction with the 16th ACM Conference on also be reduced to binary skill detection, again reduc-
Recommender Systems, September 18–23, 2022, Seattle, USA. ing the challenge compared to fine-grained skill extrac-
∗
Corresponding author.
Envelope-Open jensjoris@techwolf.ai ( Jens-Joris Decorte); jeroen@techwolf.ai
tion [9]. These works follow a more relaxed formulation
( Jeroen Van Hautte); johannes.deleu@ugent.be ( Johannes Deleu); of skill extraction, but they all suffer from the difficulty
chris.develder@ugent.be ( Chris Develder); of annotating a fine-grained training dataset.
thomas.demeester@ugent.be ( Thomas Demeester) Some work avoids this labeling difficulty completely
GLOBE https://www.techwolf.ai ( Jens-Joris Decorte); by using readily available labeled datasets. For example,
https://www.techwolf.ai ( Jeroen Van Hautte)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License in [5], an eXtreme Multi-Label Classification (XMLC)
Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073
model was trained based on a corpus of job ads with
Job posting corpus Binary training data Binary classifiers
RoBERTa (fixed)
Distant Data
supervision sampling
Distantly supervised
training corpus
...
ESCO
1 2 3 4
Figure 1: Overview of our method. Using the ESCO skill taxonomy, the distantly supervised training corpus ⃝ 2 is created from
our job posting corpus ⃝. 1 Based on the negative sampling strategy, the positive data is combined with negative examples ⃝ 3
and finally a classifier is trained for each skill ⃝.
4
attached skills provided by an online job ads platform. 2. Related Work
However, the authors reported that for that corpus, at
least 40% of the vacancies missed 20% of explicitly stated Multi-label classification datasets often have a skewed
skills in their labels. Recent work [10] successfully recon- label distribution, with many labels occurring only
structed the BERT-XLMC approach on Dutch vacancy a few times or even being completely absent in the
texts using the Dutch RobBERT model [11]. The training training data. Some works have focused on improving
dataset used for this work is however based on the the few-shot and zero-shot classification performance of
output of an existing commercial skill extraction solution. multi-label text classification on these rare or unseen
labels. Typically, the information in structured label
We propose a new end-to-end approach to fine-grained graphs (such as label descriptions or relations) or word
skill extraction that does not rely on a large hand-labeled embeddings are used as an input to the system in order
training corpus. Instead, we ease the requirements on the to generalize to unseen labels [13, 14]. However, these
training data such that it can be automatically collected methods still rely on a large labeled training dataset
through distant supervision. We cast the multi-label skill to work. In the absence of any supervision, [15] uses
classification task into independent binary classification a novel self-supervision training objective to train a
problems, with skills labeled on the sentence level, to dense sentence representation model that is used to
encompass both explicit and implicit skill descriptions. assign labels based on cosine similarity in the learned
To the best of our knowledge, our work is the first one to space. Yin et al. [16] propose an entailment approach
tackle fine-grained skill extraction in such a flexible dis- to zero-short text classification, where the input text is
tant supervision setup. Our distant supervision training called the premise, and a hypothesis is constructed for
set contains few false positives, due to the literal match- each label using the template “the text is about label”.
ing of known skills, which is a task with low ambiguity. The premise and hypothesis are concatenated before
However, we expect many false negatives, for skills not being presented to a BERT-based model for prediction,
literally mentioned. This is quantified in Section 4.1. We making this method slow at inference for large label
investigate to what extent the distantly supervised train- spaces.
ing set can be leveraged at maximum effectiveness to
train a fine-grained skill extraction system. To that end, Multi-label classification datasets not only suffer from
we design a number of negative sampling strategies that the rare label problem, also many labels are just missing,
can be used to tune the extraction model training process since they are usually only partially labeled: instances
on a small annotated development set, covering only a without labels thus may either be truly negative, or
fraction of all potential skills (0.2%, to be precise, in our positive but not identified as such during labeling. The
experimental setting). Finally, in order to stimulate re- “Single Positive Labels” scenario is an extreme case of
search on automated skill extraction, and to facilitate the missing labels, where only one positive label is available
comparison of future models with our results, we release1 for each training instance [17]. Research on this topic is
our development and test data, which is constructed on limited, and typically focuses on designing custom loss
top of the “SkillSpan” dataset [6], adding annotations functions [18] or online estimation of the missing labels
with the ESCO [12] skill labels. during training [17]. This line of work is closely related
to “Positive-Unlabeled” (PU) binary classification, which
is typically also tackled using custom loss functions [19].
1
https://github.com/jensjorisdecorte/Skill-Extraction-benchmark
Typically in a distant supervision setup, the labeling distant supervision step, on average 365 sentences were
function is followed by a filtering step that aims to reduce labeled per skill (for the set of 13,891 ESCO skills). This
the number of false positives in the labels [20]. However, dataset follows a long tail distribution, with 75.1% of skills
we find that the number of false positives produced by occurring in only ten or fewer sentences.
the distant supervision step is low in our case of literal
skill mentions. This has been shown previously by [21] Skill extraction model: The model architecture is de-
where literal skill mentions have been successfully used picted in Fig. 1. We use a frozen pre-trained RoBERTa
as distant supervision for the task of job title representa- [23] model with mean pooling to transform input sen-
tion learning. Rather than focusing on a filtering step, we tences into fixed-length contextual representations, be-
draw inspiration from the idea of “hard negative exam- fore presenting them for classification. The classification
ples” in representation learning to improve the learning is performed by separate binary text classification mod-
process. In contrastive learning, hard negative examples els 𝑓𝑠 , each generating an independent prediction value
refer to samples that are difficult to distinguish from an for their respective skill label 𝑠. In contrast to a typical
anchor point [22]. This approach improves the discrim- multi-label model, we optimize each classification model
inative abilities and downstream performance of unsu- separately on a different corresponding dataset, instead
pervised representation learning methods. We adapt this of training all weights together.
idea to the multi-label classification setup, by oversam-
pling negative examples from related labels. More details Training with negative sampling: 𝑃𝑠 serves as pos-
on this approach are contained in the following section. itive training data for classifier 𝑓𝑠 , and negative exam-
ples are sampled from the union of all positive sentence
datasets of all other skills. The basic mechanism for
3. Skill Extraction Approach sampling negatives is uniform sampling from this union.
We approach the task of skill extraction as a sentence- However, following the ideas in representation learn-
level multi-label classification task. A high-level ing [22], we hypothesize that sentences from related skills
overview of the method is shown in Fig. 1. Our method are more informative, harder to distinguish from the pos-
uses distant supervision based on the ESCO skill taxon- itive sentences (i.e., closer to the decision boundary), and
omy to automatically assign (partial) skill labels for a could thus improve the learning process. As such, a frac-
given set of sentences from the HR domain (in particular, tion of the negative examples are sampled specifically
mined from vacancies). Negative sampling strategies are from sentences that are labeled with a related (but differ-
used to combine ‘positive sentences’ for a given skill (i.e., ent) label to skill 𝑠. We refer to these sentences as “hard”
sentences labeled with that skill during the distant su- negative samples. Our negative sampling strategy is thus
pervision step) with sentences not containing that skill defined by two important factors. First, the fraction of
(referred to as ‘negative sentences’). Finally, a binary uniformly sampled negatives versus the hard negative
classifier 𝑓𝑠 is trained for each skill 𝑠, based on the con- samples is important. Secondly, how we define whether
structed positive and negative sentences for that skill. two skills are related is crucial to the learning process.
It consists of a logistic regression classifier on top of a We introduce three different strategies for selecting the
(frozen) representation for the sentences, as described in related skills in Section 3.1.
more detail below.
Inference and evaluation: The final model is used
Distantly supervised training set: Given a set 𝑆 of to rank the relevance of all skills for a given sen-
skills and a background corpus of sentences 𝐷, for each tence. Similar to [14], we use the macro-averaged R-
skill 𝑠 ∈ 𝑆, a set 𝑃𝑠 of positive sentences is collected from Precision@K (RP@K) metric to evaluate the performance
𝐷 through distant supervision. In particular, we use the of the method. Since predictions are made on a sentence-
ESCO [12] skills taxonomy as the set of classification basis, we restrict the evaluation to low values of K. RP@K
labels. The set 𝑃𝑠 of positive sentences for each skill 𝑠, is defined in (1), where the quantity 𝑅𝑒𝑙(𝑛, 𝑘) is a binary
consists of those sentences in 𝐷 that literally mention indicator of whether the 𝑘 th ranked label is a correct label
the skill 𝑠 or any of its alternative forms, as provided in for data sample 𝑛, and 𝑅𝑛 is the number of gold labels
the taxonomy. This assumes that there are no ambiguous for sample 𝑛. In addition, we use the mean reciprocal
skill names, which holds in most cases as skill names tend rank (MRR) of the highest ranked correct label as an in-
to be specific. The positive labels are very precise, due to dicator of the ranking quality. More information on the
the distant supervision process based on literal matches evaluation is presented in Section 4.1.
with the highly specific ESCO skill names. However, 𝑁 𝐾
this means potentially many skills remain unlabeled, i.e., 1 𝑅𝑒𝑙(𝑛, 𝑘)
𝑅𝑃@𝐾 = ∑∑ (1)
the training data is prone to false negatives. After the 𝑁 𝑛=1 𝑘=1 𝑚𝑖𝑛(𝐾 , 𝑅𝑛 )
Siblings Levenshtein Embedding
disarm land mine ensure flock safety find land mines repair mine machinery
protect important clients search for land mines handle mining plant waste
signal for explosion identify land mines management of mine ventilation
deal with challenging people dismantle machines construct road base
Haskell DevOps add smell PostgreSQL
XQuery upsell Erlang
Windows Phone sink wells JavaScript
SPARK speak well C++
manage musical staff discharge employees manage musical groups manage agricultural staff
manage volunteers manage musical events manage staff
supervise nursing staff manage musicians manage dental staff
guide staff manage educational staff manage educational staff
Table 1
Examples of related skill labels for the three different selection strategies. The “siblings” examples are in no particular order as
they form a set of siblings, rather than an ordered list.
3.1. Negative Sampling Strategies which contains job posting sentences annotated with skill
spans. We manually annotate each span in SkillSpan with
Rather than randomly sampling negative examples for
its corresponding ESCO skill (if it exists). This span-based
training each binary skill classifier, we assume that sam-
multi-class annotation is less complex than annotating
pling more informative negatives will likely lead to a more
complete sentences with multiple labels. The process is
efficient training procedure. Instead of sampling hard
performed on the test sets of the publicly released subsets
negative sentences directly, we first identify related (yet
TECH and HOUSE. Details on the annotation guidelines
different) skills, and then sample sentences with those
can be found in Appendix A. The annotation effort results
labels. We introduce three different strategies for identi-
in fine-grained ESCO skill labels for 64.5% of the spans.
fying such related skills, which we analyze through the
We split this dataset into a validation and test set using
experiments defined in Section 4. The considered sets
a 20%/80% split. The validation set contains 165 unique
of related skills, given a particular skill 𝑠 are obtained as
skill labels, and over 80% of the unique skill labels in the
follows:
test set never occur in the validation set. A more detailed
• Siblings: all skills that share a parent concept with 𝑠, breakdown of the number of spans and annotations is
as indicated by the “broader concepts” field in ESCO. shown in table 2.
• Levenshtein: The top 100 skills closest to 𝑠, according TECH HOUSE
to their Levenshtein distance.
val test val test
• Embedding: The top 100 skills closest to 𝑠 in terms # sentences 470 1882 243 973
of cosine similarity with their mean-pooled RoBERTa- # spans 262 1024 191 786
encoded skill name representations. # spans with ESCO label 152 644 131 532
For each of the negative sampling strategies, some
Table 2
example ESCO skills with their related labels according Benchmark dataset statistics on the number of sentences,
to the strategy are shown in table 1. spans, and ESCO labeled spans, for both the TECH and HOUSE
partitions of the SkillSpan dataset. Numbers of the validation
and test split are indicated in the table.
4. Experimental setup
4.1. Evaluation In order to verify our hypothesis that the distant su-
pervision labeling leads to quite precise positive labels, at
While hand-labeling a training dataset for skill extraction the cost of many false negatives, we validated the distant
is infeasible (given the huge number of skills, e.g., over supervision labeling of the test set against the manual
13k in ESCO), we argue that with reasonable manual annotations. The automatically assigned labels are in-
work, it is possible to construct a benchmark that can be deed rather precise (overall precision of 79%), but at the
used to compare the performance of different models. We cost of low coverage (i.e., a recall of 14.6%).
build upon the test set of the SkillSpan dataset from [6],
(a) TECH (b) HOUSE
Figure 2: Evaluating the effect of the fraction of hard negatives used during training, for each of the three strategies (siblings,
levenshtein and embedding-based similarity with the considered positive skill) separately. The baseline model performance
without hard negative sampling is shown by the horizontal red line. Metrics are reported on the validation sets.
4.2. Experiments the model. Secondly, the “levenshtein” strategy brings
the least improvements out of all three strategies.
The sentences used for training are collected from a large
proprietary corpus of public job postings. This dataset
Finally, we trained a model that combines all strategies.
has been collected from different public job boards and
Based on the results of the above hyper-parameter
contains a large number of English job postings. ESCO
search, we chose 5% as an optimal value for the fraction
is used for the distant supervision step: a skill label is as-
of negatives sampled through the combined hard
signed when the skill itself, or one of its alternative forms
negative strategies. To assess the impact of each of the
provided by ESCO, is literally mentioned in a sentence.
strategies within this combination, we trained three
For each skill classifier 𝑓𝑠 , a maximum of one thousand
more models in which each of the three strategies is
positive sentences is retained. The amount of negative
left out respectively. The performance of these final
examples per positive example is set to 10. We train a
models is shown in table 3. The combination of all three
baseline classifier without hard negative sampling. In this
strategies yielded the overall best model. This model
case, all negative examples are sampled uniformly from
has large performance gains across the MRR and RP@K
the other positive corpora. To investigate the optimal
metrics for both the TECH and HOUSE dataset.
hard negative sampling procedure, we conduct a hyper-
parameter search for the fraction of negatives sampled us-
Leaving out the “Levenshtein” strategy has a relatively
ing the three strategies (sibling, levenshtein, embedding)
low impact on the performance. This might be under-
versus uniform sampling. Based on the performance on
stood by looking at the examples in table 1: string sim-
the validation sets, we decide on an optimal value for this
ilarity surfaces unrelated skills, for example for proper
percentage. Finally, we report the contribution of each of
nouns such as Haskel. This could partially explain the
the negative sampling strategies when combined. This
relatively low utility of this negative sampling strategy.
is reported based on performance on the unseen test set,
On the other hand, leaving out the “siblings” strategy
and contributions of the strategies are shown through
takes away the largest part of the performance improve-
ablations, by leaving one strategy out at a time. We refer
ments. This strategy makes use of the hierarchy defined
to Appendix B for more details on the training procedure.
in the ESCO taxonomy, and thus is a reliable method
for selecting informative hard negatives. The effect of
5. Results and Discussion the “embedding” strategy is comparable to the “siblings”
strategy and thus proves a good alternative in case a
The results of the hyper-parameter search for each of hierarchy such as the one in ESCO is not available.
the negative sampling strategies are shown in Fig. 2.
From these results, it is clear that the different strategies
have different effects on the model performance. Most 6. Conclusion and Future Work
notably, we find that the optimal fraction of hard
We propose an end-to-end approach to skill extraction
negative sampling is no higher than 5% for any strategy.
using distant supervision. The method is able to make
This is in line with previous findings on hard negative
fine-grained skill predictions (using 13,891 skills from
sampling [22]. Sampling large amounts of hard negatives
ESCO) for a given input sentence. We introduce the
even has a large negative impact on the performance of
TECH HOUSE
MRR RP@5 RP@10 MRR RP@5 RP@10
Baseline classifier 0.246 23.65 33.71 0.255 26.66 34.19
Classifierneg 0.326 31.71 39.09 0.299 30.82 38.69
Classifierneg without embeddings 0.323 31.43 39.19 0.298 29.09 37.70
Classifierneg without Levenshtein 0.339 31.11 38.55 0.298 30.14 37.22
Classifierneg without siblings 0.303 30.57 37.07 0.281 29.20 35.91
Table 3
Evaluation metrics of final skill extraction models on the TECH and HOUSE test sets. Reported metrics are mean reciprocal
rank (MRR), R-Precision at 5 and at 10 (RP@5, RP@10).
idea of hard negative sampling through related labels (2021) 118134–118153.
in a multi-label classification setup and propose three [2] M. Zhao, F. Javed, F. Jacob, M. McNair, Skill: A
different strategies to select these related labels. We system for skill identification and normalization,
investigate the impact of each of the strategies, and in: Twenty-Seventh IAAI Conference, 2015.
found that all three strategies combined yield the [3] L. Sayfullina, E. Malmi, J. Kannala, Learning repre-
highest increase on top of a baseline model without sentations for soft skill matching, in: International
hard negative sampling. Both the distant supervision conference on analysis of images, social networks
and the hard negative sampling are designed to work and texts, Springer, 2018, pp. 141–152.
well without manual labeling, which makes the whole [4] S. Jia, X. Liu, P. Zhao, C. Liu, L. Sun, T. Peng, Repre-
method very flexible. To the best of our knowledge, we sentation of job-skill in artificial intelligence with
are the first to design such a system for skill extraction, knowledge graph analysis, in: 2018 IEEE sym-
and we improve on prior work by providing methods posium on product compliance engineering-asia
that have relaxed the requirements from ground-truth (ISPCE-CN), IEEE, 2018, pp. 1–6.
data and that have the ability to make very fine-grained [5] A. Bhola, K. Halder, A. Prasad, M.-Y. Kan, Re-
skill predictions. Finally, we release our hand-labeled test trieving skills from job descriptions: A lan-
and validation dataset for skill extraction to stimulate guage model based extreme multi-label classi-
further research on the task. fication framework, in: Proceedings of the
28th International Conference on Computational
Future work could entail a more extensive investiga- Linguistics, International Committee on Compu-
tion of other hyper-parameters, such as the number of tational Linguistics, Barcelona, Spain (Online),
negatives per positive sentence (𝑘), which was fixed to 2020, pp. 5832–5842. URL: https://aclanthology.
10 in this work. Secondly, more performance gains could org/2020.coling-main.513. doi:10.18653/v1/2020.
be made if the RoBERTa weights were fine-tuned during coling- main.513 .
training, but this requires changes in the training setup [6] M. Zhang, K. N. Jensen, S. D. Sonniks, B. Plank,
which should be carefully investigated. Lastly, it could Skillspan: Hard and soft skill extraction from En-
be interesting to investigate how limited manual labor glish job postings, arXiv preprint arXiv:2204.12811
can maximally improve the performance of the method (2022).
even further with techniques such as active learning. [7] M. Zhang, K. N. Jensen, B. Plank, Kompetencer:
Fine-grained skill classification in Danish job post-
ings via distant supervision and transfer learning,
Acknowledgments arXiv preprint arXiv:2205.01381 (2022).
[8] D. Beauchemin, J. Laumonier, Y. L. Ster, M. Yas-
We thank the anonymous reviewers for their valuable
sine, “FIJO”: a French insurance soft skill detection
feedback. This project was funded by the Flemish Govern-
dataset, arXiv preprint arXiv:2204.05208 (2022).
ment, through Flanders Innovation & Entrepreneurship
[9] D. A. Tamburri, W.-J. Van Den Heuvel, M. Garriga,
(VLAIO, project HBC.2020.2893).
DataOps for societal intelligence: A data pipeline
for labor market skills extraction and matching, in:
References 2020 IEEE 21st International Conference on Infor-
mation Reuse and Integration for Data Science (IRI),
[1] I. Khaouja, I. Kassou, M. Ghogho, A survey on skill IEEE, 2020, pp. 391–394.
identification from online job ads, IEEE Access 9 [10] N. Vermeer, V. Provatorova, D. Graus, T. Rajapakse,
S. Mesbah, Using RobBERT and eXtreme multi- esnay, Scikit-learn: Machine learning in Python,
label classification to extract implicit and explicit Journal of Machine Learning Research 12 (2011)
skills from Dutch job descriptions (2022). 2825–2830.
[11] P. Delobelle, T. Winters, B. Berendt, RobBERT: [25] N. Reimers, I. Gurevych, Sentence-bert: Sentence
a Dutch Roberta-based language model, arXiv embeddings using siamese bert-networks, arXiv
preprint arXiv:2001.06286 (2020). preprint arXiv:1908.10084 (2019).
[12] ESCO, European skills, competences, qualifications
and occupations, EC Directorate E (2017).
[13] J. Lu, L. Du, M. Liu, J. Dipnall, Multi-label few/zero-
shot learning with knowledge aggregated from mul-
A. Annotation guidelines
tiple label graphs, arXiv preprint arXiv:2010.07459 Each item that needs to be annotated is a span, thus a
(2020). part of a longer job posting sentence. Both the span and
[14] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- the complete sentence are shown to provide the right
tras, I. Androutsopoulos, Extreme multi-label legal context for annotation. When a span is ambiguous, the
text classification: A case study in EU legislation, full sentence must be read to understand the meaning of
arXiv preprint arXiv:1905.10892 (2019). the span.
[15] Y. Xiong, W.-C. Chang, C.-J. Hsieh, H.-F. Yu,
I. Dhillon, Extreme zero-shot learning for extreme The task is to annotate the correct and most specific
text classification, arXiv preprint arXiv:2112.08652 skill that is mentioned or implied by the span. The place
(2021). of the candidate labels within the shortlist has no impor-
[16] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot tance during annotation. In the case that no correct skill
text classification: Datasets, evaluation and entail- is found in the shortlist, you may search for the correct
ment approach, arXiv preprint arXiv:1909.00161 skill using the ESCO interface [12]. If you still cannot
(2019). find a correct label, select LABEL NOT PRESENT. If you
[17] E. Cole, O. Mac Aodha, T. Lorieul, P. Perona, D. Mor- find that the span can generally not be interpreted as a
ris, N. Jojic, Multi-label learning from single posi- skill, select UNDERSPECIFIED.
tive labels, in: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition,
2021, pp. 933–942. A.1. Examples
[18] D. Zhou, P. Chen, Q. Wang, G. Chen, P.-A. Heng, • Given the span “partner continuously with your many
Acknowledging the unknown for multi-label learn- stakeholders” and the candidate labels Communicate
ing with single positive labels, arXiv preprint With Stakeholders, Negotiate With Stakeholders and
arXiv:2203.16219 (2022). “Liaise With Shareholders”, only the first two labels are
[19] M. C. Du Plessis, G. Niu, M. Sugiyama, Analysis considered correct. “Communicate With Stakeholders”
of learning from positive and unlabeled data, Ad- is most specific with regards to the span, so this label
vances in neural information processing systems should be selected.
27 (2014).
[20] L. Sterckx, T. Demeester, J. Deleu, C. Develder, • Spans such as “apply your depth of knowledge” or “ap-
Knowledge base population using semantic label ply your expertise” are classified as UNDERSPECIFIED.
propagation, Knowledge-Based Systems 108 (2016)
79–91.
[21] J.-J. Decorte, J. Van Hautte, T. Demeester, C. De- B. Training details
velder, Jobbert: Understanding job titles through
The separate classifiers are implemented as a simple lo-
skills, arXiv preprint arXiv:2109.09605 (2021).
gistic regression model, using the popular scikit-learn
[22] J. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Con-
toolkit [24]. All parameters are set to their default values,
trastive learning with hard negative samples, arXiv
except for the inverse regularization strength parameter
preprint arXiv:2010.04592 (2020).
𝐶, which is set to 0.1 for stronger regularization. The
[23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
RoBERTa model and the mean pooling operation are im-
O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
plemented using the Sentence-BERT library [25].
Roberta: A robustly optimized bert pretraining ap-
proach, arXiv preprint arXiv:1907.11692 (2019).
[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, E. Duch-