1. Introduction

Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

Jens-Joris Decorte

0 1 2

Jeroen Van Hautte

jeroen@techwolf.ai 0 2

Johannes Deleu

johannes.deleu@ugent.be 0 1

Chris Develder

chris.develder@ugent.be 0 1

Thomas Demeester

thomas.demeester@ugent.be 0 1 0 ( Jeroen Van Hautte) 1 Ghent University - imec , 9052 Gent , Belgium 2 TechWolf , 9000 Gent , Belgium

2022

18 23

Skills play a central role in the job market and many human resources (HR) processes. In the wake of other digital experiences, today's online job market has candidates expecting to see the right opportunities based on their skill set. Similarly, enterprises increasingly need to use data to guarantee that the skills within their workforce remain future-proof. However, structured information about skills is often missing, and processes building on self- or manager-assessment have shown to struggle with issues around adoption, completeness, and freshness of the resulting data. These challenges can be tackled using automated techniques for skill extraction. Extracting skills is a highly challenging task, given the many thousands of possible skill labels mentioned either explicitly or merely described implicitly and the lack of finely annotated training corpora. Previous work on skill extraction overly simplifies the task to an explicit entity detection task or builds on manually annotated training data that would be infeasible if applied to a complete vocabulary of skills. We propose an end-to-end system for skill extraction, based on distant supervision through literal matching. We propose and evaluate several negative sampling strategies, tuned on a small validation dataset, to improve the generalization of skill extraction towards implicitly mentioned skills, despite the lack of such implicit skills in the distantly supervised data. We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements, and combining three diferent strategies in one model further increases the performance, up to 8 percentage points in RP@5. We introduce a manually annotated evaluation benchmark for skill extraction based on the ESCO taxonomy, on which we validate our models. We release the benchmark dataset for research purposes to stimulate further research on the task.

1. Introduction

Skill extraction is an information extraction task that tial for many HR applications, such as resume screening and job recommendation systems. A comparative survey on skill extraction indicates that research interest has steadily grown over the last decade [ 1 ]. Traditionally, skill extraction has been approached as finding and disambiguating entities in texts. These methods typically rely on a named entity recognition (NER) component based

However, skills are often present implicitly as longer sequences of words (which we refer to as spans) or full sentences rather than being mentioned explicitly: over 85%

nEvelop-O LGOBE

RecSys in HR’22: The 2nd Workshop on Recommender Systems for Human Resources, in conjunction with the 16th ACM Conference on Some work avoids this labeling dificulty completely by using readily available labeled datasets. For example, model was trained based on a corpus of job ads with

Job posting corpus

Binary training data

Binary classifiers

Distant supervision Distantly supervised training corpus

Data sampling

... 1

ESCO ) d e x i(f a T R E B o

R 2 3 4 attached skills provided by an online job ads platform. 2. Related Work However, the authors reported that for that corpus, at least 40% of the vacancies missed 20% of explicitly stated Multi-label classification datasets often have a skewed skills in their labels. Recent work [ 10 ] successfully recon- label distribution, with many labels occurring only structed the BERT-XLMC approach on Dutch vacancy a few times or even being completely absent in the texts using the Dutch RobBERT model [11]. The training training data. Some works have focused on improving dataset used for this work is however based on the the few-shot and zero-shot classification performance of output of an existing commercial skill extraction solution. multi-label text classification on these rare or unseen labels. Typically, the information in structured label

We propose a new end-to-end approach to fine-grained graphs (such as label descriptions or relations) or word skill extraction that does not rely on a large hand-labeled embeddings are used as an input to the system in order training corpus. Instead, we ease the requirements on the to generalize to unseen labels [13, 14]. However, these training data such that it can be automatically collected methods still rely on a large labeled training dataset through distant supervision. We cast the multi-label skill to work. In the absence of any supervision, [15] uses classification task into independent binary classification a novel self-supervision training objective to train a problems, with skills labeled on the sentence level, to dense sentence representation model that is used to encompass both explicit and implicit skill descriptions. assign labels based on cosine similarity in the learned To the best of our knowledge, our work is the first one to space. Yin et al. [16] propose an entailment approach tackle fine-grained skill extraction in such a flexible dis- to zero-short text classification, where the input text is tant supervision setup. Our distant supervision training called the premise, and a hypothesis is constructed for set contains few false positives, due to the literal match- each label using the template “the text is about label”. ing of known skills, which is a task with low ambiguity. The premise and hypothesis are concatenated before However, we expect many false negatives, for skills not being presented to a BERT-based model for prediction, literally mentioned. This is quantified in Section 4.1. We making this method slow at inference for large label investigate to what extent the distantly supervised train- spaces. ing set can be leveraged at maximum efectiveness to train a fine-grained skill extraction system. To that end, Multi-label classification datasets not only sufer from we design a number of negative sampling strategies that the rare label problem, also many labels are just missing, can be used to tune the extraction model training process since they are usually only partially labeled: instances on a small annotated development set, covering only a without labels thus may either be truly negative, or fraction of all potential skills (0.2%, to be precise, in our positive but not identified as such during labeling. The experimental setting). Finally, in order to stimulate re- “Single Positive Labels” scenario is an extreme case of search on automated skill extraction, and to facilitate the missing labels, where only one positive label is available comparison of future models with our results, we release1 for each training instance [17]. Research on this topic is our development and test data, which is constructed on limited, and typically focuses on designing custom loss top of the “SkillSpan” dataset [ 6 ], adding annotations functions [18] or online estimation of the missing labels with the ESCO [12] skill labels. during training [17]. This line of work is closely related to “Positive-Unlabeled” (PU) binary classification, which is typically also tackled using custom loss functions [19].

1https://github.com/jensjorisdecorte/Skill-Extraction-benchmark

Typically in a distant supervision setup, the labeling distant supervision step, on average 365 sentences were function is followed by a filtering step that aims to reduce labeled per skill (for the set of 13,891 ESCO skills). This the number of false positives in the labels [20]. However, dataset follows a long tail distribution, with 75.1% of skills we find that the number of false positives produced by occurring in only ten or fewer sentences. the distant supervision step is low in our case of literal skill mentions. This has been shown previously by [21] Skill extraction model: The model architecture is dewhere literal skill mentions have been successfully used picted in Fig. 1. We use a frozen pre-trained RoBERTa as distant supervision for the task of job title representa- [23] model with mean pooling to transform input sention learning. Rather than focusing on a filtering step, we tences into fixed-length contextual representations, bedraw inspiration from the idea of “hard negative exam- fore presenting them for classification. The classification ples” in representation learning to improve the learning is performed by separate binary text classification modprocess. In contrastive learning, hard negative examples els , each generating an independent prediction value refer to samples that are dificult to distinguish from an for their respective skill label . In contrast to a typical anchor point [22]. This approach improves the discrim- multi-label model, we optimize each classification model inative abilities and downstream performance of unsu- separately on a diferent corresponding dataset, instead pervised representation learning methods. We adapt this of training all weights together. idea to the multi-label classification setup, by oversampling negative examples from related labels. More details Training with negative sampling: serves as poson this approach are contained in the following section. itive training data for classifier , and negative examples are sampled from the union of all positive sentence 3. Skill Extraction Approach datasets of all other skills. The basic mechanism for sampling negatives is uniform sampling from this union.

We approach the task of skill extraction as a sentence- However, following the ideas in representation learnlevel multi-label classification task. A high-level ing [22], we hypothesize that sentences from related skills overview of the method is shown in Fig. 1. Our method are more informative, harder to distinguish from the posuses distant supervision based on the ESCO skill taxon- itive sentences (i.e., closer to the decision boundary), and omy to automatically assign (partial) skill labels for a could thus improve the learning process. As such, a fracgiven set of sentences from the HR domain (in particular, tion of the negative examples are sampled specifically mined from vacancies). Negative sampling strategies are from sentences that are labeled with a related (but diferused to combine ‘positive sentences’ for a given skill (i.e., ent) label to skill . We refer to these sentences as “hard” sentences labeled with that skill during the distant su- negative samples. Our negative sampling strategy is thus pervision step) with sentences not containing that skill defined by two important factors. First, the fraction of (referred to as ‘negative sentences’). Finally, a binary uniformly sampled negatives versus the hard negative classifier is trained for each skill , based on the con- samples is important. Secondly, how we define whether structed positive and negative sentences for that skill. two skills are related is crucial to the learning process. It consists of a logistic regression classifier on top of a We introduce three diferent strategies for selecting the (frozen) representation for the sentences, as described in related skills in Section 3.1. more detail below.

Distantly supervised training set: Given a set of skills and a background corpus of sentences , for each skill ∈ , a set of positive sentences is collected from through distant supervision. In particular, we use the ESCO [12] skills taxonomy as the set of classification labels. The set of positive sentences for each skill , consists of those sentences in that literally mention the skill or any of its alternative forms, as provided in the taxonomy. This assumes that there are no ambiguous skill names, which holds in most cases as skill names tend to be specific. The positive labels are very precise, due to the distant supervision process based on literal matches with the highly specific ESCO skill names. However, this means potentially many skills remain unlabeled, i.e., the training data is prone to false negatives. After the Inference and evaluation: The final model is used to rank the relevance of all skills for a given sentence. Similar to [14], we use the macro-averaged RPrecision@K (RP@K) metric to evaluate the performance of the method. Since predictions are made on a sentencebasis, we restrict the evaluation to low values of K. RP@K is defined in (1), where the quantity (, ) is a binary indicator of whether the th ranked label is a correct label for data sample , and is the number of gold labels for sample . In addition, we use the mean reciprocal rank (MRR) of the highest ranked correct label as an indicator of the ranking quality. More information on the evaluation is presented in Section 4.1. 1

∑ ∑ =1 =1 ( , (, ) ) (1) disarm land mine Haskell manage musical staf ensure flock safety protect important clients signal for explosion deal with challenging people DevOps XQuery Windows Phone SPARK discharge employees manage volunteers supervise nursing staf guide staf

Levenshtein find land mines search for land mines identify land mines dismantle machines add smell upsell sink wells speak well

Embedding repair mine machinery handle mining plant waste management of mine ventilation construct road base PostgreSQL Erlang JavaScript

C++ manage musical groups manage musical events manage musicians manage educational staf manage agricultural staf manage staf manage dental staf manage educational staf 3.1. Negative Sampling Strategies which contains job posting sentences annotated with skill spans. We manually annotate each span in SkillSpan with Rather than randomly sampling negative examples for its corresponding ESCO skill (if it exists). This span-based training each binary skill classifier, we assume that sam- multi-class annotation is less complex than annotating pling more informative negatives will likely lead to a more complete sentences with multiple labels. The process is eficient training procedure. Instead of sampling hard performed on the test sets of the publicly released subsets negative sentences directly, we first identify related (yet TECH and HOUSE. Details on the annotation guidelines diferent) skills, and then sample sentences with those can be found in Appendix A. The annotation efort results labels. We introduce three diferent strategies for identi- in fine-grained ESCO skill labels for 64.5% of the spans. fying such related skills, which we analyze through the We split this dataset into a validation and test set using experiments defined in Section 4. The considered sets a 20%/80% split. The validation set contains 165 unique of related skills, given a particular skill are obtained as skill labels, and over 80% of the unique skill labels in the follows: test set never occur in the validation set. A more detailed • Siblings: all skills that share a parent concept with , breakdown of the number of spans and annotations is as indicated by the “broader concepts” field in ESCO. shown in table 2. • Levenshtein: The top 100 skills closest to , according

to their Levenshtein distance. • Embedding: The top 100 skills closest to in terms of cosine similarity with their mean-pooled RoBERTaencoded skill name representations.

For each of the negative sampling strategies, some example ESCO skills with their related labels according to the strategy are shown in table 1. 4. Experimental setup

4.1. Evaluation While hand-labeling a training dataset for skill extraction is infeasible (given the huge number of skills, e.g., over 13k in ESCO), we argue that with reasonable manual work, it is possible to construct a benchmark that can be used to compare the performance of diferent models. We build upon the test set of the SkillSpan dataset from [ 6 ], # sentences # spans # spans with ESCO label

TECH

HOUSE val

In order to verify our hypothesis that the distant supervision labeling leads to quite precise positive labels, at the cost of many false negatives, we validated the distant supervision labeling of the test set against the manual annotations. The automatically assigned labels are indeed rather precise (overall precision of 79%), but at the cost of low coverage (i.e., a recall of 14.6%). (a) TECH (b) HOUSE 4.2. Experiments the model. Secondly, the “levenshtein” strategy brings the least improvements out of all three strategies.

The sentences used for training are collected from a large proprietary corpus of public job postings. This dataset Finally, we trained a model that combines all strategies. has been collected from diferent public job boards and Based on the results of the above hyper-parameter contains a large number of English job postings. ESCO search, we chose 5% as an optimal value for the fraction is used for the distant supervision step: a skill label is as- of negatives sampled through the combined hard signed when the skill itself, or one of its alternative forms negative strategies. To assess the impact of each of the provided by ESCO, is literally mentioned in a sentence. strategies within this combination, we trained three For each skill classifier , a maximum of one thousand more models in which each of the three strategies is positive sentences is retained. The amount of negative left out respectively. The performance of these final examples per positive example is set to 10. We train a models is shown in table 3. The combination of all three baseline classifier without hard negative sampling. In this strategies yielded the overall best model. This model case, all negative examples are sampled uniformly from has large performance gains across the MRR and RP@K the other positive corpora. To investigate the optimal metrics for both the TECH and HOUSE dataset. hard negative sampling procedure, we conduct a hyperparameter search for the fraction of negatives sampled us- Leaving out the “Levenshtein” strategy has a relatively ing the three strategies (sibling, levenshtein, embedding) low impact on the performance. This might be underversus uniform sampling. Based on the performance on stood by looking at the examples in table 1: string simthe validation sets, we decide on an optimal value for this ilarity surfaces unrelated skills, for example for proper percentage. Finally, we report the contribution of each of nouns such as Haskel. This could partially explain the the negative sampling strategies when combined. This relatively low utility of this negative sampling strategy. is reported based on performance on the unseen test set, On the other hand, leaving out the “siblings” strategy and contributions of the strategies are shown through takes away the largest part of the performance improveablations, by leaving one strategy out at a time. We refer ments. This strategy makes use of the hierarchy defined to Appendix B for more details on the training procedure. in the ESCO taxonomy, and thus is a reliable method for selecting informative hard negatives. The efect of 5. Results and Discussion the “embedding” strategy is comparable to the “siblings” strategy and thus proves a good alternative in case a hierarchy such as the one in ESCO is not available.

The results of the hyper-parameter search for each of

the negative sampling strategies are shown in Fig. 2. From these results, it is clear that the diferent strategies have diferent efects on the model performance. Most notably, we find that the optimal fraction of hard negative sampling is no higher than 5% for any strategy. This is in line with previous findings on hard negative sampling [22]. Sampling large amounts of hard negatives even has a large negative impact on the performance of

6. Conclusion and Future Work We propose an end-to-end approach to skill extraction using distant supervision. The method is able to make ifne-grained skill predictions (using 13,891 skills from ESCO) for a given input sentence. We introduce the

Baseline classifier Classifierneg Classifierneg without embeddings Classifierneg without Levenshtein Classifierneg without siblings MRR 23.65 31.71 31.43 31.11 30.57 33.71 39.09 39.19 38.55 37.07

MRR 26.66 30.82 29.09 30.14 29.20 34.19 38.69 37.70 37.22 35.91 idea of hard negative sampling through related labels in a multi-label classification setup and propose three diferent strategies to select these related labels. We investigate the impact of each of the strategies, and found that all three strategies combined yield the highest increase on top of a baseline model without hard negative sampling. Both the distant supervision and the hard negative sampling are designed to work well without manual labeling, which makes the whole method very flexible. To the best of our knowledge, we are the first to design such a system for skill extraction, and we improve on prior work by providing methods that have relaxed the requirements from ground-truth data and that have the ability to make very fine-grained skill predictions. Finally, we release our hand-labeled test and validation dataset for skill extraction to stimulate further research on the task.

Future work could entail a more extensive investigation of other hyper-parameters, such as the number of negatives per positive sentence ( ), which was fixed to 10 in this work. Secondly, more performance gains could be made if the RoBERTa weights were fine-tuned during training, but this requires changes in the training setup which should be carefully investigated. Lastly, it could be interesting to investigate how limited manual labor can maximally improve the performance of the method even further with techniques such as active learning.

Acknowledgments We thank the anonymous reviewers for their valuable feedback. This project was funded by the Flemish Government, through Flanders Innovation & Entrepreneurship (VLAIO, project HBC.2020.2893).

S. Mesbah, Using RobBERT and eXtreme multi- esnay, Scikit-learn: Machine learning in Python, label classification to extract implicit and explicit Journal of Machine Learning Research 12 (2011) skills from Dutch job descriptions (2022). 2825–2830. [11] P. Delobelle, T. Winters, B. Berendt, RobBERT: [25] N. Reimers, I. Gurevych, Sentence-bert: Sentence a Dutch Roberta-based language model, arXiv embeddings using siamese bert-networks, arXiv preprint arXiv:2001.06286 (2020). preprint arXiv:1908.10084 (2019). [12] ESCO, European skills, competences, qualifications

and occupations, EC Directorate E (2017). [13] J. Lu, L. Du, M. Liu, J. Dipnall, Multi-label few/zero- A. Annotation guidelines shot learning with knowledge aggregated from multiple label graphs, arXiv preprint arXiv:2010.07459 Each item that needs to be annotated is a span, thus a (2020). part of a longer job posting sentence. Both the span and [14] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- the complete sentence are shown to provide the right tras, I. Androutsopoulos, Extreme multi-label legal context for annotation. When a span is ambiguous, the text classification: A case study in EU legislation, full sentence must be read to understand the meaning of arXiv preprint arXiv:1905.10892 (2019). the span. [15] Y. Xiong, W.-C. Chang, C.-J. Hsieh, H.-F. Yu,

I. Dhillon, Extreme zero-shot learning for extreme The task is to annotate the correct and most specific text classification, arXiv preprint arXiv:2112.08652 skill that is mentioned or implied by the span. The place (2021). of the candidate labels within the shortlist has no impor[16] W. Yin, J. Hay, D. Roth, Benchmarking zero-shot tance during annotation. In the case that no correct skill text classification: Datasets, evaluation and entail- is found in the shortlist, you may search for the correct ment approach, arXiv preprint arXiv:1909.00161 skill using the ESCO interface [12]. If you still cannot (2019). ifnd a correct label, select LABEL NOT PRESENT. If you [17] E. Cole, O. Mac Aodha, T. Lorieul, P. Perona, D. Mor- find that the span can generally not be interpreted as a ris, N. Jojic, Multi-label learning from single posi- skill, select UNDERSPECIFIED. tive labels, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 933–942. A.1. Examples [18] D. Zhou, P. Chen, Q. Wang, G. Chen, P.-A. Heng, • Given the span “partner continuously with your many Acknowledging the unknown for multi-label learn- stakeholders” and the candidate labels Communicate ing with single positive labels, arXiv preprint With Stakeholders, Negotiate With Stakeholders and arXiv:2203.16219 (2022). “Liaise With Shareholders”, only the first two labels are [19] M. C. Du Plessis, G. Niu, M. Sugiyama, Analysis considered correct. “Communicate With Stakeholders” of learning from positive and unlabeled data, Ad- is most specific with regards to the span, so this label vances in neural information processing systems should be selected.

27 (2014). [20] L. Sterckx, T. Demeester, J. Deleu, C. Develder, • Spans such as “apply your depth of knowledge” or “apKnowledge base population using semantic label ply your expertise” are classified as UNDERSPECIFIED. propagation, Knowledge-Based Systems 108 (2016) 79–91. [21] J.-J. Decorte, J. Van Hautte, T. Demeester, C. De- B. Training details velder, Jobbert: Understanding job titles through The separate classifiers are implemented as a simple loskills, arXiv preprint arXiv:2109.09605 (2021). gistic regression model, using the popular scikit-learn [22] J. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Con- toolkit [24]. All parameters are set to their default values, trastive learning with hard negative samples, arXiv except for the inverse regularization strength parameter preprint arXiv:2010.04592 (2020). , which is set to 0.1 for stronger regularization. The [23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, RoBERTa model and the mean pooling operation are imO. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, plemented using the Sentence-BERT library [25]. Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,

B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duch

[1]

Khaouja , I. Kassou,

Ghogho , A survey on skill identification from online job ads , IEEE Access 9 ( 2021 ) 118134 - 118153 .

[2]

Zhao ,

Javed ,

Jacob , M. McNair, Skill: A system for skill identification and normalization , in: Twenty-Seventh IAAI Conference , 2015 .

[3]

Sayfullina ,

Malmi ,

Kannala , Learning representations for soft skill matching , in: International conference on analysis of images, social networks and texts , Springer, 2018 , pp. 141 - 152 .

[4]

Jia ,

Liu ,

Zhao ,

Liu ,

Sun , T. Peng, Representation of job-skill in artificial intelligence with knowledge graph analysis, in: 2018 IEEE symposium on product compliance engineering-asia (ISPCE-CN) , IEEE, 2018 , pp. 1 - 6 .

[5]

Bhola ,

Halder ,

Prasad , M.-

Kan , Retrieving skills from job descriptions: A language model based extreme multi-label classiifcation framework , in: Proceedings of the 28th International Conference on Computational Linguistics , International Committee on Computational Linguistics , Barcelona, Spain (Online) , 2020 , pp. 5832 - 5842 . URL: https://aclanthology. org/ 2020 .coling-main. 513 . doi: 10 .18653/v1/ 2020 . coling- main.513.

[6]

Zhang ,

K. N.

Jensen ,

S. D.

Sonniks ,

Plank , Skillspan: Hard and soft skill extraction from English job postings , arXiv preprint arXiv:2204.12811 ( 2022 ).

[7]

Zhang ,

K. N.

Jensen ,

Plank , Kompetencer: Fine-grained skill classification in Danish job postings via distant supervision and transfer learning , arXiv preprint arXiv:2205.01381 ( 2022 ).

[8]

Beauchemin ,

Laumonier ,

Y. L.

Ster , M. Yassine, “FIJO” : a French insurance soft skill detection dataset , arXiv preprint arXiv:2204.05208 ( 2022 ).

[9]

D. A.

Tamburri , W.-J. Van Den Heuvel, M. Garriga, DataOps for societal intelligence: A data pipeline for labor market skills extraction and matching , in: 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI) , IEEE, 2020 , pp. 391 - 394 .

[10]

Vermeer ,

Provatorova ,

Graus , T. Rajapakse,