=Paper=
{{Paper
|id=Vol-3218/paper10
|storemode=property
|title=Skill Extraction from Job Postings using Weak Supervision
|pdfUrl=https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_10.pdf
|volume=Vol-3218
|authors=Mike Zhang,Kristian Nørgaard Jensen,Rob van der Goot,Barbara Plank
|dblpUrl=https://dblp.org/rec/conf/hr-recsys/ZhangJGP22
}}
==Skill Extraction from Job Postings using Weak Supervision==
Skill Extraction from Job Postings using Weak Supervision
Mike Zhang1,∗ , Kristian Nørgaard Jensen1 , Rob van der Goot1 and Barbara Plank1,2
1
IT University of Copenhagen, Rued Langgaards Vej 7, 2300, Copenhagen, Denmark
2
Ludwig Maximilian University of Munich, Akademiestraße 7, 80799, Munich, Germany
Abstract
Aggregated data obtained from job postings provide powerful insights into labor market demands, and emerging skills, and
aid job matching. However, most extraction approaches are supervised and thus need costly and time-consuming annotation.
To overcome this, we propose Skill Extraction with Weak Supervision. We leverage the European Skills, Competences,
Qualifications and Occupations taxonomy to find similar skills in job ads via latent representations. The method shows a
strong positive signal, outperforming baselines based on token-level and syntactic patterns.
Keywords
Skill Extraction, Weak Supervision, Information Extraction, Job Postings, Skill Taxonomy, ESCO
1. Introduction Job Posting ESCO
Python is a must Python C#
The labor market is under constant development—
often due to changes in technology, migration, and Language Model
digitization—and so are the skill sets required [1, 2]. Con-
sequentially, large quantities of job vacancy data is emerg-
ing on a variety of platforms. Insights from this data on
labor market skill set demands could aid, for instance, job
matching [3]. The task of automatic skill extraction (SE) is
θ
to extract the competences necessary for any occupation
from unstructured text.
Previous work on supervised SE frame it as a sequence
labeling task (e.g., [4, 5, 6, 7, 8, 9, 10]) or multi-label classi- Figure 1: Weakly Supervised Skill Extraction. All ESCO
fication [11]. Annotation is a costly and time-consuming skills and n-grams are extracted and embedded through a
process with little annotation guidelines to work with. language model, e.g., RoBERTa [13], to get representations.
This could be alleviated by using predefined skill inven- We label spans from job postings close in vector space to the
tories. ESCO skill.
In this work, we approach span-level SE with weak
supervision: We leverage the European Skills, Compe-
tences, Qualifications and Occupations (ESCO; [12]) tax- pervised method for SE; 2 A linguistic analysis of ESCO
onomy and find similar spans that relate to ESCO skills in skills and their presence in job postings; 3 An empirical
embedding space (Figure 1). The advantages are twofold: analysis of different embedding pooling methods for SE
First, labeling skills becomes obsolete, which mitigates for two skill-based datasets.1
the cumbersome process of annotation. Second, by ex-
tracting skill phrases, this could possibly enrich skill in-
ventories (e.g., ESCO) by finding paraphrases of existing 2. Methodology
skills. We seek to answer: How viable is Weak Supervision
in the context of SE? We contribute: 1 A novel weakly su- Formally, we consider a set of job postings 𝒟, where
𝑑 ∈ 𝒟 is a set of sequences (e.g., job posting sentences)
RecSys in HR’22: The 2nd Workshop on Recommender Systems for
with the 𝑖th input sequence 𝒯𝑑𝑖 = [𝑡1 , 𝑡2 , ..., 𝑡𝑛 ] and a target
𝑖
Human Resources, in conjunction with the 16th ACM Conference on sequence of B I O -labels 𝒴𝑑 = [𝑦1 , 𝑦2 , ..., 𝑦𝑛 ] (e.g., “B - S K I L L ”,
Recommender Systems, September 18–23, 2022, Seattle, USA. “I - S K I L L ”, “O ”).2 The goal is to use an algorithm, which
∗
Corresponding author. predicts skill spans by assigning an output label sequence
Envelope-Open mikz@itu.dk (M. Zhang); krnj@itu.dk (K. N. Jensen);
𝒴𝑑𝑖 for each token sequence 𝒯𝑑𝑖 from a job posting based
robv@itu.dk (R. van der Goot); b.plank@lmu.de (B. Plank)
GLOBE https://jjzha.github.io/ (M. Zhang); http://kris927b.github.io/ on representational similarity of a span to any skill in
(K. N. Jensen); http://robvanderg.github.io/ (R. van der Goot); ESCO.
http://bplank.github.io/ (B. Plank)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License 1
Attribution 4.0 International (CC BY 4.0). https://github.com/jjzha/skill-extraction-weak-supervision
CEUR Workshop Proceedings (CEUR-WS.org) 2
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
Definition of labels can be found in [8].
A Distribution of Length ESCO Skills in Tokens B Distribution of ESCO Skills (Unigrams) C Distribution of ESCO Skills (Bigrams)
500 70
4000 60
400
3000 50
Frequency
Frequency
Frequency
understand written
understand spoken
300
ensure compliance
40
advise customers
interact verbally
2000
music therapy
30
leather goods
social service
200
health safety
service users
equipment
20
maintain
perform
manage
develop
1000
prepare
monitor
operate
provide
100
10
use
0 0 Unigram 0 Bigram
1 2 3 4 5 6 7 8 9 10 11 12 13
Token Length
D Distribution of ESCO Skills (Trigrams) E Skills POS Sequences in ESCO F Skills POS Sequences Sayfullina (Train) G Skills POS Sequences SkillSpan (Train)
2000
140 300
electrical household appliances
17
120 250
15
VERB NOUN NOUN NOUN
1500
VERB NOUN ADP NOUN
ADJ NOUN NOUN NOUN
VERB ADJ NOUN NOUN
100
footwear leather goods
music therapy sessions
game creation systems
prepared animal feeds
12
ADJ CCONJ ADJ NOUN
200
digital game creation
Frequency
Frequency
Frequency
Frequency
ethical code conduct
ensure health safety
NOUN NOUN NOUN
NOUN NOUN NOUN
NOUN NOUN NOUN
social service users
VERB NOUN NOUN 80
follow ethical code
10 1000
ADJ NOUN NOUN
ADJ NOUN NOUN
150
VERB ADJ NOUN
ADJ VERB NOUN
ADJ PART VERB
60
PROPN NOUN
7
NOUN NOUN
NOUN NOUN
NOUN NOUN
VERB NOUN
VERB NOUN
100
ADJ NOUN
ADJ NOUN
ADJ NOUN
5 500 40
PROPN
50
NOUN
NOUN
NOUN
2 20
VERB
ADJ
ADJ
0 Trigram 0 POS Sequence 0 POS Sequence 0 POS Sequence
Figure 2: Surface-level Statistics of ESCO. We show various statistics of ESCO. (A) ESCO skills token length, the mode
is three tokens. (B) Most frequent unigrams of ESCO skills. (C) Most frequent bigrams of ESCO skills. (D) Most frequent
trigrams of ESCO skills. (E) Most frequent POS sequences of ESCO skills. Last, we show the POS sequences of unique skills in
both train sets of Sayfullina and SkillSpan (F-G).
Table 1 ESCO Statistics We use ESCO as a weak supervision
Statistics of Datasets. Indicated is each dataset and their signal for discovering skills in job postings. There are
respective number of sentences, tokens, skill spans, and the 13,890 ESCO skills.4 In Figure 2, we show statistics of the
average length of skills in tokens. taxonomy: (A) On average most skills are 3 tokens long.
Statistics Sayfullina SkillSpan
In (C-D), we show n-grams frequencies with range [1; 3].
We can see that the most frequent uni- and bigrams are
# Sentences 3,703 5,866 verbs, while the most frequent trigrams consist of nouns.
Train
# Tokens 53,095 122,608
Additionally, we show an analysis of ESCO skills from
# Skill Spans 3,703 3,325
a linguistic perspective. We tag the training data using
# Sentences 1,856 3,992 the publicly available MaChAmp v0.2 model [14] trained
Dev.
# Tokens 26,519 52,084 on all Universal Dependencies 2.7 treebanks [15].5 Then,
# Skill Spans 1,856 2,697 we count the most frequent Part-of-Speech (POS) tags
# Sentences 1,848 4,680 in all sources of data (E-G). ESCO’s most frequent tag
Test
# Tokens 26,569 57,528 sequences are V E R B - N O U N , these are not as frequent in
# Skill Spans 1,848 3,093 Sayfullina nor SkillSpan. Sayfullina mostly consists of
Avg. Len. Skills 1.77 2.92 adjectives, which is attributed to the categorization of
soft skills. SkillSpan mostly consists of N O U N sequences.
Overall, we observe most skills consist of verb and noun
2.1. Data phrases.
We use the datasets from [8] (SkillSpan) and a modifica-
tion of [4] (Sayfullina).3 In Table 1, we show the statis-
2.2. Baselines
tics of both. SkillSpan contain nested labels for skill and As our approach is to find similar n-grams based on ESCO
knowledge components [12]. To make it fit for our weak skills, we choose an n-gram range of [1; 4] (where 4 is the
supervision approach, we simplify their dataset by con- median) derived from Figure 2 (A). For higher matching
sidering both skills and knowledge labels as one label probability, we apply an additional pre-processing step to
(i.e., B - K N O W L E D G E becomes B - S K I L L ). the ESCO skills by removing non-tokens (e.g., brackets)
4
Per 25-03-2022, taking ESCO v 1 . 0 . 9 .
5
A Udify-based [16] multi-task model for POS, lemmatization, de-
3
In contrast to SkillSpan, Sayfullina has a skill in every sentence, pendency parsing, built on top of the t r a n s f o r m e r s library [17], and
where they focus on categorizing sentences for soft skills. specifically using mBERT [18].
Spans in Isolation (ISO) Average over Contexts (AOC) Weighted Span Embedding (WSE)
[CLS] being able to code in Python [SEP]
[CLS] enjoy working in groups [SEP]
[CLS] Python experience [SEP]
…
[CLS] ti [SEP] [CLS] Experience in Python is a plus [SEP]
{t1 = enjoy, t2 = working, t3 = in, t4 = groups}
Language Model Language Model Language Model
LLM LLMSubword aggregation
t i⃗ ⃗
t python ⃗
t python … ⃗
t python ⃗
t working t in⃗ ⃗
t groups
1 1 n
∑t ⃗ ∈ ∑ t ⃗ ∈ (− log ti ) ⋅ t i⃗
| | i
t i⃗ | | i N
⃗
t python s j⃗
Figure 3: Skill Representations. We show different methods to embed ESCO skill phrases. The approaches are inspired
by Litschko et al. [19]. We embed a skill by encoding it directly without surrounding context (left). We aggregate different
contextual representations of the same skill term (middle). Last, we encode the skill phrase via a weighted sum of embeddings
with each token’s inverse document frequency as weight (right). For the middle and right methods, 𝒮 is the number of
sentences where the ESCO skill appears.
and words between brackets (e.g., “Java (programming)” Algorithm 1 Weakly Supervised Skill Extraction
becomes “Java”). We have three baselines: Require: 𝑀 ∈ {RoBERTa, JobBERT}
Exact Match: We do exact substring matching with Require: 𝐸 ∈ {ISO, AOC, WSE}
ESCO and the sentences in both datasets. Require: 𝜏 ∈ [0, 1]
𝒮
𝒮
Lemmatized Match: ESCO skills are written in the 𝑃 ←𝐷 ▷ A set of sentences from job postings
infinitive form. We take the same approach as exact 𝑆 ← 𝑆𝐸 ▷ ESCO Skill embeddings of type 𝐸
match on the training sets, now with the lemmatized 𝐿←∅
data of both. The data is lemmatized with MaChAmp for 𝑝 ∈ 𝑃 do
v0.2 [14]. 𝜃←0
POS Sequence Match: Motivated by the observa- for 𝑛 ∈ 𝑝 do ▷ Each ngram 𝑛 of size 1 − 4
tion that certain POS sequences often overlap between 𝐸 ← 𝑀(𝑛)
sources (Figure 2, E-G), we attempt to match POS se- Θ ← CosSim(𝑆, 𝐸)
quences within ESCO with the POS sequences in the if max(Θ) > 𝜏 ∧ max(Θ) > 𝜃 then
datasets. For example N O U N - N O U N , N O U N , V E R B - N O U N and 𝜃 ← max(Θ)
A D J - N O U N sequences are commonly occurring in all three end if
sources. end for
𝐿 ← [𝐿, 𝜃]
2.3. Skill Representations end for
return 𝐿
We investigate several encoding strategies to match n-
gram representations to embedded ESCO skills, the ap-
proaches are inspired by Litschko et al. [19], where they sentences containing 𝑡. We use all available sentences
applied them to Information Retrieval. The language in the job postings dataset (excluding Test). For a given
models (LMs) used to encode the data are RoBERTa [13] job posting sentence, we encode 𝑡 by using one of the
and the domain-specific JobBERT [8]. All obtained vector previous mentioned LMs. We average the embeddings of
representations of skill phrases with the three previous its constituent subwords to obtain the final embedding 𝑡.
encoding methods are compared pairwise with each n- Weighted Span Embedding (WSE): We obtain all
gram created from Sayfullina and SkillSpan. An explana- inverse document frequency (idf) values of each token 𝑡𝑖
tion of the methods (see Figure 3): via
Span in Isolation (ISO): We encode skill phrases 𝑡
from ESCO in isolation using the aforementioned LMs, 𝑛𝑡𝑖
idf = −log
,
without surrounding contexts. 𝑁
Average over Contexts (AOC): We leverage the sur- where 𝑛𝑡𝑖 is the number of occurrences of 𝑡𝑖 and 𝑁 the
rounding context of a skill phrase 𝑡 by collecting all the total number of tokens in our dataset. We encode the
Baseline RoBERTa JobBERT
60 Sayfullina SkillSpan
Strict-F1 Strict-F1
40 Loose-F1 Loose-F1
Span-F1
20
0
Exact Lemma POS ISO AOC WSE ISO AOC WSE
Figure 4: Results of Methods. Results on Sayfullina and SkillSpan are indicated by “Baseline” showing performance of Exact,
Lemmatized (Lemma), and Part-of-Speech (POS). The performance of ISO, AOC, and WSE are separated by model, indicated
by “RoBERTa” and “JobBERT”. The performance of RoBERTa and JobBERT on SkillSpan is determined by the best performing
CosSim threshold (0.8).
Table 2
Qualitative Examples of Predicted Spans. We show the gold versus predicted spans of the best performing model on both
datasets. The first 5 qualitative examples are from Sayfullina (RoBERTa with WSE), the last 5 are from SkillSpan. Yellow the
gold span and pink indicates the predicted span. The examples show many partial overlaps with the gold spans (but also
incorrect ones), hence the high loose-F1.
Gold Predicted
...a dynamic customer focused person to join... ...a dynamic customer focused person to join...
Sayfullina
...strong leadership and team management skills ... ...strong leadership and team management skills ...
...speak and written english skills ... ...speak and written english skills...
...a team environment and working independently skills ... ...a team environment and working independently skills...
...tangible business benefit extremely articulate and... ...tangible business benefit extremely articulate and...
...researcher within machine learning and sensory system design ... ...researcher within machine learning and sensory system design...
SkillSpan
...standards and procedures accessing and updating records ... ...standards and procedures accessing and updating records...
...with a passion for education to... ...with a passion for education to...
...understands Agile as a mindset... ... understands Agile as a mindset...
...experience with AWS GCP Microsoft Azure ... ...experience with AWS GCP Microsoft...
input sentence and compute the weighted sum of the on SkillSpan differs for JobBERT: Performance fluctuates,
embeddings (⃗𝑠𝑗 ) of the specific skill phrase in the sentence, compared to RoBERTa. Precision goes up with a higher
where each 𝑡𝑖 ’s IDF scores are used as weights. Again, threshold, while recall goes down. For RoBERTa, it stays
we only use the first subword token for each tokenized similar until CosSim= 0.9. We use CosSim= 0.8 as over
word. Formally, this is 2 LMs and 3 methods it provides the best cutoff.
𝑛𝑡𝑖
𝑠⃗𝑗 = ∑(−log ) ⋅ ⃗𝑡𝑖 .
⃗𝑡𝑖
𝑁 3. Analysis of Results
Results Our main results (Figure 4) show the baselines
Matching We rank pairs of ESCO embeddings 𝑡 and en- against ISO, AOC, and WSE of both datasets. We eval-
⃗
coded candidate n-grams 𝑔⃗ in decreasing order of cosine uate with two types of F1, following van der Goot et al.
similarity (CosSim), calculated as [20]: s t r i c t and l o o s e - F 1 . For full model fine-tuning,
RoBERTa achieves 91.31 and 98.55 strict and loose F1
⃗𝑡𝑇 𝑔⃗ on Sayfullina respectively. For SkillSpan, this is 23.21
CosSim(⃗𝑡, 𝑔⃗ ) = . and 44.72 strict and loose F1 (on the available subsets of
‖⃗𝑡‖‖⃗ 𝑔‖
SkillSpan). JobBERT achieves 90.18 and 98.19 strict and
We show our pseudocode of the matching algorithm loose F1 on Sayfullina, 49.44 and 74.41 strict and loose F1
in Algorithm 1. Note that in SkillSpan we have to set on SkillSpan. The large difference between results is most
a threshold for CosSim, as there are sentences with no likely due to lack of negatives in Sayfullina, i.e., all sen-
skills. A threshold allows us to have a “no skill” option. tences contain a skill, which makes the task easier. These
As seen in Figure 5, Appendix A the threshold sensitivity results highlight the difficulty of SE on SkillSpan, where
there are negatives as well (sentences with no skills). Acknowledgments
The exact match baseline on SkillSpan is higher than
Sayfullina. We attribute this to SkillSpan also containing We thank the NLPnorth group for feedback on an ear-
“hard skills” (e.g., “Python”), which is easier to match lier version of this paper—in particular, Elisa Bassignana
substrings with than “soft skills”.6 and Max Müller-Eberstein for insightful discussions. We
For the performance of the skill representations on would also like to thank the anonymous reviewers for
Sayfullina, RoBERTa and JobBERT outperform the Exact their comments to improve this paper. Last, we also thank
and Lemmatized baseline on strict-F1. For the POS base- NVIDIA and the ITU High-performance Computing clus-
line, only the ISO method of both models is slightly better. ter for computing resources. This research is supported
JobBERT performs better than RoBERTa in strict-F1 on by the Independent Research Fund Denmark (DFF) grant
both datasets. 9131-00019B.
There is a substantial difference between strict and
loose-F1 on both datasets. This indicates that there
is partial overlap among the predicted and gold spans.
References
RoBERTa performs best for Sayfullina, achieving 59.61 [1] E. Brynjolfsson, A. McAfee, Race against the ma-
loose-F1 with WSE. In addition, the best performing chine: How the digital revolution is accelerating
method for JobBERT is also WSE (52.69 loose-F1). For innovation, driving productivity, and irreversibly
SkillSpan we see a drop, JobBERT outperforms RoBERTa transforming employment and the economy, Bryn-
with AOC (32.30 vs. 26.10 loose-F1) given a threshold jolfsson and McAfee, 2011.
of CosSim = 0.8. We hypothesize this drop in perfor- [2] E. Brynjolfsson, A. McAfee, The second machine
mance compared to Sayfullina could be attributed again age: Work, progress, and prosperity in a time of
to SkillSpan containing negative examples as well (i.e., brilliant technologies, WW Norton & Company,
sentences with no skill). 2014.
[3] K. Balog, Y. Fang, M. De Rijke, P. Serdyukov, L. Si,
Qualitative Analysis A qualitative analysis (Table 2) Expertise retrieval, Foundations and Trends in In-
reveals there is strong partial overlap with gold vs. pre- formation Retrieval 6 (2012) 127–256.
dicted spans on both datasets, e.g., “...strong leadership [4] L. Sayfullina, E. Malmi, J. Kannala, Learning repre-
and team management skills...” vs. “...strong leadership sentations for soft skill matching, in: International
and team management skills...”, indicating the viability of Conference on Analysis of Images, Social Networks
this method. and Texts, 2018, pp. 141–152.
[5] D. A. Tamburri, W.-J. Van Den Heuvel, M. Garriga,
Dataops for societal intelligence: a data pipeline
4. Conclusion for labor market skills extraction and matching, in:
We investigate whether the ESCO skill taxonomy suits 2020 IEEE 21st International Conference on Infor-
as weak supervision signal for Skill Extraction. We apply mation Reuse and Integration for Data Science (IRI),
several skill representation methods based on previous IEEE, 2020, pp. 391–394.
work. We show that using representations of ESCO skills [6] M. Chernova, Occupational skills extraction with
can aid us in this task. We achieve high loose-F1, indi- FinBERT, Master’s Thesis (2020).
cating there is partial overlap between the predicted and [7] M. Zhang, K. N. Jensen, B. Plank, Kompetencer:
gold spans, but need refined off-set methods to get the Fine-grained skill classification in danish job post-
correct span out (e.g., human post-editing or automatic ings via distant supervision and transfer learning,
methods such as candidate filtering). Nevertheless, we Under Review, LREC 2022 (2022).
see this approach as a strong alternative for supervised [8] M. Zhang, K. N. Jensen, S. Sonniks, B. Plank,
Skill Extraction from job postings. SkillSpan: Hard and soft skill extraction from En-
Future work could include going towards multilingual glish job postings, in: Proceedings of the 2022
Skill Extraction, as ESCO consists of 27 languages, exact Conference of the North American Chapter of the
matching should be trivial. For the other methods several Association for Computational Linguistics: Human
considerations need to be taken into account, e.g., a POS- Language Technologies, Association for Computa-
tagger and/or lemmatizer for another language and a tional Linguistics, Seattle, United States, 2022, pp.
language-specific model. 4962–4984.
[9] T. Green, D. Maynard, C. Lin, Development of a
benchmark corpus to support entity recognition in
job descriptions, in: Proceedings of the Language
6
The exact numbers (+precision and recall) are in Table 3, Ap- Resources and Evaluation Conference, European
pendix A, including the definition of strict and loose-F1.
Language Resources Association, Marseille, France, E. Davidson, M.-C. de Marneffe, V. de Paiva, M. O.
2022, pp. 1201–1208. URL: https://aclanthology.org/ Derin, E. de Souza, A. Diaz de Ilarraza, C. Dick-
2022.lrec-1.128. erson, A. Dinakaramani, E. Di Nuovo, B. Dione,
[10] A.-S. Gnehm, E. Bühlmann, S. Clematide, Evalua- P. Dirix, K. Dobrovoljc, T. Dozat, K. Droganova,
tion of transfer learning and domain adaptation for P. Dwivedi, H. Eckhoff, S. Eiche, M. Eli, A. Elkahky,
analyzing german-speaking job advertisements, in: B. Ephrem, O. Erina, T. Erjavec, A. Etienne, W. Eve-
Proceedings of the Language Resources and Eval- lyn, S. Facundes, R. Farkas, M. Fernanda, H. Fernan-
uation Conference, European Language Resources dez Alcalde, J. Foster, C. Freitas, K. Fujita, K. Gaj-
Association, Marseille, France, 2022, pp. 3892–3901. došová, D. Galbraith, M. Garcia, M. Gärdenfors,
URL: https://aclanthology.org/2022.lrec-1.414. S. Garza, F. F. Gerardi, K. Gerdes, F. Ginter, G. Godoy,
[11] A. Bhola, K. Halder, A. Prasad, M.-Y. Kan, Retriev- I. Goenaga, K. Gojenola, M. Gökırmak, Y. Goldberg,
ing skills from job descriptions: A language model X. Gómez Guinovart, B. González Saavedra, B. Gri-
based extreme multi-label classification framework, ciūtė, M. Grioni, L. Grobol, N. Grūzītis, B. Guil-
in: Proceedings of the 28th International Con- laume, C. Guillot-Barbance, T. Güngör, N. Habash,
ference on Computational Linguistics, Interna- H. Hafsteinsson, J. Hajič, J. Hajič jr., M. Hämäläinen,
tional Committee on Computational Linguistics, L. Hà Mỹ , N.-R. Han, M. Y. Hanifmuti, S. Hard-
Barcelona, Spain (Online), 2020, pp. 5832–5842. wick, K. Harris, D. Haug, J. Heinecke, O. Hell-
[12] M. le Vrang, A. Papantoniou, E. Pauwels, P. Fannes, wig, F. Hennig, B. Hladká, J. Hlaváčová, F. Ho-
D. Vandensteen, J. De Smedt, Esco: Boosting job ciung, P. Hohle, E. Huber, J. Hwang, T. Ikeda,
matching in europe with semantic interoperability, A. K. Ingason, R. Ion, E. Irimia, Ọ. Ishola, K. Ito,
Computer 47 (2014) 57–64. T. Jelínek, A. Jha, A. Johannsen, H. Jónsdóttir, F. Jør-
[13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, gensen, M. Juutinen, S. K, H. Kaşıkara, A. Kaasen,
O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, N. Kabaeva, S. Kahane, H. Kanayama, J. Kan-
Roberta: A robustly optimized bert pretraining ap- erva, N. Kara, B. Katz, T. Kayadelen, J. Ken-
proach, arXiv preprint arXiv:1907.11692 (2019). ney, V. Kettnerová, J. Kirchner, E. Klementieva,
[14] R. van der Goot, A. Üstün, A. Ramponi, I. Sharaf, A. Köhn, A. Köksal, K. Kopacewicz, T. Korkiakan-
B. Plank, Massive choice, ample tasks (MaChAmp): gas, N. Kotsyba, J. Kovalevskaitė, S. Krek, P. Kr-
A toolkit for multi-task learning in NLP, in: ishnamurthy, O. Kuyrukçu, A. Kuzgun, S. Kwak,
Proceedings of the 16th Conference of the Euro- V. Laippala, L. Lam, L. Lambertino, T. Lando,
pean Chapter of the Association for Computational S. D. Larasati, A. Lavrentiev, J. Lee, P. Lê Hồng,
Linguistics: System Demonstrations, Association A. Lenci, S. Lertpradit, H. Leung, M. Levina, C. Y.
for Computational Linguistics, Online, 2021, pp. Li, J. Li, K. Li, Y. Li, K. Lim, B. Lima Padovani,
176–197. K. Lindén, N. Ljubešić, O. Loginova, A. Luthfi,
[15] D. Zeman, J. Nivre, M. Abrams, E. Ackermann, M. Luukko, O. Lyashevskaya, T. Lynn, V. Mack-
N. Aepli, H. Aghaei, Ž. Agić, A. Ahmadi, L. Ahren- etanz, A. Makazhanov, M. Mandl, C. Manning,
berg, C. K. Ajede, G. Aleksandravičiūtė, I. Alfina, R. Manurung, B. Marşan, C. Mărănduc, D. Mareček,
L. Antonsen, K. Aplonova, A. Aquino, C. Aragon, K. Marheinecke, H. Martínez Alonso, A. Martins,
M. J. Aranzabe, B. N. Arıcan, H.⁀ Arnardóttir, G. Aru- J. Mašek, H. Matsuda, Y. Matsumoto, A. Mazzei,
tie, J. N. Arwidarasti, M. Asahara, D. B. Aslan, R. McDonald, S. McGuinness, G. Mendonça,
L. Ateyah, F. Atmaca, M. Attia, A. Atutxa, L. Au- N. Miekka, K. Mischenkova, M. Misirpashayeva,
gustinus, E. Badmaeva, K. Balasubramani, M. Balles- A. Missilä, C. Mititelu, M. Mitrofan, Y. Miyao, A. Mo-
teros, E. Banerjee, S. Bank, V. Barbu Mititelu, jiri Foroushani, J. Molnár, A. Moloodi, S. Monte-
S. Barkarson, V. Basmov, C. Batchelor, J. Bauer, magni, A. More, L. Moreno Romero, G. Moretti, K. S.
S. T. Bedir, K. Bengoetxea, G. Berk, Y. Berzak, I. A. Mori, S. Mori, T. Morioka, S. Moro, B. Mortensen,
Bhat, R. A. Bhat, E. Biagetti, E. Bick, A. Bielinskienė, B. Moskalevskyi, K. Muischnek, R. Munro, Y. Mu-
K. Bjarnadóttir, R. Blokland, V. Bobicev, L. Boizou, rawaki, K. Müürisep, P. Nainwani, M. Nakhlé, J. I.
E. Borges Völker, C. Börstell, C. Bosco, G. Bouma, Navarro Horñiacek, A. Nedoluzhko, G. Nešpore-
S. Bowman, A. Boyd, A. Braggaar, K. Brokaitė, Bērzkalne, M. Nevaci, L. Nguyễn Thị, H. Nguyễn
A. Burchardt, M. Candito, B. Caron, G. Caron, Thị Minh, Y. Nikaido, V. Nikolaev, R. Nitisaroj,
L. Cassidy, T. Cavalcanti, G. Cebiroğlu Eryiğit, A. Nourian, H. Nurmi, S. Ojala, A. K. Ojha,
F. M. Cecchini, G. G. A. Celano, S. Čéplö, N. Cesur, A. Olúòkun, M. Omura, E. Onwuegbuzia, P. Osen-
S. Cetin, Ö. Çetinoğlu, F. Chalub, S. Chauhan, ova, R. Östling, L. Øvrelid, Ş. B. Özateş, M. Özçe-
E. Chi, T. Chika, Y. Cho, J. Choi, J. Chun, A. T. lik, A. Özgür, B. Öztürk Başaran, H. H. Park,
Cignarella, S. Cinková, A. Collomb, Ç. Çöltekin, N. Partanen, E. Pascual, M. Passarotti, A. Pate-
M. Connor, M. Courtin, M. Cristescu, P. Daniel, juk, G. Paulino-Passos, A. Peljak-Łapińska, S. Peng,
C.-A. Perez, N. Perkova, G. Perrier, S. Petrov, ers: State-of-the-art natural language process-
D. Petrova, J. Phelan, J. Piitulainen, T. A. Piri- ing, in: Proceedings of the 2020 Conference on
nen, E. Pitler, B. Plank, T. Poibeau, L. Ponomareva, Empirical Methods in Natural Language Process-
M. Popel, L. Pretkalniņa, S. Prévost, P. Prokopidis, ing: System Demonstrations, Association for Com-
A. Przepiórkowski, T. Puolakainen, S. Pyysalo, P. Qi, putational Linguistics, Online, 2020, pp. 38–45.
A. Rääbis, A. Rademaker, T. Rama, L. Ramasamy, URL: https://aclanthology.org/2020.emnlp-demos.6.
C. Ramisch, F. Rashel, M. S. Rasooli, V. Ravis- doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - d e m o s . 6 .
hankar, L. Real, P. Rebeja, S. Reddy, G. Rehm, I. Ri- [18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
abov, M. Rießler, E. Rimkutė, L. Rinaldi, L. Rit- Pre-training of deep bidirectional transformers for
uma, L. Rocha, E. Rögnvaldsson, M. Romanenko, language understanding, in: Proceedings of the
R. Rosa, V. Roșca, D. Rovati, O. Rudina, J. Rueter, 2019 Conference of the North American Chap-
K. Rúnarsson, S. Sadde, P. Safari, B. Sagot, A. Sa- ter of the Association for Computational Linguis-
hala, S. Saleh, A. Salomoni, T. Samardžić, S. Sam- tics: Human Language Technologies, Volume 1
son, M. Sanguinetti, E. Sanıyar, D. Särg, B. Saulīte, (Long and Short Papers), Association for Com-
Y. Sawanakunanon, S. Saxena, K. Scannell, S. Scar- putational Linguistics, Minneapolis, Minnesota,
lata, N. Schneider, S. Schuster, L. Schwartz, D. Sed- 2019, pp. 4171–4186. URL: https://aclanthology.org/
dah, W. Seeker, M. Seraji, M. Shen, A. Shimada, N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
H. Shirasu, Y. Shishkina, M. Shohibussirri, D. Sichi- [19] R. Litschko, I. Vulić, S. P. Ponzetto, G. Glavaš, On
nava, J. Siewert, E. F. Sigurðsson, A. Silveira, N. Sil- cross-lingual retrieval with multilingual text en-
veira, M. Simi, R. Simionescu, K. Simkó, M. Šimková, coders, Information Retrieval Journal (2022) 1–35.
K. Simov, M. Skachedubova, A. Smith, I. Soares- [20] R. van der Goot, I. Sharaf, A. Imankulova, A. Üstün,
Bastos, C. Spadine, R. Sprugnoli, S. Steingrímsson, M. Stepanovic, A. Ramponi, S. O. Khairunnisa,
A. Stella, M. Straka, E. Strickland, J. Strnadová, M. Komachi, B. Plank, From masked-language mod-
A. Suhr, Y. L. Sulestio, U. Sulubacak, S. Suzuki, eling to translation: Non-English auxiliary tasks
Z. Szántó, D. Taji, Y. Takahashi, F. Tamburini, improve zero-shot spoken language understanding,
M. A. C. Tan, T. Tanaka, S. Tella, I. Tellier, M. Testori, in: Proceedings of the 2021 Conference of the North
G. Thomas, L. Torga, M. Toska, T. Trosterud, American Chapter of the Association for Computa-
A. Trukhina, R. Tsarfaty, U. Türk, F. Tyers, S. Ue- tional Linguistics: Human Language Technologies,
matsu, R. Untilov, Z. Urešová, L. Uria, H. Uszkor- Volume 1 (Long and Short Papers), Association for
eit, A. Utka, S. Vajjala, R. van der Goot, M. Van- Computational Linguistics, Mexico City, Mexico,
hove, D. van Niekerk, G. van Noord, V. Varga, 2021.
E. Villemonte de la Clergerie, V. Vincze, N. Vlasova,
A. Wakasa, J. C. Wallenberg, L. Wallin, A. Walsh,
J. X. Wang, J. N. Washington, M. Wendt, P. Wid-
mer, S. Williams, M. Wirén, C. Wittern, T. Wolde-
mariam, T.-s. Wong, A. Wróblewska, M. Yako,
K. Yamashita, N. Yamazaki, C. Yan, K. Yasuoka,
M. M. Yavrumyan, A. B. Yenice, O. T. Yıldız, Z. Yu,
Z. Žabokrtský, S. Zahra, A. Zeldes, H. Zhu, A. Zhu-
ravleva, R. Ziane, Universal dependencies 2.8.1,
2021. LINDAT/CLARIAH-CZ digital library at the
Institute of Formal and Applied Linguistics (ÚFAL),
Faculty of Mathematics and Physics, Charles Uni-
versity.
[16] D. Kondratyuk, M. Straka, 75 languages, 1 model:
Parsing universal dependencies universally, in: Pro-
ceedings of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the 9th In-
ternational Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), 2019, pp. 2779–2795.
[17] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gug-
ger, M. Drame, Q. Lhoest, A. Rush, Transform-
Dataset → Sayfullina SkillSpan
↓ Method, Metric → Strict (P | R | F1) Loose (P | R | F1) Strict (P | R | F1) Loose (P | R | F1)
Exact 9.27 | 1.30 | 2.28 25.48 | 3.95 | 6.84 23.82 | 3.21 | 5.62 43.68 | 8.27 | 13.79
JobBERT RoBERTa Baseline
Lemmatized 8.49 | 1.19 | 2.09 25.87 | 4.00 | 6.93 23.90 | 2.97 | 5.21 41.09 | 7.49 | 12.52
POS 5.99 | 5.95 | 5.97 36.55 | 34.51 | 35.50 5.97 | 7.88 | 6.79 19.34 | 34.71 | 24.80
ISO 6.26 | 6.25 | 6.26 26.90 | 28.98 | 27.90 2.90 | 4.24 | 3.43 12.69 | 28.61 | 17.56
AOC 3.24 | 3.24 | 3.24 64.04 | 55.53 | 59.48 2.23 | 2.93 | 2.53 20.08 | 37.56 | 26.10
WSE 3.67 | 3.67 | 3.67 64.64 | 55.32 | 59.61 2.29 | 2.93 | 2.57 20.90 | 37.79 | 26.85
ISO 7.71 | 7.72 | 7.71 27.76 | 29.95 | 28.82 4.17 | 4.65 | 4.39 17.07 | 29.48 | 21.61
AOC 4.04 | 4.05 | 4.05 56.50 | 48.41 | 52.14 4.44 | 2.96 | 3.54 33.64 | 31.28 | 32.30
WSE 4.15 | 4.16 | 4.15 56.98 | 49.00 | 52.69 4.78 | 3.08 | 3.74 34.01 | 30.33 | 31.95
Table 3
We show the exact numbers of the performance of the methods.
X Baselines (left: Sayfullina Test, right: SkillSpan Test)
40 Precision
Recall
Strict
Loose
30 F1
F1-score
20
10
0
Exact Lemmatized Part-of-Speech Exact Lemmatized Part-of-Speech
A ISO (Sayfullina Test) B AOC (Sayfullina Test) C WSE (Sayfullina Test)
30 60 60
20
F1-score
40 40
10 20 20
0 0 0
RoBERTa JobBERT RoBERTa JobBERT RoBERTa JobBERT
D ISO RoBERTa (SkillSpan Test) E AOC RoBERTa (SkillSpan Test) F WSE RoBERTa (SkillSpan Test)
30 40 40 Strict Loose
P P
20 30 30 R R
F1 F1
Value
20 20
10 10 10
0 0 0
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.70 0.75 0.80 0.85 0.90 0.95 1.00
G ISO JobBERT (SkillSpan Test) H AOC JobBERT (SkillSpan Test) I WSE JobBERT (SkillSpan Test)
40
40 40
30
Value
20 20 20
10
0 0 0
0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.70 0.75 0.80 0.85 0.90 0.95 1.00
CosSim CosSim CosSim
Figure 5: Results of Methods. Results of the baselines are in (X), the performance of ISO, AOC, and WSE on Sayfullina
in (A-C), and the same performance on SkillSpan in (D-I) based on the model (RoBERTa or JobBERT). In D–F, we show the
precision (P), recall (R), and F1 differences when taking an increasing CosSim.
A. Exact Results This is called s t r i c t - F 1 . In the second variant, we seek
for partial matches, i.e., overlap between the predicted
Definition F1 As mentioned, we evaluate with two and gold span including the correct label, which counts
types of F1-scores, following van der Goot et al. [20]. The towards true positives for precision and recall. This is
first type is the commonly used span-F1, where only the called l o o s e - F 1 . We consider the loose variant as well,
correct span and label are counted towards true positives. because we want to analyze whether the span is “almost
correct”.
Exact Numbers Results We show the exact numbers
of Figure 4 in Table 3 and more detailed results in Fig-
ure 5. Results show that there is high precision among
the baseline approaches compared to recall. This is bal-
anced using the representation methods for Sayfullina.
However, we observe that there is much higher recall for
SkillSpan than precision.