=Paper=
{{Paper
|id=Vol-3878/52_main_long
|storemode=property
|title=Enhancing Job Posting Classification with Multilingual Embeddings and Large Language Models
|pdfUrl=https://ceur-ws.org/Vol-3878/52_main_long.pdf
|volume=Vol-3878
|authors=Hamit Kavas,Marc Serra-Vidal,Leo Wanner
|dblpUrl=https://dblp.org/rec/conf/clic-it/KavasSW24
}}
==Enhancing Job Posting Classification with Multilingual Embeddings and Large Language Models==
Enhancing Job Posting Classification with Multilingual
Embeddings and Large Language Models
Hamit Kavas1,2,* , Marc Serra-Vidal2 and Leo Wanner1,3
1
NLP Group, Pompeu Fabra University, C/ Roc Boronat, 138, 08018, Spain
2
Adevinta Spain, C/ de la Ciutat de Granada, 150, Barcelona, 08018, Spain
3
Catalan Institute for Research and Advanced Studies (ICREA), Passeig Lluís Companys, 23, Barcelona, 08010, Spain
Abstract
In the modern labour market, taxonomies such the European Skills, Competences, Qualifications and Occupations (ESCO)
classification are used as an interlingua to match job postings with job seeker profiles. Both are classified with respect to
ESCO occupations, and match if they align with the same occupation and the same skills assigned to the occupation. However,
matching models usually struggle with the classification because of overlapping skills and similar definitions of occupations
defined in the ESCO taxonomy. This often leads to imprecise classification outcomes. In this paper, we focus on the challenge
of the classification of job postings written in Italian or Spanish against ESCO occupations written in English. We experiment
with multilingual embeddings, zero-shot classification, and use of a large language model (LLM) and show that the use of
an LLM leads to best results. Furthermore, we also explore an alternative automatic labeling method by prompting three
top-performing LLMs to annotate the test dataset. This approach serves both as an experiment on the usability of automatic
labeling and as an evaluation of the reliability of the automatically assigned labels, involving human annotators.
Keywords
ESCO labour market taxonomy, job posting classification, class embeddings, text embeddings, LLM
1. Introduction experiences because due to their tree structure they of-
ten fail to adequately distinguish between occupations
The modern labour market becomes more and more di- that exhibit substantial skill overlaps. For instance, two
verse. High-tech jobs demand novel skills and compe- job postings labeled as ‘data analyst’ may appear similar
tences, which in their turn keep undergoing adaptations but require different skills if one focuses on market re-
and modifications. Under these circumstances, accurately search while the other concentrates on healthcare trends
classifying job postings and CVs of job seekers (hence- analysis. This issue is particularly pronounced when clas-
forth candidate experiences) that contain detailed techno- sification relies on a single label, such as the job title of an
logical specifications with remarkably similar yet distinct ESCO occupation, where skill overlaps undermine pre-
skills and experiences has evolved into a complex chal- cise classification. Hence, employing multiple job titles
lenge. and framing the problem as a multi-label classification
The overwhelming majority of job portals and employ- task is imperative.
ment agencies use either the European Skills, Competences, This paper addresses the challenge of multilingual
Qualifications and Occupations (ESCO) taxonomy1 or its multi-label classification using Large Language Models
US equivalent O*Net taxonomy2 to classify job postings (LLMs) for the alignment of Italian and Spanish job post-
and candidate experiences in terms of job title labeled ings with English job titles encountered in the ESCO
ESCO/O*Net occupations. Most of the proposals to au- taxonomy. Multilingual class embeddings are explored
tomatic alignment of job postings with candidate expe- to improve classification accuracy, aiming to provide the
riences (or vice versa) also use ESCO or O*Net [1, 2, 3]. necessary contextual awareness and addressing the core
However, despite their wide use, both ESCO and O*Net limitations of taxonomies such as ESCO.
taxonomies exhibit principle limitations for the task of Furthermore, we explore an alternative automatic la-
automatic classification of job postings and candidate beling method by prompting three top-performing LLMs
to annotate the test dataset. This approach serves both as
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, an experiment on the usability of automatic labeling and
Dec 04 — 06, 2024, Pisa, Italy as an evaluation of the reliability of the automatically
*
Corresponding author.
assigned labels, involving human annotators.
$ hamit.kavas@upf.edu (H. Kavas); marc.serrav@adevinta.com
(M. Serra-Vidal); leo.wanner@upf.edu (L. Wanner) To provide LLMs with domain-specific information
0009-0009-7027-7367 (H. Kavas); 0009-0000-0120-381X and to mitigate hallucinations in the course of the clas-
(M. Serra-Vidal); 0000-0002-9446-3748 (L. Wanner) sification of the job postings, we employ Retrieval Aug-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
1
Attribution 4.0 International (CC BY 4.0). mented Generation (RAG) [4], which combines infor-
https://esco.ec.europa.eu/en/classification
2 mation retrieval with a generative model. RAG serves
https://www.onetonline.org/
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
two critical functions in our methodology. Firstly, it pro- of online recruitment data. Similarly, Wang et al. [9] pro-
vides detailed definitions, including essential skills and pose a model based on multi-stream convolutional neural
synonyms for each ESCO occupation, selected through networks, aiming to classify noisy user-generated job ti-
vector similarity as outlined in [5]. Secondly, it ensures tles by considering different elements such as characters
that the assigned job titles are restricted to titles within and words within job titles. Yamashita et al. [10] and
our predefined label space, i.e., standardized job titles Zbib et al. [1] conduct studies on the classification of
defined in the ESCO taxonomy. job titles, focusing on job title alignment and job simi-
The contributions of our work are: larity training, respectively. JobBERT Decorte et al. [2]
∙ We explore the impact of using multilingual class classifies job titles against the ESCO taxonomy, treating
embeddings derived from the ESCO taxonomy for the the task as a semantic text similarity (STS) exercise. In
task of job posting classification. particular, JobBERT emphasizes the understanding of the
∙ We integrate RAG to provide LLMs with domain- semantics of job titles through the skills inferred from the
specific information and eliminate the dependency on associated vacancies and descriptions, thus alleviating
fine-tuning; the need for an extensive labeled dataset or a continu-
∙ We show how the LLM response can be restricted to ously updated list of standardized titles. Before the recent
standardized job titles and thus how LLMs can be used proposals [11] and [12], JobBERT used to be referenced
for high quality job title classification that outperforms as the state-of-the-art baseline. In general, all of these
state-of-the-art proposals for this task. works draw upon some of the information encoded in the
The remainder of the paper is structured as follows. In ESCO taxonomy. However, none of them uses detailed
Section 2, we present a concise overview of the related descriptions of ESCO occupations, as we propose.
work. In Section 3, the model on which our work is based
is outlined. Section 4 describes the experiments we car-
ried out, the results we obtained in these experiments, 3. The Model
and their discussion. In Section 5, finally, draws some
conclusions from the presented work and outlines some 3.1. The Basics
directions for future research. In Appendix A, we present The proposed model is based on the notion of distinctive-
an ablation study in which we assess the comprehension ness, which specifies the difference between the prompt
of English ESCO job titles and its Spanish equivalents by concept 𝜃* and other concepts within the conceptual
our model. Appendix B provides, for illustration, exam- space Θ [13]. The notion is crucial for distinguishing
ples of Italian job postings and predicted ESCO job titles. in-context learning concepts that are aimed to be learned
In Appendix E, we present the signature used to prompt by analogy. 𝜃* acts as a latent parameter in a Hidden
Large Language Models for pre-processing. Markov Model that defines a distribution over observed
tokens, represented by selected ESCO job titles as labels.
As proposed by Xie et al. [13], the error of the in-context
2. Related Work predictor approaches optimality under the condition that
A number of works have been carried out in the domain 𝜃 is distinguishable from other concepts in Θ ∖ {𝜃 }.
* *
of job title classification, focusing on various facets of the When RAG is adapted as a few-shot reasoning (or in-
problem. Shi et al. [6] introduce Job2Skills, a model devel- context*learning) framework for job posting classification
oped for LinkedIn. The model significantly improves job [14] , 𝜃 is represented by the top-selected ESCO labels
recommendation performance metrics, however, raises and ensures that the LLM can effectively differentiate
questions about its effectiveness beyond LinkedIn. Li between closely related job categories.
et al. [7] proposes a two-step job title normalization, also The explanation enriched prompts enhance the LLM’s
in LinkedIn, which is based on tokenization and match- ability to learn more from each example. According to
ing of the original job title provided by the user with Xie et al. [13], the expected error decreases as the length
a lookup table. The use of a lookup table instead of a and informational content of each example increase, con-
standard occupation taxonomy such as ESCO or O*Net tributing to the richness of the input–output mapping for
significantly limits the generalization potential of this a more robust in-context learning environment. This
strategy. Zhang et al. [8] extract soft and hard skills from assumption is proven to be true under the condition
job posting descriptions, showing that domain-specific of distinguishability of in-context examples and can be
pre-training significantly enhances performance in skills mathematically expressed as a reduction of the expected
and knowledge extraction. Javed et al. [3] introduce a error 𝐸[𝜖], correlated with an increase in the information
semi-supervised machine learning approach that utilizes content 𝐼 of the examples:
hierarchical classifiers and the O* NET Standard Occupa- 1
tional Classification (SOC) taxonomy for the classification 𝐸[𝜖] ∝ (1)
𝐼(𝑆𝑛 , 𝑥test )
Figure 1: Model Architecture
where 𝑆𝑛 represents the sequence of training examples
in the prompt and 𝑥test is the test input.
The use of RAG helps avoid hallucination since when
directly prompted with job postings, LLMs have been
observed to sometimes produce non-existent labels [5].
Figure 2: Prompt Template
3.2. Design of the Model
The proposed model (see Figure 1) uses multilingual class choosing 30 documents is that we aim to strike a balance
embeddings of the E5-large model[15] to retrieve perti- between computational efficiency and the accuracy of the
nent ESCO occupation definitions in English. The defini- retrieved documents. The precision of the LLM would
tions serve as contextual information to prompt language naturally decrease when it is presented with inaccurate
models for selection of the most suitable job titles. To this labels. Although, as shown in Table 1, the precision of
end, we incorporate the DSPy library’s Chain-of-Thought the model slightly increases with 40 documents in the
mechanism,3 augmented by a hint to restrict the model context, we accepted this trade-off in favor of a lower
output to a specified list of job titles. The signature used VRAM requirement.
in this methodology (cf. Figure 2) is inspired by [16]. Upon the retrieval of the 30 ESCO occupations that are
To implement the RAG model, we initially established a most closely aligned with a given job posting description,
vector database,4 in which English ESCO occupation def- a composite prompt (see Figure 2) is constructed as input
initions were inserted as multilingual embedding vectors. to the LLM. The prompt integrates the actual text data
Acknowledging the reported significance of chunking in encompassing job titles, descriptions, and skills pertinent
many NLP applications, we conducted a series of abla- to the selected occupations. The design of the simplified
tion studies to determine the optimal chunk size. These composite prompt aims to minimize the bias by focusing
studies revealed that subdividing the ESCO occupation only on the core elements. The prompt is then processed
definitions into smaller segments adversely affects the by using a locally stored Llama-3 LLM5 in an isolated
performance of vector-based similarity matching. There- environment6 .
fore, we opted for storing each of the 3,015 occupations As a few-show predictor, the LLM evaluates the com-
represented in the ESCO taxonomy in its entirety. posite prompt to accurately classify job postings by exam-
ining the semantic nuances of the selected ESCO occupa-
Table 1 tions, aligning them with the actual job titles within the
Recall values for classification with E5-large Text Embeddings offers. To quantitatively assess the alignment between
vector similarity a job posting vector 𝐽 and each occupation embedding
Precision @ K @5 @10 @30 @40 𝐸ESCO derived from the ESCO taxonomy, cosine similar-
Value 0.4238 0.9004 0.9627 0.9817 ity 𝑎(𝐽, 𝐸ESCO ) is used:
𝐽 · 𝐸ESCO
𝑎(𝐽, 𝐸ESCO ) = (2)
To accurately classify a given job posting with respect ‖𝐽‖‖𝐸ESCO ‖
to the ESCO taxonomy, we include 30 ESCO occupa-
The similarity scores yielded through 𝑎(𝐽, 𝐸ESCO ) for
tion documents (i.e., 30 nodes of the taxonomy) into the
each 𝐸ESCO facilitate the identification and selection of
LLM’s context as potential job titles. The rationale for
5
https://llama.meta.com/llama3/
3 6
https://github.com/stanfordnlp/dspy We use dockerized models from the open-source Ollama library
4
https://www.trychroma.com/ https://ollama.com/ for all experiments
the ESCO occupation embeddings that are most pertinent content where removed using a DSPy module (cf. Ap-
to the job posting in question. Armed with this infor- pendix E for prompt), which employs zero-shot LLama-3
mation, the LLM proceeds to classify the job posting by LM inference to anonymize sensitive information in job
selecting the ESCO occupation that exhibits the highest postings and candidate experiences. The preprocessed
degree of semantic and contextual relevance. postings were annotated by the top three performing
For a specific job posting 𝐽, an embedding function 𝐸 LLMs: GPT-4o8 , Gemini 1.5 Pro9 , and Claude 3.5 Son-
is employed, such that 𝐸(𝐽) produces the corresponding net10 , according to LmSys Arena[17]. In this context, the
embedding for 𝐽. The degree of similarity between the ESCO job titles are presented to each model separately,
job posting’s embedding 𝐸(𝐽) and any ESCO occupation requesting them to select the appropriate job titles, and
embedding 𝑒𝑖 from 𝐸ESCO (where 𝐸ESCO stands for the then measure their level of agreement on these labels.
ensemble of occupation embeddings derived from the The agreement between LLM models was assessed using
ESCO taxonomy) is determined through the similarity Cohen’s kappa coefficient[18]. The average kappa score
function 𝑆(𝐸(𝐽), 𝑒𝑖 ) (in our case cosine). between Gemini and GPT-4o was found to be 0.6386,
The similarity scores for each occupation embedding indicating a substantial level of agreement. The agree-
𝑒𝑖 within 𝐸ESCO relative to 𝐸(𝐽) are computed. The ment between Gemini and Claude was lower, with an
ten class embeddings that exhibit the highest similarity average kappa of 0.5798, suggesting a moderate level
to 𝐸(𝐽), denoted as 𝐸top , are selected. Formally, 𝐸top of agreement. Similarly, the kappa score between GPT-
is defined as the subset {𝑒1 , 𝑒2 , . . . , 𝑒10 } from 𝐸ESCO , 4o and Claude was 0.6497, also indicative of substantial
where each 𝑒𝑖 is selected based on the top 10 similarity agreement. Overall, the average kappa score across all
scores 𝑆(𝐸(𝐽), 𝑒𝑖 ). “annotators” was 0.6227, reflecting a general trend to-
The last stage entails a decision-making process en- wards substantial inter-annotator agreement among the
acted by the Llama-3 LLM, represented by the function models.
𝐷. This function accepts the composite prompt includ- To establish ground truth labels, we incorporated a
ing candidates {𝑒1 , 𝑒2 , . . . , 𝑒10 } accumulated to 𝐸top and dual-layer labelling process. Although the test set con-
the job posting 𝐽, to render the final selected occupation sists of only 200 items, labeling them from scratch would
embedding. The chosen occupation embedding 𝑒* is de- be time-consuming due to the complexity of the ESCO
termined by 𝑒* = 𝐷(𝐸top , 𝐽), representing the ESCO taxonomy, which includes 3,015 distinct occupations. Hu-
occupation best matched by the model. man annotators would require extensive training to accu-
The entire algorithm can be presented by the following rately navigate this taxonomy. Therefore, we first anno-
equation, which encapsulates the embedding generation, tate the occupations automatically using LLMs and then
similarity assessment, and decision-making process by let the initial annotations cross-examine by human expert
the LLM, culminating in the selection of the most suitable annotator. Since each data point was reviewed by one an-
ESCO occupation embedding 𝑒* for the given job posting notator only, inter-annotator agreement among human
description. annotators was not quantified. Instead, we conducted an
analysis to identify job titles that consistently showed
agreement or disagreement across the three LLMs, where
𝑒* = 𝐷({𝑒1 , 𝑒2 , . . . , 𝑒𝑘 | 𝑒𝑖 ∈ 𝐸ESCO ; domain-specific professionals from InfoJobs reviewed la-
top k by 𝑆(𝐸(𝐽), 𝑒𝑖 )}, 𝐽) (3) bel discrepancies. This analysis, detailed in Appendix C,
suggests that certain occupations are inherently more
challenging to classify, possibly due to overlapping skills
4. Experiments or ambiguous descriptions.
Furthermore, we repeated experiments using ground
To evaluate the effectiveness of the proposed model in truth labels where any two of the three automatic LLMs
handling multilingual job postings, experiments were agreed on the label. The results showed alignment be-
conducted separately on Italian and Spanish datasets. tween the models’ predictions and the automatic labeling
process, indicating consistency with the patterns recog-
4.1. Test dataset nized by the automatic methods when there is partial
agreement. A detailed analysis of this alignment can be
To have a reliable test dataset, we use three high perform- found in Appendix D.
ing LLMs as initial annotators of real-world 100 Italian
and 100 Spanish job postings with the most extensive de-
scriptions from the InfoJobs 7 database. Non-informative
elements such as company descriptions and promotional
8
https://openai.com/index/hello-gpt-4o/
9
https://deepmind.google/technologies/gemini/pro/
7 10
https://www.infojobs.net/ https://www.anthropic.com/news/claude-3-5-sonnet
4.2. Baselines report evaluation scores seperately on Spanish and Italian
test sets.
4.2.1. SkillGPT
SkillGPT [5] has been introduced as a tool for skill ex- Table 2
traction and classification, with vector similarity search Italian Performance Metrics for Top 5 and Top 10 Predictions
against LLM-precomputed ESCO embeddings. The au-
Precision Recall
thors employ embeddings generated by an LLM, although Model
they do not directly use LLM to select among candidate @5 @10 @5 @10
embeddings. Instead, they rely on embedding similar- llama-3-8b (CoT opt.) 0.32 0.13 0.76 0.80
ity to assign the most closely related ESCO class to job llama-3-8b (CoT) 0.26 0.12 0.62 0.64
descriptions under consideration. llama-3-8b (SkillGPT) 0.19 0.19 0.36 0.82
mBart-large-mnli (0-shot) 0.13 0.12 0.29 0.58
4.2.2. Zero-Shot Classification multilingual-e5-large 0.16 0.19 0.36 0.88
By transforming the classification task into a Natural
Language Inference (NLI) problem, any model pretrained
on NLI tasks can be utilized as a text classifier without the Table 3
need for fine-tuning, effectively achieving zero-shot text Spanish Performance Metrics for Top 5 and Top 10 Predictions
classification. This is particularly beneficial when we deal Precision Recall
with classes unseen during training, making it a robust Model
@5 @10 @5 @10
solution for a variety of text classification scenarios [19].
In our implementation that we use as baseline, we llama-3-8b (CoT opt.) 0.28 0.20 0.72 0.90
utilize the BART-MNLI model [20] that showed high per- llama-3-8b (CoT) 0.26 0.16 0.64 0.68
formance in summarization tasks when pretrained for llama-3-8b (SkillGPT) 0.09 0.12 0.36 0.62
various NLI tasks on an MNLI dataset [21] that is lever- mBart-large-mnli (0-shot) 0.15 0.14 0.39 0.70
multilingual-e5-large 0.20 0.19 0.48 0.92
aged for its capability to understand entailment relations
for classification of the given sequence into one of the
specified categories. We also apply the same methodol- Tables 2 and 3 display the results on the Italian and
ogy with the Llama-3 model. Spanish datasets, respectively. The results indicate that
prompting techniques outperform SkillGPT in both lan-
4.3. Model Optimization guages. Specifically, the optimized Llama-3-8b model
with chain-of-thought (CoT) achieves the highest preci-
To optimize LLMs with a minimal set of manually crafted sion and recall at @5 for Italian, with values of 0.32 and
examples, we use the DSPy library [22]. We initialize 0.76, respectively, and for Spanish, with values of 0.28
the classifier module with a Llama-3 model and use a and 0.72. This supports our assumption that optimiza-
GPT-4o model as the teacher. Our optimization of the tion enhances performance. The multilingual E5-large
classification is aimed at achieving high F1 scores for model achieves the highest precision at @10 for Italian
each dataset individually. In each run, we use 10 la- (0.19) and the highest recall at @10 for Spanish (0.92),
beled training examples and 30 labeled validation ex- underscoring the efficacy of embeddings in classification.
amples. We employ DSPy’s BootstrapFewShot, configur- This implies that semantically less similar labels can con-
ing it to perform a maximum of 2 rounds with up to 8 fuse models, whereas embeddings ensure higher recall
bootstrapped demonstrations. We define a custom met- accuracy, particularly in wider retrieval scenarios. Al-
ric—the F1 score—to guide the bootstrapping process. For though both models exhibit similar precision, indicating
the optimization of the LLMs, we use data points that comparable accuracy in their predictions, the optimized
had high inter-agreement among the automatic methods model’s capacity to capture a broader range of relevant
and were reviewed by human annotators. We perform a job titles ensures greater alignment with expert human
validation/test split to ensure that the optimization did preferences. This enhances the model’s ability to make
not bias the evaluation results. relevant job title suggestions, thereby improving the over-
all matching process.
4.4. Outcome of the experiments
For the evaluation of the results of the experiments, we 4.5. Discussion
used the micro recall and micro precision metrics, which In Tables 2 and3 we observe that the combined use of gen-
are suitable for our multi-class classification task. We eral text embeddings and language models significantly
outperforms current classification techniques, which rely
on language models specifically tailored to the field of the 4.6. Computational Cost of Compared
labour market, such as [12]. We see that using vector sim- Methods
ilarity with the text embeddings created by the E5-large
text embedings model alone does not surpass the base- In addition to evaluating performance metrics, we ana-
line. However, it is worth noting that the results are quite lyzed the computational cost and environmental impact
close, despite the fact that this model was not specifically of each method. The Llama-3-8b model, with 8 billion
fine-tuned on labour market data or adapted to the ESCO parameters, requires significant resources for inference,
taxonomy, as is the case of [12]. Furthermore, we can ob- necessitating a GPU with at least 16 GB of VRAM (e.g.,
serve how text embeddings indeed provide a significant NVIDIA RTX 3090). Its average inference time per job
value for filtering n occupations closest to a job posting posting is approximately 1.5 seconds, and its high energy
within the taxonomy. Using these k professions as input consumption leads to increased CO2 emissions, making
to various language models for few-shot classification large-scale deployment less environmentally sustainable
significantly improves over the baselines. Table 6 in the without optimizations.
Appendix illustrates the decisions of the LLMs in the case In contrast, the mBART-large-mnli model has about 610
of four sample job postings. million parameters and operates on GPUs with 8 GB of
We also evaluated the effectiveness of a large language VRAM, offering faster inference times under 0.5 seconds
model for classification of job titles based on provided per job posting. The embeddings-based method using
descriptions, as shown in Table 4 even when the correct the multilingual E5-large model, with 330 million param-
titles were not explicitly listed among the initial ESCO job eters, allows for precomputed embeddings and efficient
titles. The model’s ability to select accurate titles reflects CPU-based vector similarity searches, reducing infer-
its functionality in processing and understanding the con- ence time to less than 0.2 seconds per job posting. These
textual and semantic aspects of the job descriptions. For smaller models consume less energy, providing more
instance, when presented with a job description focused resource-efficient and eco-friendly alternatives suitable
on the management of comprehensive water and wastew- for production environments where computational cost
ater services, the model correctly identified “Operations and environmental impact are critical considerations.
Manager” as the correct title. This identification was
made despite the presence of several closely related but
distinct labels (such as, “Water treatment plant manager”)
5. Conclusions and future work
within the pool of ESCO job titles. This indicates that the In this paper, we argued that the use of multilingual
model’s decisions are more influenced by a comprehen- embeddings in combination with LLMs significantly en-
sive understanding of the job responsibilities and sectors hances our ability to distinguish between very similar (or
than by the mere presence of keywords or phrases in the even identical) job titles that suggest different skills and
ESCO job titles. competencies. Our experiments have shown that this is
The model’s capacity to differentiate between job titles indeed the case, demonstrating that the combination of
with more specific definitions enhances its comprehen- multilingual text embeddings similarity with the Llama-
sion of job postings and assigned labels, thereby improv- 3 markedly exceeds the performance of other leading
ing the precision of suggesting relevant skills. Upon approaches in the field.
integration into an operational job platform, this model In the future, we plan to apply the same approach
will better understand the requirements of job postings to the analysis and classification of job candidate expe-
and accurately assign job titles that align with the spe- riences. Once it is ensured that both job postings and
cific needs of companies. Similarly, in the context of candidate experiences can accurately be modeled using
parsing of job candidate experiences, keywords tend to the embedded representation of the ESCO taxonomy, we
appear more frequently in semantically related ESCO def- plan to set the stage for a more direct and efficient align-
initions, enabling parsers to incorporate these keywords ment process between job postings and experiences of
to enhance parsing performance. job seekers.
Overall, we can thus state that the integration of class Another interesting direction for future research is
embeddings generated using the multilingual E5-large to analyze the lexical overlap between English domain-
model, with subsequent application of few-shot classifi- specific terms that appear in Italian and Spanish job post-
cation techniques through LLMs, significantly improves ings and the English occupation descriptions in the ESCO
the accuracy of job title classification, clearly surpassing taxonomy. Such an analysis would reveal whether job
those of the baselines. types with higher lexical overlap affect model accuracy,
providing deeper insights into the multilingual nature of
the task.
References applicant cv classification using rich information
from a labour market taxonomy, SSRN Electronic
[1] R. Zbib, L. L. Alvarez, F. Retyk, R. Poves, J. Aizpuru, Journal (2023). doi:10.2139/ssrn.4519766.
H. Fabregat, V. Šimkus, E. G. Casademont, Learn- [13] S. M. Xie, A. Raghunathan, P. Liang, T. Ma, An ex-
ing job titles similarity from noisy skill labels, planation of in-context learning as implicit bayesian
ArXiv abs/2207.00494 (2022). URL: https://api. inference, 2022. arXiv:2111.02080.
semanticscholar.org/CorpusID:250243975. [14] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,
[2] J.-J. Decorte, J. V. Hautte, T. Demeester, C. De- J. Sun, M. Wang, H. Wang, Retrieval-augmented
velder, Jobbert: Understanding job titles through generation for large language models: A survey,
skills, ArXiv abs/2109.09605 (2021). URL: https: 2024. arXiv:2312.10997.
//api.semanticscholar.org/CorpusID:237572142. [15] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie,
[3] F. Javed, M. McNair, F. Jacob, M. Zhao, To- M. Zhang, Towards General Text Embeddings
wards a job title classification system, 2016. with Multi-stage Contrastive Learning, arXiv
arXiv:1606.00917. e-prints (2023) arXiv:2308.03281. doi:10.48550/
[4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, arXiv.2308.03281. arXiv:2308.03281.
J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, [16] K. D’Oosterlinck, O. Khattab, F. Remy, T. De-
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, meester, C. Develder, C. Potts, In-context
G. Krueger, T. Henighan, R. Child, A. Ramesh, learning for extreme multi-label classification,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, ArXiv abs/2401.12178 (2024). URL: https://api.
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, semanticscholar.org/CorpusID:267068618.
C. Berner, S. McCandlish, A. Radford, I. Sutskever, [17] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopou-
D. Amodei, Language models are few-shot learners, los, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan,
2020. arXiv:2005.14165. J. E. Gonzalez, I. Stoica, Chatbot arena: An open
[5] N. Li, B. Kang, T. D. Bie, Skillgpt: a restful api service platform for evaluating llms by human prefer-
for skill extraction and standardization using a large ence, ArXiv abs/2403.04132 (2024). URL: https:
language model, 2023. arXiv:2304.11060. //api.semanticscholar.org/CorpusID:268264163.
[6] B. Shi, J. Yang, F. Guo, Q. He, Salience and [18] J. Cohen, A coefficient of agreement for nominal
market-aware skill extraction for job targeting, scales, Educational and Psychological Measurement
2020. arXiv:2005.13094. 20 (1960) 37 – 46. URL: https://api.semanticscholar.
[7] S. Li, B. Shi, J. Yang, J. Yan, S. Wang, F. Chen, Q. He, org/CorpusID:15926286.
Deep job understanding at linkedin, in: Proceed- [19] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad,
ings of the 43rd International ACM SIGIR Confer- M. Chenaghlu, J. Gao, Deep learning based text
ence on Research and Development in Informa- classification: A comprehensive review, CoRR
tion Retrieval, ACM, 2020. URL: http://dx.doi.org/ abs/2004.03705 (2020). URL: https://arxiv.org/abs/
10.1145/3397271.3401403. doi:10.1145/3397271. 2004.03705. arXiv:2004.03705.
3401403. [20] L. Shu, J. Chen, B. Liu, H. Xu, Zero-shot aspect-
[8] M. Zhang, K. N. Jensen, S. D. Sonniks, B. Plank, based sentiment analysis, ArXiv abs/2202.01924
Skillspan: Hard and soft skill extraction from (2022).
english job postings, ArXiv abs/2204.12811 (2022). [21] A. Williams, N. Nangia, S. Bowman, A broad-
URL: https://api.semanticscholar.org/CorpusID: coverage challenge corpus for sentence understand-
248405777. ing through inference, in: Proceedings of the 2018
[9] J. Wang, K. Abdelfatah, M. Korayem, J. Balaji, Deep- Conference of the North American Chapter of the
carotene -job title classification with multi-stream Association for Computational Linguistics: Human
convolutional neural network, 2019, pp. 1953–1961. Language Technologies, Volume 1 (Long Papers),
doi:10.1109/BigData47090.2019.9005673. Association for Computational Linguistics, 2018,
[10] M. Yamashita, J. T. Shen, H. Ekhtiari, T. Tran, D. Lee, pp. 1112–1122. URL: http://aclweb.org/anthology/
James: Job title mapping with multi-aspect embed- N18-1101.
dings and reasoning, 2022. arXiv:2202.10739. [22] O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang,
[11] M. Zhang, R. van der Goot, B. Plank, Escoxlm- K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma,
r: Multilingual taxonomy-driven pre-training for T. T. Joshi, H. Moazam, H. Miller, M. Zaharia,
the job market domain, in: Annual Meeting C. Potts, Dspy: Compiling declarative lan-
of the Association for Computational Linguis- guage model calls into self-improving pipelines,
tics, 2023. URL: https://api.semanticscholar.org/ ArXiv abs/2310.03714 (2023). URL: https://api.
CorpusID:258832782. semanticscholar.org/CorpusID:263671701.
[12] H. Kavas, M. Serra-vidal, L. Wanner, Job offer and
Figure 3: LLM’s Rationale
A. Ablation Study narrowing down to “quick service restaurant team leader”
and “fast food shift team leader” as the most apt job ti-
In our ablation study, we pursued two primary objectives. tles. The reasoning of the model is correct on chosing
Firstly, to evaluate the model’s comprehension of ESCO these titles for their precise reflection of the managerial
job titles and its decision-making process. To achieve this, and leadership responsibilities pertinent to the restaurant
we prompted the model to articulate its underlying ra- environment.
tionale. Secondly, so far we reported the performance of
our model when Italian and Spanish data were matched
against English job titles and occupations in the ESCO B. Job postings and Predicted
taxonomy. Here we wanted to explore whether its com- ESCO job titles
prehension was extendable to data in different languages.
We selected Spanish for this purpose and discovered that The following tables provide examples of job titles, job
the model’s understanding was consistent, irrespective posting descriptions, and the corresponding gold labels
of the language; see Table 4. in Table 5 and optimized LLama-3 job titles in Table 6.
As illustrated in Figure 3, the LLM showcases a com- These examples illustrate how the job titles assigned by
prehensive understanding of the task at hand, effectively recruiters may not always capture the specific nature of
narrowing down potential ESCO job titles to identify the the job described in the postings. The gold labels and
most suitable label. Additionally, the LLM is observed to the optimized LLama-3 job titles offer a more accurate
generate a novel job title, referred to as “fast food shift representation of the job roles based on the detailed job
team leader”. This can be attributed to the absence of descriptions.
contstraints imposed on the LLM regarding structured The job title “Commessa” (Salesperson) is generic
output for classification, thereby granting it to auton- and does not specify the specialization required for the
omy to propose the most fitting job title. The analysis job. The gold label “telecommunications equipment spe-
initially excludes broader or less related job titles such cialised seller” fits better because the job description
as “bussiness manager”, “hospitality revenue manager”, clearly focuses on selling telecommunications equipment,
and “accomodation manager”, which are not spesific to which requires specific knowledge and skills related to
quick-service restaurant operations. Subsequently, the this type of product. The gold label accurately reflects
model considers and ultimately selects titles that em- the specialized nature of the role. The job title “Project
phasize leadership within this spesific restaurant context, engineer” given by the recruiter suggests a technical and
Gold Label Job Title
Quick Service Restaurant Team Leader
Posting Job Title
Encargado de Franquicias
Posting Description:
- Responsable de garantizar la satisfacción de los huéspedes y de gestionar y superar los objetivos financieros y operativos
de los restaurantes a mi cargo.
- Garantizar una excelente atención a los huéspedes en base a las promesas y estándares definidos.
- Liderar, motivar y desarrollar equipos.
- Facilitar los recursos y el apoyo necesario a los equipos en sus restaurantes.
- Utilizar de manera eficaz los diferentes recursos de la Compañía.
- Identificar oportunidades y amenazas de negocio en el mercado.
- Aportar ideas y ejecutando proyectos en el corto y medio plazo.
- Difundir las mejores practicas y resolver problemas comunes en los restaurantes.
- Cumplir los protocolos y políticas de la Marca y la Compañía.
- Garantizar y difundir los valores y principios definidos por la Compañía.
Skills: SAP Girnet Gtock, Cuiner
ESCO Job Titles:
Restaurant Manager, Business Manager, Hospitality Revenue Manager, Accommodation Manager, Delicatessen Shop
Manager, Rooms Division Manager, Customer Experience Manager, Quick Service Restaurant Team Leader, Destination
Manager, Membership Manager
Table 4
Spanish job posting Example
Posting Job Title Job Posting Description Gold Labels
Commessa Commessa; Commessa; - Presentazione e vendita di attrezzature per Telecommunications
telecomunicazioni ai clienti; - Servizio e supporto clienti; - Gestione delle equipment specialised
transazioni di vendita; - Gestione dello stock e dell’inventario. seller
Project Engineer Project Engineer; Project Engineer; PROJECT MANAGER / PROJECT Project manager, Prod-
ENGINEER Divisione: Amministrazione Tecnica - Coordinamento delle uct development man-
attività di gestione progetti in ambito tecnico; - Supporto al Product ager
Development; - Pianificazione e monitoraggio delle attività progettuali; -
Supervisione del team tecnico; - Assistenza alla gestione dei fornitori e
del budget di progetto.
Table 5
Examples of Job Titles, Descriptions, and Gold Labels
engineering-focused role. However, the job description packages, and managing the deli counter. Our model’s ti-
emphasizes project management, coordination of project tles “meat and meat products specialised seller” and “deli
activities, support to product development, and supervi- worker” are more precise, indicating a specialized role
sion of the technical team. The gold label “project man- in food handling and customer service, which goes be-
ager” fits better as it captures the overall management yond the general sales assistant title. This demonstrates
and coordination responsibilities described, which are our model’s ability to interpret the specific context and
more aligned with the duties of a project manager than responsibilities of the job accurately.
just a project engineer. The job title “IT Specialist” is generic and could encom-
The job title "Addetto alle vendite" (Sales Assistant) is pass various IT roles. However, the job description clearly
too generic and does not capture the specialized nature indicates responsibilities such as managing ICT projects,
of the role described in the vacancy. The description coordinating a software development team, planning
specifies duties typical of a deli worker, such as serving and monitoring development activities, managing ICT
customers, slicing cheeses and cured meats, preparing resources and budget, and providing advanced techni-
Posting Job Title Job Posting Description Optimized LLama-3
Job Titles
Addetto alle ven- Addetto alle vendite; Addetto alle vendite; Salumiere: servizio clientela, Meat and meat products
dite tagli di formaggi e salumi, preparazione confezioni, gestione banco gas- specialised seller, Deli
tronomia. worker, Food and bever-
age server
IT Specialist IT Specialist; IT Specialist; Responsabile della gestione dei progetti ICT; ICT project manager, Soft-
Coordinamento del team di sviluppo software; Pianificazione e monitor- ware development man-
aggio delle attività di sviluppo; Gestione delle risorse ICT e del budget; ager
Assistenza tecnica avanzata e risoluzione dei problemi.
Sales Manager Sales Manager; Sales Manager; Sviluppo del business aziendale; Business development
Definizione delle strategie di vendita; Gestione del team di vendita; Mon- manager, Sales director
itoraggio delle performance e raggiungimento degli obiettivi di vendita;
Gestione delle relazioni con i clienti chiave e i partner strategici.
Table 6
Examples of Job Titles, Descriptions, and Optimized Job Titles
Table 7
Examples of Job Postings with Ambiguous Classification due to Multilingual and Contextual Challenges
Job Title Description Excerpt Labels Suggested
Junior Project Applicare i metodi e gli strumenti propri del Project Management a Project Manager, ICT
Manager commesse specifiche per il settore dell’automazione industriale, di cui Project Manager, Pro-
l’azienda fornisce sistemi di visione artificiale. gramme Manager
Assistente Am- Gestione dei flussi delle segnalazioni dei cittadini per prenotazioni vacci- Healthcare Assistant,
ministrativo nazioni e assistenza pandemica, inclusa la verifica del "certificato verde" Administrative Assistant,
(Healthcare) per la conformità alle normative sanitarie. Contact Tracing Agent
Commesso di Ne- Creazione di vetrine accattivanti con abbinamenti di tendenza e assistenza Shop Assistant, Sales As-
gozio (Retail) alla clientela nella scelta dei prodotti. sistant, Visual Merchan-
diser
Team Leader (En- Predisposizione documenti formativi e aggiornamento processi operativi Team Leader, Energy Ana-
ergy Sector) presso sede Enel, inclusa l’implementazione e il collaudo di software per lyst, Business Process An-
la gestione energetica. alyst
Assistente Ammin- Compiti legati al Registro Nazionale delle Varietà Vegetali e mansioni Accounting Assistant,
istrativo (Legal and fiscali complesse come Dichiarazioni IRAP. Administrative Assistant,
Fiscal) Compliance Officer
cal support. The optimized titles “ICT project manager” C. Ambiguity from Specialized
and “software development manager” are more accurate
as they reflect the leadership, coordination, and project
and Contextual Factors
management aspects of the role, which go beyond the To further understand the complexity of job classifica-
scope of a general IT specialist. tion in a multilingual context, we conducted an ablation
The job title “Sales Manager” suggests a mid-level man- study focusing on cases where both human annotators
agement role. However, the job description highlights and LLMs demonstrated shared uncertainty in assigning
responsibilities such as business development, defining definitive labels. These cases were particularly challeng-
sales strategies, managing the sales team, monitoring per- ing due to specialized terminology, regional language
formance, and managing relationships with key clients variations, or overlapping responsibilities within job post-
and strategic partners. These responsibilities are more ings. Table 7 highlights key examples where annota-
aligned with a higher-level role such as “business de- tors, despite their recruitment expertise, aligned with the
velopment manager” or “sales director”, which involve LLMs in experiencing ambiguity.
strategic planning and high-level management. As presented in Table 7, each example illustrates spe-
cific challenges encountered in classifying job postings
across multilingual and sector-specific contexts. The Ju-
nior Project Manager job posting, for instance, combines
general project management with specialized tasks such
as machine vision, but without enough specific context,
it is unclear whether the focus should be on technical We assessed our model’s performance on both silver
expertise or managerial skills. The Project Engineer ex- and gold labels to understand its effectiveness under dif-
ample shows the impact of technical terminology and ferent levels of agreement. We had reported results for
sector-spesific language on classification. Terms such gold labels in Table 2 and 3, results for silver label are
as “SCADA” and “Modbus TCP” are common in inter- presented in Table 8. For the Spanish dataset, the model’s
national engineering contexts but may not align with performance was relatively consistent between silver and
typical understanding of recruiters, leading to the selec- gold labels, with only minor variations in precision and
tion of varied labels by both LLMs and annotators. The recall. This consistency suggests that the model robustly
example of the Assistente Amministrativo with a legal and captures underlying patterns in the job postings, regard-
fiscal focus involves highly specialized processes such as less of labeling strictness.
“Registro Nazionale delle Varietà Vegetali” and complex In contrast, the Italian dataset exhibited more signif-
fiscal duties like “Dichiarazioni IRAP.” These terms relate icant differences between performances on silver and
to specific Italian government and regulatory compliance, gold labels. For example, in some cases, the precision
which could exceed the annotators’ typical recruitment was higher for silver labels while recall was higher for
experience, thus resulting in generalized labels that do gold labels. This disparity may indicate that the model
not fully capture the compliance and accounting com- better captures broader classifications aligning with ma-
plexity. jority consensus in Italian but struggles with the stricter
These cases emphasize that job postings, as human- criteria required for unanimous agreement.
created documents, often do not provide enough con- An interesting observation is that optimization using
text for a definitive classification, resulting in ambiguity gold label ground truth data had a negative effect on
across specialized and regional terms. the models’ scores derived from silver labels. This could
be explained by the fact that during optimization, the
language models became more attuned to the patterns
D. Analysis of Model Alignment present in the gold labels, potentially diverging from
with Partial Agreement Ground those in the silver labels. As a result, the models may
have become less effective at predicting labels where only
Truth Labels partial agreement (silver labels) was present among the
automatic methods.
Table 8
Performance Metrics for Top 5 and Top 10 Predictions E. DSPy Signature
Precision Recall
Model We utilize DSPy signatures to prompt large language
@5 @10 @5 @10 models (LLMs) for performing downstream tasks. To
Spanish (SPA) optimize the script, recursive LLM calls were employed,
resulting in its final form based on empirical observa-
llama-3-8b (CoT opt.) 0.12 0.06 0.58 0.62
llama-3-8b (CoT) 0.22 0.16 0.64 0.68 tions.
llama-3-8b (SkillGPT) 0.19 0.12 0.36 0.62
mBart-large-mnli (0-shot) 0.15 0.14 0.39 0.70
multilingual-e5-large 0.20 0.19 0.48 0.92
Italian (ITA)
llama-3-8b (CoT opt.) 0.12 0.06 0.56 0.60
llama-3-8b (CoT) 0.23 0.07 0.55 0.59
llama-3-8b (SkillGPT) 0.22 0.06 0.53 0.59
mBart-large-mnli (0-shot) 0.27 0.06 0.31 0.58
multilingual-e5-large 0.35 0.08 0.39 0.79
In our evaluation, we established two levels of ground
truth labels: gold and silver. Gold labels represent unan-
imous agreement among all three annotators (GPT-4o,
Gemini 1.5 Pro, and Claude 3.5 Sonnet), validated by
human experts. Silver labels indicate a strong major-
ity consensus, assigned when any two annotators agree,
Figure 4: Pre-processing Signature
even if the third disagrees.