Enhancing Job Posting Classification with Multilingual
                                Embeddings and Large Language Models
                                Hamit Kavas1,2,* , Marc Serra-Vidal2 and Leo Wanner1,3
                                1
                                  NLP Group, Pompeu Fabra University, C/ Roc Boronat, 138, 08018, Spain
                                2
                                  Adevinta Spain, C/ de la Ciutat de Granada, 150, Barcelona, 08018, Spain
                                3
                                  Catalan Institute for Research and Advanced Studies (ICREA), Passeig Lluís Companys, 23, Barcelona, 08010, Spain


                                                Abstract
                                                In the modern labour market, taxonomies such the European Skills, Competences, Qualifications and Occupations (ESCO)
                                                classification are used as an interlingua to match job postings with job seeker profiles. Both are classified with respect to
                                                ESCO occupations, and match if they align with the same occupation and the same skills assigned to the occupation. However,
                                                matching models usually struggle with the classification because of overlapping skills and similar definitions of occupations
                                                defined in the ESCO taxonomy. This often leads to imprecise classification outcomes. In this paper, we focus on the challenge
                                                of the classification of job postings written in Italian or Spanish against ESCO occupations written in English. We experiment
                                                with multilingual embeddings, zero-shot classification, and use of a large language model (LLM) and show that the use of
                                                an LLM leads to best results. Furthermore, we also explore an alternative automatic labeling method by prompting three
                                                top-performing LLMs to annotate the test dataset. This approach serves both as an experiment on the usability of automatic
                                                labeling and as an evaluation of the reliability of the automatically assigned labels, involving human annotators.

                                                Keywords
                                                ESCO labour market taxonomy, job posting classification, class embeddings, text embeddings, LLM


                                1. Introduction                                                                                         experiences because due to their tree structure they of-
                                                                                                                                        ten fail to adequately distinguish between occupations
                                The modern labour market becomes more and more di- that exhibit substantial skill overlaps. For instance, two
                                verse. High-tech jobs demand novel skills and compe- job postings labeled as ‘data analyst’ may appear similar
                                tences, which in their turn keep undergoing adaptations but require different skills if one focuses on market re-
                                and modifications. Under these circumstances, accurately search while the other concentrates on healthcare trends
                                classifying job postings and CVs of job seekers (hence- analysis. This issue is particularly pronounced when clas-
                                forth candidate experiences) that contain detailed techno- sification relies on a single label, such as the job title of an
                                logical specifications with remarkably similar yet distinct ESCO occupation, where skill overlaps undermine pre-
                                skills and experiences has evolved into a complex chal- cise classification. Hence, employing multiple job titles
                                lenge.                                                                                                  and framing the problem as a multi-label classification
                                    The overwhelming majority of job portals and employ- task is imperative.
                                ment agencies use either the European Skills, Competences,                                                 This paper addresses the challenge of multilingual
                                Qualifications and Occupations (ESCO) taxonomy1 or its multi-label classification using Large Language Models
                                US equivalent O*Net taxonomy2 to classify job postings (LLMs) for the alignment of Italian and Spanish job post-
                                and candidate experiences in terms of job title labeled ings with English job titles encountered in the ESCO
                                ESCO/O*Net occupations. Most of the proposals to au- taxonomy. Multilingual class embeddings are explored
                                tomatic alignment of job postings with candidate expe- to improve classification accuracy, aiming to provide the
                                riences (or vice versa) also use ESCO or O*Net [1, 2, 3]. necessary contextual awareness and addressing the core
                                However, despite their wide use, both ESCO and O*Net limitations of taxonomies such as ESCO.
                                taxonomies exhibit principle limitations for the task of                                                   Furthermore, we explore an alternative automatic la-
                                automatic classification of job postings and candidate beling method by prompting three top-performing LLMs
                                                                                                                                        to annotate the test dataset. This approach serves both as
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, an experiment on the usability of automatic labeling and
                                Dec 04 — 06, 2024, Pisa, Italy                                                                          as an evaluation of the reliability of the automatically
                                *
                                  Corresponding author.
                                                                                                                                        assigned labels, involving human annotators.
                                $ hamit.kavas@upf.edu (H. Kavas); marc.serrav@adevinta.com
                                (M. Serra-Vidal); leo.wanner@upf.edu (L. Wanner)                                                           To provide LLMs with domain-specific information
                                 0009-0009-7027-7367 (H. Kavas); 0009-0000-0120-381X                                                   and to mitigate hallucinations in the course of the clas-
                                (M. Serra-Vidal); 0000-0002-9446-3748 (L. Wanner)                                                       sification of the job postings, we employ Retrieval Aug-
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

                                1
                                           Attribution 4.0 International (CC BY 4.0).                                                   mented Generation (RAG) [4], which combines infor-
                                  https://esco.ec.europa.eu/en/classification
                                2                                                                                                       mation retrieval with a generative model. RAG serves
                                    https://www.onetonline.org/


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
two critical functions in our methodology. Firstly, it pro-    of online recruitment data. Similarly, Wang et al. [9] pro-
vides detailed definitions, including essential skills and     pose a model based on multi-stream convolutional neural
synonyms for each ESCO occupation, selected through            networks, aiming to classify noisy user-generated job ti-
vector similarity as outlined in [5]. Secondly, it ensures     tles by considering different elements such as characters
that the assigned job titles are restricted to titles within   and words within job titles. Yamashita et al. [10] and
our predefined label space, i.e., standardized job titles      Zbib et al. [1] conduct studies on the classification of
defined in the ESCO taxonomy.                                  job titles, focusing on job title alignment and job simi-
   The contributions of our work are:                          larity training, respectively. JobBERT Decorte et al. [2]
   ∙ We explore the impact of using multilingual class         classifies job titles against the ESCO taxonomy, treating
embeddings derived from the ESCO taxonomy for the              the task as a semantic text similarity (STS) exercise. In
task of job posting classification.                            particular, JobBERT emphasizes the understanding of the
∙ We integrate RAG to provide LLMs with domain-                semantics of job titles through the skills inferred from the
specific information and eliminate the dependency on           associated vacancies and descriptions, thus alleviating
fine-tuning;                                                   the need for an extensive labeled dataset or a continu-
∙ We show how the LLM response can be restricted to            ously updated list of standardized titles. Before the recent
standardized job titles and thus how LLMs can be used          proposals [11] and [12], JobBERT used to be referenced
for high quality job title classification that outperforms     as the state-of-the-art baseline. In general, all of these
state-of-the-art proposals for this task.                      works draw upon some of the information encoded in the
   The remainder of the paper is structured as follows. In     ESCO taxonomy. However, none of them uses detailed
Section 2, we present a concise overview of the related        descriptions of ESCO occupations, as we propose.
work. In Section 3, the model on which our work is based
is outlined. Section 4 describes the experiments we car-
ried out, the results we obtained in these experiments,        3. The Model
and their discussion. In Section 5, finally, draws some
conclusions from the presented work and outlines some          3.1. The Basics
directions for future research. In Appendix A, we present      The proposed model is based on the notion of distinctive-
an ablation study in which we assess the comprehension         ness, which specifies the difference between the prompt
of English ESCO job titles and its Spanish equivalents by      concept 𝜃* and other concepts within the conceptual
our model. Appendix B provides, for illustration, exam-        space Θ [13]. The notion is crucial for distinguishing
ples of Italian job postings and predicted ESCO job titles.    in-context learning concepts that are aimed to be learned
In Appendix E, we present the signature used to prompt         by analogy. 𝜃* acts as a latent parameter in a Hidden
Large Language Models for pre-processing.                      Markov Model that defines a distribution over observed
                                                               tokens, represented by selected ESCO job titles as labels.
                                                               As proposed by Xie et al. [13], the error of the in-context
2. Related Work                                                predictor approaches optimality under the condition that
A number of works have been carried out in the domain 𝜃 is distinguishable from other concepts in Θ ∖ {𝜃 }.
                                                                *                                                      *

of job title classification, focusing on various facets of the When RAG is adapted as a few-shot reasoning (or in-
problem. Shi et al. [6] introduce Job2Skills, a model devel- context*learning) framework for job posting classification
oped for LinkedIn. The model significantly improves job [14] , 𝜃 is represented by the top-selected ESCO labels
recommendation performance metrics, however, raises and ensures that the LLM can effectively differentiate
questions about its effectiveness beyond LinkedIn. Li between closely related job categories.
et al. [7] proposes a two-step job title normalization, also      The explanation enriched prompts enhance the LLM’s
in LinkedIn, which is based on tokenization and match-         ability to learn more from each example. According to
ing of the original job title provided by the user with Xie et al. [13], the expected error decreases as the length
a lookup table. The use of a lookup table instead of a and informational content of each example increase, con-
standard occupation taxonomy such as ESCO or O*Net tributing to the richness of the input–output mapping for
significantly limits the generalization potential of this a more robust in-context learning environment. This
strategy. Zhang et al. [8] extract soft and hard skills from assumption is proven to be true under the condition
job posting descriptions, showing that domain-specific of distinguishability of in-context examples and can be
pre-training significantly enhances performance in skills mathematically expressed as a reduction of the expected
and knowledge extraction. Javed et al. [3] introduce a error 𝐸[𝜖], correlated with an increase in the information
semi-supervised machine learning approach that utilizes content 𝐼 of the examples:
hierarchical classifiers and the O* NET Standard Occupa-                                        1
tional Classification (SOC) taxonomy for the classification                      𝐸[𝜖] ∝                                 (1)
                                                                                          𝐼(𝑆𝑛 , 𝑥test )
Figure 1: Model Architecture


   where 𝑆𝑛 represents the sequence of training examples
in the prompt and 𝑥test is the test input.
   The use of RAG helps avoid hallucination since when
directly prompted with job postings, LLMs have been
observed to sometimes produce non-existent labels [5].
                                                                 Figure 2: Prompt Template

3.2. Design of the Model
The proposed model (see Figure 1) uses multilingual class        choosing 30 documents is that we aim to strike a balance
embeddings of the E5-large model[15] to retrieve perti-          between computational efficiency and the accuracy of the
nent ESCO occupation definitions in English. The defini-         retrieved documents. The precision of the LLM would
tions serve as contextual information to prompt language         naturally decrease when it is presented with inaccurate
models for selection of the most suitable job titles. To this    labels. Although, as shown in Table 1, the precision of
end, we incorporate the DSPy library’s Chain-of-Thought          the model slightly increases with 40 documents in the
mechanism,3 augmented by a hint to restrict the model            context, we accepted this trade-off in favor of a lower
output to a specified list of job titles. The signature used     VRAM requirement.
in this methodology (cf. Figure 2) is inspired by [16].             Upon the retrieval of the 30 ESCO occupations that are
   To implement the RAG model, we initially established a        most closely aligned with a given job posting description,
vector database,4 in which English ESCO occupation def-          a composite prompt (see Figure 2) is constructed as input
initions were inserted as multilingual embedding vectors.        to the LLM. The prompt integrates the actual text data
Acknowledging the reported significance of chunking in           encompassing job titles, descriptions, and skills pertinent
many NLP applications, we conducted a series of abla-            to the selected occupations. The design of the simplified
tion studies to determine the optimal chunk size. These          composite prompt aims to minimize the bias by focusing
studies revealed that subdividing the ESCO occupation            only on the core elements. The prompt is then processed
definitions into smaller segments adversely affects the          by using a locally stored Llama-3 LLM5 in an isolated
performance of vector-based similarity matching. There-          environment6 .
fore, we opted for storing each of the 3,015 occupations            As a few-show predictor, the LLM evaluates the com-
represented in the ESCO taxonomy in its entirety.                posite prompt to accurately classify job postings by exam-
                                                                 ining the semantic nuances of the selected ESCO occupa-
Table 1                                                          tions, aligning them with the actual job titles within the
Recall values for classification with E5-large Text Embeddings   offers. To quantitatively assess the alignment between
vector similarity                                                a job posting vector 𝐽 and each occupation embedding
    Precision @ K        @5        @10      @30      @40         𝐸ESCO derived from the ESCO taxonomy, cosine similar-
        Value          0.4238     0.9004   0.9627   0.9817       ity 𝑎(𝐽, 𝐸ESCO ) is used:
                                                                                          𝐽 · 𝐸ESCO
                                                                        𝑎(𝐽, 𝐸ESCO ) =                           (2)
   To accurately classify a given job posting with respect                               ‖𝐽‖‖𝐸ESCO ‖
to the ESCO taxonomy, we include 30 ESCO occupa-
                                                             The similarity scores yielded through 𝑎(𝐽, 𝐸ESCO ) for
tion documents (i.e., 30 nodes of the taxonomy) into the
                                                           each 𝐸ESCO facilitate the identification and selection of
LLM’s context as potential job titles. The rationale for
                                                                 5
                                                                     https://llama.meta.com/llama3/
3                                                                6
  https://github.com/stanfordnlp/dspy                                We use dockerized models from the open-source Ollama library
4
  https://www.trychroma.com/                                         https://ollama.com/ for all experiments
the ESCO occupation embeddings that are most pertinent             content where removed using a DSPy module (cf. Ap-
to the job posting in question. Armed with this infor-             pendix E for prompt), which employs zero-shot LLama-3
mation, the LLM proceeds to classify the job posting by            LM inference to anonymize sensitive information in job
selecting the ESCO occupation that exhibits the highest            postings and candidate experiences. The preprocessed
degree of semantic and contextual relevance.                       postings were annotated by the top three performing
   For a specific job posting 𝐽, an embedding function 𝐸           LLMs: GPT-4o8 , Gemini 1.5 Pro9 , and Claude 3.5 Son-
is employed, such that 𝐸(𝐽) produces the corresponding             net10 , according to LmSys Arena[17]. In this context, the
embedding for 𝐽. The degree of similarity between the              ESCO job titles are presented to each model separately,
job posting’s embedding 𝐸(𝐽) and any ESCO occupation               requesting them to select the appropriate job titles, and
embedding 𝑒𝑖 from 𝐸ESCO (where 𝐸ESCO stands for the                then measure their level of agreement on these labels.
ensemble of occupation embeddings derived from the                 The agreement between LLM models was assessed using
ESCO taxonomy) is determined through the similarity                Cohen’s kappa coefficient[18]. The average kappa score
function 𝑆(𝐸(𝐽), 𝑒𝑖 ) (in our case cosine).                        between Gemini and GPT-4o was found to be 0.6386,
   The similarity scores for each occupation embedding             indicating a substantial level of agreement. The agree-
𝑒𝑖 within 𝐸ESCO relative to 𝐸(𝐽) are computed. The                 ment between Gemini and Claude was lower, with an
ten class embeddings that exhibit the highest similarity           average kappa of 0.5798, suggesting a moderate level
to 𝐸(𝐽), denoted as 𝐸top , are selected. Formally, 𝐸top            of agreement. Similarly, the kappa score between GPT-
is defined as the subset {𝑒1 , 𝑒2 , . . . , 𝑒10 } from 𝐸ESCO ,     4o and Claude was 0.6497, also indicative of substantial
where each 𝑒𝑖 is selected based on the top 10 similarity           agreement. Overall, the average kappa score across all
scores 𝑆(𝐸(𝐽), 𝑒𝑖 ).                                               “annotators” was 0.6227, reflecting a general trend to-
   The last stage entails a decision-making process en-            wards substantial inter-annotator agreement among the
acted by the Llama-3 LLM, represented by the function              models.
𝐷. This function accepts the composite prompt includ-                 To establish ground truth labels, we incorporated a
ing candidates {𝑒1 , 𝑒2 , . . . , 𝑒10 } accumulated to 𝐸top and    dual-layer labelling process. Although the test set con-
the job posting 𝐽, to render the final selected occupation         sists of only 200 items, labeling them from scratch would
embedding. The chosen occupation embedding 𝑒* is de-               be time-consuming due to the complexity of the ESCO
termined by 𝑒* = 𝐷(𝐸top , 𝐽), representing the ESCO                taxonomy, which includes 3,015 distinct occupations. Hu-
occupation best matched by the model.                              man annotators would require extensive training to accu-
   The entire algorithm can be presented by the following          rately navigate this taxonomy. Therefore, we first anno-
equation, which encapsulates the embedding generation,             tate the occupations automatically using LLMs and then
similarity assessment, and decision-making process by              let the initial annotations cross-examine by human expert
the LLM, culminating in the selection of the most suitable         annotator. Since each data point was reviewed by one an-
ESCO occupation embedding 𝑒* for the given job posting             notator only, inter-annotator agreement among human
description.                                                       annotators was not quantified. Instead, we conducted an
                                                                   analysis to identify job titles that consistently showed
                                                                   agreement or disagreement across the three LLMs, where
      𝑒* = 𝐷({𝑒1 , 𝑒2 , . . . , 𝑒𝑘 | 𝑒𝑖 ∈ 𝐸ESCO ;                  domain-specific professionals from InfoJobs reviewed la-
                                top k by 𝑆(𝐸(𝐽), 𝑒𝑖 )}, 𝐽)   (3)   bel discrepancies. This analysis, detailed in Appendix C,
                                                                   suggests that certain occupations are inherently more
                                                                   challenging to classify, possibly due to overlapping skills
4. Experiments                                                     or ambiguous descriptions.
                                                                   Furthermore, we repeated experiments using ground
To evaluate the effectiveness of the proposed model in             truth labels where any two of the three automatic LLMs
handling multilingual job postings, experiments were               agreed on the label. The results showed alignment be-
conducted separately on Italian and Spanish datasets.              tween the models’ predictions and the automatic labeling
                                                                   process, indicating consistency with the patterns recog-
4.1. Test dataset                                                  nized by the automatic methods when there is partial
                                                                   agreement. A detailed analysis of this alignment can be
To have a reliable test dataset, we use three high perform-        found in Appendix D.
ing LLMs as initial annotators of real-world 100 Italian
and 100 Spanish job postings with the most extensive de-
scriptions from the InfoJobs 7 database. Non-informative
elements such as company descriptions and promotional
                                                                   8
                                                                       https://openai.com/index/hello-gpt-4o/
                                                                   9
                                                                       https://deepmind.google/technologies/gemini/pro/
7                                                                  10
    https://www.infojobs.net/                                           https://www.anthropic.com/news/claude-3-5-sonnet
4.2. Baselines                                               report evaluation scores seperately on Spanish and Italian
                                                             test sets.
4.2.1. SkillGPT
SkillGPT [5] has been introduced as a tool for skill ex- Table 2
traction and classification, with vector similarity search Italian Performance Metrics for Top 5 and Top 10 Predictions
against LLM-precomputed ESCO embeddings. The au-
                                                                                            Precision        Recall
thors employ embeddings generated by an LLM, although        Model
they do not directly use LLM to select among candidate                                     @5      @10    @5     @10
embeddings. Instead, they rely on embedding similar-         llama-3-8b (CoT opt.)        0.32     0.13   0.76    0.80
ity to assign the most closely related ESCO class to job     llama-3-8b (CoT)             0.26     0.12   0.62    0.64
descriptions under consideration.                            llama-3-8b (SkillGPT)        0.19     0.19   0.36    0.82
                                                               mBart-large-mnli (0-shot)    0.13   0.12    0.29    0.58
4.2.2. Zero-Shot Classification                                multilingual-e5-large        0.16   0.19    0.36    0.88

By transforming the classification task into a Natural
Language Inference (NLI) problem, any model pretrained
on NLI tasks can be utilized as a text classifier without the Table 3
need for fine-tuning, effectively achieving zero-shot text Spanish Performance Metrics for Top 5 and Top 10 Predictions
classification. This is particularly beneficial when we deal                                    Precision      Recall
with classes unseen during training, making it a robust         Model
                                                                                              @5      @10   @5     @10
solution for a variety of text classification scenarios [19].
   In our implementation that we use as baseline, we            llama-3-8b (CoT opt.)         0.28 0.20 0.72 0.90
utilize the BART-MNLI model [20] that showed high per-          llama-3-8b (CoT)              0.26     0.16 0.64   0.68
formance in summarization tasks when pretrained for             llama-3-8b (SkillGPT)         0.09     0.12 0.36   0.62
various NLI tasks on an MNLI dataset [21] that is lever-        mBart-large-mnli (0-shot) 0.15         0.14 0.39   0.70
                                                                multilingual-e5-large         0.20     0.19 0.48 0.92
aged for its capability to understand entailment relations
for classification of the given sequence into one of the
specified categories. We also apply the same methodol-           Tables 2 and 3 display the results on the Italian and
ogy with the Llama-3 model.                                   Spanish datasets, respectively. The results indicate that
                                                              prompting techniques outperform SkillGPT in both lan-
4.3. Model Optimization                                       guages. Specifically, the optimized Llama-3-8b model
                                                              with chain-of-thought (CoT) achieves the highest preci-
To optimize LLMs with a minimal set of manually crafted sion and recall at @5 for Italian, with values of 0.32 and
examples, we use the DSPy library [22]. We initialize 0.76, respectively, and for Spanish, with values of 0.28
the classifier module with a Llama-3 model and use a and 0.72. This supports our assumption that optimiza-
GPT-4o model as the teacher. Our optimization of the tion enhances performance. The multilingual E5-large
classification is aimed at achieving high F1 scores for model achieves the highest precision at @10 for Italian
each dataset individually. In each run, we use 10 la- (0.19) and the highest recall at @10 for Spanish (0.92),
beled training examples and 30 labeled validation ex- underscoring the efficacy of embeddings in classification.
amples. We employ DSPy’s BootstrapFewShot, configur- This implies that semantically less similar labels can con-
ing it to perform a maximum of 2 rounds with up to 8 fuse models, whereas embeddings ensure higher recall
bootstrapped demonstrations. We define a custom met- accuracy, particularly in wider retrieval scenarios. Al-
ric—the F1 score—to guide the bootstrapping process. For though both models exhibit similar precision, indicating
the optimization of the LLMs, we use data points that comparable accuracy in their predictions, the optimized
had high inter-agreement among the automatic methods model’s capacity to capture a broader range of relevant
and were reviewed by human annotators. We perform a job titles ensures greater alignment with expert human
validation/test split to ensure that the optimization did preferences. This enhances the model’s ability to make
not bias the evaluation results.                              relevant job title suggestions, thereby improving the over-
                                                              all matching process.
4.4. Outcome of the experiments
For the evaluation of the results of the experiments, we     4.5. Discussion
used the micro recall and micro precision metrics, which     In Tables 2 and3 we observe that the combined use of gen-
are suitable for our multi-class classification task. We     eral text embeddings and language models significantly
                                                             outperforms current classification techniques, which rely
on language models specifically tailored to the field of the     4.6. Computational Cost of Compared
labour market, such as [12]. We see that using vector sim-            Methods
ilarity with the text embeddings created by the E5-large
text embedings model alone does not surpass the base-            In addition to evaluating performance metrics, we ana-
line. However, it is worth noting that the results are quite     lyzed the computational cost and environmental impact
close, despite the fact that this model was not specifically     of each method. The Llama-3-8b model, with 8 billion
fine-tuned on labour market data or adapted to the ESCO          parameters, requires significant resources for inference,
taxonomy, as is the case of [12]. Furthermore, we can ob-        necessitating a GPU with at least 16 GB of VRAM (e.g.,
serve how text embeddings indeed provide a significant           NVIDIA RTX 3090). Its average inference time per job
value for filtering n occupations closest to a job posting       posting is approximately 1.5 seconds, and its high energy
within the taxonomy. Using these k professions as input          consumption leads to increased CO2 emissions, making
to various language models for few-shot classification           large-scale deployment less environmentally sustainable
significantly improves over the baselines. Table 6 in the        without optimizations.
Appendix illustrates the decisions of the LLMs in the case          In contrast, the mBART-large-mnli model has about 610
of four sample job postings.                                     million parameters and operates on GPUs with 8 GB of
   We also evaluated the effectiveness of a large language       VRAM, offering faster inference times under 0.5 seconds
model for classification of job titles based on provided         per job posting. The embeddings-based method using
descriptions, as shown in Table 4 even when the correct          the multilingual E5-large model, with 330 million param-
titles were not explicitly listed among the initial ESCO job     eters, allows for precomputed embeddings and efficient
titles. The model’s ability to select accurate titles reflects   CPU-based vector similarity searches, reducing infer-
its functionality in processing and understanding the con-       ence time to less than 0.2 seconds per job posting. These
textual and semantic aspects of the job descriptions. For        smaller models consume less energy, providing more
instance, when presented with a job description focused          resource-efficient and eco-friendly alternatives suitable
on the management of comprehensive water and wastew-             for production environments where computational cost
ater services, the model correctly identified “Operations        and environmental impact are critical considerations.
Manager” as the correct title. This identification was
made despite the presence of several closely related but
distinct labels (such as, “Water treatment plant manager”)
                                                                 5. Conclusions and future work
within the pool of ESCO job titles. This indicates that the      In this paper, we argued that the use of multilingual
model’s decisions are more influenced by a comprehen-            embeddings in combination with LLMs significantly en-
sive understanding of the job responsibilities and sectors       hances our ability to distinguish between very similar (or
than by the mere presence of keywords or phrases in the          even identical) job titles that suggest different skills and
ESCO job titles.                                                 competencies. Our experiments have shown that this is
   The model’s capacity to differentiate between job titles      indeed the case, demonstrating that the combination of
with more specific definitions enhances its comprehen-           multilingual text embeddings similarity with the Llama-
sion of job postings and assigned labels, thereby improv-        3 markedly exceeds the performance of other leading
ing the precision of suggesting relevant skills. Upon            approaches in the field.
integration into an operational job platform, this model            In the future, we plan to apply the same approach
will better understand the requirements of job postings          to the analysis and classification of job candidate expe-
and accurately assign job titles that align with the spe-        riences. Once it is ensured that both job postings and
cific needs of companies. Similarly, in the context of           candidate experiences can accurately be modeled using
parsing of job candidate experiences, keywords tend to           the embedded representation of the ESCO taxonomy, we
appear more frequently in semantically related ESCO def-         plan to set the stage for a more direct and efficient align-
initions, enabling parsers to incorporate these keywords         ment process between job postings and experiences of
to enhance parsing performance.                                  job seekers.
   Overall, we can thus state that the integration of class         Another interesting direction for future research is
embeddings generated using the multilingual E5-large             to analyze the lexical overlap between English domain-
model, with subsequent application of few-shot classifi-         specific terms that appear in Italian and Spanish job post-
cation techniques through LLMs, significantly improves           ings and the English occupation descriptions in the ESCO
the accuracy of job title classification, clearly surpassing     taxonomy. Such an analysis would reveal whether job
those of the baselines.                                          types with higher lexical overlap affect model accuracy,
                                                                 providing deeper insights into the multilingual nature of
                                                                 the task.
References                                                             applicant cv classification using rich information
                                                                       from a labour market taxonomy, SSRN Electronic
 [1] R. Zbib, L. L. Alvarez, F. Retyk, R. Poves, J. Aizpuru,           Journal (2023). doi:10.2139/ssrn.4519766.
     H. Fabregat, V. Šimkus, E. G. Casademont, Learn-             [13] S. M. Xie, A. Raghunathan, P. Liang, T. Ma, An ex-
     ing job titles similarity from noisy skill labels,                planation of in-context learning as implicit bayesian
     ArXiv abs/2207.00494 (2022). URL: https://api.                    inference, 2022. arXiv:2111.02080.
     semanticscholar.org/CorpusID:250243975.                      [14] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,
 [2] J.-J. Decorte, J. V. Hautte, T. Demeester, C. De-                 J. Sun, M. Wang, H. Wang, Retrieval-augmented
     velder, Jobbert: Understanding job titles through                 generation for large language models: A survey,
     skills, ArXiv abs/2109.09605 (2021). URL: https:                  2024. arXiv:2312.10997.
     //api.semanticscholar.org/CorpusID:237572142.                [15] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie,
 [3] F. Javed, M. McNair, F. Jacob, M. Zhao, To-                       M. Zhang, Towards General Text Embeddings
     wards a job title classification system, 2016.                    with Multi-stage Contrastive Learning, arXiv
     arXiv:1606.00917.                                                 e-prints (2023) arXiv:2308.03281. doi:10.48550/
 [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,                       arXiv.2308.03281. arXiv:2308.03281.
     J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,            [16] K. D’Oosterlinck, O. Khattab, F. Remy, T. De-
     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,                meester, C. Develder, C. Potts,            In-context
     G. Krueger, T. Henighan, R. Child, A. Ramesh,                     learning for extreme multi-label classification,
     D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,               ArXiv abs/2401.12178 (2024). URL: https://api.
     E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,                semanticscholar.org/CorpusID:267068618.
     C. Berner, S. McCandlish, A. Radford, I. Sutskever,          [17] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopou-
     D. Amodei, Language models are few-shot learners,                 los, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan,
     2020. arXiv:2005.14165.                                           J. E. Gonzalez, I. Stoica, Chatbot arena: An open
 [5] N. Li, B. Kang, T. D. Bie, Skillgpt: a restful api service        platform for evaluating llms by human prefer-
     for skill extraction and standardization using a large            ence, ArXiv abs/2403.04132 (2024). URL: https:
     language model, 2023. arXiv:2304.11060.                           //api.semanticscholar.org/CorpusID:268264163.
 [6] B. Shi, J. Yang, F. Guo, Q. He, Salience and                 [18] J. Cohen, A coefficient of agreement for nominal
     market-aware skill extraction for job targeting,                  scales, Educational and Psychological Measurement
     2020. arXiv:2005.13094.                                           20 (1960) 37 – 46. URL: https://api.semanticscholar.
 [7] S. Li, B. Shi, J. Yang, J. Yan, S. Wang, F. Chen, Q. He,          org/CorpusID:15926286.
     Deep job understanding at linkedin, in: Proceed-             [19] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad,
     ings of the 43rd International ACM SIGIR Confer-                  M. Chenaghlu, J. Gao, Deep learning based text
     ence on Research and Development in Informa-                      classification: A comprehensive review, CoRR
     tion Retrieval, ACM, 2020. URL: http://dx.doi.org/                abs/2004.03705 (2020). URL: https://arxiv.org/abs/
     10.1145/3397271.3401403. doi:10.1145/3397271.                     2004.03705. arXiv:2004.03705.
     3401403.                                                     [20] L. Shu, J. Chen, B. Liu, H. Xu, Zero-shot aspect-
 [8] M. Zhang, K. N. Jensen, S. D. Sonniks, B. Plank,                  based sentiment analysis, ArXiv abs/2202.01924
     Skillspan: Hard and soft skill extraction from                    (2022).
     english job postings, ArXiv abs/2204.12811 (2022).           [21] A. Williams, N. Nangia, S. Bowman, A broad-
     URL: https://api.semanticscholar.org/CorpusID:                    coverage challenge corpus for sentence understand-
     248405777.                                                        ing through inference, in: Proceedings of the 2018
 [9] J. Wang, K. Abdelfatah, M. Korayem, J. Balaji, Deep-              Conference of the North American Chapter of the
     carotene -job title classification with multi-stream              Association for Computational Linguistics: Human
     convolutional neural network, 2019, pp. 1953–1961.                Language Technologies, Volume 1 (Long Papers),
     doi:10.1109/BigData47090.2019.9005673.                            Association for Computational Linguistics, 2018,
[10] M. Yamashita, J. T. Shen, H. Ekhtiari, T. Tran, D. Lee,           pp. 1112–1122. URL: http://aclweb.org/anthology/
     James: Job title mapping with multi-aspect embed-                 N18-1101.
     dings and reasoning, 2022. arXiv:2202.10739.                 [22] O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang,
[11] M. Zhang, R. van der Goot, B. Plank, Escoxlm-                     K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma,
     r: Multilingual taxonomy-driven pre-training for                  T. T. Joshi, H. Moazam, H. Miller, M. Zaharia,
     the job market domain, in: Annual Meeting                         C. Potts,      Dspy: Compiling declarative lan-
     of the Association for Computational Linguis-                     guage model calls into self-improving pipelines,
     tics, 2023. URL: https://api.semanticscholar.org/                 ArXiv abs/2310.03714 (2023). URL: https://api.
     CorpusID:258832782.                                               semanticscholar.org/CorpusID:263671701.
[12] H. Kavas, M. Serra-vidal, L. Wanner, Job offer and
Figure 3: LLM’s Rationale


A. Ablation Study                                            narrowing down to “quick service restaurant team leader”
                                                             and “fast food shift team leader” as the most apt job ti-
In our ablation study, we pursued two primary objectives. tles. The reasoning of the model is correct on chosing
Firstly, to evaluate the model’s comprehension of ESCO these titles for their precise reflection of the managerial
job titles and its decision-making process. To achieve this, and leadership responsibilities pertinent to the restaurant
we prompted the model to articulate its underlying ra- environment.
tionale. Secondly, so far we reported the performance of
our model when Italian and Spanish data were matched
against English job titles and occupations in the ESCO B. Job postings and Predicted
taxonomy. Here we wanted to explore whether its com-              ESCO job titles
prehension was extendable to data in different languages.
We selected Spanish for this purpose and discovered that The following tables provide examples of job titles, job
the model’s understanding was consistent, irrespective posting descriptions, and the corresponding gold labels
of the language; see Table 4.                                in Table 5 and optimized LLama-3 job titles in Table 6.
   As illustrated in Figure 3, the LLM showcases a com- These examples illustrate how the job titles assigned by
prehensive understanding of the task at hand, effectively recruiters may not always capture the specific nature of
narrowing down potential ESCO job titles to identify the the job described in the postings. The gold labels and
most suitable label. Additionally, the LLM is observed to the optimized LLama-3 job titles offer a more accurate
generate a novel job title, referred to as “fast food shift representation of the job roles based on the detailed job
team leader”. This can be attributed to the absence of descriptions.
contstraints imposed on the LLM regarding structured            The job title “Commessa” (Salesperson) is generic
output for classification, thereby granting it to auton- and does not specify the specialization required for the
omy to propose the most fitting job title. The analysis job. The gold label “telecommunications equipment spe-
initially excludes broader or less related job titles such cialised seller” fits better because the job description
as “bussiness manager”, “hospitality revenue manager”, clearly focuses on selling telecommunications equipment,
and “accomodation manager”, which are not spesific to which requires specific knowledge and skills related to
quick-service restaurant operations. Subsequently, the this type of product. The gold label accurately reflects
model considers and ultimately selects titles that em- the specialized nature of the role. The job title “Project
phasize leadership within this spesific restaurant context, engineer” given by the recruiter suggests a technical and
   Gold Label Job Title
   Quick Service Restaurant Team Leader
   Posting Job Title
   Encargado de Franquicias
   Posting Description:
   - Responsable de garantizar la satisfacción de los huéspedes y de gestionar y superar los objetivos financieros y operativos
   de los restaurantes a mi cargo.
   - Garantizar una excelente atención a los huéspedes en base a las promesas y estándares definidos.
   - Liderar, motivar y desarrollar equipos.
   - Facilitar los recursos y el apoyo necesario a los equipos en sus restaurantes.
   - Utilizar de manera eficaz los diferentes recursos de la Compañía.
   - Identificar oportunidades y amenazas de negocio en el mercado.
   - Aportar ideas y ejecutando proyectos en el corto y medio plazo.
   - Difundir las mejores practicas y resolver problemas comunes en los restaurantes.
   - Cumplir los protocolos y políticas de la Marca y la Compañía.
   - Garantizar y difundir los valores y principios definidos por la Compañía.
   Skills: SAP Girnet Gtock, Cuiner
   ESCO Job Titles:
   Restaurant Manager, Business Manager, Hospitality Revenue Manager, Accommodation Manager, Delicatessen Shop
   Manager, Rooms Division Manager, Customer Experience Manager, Quick Service Restaurant Team Leader, Destination
   Manager, Membership Manager

  Table 4
  Spanish job posting Example


 Posting Job Title       Job Posting Description                                                         Gold Labels
 Commessa                Commessa; Commessa; - Presentazione e vendita di attrezzature per               Telecommunications
                         telecomunicazioni ai clienti; - Servizio e supporto clienti; - Gestione delle   equipment specialised
                         transazioni di vendita; - Gestione dello stock e dell’inventario.               seller
 Project Engineer        Project Engineer; Project Engineer; PROJECT MANAGER / PROJECT                   Project manager, Prod-
                         ENGINEER Divisione: Amministrazione Tecnica - Coordinamento delle               uct development man-
                         attività di gestione progetti in ambito tecnico; - Supporto al Product          ager
                         Development; - Pianificazione e monitoraggio delle attività progettuali; -
                         Supervisione del team tecnico; - Assistenza alla gestione dei fornitori e
                         del budget di progetto.
Table 5
Examples of Job Titles, Descriptions, and Gold Labels


engineering-focused role. However, the job description            packages, and managing the deli counter. Our model’s ti-
emphasizes project management, coordination of project            tles “meat and meat products specialised seller” and “deli
activities, support to product development, and supervi-          worker” are more precise, indicating a specialized role
sion of the technical team. The gold label “project man-          in food handling and customer service, which goes be-
ager” fits better as it captures the overall management           yond the general sales assistant title. This demonstrates
and coordination responsibilities described, which are            our model’s ability to interpret the specific context and
more aligned with the duties of a project manager than            responsibilities of the job accurately.
just a project engineer.                                             The job title “IT Specialist” is generic and could encom-
   The job title "Addetto alle vendite" (Sales Assistant) is      pass various IT roles. However, the job description clearly
too generic and does not capture the specialized nature           indicates responsibilities such as managing ICT projects,
of the role described in the vacancy. The description             coordinating a software development team, planning
specifies duties typical of a deli worker, such as serving        and monitoring development activities, managing ICT
customers, slicing cheeses and cured meats, preparing             resources and budget, and providing advanced techni-
 Posting Job Title      Job Posting Description                                                        Optimized LLama-3
                                                                                                       Job Titles
 Addetto alle ven-      Addetto alle vendite; Addetto alle vendite; Salumiere: servizio clientela,     Meat and meat products
 dite                   tagli di formaggi e salumi, preparazione confezioni, gestione banco gas-       specialised seller, Deli
                        tronomia.                                                                      worker, Food and bever-
                                                                                                       age server
 IT Specialist          IT Specialist; IT Specialist; Responsabile della gestione dei progetti ICT;    ICT project manager, Soft-
                        Coordinamento del team di sviluppo software; Pianificazione e monitor-         ware development man-
                        aggio delle attività di sviluppo; Gestione delle risorse ICT e del budget;     ager
                        Assistenza tecnica avanzata e risoluzione dei problemi.
 Sales Manager          Sales Manager; Sales Manager; Sviluppo del business aziendale;                 Business development
                        Definizione delle strategie di vendita; Gestione del team di vendita; Mon-     manager, Sales director
                        itoraggio delle performance e raggiungimento degli obiettivi di vendita;
                        Gestione delle relazioni con i clienti chiave e i partner strategici.
Table 6
Examples of Job Titles, Descriptions, and Optimized Job Titles


Table 7
Examples of Job Postings with Ambiguous Classification due to Multilingual and Contextual Challenges
 Job Title              Description Excerpt                                                            Labels Suggested
 Junior    Project      Applicare i metodi e gli strumenti propri del Project Management a             Project Manager, ICT
 Manager                commesse specifiche per il settore dell’automazione industriale, di cui        Project Manager, Pro-
                        l’azienda fornisce sistemi di visione artificiale.                             gramme Manager
 Assistente     Am-     Gestione dei flussi delle segnalazioni dei cittadini per prenotazioni vacci-   Healthcare      Assistant,
 ministrativo           nazioni e assistenza pandemica, inclusa la verifica del "certificato verde"    Administrative Assistant,
 (Healthcare)           per la conformità alle normative sanitarie.                                    Contact Tracing Agent
 Commesso di Ne-        Creazione di vetrine accattivanti con abbinamenti di tendenza e assistenza     Shop Assistant, Sales As-
 gozio (Retail)         alla clientela nella scelta dei prodotti.                                      sistant, Visual Merchan-
                                                                                                       diser
 Team Leader (En-       Predisposizione documenti formativi e aggiornamento processi operativi         Team Leader, Energy Ana-
 ergy Sector)           presso sede Enel, inclusa l’implementazione e il collaudo di software per      lyst, Business Process An-
                        la gestione energetica.                                                        alyst
 Assistente Ammin-      Compiti legati al Registro Nazionale delle Varietà Vegetali e mansioni         Accounting      Assistant,
 istrativo (Legal and   fiscali complesse come Dichiarazioni IRAP.                                     Administrative Assistant,
 Fiscal)                                                                                               Compliance Officer


cal support. The optimized titles “ICT project manager”          C. Ambiguity from Specialized
and “software development manager” are more accurate
as they reflect the leadership, coordination, and project
                                                                    and Contextual Factors
management aspects of the role, which go beyond the              To further understand the complexity of job classifica-
scope of a general IT specialist.                                tion in a multilingual context, we conducted an ablation
   The job title “Sales Manager” suggests a mid-level man-       study focusing on cases where both human annotators
agement role. However, the job description highlights            and LLMs demonstrated shared uncertainty in assigning
responsibilities such as business development, defining          definitive labels. These cases were particularly challeng-
sales strategies, managing the sales team, monitoring per-       ing due to specialized terminology, regional language
formance, and managing relationships with key clients            variations, or overlapping responsibilities within job post-
and strategic partners. These responsibilities are more          ings. Table 7 highlights key examples where annota-
aligned with a higher-level role such as “business de-           tors, despite their recruitment expertise, aligned with the
velopment manager” or “sales director”, which involve            LLMs in experiencing ambiguity.
strategic planning and high-level management.                       As presented in Table 7, each example illustrates spe-
                                                                 cific challenges encountered in classifying job postings
                                                                 across multilingual and sector-specific contexts. The Ju-
                                                                 nior Project Manager job posting, for instance, combines
                                                                 general project management with specialized tasks such
                                                                 as machine vision, but without enough specific context,
it is unclear whether the focus should be on technical            We assessed our model’s performance on both silver
expertise or managerial skills. The Project Engineer ex-       and gold labels to understand its effectiveness under dif-
ample shows the impact of technical terminology and            ferent levels of agreement. We had reported results for
sector-spesific language on classification. Terms such         gold labels in Table 2 and 3, results for silver label are
as “SCADA” and “Modbus TCP” are common in inter-               presented in Table 8. For the Spanish dataset, the model’s
national engineering contexts but may not align with           performance was relatively consistent between silver and
typical understanding of recruiters, leading to the selec-     gold labels, with only minor variations in precision and
tion of varied labels by both LLMs and annotators. The         recall. This consistency suggests that the model robustly
example of the Assistente Amministrativo with a legal and      captures underlying patterns in the job postings, regard-
fiscal focus involves highly specialized processes such as     less of labeling strictness.
“Registro Nazionale delle Varietà Vegetali” and complex           In contrast, the Italian dataset exhibited more signif-
fiscal duties like “Dichiarazioni IRAP.” These terms relate    icant differences between performances on silver and
to specific Italian government and regulatory compliance,      gold labels. For example, in some cases, the precision
which could exceed the annotators’ typical recruitment         was higher for silver labels while recall was higher for
experience, thus resulting in generalized labels that do       gold labels. This disparity may indicate that the model
not fully capture the compliance and accounting com-           better captures broader classifications aligning with ma-
plexity.                                                       jority consensus in Italian but struggles with the stricter
   These cases emphasize that job postings, as human-          criteria required for unanimous agreement.
created documents, often do not provide enough con-               An interesting observation is that optimization using
text for a definitive classification, resulting in ambiguity   gold label ground truth data had a negative effect on
across specialized and regional terms.                         the models’ scores derived from silver labels. This could
                                                               be explained by the fact that during optimization, the
                                                               language models became more attuned to the patterns
D. Analysis of Model Alignment                                 present in the gold labels, potentially diverging from
   with Partial Agreement Ground                               those in the silver labels. As a result, the models may
                                                               have become less effective at predicting labels where only
   Truth Labels                                                partial agreement (silver labels) was present among the
                                                               automatic methods.
Table 8
Performance Metrics for Top 5 and Top 10 Predictions           E. DSPy Signature
                                Precision        Recall
  Model                                                        We utilize DSPy signatures to prompt large language
                               @5      @10    @5       @10     models (LLMs) for performing downstream tasks. To
                      Spanish (SPA)                            optimize the script, recursive LLM calls were employed,
                                                               resulting in its final form based on empirical observa-
  llama-3-8b (CoT opt.)       0.12     0.06   0.58     0.62
  llama-3-8b (CoT)            0.22     0.16   0.64     0.68    tions.
  llama-3-8b (SkillGPT)       0.19     0.12   0.36     0.62
  mBart-large-mnli (0-shot)   0.15     0.14   0.39     0.70
  multilingual-e5-large       0.20     0.19   0.48     0.92
                       Italian (ITA)
  llama-3-8b (CoT opt.)       0.12     0.06   0.56     0.60
  llama-3-8b (CoT)            0.23     0.07   0.55     0.59
  llama-3-8b (SkillGPT)       0.22     0.06   0.53     0.59
  mBart-large-mnli (0-shot)   0.27     0.06   0.31     0.58
  multilingual-e5-large       0.35     0.08   0.39     0.79


   In our evaluation, we established two levels of ground
truth labels: gold and silver. Gold labels represent unan-
imous agreement among all three annotators (GPT-4o,
Gemini 1.5 Pro, and Claude 3.5 Sonnet), validated by
human experts. Silver labels indicate a strong major-
ity consensus, assigned when any two annotators agree,
                                                           Figure 4: Pre-processing Signature
even if the third disagrees.