Enhancing Job Posting Classification with Multilingual Embeddings and Large Language Models

Enhancing Job Posting Classification with Multilingual Embeddings and Large Language Models HamitKavas hamit.kavas@upf.edu NLP Group Pompeu Fabra University

C/ Roc Boronat 138, 08018 Spain

Adevinta Spain de Granada

C/ de la Ciutat, 150 08018 Barcelona Spain

MarcSerra-Vidal Adevinta Spain de Granada

C/ de la Ciutat, 150 08018 Barcelona Spain

LeoWanner leo.wanner@upf.edu NLP Group Pompeu Fabra University

C/ Roc Boronat 138, 08018 Spain

Catalan Institute for Research and Advanced Studies (ICREA)

Passeig Lluís Companys, 23 08010 Barcelona Spain

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Enhancing Job Posting Classification with Multilingual Embeddings and Large Language Models 1613-0073 573D11A030519E239924306358A6196C GROBID - A machine learning software for extracting information from scholarly documents ESCO labour market taxonomy job posting classification class embeddings text embeddings LLM

In the modern labour market, taxonomies such the European Skills, Competences, Qualifications and Occupations (ESCO) classification are used as an interlingua to match job postings with job seeker profiles. Both are classified with respect to ESCO occupations, and match if they align with the same occupation and the same skills assigned to the occupation. However, matching models usually struggle with the classification because of overlapping skills and similar definitions of occupations defined in the ESCO taxonomy. This often leads to imprecise classification outcomes. In this paper, we focus on the challenge of the classification of job postings written in Italian or Spanish against ESCO occupations written in English. We experiment with multilingual embeddings, zero-shot classification, and use of a large language model (LLM) and show that the use of an LLM leads to best results. Furthermore, we also explore an alternative automatic labeling method by prompting three top-performing LLMs to annotate the test dataset. This approach serves both as an experiment on the usability of automatic labeling and as an evaluation of the reliability of the automatically assigned labels, involving human annotators.

Introduction

The modern labour market becomes more and more diverse. High-tech jobs demand novel skills and competences, which in their turn keep undergoing adaptations and modifications. Under these circumstances, accurately classifying job postings and CVs of job seekers (henceforth candidate experiences) that contain detailed technological specifications with remarkably similar yet distinct skills and experiences has evolved into a complex challenge.

The overwhelming majority of job portals and employment agencies use either the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy 1 or its US equivalent O*Net taxonomy 2 to classify job postings and candidate experiences in terms of job title labeled ESCO/O*Net occupations. Most of the proposals to automatic alignment of job postings with candidate experiences (or vice versa) also use ESCO or O*Net [1,2,3]. However, despite their wide use, both ESCO and O*Net taxonomies exhibit principle limitations for the task of automatic classification of job postings and candidate experiences because due to their tree structure they often fail to adequately distinguish between occupations that exhibit substantial skill overlaps. For instance, two job postings labeled as 'data analyst' may appear similar but require different skills if one focuses on market research while the other concentrates on healthcare trends analysis. This issue is particularly pronounced when classification relies on a single label, such as the job title of an ESCO occupation, where skill overlaps undermine precise classification. Hence, employing multiple job titles and framing the problem as a multi-label classification task is imperative.

This paper addresses the challenge of multilingual multi-label classification using Large Language Models (LLMs) for the alignment of Italian and Spanish job postings with English job titles encountered in the ESCO taxonomy. Multilingual class embeddings are explored to improve classification accuracy, aiming to provide the necessary contextual awareness and addressing the core limitations of taxonomies such as ESCO.

Furthermore, we explore an alternative automatic labeling method by prompting three top-performing LLMs to annotate the test dataset. This approach serves both as an experiment on the usability of automatic labeling and as an evaluation of the reliability of the automatically assigned labels, involving human annotators.

To provide LLMs with domain-specific information and to mitigate hallucinations in the course of the classification of the job postings, we employ Retrieval Augmented Generation (RAG) [4], which combines information retrieval with a generative model. RAG serves two critical functions in our methodology. Firstly, it provides detailed definitions, including essential skills and synonyms for each ESCO occupation, selected through vector similarity as outlined in [5]. Secondly, it ensures that the assigned job titles are restricted to titles within our predefined label space, i.e., standardized job titles defined in the ESCO taxonomy.

The contributions of our work are:

• We explore the impact of using multilingual class embeddings derived from the ESCO taxonomy for the task of job posting classification.

• We integrate RAG to provide LLMs with domainspecific information and eliminate the dependency on fine-tuning;

• We show how the LLM response can be restricted to standardized job titles and thus how LLMs can be used for high quality job title classification that outperforms state-of-the-art proposals for this task.

The remainder of the paper is structured as follows. In Section 2, we present a concise overview of the related work. In Section 3, the model on which our work is based is outlined. Section 4 describes the experiments we carried out, the results we obtained in these experiments, and their discussion. In Section 5, finally, draws some conclusions from the presented work and outlines some directions for future research. In Appendix A, we present an ablation study in which we assess the comprehension of English ESCO job titles and its Spanish equivalents by our model. Appendix B provides, for illustration, examples of Italian job postings and predicted ESCO job titles.

In Appendix E, we present the signature used to prompt Large Language Models for pre-processing.

Related Work

A number of works have been carried out in the domain of job title classification, focusing on various facets of the problem. Shi et al. [6] introduce Job2Skills, a model developed for LinkedIn. The model significantly improves job recommendation performance metrics, however, raises questions about its effectiveness beyond LinkedIn. Li et al. [7] proposes a two-step job title normalization, also in LinkedIn, which is based on tokenization and matching of the original job title provided by the user with a lookup table. The use of a lookup table instead of a standard occupation taxonomy such as ESCO or O*Net significantly limits the generalization potential of this strategy. Zhang et al. [8] extract soft and hard skills from job posting descriptions, showing that domain-specific pre-training significantly enhances performance in skills and knowledge extraction. Javed et al. [3] introduce a semi-supervised machine learning approach that utilizes hierarchical classifiers and the O * NET Standard Occupational Classification (SOC) taxonomy for the classification of online recruitment data. Similarly, Wang et al. [9] propose a model based on multi-stream convolutional neural networks, aiming to classify noisy user-generated job titles by considering different elements such as characters and words within job titles. Yamashita et al. [10] and Zbib et al. [1] conduct studies on the classification of job titles, focusing on job title alignment and job similarity training, respectively. JobBERT Decorte et al. [2] classifies job titles against the ESCO taxonomy, treating the task as a semantic text similarity (STS) exercise. In particular, JobBERT emphasizes the understanding of the semantics of job titles through the skills inferred from the associated vacancies and descriptions, thus alleviating the need for an extensive labeled dataset or a continuously updated list of standardized titles. Before the recent proposals [11] and [12], JobBERT used to be referenced as the state-of-the-art baseline. In general, all of these works draw upon some of the information encoded in the ESCO taxonomy. However, none of them uses detailed descriptions of ESCO occupations, as we propose.

The Model

The Basics

The proposed model is based on the notion of distinctiveness, which specifies the difference between the prompt concept 𝜃 * and other concepts within the conceptual space Θ [13]. The notion is crucial for distinguishing in-context learning concepts that are aimed to be learned by analogy. 𝜃 * acts as a latent parameter in a Hidden Markov Model that defines a distribution over observed tokens, represented by selected ESCO job titles as labels. As proposed by Xie et al. [13], the error of the in-context predictor approaches optimality under the condition that 𝜃 * is distinguishable from other concepts in Θ ∖ {𝜃 * }. When RAG is adapted as a few-shot reasoning (or incontext learning) framework for job posting classification [14] , 𝜃 * is represented by the top-selected ESCO labels and ensures that the LLM can effectively differentiate between closely related job categories.

The explanation enriched prompts enhance the LLM's ability to learn more from each example. According to Xie et al. [13], the expected error decreases as the length and informational content of each example increase, contributing to the richness of the input-output mapping for a more robust in-context learning environment. This assumption is proven to be true under the condition of distinguishability of in-context examples and can be mathematically expressed as a reduction of the expected error 𝐸[𝜖], correlated with an increase in the information content 𝐼 of the examples: The use of RAG helps avoid hallucination since when directly prompted with job postings, LLMs have been observed to sometimes produce non-existent labels [5].

𝐸[𝜖] ∝ 1 𝐼(𝑆𝑛, 𝑥test)(1)

Design of the Model

The proposed model (see Figure 1) uses multilingual class embeddings of the E5-large model [15] to retrieve pertinent ESCO occupation definitions in English. The definitions serve as contextual information to prompt language models for selection of the most suitable job titles. To this end, we incorporate the DSPy library's Chain-of-Thought mechanism, 3 augmented by a hint to restrict the model output to a specified list of job titles. The signature used in this methodology (cf. Figure 2) is inspired by [16].

To implement the RAG model, we initially established a vector database, 4 in which English ESCO occupation definitions were inserted as multilingual embedding vectors. Acknowledging the reported significance of chunking in many NLP applications, we conducted a series of ablation studies to determine the optimal chunk size. These studies revealed that subdividing the ESCO occupation definitions into smaller segments adversely affects the performance of vector-based similarity matching. Therefore, we opted for storing each of the 3,015 occupations represented in the ESCO taxonomy in its entirety.

Table 1

Recall values for classification with E5-large Text Embeddings vector similarity Precision @ K @5 @10 @30 @40 Value 0.4238 0.9004 0.9627 0.9817

To accurately classify a given job posting with respect to the ESCO taxonomy, we include 30 ESCO occupation documents (i.e., 30 nodes of the taxonomy) into the LLM's context as potential job titles. The rationale for choosing 30 documents is that we aim to strike a balance between computational efficiency and the accuracy of the retrieved documents. The precision of the LLM would naturally decrease when it is presented with inaccurate labels. Although, as shown in Table 1, the precision of the model slightly increases with 40 documents in the context, we accepted this trade-off in favor of a lower VRAM requirement.

Upon the retrieval of the 30 ESCO occupations that are most closely aligned with a given job posting description, a composite prompt (see Figure 2) is constructed as input to the LLM. The prompt integrates the actual text data encompassing job titles, descriptions, and skills pertinent to the selected occupations. The design of the simplified composite prompt aims to minimize the bias by focusing only on the core elements. The prompt is then processed by using a locally stored Llama-3 LLM5 in an isolated environment 6 .

As a few-show predictor, the LLM evaluates the composite prompt to accurately classify job postings by examining the semantic nuances of the selected ESCO occupations, aligning them with the actual job titles within the offers. To quantitatively assess the alignment between a job posting vector 𝐽 and each occupation embedding 𝐸ESCO derived from the ESCO taxonomy, cosine similarity 𝑎(𝐽, 𝐸ESCO) is used:

𝑎(𝐽, 𝐸ESCO) = 𝐽 • 𝐸ESCO ‖𝐽‖‖𝐸ESCO‖(2)

The similarity scores yielded through 𝑎(𝐽, 𝐸ESCO) for each 𝐸ESCO facilitate the identification and selection of the ESCO occupation embeddings that are most pertinent to the job posting in question. Armed with this information, the LLM proceeds to classify the job posting by selecting the ESCO occupation that exhibits the highest degree of semantic and contextual relevance.

For a specific job posting 𝐽, an embedding function 𝐸 is employed, such that 𝐸(𝐽) produces the corresponding embedding for 𝐽. The degree of similarity between the job posting's embedding 𝐸(𝐽) and any ESCO occupation embedding 𝑒𝑖 from 𝐸ESCO (where 𝐸ESCO stands for the ensemble of occupation embeddings derived from the ESCO taxonomy) is determined through the similarity function 𝑆(𝐸(𝐽), 𝑒𝑖) (in our case cosine).

The similarity scores for each occupation embedding 𝑒𝑖 within 𝐸ESCO relative to 𝐸(𝐽) are computed. The ten class embeddings that exhibit the highest similarity to 𝐸(𝐽), denoted as 𝐸top, are selected. Formally, 𝐸top is defined as the subset {𝑒1, 𝑒2, . . . , 𝑒10} from 𝐸ESCO, where each 𝑒𝑖 is selected based on the top 10 similarity scores 𝑆(𝐸(𝐽), 𝑒𝑖).

The last stage entails a decision-making process enacted by the Llama-3 LLM, represented by the function 𝐷. This function accepts the composite prompt including candidates {𝑒1, 𝑒2, . . . , 𝑒10} accumulated to 𝐸top and the job posting 𝐽, to render the final selected occupation embedding. The chosen occupation embedding 𝑒 * is determined by 𝑒 * = 𝐷(𝐸top, 𝐽), representing the ESCO occupation best matched by the model.

The entire algorithm can be presented by the following equation, which encapsulates the embedding generation, similarity assessment, and decision-making process by the LLM, culminating in the selection of the most suitable ESCO occupation embedding 𝑒 * for the given job posting description.

𝑒 * = 𝐷({𝑒1, 𝑒2, . . . , 𝑒 𝑘 | 𝑒𝑖 ∈ 𝐸ESCO; top k by 𝑆(𝐸(𝐽), 𝑒𝑖)}, 𝐽) (3)

Experiments

To evaluate the effectiveness of the proposed model in handling multilingual job postings, experiments were conducted separately on Italian and Spanish datasets.

Test dataset

To have a reliable test dataset, we use three high performing LLMs as initial annotators of real-world 100 Italian and 100 Spanish job postings with the most extensive descriptions from the InfoJobs7 database. Non-informative elements such as company descriptions and promotional content where removed using a DSPy module (cf. Appendix E for prompt), which employs zero-shot LLama-3 LM inference to anonymize sensitive information in job postings and candidate experiences. The preprocessed postings were annotated by the top three performing LLMs: GPT-4o8 , Gemini 1.5 Pro9 , and Claude 3.5 Sonnet 10 , according to LmSys Arena [17]. In this context, the ESCO job titles are presented to each model separately, requesting them to select the appropriate job titles, and then measure their level of agreement on these labels. The agreement between LLM models was assessed using Cohen's kappa coefficient [18]. The average kappa score between Gemini and GPT-4o was found to be 0.6386, indicating a substantial level of agreement. The agreement between Gemini and Claude was lower, with an average kappa of 0.5798, suggesting a moderate level of agreement. Similarly, the kappa score between GPT-4o and Claude was 0.6497, also indicative of substantial agreement. Overall, the average kappa score across all "annotators" was 0.6227, reflecting a general trend towards substantial inter-annotator agreement among the models.

To establish ground truth labels, we incorporated a dual-layer labelling process. Although the test set consists of only 200 items, labeling them from scratch would be time-consuming due to the complexity of the ESCO taxonomy, which includes 3,015 distinct occupations. Human annotators would require extensive training to accurately navigate this taxonomy. Therefore, we first annotate the occupations automatically using LLMs and then let the initial annotations cross-examine by human expert annotator. Since each data point was reviewed by one annotator only, inter-annotator agreement among human annotators was not quantified. Instead, we conducted an analysis to identify job titles that consistently showed agreement or disagreement across the three LLMs, where domain-specific professionals from InfoJobs reviewed label discrepancies. This analysis, detailed in Appendix C, suggests that certain occupations are inherently more challenging to classify, possibly due to overlapping skills or ambiguous descriptions. Furthermore, we repeated experiments using ground truth labels where any two of the three automatic LLMs agreed on the label. The results showed alignment between the models' predictions and the automatic labeling process, indicating consistency with the patterns recognized by the automatic methods when there is partial agreement. A detailed analysis of this alignment can be found in Appendix D.

Baselines

SkillGPT

SkillGPT [5] has been introduced as a tool for skill extraction and classification, with vector similarity search against LLM-precomputed ESCO embeddings. The authors employ embeddings generated by an LLM, although they do not directly use LLM to select among candidate embeddings. Instead, they rely on embedding similarity to assign the most closely related ESCO class to job descriptions under consideration.

Zero-Shot Classification

By transforming the classification task into a Natural Language Inference (NLI) problem, any model pretrained on NLI tasks can be utilized as a text classifier without the need for fine-tuning, effectively achieving zero-shot text classification. This is particularly beneficial when we deal with classes unseen during training, making it a robust solution for a variety of text classification scenarios [19].

In our implementation that we use as baseline, we utilize the BART-MNLI model [20] that showed high performance in summarization tasks when pretrained for various NLI tasks on an MNLI dataset [21] that is leveraged for its capability to understand entailment relations for classification of the given sequence into one of the specified categories. We also apply the same methodology with the Llama-3 model.

Model Optimization

To optimize LLMs with a minimal set of manually crafted examples, we use the DSPy library [22]. We initialize the classifier module with a Llama-3 model and use a GPT-4o model as the teacher. Our optimization of the classification is aimed at achieving high F1 scores for each dataset individually. In each run, we use 10 labeled training examples and 30 labeled validation examples. We employ DSPy's BootstrapFewShot, configuring it to perform a maximum of 2 rounds with up to 8 bootstrapped demonstrations. We define a custom metric-the F1 score-to guide the bootstrapping process. For the optimization of the LLMs, we use data points that had high inter-agreement among the automatic methods and were reviewed by human annotators. We perform a validation/test split to ensure that the optimization did not bias the evaluation results.

Outcome of the experiments

For the evaluation of the results of the experiments, we used the micro recall and micro precision metrics, which are suitable for our multi-class classification task. We report evaluation scores seperately on Spanish and Italian test sets. Tables 2 and 3 display the results on the Italian and Spanish datasets, respectively. The results indicate that prompting techniques outperform SkillGPT in both languages. Specifically, the optimized Llama-3-8b model with chain-of-thought (CoT) achieves the highest precision and recall at @5 for Italian, with values of 0.32 and 0.76, respectively, and for Spanish, with values of 0.28 and 0.72. This supports our assumption that optimization enhances performance. The multilingual E5-large model achieves the highest precision at @10 for Italian (0.19) and the highest recall at @10 for Spanish (0.92), underscoring the efficacy of embeddings in classification. This implies that semantically less similar labels can confuse models, whereas embeddings ensure higher recall accuracy, particularly in wider retrieval scenarios. Although both models exhibit similar precision, indicating comparable accuracy in their predictions, the optimized model's capacity to capture a broader range of relevant job titles ensures greater alignment with expert human preferences. This enhances the model's ability to make relevant job title suggestions, thereby improving the overall matching process.

Discussion

In Tables 2 and3 we observe that the combined use of general text embeddings and language models significantly outperforms current classification techniques, which rely on language models specifically tailored to the field of the labour market, such as [12]. We see that using vector similarity with the text embeddings created by the E5-large text embedings model alone does not surpass the baseline. However, it is worth noting that the results are quite close, despite the fact that this model was not specifically fine-tuned on labour market data or adapted to the ESCO taxonomy, as is the case of [12]. Furthermore, we can observe how text embeddings indeed provide a significant value for filtering n occupations closest to a job posting within the taxonomy. Using these k professions as input to various language models for few-shot classification significantly improves over the baselines. Table 6 in the Appendix illustrates the decisions of the LLMs in the case of four sample job postings.

We also evaluated the effectiveness of a large language model for classification of job titles based on provided descriptions, as shown in Table 4 even when the correct titles were not explicitly listed among the initial ESCO job titles. The model's ability to select accurate titles reflects its functionality in processing and understanding the contextual and semantic aspects of the job descriptions. For instance, when presented with a job description focused on the management of comprehensive water and wastewater services, the model correctly identified "Operations Manager" as the correct title. This identification was made despite the presence of several closely related but distinct labels (such as, "Water treatment plant manager") within the pool of ESCO job titles. This indicates that the model's decisions are more influenced by a comprehensive understanding of the job responsibilities and sectors than by the mere presence of keywords or phrases in the ESCO job titles.

The model's capacity to differentiate between job titles with more specific definitions enhances its comprehension of job postings and assigned labels, thereby improving the precision of suggesting relevant skills. Upon integration into an operational job platform, this model will better understand the requirements of job postings and accurately assign job titles that align with the specific needs of companies. Similarly, in the context of parsing of job candidate experiences, keywords tend to appear more frequently in semantically related ESCO definitions, enabling parsers to incorporate these keywords to enhance parsing performance.

Overall, we can thus state that the integration of class embeddings generated using the multilingual E5-large model, with subsequent application of few-shot classification techniques through LLMs, significantly improves the accuracy of job title classification, clearly surpassing those of the baselines.

Computational Cost of Compared Methods

In addition to evaluating performance metrics, we analyzed the computational cost and environmental impact of each method. The Llama-3-8b model, with 8 billion parameters, requires significant resources for inference, necessitating a GPU with at least 16 GB of VRAM (e.g., NVIDIA RTX 3090). Its average inference time per job posting is approximately 1.5 seconds, and its high energy consumption leads to increased CO2 emissions, making large-scale deployment less environmentally sustainable without optimizations.

In contrast, the mBART-large-mnli model has about 610 million parameters and operates on GPUs with 8 GB of VRAM, offering faster inference times under 0.5 seconds per job posting. The embeddings-based method using the multilingual E5-large model, with 330 million parameters, allows for precomputed embeddings and efficient CPU-based vector similarity searches, reducing inference time to less than 0.2 seconds per job posting. These smaller models consume less energy, providing more resource-efficient and eco-friendly alternatives suitable for production environments where computational cost and environmental impact are critical considerations.

Conclusions and future work

In this paper, we argued that the use of multilingual embeddings in combination with LLMs significantly enhances our ability to distinguish between very similar (or even identical) job titles that suggest different skills and competencies. Our experiments have shown that this is indeed the case, demonstrating that the combination of multilingual text embeddings similarity with the Llama-3 markedly exceeds the performance of other leading approaches in the field.

In the future, we plan to apply the same approach to the analysis and classification of job candidate experiences. Once it is ensured that both job postings and candidate experiences can accurately be modeled using the embedded representation of the ESCO taxonomy, we plan to set the stage for a more direct and efficient alignment process between job postings and experiences of job seekers.

Another interesting direction for future research is to analyze the lexical overlap between English domainspecific terms that appear in Italian and Spanish job postings and the English occupation descriptions in the ESCO taxonomy. Such an analysis would reveal whether job types with higher lexical overlap affect model accuracy, providing deeper insights into the multilingual nature of the task.

A. Ablation Study

In our ablation study, we pursued two primary objectives. Firstly, to evaluate the model's comprehension of ESCO job titles and its decision-making process. To achieve this, we prompted the model to articulate its underlying rationale. Secondly, so far we reported the performance of our model when Italian and Spanish data were matched against English job titles and occupations in the ESCO taxonomy. Here we wanted to explore whether its comprehension was extendable to data in different languages. We selected Spanish for this purpose and discovered that the model's understanding was consistent, irrespective of the language; see Table 4.

As illustrated in Figure 3, the LLM showcases a comprehensive understanding of the task at hand, effectively narrowing down potential ESCO job titles to identify the most suitable label. Additionally, the LLM is observed to generate a novel job title, referred to as "fast food shift team leader". This can be attributed to the absence of contstraints imposed on the LLM regarding structured output for classification, thereby granting it to autonomy to propose the most fitting job title. The analysis initially excludes broader or less related job titles such as "bussiness manager", "hospitality revenue manager", and "accomodation manager", which are not spesific to quick-service restaurant operations. Subsequently, the model considers and ultimately selects titles that emphasize leadership within this spesific restaurant context, narrowing down to "quick service restaurant team leader" and "fast food shift team leader" as the most apt job titles. The reasoning of the model is correct on chosing these titles for their precise reflection of the managerial and leadership responsibilities pertinent to the restaurant environment.

B. Job postings and Predicted ESCO job titles

The following tables provide examples of job titles, job posting descriptions, and the corresponding gold labels in Table 5 and optimized LLama-3 job titles in Table 6. These examples illustrate how the job titles assigned by recruiters may not always capture the specific nature of the job described in the postings. The gold labels and the optimized LLama-3 job titles offer a more accurate representation of the job roles based on the detailed job descriptions. The job title "Commessa" (Salesperson) is generic and does not specify the specialization required for the job. The gold label "telecommunications equipment specialised seller" fits better because the job description clearly focuses on selling telecommunications equipment, which requires specific knowledge and skills related to this type of product. The gold label accurately reflects the specialized nature of the role. The job title "Project engineer" given by the recruiter suggests a technical and

Gold Label Job Title

Quick Service Restaurant Team Leader Posting Job Title

Encargado de Franquicias Posting Description:

-Responsable de garantizar la satisfacción de los huéspedes y de gestionar y superar los objetivos financieros y operativos de los restaurantes a mi cargo.

-Garantizar una excelente atención a los huéspedes en base a las promesas y estándares definidos.

-Liderar, motivar y desarrollar equipos.

-Facilitar los recursos y el apoyo necesario a los equipos en sus restaurantes.

-Utilizar de manera eficaz los diferentes recursos de la Compañía.

-Identificar oportunidades y amenazas de negocio en el mercado.

-Aportar ideas y ejecutando proyectos en el corto y medio plazo.

-Difundir las mejores practicas y resolver problemas comunes en los restaurantes.

-Cumplir los protocolos y políticas de la Marca y la Compañía.

-Garantizar y difundir los valores y principios definidos por la Compañía.

Skills: SAP Girnet Gtock, Cuiner

ESCO Job Titles:

Restaurant Manager, Business Manager, Hospitality Revenue Manager, Accommodation Manager, Delicatessen Shop Manager, Rooms Division Manager, Customer Experience Manager, Quick Service Restaurant Team Leader, Destination Manager, Membership Manager Project manager, Product development manager

Table 5

Examples of Job Titles, Descriptions, and Gold Labels engineering-focused role. However, the job description emphasizes project management, coordination of project activities, support to product development, and supervision of the technical team. The gold label "project manager" fits better as it captures the overall management and coordination responsibilities described, which are more aligned with the duties of a project manager than just a project engineer. The job title "Addetto alle vendite" (Sales Assistant) is too generic and does not capture the specialized nature of the role described in the vacancy. The description specifies duties typical of a deli worker, such as serving customers, slicing cheeses and cured meats, preparing packages, and managing the deli counter. Our model's titles "meat and meat products specialised seller" and "deli worker" are more precise, indicating a specialized role in food handling and customer service, which goes beyond the general sales assistant title. This demonstrates our model's ability to interpret the specific context and responsibilities of the job accurately.

The job title "IT Specialist" is generic and could encompass various IT roles. However, the job description clearly indicates responsibilities such as managing ICT Gestione dei flussi delle segnalazioni dei cittadini per prenotazioni vaccinazioni e assistenza pandemica, inclusa la verifica del "certificato verde" per la conformità alle normative sanitarie.

Healthcare

Assistant, Administrative Assistant, Contact Tracing Agent Commesso di Negozio (Retail) Creazione di vetrine accattivanti con abbinamenti di tendenza e assistenza alla clientela nella scelta dei prodotti. Shop Assistant, Sales Assistant, Visual Merchandiser Team Leader (Energy Sector) Predisposizione documenti formativi e aggiornamento processi operativi presso sede Enel, inclusa l'implementazione e il collaudo di software per la gestione energetica.

Team Leader, Energy Analyst, Business Process Analyst Assistente Amministrativo (Legal and Fiscal)

Compiti legati al Registro Nazionale delle Varietà Vegetali e mansioni fiscali complesse come Dichiarazioni IRAP.

Accounting

Assistant, Administrative Assistant, Compliance Officer cal support. The optimized titles "ICT project manager" and "software development manager" are more accurate as they reflect the leadership, coordination, and project management aspects of the role, which go beyond the scope of a general IT specialist.

The job title "Sales Manager" suggests a mid-level management role. However, the job description highlights responsibilities such as business development, defining sales strategies, managing the sales team, monitoring performance, and managing relationships with key clients and strategic partners. These responsibilities are more aligned with a higher-level role such as "business development manager" or "sales director", which involve strategic planning and high-level management.

C. Ambiguity from Specialized and Contextual Factors

To further understand the complexity of job classification in a multilingual context, we conducted an ablation study focusing on cases where both human annotators and LLMs demonstrated shared uncertainty in assigning definitive labels. These cases were particularly challenging due to specialized terminology, regional language variations, or overlapping responsibilities within job postings. Table 7 highlights key examples where annotators, despite their recruitment expertise, aligned with the LLMs in experiencing ambiguity.

As presented in Table 7, each example illustrates specific challenges encountered in classifying job postings across multilingual and sector-specific contexts. The Junior Project Manager job posting, for instance, combines general project management with specialized tasks such as machine vision, but without enough specific context, it is unclear whether the focus should be on technical expertise or managerial skills. The Project Engineer example shows the impact of technical terminology and sector-spesific language on classification. Terms such as "SCADA" and "Modbus TCP" are common in international engineering contexts but may not align with typical understanding of recruiters, leading to the selection of varied labels by both LLMs and annotators. The example of the Assistente Amministrativo with a legal and fiscal focus involves highly specialized processes such as "Registro Nazionale delle Varietà Vegetali" and complex fiscal duties like "Dichiarazioni IRAP. " These terms relate to specific Italian government and regulatory compliance, which could exceed the annotators' typical recruitment experience, thus resulting in generalized labels that do not fully capture the compliance and accounting complexity.

These cases emphasize that job postings, as humancreated documents, often do not provide enough context for a definitive classification, resulting in ambiguity across specialized and regional terms.

D. Analysis of Model Alignment

with Partial Agreement Ground Truth Labels In our evaluation, we established two levels of ground truth labels: gold and silver. Gold labels represent unanimous agreement among all three annotators (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet), validated by human experts. Silver labels indicate a strong majority consensus, assigned when any two annotators agree, even if the third disagrees.

We assessed our model's performance on both silver and gold labels to understand its effectiveness under different levels of agreement. We had reported results for gold labels in Table 2 and 3, results for silver label are presented in Table 8. For the Spanish dataset, the model's performance was relatively consistent between silver and gold labels, with only minor variations in precision and recall. This consistency suggests that the model robustly captures underlying patterns in the job postings, regardless of labeling strictness.

In contrast, the Italian dataset exhibited more significant differences between performances on silver and gold labels. For example, in some cases, the precision was higher for silver labels while recall was higher for gold labels. This disparity may indicate that the model better captures broader classifications aligning with majority consensus in Italian but struggles with the stricter criteria required for unanimous agreement.

An interesting observation is that optimization using gold label ground truth data had a negative effect on the models' scores derived from silver labels. This could be explained by the fact that during optimization, the language models became more attuned to the patterns present in the gold labels, potentially diverging from those in the silver labels. As a result, the models may have become less effective at predicting labels where only partial agreement (silver labels) was present among the automatic methods.

E. DSPy Signature

We utilize DSPy signatures to prompt large language models (LLMs) for performing downstream tasks. To optimize the script, recursive LLM calls were employed, resulting in its final form based on empirical observations.

Figure 1 :1Figure 1: Model Architecture

Figure 2 :2Figure 2: Prompt Template

Figure 3 :3Figure 3: LLM's Rationale

Figure 4 :4Figure 4: Pre-processing Signature

Table 22Italian Performance Metrics for Top 5 and Top 10 PredictionsModelPrecisionRecall@5@10@5@10llama-3-8b (CoT opt.)0.320.130.760.80llama-3-8b (CoT)0.260.120.620.64llama-3-8b (SkillGPT)0.190.190.360.82mBart-large-mnli (0-shot)0.130.120.290.58multilingual-e5-large0.160.190.360.88

Table 33Spanish Performance Metrics for Top 5 and Top 10 PredictionsModelPrecisionRecall@5@10@5@10llama-3-8b (CoT opt.)0.28 0.20 0.72 0.90llama-3-8b (CoT)0.260.160.640.68llama-3-8b (SkillGPT)0.090.120.360.62mBart-large-mnli (0-shot) 0.150.140.390.70multilingual-e5-large0.200.190.48 0.92

Table 44Spanish job posting ExamplePosting Job TitleJob Posting DescriptionGold LabelsCommessaCommessa; Commessa; -Presentazione e vendita di attrezzature perTelecommunicationstelecomunicazioni ai clienti; -Servizio e supporto clienti; -Gestione delleequipment specialisedtransazioni di vendita; -Gestione dello stock e dell'inventario.sellerProject EngineerProject Engineer; Project Engineer; PROJECT MANAGER / PROJECTENGINEER Divisione: Amministrazione Tecnica -Coordinamento delleattività di gestione progetti in ambito tecnico; -Supporto al ProductDevelopment; -Pianificazione e monitoraggio delle attività progettuali; -Supervisione del team tecnico; -Assistenza alla gestione dei fornitori edel budget di progetto.

Table 66Examples of Job Titles, Descriptions, and Optimized Job Titlesprojects,

Table 77Examples of Job Postings with Ambiguous Classification due to Multilingual and Contextual ChallengesJob TitleDescription ExcerptLabels SuggestedJuniorProjectApplicare i metodi e gli strumenti propri del Project Management aProject Manager, ICTManagercommesse specifiche per il settore dell'automazione industriale, di cuiProject Manager, Pro-l'azienda fornisce sistemi di visione artificiale.gramme ManagerAssistenteAm-ministrativo(Healthcare)

Table 88Performance Metrics for Top 5 and Top 10 PredictionsModelPrecisionRecall@5@10@5@10Spanish (SPA)llama-3-8b (CoT opt.)0.120.060.580.62llama-3-8b (CoT)0.220.160.640.68llama-3-8b (SkillGPT)0.190.120.360.62mBart-large-mnli (0-shot) 0.150.140.390.70multilingual-e5-large0.200.190.480.92Italian (ITA)llama-3-8b (CoT opt.)0.120.060.560.60llama-3-8b (CoT)0.230.070.550.59llama-3-8b (SkillGPT)0.220.060.530.59mBart-large-mnli (0-shot) 0.270.060.310.58multilingual-e5-large0.350.080.390.79

https://github.com/stanfordnlp/dspy https://www.trychroma.com/ https://llama.meta.com/llama3/ We use dockerized models from the open-source Ollama library https://ollama.com/ for all experiments https://www.infojobs.net/ https://openai.com/index/hello-gpt-4o/ https://deepmind.google/technologies/gemini/pro/ https://www.anthropic.com/news/claude-3-5-sonnet

Learning job titles similarity from noisy skill labels RZbib LLAlvarez FRetyk RPoves JAizpuru HFabregat VŠimkus EGCasademont ArXiv abs/2207.00494 2022 J.-JDecorte JVHautte TDemeester CDevelder ArXiv abs/2109.09605 Jobbert: Understanding job titles through skills 2021 FJaved MMcnair FJacob MZhao arXiv:1606.00917 Towards a job title classification system 2016 Language models are few-shot learners TBBrown BMann NRyder MSubbiah JKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DMZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei arXiv:2005.14165 2020 NLi BKang TDBie arXiv:2304.11060 Skillgpt: a restful api service for skill extraction and standardization using a large language model 2023 Salience and market-aware skill extraction for job targeting BShi JYang FGuo QHe arXiv:2005.13094 2020 Deep job understanding at linkedin SLi BShi JYang JYan SWang FChen QHe 10.1145/3397271.3401403 doi:10.1145/3397271. 3401403 Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval ACM 2020 MZhang KNJensen SDSonniks BPlank ArXiv abs/2204.12811 Skillspan: Hard and soft skill extraction from english job postings 2022 Deepcarotene -job title classification with multi-stream convolutional neural network JWang KAbdelfatah MKorayem JBalaji 10.1109/BigData47090.2019.9005673 2019 MYamashita JTShen HEkhtiari TTran DLee arXiv:2202.10739 James: Job title mapping with multi-aspect embeddings and reasoning 2022 Escoxlmr: Multilingual taxonomy-driven pre-training for the job market domain MZhang RVan Der Goot BPlank Annual Meeting of the Association for Computational Linguistics 2023 Job offer and applicant cv classification using rich information from a labour market taxonomy HKavas MSerra-Vidal LWanner 10.2139/ssrn.4519766 SSRN Electronic Journal 2023 An explanation of in-context learning as implicit bayesian inference SMXie ARaghunathan PLiang TMa arXiv:2111.02080 2022 YGao YXiong XGao KJia JPan YBi YDai JSun MWang HWang arXiv:2312.10997 Retrieval-augmented generation for large language models: A survey 2024 ZLi XZhang YZhang DLong PXie MZhang 10.48550/arXiv.2308.03281 arXiv:2308.03281 arXiv:2308.03281 Towards General Text Embeddings with Multi-stage Contrastive Learning 2023 KOosterlinck OKhattab FRemy TDemeester CDevelder CPotts ArXiv abs/2401.12178 In-context learning for extreme multi-label classification 2024 W.-LChiang LZheng YSheng ANAngelopoulos TLi DLi HZhang BZhu MJordan JEGonzalez IStoica ArXiv abs/2403.04132 Chatbot arena: An open platform for evaluating llms by human preference 2024 A coefficient of agreement for nominal scales JCohen Educational and Psychological Measurement 20 1960 SMinaee NKalchbrenner ECambria NNikzad MChenaghlu JGao CoRR abs/2004.03705 Deep learning based text classification: A comprehensive review 2020 LShu JChen BLiu HXu ArXiv abs/2202.01924 Zero-shot aspectbased sentiment analysis 2022 A broadcoverage challenge corpus for sentence understanding through inference AWilliams NNangia SBowman Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long Papers the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2018 1 Association for Computational Linguistics OKhattab ASinghvi PMaheshwari ZZhang KSanthanam SVardhamanan SHaq ASharma TTJoshi HMoazam HMiller MZaharia CPotts ArXiv abs/2310.03714 Dspy: Compiling declarative language model calls into self-improving pipelines 2023