1. Introduction

Enhancing Job Posting Classification with Multilingual Embeddings and Large Language Models

Hamit Kavas

0 2 3

Marc Serra-Vidal

0 3

Leo Wanner

1 2 3 0 Adevinta Spain, C/ de la Ciutat de Granada , 150, Barcelona, 08018 , Spain 1 Catalan Institute for Research and Advanced Studies (ICREA) , Passeig Lluís Companys, 23, Barcelona, 08010 , Spain 2 NLP Group, Pompeu Fabra University , C/ Roc Boronat, 138, 08018 , Spain 3 Precision @ K Value

3 8

In the modern labour market, taxonomies such the European Skills, Competences, Qualifications and Occupations (ESCO) classification are used as an interlingua to match job postings with job seeker profiles. Both are classified with respect to ESCO occupations, and match if they align with the same occupation and the same skills assigned to the occupation. However, matching models usually struggle with the classification because of overlapping skills and similar definitions of occupations defined in the ESCO taxonomy. This often leads to imprecise classification outcomes. In this paper, we focus on the challenge of the classification of job postings written in Italian or Spanish against ESCO occupations written in English. We experiment with multilingual embeddings, zero-shot classification, and use of a large language model (LLM) and show that the use of an LLM leads to best results. Furthermore, we also explore an alternative automatic labeling method by prompting three top-performing LLMs to annotate the test dataset. This approach serves both as an experiment on the usability of automatic labeling and as an evaluation of the reliability of the automatically assigned labels, involving human annotators.

eol>ESCO labour market taxonomy job posting classification class embeddings text embeddings LLM

1. Introduction

experiences because due to their tree structure they often fail to adequately distinguish between occupations The modern labour market becomes more and more di- that exhibit substantial skill overlaps. For instance, two verse. High-tech jobs demand novel skills and compe- job postings labeled as ‘data analyst’ may appear similar tences, which in their turn keep undergoing adaptations but require diferent skills if one focuses on market reand modifications. Under these circumstances, accurately search while the other concentrates on healthcare trends classifying job postings and CVs of job seekers (hence- analysis. This issue is particularly pronounced when clasforth candidate experiences) that contain detailed techno- sification relies on a single label, such as the job title of an logical specifications with remarkably similar yet distinct ESCO occupation, where skill overlaps undermine preskills and experiences has evolved into a complex chal- cise classification. Hence, employing multiple job titles lenge. and framing the problem as a multi-label classification

The overwhelming majority of job portals and employ- task is imperative.

ment agencies use either the European Skills, Competences, This paper addresses the challenge of multilingual

Qualifications and Occupations (ESCO) taxonomy1 or its multi-label classification using Large Language Models US equivalent O*Net taxonomy2 to classify job postings (LLMs) for the alignment of Italian and Spanish job post

and candidate experiences in terms of job title labeled ings with English job titles encountered in the ESCO

ESCO/O*Net occupations. Most of the proposals to au- taxonomy. Multilingual class embeddings are explored

tomatic alignment of job postings with candidate expe- to improve classification accuracy, aiming to provide the riences (or vice versa) also use ESCO or O*Net [1, 2, 3]. necessary contextual awareness and addressing the core

However, despite their wide use, both ESCO and O*Net limitations of taxonomies such as ESCO.

taxonomies exhibit principle limitations for the task of Furthermore, we explore an alternative automatic laautomatic classification of job postings and candidate beling method by prompting three top-performing LLMs to annotate the test dataset. This approach serves both as CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, an experiment on the usability of automatic labeling and Dec 04 — 06, 2024, Pisa, Italy as an evaluation of the reliability of the automatically *$Cohrarmesipt.oknadviansg@auuptfh.eodr.u (H. Kavas); marc.serrav@adevinta.com assigned labels, involving human annotators. (M. Serra-Vidal); leo.wanner@upf.edu (L. Wanner) To provide LLMs with domain-specific information 0009-0009-7027-7367 (H. Kavas); 0009-0000-0120-381X and to mitigate hallucinations in the course of the clas(M. Serra-Vidal); 0000-0002-9446-3748 (L. Wanner) sification of the job postings, we employ Retrieval Aug©At2tr0i2b4utCioonpy4r.0igIhnttefornratthioisnpaalp(CerCbByYit4s.0a)u.thors. Use permitted under Creative Commons License mented Generation (RAG) [4], which combines infor12hhttttppss::////ewswcow.e.oc.neeutroonplain.eeu.o/erng//classification mation retrieval with a generative model. RAG serves two critical functions in our methodology. Firstly, it pro- of online recruitment data. Similarly, Wang et al. [9] provides detailed definitions, including essential skills and pose a model based on multi-stream convolutional neural synonyms for each ESCO occupation, selected through networks, aiming to classify noisy user-generated job tivector similarity as outlined in [5]. Secondly, it ensures tles by considering diferent elements such as characters that the assigned job titles are restricted to titles within and words within job titles. Yamashita et al. [ 10 ] and our predefined label space, i.e., standardized job titles Zbib et al. [1] conduct studies on the classification of defined in the ESCO taxonomy. job titles, focusing on job title alignment and job simiThe contributions of our work are: larity training, respectively. JobBERT Decorte et al. [2] ∙ We explore the impact of using multilingual class classifies job titles against the ESCO taxonomy, treating embeddings derived from the ESCO taxonomy for the the task as a semantic text similarity (STS) exercise. In task of job posting classification. particular, JobBERT emphasizes the understanding of the ∙ We integrate RAG to provide LLMs with domain- semantics of job titles through the skills inferred from the specific information and eliminate the dependency on associated vacancies and descriptions, thus alleviating ifne-tuning; the need for an extensive labeled dataset or a continu∙ We show how the LLM response can be restricted to ously updated list of standardized titles. Before the recent standardized job titles and thus how LLMs can be used proposals [11] and [12], JobBERT used to be referenced for high quality job title classification that outperforms as the state-of-the-art baseline. In general, all of these state-of-the-art proposals for this task. works draw upon some of the information encoded in the

The remainder of the paper is structured as follows. In ESCO taxonomy. However, none of them uses detailed Section 2, we present a concise overview of the related descriptions of ESCO occupations, as we propose. work. In Section 3, the model on which our work is based is outlined. Section 4 describes the experiments we carried out, the results we obtained in these experiments, 3. The Model and their discussion. In Section 5, finally, draws some conclusions from the presented work and outlines some 3.1. The Basics directions for future research. In Appendix A, we present The proposed model is based on the notion of distinctivean ablation study in which we assess the comprehension ness, which specifies the diference between the prompt of English ESCO job titles and its Spanish equivalents by concept * and other concepts within the conceptual our model. Appendix B provides, for illustration, exam- space Θ [13]. The notion is crucial for distinguishing ples of Italian job postings and predicted ESCO job titles. in-context learning concepts that are aimed to be learned

In Appendix E, we present the signature used to prompt by analogy. * acts as a latent parameter in a Hidden Large Language Models for pre-processing. Markov Model that defines a distribution over observed

tokens, represented by selected ESCO job titles as labels. 2. Related Work As proposed by Xie et al. [13], the error of the in-context predictor approaches optimality under the condition that A number of works have been carried out in the domain * is distinguishable from other concepts in Θ ∖ { * }. of job title classification, focusing on various facets of the When RAG is adapted as a few-shot reasoning (or inproblem. Shi et al. [6] introduce Job2Skills, a model devel- context learning) framework for job posting classification oped for LinkedIn. The model significantly improves job [14] , * is represented by the top-selected ESCO labels recommendation performance metrics, however, raises and ensures that the LLM can efectively diferentiate questions about its efectiveness beyond LinkedIn. Li between closely related job categories. et al. [7] proposes a two-step job title normalization, also The explanation enriched prompts enhance the LLM’s in LinkedIn, which is based on tokenization and match- ability to learn more from each example. According to ing of the original job title provided by the user with Xie et al. [13], the expected error decreases as the length a lookup table. The use of a lookup table instead of a and informational content of each example increase, constandard occupation taxonomy such as ESCO or O*Net tributing to the richness of the input–output mapping for significantly limits the generalization potential of this a more robust in-context learning environment. This strategy. Zhang et al. [8] extract soft and hard skills from assumption is proven to be true under the condition job posting descriptions, showing that domain-specific of distinguishability of in-context examples and can be pre-training significantly enhances performance in skills mathematically expressed as a reduction of the expected and knowledge extraction. Javed et al. [3] introduce a error [ ], correlated with an increase in the information semi-supervised machine learning approach that utilizes content of the examples: hierarchical classifiers and the O * NET Standard Occupational Classification (SOC) taxonomy for the classification (1) 1 [ ] ∝ (, test)

where represents the sequence of training examples in the prompt and test is the test input.

The use of RAG helps avoid hallucination since when directly prompted with job postings, LLMs have been observed to sometimes produce non-existent labels [5]. 3.2. Design of the Model The proposed model (see Figure 1) uses multilingual class choosing 30 documents is that we aim to strike a balance embeddings of the E5-large model[15] to retrieve perti- between computational eficiency and the accuracy of the nent ESCO occupation definitions in English. The defini- retrieved documents. The precision of the LLM would tions serve as contextual information to prompt language naturally decrease when it is presented with inaccurate models for selection of the most suitable job titles. To this labels. Although, as shown in Table 1, the precision of end, we incorporate the DSPy library’s Chain-of-Thought the model slightly increases with 40 documents in the mechanism,3 augmented by a hint to restrict the model context, we accepted this trade-of in favor of a lower output to a specified list of job titles. The signature used VRAM requirement. in this methodology (cf. Figure 2) is inspired by [16]. Upon the retrieval of the 30 ESCO occupations that are

To implement the RAG model, we initially established a most closely aligned with a given job posting description, vector database,4 in which English ESCO occupation def- a composite prompt (see Figure 2) is constructed as input initions were inserted as multilingual embedding vectors. to the LLM. The prompt integrates the actual text data Acknowledging the reported significance of chunking in encompassing job titles, descriptions, and skills pertinent many NLP applications, we conducted a series of abla- to the selected occupations. The design of the simplified tion studies to determine the optimal chunk size. These composite prompt aims to minimize the bias by focusing studies revealed that subdividing the ESCO occupation only on the core elements. The prompt is then processed definitions into smaller segments adversely afects the by using a locally stored Llama-3 LLM5 in an isolated performance of vector-based similarity matching. There- environment6. fore, we opted for storing each of the 3,015 occupations As a few-show predictor, the LLM evaluates the comrepresented in the ESCO taxonomy in its entirety. posite prompt to accurately classify job postings by examining the semantic nuances of the selected ESCO occupaTable 1 tions, aligning them with the actual job titles within the Recall values for classification with E5-large Text Embeddings ofers. To quantitatively assess the alignment between vector similarity a job posting vector and each occupation embedding ESCO derived from the ESCO taxonomy, cosine similarity (, ESCO) is used: @10 0.9004

To accurately classify a given job posting with respect to the ESCO taxonomy, we include 30 ESCO occupation documents (i.e., 30 nodes of the taxonomy) into the LLM’s context as potential job titles. The rationale for 3https://github.com/stanfordnlp/dspy 4https://www.trychroma.com/ (, ESCO) =

· ESCO ‖ ‖‖ESCO‖

The similarity scores yielded through (, ESCO) for each ESCO facilitate the identification and selection of 5https://llama.meta.com/llama3/ 6We use dockerized models from the open-source Ollama library https://ollama.com/ for all experiments (2) the ESCO occupation embeddings that are most pertinent content where removed using a DSPy module (cf. Apto the job posting in question. Armed with this infor- pendix E for prompt), which employs zero-shot LLama-3 mation, the LLM proceeds to classify the job posting by LM inference to anonymize sensitive information in job selecting the ESCO occupation that exhibits the highest postings and candidate experiences. The preprocessed degree of semantic and contextual relevance. postings were annotated by the top three performing

For a specific job posting , an embedding function LLMs: GPT-4o8, Gemini 1.5 Pro9, and Claude 3.5 Son

is employed, such that ( ) produces the corresponding net10, according to LmSys Arena[17]. In this context, the embedding for . The degree of similarity between the ESCO job titles are presented to each model separately, job posting’s embedding ( ) and any ESCO occupation requesting them to select the appropriate job titles, and embedding from ESCO (where ESCO stands for the then measure their level of agreement on these labels. ensemble of occupation embeddings derived from the The agreement between LLM models was assessed using ESCO taxonomy) is determined through the similarity Cohen’s kappa coeficient[ 18]. The average kappa score function (( ), ) (in our case cosine). between Gemini and GPT-4o was found to be 0.6386,

The similarity scores for each occupation embedding indicating a substantial level of agreement. The agree within ESCO relative to ( ) are computed. The ment between Gemini and Claude was lower, with an ten class embeddings that exhibit the highest similarity average kappa of 0.5798, suggesting a moderate level to ( ), denoted as top, are selected. Formally, top of agreement. Similarly, the kappa score between GPTis defined as the subset {1, 2, . . . , 10} from ESCO, 4o and Claude was 0.6497, also indicative of substantial where each is selected based on the top 10 similarity agreement. Overall, the average kappa score across all scores (( ), ). “annotators” was 0.6227, reflecting a general trend to

The last stage entails a decision-making process en- wards substantial inter-annotator agreement among the acted by the Llama-3 LLM, represented by the function models. . This function accepts the composite prompt includ- To establish ground truth labels, we incorporated a ing candidates {1, 2, . . . , 10} accumulated to top and dual-layer labelling process. Although the test set conthe job posting , to render the final selected occupation sists of only 200 items, labeling them from scratch would embedding. The chosen occupation embedding * is de- be time-consuming due to the complexity of the ESCO termined by * = (top, ), representing the ESCO taxonomy, which includes 3,015 distinct occupations. Huoccupation best matched by the model. man annotators would require extensive training to accu

The entire algorithm can be presented by the following rately navigate this taxonomy. Therefore, we first annoequation, which encapsulates the embedding generation, tate the occupations automatically using LLMs and then similarity assessment, and decision-making process by let the initial annotations cross-examine by human expert the LLM, culminating in the selection of the most suitable annotator. Since each data point was reviewed by one anESCO occupation embedding * for the given job posting notator only, inter-annotator agreement among human description. annotators was not quantified. Instead, we conducted an analysis to identify job titles that consistently showed agreement or disagreement across the three LLMs, where * = ({1, 2, . . . , | ∈ ESCO; domain-specific professionals from InfoJobs reviewed latop k by (( ), )}, ) (3) bel discrepancies. This analysis, detailed in Appendix C, suggests that certain occupations are inherently more challenging to classify, possibly due to overlapping skills

4. Experiments or ambiguous descriptions.

Furthermore, we repeated experiments using ground

To evaluate the efectiveness of the proposed model in truth labels where any two of the three automatic LLMs

handling multilingual job postings, experiments were agreed on the label. The results showed alignment beconducted separately on Italian and Spanish datasets. tween the models’ predictions and the automatic labeling process, indicating consistency with the patterns recog

4.1. Test dataset nized by the automatic methods when there is partial

agreement. A detailed analysis of this alignment can be To have a reliable test dataset, we use three high perform- found in Appendix D. ing LLMs as initial annotators of real-world 100 Italian and 100 Spanish job postings with the most extensive descriptions from the InfoJobs 7 database. Non-informative elements such as company descriptions and promotional 8https://openai.com/index/hello-gpt-4o/ 9https://deepmind.google/technologies/gemini/pro/ 10https://www.anthropic.com/news/claude-3-5-sonnet 7https://www.infojobs.net/ 4.2. Baselines 4.2.1. SkillGPT report evaluation scores seperately on Spanish and Italian test sets.

SkillGPT [5] has been introduced as a tool for skill ex- Table 2 traction and classification, with vector similarity search Italian Performance Metrics for Top 5 and Top 10 Predictions against LLM-precomputed ESCO embeddings. The authors employ embeddings generated by an LLM, although Model Precision Recall they do not directly use LLM to select among candidate @5 @10 @5 @10 embeddings. Instead, they rely on embedding similarity to assign the most closely related ESCO class to job descriptions under consideration. 0.13 0.12 0.19 0.12 0.19 0.76 0.62 0.36 0.29 0.36 0.80 0.64 0.82 0.58 0.88 4.2.2. Zero-Shot Classification By transforming the classification task into a Natural Language Inference (NLI) problem, any model pretrained on NLI tasks can be utilized as a text classifier without the need for fine-tuning, efectively achieving zero-shot text classification. This is particularly beneficial when we deal with classes unseen during training, making it a robust solution for a variety of text classification scenarios [ 19].

In our implementation that we use as baseline, we utilize the BART-MNLI model [20] that showed high performance in summarization tasks when pretrained for various NLI tasks on an MNLI dataset [21] that is leveraged for its capability to understand entailment relations for classification of the given sequence into one of the specified categories. We also apply the same methodology with the Llama-3 model.

Tables 2 and 3 display the results on the Italian and Spanish datasets, respectively. The results indicate that prompting techniques outperform SkillGPT in both lan

4.3. Model Optimization guages. Specifically, the optimized Llama-3-8b model

with chain-of-thought (CoT) achieves the highest preci

To optimize LLMs with a minimal set of manually crafted sion and recall at @5 for Italian, with values of 0.32 and

examples, we use the DSPy library [22]. We initialize 0.76, respectively, and for Spanish, with values of 0.28 the classifier module with a Llama-3 model and use a and 0.72. This supports our assumption that optimizaGPT-4o model as the teacher. Our optimization of the tion enhances performance. The multilingual E5-large classification is aimed at achieving high F1 scores for model achieves the highest precision at @10 for Italian each dataset individually. In each run, we use 10 la- (0.19) and the highest recall at @10 for Spanish (0.92), beled training examples and 30 labeled validation ex- underscoring the eficacy of embeddings in classification. amples. We employ DSPy’s BootstrapFewShot, configur- This implies that semantically less similar labels can coning it to perform a maximum of 2 rounds with up to 8 fuse models, whereas embeddings ensure higher recall bootstrapped demonstrations. We define a custom met- accuracy, particularly in wider retrieval scenarios. Alric—the F1 score—to guide the bootstrapping process. For though both models exhibit similar precision, indicating the optimization of the LLMs, we use data points that comparable accuracy in their predictions, the optimized had high inter-agreement among the automatic methods model’s capacity to capture a broader range of relevant and were reviewed by human annotators. We perform a job titles ensures greater alignment with expert human validation/test split to ensure that the optimization did preferences. This enhances the model’s ability to make not bias the evaluation results. relevant job title suggestions, thereby improving the overall matching process. 4.4. Outcome of the experiments For the evaluation of the results of the experiments, we used the micro recall and micro precision metrics, which are suitable for our multi-class classification task. We 4.5. Discussion In Tables 2 and3 we observe that the combined use of general text embeddings and language models significantly outperforms current classification techniques, which rely on language models specifically tailored to the field of the 4.6. Computational Cost of Compared labour market, such as [12]. We see that using vector sim- Methods ilarity with the text embeddings created by the E5-large text embedings model alone does not surpass the base- In addition to evaluating performance metrics, we analine. However, it is worth noting that the results are quite lyzed the computational cost and environmental impact close, despite the fact that this model was not specifically of each method. The Llama-3-8b model, with 8 billion ifne-tuned on labour market data or adapted to the ESCO parameters, requires significant resources for inference, taxonomy, as is the case of [12]. Furthermore, we can ob- necessitating a GPU with at least 16 GB of VRAM (e.g., serve how text embeddings indeed provide a significant NVIDIA RTX 3090). Its average inference time per job value for filtering n occupations closest to a job posting posting is approximately 1.5 seconds, and its high energy within the taxonomy. Using these k professions as input consumption leads to increased CO2 emissions, making to various language models for few-shot classification large-scale deployment less environmentally sustainable significantly improves over the baselines. Table 6 in the without optimizations.

Appendix illustrates the decisions of the LLMs in the case In contrast, the mBART-large-mnli model has about 610

of four sample job postings. million parameters and operates on GPUs with 8 GB of

We also evaluated the efectiveness of a large language VRAM, ofering faster inference times under 0.5 seconds model for classification of job titles based on provided per job posting. The embeddings-based method using descriptions, as shown in Table 4 even when the correct the multilingual E5-large model, with 330 million paramtitles were not explicitly listed among the initial ESCO job eters, allows for precomputed embeddings and eficient titles. The model’s ability to select accurate titles reflects CPU-based vector similarity searches, reducing inferits functionality in processing and understanding the con- ence time to less than 0.2 seconds per job posting. These textual and semantic aspects of the job descriptions. For smaller models consume less energy, providing more instance, when presented with a job description focused resource-eficient and eco-friendly alternatives suitable on the management of comprehensive water and wastew- for production environments where computational cost ater services, the model correctly identified “Operations and environmental impact are critical considerations. Manager” as the correct title. This identification was made despite the presence of several closely related but 5. Conclusions and future work distinct labels (such as, “Water treatment plant manager”) within the pool of ESCO job titles. This indicates that the In this paper, we argued that the use of multilingual model’s decisions are more influenced by a comprehen- embeddings in combination with LLMs significantly ensive understanding of the job responsibilities and sectors hances our ability to distinguish between very similar (or than by the mere presence of keywords or phrases in the even identical) job titles that suggest diferent skills and

ESCO job titles. competencies. Our experiments have shown that this is

The model’s capacity to diferentiate between job titles indeed the case, demonstrating that the combination of with more specific definitions enhances its comprehen- multilingual text embeddings similarity with the Llamasion of job postings and assigned labels, thereby improv- 3 markedly exceeds the performance of other leading ing the precision of suggesting relevant skills. Upon approaches in the field. integration into an operational job platform, this model In the future, we plan to apply the same approach will better understand the requirements of job postings to the analysis and classification of job candidate expeand accurately assign job titles that align with the spe- riences. Once it is ensured that both job postings and cific needs of companies. Similarly, in the context of candidate experiences can accurately be modeled using parsing of job candidate experiences, keywords tend to the embedded representation of the ESCO taxonomy, we appear more frequently in semantically related ESCO def- plan to set the stage for a more direct and eficient aligninitions, enabling parsers to incorporate these keywords ment process between job postings and experiences of to enhance parsing performance. job seekers.

Overall, we can thus state that the integration of class Another interesting direction for future research is embeddings generated using the multilingual E5-large to analyze the lexical overlap between English domainmodel, with subsequent application of few-shot classifi- specific terms that appear in Italian and Spanish job postcation techniques through LLMs, significantly improves ings and the English occupation descriptions in the ESCO the accuracy of job title classification, clearly surpassing taxonomy. Such an analysis would reveal whether job those of the baselines. types with higher lexical overlap afect model accuracy, providing deeper insights into the multilingual nature of the task.

A. Ablation Study

narrowing down to “quick service restaurant team leader” and “fast food shift team leader” as the most apt job ti

In our ablation study, we pursued two primary objectives. tles. The reasoning of the model is correct on chosing

Firstly, to evaluate the model’s comprehension of ESCO these titles for their precise reflection of the managerial job titles and its decision-making process. To achieve this, and leadership responsibilities pertinent to the restaurant we prompted the model to articulate its underlying ra- environment. tionale. Secondly, so far we reported the performance of our model when Italian and Spanish data were matched against English job titles and occupations in the ESCO B. Job postings and Predicted taxonomy. Here we wanted to explore whether its com- ESCO job titles prehension was extendable to data in diferent languages.

We selected Spanish for this purpose and discovered that The following tables provide examples of job titles, job

the model’s understanding was consistent, irrespective posting descriptions, and the corresponding gold labels of the language; see Table 4. in Table 5 and optimized LLama-3 job titles in Table 6.

As illustrated in Figure 3, the LLM showcases a com- These examples illustrate how the job titles assigned by prehensive understanding of the task at hand, efectively recruiters may not always capture the specific nature of narrowing down potential ESCO job titles to identify the the job described in the postings. The gold labels and most suitable label. Additionally, the LLM is observed to the optimized LLama-3 job titles ofer a more accurate generate a novel job title, referred to as “fast food shift representation of the job roles based on the detailed job team leader”. This can be attributed to the absence of descriptions. contstraints imposed on the LLM regarding structured The job title “Commessa” (Salesperson) is generic output for classification, thereby granting it to auton- and does not specify the specialization required for the omy to propose the most fitting job title. The analysis job. The gold label “telecommunications equipment speinitially excludes broader or less related job titles such cialised seller” fits better because the job description as “bussiness manager”, “hospitality revenue manager”, clearly focuses on selling telecommunications equipment, and “accomodation manager”, which are not spesific to which requires specific knowledge and skills related to quick-service restaurant operations. Subsequently, the this type of product. The gold label accurately reflects model considers and ultimately selects titles that em- the specialized nature of the role. The job title “Project phasize leadership within this spesific restaurant context, engineer” given by the recruiter suggests a technical and Gold Label Job Title Quick Service Restaurant Team Leader Posting Job Title Encargado de Franquicias Posting Description: - Responsable de garantizar la satisfacción de los huéspedes y de gestionar y superar los objetivos financieros y operativos de los restaurantes a mi cargo. - Garantizar una excelente atención a los huéspedes en base a las promesas y estándares definidos. - Liderar, motivar y desarrollar equipos. - Facilitar los recursos y el apoyo necesario a los equipos en sus restaurantes. - Utilizar de manera eficaz los diferentes recursos de la Compañía. - Identificar oportunidades y amenazas de negocio en el mercado. - Aportar ideas y ejecutando proyectos en el corto y medio plazo. - Difundir las mejores practicas y resolver problemas comunes en los restaurantes. - Cumplir los protocolos y políticas de la Marca y la Compañía. - Garantizar y difundir los valores y principios definidos por la Compañía.

Skills: SAP Girnet Gtock, Cuiner ESCO Job Titles: Restaurant Manager, Business Manager, Hospitality Revenue Manager, Accommodation Manager, Delicatessen Shop Manager, Rooms Division Manager, Customer Experience Manager, Quick Service Restaurant Team Leader, Destination Manager, Membership Manager

Job Posting Description Commessa; Commessa; - Presentazione e vendita di attrezzature per telecomunicazioni ai clienti; - Servizio e supporto clienti; - Gestione delle transazioni di vendita; - Gestione dello stock e dell’inventario.

Project Engineer; Project Engineer; PROJECT MANAGER / PROJECT ENGINEER Divisione: Amministrazione Tecnica - Coordinamento delle attività di gestione progetti in ambito tecnico; - Supporto al Product Development; - Pianificazione e monitoraggio delle attività progettuali; Supervisione del team tecnico; - Assistenza alla gestione dei fornitori e del budget di progetto. engineering-focused role. However, the job description packages, and managing the deli counter. Our model’s tiemphasizes project management, coordination of project tles “meat and meat products specialised seller” and “deli activities, support to product development, and supervi- worker” are more precise, indicating a specialized role sion of the technical team. The gold label “project man- in food handling and customer service, which goes beager” fits better as it captures the overall management yond the general sales assistant title. This demonstrates and coordination responsibilities described, which are our model’s ability to interpret the specific context and more aligned with the duties of a project manager than responsibilities of the job accurately. just a project engineer. The job title “IT Specialist” is generic and could encom

The job title "Addetto alle vendite" (Sales Assistant) is pass various IT roles. However, the job description clearly too generic and does not capture the specialized nature indicates responsibilities such as managing ICT projects, of the role described in the vacancy. The description coordinating a software development team, planning specifies duties typical of a deli worker, such as serving and monitoring development activities, managing ICT customers, slicing cheeses and cured meats, preparing resources and budget, and providing advanced techniAddetto alle vendite IT Specialist Sales Manager

Addetto alle vendite; Addetto alle vendite; Salumiere: servizio clientela, tagli di formaggi e salumi, preparazione confezioni, gestione banco gastronomia.

IT Specialist; IT Specialist; Responsabile della gestione dei progetti ICT; Coordinamento del team di sviluppo software; Pianificazione e monitoraggio delle attività di sviluppo; Gestione delle risorse ICT e del budget; Assistenza tecnica avanzata e risoluzione dei problemi.

Sales Manager; Sales Manager; Sviluppo del business aziendale; Definizione delle strategie di vendita; Gestione del team di vendita; Monitoraggio delle performance e raggiungimento degli obiettivi di vendita;

Gestione delle relazioni con i clienti chiave e i partner strategici. cal support. The optimized titles “ICT project manager” and “software development manager” are more accurate as they reflect the leadership, coordination, and project management aspects of the role, which go beyond the To further understand the complexity of job classificascope of a general IT specialist. tion in a multilingual context, we conducted an ablation

The job title “Sales Manager” suggests a mid-level man- study focusing on cases where both human annotators agement role. However, the job description highlights and LLMs demonstrated shared uncertainty in assigning responsibilities such as business development, defining definitive labels. These cases were particularly challengsales strategies, managing the sales team, monitoring per- ing due to specialized terminology, regional language formance, and managing relationships with key clients variations, or overlapping responsibilities within job postand strategic partners. These responsibilities are more ings. Table 7 highlights key examples where annotaaligned with a higher-level role such as “business de- tors, despite their recruitment expertise, aligned with the velopment manager” or “sales director”, which involve LLMs in experiencing ambiguity. strategic planning and high-level management. As presented in Table 7, each example illustrates specific challenges encountered in classifying job postings across multilingual and sector-specific contexts. The Junior Project Manager job posting, for instance, combines general project management with specialized tasks such as machine vision, but without enough specific context, C. Ambiguity from Specialized and Contextual Factors it is unclear whether the focus should be on technical We assessed our model’s performance on both silver expertise or managerial skills. The Project Engineer ex- and gold labels to understand its efectiveness under difample shows the impact of technical terminology and ferent levels of agreement. We had reported results for sector-spesific language on classification. Terms such gold labels in Table 2 and 3, results for silver label are as “SCADA” and “Modbus TCP” are common in inter- presented in Table 8. For the Spanish dataset, the model’s national engineering contexts but may not align with performance was relatively consistent between silver and typical understanding of recruiters, leading to the selec- gold labels, with only minor variations in precision and tion of varied labels by both LLMs and annotators. The recall. This consistency suggests that the model robustly example of the Assistente Amministrativo with a legal and captures underlying patterns in the job postings, regardifscal focus involves highly specialized processes such as less of labeling strictness. “Registro Nazionale delle Varietà Vegetali” and complex In contrast, the Italian dataset exhibited more signififscal duties like “Dichiarazioni IRAP.” These terms relate icant diferences between performances on silver and to specific Italian government and regulatory compliance, gold labels. For example, in some cases, the precision which could exceed the annotators’ typical recruitment was higher for silver labels while recall was higher for experience, thus resulting in generalized labels that do gold labels. This disparity may indicate that the model not fully capture the compliance and accounting com- better captures broader classifications aligning with maplexity. jority consensus in Italian but struggles with the stricter

These cases emphasize that job postings, as human- criteria required for unanimous agreement. created documents, often do not provide enough con- An interesting observation is that optimization using text for a definitive classification, resulting in ambiguity gold label ground truth data had a negative efect on across specialized and regional terms. the models’ scores derived from silver labels. This could be explained by the fact that during optimization, the language models became more attuned to the patterns D. Analysis of Model Alignment present in the gold labels, potentially diverging from with Partial Agreement Ground those in the silver labels. As a result, the models may Truth Labels have become less efective at predicting labels where only partial agreement (silver labels) was present among the automatic methods. 0.12 0.22 0.19 0.15 0.20 0.12 0.23 0.22 0.27 0.35 0.06 0.16 0.12 0.14 0.19 0.06 0.07 0.06 0.06 0.08 0.58 0.64 0.36 0.39 0.48 0.56 0.55 0.53 0.31 0.39 0.62 0.68 0.62 0.70 0.92 0.60 0.59 0.59 0.58 0.79

In our evaluation, we established two levels of ground truth labels: gold and silver. Gold labels represent unanimous agreement among all three annotators (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet), validated by human experts. Silver labels indicate a strong majority consensus, assigned when any two annotators agree, even if the third disagrees.

E. DSPy Signature

We utilize DSPy signatures to prompt large language models (LLMs) for performing downstream tasks. To optimize the script, recursive LLM calls were employed, resulting in its final form based on empirical observations.

from a labour market taxonomy , SSRN Electronic [1]

Zbib ,

L. L.

Alvarez ,

Retyk ,

Poves , J. Aizpuru , Journal ( 2023 ). doi: 10 .2139/ssrn.4519766.

Fabregat ,

Šimkus ,

E. G.

Casademont , Learn- [13]

S. M.

Xie ,

Raghunathan ,

Liang , T. Ma, An ex-

ArXiv abs/2207 .00494 ( 2022 ). URL: https://api. inference, 2022 . arXiv: 2111 . 02080 .

semanticscholar.org/CorpusID:250243975. [14]

Gao ,

Xiong ,

Gao ,

Jia ,

Pan ,

Bi ,

Dai , [2]

J.-J.

Decorte ,

J. V.

Hautte ,

Demeester , C. De- J. Sun , M.

Wang , H.

Wang , Retrieval-augmented

skills, ArXiv abs/2109 .09605 ( 2021 ). URL: https: 2024 . arXiv: 2312 . 10997 .

//api.semanticscholar.org/CorpusID:237572142. [15]

Li ,

Zhang ,

Long , P. Xie, [3]

Javed ,

McNair ,

Jacob ,

Zhao , To- M. Zhang, Towards General Text Embeddings

wards a job title classification system, 2016. with Multi-stage Contrastive Learning , arXiv

arXiv:1606 .00917. e-prints ( 2023 ) arXiv: 2308 .03281. doi: 10 .48550/ [4] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , arXiv. 2308 .03281. arXiv: 2308 . 03281 .

Kaplan ,

Dhariwal ,

Neelakantan , P. Shyam, [16] K. D'Oosterlinck , O.

Khattab , F.

Remy , T. De-

D. M. Ziegler , J.

Wu , C.

Winter , C.

Hesse , M. Chen, ArXiv abs/2401 .12178 ( 2024 ). URL: https://api.

Sigler ,

Litwin ,

Gray ,

Chess , J. Clark, semanticscholar.org/CorpusID:267068618.

Berner ,

McCandlish ,

Radford , I. Sutskever , [17] W.-L. Chiang ,

Zheng ,

Sheng ,

A. N.

Angelopou -

2020 . arXiv: 2005 .14165. J. E. Gonzalez , I. Stoica , Chatbot arena: An open [5]

Li ,

Kang ,

T. D.

Bie , Skillgpt: a restful api service platform for evaluating llms by human prefer-

for skill extraction and standardization using a large ence , ArXiv abs/2403 .04132 ( 2024 ). URL: https:

language model , 2023 . arXiv: 2304 .11060. //api.semanticscholar.org/CorpusID:268264163. [6]

Shi ,

Yang ,

Guo ,

He , Salience and [18]

Cohen , A coeficient of agreement for nominal

2020 . arXiv: 2005 . 13094 . 20 ( 1960 ) 37 - 46 . URL: https://api.semanticscholar. [7]

Li ,

Shi ,

Yang ,

Yan ,

Wang ,

Chen ,

He , org/CorpusID:15926286.

Deep job understanding at linkedin , in: Proceed- [19]

Minaee ,

Kalchbrenner ,

Cambria , N. Nikzad,

ings of the 43rd

International ACM SIGIR Confer- M. Chenaghlu , J.

Gao , Deep learning based text

tion

Retrieval

, ACM, 2020 . URL: http://dx.doi.org/ abs/ 2004 .03705 ( 2020 ). URL: https://arxiv.org/abs/

10.1145/3397271.3401403. doi: 10 .1145/3397271. 2004 . 03705 . arXiv: 2004 .03705.

3401403. [20]

Shu ,

Chen ,

Liu ,

Xu , Zero-shot aspect[8]

Zhang ,

K. N.

Jensen ,

S. D.

Sonniks , B. Plank, based sentiment analysis , ArXiv abs/2202 .01924

Skillspan: Hard and soft skill extraction from (

2022 ).

english job postings , ArXiv abs/2204 .12811 ( 2022 ). [21]

Williams ,

Nangia ,

Bowman , A broad-

248405777. ing through inference , in: Proceedings of the 2018 [9]

Wang ,

Abdelfatah ,

Korayem ,

Balaji , Deep- Conference of the North American Chapter of the

convolutional neural network , 2019 , pp. 1953 - 1961 . Language Technologies, Volume 1 ( Long

Papers)

doi:10.1109/BigData47090 . 2019 .9005673. Association for Computational Linguistics, 2018 , [10]

Yamashita ,

J. T.

Shen ,

Ekhtiari ,

Tran ,

Lee , pp. 1112 - 1122 . URL: http://aclweb.org/anthology/

James: Job title mapping with multi-aspect embed- N18-1101.

dings and reasoning , 2022 . arXiv: 2202 . 10739 . [22]

Khattab ,

Singhvi ,

Maheshwari ,

Zhang , [11]

Zhang , R. van der Goot,

Plank , Escoxlm- K. Santhanam , S.

Vardhamanan , S.

Haq , A . Sharma,

tics , 2023 . URL: https://api.semanticscholar.org/ ArXiv abs/2310.03714 ( 2023 ). URL: https://api.

CorpusID:258832782. semanticscholar .org/CorpusID:263671701. [12]

Kavas , M. Serra-vidal, L. Wanner, Job ofer and