Enhancing Job Posting Classification with Multilingual Embeddings and Large Language Models Hamit Kavas1,2,* , Marc Serra-Vidal2 and Leo Wanner1,3 1 NLP Group, Pompeu Fabra University, C/ Roc Boronat, 138, 08018, Spain 2 Adevinta Spain, C/ de la Ciutat de Granada, 150, Barcelona, 08018, Spain 3 Catalan Institute for Research and Advanced Studies (ICREA), Passeig Lluís Companys, 23, Barcelona, 08010, Spain Abstract In the modern labour market, taxonomies such the European Skills, Competences, Qualifications and Occupations (ESCO) classification are used as an interlingua to match job postings with job seeker profiles. Both are classified with respect to ESCO occupations, and match if they align with the same occupation and the same skills assigned to the occupation. However, matching models usually struggle with the classification because of overlapping skills and similar definitions of occupations defined in the ESCO taxonomy. This often leads to imprecise classification outcomes. In this paper, we focus on the challenge of the classification of job postings written in Italian or Spanish against ESCO occupations written in English. We experiment with multilingual embeddings, zero-shot classification, and use of a large language model (LLM) and show that the use of an LLM leads to best results. Furthermore, we also explore an alternative automatic labeling method by prompting three top-performing LLMs to annotate the test dataset. This approach serves both as an experiment on the usability of automatic labeling and as an evaluation of the reliability of the automatically assigned labels, involving human annotators. Keywords ESCO labour market taxonomy, job posting classification, class embeddings, text embeddings, LLM 1. Introduction experiences because due to their tree structure they of- ten fail to adequately distinguish between occupations The modern labour market becomes more and more di- that exhibit substantial skill overlaps. For instance, two verse. High-tech jobs demand novel skills and compe- job postings labeled as ‘data analyst’ may appear similar tences, which in their turn keep undergoing adaptations but require different skills if one focuses on market re- and modifications. Under these circumstances, accurately search while the other concentrates on healthcare trends classifying job postings and CVs of job seekers (hence- analysis. This issue is particularly pronounced when clas- forth candidate experiences) that contain detailed techno- sification relies on a single label, such as the job title of an logical specifications with remarkably similar yet distinct ESCO occupation, where skill overlaps undermine pre- skills and experiences has evolved into a complex chal- cise classification. Hence, employing multiple job titles lenge. and framing the problem as a multi-label classification The overwhelming majority of job portals and employ- task is imperative. ment agencies use either the European Skills, Competences, This paper addresses the challenge of multilingual Qualifications and Occupations (ESCO) taxonomy1 or its multi-label classification using Large Language Models US equivalent O*Net taxonomy2 to classify job postings (LLMs) for the alignment of Italian and Spanish job post- and candidate experiences in terms of job title labeled ings with English job titles encountered in the ESCO ESCO/O*Net occupations. Most of the proposals to au- taxonomy. Multilingual class embeddings are explored tomatic alignment of job postings with candidate expe- to improve classification accuracy, aiming to provide the riences (or vice versa) also use ESCO or O*Net [1, 2, 3]. necessary contextual awareness and addressing the core However, despite their wide use, both ESCO and O*Net limitations of taxonomies such as ESCO. taxonomies exhibit principle limitations for the task of Furthermore, we explore an alternative automatic la- automatic classification of job postings and candidate beling method by prompting three top-performing LLMs to annotate the test dataset. This approach serves both as CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, an experiment on the usability of automatic labeling and Dec 04 — 06, 2024, Pisa, Italy as an evaluation of the reliability of the automatically * Corresponding author. assigned labels, involving human annotators. $ hamit.kavas@upf.edu (H. Kavas); marc.serrav@adevinta.com (M. Serra-Vidal); leo.wanner@upf.edu (L. Wanner) To provide LLMs with domain-specific information  0009-0009-7027-7367 (H. Kavas); 0009-0000-0120-381X and to mitigate hallucinations in the course of the clas- (M. Serra-Vidal); 0000-0002-9446-3748 (L. Wanner) sification of the job postings, we employ Retrieval Aug- © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). mented Generation (RAG) [4], which combines infor- https://esco.ec.europa.eu/en/classification 2 mation retrieval with a generative model. RAG serves https://www.onetonline.org/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings two critical functions in our methodology. Firstly, it pro- of online recruitment data. Similarly, Wang et al. [9] pro- vides detailed definitions, including essential skills and pose a model based on multi-stream convolutional neural synonyms for each ESCO occupation, selected through networks, aiming to classify noisy user-generated job ti- vector similarity as outlined in [5]. Secondly, it ensures tles by considering different elements such as characters that the assigned job titles are restricted to titles within and words within job titles. Yamashita et al. [10] and our predefined label space, i.e., standardized job titles Zbib et al. [1] conduct studies on the classification of defined in the ESCO taxonomy. job titles, focusing on job title alignment and job simi- The contributions of our work are: larity training, respectively. JobBERT Decorte et al. [2] ∙ We explore the impact of using multilingual class classifies job titles against the ESCO taxonomy, treating embeddings derived from the ESCO taxonomy for the the task as a semantic text similarity (STS) exercise. In task of job posting classification. particular, JobBERT emphasizes the understanding of the ∙ We integrate RAG to provide LLMs with domain- semantics of job titles through the skills inferred from the specific information and eliminate the dependency on associated vacancies and descriptions, thus alleviating fine-tuning; the need for an extensive labeled dataset or a continu- ∙ We show how the LLM response can be restricted to ously updated list of standardized titles. Before the recent standardized job titles and thus how LLMs can be used proposals [11] and [12], JobBERT used to be referenced for high quality job title classification that outperforms as the state-of-the-art baseline. In general, all of these state-of-the-art proposals for this task. works draw upon some of the information encoded in the The remainder of the paper is structured as follows. In ESCO taxonomy. However, none of them uses detailed Section 2, we present a concise overview of the related descriptions of ESCO occupations, as we propose. work. In Section 3, the model on which our work is based is outlined. Section 4 describes the experiments we car- ried out, the results we obtained in these experiments, 3. The Model and their discussion. In Section 5, finally, draws some conclusions from the presented work and outlines some 3.1. The Basics directions for future research. In Appendix A, we present The proposed model is based on the notion of distinctive- an ablation study in which we assess the comprehension ness, which specifies the difference between the prompt of English ESCO job titles and its Spanish equivalents by concept 𝜃* and other concepts within the conceptual our model. Appendix B provides, for illustration, exam- space Θ [13]. The notion is crucial for distinguishing ples of Italian job postings and predicted ESCO job titles. in-context learning concepts that are aimed to be learned In Appendix E, we present the signature used to prompt by analogy. 𝜃* acts as a latent parameter in a Hidden Large Language Models for pre-processing. Markov Model that defines a distribution over observed tokens, represented by selected ESCO job titles as labels. As proposed by Xie et al. [13], the error of the in-context 2. Related Work predictor approaches optimality under the condition that A number of works have been carried out in the domain 𝜃 is distinguishable from other concepts in Θ ∖ {𝜃 }. * * of job title classification, focusing on various facets of the When RAG is adapted as a few-shot reasoning (or in- problem. Shi et al. [6] introduce Job2Skills, a model devel- context*learning) framework for job posting classification oped for LinkedIn. The model significantly improves job [14] , 𝜃 is represented by the top-selected ESCO labels recommendation performance metrics, however, raises and ensures that the LLM can effectively differentiate questions about its effectiveness beyond LinkedIn. Li between closely related job categories. et al. [7] proposes a two-step job title normalization, also The explanation enriched prompts enhance the LLM’s in LinkedIn, which is based on tokenization and match- ability to learn more from each example. According to ing of the original job title provided by the user with Xie et al. [13], the expected error decreases as the length a lookup table. The use of a lookup table instead of a and informational content of each example increase, con- standard occupation taxonomy such as ESCO or O*Net tributing to the richness of the input–output mapping for significantly limits the generalization potential of this a more robust in-context learning environment. This strategy. Zhang et al. [8] extract soft and hard skills from assumption is proven to be true under the condition job posting descriptions, showing that domain-specific of distinguishability of in-context examples and can be pre-training significantly enhances performance in skills mathematically expressed as a reduction of the expected and knowledge extraction. Javed et al. [3] introduce a error 𝐸[𝜖], correlated with an increase in the information semi-supervised machine learning approach that utilizes content 𝐼 of the examples: hierarchical classifiers and the O* NET Standard Occupa- 1 tional Classification (SOC) taxonomy for the classification 𝐸[𝜖] ∝ (1) 𝐼(𝑆𝑛 , 𝑥test ) Figure 1: Model Architecture where 𝑆𝑛 represents the sequence of training examples in the prompt and 𝑥test is the test input. The use of RAG helps avoid hallucination since when directly prompted with job postings, LLMs have been observed to sometimes produce non-existent labels [5]. Figure 2: Prompt Template 3.2. Design of the Model The proposed model (see Figure 1) uses multilingual class choosing 30 documents is that we aim to strike a balance embeddings of the E5-large model[15] to retrieve perti- between computational efficiency and the accuracy of the nent ESCO occupation definitions in English. The defini- retrieved documents. The precision of the LLM would tions serve as contextual information to prompt language naturally decrease when it is presented with inaccurate models for selection of the most suitable job titles. To this labels. Although, as shown in Table 1, the precision of end, we incorporate the DSPy library’s Chain-of-Thought the model slightly increases with 40 documents in the mechanism,3 augmented by a hint to restrict the model context, we accepted this trade-off in favor of a lower output to a specified list of job titles. The signature used VRAM requirement. in this methodology (cf. Figure 2) is inspired by [16]. Upon the retrieval of the 30 ESCO occupations that are To implement the RAG model, we initially established a most closely aligned with a given job posting description, vector database,4 in which English ESCO occupation def- a composite prompt (see Figure 2) is constructed as input initions were inserted as multilingual embedding vectors. to the LLM. The prompt integrates the actual text data Acknowledging the reported significance of chunking in encompassing job titles, descriptions, and skills pertinent many NLP applications, we conducted a series of abla- to the selected occupations. The design of the simplified tion studies to determine the optimal chunk size. These composite prompt aims to minimize the bias by focusing studies revealed that subdividing the ESCO occupation only on the core elements. The prompt is then processed definitions into smaller segments adversely affects the by using a locally stored Llama-3 LLM5 in an isolated performance of vector-based similarity matching. There- environment6 . fore, we opted for storing each of the 3,015 occupations As a few-show predictor, the LLM evaluates the com- represented in the ESCO taxonomy in its entirety. posite prompt to accurately classify job postings by exam- ining the semantic nuances of the selected ESCO occupa- Table 1 tions, aligning them with the actual job titles within the Recall values for classification with E5-large Text Embeddings offers. To quantitatively assess the alignment between vector similarity a job posting vector 𝐽 and each occupation embedding Precision @ K @5 @10 @30 @40 𝐸ESCO derived from the ESCO taxonomy, cosine similar- Value 0.4238 0.9004 0.9627 0.9817 ity 𝑎(𝐽, 𝐸ESCO ) is used: 𝐽 · 𝐸ESCO 𝑎(𝐽, 𝐸ESCO ) = (2) To accurately classify a given job posting with respect ‖𝐽‖‖𝐸ESCO ‖ to the ESCO taxonomy, we include 30 ESCO occupa- The similarity scores yielded through 𝑎(𝐽, 𝐸ESCO ) for tion documents (i.e., 30 nodes of the taxonomy) into the each 𝐸ESCO facilitate the identification and selection of LLM’s context as potential job titles. The rationale for 5 https://llama.meta.com/llama3/ 3 6 https://github.com/stanfordnlp/dspy We use dockerized models from the open-source Ollama library 4 https://www.trychroma.com/ https://ollama.com/ for all experiments the ESCO occupation embeddings that are most pertinent content where removed using a DSPy module (cf. Ap- to the job posting in question. Armed with this infor- pendix E for prompt), which employs zero-shot LLama-3 mation, the LLM proceeds to classify the job posting by LM inference to anonymize sensitive information in job selecting the ESCO occupation that exhibits the highest postings and candidate experiences. The preprocessed degree of semantic and contextual relevance. postings were annotated by the top three performing For a specific job posting 𝐽, an embedding function 𝐸 LLMs: GPT-4o8 , Gemini 1.5 Pro9 , and Claude 3.5 Son- is employed, such that 𝐸(𝐽) produces the corresponding net10 , according to LmSys Arena[17]. In this context, the embedding for 𝐽. The degree of similarity between the ESCO job titles are presented to each model separately, job posting’s embedding 𝐸(𝐽) and any ESCO occupation requesting them to select the appropriate job titles, and embedding 𝑒𝑖 from 𝐸ESCO (where 𝐸ESCO stands for the then measure their level of agreement on these labels. ensemble of occupation embeddings derived from the The agreement between LLM models was assessed using ESCO taxonomy) is determined through the similarity Cohen’s kappa coefficient[18]. The average kappa score function 𝑆(𝐸(𝐽), 𝑒𝑖 ) (in our case cosine). between Gemini and GPT-4o was found to be 0.6386, The similarity scores for each occupation embedding indicating a substantial level of agreement. The agree- 𝑒𝑖 within 𝐸ESCO relative to 𝐸(𝐽) are computed. The ment between Gemini and Claude was lower, with an ten class embeddings that exhibit the highest similarity average kappa of 0.5798, suggesting a moderate level to 𝐸(𝐽), denoted as 𝐸top , are selected. Formally, 𝐸top of agreement. Similarly, the kappa score between GPT- is defined as the subset {𝑒1 , 𝑒2 , . . . , 𝑒10 } from 𝐸ESCO , 4o and Claude was 0.6497, also indicative of substantial where each 𝑒𝑖 is selected based on the top 10 similarity agreement. Overall, the average kappa score across all scores 𝑆(𝐸(𝐽), 𝑒𝑖 ). “annotators” was 0.6227, reflecting a general trend to- The last stage entails a decision-making process en- wards substantial inter-annotator agreement among the acted by the Llama-3 LLM, represented by the function models. 𝐷. This function accepts the composite prompt includ- To establish ground truth labels, we incorporated a ing candidates {𝑒1 , 𝑒2 , . . . , 𝑒10 } accumulated to 𝐸top and dual-layer labelling process. Although the test set con- the job posting 𝐽, to render the final selected occupation sists of only 200 items, labeling them from scratch would embedding. The chosen occupation embedding 𝑒* is de- be time-consuming due to the complexity of the ESCO termined by 𝑒* = 𝐷(𝐸top , 𝐽), representing the ESCO taxonomy, which includes 3,015 distinct occupations. Hu- occupation best matched by the model. man annotators would require extensive training to accu- The entire algorithm can be presented by the following rately navigate this taxonomy. Therefore, we first anno- equation, which encapsulates the embedding generation, tate the occupations automatically using LLMs and then similarity assessment, and decision-making process by let the initial annotations cross-examine by human expert the LLM, culminating in the selection of the most suitable annotator. Since each data point was reviewed by one an- ESCO occupation embedding 𝑒* for the given job posting notator only, inter-annotator agreement among human description. annotators was not quantified. Instead, we conducted an analysis to identify job titles that consistently showed agreement or disagreement across the three LLMs, where 𝑒* = 𝐷({𝑒1 , 𝑒2 , . . . , 𝑒𝑘 | 𝑒𝑖 ∈ 𝐸ESCO ; domain-specific professionals from InfoJobs reviewed la- top k by 𝑆(𝐸(𝐽), 𝑒𝑖 )}, 𝐽) (3) bel discrepancies. This analysis, detailed in Appendix C, suggests that certain occupations are inherently more challenging to classify, possibly due to overlapping skills 4. Experiments or ambiguous descriptions. Furthermore, we repeated experiments using ground To evaluate the effectiveness of the proposed model in truth labels where any two of the three automatic LLMs handling multilingual job postings, experiments were agreed on the label. The results showed alignment be- conducted separately on Italian and Spanish datasets. tween the models’ predictions and the automatic labeling process, indicating consistency with the patterns recog- 4.1. Test dataset nized by the automatic methods when there is partial agreement. A detailed analysis of this alignment can be To have a reliable test dataset, we use three high perform- found in Appendix D. ing LLMs as initial annotators of real-world 100 Italian and 100 Spanish job postings with the most extensive de- scriptions from the InfoJobs 7 database. Non-informative elements such as company descriptions and promotional 8 https://openai.com/index/hello-gpt-4o/ 9 https://deepmind.google/technologies/gemini/pro/ 7 10 https://www.infojobs.net/ https://www.anthropic.com/news/claude-3-5-sonnet 4.2. Baselines report evaluation scores seperately on Spanish and Italian test sets. 4.2.1. SkillGPT SkillGPT [5] has been introduced as a tool for skill ex- Table 2 traction and classification, with vector similarity search Italian Performance Metrics for Top 5 and Top 10 Predictions against LLM-precomputed ESCO embeddings. The au- Precision Recall thors employ embeddings generated by an LLM, although Model they do not directly use LLM to select among candidate @5 @10 @5 @10 embeddings. Instead, they rely on embedding similar- llama-3-8b (CoT opt.) 0.32 0.13 0.76 0.80 ity to assign the most closely related ESCO class to job llama-3-8b (CoT) 0.26 0.12 0.62 0.64 descriptions under consideration. llama-3-8b (SkillGPT) 0.19 0.19 0.36 0.82 mBart-large-mnli (0-shot) 0.13 0.12 0.29 0.58 4.2.2. Zero-Shot Classification multilingual-e5-large 0.16 0.19 0.36 0.88 By transforming the classification task into a Natural Language Inference (NLI) problem, any model pretrained on NLI tasks can be utilized as a text classifier without the Table 3 need for fine-tuning, effectively achieving zero-shot text Spanish Performance Metrics for Top 5 and Top 10 Predictions classification. This is particularly beneficial when we deal Precision Recall with classes unseen during training, making it a robust Model @5 @10 @5 @10 solution for a variety of text classification scenarios [19]. In our implementation that we use as baseline, we llama-3-8b (CoT opt.) 0.28 0.20 0.72 0.90 utilize the BART-MNLI model [20] that showed high per- llama-3-8b (CoT) 0.26 0.16 0.64 0.68 formance in summarization tasks when pretrained for llama-3-8b (SkillGPT) 0.09 0.12 0.36 0.62 various NLI tasks on an MNLI dataset [21] that is lever- mBart-large-mnli (0-shot) 0.15 0.14 0.39 0.70 multilingual-e5-large 0.20 0.19 0.48 0.92 aged for its capability to understand entailment relations for classification of the given sequence into one of the specified categories. We also apply the same methodol- Tables 2 and 3 display the results on the Italian and ogy with the Llama-3 model. Spanish datasets, respectively. The results indicate that prompting techniques outperform SkillGPT in both lan- 4.3. Model Optimization guages. Specifically, the optimized Llama-3-8b model with chain-of-thought (CoT) achieves the highest preci- To optimize LLMs with a minimal set of manually crafted sion and recall at @5 for Italian, with values of 0.32 and examples, we use the DSPy library [22]. We initialize 0.76, respectively, and for Spanish, with values of 0.28 the classifier module with a Llama-3 model and use a and 0.72. This supports our assumption that optimiza- GPT-4o model as the teacher. Our optimization of the tion enhances performance. The multilingual E5-large classification is aimed at achieving high F1 scores for model achieves the highest precision at @10 for Italian each dataset individually. In each run, we use 10 la- (0.19) and the highest recall at @10 for Spanish (0.92), beled training examples and 30 labeled validation ex- underscoring the efficacy of embeddings in classification. amples. We employ DSPy’s BootstrapFewShot, configur- This implies that semantically less similar labels can con- ing it to perform a maximum of 2 rounds with up to 8 fuse models, whereas embeddings ensure higher recall bootstrapped demonstrations. We define a custom met- accuracy, particularly in wider retrieval scenarios. Al- ric—the F1 score—to guide the bootstrapping process. For though both models exhibit similar precision, indicating the optimization of the LLMs, we use data points that comparable accuracy in their predictions, the optimized had high inter-agreement among the automatic methods model’s capacity to capture a broader range of relevant and were reviewed by human annotators. We perform a job titles ensures greater alignment with expert human validation/test split to ensure that the optimization did preferences. This enhances the model’s ability to make not bias the evaluation results. relevant job title suggestions, thereby improving the over- all matching process. 4.4. Outcome of the experiments For the evaluation of the results of the experiments, we 4.5. Discussion used the micro recall and micro precision metrics, which In Tables 2 and3 we observe that the combined use of gen- are suitable for our multi-class classification task. We eral text embeddings and language models significantly outperforms current classification techniques, which rely on language models specifically tailored to the field of the 4.6. Computational Cost of Compared labour market, such as [12]. We see that using vector sim- Methods ilarity with the text embeddings created by the E5-large text embedings model alone does not surpass the base- In addition to evaluating performance metrics, we ana- line. However, it is worth noting that the results are quite lyzed the computational cost and environmental impact close, despite the fact that this model was not specifically of each method. The Llama-3-8b model, with 8 billion fine-tuned on labour market data or adapted to the ESCO parameters, requires significant resources for inference, taxonomy, as is the case of [12]. Furthermore, we can ob- necessitating a GPU with at least 16 GB of VRAM (e.g., serve how text embeddings indeed provide a significant NVIDIA RTX 3090). Its average inference time per job value for filtering n occupations closest to a job posting posting is approximately 1.5 seconds, and its high energy within the taxonomy. Using these k professions as input consumption leads to increased CO2 emissions, making to various language models for few-shot classification large-scale deployment less environmentally sustainable significantly improves over the baselines. Table 6 in the without optimizations. Appendix illustrates the decisions of the LLMs in the case In contrast, the mBART-large-mnli model has about 610 of four sample job postings. million parameters and operates on GPUs with 8 GB of We also evaluated the effectiveness of a large language VRAM, offering faster inference times under 0.5 seconds model for classification of job titles based on provided per job posting. The embeddings-based method using descriptions, as shown in Table 4 even when the correct the multilingual E5-large model, with 330 million param- titles were not explicitly listed among the initial ESCO job eters, allows for precomputed embeddings and efficient titles. The model’s ability to select accurate titles reflects CPU-based vector similarity searches, reducing infer- its functionality in processing and understanding the con- ence time to less than 0.2 seconds per job posting. These textual and semantic aspects of the job descriptions. For smaller models consume less energy, providing more instance, when presented with a job description focused resource-efficient and eco-friendly alternatives suitable on the management of comprehensive water and wastew- for production environments where computational cost ater services, the model correctly identified “Operations and environmental impact are critical considerations. Manager” as the correct title. This identification was made despite the presence of several closely related but distinct labels (such as, “Water treatment plant manager”) 5. Conclusions and future work within the pool of ESCO job titles. This indicates that the In this paper, we argued that the use of multilingual model’s decisions are more influenced by a comprehen- embeddings in combination with LLMs significantly en- sive understanding of the job responsibilities and sectors hances our ability to distinguish between very similar (or than by the mere presence of keywords or phrases in the even identical) job titles that suggest different skills and ESCO job titles. competencies. Our experiments have shown that this is The model’s capacity to differentiate between job titles indeed the case, demonstrating that the combination of with more specific definitions enhances its comprehen- multilingual text embeddings similarity with the Llama- sion of job postings and assigned labels, thereby improv- 3 markedly exceeds the performance of other leading ing the precision of suggesting relevant skills. Upon approaches in the field. integration into an operational job platform, this model In the future, we plan to apply the same approach will better understand the requirements of job postings to the analysis and classification of job candidate expe- and accurately assign job titles that align with the spe- riences. Once it is ensured that both job postings and cific needs of companies. Similarly, in the context of candidate experiences can accurately be modeled using parsing of job candidate experiences, keywords tend to the embedded representation of the ESCO taxonomy, we appear more frequently in semantically related ESCO def- plan to set the stage for a more direct and efficient align- initions, enabling parsers to incorporate these keywords ment process between job postings and experiences of to enhance parsing performance. job seekers. Overall, we can thus state that the integration of class Another interesting direction for future research is embeddings generated using the multilingual E5-large to analyze the lexical overlap between English domain- model, with subsequent application of few-shot classifi- specific terms that appear in Italian and Spanish job post- cation techniques through LLMs, significantly improves ings and the English occupation descriptions in the ESCO the accuracy of job title classification, clearly surpassing taxonomy. Such an analysis would reveal whether job those of the baselines. types with higher lexical overlap affect model accuracy, providing deeper insights into the multilingual nature of the task. References applicant cv classification using rich information from a labour market taxonomy, SSRN Electronic [1] R. Zbib, L. L. Alvarez, F. Retyk, R. Poves, J. Aizpuru, Journal (2023). doi:10.2139/ssrn.4519766. H. Fabregat, V. Šimkus, E. G. Casademont, Learn- [13] S. M. Xie, A. Raghunathan, P. Liang, T. Ma, An ex- ing job titles similarity from noisy skill labels, planation of in-context learning as implicit bayesian ArXiv abs/2207.00494 (2022). URL: https://api. inference, 2022. arXiv:2111.02080. semanticscholar.org/CorpusID:250243975. [14] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, [2] J.-J. Decorte, J. V. Hautte, T. Demeester, C. De- J. Sun, M. Wang, H. Wang, Retrieval-augmented velder, Jobbert: Understanding job titles through generation for large language models: A survey, skills, ArXiv abs/2109.09605 (2021). URL: https: 2024. arXiv:2312.10997. //api.semanticscholar.org/CorpusID:237572142. [15] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, [3] F. Javed, M. McNair, F. Jacob, M. Zhao, To- M. Zhang, Towards General Text Embeddings wards a job title classification system, 2016. with Multi-stage Contrastive Learning, arXiv arXiv:1606.00917. e-prints (2023) arXiv:2308.03281. doi:10.48550/ [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, arXiv.2308.03281. arXiv:2308.03281. J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, [16] K. D’Oosterlinck, O. Khattab, F. Remy, T. De- G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, meester, C. Develder, C. Potts, In-context G. Krueger, T. Henighan, R. Child, A. Ramesh, learning for extreme multi-label classification, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, ArXiv abs/2401.12178 (2024). URL: https://api. E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, semanticscholar.org/CorpusID:267068618. C. Berner, S. McCandlish, A. Radford, I. Sutskever, [17] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopou- D. Amodei, Language models are few-shot learners, los, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, 2020. arXiv:2005.14165. J. E. Gonzalez, I. Stoica, Chatbot arena: An open [5] N. Li, B. Kang, T. D. Bie, Skillgpt: a restful api service platform for evaluating llms by human prefer- for skill extraction and standardization using a large ence, ArXiv abs/2403.04132 (2024). URL: https: language model, 2023. arXiv:2304.11060. //api.semanticscholar.org/CorpusID:268264163. [6] B. Shi, J. Yang, F. Guo, Q. He, Salience and [18] J. Cohen, A coefficient of agreement for nominal market-aware skill extraction for job targeting, scales, Educational and Psychological Measurement 2020. arXiv:2005.13094. 20 (1960) 37 – 46. URL: https://api.semanticscholar. [7] S. Li, B. Shi, J. Yang, J. Yan, S. Wang, F. Chen, Q. He, org/CorpusID:15926286. Deep job understanding at linkedin, in: Proceed- [19] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, ings of the 43rd International ACM SIGIR Confer- M. Chenaghlu, J. Gao, Deep learning based text ence on Research and Development in Informa- classification: A comprehensive review, CoRR tion Retrieval, ACM, 2020. URL: http://dx.doi.org/ abs/2004.03705 (2020). URL: https://arxiv.org/abs/ 10.1145/3397271.3401403. doi:10.1145/3397271. 2004.03705. arXiv:2004.03705. 3401403. [20] L. Shu, J. Chen, B. Liu, H. Xu, Zero-shot aspect- [8] M. Zhang, K. N. Jensen, S. D. Sonniks, B. Plank, based sentiment analysis, ArXiv abs/2202.01924 Skillspan: Hard and soft skill extraction from (2022). english job postings, ArXiv abs/2204.12811 (2022). [21] A. Williams, N. Nangia, S. Bowman, A broad- URL: https://api.semanticscholar.org/CorpusID: coverage challenge corpus for sentence understand- 248405777. ing through inference, in: Proceedings of the 2018 [9] J. Wang, K. Abdelfatah, M. Korayem, J. Balaji, Deep- Conference of the North American Chapter of the carotene -job title classification with multi-stream Association for Computational Linguistics: Human convolutional neural network, 2019, pp. 1953–1961. Language Technologies, Volume 1 (Long Papers), doi:10.1109/BigData47090.2019.9005673. Association for Computational Linguistics, 2018, [10] M. Yamashita, J. T. Shen, H. Ekhtiari, T. Tran, D. Lee, pp. 1112–1122. URL: http://aclweb.org/anthology/ James: Job title mapping with multi-aspect embed- N18-1101. dings and reasoning, 2022. arXiv:2202.10739. [22] O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, [11] M. Zhang, R. van der Goot, B. Plank, Escoxlm- K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, r: Multilingual taxonomy-driven pre-training for T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, the job market domain, in: Annual Meeting C. Potts, Dspy: Compiling declarative lan- of the Association for Computational Linguis- guage model calls into self-improving pipelines, tics, 2023. URL: https://api.semanticscholar.org/ ArXiv abs/2310.03714 (2023). URL: https://api. CorpusID:258832782. semanticscholar.org/CorpusID:263671701. [12] H. Kavas, M. Serra-vidal, L. Wanner, Job offer and Figure 3: LLM’s Rationale A. Ablation Study narrowing down to “quick service restaurant team leader” and “fast food shift team leader” as the most apt job ti- In our ablation study, we pursued two primary objectives. tles. The reasoning of the model is correct on chosing Firstly, to evaluate the model’s comprehension of ESCO these titles for their precise reflection of the managerial job titles and its decision-making process. To achieve this, and leadership responsibilities pertinent to the restaurant we prompted the model to articulate its underlying ra- environment. tionale. Secondly, so far we reported the performance of our model when Italian and Spanish data were matched against English job titles and occupations in the ESCO B. Job postings and Predicted taxonomy. Here we wanted to explore whether its com- ESCO job titles prehension was extendable to data in different languages. We selected Spanish for this purpose and discovered that The following tables provide examples of job titles, job the model’s understanding was consistent, irrespective posting descriptions, and the corresponding gold labels of the language; see Table 4. in Table 5 and optimized LLama-3 job titles in Table 6. As illustrated in Figure 3, the LLM showcases a com- These examples illustrate how the job titles assigned by prehensive understanding of the task at hand, effectively recruiters may not always capture the specific nature of narrowing down potential ESCO job titles to identify the the job described in the postings. The gold labels and most suitable label. Additionally, the LLM is observed to the optimized LLama-3 job titles offer a more accurate generate a novel job title, referred to as “fast food shift representation of the job roles based on the detailed job team leader”. This can be attributed to the absence of descriptions. contstraints imposed on the LLM regarding structured The job title “Commessa” (Salesperson) is generic output for classification, thereby granting it to auton- and does not specify the specialization required for the omy to propose the most fitting job title. The analysis job. The gold label “telecommunications equipment spe- initially excludes broader or less related job titles such cialised seller” fits better because the job description as “bussiness manager”, “hospitality revenue manager”, clearly focuses on selling telecommunications equipment, and “accomodation manager”, which are not spesific to which requires specific knowledge and skills related to quick-service restaurant operations. Subsequently, the this type of product. The gold label accurately reflects model considers and ultimately selects titles that em- the specialized nature of the role. The job title “Project phasize leadership within this spesific restaurant context, engineer” given by the recruiter suggests a technical and Gold Label Job Title Quick Service Restaurant Team Leader Posting Job Title Encargado de Franquicias Posting Description: - Responsable de garantizar la satisfacción de los huéspedes y de gestionar y superar los objetivos financieros y operativos de los restaurantes a mi cargo. - Garantizar una excelente atención a los huéspedes en base a las promesas y estándares definidos. - Liderar, motivar y desarrollar equipos. - Facilitar los recursos y el apoyo necesario a los equipos en sus restaurantes. - Utilizar de manera eficaz los diferentes recursos de la Compañía. - Identificar oportunidades y amenazas de negocio en el mercado. - Aportar ideas y ejecutando proyectos en el corto y medio plazo. - Difundir las mejores practicas y resolver problemas comunes en los restaurantes. - Cumplir los protocolos y políticas de la Marca y la Compañía. - Garantizar y difundir los valores y principios definidos por la Compañía. Skills: SAP Girnet Gtock, Cuiner ESCO Job Titles: Restaurant Manager, Business Manager, Hospitality Revenue Manager, Accommodation Manager, Delicatessen Shop Manager, Rooms Division Manager, Customer Experience Manager, Quick Service Restaurant Team Leader, Destination Manager, Membership Manager Table 4 Spanish job posting Example Posting Job Title Job Posting Description Gold Labels Commessa Commessa; Commessa; - Presentazione e vendita di attrezzature per Telecommunications telecomunicazioni ai clienti; - Servizio e supporto clienti; - Gestione delle equipment specialised transazioni di vendita; - Gestione dello stock e dell’inventario. seller Project Engineer Project Engineer; Project Engineer; PROJECT MANAGER / PROJECT Project manager, Prod- ENGINEER Divisione: Amministrazione Tecnica - Coordinamento delle uct development man- attività di gestione progetti in ambito tecnico; - Supporto al Product ager Development; - Pianificazione e monitoraggio delle attività progettuali; - Supervisione del team tecnico; - Assistenza alla gestione dei fornitori e del budget di progetto. Table 5 Examples of Job Titles, Descriptions, and Gold Labels engineering-focused role. However, the job description packages, and managing the deli counter. Our model’s ti- emphasizes project management, coordination of project tles “meat and meat products specialised seller” and “deli activities, support to product development, and supervi- worker” are more precise, indicating a specialized role sion of the technical team. The gold label “project man- in food handling and customer service, which goes be- ager” fits better as it captures the overall management yond the general sales assistant title. This demonstrates and coordination responsibilities described, which are our model’s ability to interpret the specific context and more aligned with the duties of a project manager than responsibilities of the job accurately. just a project engineer. The job title “IT Specialist” is generic and could encom- The job title "Addetto alle vendite" (Sales Assistant) is pass various IT roles. However, the job description clearly too generic and does not capture the specialized nature indicates responsibilities such as managing ICT projects, of the role described in the vacancy. The description coordinating a software development team, planning specifies duties typical of a deli worker, such as serving and monitoring development activities, managing ICT customers, slicing cheeses and cured meats, preparing resources and budget, and providing advanced techni- Posting Job Title Job Posting Description Optimized LLama-3 Job Titles Addetto alle ven- Addetto alle vendite; Addetto alle vendite; Salumiere: servizio clientela, Meat and meat products dite tagli di formaggi e salumi, preparazione confezioni, gestione banco gas- specialised seller, Deli tronomia. worker, Food and bever- age server IT Specialist IT Specialist; IT Specialist; Responsabile della gestione dei progetti ICT; ICT project manager, Soft- Coordinamento del team di sviluppo software; Pianificazione e monitor- ware development man- aggio delle attività di sviluppo; Gestione delle risorse ICT e del budget; ager Assistenza tecnica avanzata e risoluzione dei problemi. Sales Manager Sales Manager; Sales Manager; Sviluppo del business aziendale; Business development Definizione delle strategie di vendita; Gestione del team di vendita; Mon- manager, Sales director itoraggio delle performance e raggiungimento degli obiettivi di vendita; Gestione delle relazioni con i clienti chiave e i partner strategici. Table 6 Examples of Job Titles, Descriptions, and Optimized Job Titles Table 7 Examples of Job Postings with Ambiguous Classification due to Multilingual and Contextual Challenges Job Title Description Excerpt Labels Suggested Junior Project Applicare i metodi e gli strumenti propri del Project Management a Project Manager, ICT Manager commesse specifiche per il settore dell’automazione industriale, di cui Project Manager, Pro- l’azienda fornisce sistemi di visione artificiale. gramme Manager Assistente Am- Gestione dei flussi delle segnalazioni dei cittadini per prenotazioni vacci- Healthcare Assistant, ministrativo nazioni e assistenza pandemica, inclusa la verifica del "certificato verde" Administrative Assistant, (Healthcare) per la conformità alle normative sanitarie. Contact Tracing Agent Commesso di Ne- Creazione di vetrine accattivanti con abbinamenti di tendenza e assistenza Shop Assistant, Sales As- gozio (Retail) alla clientela nella scelta dei prodotti. sistant, Visual Merchan- diser Team Leader (En- Predisposizione documenti formativi e aggiornamento processi operativi Team Leader, Energy Ana- ergy Sector) presso sede Enel, inclusa l’implementazione e il collaudo di software per lyst, Business Process An- la gestione energetica. alyst Assistente Ammin- Compiti legati al Registro Nazionale delle Varietà Vegetali e mansioni Accounting Assistant, istrativo (Legal and fiscali complesse come Dichiarazioni IRAP. Administrative Assistant, Fiscal) Compliance Officer cal support. The optimized titles “ICT project manager” C. Ambiguity from Specialized and “software development manager” are more accurate as they reflect the leadership, coordination, and project and Contextual Factors management aspects of the role, which go beyond the To further understand the complexity of job classifica- scope of a general IT specialist. tion in a multilingual context, we conducted an ablation The job title “Sales Manager” suggests a mid-level man- study focusing on cases where both human annotators agement role. However, the job description highlights and LLMs demonstrated shared uncertainty in assigning responsibilities such as business development, defining definitive labels. These cases were particularly challeng- sales strategies, managing the sales team, monitoring per- ing due to specialized terminology, regional language formance, and managing relationships with key clients variations, or overlapping responsibilities within job post- and strategic partners. These responsibilities are more ings. Table 7 highlights key examples where annota- aligned with a higher-level role such as “business de- tors, despite their recruitment expertise, aligned with the velopment manager” or “sales director”, which involve LLMs in experiencing ambiguity. strategic planning and high-level management. As presented in Table 7, each example illustrates spe- cific challenges encountered in classifying job postings across multilingual and sector-specific contexts. The Ju- nior Project Manager job posting, for instance, combines general project management with specialized tasks such as machine vision, but without enough specific context, it is unclear whether the focus should be on technical We assessed our model’s performance on both silver expertise or managerial skills. The Project Engineer ex- and gold labels to understand its effectiveness under dif- ample shows the impact of technical terminology and ferent levels of agreement. We had reported results for sector-spesific language on classification. Terms such gold labels in Table 2 and 3, results for silver label are as “SCADA” and “Modbus TCP” are common in inter- presented in Table 8. For the Spanish dataset, the model’s national engineering contexts but may not align with performance was relatively consistent between silver and typical understanding of recruiters, leading to the selec- gold labels, with only minor variations in precision and tion of varied labels by both LLMs and annotators. The recall. This consistency suggests that the model robustly example of the Assistente Amministrativo with a legal and captures underlying patterns in the job postings, regard- fiscal focus involves highly specialized processes such as less of labeling strictness. “Registro Nazionale delle Varietà Vegetali” and complex In contrast, the Italian dataset exhibited more signif- fiscal duties like “Dichiarazioni IRAP.” These terms relate icant differences between performances on silver and to specific Italian government and regulatory compliance, gold labels. For example, in some cases, the precision which could exceed the annotators’ typical recruitment was higher for silver labels while recall was higher for experience, thus resulting in generalized labels that do gold labels. This disparity may indicate that the model not fully capture the compliance and accounting com- better captures broader classifications aligning with ma- plexity. jority consensus in Italian but struggles with the stricter These cases emphasize that job postings, as human- criteria required for unanimous agreement. created documents, often do not provide enough con- An interesting observation is that optimization using text for a definitive classification, resulting in ambiguity gold label ground truth data had a negative effect on across specialized and regional terms. the models’ scores derived from silver labels. This could be explained by the fact that during optimization, the language models became more attuned to the patterns D. Analysis of Model Alignment present in the gold labels, potentially diverging from with Partial Agreement Ground those in the silver labels. As a result, the models may have become less effective at predicting labels where only Truth Labels partial agreement (silver labels) was present among the automatic methods. Table 8 Performance Metrics for Top 5 and Top 10 Predictions E. DSPy Signature Precision Recall Model We utilize DSPy signatures to prompt large language @5 @10 @5 @10 models (LLMs) for performing downstream tasks. To Spanish (SPA) optimize the script, recursive LLM calls were employed, resulting in its final form based on empirical observa- llama-3-8b (CoT opt.) 0.12 0.06 0.58 0.62 llama-3-8b (CoT) 0.22 0.16 0.64 0.68 tions. llama-3-8b (SkillGPT) 0.19 0.12 0.36 0.62 mBart-large-mnli (0-shot) 0.15 0.14 0.39 0.70 multilingual-e5-large 0.20 0.19 0.48 0.92 Italian (ITA) llama-3-8b (CoT opt.) 0.12 0.06 0.56 0.60 llama-3-8b (CoT) 0.23 0.07 0.55 0.59 llama-3-8b (SkillGPT) 0.22 0.06 0.53 0.59 mBart-large-mnli (0-shot) 0.27 0.06 0.31 0.58 multilingual-e5-large 0.35 0.08 0.39 0.79 In our evaluation, we established two levels of ground truth labels: gold and silver. Gold labels represent unan- imous agreement among all three annotators (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet), validated by human experts. Silver labels indicate a strong major- ity consensus, assigned when any two annotators agree, Figure 4: Pre-processing Signature even if the third disagrees.