-

Recommender Systems, October

1613-0073

MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations

FedericoRetyk

0 1

Luis Gascó

0 1

Casimiro PioCarrino

0 1

DanielDeniz

0 1

Rabih Zbib

0 1 0 Entity Linking , Entity Normalization, Taxonomy Alignment, Cross-lingual, Multilingual 1 Workshop Proce dings

2024

1 4 18

We present the Multilingual Entity Linking of Occupations (MELO) Benchmark, a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations multilingual taxonomy. MELO was built using high-quality, pre-existent human annotations. We conduct experiments with simple lexical models and general-purpose sentence encoders, evaluated as bi-encoders in a zero-shot setup, to establish baselines for future research. The datasets and source code for standardized evaluation are publicly avaihltatbplse:/a/gtithub.com/Avature/melo-benchm.ark The current trend in the digital transformation of hulimcaenvaluation benchmarks for measuring progress contelligence (AI) components that can improve automatioTno address this gap, we propose the Multilingual Enand operational eficiency. These systems often need to tity Linking of Occupations (MELO) Benchmark, a new process input data in the form of natural language ctoelxlte,ction of 48 datasets designed to evaluate multilinwhich can be noisy and diverse in terms of language angudal EL tasks. This benchmark leverages pre-existing, high-quality human annotations and covers 21 languages.

CEUR ceur-ws.org

1. Introduction

other domain-specific aspects. resources (HR) processes is the integration of artificialsiinst-ently in this important area. ments, there is still a surprising lack of high-quality pubis key to ensuring the consistency and efectiveness of digitalized HR systems in a global setting.

Previous research in the application of AI within the HR domain has made extensive use of taxonomies, such

as occupation and skill classificatio5n, s6,[ 7, 8, 9, 10].

These HR-specific taxonomies have been used for normalizing raw data11[, 12, 13, 14, 15, 16, 17], removing noise and enabling AI models to operate on standardized information, which in turn leads to more accurate RecSys in HR’24: The 4th Workshop on Recommender Systems for Human Resources, in conjunction with the 18th ACM Conference on

Attribution 4.0 International (CC BY 4.0).

CEUR

Workshop ProceedingsC(EUR-WS.org) framed as a ranking problem, where queries and corpus elements are occupation names taken from a source and a target taxonomy, respectively, and binary-relevance annotations are derived from high-quality crosswalks between the taxonomies. Additionally, we release code for standardizing the evaluation of models on this benchmark. ing models evaluated as zero-shot bi-encoders on MELO, to serve as baselines for future research. We find that, while the lexical baselines perform fairly well, the semantic baselines generally achieve better results, particularly in crosslingual tasks. However, there remains significant and reliable outcomes. Substantial progress has been • We provide experimental results for both simple made, particularly in the normalization of occupational lexical systems and state-of-the-art deep learnroom for improvement. entity linking in the HR domain.

2. Background

To the best of our knowledge, MELO is the first publieclements in the preliminary rank. evaluation benchmark to address the task of multilinguOabltaining annotated data for training such systems second stage, thRee-ranking Stage, a more costly but higher-precision method is applied to evaluate the top is costly, particularly for tasks involving custom taxonomies or low-resource languag4e]s. [To mitigate this problem, many techniques have been proposed for leveraging transfer learning to obtain good performance in typically use a bi-encoder for the candidate generation In this Section, we introduce the context necessaryzefroor-shot EL scenario1s,[2]. State-of-the-art methods understanding the subsequent task definition3)s, t(§he methodology employed in constructing the benchmasrtkage, and a cross-encoder for the re-ranking stage. (§4), and the related work6)(.§ Entity Linking. Given a knowledge basEe and a query work, we define a taxonomyE as a directed acyclic graph mention , the task of Entity Linking (EL) involves ide(nD-AG) where nodes are concepts and edges represent tifying the correct ent i∈tyE to which the mention isbinary IS-A relationship2s6[] between concepts. The referring. In principle, the structure of the knowletdagile concept (child) is a hyponym of the head concept base E can range from a flat catalog of unrelated en(ptai-rent) and therefore represents a narrower meaning. ties to a complex and heterogeneous ontology. In tChoisnversely, the parent is a hypernym of the child and work we focus on taxonomies of a single type of entirteypresents a broader meaning, i.e. a category to which

Multilingual Taxonomies. For the purposes of this (i.e. occupations).

Inspired by the multilingual formulation proposedpbayrents.

Botha et al3.][, we consider each entityas a languageagnostic concept with associated language-specific taexgn-ostic but they have language-specific properties, such tual information. For each lang uaingea set of supported as a set of names, a description, or usage examples. In languagesL , any entity may have a set of names (syno-ther words, every concept has one set of names for each onymous between each other), a description, and exalma-nguage supported in the taxonomy. The set of names ple sentences where the concept is used. The qu eirsy for a concept for a language are considered synonyms a text string in some langu a g,ewith no prior assump- between each other. If a lexical entry is attached to more the child belongs. Concepts are allowed to have many

In a multilingual taxonomy, concepts are language

omy, or it may not refer to any entity at all. This probilnemth,e workforce. known as out-of-KB or NIL predictio2n3][, falls outside

In principle, the system may receive a query mentiotnaxonomies were developed to classify, standardize, and that refers to an entity that does not exist in the toarxgoann-ize information related to job titles and roles found

One popular and influential occupations taxonomy is

of supported languag1e.s tions about the relationship bet w eaend the setL than one concept, this implies polysemy.

Occupation Taxonomies. Several public occupation the scope of this work. Additionally, it is typical in thtehEeLEuropean Skills, Competences, Qualifications, and community to allow the system to know the textual Ococnc-upations (ESCO) ontology, a collection of multilintext in which the mention occurs, aiding in the resolutguioanl and interrelated taxonomies created and maintained of ambiguity 2[4]. This aspect is also beyond the scopeby the European Union27[, 28]. It include3s,039 occuof our work, as the data we use to build our datasetspoantliyon concepts in its latest version, each with names includes unnormalized occupation names as queries.and definitions (descriptions) i2n8 languages. Every

Entity linking can be framed as a ranking t2a5s]k: [concept has one or more names in every supported langiven a query , the system produces a scor(e,)

for each ∈ E and the predicted entit̂iys computed as: (̂) = argmax (, ) ∈ E guage. The names are compliant with the terminological guidelines defined by ESCO [29]. All the names of a particular concept in a particular language are considered synonyms with each other. Also, for a particular concept, the language-specific name sets can be considered

1For example, settinLg

task, andL More generally, a seLt multilingual task. with higher cardinality can define a

2https://www.bls.gov/soc/

and rank-based evaluation metrics can be used to stpuadryallel data from a translation point of view. the performance. A typical approach to this task breakAsnother important example is the O*NET-SOC taxonit into two stages. The first is tChaendidate Generation Stage, where an initial ranking is obtained using a ldoewv-eloped and maintained by the United States governlatency method, trying to optimize for recall. Inmtehnet [30, 31] to standardize information relevant to the omy. The Occupational Information Network (O*NET) is = { } with ≠ involves a cross-lingual task.

= { } would result in a monolinguaCllassification (SOC) system2. It contains information

labor market, based on the 2018 Standard Occupational in English about1,016 occupations, each with a set otfhe top of the ranking is suficient for correctly performnames and a description. ing the task. In other words, when ranking the corpus

Additionally, many other countries have developeeledments for a query, the position in the ranking of the their own national taxonomies or terminologies for ohcicguh-est-ranked relevant surface form is the measure we pations. For example, the Federal Employment Agencayim to evaluate. For this reason, we evaluate the basein Germany developed thKelassifikation der Berufe 2010 line models with the following metrics: mean reciprocal (KldB 2010) which is a terminology used to standardrizaenk (MRR) and top-accuracy (A @). the information in the German language about occupations 3[2].

To achieve interoperability between some of these t4a.x- Datasets onomies, mappings —also callecdrosswalks— were developed and made public. These mappings establish aTnhe MELO Benchmark consists of 48 datasets, where each is an instance of the ranking task as described in alignment between two given taxonomies. In particular, the European Union published many crosswa3l3k]s [ Section3. While the set of queries difers among the datasets, the target taxonomy is always ESCO Occupathat map concepts from national taxonomies, which are

tions. Although the underlying concepts in the corpus are typically monolingual, into ESCO. The process described

the same, the surface forms —specifically, the occupation in Section4 uses this information as a gold standard to create the datasets for the MELO Benchmark. names— vary across datasets, since they are presented in diferent subsets of ESCO languages.

We leverage existing crosswa3l,kwshich are high3. Task quality mappings between ESCO Occupations and other taxonomies 3[4, 33], to build the datasets. Two datasets As mentioned already, the task consists of multilingauraelderived from the mapping between ESCO and the Entity Linking of occupations into the ESCO taxonoOm*yN,ET-SOC Taxonomy, while the remaining ones are dewhich we denote byE. Given a query mention, which rived from the mapping between ESCO and the oficial ocis a text string expressing the non-normalized namecuopfation terminologies from several European countries. an occupation without surrounding context, we nWeehdile ESCO is a multilingual taxonomy, the national to find the best semantic match in ESCO, namely theterminologies are monolingual. Elements between the correct entit∈y E. Every occupation in the taxonomytaxonomies are assigned SKOS relationsh3ip5s] [such as has textual information in all lang ua∈geLs . The exact match, narrow match, broad match, orclose match. query is expressed in langua ge, which we make no For each crosswalk, we build two evaluation datasets: prior assumptions about. a monolingual dataset and a cross-lingual dataset. In

For evaluation, we operationalize the task as a rbaontkh- cases, the set of queries are those elements in the ing problem with binary-relevance annotations, whnearteional terminologies (or O*NET) that either have only a query is used to rank all the str i ngins a corpus oneexact match in ESCO or have zeroexact matches and C. The corpus is a collection of lexical terms denotoinlgy onenarrow match. Therefore, we are filtering out occupation names, and it is derived from the taxonEo.mysemantically ambiguous queries, e.g. if they have more

To build the corpuCs, we first define the set of target than oneexact matches, or that can’t be assigned to a languages for the corpus, as a subLs et⊂ L . Then, we specific concept in ESCO because they are not specific collect every surface form (name) for every occupatieonnough, for example if they only havberoad or close corresponding to those languages. That is, starting fmroamtches. an empty set, we traversEe and, for each occupation The language of the set of queri es,,depends on the , we add every name available in any languagLei.n national terminology. Regarding the languages used As a resultC, is the collection of every name of everfyor the corpus, we select a diferent subset of the lanoccupation in every target language. guages in ESCO for each modality. For the monolin

The annotations consist of the set of relevant cogrupuals task we seLt = { }, and for the cross-lingual elements for each query. Given the correct entfoitrya we setL = {English}. Exceptionally, since for O*NET query , then those corpus elemen tsthat were obtainedthe query language is already English, in this case infrom the surface forms o fare considered to be reles-tead of a cross-lingual task we define a multilingual vant, while any other element in the corpus is considetraesdk, where the corpus languages are English, German, irrelevant. Spanish, French, Italian, Dutch, Portuguese, and Polish

Because the goal is to find the relevant conceipnt (We intentionally include English, the query language.) the taxonomy for the given query (i.e. to solve the entAitsymentioned in the previous Section, the annotations linking formulation of the task), obtaining at least one surface form associated with the relevant concept3hatttps://esco.ec.europa.eu/en/use-esco/

eures-countries-mapping-tables 0.5 ty0.4 il iab0.3 b ro0.2 P 0.1 0.0 0.0 0.2 Di0s.t4ance0.6 0.8 0.0 0.2 Di0s.t4ance0.6 0.8 0.0 0.2 Di0s.t4ance0.6 0.8 0.0 0.2 Di0s.t4ance0.6 0.8 0.0 0.2 Di0s.t4ance0.6 0.8 further detail on the construction and compositioSnemoafntic Baselines. Additionally, we provide rethese datasets, as well as example queries and relevsaunlts for zero-shot evaluations using state-of-the-art deep corpus elements, please refer to AppenAdi.x learning models employed as symmetric bi-encoders. Un

The benchmark is intended to represent realisticduesre this setup, we use a sentence encoder to obtain a cases, such as linking mentions into a taxonomy, efixne-d-size representation for each surface form, and the riching a custom taxonomy with new synonyms for thsecore for a query and each corpus element is computed as existing concepts, or aligning two taxonomies. It is atlhseocosine similarity of their corresponding representaintended to study the cross-lingual and multilingutailocnas-. This allows the system to capture deeper semantic pabilities of proposed systems. Using extra informatiroenlationships. for solving this task, such as context for the mentions oWre experiment with the following pre-trained models descriptions and examples for the taxonomy conceptsi,nisa zero-shot setup, without fine-tuning or in-context out of the scope of this work but represents an intereexsatm-ples: ESCOXLM-R [10], mUSE-CNN [ 36], a muling line of future research that can take advantage otfiltihnegual variant of MPN3e7t],[ BGE-M3 [38], GISTMELO Benchmark. Embedding [39], Multilingual E540[], E5 [41, 42], and

To assess the lexical overlap between the surface fortmhes modeltext-embedding-3-large from OpenAI4. This in any national terminology and ESCO, we use the monseol-ection of models represents a spectrum of trade-ofs lingual tasks, and measure the normalized edit distabnetcweeen performance and model complexity. We refer between each query and the closest relevant corpustehlee-reader to AppendBixand Table3 for further details ment. In Figure1 we show a histogram with the distroi-n the models and the inference procedure. bution of such distances in a selection of tasks. As described in Sectio3n, the goal of each task is to find

The lexical overlap is considerable in some cases, ltikhee relevant concept in the taxonomy for the given query. with the Danish terminology. In the histogram, a bTihgerefore, obtaining at least one surface form associated concentration of examples in the left-most bin impliewsith the relevant concept at the top of the ranking is that many queries are lexically very close to theirsrueficlie-nt to achieve this goal. With that in mind, we use vant corpus elements. This, in principle, would makmeean reciprocal rank (MRR) and t oapc-curacy (A @) these tasks easier to solve using simple lexical scoraisnegvaluation metrics. functions. In AppendixA we explain the procedure used Due to space constraints, in Ta2bwlee present results to compute the lexical distances and we also presentinthterms of mean reciprocal rank (MRR) for a selected subsame analysis for every task in the benchmark. set of tasks, while the complete set of results is provided in Table5 and Table6 in AppendixC.

In most monolingual datasets, the top-performing lex5. Experiments ical baselines achieved MRR values ranging from 30% to 55%. Notably, in the Fren5cahnd Danish datasets, To demonstrate the MELO Benchmark in use, we study

these baselines performed extraordinarily well in large the performance of several models when evaluated on the

part due to substantial lexical overlap, as indicated by tasks we defined above. We explore both simple lexical

the left-skewed distributions in Figu4r.eIn contrast, the baselines and advanced deep learning models using a

Lithuanian, Norwegian, and Romanian datasets exhibited bi-encoder, zero-shot setting. lower performance. Char-based TF-IDF variants deliver Lexical Baselines. We evaluate the following baselines:

the highest performance among this group of baselines. edit-distance, word-level TF-IDF, word-level TF-IDF onIn a zero-shot setup, ESCOXLM-R performs poorly, lemmas, char-level TF-IDF, char-level TF-IDF on lemmas, BM25, and BM25 on lemmas. These models rely on surface-level text features.

4https://openai.com/index/new-embedding-models-and-api-updates 5Results for every dataset are presented in AppeCn.dix

even falling behind simple lexical baselines across boatchieves reasonable results in most other languages, monolingual and cross-lingual datasets. This resuwlthiicsh is surprising considering its primary training was consistent with previous research that has shown tfohcautsed on English. encoders trained with masked language modeling (MLM)E5, a significantly larger Decoder-only model, outobjectives often struggle to produce efective sentencpeerforms the previously mentioned models across most representations when directly evaluated as sentencteaesnks-. This is also surprising since E5 was mainly trained coders 4[3]. In contrast, the other bi-encoders evaluaitnedEnglish. Finally, although limited details are availin this study were specifically optimized for generatinagble publicly about OpenAIt’esxt-embedding-3-large useful sentence embeddings, which explains their supmeo-del, its performance is generally on par with or even rior performance in these tasks. surpasses that of E5. OpenAI’s model delivers the highest

The mUSE-CNN model demonstrates fair performancoeverall performance among all the models evaluated in on most monolingual tasks for languages includedouinr experiments. its pre-training, especially when considering its relTah-e performance of the models in each monolingual tively small model size and architecture type (see3T)a.bldeataset is correlated with the lexical overlap in the dataset, However, as anticipated, its performance drops signaifi-s measured by the median of the distributions presented cantly for languages that were not included durining Fitigsure4. As expected, lexical baselines exhibit a particpre-training. Furthermore, its performance falls beulolawrly strong correlation, with Spearman’s coeficients of the lexical baselines in almost all datasets. This ca-0n.7b4efor Char TF-IDF and -0.80 for Edit Distance. Interobserved in Figure2b. estingly, bi-encoders also demonstrate a moderate corre

MPNet exhibits poor performance across all monolliant-ion, such as mE5 (-0.65) and OpenAI (-0.62). In Figu2re gual datasets, a surprising result given its larger mwoedevlisualize this correlation, as well as the correlation size, architecture type, and the fact that it was pre-trbaeitnwedeen the lexical overlap and the diference in perforin all the languages used in this experiment. Despmitaence between some bi-encoders and a lexical bas6e.line these advantages, it is generally outperformed by Wtheeobserve that, the less lexical overlap in the dataset, the smaller mUSE-CNN model, with the notable exceptiomnore the OpenAI model outperforms the lexical baseline. of the English datasets. Comparing the results of datasets USA-en-en and USA

BGE-M3 and Multilingual E5 have similar characteenr--xx, which share the same queries, we observe that istics, as described in Tab3l,eand both deliver strongmost methods significantly enhance their performance performance across most monolingual tasks. In thewsheen the corpus elements visible to the system are excases, they generally outperform all lexical baselinespanded to include multiple languages, surpassing their smaller bi-encoders. However, in the English dataspeetrsf,ormance in the monolingual task. An implication for Multilingual E5 outperforms BGE-M3. this is that, when linking mentions into a multilingual

GIST-Embedding demonstrates strong performancteaxonomy, the surface forms in other languages are valuin English, outperforming many larger models. It also

6Same figure is displayed in full size in AppendiCx FRA

PRTBELAPUUOTSLAHUNNLSDVIHTENRAUV ESBTEL

NORBGRESP LVCASZVSEKWRLOETUU

DNK 0.0 0.2 0.4 Lexical Distance

0.2 0.4

Lexical Distance (a) Absolute performance (in MRR). FRA

PRTBELAPUUOTSLANLSDDVENU

HRV ESBTEL HUNIETASP LVCASZVSEKWE NORBGR RLOTUU

6. Related Work There has been significant research interest in systems

that normalize HR information into ESCO and other taxonomies.

Decorte et al1.4[] explore the extraction of ESCO skills from segmented job descriptions. They approach this problem as a massive multi-label classification task, and present a human-annotated evaluation set for this task.

More recently, Decorte et1a7l].a[pproach the same problem from an EL perspective. They use a large language model (LLM) to produce synthetic annotations and train a bi-encoder to extract ESCO skills from job description segments. Finally, Zhang et a1l1.][apply and compare two supervised EL methods for solving the same task: BLINK [2] and GENRE [44]. In contrast to these other studies, our work focuses on occupations instead of skills, explores cross-lingual and multilingual scenarios, and the task as we formulate it does not use context for linking 0.0

0.2 0.4 Lexical Distance 0.0

0.2 0.4

Lexical Distance (b) Performance relative to the lexical baseline Char TF-IDthFe query mentions.

There has also been a substantial amount of research Figure 2: Correlation between model performance and the focused on occupations. Decorte et2a0l].d[eveloped median of the minimum edit distance between queries and an unsupervised approach to fine-tune BERT45[] to enrelevant corpus elements in monolingual datasets. code the semantics of occupation names. Furthermore, they create a dataset for the normalization of free-form English occupation names into ESCO and they use it to evaluate their model. It has been reported that this able even if the taxonomy includes entity names in tdhaetaset contains ambiguous input quer2i0e]sa[s well as language of the query. some mislabeled elements46[]. Closely related works by

As expected, the performance drop when moving fromZbib et al.4[7] and by Bocharova et al4.8[] propose almonolingual to cross-lingual datasets (excluding O*NtEerTn)ative unsupervised representation learning schemes. is significantly more pronounced for the lexical baselinTehsey both release evaluation datasets, the former for compared to the bi-encoders. The capacity for (zero-shooctc)upation name ranking, and the latter for EL of unnorcross-lingual EL of occupations varies for diferent momda-lized occupation names into ESCO. els: ESCOXLM-R, MPNet, and GIST-Embedding exhibit Lake [16] studies the application of bi-encoders and very low cross-lingual performance; mUSE-CNN, BGEc-ross-encoders to EL of occupations to a custom taxonM3, and Multilingual E5 demonstrate fair cross-linogmuay.lYamashita et al2.1[] work on a normalization task performance; while E5 and OpenAI achieve the highesftor occupations, which closely resembles our formulacross-lingual performance. tion of EL. They create a non-public dataset by collect

Since the techniques we experiment with —lexical scionrg- a large number of unnormalized occupation names ers and bi-encoders— are commonly used for candidaatned then automatically mapping them to ESCO occugeneration in the first stage of E1L, 2[], it is interestingpations via exact match after removing proper nouns. to measure the to p-accuracy (A @) for diferent val- Vrolijk et al2. 2[] build a synthetic dataset for zero-shot ues of to assess how well such techniques recover tehvealuation and fine-tuning of several language models usifrst relevant item. Figur3e presents these results foirng information from ESCO that includes the synonyms the same subset of tasks for the following systems: Efdoirt each entity name, the relationship between entities, Distance, Char-level TF-IDF, mUSE-CNN, and OpenAIa.nd their definitions. In particular, they use the set of The complete set of A @is available in AppendiCx, in name synonyms for each ESCO occupation to pose a Figure6 and Figure7. The results observed for top-binary relevance classification problem, where positive accuracy are consistent with those for mean reciprpoaciarls involve two names belonging to the same synonym rank (MRR), particularly in terms of the relative ranskeitn.g and comparative performance of the models. Two important use cases of the EL task under study are enriching and aligning taxonomies. In order to maintain up-to-date but well-curated taxonomies, it is common to automatically identify new candidate concepts 1.0 0.2 0.0 1.0 0.2 0.0 USA-en-xx to be included, and to use human annotators to valioduattesemantically ambiguous queries for which experts their inclusion. Similarly, when aligning two taxonomdieetsermined that they should be related as an exact match —i.e. building a crosswalk—, it is common to use auto-more than one ESCO concept. For those reasons, their matic systems to propose and explore candidate matchresults are not comparable to those we present in this between the concepts in each taxonomy. work.

Giabelli1[9] and colleagues have worked on several approaches for enriching49[] and aligning1[8, 50] taxonomies using word embeddings to model concepts vi7a. Conclusion their names, together with structural information about

We have introduced the MELO Benchmark, a suite of 48 the taxonomy. All these methods automatically score can

datasets for multilingual entity linking of occupations in didates for inclusion or mapping, and can be used within

21 languages. We experimented with several out-of-thea human-in-the-loop framework for further validation.

box lexical and semantic baselines, demonstrating that During the creation of the crosswalk between O*NET

there is still significant room for improvement. Our aim and ESCO, the teams responsible for maintaining both

is that MELO will serve as a valuable resource for the retaxonomies worked together to ensure a high-quality

search community, providing a standardized benchmark mapping [33]. Interestingly, they report employing a human-in-the-loop methodology where a fine-tunedfor assessing progress in multilingual EL within the HR domain, and fostering innovation and the development BERT model [45] is used as a bi-encoder to rank the

of new methodologies in this important area of research.

ESCO occupations for each O*NET occupation. They explore diferent methods for encoding each, leveraging In future work, several research directions could be occupation names (and synonyms) as well. explored. First, the current evaluation scheme can be extended to incorporate NIL prediction or prediction using More recently, the ESCO team presented an analy

entity descriptions rather than relying solely on entity sis [46] on a task that is very similar to the one we present here. They fine-tune a XLM-RoBERTa model 5[1] on names, with the presented source code being easily adaptHR-related data, including the textual informationafbrloemfor such modifications. Second, domain-adapting ESCO, but with no supervision signal for any specific ELor fine-tuning encoders specifically for this task, in a manner similar to ESCOXLM-R but optimized for semantask. They then use this model as a bi-encoder to suggest

tic text similarity, presents another possible direction.

ESCO occupations for elements taken from the national

Third, exploring advanced deep learning techniques beterminologies of Latvia, Spain, Sweden, and Italy, as well

yond bi-encoders, such as cross-encoders combined with as from O*NET. Using the respective crosswalks, they

re-ranking stages, could enhance model performance. evaluate this as an EL task. They explore monolingual

Finally, investigating the meta-learning paradigm by diand cross-lingual (to English) modalities. A key difer

viding MELO tasks into meta-training and meta-testing ence between this work and ours is that they consider

tasks, and applying meta-learning context to solve the any SKOS relationship as a legitimate annotation, while

meta-testing tasks, exploiting multi-lingual transfer capawe only use exact and narrow matches. We also filter

bilities of modern deep-learning models, ofers another interesting direction for future work. mender Systems (2021). URL:https://ceur-ws.org/ Vol-2967/paper_3.pd.f [6] S. Tu, O. Cannon, Beyond Human-in-theAcknowledgment loop: Scaling Occupation Taxonomy at Indeed, The 2nd Workshop on Recommender SysThis publication uses the ESCO classification of the Euro- tems for Human Resources (RecSys in HR’22), pean Commission. We gratefully acknowledge the work in conjunction with the 16th ACM Conferpdaotnieonbys ttahexotneoammyi,navsowlveeldl ains ctuhreatteianmgstrheesEpoSCnsOibOleccfuo-r ence on Recommender Systems (2022). URL: https://recsyshr.aau.dk/wp-content/uploads/2022/ the O*NET-SOC 2019 taxonomy and the other national 09/RecSysHR2022-paper_2.pdf. taxonomies used in this work. Furthermore, we would

[7] S. Avlonitis, D. Lavi, M. Mansoury, D. Graus, Caalso like to thank the teams responsible for creating thereer Path Recommendations for Long-term Incrosswalks between ESCO and these taxonomies. come Maximization: A Reinforcement Learning Approach, The 3rd Workshop on RecReferences ommender Systems for Human Resources (RecSys in HR’23), in conjunction with the 17th [1] L. Logeswaran, M.-W. Chang, K. Lee, K. Toutanova, ACM Conference on Recommender Systems (2023).

J. Devlin, H. Lee, Zero-Shot Entity Linking by URL: https://recsyshr.aau.dk/wp-content/uploads/ Reading Entity Descriptions, in: Proceedings of 2023/09/RecSysHR2023-paper_2.pdf. the 57th Annual Meeting of the Association fo[8r] J.-J. Decorte, J. V. Hautte, J. Deleu, C. DeComputational Linguistics, Association for Com- velder, T. Demeester, Career Path Prediction usputational Linguistics, Florence, Italy, 2019, pp. ing Resume Representation Learning and Skill3449–3460. URL: https://aclanthology.org/P19-1.335 based Matching, The 3rd Workshop on Recdoi:10.18653/v1/P19-1335. ommender Systems for Human Resources (Rec[2] L. Wu, F. Petroni, M. Josifoski, S. Riedel, L. Zettle- Sys in HR’23), in conjunction with the 17th moyer, Scalable Zero-shot Entity Linking with ACM Conference on Recommender Systems (2023). Dense Entity Retrieval, in: Proceedings of URL: https://recsyshr.aau.dk/wp-content/uploads/ the 2020 Conference on Empirical Methods in 2023/09/RecSysHR2023-paper_1.pdf.

Natural Language Processing (EMNLP), Asso-[9] M. Zhang, K. Jensen, S. Sonniks, B. Plank, SkillSpan: ciation for Computational Linguistics, Online, Hard and Soft Skill Extraction from English Job 2020, pp. 6397–6407. URL: https://aclanthology. Postings, in: Proceedings of the 2022 Conference org/2020.emnlp-main.51 9. doi:10.18653/v1/2020. of the North American Chapter of the Association emnlp-main.519. for Computational Linguistics: Human Language [3] J. A. Botha, Z. Shan, D. Gillick, Entity Link- Technologies, Association for Computational Lining in 100 Languages, in: Proceedings of guistics, Seattle, United States, 2022, pp. 4962–4984. the 2020 Conference on Empirical Methods in URL: https://aclanthology.org/2022.naacl-main..366 Natural Language Processing (EMNLP), Asso- doi:10.18653/v1/2022.naacl-main.366. ciation for Computational Linguistics, Onli[n10e], M. Zhang, R. van der Goot, B. Plank, ESCOXLM2020, pp. 7833–7845. URL: https://aclanthology. R: Multilingual Taxonomy-driven Pre-training for org/2020.emnlp-main.63 0. doi:10.18653/v1/2020. the Job Market Domain, in: Proceedings of emnlp-main.630. the 61st Annual Meeting of the Association for [4] X. Fu, W. Shi, X. Yu, Z. Zhao, D. Roth, Design Computational Linguistics (Volume 1: Long PaChallenges in Low-resource Cross-lingual Entity pers), Association for Computational Linguistics, Linking, in: Proceedings of the 2020 Conference Toronto, Canada, 2023, pp. 11871–11890. URL: on Empirical Methods in Natural Language Pro- https://aclanthology.org/2023.acl-lon.gd.6o6i2:10. cessing (EMNLP), Association for Computational 18653/v1/2023.acl-long.662.

Linguistics, Online, 2020, pp. 6418–6432. URLh:ttps: [11] M. Zhang, R. van der Goot, B. Plank, Entity Linking //aclanthology.org/2020.emnlp-main.5.21doi:10. in the Job Market Domain, in: Findings of the 18653/v1/2020.emnlp-main.521. Association for Computational Linguistics: EACL [5] M. de Groot, J. Schutte, D. Graus, Job Posting- 2024, Association for Computational Linguistics, Enriched Knowledge Graph for Skills-based Match- St. Julian’s, Malta, 2024, pp. 410–419. URhLt:tps: ing, The 1st Workshop on Recommender Systems //aclanthology.org/2024.findings-eac.l.28 for Human Resources (RecSys in HR’21), in con-[12] E. Senger, M. Zhang, R. van der Goot, B. Plank, junction with the 15th ACM Conference on Recom- Deep Learning-based Computational Job Market

Analysis: A Survey on Skill Extraction and Classiifcation from Job Postings, in: Proceedings of the velder, JobBERT: Understanding Job Titles First Workshop on Natural Language Processing through Skills, FEAST, ECML-PKDD 2021 Workfor Human Resources (NLP4HR 2024), Association shop (2021). URL: https://feast-ecmlpkdd.github.io/ for Computational Linguistics, St. Julian’s, Malta, archive/2021/papers/FEAST2021_paper_6.pd. f 2024, pp. 1–15. URL: https://aclanthology.org/2024[2.1] M. Yamashita, J. T. Shen, T. Tran, H. Ekhtiari, nlp4hr-1.1. D. Lee, JAMES: Normalizing Job Titles with Multi[13] M. Zhang, K. N. Jensen, B. Plank, Kompetencer: Aspect Graph Embeddings and Reasoning, in: 2023 Fine-grained Skill Classification in Danish Job Post- IEEE 10th International Conference on Data Sciings via Distant Supervision and Transfer Learn- ence and Advanced Analytics (DSAA), 2023, pp. ing, in: Proceedings of the Thirteenth Language 1–10. URL: https://arxiv.org/abs/2202.107.3d9oi:10. Resources and Evaluation Conference, European 1109/DSAA60987.2023.10302559.

Language Resources Association, Marseille, Fran[c2e2,] J. Vrolijk, D. Graus, Enhancing PLM Performance 2022, pp. 436–447. URL: https://aclanthology.org/ on Labour Market Tasks via Instruction-based 2022.lrec-1.46. Finetuning and Prompt-tuning with Rules, [14] J.-J. Decorte, J. V. Hautte, J. Deleu, C. De- The 3rd Workshop on Recommender Sysvelder, T. Demeester, Design of Negative Sam- tems for Human Resources (RecSys in HR’23), pling Strategies for Distantly Supervised Skill in conjunction with the 17th ACM ConferExtraction, The 2nd Workshop on Recom- ence on Recommender Systems (2023). URL: mender Systems for Human Resources (RecSys https://recsyshr.aau.dk/wp-content/uploads/2023/ in HR’22), in conjunction with the 16th ACM 09/RecSysHR2023-paper_4.pdf.

Conference on Recommender Systems (2022).[23] F. Zhu, J. Yu, H. Jin, L. Hou, J. Li, Z. Sui, Learn URL: https://recsyshr.aau.dk/wp-content/uploads/ to Not Link: Exploring NIL Prediction in Entity 2022/09/RecSysHR2022-paper_4.pdf. Linking, in: Findings of the Association for [15] M. Zhang, K. N. Jensen, R. van der Goot, B. Plank, Computational Linguistics: ACL 2023, Association Skill Extraction from Job Postings using Weak for Computational Linguistics, Toronto, Canada, Supervision, The 2nd Workshop on Recom- 2023, pp. 10846–10860. URL: https://aclanthology. mender Systems for Human Resources (RecSys org/2023.findings-acl.69.0doi:10.18653/v1/2023. in HR’22), in conjunction with the 16th ACM findings-acl.690.

Conference on Recommender Systems (2022).[24] N. Gupta, S. Singh, D. Roth, Entity Linking via URL: https://recsyshr.aau.dk/wp-content/uploads/ Joint Encoding of Types, Descriptions, and Con2022/09/RecSysHR2022-paper_10.pdf. text, in: Proceedings of the 2017 Conference [16] T. Lake, Flexible Job Classification with Zero- on Empirical Methods in Natural Language ProShot Learning, The 2nd Workshop on Rec- cessing, Association for Computational Linguistics, ommender Systems for Human Resources (Rec- Copenhagen, Denmark, 2017, pp. 2681–2690. URL: Sys in HR’22), in conjunction with the 16th https://aclanthology.org/D17-1.2d8o4i:10.18653/ ACM Conference on Recommender Systems (2022). v1/D17-1284.

URL: https://recsyshr.aau.dk/wp-content/upload[2s/5] Z. Zheng, F. Li, M. Huang, X. Zhu, Learning to 2022/09/RecSysHR2022-paper_8.pdf. Link Entities with Knowledge Base, in: Human [17] J.-J. Decorte, S. Verlinden, J. V. Hautte, J. Deleu, Language Technologies: The 2010 Annual ConferC. Develder, T. Demeester, Extreme Multi-Label ence of the North American Chapter of the AssociaSkill Extraction Training using Large Language tion for Computational Linguistics, Association for Models, 2023. URL: https://arxiv.org/abs/2307. Computational Linguistics, Los Angeles, California, 10778. arXiv:2307.10778. 2010, pp. 483–491. URL: https://aclanthology.org/ [18] A. Giabelli, L. Malandri, F. Mercorio, M. Mez- N10-1072.

zanzanica, WETA: Automatic taxonomy[26] R. J. Brachman, What IS-A Is and Isn’t: An Analalignment via word embeddings, Com- ysis of Taxonomic Links in Semantic Networks, puters in Industry 138 (2022) 103626. URL: Computer 16 (1983) 30–36. doi1:0.1109/MC.1983. https://www.sciencedirect.com/science/ 1654194. article/pii/S016636152200021.5 doi:https: [27] M. le Vrang, A. Papantoniou, E. Pauwels, P. Fannes, //doi.org/10.1016/j.compind.2022.103626. D. Vandensteen, J. De Smedt, ESCO: Boosting Job [19] A. Giabelli, Integrating Word Embeddings and Tax- Matching in Europe with Semantic Interoperability, onomy Learning for Enhanced Lexical Domain Computer 47 (2014) 57–64. doi1:0.1109/MC.2014. Modelling, Phd thesis, Università degli Studi di 283.

Milano-Bicocca, 2024. [28] European Commission, ESCO Handbook: Eu[20] J.-J. Decorte, J. V. Hautte, T. Demeester, C. De- ropean Skills, Competences, Qualifications and Occupations, Technical Report, European Union, c3a690be93aa602ee2dc0ccab5b7b67e-Paper.p.df 2019. URL: https://esco.ec.europa.eu/system/files[/38] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, 2021-07/Handbook.pd.f Z. Liu, BGE M3-Embedding: Multi-Lingual, [29] European Commission, ESCO Terminologi- Multi-Functionality, Multi-Granularity Text cal Guidelines, Technical Report, European Embeddings Through Self-Knowledge Distillation, Union, 2021. URL: https://esco.ec.europa. 2024. URL: https://arxiv.org/abs/2402.032.16 eu/en/about-esco/publications/publication/ arXiv:2402.03216.

esco-terminological-guideli.nes [39] A. V. Solatorio, GISTEmbed: Guided In-sample Se[30] E. C. Dierdorf, D. W. Drewes, J. J. Norton, O*NET lection of Training Negatives for Text Embedding Tools and Technology: A Synopsis of Data De- Fine-tuning, 2024. URLh:ttps://arxiv.org/abs/2402. velopment Procedures, Technical Report, North 16829. arXiv:2402.16829.

Carolina State University, 2006. UhRtLt: ps://www. [40] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, onetcenter.org/dl_files/T2Development..pdf F. Wei, Multilingual E5 Text Embeddings: A Tech[31] M. J. Handel, The O*NET Content Model: nical Report, 2024. URLh:ttps://arxiv.org/abs/2402.

Strengths and Limitations, Journal for Labour 05672. arXiv:2402.05672.

Market Research 49 (2016) 157–176. do10i:.1007/ [41] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, s12651-016-0199-8. F. Wei, Improving Text Embeddings with Large [32] W. Paulus, B. Matthes, Klassifikation der Language Models, 2023. URLh: ttps://arxiv.org/abs/ Berufe: Struktur, Codierung und Um- 2401.00368. arXiv:2401.00368. steigeschlüssel, Technical Report, Bun[4-2] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, desagentur für Arbeit, 2013. URLh:ttps: D. Jiang, R. Majumder, F. Wei, Text Embed//doku.iab.de/fdz/reporte/2013/MR_08-13.p.df dings by Weakly-Supervised Contrastive Pre[33] European Commission, The Crosswalk training, 2022. URL:https://arxiv.org/abs/2212.

Between ESCO and O*NET, Technical 03533. arXiv:2212.03533.

Report, European Union, 2022. URL:[43] B. Li, H. Zhou, J. He, M. Wang, Y. Yang, L. Li, On the https://esco.ec.europa.eu/system/files/2022-12/ Sentence Embeddings from Pre-trained Language ONET%20ESCO%20Technical%20Report.p.df Models, in: Proceedings of the 2020 Conference [34] European Commission, ESCO implementation man- on Empirical Methods in Natural Language Proual, Technical Report, European Union, 2018. URL: cessing (EMNLP), Association for Computational https://esco.ec.europa.eu/system/files/2021-07/ Linguistics, Online, 2020, pp. 9119–9130. URLh:ttps: 425b7a5f-3048-4377-a816-5402c00e9a9505_A_ //aclanthology.org/2020.emnlp-main.7.33doi:10.

Annex_Draft_ESCO_Implementation_manual.pdf 18653/v1/2020.emnlp-main.733. [35] A. Miles, S. Bechhofer, SKOS Simple Knowledge[44] N. D. Cao, G. Izacard, S. Riedel, F. Petroni, AuOrganization System Reference, W3C Recommen- toregressive Entity Retrieval, International Condation, World Wide Web Consortium, 2009. URL: ference on Learning Representations (2021). URL: https://www.w3.org/TR/skos-referenc,ew/3C Rec- https://openreview.net/forum?id=5k8F6UU3.9V ommendation. [45] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: [36] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Con- Pre-training of Deep Bidirectional Transformers stant, G. Hernandez Abrego, S. Yuan, C. Tar, Y.-h. for Language Understanding, in: Proceedings Sung, B. Strope, R. Kurzweil, Multilingual Univer- of the 2019 Conference of the North American sal Sentence Encoder for Semantic Retrieval, in: Chapter of the Association for Computational LinProceedings of the 58th Annual Meeting of the guistics: Human Language Technologies, Volume Association for Computational Linguistics: Sys- 1 (Long and Short Papers), Association for Comtem Demonstrations, Association for Computa- putational Linguistics, Minneapolis, Minnesota, tional Linguistics, Online, 2020, pp. 87–94. URL: 2019, pp. 4171–4186. URL: https://aclanthology.org/ https://aclanthology.org/2020.acl-demo. sd.1o2i:10. N19-1423. doi:10.18653/v1/N19-1423. 18653/v1/2020.acl-demos.12. [46] European Commission, Machine Learning Assisted [37] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, MP- Mapping of Multilingual Occupational Data to Net: Masked and Permuted Pre-training for ESCO, Technical Report, European Union, 2022. Language Understanding, in: H. Larochelle, URL: https://shorturl.at/REc. Dd M. Ranzato, R. Hadsell, M. Balcan, H. Li[n47] R. Zbib, L. A. Lacasa, F. Retyk, R. Poves, (Eds.), Advances in Neural Information Pro- J. Aizpuru, H. Fabregat, V. Šimkus, E. Garcíacessing Systems, volume 33, Curran Associates, Casademont, Learning Job Titles Similarity from Inc., 2020, pp. 16857–16867. URL: https:// Noisy Skill Labels, FEAST, ECML-PKDD 2021 Workproceedings.neurips.cc/paper_files/paper/2020/file/ shop (2022). URL: https://feast-ecmlpkdd.github.io/ archive/2022/papers/FEAST2022_paper_4972.pd. f A. Details on the Datasets [48] M. Bocharova, E. Malakhov, V. Mezhuyev, VacancySBERT: the approach for representation of titWleesrelease the source code used to build the da7t,asets and skills for semantic similarity search in theprroev-iding researchers with a tool to easily generate new cruitment domain, Applied Aspects of Informad-atasets by combining diferent sets of languages for tion Technology 6 (2023) 52–59. URLh:ttps://aait.od.query and corpus elements. Using this code, new instanua/index.php/journal/article/view/161/.2d1o2i:10. tiations of the task can be derived from the input data by 15276/aait.06.2023.4. defining custom language combinations. For example, it [49] A. Giabelli, L. Malandri, F. Mercorio, M. Mezzains-possible to use the Italian national terminology to set zanica, A. Seveso, NEO: A Tool for Taxonomy En-up an Italian-to-Greek cross-lingual task, or even comrichment with New Emerging Occupations, in: Thebine the query sets of several national classifications and Semantic Web – ISWC 2020, Springer Internationalelverage all languages in ESCO to create a more complex Publishing, Cham, 2020, pp. 568–584. multilingual task. [50] A. Giabelli, L. Malandri, F. Mercorio, M. Mez-The input data consists of files with the multilingual zanzanica, JoTA: Aligning Multilingual Job TaExS-CO Occupations taxonomy (one for each relevant veronomies through Word Embeddings (Student Abs-ion) and files containing the queries in each national stract), Proceedings of the AAAI Conferentceerminology, which are mapped to the ESCO concept ID on Artificial Intelligence 36 (2022) 12955–12956o.f the relevant occupation. To create a dataset, the user URL: https://ojs.aaai.org/index.php/AAAI/articlce/an select a national terminology and a set of languages view/21614. doi:10.1609/aaai.v36i11.21614. for the corpus (any subset of the languages supported by [51] A. Conneau, K. Khandelwal, N. Goyal, V. ChaudE-SCO).

hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, In Table4 we present example queries and their releL. Zettlemoyer, V. Stoyanov, Unsupervisedvant corpus elements, sampled from the NLD-nl-nl, PRTCross-lingual Representation Learning at Scpatl-ep,t, and PRT-pt-en datasets. in: Proceedings of the 58th Annual Meet-Finally, we analyze the lexical overlap between the naing of the Association for Computational Ltinio-nal classifications and ESCO. In Figu4r,ewe present a guistics, Association for Computational Linguhiiss-togram showing the normalized edit distance between tics, Online, 2020, pp. 8440–8451. URL:https:// queries and their closest relevant corpus element, for all aclanthology.org/2020.acl-main.7d4o7i:10.18653/ the tasks in MELO.

v1/2020.acl-main.747. To compute the distances, we first lowercase the sur[52] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco,face forms of both the query and the corpus element, R. St. John, N. Constant, M. Guajardo-Cespedeas,nd we use the methordatio from the Python package S. Yuan, C. Tar, B. Strope, R. Kurzweil, Univerr-apidfuzz 8. This is a measure of the normalized edit sal Sentence Encoder for English, in: Proceeddis-tance between the two strings. In the histograms, for ings of the 2018 Conference on Empirical Methodesach query, we compute the distance for all its relevant in Natural Language Processing: System Demonco-rpus elements and report the minimum distance. strations, Association for Computational LinguiIsn- the histograms, the left-most bin represents the fractics, Brussels, Belgium, 2018, pp. 169–174. URL: tion of queries for which the closest relevant element is https://aclanthology.org/D18-2.0d2o9i:10.18653/ either identical or very similar. The Danish national terv1/D18-2029. minology has the highest concentration of such cases. To [53] N. Reimers, I. Gurevych, Sentence-BERT: Sentencea lesser extent, this is also true for Hungarian, Estonian, Embeddings using Siamese BERT-Networks, in:and Polish.

Proceedings of the 2019 Conference on Empirical Excluding those lexically trivial cases, the more the Methods in Natural Language Processing and tdhisetribution is skewed to the left, the easier the task. For 9th International Joint Conference on NaturaleLxaanm-ple, comparing the Belgian (in the French language) guage Processing (EMNLP-IJCNLP), Associationand the French tasks, the queries from the French termifor Computational Linguistics, Hong Kong, Chinnao, logy show greater lexical overlap with their relevant 2019, pp. 3982–3992. URL: https://aclanthology.orgc/orpus elements.

D19-1410. doi:10.18653/v1/D19-1410. In Appendix C, we use this analysis to compare the performance of lexical baselines across diferent monolingual tasks. 7https://github.com/Avature/melo-benchmark 8https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html BGR-bg-bg

B. Details on the Models

input to the model is the surface form of the query or the corpus element, with no preprocessing.

Here, we provide further details about the models exW- e also present results for the Multilingual Univerplored in this work. sal Sentence Encoder (mUSE-CNN) model variant with

Regarding the lexical baselines, we always applya aCNN architecture, proposed by Cer et5a2l,.3[6]. In simple preprocessing in which we lowercase the inpuotur experiments, we use the TensorFlow implementastrings and, for all languages except Bulgarian, alsotpioenr-and the pre-trained weights available on Tensorform ASCII normalization. For the edit distance baselFilnoew, Hub with the handgleoogle/universal-sentencewe use rapidfuzz as described above. For the TF-IDFencoder-multilingual/3. This model was pre-trained baselines, we use thsecikit-learn9 Python package, on data in Arabic, Chinese, English, French, German, Italwhile for the BM5 variants, we use the Okapi BM25 i mia-n, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, plementation fromrank-bm2510. Thai, Turkish, and Russian. (Note that, during training,

For the baseline variants that involve lemmatizamtiUoSnE,-CNN has not seen text for languages such as Bulwe use spacy11 models whenever available. Howeverg,arian, Czech, or Danish.) During inference, the input to spacy models were not available for the following ltahne-model is the surface form of the query or the corpus guages: Bulgarian, Czech, Estonian, Hungarian, Latviealne,ment without any preprocessing or enclosing prompt and Slovak. Lemmatization is applied before ASCII notre-mplate. malization. Other open-source models we experiment with are im

In the case of bi-encoders, we experiment with sepvl-emented in PyTorch within the HuggingFace package eral deep learning sentence encoders that have demsoennt-ence-transformers [53]. These models are the strated strong performance in other semantic text fsoimllio-wing: a multilingual model based on MPN37e]t [ larity tasks. that was pre-trained on 50 languages, including all of

The first model is ESCOXLM-R, proposed by MELO language1s2; the BGE-M3 model 3[8], which supZhang et al.1[0], which is based on XLM-RoBERTa. We ports more than 100 languages, including also all MELO use the PyTorch implementation and the pre-trainleadnguage1s3; GIST Embedding [39], which is a model weights that are available on HuggingFace with rtehpeorted to be primarily trained in En1g4l;iMshultilinmodel namejjzha/esco-xlm-roberta-large. The base gual E5 [40], which was pre-trained on 94 languages, model was pre-trained on data in 88 languages, includiinncgluding all of MELO langua1g5e;sand E5 [41, 42] preall those involved in our datasets, and the fine-tuntinragined on many languages but reported to perform best by Zhang and colleagues involved learning objectiovensEnglish-language inp1u6t. that leverage information in ESCO. Although it is usuaFlinally, we also experiment with tehxet-embeddingto experiment with the XLM-RoBERTa family of models only after fine-tuning, in our experiment we use it outof-the-box in a zero-shot setup. During inference, t12hhettps://huggingface.co/sentence-transformers/ paraphrase-multilingual-mpnet-base-v2 9https://scikit-learn.org/stable/modules/generated/sklearn.fea1t3hutrtep_s://huggingface.co/BAAI/bge-m3 extraction.text.TfidfVectorizer.html 14https://huggingface.co/avsolatorio/GIST-Embedding-v0 10https://pypi.org/project/rank-bm25/ 15https://huggingface.co/intfloat/multilingual-e5-large 11https://spacy.io/api/lemmatizer 16https://huggingface.co/intfloat/e5-mistral-7b-instruct 3-large model from OpenA1I7, which is reported to be state-of-the-art for many semantic text similarity tasks.

For HuggingFace and OpenAI models, during inference, we wrap the input text (the surface form of the query or corpus element) with the following prompt template:

The candidate’s job title is “{{surf_form}}”.

What skills are likely required for this job? where{{surf_form}} is replaced with the surface form of the element that is being encoded.

This decision was informed by preliminary experiments in which we evaluated various models with diferent wrapping prompt templates, including no template (as with ESCOXLM-R and mUSE-CNN). We speculate that such prompts are particularly beneficial for LLMbased encoders, as they may better capture the semantics of the occupation names we aim to rank.

Although we also experimented with prompts in the same language as each query, this did not improve performance. Consistently using a single prompt ensures a language-agnostic and symmetric bi-encoder approach.

C. Full Results This Section presents the full set of experimental results.

Table5 and Table6 include the mean reciprocal rank (MRR) for each model across all tasks in MELO.

Although not included with the main results, we also evaluated a random baseline for each dataset, where the score(, ) for any query and any corpus element is drawn from a uniform distribution. The performance of this baseline varies depending on the number of corpus elements and the distribution of relevant elements per query, but in general, its MRR is close to 0.020.

Additionally, Figur5eshows scatterplots illustrating the correlation between model performance and the median of the lexical overlap index described in AppeAn:dix the minimum normalized edit distance per query.

Finally, in Figur6e and in Figure7 we show the top accuracy (A @) for a selection of models in every task in MELO. 17https://openai.com/index/new-embedding-models-and-api-updates

FRA FRA mUSE-CNN

PRT

BEL PUOSLA

AUT

DEU NLD HUNSVN

ITA HRV

ESP NORBGR

ESBTEL LVCASZVSEKWE

ROU LTU

DNK PRT

BEL

USA FRA

PRT

BEL POL

USA AUT

DEU NLSDVN HUNIETASP

HRV NORBGR

EST BEL CSZVSEKWE LVA

ROU

LTU

OpenAI

FRA DNK

PRT BEL POL

USA

DEU AUT

NLD

SVENSP HUN

BGR NOR

BEL ITA HRV EST

CSZVEK LVA

SWE ROU LTU 0.0 0.2 0.4 0.0 0.2 0.4 Lexical Distance (a) Absolute performance (in MRR). mUSE-CNN OpenAI 0.0 0.2 0.4 0.0 0.2 0.4 Lexical Distance (b) Performance relative to the lexical baseline Char TF-IDF corpus elements in monolingual datasets.

R R M

DNK

FRA

PRT

BEL

AUT USA ESP POL BGR

NL DITA HUNHRV NOR SVN

DEU

BEL

SVK EST SWE

CZE LVA

LTU ROU

DNK

FRA

PRT BELAPUOTLHUBGNSVRHNRV

USA NOR

DEU

ESP NL DITA

BELSVK CZE LVA EST

SWROEU

LTU (b) Results for tasks: Czechia, Germany, Denmark, Spain, and Estonia.

FRA

HRV

HUN

ITA

LTU (c) Results for tasks: France, Croatia, Hungary, Italy, and Lithuania. (a) Results for tasks: Latvia, the Netherlands, Norway, Poland, and Portugal.

ROU

SVK

SVN

SWE 5 k 5 k USA-en-xx (b) Results for tasks: Czechia, Germany, Denmark, Spain, and Estonia. model OpenAI mUSE-CNN Char TF-IDF Edit Distance model OpenAI mUSE-CNN Char TF-IDF Edit Distance

NLD-nl-en

NOR-no-en

POL-pl-en 5 k 5 k 5 k 5 k 5 k 5 k