<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Recommender Systems, October</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>FedericoRetyk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Gascó</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Casimiro PioCarrino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DanielDeniz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rabih Zbib</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Entity Linking</institution>
          ,
          <addr-line>Entity Normalization, Taxonomy Alignment, Cross-lingual, Multilingual</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>4</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>We present the Multilingual Entity Linking of Occupations (MELO) Benchmark, a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations multilingual taxonomy. MELO was built using high-quality, pre-existent human annotations. We conduct experiments with simple lexical models and general-purpose sentence encoders, evaluated as bi-encoders in a zero-shot setup, to establish baselines for future research. The datasets and source code for standardized evaluation are publicly avaihltatbplse:/a/gtithub.com/Avature/melo-benchm.ark The current trend in the digital transformation of hulimcaenvaluation benchmarks for measuring progress contelligence (AI) components that can improve automatioTno address this gap, we propose the Multilingual Enand operational eficiency. These systems often need to tity Linking of Occupations (MELO) Benchmark, a new process input data in the form of natural language ctoelxlte,ction of 48 datasets designed to evaluate multilinwhich can be noisy and diverse in terms of language angudal EL tasks. This benchmark leverages pre-existing, high-quality human annotations and covers 21 languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>other domain-specific aspects.
resources (HR) processes is the integration of artificialsiinst-ently in this important area.
ments, there is still a surprising lack of high-quality
pubis key to ensuring the consistency and efectiveness of
digitalized HR systems in a global setting.</p>
      <sec id="sec-2-1">
        <title>Previous research in the application of AI within the</title>
      </sec>
      <sec id="sec-2-2">
        <title>HR domain has made extensive use of taxonomies, such</title>
        <p>as occupation and skill classificatio5n, s6,[ 7, 8, 9, 10].</p>
      </sec>
      <sec id="sec-2-3">
        <title>These HR-specific taxonomies have been used for normalizing raw data11[, 12, 13, 14, 15, 16, 17], removing noise and enabling AI models to operate on standardized information, which in turn leads to more accurate</title>
        <sec id="sec-2-3-1">
          <title>RecSys in HR’24: The 4th Workshop on Recommender Systems for</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Human Resources, in conjunction with the 18th ACM Conference on</title>
          <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
CEUR
htp:/ceur-ws.org
ISN1613-073</p>
          <p>Attribution 4.0 International (CC BY 4.0).</p>
          <p>CEUR</p>
          <p>Workshop ProceedingsC(EUR-WS.org)
framed as a ranking problem, where queries and
corpus elements are occupation names taken
from a source and a target taxonomy,
respectively, and binary-relevance annotations are
derived from high-quality crosswalks between the
taxonomies. Additionally, we release code for
standardizing the evaluation of models on this
benchmark.
ing models evaluated as zero-shot bi-encoders
on MELO, to serve as baselines for future
research. We find that, while the lexical baselines
perform fairly well, the semantic baselines
generally achieve better results, particularly in
crosslingual tasks. However, there remains significant
and reliable outcomes. Substantial progress has been • We provide experimental results for both simple
made, particularly in the normalization of occupational lexical systems and state-of-the-art deep
learnroom for improvement.
entity linking in the HR domain.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <p>To the best of our knowledge, MELO is the first publieclements in the preliminary rank.
evaluation benchmark to address the task of multilinguOabltaining annotated data for training such systems
second stage, thRee-ranking Stage, a more costly but
higher-precision method is applied to evaluate the top
is costly, particularly for tasks involving custom
taxonomies or low-resource languag4e]s. [To mitigate this
problem, many techniques have been proposed for
leveraging transfer learning to obtain good performance in
typically use a bi-encoder for the candidate generation
In this Section, we introduce the context necessaryzefroor-shot EL scenario1s,[2]. State-of-the-art methods
understanding the subsequent task definition3)s, t(§he
methodology employed in constructing the benchmasrtkage, and a cross-encoder for the re-ranking stage.
(§4), and the related work6)(.§
Entity Linking. Given a knowledge basEe and a query work, we define a taxonomyE as a directed acyclic graph
mention , the task of Entity Linking (EL) involves ide(nD-AG) where nodes are concepts and edges represent
tifying the correct ent i∈tyE to which the mention isbinary IS-A relationship2s6[] between concepts. The
referring. In principle, the structure of the knowletdagile concept (child) is a hyponym of the head concept
base E can range from a flat catalog of unrelated en(ptai-rent) and therefore represents a narrower meaning.
ties to a complex and heterogeneous ontology. In tChoisnversely, the parent is a hypernym of the child and
work we focus on taxonomies of a single type of entirteypresents a broader meaning, i.e. a category to which</p>
      <p>Multilingual Taxonomies. For the purposes of this
(i.e. occupations).</p>
      <p>Inspired by the multilingual formulation proposedpbayrents.</p>
      <p>Botha et al3.][, we consider each entityas a
languageagnostic concept with associated language-specific taexgn-ostic but they have language-specific properties, such
tual information. For each lang uaingea set of supported as a set of names, a description, or usage examples. In
languagesL , any entity may have a set of names (syno-ther words, every concept has one set of names for each
onymous between each other), a description, and exalma-nguage supported in the taxonomy. The set of names
ple sentences where the concept is used. The qu eirsy for a concept for a language are considered synonyms
a text string in some langu a g,ewith no prior assump- between each other. If a lexical entry is attached to more
the child belongs. Concepts are allowed to have many</p>
      <sec id="sec-3-1">
        <title>In a multilingual taxonomy, concepts are language</title>
        <p>omy, or it may not refer to any entity at all. This probilnemth,e workforce.
known as out-of-KB or NIL predictio2n3][, falls outside</p>
        <p>In principle, the system may receive a query mentiotnaxonomies were developed to classify, standardize, and
 that refers to an entity that does not exist in the toarxgoann-ize information related to job titles and roles found</p>
      </sec>
      <sec id="sec-3-2">
        <title>One popular and influential occupations taxonomy is</title>
        <p>of supported languag1e.s
tions about the relationship bet w eaend the setL
than one concept, this implies polysemy.</p>
        <p>Occupation Taxonomies. Several public occupation
the scope of this work. Additionally, it is typical in thtehEeLEuropean Skills, Competences, Qualifications, and
community to allow the system to know the textual Ococnc-upations (ESCO) ontology, a collection of
multilintext in which the mention occurs, aiding in the resolutguioanl and interrelated taxonomies created and maintained
of ambiguity 2[4]. This aspect is also beyond the scopeby the European Union27[, 28]. It include3s,039
occuof our work, as the data we use to build our datasetspoantliyon concepts in its latest version, each with names
includes unnormalized occupation names as queries.and definitions (descriptions) i2n8 languages. Every</p>
        <p>Entity linking can be framed as a ranking t2a5s]k: [concept has one or more names in every supported
langiven a query , the system produces a scor(e,)</p>
        <p>for
each ∈ E and the predicted entit̂iys computed as:
(̂) =
argmax (, )
 ∈ E
guage. The names are compliant with the terminological
guidelines defined by ESCO [29]. All the names of a
particular concept in a particular language are considered
synonyms with each other. Also, for a particular
concept, the language-specific name sets can be considered</p>
      </sec>
      <sec id="sec-3-3">
        <title>1For example, settinLg</title>
        <p>task, andL
More generally, a seLt
multilingual task.
with higher cardinality can define a</p>
      </sec>
      <sec id="sec-3-4">
        <title>2https://www.bls.gov/soc/</title>
        <p>and rank-based evaluation metrics can be used to stpuadryallel data from a translation point of view.
the performance. A typical approach to this task breakAsnother important example is the O*NET-SOC
taxonit into two stages. The first is tChaendidate Generation
Stage, where an initial ranking is obtained using a ldoewv-eloped and maintained by the United States
governlatency method, trying to optimize for recall. Inmtehnet [30, 31] to standardize information relevant to the
omy. The Occupational Information Network (O*NET) is
= { } with 

≠   involves a cross-lingual task.</p>
        <p>= { } would result in a monolinguaCllassification (SOC) system2. It contains information</p>
        <p>labor market, based on the 2018 Standard Occupational
in English about1,016 occupations, each with a set otfhe top of the ranking is suficient for correctly
performnames and a description. ing the task. In other words, when ranking the corpus</p>
        <p>Additionally, many other countries have developeeledments for a query, the position in the ranking of the
their own national taxonomies or terminologies for ohcicguh-est-ranked relevant surface form is the measure we
pations. For example, the Federal Employment Agencayim to evaluate. For this reason, we evaluate the
basein Germany developed thKelassifikation der Berufe 2010 line models with the following metrics: mean reciprocal
(KldB 2010) which is a terminology used to standardrizaenk (MRR) and top-accuracy (A @).
the information in the German language about
occupations 3[2].</p>
        <p>To achieve interoperability between some of these t4a.x- Datasets
onomies, mappings —also callecdrosswalks— were
developed and made public. These mappings establish aTnhe MELO Benchmark consists of 48 datasets, where
each is an instance of the ranking task as described in
alignment between two given taxonomies. In
particular, the European Union published many crosswa3l3k]s [ Section3. While the set of queries difers among the
datasets, the target taxonomy is always ESCO
Occupathat map concepts from national taxonomies, which are</p>
        <p>tions. Although the underlying concepts in the corpus are
typically monolingual, into ESCO. The process described</p>
        <p>the same, the surface forms —specifically, the occupation
in Section4 uses this information as a gold standard to
create the datasets for the MELO Benchmark. names— vary across datasets, since they are presented in
diferent subsets of ESCO languages.</p>
        <p>We leverage existing crosswa3l,kwshich are
high3. Task quality mappings between ESCO Occupations and other
taxonomies 3[4, 33], to build the datasets. Two datasets
As mentioned already, the task consists of multilingauraelderived from the mapping between ESCO and the
Entity Linking of occupations into the ESCO taxonoOm*yN,ET-SOC Taxonomy, while the remaining ones are
dewhich we denote byE. Given a query mention, which rived from the mapping between ESCO and the oficial
ocis a text string expressing the non-normalized namecuopfation terminologies from several European countries.
an occupation without surrounding context, we nWeehdile ESCO is a multilingual taxonomy, the national
to find the best semantic match in ESCO, namely theterminologies are monolingual. Elements between the
correct entit∈y E. Every occupation in the taxonomytaxonomies are assigned SKOS relationsh3ip5s] [such as
has textual information in all lang ua∈geLs . The exact match, narrow match, broad match, orclose match.
query is expressed in langua ge, which we make no For each crosswalk, we build two evaluation datasets:
prior assumptions about. a monolingual dataset and a cross-lingual dataset. In</p>
        <p>For evaluation, we operationalize the task as a rbaontkh- cases, the set of queries are those elements in the
ing problem with binary-relevance annotations, whnearteional terminologies (or O*NET) that either have only
a query is used to rank all the str i ngins a corpus oneexact match in ESCO or have zeroexact matches and
C. The corpus is a collection of lexical terms denotoinlgy onenarrow match. Therefore, we are filtering out
occupation names, and it is derived from the taxonEo.mysemantically ambiguous queries, e.g. if they have more</p>
        <p>To build the corpuCs, we first define the set of target than oneexact matches, or that can’t be assigned to a
languages for the corpus, as a subLs et⊂ L . Then, we specific concept in ESCO because they are not specific
collect every surface form (name) for every occupatieonnough, for example if they only havberoad or close
corresponding to those languages. That is, starting fmroamtches.
an empty set, we traversEe and, for each occupation The language of the set of queri es,,depends on the
 , we add every name available in any languagLei.n national terminology. Regarding the languages used
As a resultC, is the collection of every name of everfyor the corpus, we select a diferent subset of the
lanoccupation in every target language. guages in ESCO for each modality. For the
monolin</p>
        <p>The annotations consist of the set of relevant cogrupuals task we seLt = {  }, and for the cross-lingual
elements for each query. Given the correct entfoitrya we setL = {English}. Exceptionally, since for O*NET
query , then those corpus elemen tsthat were obtainedthe query language is already English, in this case
infrom the surface forms o fare considered to be reles-tead of a cross-lingual task we define a multilingual
vant, while any other element in the corpus is considetraesdk, where the corpus languages are English, German,
irrelevant. Spanish, French, Italian, Dutch, Portuguese, and Polish</p>
        <p>Because the goal is to find the relevant conceipnt (We intentionally include English, the query language.)
the taxonomy for the given query (i.e. to solve the entAitsymentioned in the previous Section, the annotations
linking formulation of the task), obtaining at least one
surface form  associated with the relevant concept3hatttps://esco.ec.europa.eu/en/use-esco/</p>
        <p>eures-countries-mapping-tables
0.5
ty0.4
il
iab0.3
b
ro0.2
P
0.1
0.0 0.0 0.2 Di0s.t4ance0.6 0.8 0.0 0.2 Di0s.t4ance0.6 0.8 0.0 0.2 Di0s.t4ance0.6 0.8 0.0 0.2 Di0s.t4ance0.6 0.8 0.0 0.2 Di0s.t4ance0.6 0.8
further detail on the construction and compositioSnemoafntic Baselines. Additionally, we provide
rethese datasets, as well as example queries and relevsaunlts for zero-shot evaluations using state-of-the-art deep
corpus elements, please refer to AppenAdi.x learning models employed as symmetric bi-encoders.
Un</p>
        <p>The benchmark is intended to represent realisticduesre this setup, we use a sentence encoder to obtain a
cases, such as linking mentions into a taxonomy, efixne-d-size representation for each surface form, and the
riching a custom taxonomy with new synonyms for thsecore for a query and each corpus element is computed as
existing concepts, or aligning two taxonomies. It is atlhseocosine similarity of their corresponding
representaintended to study the cross-lingual and multilingutailocnas-. This allows the system to capture deeper semantic
pabilities of proposed systems. Using extra informatiroenlationships.
for solving this task, such as context for the mentions oWre experiment with the following pre-trained models
descriptions and examples for the taxonomy conceptsi,nisa zero-shot setup, without fine-tuning or in-context
out of the scope of this work but represents an intereexsatm-ples: ESCOXLM-R [10], mUSE-CNN [ 36], a
muling line of future research that can take advantage otfiltihnegual variant of MPN3e7t],[ BGE-M3 [38],
GISTMELO Benchmark. Embedding [39], Multilingual E540[], E5 [41, 42], and</p>
        <p>To assess the lexical overlap between the surface fortmhes modeltext-embedding-3-large from OpenAI4. This
in any national terminology and ESCO, we use the monseol-ection of models represents a spectrum of trade-ofs
lingual tasks, and measure the normalized edit distabnetcweeen performance and model complexity. We refer
between each query and the closest relevant corpustehlee-reader to AppendBixand Table3 for further details
ment. In Figure1 we show a histogram with the distroi-n the models and the inference procedure.
bution of such distances in a selection of tasks. As described in Sectio3n, the goal of each task is to find</p>
        <p>The lexical overlap is considerable in some cases, ltikhee relevant concept in the taxonomy for the given query.
with the Danish terminology. In the histogram, a bTihgerefore, obtaining at least one surface form associated
concentration of examples in the left-most bin impliewsith the relevant concept at the top of the ranking is
that many queries are lexically very close to theirsrueficlie-nt to achieve this goal. With that in mind, we use
vant corpus elements. This, in principle, would makmeean reciprocal rank (MRR) and t oapc-curacy (A @)
these tasks easier to solve using simple lexical scoraisnegvaluation metrics.
functions. In AppendixA we explain the procedure used Due to space constraints, in Ta2bwlee present results
to compute the lexical distances and we also presentinthterms of mean reciprocal rank (MRR) for a selected
subsame analysis for every task in the benchmark. set of tasks, while the complete set of results is provided
in Table5 and Table6 in AppendixC.</p>
        <p>In most monolingual datasets, the top-performing
lex5. Experiments ical baselines achieved MRR values ranging from 30%
to 55%. Notably, in the Fren5cahnd Danish datasets,
To demonstrate the MELO Benchmark in use, we study</p>
        <p>these baselines performed extraordinarily well in large
the performance of several models when evaluated on the</p>
        <p>part due to substantial lexical overlap, as indicated by
tasks we defined above. We explore both simple lexical</p>
        <p>the left-skewed distributions in Figu4r.eIn contrast, the
baselines and advanced deep learning models using a</p>
        <p>Lithuanian, Norwegian, and Romanian datasets exhibited
bi-encoder, zero-shot setting. lower performance. Char-based TF-IDF variants deliver
Lexical Baselines. We evaluate the following baselines:</p>
        <p>the highest performance among this group of baselines.
edit-distance, word-level TF-IDF, word-level TF-IDF onIn a zero-shot setup, ESCOXLM-R performs poorly,
lemmas, char-level TF-IDF, char-level TF-IDF on
lemmas, BM25, and BM25 on lemmas. These models rely on
surface-level text features.</p>
      </sec>
      <sec id="sec-3-5">
        <title>4https://openai.com/index/new-embedding-models-and-api-updates 5Results for every dataset are presented in AppeCn.dix</title>
        <p>even falling behind simple lexical baselines across boatchieves reasonable results in most other languages,
monolingual and cross-lingual datasets. This resuwlthiicsh is surprising considering its primary training was
consistent with previous research that has shown tfohcautsed on English.
encoders trained with masked language modeling (MLM)E5, a significantly larger Decoder-only model,
outobjectives often struggle to produce efective sentencpeerforms the previously mentioned models across most
representations when directly evaluated as sentencteaesnks-. This is also surprising since E5 was mainly trained
coders 4[3]. In contrast, the other bi-encoders evaluaitnedEnglish. Finally, although limited details are
availin this study were specifically optimized for generatinagble publicly about OpenAIt’esxt-embedding-3-large
useful sentence embeddings, which explains their supmeo-del, its performance is generally on par with or even
rior performance in these tasks. surpasses that of E5. OpenAI’s model delivers the highest</p>
        <p>The mUSE-CNN model demonstrates fair performancoeverall performance among all the models evaluated in
on most monolingual tasks for languages includedouinr experiments.
its pre-training, especially when considering its relTah-e performance of the models in each monolingual
tively small model size and architecture type (see3T)a.bldeataset is correlated with the lexical overlap in the dataset,
However, as anticipated, its performance drops signaifi-s measured by the median of the distributions presented
cantly for languages that were not included durining Fitigsure4. As expected, lexical baselines exhibit a
particpre-training. Furthermore, its performance falls beulolawrly strong correlation, with Spearman’s coeficients of
the lexical baselines in almost all datasets. This ca-0n.7b4efor Char TF-IDF and -0.80 for Edit Distance.
Interobserved in Figure2b. estingly, bi-encoders also demonstrate a moderate
corre</p>
        <p>MPNet exhibits poor performance across all monolliant-ion, such as mE5 (-0.65) and OpenAI (-0.62). In Figu2re
gual datasets, a surprising result given its larger mwoedevlisualize this correlation, as well as the correlation
size, architecture type, and the fact that it was pre-trbaeitnwedeen the lexical overlap and the diference in
perforin all the languages used in this experiment. Despmitaence between some bi-encoders and a lexical bas6e.line
these advantages, it is generally outperformed by Wtheeobserve that, the less lexical overlap in the dataset, the
smaller mUSE-CNN model, with the notable exceptiomnore the OpenAI model outperforms the lexical baseline.
of the English datasets. Comparing the results of datasets USA-en-en and
USA</p>
        <p>BGE-M3 and Multilingual E5 have similar characteenr--xx, which share the same queries, we observe that
istics, as described in Tab3l,eand both deliver strongmost methods significantly enhance their performance
performance across most monolingual tasks. In thewsheen the corpus elements visible to the system are
excases, they generally outperform all lexical baselinespanded to include multiple languages, surpassing their
smaller bi-encoders. However, in the English dataspeetrsf,ormance in the monolingual task. An implication for
Multilingual E5 outperforms BGE-M3. this is that, when linking mentions into a multilingual</p>
        <p>GIST-Embedding demonstrates strong performancteaxonomy, the surface forms in other languages are
valuin English, outperforming many larger models. It also</p>
        <p>6Same figure is displayed in full size in AppendiCx
FRA</p>
        <p>PRTBELAPUUOTSLAHUNNLSDVIHTENRAUV ESBTEL</p>
        <p>NORBGRESP LVCASZVSEKWRLOETUU</p>
        <p>DNK
0.0
0.2 0.4
Lexical Distance</p>
        <p>0.2 0.4</p>
        <p>Lexical Distance
(a) Absolute performance (in MRR).
FRA</p>
        <p>PRTBELAPUUOTSLANLSDDVENU</p>
        <p>HRV ESBTEL
HUNIETASP LVCASZVSEKWE
NORBGR RLOTUU</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Related Work</title>
      <sec id="sec-4-1">
        <title>There has been significant research interest in systems</title>
        <p>that normalize HR information into ESCO and other
taxonomies.</p>
        <p>Decorte et al1.4[] explore the extraction of ESCO skills
from segmented job descriptions. They approach this
problem as a massive multi-label classification task, and
present a human-annotated evaluation set for this task.</p>
        <p>More recently, Decorte et1a7l].a[pproach the same
problem from an EL perspective. They use a large language
model (LLM) to produce synthetic annotations and train
a bi-encoder to extract ESCO skills from job description
segments. Finally, Zhang et a1l1.][apply and compare
two supervised EL methods for solving the same task:
BLINK [2] and GENRE [44]. In contrast to these other
studies, our work focuses on occupations instead of skills,
explores cross-lingual and multilingual scenarios, and the
task as we formulate it does not use context for linking
0.0</p>
        <p>0.2 0.4
Lexical Distance
0.0</p>
        <p>0.2 0.4</p>
        <p>Lexical Distance
(b) Performance relative to the lexical baseline Char TF-IDthFe query mentions.</p>
        <p>There has also been a substantial amount of research
Figure 2: Correlation between model performance and the focused on occupations. Decorte et2a0l].d[eveloped
median of the minimum edit distance between queries and an unsupervised approach to fine-tune BERT45[] to
enrelevant corpus elements in monolingual datasets. code the semantics of occupation names. Furthermore,
they create a dataset for the normalization of free-form
English occupation names into ESCO and they use it
to evaluate their model. It has been reported that this
able even if the taxonomy includes entity names in tdhaetaset contains ambiguous input quer2i0e]sa[s well as
language of the query. some mislabeled elements46[]. Closely related works by</p>
        <p>As expected, the performance drop when moving fromZbib et al.4[7] and by Bocharova et al4.8[] propose
almonolingual to cross-lingual datasets (excluding O*NtEerTn)ative unsupervised representation learning schemes.
is significantly more pronounced for the lexical baselinTehsey both release evaluation datasets, the former for
compared to the bi-encoders. The capacity for (zero-shooctc)upation name ranking, and the latter for EL of
unnorcross-lingual EL of occupations varies for diferent momda-lized occupation names into ESCO.
els: ESCOXLM-R, MPNet, and GIST-Embedding exhibit Lake [16] studies the application of bi-encoders and
very low cross-lingual performance; mUSE-CNN, BGEc-ross-encoders to EL of occupations to a custom
taxonM3, and Multilingual E5 demonstrate fair cross-linogmuay.lYamashita et al2.1[] work on a normalization task
performance; while E5 and OpenAI achieve the highesftor occupations, which closely resembles our
formulacross-lingual performance. tion of EL. They create a non-public dataset by
collect</p>
        <p>Since the techniques we experiment with —lexical scionrg- a large number of unnormalized occupation names
ers and bi-encoders— are commonly used for candidaatned then automatically mapping them to ESCO
occugeneration in the first stage of E1L, 2[], it is interestingpations via exact match after removing proper nouns.
to measure the to p-accuracy (A @) for diferent val- Vrolijk et al2. 2[] build a synthetic dataset for zero-shot
ues of to assess how well such techniques recover tehvealuation and fine-tuning of several language models
usifrst relevant item. Figur3e presents these results foirng information from ESCO that includes the synonyms
the same subset of tasks for the following systems: Efdoirt each entity name, the relationship between entities,
Distance, Char-level TF-IDF, mUSE-CNN, and OpenAIa.nd their definitions. In particular, they use the set of
The complete set of A @is available in AppendiCx, in name synonyms for each ESCO occupation to pose a
Figure6 and Figure7. The results observed for top-binary relevance classification problem, where positive
accuracy are consistent with those for mean reciprpoaciarls involve two names belonging to the same synonym
rank (MRR), particularly in terms of the relative ranskeitn.g
and comparative performance of the models. Two important use cases of the EL task under study
are enriching and aligning taxonomies. In order to
maintain up-to-date but well-curated taxonomies, it is
common to automatically identify new candidate concepts
1.0
0.2
0.0
1.0
0.2
0.0
USA-en-xx
to be included, and to use human annotators to valioduattesemantically ambiguous queries for which experts
their inclusion. Similarly, when aligning two taxonomdieetsermined that they should be related as an exact match
—i.e. building a crosswalk—, it is common to use auto-more than one ESCO concept. For those reasons, their
matic systems to propose and explore candidate matchresults are not comparable to those we present in this
between the concepts in each taxonomy. work.</p>
        <p>Giabelli1[9] and colleagues have worked on several
approaches for enriching49[] and aligning1[8, 50]
taxonomies using word embeddings to model concepts vi7a. Conclusion
their names, together with structural information about</p>
        <p>We have introduced the MELO Benchmark, a suite of 48
the taxonomy. All these methods automatically score
can</p>
        <p>datasets for multilingual entity linking of occupations in
didates for inclusion or mapping, and can be used within</p>
        <p>21 languages. We experimented with several
out-of-thea human-in-the-loop framework for further validation.</p>
        <p>box lexical and semantic baselines, demonstrating that
During the creation of the crosswalk between O*NET</p>
        <p>there is still significant room for improvement. Our aim
and ESCO, the teams responsible for maintaining both</p>
        <p>is that MELO will serve as a valuable resource for the
retaxonomies worked together to ensure a high-quality</p>
        <p>search community, providing a standardized benchmark
mapping [33]. Interestingly, they report employing
a human-in-the-loop methodology where a fine-tunedfor assessing progress in multilingual EL within the HR
domain, and fostering innovation and the development
BERT model [45] is used as a bi-encoder to rank the</p>
        <p>of new methodologies in this important area of research.</p>
        <p>ESCO occupations for each O*NET occupation. They
explore diferent methods for encoding each, leveraging In future work, several research directions could be
occupation names (and synonyms) as well. explored. First, the current evaluation scheme can be
extended to incorporate NIL prediction or prediction using
More recently, the ESCO team presented an
analy</p>
        <p>entity descriptions rather than relying solely on entity
sis [46] on a task that is very similar to the one we present
here. They fine-tune a XLM-RoBERTa model 5[1] on names, with the presented source code being easily
adaptHR-related data, including the textual informationafbrloemfor such modifications. Second, domain-adapting
ESCO, but with no supervision signal for any specific ELor fine-tuning encoders specifically for this task, in a
manner similar to ESCOXLM-R but optimized for
semantask. They then use this model as a bi-encoder to suggest</p>
        <p>tic text similarity, presents another possible direction.</p>
        <p>ESCO occupations for elements taken from the national</p>
        <p>Third, exploring advanced deep learning techniques
beterminologies of Latvia, Spain, Sweden, and Italy, as well</p>
        <p>yond bi-encoders, such as cross-encoders combined with
as from O*NET. Using the respective crosswalks, they</p>
        <p>re-ranking stages, could enhance model performance.
evaluate this as an EL task. They explore monolingual</p>
        <p>Finally, investigating the meta-learning paradigm by
diand cross-lingual (to English) modalities. A key
difer</p>
        <p>viding MELO tasks into meta-training and meta-testing
ence between this work and ours is that they consider</p>
        <p>tasks, and applying meta-learning context to solve the
any SKOS relationship as a legitimate annotation, while</p>
        <p>meta-testing tasks, exploiting multi-lingual transfer
capawe only use exact and narrow matches. We also filter</p>
        <p>bilities of modern deep-learning models, ofers another
interesting direction for future work. mender Systems (2021). URL:https://ceur-ws.org/
Vol-2967/paper_3.pd.f
[6] S. Tu, O. Cannon, Beyond
Human-in-theAcknowledgment loop: Scaling Occupation Taxonomy at Indeed,
The 2nd Workshop on Recommender
SysThis publication uses the ESCO classification of the Euro- tems for Human Resources (RecSys in HR’22),
pean Commission. We gratefully acknowledge the work in conjunction with the 16th ACM
Conferpdaotnieonbys ttahexotneoammyi,navsowlveeldl ains ctuhreatteianmgstrheesEpoSCnsOibOleccfuo-r ence on Recommender Systems (2022). URL:
https://recsyshr.aau.dk/wp-content/uploads/2022/
the O*NET-SOC 2019 taxonomy and the other national 09/RecSysHR2022-paper_2.pdf.
taxonomies used in this work. Furthermore, we would</p>
        <p>[7] S. Avlonitis, D. Lavi, M. Mansoury, D. Graus,
Caalso like to thank the teams responsible for creating thereer Path Recommendations for Long-term
Incrosswalks between ESCO and these taxonomies. come Maximization: A Reinforcement
Learning Approach, The 3rd Workshop on
RecReferences ommender Systems for Human Resources
(RecSys in HR’23), in conjunction with the 17th
[1] L. Logeswaran, M.-W. Chang, K. Lee, K. Toutanova, ACM Conference on Recommender Systems (2023).</p>
        <p>J. Devlin, H. Lee, Zero-Shot Entity Linking by URL: https://recsyshr.aau.dk/wp-content/uploads/
Reading Entity Descriptions, in: Proceedings of 2023/09/RecSysHR2023-paper_2.pdf.
the 57th Annual Meeting of the Association fo[8r] J.-J. Decorte, J. V. Hautte, J. Deleu, C.
DeComputational Linguistics, Association for Com- velder, T. Demeester, Career Path Prediction
usputational Linguistics, Florence, Italy, 2019, pp. ing Resume Representation Learning and
Skill3449–3460. URL: https://aclanthology.org/P19-1.335 based Matching, The 3rd Workshop on
Recdoi:10.18653/v1/P19-1335. ommender Systems for Human Resources
(Rec[2] L. Wu, F. Petroni, M. Josifoski, S. Riedel, L. Zettle- Sys in HR’23), in conjunction with the 17th
moyer, Scalable Zero-shot Entity Linking with ACM Conference on Recommender Systems (2023).
Dense Entity Retrieval, in: Proceedings of URL: https://recsyshr.aau.dk/wp-content/uploads/
the 2020 Conference on Empirical Methods in 2023/09/RecSysHR2023-paper_1.pdf.</p>
        <p>Natural Language Processing (EMNLP), Asso-[9] M. Zhang, K. Jensen, S. Sonniks, B. Plank, SkillSpan:
ciation for Computational Linguistics, Online, Hard and Soft Skill Extraction from English Job
2020, pp. 6397–6407. URL: https://aclanthology. Postings, in: Proceedings of the 2022 Conference
org/2020.emnlp-main.51 9. doi:10.18653/v1/2020. of the North American Chapter of the Association
emnlp-main.519. for Computational Linguistics: Human Language
[3] J. A. Botha, Z. Shan, D. Gillick, Entity Link- Technologies, Association for Computational
Lining in 100 Languages, in: Proceedings of guistics, Seattle, United States, 2022, pp. 4962–4984.
the 2020 Conference on Empirical Methods in URL: https://aclanthology.org/2022.naacl-main..366
Natural Language Processing (EMNLP), Asso- doi:10.18653/v1/2022.naacl-main.366.
ciation for Computational Linguistics, Onli[n10e], M. Zhang, R. van der Goot, B. Plank,
ESCOXLM2020, pp. 7833–7845. URL: https://aclanthology. R: Multilingual Taxonomy-driven Pre-training for
org/2020.emnlp-main.63 0. doi:10.18653/v1/2020. the Job Market Domain, in: Proceedings of
emnlp-main.630. the 61st Annual Meeting of the Association for
[4] X. Fu, W. Shi, X. Yu, Z. Zhao, D. Roth, Design Computational Linguistics (Volume 1: Long
PaChallenges in Low-resource Cross-lingual Entity pers), Association for Computational Linguistics,
Linking, in: Proceedings of the 2020 Conference Toronto, Canada, 2023, pp. 11871–11890. URL:
on Empirical Methods in Natural Language Pro- https://aclanthology.org/2023.acl-lon.gd.6o6i2:10.
cessing (EMNLP), Association for Computational 18653/v1/2023.acl-long.662.</p>
        <p>Linguistics, Online, 2020, pp. 6418–6432. URLh:ttps: [11] M. Zhang, R. van der Goot, B. Plank, Entity Linking
//aclanthology.org/2020.emnlp-main.5.21doi:10. in the Job Market Domain, in: Findings of the
18653/v1/2020.emnlp-main.521. Association for Computational Linguistics: EACL
[5] M. de Groot, J. Schutte, D. Graus, Job Posting- 2024, Association for Computational Linguistics,
Enriched Knowledge Graph for Skills-based Match- St. Julian’s, Malta, 2024, pp. 410–419. URhLt:tps:
ing, The 1st Workshop on Recommender Systems //aclanthology.org/2024.findings-eac.l.28
for Human Resources (RecSys in HR’21), in con-[12] E. Senger, M. Zhang, R. van der Goot, B. Plank,
junction with the 15th ACM Conference on Recom- Deep Learning-based Computational Job Market</p>
        <p>Analysis: A Survey on Skill Extraction and
Classiifcation from Job Postings, in: Proceedings of the velder, JobBERT: Understanding Job Titles
First Workshop on Natural Language Processing through Skills, FEAST, ECML-PKDD 2021
Workfor Human Resources (NLP4HR 2024), Association shop (2021). URL: https://feast-ecmlpkdd.github.io/
for Computational Linguistics, St. Julian’s, Malta, archive/2021/papers/FEAST2021_paper_6.pd. f
2024, pp. 1–15. URL: https://aclanthology.org/2024[2.1] M. Yamashita, J. T. Shen, T. Tran, H. Ekhtiari,
nlp4hr-1.1. D. Lee, JAMES: Normalizing Job Titles with
Multi[13] M. Zhang, K. N. Jensen, B. Plank, Kompetencer: Aspect Graph Embeddings and Reasoning, in: 2023
Fine-grained Skill Classification in Danish Job Post- IEEE 10th International Conference on Data
Sciings via Distant Supervision and Transfer Learn- ence and Advanced Analytics (DSAA), 2023, pp.
ing, in: Proceedings of the Thirteenth Language 1–10. URL: https://arxiv.org/abs/2202.107.3d9oi:10.
Resources and Evaluation Conference, European 1109/DSAA60987.2023.10302559.</p>
        <p>Language Resources Association, Marseille, Fran[c2e2,] J. Vrolijk, D. Graus, Enhancing PLM Performance
2022, pp. 436–447. URL: https://aclanthology.org/ on Labour Market Tasks via Instruction-based
2022.lrec-1.46. Finetuning and Prompt-tuning with Rules,
[14] J.-J. Decorte, J. V. Hautte, J. Deleu, C. De- The 3rd Workshop on Recommender
Sysvelder, T. Demeester, Design of Negative Sam- tems for Human Resources (RecSys in HR’23),
pling Strategies for Distantly Supervised Skill in conjunction with the 17th ACM
ConferExtraction, The 2nd Workshop on Recom- ence on Recommender Systems (2023). URL:
mender Systems for Human Resources (RecSys https://recsyshr.aau.dk/wp-content/uploads/2023/
in HR’22), in conjunction with the 16th ACM 09/RecSysHR2023-paper_4.pdf.</p>
        <p>Conference on Recommender Systems (2022).[23] F. Zhu, J. Yu, H. Jin, L. Hou, J. Li, Z. Sui, Learn
URL: https://recsyshr.aau.dk/wp-content/uploads/ to Not Link: Exploring NIL Prediction in Entity
2022/09/RecSysHR2022-paper_4.pdf. Linking, in: Findings of the Association for
[15] M. Zhang, K. N. Jensen, R. van der Goot, B. Plank, Computational Linguistics: ACL 2023, Association
Skill Extraction from Job Postings using Weak for Computational Linguistics, Toronto, Canada,
Supervision, The 2nd Workshop on Recom- 2023, pp. 10846–10860. URL: https://aclanthology.
mender Systems for Human Resources (RecSys org/2023.findings-acl.69.0doi:10.18653/v1/2023.
in HR’22), in conjunction with the 16th ACM findings-acl.690.</p>
        <p>Conference on Recommender Systems (2022).[24] N. Gupta, S. Singh, D. Roth, Entity Linking via
URL: https://recsyshr.aau.dk/wp-content/uploads/ Joint Encoding of Types, Descriptions, and
Con2022/09/RecSysHR2022-paper_10.pdf. text, in: Proceedings of the 2017 Conference
[16] T. Lake, Flexible Job Classification with Zero- on Empirical Methods in Natural Language
ProShot Learning, The 2nd Workshop on Rec- cessing, Association for Computational Linguistics,
ommender Systems for Human Resources (Rec- Copenhagen, Denmark, 2017, pp. 2681–2690. URL:
Sys in HR’22), in conjunction with the 16th https://aclanthology.org/D17-1.2d8o4i:10.18653/
ACM Conference on Recommender Systems (2022). v1/D17-1284.</p>
        <p>URL: https://recsyshr.aau.dk/wp-content/upload[2s/5] Z. Zheng, F. Li, M. Huang, X. Zhu, Learning to
2022/09/RecSysHR2022-paper_8.pdf. Link Entities with Knowledge Base, in: Human
[17] J.-J. Decorte, S. Verlinden, J. V. Hautte, J. Deleu, Language Technologies: The 2010 Annual
ConferC. Develder, T. Demeester, Extreme Multi-Label ence of the North American Chapter of the
AssociaSkill Extraction Training using Large Language tion for Computational Linguistics, Association for
Models, 2023. URL: https://arxiv.org/abs/2307. Computational Linguistics, Los Angeles, California,
10778. arXiv:2307.10778. 2010, pp. 483–491. URL: https://aclanthology.org/
[18] A. Giabelli, L. Malandri, F. Mercorio, M. Mez- N10-1072.</p>
        <p>zanzanica, WETA: Automatic taxonomy[26] R. J. Brachman, What IS-A Is and Isn’t: An
Analalignment via word embeddings, Com- ysis of Taxonomic Links in Semantic Networks,
puters in Industry 138 (2022) 103626. URL: Computer 16 (1983) 30–36. doi1:0.1109/MC.1983.
https://www.sciencedirect.com/science/ 1654194.
article/pii/S016636152200021.5 doi:https: [27] M. le Vrang, A. Papantoniou, E. Pauwels, P. Fannes,
//doi.org/10.1016/j.compind.2022.103626. D. Vandensteen, J. De Smedt, ESCO: Boosting Job
[19] A. Giabelli, Integrating Word Embeddings and Tax- Matching in Europe with Semantic Interoperability,
onomy Learning for Enhanced Lexical Domain Computer 47 (2014) 57–64. doi1:0.1109/MC.2014.
Modelling, Phd thesis, Università degli Studi di 283.</p>
        <p>Milano-Bicocca, 2024. [28] European Commission, ESCO Handbook:
Eu[20] J.-J. Decorte, J. V. Hautte, T. Demeester, C. De- ropean Skills, Competences, Qualifications and
Occupations, Technical Report, European Union, c3a690be93aa602ee2dc0ccab5b7b67e-Paper.p.df
2019. URL: https://esco.ec.europa.eu/system/files[/38] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian,
2021-07/Handbook.pd.f Z. Liu, BGE M3-Embedding: Multi-Lingual,
[29] European Commission, ESCO Terminologi- Multi-Functionality, Multi-Granularity Text
cal Guidelines, Technical Report, European Embeddings Through Self-Knowledge Distillation,
Union, 2021. URL: https://esco.ec.europa. 2024. URL: https://arxiv.org/abs/2402.032.16
eu/en/about-esco/publications/publication/ arXiv:2402.03216.</p>
        <p>esco-terminological-guideli.nes [39] A. V. Solatorio, GISTEmbed: Guided In-sample
Se[30] E. C. Dierdorf, D. W. Drewes, J. J. Norton, O*NET lection of Training Negatives for Text Embedding
Tools and Technology: A Synopsis of Data De- Fine-tuning, 2024. URLh:ttps://arxiv.org/abs/2402.
velopment Procedures, Technical Report, North 16829. arXiv:2402.16829.</p>
        <p>Carolina State University, 2006. UhRtLt: ps://www. [40] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder,
onetcenter.org/dl_files/T2Development..pdf F. Wei, Multilingual E5 Text Embeddings: A
Tech[31] M. J. Handel, The O*NET Content Model: nical Report, 2024. URLh:ttps://arxiv.org/abs/2402.</p>
        <p>Strengths and Limitations, Journal for Labour 05672. arXiv:2402.05672.</p>
        <p>Market Research 49 (2016) 157–176. do10i:.1007/ [41] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder,
s12651-016-0199-8. F. Wei, Improving Text Embeddings with Large
[32] W. Paulus, B. Matthes, Klassifikation der Language Models, 2023. URLh: ttps://arxiv.org/abs/
Berufe: Struktur, Codierung und Um- 2401.00368. arXiv:2401.00368.
steigeschlüssel, Technical Report, Bun[4-2] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang,
desagentur für Arbeit, 2013. URLh:ttps: D. Jiang, R. Majumder, F. Wei, Text
Embed//doku.iab.de/fdz/reporte/2013/MR_08-13.p.df dings by Weakly-Supervised Contrastive
Pre[33] European Commission, The Crosswalk training, 2022. URL:https://arxiv.org/abs/2212.</p>
        <p>Between ESCO and O*NET, Technical 03533. arXiv:2212.03533.</p>
        <p>Report, European Union, 2022. URL:[43] B. Li, H. Zhou, J. He, M. Wang, Y. Yang, L. Li, On the
https://esco.ec.europa.eu/system/files/2022-12/ Sentence Embeddings from Pre-trained Language
ONET%20ESCO%20Technical%20Report.p.df Models, in: Proceedings of the 2020 Conference
[34] European Commission, ESCO implementation man- on Empirical Methods in Natural Language
Proual, Technical Report, European Union, 2018. URL: cessing (EMNLP), Association for Computational
https://esco.ec.europa.eu/system/files/2021-07/ Linguistics, Online, 2020, pp. 9119–9130. URLh:ttps:
425b7a5f-3048-4377-a816-5402c00e9a9505_A_ //aclanthology.org/2020.emnlp-main.7.33doi:10.</p>
        <p>Annex_Draft_ESCO_Implementation_manual.pdf 18653/v1/2020.emnlp-main.733.
[35] A. Miles, S. Bechhofer, SKOS Simple Knowledge[44] N. D. Cao, G. Izacard, S. Riedel, F. Petroni,
AuOrganization System Reference, W3C Recommen- toregressive Entity Retrieval, International
Condation, World Wide Web Consortium, 2009. URL: ference on Learning Representations (2021). URL:
https://www.w3.org/TR/skos-referenc,ew/3C Rec- https://openreview.net/forum?id=5k8F6UU3.9V
ommendation. [45] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
[36] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Con- Pre-training of Deep Bidirectional Transformers
stant, G. Hernandez Abrego, S. Yuan, C. Tar, Y.-h. for Language Understanding, in: Proceedings
Sung, B. Strope, R. Kurzweil, Multilingual Univer- of the 2019 Conference of the North American
sal Sentence Encoder for Semantic Retrieval, in: Chapter of the Association for Computational
LinProceedings of the 58th Annual Meeting of the guistics: Human Language Technologies, Volume
Association for Computational Linguistics: Sys- 1 (Long and Short Papers), Association for
Comtem Demonstrations, Association for Computa- putational Linguistics, Minneapolis, Minnesota,
tional Linguistics, Online, 2020, pp. 87–94. URL: 2019, pp. 4171–4186. URL: https://aclanthology.org/
https://aclanthology.org/2020.acl-demo. sd.1o2i:10. N19-1423. doi:10.18653/v1/N19-1423.
18653/v1/2020.acl-demos.12. [46] European Commission, Machine Learning Assisted
[37] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, MP- Mapping of Multilingual Occupational Data to
Net: Masked and Permuted Pre-training for ESCO, Technical Report, European Union, 2022.
Language Understanding, in: H. Larochelle, URL: https://shorturl.at/REc. Dd
M. Ranzato, R. Hadsell, M. Balcan, H. Li[n47] R. Zbib, L. A. Lacasa, F. Retyk, R. Poves,
(Eds.), Advances in Neural Information Pro- J. Aizpuru, H. Fabregat, V. Šimkus, E.
Garcíacessing Systems, volume 33, Curran Associates, Casademont, Learning Job Titles Similarity from
Inc., 2020, pp. 16857–16867. URL: https:// Noisy Skill Labels, FEAST, ECML-PKDD 2021
Workproceedings.neurips.cc/paper_files/paper/2020/file/ shop (2022). URL: https://feast-ecmlpkdd.github.io/
archive/2022/papers/FEAST2022_paper_4972.pd. f A. Details on the Datasets
[48] M. Bocharova, E. Malakhov, V. Mezhuyev,
VacancySBERT: the approach for representation of titWleesrelease the source code used to build the da7t,asets
and skills for semantic similarity search in theprroev-iding researchers with a tool to easily generate new
cruitment domain, Applied Aspects of Informad-atasets by combining diferent sets of languages for
tion Technology 6 (2023) 52–59. URLh:ttps://aait.od.query and corpus elements. Using this code, new
instanua/index.php/journal/article/view/161/.2d1o2i:10. tiations of the task can be derived from the input data by
15276/aait.06.2023.4. defining custom language combinations. For example, it
[49] A. Giabelli, L. Malandri, F. Mercorio, M. Mezzains-possible to use the Italian national terminology to set
zanica, A. Seveso, NEO: A Tool for Taxonomy En-up an Italian-to-Greek cross-lingual task, or even
comrichment with New Emerging Occupations, in: Thebine the query sets of several national classifications and
Semantic Web – ISWC 2020, Springer Internationalelverage all languages in ESCO to create a more complex
Publishing, Cham, 2020, pp. 568–584. multilingual task.
[50] A. Giabelli, L. Malandri, F. Mercorio, M. Mez-The input data consists of files with the multilingual
zanzanica, JoTA: Aligning Multilingual Job TaExS-CO Occupations taxonomy (one for each relevant
veronomies through Word Embeddings (Student Abs-ion) and files containing the queries in each national
stract), Proceedings of the AAAI Conferentceerminology, which are mapped to the ESCO concept ID
on Artificial Intelligence 36 (2022) 12955–12956o.f the relevant occupation. To create a dataset, the user
URL: https://ojs.aaai.org/index.php/AAAI/articlce/an select a national terminology and a set of languages
view/21614. doi:10.1609/aaai.v36i11.21614. for the corpus (any subset of the languages supported by
[51] A. Conneau, K. Khandelwal, N. Goyal, V. ChaudE-SCO).</p>
        <p>hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, In Table4 we present example queries and their
releL. Zettlemoyer, V. Stoyanov, Unsupervisedvant corpus elements, sampled from the NLD-nl-nl,
PRTCross-lingual Representation Learning at Scpatl-ep,t, and PRT-pt-en datasets.
in: Proceedings of the 58th Annual Meet-Finally, we analyze the lexical overlap between the
naing of the Association for Computational Ltinio-nal classifications and ESCO. In Figu4r,ewe present a
guistics, Association for Computational Linguhiiss-togram showing the normalized edit distance between
tics, Online, 2020, pp. 8440–8451. URL:https:// queries and their closest relevant corpus element, for all
aclanthology.org/2020.acl-main.7d4o7i:10.18653/ the tasks in MELO.</p>
        <p>v1/2020.acl-main.747. To compute the distances, we first lowercase the
sur[52] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco,face forms of both the query and the corpus element,
R. St. John, N. Constant, M. Guajardo-Cespedeas,nd we use the methordatio from the Python package
S. Yuan, C. Tar, B. Strope, R. Kurzweil, Univerr-apidfuzz 8. This is a measure of the normalized edit
sal Sentence Encoder for English, in: Proceeddis-tance between the two strings. In the histograms, for
ings of the 2018 Conference on Empirical Methodesach query, we compute the distance for all its relevant
in Natural Language Processing: System Demonco-rpus elements and report the minimum distance.
strations, Association for Computational LinguiIsn- the histograms, the left-most bin represents the
fractics, Brussels, Belgium, 2018, pp. 169–174. URL: tion of queries for which the closest relevant element is
https://aclanthology.org/D18-2.0d2o9i:10.18653/ either identical or very similar. The Danish national
terv1/D18-2029. minology has the highest concentration of such cases. To
[53] N. Reimers, I. Gurevych, Sentence-BERT: Sentencea lesser extent, this is also true for Hungarian, Estonian,
Embeddings using Siamese BERT-Networks, in:and Polish.</p>
        <p>Proceedings of the 2019 Conference on Empirical Excluding those lexically trivial cases, the more the
Methods in Natural Language Processing and tdhisetribution is skewed to the left, the easier the task. For
9th International Joint Conference on NaturaleLxaanm-ple, comparing the Belgian (in the French language)
guage Processing (EMNLP-IJCNLP), Associationand the French tasks, the queries from the French
termifor Computational Linguistics, Hong Kong, Chinnao, logy show greater lexical overlap with their relevant
2019, pp. 3982–3992. URL: https://aclanthology.orgc/orpus elements.</p>
        <p>D19-1410. doi:10.18653/v1/D19-1410. In Appendix C, we use this analysis to compare the
performance of lexical baselines across diferent
monolingual tasks.
7https://github.com/Avature/melo-benchmark
8https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html
BGR-bg-bg</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>B. Details on the Models</title>
      <p>input to the model is the surface form of the query or the
corpus element, with no preprocessing.</p>
      <p>Here, we provide further details about the models exW- e also present results for the Multilingual
Univerplored in this work. sal Sentence Encoder (mUSE-CNN) model variant with</p>
      <p>Regarding the lexical baselines, we always applya aCNN architecture, proposed by Cer et5a2l,.3[6]. In
simple preprocessing in which we lowercase the inpuotur experiments, we use the TensorFlow
implementastrings and, for all languages except Bulgarian, alsotpioenr-and the pre-trained weights available on
Tensorform ASCII normalization. For the edit distance baselFilnoew, Hub with the
handgleoogle/universal-sentencewe use rapidfuzz as described above. For the TF-IDFencoder-multilingual/3. This model was pre-trained
baselines, we use thsecikit-learn9 Python package, on data in Arabic, Chinese, English, French, German,
Italwhile for the BM5 variants, we use the Okapi BM25 i mia-n, Japanese, Korean, Dutch, Polish, Portuguese, Spanish,
plementation fromrank-bm2510. Thai, Turkish, and Russian. (Note that, during training,</p>
      <p>For the baseline variants that involve lemmatizamtiUoSnE,-CNN has not seen text for languages such as
Bulwe use spacy11 models whenever available. Howeverg,arian, Czech, or Danish.) During inference, the input to
spacy models were not available for the following ltahne-model is the surface form of the query or the corpus
guages: Bulgarian, Czech, Estonian, Hungarian, Latviealne,ment without any preprocessing or enclosing prompt
and Slovak. Lemmatization is applied before ASCII notre-mplate.
malization. Other open-source models we experiment with are
im</p>
      <p>In the case of bi-encoders, we experiment with sepvl-emented in PyTorch within the HuggingFace package
eral deep learning sentence encoders that have demsoennt-ence-transformers [53]. These models are the
strated strong performance in other semantic text fsoimllio-wing: a multilingual model based on MPN37e]t [
larity tasks. that was pre-trained on 50 languages, including all of</p>
      <p>The first model is ESCOXLM-R, proposed by MELO language1s2; the BGE-M3 model 3[8], which
supZhang et al.1[0], which is based on XLM-RoBERTa. We ports more than 100 languages, including also all MELO
use the PyTorch implementation and the pre-trainleadnguage1s3; GIST Embedding [39], which is a model
weights that are available on HuggingFace with rtehpeorted to be primarily trained in
En1g4l;iMshultilinmodel namejjzha/esco-xlm-roberta-large. The base gual E5 [40], which was pre-trained on 94 languages,
model was pre-trained on data in 88 languages, includiinncgluding all of MELO langua1g5e;sand E5 [41, 42]
preall those involved in our datasets, and the fine-tuntinragined on many languages but reported to perform best
by Zhang and colleagues involved learning objectiovensEnglish-language inp1u6t.
that leverage information in ESCO. Although it is usuaFlinally, we also experiment with
tehxet-embeddingto experiment with the XLM-RoBERTa family of models
only after fine-tuning, in our experiment we use it
outof-the-box in a zero-shot setup. During inference, t12hhettps://huggingface.co/sentence-transformers/
paraphrase-multilingual-mpnet-base-v2
9https://scikit-learn.org/stable/modules/generated/sklearn.fea1t3hutrtep_s://huggingface.co/BAAI/bge-m3
extraction.text.TfidfVectorizer.html 14https://huggingface.co/avsolatorio/GIST-Embedding-v0
10https://pypi.org/project/rank-bm25/ 15https://huggingface.co/intfloat/multilingual-e5-large
11https://spacy.io/api/lemmatizer 16https://huggingface.co/intfloat/e5-mistral-7b-instruct
3-large model from OpenA1I7, which is reported to be
state-of-the-art for many semantic text similarity tasks.</p>
      <p>For HuggingFace and OpenAI models, during
inference, we wrap the input text (the surface form of the
query or corpus element) with the following prompt
template:</p>
      <p>The candidate’s job title is “{{surf_form}}”.</p>
      <p>What skills are likely required for this job?
where{{surf_form}} is replaced with the surface form
of the element that is being encoded.</p>
      <p>This decision was informed by preliminary
experiments in which we evaluated various models with
diferent wrapping prompt templates, including no template
(as with ESCOXLM-R and mUSE-CNN). We speculate
that such prompts are particularly beneficial for
LLMbased encoders, as they may better capture the semantics
of the occupation names we aim to rank.</p>
      <p>Although we also experimented with prompts in the
same language as each query, this did not improve
performance. Consistently using a single prompt ensures a
language-agnostic and symmetric bi-encoder approach.</p>
    </sec>
    <sec id="sec-6">
      <title>C. Full Results</title>
      <sec id="sec-6-1">
        <title>This Section presents the full set of experimental results.</title>
        <p>Table5 and Table6 include the mean reciprocal rank
(MRR) for each model across all tasks in MELO.</p>
        <p>Although not included with the main results, we also
evaluated a random baseline for each dataset, where the
score(,   ) for any query and any corpus element is
drawn from a uniform distribution. The performance of
this baseline varies depending on the number of corpus
elements and the distribution of relevant elements per
query, but in general, its MRR is close to 0.020.</p>
        <p>Additionally, Figur5eshows scatterplots illustrating
the correlation between model performance and the
median of the lexical overlap index described in AppeAn:dix
the minimum normalized edit distance per query.</p>
        <p>Finally, in Figur6e and in Figure7 we show the top
accuracy (A @) for a selection of models in every task
in MELO.
17https://openai.com/index/new-embedding-models-and-api-updates</p>
        <p>FRA
FRA
mUSE-CNN</p>
        <p>PRT</p>
        <p>BEL PUOSLA</p>
        <p>AUT</p>
        <p>DEU
NLD
HUNSVN</p>
        <p>ITA
HRV</p>
        <p>ESP
NORBGR</p>
        <p>ESBTEL
LVCASZVSEKWE</p>
        <p>ROU
LTU</p>
        <p>DNK
PRT</p>
        <p>BEL</p>
        <p>USA
FRA</p>
        <p>PRT</p>
        <p>BEL POL</p>
        <p>USA
AUT</p>
        <p>DEU
NLSDVN
HUNIETASP</p>
        <p>HRV
NORBGR</p>
        <p>EST
BEL
CSZVSEKWE
LVA</p>
        <p>ROU</p>
        <p>LTU</p>
        <p>OpenAI</p>
        <p>FRA
DNK</p>
        <p>PRT BEL POL</p>
        <p>USA</p>
        <p>DEU
AUT</p>
        <p>NLD</p>
        <p>SVENSP
HUN</p>
        <p>BGR
NOR</p>
        <p>BEL
ITA
HRV EST</p>
        <p>CSZVEK
LVA</p>
        <p>SWE
ROU
LTU
0.0
0.2
0.4
0.0
0.2
0.4
Lexical Distance
(a) Absolute performance (in MRR).
mUSE-CNN
OpenAI
0.0
0.2
0.4
0.0
0.2
0.4
Lexical Distance
(b) Performance relative to the lexical baseline Char TF-IDF
corpus elements in monolingual datasets.</p>
        <p>R
R
M</p>
        <p>DNK</p>
        <p>FRA</p>
        <p>PRT</p>
        <p>BEL</p>
        <p>AUT
USA ESP
POL BGR</p>
        <p>NL DITA
HUNHRV
NOR SVN</p>
        <p>DEU</p>
        <p>BEL</p>
        <p>SVK
EST SWE</p>
        <p>CZE
LVA</p>
        <p>LTU
ROU</p>
        <p>DNK</p>
        <p>FRA</p>
        <p>PRT BELAPUOTLHUBGNSVRHNRV</p>
        <p>USA
NOR</p>
        <p>DEU</p>
        <p>ESP
NL DITA</p>
        <p>BELSVK
CZE
LVA
EST</p>
        <p>SWROEU</p>
        <p>LTU
(b) Results for tasks: Czechia, Germany, Denmark, Spain, and Estonia.</p>
        <p>FRA</p>
        <p>HRV</p>
        <p>HUN</p>
        <p>ITA</p>
        <p>LTU
(c) Results for tasks: France, Croatia, Hungary, Italy, and Lithuania.
(a) Results for tasks: Latvia, the Netherlands, Norway, Poland, and Portugal.</p>
        <p>ROU</p>
        <p>SVK</p>
        <p>SVN</p>
        <p>SWE
5
k
5
k
USA-en-xx
(b) Results for tasks: Czechia, Germany, Denmark, Spain, and Estonia.
model
OpenAI
mUSE-CNN
Char TF-IDF
Edit Distance
model
OpenAI
mUSE-CNN
Char TF-IDF
Edit Distance</p>
        <p>NLD-nl-en</p>
        <p>NOR-no-en</p>
        <p>POL-pl-en
5
k
5
k
5
k
5
k
5
k
5
k</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>