<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. T. Villar);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>HULAT-UC3M at TalentCLEF: Artificial Intelligence and Natural Language Processing Applied to HR Management</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alvaro Tejera Villar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabel Segura Bedmar</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Human Language and Accessibility Technologies Group (HULAT), Computer Science and Engineering Department, Universidad Carlos III de Madrid, Av. de la Universidad</institution>
          ,
          <addr-line>30, 28911 Leganés, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper describes the participation in the TalentCLEF 2025 lab, which focuses on Natural Language Processing methods for Human Capital Management through two tasks: multilingual job title matching (Task A) and job skill prediction (Task B). We explored a range of approaches combining dense semantic representations via sentence embeddings with reranking techniques that leverage Large Language Models (LLMs). In particular, we implemented an LLM-based reranking strategy in which top candidates retrieved via embeddings are reordered based on contextual reasoning. Our systems were designed to tackle two main challenges: the semantic matching of job titles across languages and the accurate prediction of relevant skills in knowledge-intensive scenarios. We also focused on making the models eficient and easy to use in real-world applications. The experiments included both fine-tuned and zero-shot models, tested in several languages and evaluation settings. The work provides insights into the comparative performance of embedding-based and LLM-based methods in multilingual and skill inference scenarios, contributing to a better understanding of their respective strengths and trade-ofs in real-world talent management applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Job Title Matching</kwd>
        <kwd>Skill Prediction</kwd>
        <kwd>Sentence Embeddings</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Retrieval-Augmented Generation</kwd>
        <kwd>Human Capital Management</kwd>
        <kwd>Multilingual NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The increasing availability of labor market data and the growing demand for personalized talent
management solutions have highlighted the importance of Natural Language Processing (NLP) in the
ifeld of Human Capital Management (HCM) [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>
        In this context, the TalentCLEF 2025 lab [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposes two tasks aimed at advancing multilingual
information retrieval methods for real-world recruitment and skill identification scenarios. Both tasks
are rooted in real-world applications within HCM and use standardized taxonomies such as ISCO
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and ESCO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which are international classification systems for occupations and skills, to ensure
relevance and consistency across languages and domains.
      </p>
      <p>The first task (Task A) addresses the challenge of identifying equivalent job titles across diferent
languages. Given a job title as a query, the objective is to retrieve and rank the most similar titles from
a predefined list of candidates (the corpus_elements file). This list is part of a broader multilingual
corpus that includes job titles in English, Spanish, German, and Chinese. The overall corpus is divided
into training, development, and test dataset.</p>
      <p>The second task (Task B) focuses on predicting the most relevant professional skills for a given job
title in English. The system must retrieve and rank the most appropriate skills from a fixed set of
candidates also provided in a corpus_elements file. This file is derived from a curated corpus of
ESCO skills, each enriched with lexical variants and aliases to simulate realistic job description language.
As in Task A, the data is organized into training, development, and test datasets. Importantly, while
Task A provides training and evaluation data based on multilingual job titles aligned with ESCO and</p>
      <p>ISCO taxonomies, Task B follows a hybrid data setup: the training data is derived from structured ESCO
sources, whereas the development and test sets are manually annotated. This distinction is crucial,
as it introduces realistic and linguistically diverse skill expressions that go beyond formal taxonomy
definitions—challenging systems to generalize efectively in practical, knowledge-intensive scenarios.</p>
      <p>
        This paper presents the approaches and results of our participation in both tasks. The core of
our methodology explores the combination of dense semantic representations (sentence embeddings)
with Large Language Models (LLMs) as rerankers, in a pipeline where top candidates retrieved via
embeddings are reordered based on contextual reasoning. While this resembles Retrieval-Augmented
Generation (RAG)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], our system does not perform text generation, but rather uses the LLM to rerank
a retrieved list of candidates. This LLM-based reranking approach provides a flexible middle ground
between pure semantic retrieval and generative reasoning, enhancing both performance and scalability.
      </p>
      <p>We implemented several model variants for each task and evaluated their performance using the
oficial evaluation metric, Mean Average Precision (MAP), as proposed by the organizers. This allowed
us to systematically compare the models and analyze their behavior in multilingual and
knowledgeintensive retrieval scenarios.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Datasets</title>
      <p>The datasets used in TalentCLEF 2025 were specifically designed to reflect real-world scenarios in
multilingual job retrieval and skill inference, with data aligned to the ESCO and ISCO taxonomy to
ensure consistency across languages and professions.</p>
      <sec id="sec-2-1">
        <title>2.1. Task A – Multilingual Job Title Matching</title>
        <p>The dataset for Task A includes job titles in English, Spanish, German, and Chinese. Each
title is associated with an ESCO occupation ID and a broader ISCO family ID, enabling semantic
grouping across languages. For instance, the job title air commodore is linked to the ESCO ID
f2cc5978-e45c-4f28-b859-7f89221b0505 and belongs to the ISCO family C0110 (Armed Forces
Oficers).</p>
        <p>The scope of the task is restricted to a fixed reference set of 2,500 job titles, provided by the organizers
in a file named corpus_elements. This file defines the complete retrieval space for the task: each
entry consists of a unique identifier and its associated job title. All system predictions must be selected
from this predefined list, ensuring consistency and comparability across submissions.</p>
        <p>The dataset is divided as follows:
• Training set: Contains 15,000 job title pairs per language (English, Spanish, German). No training
data is provided for Chinese. Both job titles in the same pair share the same ESCO occupation ID,
indicating they are equivalent terms for the same occupation. Each pair is represented by four
ifelds: family_id (the corresponding ISCO family), id (the ESCO occupation ID), jobtitle_1
(the text for the first job title), and jobtitle_2 (the text for the second job title).
• Development set: is organized into two files:
– queries: includes a list of 100 job titles. Each entry represents a real-world job
title for which the system must retrieve similar titles from the reference corpus
(corpus_elements).
– corpus_elements: contains the full set of candidate job titles available for retrieval. Each
entry in this file includes a unique identifier and the corresponding job title. All predictions
must be selected from this predefined set.
– q_rels: Establishes the relationships linking each job title in the queries file to its relevant
job titles within corpus_elements. This file follows the TREC-style format: each line
contains a query ID, a candidate job title ID from corpus_elements, and a relevance label
(1 for relevant, 0 for not relevant).
• Test set: Comprises 5,000 job titles per language, with 100 manually selected queries used for
evaluation. Unlike the development set, the test set does not include q_rels.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task B – Job Title-Based Skill Prediction</title>
        <p>The dataset for Task B focuses on predicting skills relevant to English-language job titles, with
information drawn directly from the ESCO taxonomy. It is divided into three splits:
• Training set: Over 5,000 job titles are linked to their most representative skills. Both job titles
and skills are represented by their corresponding ESCO URIs. This set also specifies whether
each skill is essential or optional for a given job.
• Development set: Comprises 200 job titles, each associated with a curated set of relevant skills.</p>
        <p>Skills are defined not only by their ESCO ID but also by a list of lexical variants (aliases), simulating
realistic and noisy job descriptions.</p>
        <p>• Test set: Includes 500 job titles for which participants must predict a ranked list of relevant skills.</p>
        <p>Although the training set for Task B is constructed using structured information from the ESCO
taxonomy, the development and test sets are manually curated. This ensures realistic and diverse
skill representations that reflect practical usage scenarios, which is crucial for fair and comprehensive
evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>To address the two tasks proposed in TalentCLEF 2025, we designed a flexible architecture combining
dense vector representations with LLM-based reranking techniques. Our methodology was guided by
the goal of balancing semantic accuracy, multilingual robustness, and computational feasibility. This
section presents in detail the two implemented approaches: sentence embedding-based retrieval and
LLM-based reranking of retrieved candidates, as well as their shared components such as preprocessing,
retrieval, and output handling.</p>
      <sec id="sec-3-1">
        <title>3.1. Method 1: Sentence Embedding-Based Retrieval</title>
        <p>To semantically represent both queries and the predefined list of candidates (job titles in Task A and
skills in Task B), we used a sentence embedding model that maps each textual input into a dense vector
within a shared semantic space. This representation enables comparisons based on meaning rather than
surface form, making it well suited for semantic retrieval.</p>
        <p>
          We selected snowflake-arctic-embed-l-v2.0 as our base sentence embedding model. This
multilingual model, with 568 million parameters and 1024-dimensional output vectors, was chosen for
its top-ranked performance in the retrieval task of the Massive Text Embedding Benchmark (MTEB) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
Among all models evaluated, it achieved the highest overall score in this category, making it the
strongest available model for retrieval-based applications at the time of writing. The retrieval category
in MTEB evaluates how well a model can represent both queries and documents so that items with
similar meaning end up close to each other in the model’s internal representation. This allows the
system to find the most relevant results based on meaning, not just exact word matches. In our case,
this is essential for retrieving job titles or skills that are most closely related to the given query. The
model also supports all four target languages of the challenge: English, Spanish, German, and Chinese.
        </p>
        <p>
          In our pipeline, both queries and the predefined list of candidates were independently
normalized—lowercased and stripped of punctuation—and then encoded into dense vectors using the sentence
embedding model. These vectors were indexed using FAISS (Facebook AI Similarity Search) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], a
library optimized for fast similarity search in high-dimensional spaces. We used the IndexFlatL2
structure, which performs exact nearest neighbor search using Euclidean distance. While not optimized
for large-scale datasets with millions of entries, it was appropriate given the moderate sizes of the
predefined lists of candidates: approximately 2,500 job titles in Task A and 2,500 skills in Task B. To
accommodate multilingualism and task specificity, we created five FAISS indices: one per language
(English, Spanish, German and Chinese) for Task A, and one for Task B, which was conducted exclusively
in English.
        </p>
        <p>We evaluated two embedding configurations. First, a zero-shot setup, where the base semantic
embedding model was used without additional training. Second, a fine-tuned version for Task A. The
aim of fine-tuning was to adapt the semantic embedding model to better capture subtle semantic
diferences between job titles.</p>
        <p>The training set for Task A contained 15,000 job title pairs per language, where each pair consists
of two diferent job titles that refer to the same ESCO occupation. For example, the titles “pilot
oficer” and “squadron leader” are distinct surface forms but both map to the ESCO occupation
http://data.europa.eu/esco/occupation/f2cc5978-e45c-4f28-b859-7f89221b0505.
These pairs were used as positive examples to fine-tune the sentence embedding model using a
contrastive learning approach.</p>
        <p>We employed the sentence-transformers library and optimized the model with the
MultipleNegativesRankingLoss objective. This loss function is particularly efective for retrieval
tasks. During training, it takes a batch of positive pairs (e.g., “pilot oficer” – “squadron leader”, “chief
surgeon” – “medical director”, etc.) and encourages the model to bring the embeddings of each positive
pair closer together in the vector space.</p>
        <p>Fine-tuning was performed independently for English, Spanish, and German, resulting in three
language-specific sentence embedding models. Each was trained for five epochs using the AdamW
optimizer, with a batch size of 32 and a learning rate of 2e-5.</p>
        <p>After fine-tuning, these models were used during inference to semantically represent both the input
queries and the predefined list of candidates. Each query and each corpus item was passed through
the semantic embedding model to obtain a dense vector representation. Then, semantic similarity
between the query and all candidate embeddings was computed using cosine similarity. The system
ranked the candidate job titles according to their distance to the query in the vector space, assuming
that more relevant job titles will have embeddings closer to the query vector. For the Chinese track, no
ifne-tuning was applied due to the lack of training data, so the base sentence embedding model was
used in zero-shot mode.</p>
        <p>It is important to understand that the goal of the fine-tuning process is to improve how the model
represents the meaning of job titles. The model does not group or classify new queries into categories,
nor does it directly say how similar two titles are. What it does is learn to place job titles with similar
meanings close together in a shared vector space. Thanks to this, when the system later searches for
the most relevant titles, it can simply find those whose vectors are closest to the query—making the
retrieval more accurate and better aligned with human intuition.</p>
        <p>For Task B, fine-tuning was not applied due to hardware limitations and the large scale of the
candidate set. Unlike Task A, the available data for Task B did not consist of pre-defined positive
pairs, but rather a large list of skills linked to each job title. Generating meaningful training pairs from
this data would have required significant preprocessing and computational resources. As a result, we
used the base sentence embedding model (snowflake-arctic-embed-l-v2.0) in zero-shot mode.
While this limited task-specific adaptation, the model still provided useful semantic representations for
retrieving relevant skills via FAISS.</p>
        <p>At inference time, semantic similarity between the query vector and corpus vectors was computed
using cosine distance. The top- most similar items ( = 100) were selected and ranked. The results
were exported in the TREC-style run file format required by the organizers.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Method 2: LLM-Based Reranking of Retrieved Candidates</title>
        <p>This method combines embedding-based retrieval with generative ranking using a LLM. Its primary
goal is twofold: first, to provide the model with contextual information that it does not possess inherently;
and second, to leverage the LLM’s inference and reasoning capabilities for ranking semantically relevant
items. In this case, the relevant context consists of domain-specific knowledge contained, thereby
enhancing performance while minimizing token usage and inference cost. (e.g., job titles or skills),
which is not part of the LLM’s internal parameters. By retrieving and injecting only the most relevant
candidates into the prompt, this approach allows the LLM to operate over a semantically focused subset
of external data at runtime, thereby enhancing performance while keeping the input size manageable
and reducing inference cost.</p>
        <p>As in sentence embedding-based retrieval, the input query was normalized and embedded using
the same sentence embedding model. The resulting vector was used to retrieve the top 500 most
relevant entries from the FAISS index, built from the corresponding predefined list of candidates. These
candidates, along with the original query, were inserted into a structured prompt, which was submitted
to the Gemini LLM via API.</p>
        <p>Prompts were concise and task-specific. For Task A, the prompt included the original job title and the
list of retrieved candidates, with clear instructions to return a ranked list of IDs in valid JSON format.
Here is part of our prompt:</p>
        <p>You are a multilingual job title matching expert. Given the job title: "X",
rank the following job titles in order of most to least similar based on
professional role similarity.</p>
        <p>Important Instructions: 1. No gender filtering in reasoning, ignore any
grammatical gender constraints when comparing professional similarity.
2. The query job title may appear in the list, you must exclude it from
your ranking.
3. There might be duplicate job titles with different numbering in the list
below. Treat these duplicates as a single unique job title when ranking.
Only include the first unique occurrence.
4. You (the model) get to decide how many job titles to return. Provide only
the most relevant ones according to professional role similarity, return
at least 10 jobs and a maximum of 110 jobs in the list.
5. Return only valid JSON with the key "ranking", a list of the numbers
(from the provided list) for the job titles you chose in descending order
of similarity (most similar first).
6. You don’t need to return 110 jobs, just the most relevant ones according
to professional role similarity.
7. Don’t repeat job titles in the ranking list, just one job title with
different unique IDs.</p>
        <p>Return only a valid JSON object in the format:
{"ranking": [list of IDs]}</p>
        <p>As an additional strategy, we modified the prompt to include the similarity scores computed between
the query and each candidate. These scores were provided as guidance for the LLM, helping it refine
the ranking, particularly in cases involving near-duplicate or closely related entries.
8. Use the provided **similarity score** (lower = more similar) to break
ties or help refine ranking of near-identical job titles.</p>
        <p>For Task B, we developed an LLM-based reranking system following the same embedding
representation and prompting pipeline described for Task A. The prompt was adapted to the skill prediction task:
it included the job title and the top 500 retrieved skills from the FAISS index of the predefined list of
candidates, along with structured instructions guiding the LLM to rerank and return the most relevant
skills in JSON format.</p>
        <p>You are a multilingual expert in professional skill prediction with deep
understanding of industry job functions.</p>
        <p>Your task:
Given the job title: X, select the most relevant skills from the list below
that align with the actual responsibilities, competencies, and expectations
typically associated with the job.</p>
        <p>Guidelines:
1. Select no fewer than 10 and no more than 110 skills — choose only those
that are truly necessary for the job.
2. Choose a diverse set of skills reflecting technical, functional, and
soft skills.
3. Ignore surface-level word matches — focus on true professional alignment.
4. Use only the numbers shown before each skill to refer to them.
5. Rank the selected skills from most relevant (first) to least relevant
(last).
6. Do not return duplicate skills.
7. Don’t repeat skills in the ranking list; do not include the same skill
with the same c_id more than once. Each skill in skill_list_str has a
unique c_id.</p>
        <p>Additionally, we tested a full-prompt LLM variant for Task B, in which the complete predefined list
of 2,500 candidate skills—including all aliases—was directly included in the prompt. This configuration
bypassed the retrieval step entirely, relying on the LLM’s ability to reason over the entire candidate set
without preselection. We used the same prompt structure as in the LLM-based reranking configuration,
but applied it to the full list of candidates. While this approach increased input size and inference cost,
it ensured that the model considered all possible skills during ranking. This variant was submitted as
an alternative run.</p>
        <p>To ensure that the outputs generated by the LLM were structurally valid and compatible with the
evaluation scripts, we applied schema-based validation during post-processing. This step enforced
a consistent output format—a single JSON object with a "ranking" key mapped to a list of unique
item IDs—reducing the risk of formatting errors and improving reliability. Using output schemas in
LLM pipelines is particularly important when integrating LLMs into structured evaluation workflows,
where even minor inconsistencies can lead to parsing failures or incorrect scoring. We enforced a lower
bound of 10 elements to guarantee compatibility with Precision@10 and an upper bound of 110 to
support evaluation up to Precision@100. These constraints improved robustness and helped limit
the length of the LLM’s output, reducing inference cost. This method made it possible to work with
just the relevant parts of the data, without having to include the whole dataset in the prompt, which
helped reduce cost and improve scalability</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Resources Employed</title>
      <p>The systems developed for both tasks were executed on a local workstation with the following hardware
specifications:
• Processor: AMD Ryzen 5 3600XT (6 cores, 12 threads)
• Memory: 16 GB DDR4 RAM
• GPU: NVIDIA GeForce RTX 3060 Ti with 8 GB VRAM</p>
      <p>This configuration was suficient to train fine-tuned sentence embedding models (as done in Task A),
generate dense vector indexes with FAISS, and perform inference on moderately sized corpora. In Task
B, although no fine-tuning was performed, local embedding and retrieval were still executed eficiently.
Most of the experiments were run locally, with GPU acceleration via CUDA used to speed up both
embedding and training processes.</p>
      <p>In terms of software, we relied primarily on Python, using the following libraries:
• sentence-transformers for embedding generation and fine-tuning.
• FAISS for dense vector indexing and similarity search.
• pandas and numpy for data handling and preprocessing.</p>
      <p>• pydantic for output validation and schema enforcement.</p>
      <p>Additionally, we used the Google Gemini API to access the LLM for both our LLM-based reranking
and full-prompt systems. Query batching and prompt formatting were handled automatically via a
custom wrapper, which also managed error handling and retries.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Analysis</title>
      <p>This section presents and analyzes the oficial results obtained by the submitted systems for both Task
A (Multilingual Job Title Matching) and Task B (Job Title-Based Skill Prediction), based on the oficial
evaluation metric: Mean Average Precision (MAP). All results correspond to the final test sets held by
the organizers, who performed the evaluation and released the MAP scores without disclosing the test
data itself.</p>
      <sec id="sec-5-1">
        <title>5.1. Task A – Multilingual Job Title Matching</title>
        <p>Task A focuses on semantic matching across languages, requiring systems to handle lexical variation,
gendered occupational forms, and cultural nuances. The evaluation was carried out over both
monolingual tracks (en-en, es-es, de-de, zh-zh) and cross-lingual tracks (en-es, en-de, en-zh). The submitted
models as mention in previous sections are:
• SE-FineTune-Snow: Sentence embedding model (snowflake-arctic-embed-l-v2.0)
finetuned for English, Spanish, and German. Chinese was evaluated in zero-shot mode. This approach
encodes both queries and candidate job titles into dense vectors and uses cosine similarity to
rank the most semantically similar items.
• SE-ZeroShot: Same model as above, but used without any task-specific fine-tuning. It also relies
on sentence embeddings and cosine similarity to retrieve the most relevant results.
• LLM-Reranker-1: A LLM-based reranking based approach that used FAISS to retrieve 500
candidates and passed them to Gemini via a structured prompt.
• LLM-Reranker-2: Identical to LLM-Reranker-1, but using semantic similarity scores included in
the prompt to help the LLM refine its ranking.</p>
        <sec id="sec-5-1-1">
          <title>Analysis:</title>
          <p>As evidenced in Table 1, sentence embedding-based systems consistently outperform LLM-based
reranking approaches across all evaluation tracks in Task A. This performance gap is observed in both
monolingual (en-en, es-es, de-de, zh-zh) and cross-lingual (en-es, en-de, en-zh) settings, underscoring
the relative robustness and reliability of sentence embedding models in multilingual retrieval tasks.
The highest MAP score was obtained by the fine-tuned version of the sentence embedding model
(SE-FineTune-Snow), which highlights the benefits of task-specific adaptation. A detailed analysis of
the results yields several relevant insights:
• Efectiveness of Language-Specific Fine-Tuning: The marked improvement of the fine-tuned
model over its zero-shot variant demonstrates the added value of even modest task-specific
training data. Fine-tuning allowed the sentence embedding model to better capture subtle
semantic distinctions between job titles in each language, resulting in significantly higher retrieval
precision.
• Robustness and Generalization Across Languages: Although performance decreases in
crosslingual settings—as expected due to added semantic and syntactic complexity—fine-tuned sentence
embedding models maintain relatively strong MAP scores. The strong zero-shot performance on
zh-zh (0.433) further illustrates the capacity of the base model to generalize across typologically
distant languages, a likely consequence of extensive multilingual pretraining.
• Limitations of LLM-Based Reranking Approaches: The comparatively low scores of the
LLMbased reranking systems (LLM-Reranker-1 and LLM-Reranker-2), both under 0.19, indicate
that LLMs struggle in this retrieval-centered task. This may be attributed to the accumulation of
noise in the retrieval stage and to the intrinsic variability of LLM outputs, even when structured
prompts and schema validation are employed. Unlike embedding-based approaches, reranking
with LLMs relies heavily on the model’s interpretation of the input context, which introduces
unpredictability and potential drift from the intended ranking criteria.
• Language-Specific Challenges – The Case of German: Among the evaluated languages,
German consistently yielded the lowest MAP scores. This could be due to its morphological
richness, frequent compound word formations, and orthographic variation, which complicate
semantic matching. Sentence embedding models may partially mitigate these efects, but the
structural properties of the language still pose notable challenges, especially in cross-lingual
contexts.
• Impact of Gendered Language on Evaluation: A particularly salient issue arises from the use
of gendered occupational terms in languages such as Spanish and German. LLM-based systems
tend to treat masculine and feminine variants as semantically equivalent, an arguably correct
behavior in real-world applications, but this equivalence is not always captured in the gold
standard (q_rels) relevance annotations.</p>
          <p>To illustrate this misalignment, Table 2 presents a qualitative example based on the development
set for Task A, since the oficial relevance judgments for the test set ( q_rels) were not publicly
available. Specifically, we use the query "ingeniera de automatización" and compare
the top predictions returned by the LLM with a subset of job titles annotated as relevant in the
corresponding q_rels file. Since the gold standard provides binary relevance labels without
any predefined ranking, we manually extracted the first five relevant job titles as they appear in
the q_rels file for this query. This visual comparison reveals that several masculine or lexical
variants predicted by the model are not included among the annotated relevant items, despite
being semantically appropriate. This suggests that systems focusing on semantic equivalence
may be penalized under a strictly lexical evaluation framework.</p>
          <p>Sentence embedding-based models demonstrate strong efectiveness for multilingual job title
matching, ofering consistent and robust performance across both monolingual and cross-lingual settings.
Their reliability, especially when fine-tuned with limited domain-specific data, highlights their
scalability and practical utility in real-world retrieval scenarios. By contrast, LLM-based reranking approaches
show greater variability and are more sensitive to evaluation design—particularly when semantic
equivalence is not explicitly reflected in the relevance annotations.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Task B – Job Title-Based Skill Prediction</title>
        <p>Task B focused on predicting relevant skills from a predefined list of candidates for a given job tittle.
Unlike Task A, this task was conducted exclusively in English. We submitted the following approaches:
• SE-ZeroShot: Sentence embedding model (snowflake-arctic-embed-l-v2.0) used without
task-specific fine-tuning. Both the job title query and candidate skills are encoded into dense
vectors, and similarity is computed using cosine distance to rank the most relevant skills.
• LLM-Reranker: Combines embedding-based retrieval with LLM-based reranking. The top
500 most relevant skills are retrieved using FAISS, based on cosine similarity with the query
embedding, and then re-ranked by the Gemini LLM using a structured prompt that incorporates
both the query and retrieved candidates.
• LLM-FullPrompt: A non-retrieval approach where the entire list of 2,500 candidate skills is
directly embedded into the prompt. The Gemini LLM processes the full input and outputs a
ranked list of relevant skills, without relying on prior filtering or similarity scoring.</p>
        <sec id="sec-5-2-1">
          <title>Analysis:</title>
          <p>As presented in Table 3, the performance diferences among the evaluated systems in Task B are
less pronounced than those observed in Task A. However, the LLM-Reranker approach achieved the
highest MAP score (0.141), while the LLM-FullPrompt model performed slightly worse (0.111), just
below the embedding-based system in zero-shot mode (0.112). Despite the narrow margins, these results
reveal important contrasts in how embedding-based models and LLM-based approaches address the
underlying challenge of predicting relevant skills from job titles—a task that requires not only
surfacelevel similarity but also a degree of contextual and inferential reasoning. The following observations
are particularly relevant:
• Knowledge-Intensive Inference: Skill prediction is inherently a knowledge-driven task. The
relationship between a job title and its associated skills is often implicit and context-dependent,
rather than lexically explicit. This places greater demands on models to perform semantic
inference, favoring approaches—such as LLMs—that are capable of leveraging contextual cues
and domain knowledge beyond what is captured in dense vector representations.
• Relative Efectiveness of LLM-Based Reranking: Contrary to its performance in Task A,
the LLM-Reranker approach achieves the highest MAP score in this task. This suggests that the
combination of targeted retrieval and LLM-driven reranking enables the system to better capture
the latent associations between job titles and relevant skills. While the overall improvement
is modest, it indicates that this approach can provide meaningful advantages in semantically
complex scenarios.
• Limitations of Sentence Embedding Models in Complex Associations: The sentence
embedding model, used in zero-shot mode, performs similarly to the LLM-FullPrompt model
but does not reach the performance of the LLM-Reranker approach. This may be because the
model struggles to capture connections between job titles and skills when those relationships
are not directly visible in the text. In many cases, knowing which skills go with a job requires
background knowledge that goes beyond the words themselves.
• Context Overload in Full-Prompt LLMs: The LLM-FullPrompt variant, which processed all
2,500 candidate skills in a single input, did not yield performance gains over the LLM-Reranker
configuration. This outcome highlights the practical limitations of large-context inputs: increasing
the input size may introduce noise, reduce the model’s focus, and exceed efective attention
capacity. It underscores the importance of selective context curation when designing prompts
for knowledge-intensive tasks. It is also plausible that positional bias played a role: transformer
models can tend to allocate more weight to tokens that appear earlier in the prompt, and our
skills were fed in a fixed catalogue order that was unrelated to relevance. Although we did not
isolate this factor experimentally, such ordering efects could have further blunted the model’s
efectiveness. Together, these considerations underline the need for selective context curation
and careful ordering when constructing prompts for knowledge-intensive tasks.</p>
          <p>While no approach achieved particularly high scores in absolute terms, the results suggest that
LLM-based reranking approaches hold greater potential for tasks requiring semantic inference over
external knowledge sources. Sentence embedding models remain competitive due to their eficiency and
simplicity, but may require further adaptation or hybridization to match the performance of
contextaware LLM systems. Careful prompt engineering and retrieval design emerge as critical factors for
maximizing LLM efectiveness in this setting.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>The results obtained in both tasks highlight the complementary strengths of sentence
embeddingbased retrieval and LLM-based reranking approaches in multilingual Human Capital Management
scenarios. Sentence embedding models, particularly when fine-tuned, demonstrated robust performance
in semantic retrieval tasks such as multilingual job title matching (Task A), combining eficiency,
consistency, and scalability. By contrast, LLM-based reranking approaches showed greater flexibility
and capacity for contextual reasoning, particularly in knowledge-driven tasks like skill prediction (Task
B). However, these systems are more expensive to run, as they depend on external APIs for LLMs, which
often charge based on the amount of text processed. They are also more sensitive to how the prompt is
written—small changes in wording or structure can lead to noticeable diferences in the results.</p>
      <p>Hybrid architectures that integrate sentence embeddings with LLM-based reranking—specifically,
our LLM-Reranker approach—emerged as a promising middle ground. This was particularly evident in
Task B, the LLM-Reranker system recorded the highest MAP score (0.141), only marginally ahead of
both the zero-shot sentence-embedding model and the LLM-FullPrompt configuration. Meanwhile, in
Task A, the fine-tuned sentence embedding model achieved the best results (0.42 MAP), demonstrating
the benefits of lightweight, language-specific adaptation in multilingual semantic retrieval.</p>
      <p>Although LLMs did not perform as well in structured retrieval tasks—like job title matching in Task
A—they still ofered valuable insights in our experiments. By using LLMs through an API, we were able
to explore diferent prompt formulations, test the impact of including similarity scores, and evaluate
alternative ranking strategies, all without needing heavy infrastructure. This flexibility allowed us
to experiment quickly and better understand the model’s behavior in both well-structured and more
ambiguous tasks. However, we also observed that LLM performance was highly sensitive to prompt
design and input length, particularly in full-prompt settings like Task B. These observations suggest
that while LLMs are powerful, efectively using them for ranking in real-world talent management tasks
still requires careful prompt engineering and data curation—more so than is often assumed. Future
improvements could involve domain adaptation through lightweight fine-tuning techniques like LoRA
or adapter layers, which would allow better performance without large computational costs.</p>
      <p>There are several promising directions for future work. One of them is improving the LLM-based
reranking approach by building on the strategies already applied. In our current system, we included
similarity scores in the prompt to help the LLM rank the most relevant results. This could be further
enhanced by adding an intermediate reranking step before prompt generation. A lightweight model
trained on a small set of annotated examples could learn to reorder the retrieved candidates based on
informative features such as word overlap or semantic similarity. This additional step may help address
the relatively low performance observed in Task A, improving the quality of the final candidate list
submitted to the LLM.</p>
      <p>Another avenue is making the system more adaptive depending on the input. For instance, when the
job title is vague or uncommon, the system could rely more on semantic reasoning; in contrast, when
the title includes specific or well-known terms, it might prioritize exact word matches. Adjusting this
balance dynamically could enhance ranking quality, especially in ambiguous or noisy cases. Finally,
Task B could be extended to other languages to evaluate the cross-lingual generalization capabilities of
current systems. This would test whether the approach can adapt to diferent linguistic structures and
cultural expressions of professional skills, which is key for real-world multilingual applications.</p>
      <p>In addition to architectural improvements, refining evaluation practices could lead to fairer and
more realistic assessments. During development—specifically while reviewing the q_rels files from
the validation set—we observed that some semantically equivalent job titles, particularly gendered
variants (e.g., ingeniero vs. ingeniera), were not consistently annotated as relevant. In such cases,
LLM-based reranking approaches, which rely on semantic reasoning, may be penalized for returning
lexically diferent but conceptually correct outputs. Although this observation is based on limited
evidence and should be interpreted with caution—especially considering the modest performance
of LLM-based reranking in Task A—it points to a potential mismatch between system behavior and
evaluation criteria. Updating gold relevance annotations to better account for gender variation could
improve alignment with real-world expectations, especially in multilingual contexts. Finally, grouping
job titles into broader semantic categories could help reduce data sparsity and enhance alignment
across languages, particularly in low-resource cases like the Chinese subset, where no training data
was available. This would support the development of more robust and scalable multilingual systems
for talent management applications.</p>
      <p>
        Another important aspect that deserves further attention is the impact of item ordering in the prompt
on LLM-based ranking. In the reranking runs, we passed the top 500 candidates to the LLM in descending
order of cosine similarity, whereas in the Full-Prompt variant every one of the 2500 skills was fed in
its fixed catalogue order—neither list was shufled. Transformer models, however, seem to show some
positional bias, often giving more attention to tokens that come earlier in the sequence [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Even when
the entire prompt fits within the model’s context window, this ordering can influence the final ranking
output. Recent studies in retrieval-augmented generation and LLM-based QA (e.g., [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) show that
simply reordering retrieved passages can measurably afect generation quality and factual accuracy. This
sensitivity exposes a reproducibility and interpretability limitation of current LLM ranking pipelines.
Although ordering by cosine similarity provides an inductive bias aligned with retrieval scores, and
catalogue order is convenient in the Full-Prompt setup, neither choice appears clearly optimal in light
of our results. Future research should therefore test alternative strategies—random shufling, semantic
clustering, or lightweight learned re-ordering—to quantify and mitigate positional efects in both
withretrieval and no-retrieval scenarios. Addressing this limitation is key to building more robust and
transparent hybrid architectures for talent-intelligence systems.
      </p>
      <p>In summary, this work highlights the importance of selecting the right combination of semantic
retrieval and contextual reasoning depending on the specific demands of the task. While sentence
embedding models ofer speed, scalability, and solid performance—especially when fine-tuned—LLMs
introduce powerful reasoning capabilities that are particularly useful for more abstract or
knowledgeintensive tasks. Our findings suggest that hybrid approaches like LLM-based reranking strike a practical
balance between these two paradigms. Moving forward, the development of multilingual Human Capital
Management systems will benefit from integrating flexible architectures, improving evaluation protocols
to reflect real-world semantic variation (e.g., gendered job titles), and enabling better adaptation to
diverse linguistic and cultural contexts. These improvements are essential not only to increase model
accuracy, but also to ensure fairness, inclusivity, and practical utility in real-world talent intelligence
applications across languages and regions.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Grant PID2023-148577OB-C21 (Human-Centered AI: User-Driven Adapted Language
ModelsHUMAN_AI) by MICIU/AEI/ 10.13039/501100011033 and by FEDER/UE.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI’s GPT-4 in order to: assist with text
translation from Spanish to English, and to improve writing style by suggesting alternative phrasings
and enhancing overall clarity and coherence. After using this tool, the authors carefully reviewed,
edited, and verified all content to ensure its accuracy and originality, and take full responsibility for the
publication’s content.
[22] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Association for Computational Linguistics, 2014, pp. 1532–1543. URL: https://nlp.stanford.edu/
pubs/glove.pdf.
[23] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information,
2017. URL: https://arxiv.org/abs/1607.04606. arXiv:1607.04606.
[24] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning of universal sentence
representations from natural language inference data, 2018. URL: https://arxiv.org/abs/1705.02364.
arXiv:1705.02364.
[25] D. Cer, Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes,
S. Yuan, C. Tar, Y.-H. Sung, B. Strope, R. Kurzweil, Universal sentence encoder, 2018. URL: https:
//arxiv.org/abs/1803.11175. arXiv:1803.11175.
[26] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.</p>
      <p>URL: https://arxiv.org/abs/1908.10084. arXiv:1908.10084.
[27] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding by
generative pre-training, OpenAI Blog 1 (2018). URL: https://cdn.openai.com/research-covers/
language-unsupervised/language_understanding_paper.pdf.
[28] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, 2019. URL: https://arxiv.org/abs/1810.04805. arXiv:1810.04805.
[29] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, 2023. URL: https://arxiv.org/
abs/1910.10683. arXiv:1910.10683.
[30] K. Enevoldsen, I. Chung, I. Kerboua, M. Kardos, A. Mathur, D. Stap, J. Gala, W. Siblini, D. Krzemiński,
G. I. Winata, S. Sturua, S. Utpala, M. Ciancone, M. Schaefer, G. Sequeira, D. Misra, S. Dhakal,
J. Rystrøm, R. Solomatin, Ömer Çağatan, A. Kundu, M. Bernstorf, S. Xiao, A. Sukhlecha, B. Pahwa,
R. Poświata, K. K. GV, S. Ashraf, D. Auras, B. Plüster, J. P. Harries, L. Magne, I. Mohr, M. Hendriksen,
D. Zhu, H. Gisserot-Boukhlef, T. Aarsen, J. Kostkan, K. Wojtasik, T. Lee, M. Šuppa, C. Zhang,
R. Rocca, M. Hamdy, A. Michail, J. Yang, M. Faysse, A. Vatolin, N. Thakur, M. Dey, D. Vasani,
P. Chitale, S. Tedeschi, N. Tai, A. Snegirev, M. Günther, M. Xia, W. Shi, X. H. Lù, J. Clive, G.
Krishnakumar, A. Maksimova, S. Wehrli, M. Tikhonova, H. Panchal, A. Abramov, M. Ostendorf, Z. Liu,
S. Clematide, L. J. Miranda, A. Fenogenova, G. Song, R. B. Safi, W.-D. Li, A. Borghini, F. Cassano,
H. Su, J. Lin, H. Yen, L. Hansen, S. Hooker, C. Xiao, V. Adlakha, O. Weller, S. Reddy, N. Muennighof,
Mmteb: Massive multilingual text embedding benchmark, arXiv preprint arXiv:2502.13595 (2025).</p>
      <p>URL: https://arxiv.org/abs/2502.13595. doi:10.48550/arXiv.2502.13595.
[31] S. Colvin, the Pydantic Team, Pydantic: Data validation and settings management using python
type annotations, https://docs.pydantic.dev/, 2025. Accessed: 2025-05-27.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hruschka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Otani</surname>
          </string-name>
          , T. Mitchell,
          <source>Proceedings of the first workshop on natural language processing for human resources (nlp4hr</source>
          <year>2024</year>
          ),
          <source>in: Proceedings of the First Workshop on Natural Language Processing for Human Resources (NLP4HR</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Graus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , J.
          <string-name>
            <surname>-J. Decorte</surname>
          </string-name>
          , T. De Bie,
          <article-title>Fourth workshop on recommender systems for human resources (recsys in hr 2024)</article-title>
          ,
          <source>in: Proceedings of the 18th ACM Conference on Recommender Systems</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1222</fpage>
          -
          <lpage>1226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fabregat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>García-Sardiña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Estrella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zbib</surname>
          </string-name>
          ,
          <article-title>Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management, in: International Conference of the Cross-Language Evaluation Forum for European Languages</article-title>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>International</given-names>
            <surname>Labour</surname>
          </string-name>
          <string-name>
            <surname>Organization</surname>
          </string-name>
          ,
          <article-title>International standard classification of occupations (isco</article-title>
          ),
          <year>2008</year>
          . URL: https://ilostat.ilo.org/methods/concepts-and-definitions/classification-occupation/, accedido:
          <fpage>2025</fpage>
          -05-27.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>European</given-names>
            <surname>Commission</surname>
          </string-name>
          ,
          <article-title>European skills, competences, qualifications and occupations (esco) classiifcation, 2024</article-title>
          . URL: https://esco.ec.europa.eu/en/classification, accessed:
          <fpage>2025</fpage>
          -05-27.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W. tau Yih, T. Rocktäschel,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2005</year>
          .11401. arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Magne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          , Mteb: Massive text embedding benchmark,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2210.07316. arXiv:
          <volume>2210</volume>
          .
          <fpage>07316</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guzhva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          , J. Johnson, G. Szilvasy, P.-E. Mazaré,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          , The faiss library,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2401.08281. arXiv:
          <volume>2401</volume>
          .
          <fpage>08281</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hewitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paranjape</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bevilacqua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Lost in the middle: How language models use long contexts</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <fpage>157</fpage>
          -
          <lpage>173</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .tacl-
          <volume>1</volume>
          .9/. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00638</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Huang,</surname>
          </string-name>
          <article-title>R4: Reinforced retrieverreorder-responder for retrieval-augmented large language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/ abs/2405.02659. arXiv:
          <volume>2405</volume>
          .
          <fpage>02659</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gascó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Hermenegildo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-S.</given-names>
            <surname>Laura</surname>
          </string-name>
          , D. C. Daniel, P. Estrella,
          <string-name>
            <given-names>R.</given-names>
            <surname>Alvaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rabih</surname>
          </string-name>
          ,
          <article-title>Talentclef 2025 corpus: Skill and job title intelligence for human capital management, 2025</article-title>
          . URL: https: //doi.org/10.5281/zenodo.15292308. doi:
          <volume>10</volume>
          .5281/zenodo.15292308, dataset,
          <source>version 0.5.0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Deniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Retyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>García-Sardiña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fabregat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zbib</surname>
          </string-name>
          ,
          <article-title>Combined unsupervised and contrastive learning for multilingual job recommendation</article-title>
          ,
          <source>in: Proceedings of the 4th Workshop on Recommender Systems for Human Resources (RecSys in HR'24)</source>
          , volume
          <volume>3788</volume>
          , CEUR Workshop Proceedings, Bari, Italy,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3788</volume>
          /RecSysHR2024-paper_3.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-J.</given-names>
            <surname>Decorte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lowie</surname>
          </string-name>
          ,
          <article-title>Is it required? ranking the skills required for a job-title</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2212.08553. arXiv:
          <volume>2212</volume>
          .
          <fpage>08553</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Halder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prasad</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <article-title>Retrieving skills from job descriptions: A language model based extreme multi-label classification framework</article-title>
          , in: D.
          <string-name>
            <surname>Scott</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Bel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Zong (Eds.),
          <source>Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          , Barcelona,
          <source>Spain (Online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>5832</fpage>
          -
          <lpage>5842</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .coling-main.
          <volume>513</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .coling-main.
          <volume>513</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Laosaengpha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tativannarat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          , E. Chuangsuwanich,
          <article-title>Mitigating language bias in cross-lingual job retrieval: A recruitment platform perspective</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/ 2502.03220. arXiv:
          <volume>2502</volume>
          .
          <fpage>03220</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Laosaengpha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tativannarat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Piansaddhayanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chuangsuwanich</surname>
          </string-name>
          ,
          <article-title>Learning job title representation from job description aggregation network, 2024</article-title>
          . URL: https: //arxiv.org/abs/2406.08055. arXiv:
          <volume>2406</volume>
          .
          <fpage>08055</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>J.-J. Decorte</surname>
            ,
            <given-names>J. V.</given-names>
          </string-name>
          <string-name>
            <surname>Hautte</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Demeester</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Develder</surname>
          </string-name>
          , Skillmatch:
          <article-title>Evaluating self-supervised learning of skill relatedness</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.05006. arXiv:
          <volume>2410</volume>
          .
          <fpage>05006</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hruschka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Otani</surname>
          </string-name>
          , T. Mitchell (Eds.),
          <source>Proceedings of the First Workshop on Natural Language Processing for Human Resources (NLP4HR</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics, St</article-title>
          .
          <source>Julian's, Malta</source>
          ,
          <year>2024</year>
          . URL: https://aclanthology.org/volumes/
          <year>2024</year>
          .nlp4hr-
          <fpage>1</fpage>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fabregat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>García-Sardiña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Estrella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zbib</surname>
          </string-name>
          , Talentclef at clef2025:
          <article-title>Skill and job title intelligence for human capital management</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>479</fpage>
          -
          <lpage>486</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          ,
          <year>2013</year>
          . URL: https://arxiv.org/abs/1301.3781. arXiv:
          <volume>1301</volume>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>