<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Decorte);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multilingual JobBERT for Cross-Lingual Job Title Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jens-Joris Decorte</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias De Lange</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeroen Van Hautte</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TechWolf</institution>
          ,
          <addr-line>Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Total correct</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>We introduce JobBERT-V3, a contrastive learning-based model for cross-lingual job title matching. Building on the state-of-the-art monolingual JobBERT-V2, our approach extends support to English, German, Spanish, and Chinese by leveraging synthetic translations and a balanced multilingual dataset of over 21 million job titles. The model retains the eficiency-focused architecture of its predecessor while enabling robust alignment across languages without requiring task-specific supervision. Extensive evaluations on the TalentCLEF 2025 benchmark demonstrate that JobBERT-V3 outperforms strong multilingual baselines and achieves consistent performance across both monolingual and cross-lingual settings. While not the primary focus, we also show that the model can be efectively used to rank relevant skills for a given job title, demonstrating its broader applicability in multilingual labor market intelligence. The model is publicly available: https://huggingface.co/TechWolf/JobBERT-v3.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Job Title Normalisation</kwd>
        <kwd>Multilingual Language Models</kwd>
        <kwd>Labor Market Analysis</kwd>
        <kwd>Contrastive Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Job title normalization is a critical task in labor market analysis, facilitating the standardization of
heterogeneous job titles into a unified taxonomy to improve job matching, skill inference, and labor
market analytics. Although substantial advancements have been achieved in monolingual normalization
tasks, particularly within (semi-)supervised learning frameworks [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], these approaches typically
sufer from data scarcity due to high labeling costs. To address this challenge, JobBERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduced
large-scale unsupervised representation learning techniques, from which subsequent studies [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]
have further validated the efectiveness of leveraging job title embeddings at scale without relying
heavily on labeled datasets. More recently, the JobBERT-V2 model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] has demonstrated significant
improvements in monolingual performance by employing contrastive learning strategies.
Nonetheless, extending monolingual normalization techniques to multilingual contexts introduces additional
complexities that require systematic exploration.
      </p>
      <p>
        In this paper, we present JobBERT-V3, an extension of the English focused JobBERT-V2 model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
that addresses the challenge of cross-lingual job title normalisation. The model is designed to handle
job titles in English, German, Spanish, and Chinese, making it a valuable tool for international labor
market analysis and talent matching.
      </p>
      <p>
        Our approach builds upon the contrastive learning framework employed by JobBERT-V2 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
demonstrating that this methodology efectively scales to multilingual contexts. However, the scarcity of
cross-lingual data poses a significant challenge. To overcome this limitation, we use synthetic
translations generated from the extensive English dataset originally developed for JobBERT-V2 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Consequently, we establish a balanced multilingual dataset comprising 21 million job titles, enabling robust
experimentation and evaluation of our multilingual normalization capabilities.
      </p>
      <p>
        The key contributions of this work can be summarized as:
• We release the open-source JobBERT-V31, an extension of JobBERT-V2 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] supporting cross-lingual
job title normalisation in English, German, Spanish, and Chinese.
• We construct a large-scale training dataset comprising over 21 million job titles, balanced across
the four target languages through synthetic data generation.
• The model performance is evaluated in cross-lingual job title matching scenarios.
• The model is analyzed in its ability to capture job title semantics across diferent languages.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>2.1. Base Model Selection</title>
        <p>
          Given that the original JobBERT-V2 model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is focused on English only, we apply the same JobBERT-V2
training paradigm from scratch on the multilingual MPNET base model2. We selected this model for
its strong multilingual understanding capabilities across our four target languages. This is a
SentenceBERT mode [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and generates 768-dimensional embeddings for sentences or paragraphs across over
50 languages. This model, based on the MPNet architecture [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and fine-tuned on a large corpus of
multilingual sentence pairs, is particularly efective for tasks such as semantic similarity, paraphrase
detection, and cross-lingual retrieval. The asymmetric linear projection layer – a core part of the
JobBERT-V2 training method – is added on top of the MPNET model, and projects the 768-dimensional
embeddings to 1024-dimensional ones.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Training Data</title>
        <p>
          To train JobBERT-V3, we leverage the same foundational dataset used in the original JobBERT-V2
model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], consisting of 5, 579, 240 English job advertisements collected from the TechWolf market
data lake. These job ads, posted between January 2020 and December 2024 in the United States, contain
tuples of job titles paired with sets of annotated ESCO skills. After applying additional preprocessing
steps — including filtering out titles shorter than three characters and ensuring a minimum of five
unique ESCO skills per record—we retain a total of 5, 280, 967 high-quality English tuples.
        </p>
        <p>
          To create high-quality multilingual training data, we translated each English job title into German,
Spanish, and Simplified Chinese using prompt-based machine translation. These prompts were carefully
designed to preserve professional tone and retain technical terminology commonly used in the respective
local labor markets. We avoided adding extraneous instructions or formatting to ensure clean, consistent
outputs suitable for downstream modeling. Table 1 provides an overview of the system and user prompts
used for each target language. As OpenAI’s models are shown to be performant translators [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ],
we use the gpt-4.1-nano model to perform the translations, and keep all default parameters3. The final
training dataset consists of 21, 123, 868 job titles, evenly distributed across the four languages.
        </p>
        <p>This prompt-based approach enables consistent multilingual data generation at scale without
requiring costly human annotation. The resulting dataset retains key domain-specific cues across languages,
providing a robust foundation for cross-lingual model training.</p>
        <p>To support efective cross-lingual training, we adopt a shufled batching strategy that ensures each
batch contains job titles from multiple languages. This encourages the model to learn language-agnostic
job title representations while retaining sensitivity to language-specific nuances when necessary.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Training Methodology</title>
        <p>
          We maintain the core contrastive learning approach from JobBERT-V2 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], adapting it for the multilingual
setting:
        </p>
        <sec id="sec-2-3-1">
          <title>1https://huggingface.co/TechWolf/JobBERT-v3 2https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 3https://platform.openai.com/docs/models/gpt-4.1-nano</title>
          <p>System: You are a professional translator specializing in job ad titles and
professional language. Translate the following job ad title from English to
German. Preserve any technical terms that are commonly used in English
within the German job market. Do not include any other text or commentary.</p>
          <p>Input: Software Developer – NYC fulltime (JobID ja164956189)
Output: Softwareentwickler – New York, Vollzeit (JobID ja164956189)
System: You are a professional translator specializing in job ad titles and
professional language. Translate the following job ad title from English to
Spanish. Preserve any technical terms that are commonly used in English
within the Spanish job market. Do not include any other text or commentary.</p>
          <p>Input: Software Developer – NYC fulltime (JobID ja164956189)
Output: Desarrollador de Software – Nueva York, tiempo completo (JobID
ja164956189)
System: You are a professional translator specializing in job ad titles and
professional language. Translate the following job ad title from English to
Chinese (Simplified). Preserve any technical terms that are commonly used
in English within the Chinese job market. Do not include any other text or
commentary.</p>
          <p>Input: Software Developer – NYC fulltime (JobID ja164956189)</p>
          <p>Output: UTF8gbsn—— ja164956189
• Contrasting Job Title and Skills: Job titles and their corresponding skill sets are processed
through the same encoder, with a linear projection applied to job title embeddings to account for
semantic diferences.
• Cross-Lingual Alignment: The model learns to align job title representations across languages
through shared skill annotations, efectively creating a language-agnostic semantic space.
• InfoNCE Loss: We use the InfoNCE loss function to bring semantically similar job titles closer
in the embedding space, regardless of their source language.</p>
          <p>The training process was carefully designed to preserve the model’s strong performance on
monolingual tasks while introducing robust cross-lingual capabilities. Achieving balanced performance across
all four languages required precise weighing of the loss objective. To support this, we constructed a
dataset evenly distributed across the four languages. Combined with a large batch size of 2048 and
random batch sampling, this approach proved highly efective.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <p>
        Our methods are evaluated as part of the shared task introduced in TalentCLEF [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. TalentCLEF
advances research in Human Capital Management (HCM) by establishing benchmarks for multilingual,
fair, and cross-industry adaptable NLP systems in HR. The organisation provides two tasks: Multilingual
Job Title Matching (Task A) and Job Title-Based Skill Prediction (Task B). While our focus is on Task
A, we also report results on Task B for completeness. Note that while TalentCLEF provided training,
validation, and test sets for the tasks, JobBERT-V3 is trained on Techwolf’s proprietary dataset instead
of the benchmark training data. Additionally, while the test set results are made available, only the
MAP scores are shared. Therefore, we provide comprehensive validation set results to enable baseline
comparison.
      </p>
      <sec id="sec-3-1">
        <title>3.1. TalentCLEF Task A: Multilingual Job Title Matching</title>
        <p>Task A requires systems to identify and rank similar job titles across multiple languages. Task A is
evaluated in two settings:
• Monolingual Job Title Matching: Measuring the model’s ability to identify related job titles
within each supported language. This setup is provided in both the validation and test sets.
• Cross-lingual Job Title Matching: Evaluating the model’s capability to match similar job titles
across diferent languages. This setup is only provided in the blind test set.</p>
        <p>Following the evaluation strategy set forth by TalentCLEF, we use the following metrics:
• Mean Average Precision (MAP) – the oficial metric used to rank systems.
• Mean Reciprocal Rank (MRR) – provides insight into how early the first relevant job title
appears in the ranked list.
• Normalized Discounted Cumulative Gain (nDCG) – evaluates the overall quality of the
ranked list by considering the position of relevant job titles, giving higher scores to relevant items
appearing earlier and discounting those that appear lower in the ranking.</p>
        <p>• Precision@5 – measures the proportion of correct job titles among the top 5 retrieved results.</p>
        <p>These metrics are computed both for monolingual and cross-lingual scenarios to provide a
comprehensive view of the model’s performance. However, the validation data does not provide annotations
for the cross-lingual setting, hence we only report the final test set scores.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. TalentCLEF Task B: Job Title-Based Skill Prediction</title>
        <p>Task B focuses on developing systems that can accurately predict professional skills associated with a
given job title. The task makes use of ESCO skills, and provides evaluation and test datasets of job titles
linked with relevant skills. Task B is evaluated in a single setting:
• Job Title-to-Skill Prediction: Assessing the model’s ability to retrieve and rank the most
relevant skills for a given job title, normalized against a predefined skills gazetteer of ESCO skills.</p>
        <p>Following the evaluation strategy set forth by TalentCLEF, we use the following metrics on the
validation set:
• Mean Average Precision (MAP) – the oficial metric used to rank systems.
• Mean Reciprocal Rank (MRR) – provides insight into how early the first relevant skill appears
in the ranked list.
• Normalized Discounted Cumulative Gain (nDCG) – evaluates the overall quality of the ranked
list by considering the position of relevant skills, giving higher scores to relevant items appearing
earlier and discounting those that appear lower in the ranking.
• Precision@K (K=5,10) – measures the proportion of correct skills among the top-K retrieved
results.</p>
        <p>The blind test set only reports the MAP score.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Baselines</title>
        <p>
          As baseline for our experiments to get a clear view of the added value of our training setup, we use
278M parameter MPNET-base model4 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] which is the pretrained multilingual model from which we
start the training. Secondly, we also evaluate on the 560M parameter E5-Instruct model5 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which is
twice as large as our JobBERT-V3 model. The E5-Instruct model requires a task description to be passed
along with the queries. Based on the oficial instruction documentation, we set the instruction to “Given
a job title, retrieve similar job titles”, adapted to the task at hand.
4https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
5https://huggingface.co/intfloat/multilingual-e5-large-instruct
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Monolingual Job Title Matching</title>
        <p>Table 2 shows the performance of JobBERT-V3 on monolingual job title normalisation tasks. The results
demonstrate that JobBERT-V3 maintains strong performance across all languages, outperforming its
base model on all metrics. Moreover, it shows competitive performance compared to the E5-Instruct
model that has nearly twice the model size. We refer to Appendix A for a qualitative analysis on an
observed trade-of between precision (MRR) and overall relevance (MAP, nDCG).</p>
        <p>As an additional ablation, Table 3 shows the performance of the multilingual training objective
compared to the English-only JobBERT-V2 model, showing a marginal decrease of 1.6% MAP in English
to support all four languages.
see MMRARP</p>
        <p>
          MPNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] E5-Instruct [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] JobBERT-V3
0.5382 0.5815 0.6302
0.8006 0.8413 0.8056
0.7970 0.8206 0.8417
0.6990 0.7181 0.7429
0.2982 0.3918 0.4562
0.4985 0.5710 0.5058
0.6384 0.7124 0.7349
0.4798 0.5852 0.5685
0.4170 0.4459 0.5090
0.5514 0.6105 0.5441
0.7195 0.7463 0.7700
0.6400 0.6465 0.6649
0.4535 0.5434 0.5845
0.7827 0.8312 0.8035
0.7447 0.7973 0.8156
0.6000 0.6796 0.7184
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Cross-Lingual Job Title Matching</title>
        <p>To evaluate the model’s efectiveness in a cross-lingual setting, we report the oficial TalentCLEF test
set results for Task A. These include both monolingual and cross-lingual job title matching scenarios.</p>
        <p>Table 4 summarizes the model’s performance in terms of Mean Average Precision (MAP) for each
language pair. We observe that JobBERT-V3 performs consistently across both monolingual and
crosslingual settings, with limited degradation in cross-lingual transfer scenarios. The English-English and
Spanish-Spanish pairs yield the highest monolingual performance, while English-Chinese (en-zh) shows
the strongest cross-lingual alignment.</p>
        <p>These results confirm the model’s ability to generalize across languages, highlighting its applicability
for international labor market use cases where job title normalization must operate in a multilingual</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Job Title-Based Skill Prediction</title>
        <p>
          While our primary focus is Task A, we also evaluated JobBERT-V3 on TalentCLEF’s Task B to predict
relevant professional skills for a given job title. It is important to note that the JobBERT-V2 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] method
does not explicitly train for this task. Instead, it is optimized to learn high-quality job title representations,
with no direct supervision for individual skill embeddings. As a result, individual skill embeddings are
inherently out-of-distribution for the model.
        </p>
        <p>Nonetheless, JobBERT-V2’s shared encoder architecture allows job titles and ESCO skills to be
represented into the same embedding space. Specifically, for this task, we use the representations
from the penultimate layer and omit the asymmetric projection layer used during training. These
768-dimensional representations of jobs and skills are compared against each other by computing the
cosine similarity. Given a job title query, we generate a complete ranking of all unique ESCO aliases.
Afterwards, this ranking is filtered into a ranking for all ESCO skills by keeping only the highest ranking
alias for each ESCO skill. This approach proves surprisingly efective. A detailed qualitative analysis of
the skill prediction results can be found in Appendix B.</p>
        <p>
          MPNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] JobBERT-V2 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] JobBERT-V3
0.1852 0.2531 0.2449
0.7061 0.7652 0.7828
0.6656 0.7166 0.7115
0.4493 0.5296 0.5467
0.3809 0.4813 0.4865
        </p>
        <p>Despite not being trained specifically for this task, Table 5 shows that both JobBERT-V2 variants
outperform the underlying base model by a large margin. Interestingly, JobBERT-V3 performs on par
with, or slightly better than, the English-only version on MRR, Precision@5, and Precision@10 metrics,
highlighting the generalizability and robustness of our multilingual setup. This demonstrates that even
without explicit supervision, the contrastive learning objective enables the model to efectively link job
titles and relevant skills. The oficial results of the TalentCLEF Task B test set is a MAP score of 0.255,
which is in line with the validation performance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>
        We have presented JobBERT-V3, a multilingual extension of the state-of-the-art English JobBERT-V2
model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The results demonstrate that the model efectively maintains strong performance in
monolingual scenarios while adding robust cross-lingual capabilities. Additionally, the model is
also of practical use when ranking relevant skills for job titles. We acknowledge that the primary
limitation of our approach lies in its reliance on automated translations generated by a GPT model,
without human review. This introduces a potential risk of cultural misalignment or semantic
inaccuracies in job title translations. Assessing and mitigating such risks remains an open area for future research.
      </p>
      <sec id="sec-5-1">
        <title>Future work will focus on:</title>
        <p>• Expanding language coverage to include more languages;
• Improving performance on low-resource languages;
• Human review of job title translation quality;
• Investigating methods to reduce the performance gap in cross-lingual scenarios; and
• Exploring applications in multilingual skill extraction and job market analysis.</p>
        <p>The model’s strong performance across languages makes it a valuable tool for international labor
market analysis and cross-border talent matching applications.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by TechWolf. We thank our colleagues for their valuable feedback and the
TalentCLEF organizers for providing the evaluation framework. Special thanks to the open-source
community for their contributions to the tools and libraries used in this research.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools in the development of the model or the
analysis of results. The authors used GPT-4o for formatting assistance, and grammar and spelling check.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Qualitative Analysis of Job Title Matching</title>
      <p>Our analysis compares JobBERT-V3 versus the larger E5-Instruct model to understand their performance
diferences. The quantitative metrics on the Task A validation set in Table 2 reveal two distinct patterns:
• Precision at Top Results: E5-Instruct excels at identifying near-duplicate job titles with high
precision in the top retrieved results, as evidenced by its superior MRR scores.
• Overall Relevance: JobBERT demonstrates better general performance through higher MAP
and nDCG scores, indicating more consistently relevant results throughout the ranked list.
To illustrate these patterns, consider the following example query:</p>
      <sec id="sec-8-1">
        <title>Query: “media buyer”</title>
        <p>JobBERT-V3 Results:
1. media planner
2. digital media planner
3. media manager
4. media planning supervisor
5. broadcast buyer
E5-Instruct Results:
1. broadcast buyer
2. media associate
3. buyers agent (irrelevant)
4. media production specialist (irrelevant)
5. media manager</p>
        <p>This example demonstrates the key trade-of between the models: E5-Instruct prioritizes exact
matches (broadcast buyer at rank 1) but includes irrelevant results, while JobBERT maintains consistent
relevance (all relevant) but may rank the closest match lower.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>B. Qualitative Analysis of Skill Prediction</title>
      <p>To better understand the limitations of the skill prediction benchmark, we manually reviewed the top-25
skills retrieved by the model for the job title “bar person / waitress”. The table below compares whether
each predicted ESCO skill was marked as correct in the oficial benchmark and whether we consider it
correct upon manual inspection:</p>
      <p>We observe that only 11 out of the 25 top predicted skills were marked as correct by the oficial
benchmark. However, upon manual inspection, we consider at least 16 of them to be valid and
contextually relevant to the bar person / waitress role. This reveals that several practical and commonly
expected workplace activities (e.g., handling glassware, cleaning surfaces, welcoming guests) are missing
from the benchmark labels despite being well-aligned with real-world job expectations.
#</p>
      <p>ESCO Skill (paraphrased)
1 Mix and serve alcoholic and non-alcoholic beverages
2 Serve beverages (alcoholic and non-alcoholic)
3 Serve beer (bottle/draught)
4 Stock and restock bar supplies
5 Handover and close bar/service area
6 Knowledge of alcoholic beverages
7 Prepare and serve hot drinks (tea, cofee)
8 Brewhouse operations knowledge
9 Take and process beverage orders
10 Handle and polish glassware
11 Prepare fruit for cocktails
12 Work in a hospitality team
13 Match cofee grind to type
14 Sit for long periods
15 Assist with check-out procedures
16 Clean surfaces and tables
17 Show polite behaviour
18 Serve food and drinks to customers
19 Prepare speciality cofee
20 Communicate in English (spoken/written)
21 Prepare vegetables for dishes
22 Apply food safety principles
23 Welcome guests at restaurant
24 Recommend food and wine pairings
25 Apply hygienic work practices
51
51
51
51
51
51
51
55
51
55
55
55
55
55
55
55
55
55
55
51
55
51
55
51
55
11
0.44
Missed Gold Labels. In addition to examining the predicted top-25 skills, we also reviewed the
gold-standard skills that were expected to be predicted for “bar person / waitress” but were not retrieved
by the model. This set of missed gold labels includes a wide variety of skills, ranging from highly
relevant to arguably overly generic or even role-inappropriate.</p>
      <p>On the one hand, we acknowledge several high-value false negatives that would be desirable for the
model to retrieve. These include:
• Soft skills and customer care: such as “demonstrate concern for others”, “exceed customer
expectations”, “demonstrate professional attitude”, and “deal with public”. These are important
attributes in hospitality work and should ideally be present in the top predictions.
• Core restaurant tasks: such as “organise customer seating plan”, “prepare snacks and
sandwiches”, “perform cleaning activities”, “serve food in table service”, and “manage service in a
restaurant”—all of which are aligned with real-world expectations for waitstaf roles.
• Communication and responsiveness: e.g., “respond to customers”, “communicating”, “greet
guests”, and “customer servicing”. These reflect interpersonal and service-oriented responsibilities
often observed in bar and waitress positions.</p>
      <p>On the other hand, a non-trivial portion of the missed gold labels appears to be questionable:
• Generic or overly broad skills: such as “support people”, “carry objects”, “communicating”,
“present new employees”, and “support cultural diversity”. While applicable in many workplace
settings, these are not specific to bar staf or waitresses and may dilute the discriminative power
of skill-based models if overemphasized.
• Irrelevant or dubious entries: for example, “operate a forklift” and “oversee catalogue collection”
seem entirely unrelated to the role and likely reflect noise in the validation data.</p>
      <p>While our qualitative analysis is based on a single sample, it ofers preliminary indications that the
benchmark’s definition of relevance may at times be overly broad. Specifically, it appears to include a
number of skills that are either too generic or misaligned with the specific job title under consideration.
Although the limited sample size precludes drawing any definitive conclusions, these observations
suggest that a more curated and role-sensitive gold standard, perhaps one that diferentiates between
"core," "contextual," and "generic" skills, could improve the practical evaluation of job-to-skill models.
Such a framework may also help avoid unfairly penalizing models that correctly prioritize
domainrelevant over generic or out-of-scope skills.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bekkerman</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Gavish, High-precision phrase-based document classification on a modern scale</article-title>
          ,
          <source>in: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '11,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2011</year>
          , p.
          <fpage>231</fpage>
          -
          <lpage>239</lpage>
          . URL: https://doi.org/10.1145/2020408.2020449. doi:
          <volume>10</volume>
          .1145/2020408.2020449.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Javed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McNair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jacob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Carotene: A job title classification system for the online recruitment domain</article-title>
          ,
          <source>in: 2015 IEEE First International Conference on Big Data Computing Service and Applications</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>286</fpage>
          -
          <lpage>293</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigDataService.
          <year>2015</year>
          .
          <volume>61</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Abdelfatah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Korayem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Balaji</surname>
          </string-name>
          ,
          <article-title>Deepcarotene -job title classification with multistream convolutional neural network</article-title>
          ,
          <source>in: 2019 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1953</fpage>
          -
          <lpage>1961</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigData47090.
          <year>2019</year>
          .
          <volume>9005673</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Decorte</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jens-Joris and Van Hautte</surname>
          </string-name>
          , Jeroen and Demeester, Thomas and Develder, Chris,
          <article-title>JobBERT : understanding job titles through skills</article-title>
          , in: FEAST, ECML-PKDD 2021 Workshop, Proceedings,
          <year>2021</year>
          , p.
          <fpage>9</fpage>
          . URL: https://feast-ecmlpkdd.github.io/papers/FEAST2021_paper_6.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zbib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Lacasa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Retyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Poves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aizpuru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fabregat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Šimkus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>García-Casademont</surname>
          </string-name>
          ,
          <article-title>Learning Job Titles Similarity from Noisy Skill Labels</article-title>
          , in: FEAST, ECML-PKDD 2022 Workshop, Proceedings,
          <year>2022</year>
          . URL: https://feast-ecmlpkdd.github.io/archive/2022/papers/FEAST2022_paper_ 4972.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Bocharova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Malakhov</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. I. Mezhuyev</surname>
          </string-name>
          ,
          <article-title>Vacancysbert: the approach for representation of titles and skillsfor semantic similarity search in the recruitment domain</article-title>
          ,
          <source>Applied Aspects of Information Technology</source>
          <volume>6</volume>
          (
          <year>2023</year>
          )
          <fpage>52</fpage>
          -
          <lpage>59</lpage>
          . URL: http://dx.doi.org/10.15276/aait.06.
          <year>2023</year>
          .
          <article-title>4</article-title>
          . doi:
          <volume>10</volume>
          . 15276/aait.06.
          <year>2023</year>
          .
          <volume>4</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Laosaengpha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tativannarat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Piansaddhayanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chuangsuwanich</surname>
          </string-name>
          ,
          <article-title>Learning job title representation from job description aggregation network</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>1319</fpage>
          -
          <lpage>1329</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-acl.
          <volume>77</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>77</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.-J.</given-names>
            <surname>Decorte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Hautte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Develder</surname>
          </string-name>
          , T. Demeester,
          <article-title>Eficient text encoders for labor market analysis</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.24640. arXiv:
          <volume>2505</volume>
          .
          <fpage>24640</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence embeddings using Siamese BERT-networks</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3982</fpage>
          -
          <lpage>3992</lpage>
          . URL: https://aclanthology.org/D19-1410/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1410.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>Mpnet: masked and permuted pre-training for language understanding</article-title>
          ,
          <source>NIPS '20</source>
          , Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Manakhimova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Avramidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Macketanz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lapshinova-Koltunski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bagdasarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <article-title>Linguistically motivated evaluation of the 2023 state-of-the-art machine translation: Can ChatGPT outperform NMT?</article-title>
          , in: P.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kocmi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Monz (Eds.),
          <source>Proceedings of the Eighth Conference on Machine Translation</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>224</fpage>
          -
          <lpage>245</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .wmt-
          <volume>1</volume>
          .23/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .wmt-
          <volume>1</volume>
          .
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Gpt-4 vs. human translators: A comprehensive evaluation of translation quality across languages, domains, and expertise levels</article-title>
          ,
          <source>arXiv preprint arXiv:2407.03658</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fabregat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>García-Sardiña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Estrella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zbib</surname>
          </string-name>
          ,
          <article-title>Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management, in: International Conference of the Cross-Language Evaluation Forum for European Languages</article-title>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Multilingual e5 text embeddings: A technical report</article-title>
          , arXiv preprint arXiv:
          <volume>2402</volume>
          .05672 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>