<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>pjmathematician at TalentCLEF 2025: Enhancing Job Title and Skill Matching with GISTEmbed and LLM-Augmented Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Poojan Vachharajani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Netaji Subhas University of Technology</institution>
          ,
          <addr-line>New Delhi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper details the pjmathematician team's participation in the TalentCLEF 2025 shared task, focusing on Task A (Multilingual Job Title Matching) and Task B (Job Title-Based Skill Prediction). Our approach primarily leveraged state-of-the-art sentence embedding models fine-tuned using the GISTEmbed technique. For Task A, various multilingual and English-specific encoder models were adapted, including a distilled version and a LoRA-fine-tuned 7B parameter model. Data augmentation for Chinese was performed using Qwen2.5 32B Instruct. For Task B, we employed data augmentation using Qwen2.5 32B Instruct to generate descriptive texts for jobs and skills, significantly enriching the training data. Models like BGE-Large and GTE-Qwen2-7B (LoRA) were ifne-tuned on this augmented data. Our submissions demonstrate the efectiveness of these strategies, achieving competitive results in both tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;TalentCLEF</kwd>
        <kwd>Job Title Matching</kwd>
        <kwd>Skill Prediction</kwd>
        <kwd>Sentence Embeddings</kwd>
        <kwd>GISTEmbed</kwd>
        <kwd>Data Augmentation</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>LoRA</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Task A: Multilingual Job Title Matching</title>
        <p>Our approach for Task A centered on fine-tuning various sentence embedding models to capture
semantic similarities between job titles across multiple languages.</p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Models and Fine-tuning</title>
          <p>
            We experimented with several base encoder models:
• BAAI/bge-small-en-v1.5 (33.4M parameters) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]
• BAAI/bge-m3 (569M parameters) [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]
• Alibaba-NLP/gte-multilingual-base (305M parameters) [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]
• Alibaba-NLP/gte-Qwen2-7B-instruct (7B parameters) [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]
• A distilled version of Alibaba-NLP/gte-multilingual-base (approx. 60M parameters).
          </p>
          <p>
            All models, except the distilled one, were fine-tuned using the GISTEmbed loss [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], with
‘all-MiniLML12-v2‘ [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] as the guide model. For the GTE-Qwen2-7B-instruct model, CachedGISTEmbed was used
along with Low-Rank Adaptation (LoRA) [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] to manage computational resources. The distilled
‘gtemultilingual-base‘ model (retaining 3 layers) was fine-tuned using Mean Squared Error (MSE) loss to
mimic the embeddings of its GISTEmbed-fine-tuned, full-layer counterpart which served as the teacher
model.
2.1.2. Data
The provided training set of related job title pairs for English, Spanish, and German was used. For
Chinese, where no training data was provided, we augmented the ESCO English job title dataset by
translating it to Chinese using the Qwen2.5 32B Instruct model. All language pairs were used to train a
single multilingual model for each respective base encoder.
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.3. Implementation</title>
          <p>
            All fine-tuning and inference were performed using the ‘sentence-transformers‘ library [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. Bias
handling involved shufling the training data and using random validation splits.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task B: Job Title-Based Skill Prediction</title>
        <p>For Task B, our strategy focused on robust data augmentation using a Large Language Model (LLM)
and fine-tuning powerful English-specific sentence encoders.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Data Augmentation</title>
          <p>We utilized the Qwen2.5 32B Instruct (AWQ quantized) model to generate descriptive text for both job
titles and skills. The process involved prompting the LLM to create concise descriptions:
• For jobs: "Given a job role (and its synonyms), briefly (1-2 sentences) describe the skills needed
for that job". This generated a ‘skill_brief‘ for each job title.
• For skills: "Given a skill (and its synonyms), briefly (1-2 sentences) describe the job roles that
require that skill". This generated a ‘job_brief‘ for each skill.</p>
          <p>This process was applied to the training, validation, and test sets provided by the organizers. Two main
augmented training datasets were created from the original job-skill mapping files:
1. A dataset of pairs (‘skill_brief‘, ‘job_brief‘).
2. A dataset of mixed-format pairs (‘skill_brief‘ + newline + job synonyms list, ‘job_brief‘ + newline
+ skill synonyms list).</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Models and Fine-tuning</title>
          <p>
            We employed two main base models:
• BAAI/bge-large-en-v1.5 (335M parameters) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]:
– One version was fine-tuned using CachedGISTEmbedLoss on the dataset composed of
augmented skill and job descriptions. Inference used these generated descriptions.
– Another version was fine-tuned using CachedGISTEmbedLoss on the dataset of mixed
augmented descriptions and synonym lists. Inference used inputs combining the generated
description with the original job title or skill aliases.
• Alibaba-NLP/gte-Qwen2-7B-instruct (7B parameters) [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]: This model was fine-tuned using
LoRA and CachedGISTEmbedLoss on the dataset of mixed augmented descriptions and synonym
lists. Inference input combined the generated description with the original job title or skill aliases.
For all GISTEmbed fine-tuning, ‘all-MiniLM-L12-v2‘ [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] served as the guide model.
          </p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. Implementation and Inference</title>
          <p>
            The ‘sentence-transformers‘ library [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] was used for training and inference. Submissions were generated
by encoding the augmented query and corpus texts and then computing cosine similarity scores to
rank corpus elements. No external data beyond the LLM-generated augmentations was used for Task B.
Shufling and random validation splits were standard practice.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and Results</title>
      <p>The models were evaluated based on Mean Average Precision (MAP) as the oficial metric.</p>
      <sec id="sec-3-1">
        <title>3.1. Task A: Multilingual Job Title Matching</title>
        <p>We submitted five systems for Task A, varying the base model and fine-tuning techniques. The
GTEQwen2-7B model fine-tuned with LoRA and CachedGIST achieved the best overall performance across
English, Spanish, and German with an Avg.MAP of 0.52. The results are summarized in Tables 1, 2, and
3.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Training Procedure</title>
          <p>All models were fine-tuned using the sentence-transformers library with a batch size of 64 and
the AdamW optimizer. We used the CachedGISTEmbedLoss to align model outputs with embeddings
generated from the all-MiniLM-L12-v2 guide model. The training ran for 1–3 epochs with early
stopping based on MAP scores on a 10% validation split.</p>
          <p>For the GTE-Qwen2-7B model, we applied LoRA fine-tuning with a rank of 8 and trained only adapter
layers to reduce GPU memory requirements. For the distilled version of gte-multilingual-base,
we used a teacher–student setup, training the student with MSE loss to mimic embeddings from the
GISTEmbed-fine-tuned full model.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Inference Strategy</title>
          <p>At inference time, both query and candidate job titles were encoded using the fine-tuned model, and
cosine similarity was computed to rank candidate titles. We used the same multilingual model per base
encoder across all supported languages, without applying any task-specific heuristics. For Chinese job
titles, LLM-generated translations of ESCO data ensured consistent structure and terminology with the
English training examples.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Motivation for Cross-Lingual Setup and Translation Strategy</title>
          <p>A major challenge in Task A was the lack of labeled training data for Chinese. To overcome this,
we translated English job titles from ESCO into Chinese using the Qwen2.5 32B Instruct model. The
motivation was to create aligned examples that preserved semantic structure while leveraging a powerful
LLM’s cross-lingual generation capabilities. By training a single multilingual model for all languages
(en, es, de, zh), we aimed to ensure consistent semantic space alignment and reduce the complexity of
maintaining separate models.</p>
          <p>This design choice enabled the model to learn language-agnostic representations of job titles,
facilitating strong cross-lingual performance as reflected in the MAP scores.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task B: Job Title-Based Skill Prediction</title>
        <p>For Task B, we submitted three systems, leveraging LLM-augmented data. The results are shown in
Table 4.</p>
        <p>The GTE-Qwen2-7B model fine-tuned with LoRA on the mixed augmented data (descriptions and
synonym lists) yielded the highest MAP of 0.36. This suggests that the combination of a large
instructiontuned base model, LoRA, and rich augmented input was most efective. The BGE-Large model trained
solely on the augmented descriptions performed competitively (MAP 0.34). When BGE-Large was
trained on the mixed augmented data, the performance was slightly lower (MAP 0.33). This might
indicate that for BGE-Large, the simpler augmented descriptions were more directly beneficial, or
that the mixed data format required further hyperparameter optimization. The use of LLM-generated
descriptions proved crucial for providing rich textual context.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Training Procedure</title>
          <p>All models were fine-tuned using the sentence-transformers library with a batch size of 64, using
the AdamW optimizer and a cosine learning rate schedule. We employed early stopping based on
validation MAP. For GISTEmbed-based training, the guide model was set to all-MiniLM-L12-v2,
and the CachedGISTEmbedLoss was used to align the student model’s embeddings with cached guide
embeddings.</p>
          <p>For the GTE-Qwen2-7B model, we used Low-Rank Adaptation (LoRA) with a rank of 8 and bias
training enabled, to eficiently fine-tune the model without updating all parameters. Fine-tuning was
performed for 1–3 epochs depending on convergence behavior, monitored using a 10% validation split
from the training set.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Inference Strategy</title>
          <p>During inference, we used the same augmentation templates as during training. Each job title query
was transformed into a prompt-generated description (optionally combined with a list of aliases), and
similarly for each skill. These texts were encoded into embeddings using the fine-tuned model, and
cosine similarity was computed between the job title and all skills. The top-N most similar skills were
returned as the model’s output for ranking.</p>
          <p>In mixed-format submissions, we used newline-separated concatenations of generated descriptions
and alias lists. This format provided the model with richer and more consistent context and led to
improved generalization.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Motivation Behind Data Augmentation</title>
          <p>The core motivation for data augmentation was to enrich the semantic content of both job titles and
skills, which are otherwise short and ambiguous. By prompting the Qwen2.5 32B Instruct model to
generate compact but expressive descriptions, we aimed to reduce lexical sparsity and improve the
model’s ability to match based on conceptual similarity.</p>
          <p>Additionally, including known aliases in the input helped align representations across synonymous
phrases. The combination of these strategies allowed the model to learn more generalizable embeddings,
making it better suited for real-world applications where job titles and skill names vary significantly in
wording.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In our participation in TalentCLEF 2025, we explored the eficacy of fine-tuning sentence embedding
models using GISTEmbed for both multilingual job title matching (Task A) and job title-based skill
prediction (Task B). For Task A, larger models like GTE-Qwen2-7B, fine-tuned with LoRA,
demonstrated superior performance, particularly when augmented with LLM-translated data for low-resource
languages like Chinese. For Task B, data augmentation via LLM-generated job and skill descriptions
was a key strategy. The GTE-Qwen2-7B (LoRA) model trained on mixed augmented data (descriptions
and synonyms) achieved the best results, underscoring the value of rich, contextualized training inputs.
Our experiments highlight the potential of combining advanced fine-tuning techniques like GISTEmbed
and LoRA with LLM-driven data augmentation for complex semantic matching tasks in the HR domain.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Qwen2.5 32B Instruct in order to: Generate
training data through textual descriptions for jobs and skills (Task B), and Translate English job titles
to Chinese for training data augmentation (Task A). After using these tool(s)/service(s), the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fabregat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>García-Sardiña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Estrella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zbib</surname>
          </string-name>
          ,
          <article-title>Overview of the TalentCLEF 2025 Shared Task: Skill and Job Title Intelligence for Human Capital Management, in: International Conference of the Cross-Language Evaluation Forum for European Languages</article-title>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Solatorio</surname>
          </string-name>
          ,
          <article-title>GISTEmbed: Guided in-sample selection of training negatives for text embedding ifne-tuning</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2402.16829. arXiv:
          <volume>2402</volume>
          .
          <fpage>16829</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          , C-Pack:
          <article-title>Packaged resources to advance general chinese embedding</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>07597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , BGE M3-
          <article-title>Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>03216</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , et al.,
          <article-title>mGTE: Generalized long-context text representation and reranking models for multilingual text retrieval</article-title>
          ,
          <source>in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1393</fpage>
          -
          <lpage>1412</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Towards general text embeddings with multi-stage contrastive learning</article-title>
          ,
          <source>arXiv preprint arXiv:2308.03281</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.09685.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>