<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Terminology Augmented Generation: A Systematic Review of Terminology Formats for In-Context Learning in LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Lackner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alena Vega-Wilson</string-name>
          <email>alena.vega-wilson@eurocom.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Lang</string-name>
          <email>christian.lang@kaleidoscope.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kaleidoscope GmbH</institution>
          ,
          <addr-line>Landstraße 99-101, 1030 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present our on-going work on a specialized extension of the Retrieval Augmented Generation (RAG) framework focusing on providing knowledge from enterprise terminology databases to generative LLMs: Terminology Augmented Generation (TAG). This study specifically focuses on the role that terminology formatting plays for TAG across common NLP downstream tasks such as translation and terminology revisions of texts. By conducting empirical evaluations using OpenAI's GPT-4o, GPT-4o-mini, and the opensource LLaMA 3.3 and Mistral 7b models, we systematically explore various established terminology formats (including TBXv3) and compare the results to alternative structured and prose formats and their impact on generation quality. Preliminary findings indicate that specific formatting strategies significantly improve model accuracy and recall of in-context knowledge, as well as the disambiguation capabilities in linguistically ambiguous scenarios. This research provides valuable insights into the design of terminology integration methodologies for LLMs, contributing to the development of more effective language processing systems that meet the nuanced demands of professional and technical communication.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;NLP</kwd>
        <kwd>LLM</kwd>
        <kwd>RAG</kwd>
        <kwd>Retrieval Augmented Generation</kwd>
        <kwd>TAG</kwd>
        <kwd>Terminology Augmented Generation</kwd>
        <kwd>Neural Machine Translation</kwd>
        <kwd>terminology management</kwd>
        <kwd>terminology evaluation</kwd>
        <kwd>terminology revision1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>The retrieval process is comparatively slow</title>
      </sec>
      <sec id="sec-1-2">
        <title>Retrieval is generally quite fuzzy, leading to noisy data</title>
      </sec>
      <sec id="sec-1-3">
        <title>Arbitrary chunking of data may lead to critical information loss</title>
        <p>Retrieval methods are often limited to top-k hits, potentially leading to silence in the
retrieved data</p>
        <p>Typical terminology formats (XML) are not well suited for vector-based semantic search
∗ Corresponding author.
† These authors contributed equally.</p>
        <p>
          We aim to explore two major components for efficient terminology augmented generation (TAG):
Firstly, we explore the retrieval for TAG as a specialized extension of RAG, using readily available
terminology APIs in Kalcium Quickterm. We explore the impact of TAG in terms of speed, reliability
and general feasibility for various down-stream tasks with LLMs. We describe TAG in detail in our
German publication for the conference proceedings of the DTT-Symposion 2025 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], but to briefly
summarize it, TAG methods should be able to retrieve terminology in real-time from terminology
management systems (TMS) like Kalcium Quickterm and/or format the terminological context in a
way, that it can be efficiently parsed by the LLM for in-context learning. Our second focus will be
on the question of how to format the retrieved terminological entries when providing them as
incontext knowledge to LLMs. While LLMs demonstrate remarkable abilities to parse XML – the
typical terminology exchange format that is also standardized in the TBX XML specification [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] –
we examine if the verbose nature of the XML structure is detrimental for providing terminology to
LLMs and if it is, find viable alternatives for providing structure terminological knowledge to LLMs
at run-time.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>We explore two different distinct use cases for TAG: Machine Translation and Automatic
Terminology Revision. Since these are two distinct tasks, we follow a slightly different experimental
set up for the individual tasks described in the following sub-sections.</p>
      <sec id="sec-2-1">
        <title>2.1. LLM setup</title>
        <p>To instruct the LLMs for the task, we set up one system prompt for each of the tasks, that is
shared between all models. For the terminology augmentation we explore a variety of possible
formats, ranging from the native XML output of terminology systems to other structured language
outputs such as JSON, YAML, Markdown or ad-hoc generated “prose” instructions for using the
relevant terminology. Since most of our prior testing was done using OpenAI models, our prompting
techniques are likely to favor OpenAI trained models. This evaluation is therefore not to be
interpreted as a comparison between different models, but rather an exploration of various effects of
different prompting formats for TAG with different LLM backends. Nevertheless, we experiment
using 4 popular models: OpenAI’s closed source GPT-4o and GPT-4o-mini (Model snapshots
GPT4o 2024-11-20 and GPT-4o Mini 2024-07-18), as well as the instruction-tuned open-source Llama-3.3
70b2 (4bit int quantized) and Mistral 7b3 (4bit int quantized) models. For all models we set 3 major
hyper-parameters controlling the variance of the generated output to a fixed value: Temperature is
set to 0.2, Top-P is kept at the default value of 1 and seed is set to 42. While this still allows for some
variance, we found that setting Temperature lower tends to reduce the perceived and measured
quality of the output.4</p>
        <p>
          To access the models, we plan to employ the open-source AI interface Open WebUI [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]5 in
combination with open-source ollama6 framework, which allows us to access both open-source and
proprietary models via one single OpenAI-conformant REST-API interface. While Open WebUI also
allows us to set up custom pipelines which include TAG via the Kalcium REST-API, we implement
the actual TAG code in Python to make the testing reproducible with other OpenAI compatible
endpoints. The TAG code itself is based on the Kalcium REST-API7, which provides ready-made
2 https://ollama.com/library/llama3.3:70b-instruct-q4_0
3 https://ollama.com/library/mistral:7b-instruct-q4_0
4 Please note, we might update the models or hyper-parameters for the final release of this paper, to reflect our actual
testing conditions.
5 https://github.com/open-webui/open-webui
6 https://github.com/ollama/ollama
7 https://demo.kaleidoscope.at/kalcrest/swagger/index.html
endpoints for advanced term recognition using various established term recognition methods such
as fuzzy-matching and stemming. We use the “/kalcrest/terminology/analyze-sentence” endpoint for
all of our tests and parse the returned JSON as required.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task-specific setup</title>
        <p>
          For Machine Translation, we follow the experimental setup of Dinu et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] using the WMT 2017
English-German news translation task8 to evaluate our approach to a valid baseline (however, we
use LLMs instead of custom NMT models). Additionally, we examine a small custom test set in 3
language directions: German (Austria) à Italian, German (Austria) à Czech and German (Austria)
à English (US/GB). We chose these language pairs, since they are common pairs for our customer
base. In line with Dinu et al., we evaluate the Machine Translation results with BLEU[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and
additionally also use COMET[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Furthermore, we apply a fuzzy matching strategy similar to Exel
et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to detect if the correct terminology was used in the translation. Specifically, we stem words
using the stemming engine in Kalcium Quickterm and perform a fuzzy search with a similarity rate
of 80%. We acknowledge that this approach will not be perfect and might result in false positives or
negatives, e.g. for discontinuous terms or morphological variants that fail to be stemmed correctly.
For this reason, we also manually sample the results to detect any irregularities stemming from
erroneous term recognition.
        </p>
        <p>For terminology revision, we focus on monolingual revision for the same languages, i.e. German,
Czech, Italian and English (US/GB), however, to the best of our knowledge, there exists no
terminology revision test set, so we create our own based on sentences from public translation
projects. Since the terminology revision task requires the model to only replace the invalid
terminology with the correct terminology and, if necessary, also adapting the sentence grammar, we
evaluate the generated sentences against the “correct” ground-truth sentences and only consider an
exact match to the ground-truth sentences to be a successful revision.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Dataset</title>
        <p>As described in 2.2, for the MT evaluation, we use the WMT 2017 English-German news
translation task as a baseline, however, we also create a custom dataset to evaluate languages and
terminology actually used by our customers. In prior unreleased work, we have already prepared a
test set for terminology revision, consisting of various sentences from prior translation projects,
which we modified with flawed terminology to be corrected by the LLM. For this first examination,
we provided the term replacement pairs directly with the test data. For this work, we plan to rework
the dataset to include approximately 200 source and target language sentences for each of the
language pairs. Typical test sets are sentence-based, so we make an effort to sentence-align any
equivalent sentences, but we also align the dataset on a paragraph and a document-level, allowing
for the evaluation of long-context performance with TAG.</p>
        <p>
          As for the terminology, we create three separate termbases: For the WMT 2017 test set we import
the glossary created by Dinu et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]9, based on the Wikitionary and IATE terminology. For our
custom test set, we use existing terminological resources curated by the customer and our team or
extract and create terminological entries from the test set as needed.
        </p>
        <p>Note that the premise of TAG goes beyond term pairs and glossaries and rather aims to augment
the generation with concept-oriented terminology, which means that terminological metadata like
definition, usage status, usage note and other relevant information from the termbase will be used
during the terminological augmentation. This information is not present in the glossaries created by
Dinu et al., so to evaluate the effectiveness of TAG for additional capabilities like disambiguation we
purposefully include homographs into the custom test sets and the corresponding termbase. The test
8 https://www.statmt.org/wmt17/translation-task.html
9 https://github.com/mtresearcher/terminology_dataset
set creation is currently an on-going process. We plan to release the test set for reproducibility, but
since it will contain customer data, a survey of the legal feasibility needs to be done for the finished
dataset.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminary results and observations</title>
      <p>
        As this evaluation project is currently still a work in progress, we present preliminary results of our
evaluation for the camera-ready version of this paper. For the non-augmented models, we translate
the WMT 17 test data (EN-DE) from Dinu et. al with a simple system prompt (“Translate the text
provided by the user from and into the language specified by the user. Only return the translation.“)
followed by the user prompt (“Translate from English to German: {text}”). For the model with TAG,
we use a more complex prompt which is found in Annex B. As the terminology format, we used the
markdown format, as described in the system prompt (note, that for this evaluation set, no
information but the terms themselves were available). We measure the terminology adherence by
looking for perfect matches of all terms present in the translation output of each sentence. This is
possible, because the terminology contains the translated terminology (mostly) in the same
morphological form, in which it is present in the text. In Table 1 we compare the results of GPT-4o,
GPT-4o-Mini and GPT4-o with TAG against the train-by-replacement approach of Dinu et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as
a baseline, as it had achieved the highest term adherences in their work. For your reference, we also
include all sentences where terminology adherence was not achieved with TAG in appendix A.
      </p>
      <sec id="sec-3-1">
        <title>Model</title>
      </sec>
      <sec id="sec-3-2">
        <title>Baseline</title>
      </sec>
      <sec id="sec-3-3">
        <title>GPT-4o-Mini GPT-4o</title>
      </sec>
      <sec id="sec-3-4">
        <title>GPT-4o with TAG</title>
      </sec>
      <sec id="sec-3-5">
        <title>Term % 94.5 87.4 87.2</title>
        <p>96.37</p>
      </sec>
      <sec id="sec-3-6">
        <title>BLEU 26.0 33.7</title>
        <p>35.7
35.5</p>
      </sec>
      <sec id="sec-3-7">
        <title>COMET 0.876</title>
        <p>0.880
0.877</p>
        <p>While the baseline approach manages to outperform non-augmented LLMs in terminology
adherence, the LLMs achieve a notably higher BLEU score and crucially, the TAG approach further
improves on the already high result on terminology adherence of the baseline approach, failing only
in 15 out of the 414 sentences tested. Similarly to Dinu et al. we also observe high terminological
adherence even for the non-augmented LLMs (52 and 53 failed examples respectively). However,
during our exploration of the terminology used in the Baseline experiments and this preliminary
evaluation, we observed several inherent issues with the terminological data itself:
1. The terminology extracted from IATE contains many common nouns like “month”, “eggs”,
“tobacco” or country names like “Syria”, which would generally be correctly translated by
systems trained on common domain texts.
2. The terminology is not always in the infinite singular form (e.g. “eggs”, “victories”,
“Schweizerin”), is translated as a noun when it should be a verb (e.g. “covering” à “Bezug” or
“arrest” à “Festnahme”) or contains articles (e.g. “Die Republikaner”). The last two examples
account for 4 issues of the 15 encountered with TAG. (Appendix A: 5,7,11,15)
3. Around 30 homographs (out of 232 total terms) with inconsistent translations are present (e.g.
“office” à “Büro” and “office” à “Amt”); since the terminology does not provide any kind of
definition or usage recommendation, the LLM has no way to disambiguate the meaning or
use a specific translation. This issue accounts for 9 issues of the 15 encountered with TAG.
(Appendix A: 1,2,3,4,6,8,9,10,12)</p>
        <p>Point 2 and 3 show that the quality and completeness of the terminological data is of high
importance for TAG, especially when disambiguation is required. Two issues, which were likely not
exclusively caused by deficiencies in the terminological data, but rather stem from either the
generative nature of the model or our retrieval method could also be observed: In one instance, the
model translated “night” with “Abend” (or rather composited it into “Donnerstagabend”) even
though “night” was provided with the German translation: “Nacht” (Appendix A: 14). In another
instance, the model was provided with both “election campaign” and “campaign” source terms. and
chose the wrong translation “Wahlkampf” instead of “Kampagne”, which is somewhat related to the
“homograph” issue and thus might be resolved with a proper definition of the terms or by using a
lower fuzziness for the retrieval (Appendix A: 13).</p>
        <p>All the remaining issues were caused through our preliminary way of checking terminology
adherence, e.g. the LLM may have predicted a morphological variant or similar composite word
instead of the expected terminology verbatim (e.g. “Vormonat” instead of “Monat” (Appendix A: 7).
We will likely either use more refined NLP approaches or manual approaches to filter out these false
negatives for the final results.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Future work</title>
      <p>
        After the creation of the dataset, we aim to finalize our automated testing pipeline as described
in section 2. From our experience employing TAG and the preliminary results shown in this paper,
we expect to match or beat the baseline result achieved by prior methods for NMT ([
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) with
TAG, but are curious to see how well our hypothesis holds up for more complex or ambiguous
examples in the custom test set. Especially, we are curious to see how well our hypothesis regarding
specific terminology formats for TAG holds up on a larger test set and how the effectiveness of these
formats may vary between various LLMs. However, this evaluation covers only a small part of the
various systems that come into play during TAG. For example, the terminology retrieval method
used for this work is efficient for the tasks at hand, but also rather limited for more open-ended
downstream tasks. Future work could focus on more advanced ways to retrieve terminology from
termbases, such as dense or sparse vector retrieval or graph-based approaches making use of
relational information in advanced terminological systems. These methods could enhance the
accuracy of term recognition but also allow autonomous AI agents to better navigate terminological
resources to complete various downstream tasks, such as generating terminologically correct and
hallucination-free text from scratch, e.g. for technical documentation, interactive support systems or
efficiently navigating a specialized UI.
      </p>
      <p>We hope this work can serve as both as a foundation for future refinements and evaluations to
TAG approaches as well as an inspiration to explore new applications of terminology in AI. We look
forward to sharing our results and discuss them at the MDTT 2025 in Thessaloniki, Greece.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT-4 for grammar and spelling checks.
The authors have subsequently reviewed and edited the content and take full responsibility for the
publication’s final version.
#
1
2
3
4
5
6
7
8
9</p>
      <p>Wahlkampagne
Die Republikaner
Signal
Kraft</p>
      <p>Gegenüberstellung</p>
      <p>Comment
Two entries for "state of
emergency" with
"Notstand" and
"Ausnahmezustand" as
translation (ids: 198, 202)
Three entries for "strength"
with
"Widerstandsfähigkeit",
"Kraft" and "Stärke" in
German (ids: 169, 36, 76)
Two entries for "sign" with
"Signal" and "Zeichen" in
German (ids: 133, 220)</p>
      <p>Term
Notstand
Stärke</p>
      <p>Zeichen
Two entries for "election
campaign" with
"Wahlkampf" and
"Wahlkampagne" in
German (ids: 78, 11)
Partly correct translated
with "Republikaner" but
the article wasn't used
(which would not have
made sense in this context).</p>
      <p>Only "republicans" (id: 3) is
in the termbase and it still
correctly translated
"senate-republicans" with
"Senats-Republikaner"
Two entries for "sign" with
"Signal" and "Zeichen" in
German (ids: 133, 220)
"Vormonat"
Monat
instead
of</p>
      <p>Monat
A. Qualitative evaluation of terminology adherence with TAG</p>
      <p>Prediction (with TAG)
Starker Regen und weit
verbreitete Überschwemmungen
in Louisiana führten dazu, dass
der Gouverneur am Freitag den
Ausnahmezustand ausrief, wobei
für Samstag weiterer Regen über
dem Staat erwartet wird.</p>
      <p>Gymnastik entwickelt Kraft,
Flexibilität und Koordination für
den Körper sowie harte Arbeit,
Disziplin und Entschlossenheit
für den Geist.</p>
      <p>Arbeitgeber hofften, dass das
anhaltend positive Engagement
bei anderen wichtigen Themen
wie Einsatz, Flexibilität in der
Ausbildung, zusätzliche
Schulungen für Rückkehrer nach
einer beruflichen Auszeit,
Ausbildungskosten, gegenseitige
Anerkennung des Lehrplans,
Studienurlaub und das
geschlechtsspezifische
Entlohnungsgefälle in der
Medizin - ein Signal dafür war,
wie ernst es den Arbeitgebern,
Health Education England und
dem Gesundheitsministerium
damit war, die mit der BMA im
November, Februar und Mai
getroffenen Vereinbarungen
einzuhalten.</p>
      <p>Präsidentschaftskandidat Donald
Trump hat nun die Schwächen
seines Wahlkampfs in Utah,
ehemals eine Hochburg der
Republikaner, eingeräumt.</p>
      <p>Senats-Republikaner haben die
Bestätigung von Garland
blockiert, seit Präsident Barack
Obama ihn im März nominiert
hat.</p>
      <p>Der Bundesverband deutscher
Banken glaubt, dass dies ein
Einzelfall ist und sieht es nicht als
ein Zeichen: Normalsparer
"müssen sich keine Sorgen
machen, dass sie mit Strafzinsen
auf die Pfennige, die sie
beiseitegelegt haben, belastet
werden".</p>
      <p>Eine ICM-Umfrage im April
ergab, dass fast 50 % der
Personen, die im Vormonat ein
Vinylalbum gekauft hatten, es
noch nicht angehört hatten.</p>
      <p>Fidschi gab eine Meisterklasse im
Handling, Abspielen,
Ausweichen,
UnterstützungsSpiel, Laufwege und brutale
Stärke, um Gold zu gewinnen
ihre erste olympische Medaille in
irgendeiner Farbe.</p>
      <p>Sie fügte hinzu, dass "es schwer
ist, gegen den direkten Vergleich
der sehr groben asiatischen
Karikatur und der Effekte des
Filters zu argumentieren."</p>
      <p>Source
Heavy rain and widespread
flooding in Louisiana lead the
governor to declare a state of
emergency on Friday, with more
rain expected over the state
through Saturday.</p>
      <p>Gymnastics develops strength,
flexibility and coordination for
the body and hard work,
discipline and determination for
the mind.</p>
      <p>Employers were hopeful that the
continued positive engagement
on other important topics - such
as deployment, flexibility in
training, additional training for
those returning from career
breaks, costs of training, mutual
recognition of syllabus, study
leave and the gender pay gap in
medicine - were a sign of how
serious employers, Health
Education England and the
Department of Health were about
honouring the agreements
reached with the BMA in
November, February and May.</p>
      <p>Presidential candidate Donald
Trump has now admitted to the
weaknesses of his election
campaign in Utah, formerly a
Republican stronghold.</p>
      <p>Senate Republicans have blocked
Garland's confirmation since
President Barack Obama
nominated him in March.</p>
      <p>The Federal Association of
German Banks believes this is an
isolated case, and does not see it
as a sign: normal savers "need not
worry about being hit with
penalty interest on the pennies
they've put aside".</p>
      <p>An ICM poll in April revealed
that almost 50% of people who
bought a vinyl album the
previous month had yet to listen
to it.</p>
      <p>Fiji gave a masterclass in
handling, off-loading,
sidestepping, support play, running
lines and brute strength to win
gold - their first Olympic medal of
any colour.</p>
      <p>She added that "it's hard to argue
with the side by side comparison
of the very gross Asian caricature
and the filter's effects.
10
11
12
13
14
15</p>
      <p>Two entries for "series"
with "Serie" and "Reihe" in
German (ids: 156, 62)
Term was used as a verb
(beziehen instead of
Bezug), the term in English
was also used as a verb</p>
      <p>Bordini's complaint seeks
compensation from the campaign
for negligent supervision, and
from Phillip for assault, battery
and infliction of emotional
distress.</p>
      <p>For people hoping to spot
shooting stars in south-west
Germany on Thursday night, the
weather put paid to their plans
but all is not lost.</p>
      <p>Plain clothes officers from
Dusseldorf's police force
managed to arrest two women
and two men, aged between 50
and 61, on Thursday.</p>
      <p>B. System prompt and terminology format for preliminary results
You are a translator and author. The user will provide text to be translated and indications
which terminology to use.
# Task description
* Translate the text provided by the user from and into the language the user specifies.
* Make sure the translation sounds natural.
* The user specifies the translation direction by prefixing the text to be translated with the
following string: `Translate {sourceLanguage} to {targetLanguage}:`
* Use the definition to disambiguate the meaning of the terminology passed by the user and
translate accordingly
* Only return the translation
# Terminology
* If available, the user will provide indication on what terminology to use
* Follow the suggestions provided within the &lt;tag&gt;-XML elements of the user message
# Rules
* Use the definition to disambiguate the meaning of term pairs
* Follow the usageNote of each possible translation to choose the most suitable translation,
if more than one translation is provided
* If the system returns a term that is not present in the source text, ignore the term.
Prompt 1: System prompt for GPT-4o with TAG</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          et al.,
          <article-title>“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</article-title>
          ,”
          <year>2020</year>
          , arXiv. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>2005</year>
          .
          <volume>11401</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranjan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Singh</surname>
          </string-name>
          , “
          <article-title>A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape</article-title>
          and Future Directions,”
          <year>2024</year>
          , arXiv. doi:
          <volume>10</volume>
          .48550/ARXIV.2410.12837.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fleischmann</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lang</surname>
          </string-name>
          , “
          <article-title>Terminologie für die KI: Wie mit Terminologie der Output von LLMs und GenAI optimiert werden kann</article-title>
          .,
          <source>” in Akten des Symposions</source>
          , vol.
          <volume>27</volume>
          .-
          <fpage>29</fpage>
          .
          <source>März</source>
          <year>2025</year>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Drewer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mayer</surname>
          </string-name>
          , and D. Pulitano, Eds., Worms: Deutscher
          <string-name>
            <surname>Terminologie-Tag e</surname>
          </string-name>
          .V.,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] “Management of terminology resources - TermBase eXchange (TBX),”</article-title>
          <string-name>
            <surname>ISO</surname>
          </string-name>
          ,
          <string-name>
            <surname>Standard</surname>
            <given-names>ISO</given-names>
          </string-name>
          30042:
          <year>2019</year>
          ,
          <year>2019</year>
          . [Online]. Available: https://www.iso.org/standard/62510.html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] “Designing an open-source LLM interface and social platforms for collectively driven LLM evaluation and auditing”.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dinu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Federico</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Al-Onaizan</surname>
          </string-name>
          , “
          <article-title>Training Neural Machine Translation to Apply Terminology Constraints,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          , Florence, Italy: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>3063</fpage>
          -
          <lpage>3068</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1294.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , and W.-J. Zhu, “
          <article-title>BLEU: a method for automatic evaluation of machine translation</article-title>
          ,”
          <source>in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02</source>
          , Philadelphia, Pennsylvania: Association for Computational Linguistics,
          <year>2001</year>
          , p.
          <fpage>311</fpage>
          . doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Farinha</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          , “
          <article-title>COMET: A Neural Framework for MT Evaluation,”</article-title>
          <source>in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <source>Online: Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2685</fpage>
          -
          <lpage>2702</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>213</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Exel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Buschbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Brandt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Doneva</surname>
          </string-name>
          , “
          <string-name>
            <surname>Terminology-Constrained Neural Machine Translation at</surname>
            <given-names>SAP</given-names>
          </string-name>
          ,”
          <source>in Proceedings of the 22nd Annual Conference of the European Association for Machine Translation</source>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Moniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fumega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Batista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Coheur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Parra</surname>
          </string-name>
          , I. Trancoso,
          <string-name>
            <given-names>M.</given-names>
            <surname>Turchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bisazza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moorkens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guerberof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nurminen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Forcada</surname>
          </string-name>
          , Eds., Lisboa, Portugal: European Association for Machine Translation, Nov.
          <year>2020</year>
          , pp.
          <fpage>271</fpage>
          -
          <lpage>280</lpage>
          . [Online]. Available: https://aclanthology.org/
          <year>2020</year>
          .eamt-
          <volume>1</volume>
          .29/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>