<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (L. De Santis);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An LLM-based Approach for Translating Keywords in Scientific Publications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca De Santis</string-name>
          <email>desantis@netseven.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Pedinotti</string-name>
          <email>pedinotti@netseven.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Net7 Srl</institution>
          ,
          <addr-line>via Chiassatello 57, 56121 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>9</fpage>
      <lpage>0009</lpage>
      <abstract>
        <p>We present herein a methodology and a working implementation for translating textual keywords of scientific publications. Using descriptive metadata to construct the context, this approach leverages Large Language Models (LLMs) to map keywords to entities of multilingual knowledge bases and controlled vocabularies, Wikidata in particular. By integrating these sources, it is not only possible to obtain keyword translations in multiple languages, but also to map them to Linked Data entities, disambiguating their meaning and improving the identification and classification of the associated publications. The methodology, developed during the ATRIUM research project, produced promising results when used with a commercial Large Language Model like ChatGPT. At the same time, our research highlights the challenges of reconciling free-form keywords, since the results can vary depending on the quality of the original metadata. While initially designed for the GoTriple discovery platform, this approach, along with its open-source example implementation, can be generalized to all situations where it is necessary to extract multilingual knowledge from text-based keywords.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Metadata Enrichment</kwd>
        <kwd>Text Processing</kwd>
        <kwd>Multilingualism</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>ChatGPT</kwd>
        <kwd>Social Sciences and Humanities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        of an English translation when absent, to facilitate the access of documents in local languages [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Moreover the GoTriple website is localized in 10 languages.
      </p>
      <p>
        In GoTriple, the annotation of disciplines and controlled vocabulary terms enriches the original
metadata of documents, which include after processing, labels in multiple European languages.
This therefore facilitates the discovery of relevant content using a local language other than
English. This added metadata are, as said, the result of an automatic process, based on Machine
Learning and Natural Language Processing (NLP) techniques, which, while effective, cannot be
defined as completely bulletproof (e.g. in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] it is shown that the disciplines classifier only produces
an average F1-score of around 50% over all its 11 supported languages).
      </p>
      <p>Moreover, these classifications, while undoubtedly useful, cannot be considered as valuable as
those originally applied by the document authors in the “keywords” attribute, that is, the free text
descriptions added to provide a simple categorization of the content of the paper.</p>
      <p>
        As indicated in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], this specific document metadata proved to be problematic for automatic
curation in GoTriple. In particular, the possibility to perform an automatic translation on them via
a dedicated service (eTranslation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] in the case of GoTriple) has been discarded: on the one hand,
automatic systems may fail to perform well on short text, in particular when considered outside of
a larger and more meaningful context (think of a term like “rock”, that can be both applied to two
distant subjects as Geology and Music).
      </p>
      <p>On the other, for GoTriple metadata, it has been observed that articles often include keywords
in multiple languages: in particular, when the text of an article is in a language different from
English, the authors often add keywords in both the document language and in English, to ease the
discovery of the article in scientific repositories.</p>
      <p>
        To the keywords translation problem, a specific subtask (T3.4.1) of the on-going EU-funded
research project ATRIUM (Advancing fronTier Research In the arts and hUManities) has been
dedicated. The goal of ATRUM is to “bridge leading research infrastructures in arts and humanities
(DARIAH), archaeology (ARIADNE), languages (CLARIN), and open scholarly communication in
the social sciences and humanities (OPERAS)” [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The keywords translation task, albeit simple and straightforward in theory, presents numerous
challenges, such as short keywords, lack of context, unidentified languages, and the use of multiple
languages within a single publication's keywords.</p>
      <p>
        In this article we present the work done by our team in this context. We start by presenting a
review of interesting LLM-based approaches for metadata enrichment (Section 2). In Section 3, the
methodology proposed in the ATRIUM project is presented, followed by a description of its
implementation (Section 4), in the form of a publicly available Python code repository and an SSH
Open Marketplace workflow [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which describes a step-by-step documentation on how to use the
aforementioned code. In Section 5, we present the experimental results obtained by applying the
methodology on a selection of multilingual documents extracted from the GoTriple platform, while
the conclusions provide a summary and suggest possible directions for future work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Review of the use of LLMs for metadata enrichment</title>
      <p>Numerous studies are available to demonstrate the great potential of using an LLM, and ChatGPT
in particular, for enriching the descriptions of textual documents.</p>
      <p>For example, in [11] ChatGPT was used for automatic genre classification on texts in English
and Slovenian, providing results, even with a zero-shot approach, that outperformed a machine
learning model fine-tuned on manually annotated datasets. In [12], the use of ChatGPT to classify
hate speech has been empirically tested, showing a positive result even in presence of implicit
hateful content, in 80% of the cases. In [13], it is stated that “ChatGPT outperforms crowd workers
for several annotation tasks, including relevance, stance, topics, and frame detection” with “the
zero-shot accuracy of ChatGPT exceeding that of crowd workers by about 25 percentage points on
average”.</p>
      <p>More similar to our goals are the research described in [14], which showed how LLMs can be
used to annotate subject metadata by providing classification examples through an in-context
learning approach. The results obtained have been considered “promising” although the
experiments have been conducted by using ChatGPT-3.5 which, as mentioned by the authors,
performed poorly for the categorization of documents of specific disciplines (e.g. History and
archaeology).</p>
      <p>While the potential of using LLMs is widely recognized and proven, it is important at the same
time not to ignore the potential problems that can arise in their use, in particular, the issue of
generating the so-called hallucinations.</p>
      <p>The research in [15], which focuses on the use of LLMs to create systematic reviews, highlights
problems in obtaining accurate references, with a risk of having hallucinations “at a rate between
28% to 91%”, according to the model used, stating also that “any references generated by such
models warrant thorough validation by researchers”.</p>
      <p>In order to limit the risk of hallucination, [16] proposes an interesting approach based on
refining the original LLM response by searching for supporting documents to verify and enforce
any citation contained in it.</p>
      <p>The idea of using an external corpus to support the LLM responses is defined in the article as
“citation augmented strategies”, which can be either “parametric”, that is based on “information
internalized from the training data” or “non-parametric”, that is methods that “involve querying
relevant information and seamlessly integrating the retrieved content from outside corpus” to
enrich the LLM original response.</p>
      <p>The validity of this approach finds confirmation in other studies, like [17] and [18].</p>
    </sec>
    <sec id="sec-3">
      <title>3. The proposed methodology</title>
      <p>Keyword translation has been approached by using an LLM to map the keywords to entities of
multilingual controlled vocabularies, Wikidata in particular. Bibliographic keywords are included
in publications in a “bag of words” manner, using concepts composed of one or more words,
typically separated by commas, that describe the article‘s content but are not necessarily in a strict
semantic relationship with each other: their meaning emerges when considered in relation to the
content of the article.</p>
      <p>Our methodology uses the idea of recreating a significant context for the interpretation of
keywords by using the other publication’s metadata, in particular its title, abstract and the
language in which it is written. With this context, we produce a prompt to ask the LLM to
recognize for each keyword a concept from Wikidata, returning also the URL of the corresponding
Wikidata page.</p>
      <p>We started our initial experiments by using ChatGPT with a prompt similar to the one indicated
below.</p>
      <p>We process a scientific article of which we have the TITLE, the ABSTRACT and the KEYWORDS
separated by commas.</p>
      <p>The language of the article is &lt;document_language&gt; but the KEYWORDS can be in different
languages.</p>
      <p>The goal is to map each keyword to a corresponding entity of Wikidata.</p>
      <p>Use the TITLE and ABSTRACT as context. Use this context to suggest a mapping of each keyword to
a Wikidata entity, returning also its URL.</p>
      <p>TITLE: &lt;document_title&gt;
ABSTRACT: &lt;document_abstract&gt;
KEYWORDS: &lt;document_keywords&gt;.</p>
      <p>From the very start we noticed two important aspects. On the one hand, the results obtained
were very often accurate, with the LLM able to identify and reconcile keywords to their exact
Wikidata counterpart or to very logically close entities. On the other, the URLs provided were
always wrong. ChatGPT was able to return correct Wikidata URLs which correspond to different
entities.</p>
      <p>As the retrieved concepts were accurate, we decided to apply to the LLM response a
nonparametric citation augmented strategy, as defined in [16]: in our case, we query Wikidata via its
API to obtain the correct URLs of the entities recognized by the LLM.</p>
      <p>If the keyword text directly matches a label of a Wikidata entity, we can export its translations
for all the languages that we need. More generically, we can establish a strong semantic association
between the keyword and the Wikidata entity by using a predicate of the SKOS ontology [19], such
as “exactMatch”.</p>
      <p>If the keyword doesn’t literally correspond to the associated Wikidata entity’s labels, we will
not use them as translations, but we can, in any case, create a weaker semantic link using the SKOS
predicate “relatedMatch”.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The experimental implementation</title>
      <p>The methodology described herein has been implemented using the Python programming
language: all of the source code is freely available on GitHub [20].</p>
      <p>The most significant code file is main_functions.py, which encapsulates the logic required to
perform keyword translation tasks.</p>
      <p>The first step involves text preprocessing, which, given an article, identifies and extracts the
relevant metadata, in particular the title, the abstract and the keywords.</p>
      <p>Then the LLM prompt is created, which includes the context built with the extracted metadata.
To interact with LLMs, the Groq APIs [21] have been used, as they provide a convenient way to
interact with multiple language models, both commercial and open source.</p>
      <p>The prompt instructs the model to return recognized entities enclosed in square brackets, so
that they can be easily retrieved using a regular expression.</p>
      <p>Each entity is then used to query Wikidata’s APIs, in order to retrieve its URL along with the
available translations provided on the platform.</p>
      <p>The implementation is designed to be flexible, supporting both commercial and open-source
large language models to accommodate diverse deployment requirements.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Measurements and results</title>
      <p>The effectiveness of the proposed algorithm was measured on an annotated dataset of 200 articles
extracted from GoTriple. This dataset was constructed by selecting documents in 21 languages,
including English, belonging to 23 SSH disciplines, each containing at least four keywords (9.53 on
average).</p>
      <p>The annotation, made by the authors of this paper, consisted of manually mapping each entity
to a corresponding Wikidata entity. Both "exact matches" and "related matches" were considered,
with the latter category including similar, but not entirely precise, correspondences found in
Wikidata. It was possible to include more than one Wikidata URL in those situations in which more
entities could be significantly associated with a keyword. When no match was possible, the original
keyword was left without any Wikidata association.</p>
      <p>Examples of these annotations follow:
</p>
      <p>Exact match: nation navajo -&gt; https://www.wikidata.org/wiki/Q1783171 (Navajo
Nation)</p>
      <p>The experiment was limited to verifying the effectiveness of the algorithm in performing the
reconciliation of keywords with Wikidata entities. Once the correspondence is created, the
translations can be easily obtained, in multiple languages, using Wikidata APIs: therefore, this last
step was not included in the test.</p>
      <p>While annotating the keywords for the experiment, we noted that only a percentage of
keywords could be safely associated with a Wikidata entity. In around 80% of the cases, it was
possible to find a real association, both exact or related, the former in 64.72% of the cases.</p>
      <p>The manual annotation represented the ground truth against which the results of the algorithm
were evaluated. The metrics of precision and recall have been used for the evaluation.</p>
      <p>The algorithm was tested on this dataset by using gpt-4o-mini. Results are shown in Table 1.</p>
      <p>The algorithm's performance proves to be more effective in identifying an exact match, with a
precision of 0.66 and a recall of 0.64 against the ground truth. On the other hand, the results for the
looser matches were quite disappointing, as inevitably their choice brings greater uncertainty and,
possibly, also a personal bias of the human who performs the annotation.</p>
      <p>The script and dataset used for this experiment have been made freely available in the
"evaluation_files" directory of the software's GitHub repository [20].</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and future work</title>
      <p>We presented a methodology for translating the keywords of scientific publications by leveraging a
Large Language Model to reconcile them with Wikidata entities. This reconciliation enables the
retrieval of translations by utilizing the multilingual capabilities of this collaborative knowledge
base.</p>
      <p>The methodology and its associated implementation were evaluated against a testbed of manual
annotations, demonstrating strong performance (around 65%) on a limited set of keywords that
correspond readily to Wikidata concepts.</p>
      <p>Of course, the benefits of this approach are not limited to translations. Reconciling keywords
with Wikidata entities facilitates article classification and enhances the understanding of its main
subjects. This is particularly useful in a multilingual discovery platform like GoTriple, which
features articles in many different languages.</p>
      <p>On the other hand, the lack of standard workflows for creating keywords, along with noise
introduced by data aggregators that may include classification codes (such as the Dewey Decimal
Classification - DDC) as keywords, makes processing this metadata particularly challenging. In
fact, manual annotation of our test data was generally feasible for 80% of the keywords, but an
exact correspondence with Wikidata entities was achieved in only 64.72% of the cases.</p>
      <p>Future directions for this work include exploring the possibility of reconciling keywords with
other standard classification taxonomies and controlled vocabularies, such as the Library of
Congress Subject Headings or DDC. Additionally, experimenting with open-source LLMs to
compare their performance with that of ChatGPT will also be pursued.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>
        This research was funded by the European Union, grant agreement number 101132163 (ATRIUM
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]).
      </p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT 4o in order to: Grammar and
spelling check. After using this tool, the authors reviewed and edited the content as needed and
take full responsibility for the publication’s content.
[11] T. Kuzman, I. Mozetič, N. Ljubesic, ChatGPT: Beginning of an End of Manual Linguistic Data
Annotation? Use Case of Automatic Genre Identification, (2023).
doi:10.48550/arXiv.2303.03953.
[12] F. Huang, H. Kwak, J. An, Is chatgpt better than human annotators? potential and limitations
of chatgpt in explaining implicit hate speech. In Companion proceedings of the ACM web
conference (2023) 294-297. doi:10.48550/arXiv.2302.07736.
[13] F. Gilardi, A. Meysam, K. Maël, ChatGPT outperforms crowd workers for text-annotation
tasks, Proceedings of the National Academy of Sciences 120.30 (2023).
[14] S. Zhang, W. Mingfang, Z. Xiuzhen, Utilising a Large Language Model to Annotate Subject
Metadata: A Case Study in an Australian National Research Data Catalogue, (2023).
doi:10.48550/arXiv.2310.11318.
[15] M. Chelli, J. Descamps, V. Lavoué, C. Trojani, M. Azar, M. Deckert, J. Raynier, G. Clowez, P.</p>
      <p>Boileau, C. Ruetsch-Chelli, Hallucination Rates and Reference Accuracy of ChatGPT and Bard
for Systematic Reviews: Comparative Analysis, J Med Internet Res, (2024). doi: 10.2196/53164.
[16] W. Li et al, Citation-Enhanced Generation for LLM-based Chatbot, (2024).</p>
      <p>doi:10.48550/arXiv.2402.16063.
[17] T. Gao, H. Yen, J. Yu, D. Chen, Enabling Large Language Models to Generate Text with
Citations, Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, Singapore, 2023, pp. 6465–6488.
[18] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, F. Song, M. Chadwick, N. McAleese, Teaching
language models to support answers with verified quotes, (2022).
doi:10.48550/arXiv.2203.11147.
[19] SKOS ontology. URL: https://www.w3.org/2009/08/skos-reference/skos.html
[20] Keyword translation code. URL:
https://github.com/atrium-research/T3.4.1_KeywordsTranslation.
[21] Groq. URL: https://groq.com/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kulczycki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. C. E.</given-names>
            <surname>Engels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pölönen</surname>
          </string-name>
          ,
          <article-title>Multilingualism of social sciences</article-title>
          , in: E. Kulczycki, T. C. E. Engels (Ed.), Handbook on Research Assessment in the Social Sciences, ed., Edward Elgar Publishing, Cheltenham, UK,
          <year>2022</year>
          , pp.
          <fpage>350</fpage>
          -
          <lpage>366</lpage>
          . doi:
          <volume>10</volume>
          .4337/9781800372559.00031.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kulczycki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pölönen</surname>
          </string-name>
          , et al.,
          <article-title>Multilingual publishing in the social sciences and humanities: A seven-country European study</article-title>
          ,
          <source>J. Assoc. Inf. Sci. Technol</source>
          .
          <volume>71</volume>
          (
          <year>2020</year>
          )
          <fpage>1371</fpage>
          -
          <lpage>1385</lpage>
          . doi:
          <volume>10</volume>
          .1002/asi.24336.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kancewicz-Hoffman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pölönen</surname>
          </string-name>
          ,
          <article-title>Does excellence have to be in English? Language diversity and internationalisation in SSH research evaluation</article-title>
          ,
          <year>2020</year>
          . URL: https://enressh.eu/wpcontent/uploads/2017/09/OverviewPeerReviewENRESSH-1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Delfim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Angelaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bertino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumouchel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>OPERAS Multilingualism</given-names>
            <surname>White</surname>
          </string-name>
          <string-name>
            <surname>Paper</surname>
          </string-name>
          , (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .5281/zenodo.1324026.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>GoTriple</given-names>
            <surname>Discovery</surname>
          </string-name>
          <article-title>Platform</article-title>
          . URL: https://gotriple.eu.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>De Santis</surname>
          </string-name>
          ,
          <source>TRIPLE Deliverable: D2.5 - Report on Data Enrichment</source>
          , (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .5281/zenodo.7359654.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>De Santis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Giacomi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Agosta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Homo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ardizzone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Lamata</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <source>TRIPLE Deliverable D4</source>
          .
          <article-title>4 Technical and User Documentation for the TRIPLE system</article-title>
          , (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .5281/zenodo.7708784.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] eTranslation, The European Commission's Machine Translation system</article-title>
          . URL: https://commission.europa.eu/resources-partners/etranslation_en.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>ATRIUM</surname>
          </string-name>
          <article-title>Project website</article-title>
          . URL: https://atrium-research.eu/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pedinotti</surname>
          </string-name>
          ,
          <article-title>LLM-Powered Mapping of Keywords of a Research Article to Linked Data, SSH Open Marketplace workflow</article-title>
          ,
          <year>2024</year>
          . URL: https://marketplace.sshopencloud.eu/workflow/rEet9L.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>