<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large Language Models for Research Data Management?! 2025 (LLMs4RDM 2025)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Magnus Bender</string-name>
          <email>magnus@mgmt.au.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sylvia Melzer</string-name>
          <email>sylvia.melzer@uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Möller</string-name>
          <email>ralf.moeller@uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Thiemann</string-name>
          <email>stefan.thiemannt@uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aarhus University, Center for Contemporary Cultures of Text</institution>
          ,
          <addr-line>Jens Chr. Skous Vej 4, 8000 Aarhus C</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Aarhus University, Department of Management</institution>
          ,
          <addr-line>Fuglesangs Allé 4, 8210 Aarhus V</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Hamburg, Center for Sustainable Research Data Management</institution>
          ,
          <addr-line>Monetastraße 4, 20146 Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Hamburg, Centre for the Study of Manuscript Cultures (CSMC)</institution>
          ,
          <addr-line>Warburgstraße 26, 20354 Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Hamburg, Institute for Humanities-Centered AI (CHAI)</institution>
          ,
          <addr-line>Warburgstraße 28, 20354 Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Research data management (RDM) has become an important discipline that enables researchers to efectively organise, preserve and share their research results. RDM is a new development that aims to prepare researchers for the future by building on the principles of open science. It utilises innovative approaches such as generative artificial intelligence (genAI), which is powered by large language models (LLMs), to complement traditional research methods. As data-driven research becomes increasingly complex, researchers often have to spend a lot of time learning how to manage, analyse and interpret large amounts of information. Traditional data literacy training can be time-consuming and doesn't always keep pace with evolving technologies and methods of analysis. Foundation models based on generative AI ofer the potential to streamline this learning process. By automating data pre-processing, pattern recognition and even hypothesis generation, these models can lower the technical barriers to entry, allowing researchers to focus more on insights and discovery rather than spending excessive amounts of time mastering data skills. The objective of this workshop is an exchange of perspectives regarding the implementation of novel RDM approaches using LLMs or not, both past and prospective, in research and practice.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1.1. Workshop Organisation</title>
      <p>The LLMs4RDM 2025 Workshop was held as part the INFORMATIK Festival 2025 (55th Annual
Conference of the German Informatics Society), September 18, 2025, Potsdam, Germany.
1.1.1. Organisers
• Magnus Bender, Aarhus University, Denmark
• Sylvia Melzer, University of Hamburg, Germany
• Ralf Möller, University of Hamburg, Germany
• Stefan Thiemann, University of Hamburg, Germany
1.2. Programme Committee of LLMs4RDM 2025
• Thomas Asselborn, University of Hamburg, Germany
• Magnus Bender, Aarhus University, Denmark
• Mahdi Jampour, University of Hamburg, Germany
• Sylvia Melzer, University of Hamburg, Germany
• Stefan Thiemann, University of Hamburg, Germany</p>
    </sec>
    <sec id="sec-2">
      <title>1.3. Overview of papers</title>
      <sec id="sec-2-1">
        <title>One keynote and five papers were presented at the workshop.</title>
        <p>The keynote focuses on the crucial next step in Research Data Management (RDM), advocating for a
transition from simple data immersion to structured scientific argumentation. The speaker presents
historical examples, such as the 3D reconstruction of the theatre of Miletus, to illustrate how researchers
formulate hypotheses about past human decisions, pointing out that these visualisations can be based
either on empirical data or on imaginary 3D data generated via language models. The central argument
is that RDM systems must evolve to support the formal representation of these scientific arguments so
that they are machine-processable and the data used can be verifiably represented. The keynote calls
for new RDM systems that not only host diverse data, but also enable researchers to use these resources
directly for the development and validation of formal scientific hypotheses.</p>
        <p>The first paper Large Language Models in Labor Market Research Data Management: Potentials
and Limitations presents the application of LLMs within research data management (RDM), focusing
specifically on tasks related to occupational data and labor market text interpretation. Through empirical
studies, the researchers determined that LLMs struggle with the automated classification of job titles,
often producing results that were less reliable and reproducible than those generated by traditional
machine learning classifiers. The LLM tests using hermeneutical methods produced fundamentally
inconsistent and unstable interpretations. The authors argue that LLMs are inadequate for tasks
demanding methodological rigor or scientifically defensible classification due to their lack of consistency
and interpretative depth. Therefore, LLMs should be used only as assistive tools for preliminary support
functions.</p>
        <p>The second paper Challenges in Automatic Speech Recognition in the Research on Multilingualism
examines the significant challenges faced when applying Automatic Speech Recognition (ASR)
technology, specifically the Whisper model, to complex spoken data collected for multilingualism research.</p>
      </sec>
      <sec id="sec-2-2">
        <title>The authors note that while commercial applications require clean, monolingual transcripts, linguistic</title>
        <p>studies require highly accurate recordings that capture every acoustic detail, including speech disorders
and complex language switching between languages. Using Polish-German bilingual recordings from
the LangGener corpus, the study identifies key shortcomings in ASR output, such as hallucinations and
a problematic tendency towards code unification, which mis-transcribes or mis-translates embedded
language elements.</p>
        <p>The third paper Improving Accessibility and Reproducibility by Guiding Large Language Models presents
proposes a combined method the general-purpose large language models (LLMs) and specialized research
data stored in Research Data Repositories (RDRs) by leveraging the expert knowledge of the data creators.</p>
      </sec>
      <sec id="sec-2-3">
        <title>The core innovation is the interpretation prompt, a field added during the data upload process that</title>
        <p>allows the expert data creator to provide specific instructions. When a user queries the RDR’s LLM
chatbot, this expert-generated prompt is prepended to the user’s query, efectively guiding the LLM
toward project-specific understanding. The authors demonstrate that these prompts result in more
accurate, tailored responses by focusing the LLM’s output and improving data accessibility and utility.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Furthermore, the interpretation prompt facilitates automated reproducibility of research experiments by instructing the LLM to execute relevant algorithms or code associated with the data entry.</title>
        <p>The forth paper Talk to your Database: An open-source in-context Learning Approach to interact with</p>
      </sec>
      <sec id="sec-2-5">
        <title>Relational Databases through LLMs presents an open-source large language model (LLM) framework</title>
        <p>designed to solve the Text-to-SQL problem through in-context learning. Researchers compared the
performance of this method against a simpler default prompting technique using a PostgreSQL database.</p>
      </sec>
      <sec id="sec-2-6">
        <title>The results decisively show that in-context learning improved accuracy, boosting successful query execution rates from approximately 35% to over 85%.</title>
        <p>The fifth paper Verbalisation Process of a RAG-Based Chatbot to Support Tabular Data Evaluation for
Humanities Researchers presents the verbalization process of a RAG-based chatbot (ChatHA) engineered
to support tabular data evaluation for humanities researchers. The core motivation for this research is
enabling scholars to conduct free-form and semantic searches on structured data, moving beyond the
limitations of simple string matching. To address the need for verbalizing database entries into natural
language, the authors propose a hybrid verbalization method that minimizes the computational cost
and risk of hallucination associated with LLMs.</p>
        <sec id="sec-2-6-1">
          <title>2. Presentations</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1. Presentations</title>
      <sec id="sec-3-1">
        <title>Magnus Bender</title>
        <p>Aarhus University, Denmark
Welcome</p>
      </sec>
      <sec id="sec-3-2">
        <title>Ralf Möller</title>
        <p>University of Hamburg, Germany
Keynote</p>
      </sec>
      <sec id="sec-3-3">
        <title>Abstracts and presentations are available at: https://doi.org/10.25592/uhhfdm.17955</title>
        <p>Jens Dörpinghaus1,2,3, Michael Tiemann1,2
1University of Koblenz, Germany; 2Federal Institute for Vocational Education and Training (BIBB),
Germany; 3Linnaeus University, Sweden
Large Language Models in Labor Market Research Data Management: Potentials and
Limitations
Edyta Jurkiewicz-Rohrbacher1,2, Thomas Asselborn2
1University of Regensburg, Germany; 2University of Hamburg, Germany
Challenges in Automatic Speech Recognition in the Research on Multilingualism</p>
      </sec>
      <sec id="sec-3-4">
        <title>Florian Marwitz, Marcel Gehrke</title>
        <p>University of Hamburg, Germany
Improving Accessibility and Reproducibility by Guiding Large Language Models</p>
      </sec>
      <sec id="sec-3-5">
        <title>Maximilian Plazotta, Meike Klettke</title>
        <p>University of Regensburg, Germany
Talk to your database: An open-source in-context learning approach to interact with
relational databases through LLMs
Thomas Asselborn1, Magnus Bender2, Florian Marwitz1, Ralf Möller1, Sylvia Melzer1
1University of Hamburg, Germany; 2Aarhus University, Denmark
Verbalisation Process of a RAG-Based Chatbot to Support Tabular Data Evaluation for
Humanities Researchers</p>
      </sec>
      <sec id="sec-3-6">
        <title>Magnus Bender</title>
        <p>Aarhus University, Denmark
Farewell
2.1.1. Acknowledgments</p>
      </sec>
      <sec id="sec-3-7">
        <title>The organisers of the LLMs4RDM 2025 workshop would like to thank the organisers of the Informatik</title>
      </sec>
      <sec id="sec-3-8">
        <title>Festival conference in Potsdam for their excellent support. We also would like to thank the members of the programme committee for their help in carefully evaluating and selecting the submitted papers and all participants of the workshop for their contributions. We wish that new inspirations and collaborations between the contributing disciplines will emerge from this workshop.</title>
        <sec id="sec-3-8-1">
          <title>Funding Information</title>
        </sec>
      </sec>
      <sec id="sec-3-9">
        <title>This contribution was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research</title>
        <p>Foundation) under Germany´s Excellence Strategy – EXC 2176 ‘Understanding Written Artefacts:</p>
      </sec>
      <sec id="sec-3-10">
        <title>Material, Interaction and Transmission in Manuscript Cultures’, project no. 390893796. The research</title>
        <p>was mainly conducted within the scope of the Centre for the Study of Manuscript Cultures (CSMC) at</p>
      </sec>
      <sec id="sec-3-11">
        <title>University of Hamburg.</title>
      </sec>
      <sec id="sec-3-12">
        <title>This contribution was partially funded by the Danish National Research Foundation (DNRF193) through TEXT: Centre for Contemporary Cultures of Text at Aarhus University.</title>
        <sec id="sec-3-12-1">
          <title>Declaration on Generative AI</title>
        </sec>
      </sec>
      <sec id="sec-3-13">
        <title>During the preparation of this work, the authors used DeepL in order to: Grammar and spelling check.</title>
      </sec>
      <sec id="sec-3-14">
        <title>After using these tool(s)/service(s), the authors reviewed and edited the content as needed and take(s)</title>
        <p>full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>