<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/s11604</article-id>
      <title-group>
        <article-title>Verbalisation Process of a RAG-Based Chatbot to Support Tabular Data Evaluation for Humanities Researchers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Asselborn</string-name>
          <email>thomas.asselborn@uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Magnus Bender</string-name>
          <email>magnus@mgmt.au.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Andreas Marwitz</string-name>
          <email>florian.marwitz@uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Möller</string-name>
          <email>ralf.moeller@uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sylvia Melzer</string-name>
          <email>sylvia.melzer@uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aarhus University, Department of Management</institution>
          ,
          <addr-line>Fuglesangs Allé 4, 8210 Aarhus V</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Hamburg, Centre for the Study of Manuscript Cultures (CSMC)</institution>
          ,
          <addr-line>Warburgstraße 26, 20354 Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Hamburg, Institute of Humanities-Centered Artificial Intelligence (CHAI)</institution>
          ,
          <addr-line>Warburgstraße 28, 20354 Hamburg</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>3</volume>
      <issue>2003</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Scholars have access to large amounts of data and publications stored in RDRs (Research Data Repositories). LLMs (Large Language Models) can eficiently work with textual data. But, since LLMs are pretrained and have a limited context window, they cannot work with large amounts of text. For this, the standard approach is to use RAG (Retrieval Augmented Generation), where an embedding space is built for the text corpus. During answering, the nearest suitable texts are extracted and provided to the context of the LLM. However, data in tables is not evaluated correctly because the embedding treats the tabular data as textual and thus fails to correctly model the semantics, which represents the context, of the tabular data. In this article, we show how tabular data can be used in a RAG-like approach: The key steps are i) a static cloze text is generated and then modified once by an LLM and ii) presented to the scholar for possible modifications. Afterwards, iii) the whole data set is verbalised according to the cloze text and, therefore, iv) usable for RAG. In particular, step iii) is crucial for our system as we add the missing context to the data. Our feasibility study shows how to eficiently generate a chatbot with a large amount of structured data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Research in the humanities often involves textual data and frequently generates new textual content.
This data is not just plain text but is often integrated into more sophisticated visualisations, such as
“tabular-based” information systems. E.g., imagine the table visualisation: Each row represents the
results of humanities research for a particular object of interest, e.g., an ancient artefact. Each column
contains the results of an analysis focusing on a particular aspect of the artefact, such as date and
place. Exemplary information systems are presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A published information system that
visualises tabular-like results constitutes an important part of the research output in the humanities.
Consequently, other scholars may analyse and combine diferent results from such systems for their
own research.
      </p>
      <p>However, eficiently accessing and utilising such research outputs can be challenging due to the
vast amount of available information. This is where modern chatbots based on LLMs (Large Language
Models) come into play: they can assist scholars during their research work, e.g., by enhancing
accessibility to information. Scholars can interact with research outputs of their own and those of other
Large Language Models for Research Data Management?! (LLMs4RDM), Workshop at the Conference of the German Informatics</p>
      <p>LGOBE</p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
scholars in their field without being required to read every detail. Thus, with LLMs, many documents
become easily accessible to scholars. To facilitate this, Asselborn et al. present ChatHA (Humanities
Aligned Chatbot), a chatbot capable of using scholar-defined resources for answering questions [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
However, ChatHA requires a corpus of text documents to work with and cannot contextualise tabular
data from the information systems described above. ChatHA can extract insights from published papers,
but the likewise important part of insights from information systems can not be accessed. Therefore,
this paper presents an approach to make visualisations of table-like results from information systems
accessible for LLM-based chatbots like ChatHA.
      </p>
      <p>Given tabular data, the data itself consists mostly of non-continuous text fragments, and the actual
format heavily difers between information systems and research projects. Thus, building a large
database combining all the data from diferent projects, information systems, and RDRs ( Research Data
Repositorys) is not easily possible. As a solution, this paper contributes a technique to incorporate
non-continuous textual data from diferent sources into ChatHA. ChatHA is a conversational agent
tailored to assist and engage in humanities research and education built with LLMs. Our proposed
technique gets a set of table-like files as input and prepares them for use by i) generating a static cloze
text by an LLM and ii) presenting it to the scholar for possible modifications. Afterwards, iii) the whole
data set is verbalised according to the static cloze text and therefore iv) usable in a RAG (Retrieval
Augmented Generation)-based chatbot such as ChatHA.</p>
      <p>The remainder of this paper is structured as follows: First, we look at related work. Second, we recap
the basics of LLMs and ChatHA. Afterwards, we formalise the problem of presenting non-continuous
text to LLMs and introduce our solution. Next, we put our approach in relation to humanities research
and integration with RDR systems and demonstrate its abilities in a demonstration. Finally, we discuss
challenges and limitations and conclude afterwards.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        LLMs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have now changed the way the humanities work. Scholars are increasingly approaching text
analyses, historical documents and the retrieval of bibliographies diferently. Traditional computational
approaches in the humanities have relied on rule-based methods and statistical techniques, but LLMs
ofer deeper, more context-aware interpretations. Models such as OpenAI’s GPT-4, Google’s Gemini and
Meta’s LLaMA have been used in literature analysis, semantic search and historical text reconstruction.
These models allow researchers to uncover hidden patterns and create summaries that give a first
impression of what the data is about. Studies have demonstrated the capabilities of LLMs in the context
of the humanities, such as presented in [5, 6, 7].
      </p>
      <p>Before the use of chatbots that ofer diferent LLMs to select query answers, text mining in the
humanities relied on NLP (Natural Language Processing) techniques such as topic modelling (LDA) [8]
and named entity recognition (NER) [9]. While these approaches were efective, they required extensive
pre-processing and lacked the sophisticated understanding ofered by modern LLMs. Modern research
utilises transformer-based models that can perform zero-shot and few-shot learning, allowing them
to be adapted to diferent research questions without needing large annotated datasets. In addition,
advances in machine translation and multilingual models have facilitated the study of historical texts
that were previously inaccessible due to language barriers.</p>
      <p>NotebookLM [10], developed by Google, represents a novel approach to research support by
integrating LLMs into a structured, interactive notebook environment. Unlike generic LLMs, NotebookLM
allows users to upload research documents – such as PDFs – to interact dynamically with LLMs. The
use of NotebookLM can prove its worth in humanities research for digital archiving and manuscript
analysis. The ability to interact with source materials in real-time allows researchers to create structured
overviews of complex texts, compare versions of historical documents and retrieve information with
greater specificity. In addition, the ability to synthesise multiple sources supports interdisciplinary
research by connecting ideas from diferent fields. However, despite these advantages, NotebookLM is
not immune to the problem of hallucinations, where it may generate misleading or incorrect information.
These inaccuracies can be particularly problematic in humanities research, where the authenticity
and precision of historical data are crucial. Therefore, while NotebookLM can be a powerful tool for
researchers, it requires careful cross-referencing with original sources and human oversight to ensure
the validity of its results. This also applies to other chatbots that use LLMs. Therefore, an approach is
needed that can verbalise the data in such a way that ensures the highest possible degree of precision.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminaries and Enhancing ChatHA</title>
      <p>
        Using LLMs in the humanities ofers various advantages: Scholars can interact with their own research
publications or other publications from scholars in their field. In this section, we briefly recap the inner
workings of ChatHA [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>We split the description into an ofline preparation step and the online query-answering chatbot.
In the preparation step, the scholar must provide the documents ChatHA should use for answering
questions. These documents can originate from diferent sources, e.g., multiple text documents in
diferent RDR. However, ChatHA assumes to work with a corpus of natural language text documents.
Then, the documents are pre-processed to provide a search that is usable by the chatbot. In the online
phase, the chatbot receives a query, extracts the most likely search results from the documents and
provides an answer based on them. Additionally, ChatHA provides citations, which exact text locations
were used for which part of the answer.</p>
      <p>In the preparation step, the selected documents are embedded into an embedding space. Later,
ChatHA uses the embedding space to find the closest paragraphs next to the query asked.</p>
      <p>In the online phase, ChatHA receives a query. First, the closest paragraphs to the query are extracted
using the embedding space. Second, the paragraphs are provided as additional context for a base model,
which can be any available LLMs. Third, the returned answer is post-processed by checking which
output sentence matches which in the embedding space. The findings are then displayed so that the
scholar can see which sentence in the answer originates from where in the initially provided documents.</p>
      <p>The use of an embedding space for augmenting an LLM’s input with additional context makes
ChatHA a RAG approach [11]. RAG is widely used to use custom context and data with the general
text generation and question-answering capabilities of LLMs. Moreover, with LLMs, a large scale of
documents becomes accessible to scholars, helping interdisciplinary insights.</p>
      <p>However, while ChatHA provides an easy-to-use way to provide documents as a searchable resource
for LLMs, it still lacks the ability to search across actual non-continuous research data, e.g., table-like
data from spreadsheets. The requirement of having corpora of natural language text documents mostly
comes from using RAG. The LLM input is augmented with excerpts from the corpus of text documents.
Based on these excerpts, the LLM needs to get the context and generate a response. Hence, augmenting
the input with non-continuous data will stop the LLM from getting enough context and result in faulty
responses.</p>
      <p>On the other hand, ChatHA is required to work with data from various sources, i.e., diferent projects,
information systems, and RDRs, which do not share an equal representation or schema. In such diverse
environments, it is not possible to manually define rules for transforming the data and integrating it
into one large data set. Hence, ChatHA needs to transform the diverse data automatically to be used
with RAG.</p>
      <p>In the next section, we explain how we modify ChatHA to be able to fully work with continuous and
non-continuous text documents as a searchable resource for answering queries.</p>
    </sec>
    <sec id="sec-4">
      <title>4. ChatHA’s Processes and Application</title>
      <sec id="sec-4-1">
        <title>In the following, ChatHA’s new processes and the application are described.</title>
        <sec id="sec-4-1-1">
          <title>Show Cloze Text and Allow Modification</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Generate Static Cloze</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Text</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Refine Cloze Text using LLM</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Apply to All Files</title>
        </sec>
        <sec id="sec-4-1-6">
          <title>Output:</title>
        </sec>
        <sec id="sec-4-1-7">
          <title>Verbalised Input File</title>
          <p>4.1. Process Overview
Research data in the humanities can not only be in an unstructured format like prose text but also in
structured formats like JSON, CSV, or some database. Because the standard version of RAG works best
with text, some form of transforming structured data into text needs to be used. The first idea is to
take every entry and use an LLM to generate unique text per entry. While this approach may work
well, it has the disadvantage of taking a lot of resources during generation in terms of time, money and
electricity. Integrating this approach into a shared environment such as the RDR may present certain
challenges. To address this, we propose a slightly modified approach to solving the problem. Fig. 1
illustrates the process, which consists of the following steps:
1. Input Files: A researcher begins by selecting an input file (e.g., JSON or CSVs) that represents
the dataset or is chosen randomly.
2. Generate Static Cloze Text: The process then generates static cloze text by recursively extracting
the JSON keys (or CSVs’ column names and foreign keys) from the input file and incorporating
them into the resulting text.
3. Refine Cloze Text Using an LLM: Since the initial static cloze text may not be naturally readable,
a user-selected LLM, such as Llama 3 or Gemma, can refine it. This refinement is performed once
per structure rather than for each JSON instance (CSV row), optimising resource usage.
4. Show Cloze Text and Allow Modification: The generated cloze text is presented to the user,
who can modify or accept it as needed.
5. Apply to All Files: For each file, a cloze text is filled containing the data from the said file.
6. Output: Verbalised Input Files: Finally, the filled in cloze texts are stored for further processing
as verbalised output files.</p>
          <p>This process of generating and filling cloze texts can now be used in a standard RAG pipeline as seen
in Fig. 2.</p>
          <p>1. Input Files: Instead of prose text, there is structured data at the beginning of the pipeline, like</p>
          <p>CSV or JSON files.
2. Verbalise Input Files: The files are verbalised according to the pipeline described in Fig. 1.
3. Generate Embeddings: From the generated verbalised texts, embeddings are computed like for
the standard RAG using an embedding function of choice, like SentenceBERT [12].</p>
        </sec>
        <sec id="sec-4-1-8">
          <title>Input: User Query</title>
        </sec>
        <sec id="sec-4-1-9">
          <title>LLM Generate Answer</title>
        </sec>
        <sec id="sec-4-1-10">
          <title>Get Top  Results</title>
          <p>from Vector DB</p>
        </sec>
        <sec id="sec-4-1-11">
          <title>Get Citations</title>
        </sec>
        <sec id="sec-4-1-12">
          <title>Using Vector DB</title>
        </sec>
        <sec id="sec-4-1-13">
          <title>Augment Query</title>
        </sec>
        <sec id="sec-4-1-14">
          <title>Combine to</title>
        </sec>
        <sec id="sec-4-1-15">
          <title>Final Output</title>
          <p>During the runtime of the chatbot, the process also largely follows the standard RAG approach. The
full process is outlined in Fig. 3.</p>
          <p>1. Input: User query: The user query is taken as the input to this pipeline.
2. Get Top N results from Vector DB: The user query is embedded using the same embedding
function as during the RAG generation. This query is now used to retrieve the top  ,  ∈ ℕ . The
exact value for  can be chosen by the developer of the chatbot as seen as appropriate. The top
 elements can be retrieved based on some distance metric, e.g., the cosine similarity.
3. Augment Query: The original query is then augmented using the retrieved entries from the
vector database.
4. LLM Generate Answer: The new information can then be used by the LLM to answer the
question grounded on the information provided by the user-specific dataset.
5. Get Citations Using Vector DB: After the LLM has answered the query, the retrieval step is
performed again, using the generated output. This generates references to the dataset to indicate
which entries from the vector database most likely backed the generated text.
6. Combine to Final Output: The citation markers are added to the output from the LLM to get
the final output shown to the user.
4.2. Application
The method of first verbalising structured files and then running a standard RAG pipeline was evaluated
using a small, synthetic dataset. It consists of 12 entries with student data, which includes, e.g.,
firstand last-names, matriculation numbers, semester of study and subject. The information regarding one
student is stored in a JSON file each. The verbalisation of the JSON files and the generation of the
embeddings were performed on a standard Macbook Pro M3 with 16GB of RAM. The smoothing of
the static cloze text was performed using the Gemma 2 language model with 9 billion parameters and
quantised to 4 bit1. This model is small enough to run on a consumer laptop while providing good
results. As an embedding function, SentenceBERT [12] was used, which is specifically designed for
embedding full sentences and paragraphs, and the results were then stored in FAISS [14]. The static
cloze text, generated with the recursive method, for the synthetic dataset looks as follows. The values
in {{}} are the ones that will be the variables:</p>
          <p>“The value for ’Matr_Nr’ is {{Matr_Nr}}. The value for ’First_Name’ is {{First_Name}}. The value
for ’Last_Name’ is {{Last_Name}}. The field ’SemesterID’ is a list where each item contains:
SemesterID: {{SemesterID_SemesterID}}, Year: {{SemesterID_Year}}, Semester: {{SemesterID_Semester}}. The
ifeld ’Building’ is a list where each item contains: Building: {{Building_Building}}, Location:
{{Building_Location}}, Address: {{Building_Address}}, PLZ: {{Building_PLZ}}, City: {{Building_City}}, Country:
{{Building_Country}}. The value for ’Subject’ is {{Subject}}.”</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>1https://huggingface.co/bartowski/gemma-2-9b-it-GGUF</title>
        <p>Further processing with Gemma 2 then gives, for example, the following text, which will then be
iflled out by the concrete values from the JSON files:</p>
        <p>“My student ID is {{Matr_Nr}}. My first name is {{First_Name}} and my last name is {{Last_Name}}.
I’m currently in the {{SemesterID_Semester}}, {{SemesterID_Year}} semester. My building is
{{Building_Building}}, located at {{Building_Address}}, {{Building_PLZ}} {{Building_City}}, {{Building_Country}}.
The subject I’m studying is {{Subject}}.”</p>
        <p>A few test queries have been tried using this method. As our base LLM for answering questions,
we have used Llama 3.1 with 70 billion parameters quantised to 4-bit 2 because this model provides a
good balance between being a good model overall but due to its quantisation being not too resource
intensive. The execution of the LLM on the queries itself was performed on a server with an Nvidia
DGX L40s GPU.</p>
        <p>• Query: Are there students studying Data Science?
• Answer: Yes, according to the provided information, there are students studying Data Science.
(Citation: File 7) Specifically, the document mentions that the subject of study for some individuals
(including the ones with the first name Erna and last name Pfeifer, and the one studying at
Audimax with the matriculation number 12 is not specified but another individual has Data
Science as their subject of study) is Data Science, and they are located in Hamburg, Germany.
(Citation: File 6)</p>
        <p>This answer is factually correct, but it may lack information regarding, e.g. the student with
matriculation number 12. Thus, one follow-up question may be:
• Query: Who is the student with matriculation number 12?
• Answer: The student with matriculation number 12 is Harald Töpfer. (Citation: File 11)
As before, this answer is factually correct, and due to the mention of the specific file the data comes
from, it is verifiable by the user.</p>
        <p>Analogue to this, questions about diferent attributes, like the location or year of study, can be asked,
with the answers being similar to those presented here.</p>
        <p>This method works well with questions where the answer is expected to be prose text. In Section 4.3,
more details are provided regarding the challenges of this method.</p>
        <p>The feasibility study showed that the results are correct, although they may not always be complete.
This is suficient for getting an overview of the dataset at hand. Additionally, generating the cloze texts
and filling it with the data is resource-saving compared to verbalising all entries individually. This
makes it suitable for use on a server with moderate hardware. Thus, this allows this process to be
integrated into the RDR with all processes performed on local hardware.
4.3. Challenges and Limitations
This method works well with questions related to one specific entry of the dataset or at least only
a small subset of entries. Additionally, it is best when the answer is not expected to be of an exact
quantitative nature, like: “How many?” or “How often?” The answer to this query assumes that all data
records are evaluated and not a specific entry of a data record in a JSON file or a row in the CSV file,
which is the issue here. This would require an extension of the processes, which could be considered in
further work.</p>
        <p>When working with texts and data from the humanities, nuances in the language can be important.
This may lead to misunderstandings between the human and the LLM. One such example is the following:
The user asks whether a consul is mentioned in the dataset. The dataset may contain an entry where
the word consul is mentioned, but it is not the complete word. Based on the context, the humanities
researcher creating this entry writes consul[tatio] in this entry. The LLM may or may not detect that
consultation is meant instead of consul with this entry and may, thus, incorrectly return it.</p>
      </sec>
      <sec id="sec-4-3">
        <title>2https://huggingface.co/MaziyarPanahi/Meta-Llama-3.1-70B-Instruct-GGUF</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Outlook</title>
      <p>LLM-based chatbots are excellent for processing textual data. However, such chatbots have limitations
in processing large amounts of structured datasets, e.g., tabular data, which poses a challenge for
retrieval-based approaches. Our proposed processes, integrated into the RAG-based chatbot ChatHA,
address this gap by converting tabular data into a structured text format once for a verbalisation
process. By integrating a static cloze text generated and refined with scientific input, we ensure that the
contextual meaning of the data is preserved, making it suitable for RAG-based retrieval. Our feasibility
study demonstrates the efectiveness of this approach in creating a chatbot capable of processing large
structured datasets, opening the way for improved interaction with RDR.</p>
      <p>While ChatHA with the new verbaliser process works well with questions where the answers are
expected to be prose text, it has limited capabilities for answering questions where the answer is
quantitative. One such question could be: “How many students are in total at the university?” These
types of questions could be easily answered by querying an SQL database with the necessary queries.
Thus in the future, we plan to extend the capabilities of ChatHA by incorporating a classifier that
classifies the questions and decides which method is best: Using the embedded texts or querying a
database (or possibly both) to generate a good answer.</p>
      <p>In addition, we plan to test such a chatbot for other application areas, such as the field of epigraphic
data from Asia Minor. The overarching goal is integrating a chatbot into the University of Hamburg’s
productive RDR system.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)
under Germany´s Excellence Strategy – EXC 2176 ‘Understanding Written Artefacts: Material,
Interaction and Transmission in Manuscript Cultures’, project no. 390893796. The research was conducted
within the scope of the Centre for the Study of Manuscript Cultures (CSMC) at University of Hamburg.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used DeepL in order to: Grammar and spelling check.
After using these tool(s)/service(s), the authors reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Peukert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Voges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Asselborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Melzer</surname>
          </string-name>
          ,
          <article-title>Humanities in the center of data usability: Data visualization in institutional research repositories</article-title>
          ,
          <source>in: Proceedings of the Workshop on Humanities-Centred Artificial Intelligence (CHAI@KI)</source>
          , volume
          <volume>3814</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>74</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3814</volume>
          /paper6.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Asselborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Melzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Aljoumani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Marwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hirschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <article-title>Fine-tuning bert models on demand for information systems explained using training data from pre-modern arabic</article-title>
          .,
          <source>in: Proceedings of the Workshop on Humanities-Centred Artificial Intelligence (CHAI@KI)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Asselborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Melzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Marwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Aljoumani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thiemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hirschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <article-title>Building sustainable information systems and transformer models on demand</article-title>
          ,
          <source>Humanities and Social Sciences Communications</source>
          <volume>12</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Annepaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pakray</surname>
          </string-name>
          ,
          <article-title>Large language models: A survey of their development, capabilities, and applications</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>67</volume>
          (
          <year>2025</year>
          )
          <fpage>2967</fpage>
          -
          <lpage>3022</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s10115- 024- 02310- 4.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>