<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large language models in requirements engineering for digital twins</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nico Blasek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karl Eichenmüller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bastian Ernst</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Niklas Götz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Nast</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kurt Sandkuhl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jönköping University</institution>
          ,
          <addr-line>Gjuterigatan 5, 55111 Jönköping</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rostock University</institution>
          ,
          <addr-line>Albert-Einstein-Str. 22, 18059 Rostock</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Can large language models (LLMs) be used for digital twin engineering (DTE)? Engineering digital twins (DTs) is a complex process consisting of several phases and involving diferent disciplines. We argue that an investigation of LLM use in DTE has to define what kinds of DTs are in focus and what DTE phases shall be supported. In our work, we concentrate on the early phases of DTE, with a particular focus on requirements engineering (RE), and we focus on supervisory and operational DTs. This paper investigates the quality of LLM output for defining requirements for DTs. The main contributions of our work are results from an experiment comparing requirements to a DT of an air conditioning facility of a domain expert and ChatGPT and conclusions for prompt engineering resulting from this experiment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Digital Twin Engineering</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>ChatGPT</kwd>
        <kwd>Requirements Engineering</kwd>
        <kwd>Air Conditioning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>concentrate on the early phases of DTE, with a particular focus on requirements engineering
(RE), and we focus on supervisory and operational DTs.</p>
      <p>More concretely, we aim to investigate the quality of LLM output for defining requirements
for DTs. For this purpose, we used one of the currently most popular LLMs, OpenAI’s
ChatGPT4, to elicit requirements for DTs. As our ambition is to explore issues relevant to future research,
we limited our study in a first step to only one application domain. The application domain
selected is air conditioning and ventilation systems. We operationalize quality as completeness
and correctness from the perspective of a domain expert.</p>
      <p>The main contributions of our work are results from an experiment comparing requirements
for a DT of an air conditioning facility of a domain expert and ChatGPT and conclusions for
prompt engineering resulting from this experiment.</p>
      <p>The paper is structured as follows: section 2 describes the background for our work from
digital twin engineering and large language models. Section 3 summarizes the research method,
and section 4 results of a literature study on related work. Section 5 describes the requirement
elicitation for DT of air conditioning systems and compares ChatGPT output with the view
of a domain expert. Section 6 discusses results and addresses the issue of prompt design. A
conclusion and implications for future work are given in section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Theoretical background</title>
      <sec id="sec-2-1">
        <title>2.1. Large language models</title>
        <p>
          LLMs belong to the broader category of deep learning models, address the area of natural
language processing, and are designed to interpret and generate human-like text. Essential
concepts of LLMs and their evolution have been widely documented, for example, in the
publication by Brown et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The most influential architecture in recent times for building
LLMs is the Transformer. It uses attention mechanisms [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to weigh the importance of diferent
words or tokens in a sequence when producing an output. LLMs are trained on vast amounts of
text data to be able to generate coherent and contextually relevant text across a wide range of
topics. For instance, models like OpenAI’s GPT series have been trained on books, articles, and
web pages. The pre-training of LLMs is supposed to be task-agnostic [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Capabilities of LLMs include tasks such as translation, question-answering, summarization,
and text generation without needing task-specific training data. One of the most important
features of models like GPT (Generative Pre-trained Transformer) is their ability to generate
coherent, diverse, and contextually relevant text over long passages. One of the currently most
popular LLMs, OpenAI’s GPT-4 with its Chatbot frontend - ChatGPT1 can also be used for
translation, grammar correction, or email composition [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. While LLMs are powerful, they can
sometimes produce incorrect or nonsensical answers, which often are termed “hallucinations”.
The use of LLMs starts from inputs (called prompts) stating the task to be completed by the LLM.
LLMs are sensitive to the input phrasing. Thus, prompt engineering and prompting methods
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] have developed into a critical topic of study for LLMs as they investigate the techniques by
which end-users can use LLMs to perform tasks.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Digital twin engineering</title>
        <p>
          An analysis of research on DTE showed various procedural and method approaches, for example,
discussed in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] or [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. While the specific procedures in DTE can vary depending on the
application domain, industry, and specific objectives, the following general steps in DTE are
visible in the literature:
• Scoping and Goal Definition: Setting the scope of DT development includes deciding
on the physical system, process, or entity that is the subject of the DT, i.e., the scope
of the project. Furthermore, there should be a clear understanding of what objective
to achieve, e.g., support of predictive maintenance, process optimization, simulation, or
other objectives. Defining clear goals is the first step.
• Requirements Engineering: In the RE phase, the way of data collection has to be
defined. This includes instrumentation, i.e., what sensors are in the physical entity to
collect relevant data. This could include sensors for temperature or pressure, cameras,
accelerometers, and more. If historical data must be considered, this should be defined.
        </p>
        <p>Furthermore, the functionality required to reach the goals has to be specified.
• Develop the Digital Representation: Modeling the physical entity might involve CAD
models, process diagrams, or other types of digital schematics. How to integrate the
model with real-time data feeds from sensors and other sources has to be part of this step.
• Calibration and Validation: The DT requires validation by comparing its outputs and
predictions to actual outcomes in the physical world. This typically includes refinement
of the model until its performance meets acceptable accuracy levels.
• Integration with Other Systems: DTs can be more efective when integrated with other
systems, such as enterprise resource planning (ERP) systems, manufacturing execution
systems (MES), or building management systems.
• Iterative Refinement: Over time, as more data becomes available and the physical
entity evolves or changes, the DT should be updated and refined to reflect these changes.
This requires the implementation of a feedback mechanism that continuously updates the
DT based on real-world data. This ensures the twin remains an accurate representation
of its physical counterpart.</p>
        <p>
          When engineering a DT, it’s essential to maintain a close alignment between the digital model
and the real-world entity it represents. Collaboration across disciplines, including domain
experts, IT professionals, data scientists, and engineers, is often crucial for successful DT
implementations. Furthermore, some scholars propose to distinguish diferent kinds of DT
according to their capabilities [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]:
• Supervisory DTs, sometimes also called digital shadows, allow the real-time monitoring
of the status and events of a physical object. For this purpose, the parts of the object to
be monitored, necessary information, and adequate sensing equipment must be known.
• Operational DTs add the possibility to perform certain operations on the physical object,
which can be used to control the operations. This requires actuators to initiate functional
changes and the knowledge about possible and required operations.
• Simulation DTs have the ability to simulate the physical object based on the developed
models. This type of DT can also carry out predictions to support design or operational
decisions.
• Intelligent DTs are supposed also to show learning abilities. They are supposed to learn
from operational data, for example, using machine learning (ML), which also realizes
some abilities of decision support and scenario planning.
• Autonomous DTs additionally have the ability to perform all decision-making based on
predefined parameters and manage the physical object concerned with minimal human
intervention.
        </p>
        <p>In this paper, we focus on RE for supervisory and operational DTs.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Research method</title>
      <p>The starting point of our work is the questions presented in the introduction and the decision
to focus on the phase of RE. With this background, two research questions (RQ) were defined
for this paper:
• RQ 1: In the context of RE for DT, how consistent and complete is the output of ChatGPT
with the information provided by domain experts?
• RQ 2: How to design the prompt chain to improve the output of ChatGPT?
The overall research strategy for work presented in this paper is of an explorative nature, i.e.,
we aim to gather new knowledge by exploring the potential of ChatGPT use in RE of DTs. More
concretely, the work combines literature studies with quasi-experiments and
argumentativedeductive work.</p>
      <p>
        The literature search aimed to identify related work and results from other scholars to be
taken into account when investigating the potential of LLMs for DTE. For this step, we used
Kitchenham’s approach for systematic literature reviews (SLR). Kitchenham [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] suggests six
steps, which we briefly introduce in the following and document in detail in section 4. The first
step is to develop research questions (RQ) to be answered by the SLR. The process of paper
identification starts with defining the overall search space (step 2), which basically consists of
determining the literature sources to take into account in the light of the research questions.
Paper identification continues with the population phase (step 3). In this step, the search string is
developed and applied by searching the literature sources. Afterward, the step “paper selection”
follows by defining inclusion and exclusion criteria and a manual selection of relevant papers
found in the population phase (step 4). The data collection phase (step 5) focuses on extracting
the information relevant to answering the research question from the set of identified relevant
papers. The last step is the analysis of data and interpretation, i.e., to answer the research
question defined in step 1 by using collected data from relevant papers.
      </p>
      <p>As the SLR returned no previous work on using LLM for DTE, we structured the field of
DTE along with the tasks to perform during an engineering project (see section 2.2). This is the
argumentative-deductive part of our work.</p>
      <p>
        In the next step, RE for a DT in a defined application area was the subject of a quasi-experiment.
A controlled experiment in software engineering and information systems development is “a
randomized or quasi-experiment in which individuals or teams (the study units) conduct one or
more [. . . ] tasks for the sake of comparing diferent populations, processes, methods, techniques,
languages or tools (the treatments)” [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In our work, we perform a quasi-experiment; the study
units are ChatGPT and domain experts, and the treatment is the task of eliciting requirements
for a DT of an air conditioning system. A quasi-experiment is “an experiment in which units are
not assigned to conditions randomly” [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The experiment does not aim at testing a specific
hypothesis but is exploratory research to answer the research questions defined. The experiment
design is described in detail in section 5.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Related work</title>
      <p>Related work was identified by performing a systematic literature study based on the six steps
proposed in the method of Kitchenham (see section 3). The research questions (step 1) were
already introduced in section 3. The search space (step 2) consisted of the literature databases
Scopus, IEEExplore, and AISeL. The search string (step 3) used in these databases combines
the term “Digital Twin” in combination with “large language model” and synonyms for this
term (“LLM”, “neural text”, “ChatGPT”). Thus, the final search string used was “Digital Twin”
AND (“large language model” OR “LLM” OR “neural text” OR “ChatGPT”). The search in title,
abstract, and keywords resulted in a total of 19 papers. The inclusion criterion (step 4) was that
the papers had to discuss LLM use in the context of DT.</p>
      <p>
        Of the 8 hits in Scopus, 5 were excluded because they were conference proceedings that
included DT papers and LLM papers, but no paper with both topics in the same paper. The
remaining three papers used DT and LLM on the abstract to position the work but did not
address DT development. In AISeL, the query interface only allowed for search in all meta-data.
All 9 papers did not mention DT and LLM in title, abstract, or keywords; none of the nine papers
concerned DT development or LLM use in a DT. In IEEE Xplore, two papers were found, and
one of them was relevant to our work: [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] discuss a DT for an industrial automation system
and use LLM-agents to interpret descriptive information in the DTs and to control the physical
system through service interfaces. Thus, this paper is not exactly about engineering the DT but
about using LLM as a component for a specific task in a DT.
      </p>
      <p>Table 1 summarizes the number of papers found in the diferent databases and the relevant
ones. In conclusion, the SLR did not return any paper addressing LLM use for DTE. Thus, the
identified literature does not provide any information to aid in investigating the potential of
LLM use for DTE or to answer our RQs.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment on ChatGPT use in requirements engineering</title>
      <sec id="sec-5-1">
        <title>5.1. Experiment design</title>
        <p>
          Based on our experience from previous work [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] in the field of air conditioning systems and
an internet research, we prepared a questionnaire for the interview consisting of 17 questions.
These can be divided into General Structure and Functions of Air Conditioning Systems, Energy
Optimizations, and Special Case of a Facility in a Swimming Pool Hall (see Table 2). This list of
questions not only served as a guide during the interview with a domain expert but was also used
to create prompts for ChatGPT. By means of prompt engineering, in particular role prompting,
it was thus possible to simulate an interview situation with ChatGPT as the interview partner.
This ofers the possibility of comparing statements of the domain expert with the answers of
ChatGPT to have them evaluated later and thus generate further expertise and set requirements
for a DT.
        </p>
        <p>General Structure and Functions of Air Conditioning Systems</p>
        <p>What is/How does an air conditioning system work?
What are the main processes in an air conditioning system?</p>
        <p>What parts does an air conditioning system consist of?
What possibilities exist for modeling air conditioning systems?
Are there essential connections between sensors and actuators?</p>
        <p>Which parts are monitored by sensors?</p>
        <p>Which parts can be controlled externally?
Which environmental influences afect the air conditioning system?</p>
        <p>Which factors afect the durability of an air conditioning system?</p>
        <p>Which parts are maintenance-intensive?</p>
        <p>Energy Optimizations
What are the possibilities for saving energy in air conditioning systems?</p>
        <p>How does the heat recovery work?
Are there peaks in the possibilities for energy savings?</p>
        <p>Special Case of a Facility in a Swimming Pool Hall
Are there any parts and conditions in the air conditioning system for the swimming pool hall
that difer from conventional systems?
What special environmental influences afect the ventilation system in the swimming pool hall?</p>
        <p>In what form is the sensor data collected, and how does the storage work?
Are there components that are particularly stressed by the special environmental influences in
a swimming pool hall?</p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Interview with domain expert</title>
          <p>An expert in the field of air conditioning systems was available as a participant in the interview.
He has been working for several years in a medium-sized company in this domain, where
he checks existing facilities, evaluates measurement data, and advises operators on energy
optimization. In addition, the expert has basic knowledge and initial experience with DTs and
digital shadows of those systems.</p>
          <p>The interview was conducted with the domain expert as the participant and four people who
prepared the interview. One person acted as the interviewer and guided the participant through
the interview on the basis of the questionnaire, while the remaining three people took notes,
asked follow-up questions, or checked the course of the interview guide. For later transcription
and analysis, the interview was recorded with the consent of all persons in the room.</p>
          <p>Before the interview began, the project and the project group were briefly introduced to
the expert. This was followed by an introduction of the expert, in which he described his
professional career and experience with air conditioning systems and the associated energy
optimization. The expert was able to answer each question from the list of questions, as well as
intermediate questions, in detail, and in some cases, also with case studies, experience reports,
and further explanations. In addition, the expert was able to interpret illustrations and circuit
diagrams of an air conditioning system, as well as to name and explain the components and
their functions.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Interview simulation with ChatGPT</title>
          <p>Using ChatGPT-4, role prompting was applied with an initial prompt to assign a role to the
Artificial Intelligence (AI) as an expert in the field of DTs for air conditioning systems. Here,
the assignment of the role was primarily done to ensure that experience in the domain and
professional background matched the domain expert. The prompts were entered in German.
“Hello, thank you for finding the time for an interview with us as a domain expert on digital twins
of air conditioning systems. Thanks to your expertise and years of experience related to air
conditioning systems, you are the ideal candidate for our interview. You have studied mechanical
engineering and are a project engineer in a renowned East German company in the air
conditioning and refrigeration sector, and your job is to look after, maintain, and improve/optimize
existing air conditioning systems. Below, we have a few questions for you. The questions should be
answered in as much detail as possible and be easy to understand, as we have no expertise in
mechanical engineering or in air conditioning and refrigeration technology.”</p>
          <p>With the simulation of the interview, each question was asked as a single prompt to ChatGPT,
where no follow-up questions were asked. It could be observed that the AI’s answers follow a
similar pattern, which starts with an introductory sentence. This is always followed by a list
of examples with a short explanation and further information until the answer ends with a
concluding paragraph. Unlike the domain expert, ChatGPT did not use specific case studies or
testimonials to reflect the information better. In addition, the version of ChatGPT used does not
ofer the possibility to interpret illustrations and describe them if necessary.</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.1.3. Evaluation</title>
          <p>
            After conducting both interviews, the recordings and texts had to be processed and evaluated
accordingly for later analysis. The interview with the domain expert was transcribed, and the
transcript was reviewed for errors and corrections. Afterward, a qualitative content analysis
was conducted according to [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] (see section 5.2.1). Supercategories and subcategories were
inductively formed independently based on individual markers. The formed categories were
then compared within the group. The suitability of the categories discussed was often measured
within the groups based on the goal and task. The supercategories were unanimously defined
as follows:
• General Information About Air Conditioning Systems: Questions concerning the
construction and operation
• Energy Optimization: Questions concerning the possibilities for energy optimization
of an air conditioning system, both structural and process-related changes
• Use Case Swimming Pool Hall: Questions concerning specifics of an air conditioning
system in an indoor swimming pool
          </p>
          <p>Once the broad topics of the questions were identified, subcategories had to be created, which
helped to classify the answers to the corresponding questions and the evaluation by the domain
expert in a meaningful way. Therefore, the following categories were unanimously agreed upon
for the subcategories:
• Missing Information: Information from ChatGPT, which should have been included in
the generated answer according to the domain expert.
• Additional Helpful Information: Information from ChatGPT, which, according to
the domain expert, is outside the scope of the posed question and represents additional,
non-trivial, useful background knowledge.
• Additional Unhelpful Information: Information from ChatGPT, which, according
to the domain expert, is outside the scope of the question asked and does not provide
additional useful information to the subject matter.</p>
          <p>• Mistake: Marked as misinformation by the domain expert.</p>
          <p>The following subcategories were also added after a discussion with the agreement of the
majority of the working group. They were added because they allow for a more in-depth
evaluation of the interview in relation to the research questions.</p>
          <p>• Software: Information on software that enables digital realization of air conditioning
systems.
• Components: Details of parts that a working air conditioning system requires.
• Sensor Data: Data that can be collected via sensors and is relevant for system
functionality. Only metrics that can be assigned to physical quantities are to be recorded.</p>
          <p>The following subcategories were rejected by the majority after discussion and were not
adopted: Wear Factors, Processes in a System, Factors Influencing Durability of Parts , and Factors
Influencing Comfort/Well-Being . These subcategories were rejected either because they lacked
reference to the task or overlapped with other already defined categories and were, therefore,
dificult to distinguish from one another.</p>
          <p>Using the definitions, all four group members were then tasked with marking all text passages
individually according to the definitions they had developed. These markings were compared
with each other again afterward so that the coding could be completed. The preceding definition
was decisive for the coding. The coding was done in great unanimity, and there was no
incident of explicit disagreement in the assignment of text passages to the respective categories.
Diferences in markings were limited to careless errors where consensus was quickly reached.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>In this chapter, the results of the qualitative content analysis are presented first. Based on
this, the requirements of a DT for an air conditioning system are defined, and the preceding
interview with ChatGPT is examined with regard to these requirements. Subsequently, newly
created prompts are presented to provide a detailed analysis of the feasibility of using ChatGPT
as a substitute for a domain expert.</p>
        <sec id="sec-5-2-1">
          <title>5.2.1. Qualitative content analysis</title>
          <p>Figure 1 shows the assessment of the domain expert, sorted by supercategories and subcategories.
The 17 questions from the questionnaire (see Table 2) are divided into eleven questions on
General Information about air conditioning systems, three questions about Energy Optimization,
and three questions on information relevant to air conditioning systems Case Swimming Pool.
For each of these questions, it was checked whether a text passage was defined, which could
be assigned to one of the subcategories. The subcategories, which deal with the evaluation of
ChatGPT statements by the domain expert, are counted out here. Figure 1 shows the evaluation
of the statements of ChatGPT by the domain expert.
12
10
8
6
4
2
0</p>
          <p>General Information</p>
          <p>Energy Optimization</p>
          <p>Case Swimming Pool
Total</p>
          <p>Additional Helpful</p>
          <p>Additional Unhelpful</p>
          <p>Missing</p>
          <p>Mistake</p>
          <p>Of the eleven General Information questions, five responses obtained Additional Helpful
Information. One response included Additional Unhelpful Information. Information was
Missing in nine responses, and three contained Mistakes.</p>
          <p>Of three questions on Energy Optimization, one response provided Additional Helpful
Information, and one provided Additional Unhelpful Information. In this block of questions, no
Missing information or Mistakes were identified.</p>
          <p>In three questions about the Case Swimming Pool, Additional Helpful Information was
identiifed in one response and Additional Unhelpful Information in two responses. Additionally, two
Mistakes were found, but no Missing information in this block of questions.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Identified requirements for a digital twin</title>
          <p>Based on the interviews and our research, it was possible to identify a total of 28 elements that
are important for the realization of an air conditioning system as a DT and for it to function well.
These 28 elements can be divided into 23 general (Component/Factor) and five specific factors
(Type). The specific factors are specifications of individual components whose diferentiation is
important for the successful design of an air conditioning system as a DT.</p>
          <p>The interview with ChatGPT was examined for these parts and components listed in Table 3
to explore whether a structured interview with the AI application was suficient to realize a DT
of a technical system. Of the 23 general parts, ChatGPT correctly named 19 during the interview,
representing about 83% coverage. The correct and complete naming of all relevant sensors is
to be emphasized here. However, the important actuator pump was not named, nor was the
frost protection monitor, which protects the air conditioning system from frost damage and
thus ensures its functionality. The fact that the supply and exhaust systems are two separate
systems within a facility was also not apparent from the interview. For specific components
(Type), coverage is two out of five correctly named items (40%), more than 40% less than for
general parts. The diferent types of valves were missing, and it was not highlighted that there
are diferent types of airflow controllers, which would have been necessary in a DT due to the
diferent operations and impact on airflow.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Subsequent prompt engineering</title>
        <p>The following section deals with three new prompts that were created to get a more detailed
picture of the extent to which ChatGPT can be used as a substitute for a domain expert. The
main goal was to reproduce the list introduced in Table 3 and to extend it for the implementation
as a DT with information not mentioned before.</p>
        <p>Prompt 1: To model a DT, each modeling tool needs the exact mechanical units and sizes
of the individual components. These range from the number of channels in millimeters to the
power consumption of the motors in watts. Since none of these units were apparent during the
structured interview with ChatGPT, a customized prompt was created that named the categories
introduced in Table 3 and asked for the associated units.
“Please name all actuators, sensors, and passive components, as well as other passive factors that
are needed from an air conditioning system to realize it as a digital twin that should have a focus
on energy optimization. For all relevant components, please name their required physical
quantities.”</p>
        <p>The quality of the answers can be described as very good. However, the initial question lacked
the physical units for the respective variables, which were supplemented on request. It should
be emphasized that by directly addressing passive factors, a higher output was achieved than
in the structured interview. Thus, outdoor environmental conditions in the form of outdoor
temperature and outdoor humidity, as well as building characteristics and operating hours,
were mentioned again. However, from an energy optimization standpoint, the lack of motors
on the valves should be considered negative. In addition, ChatGPT also provided the point
that special software and algorithms for data analysis, ML, and optimization algorithms are
needed for energy savings in addition to the physical components. Another negative assessment
is that these are vague statements that could have provided more insight by providing more
detailed information, such as the use of Message Queuing Telemetry Transport (MQTT) as an
energy-saving protocol for transmitting sensor data, and do not contribute to the facts of the
case by providing additional information that is not really helpful.</p>
        <p>Prompt 2: In a further attempt, we entered the results from prompt 1 and asked ChatGPT
whether it was suficient.
“I have realized the following actuators, sensors, components and passive factors as well as their
physical parameters of a ventilation system in a digital twin: (...) Please tell me whether these
components are suficient to optimize the ventilation system in terms of energy and, if not, name
the missing parts and the corresponding physical values.”.</p>
        <p>On the positive side, new detailed information has again emerged, such as the status of air
duct insulation, which can make a positive contribution to energy eficiency and, thus, energy
optimization. ChatGPT also adds further facts for passive factors such as solar radiation or wind
speed, which are negligible for an air conditioning system DT. Otherwise, the quality of the
responses must be rated rather negatively. Both the consistency of the answers and the actual
task of realizing a functional DT must be rated as poor. The previously mentioned parts, valves,
and heat exchanger, which are both elementary for a DT, especially for energy optimization,
were not mentioned. The humidity sensor was correctly detected. However, ChatGPT also
provides rather poor answers for the sensors. The accuracy of the temperature sensors cannot
really be determined, and the addition of noise level sensors to the sensors would help detect
fan and other mechanical component failures or malfunctions, but ChatGPT sees these more in
the function of noise level for a more comfortable environment, including fan speed. Building
features have also been significantly reduced.</p>
        <p>Prompt 3: This prompt is an extension of the second prompt to test the efects of role
prompting on the quality of responses. The only adaptation of this prompt is to tell the language
model to imagine being an expert in the air conditioning domain.
“Imagine you are an expert in the ventilation system industry and have to assess whether these
components are suficient to optimize the ventilation system in terms of energy. If they are not
suficient, please name the missing components with the corresponding physical values.”
The output here is better judged than for the second prompt. The actuators were expanded
to include a previously unmentioned humidifier and the previously missing heat exchanger
was described more generally in terms of heat recovery systems but was therefore listed as an
actuator. The missing valves were again not mentioned, but the list of passive components was
at least expanded to include adjustable air difusers. Questionable components such as noise
sensors were no longer listed.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and evaluation</title>
      <sec id="sec-6-1">
        <title>6.1. Limitations</title>
        <p>While our research has led to a number of results, it also has many limitations that concern
various aspects of the experiment and the process of DTE. The three main categories of our
questionnaire were not represented with the same number of questions (evaluation is less
meaningful regarding the more detailed facts about air conditioning systems). ChatGPT-4 only
works with data through the end of 2021. More recent requirements, findings, and frameworks
cannot be considered. In addition, the topic of DT, as well as air conditioning, has proven to be
highly complex in the course of this work. Especially with regard to the physical conditions and
the mechanical as well as thermodynamic processes, a sound knowledge base is indispensable.
The consistency in queries at diferent times is mostly not given, and ChatGPT often provides
a lot of information without any real classification, and often, there are only more detailed
answers for very detailed questions that require prior knowledge.</p>
        <p>The experimental design itself also imposes limitations on the work. Due to the framework
condition that no prior knowledge is allowed, it was determined that prompting in ChatGPT
must not be a fixed part of the experimental setup. For this reason, the usual response pattern
with respect to prompt engineering was omitted.</p>
        <p>It is not possible to conduct an unstructured interview with ChatGPT, which is why structuring
the interview remains the interviewer’s task. Further limitations are that ChatGPT, in its current
version, cannot interpret or analyze visual material, making it impossible to let ChatGPT
evaluate information in a visual structure.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Interpretation of the results</title>
        <p>The statements that can be derived from this work are manifold. This section first discusses the
qualitative content analysis to answer the research questions; thus, the findings can be taken
from the domain experts’ evaluations.</p>
        <p>RQ 1: In the context of RE for DT, how consistent and complete is the output of
ChatGPT with the information provided by domain experts? There is missing information
or mistakes in each category. Therefore, it is important to carefully evaluate and use ChatGPT’s
generated responses with discretion. Specific inquiries often include superfluous information,
leading to confusion. In contrast, ChatGPT presents valuable supplementary information for
more general topics. The domain expert confirms that most of the answers here are simple but
correct in content but do not go beyond normal textbooks. At one point, he would have liked to
swap his answer for that of ChatGPT. It can be concluded that additional unhelpful information,
missing information, and errors were identified in many of the responses generated.</p>
        <p>Given the presented data limitations, creating a DT through the assistance of ChatGPT in this
experiment was not feasible. The absence of data on facility power consumption, actuator status
data, 3D models of the facility, and real-time data directly fed from a network only allowed
for the development of a dashboard that solely processes the available data. Nonetheless, an
attempt was made to program a DT in Modelica using ChatGPT as the sole tool. However,
ChatGPT is unable to perform or guide the development of a DT independently. This is because
ChatGPT releases the required code in an outdated and unsupported version. Consequently, it is
necessary to engage a domain expert with the required expertise to successfully implement a DT.
ChatGPT could not determine, through inquiries, that implementing a DT of an air conditioning
facility is feasible with the provided resources, and the proposed solution is not viable.</p>
        <p>RQ 2: How to design the prompt chain to improve the output of ChatGPT?
Based on this work, we can initially derive a few things to answer RQ 2, but this can only
be seen as a starting point and must be continued in future work. In our first attempts, we
found that the use of role prompting leads to more nuanced and detailed responses. By trying
out diferent approaches, we found that clarification and specificity, as well as encouraging
ChatGPT for more detailed responses, are important aspects to consider. For example, adding
the fact that we also need physical quantities of the components in prompt 1 helped improve
the answers. Using these results in prompt 2 (iterative prompting) again helped to refine and
clarify the gained information. We also added a feedback loop there, which leads ChatGPT to
evaluate and refine the answers.</p>
        <p>These results are promising and give reason to go further in this direction in future work.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and future work</title>
      <p>The work presented in this paper addressed the questions of how consistent and complete the
output of ChatGPT can be in the context of RE for DT compared to the knowledge of a domain
expert and how to design the prompt chain to improve the output of ChatGPT. To answer these
questions, we conducted an experiment in which we used ChatGPT to elicit requirements in
order to compare them with the knowledge of a domain expert. We then used the knowledge
gained to try to change the prompt in order to improve the output.</p>
      <p>A first implication is that the AI tool ChatGPT should not be seen as a substitute but as a
useful building block to better design DTs. ChatGPT can compensate for human errors through
a synthetic approach with the domain expert and can be used to supplement the expert’s
remarks at many points due to its constant availability, as long as ChatGPT’s statements are
always critically scrutinized afterward. The advantage here is also a more efective use of the
domain expert’s time. The most efective way to get all the data in ChatGPT to create a DT is
to create a list of components and have ChatGPT check them for completeness. With a larger
input set, ChatGPT returns a more diverse output set, which can lead to the expansion of some
components and the avoidance of errors due to the addition of forgotten components. Some
open questions for future work derived from our results are:</p>
      <p>(1) Does working out a topic with ChatGPT and domain experts save time compared to
working it out without ChatGPT? Since ChatGPT qualitatively supports the early phases of
DTE, the question arises whether a quantitative benefit (e.g., time) can be measured.</p>
      <p>(2) How can the evaluation scale be adapted to the existing knowledge of the domain experts?
On the conceptual level, the evaluation of the efort of ChatGPT by the domain expert depends
on the domain expert’s existing knowledge. The comparability of the statements of ChatGPT
sufers from this since their evaluation depends directly on the person evaluating the statements.</p>
      <p>(3) How does ChatGPT perform among several domain experts? As already mentioned, the
evaluation of the statements of ChatGPT by the domain expert depends on the domain expert.
For a more informed evaluation, it would make sense to use several domain experts for the
evaluation in order to check whether the core statements difer among the various experts.</p>
      <p>(4) How does ChatGPT perform in other disciplines? Besides DTs and air conditioning systems,
a whole range of topics could be considered for processing with ChatGPT. It would be interesting
to know if there are significant diferences in the level of knowledge and expressiveness of
ChatGPT.</p>
      <p>In conclusion, a more comprehensive investigation of the benefits of ChatGPT for more
and diferent use cases seems relevant and necessary to better understand its potential and
limitations. It is essential that a diverse range of domain experts play a role in this research. The
contribution of this paper serves to confirm the importance and promise of continued research
in this area.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sniderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Cotteleer</surname>
          </string-name>
          ,
          <article-title>Industry 4.0 and manufacturing ecosystems: Exploring the world of connected enterprises</article-title>
          ,
          <source>Deloitte Consulting</source>
          <volume>1</volume>
          (
          <year>2016</year>
          )
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mordatch</surname>
          </string-name>
          ,
          <article-title>Language models as zero-shot planners: Extracting actionable knowledge for embodied agents</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>9118</fpage>
          -
          <lpage>9147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Floridi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chiriatti</surname>
          </string-name>
          , Gpt-3
          <article-title>: Its nature, scope, limits, and consequences</article-title>
          ,
          <source>Minds and Machines</source>
          <volume>30</volume>
          (
          <year>2020</year>
          )
          <fpage>681</fpage>
          -
          <lpage>694</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          , G. Neubig,
          <article-title>Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G. N.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Steinmetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. V. B.</given-names>
            <surname>Henriques</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rettberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>A methodology for digital twin modeling and deployment for industry 4.0</article-title>
          ,
          <string-name>
            <surname>Proceedings</surname>
            <given-names>of</given-names>
          </string-name>
          <source>the IEEE</source>
          <volume>109</volume>
          (
          <year>2020</year>
          )
          <fpage>556</fpage>
          -
          <lpage>567</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qamsane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Moyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Toothman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Kovalenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Balta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Faris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Tilbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Barton</surname>
          </string-name>
          ,
          <article-title>A methodology to develop and implement digital twin solutions for manufacturing systems</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>44247</fpage>
          -
          <lpage>44265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <article-title>Construction method of shop-floor digital twin based on mbse</article-title>
          ,
          <source>Journal of Manufacturing Systems</source>
          <volume>60</volume>
          (
          <year>2021</year>
          )
          <fpage>93</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Wagg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Worden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Barthorpe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gardner</surname>
          </string-name>
          , Digital Twins:
          <article-title>State-of-the-Art and Future Directions for Modeling and</article-title>
          Simulation in Engineering Dynamics Applications,
          <source>ASCE-ASME J Risk and Uncert in Engrg Sys Part B Mech Engrg</source>
          <volume>6</volume>
          (
          <year>2020</year>
          )
          <fpage>030901</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kitchenham</surname>
          </string-name>
          ,
          <article-title>Procedures for performing systematic reviews</article-title>
          , Keele,
          <string-name>
            <surname>UK</surname>
          </string-name>
          , Keele University 33 (
          <year>2004</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. I.</given-names>
            <surname>Sjøberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Hannay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. B.</given-names>
            <surname>Kampenes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karahasanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.-K.</given-names>
            <surname>Liborg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Rekdal</surname>
          </string-name>
          ,
          <article-title>A survey of controlled experiments in software engineering</article-title>
          ,
          <source>IEEE transactions on software engineering 31</source>
          (
          <year>2005</year>
          )
          <fpage>733</fpage>
          -
          <lpage>753</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>T. D. Cook</surname>
          </string-name>
          , D. T. Campbell, W. Shadish,
          <article-title>Experimental and quasi-experimental designs for generalized causal inference</article-title>
          ,
          <source>Houghton</source>
          Miflin Boston, MA,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shenoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jazdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weyrich</surname>
          </string-name>
          ,
          <article-title>Towards autonomous system: flexible modular production system enhanced with large language model agents</article-title>
          ,
          <source>in: IEEE 28th International Conference on Emerging Technologies and Factory Automation (ETFA)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ivanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sandkuhl</surname>
          </string-name>
          ,
          <article-title>Technologies for a diagnostic technique for hvac systems using iot and cloud-based architecture</article-title>
          , in: 2023
          <source>International Interdisciplinary PhD Workshop</source>
          (IIPhDW),
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mayring</surname>
          </string-name>
          ,
          <article-title>Qualitative inhaltsanalyse: Grundlagen und techniken, Dt</article-title>
          . Studien-Verlag,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>