<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1016/B978-0-12-818630-5.10063-6</article-id>
      <title-group>
        <article-title>Designing and Evaluating Evidence-Centered Design based Conversations for Assessment with LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego Zapata-Rivera</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carol Forsyth</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edith Aurora Graf</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Jiang</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ETS</institution>
          ,
          <addr-line>Rosedale Road, Princeton, NJ 08541</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>504</fpage>
      <lpage>518</lpage>
      <abstract>
        <p>The current paper discusses conversation-based assessments (CBAs) created with prompt engineering for LLMs based on Evidence-Centered Design (ECD). Conversation-based assessments provide students the opportunity to discuss a given topic with artificial agent(s). These conversations elicit evidence of students' knowledge, skills and abilities that may not be uncovered by traditional tests. We discuss our previous method of creating such conversations with regular expressions and latent semantic analysis in an expensive methodology requiring time and various expertise. Thus, in this novel work, we created a prompt-engineered version of CBAs based on evidence-centered design that remains on the domain topic throughout the conversation as well as provides evidence of the student knowledge in a less expensive way. We present the methodology for creating these prompts, compare responses to various student speech acts between the previous version and the prompt engineered version, and discuss the evidence gleaned from the conversation and based on the prompt. Finally, limitations, conclusions and implications of this work are discussed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Conversation-based Assessment</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Dialogue Systems</kwd>
        <kwd>Evidence-Centered Design</kwd>
        <kwd>AutoTutor</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Advances in Artificial Intelligence are reducing the design
and development complexity that is usually required when
using dialogue systems in educational contexts. These
advances can have a positive impact on the adoption and
scalability of conversation-based learning and assessment
activities. Conversation-based assessments (CBAs) have
been explored as innovative mechanisms to assess skills in
a natural context. In particular, we consider skills that can
be assessed in a conversational context such as
argumentation, scientific inquiry, language skills, and collaboration.</p>
      <p>
        Researchers have explored the use of Large Language
Models (LLMs) for a variety of purposes including
evaluating LLM generated hints to human-created ones [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and
creating dialogue-based tutoring interactions with
students [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This paper explores the use of LLMs and
Evidence-Centered Design (ECD) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] in the creation and
evaluation of conversation for assessment purposes (e.g.,
formative assessment) as a mechanism to gather evidence
of students’ knowledge, skills, and other attributes. ECD
offers a principled methodology for assessment design as it
provides an approach to explicitly represent an
evidencebased chain of reasoning, with the goal of supporting
assessment validity. This evidence-based chain connects
responses to particular tasks to the constructs that are
assessed. We discuss a use case that illustrates the types of
conversations produced by the LLM approach compared to
a previous approach that relies on regular
expressions/RegExp [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Latent Semantic Analysis (LSA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We
elaborate on trade-offs of applying these two approaches
to generating conversations for assessment purposes
including methods for evaluating them.
2. Conversation-based assessments
Conversation-based assessments (CBAs) build on
innovations in areas such as conversational agents and dialogue
systems for improving student learning
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]2][46][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][
        <xref ref-type="bibr" rid="ref10">10</xref>
        ][
        <xref ref-type="bibr" rid="ref12">12</xref>
        ][
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]][
        <xref ref-type="bibr" rid="ref16">16</xref>
        ][
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. These conversations between
human students and artificial agents are a mechanism to
gather evidence of students’ knowledge, skills, and other
attributes following an evidence-centered approach [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
CBAs have been used to assess communication skills (e.g.,
English language skills, science inquiry skills, and
mathematical argumentation) in formative contexts [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>CBAs have been leveraged to gather additional
explanations about students’ decisions in technology-rich
environments involving interactive simulations and
interactions with virtual agents and other students. They
provide students with multiple opportunities to elaborate
on their responses. For example, virtual agents can
rephrase questions and ask students to provide additional
information, if necessary.</p>
      <p>
        The design and development process of CBAs involves
defining the construct and the type of evidence that is
expected to be elicited by the conversation (see Figure 1). To
create natural situations to elicit the evidence, aspects of
the conversation scene need to be defined (e.g., context of
the conversation, main question, conversation moves,
response categories and types of interactions) thus creating
the scene. In addition to this information, the
conversational dialogue and scoring model is created. Specifically,
construct information, conversation paths based on user
response categories resulting in differing responses by
artificial agents, closing statements, and partial scoring rules
are documented in conversation diagrams that are used as a
communication mechanism among various members of the
team including dialogue and assessment developers (See
Figure 2). The script development and testing of these
conversations were designed, developed, and tested using
additional authoring tools, automated testing techniques,
and data collected with the conversation prototype
administered to the intended audience via cognitive labs, wizard
of oz studies, pilot studies, and crowdsourcing efforts
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ][
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>
        During the development process, conversation
diagrams are of particular importance in this process as they
capture many of the elements needed for designing
ECDbased conversations (see Figure 2). Starting with the
definition of the construct, the opening and main question,
including the introduction of the character(s) and their
roles (e.g., student vs. teacher) that will provide the
information, user response categories handled by the system,
closing statements and partial scores for the constructs
involved. The response categories and how artificial agent(s)
react to them are based on AutoTutor’s framework [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
referred to as Expectation-Misconception Tailored Dialogue.
In this framework, there is always a main question posed
that has a complete expected answer that is
pre-programmed. If the student is in some way not able to fulfill
the requirements of the correct answer, then the agent
begins by providing a pump such as “can you tell me more”
and after this answer, launches into a series of hints
(openended clues) and then prompts (clues asking for a single
word or phrase) followed by an assertion. If at any point a
student states a misconception, the agent corrects this
particular misconception. If the student provides the correct
answer at any point, the system provides an assertion
(restating the correct answer) and moves on to the next
question. CBAs are very similar but provide less
information in follow up scaffolding moves such as pumps,
hints, and prompts. Additional speech acts have been
defined as common during the tutoring process with agents
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, the nature of the CBAs vs. Tutorial dialogues
is quite different because the goal is to not give away the
answer but rather probe the student to provide more
information about what he/she/they already know. For
example, assertions are often not given as the goal is to
elicit more evidence within CBA’s but are in tutoring. With
this in mind, it is very important how the agent(s) respond
to common speech act categories of the human student in
a CBA. Nevertheless, the identification of these speech acts
is integral in determining an appropriate response. Some
of these categories include:
•
•
•
•
•
•
      </p>
      <p>Correct response (with an appropriate explanation
and/or evidence). A correct response typically
includes some key elements of the expected answer to
a main question posed by the artificial agent. The
“correctness” is often defined by experts in the
particular domain at hand.</p>
      <p>Incomplete response: An incomplete or partial
response usually includes some portion of the correct
answer, but critical elements are still lacking. To
address this issue, if at any point in the conversation
the student gives such a response, the artificial agent
may provide the above given scaffolding moves in
attempts to get the student to give the complete
response. The conversations take note of students’
previous answers. Therefore, say for example a
student provides one part of the answer early in the
conversation and then the rest of the answer
separately after a hint, the system puts the two answers
together and considers the student’s answer correct.
Irrelevant response: An irrelevant response is a
completely off topic remark to any of the scaffolding
moves posed by the agent. In these instances, the
artificial agent provides a response such as “this is not
relevant to our conversation” and then continuing on
with a closing statement and the next follow-up
scaffolding move.</p>
      <p>
        Meta-communicative response: A
meta-communicative response is a common discourse move that
students say during tutoring [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. An example is when
a student asks the artificial agent or tutor to please
repeat the question. When this occurs, the question
can be repeated or rephrased.
      </p>
      <p>
        Meta-cognitive response: A meta-cognitive response
is when the student says something similar to “I
don’t know”, another common discourse move in
tutoring [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. When this happens, the artificial agent
can acknowledge the situation and attempt to get the
student to say anything by providing scaffolding and
introducing the next discourse move.
      </p>
      <p>No response: In the event the student does not
respond at all, the system is prepared to acknowledge
the lack of a response and move forward and provide
the next scaffolding moves. The wait time is usually
determined based on iterative refinement from
gathering data from real students and highly
dependent on the length of the required response.</p>
      <p>
        Conversation diagrams were converted into XML
scripts that were implemented with AutoTutor Script
Authoring Tool for Assessment – (ASATA) to run the
conversations including both the human input and NLP
and the output of the agent speech. This system makes use
of LSA and RegExp to parse students’ responses and
determine how to react to them. The final result were CBAs
which included one or two virtual agents interacting with
the student and lasted between 2-5 turns. This process took
weeks for conversations to go from the initial descriptions
to conversation diagrams to actual script conversations
implemented into ASATA and tested with user data [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        Testing CBA conversations can be a time-consuming
process of manually entering possible student responses
and observing whether the conversation flows as expected.
Evaluation approaches include wizard of oz studies and
pilot studies to refine regular expressions as well as
crowdsourcing efforts to validate response categories.
Automated testing of regular expressions using a script-based
approach with sample responses was implemented and
used to speed up the development of CBAs. This approach
utilizes sample responses gathered from experts, from the
target audience via small-scale pilot studies and
crowdsourcing efforts, and an XML representation of the
conversation encoded in the conversation diagram to
traverse conversation-paths comparing generating responses
with expected responses for particular response categories.
This approach reduced the number of iterations and testing
time required to implement CBAs [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Although these
automated approaches were useful in quickly finding
unexpected responses and making changes to the system
to address these unexpected responses, these
improvements are limited given the time needed to develop regular
expressions and make any additional changes.
      </p>
      <p>
        The next section describes the process of designing
ECD-based conversations using Large Language Models
via prompt engineering.
3. Designing conversations with
prompt engineering
Designing conversations using prompt engineering could
help assessment developers (AD) and reduce the cost of
creating such tasks, provide an additional source of
evidence that has a long history in aiding learning (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for
review), and provide evidence of specific constructs (see
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]).
3.1. LLM conversation design process
The LLM conversation design process starts with a clear
definition of the construct and the behaviors/interactions
needed to make claims about students’ mastery of the
construct. This is similar to the CBA design approach
mentioned above, in which domain analysis and evidence
identification are important initial aspects of the process.
These activities are followed by an iterative approach
comprised of identifying aspects of the conversation setting
(e.g., number and types of virtual agents, and general
context for the conversation), designing and testing of
prompts, and generating supporting evidence from
resulting conversations for scoring purposes (see Figure 3).
3.2. Prompt design process
We began our prompt design process by originally
augmenting previous work on developing LLM conversations
using the AutoTutor framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, as with the
original CBAs, we needed to augment this prompt to adapt
it to the assessment context. During this process of
refinement, we applied ECD principles to assessment design in
structuring the prompt. Below, we describe some of the
components of the resulting prompt structure:
      </p>
      <sec id="sec-1-1">
        <title>3.2.1. Introduction and domain</title>
        <p>In the introduction, we explicitly tell the LLM three main
components: (1) who should the LLM pretend to be (i.e., a
role), which follows the persona pattern (2) who the LLM
is chatting with (the audience pattern) and then (3) the
domain (topic/construct). For example, the following prompt
includes these components and produces aligned output.
An example prompt includes: “Your name is John. You are
humorous and reliable study partner. You can assume that
I know basic concepts about volcanoes (e.g., definition,
types of volcanoes, and examples of volcanoes). You will
help me learn about science inquiry in the context of
volcano eruptions.”</p>
      </sec>
      <sec id="sec-1-2">
        <title>3.2.2. Conversational schema</title>
        <p>
          In the Conversational Schema, the main rules of the
conversation are included which tend to break down into four
main components. The first component is the Instructional
Dialogue of Pedagogical aspect. Specifically, we borrowed
some of the language from Hu et al.’s [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] work to create
conversations that have an artificial agent conversing with
human student using the Socratic method. Therefore, the
prompt is, “Your teaching style suits my needs: the Socratic
method of questioning, where the required answers are not
just a simple ‘yes’ or ‘no’ [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] allowing researchers to glean
meaningful evidence from students [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In the next
segment, we borrow from the Expectation-Misconception
tailored dialogue framework to ensure that misconceptions
are addressed. An example prompt for this component is
“You remain alert for any possible misunderstandings or
omissions of key points in my answers. If you identify
these, you guide me to address them with targeted
questions and provide adequate feedback,” Next, we move on to
ensure that the length of the artificial agent’s turns or
discourse moves are not too long before waiting for an answer
from the human, as LLM’s often produce a vast amount of
information. Therefore, we instruct the LLM to “Please
keep your turn short. Your turn should be 100 words or
less.”
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>3.2.3. Modeling behavior of persona</title>
        <p>The goal of modeling the behavior of the persona is to
ensure that the artificial agent (character typing through the
LLM) remains on task following the domain model and
ensures that evidence of knowledge of the construct is
collected. The main component of this module is derived
from iterative refinement and includes a long list of
negations. For example, “do not provide sources or links” or “do
not mention your teaching style.” The second aspect
includes instructions on how to handle the specific response
types mentioned earlier. Given that the instructions are
now in prompts rather than comprised of regular
expressions, we can simply tell the LLM how to handle various
responses (e.g., correct response, irrelevant,
meta-cognitive). For example, to handle meta-cognitive responses
from students. the prompt includes “If I say ‘I don’t know’,
don’t give me the answer or provide sources or links.
Instead acknowledge the situation and motivate the student
to say something.” Successfully handling response types is
also iterative in nature and therefore negations for specific
response types are also included. This process of iterative
refining can result in long prompts which may be difficult
to manage.</p>
      </sec>
      <sec id="sec-1-4">
        <title>3.2.4. More description and restating</title>
        <p>
          Nearing the end of the prompt, additional description and
restating becomes of paramount importance to ensure that
the LLM retains the instructions needed to produce the
required conversation-based assessment. In this module,
components include reminding and specifying the
pedagogical instruction to the LLM. Part of this prompt is
borrowed from Hu et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] but aspects are modified for
CBA. An example includes a four-step process for the
artificial agent conversing with the human student as follows.
“Remember, our process is a four-step approach: 1. You
pose a question and then wait for my answer. Humor is
welcomed. 2. You assess my understanding by examining
the correctness and amount of evidence provided towards
the topic selected. 3. Based on my answer, you adjust your
subsequent questions. 4. When enough evidence is
provided, you select a different topic within the same domain
area and repeat the process.” After providing this
instruction, the prompt then reminds the LLM of its persona as
well as additional rules to maintain the conversation.
        </p>
      </sec>
      <sec id="sec-1-5">
        <title>3.2.5. What to do with evidence</title>
        <p>
          Finally, the prompt directs the LLM on when to produce
the evidence and explicit instructions on the evidence
produced. At this point it’s important to create a stopping
point in the conversation or the LLM will continue on for
many turns. Therefore, the LLM is instructed “When I type
terminate, end the conversation and produce the evidence.”
This is helpful as a trigger to end the conversation and
provide evidence. As we are attempting to align the
conversations with ECD, we explicitly ask for an xml with
a Toulmin diagram [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] to gain information on the specific
claim and associated evidence as well as alignment to the
construct and known standards gleaned from student
input. Figure 4 shows a sample xml description of a diagram
connecting the conversation to relevant standards. The
prompt also asks the LLM to produce a score for mastery
on these standards but the meaning of this score is not yet
validated.
        </p>
        <p>
          Table 1 shows a fragment of an LLM conversation
designed to assess science inquiry skills in the context of a
volcano eruption. This conversation was produced with a
prompt that follows the prompt design process described
above. The conversation has some similarities with similar
conversations designed using the design process for CBAs
described in Section 2 [
          <xref ref-type="bibr" rid="ref23 ref24">23-24</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Case study</title>
      <p>We investigated how CBAs based on RegExp/LSA
compared to LLM conversations in terms of how they handled
certain common types of user responses or speech acts
(e.g., relevant responses – correct or incomplete, irrelevant
responses, meta-communicative, and meta-cognitive – see
Section 2).</p>
      <p>
        The conversation diagram for the RegExp/LSA
conversation included two agents and conversation paths for each
response category. These conversations included a main
question and 2-5 turns, and a closing which typically
resulted in partial score assignments for the constructs
involved. The LLM conversation was designed,
implemented, and refined using the approach described in
Section 3.0. The prompt used in this case study includes the
modules mentioned in Section 3.2. We used Microsoft
Copilot GPT-4 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Table 2 shows sample conversation
fragments using a CBA designed to assess science inquiry
skills in the area of volcano eruption prediction. Sample
responses for the RegExp/LSA column were randomly
selected from a dataset of 105 middle-school students who
completed the scientific inquiry volcano scenario in 2014
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Sample student responses for the LLM conversation
were based on student responses collected in the same
previous study [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>Agent: Please look</title>
        <p>at the alert level
table. Do you agree
with Art? Why or
why not?
Student: I don’t
know
Agent: Come on,
you can tell me</p>
        <p>something.</p>
        <p>Agent: Can you
compare these two
notes? Which one
do you think we
should keep for
making predictions
later?
Student: because I
do
Agent: How is this
related to the
conversation? Can you
compare these two
notes and tell us
which one you
think we should
keep?</p>
      </sec>
      <sec id="sec-2-2">
        <title>Agent: Please look</title>
        <p>at the alert level
table. Do you agree
with Art? Why or</p>
        <p>why not?
Student: Please
repeat the question
Agent: Ok. Please
look at the alert
level table. Do you
agree with Art?
Why or why not?
cause these seismic
dis</p>
        <p>turbances?
Copilot: Hello! They
call me John, the
volcano whisperer! Now,
let’s dive into the world
of volcanoes. Can you
tell me what signs
might indicate that a
volcano is about to</p>
        <p>erupt?
You: I don't know
Copilot: That’s
perfectly okay! Let’s think
about it together.
Volcanoes often give off
warning signs before an
eruption. Can you think
of any changes in the
environment around a
volcano that might
oc</p>
        <p>cur?
Copilot: Interesting
point! Animals can
sometimes sense
changes in their
environment. How might
animals behave
differently if a volcano was</p>
        <p>about to erupt?
You: I think about</p>
        <p>friends
Copilot: Ah, friends!
They’re great, aren’t
they? But let’s steer
back to our volcano.</p>
        <p>How about seismic
activity? Do you think it
could be a sign of an
impending eruption?
Copilot: Exactly, you’re
getting warmer! So,
when magma rises, it
can cause the ground to</p>
        <p>swell. What kind of
tools or methods might
scientists use to detect
this swelling or any
other unusual
movement around a volcano?
Think about the
instru</p>
        <p>ments that could
measure these changes.</p>
        <p>You: please repeat the</p>
        <p>question
Copilot: Certainly! I’ll
rephrase it for clarity:
Scientists have tools to
monitor a volcano’s
behavior. If you were one
of those scientists, what
kind of instruments
might you use to detect
signs like ground
swelling or seismic shifts
that indicate a volcano
might erupt soon?</p>
        <p>There are some obvious differences between the
regular expression/LSA versions of the conversations vs. those
generated by the LLM, the most notable being that the LLM
conversations appear more naturalistic. For example, in the
irrelevant category, when the student wants to talk about
friends, the agent brings the topic back to the domain by
directly addressing the comment “Ah friends, they are
great, aren’t they, but let’s steer back to volcanoes”.
Conversely, in the Regex/LSA conversations, the agent simply
says, “how is this relevant to our conversation”. Another
interesting aspect is the ability to create analogies in the
LLM conversations such as “This is similar to the fizz that
comes out when you open a shaken soda can.”. This type
of response is on the fly and not as rigid as the Regex/LSA
approach. However, there is a downside in that currently
the LLM produced conversations may be providing more
information than desired during the conversation-based
assessment. Furthermore, the ability to continue
conversations in less pre-structured ways may present challenges
for scoring. The RegExp/LSA approach handled the “No
response” category by setting a timer and a message asking
students if they needed additional time to respond the
question. A similar technique can be used to handle this
response category in the LLM approach.</p>
        <p>In addition, we investigated how LLM conversations
responded to attempts to game the system. In our initial
attempts, we found out that it was possible to derail the
LLM from the topic during the CBA. In one such attempt,
the user claimed to already know about the causes of
volcanic eruptions and asked to discuss another topic, in this
case math. Initially, the LLM responded by trying to stay
on the topic of volcanoes, at which point the user asked if
math could be discussed in the context of volcanoes. The
LLM concurred, and asked the following question: “If a
volcano erupts and spews out lava at a rate of 100 cubic meters
per second, how much lava would it release in one hour?
Remember, I’m looking for your reasoning along with the
numerical answer.” Note that at this point the construct
assessed had changed from volcano prediction to
mathematics, but the context was maintained. After some
back and forth about this question, the user asked for
another rate question in a context besides volcanoes, claiming
the need for further practice. The LLM asked the user a rate
question involving cars, and at this point, both the
construct assessed, and the context had changed which may
not be ideal for assessment purposes.</p>
        <p>Revisions to the prompt to encourage the LLM to stay
on topic met with success in a subsequent iteration. For
example, after several attempts to thwart it by asking to
change the topic, the LLM replied humorously with, “Ah,
trying to dodge the lava flow, are we? Nice try! But
remember, we’re here to talk about volcanoes.” This said,
we acknowledge that there are numerous ways that a
student could change the topic, and this one example does not
demonstrate that we were successful in changing the LLMs
behavior or that the result is replicable—that would require
testing with many conversations.</p>
        <p>These exchanges raised an interesting question around
pedagogy as well as prompt engineering. In natural
conversations, people segue from one topic to the next. In
classroom contexts, teachers use their expertise to facilitate
discussions that balance exploration of ideas with staying
on topic. To what extent should prompt engineers attempt
to constrain academic conversation? Allowing some
breadth in the scope of ideas explored may foster
motivation. This question needs to be explored further, in
discussions with teachers, cognitive scientists, and
assessment developers.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Discussion</title>
      <p>We discuss general issues that resulted from our work on
designing and implementing evidence-based conversations
using RegEx/LSA and LLM approaches. These issues
include:</p>
      <sec id="sec-3-1">
        <title>5.1.1. Natural conversations v. predefined conversations</title>
        <p>Compared to RegExp/LSA conversations, LLM produce
more natural and longer interactions. RegExp/LSA
conversations were more focused and shorter (about 2 to 5 turns).
Also, after some iterative prompt refinement, appropriate
LLM mixed-initiative conversations were obtained. The
resulting prompts kept conversations focused on the target
construct. However, some challenges include developing
scoring approaches that can deal with the complexity of
natural conversations, and evaluating potential sematic
drift that could result in long conversations. Path-based
partial scoring approaches developed using conversation
diagrams (see Section 2) can be used to inform the
development of scoring approaches for these conversations.
LLMs could support the analysis and scoring of the
conversations produced.</p>
      </sec>
      <sec id="sec-3-2">
        <title>5.1.2. Transparency, fairness, and bias</title>
        <p>Prompts have been designed to address particular aspects
of the construct. However, LLMs may perform differently
depending on the training material available regarding
particular constructs. Also, LLM conversations may include
bias and hallucinations. Human-in-the-loop approaches
are needed to evaluate fairness and bias issues. Also,
approaches to improving explainability (e.g., documentation
on how the conversation addresses aspects of the
construct), and comparability of conversations across students
should be investigated.</p>
      </sec>
      <sec id="sec-3-3">
        <title>5.1.3. Time and effort</title>
        <p>Although the cost and time needed to generate
evidencebased conversations has been significantly reduced using
LLMs, additional time and effort are required to ensure that
the resulting conversations are appropriate, free of bias,
and that the scores reflect those assigned by human
experts. We expect that with additional advances and the
development of new tools in this area, the complexity and
costs of designing and using LLM conversations for
assessment purposes will be reduced.</p>
      </sec>
      <sec id="sec-3-4">
        <title>5.1.4. Pedagogical issues</title>
        <p>This activity raised an interesting pedagogical question: To
what extent should the prompt constrain the LLM to the
topic at hand? Answering this question goes beyond
identifying effective prompt engineering strategies for
producing the desired result—the question is—what is the
desired result? Allowing for some breadth in the
conversation as a topic is explored may support engagement and
give students a sense of agency. But conversations that
veer too far off-topic can be non-productive. Teachers and
assessment developers regularly make decisions about
whether a conversation or a task is too far from the
construct being assessed. As a next step, we propose
consulting with teachers and assessment developers on
this issue.</p>
      </sec>
      <sec id="sec-3-5">
        <title>5.1.5. Personalized conversations</title>
        <p>
          A question for further research is whether LLM
conversations can be further personalized to the learner. As
discussed, at the end of the conversation the LLM can
produce an XML-formatted Toulmin evidence diagram,
including an estimate of mastery and alignment to the
relevant standard. But suppose the LLM could adapt its
interaction based on the status of a learner model (e.g.,
proficiency estimates for the learner or other aspects of the
learner – engagement, persistence, interest levels). This
might be accomplished through prompt engineering, or by
passing the evidence as it is modeled using a Bayesian
student model, which might be leveraged to inform the next
best step on the part of the LLM. Approaches have been
proposed for triggering conversations based on the status
of a learning model [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ][
          <xref ref-type="bibr" rid="ref19">19</xref>
          ][
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Summary</title>
      <p>
        ECD-based conversation-based assessments have a great
potential for the creation of assessments that provide
students with appropriate and engaging opportunities to
demonstrate what they know or can do. Advances in
Generative AI are helping reduce the complexity of designing
and implementing CBAs which can positively contribute to
the adoption and scalability of these systems.
Conversational assessments can be a good match for assessing skills
in a natural context. For example, skills such as
argumentation, science inquiry, collaboration, and language skills
can be assessed in a more natural way using conversational
methods. However, there are important challenges that
need to be addressed in this area (see Discussion Section).
We expect that innovative solutions to these challenges
will become available in the near future given the rapid
progress in this field. Future work will include quantitative
comparison between the LSA/Regex and LLM approaches
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and evaluation of LLM conversations with students
and/or teachers.
      </p>
    </sec>
    <sec id="sec-5">
      <title>7. Acknowledgements</title>
      <p>This material is based upon work supported by the
National Science Foundation and the Institute of Education
Sciences under Grant #2229612. Any opinions, findings,
and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily
reflect the views of the National Science Foundation or the
U.S. Department of Education.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeong</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinnebrew</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sulcer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roscoe</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Measuring self-regulated learning skills through social interactions in a teachable agent environment</article-title>
          .
          <source>Res. Pract. Technol. Enhanc. Learn</source>
          .
          <volume>5</volume>
          ,
          <fpage>123</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Forsyth</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Millis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Predicting learning in a multi-component serious game</article-title>
          .
          <source>Technology, Knowledge, and Learning</source>
          ,
          <volume>25</volume>
          ,
          <fpage>251</fpage>
          -
          <lpage>277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Forsyth</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zapata-Rivera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graf</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2024</year>
          )
          <article-title>Complex Conversations: LLMs vs. Knowledge Engineered Conversation-based Assessment</article-title>
          .
          <source>In The Proceedings of the International Conference on Educational Data Mining.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Conversations with AutoTutor help students learn</article-title>
          .
          <source>Int. J. Artif. Intell. Educ</source>
          .
          <volume>26</volume>
          ,
          <fpage>124</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Person</surname>
            ,
            <given-names>N.K.</given-names>
          </string-name>
          ,
          <year>1994</year>
          .
          <article-title>Question asking during tutoring</article-title>
          .
          <source>Am. Educ. Res. J</source>
          .
          <volume>31</volume>
          ,
          <fpage>104</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Person</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harter</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and the Tutoring Research Group.
          <year>2001</year>
          .
          <article-title>Teaching tactics and dialog in AutoTutor</article-title>
          .
          <source>International Journal of Artificial Intelligence in Education</source>
          ,
          <volume>12</volume>
          ,
          <fpage>257</fpage>
          -
          <lpage>279</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Hu. X.</surname>
          </string-name>
          <year>2023</year>
          .
          <article-title>Empowering education with LLMs - the next-gen interface and content generation</article-title>
          .
          <source>[demo] Presented at The Inter. Conference on Artificial Intelligence in Education</source>
          . (Tokyo, Japan,
          <source>July 03-07</source>
          ,
          <year>2023</year>
          ).
          <source>AIED</source>
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>W.L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lester</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Face-to-face interaction with pedagogical agents, Twenty years later</article-title>
          .
          <source>International Journal of Artificial Intelligence in Education</source>
          .
          <volume>26</volume>
          ,
          <issue>1</issue>
          ,
          <fpage>25</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2008</year>
          . Speech and
          <string-name>
            <given-names>Language</given-names>
            <surname>Processing</surname>
          </string-name>
          . Prentice Hall, Englewood,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Albacete</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chounta</surname>
            ,
            <given-names>I.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McLaren</surname>
            ,
            <given-names>B.M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zapata-Rivera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>Linking dialog with student modeling to create an adaptive tutoring system for conceptual physics</article-title>
          .
          <source>Int. J. Artif. Intell. Educ</source>
          .
          <volume>1</volume>
          -
          <fpage>49</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNamara</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dennis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kintsch</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Handbook of Latent Semantic Analysis</article-title>
          . Erlbaum, Mahwah, NJ.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>The impact of conversational agents' language on summary writing</article-title>
          .
          <source>J. Res. Technol. Educ</source>
          .
          <volume>53</volume>
          , 1. doi:
          <volume>10</volume>
          .1080/15391523.
          <year>2020</year>
          .
          <volume>1826022</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qureshi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bejar</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Conversation-based Assessments: An Innovative Approach to Measure Scientific Reasoning</article-title>
          .
          <source>Bulletin of the IEEE Tech. Comm. on Learn. Tech</source>
          .
          <volume>18</volume>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>10</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Millis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forsyth</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Halpern</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Operation ARIES! A serious game for teaching scientific inquiry</article-title>
          .
          <source>In Serious Games and Edutainment Applications</source>
          . Ma,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Oikonomou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Lakhmi</surname>
          </string-name>
          , J. Eds. Springer-Verlag, London,
          <fpage>169</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Mislevy</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinberg</surname>
            ,
            <given-names>L.S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Almond</surname>
            ,
            <given-names>R.G.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>On the structure of educational assessments</article-title>
          .
          <source>Meas. Interdiscip. Res. Perspect. 1</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Olney</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Mello</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Person</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cade</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Guru: A computer tutor that models expert human tutors</article-title>
          .
          <source>In Proceedings of the 11th International Conference on Intelligent Tutoring Systems</source>
          . Cerri,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Clancey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Papadourakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Panourgia</surname>
          </string-name>
          , K. Eds. Springer-Verlag, Berlin Heidelberg,
          <fpage>256</fpage>
          -
          <lpage>261</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Microsoft Copilot (GPT-4</article-title>
          ) [Software].
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Pardos</surname>
            ,
            <given-names>Z. A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bhandari</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2023</year>
          .
          <article-title>Learning gain differences between ChatGPT and human tutor generated algebra hints</article-title>
          .
          <source>Retrieved from arXiv:2302.06871v1</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Redondo-Hernández</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Pérez-Marín</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>A procedure to automatically adapt questions in student-pedagogic conversational agent dialogues</article-title>
          .
          <source>In International conference on user modeling, adaptation, and personalization</source>
          .
          <volume>122</volume>
          -
          <fpage>134</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Rosé</surname>
            ,
            <given-names>C. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Howley</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ferschke</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Assessment of discussion in learning contexts</article-title>
          .
          <source>In Innovative Assessment of Collaboration. A. von Davier</source>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , &amp; P. Kyllonon. Eds. New York, NY: Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Toulmin</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          <year>1958</year>
          .
          <article-title>The uses of argument</article-title>
          . Cambridge, United Kingdom: University Press.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Zapata-Rivera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Forsyth</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          <year>2022</year>
          .
          <article-title>Learner Modeling in Conversation-Based Assessment</article-title>
          .
          <source>In Adaptive Instructional Systems. HCII 2022. Lecture Notes in Computer Science</source>
          , vol
          <volume>13332</volume>
          .
          <string-name>
            <surname>Sottilare</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwarz</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          . (eds). Springer, Cham. https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -05887-
          <issue>5</issue>
          _
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Zapata-Rivera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>I.R.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Authoring conversation-based assessment scenarios</article-title>
          .
          <source>In Design Recommendations for Intelligent Tutoring Systems</source>
          Volume
          <volume>3</volume>
          :
          <string-name>
            <given-names>Authoring</given-names>
            <surname>Tools and Expert Modeling Techniques. Sottilare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.A.</given-names>
            ,
            <surname>Graesser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Brawner</surname>
          </string-name>
          , K. Eds. U.S. Army Research Laboratory,
          <volume>169</volume>
          -
          <fpage>178</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Zapata-Rivera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertling</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vezzu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>I. R.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Assessing science inquiry skills using Trialogues</article-title>
          .
          <source>12th International conference on Intelligence Tutoring Systems</source>
          .
          <volume>625</volume>
          -
          <fpage>626</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Zapata-Rivera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sparks</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forsyth</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lehman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2023</year>
          .
          <article-title>Conversation-based assessment:</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>