<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Communication with Individuals with Disabilities and the Role of LLMs: Progress Report</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>May Lutzen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcello Balduccini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saint Joseph's University</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper is part of a project aimed at leveraging neuro-symbolic architectures to achieve a sophisticated interaction between humanoid robots and individuals with cognitive impairments. In our architecture, a symbolic reasoning module based on Answer Set Programming creates a sequence of appropriate activities and monitors their execution in real-time. Large Language Models (LLMs) are used to enhance the user experience in various ways. Among those, in this paper we investigate methods for using LLMs to rewording text produced by the reasoning component, while keeping length and vocabulary level consistent with the original text. The study explores the efectiveness of metrics such as length and frequency of use, in comparing the vocabulary level of the input with that of the output. We present a comparative analysis of free or moderately priced LLMs, such as GPT-3.5, Google Gemini Pro, and Claude 3 Opus. A continuous validation process is also introduced, utilizing a critic that evaluates the appropriateness of the generated output at run-time. Although preliminary, the findings appear to indicate that while LLMs can often produce outputs with a vocabulary level comparable to the inputs, there are areas needing improvement, particularly in handling specific domain knowledge or less common phrases. This research contributes to the exploration of novel neuro-symbolic architectures and to the practical application of LLMs in contexts where controlled language use is essential for efective communication.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Neuro-Symbolic Robot Architectures</kwd>
        <kwd>Knowledge Representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Humanoid robots employed in healthcare represent a diverse and rapidly evolving research field that
encompasses various areas of investigation. For instance, studies have explored the use of AI-enabled
devices in providing companionship and social interaction for seniors, as highlighted in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], underscoring
the potential of humanoid robots to address social isolation and promote well-being among elderly
populations. Moreover, recent research [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] highlighted how AI can subtly guide individuals towards
specific choices with more efective nudging capabilities than humans, suggesting that an automated
assistant might be a useful tool for guiding patients towards healthier behaviors.
      </p>
      <p>This paper is part of a project stemming from a partnership with Bancroft (https://bancroft.org), a
non-profit organization dedicated to helping individuals with autism, intellectual and developmental
disabilities, brain injury, and neurological conditions. A challenge faced by service providers like
Bancroft is that direct care facility staf may become overwhelmed with routine tasks, leading to low job
satisfaction and retention, and hindering their performance in higher-level duties and communication
with clinical staf. High turnover exacerbates these issues, as new and temporary staf lack the specific
knowledge needed for efective patient care. All of these staf issues, in the long term, cause reduced
quality of care thus impacting the quality of life of the individuals served. The goal of the project is to
research methods at the intersection of symbolic reasoning, machine learning, and robotics that can
lead to enhancing both staf satisfaction and resident care.</p>
      <p>
        The project augments Iggy, a Pepper humanoid robot [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], with a neuro-symbolic architecture, called
      </p>
      <p>
        IR4D1, which combines the powerful knowledge representation and reasoning capabilities of Answer
Set Programming (ASP) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] with Large Language Models (LLMs).
      </p>
      <p>Given information on a group of patients, the reasoning module builds a sequence of appropriate
activities and monitors their execution, making real-time adjustments based on user responses. Whenever
a menu is presented on the robot’s display, the reasoning module generates text that is read out by the
robot to introduce the menu. This is done not only for a more natural interaction but also because some
of the patients have minimal or no reading skills. Given that the robot controlled by the architecture
interacts with individuals with potentially impaired cognitive skills, the text produced by the symbolic
reasoning component is based on fragments that are carefully crafted by psychologists in order to
ensure that the vocabulary level is suitable for the audience.</p>
      <p>Depending on the circumstances, the same menu may be displayed multiple times in a row. In order
to make the interaction more engaging for the audience and in order to make the robot sound more
natural, the robot automatically produces variations of the text generated by the symbolic reasoning
component. While LLM are a natural choice for generating these variations, the variations must still
match the audience’s capabilities, and in particular their (receptive) vocabulary level.</p>
      <p>In this paper, we present an initial evaluation of methods for controlling the vocabulary level of the
text produced by LLMs in rewording tasks, with a specific attention to the use with individuals with
cognitive impairment. We begin by considering diferent kinds of metrics normally associated with
the vocabulary level of text, and then evaluate the use of a continuous validation process, in which the
output produced by the LLM is vetted at runtime by a critic that determines if the output is acceptable.
Due to considerations on practical use, in this paper we focus our evaluation on LLMs that are either
free or moderately priced, but also of suficient power and ease of use, i.e. GPT 3.5, Google Gemini Pro,
and Claude 3 Opus. In a later phase of the study, we will evaluate more powerful versions of these as
well as self-hosted LLMs.</p>
      <p>The paper is organized as follows. We begin with a discussion on methods of measuring the
vocabulary of text and clarify the research questions that our analysis aims at answering. The following
sections analyze metrics of progressively increasing sophistication. We conclude the paper with final
considerations and a discussion on future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Measuring the Vocabulary Level of Text</title>
      <p>
        Typically, the evaluation of the vocabulary level of text takes into account metrics such as length,
frequency of use, morphological complexity and semantic complexity, as discussed, e.g., in [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ].
Length refers to the number of characters, words, and syllables in the text, as it is conceivable that
longer text might require stronger cognitive eforts. Frequency of use refers to how often words are used
in everyday language. If they are less common, they might be more complex and thereby more dificult
to understand. Morphological complexity refers to the complexity of a word’s internal structure and
words with more complex structures might be more dificult to understand. Lastly, semantic complexity
refers to the number of meanings or senses a word has. If a word has multiple meanings, reaching the
desired understanding of a word can be much more complex. In this paper, we focus on the first two
metrics, and use the following research questions to guide the analysis:
• How capable are the LLMs of producing output that is of comparable length to the input if
instructed to do so? And how reliable is length as an indicator of vocabulary level in the context
of our application domain?
• How capable are LLMs of producing output in which frequency-of-use is comparable to that of
the input?
• Is there an algorithmic way of leveraging frequency-of-use as a way to improve the vocabulary
level of output text?
      </p>
      <sec id="sec-2-1">
        <title>1IR4D stands for Intelligent Robotics for Disabilities</title>
        <p>We conduct our analysis over a benchmark set of 10 passages that covers all the categories of passages
generated by the reasoning module during a session. Categories are represented by a uniform number
of passages in the benchmark set. The passages are show in Figure 1.</p>
        <p>For each input passage, we produce three output passages for each LLM. We chose to produce
multiple output passages in order to limit the impact of statistical anomalies on our analysis.</p>
        <p>It is important to note that some of the passages contain the names of activities that the audience
can choose from using the robot’s display. Those names are familiar to the audience. In preliminary
experiments, we observed that the LLMs have a tendency to reword the activity names, often making
the activities dificult to recognize, sometimes drastically altering the semantics of the names – for
example, turning “a name that sound game” into “a name-based sound game”, and even rewording
“an emotion recognition game” into “a mind-reading game”. To our surprise, we were only partially
successful at designing prompts that directly prevented the LLMs from rewording the activity names,
and resorted to an algorithmic approach for preserving them. In our algorithmic preservation approach,
the activity names are replaced by strings that the LLMs do not appear to reword, such as “X1X”. After
receiving the LLM’s output, they are then converted back to produce the final output. While in our
analysis we considered both kinds of approaches to activity name preservation, in this paper we focus
on the analysis of the passages obtained through the algorithmic preservation approach. Additionally,
due to space considerations, in the sections that follow we will be providing only excerpts of some of
the tables. The complete tables can be found at https://g34r.short.gy/dataset.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Length-Based Metrics</title>
      <p>Length of text is a readily-usable metric in the context of LLMs, as phrases such as “of length comparable
to the input” can be easily included in the prompt to an LLM. In this section, we evaluate the output
produced by the LLMs under consideration when the prompt instructs the LLM to constrain the length
of the output. Specifically, we use the prompt: 2
“Take what I say after the new-line character and paraphrase it. The new text should have a
similar length to the one in input as well as the same tone and language.”
The complete list of the output passages produced by the LLMs can be found at the link provided earlier.
We provide sample output passages – specifically for P10 – in Figure 2.</p>
      <p>In this section, we compare the various LLMs in terms of number of characters, words, and syllables
of the output w.r.t. to those of the input.</p>
      <p>For every metric and LLM, we calculate average change (i.e., diference between input length and
output length) for every passage, as well average change across all passages. The results of the analysis
on length in characters are shown in Figure 3. The analysis of the average change, per LLM, on each
input passage shows a clear trend. GPT has the lowest average change, followed by Gemini and then
Claude, which has the highest average change. This is also the case for the aggregate average change
on all passages, which is shown in the 3 bottom rectangles for each metric, where GPT has 0% average
change, Gemini 10%, and Claude 44%. The results from the analysis of length measured in characters
is confirmed by looking at the average change in length measured by word count where the average
change on all the diferent passages by LLM for GPT is -4%, Gemini shows 5% average change and
Claude shows 34% average change.</p>
      <p>While length and average change in length might show how well a large language model is able to
follow directions from a prompt, our analysis highlighted that, at least for our target population, the
length of the text is not a good indicator of the vocabulary level. Consider for instance the input text:
Before we say goodbye, I have a little surprise for you. Here are some activities that I hope you’ll
like: “a name that sound game for David", “some dancing fun", and “a name that sound game".</p>
      <p>Pick from one of the boxes on my screen to start an activity! ”</p>
      <sec id="sec-3-1">
        <title>The output created by Claude is:</title>
        <p>2One might argue that more sophisticated ways of phrasing the prompt should be considered. While that is indeed possible,
we intentionally stay away from the route of prompt engineering and focus on analyzing what LLMs are capable of producing
given a reasonably precise prompt that an average human would likely be able to understand and act upon.</p>
        <p>Hold your horses! Before we part ways, I’ve got a treat in store. I’ve handpicked some cracking
activities that I reckon you’ll dig: “a name that sound game for David”, “some dancing fun”, and
“a name that sound game”. Go ahead and click on any of the boxes on your screen to get the party
started!
The vocabulary level of phrases such as “hold your horses”, “handpicked some cracking activities” and
“I reckon you’ll dig” is significantly diferent from “I have a little surprise” and “I hope you’ll like”, even
though the change in number of characters (13%) and in number of words (12%) might be considered
modest by comparison.</p>
        <p>When it comes to syllables, Figure 3 shows that the overall average change in number of syllables
resembles follows trends similar to those of the number of characters and words. GPT has the overall
smallest average change with only 2% from input to output, Gemini is second with only 10% overall
average change from input to output and Claude performed the worst with 49% overall average change
from input to output. The conclusion of this analysis thereby shows that GPT is best at following
directions in the prompt given to it specifically about desired length in a paraphrased output whereas
Gemini is slightly worse but can still be considered acceptable as it within a 10% change. However, in
many of the 30 Claude cases in our analysis, the LLM includes part of the prompt in its output, such as
’with the specified variables’.. It should therefore be noted that one reason for Claude’s great positive
change in length is due to the fact that Claude tends to include part of the prompt in the output thereby
increasing the length, however this also shows how it is unable to follow directions from the prompt.</p>
        <p>However, the number of syllables does not appear to have a strong relationship with the vocabulary
level, at least for the types of passages and population considered here. Consider one of the outputs
produced by Claude for P5:
• Input: Before we say goodbye, I have a little surprise for you. Here are some activities that I hope
you’ll like: “a Frosty the Snowman sing-along”, “a name that sound game for David”, and “a name
that tune game”. Pick from one of the boxes on my screen to start an activity!
• Output: Before we part ways, I’ve got a special treat in store for you. Take a look at these fun options
I’ve prepared: “a Frosty the Snowman sing-along”, “a name that sound game for David”, and “a name
that tune game”. Simply select one of the boxes displayed to embark on an exciting activity!
Here, phrases such as “part ways”, “special treat in store for you” and “embark on” have a significantly
diferent vocabulary level than “goodbye”, “little surprise” and “to start”.</p>
        <p>In conclusion, while the change in length between input and output can tell us how well an LLM is
at following directions related to the length of the output – and is useful in preventing the output from
being boring or wordy – it does not help ensure that the vocabulary level of the output matches that of
the input.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Frequency-of-Use Metric</title>
      <p>Of the other possible options for estimating the vocabulary level of a passage, the frequency of use
(abbreviated uf in this paper) of its words appears to be a practically viable metric. The intuition behind
this metric is simple: frequency of use refers to how often certain words are used in a corpus of text
that is used as a reference; if words are less common, it is conceivable that they are more complex or
dificult to understand.</p>
      <p>The minimum frequency of use (abbreviated min uf) across all the words of a passage seems
particularly useful in our application domain. After all, we are looking for text that, when spoken by the
robot, can be promptly understood by our target population. Even a single word that is outside of the
receptive vocabulary of the audience can be suficient to prevent them from promptly understanding
the passage. It must be noted, however, that it is not suficient to consider the uf of individual words, as
phrases, especially if idiomatic, may have a meaning that is diferent from that of their constituent words
and a frequency of use that is also diferent from of the constituent words. Consider for example “let the
fun begin”. While the individual words of this phrase are fairly common, the specific combination is not
as frequently used as, for example “let’s have fun.” For this reason, we adopt the following approach.</p>
      <p>Let  () denote the parse tree associated with passage  (if multiple parse trees exist, we assume that
a parse tree has been arbitrarily selected). For a node  ∈  (), let  () denote the frequency of use
of the text associated with .  () is undefined if no frequency of use is available for the text.
Definition 1.</p>
      <sec id="sec-4-1">
        <title>A node  of  () is a uf-top node if: •  () is defined, and •  (′) is undefined for every ancestor ′ of  in  ().</title>
        <sec id="sec-4-1-1">
          <title>We can now give the following:</title>
          <p>Definition 2.</p>
          <p>Given a passage :
• The min uf of , written −  (), is
−  () =</p>
          <p>∑︁  ()
∈ ()
• The min-uf phrase of  is the fragment of text associated with a uf-top node of  () that has
minimum uf. (For simplicity, if multiple such nodes exist, we select the first node in leftmost traversal
of the parse tree.)
For the analysis that follows, the parse trees were generated using the spacy library3 with the
en_core_web_lg model. The uf of phrases has been obtained from the absTotalMatchCount metric
of the ngrams tool4 with default settings5. ngrams uses the Google Books Ngram Dataset v3 as the
source of ngram data.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>3https://spacy.io</title>
          <p>4https://ngrams.dev/
5It is certainly possible that diferent settings may afect the results of our analysis, particularly if one selects specific year
ranges for the frequency of the ngrams. We defer an evaluation of the impact of these settings to a later phase of our study.</p>
          <p>At least at first sight, the min uf appears to provide meaningful a indication of the vocabulary level of
passages. For example, the phrase “on my screen” which one might consider fairly simple to understand,
has a comparatively high min uf of 33090, while phrases such as “our prior encounter” and “infringing
on any copyrights,” which arguably belong to a higher vocabulary level, have a comparatively low min
uf of 72 and 42 respectively.</p>
          <p>So, the question is: how capable are LLMs of producing output whose min uf is comparable to the
min uf of the input given our prompt?</p>
          <p>To answer the question, we compared the min uf of our input and output passages, and report the
results of our analysis in Figure 4.</p>
          <p>Out of the 90 diferent cases we analyzed, in 68 of them the output’s min uf is either equal to or
greater than the input’s min uf. This means that in 76% of the cases the output’s min uf is appropriate
compared to the input’s.6</p>
          <p>Breaking the results further by considering each LLM separately, we see that in GPT’s case the
output’s min uf is greater than or equal to that of the input in 83% of cases. In Gemini’s case the
output’s min uf compared to the input’s is greater than or equal to that of the input in 77% of cases. In
Claude’s case, the output’s min uf is greater than or equal to that of the input in 67% of cases. Thus,
GPT appears to have a fair margin over Gemini, while Claude is the worst performer.</p>
          <p>These numbers show that LLMs are rather satisfactory in their ability to produce an output whose
min uf is comparable to that of the input. However, upon further investigation, we noticed that the
observed good performance of the LLMs occurs for the most part in cases in which the min-uf phrase
of the input is an activity name. Table 5 provides a comparison of the min uf of input and output for
the passages whose min-uf phrases do not contain an activity name. Interestingly, in those cases the
output’s min uf is in almost always less than the input’s min uf.</p>
          <p>Specifically, the min uf of the output is less than that of the input in 16 cases out of 18, or 89% of the
cases, thereby showing a complete switch in observed behavior compared to the previous comparison.</p>
          <p>The switch can be explained as follows. It so happens that some activity names are rather infrequent
in the corpus used for the calculation of min uf (even though they are promptly understood by the
audience), causing them to be the min-uf phrases. Since the rewording process ensures that the activity
names are kept verbatim in the output, in many cases those are also the min-uf phrases of the output.
However, the fact that the min uf of the input and of the output coincide does not tell us much about
the quality of the output produced by the LLMs, only that the uf of the phrases they produced is no
6One might argue that the min uf of the output should not simply be greater than, or equal to, that of the input, but that the
diference should be within a certain threshold. We defer such evaluation to a later phase of our study.
worse than that of the activity names. And because the uf of the activity names is fairly low, that is not
a particularly significant achievement.</p>
          <p>For this reason, we decided to re-evaluate the passages disregarding the activity names in the
calculation of the min uf in those cases in which the activity names would have otherwise been the
min-uf phrases of both input and output, leading to the comparison outlined in Figure 6. Looking at the
secondary min uf phrase for these cases, we observed that in 50% of cases the min uf of the output was
worse than that of the input. That is, in only 50% of the cases the passage produced by the LLM was
satisfactory in terms of min uf.</p>
          <p>It is worth noting that a potential issue with the ngrams used for estimating frequency of use. As
seen from Figure 6, in a number of cases in which the min uf of the output is greater than or equal to
the min uf of the input, the min-uf phrase of the input is “Which activity would you like.” For some
reason, that phrase is assigned a rather low uf, which causes most outputs to have a better min uf
than the input passage – even in cases where the min-uf phrase of the output is “with the specified
variables”, which intuitively seems to be at a higher vocabulary level. As we discuss later in this paper,
we hypothesize that better quality ngram data may reduce this type of noise.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. A Vocabulary Level Critic: an Evaluation</title>
      <p>In the previous section, we saw how LLMs have some capability of producing output of comparable
vocabulary level to the input as estimated via uf, although the performance of LLMs is less than ideal.
In this section, we consider the possibility of algorithmically filtering the output of the LLMs based on
this metric, and evaluate whether uf provides indeed a reliable filtering mechanism.</p>
      <p>For our evaluation, we consider a simple filtering mechanism based on a continuous validation of
the output of the LLM being used. That is, we can leverage min uf to develop a critic that compares
the min uf of the output with that of the input. If the min uf of the output is lower than that of the
input, the critic rejects the output, and the architecture may decide to either have the LLM produce
a new output, or use the input directly. In practice, the decision will be likely based on timeliness
considerations: given the statistical nature of LLMs, there is no telling how many times the output will
need to be regenerated before something of acceptable quality is produced, and long delays before the
robot speaks might make for an unnatural experience.</p>
      <p>Our goal in this section is to answer the question: suppose the critic rejects the output produced by the
LLM; based on human judgement, is that a good decision or a bad decision? This evaluation will give us
some insights both on the reliability of min uf as a filtering metric, and if the critic as a practically
viable algorithmic filtering method.</p>
      <p>The results of this evaluation are summarized in Figure 7, where a judgement of “good decision” is
indicated as a true positive, and a bad decision as a false positive. At this early stage of the project,
we relied on our own judgment to determine whether a decision was good or bad. In the future, we
plan to conduct a more rigorous evaluation with independent test subjects. In spite of the simplicity
of our validation method, we believe that the evaluation provided some interesting results. As the
ifgure illustrates, out of 53 cases flagged by the critic, 33 are true positives, which means that in 62%
of cases, min uf is a reliable indicator of whether the vocabulary level of the output produced by an
LLM is acceptable. Figure 8 provides a breakdown of the information for each LLM. For GPT, the true
positive rate is 61%, for Gemini it is 79%, and for Claude it is 52%. One consideration to make is that the
generation of the parse tree, and possibly also the ngram data, appear to introduce some noise in the
ifltering process. For example, some of the cases marked as false positives during our validation are
due to the string “’s have some fun” being the min-uf phrase in the output, and being associated with a
low uf. Those cases were marked as false positives during our validation because it is clear that the
string refers to “let’s have some fun”, and the vocabulary level of that phrase is low. We speculate that,
had the parse tree correctly extracted the string “let’s have some fun”, its uf might have been higher
and thus the output would not have been rejected on its account. (A similar result would have been
obtained if the ngram data had not included the entry “’s have some fun”.) That is, we hypothesize that
better quality of the parsing process and of ngram data might lead to a higher accuracy of the filtering
process.</p>
      <p>The data also highlights another interesting aspect. In a certain number of cases marked as false
positives during the human validation, the min-uf phrases in question contain place names related to
the location of the robot, such as “Innovation Center” and “Saint Joseph’s University.” Clearly, those
names occur infrequently in the text corpus used for determining the uf, but are well known to the
audience. The results obtained after excluding those place names from the calculation of the min uf
are illustrated in Figure 9. With this refinement, the percentage of true positives is substantially higher,
standing at 70%, with a 67% for GPT, 91% for Gemini, and 61% for Claude.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>In this paper, we presented an initial evaluation of methods for controlling the vocabulary level of the
text produced by LLMs in rewording tasks, with specific attention to the use with individuals whose
cognitive capabilities are impaired. We considered length-based metrics as well as frequency of use,
and defined an approach for identifying which nodes of the parse tree should be considered when
measuring frequency of use. While there is space for improvement, we hope our results provide an
informative initial account on the topic. For practical reasons, at this stage we considered a set of
free or moderately-priced LLMs. In the future, we plan to evaluate more powerful versions of those
LLMs as well as self-hosted LLMs. We will also conduct a human validation with independent test
subjects, investigate morphological and semantic complexity measures, and explore the ability of more
sophisticated prompt designs to ensure a consistent vocabulary level.</p>
      <p>Acknowledgments. This paper was made possible in part by funding provided by Bancroft. The passages evaluated
in this paper were assembled from fragments prepared by Sara Girotto.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <fpage>6ABC</fpage>
          - Action News Philadelphia,
          <article-title>Seniors using AI robots to combat loneliness (https://rebrand</article-title>
          . ly/seniors-AI),
          <year>2023</year>
          . URL: 6abc.com/seniors-ai
          <string-name>
            <surname>-</surname>
          </string-name>
          robot-companionship/14154002/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Ganapini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fabiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Horesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Loreggia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mattei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murugesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pallagani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Venable</surname>
          </string-name>
          ,
          <article-title>Value-based fast and slow ai nudging</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>07628</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Russo</given-names>
            <surname>Spena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kaartemo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Marzullo</surname>
          </string-name>
          ,
          <article-title>Smart nudging: How cognitive technologies enable choice architectures for value co-creation</article-title>
          ,
          <source>Journal of Business Research</source>
          <volume>129</volume>
          (
          <year>2021</year>
          )
          <fpage>949</fpage>
          -
          <lpage>960</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0148296320305786. doi:https:// doi.org/10.1016/j.jbusres.
          <year>2020</year>
          .
          <volume>09</volume>
          .004.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Softbank</given-names>
            <surname>Robotics</surname>
          </string-name>
          , Meet Pepper:
          <article-title>The robot built for people (https://us</article-title>
          .softbankrobotics.com/pepper),
          <year>2023</year>
          . URL: https://us.softbankrobotics.com/pepper.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gelfond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lifschitz</surname>
          </string-name>
          ,
          <article-title>Classical Negation in Logic Programs</article-title>
          and Disjunctive Databases,
          <source>New Generation Computing</source>
          <volume>9</volume>
          (
          <year>1991</year>
          )
          <fpage>365</fpage>
          -
          <lpage>385</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] Readability and the Common Core's Staircase of Text Complexity</article-title>
          ,
          <year>2012</year>
          . URL: https://textproject. org/paper/readability-and
          <article-title>-the-common-core-staircase-of-text-complexity/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Rayner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Dufy</surname>
          </string-name>
          ,
          <article-title>Lexical complexity and fixation times in reading: Efects of word frequency, verb complexity, and lexical ambiguity</article-title>
          ,
          <source>Memory &amp; Cognition</source>
          <volume>14</volume>
          (
          <year>1986</year>
          )
          <fpage>191</fpage>
          -
          <lpage>201</lpage>
          . URL: https: //doi.org/10.3758/BF03197692. doi:
          <volume>10</volume>
          .3758/BF03197692.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shardlow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , Predicting lexical complexity in
          <source>English texts: the Complex 2.0 dataset, Language Resources and Evaluation</source>
          <volume>56</volume>
          (
          <year>2022</year>
          )
          <fpage>1153</fpage>
          -
          <lpage>1194</lpage>
          . URL: https://doi.org/10.1007/ s10579-022-09588-2. doi:
          <volume>10</volume>
          .1007/s10579-022-09588-2.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>