Communication with Individuals with Disabilities and the Role of LLMs: Progress Report

Communication with Individuals with Disabilities and the Role of LLMs: Progress Report MayLutzen MarcelloBalduccini Saint Joseph's University Workshop on Symbolic and Neuro-Symbolic Architectures for Intelligent Robotics Technology Communication with Individuals with Disabilities and the Role of LLMs: Progress Report 1613-0073 E573D4B824A3A47ECB4B312D3AFA858F GROBID - A machine learning software for extracting information from scholarly documents Large Language Models Neuro-Symbolic Robot Architectures Knowledge Representation

This paper is part of a project aimed at leveraging neuro-symbolic architectures to achieve a sophisticated interaction between humanoid robots and individuals with cognitive impairments. In our architecture, a symbolic reasoning module based on Answer Set Programming creates a sequence of appropriate activities and monitors their execution in real-time. Large Language Models (LLMs) are used to enhance the user experience in various ways. Among those, in this paper we investigate methods for using LLMs to rewording text produced by the reasoning component, while keeping length and vocabulary level consistent with the original text. The study explores the effectiveness of metrics such as length and frequency of use, in comparing the vocabulary level of the input with that of the output. We present a comparative analysis of free or moderately priced LLMs, such as GPT-3.5, Google Gemini Pro, and Claude 3 Opus. A continuous validation process is also introduced, utilizing a critic that evaluates the appropriateness of the generated output at run-time. Although preliminary, the findings appear to indicate that while LLMs can often produce outputs with a vocabulary level comparable to the inputs, there are areas needing improvement, particularly in handling specific domain knowledge or less common phrases. This research contributes to the exploration of novel neuro-symbolic architectures and to the practical application of LLMs in contexts where controlled language use is essential for effective communication.

Introduction

Humanoid robots employed in healthcare represent a diverse and rapidly evolving research field that encompasses various areas of investigation. For instance, studies have explored the use of AI-enabled devices in providing companionship and social interaction for seniors, as highlighted in [1], underscoring the potential of humanoid robots to address social isolation and promote well-being among elderly populations. Moreover, recent research [2,3] highlighted how AI can subtly guide individuals towards specific choices with more effective nudging capabilities than humans, suggesting that an automated assistant might be a useful tool for guiding patients towards healthier behaviors.

This paper is part of a project stemming from a partnership with Bancroft (https://bancroft.org), a non-profit organization dedicated to helping individuals with autism, intellectual and developmental disabilities, brain injury, and neurological conditions. A challenge faced by service providers like Bancroft is that direct care facility staff may become overwhelmed with routine tasks, leading to low job satisfaction and retention, and hindering their performance in higher-level duties and communication with clinical staff. High turnover exacerbates these issues, as new and temporary staff lack the specific knowledge needed for effective patient care. All of these staff issues, in the long term, cause reduced quality of care thus impacting the quality of life of the individuals served. The goal of the project is to research methods at the intersection of symbolic reasoning, machine learning, and robotics that can lead to enhancing both staff satisfaction and resident care.

The project augments Iggy, a Pepper humanoid robot [4], with a neuro-symbolic architecture, called IR4D 1 , which combines the powerful knowledge representation and reasoning capabilities of Answer Set Programming (ASP) [5] with Large Language Models (LLMs). Given information on a group of patients, the reasoning module builds a sequence of appropriate activities and monitors their execution, making real-time adjustments based on user responses. Whenever a menu is presented on the robot's display, the reasoning module generates text that is read out by the robot to introduce the menu. This is done not only for a more natural interaction but also because some of the patients have minimal or no reading skills. Given that the robot controlled by the architecture interacts with individuals with potentially impaired cognitive skills, the text produced by the symbolic reasoning component is based on fragments that are carefully crafted by psychologists in order to ensure that the vocabulary level is suitable for the audience.

Depending on the circumstances, the same menu may be displayed multiple times in a row. In order to make the interaction more engaging for the audience and in order to make the robot sound more natural, the robot automatically produces variations of the text generated by the symbolic reasoning component. While LLM are a natural choice for generating these variations, the variations must still match the audience's capabilities, and in particular their (receptive) vocabulary level.

In this paper, we present an initial evaluation of methods for controlling the vocabulary level of the text produced by LLMs in rewording tasks, with a specific attention to the use with individuals with cognitive impairment. We begin by considering different kinds of metrics normally associated with the vocabulary level of text, and then evaluate the use of a continuous validation process, in which the output produced by the LLM is vetted at runtime by a critic that determines if the output is acceptable. Due to considerations on practical use, in this paper we focus our evaluation on LLMs that are either free or moderately priced, but also of sufficient power and ease of use, i.e. GPT 3.5, Google Gemini Pro, and Claude 3 Opus. In a later phase of the study, we will evaluate more powerful versions of these as well as self-hosted LLMs.

The paper is organized as follows. We begin with a discussion on methods of measuring the vocabulary of text and clarify the research questions that our analysis aims at answering. The following sections analyze metrics of progressively increasing sophistication. We conclude the paper with final considerations and a discussion on future work.

Measuring the Vocabulary Level of Text

Typically, the evaluation of the vocabulary level of text takes into account metrics such as length, frequency of use, morphological complexity and semantic complexity, as discussed, e.g., in [6,7,8]. Length refers to the number of characters, words, and syllables in the text, as it is conceivable that longer text might require stronger cognitive efforts. Frequency of use refers to how often words are used in everyday language. If they are less common, they might be more complex and thereby more difficult to understand. Morphological complexity refers to the complexity of a word's internal structure and words with more complex structures might be more difficult to understand. Lastly, semantic complexity refers to the number of meanings or senses a word has. If a word has multiple meanings, reaching the desired understanding of a word can be much more complex. In this paper, we focus on the first two metrics, and use the following research questions to guide the analysis:

• How capable are the LLMs of producing output that is of comparable length to the input if instructed to do so? And how reliable is length as an indicator of vocabulary level in the context of our application domain? • How capable are LLMs of producing output in which frequency-of-use is comparable to that of the input? • Is there an algorithmic way of leveraging frequency-of-use as a way to improve the vocabulary level of output text?

We conduct our analysis over a benchmark set of 10 passages that covers all the categories of passages generated by the reasoning module during a session. Categories are represented by a uniform number of passages in the benchmark set. The passages are show in Figure 1. For each input passage, we produce three output passages for each LLM. We chose to produce multiple output passages in order to limit the impact of statistical anomalies on our analysis.

It is important to note that some of the passages contain the names of activities that the audience can choose from using the robot's display. Those names are familiar to the audience. In preliminary experiments, we observed that the LLMs have a tendency to reword the activity names, often making the activities difficult to recognize, sometimes drastically altering the semantics of the names -for example, turning "a name that sound game" into "a name-based sound game", and even rewording "an emotion recognition game" into "a mind-reading game". To our surprise, we were only partially successful at designing prompts that directly prevented the LLMs from rewording the activity names, and resorted to an algorithmic approach for preserving them. In our algorithmic preservation approach, the activity names are replaced by strings that the LLMs do not appear to reword, such as "X1X". After receiving the LLM's output, they are then converted back to produce the final output. While in our analysis we considered both kinds of approaches to activity name preservation, in this paper we focus on the analysis of the passages obtained through the algorithmic preservation approach. Additionally, due to space considerations, in the sections that follow we will be providing only excerpts of some of the tables. The complete tables can be found at https://g34r.short.gy/dataset.

Length-Based Metrics

Length of text is a readily-usable metric in the context of LLMs, as phrases such as "of length comparable to the input" can be easily included in the prompt to an LLM. In this section, we evaluate the output produced by the LLMs under consideration when the prompt instructs the LLM to constrain the length of the output. Specifically, we use the prompt:2

"Take what I say after the new-line character and paraphrase it. The new text should have a similar length to the one in input as well as the same tone and language. "

The complete list of the output passages produced by the LLMs can be found at the link provided earlier. We provide sample output passages -specifically for P10 -in Figure 2.

In this section, we compare the various LLMs in terms of number of characters, words, and syllables of the output w.r.t. to those of the input. For every metric and LLM, we calculate average change (i.e., difference between input length and output length) for every passage, as well average change across all passages. The results of the analysis on length in characters are shown in Figure 3. The analysis of the average change, per LLM, on each input passage shows a clear trend. GPT has the lowest average change, followed by Gemini and then Claude, which has the highest average change. This is also the case for the aggregate average change on all passages, which is shown in the 3 bottom rectangles for each metric, where GPT has 0% average change, Gemini 10%, and Claude 44%. The results from the analysis of length measured in characters is confirmed by looking at the average change in length measured by word count where the average change on all the different passages by LLM for GPT is -4%, Gemini shows 5% average change and Claude shows 34% average change.

While length and average change in length might show how well a large language model is able to follow directions from a prompt, our analysis highlighted that, at least for our target population, the length of the text is not a good indicator of the vocabulary level. Consider for instance the input text:

Before we say goodbye, I have a little surprise for you. Here are some activities that I hope you'll like: "a name that sound game for David", "some dancing fun", and "a name that sound game".

Pick from one of the boxes on my screen to start an activity! "

The output created by Claude is: Hold your horses! Before we part ways, I've got a treat in store. I've handpicked some cracking activities that I reckon you'll dig: "a name that sound game for David", "some dancing fun", and "a name that sound game". Go ahead and click on any of the boxes on your screen to get the party started!

The vocabulary level of phrases such as "hold your horses", "handpicked some cracking activities" and "I reckon you'll dig" is significantly different from "I have a little surprise" and "I hope you'll like", even though the change in number of characters (13%) and in number of words (12%) might be considered modest by comparison.

When it comes to syllables, Figure 3 shows that the overall average change in number of syllables resembles follows trends similar to those of the number of characters and words. GPT has the overall smallest average change with only 2% from input to output, Gemini is second with only 10% overall average change from input to output and Claude performed the worst with 49% overall average change from input to output. The conclusion of this analysis thereby shows that GPT is best at following directions in the prompt given to it specifically about desired length in a paraphrased output whereas Gemini is slightly worse but can still be considered acceptable as it within a 10% change. However, in many of the 30 Claude cases in our analysis, the LLM includes part of the prompt in its output, such as 'with the specified variables'.. It should therefore be noted that one reason for Claude's great positive change in length is due to the fact that Claude tends to include part of the prompt in the output thereby increasing the length, however this also shows how it is unable to follow directions from the prompt.

However, the number of syllables does not appear to have a strong relationship with the vocabulary level, at least for the types of passages and population considered here. Consider one of the outputs produced by Claude for P5:

• Input: Before we say goodbye, I have a little surprise for you. Here are some activities that I hope you'll like: "a Frosty the Snowman sing-along", "a name that sound game for David", and "a name that tune game". Pick from one of the boxes on my screen to start an activity!

• Output: Before we part ways, I've got a special treat in store for you. Take a look at these fun options I've prepared: "a Frosty the Snowman sing-along", "a name that sound game for David", and "a name that tune game". Simply select one of the boxes displayed to embark on an exciting activity!

Here, phrases such as "part ways", "special treat in store for you" and "embark on" have a significantly different vocabulary level than "goodbye", "little surprise" and "to start".

In conclusion, while the change in length between input and output can tell us how well an LLM is at following directions related to the length of the output -and is useful in preventing the output from being boring or wordy -it does not help ensure that the vocabulary level of the output matches that of the input.

Frequency-of-Use Metric

Of the other possible options for estimating the vocabulary level of a passage, the frequency of use (abbreviated uf in this paper) of its words appears to be a practically viable metric. The intuition behind this metric is simple: frequency of use refers to how often certain words are used in a corpus of text that is used as a reference; if words are less common, it is conceivable that they are more complex or difficult to understand.

The minimum frequency of use (abbreviated min uf) across all the words of a passage seems particularly useful in our application domain. After all, we are looking for text that, when spoken by the robot, can be promptly understood by our target population. Even a single word that is outside of the receptive vocabulary of the audience can be sufficient to prevent them from promptly understanding the passage. It must be noted, however, that it is not sufficient to consider the uf of individual words, as phrases, especially if idiomatic, may have a meaning that is different from that of their constituent words and a frequency of use that is also different from of the constituent words. Consider for example "let the fun begin". While the individual words of this phrase are fairly common, the specific combination is not as frequently used as, for example "let's have fun. " For this reason, we adopt the following approach.

Let 𝜏 (𝑝) denote the parse tree associated with passage 𝑝 (if multiple parse trees exist, we assume that a parse tree has been arbitrarily selected). For a node 𝑛 ∈ 𝜏 (𝑝), let 𝑢𝑓 (𝑛) denote the frequency of use of the text associated with 𝑛. 𝑢𝑓 (𝑛) is undefined if no frequency of use is available for the text.

Definition 1. A node 𝑛 of 𝜏 (𝑝) is a uf-top node if:

• 𝑢𝑓 (𝑛) is defined, and • 𝑢𝑓 (𝑛 ′ ) is undefined for every ancestor 𝑛 ′ of 𝑛 in 𝜏 (𝑝).

We can now give the following: Definition 2. Given a passage 𝑝:

• The min uf of 𝑝, written 𝑚𝑖𝑛−𝑢𝑓 (𝑝), is 𝑚𝑖𝑛−𝑢𝑓 (𝑝) = ∑︁ 𝑛∈𝜏 (𝑝)

𝑢𝑓 (𝑛)

• The min-uf phrase of 𝑝 is the fragment of text associated with a uf-top node of 𝜏 (𝑝) that has minimum uf. (For simplicity, if multiple such nodes exist, we select the first node in leftmost traversal of the parse tree.)

For the analysis that follows, the parse trees were generated using the spacy library 3 with the en_core_web_lg model. The uf of phrases has been obtained from the absTotalMatchCount metric of the ngrams tool4 with default settings 5 . ngrams uses the Google Books Ngram Dataset v3 as the source of ngram data.

At least at first sight, the min uf appears to provide meaningful a indication of the vocabulary level of passages. For example, the phrase "on my screen" which one might consider fairly simple to understand, has a comparatively high min uf of 33090, while phrases such as "our prior encounter" and "infringing on any copyrights, " which arguably belong to a higher vocabulary level, have a comparatively low min uf of 72 and 42 respectively.

So, the question is: how capable are LLMs of producing output whose min uf is comparable to the min uf of the input given our prompt?

To answer the question, we compared the min uf of our input and output passages, and report the results of our analysis in Figure 4. Out of the 90 different cases we analyzed, in 68 of them the output's min uf is either equal to or greater than the input's min uf. This means that in 76% of the cases the output's min uf is appropriate compared to the input's. 6Breaking the results further by considering each LLM separately, we see that in GPT's case the output's min uf is greater than or equal to that of the input in 83% of cases. In Gemini's case the output's min uf compared to the input's is greater than or equal to that of the input in 77% of cases. In Claude's case, the output's min uf is greater than or equal to that of the input in 67% of cases. Thus, GPT appears to have a fair margin over Gemini, while Claude is the worst performer.

These numbers show that LLMs are rather satisfactory in their ability to produce an output whose min uf is comparable to that of the input. However, upon further investigation, we noticed that the observed good performance of the LLMs occurs for the most part in cases in which the min-uf phrase of the input is an activity name. Table 5 provides a comparison of the min uf of input and output for the passages whose min-uf phrases do not contain an activity name. Interestingly, in those cases the output's min uf is in almost always less than the input's min uf.

Specifically, the min uf of the output is less than that of the input in 16 cases out of 18, or 89% of the cases, thereby showing a complete switch in observed behavior compared to the previous comparison.

The switch can be explained as follows. It so happens that some activity names are rather infrequent in the corpus used for the calculation of min uf (even though they are promptly understood by the audience), causing them to be the min-uf phrases. Since the rewording process ensures that the activity names are kept verbatim in the output, in many cases those are also the min-uf phrases of the output. However, the fact that the min uf of the input and of the output coincide does not tell us much about the quality of the output produced by the LLMs, only that the uf of the phrases they produced is no worse than that of the activity names. And because the uf of the activity names is fairly low, that is not a particularly significant achievement.

For this reason, we decided to re-evaluate the passages disregarding the activity names in the calculation of the min uf in those cases in which the activity names would have otherwise been the min-uf phrases of both input and output, leading to the comparison outlined in Figure 6. Looking at the secondary min uf phrase for these cases, we observed that in 50% of cases the min uf of the output was worse than that of the input. That is, in only 50% of the cases the passage produced by the LLM was satisfactory in terms of min uf.

It is worth noting that a potential issue with the ngrams used for estimating frequency of use. As seen from Figure 6, in a number of cases in which the min uf of the output is greater than or equal to the min uf of the input, the min-uf phrase of the input is "Which activity would you like." For some reason, that phrase is assigned a rather low uf, which causes most outputs to have a better min uf than the input passage -even in cases where the min-uf phrase of the output is "with the specified variables", which intuitively seems to be at a higher vocabulary level. As we discuss later in this paper, we hypothesize that better quality ngram data may reduce this type of noise.

A Vocabulary Level Critic: an Evaluation

In the previous section, we saw how LLMs have some capability of producing output of comparable vocabulary level to the input as estimated via uf, although the performance of LLMs is less than ideal. In this section, we consider the possibility of algorithmically filtering the output of the LLMs based on this metric, and evaluate whether uf provides indeed a reliable filtering mechanism.

For our evaluation, we consider a simple filtering mechanism based on a continuous validation of the output of the LLM being used. That is, we can leverage min uf to develop a critic that compares the min uf of the output with that of the input. If the min uf of the output is lower than that of the input, the critic rejects the output, and the architecture may decide to either have the LLM produce a new output, or use the input directly. In practice, the decision will be likely based on timeliness considerations: given the statistical nature of LLMs, there is no telling how many times the output will need to be regenerated before something of acceptable quality is produced, and long delays before the robot speaks might make for an unnatural experience.

Our goal in this section is to answer the question: suppose the critic rejects the output produced by the LLM; based on human judgement, is that a good decision or a bad decision? This evaluation will give us some insights both on the reliability of min uf as a filtering metric, and if the critic as a practically viable algorithmic filtering method.

The results of this evaluation are summarized in Figure 7, where a judgement of "good decision" is indicated as a true positive, and a bad decision as a false positive. At this early stage of the project, we relied on our own judgment to determine whether a decision was good or bad. In the future, we plan to conduct a more rigorous evaluation with independent test subjects. In spite of the simplicity of our validation method, we believe that the evaluation provided some interesting results. As the figure illustrates, out of 53 cases flagged by the critic, 33 are true positives, which means that in 62% of cases, min uf is a reliable indicator of whether the vocabulary level of the output produced by an LLM is acceptable. Figure 8 provides a breakdown of the information for each LLM. For GPT, the true positive rate is 61%, for Gemini it is 79%, and for Claude it is 52%. One consideration to make is that the generation of the parse tree, and possibly also the ngram data, appear to introduce some noise in the filtering process. For example, some of the cases marked as false positives during our validation are due to the string "'s have some fun" being the min-uf phrase in the output, and being associated with a low uf. Those cases were marked as false positives during our validation because it is clear that the string refers to "let's have some fun", and the vocabulary level of that phrase is low. We speculate that, had the parse tree correctly extracted the string "let's have some fun", its uf might have been higher and thus the output would not have been rejected on its account. (A similar result would have been obtained if the ngram data had not included the entry "'s have some fun".) That is, we hypothesize that better quality of the parsing process and of ngram data might lead to a higher accuracy of the filtering process.

The data also highlights another interesting aspect. In a certain number of cases marked as false positives during the human validation, the min-uf phrases in question contain place names related to the location of the robot, such as "Innovation Center" and "Saint Joseph's University." Clearly, those names occur infrequently in the text corpus used for determining the uf, but are well known to the audience. The results obtained after excluding those place names from the calculation of the min uf are illustrated in Figure 9. With this refinement, the percentage of true positives is substantially higher, standing at 70%, with a 67% for GPT, 91% for Gemini, and 61% for Claude.

Conclusions and Future Work

In this paper, we presented an initial evaluation of methods for controlling the vocabulary level of the text produced by LLMs in rewording tasks, with specific attention to the use with individuals whose cognitive capabilities are impaired. We considered length-based metrics as well as frequency of use, and defined an approach for identifying which nodes of the parse tree should be considered when measuring frequency of use. While there is space for improvement, we hope our results provide an informative initial account on the topic. For practical reasons, at this stage we considered a set of free or moderately-priced LLMs. In the future, we plan to evaluate more powerful versions of those LLMs as well as self-hosted LLMs. We will also conduct a human validation with independent test subjects, investigate morphological and semantic complexity measures, and explore the ability of more sophisticated prompt designs to ensure a consistent vocabulary level.

Figure 1 :1Figure 1: Benchmark set used in the analysis

Figure 2 :2Figure 2: Output passages produced for passage P10

Figure 3 :3Figure 3: Change in length measured in characters, word count and syllables

Figure 4 :4Figure 4: Comparative change in min uf

Figure 5 :5Figure 5: Change in min uf from input to output

Figure 6 :6Figure 6: Change in min uf from revised input to revised output (excerpt)

Figure 7 :7Figure 7: Human validation of revised min uf phrases

Figure 8 :8Figure 8: Comparative results of the human validation process

Figure 9 :9Figure 9: Comparative results of the human validation process excluding place names One might argue that more sophisticated ways of phrasing the prompt should be considered. While that is indeed possible, we intentionally stay away from the route of prompt engineering and focus on analyzing what LLMs are capable of producing given a reasonably precise prompt that an average human would likely be able to understand and act upon. https://spacy.io https://ngrams.dev/ It is certainly possible that different settings may affect the results of our analysis, particularly if one selects specific year ranges for the frequency of the ngrams. We defer an evaluation of the impact of these settings to a later phase of our study. One might argue that the min uf of the output should not simply be greater than, or equal to, that of the input, but that the difference should be within a certain threshold. We defer such evaluation to a later phase of our study.

Acknowledgments. This paper was made possible in part by funding provided by Bancroft. The passages evaluated in this paper were assembled from fragments prepared by Sara Girotto.

6ABC -Action News Philadelphia, Seniors using AI robots to combat loneliness MBGanapini FFabiano LHoresh ALoreggia NMattei KMurugesan VPallagani FRossi BSrivastava BVenable arXiv:2307.07628 Value-based fast and slow ai nudging 2023 Smart nudging: How cognitive technologies enable choice architectures for value co-creation CMele TRusso Spena VKaartemo MLMarzullo 10.1016/j.jbusres.2020.09.004 Journal of Business Research 129 2021 Softbank Robotics, Meet Pepper: The robot built for people 2023 Classical Negation in Logic Programs and Disjunctive Databases MGelfond VLifschitz New Generation Computing 9 1991 Readability and the Common Core's Staircase of Text Complexity 2012 Lexical complexity and fixation times in reading: Effects of word frequency, verb complexity, and lexical ambiguity KRayner SADuffy 10.3758/BF03197692 Memory & Cognition 14 1986 Predicting lexical complexity in English texts: the Complex 2.0 dataset MShardlow REvans MZampieri 10.1007/s10579-022-09588-2 Language Resources and Evaluation 56 2022