<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Joint Proceedings of IS-EUD</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Voice-based Direct Manipulation to Foster Inclusion in Intent-driven User Interfaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Colazzo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Pucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maristella Matera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Milano, Department of Electronics</institution>
          ,
          <addr-line>Information and Bioengineering, Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>10</volume>
      <fpage>16</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>This paper discusses voice-based direct manipulation to extend the conversational interactions with LLMs and, more generally, with intent-driven user interfaces. Inspired by recent work on visual direct manipulation in AIassisted tools, we explore whether similar patterns can enhance Voice User Interfaces (VUIs). Such approaches can improve usability of intent formulation, especially in contexts where voice is the primary or the only interaction modality, with important implications for accessibility and social inclusion.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Intent-driven UIs</kwd>
        <kwd>Voice-based direct manipulation</kwd>
        <kwd>LLM interfaces</kwd>
        <kwd>Accessibility</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent advancements in Generative AI and its widespread availability are impacting professional and
private lives. This new technology has also contributed to a revolutionary shift in the way humans
interact with technology: a paradigm Nielsen described as “intent-based outcome specification” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This
is fundamentally diferent from other paradigms that have emerged in the history of Human-Computer
Interaction (HCI). Indeed, compared to command-based interactions, the intent-based paradigm shifts
the control over how the computation is performed from the user to the underlying AI model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        After the ChatGPT launch, back in November 2022, conversational intent-based interactions have
quickly become the standard to interact with Large Language Models (LLMs). In this paradigm, humans
and AI engage in multi-turn conversations, with the user focusing on expressing the desired outcome
in the form of a prompt in natural language, while the model is responsible for capturing the user
intent and converting it into a meaningful result [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Despite the exceptional LLMs popularity and the
unprecedented opportunities they have unlocked in many fields, ensuring user intent is efectively
and accurately captured by the LLM, starting from a prompt expressed in natural language, still poses
significant usability challenges. Prompt-engineering techniques, such as few-shot prompting and
chain-of-thought, have emerged as strategies to improve the alignment between the output produced by
the model and the user intent. However, the efectiveness of such techniques is inherently limited [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Additionally, when situational or permanent disability demands voice interaction, prompt refinement
may feature barriers due to the lack of adequate interactions and content manipulation mechanisms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Alongside text-based prompting, other techniques that leverage the manipulation of visual elements
to streamline prompting and achieve more direct interactions are emerging. This paper discusses these
new strategies and introduces possible ways to bring voice-based direct manipulation into intent-driven
interfaces. Shifting the focus to the voice modality is fundamental to ensure advancements in AI remain
human-centered, preventing social exclusions and encouraging participation while embracing diversity.
After illustrating the current panorama of conversational intent-based interaction, especially applied to
the interaction with LLMs, the paper discusses how typical patterns for direct manipulation in visual
intent-based interfaces can be translated into voice-based interaction mechanisms. The paper also
discusses a preliminary prototype informed by a user study that involved Blind and Visually Impaired
(BVI) participants.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Beyond Text-based Prompting in Human-LLM Interaction</title>
      <p>
        Although standard prompting, especially text-based prompting, is currently the most common
interaction mechanism in Generative AI systems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], novel approaches are emerging. According to the
taxonomy of the most common patterns in Generative AI UIs proposed by Luera and et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
prompting, whether it is text-based, visual, audio or multimodal, is the most prominent example of what the
authors define “user-guided interactions” in the landscape of Generative AI. In addition to chat-based
interfaces, other categories feature a canvas area at the center of the screen where the majority of
interactions occur. More specifically, information visualization canvases represent a subcategory where
elements inside the central area can be directly manipulated to alter the state of the system.
      </p>
      <p>
        DirectGPT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (Figure 1) is an example of a system featuring an information visualization canvas
interface [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], implemented on top of an LLM. In DirectGPT, users are allowed to build prompts by
manipulating visual components inside the UI, using mechanisms such as multi-modal prompting
and multi-selection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The authors have demonstrated how direct manipulation principles can be
efectively combined with traditional text-based prompting strategies to streamline the interaction
with LLMs and make it more direct, while overcoming some of the limitations of purely conversational
UIs. Indeed, as the authors point out, while it is true that traditional conversational LLM interfaces
enable users to create content like text, code, or images, it may take several conversational turns to
achieve a satisfactory result, thus slowing down interactions. Furthermore, error-recovering strategies
are not supported and unsatisfactory results negatively afect subsequent interactions. For this iterative
adjustment process to be truly efective, then, users must be able to reference specific elements from
the model’s previous response, something purely-conversational interfaces do not allow.
      </p>
      <p>DirectGPT, as opposed to conversational interfaces, builds upon the idea that to enable direct
manipulation the objects of interest must be continuously accessible inside the interface. For this reason,
the long history of messages is replaced by a fixed area at the center of the screen, where the model
output is continuously displayed after each modification. Furthermore, the dynamic population of a
toolbar of commands makes available the most recent user prompts in the form of buttons, fostering
their re-usability. Increased directness is also achieved through physical actions, such as dragging or
highlighting, augmenting the expressivity and the flexibility of purely textual prompts. Additionally,
error recovery is achieved through undo and redo commands, displayed as buttons located at the top of
the interface. The system also highlights the elements afected by the LLM-generated modifications,
allowing users to quickly assess the efect of their prompts and immediately revert them if needed.</p>
      <p>
        Content creation is one of the primary applications of Generative AI. As Luera and et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] point
out in their survey, the most common UI layouts for AI-assisted content creation tasks, involving the
generation or editing of visual, written or audio content, are conversational and canvas UIs, with the
latter being more frequently adopted for the generation of visual content. However, this pattern is
not absolute and canvas UIs have also been developed for AI-assisted writing tasks. An example is
given by Canvas [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (Figure 2): a tool integrated in ChatGPT, introduced by OpenAI in October 2024.
Canvas is designed to support users in their writing and coding projects, with a UI that goes beyond
the traditional chat layout. Indeed, as emphasized in the article about its launch, working with a chat
can be limiting for tasks that require editing and revisions. The tool, instead, introduces a canvas area,
separate from the chat, that ChatGPT users can leverage to directly engage with the text or code to be
edited. Supported actions include the possibility to highlight specific portions of the text to restrict
the context of the action to be performed; the chance for the user to directly edit the text or code, for
those cases that do not require the support of the underlying LLM; the access to a menu of shortcuts for
quickly tuning parameters such as the length or the reading level of the text; the option to perform
undo and redo actions through dedicated buttons; the opportunity to check the diference between the
current version of the text and the previous one. These features facilitate the collaboration between the
user and ChatGPT, with greater flexibility compared to standard interfaces used for the same tasks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Voice-based Direct Manipulation</title>
      <p>
        Our research explores how direct manipulation principles, most commonly associated with GUIs, here
can be extended to voice LLMs’ interfaces. The direct-manipulation paradigm was first introduced
to describe interactive systems that apply three main principles [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: (i) continuous representation of
the object of interest; (ii) physical actions or labeled button presses instead of complex syntax; (iii)
rapid, incremental, reversible operations whose impact on the object of interest is immediately visible.
Our research aims to identify how these principles can inspire the definition of new mechanisms to
achieve more expressive and direct vocal interactions in the context of LLM-powered voice-based UIs
(VUIs). The research is informed by a user study we performed between November 2023 and July
2024 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to explore the accessibility potential and challenges of voice interaction when accessing LLMs.
Our research began with an exploratory interview with a BVI expert in assistive technologies, which
provided firsthand insights into the needs of BVI users interacting with LLMs. The conversation focused
on the use of tools such as ChatGPT for information access. Building on the interview findings, we first
designed and distributed an online questionnaire targeting BVI individuals that received 116 responses;
we then conducted individual semi-structured interviews. A thematic analysis highlighted the main
aspects emerged from the gathered data. Related voice interaction patterns were identified as solutions
to the main challenges, and validated in focus groups with 11 participants.
      </p>
      <p>Overall, the study revealed a range of design opportunities in voice interaction. It also emphasized
the need for better ways to manage and organize conversations, making it easier for users to navigate
content, revisit important information, and maintain context across sessions. Additionally, the findings
highlighted the importance of supporting more efective consumption of generated responses, by
enabling smoother navigation through long or complex outputs and direct manipulation of specific
output portions. These design insights informed the development of a prototype incorporating the
new voice interaction patterns. We are now building on this experience to extend direct manipulation
principles to the vocal channel. The discussion presented below is inspired by the previously-discussed
examples of canvas UIs and considers some well-known challenges of VUI design from the literature.</p>
      <p>Continuous representation of the object of interest. The system could let the user hear the most
updated version of the output—the LLM-generated text—aloud. In this way, the efect of user-requested
commands could be immediately perceived, without the need for the user to explicitly ask to have
the full text read out loud any time a new transformation is applied. However, while this mechanism
could be efective when the model’s response is relatively short, for more lengthy outputs, it could
decrease usability. In such circumstances, providing a summary of the applied changes might be more
appropriate, while still preserving the chance for the user to listen to the full modified text upon request.</p>
      <p>
        Physical actions or labeled button presses instead of complex syntax. While the definition
of visual abstractions acting like shortcuts is relatively easy and efective in GUIs, in voice interfaces
it requires careful consideration. Such an abstraction could be implemented as pre-defined vocal
commands that act as proxies for more complex and longer prompts. A challenge is, however, to inform
the users about the availability of such shortcuts; once aware, they should also be able to recall them.
To address these issues, the system could actively remind of the availability of such commands, asking
at strategic points in the conversation whether the user wants to hear the list of such shortcuts. A less
intrusive alternative would be to provide it only upon request. In both cases, to help the user remember
all the proposed options, no more than three elements should be provided at a time [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Overall, the
pattern appears to be consistent with the findings of previous work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Additionally, identifying a
voice-based version of gestures that in GUIs are naturally associated with a semantic meaning, such
as a selection mechanism to restrict the context of a given prompt to a smaller portion of text, is far
more challenging. It requires the ability to physically point to a specific element of text, an action that
does not have a direct equivalent in the voice domain. However, equipping the user with a fine-grained
navigation mechanism, through which the text can be linearly explored and potentially modified along
the way, may give users an analogous degree of expressiveness in crafting prompts through speech. By
considering the limitations of human short-term memory, this pattern might give users the chance to
focus on smaller and more manageable portions of text for an easier editing and review experience. One
may argue that having speech as primary means for expressing intent inevitably introduces a certain
degree of indirectness, forcing the user to precisely describe in words what the expected outcome is
and which portions of the previous model’s output the prompt should afect. With this navigation
mechanism, instead, users could gain finer-grained control over the generation process, having access to
an in-place tool for a more precise and accurate definition of intents that removes the need for extensive
descriptions. However, to prevent users from feeling lost and help them form a mental model of the
text being explored, this method should be combined with a suitable strategy that promotes location
awareness, such as explicitly assigning each navigation node a progressive numerical identifier.
      </p>
      <p>Rapid, incremental, reversible operations whose impact on the object of interest is
immediately visible. While reversibility can be easily achieved by introducing vocal commands that undo
or redo recent modifications, the possibility of performing incremental operations quickly is the most
critical part of the interaction. This is because vocal commands can be relatively rapid to issue, but the
eficiency with which the user is able to evaluate their efect on the system strongly depends on the
way such actions are acknowledged. When the target text can be visually inspected, it is easier for the
user to check the diference between the current version of the output and the previous one, especially
if the interface provides a dedicated mechanism to inspect and browse the editing history, like the one
available in Canvas. In VUIs, instead, the eficiency with which users can evaluate the efect of issued
commands depends on the trade-ofs discussed for the first principle.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Preliminary Prototype</title>
      <p>
        To assess the potential of the patterns presented in the previous section, we integrated them into a new
paradigm for vocal interaction with LLMs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For this purpose, we considered a use case in which
the user’s goal is to iteratively refine a message received from the model within an ongoing chat (e.g.
the body of an LLM-generated email to adjust its tone and structure). A sketch of the interaction flow
is provided in Figure 3. Being our focus on voice interfaces, visual elements in the figure are used to
depict the objects of the vocal interactions.
      </p>
      <p>
        At first, the user chooses a message from a chat that will act as a target for editing actions. To support
the refinement process, an editing mode is introduced. This is conceptually analogous to the canvas
area seen in some visual LLM interfaces and can be accessed using a dedicated vocal command, such as
“Enter edit mode”. Being such a space virtually separated from the main chat with the model, it can be
leveraged by the user to iteratively carry out a series of LLM-assisted transformations to the corpus of
the text, while keeping the main chat clean and concise. In fact, in VUIs, the longer the conversation
history, the more frustrating it becomes to fully traverse it [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Moreover, if some messages in the
chat only represent transient intermediate results, they can contribute to a more cluttered history of
messages, which might be even more tedious to navigate. Given that in edit mode the exchange of
messages between the user and the model only serves the purpose of producing a new, satisfactory
version of the message selected as the target for modification, only the final output of such a procedure
persists to the main chat once the user is out of edit mode.
      </p>
      <p>In edit mode, we can classify user requests to modify text into two categories:
• Global editing actions (Figure 4a): they afect the target message in its entirety. Examples include
requests to summarize the message or change its tone. Any time a request of this kind is sent
to the model, the full modified message is read aloud. In the case of particularly long messages,
instead, only a summary of modifications applied is provided.
• Localized editing actions (Figure 4b): they afect only a restricted portion of the text, such as a
single sentence or word. Examples include rewriting a given sentence using a diferent style
or replacing a word with a synonym. To perform this class of actions the user must be able
to both traverse the message using a finer-grained and adjustable granularity (e.g. at sentence
level), and select the specific portion to be transformed. This behaviour can be achieved through
the previously-discussed navigation pattern, implemented as a navigation mode that can be
activated using a vocal command like “Enter navigation mode”. Users can optionally specify
the desired navigation granularity, otherwise the most appropriate one is selected based on the
length of the message. In navigation mode, users can also access helper commands that facilitate
message traversal. Landing on a given node automatically triggers a series of actions. At first, the
node’s identifier is read aloud followed by its content. Then, the user receives suggestions from
the system for possible actions to take. At this stage, any request involving an editing action
automatically considers the node currently in focus as the target of the transformation to be
applied. Moreover, while traversing the message, any time a request to modify a node is issued,
the updated version of its content is read aloud.</p>
      <p>Additional complementary strategies to improve the overall interaction include:
• Auditory cues: sounds can be used to reinforce and complement verbal feedback about specific
actions. E.g., a specific non-verbal sound can be reproduced after the message “You are now in
edit mode”, when entering this mode. Such an additional layer of feedback could help increase the
users’ awareness of the system’s state and guide them more intuitively through the interaction.
• Keyboard shortcuts: they can be introduced as an alternative to commands that already have a
voice-based counterpart. For instance, inside navigation mode, users could use the right arrow
key to move to the next node instead of pronouncing the word “Next”. The availability of
such keyboard shortcuts could streamline the interaction. However, to preserve the hands-free
modality, these mechanisms should only represent a form of redundancy.
• Help command: to facilitate the exploration of all actions and shortcuts available from a given
stage of the interaction, a help command can additionally be introduced.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper has illustrated preliminary insights on defining vocal interaction patterns for direct
manipulation in chat-based interfaces. Sketching the interaction, as illustrated in the previous section, represents
the first step toward validating the feasibility of the proposed patterns. The next steps involve testing
the interaction with users. A fast-prototyping approach was envisioned for this purpose, leveraging
OpenAI’s custom GPTs as tools for demonstrating the identified interaction patterns. This approach
presents several advantages, including access to a pre-built interface with a ready-to-use voice mode
available on top of OpenAI’s models. The configuration required to share and run the prototype is thus
limited to the definition of a well-formatted system prompt. Despite its limitations, the low resource
demand of this solution makes it promising for quickly testing and improving the interaction. However,
more sophisticated prototyping techniques will also be explored before approaching the validation step,
with the aim of defining a more robust artifact that can lead to a higher quality of the gathered data.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors used Writefull for grammar and spelling checking. However, the authors extensively
reviewed and edited the text; therefore, they take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <source>AI: First New UI Paradigm in 60 Years</source>
          ,
          <year>2023</year>
          . URL: https://www.nngroup.com/articles/ ai-paradigm/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Gebreegziabher</surname>
          </string-name>
          , K. T. W. Choo, T. J.
          <string-name>
            <surname>-J. Li</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          <string-name>
            <surname>Perrault</surname>
            ,
            <given-names>T. W.</given-names>
          </string-name>
          <string-name>
            <surname>Malone</surname>
          </string-name>
          ,
          <article-title>A Taxonomy for Human-LLM Interaction Modes: An Initial Exploration</article-title>
          , in: CHI EA '
          <volume>24</volume>
          ', ACM,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Masson</surname>
          </string-name>
          , S. Malacria,
          <string-name>
            <given-names>G.</given-names>
            <surname>Casiez</surname>
          </string-name>
          , D. Vogel,
          <article-title>DirectGPT: A Direct Manipulation Interface to Interact with Large Language Models</article-title>
          ,
          <source>in: Proc. of CHI '24</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Andolina</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Matera, From Conversational Web to Inclusive Conversations with LLMs</article-title>
          , in: C. Conati, G. Volpe, I. Torre (Eds.),
          <source>Proc. of AVI</source>
          <year>2024</year>
          , ACM,
          <year>2024</year>
          , pp.
          <volume>87</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>87</lpage>
          :
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Luera</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. A. R.</surname>
          </string-name>
          et al.,
          <source>Survey of User Interface Design and Interaction Techniques in Generative AI Applications</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] OpenAI, Introducing Canvas,
          <year>2024</year>
          . URL: https://openai.com/index/introducing-canvas/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Shneiderman</surname>
            ,
            <given-names>Direct</given-names>
          </string-name>
          <string-name>
            <surname>Manipulation</surname>
            :
            <given-names>A Step</given-names>
          </string-name>
          <string-name>
            <surname>Beyond Programming Languages</surname>
          </string-name>
          ,
          <source>Computer</source>
          <volume>16</volume>
          (
          <year>1983</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Landay</surname>
          </string-name>
          ,
          <article-title>Evaluating Speech-Based Smart Devices Using New Usability Heuristics</article-title>
          ,
          <source>IEEE Pervasive Computing</source>
          <volume>17</volume>
          (
          <year>2018</year>
          )
          <fpage>84</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Luebs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Tigwell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shinohara</surname>
          </string-name>
          ,
          <article-title>Understanding Expert Crafting Practices of Blind and Low Vision Creatives</article-title>
          , in: CHI EA '
          <volume>24</volume>
          ', ACM,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>