<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshops, Los Angeles, USA, March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Evaluating Voice Applications by User-Aware Design Guidelines Using an Automatic Voice Crawler</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xu Han</string-name>
          <email>xu.han-1@colorado.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom Yeh</string-name>
          <email>tom.yeh@colorado.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Colorado Boulder</institution>
          ,
          <addr-line>Boulder</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>20</volume>
      <issue>2019</issue>
      <abstract>
        <p>Adaptive voice applications supported by conversational agents (CAs) are increasingly popular (i.e., Alexa Skills and Google Home Actions). However, much work remains in the area of voice interaction evaluation, especially in terms of user-awareness. In our study, we developed a voice skill crawler to collect responses from the 100 most popular Alexa skills within 10 categories. We then evaluated these responses to assess their compliance to three user-aware design guidelines published by Amazon. Our findings show that more than 50% of voice applications do not follow some of these guidelines and variation in guideline compliance across skill categories exists. As voice interaction continues to increase in consumer settings, our crawler can evaluate CA-based voice applications with high eficiency and scalability.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Human-centered computing → HCI design and evaluation
methods; Interactive systems and tools; Systems and tools for
interaction design.
conversational agents; voice applications; user-awareness
evaluation;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION AND MOTIVATION</title>
      <p>
        Voice-powered conversational agent (CA) devices have recently
achieved significant commercial success. In the U.S.A., 47.3 million
(19.7% of) households now own CA devices (March 2018), an
increase from less than 1% two years ago [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Amazon’s Echo series
devices make up 71.9% of the market, followed by Google’s devices
with 18.4% [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        One key characteristic that makes this new generation of CA
devices adaptive is their API platform for third-party developers. Here,
developers design and build voice applications and publish them on
a marketplace with the potential to reach millions of users.
Amazon’s Alexa skills [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Google’s Home Actions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are the two
most popular examples. Yet, many third-party developers may not
IUI Workshops’19, March 20, 2019, Los Angeles, USA
© 2019 Copyright for the individual papers by the papers’ authors. Copying permitted
for private and academic purposes. This volume is published and copyrighted by its
editors.
have prior experiences in designing and building voice applications,
especially in terms of user-awareness. A well-designed user-aware
voice application should adapt its interaction mode to diferent
users and satisfy their individual needs. To help educate
developers, Amazon and Google have published design guidelines [
        <xref ref-type="bibr" rid="ref1 ref8">1, 8</xref>
        ]
to establish a set of design practices a voice application should try
to comply with. These oficial design guidelines cover a variety of
topics ranging from how to clearly communicate the purpose of a
voice application to users to how to design a natural and adaptive
interaction flow.
      </p>
      <p>
        Despite the existence of oficial design guidelines, the quality of
CAbased voice applications in terms of user-awareness varies widely.
An example of a highly-rated (4.9 out of 5 stars based on 3209 user
reviews) Alexa skill is Would You Rather for Family. This skill is
an interactive Q&amp;A game that exhibits several user-aware design
features following Amazon’s guidelines, including remembering
where the last interaction ends and giving a personalized opening
prompt to users. In contrast is the skill AccuWeather. This skill’s
average rating is low—2.2 out of 5 stars based on 182 user reviews.
Within a user interaction, the skill’s design violates several
useraware design guidelines, such as handling errors properly. Users
also complain about these violations in their reviews.
At the same time, although several user experience evaluation
methodologies have been adopted on CAs and voice applications,
eficient ones are still lacking. Traditional usability studies are
useful in gathering feedback and conducting evaluation analysis on
CAs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], longitudinal studies is another efective methodology for
elucidating scenarios and situations involving the use of CAs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], researchers interviewed 14 users of CAs to understand the
factors afecting everyday use. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] deployed lab-based usability
studies by developing CA prototypes of varying fidelity. However,
by March 2018, more than 30,000 Alexa voice skills [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been
published by thousands of third-party developers. The large
number of voice applications require much more eficient evaluation
methodologies, where traditional ones like usability studies and
longitudinal studies cannot satisfy the needs.
      </p>
      <p>The variability in user-aware design quality among third-party
voice applications and the lack of eficient evaluation methods
inspire these research questions:
(1) What is the current state of CA-based voice applications
following (or violating) the user-aware design guidelines?
(2) How can we eficiently evaluate CA-based voice applications
in terms of the user-aware design guidelines?
To study these questions, we focused on Alexa skills, and selected
100 most popular Alexa skills from ten categories. We developed a
voice skill crawler which could eficiently collect responses from a
wide variety of CA-based voice applications with difering input
commands. We then analyzed the collected responses to determine
whether or not particular guidelines were followed. Regarding the
ifrst research question, we focused on the three design guidelines
which are most relevant to user-awareness and assessed
compliance. These three design guidelines are: A skill needs to memorize
a user’s previous interaction mode to provide more personalized
service; a skill needs to adaptively re-prompt users to continue the
interaction when it receives no input; and a skill needs to reword
the re-prompt messages with more detailed information based on
previous personified interaction. The key findings show that more
than 50% of voice applications do not follow some of these
guidelines and variation in guideline compliance across skill categories
exists. Regarding the second research question, we argue that our
voice skill crawler is a research prototype with great potential to do
eficient user-awareness evaluation on CA-based voice applications.
2</p>
    </sec>
    <sec id="sec-3">
      <title>VOICE CRAWLER DESIGN</title>
      <p>We developed a crawler to automate the task of collecting a large
sample of skills’ responses to voice commands. This crawler
follows the basic Alexa interaction mode of "open-command-stop"
to initiate the skill and exit. As indicated in Figure 1, this crawler
simulates the voice interaction between users and Alexa devices.
The crawler uses a human voice generated by Google’s
Text-toSpeech package in Python1 to speak a command to a skill. It then
listens to the response using a speech recognition package 2. Finally,
it saves the responses in a file for further analysis. The crawler
iterated through our sample of 100 skills to perform this sequence
for each skill. The data collection process is described pragmatically
as Algorithm 1.
3
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>METHOD</title>
    </sec>
    <sec id="sec-5">
      <title>Skills Selection</title>
      <p>
        By March 2018, more than 30,000 Alexa voice skills [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been
published and organized by categories on Alexa’s website.
To collect a representative sample for the purpose of our research,
we first identified the 10 top categories with the most number of
skills. The top 10 categories (and their subcategories) are: 1. Daily
1The project website is: https://gtts.readthedocs.io/en/latest/
2The project website is: https://pypi.org/project/SpeechRecognition/
Algorithm 1 Collect Responses to m Commands by n Skills
1: for skill in [s1,s2,...,sn ] do
2: speech ← TextToSpeech("Alexa, open {{skill’s name}}");
3: play speech
4: for command in [c1,c2,...,cm ] do
5: speech ← TextToSpeech(command);
6: play speech;
7: audio ← listen;
8: text ← SpeechToText(audio);
9: save text;
10: end for
11: end for
Activities (News, Weather), 2. Entertainment (Movies &amp; TV,
Music &amp; Audio, Novelty &amp; Humor, Sports), 3. Education &amp; Reference,
4.Health &amp; Fitness, 5.Travel &amp; Transportation, 6.Games, Trivia &amp;
Accessories, 7. Food &amp; Drink, 8. Shopping &amp; Finance (Shopping,
Business &amp; Finance), 9. Communication &amp; Social and 10. Kids. We
wrote a script to scrape Alexa’s website for the top 10 skills for
each category based on the number of reviews. For categories with
subcategories, we tried to balance the number across the
subcategories manually. For example, the ten skills we selected to represent
the Entertainment category consist of three in the Movies &amp; TV
subcategory, three in and Music &amp; Audio subcategory, two in the
Novelty &amp; Humor subcategory, and two in the Sports category. In
all, we selected a total of 100 skills for evaluation.
3.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Guideline-Specific Response Elicitation</title>
    </sec>
    <sec id="sec-7">
      <title>Design</title>
      <p>
        Based on the Amazon’s voice design guidelines [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we chose 3
representative user-aware design guidelines to focus on. In this
paper, we use G1 to denote the guideline wherein a skill needs
to adaptively re-prompt users when it receives no input; G2
denotes the guideline wherein the re-prompt messages are supposed
to be slightly reworded with more detailed information based on
previous user interaction, and G3 denotes the guideline wherein
a skill needs to memorize a user’s previous interaction mode to
provide more personalized service. For each design guideline, we
derived appropriate testing commands in order to elicit responses
that we could evaluate with respect to that guideline. The details
are presented below.
      </p>
      <p>Problem handling support (G1, G2): Problem handling is one
of the most important aspects of user-aware design. One typical
problem handling scenario we chose to evaluate was how the Alexa
skill would react when it does not receive an answer from the
user. In order to evaluate the compliance situation of these 100
skills with respect to G1 and G2, we first designed crawler loops by
setting the basic commands as elicitation commands. Within one
round of the crawler loop, the crawler will say "open", "help", "stop"
commands and listen to the responses in turn (this crawler loop is
denoted as "open-help-stop" loop in the rest of the paper). In our
response collection process, we implemented an "open-help-stop"
loop and then repeated the "open- elicitation command-stop" loop
three times to make sure each skill was fully explored (using
selfgenerated commands as elicitation). After that , we enabled this
Evaluating Voice Applications Using an Automatic Voice Crawler
skill again and stopped giving further command to wait for how it
would respond. Our crawler then repeated this process for all skills
in the sample.</p>
      <p>Remember what was said (G3): According to the design
guidelines, users will appreciate if skills remembered what was previously
said by users to provide more personalized services. In order to
test this, we first fully explored the skills (as how we handled G1
and G2), and then ran our "open-help-stop" loop one more time to
see whether the skills remembered the last interaction and would
change its responses accordingly.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Crawled Data Correction and Coding</title>
      <p>After response data was collected, transcribed into text, and saved
using our crawler, we manually analyzed the data as follows. First,
we compared this dataset to a small pilot dataset of 20 skills we
previously collected by hand in order to identify any discrepancy
between machine and human transcribed responses. In doing so we
were able to detect and correct problems brought by limitations of
speech-to-text technology, such as typo and missing punctuation.
After data correction, two researchers independently coded each
response’s compliance with respect to design guidelines. In terms of
problem handling support, we first determined if the skill supports
G1 and then compared with the previous welcome message to
determine if the re-prompt messages were reworded and personalized.
For G3, by comparing the last and the very first "open-help-stop"
loop’s responses, we judged whether the skill memorized previous
interaction. Afterwards, two researchers compared their coding
results and resolved their discrepancies.
4</p>
    </sec>
    <sec id="sec-9">
      <title>FINDINGS AND ANALYSIS</title>
      <p>Out of our sample of the 100 most popular skills from 10 categories,
we did not retrieve responses from six skills due to account linking
errors or access permission issues. Hence, our findings presented
below are based on 94 skills.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>User-Aware Design Guidelines’ Compliance</title>
      <p>Table 1 shows the compliance rate for each user-aware design
guideline. For problem handling support (G1,G2), we manually
determined that 82 skills (of 94 total) should have support for G1.
(Some skills are not expected to support G1, like those which are
meant for passive listening as well as "one-shot" skills3). Among
these 82 skills, 74 (90.2%) of them supported G1. During the
evaluation process, we encountered some skills which did not support
re-prompting, such as Scryb and Mastermind from the
Communication &amp; Social categories. When they did not receive an input from
the users, the program quit automatically. Among the 74 skills that
supported re-prompting, only 23 of them supported G2 in their
re-prompts. Some skills, such as Bring from Shopping &amp; Finance
category and I’m Driving from Travel &amp; Transportation category,
did not reword their responses and simply repeated their previous
response when lacking user input.</p>
      <p>In terms of personalizing based on previous interactions (G3), we
found that only 32 (34.0%) skills memorized previous interactions
and changed their interaction modes accordingly. Some example
3"one-shot" skills are those with which users can complete their tasks in a single
utterance and do not have a chance to say more commands.
skills we encountered during the evaluation could help us identify
the characteristics of skills which did not follow G3. An example
of compliance is the skill Lemonade Stand. It is a CA-based game
application. This skill consistently remembered where the game
was previously interrupted and, when users reopened this skill, the
skill would briefly summarize the current game status and prompt
users to continue the game. On the contrary, skills such as 5-min
Plank Workout from Health &amp; Fitness category and Short Bedtime
Story from Entertainment category were not compliant. These skills
restarted without communicating the previous interaction status.
As we examined further, we identified certain legitimate exceptions
past interactions were not remembered. In some skills, the
contents are updated in a timely manner such that don’t rely on user
responses to function. For example, This Day in History is a skill
that helps users learn more about historical events and is updated
with new information daily). Similarly, some skills’ operations are
simple and don’t require user input to update the experience (like
5-min Plank Workout and Short Bedtime Story).
The Games category has the highest compliance with personalized
re-prompt (G1) and with remembering what was said by users (G3).
Correspondingly, Games also achieved high user ratings from
Amazon’s Alexa skill webpage (the 10 selected skills’ have an average
rating of 4.5 out of 5).
This result confirms that Games skills are expected to involve more
interaction with users and require more complicated user interface
designs, such as remembering users’ previous score and provide
personalized game dynamics.</p>
      <p>As for categories with low G2 and G3 compliance, many of them
were simple, straightforward applications that missed
opportunities for personalized experiences. For example, skills in the Daily
Activities category are meant to be used frequently on a daily basis;
skills in the Entertainment category are meant to quickly entertain
users in terms of music, jokes, etc.; skills in the Health &amp; Fitness,
Education &amp; References category are meant to provide direct and
accurate inquiry information. Although their interaction modes
tend to be simple and straightforward, they can still benefit from
personalized services. However, our results indicate that these
categories do not perform very well in this aspect, which implies that
more attention must be paid in the future developments.</p>
    </sec>
    <sec id="sec-11">
      <title>5 DISCUSSION</title>
    </sec>
    <sec id="sec-12">
      <title>5.1 Category-specific Evaluation</title>
      <p>The finding that variation in user-aware design guideline
compliance across skill categories exists help address our first research
question. The variation suggests that each skill category has its
own specific requirements and characteristics which need to be
considered during evaluation. For example, Games category tend
to involve more personalized design while categories which are
meant to be used on frequent basis prefer more straightforward
interaction. This opens up a further research question for future
investigation: How should user-awareness evaluation be adapted
for diferent categories and even further, application scenarios? We
can start from studying which design guidelines are more integral
to each category. Additionally, we can also study the category
variations in terms of interaction flows and understand challenges that
may arise.</p>
    </sec>
    <sec id="sec-13">
      <title>5.2 Automating User-Aware Evaluation</title>
      <p>To address our second research question of eficient evaluation, we
ifnd that the automatic voice application crawler we introduced
in this paper could evolve to an automatic user-aware evaluation
system in the future. On the technical side, our crawler focuses
on automatic response data collection, which is the first step of an
automatic evaluation system. Next, we would need to automate the
data coding/labeling process. Innovative labeling algorithms could
be developed based on our manual-labeling process; the study of
common patterns contained in various collected responses could
help conduct better topic analysis to increase the labeling accuracy.</p>
    </sec>
    <sec id="sec-14">
      <title>6 CONCLUSIONS</title>
      <p>With the popularity of CA-based voice applications, evaluating
them with a focus on user-awareness is increasingly important. In
this work, we contribute an automatic voice application crawler
to evaluate the compliance of CA-based voice applications with
user-aware design guidelines. Our findings revealed that only a
small part of selected voice applications implemented particular
user-aware design guidelines like G2,G3. This suggests a need for
more robust evaluation tools, especially to support developers in
assessing the usability of their own applications. Our findings also
show the necessity of taking categories into consideration when
doing user-aware evaluation. In sum, our research identified
directions for automating the evaluation process. Based on our findings
we showed that our CA-based voice application crawler should be
a fundamental research prototype for user-aware evaluation tool
design.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Amazon</given-names>
            <surname>Alexa</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Voice Design Guide</article-title>
          . https://developer.amazon.com/ designing-for-voice/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Noor</given-names>
            <surname>Ali-Hasan</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Evaluating Smartphone Voice Assistants: A Review of UX Methods and Challenges</article-title>
          . https://voiceux.files.wordpress.com/
          <year>2018</year>
          /03/ali-hasan. pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Corey</given-names>
            <surname>Badcock</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>First Alexa Third-Party Skills Now Available for Amazon Echo</article-title>
          . https://developer.amazon.com/blogs/post/TxC2VHKFEIZ9SG/ First-Alexa-
          <article-title>Third-Party-Skills-NowAvailable-for-Amazon-Echo</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Douglas</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Start building Actions on Google</article-title>
          . https://developers. googleblog.com/
          <year>2016</year>
          /12/start-building
          <article-title>-actions-on-google</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Bret</given-names>
            <surname>Kinsella</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <source>Amazon Alexa Skill Count Surpasses</source>
          <volume>30</volume>
          ,000 in the U.S. https: //voicebot.ai/
          <year>2018</year>
          /03/22/amazon-alexa
          <article-title>-skill-count-</article-title>
          <string-name>
            <surname>surpasses-</surname>
          </string-name>
          30000
          <string-name>
            <surname>-</surname>
          </string-name>
          u-s/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Bret</given-names>
            <surname>Kinsella</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ava</given-names>
            <surname>Mutchler</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <source>Smart Speaker Consumer Adoption Report</source>
          <year>2018</year>
          . https://voicebot.ai/wp-content/uploads/2018/03/smart_speaker_consumer_
          <source>adoption_report_2018</source>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Ewa</given-names>
            <surname>Luger</surname>
          </string-name>
          and
          <string-name>
            <given-names>Abigail</given-names>
            <surname>Sellen</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Like having a really bad PA: the gulf between user expectation and experience of conversational agents</article-title>
          .
          <source>In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. dl.acm.org</source>
          ,
          <volume>5286</volume>
          -
          <fpage>5297</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[8] Actions on Google</source>
          .
          <year>2018</year>
          . Conversation Design. https://designguidelines. withgoogle.com/conversation/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>