<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Smart speaker design and implementation with biometric authentication and advanced voice interaction capability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bharath Sudharsan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Corcoran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Intizar Ali</string-name>
          <email>ali.intizarg@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Institute, National University of Ireland Galway</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Advancements in semiconductor technology have reduced dimensions and cost while improving the performance and capacity of chipsets. In addition, advancement in the AI frameworks and libraries brings possibilities to accommodate more AI at the resource-constrained edge of consumer IoT devices. Sensors are nowadays an integral part of our environment which provide continuous data streams to build intelligent applications. An example could be a smart home scenario with multiple interconnected devices. In such smart environments, for convenience and quick access to web-based service and personal information such as calendars, notes, emails, reminders, banking, etc, users link third-party skills or skills from the Amazon store to their smart speakers. Also, in current smart home scenarios, several smart home products such as smart security cameras, video doorbells, smart plugs, smart carbon monoxide monitors, and smart door locks, etc. are interlinked to a modern smart speaker via means of custom skill addition. Since smart speakers are linked to such services and devices via the smart speaker user's account. They can be used by anyone with physical access to the smart speaker via voice commands. If done so, the data privacy, home security and other aspects of the user get compromised. Recently launched, Tensor Cam's AI Camera, Toshiba's Symbio, Facebook's Portal are camera-enabled smart speakers with AI functionalities. Although they are camera-enabled, yet they do not have an authentication scheme in addition to calling out the wake-word. This paper provides an overview of cybersecurity risks faced by smart speaker users due to lack of authentication scheme and discusses the development of a state-of-the-art camera-enabled, microphone arraybased modern Alexa smart speaker prototype to address these risks.</p>
      </abstract>
      <kwd-group>
        <kwd>Alexa Voice Service</kwd>
        <kwd>Snowboy hotword detection</kwd>
        <kwd>Smart speaker design</kwd>
        <kwd>Microphone array</kwd>
        <kwd>ReSpeaker</kwd>
        <kwd>Voice algorithms</kwd>
        <kwd>Open CV</kwd>
        <kwd>Smart speaker authentication</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The recent advancements in technology (particularly IoT and AI) are having
a great impact on our day to day life [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In a smart home scenario, multiple
smart devices are interlinked and work in collaboration with each other to serve
a common goal. Smart speakers are one amongst such smart devices that are
being widely adopted by common users and becoming an integral part of smart
homes. The AI assistants inbuilt within the recent smart speakers can understand
voice-based commands and control complex integrated systems of a smart home.
While voice-based commands provide an easy mechanism to interact with complex
systems, they also introduce a security risk in terms of handing over control
of systems to any user who has access to the smart speaker and can deliver
voice-based commands. There is a strong need to introduce bio-metrics based
authentication mechanisms for smart speakers to strengthen the security of
integrated systems without compromising the rich user experience. Due to the
lack of reliability in the existing voice authentications system, the ideal solution
is to introduce additional authentication techniques.
      </p>
      <p>
        When a person claims to be the registered smart speaker user, there is a need
to provide a factor to prove "the user is who she says she is". This factor can be
providing the authentication system of the smart speaker with something the
user knows (pin or password), or use something the user has (physical token)
or something the user is (biometrics). Biometric authentication is best suited
since authentication is a part of the user which makes the authentication process
of smart speaker hands-free. Voice authentication analyzes the user's voice to
verify identity based on the user's unique vocal attributes. Voice authentication is
ideal for hands-free usage of standalone devices like smartphones, smart speakers
and voice-based systems in an automobile since its integration is cost-e ective,
familiar and convenient for most users, less invasive (contactless) and more
hygienic. But its downsides are, it is not as accurate as other biometric modalities
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], requires additional liveness detection system and background noise impacts
the voice matching performance [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Biometric authentication solutions such
as Knomi [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] provide a family of biometric matching and liveliness detection
algorithms that use both face &amp; voice for authentication. Likewise Sensory's
TrulyHandsfree [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] uses proprietary face, voice recognition, and biometric fusion
algorithms leveraging computer vision, speech processing and machine learning
algorithms to provides on-device, almost instantaneous authentication. SDK's
of such multi-modal authentication systems are suited to build applications for
smartphones and tablets and not for smart speakers because of its low hardware
speci cations.
      </p>
      <p>Recently launched, Tensor Cam's AI Camera, Toshiba's Symbio, Facebook's
Portal are camera-enabled smart speakers with AI functionalities. Although they
are camera-enabled, yet they do not have an authentication scheme in addition
to calling out the wake-word. The modern Alexa smart speaker discussed in
this paper is constructed using o the shelf hardware components (Raspberry
Pi, ReSpeaker v2, Raspberry Pi camera, regular speaker). A biometrics-based
authentication system for such Alexa smart speakers is designed by adding a
camera module and introducing face recognition algorithms. This face recognition
algorithm was trained using Deep Neural Network which can detect and identify
human face for authentication. Additionally, it was able to identify and recognize
faces during the human gaze, thus waking up Alexa only when a known face is
recognized. To provide a seamless, full-duplex user-Alexa interaction, a
microphone array with an on-board chip hosting DSP based speech algorithms was
selected and used to capture, process and provide a noise suppressed voice feed
to Alexa. Our proof of concept prototype demonstrates a rich user experience to
interact with smart speakers by providing an extra layer of authentication and
also facilitating improved voice interaction with the device.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Cybersecurity risks due to lack of authentication schemes in smart speakers</title>
      <p>
        Users start interacting with a regular Alexa smart speaker by waking up the
Alexa AI voice assistant by calling out the \Alexa" wake-word, followed by
regular dialogues based interaction. In the current scenario, a few Alexa devices
support voice pro les [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to provide a personalized interaction experience with
its supported features. For this, the user has to train Alexa using voice, followed
by linking the trained voice with a corresponding Alexa user account. But, this
feature is only a voice-based user identi cation rather than authentication. Firstly,
this existing voice-biometric feature is limited to a few Alexa supported features
and does not act as a voice biometric authentication method for the whole smart
speaker system. Secondly, it is proven that a similar voice might be able to fool
Amazon and Google's voice recognition [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and also Google warns saying the
fact that similar voice might be able to access your info while the user is setting
up voice recognition for the rst time. According to a guide to the security of
voice-activated smart speakers, for example, an ISTR Special Report published in
2017 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and other similar research articles, the following are a few cybersecurity
risks that the smart speaker user can get exposed to in the absence of user
authentication scheme.
a. The curious child attack: There is always a risk that a child can make a
purchase via voice commands from the smart speaker without the knowledge
of the linked account owner
b. Mischievous neighbor's tale: If a neighbor wants to cause mischief. She/he
could send commands to the smart speaker in ultrasonic frequencies which
cannot be heard by humans but can be detected by smart speakers.
c. \This parrot keeps trying to buy food by speaking to Alexa" [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]:
A parrot managed to successfully add items such as strawberries, light bulb,
and kettle, etc. to owner's online shopping cart. Such activities could be
avoided by using a pin, but the parrot could potentially learn and repeat the
pin too.
d. Talking television troubles: Simply watching television or listening to the
radio can Wake-up and interact with the smart speaker.
e. Physical access:Anyone proximate to the smart speaker can wake it up,
interact and extract information from the actual user's calendar, reminder's
and other linked applications
f. Biometric-override attack [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]: An attacker can inject voice commands
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] by replaying the previously recorded clip of the victim's voice, or by
impersonating the victim's voice.
g. Malicious commands: Someone can generate malicious commands, which
can be heard as garbled sounds by human ears, while the smart speakers
interpret them as commands. Such commands can be embedded in online
videos or TV advertisements to attack devices [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. As smart speakers are
always listening, they are susceptible to such security attacks by devices
which can generate malicious voices. Audio from television news triggered
Amazon Echo to place orders for dollhouse [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
      </p>
      <p>
        To address these issues, one possible existing method is to provide voice
biometrics-based authentication for crucial third party applications such as
calendars, email, banking, etc which are linked to Alexa. This can be done by
integrating a third-party voice biometric API such as ArmorVox [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. By doing so,
the raw voice le captured by smart speakers should be exposed to the third-party
API, which certainly could cause the emergence of privacy and data security
challenges in the future. Again these approaches do not provide an authentication
method for the whole smart speaker system and still leaves the system exposed
to risks. Unfortunately, since most state-of-the-art smart speakers do not have
authentication methods, they are mostly ine ective in alleviating the mentioned
issues. The prototype developed and described in this paper interacts with Alexa
API by providing noise suppressed audio feed captured from a Microphone array
and in addition, it is capable of performing Biometrics (facial recognition) based
system wakeup in addition to calling out the Alexa wake-word. The importance of
Biometrics-based authentication for smart speakers was discussed in this session
and the development of such biometrics enabled smart speaker prototype will be
discussed in upcoming sessions.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        VAuth [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is proposed for continuous authentication of voice assistants to defend
against the threats caused due to the open nature of the smart speaker's voice
channel. VAuth is a separate embedded system that is adopted on wearable
devices, such as eyeglasses, earphones/buds, and necklaces. This system senses
the body-surface vibrations of the user and matches it with the speech signal
received by the voice assistant's microphone. Although VAuth achieved 97%
detection accuracy, it is not a feasible solution to charge, maintain and carry this
separate embedded system attached to the body of the user just to authenticate
a smart speaker. Daon's IdentityX [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a multi-modal, vendor agnostic identity
services platform that provides additional biometrics-based authentication using
a smartphone only while using nancial services apps via Alexa. This process
involves the use of a secondary gadget (smartphone) and still not an authentication
scheme for the entire smart speaker system which leaves Alexa exposed to risks
discussed in Section 1. EchoSafe [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a sonar-based defense against the attacks
which occur due to malicious voice commands from nearby devices during user
unoccupied periods. Here, when the user sends a critical command to the smart
speaker, an audio pulse is sent from the smart speaker followed by post-processing
to determine if the user is present in the room. They have claimed EchoSafe
system can detect the user's presence during critical commands with 93.13%
accuracy. EchoSafe is a solution only for attacks via malicious voice commands
and not suited for other vulnerabilities.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Overview of biometric authentication and speech algorithm based smart speaker</title>
      <p>The rst objective of this work is to provide a face biometrics-based
authentication scheme for the entire smart speaker system. To perform this, a camera
module is added to the smart speaker prototype as shown in Fig. 2. The lack of
authentication schemes in regular smart speakers provides an open door to access
user's private information by anyone in its vicinity. Since the prototype discussed
in this paper has a camera module and also is equipped with face recognition
based Alexa wakeup scripts, it provides an extra layer of authentication. As shown
in Fig. 1, the registered user has to rst gaze at the camera to authenticate the
system, then call out the Alexa wake word and start the regular dialogue-based
interaction with Alexa. Section 4.3 discusses the algorithms involved to wake
up the system when known face gazes at the prototype. The second objective is
to capture and provide high-quality noise suppressed voice input to Alexa for
achieving a seamless, full-duplex user-Alexa speech interaction. To perform this,
the ReSpeaker v2 microphone array is used here rather than a single microphone
since it can segregate speech from noise. Also, this mic-array has an inbuilt
high-performance processor loaded with on-chip advanced DSP (Digital Signal
Processing) based speech algorithms which enables users to interact with Alexa
up to ve meters or further from the smart speaker, interact while walking
around the room, etc. This mic array's role and bene ts of using it to capture,
process and provide voice input for Alexa are discussed in Section 4.2. The third
objective is to improve the user experience by making sure the smart speaker is
not activated accidentally when wake-word is not called out and also make sure
the Alexa wake word is spotted from the input audio streams with high accuracy.
To perform this a third-party wake word engine as discussed in Section 4.5 is
integrated with the Alexa Voice Service C++ SDK as discussed in Section 4.4.
4.1</p>
      <sec id="sec-4-1">
        <title>Hardware components of the smart speaker prototype</title>
        <p>This modern smart-speaker prototype is constructed using commercial o the
shelf advanced microphone array with inbuilt DSP (ReSpeaker v2), camera
module (Raspberry Pi camera), and a regular speaker interfaced to a single board
computer as shown in Fig. 1.
a. Selection of Single Board Computer: BeagleBone Black, Orange Pi 3,</p>
        <p>
          LattePanda 2G/32G and Banana PI M4 are the SBC's of our interest. The
Raspberry Pi 3 model b+ (Pi 4 not yet released) is chosen considering its
form factor, price-performance balance, low-power consumption, compatibility
with o the shelf devices, community-created guides, tutorials, and support.
As illustrated in Fig. 1, python scripts written leveraging external libraries
are deployed on this Raspberry Pi Linux SBC. The scripts deployed here
are responsible for waking up the Alexa Sample App when a known face is
recognized from live frames captured
b. Selection of camera unit: For real-time computer vision applications, the
Raspberry Pi Camera V2 is preferred since it is capable of 1080p 30fps video
encoding and 5MP stills quality. Since the camera is connected directly to the
GPU via CSI connector as shown in Fig.1, there is only a little impact on Pi's
CPU, leaving it available for other processing. Most cost-e ective web cameras
do not have a built-in encoding like the Pi camera. Hence, web cameras use
additional CPU resources causing the reduced overall performance of the
system
c. Selection of microphone array: Microphone is the crucial part of a smart
speaker system. Since we require pre-processing of sound using speech
algorithms, the focus is on a microphone array with built-in advanced DSP
algorithms. ReSpeaker v2, Matrix Creator, PS3 eye, Conexant 4-mic
development kit, MiniDSP UMA-8, Microsemi AcuEdge ZLK38AVS are the
microphone arrays of our interest. ReSpeaker v2 has a good success rate for
hot word detection when the distance is increased and tested in a silent room,
a room with white noise and room with background music [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The PS3 Eye
has the edge over ReSpeaker v2, but ReSpeaker v2 is chosen for this project
because the Raspberry Pi camera with CSI interface has better support for
Open CV environment than the PS3 Eye camera. The second reason is, the
ReSpeaker has a Pixel Ring of 12 RGB LED's which can be used for visual
feedback in addition to the speaker unit.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Speech algorithms based microphone array for advanced voice interaction capability</title>
        <p>The rmware on the XVF-3000 Chip (present on the ReSpeaker v2 hardware)
produces six-channel mic outputs via USB to the Linux system. Channel zero
contains audio which is processed using advanced DSP algorithms. Channels one
to four contains raw data from the microphones corresponding to the channel
number. Channel ve provides raw audio which is a combination of all raw audio
signals from four microphones on the ReSpeaker v2. A high-level illustration of
this mic array's role is shown in Fig.3. Here, the audio feed from channel zero
is used for wake word spotting and also fed as voice input to the Alexa. The
bene ts of using ReSpeaker v2 with Alexa are listed below.
a. Far- eld voice capture: Wake-up and interact with the smart speaker by
capturing and processing raw microphone inputs at distances of up to ve
meters or further.
b. USB Audio Class 1.0 (UAC 1.0): USB audio is used to send digital music
from the Raspberry Pi to the digital to analog converter (DAC) inbuilt on the
ReSpeaker v2. Class 1.0 can send up to a maximum of 24- bit/96kHz hi-res
les. By utilizing this we can bypass the internal sound-card of Raspberry
Pi and allow the USB DAC to play audio response from Alexa with much
better quality.
c. Twelve programmable RGB LED Pixel-ring: The RGB LED pixel
ring on the ReSpeaker v2 is utilized to visually point the direction of speech
signal arrival (source). Pixel ring library is used to address the LED pixels via
the USB interface to change color and brightness according to requirements
from the main program.
d. Digital Signal Processing algorithms on ReSpeaker V2:
i. Beamforming: All MEMS microphones have an omnidirectional pickup
response. It means, their response is the same for sound coming anywhere
from around the microphone. Directional response or a beam pattern can
be formed by con guring multiple microphones in an array. Thus, enabling
us to detect and track the position of the voice of the smart speaker
user across the room. As the smart speaker user interacts with the smart
speaker and walks around the room, the angle of the microphone beam
adjusts automatically to track their voice. Hence, it is e ectively possible
to point towards the user's direction and suppress noise or reverberation
signals from other directions.
ii. Noise suppression: In acoustic beamforming, the spatial relationship
of the microphones in the microphone array achieves active microphone
noise suppression and control. If the direction of the sound source relative
to the microphone array is known, then an acoustic beamformer can be
designed to pass signals coming from the sound source of interest and
lter out sound signals picked up from other di erent directions. This
approach to microphone array noise reduction is most applicable to a
situation in which one person's voice needs to be heard when multiple
people are talking. Noise suppression removes the stationary (point-noise)
and non-stationary background sounds.
iii. De-reverberation: In any room, one's voice will reverberate (re ect)
o hard surfaces around the room, e.g. a window or TV screen.
Dereverberation removes these re ections and cleans up the voice signal.
iv. Acoustic Echo Cancellation: While interacting with electronic devices,
in some cases, users hear their voice (sometimes with a signi cant delay).
This experience is known as an acoustic echo. Controlling and canceling
acoustic echo is essential for voice-based systems such as smart speakers.
For example, if the smart speaker user is watching a lm on a TV with
minimal volume and simultaneously gives voice input to the smart speaker,
now the microphones will capture both the user's voice and the sound
of the lm (the acoustic echo). This acoustic Echo is canceled from the
voice input so that text from captured audio can be extracted with better
accuracy.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Biometric authentication based Alexa wakeup</title>
        <p>
          As illustrated in Fig. 5. When the face recognition script is run, faces are
detected from the live frames captured from Pi camera and a 128-d face
embedding is computed via a deep metric network for the detected face. Then
the computed 128-d face embedding is compared with a known database of
already computed face encodings of registered faces to successfully recognise
faces from the live frame. Once a known face is recognised, this script
wakes up Alexa and simultaneously the ReSpeaker's RGB LED Pixel-ring
provides visual feedback to the user by turning green. Before running this
face recognition script. A sub-script as shown in Fig. 4 has to be run in order
to encode 128-d vectors for faces in the dataset (directory with .jpg les of
faces) &amp; store the encodings in a .pickle le, which is later used as a database
(while running main face recognition script) to compare detected faces from
live frames and check for a match. Since Pi has limited computation power,
memory &amp; GPU, its resource has to be left free for other scripts to run.
Hence more powerful algorithms such as Eigenfaces and LBPs (Local Binary
Patterns) which can achieve frame rates greater than 10 FPS was not used.
Parallel to setting up the smart speaker hardware and deploying the scripts,
the Pi has to be registered as a device at Amazon developer console &amp; a
security pro le has to be created. Follow the detailed step-by-step instructions
for Cloud side setup at [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and provide the path of the generated con g.json
while building the Alexa AVS Sample App from its SDK. This C++ libraries
based AVS device SDK enables us to integrate Alexa into the smart speaker
prototype. The interaction of smart speaker with AVS is performed using
this Alexa Sample App which is built for Raspberry Pi from the o cial SDK.
Before proceeding with Alexa AVS C++ SDK, Python version of Alexa Voice
Service app [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] was tested with Raspberry Pi &amp; ReSpeaker v2. Following
results were observed.
a. After interacting with Alexa for quite some time, Alexa's voice turned
blurred &amp; mu ed. It gets resolved only after restarting the Pi
b. After spotting the Alexa wake-word, there is a delay for a minimal
duration (approx. 0.5 seconds) only after which audio is streamed to
Alexa cloud
As mentioned in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and in Fig.6 multiple components comprise the C++
AVS SDK through which the audio data ows. Initially, signal processing
algorithms are applied to input and output audio channels to produce
processed, clear audio. If the raw audio data from four microphones of ReSpeaker
are provided as input then this third-party Audio Signal Processor combines
and provides a single audio stream to the next component in the architecture.
But here, we are already providing a single channel audio stream which is
processed by the DSP on the XVF-3000 chip of ReSpeaker v2. The remaining
subparts of the architecture perform its functionality as mentioned in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and
Fig.6. Snowboy from KITT.ai and Sensory[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] are two third-party wake word
engines, either one of which has to be a part of the SDK build to spot the
Alexa wake-word from the input streams to provide hands-free interaction.
Both these engines were tested with this ReSpeaker v2 based Alexa smart
speaker setup. Snowboy, wake work engine was selected and used as a plugin
for building the Alexa AVS sample app since it only consumes less than 8 %
of Raspberry Pi's CPU and had a better success for wake word detection.
4.5
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Snowboy wake-word engine to spot Alexa wake-word</title>
        <p>
          Snowboy engine ensures the smart speakers are not activated accidentally
when wake-word is not called out. The accuracy of the wake word detection
engines is measured by plotting false alarm per hour (a number of false
positives) vs miss detection rates (percentage of wake word utterances an
engine rejects incorrectly). The ROC curves of four di erent wake-word
detection engines is shown in Fig.7. Here, Snowboy wake-word engine has
the lowest miss detection rate and it is more accurate than other engines.
Reasons for integrating wake-word engines with voice based AI assistants
and smart speakers [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
a. Privacy: Microphones does not have to listen always
b. Cost: Impractical &amp; expensive when data is streamed to cloud all the
time
c. Power consumption: Voice assistants are run on smartphones, wearables
&amp; smart speakers where maximum standby time is expected
Over 50% of Irish population are expected to own a regular smart speaker by 2023
[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and its predicted that smart speaker ownership will overtake that of tablets
globally by 2021 [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Likewise, soon camera-enabled smart speakers will replace
regular smart speakers and become an integral part of our daily life. This paper
provided an overview of cybersecurity risks faced by smart speaker users due to
lack of authentication scheme and discussed the development of a state-of-the-art
camera-enabled, microphone array based modern Alexa smart speaker prototype
to address these risks. In addition to this biometrics based system wake up and
microphone array-based interaction. Since this smart speaker prototype is a
camera-enabled Linux-based system, it is capable to host custom skills which can
perform audio processing and computer vision based tasks when requested by the
user. The development process involved in the implementation of such custom
skills is considered as future work. We also plan to extend our existing work for
multiple use-cases requiring voice commands such as smart enterprises (online
meetings) and smart manufacturing (human machine interaction) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This publication has emanated from research supported by a research grant from
Science Foundation Ireland (SFI) under Grant Number SFI/16/RC/3918
(Conrm), and SFI/12/RC/2289 P2 (Insight) co-funded by the European Regional
Development Fund.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. A guide to the security of voice-activated smart speakers an istr special report</article-title>
          , https://www.symantec.com/content/dam/symantec/docs/security-center/
          <article-title>white-papers/istr-security-voice-activated-smart-speakers-en</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. Picovoice benchmark, https://github.com/Picovoice/wakeword-benchmark</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Home | sensory (
          <year>2014</year>
          ), https://www.sensory.com/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Trulysecure | sensory (
          <year>2014</year>
          ), https://www.sensory.com/products/ technologies/trulysecure/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Amazon.
          <article-title>com help: About alexa voice pro les (</article-title>
          <year>2019</year>
          ), https://www.amazon.com/ gp/help/customer/display.html?nodeId=
          <fpage>202199440</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Daon</surname>
          </string-name>
          <article-title>| multi-factor mobile biometric authentication | identityx platform - daon (</article-title>
          <year>07 2019</year>
          ), https://www.daon.com/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Welcome - armorvox (
          <year>2019</year>
          ), https://cloud.armorvox.com/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Alanwar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balaji</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Echosafe</surname>
          </string-name>
          .
          <source>Proceedings of the 1st ACM Workshop on the Internet of Safe Things - SafeThings'17</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>9. Alexa: alexa/avs-device-sdk, https://github.com/alexa/avs-device-sdk</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ono</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaysar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gri n</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mileo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A semantic processing framework for iot-enabled communication systems</article-title>
          .
          <source>In: ISWC 2015</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Biometrics</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>KnomiTM - mobile biometric authentication framework - aware biometrics (</article-title>
          <year>2017</year>
          ), https://www.aware.com/knomi-mobile-biometricauthentication/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Biometrics</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Voice authentication technology - aware biometrics software (</article-title>
          <year>2018</year>
          ), https://www.aware.com/voice-authentication/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Charlton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>This parrot keeps trying to buy food by speaking to alexa (12</article-title>
          <year>2018</year>
          ), https://www.gearbrain.com/parrot-uses
          <string-name>
            <surname>-</surname>
          </string-name>
          amazon-alexa-
          <volume>2623633611</volume>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fawaz</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>K.G.</given-names>
          </string-name>
          :
          <article-title>Continuous authentication for voice assistants</article-title>
          .
          <source>Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking - MobiCom '17</source>
          (
          <year>2017</year>
          ), https://dl.acm.org/citation.cfm?doid=
          <volume>3117811</volume>
          .
          <fpage>3117823</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Gebhart</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Fooling amazon and google's voice recognition isn't hard (11</article-title>
          <year>2017</year>
          ), https://www.cnet.com/news/fooling
          <article-title>-amazon-and-googles-voicerecognition-isnt-hard/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kenarsari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Yet another wake-word detection engine (04</article-title>
          <year>2018</year>
          ), https://medium.com/@alirezakenarsarianhari/yet-another
          <article-title>-wake-worddetection-engine-a2486d36d8d4</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Liptak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Amazon's alexa started ordering people dollhouses after hearing its name on tv (01</article-title>
          <year>2017</year>
          ), https://www.theverge.com/
          <year>2017</year>
          /1/7/14200210/amazonalexa-tech
          <article-title>-news-anchor-order-dollhouse</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The impact of arti cial intelligence in the everyday lives of consumers</article-title>
          .
          <source>Forbes (03</source>
          <year>2018</year>
          ), https://www.forbes.com/sites/forbestechcouncil/2018/ 03/07/the-impact
          <article-title>-of-artificial-intelligence-in-the-everyday-lives-ofconsumers/#4f6ae8446f31</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Panjwani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prakash</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Crowdsourcing attacks on biometric systems</article-title>
          , https://www.usenix.org/system/files/conference/soups2014/soups14- paper-panjwani.pdf
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          :
          <article-title>From raw data to smart manufacturing: AI and semantic web of things for industry 4.0</article-title>
          .
          <string-name>
            <given-names>IEEE</given-names>
            <surname>Intelligent</surname>
          </string-name>
          <article-title>Systems (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>21. Respeaker: Alexa, https://github.com/respeaker/Alexa</mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Rouchon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Benchmarking microphone arrays: Respeaker, conexant, microsemi acuedge, matrix creator, minidsp</article-title>
          . . . (08
          <year>2017</year>
          ), https://medium.com/snipsai/benchmarking-microphone
          <article-title>-arrays-respeaker-conexant-microsemiacuedge-matrix-creator-minidsp-950de8876fda</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , C.
          <source>: Over</source>
          <volume>50</volume>
          (04
          <year>2019</year>
          ), https://www.irishtimes.com/business/ technology/over-50
          <string-name>
            <surname>-</surname>
          </string-name>
          of
          <article-title>-irish-people-expected-to-own-a-smart-</article-title>
          <string-name>
            <surname>speakerby-</surname>
          </string-name>
          2023
          <source>-1</source>
          .
          <fpage>3869382</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>