Research of Voice Assistants Safety Nikita Burym, Mikhail Belenko and Pavel Balakshin ITMO University, Kronverksky Pr. 49, bldg. A, St. Petersburg, 197101, Russian Federation Abstract Internet-connected gadgets with voice assistants are becoming more popular due to their convenience for everyday tasks such as asking about the weather forecast, playing music or controlling other smart things in the house. However, such convenience comes with a privacy risks: smart gadgets have to constantly listen in order to activate when the “wake word” is spoken, and are known to transmit recorded audio from their environment and record it on cloud servers. Specifically, this article focuses on the privacy risks associated with using smart gadgets with voice assistants. Keywords voice assistants, smart speakers, gadgets, privacy, IoT, voice command, voice recording, wake word 1. Introduction In the field of information technology, the means of interaction between a user and a technical system called an interface. Interfaces are different and implemented by different means and methods. One of the most important tasks in the development of modern technical systems is to provide the most intuitive and natural user interface. One of the natural forms of human interaction is speech. The voice interface is one of the key parts of human-machine interaction, allowing improve the existing user interface, as well as provide a more convenient way of human-computer interaction. Google’s voice assistant [1] and Apple’s Siri voice assistant [2] are prime examples, highlighting the urgent need to introduction speech technologies such as speech recognition and voice interfaces. Today, voice assistants are in demand in the customer support segment. Most large companies use communication services to improve the quality of customer service. Voice assistants are increasingly appearing in the form of home appliances that surround us. Notable examples from this area are smart speakers Google Nest, previously named Google Home [3], Amazon Echo Dot [4], Apple HomePod [5] and Harman/Kardon Invoke [6], which provide a voice control interface. People can use them to turn music on and off, ask for weather forecast, tell them to adjust the room temperature, order goods online, and much more. To provide better user experience, most devices with a built-in voice assistant use an always- listening mechanism that receives voice commands all the time [7]. In particular, users are not Proceedings of the 12th Majorov International Conference on Software Engineering and Computer Systems, December 10–11, 2020, Online & Saint Petersburg, Russia " CHVRCHES@mail.ru (N. Burym); mikael0bmv@gmail.com (M. Belenko); pvbalakshin@gmail.com (P. Balakshin)  0000-0002-4343-6408 (N. Burym); 0000-0002-5060-1512 (M. Belenko); 0000-0002-9421-8566 (P. Balakshin) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) required to press or hold a physical button on devices before speaking commands. However, this advantage may expose users to security threats due to the openness of voice channels. 2. Issue 1: Unauthorized access Most voice assistants require a voice command or the so-called “wake word” to initiate user interaction. For different devices, these voice commands are different, for example, for the Google voice, you need to say the phrase “OK Google” [1, 3], and for an Amazon voice assistant, the phrase “Hello Alexa” is required [4], after which the assistants are informed that the user is ready to ask a question or tell commands. As a result, anything that said on radio, television, or during normal human dialogue can accidentally wake an assistant. On the one hand, this may seem harmless, but some voice assistants transfer the recording or its text version and other data to the cloud server to execute the user’s command. This data can be stored on cloud servers for quite a long time and used by a voice assistant company, for example, to check the quality of speech recognition, which leads to the following risks: 1. Voice assistants can store more information than intended. They should only record voice after they hear the wake word, but they may react to similar words and speech on TV or wake for no reason. 2. Employees can get access to personal information, because they are the ones who check the quality of the voice assistants. They can find out personal data, for example, a medical history told to monitor a patient’s condition in a hospital or cash card information pronounced for a purchase in the Internet. 3. Criminals can take advantage of the data. Like other information collected, voice recordings that locating on a cloud server are at risk of hacker attacks. They can be stolen and used, for example, to simulate the user’s voice to hack devices protected by biometrics [8]. 3. Issue 2: Anyone can control the device It should be noted that voice assistants are designed to be at the center of the Internet of Things ecosystem. Thus, while they allow users to access the Internet and execute various commands, they can also communicate and control all other smart gadgets in the home. Recently, a new method has been found that allows you to control Apple’s Siri voice assistant using inaudible Ultrasonic waves [9]. Ultrasonic waves are sound waves with a frequency higher than a human can hear [10]. However, smart gadget microphones can record these higher frequencies. This method can activate the voice assistant and make it use various functions of the smartphone, for example, a phone call, transfer commands to other devices in the IoT ecosystem without touching the gadget, since the assistant thinks you are saying a command and proceeds to transmit important information. For this reason is advised not to connect the device to any IoT security solutions such as smart door locks, as hackers can use a voice assistant to instruct the device to unlock the front door and enter in the house. To protect against this kind of attacks, voice assistant companies are encouraged to develop software for the phone that analyzes the received signal to discriminate between ultrasonic waves and genuine human voices [9]. 4. Issue 3: Misactivating of smart speakers Researchers from Northwestern University and Imperial College London conducted experiments during which it turned out that smart speakers in which voice assistants are embedded can be activated when watching TV series and spy on users [11]. The purpose of this work was to find out if smart speakers record random sounds from the environment, and if so, how and when it does. The researchers also tried to identify what false wake words typically misactivate the voice assistants, certain types of dialogue, and other factors. In particular, the work answered the following questions: 1. How often do smart speakers misactivating? It is characterized by how often the smart speaker is incorrectly activated during a conversation. The more cases of incorrect activation, the higher the risk of unexpected audio recordings. 2. How long does smart speaker recording environmental sound after misactivat- ing? Prolonged misactivation, represent a higher privacy risk than a short one, as more data (such as context and conversation details) is recorded over a long period. 3. Are there certain TV shows that cause more misactivations than others do? If yes, why? Each TV show you select contains different conversational characteristics (accent, context, etc.). It measures which ones cause the most misactivation in order to understand which characteristics correspond to the increased risk of misactivation. 4. What words that do not wake up properly like “Hey Alexa” or “Okay Google” are constantly causing misactivation? This will help find undocumented wake words or sounds that should be avoided by the user. During the experiment, the researchers played the content of the American entertainment company Netflix for 134 hours next to smart speakers. They selected TV series of different genres with many dialogues and watched if the phrases from the dialogues in the series could activate voice assistants in Google Home Mini, Apple Homepod, Harman Kardon (Cortana) and two generations of Amazon Echo Dot, as each version has a different number of microphones for recognition, which can affect the accuracy of speech recognition. Researchers repeated the tests several times in order to determine which words not intended to wake up the assistant regularly activate smart speakers [11]. Table 1 shows the smart speakers tested and their characteristics. As it turned out during the experiment, the assistant is misactivating up to 19 times a day, while Siri and Cortana are most likely to misactivate and record environmental sounds. Most often, assistants were misactivating when watching the TV series “Gilmore Girls" and “The Office". Researchers have also identified some patterns in which words not intended for the assistant can activate it. For example, these turned out to be words that rhyme with the words of activation (in particular, Amazon Echo mistook the phrase “kevin’s car” for “Alexa”). Table 2 provides a list of some pattern phrases of misactivating of voice assistants [11]. Table 1 Smart Speakers and their characteristics Device Assistant Wake word Google Home Mini Google Assistant “OK/Hey Google” Apple Homepod Apple Siri “Hey Siri” Amazon Echo Dot Amazon Alexa “Alexa”,“Amazon”,“Echo”,“Computer” Harman Kardon Invoke Microsoft Cortana “Cortana” (US only) 5. Testing: Misactivating of Yandex’s Alice In this part of work, a research conducted misactivating of the voice assistant “Alice” in Russian. This assistant activates by the wake phrases “Listen, Alice”, “Alice”, and “Hello, Alice” [12]. The start of the researching consisted of choosing consonant words and phrases to test for misactivations. In total, 25 words and phrases were chosen for dictating. It can be concluded that the voice assistant “Alice” has a good speech recognition module because dictating consonant words does not cause misactivations. The next stage involved verification misactivaions without the participation of a speaker. For this task cartoon in Russian was chosen based on the fairy tale “Alice’s Adventures in Wonderland” with duration about 1 hour and 3 minutes and the first season of the TV series “The Alienist”. No misactivations were detected during the playback of the series. During the playback of the cartoon the assistant was activated 9 times and only for the word “Alice”. However, it should be noted that it was triggered by the sound reproduced through the speakers, and not by the real human voice, which confirms the fact that there was no analysis of the incoming signal. In addition, after activation the voice assistant recorded the sounds of the environment with sound of the cartoon, in which there could potentially be dialogues of people or confidential information. Based on the results of this part of work, it can be concluded that the voice assistant “Alice” has a good speech recognition module, since dictating consonant words does not cause misac- tiovaions. Also, it can be noticed that this voice assistant often recognizes the Russian word for “fox” as “Alice” in recognizing commands mode (wake mode). 6. Conclusion The fast introduction of smart voice assistants in homes, businesses and public places has raised a number of concerns from privacy advocates. While these devices offer comfortable voice interaction, their microphones always listen for the wake words. As smart speakers become more common in everyday life, there is an urgent need to understand the behavior of this ecosystem and its impact on consumers. In this work several security vulnerabilities in smart speakers and voice assistants were reviewed. The main disadvantage of modern voice assistants is without physical presence-based access control, they can receive voice commands even when there are no people nearby. Table 2 List of some misactivating patterns among repeatable misactivations Words Some patterns Some examples from the subtitles OK/Hey Google Words rhyming with “Hey” or “Hi” “Okay ... to go”, “maybe I don’t like the cold”, (e.g., “They” or “I”), followed by hard “they’re capable of”, “yeah ... good weird”, “G” or something containing “ol” “hey ... you told”, “A-P ... I won’t hold” Hey Siri Words rhyming with “Hey” or “Hi” “Hey ... missy”, “they ... sex, right?”, “hey, (e.g., “They” or “I”), followed by a Charity”, “they ... secretly”, “I’m sorry”, “hey voiceless “s”/“f”/“th” sound and a ... is here”, “yeah. I was thinking”, “Hi. Mrs. “i”/“ee” vowel Kim”, “they say ... was a sign”, “hey, how you feeling” Alexa Sentences starting with “I” followed “I care about”, “I messed up”, “I got something”, by a “K” or a voiceless “S” “it feels like I’m” Echo Words containing a vowel plus “k” “Head coach”, “he was quiet”, “I got”, “picking”, or “g” sounds “that cool”, “pickle”, “Hey, Co.”” Computer Words starting with “comp” or “Comparisons”, “I can’t live here”, “come here”, rhyming with “here”/“ear” “come onboard”, “nuclear accident”, “going camping”, “what about here?” Amazon Sentences containing combinations “it was a”, “I’m sorry”, “just ... you swear you of “was”/“as”/“goes”/“some” or won’t”, “I was in”, “what was off”, “life goes “I’m” followed by “s”, or words on”, “have you come as”, “want some water?”, ending in “on/om” “he was home” Cortana Words containing a “K” sound “take a break ... take a”, “lecture on”, “quartet”, closely followed by a “R” or a “T”. “courtesy”, “according to” References [1] Google Voice Assistant, https://assistant.google.com. Last accessed 10 October 2020 [2] Apple Siri Voice Assistant, https://www.apple.com/ru/siri. Last accessed 10 October 2020 [3] Google Nest, https://en.wikipedia.org/wiki/Google_Nest_(smart_speakers). Last accessed 12 October 2020 [4] Amazon Echo Dot, https://www.amazon.com/Echo-Dot/dp/B07FZ8S74R. Last accessed 12 October 2020 [5] Apple HomePod, https://www.apple.com/homepod. Last accessed 12 October 2020 [6] Harman/Kardon Invoke, http://www.harmansound.ru/product/ harman-kardon-invoke-black. Last accessed 12 October 2020 [7] Xinyu Lei, Guan-Hua Tu, Alex X. Liu, Chi-Yu Li, Tian Xie: The Insecurity of Home Digital Voice Assistants - Amazon Alexa as a Case Study. In: 2018 IEEE Conference on Communications and Network Security (CNS) [8] Prospects and problems of voice assistance https://blog.dti.team/voice-assistants-3/. Last accessed 18 October 2020 [9] Qiben Yan, Kehai Liu, Qin Zhou, Hanqing Guo, Ning Zhang: SurfingAttack: Interactive Hidden Attack on Voice Assistants Using Ultrasonic Guided Waves. Computer Science & Engineering, Washington University in St. Louis [10] Ultrasound, https://en.wikipedia.org/wiki/Ultrasound. Last accessed 19 October 2020 [11] Daniel J. Dubois, Roman Kolcun , Anna Maria Mandalari, Muhammad Talha Paracha, David Choffnes, Hamed Haddadi: When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers. In: Proceedings on Privacy Enhancing Technologies, vol. 2020 (4), pp. 255-276 [12] Yandex Alice, https://yandex.ru/alice. Last accessed 25 October 2020