=Paper=
{{Paper
|id=Vol-2940/paper17
|storemode=property
|title=An Accelerometer-based Privacy Attack on Smartphones
|pdfUrl=https://ceur-ws.org/Vol-2940/paper17.pdf
|volume=Vol-2940
|authors=Roberto De Prisco,Alfredo De Santis,Rocco Zaccagnino,Daniele Granata,Massimiliano Rak,Giovanni Salzillo,Umberto Barbato,Giuseppe Mario Malandrone,Giovanni Virdis,Giorgio Giacinto,Davide Maiorca,Dario Stabili,Francesco Pollicino,Alessio Rota,Shaharyar Khan,Alberto Volpatto,Geet Kalra,Jonathan Esteban,Tommaso Pescanoce,Sabino Caporusso,Michael Siegel,Alessia Boi,Carmelo Ardito,Tommaso Di Noia,Eugenio Di Sciascio,Domenico Lofù,Andrea Pazienza,Felice Vitulano,Giulio Berra,Gaspare Ferraro,Matteo Fornero,Nicolò Maunero,Paolo Prinetto,Gianluca Roascio,Luigi Coppolino,Salvatore D'Antonio,Giovanni Mazzeo,Luigi Romano,Paolo Campegiani,Vincenzo Dentamaro,Vito Nicola Convertini,Stefano Galantucci,Paolo Giglio,Tonino Palmisano,Giuseppe Pirlo,Massimiliano Masi,Tanja Pavleska,Simone Pezzoli,Massimiliano Calani,Giovanni Denaro,Alberto Leporati,Manuel Cheminod,Luca Durante,Lucia Seno,Adriano Valenzano,Mario Ciampi,Fabrizio Marangio,Giovanni Schmid,Mario Sicuranza,Marco Zuppelli,Giuseppe Manco,Luca Caviglione,Massimo Guarascio,Marzio Di Feo,Simone Raponi,Maurantonio Caprolu,Roberto Di Pietro,Paolo Spagnoletti,Federica Ceci,Andrea Salvi,Vincenzo Carletti,Antonio Greco,Alessia Saggese,Mario Vento,Gabriele Costa,Enrico Russo,Andrea Valenza,Giuseppe Amato,Simone Ciccarone,Pasquale Digregorio,Giuseppe Natalucci,Giovanni Lagorio,Marina Ribaudo,Alessandro Armando,Francesco Benvenuto,Francesco Palmarini,Riccardo Focardi,Flaminia Luccio,Edoardo Di Paolo,Enrico Bassetti,Angelo Spognardi,Anna Pagnacco,Vita Santa Barletta,Paolo Buono,Danilo Caivano,Giovanni Dimauro,Antonio Pontrelli,Chinmay Siwach,Gabriele Costa,Rocco De Nicola,Carmelo Ardito,Yashar Deldjoo,Eugenio Di Sciascio,Fatemeh Nazary,Vishnu Ramesh,Sara Abraham,Vinod P,Isham Mohamed,Corrado A. Visaggio,Sonia Laudanna
|dblpUrl=https://dblp.org/rec/conf/itasec/PriscoSZ21
}}
==An Accelerometer-based Privacy Attack on Smartphones==
<pdf width="1500px">https://ceur-ws.org/Vol-2940/paper17.pdf</pdf>
<pre>
An accelerometer-based privacy attack on
smartphones
Roberto De Prisco1 , Alfredo De Santis1 and Rocco Zaccagnino1
1
    University of Salerno, Computer Science Department, Via Giovanni Paolo II, 132 - 84084 Fisciano (SA), Italy


                                         Abstract
                                         Most smartphones are equipped with an accelerometer sensor. There are numerous scenarios in which
                                         this sensor can be very useful. However it can also represent a privacy threat. Indeed, the measurement
                                         of the device vibrations can be exploited to detect private information. The attack can be favored by the
                                         fact that this specific sensor is normally not considered a “dangerous” one and also by the fact that the
                                         measurements of today’s sensors are quite accurate.
                                              Recently many research studies have focused on the task of inferring information from the accelerom-
                                         eter measurements. There are several settings that can be considered and several final goals; in this
                                         paper we consider the specific case of recognizing words that the device itself is reproducing through
                                         its loudspeakers. A recent paper has considered this scenario and has proposed a recognizer, based on
                                         Convolutional Neural Networks, for single digits, single letters and a small set of “hot words”.
                                              Following such a research direction, in this paper, we provide an improved recognizer for single
                                         letters and digits. We performed an evaluation study to assess the effectiveness of the proposed attack.
                                         Results show that the system outperforms the previous approach. We also propose a generalization
                                         whose goal is that of recognizing entire words, or even sentences, not by means of a dictionary, but by
                                         first recognizing syllables and then locate sequences of syllables that correspond to words. We provide
                                         preliminary results in this direction.

                                         Keywords
                                         Mobile security, Speech privacy attack, Deep learning


1. Introduction
The idea of smartphone, i.e., a device integrating both telephony and some computer capabil-
ities, dates back to 1993, when IBM designed the first smartphone ever: Simon1 . Since 1993,
smartphones have increasingly become an essential component of our daily life to the point
of assuming the role of interface with the rest of the world in several situations, thanks to
the many communication possibilities made available. Among these, voice communication
is clearly the main one. Because of this, operating systems usually restrict the access to the
microphone by placing its usage at the highest permission level2 . The search for security vul-
nerabilities associated with smartphones has moved over time towards other types of sensors,
Itasec21: Italian conference on Cybersecurity, April 07–09, 2021, Online
" robdep@unisa.it (R. D. Prisco); ads@unisa.it (A. D. Santis); rzaccagnino@unisa.it (R. Zaccagnino)
~ https://docenti.unisa.it/003550/home (R. D. Prisco); https://docenti.unisa.it/000769/home (A. D. Santis);
https://docenti.unisa.it/023039/home (R. Zaccagnino)
 0000-0003-0559-6897 (R. D. Prisco); 0000-0001-8962-1919 (A. D. Santis); 0000-0002-9089-5957 (R. Zaccagnino)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                    https://www.businessinsider.com/worlds-first-smartphone-simon-launched-before-iphone-2015-6
                  2
                    https://developer.android.com/guide/topics/sensors/sensorsoverview.
and in particular towards motion sensors. This is probably due to the fact that such sensors are
normally considered not dangerous and thus their access is generally unrestricted (for example
in Android, at least up to the current version, there is no need to ask a permission to access the
accelerometer).
    The technological evolution of motion sensors integrated in smartphones has resulted in
the development of many smart applications which infer information from them to solve
tasks. For example, human activity recognition [1, 2], age group detection [3], health moni-
toring and diagnosis [4], gender recognition [5], and continuous authentication with privacy
preservation [6].
    It is easy to find in recent literature many research papers that focus on inferring private
information from “innocuous” sensors, among which the accelerometer is one of the most
studied. Several research studies exploit motion sensors for eavesdropping on keystrokes, touch
input and speech, without the requirement of system permissions, and without hacking into
the operating system for gaining access to the administrator authority [7, 8, 9, 10, 11, 12].
    Most of the work that focus on speech recognition considers the scenario in which the
speech to be recognized comes from external speakers. Although this is certainly an interesting
case, it is quite different from the case in which the speech to be recognized comes from the
measuring device itself. In this paper we focus on this latter case: the device that measures the
accelerometer signals is the same device that produces the speech to be recognized. Although
this specific case doesn’t capture the case in which a user is speaking through the device, it
captures many other situations, in particular the reproduction of vocal messages on the device.
In order to have accelerometer measurements that depend only on the sounds being reproduced
it is necessary that the device not be held in the hand of the user. We will assume that the device
is placed by itself on a table; this scenario is referred to as the table setting. The table setting
scenario doesn’t seem to have been considered, apart from [9]. In that paper the authors focus
on the recognition of single letters and digits and on a small set of hotwords, leaving almost
downright unexplored the problem of recognizing entire words in real conversations.
    In this paper we make a small step in this direction. We provide an accelerometer-based
recognizer that works in the table setting scenario. For the specific problem considered, we
designed a recognizer for single letters and digits based on a more compact deep learning model
with respect to the one provided in [9], and we make a step towards the recognition of entire,
arbitrary words. More specifically, the main contributions of the paper are the following:
    • a novel deep learning-based system that, using a simple, custom built, CNN and starting
      from the spectrogram representation of acceleration signals, learns to recognize the speech
      units, i.e., those basic components of the language which combined produce words (letters,
      digits and syllables).
    • a proposal of a general method to recognize words in speech conversations; specifically we
      define a set of basic syllables (speech-unit) and build a recognizer with the same technique
      used for the single letters and digits; then we try to recognize specific sequences of
      syllables that make up the words. We also provide a simple implementation of such a
      method. More clever and effective implementations are left as future work.
  The tests have been conducted on an Android smartphone. We have written an Android
app to collect the accelerometer readings. Although we have used Android as test system,
the overall method described is general and does not depend on the specific operating system
(except for the access to the accelerometer which, in Android, is allowed without requesting any
permission). Results on different systems can vary as a function of the hardware characteristics
of the devices.
   The rest of the paper is organized as follows. In Section 2.1, we describe some relevant works
in the field of speech privacy attack in Android using motion sensors. In Section 3, we describe
the methodology followed to define the leaning-based speech units recognizer for single letters
and digits. In Section 4, we discuss a generalization to words. Finally, in Section 6 we provide
concluding remarks with some future directions.


2. Related work and threat model
2.1. Related work
Several speech privacy attacks using motion sensors have been proposed. The discriminating
elements between the various studies are the type of motion sensors exploited, and the setup in
which such the sensor is stimulated to collect information regarding the speech signals. An
extensive study of the accelerometers and gyroscopes response to speech signals in various
setups is proposed in [8]. The authors stimulate both sensors with human-rendered, laptop-
rendered and (external) loudspeaker-rendered speech signals traveling through the air or a solid
surface. Results show that only loudspeaker-rendered speech signals traveling through a solid
surface can create noticeable impacts on motion sensors.
   In [11] the authors proposed a study in which (i) the smartphone is placed on the same
solid surface as the external loudspeaker used to reproduce the audio, (ii) the smartphone’s
gyroscope is used to collect the surface vibrations caused by the speech signals emitted by the
loudspeaker, and (iii) the captured information were used to conduct speech recognition and
speaker identification. Due to the low sensitivity to the gyroscope with respect to the surface
vibrations, and to its limited sampling rate (200Hz), the performance for the recognition task
does not achieve high success rates (65% for the speaker dependent case and up to 26% for the
speaker independent case).
   In [12] the authors proposed a setup in which (i) the user speaks to a smartphone held in
her hand or placed on a desk, (ii) the accelerometer is used to collect speech signals traveling
through the air, and (iii) the accelerometer readings are used to conduct hot words detection
(“Okay Google” and “Hi Galaxy”). However, results show that the accelerometer may not be
able to collect sufficient information through airborne vibrations, suggesting that the speech
signals traveling through the air are unlikely to have any noticeable impact on motion sensors.
   In the above cited work the device emitting the sound (external speaker, smartphone, com-
puter) is different from the one (smartphone) that captures the accelerometer readings. An
interesting case is the one in which the two devices are the same; that is the smartphone is
used both to reproduce the speech signal and to record the accelerometer readings. This setting
is referred to as the table setting imagining that the device is placed by itself on a table while
performing the experiments. The table setting has been considered by [9], where the readings
coming from the accelerometer are analyzed using deep learning techniques. In [9] the au-
thors investigate a comprehensive set of factors and address them with effective preprocessing
approaches. The specific technique used in [9] transforms the accelerometer readings into
images and then uses standard approaches from computer vision to recognize the images; more
specifically a recognizer based on DenseNet is used. The recognizer built in [9] focus on the
recognition of single letters and single digits and on words coming from a small set (namely
"password", "username", "social", "security", "number", "email", "credit" and "card").

2.2. Setting
We follow the case considered in [9], i.e., we consider the table setting in which the targeted
smartphone is used both to reproduce the speech signal and to record the accelerometer signals.

Threat model. We assume that the victim’s smartphone contains our SpyApp that exploits
the accelerometer to record its measurement during the reproduction of a speech with the
smartphone placed on a table. This can be the case in office or home environments, where
conversations are often based on the exchange of voice messages. Thus the captured speeches
include voice messages from the contacts of the victim, since the spy app on the smartphone
will record data coming from any source, such as voice memo listened by the user, location
information emitted by the smartphone speaker during support voice guidance, and music/video
preferences which can be analyzed with the goal of constructing the user’s listening and watching
habits.
   We are not concerned about how the SpyApp can be installed on the smartphone and also
about the fact that the victim must play the speech signal placing the smartphone on a surface.
The goal of the research is that of understanding if in such a situation it is possible to infer the
words from the captured measurements of the accelerometer. We first consider the basic case in
which the speech unit that we want to recognize are only single letters and single digit. As a
second step we want to generalize the recognition to arbitrary words.
   The SpyApp continuously collects the accelerometer measurements and sends the data to a
server. On the server we implement the recognizer that we will describe in later sections.
   Figure 1 is a graphical representation of the system considered.

Implementation. For the actual implementation we used an Android smartphone requesting
the operating system to use the fastest sampling rate (SENSOR_DELAY_FASTEST). Thus the
actual sampling rate is determined by the hardware. For the specific device that we used, a
Samsung S8 2017 Quad-core 2.3 GHz + Quad-core 1.7 GHz, chipset 8895 Samsung Exynos, 64bit,
we have a sampling rate of 420 Hz, which is more than enough to sample the human voice.


3. A deep learning-based speech-units recognizer
The speech-units recognizer has to solve the following recognition or classification problem:
given the accelerometer measurements of a speech unit reproduced by the smartphone, recognize
the speech unit. The recognizer we propose is based on the same methodology described in [9];
however we experimented with several possible variants and identified a solution that works
better (at least in the test we conducted) than the one described in [9]. The idea at the base of
                                                                              server
                                       Spy app


      User victim                                             accelerometer


Figure 1: Threat model of the proposed side channel attack


the methodology described in [9] is that of representing the accelerometer signals as images
and exploit the powerful deep learning models studied in computer vision in order to solve the
classification problem for the speech units. Among the deep learning models commonly used in
computer vision we have VGG [13], ResNet [14], Wide-ResNet [15], DenseNet [16]. All of these
are CNNs with different characteristics. The one used in [9] is the DenseNet, but the exact type
it is not specified. All of the above CNNs are quite powerful but also computationally expensive
since they are made up of a considerable number of layers. For example DenseNets can have
121, 169 or more layers. We propose to use a CNN with a considerable smaller number of layers,
namely 12. Thus the CNN we propose to use for the speech-units recognizer is not a standard
one. We call the proposed model AccCNN. In order to evaluate AccCNN, we have implemented
other speech-units recognizers based on the above cited CNNs: VGG, ResNet, Wide-ResNet and
DenseNet. The last one is thus an implementation of the approach described in [9]; we remark
that we used a DenseNet with 121 layers - in [9] the number of layers used is not specified. The
tests that we have conducted show that, despite being much more simple with respect to the
others, AccCNN exhibits better performances.
    In this section we provide details about the construction of the network and about the tests.

3.1. The CNNs
We have experimented with several alternative CNNs: some standard CNNs and a custom
designed CNN, which we describe in this section. In later sections we report the results of the
experiments.
3.1.1. Standard CNNs
Convolutional Neural Networks are a multi-layer neural network designed to recognize visual
patterns directly from pixel images with minimal preprocessing. In recent years, we have
witnessed the birth of numerous CNNs. Among those, we find VGG, ResNet, Wide-ResNet
and DenseNet. These networks have gotten so deep that it has become extremely difficult to
visualize the entire model. Since they are standard and there are public libraries that implement
them, we use them as black boxes. There exists two types of VGG, namely VGG16 and VGG19.
We have considered VGG19, that have 16 convolution layers and 3 dense layers, for a total of 19
layers. ResNet were introduced to answer the following question: “why by adding more layers
to deep neural networks does the accuracy not improve, but it actually gets worse?”. Intuitively,
deeper neural networks should not perform worse than shallow ones, or at least not during
training when there is no risk of overfitting. However, as the depth of the network grows this is
not always true. Thanks to the innovation introduced by ResNet, we can now build networks
of countless layers. Several variants have been proposed in literature. In this work we have
considered ResNet50 consisting in 50 layers. ResNets were shown to be able to scale up to
thousands of layers and still have improving performance. However, each fraction of a percent
of improved accuracy costs nearly doubling the number of layers, and so training very deep
residual networks has a problem of diminishing feature reuse, which makes these networks
very slow to train. To tackle these problems, WideResNet have been introduced. We have used
a WideResNet consisting in 40 layers. In DenseNet, each layer obtains additional inputs from all
preceding layers and passes on its own feature-maps to all subsequent layers. Concatenation is
used. The idea is that each layer is receiving a “collective knowledge” from all preceding layers.
We have considered a well-known DenseNet, named Dense121, consisting in 121 layers.

3.1.2. AccCNN: a custom CNN
Running several tests and experimenting with various ad-hoc combinations of layers we identify
a custom CNN that for the specific problem we are considering and at least for the preliminary
tests that we have run, outperforms the standard ones. Such a CNN, that we name AccCNN is
shown in Figure 2. As we can see, the structure is is not particularly complex. AccCNN first
takes as input the image corresponding to the spectrogram of the accelerometer measurements,
represented by a matrix 224 × 224 × 3, and resizes it to a 32 × 32 × 3 matrix (applying a bilinear
interpolation). Then, through a sequence of three pairs of layers Conv2D/MaxPooling2D (with
relu activation), one dropout of 0.2 produces a vector of 1024 elements, which is given as input
to a Flatten layer. Such a layer is then fully connected to two Dense layers (size 128 and 64),
in turn connected to an Output layer of size 51 (the number of speech units considered in our
study).

3.2. Data collection
To collect the accelerometer signals measurements used to train the deep learning models, we
have used a Samsung S8 smartphone. We wrote the SpyApp that records the accelerometer
measurements. We have used the “table setting”, that is, during the experiment the smartphone
is placed on a table. In this settings the accelerometer is solicited by the audio signal played
                     2) Resizing image                          4) Pooling layer                           6) Pooling layer                           8) Pooling layer                         10) Dense 11) Dense
                         32 x 32 x 3                                                                                                                                                              128       128
                                                                 MaxPooling (2,2)                           MaxPooling (2,2)                           MaxPooling (2,2)


                                                                                                                                                                          DropOut 0.2


                                              16 filters                                  32 filters                                64 filters


1)     Input image                       3) Convolution layer                       5) Convolution layer                       7) Convolution layer                          9) Flatten 1024                         12) Output:
     224 x 224 x 3                                                                                                                                                                                                     0,…, 50


Figure 2: The AccCNN used for the speech units recognition.


through the loudspeaker of the device. As observed also in [9], acceleration signals collected
from this setting show strong audio response along all axes. The speech units that we considered
are the 10 digits plus the 21 letters of the Italian alphabet (Table 1).

                                                                                 Speech units                                           Count
                                                                                     Digits                                              10
                                                                           Letters (Italian alphabet)                                    21
Table 1
The 31 speech units

  For each speech unit we have collected samples, that is accelerometer measurements during
the reproduction of the speech unit. In total, we have collected 1200 samples for each speech
unit. Since the number of speech units that we considered is 31, the total number of samples
collected is 1200 × 31 = 37200.

3.3. Pre-processing
The goal of this phase is to find a representation for the accelerometer measurements that can
be effectively learned by deep learning models. Given the accelerometer signals measurement
in a time interval, the accelerometer measurements are transformed into a spectrogram. The
spectrogram representation reflects the multi-scale information of a signal in the frequency
domain, and it has been proved to be useful for widely adopted models in computer vision. The
following describes the details of the transformation of the raw acceleration measurements into
spectrograms.
         1. Interpolation: in order to generate acceleration signals with a fixed sampling rate of 1000
            Hz, (i) first, we used linear interpolation to deal with unstable intervals of accelerometer
            measurements, (ii) then, we upsampled the accelerometer measurements to 1000 Hz,
            and (iii) finally, we used timestamps to locate all time points that have no accelerometer
            measurement and used linear interpolation to fill in the missing data.
   2. High-pass filtering: A high-pass filter has been used to eliminate (possible) significant
      distortions in the signals, and to obtain filtered signals mainly consisting of the target
      speech information and the self-noise of the accelerometer; specifically, we first convert
      the acceleration signal along each axis to the frequency domain using the Short-Time
      Fourier Transform (STFT), which divides the long signal into equal-length segments and
      calculates the Fourier transform on each segment separately; we then set the coefficients
      of all frequency components below the cut-off frequency (set to 80Hz to cover the adult
      males and females frequencies and to minimize the impact of noise components) to zero
      and convert the signal back to the time domain using inverse STFT.
   3. Signal-to-spectrogram: since we have acceleration signals along three axes, three spec-
      trograms can be obtained for each speech unit signal; to this, we first divide the signal
      into multiple short segments with a fixed overlap (as proposed in [9], we used 128 and
      120 as signal and overlap lengths respectively); we then window each segment with a
      Hamming window and calculate its spectrum through STFT; the signal along each axis is
      now converted into a STFT matrix that records the magnitude and phase for each time
      and frequency. Finally, the 2D spectrogram can be calculated as 𝑠𝑝𝑒𝑐𝑡(𝑠) = |𝑆𝑇 𝐹 𝑇 (𝑠)|2 ,
      where 𝑠 and |𝑆𝑇 𝐹 𝑇 (𝑠)|2 respectively represents a single-axis acceleration signal and the
      magnitude of its corresponding STFT matrix.
   4. Spectrogram-to-image: to feed the spectrograms into the deep learning models chosen for
      our experiments, we convert the three 2-D spectrograms of a signal into one RGB image
      in PNG format; to this, (i) we fit the three 𝑚 × 𝑛 spectrograms into one 𝑚 × 𝑛 × 3 tensor,
      (ii) we take the square root of all the elements in the tensor and map the obtained values
      to integers between 0 and 255 (to obtain considerable information loss), (iii) we export
      the 𝑚 × 𝑛 × 3 tensor as an image in PNG format, (iv) the spectrogram-images are cropped
      to the frequency range from 80 Hz to 300 Hz in order to reduce the impact of selfnoise;
      (v) finally, to feed those images into standardized computer vision models, it is better to
      resize them into 𝑛 × 𝑛 × 3 images (to preserve sufficient information, usually 𝑛 = 224).

3.4. Training, validation, and testing
Each of the model we have considered has been trained, validated and test as follows. As a first
step, using the stratification on the set of labels, we partitioned the dataset of 37200 images into
two subsets:
   1. a training set with 80% of the images
   2. a testing set, with 20% of the images.
   Tables 2 and 3 show the results of the validation and testing experiments. The ad-hoc
network that we propose, AccCNN, outperforms the other models with respect to all metrics,
behaving slightly better that the DenseNet used in [9] and better than the other standard models.
Specifically, in the validation phase the values achieved for accuracy, precision, recall and f-score
are, respectively, 0.94, 0.91, 0.91 and 0.91, very close and slightly better that those of the model
based on a DenseNet, while in the testing phase the values are 0.89, 0.88, 0.86, 0.86 again with
a slight improvement over the Dense Net. The other models show worse performances, as can
be seen from the tables.
   It is worth to note that the improvement of the performance of AccCNN becomes more
evident in the testing phase. Recalling that AccCNN has a much simpler structure (only 12
layers) this leads to the following observation: (at least) for the specific problem that we are
considering and for the training set used in the training phase, a model with a simpler structure,
such as AccCNN, can achieve a higher generalization capacity on the testing set. The reason why
AccCNN works better than the others lies in the over-parametrization issues affecting DNNs.
Often, the choice of very complex models does not necessarily ensure better performance.
Models with many parameters, such as pre-trained CNNs, have a high capability to fit the
noise at the expense of a lower generalization capacity. This is especially evident when the
representations used for the samples are not sophisticated enough. In general, more complex
models would probably have needed more data or somehow more sophisticated representations.

                           Model         Accuracy Precision Recall F-score
                           VGG           0.87         0.83         0.83        0.85
                           ResNet        0.87         0.85         0.84        0.84
                           WideResNet 0.87            0.84         0.84        0.86
                           DenseNet      0.92         0.91         0.89        0.91
                           AccCNN        0.94         0.91         0.91        0.93

                          Table 2
                          Performance in the validation phase: letter + digits.


                           Model         Accuracy Precision Recall F-score
                           VGG           0.75         0.73         0.73        0.70
                           ResNet        0.75         0.70         0.70        0.71
                           WideResNet 0.79            0.77         0.76        0.78
                           DenseNet      0.86         0.83         0.86        0.86
                           AccCNN        0.89         0.88         0.86        0.86

                          Table 3
                          Performance in the testing phase: letter + digits.


4. Generalization
The speech units recognizer that we have described in the previous section has the specific goal
of identifying single digits or single letters. Clearly it is desirable to generalize the recognizing
capabilities of the system to words or even sentences. As much clearly is the fact that the task is
not easy. In [3] beside the single digit and the single letters a small set of “hot” keywords has also
been considered, namely “password”, “username”, “social”, “security”, “number”, “email”, “credit”
Figure 3: The steps of the proposed approach for recognizing entire words.


and “card”. Adding new words requires re-building the model and reaching a rich enough set
can be quite difficult. Instead of targeting a specific set of words we propose an alternative
approach: build a speech units recognizer for the syllables and then use design segmentation
techniques to to identify sequences of syllables corresponding to actual words. As we can
see in Figure 3, given the accelerometer measurements corresponding to a sequence of words
pronounced during a conversation, the proposed strategy consist of the following steps: (i) a
segmentation technique is applied to extract the measurements corresponding to the syllables
composing the words, (ii) each extracted measurement is given as input to AccCNN, (iii) the
recognized syllables are assembled to reconstruct the original words.
   As a first step we have considered a set of “dummy” syllables: all the ones that can be obtained
appending a vowel to the consonants b,d,r and s, namely: {ba, be, bi, bo, bu, da, de, di, do, d, ra,
re, ri, ro, ru, sa, se, si, so, su}. Of course this means that also the “words” that we will consider
are “dummy” words (although a few meaningful words can be constructed with the syllables
that we are considering). We have this simplifying assumption in order to have a small set of
similar syllables to understand whether the approach can actually work. It goes without saying
that the approach needs to be expanded to consider the set of all possible (real) syllables.
   Having established set of syllables the next step is that of taking an entire conversation
and identify the syllables use in order to check for specific sequence of consecutive syllables
that make up words. In order to do so we need to face the problem of “segmenting” the entire
conversation into pieces that correspond to the syllables. We explore a simple approach: dividing
the entire conversation into small pieces each one corresponding to a syllable. Syllables have
different length so it is not clear how the conversation should be split into pieces. We tried with
a very simple approach: use segments of the same length and as the length, using 5 different
lengths (0.50, 0.55, 0.60, 0.65, 0.70).
   In order to consider the syllables we have to train the network on the syllables. So we repeated
the training, validation and testing phases described in Section 3 considering the chosen 20
syllables instead of the digits and the letters of Table 1.
   Tables 4 and 5 show the results of the validation and testing experiments. The results are
similar to those obtained for the digits and letters (Tables 2 and 3). It is possible to notice that in
this case the performance of AccCNN and DenseNet are almost the same.

                           Model         Accuracy Precision Recall F-score
                           VGG           0.88        0.84         0.85      0.85
                           ResNet        0.88        0.84         0.86      0.86
                           WideResNet 0.88           0.87         0.86      0.87
                           DenseNet      0.93        0.91         0.91      0.90
                           AccCNN        0.95        0.91         0.91      0.92

                          Table 4
                          Performance in the validation phase: syllables.


                           Model         Accuracy Precision Recall F-score
                           VGG           0.83        0.73         0.76      0.76
                           ResNet        0.85        0.77         0.75      0.76
                           WideResNet 0.86           0.80         0.81      0.82
                           DenseNet      0.90        0.87         0.87      0.88
                           AccCNN        0.90        0.88         0.87      0.88

                          Table 5
                          Performance in the testing phase: syllables.


   To test the recognizer we have used a set of 100 “sentences” of varying length, from 5 to 60
seconds (roughly 8 for each length). Each sentence is simply a sequence of (dummy) words
built with the dummy syllables, using 2, 3 or 4 syllables per word. An example of a 5-second
“sentence” is dodababe dore babesa and an example of a 25-second sentence is babada direro
doredo disa sasasesa bubiduda da da sese babababi bibi suso sasasasa siredomi dada
   Since to use the recognizer we need to segment the sentences, we have to decide the length of
the segments. To do so, we analyzed the lengths of the 24000 samples of the speech units (1200
per each of the 20 speech units): they range (roughly) from 0.5 to 0.7 seconds. Thus we tried to
segment the sentences with the following values: 0.50, 0.55, 0.60, 0.65 and 0.70 seconds.
   Figure 4 shows the results in terms of percentage of recognized words as a function of the
length of the sentence and of the length of the segments. The recognizer seem to perform badly
with very short sentences (of 5 and 10 seconds) and better with longer ones (15 to 45 seconds);
the performance tends to degrade for very long sentences (50 to 60 seconds). Moreover the
segmentation with segments of 0.55 seconds seems to be the one that works better. Overall
the percentage of words recognized is low, always less than 35%. This is probably due to the
                                                % recognized words
 40

 35

 30

 25

 20

 15

 10

  5

  0
       5s      10s     15s     20s     25s           30s          35s           40s   45s   50s   55s   60s

                                             0.50   0.55   0.60   0.65   0.70


Figure 4: Percentage of recognized words as a function of the length of the sentences (from 5s to 60s)
and for each length as a function of the length of the segments (from 0.50s to 0.70s)


fixed segmentation which does not capture correctly the single syllables. A clever and more
sophisticated approach needs to be used.


5. Better segmentation strategies
The segmentation approach presented in the previous section is quite naive; it would be
interesting to design a more “intelligent” segmentation technique. A first step in this direction
could be that of (i) defining a metric for measuring how good a segmentation is, i.e., how much
it consists of segments that can be “effectively” classified by AccNN, and then (ii) to show that
this metric is correlated to the number of words correctly recognized in the conversation.

Assessing the goodness of the segmentation In order to define such a metric one could
exploit the fact that an AccCNN computes the probability that an image in input “belongs” to one
of the 51 classes (speech units). As a step in this direction we performed a preliminary analysis,
in which we have analyzed these probabilities on the images in both training and testing set
defined (Section 3.4), and on images extracted at random from recorded conversations. As a
result, we observed that images corresponding exactly to speech units are always characterized
by only one probability close to 1 (corresponding to the speech unit in which the image will be
classified by AccCNN) while the others close to 0. Conversely, in the case of images that do not
correspond precisely to speech units, such a distinction is not so evident. This observation can
help in defining a measure for the goodness of a segmentation. Given a segmentation (list of
segments): (i) for each segment one can consider the list of probabilities of belonging to each
class (51 values) and can compute the “value” of such a segment, i.e., the difference between
the highest value and the average sum of the remaining values, then (ii) then it is possible to
compute the “value” of the entire segmentation as the average sum of the values of its segments.
Statistical relation with the capability of recognizing words. The successive step that we
plan to take is that of studying the correlation between the segmentation quality metric and the
number of words correctly recognized in speech conversations. In order to do so we plan to do
the following: (i) use the table setting described in Section 3.2 to collect the acceleration signals
measurements corresponding to a number of conversations (as combination of speech units)
of variable length; (ii) for each conversation generate random segmentations; (iii) give each
segmentation obtained as input to AccCNN (one segment at a time) and then count the number
of words correctly recognized; (iv) for each segmentation compute the segmentation quality
and the number of words correctly recognized, (v) study the correlation between such two
distributions of data by using the Shapiro-Wilk goodness-of-fit test ([17]) to assess the normality
of the data (the non-normality of distributions led us to apply the well-known non-parametric
Spearman’s rho test).

Strategies to find good segmentation. Once we have defined the above cited metric and
have proved that it is actually a good metric for identifying the segmentations that allows us
the obtain the best recognition of the words we plan to exploit it to define algorithms that allow
to identify such segmentations. We believe that genetic algorithms could be effective.


6. Conclusion
In this paper we have tackled the problem of inferring private information exploiting the
accelerometer of a smartphone by measuring the vibrations caused by speeches reproduced on
the device itself. We have designed an approach based on deep-learning methods and assessed
its behavior through experimental data. The system is designed for recognizing single letters
and single digits. We have also designed a generalization to words based on the recognition of
the syllables providing preliminary results. Although the applicability of the proposed system
seems restricted, and the generalization technique is still embryonic, the results obtained are
interesting and suggest future directions to follow in order to improve the effectiveness of
the attack. As future work we plan to study more in details the proposed generalization to
words, as explained in Section 5. The study presented in this paper uses only a small set of
dummy syllables. It would be interesting to expand this set to include all the syllables in a given
language and then try to recognize words in real conversations. The current proposed approach
uses quite a straightforward (and not very clever) method to segment a long conversation into
units that correspond to syllables. It would be interesting to study better ways to performing
the segmentation, for example using dynamic approaches that could adapt to the syllables that
have been recognized.


References
 [1] Y. Chen, C. Shen, Performance analysis of smartphone-sensor behavior for human activity
     recognition, IEEE Access 5 (2017) 3095–3110.
 [2] C. Shen, Y. Chen, G. Yang, On motion-sensor behavior analysis for human-activity
     recognition via smartphones, in: 2016 Ieee International Conference on Identity, Security
     and Behavior Analysis (Isba), IEEE, 2016, pp. 1–6.
 [3] E. Davarci, B. Soysal, I. Erguler, S. O. Aydin, O. Dincer, E. Anarim, Age group detection
     using smartphone motion sensors, in: 2017 25th European Signal Processing Conference
     (EUSIPCO), IEEE, 2017, pp. 2201–2205.
 [4] S. Majumder, M. J. Deen, Smartphone sensors for health monitoring and diagnosis, Sensors
     19 (2019). URL: https://www.mdpi.com/1424-8220/19/9/2164. doi:10.3390/s19092164.
 [5] A. Sharshar, A. Fayez, Y. Ashraf, W. Gomaa, Activity with gender recognition using
     accelerometer and gyroscope, in: 2021 15th International Conference on Ubiquitous
     Information Management and Communication (IMCOM), IEEE, 2021, pp. 1–7.
 [6] L. Hernández-Álvarez, J. María de Fuentes, L. González-Manzano, L. H. Encinas, Smart-
     campp - smartphone-based continuous authentication leveraging motion sensors with
     privacy preservation, Pattern Recognition Letters (2021).
 [7] A. Al-Haiqi, M. Ismail, R. Nordin, On the best sensor for keystrokes inference attack
     on android, Procedia Technology 11 (2013) 989–995. doi:https://doi.org/10.1016/
     j.protcy.2013.12.285, 4th International Conference on Electrical Engineering and
     Informatics, ICEEI 2013.
 [8] S. A. Anand, N. Saxena, Speechless: Analyzing the threat to speech privacy from smart-
     phone motion sensors, in: 2018 IEEE Symposium on Security and Privacy (SP), IEEE, 2018,
     pp. 1000–1017.
 [9] Z. Ba, T. Zheng, X. Zhang, Z. Qin, B. Li, X. Liu, K. Ren, Learning-based practical smartphone
     eavesdropping with built-in accelerometer, in: Proceedings of the Network and Distributed
     Systems Security (NDSS) Symposium, 2020, pp. 23–26.
[10] P. Marquardt, A. Verma, H. Carter, P. Traynor, (sp) iphone: Decoding vibrations from
     nearby keyboards using mobile phone accelerometers, in: Proceedings of the 18th ACM
     conference on Computer and communications security, 2011, pp. 551–562.
[11] Y. Michalevsky, D. Boneh, G. Nakibly, Gyrophone: Recognizing speech from gyroscope
     signals, SEC’14, USENIX Association, USA, 2014, p. 1053–1067.
[12] L. Zhang, P. H. Pathak, M. Wu, Y. Zhao, P. Mohapatra, Accelword: Energy efficient
     hotword detection through accelerometer, in: Proceedings of the 13th Annual International
     Conference on Mobile Systems, Applications, and Services, 2015, pp. 301–315.
[13] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
     recognition, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning
     Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, 2015.
[14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[15] S. Zagoruyko, N. Komodakis, Wide residual networks, in: E. R. H. Richard C. Wilson,
     W. A. P. Smith (Eds.), Proceedings of the British Machine Vision Conference (BMVC),
     BMVA Press, 2016, pp. 87.1–87.12.
[16] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolu-
     tional networks, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2017, pp. 4700–4708.
[17] S. S. Shapiro, M. B. Wilk, An analysis of variance test for normality (complete samples),
     Biometrika 52 (1965) 591–611.

</pre>