1. Introduction

A touchless human- Journal of Ambient Intelligence and Hu

10.1109/34

Italian sign language alphabet recognition from surface EMG and IMU sensors with a deep neural network

Paolo Sernani

p.sernani@univpm.it 0 1

Iacopo Pacifici

0 1

Nicola Falcionelli

n.falcionelli@pm.univpm.it 0 1

Selene Tomassini

s.tomassini@pm.univpm.it 0 1

Aldo Franco Dragoni

a.f.dragoni@univpm.it 0 1 0 Information Engineering Department, Università Politecnica delle Marche , Via Brecce Bianche 12, 60131 Ancona , Italy 1 RTA-CSIT 2021: 4th International Conference Recent Trends and Applications In Computer Science And Information Technology

2015

1746 105 114

The use of surface electromyography (EMG) and Inertial Measurement Unit (IMU) data emerged as a possible alternative to computer vision-based gesture recognition. As a consequence, the convenience of using such data in the automatic recognition of sign languages, a natural application of gesture recognition, has been investigated in scientific literature. Most of the methodologies and evaluations are based on traditional machine learning techniques, such as SVMs, relying on selected handcrafted features. Instead, leveraging on the findings about deep Long Short Term Memory (LSTM) architectures to process time series, we propose a deep LSTM-based neural network for the recognition of the Italian Sign Language alphabet with surface EMG and IMU data. To preliminary validate our methodology, we collected a dataset recording gesture samples with the Myo Gesture Control Armband. We obtained a 97% accuracy on the proposed dataset.

eol>Sign Language Recognition Bidirectional LSTM Long Short Term Memory Deep Learning Surface Electromyography EMG Inertial Measurement Unit IMU Italian Sign Language LIS

1. Introduction

mode of non-verbal communication with computer interfaces [2]. The possible applications In the last three decades, automatic gesture are countless, including touchless interaction recognition has been investigated in many with smart objects [3], rehabilitation and perapplications domains. In fact, hand gestures sonal health systems [4, 5], human-robot colare recognized as a natural, ubiquitous and laboration [6], interaction with smart home meaningful part of communicating [1]. There- reasoning systems [7, 8], and many others. fore, extensive research has been devoted to making hand gestures a natural and efective

Obviously, the automatic recognition of

sign language gestures is an eminent application field for the advancements in gesture recognition. To this end, the earliest researches in computer vision [9] evolved with the use of depth sensors, such as those of the Microsoft Kinect [10] and Leap Motion [11]. An alternative methodology is emerging in recent years: the use of wearable devices with surface electromyography (EMG) and Inertial Measurement Unit (IMU) sensors [12]. Using EMG and IMU sensors has the disadvantage of forcing a user to wear the device (on both hands, for complex gestures). However, it does not work, explaining the setup of the experiments require a fixed camera which might be vul- and presenting the results. Finally, Section 5 nerable to varying lighting conditions, in ad- draws the conclusions of this research work. dition to having a limited range of vision and causing privacy issues.

In this regard, we present a deep learning 2. Related Works methodology for the recognition of the Italian Sign Language (LIS) alphabet using EMG and IMU data. Specifically, this paper adds the following contributions to the state of the art about sign language gesture recognition: The use of EMG and IMU data for the recognition of sign language gestures has been validated by several studies. For example, Savur and Sahin [15] got 91% accuracy on the American Sign Language (ASL) alphabet, using a • we propose a deep neural network ar- Support Vector Machine (SVM) classifier. Wu chitecture to classify the EMG and IMU et al. [12] proposed the design of a wearable data corresponding to the 26 letters of device and a feature selection method to colthe LIS alphabet. We based our network lect EMG and IMU data for the recognition on the bidirectional Long Short Term of gestures. They validated their proposal on Memory (LSTM) architecture, as it has the ASL gestures, getting a top accuracy of been already proven useful to process 96% with a comparison of traditional machine time series, e.g. in speech [13] and ges- learning approaches (Nearest Neighbor, Naive ture recognition [14]; Bayes, Decision Tree, and SVM). In [16] Abreu et al. evaluated the use of the Myo Armband • we propose a dataset with 30 gesture for the Brazilian Sign Language alphabet by samples for each letter of the LIS alpha- defining 20 SVM binary classifiers to recogbet, collected to preliminary evaluate nize 20 letters, in a one-vs-all strategy. Simour approach. Each sample includes ilarly to these works, we use EMG and IMU the data from the 8 EMG sensors and data (from the Myo Armband) to recognize the IMU of the Myo Gesture Control the letters of the LIS alphabet. However, inArmband, a commercial wearable de- stead of relying on traditional machine learnvice designed to collect EMG signals ing methods and feature selection, we propose and IMU data when moving the hand a deep neural network, leveraging on a deep and the arm. architecture to learn the gesture representaTo guarantee the reproducibility of our ap- tion which allows the classification. proach, as well as encourage further develop- Recurrent Neural Networks, in particular ments of the research in this field, the experi- those based on the LSTM and bidirectional ments and the dataset are publicly available LSTM architectures, have been validated for in two dedicated GitHub repositories. representing and classifying complex sequen

The rest of the paper is structured as fol- tial data simultaneously, such as in modellows. Section 2 lists some studies related to ing human gesture structure and temporal the presented research. Section 3 explains the dynamics [14]. Some research works are preproposed approach, with the necessary back- senting LSTM-based architectures for sign ground about the LSTM architecture, and de- language recognition. For example, Liu et scribes the dataset collected to evaluate our al. [17] propose to use the LSTM architecmethod. Section 4 discusses a preliminary ture to perform recognition by analyzing the experimental evaluation of our neural net- trajectory of skeleton joints provided by the

Microsoft Kinect; Guo et al. [18] combine a

3D Convolutional Neural Network with the LSTM to classify gestures from videos, in a transfer-learning approach; Mittal et al. design a LSTM-based architecture to recognize words and sentences of the Indian Sign Language from Leap Motion data [19]. Similarly to these works, we also based our system on the LSTM architecture, but we rely on EMG and IMU data, instead of visual data. In the need of data to train our method, we synthetically augmented our dataset to preliminary validate our method, using data augmentation also to add intra-class variation in our samples and prevent overfitting.

3. Materials and Methods

xt ft Forget Gate ht tanh ct tanh ot Output Gate it Input

Gate Memory Cell xt xt Recurrent Neural Networks (RNN) use recur- xt rent connections to model the flow of time Figure 1: An LSTM unit, with the input/output in a sequence of data [20], and are therefore of the memory cell regulated by the input, output, particularly suited to work with time series. and forget gates.

LSTM are a type of RNN which are capable of learning long-time dependencies in the data. gates.

As we want to recognize gestures from a se- As pointed out in [13], the output ℎ at time quence of time-ordered EMG and IMU data, point of an LSTM hidden unit is regulated our system is based on the LSTM architecture. by the following equations: Moreover, we also collected a dataset to test the accuracy of the proposed system in the = ( + ℎ ℎ −1 + −1 + ) recognition of the LIS gestures. = ( + ℎ ℎ −1 + −1 + )

3.1. LSTM and Bidirectional LSTM LSTM is a well-known RNN architecture, pro

posed by Hochreiter and Schmidhuber [21]. where , , , and are the activation vecAs showed in Figure 1, the basic hidden unit tors of the input gate, forget gate, output gate, of a LSTM network is composed of a self- and memory cell at time point , is the recurrent cell, called memory cell, whose in- sigmoid function, denotes the bias of each put/output is regulated by three multiplicative gate/cell, and are diagonal weight matrixes. gates, i.e. the input gate, the output gate, and The output vector at time point of an hidthe forget gate. A LSTM layer is composed by den layer is therefore given by: a series of such units and the network interacts with the memory cells only by using the = ℎ ℎ + = −1 + tanh( + ℎ ℎ −1 + ) = ( + ℎ ℎ −1 + + ) ℎ = tanh( ) where ℎ is the weight matrix and the for the acquisition of each sample was 2 secbias vector. onds, sampling both the EMG and IMU data

Traditional LSTMs, as RNNs in general, pro- at 200 Hz. The subject was required to selfcess input data in ascending temporal order. collect the samples with a desktop application Therefore, their outputs is mostly based on that we developed specifically for the gesture previous context. However, when data is pro- acquisition. cessed at once, as it might happen with the Each data sample for each gesture repreclassification of gestures, the recognition of a senting a letter is included in a json file conpattern might be more efective with the use taining both the EMG and the IMU data. The of future context as well. To this end, Bidi- EMG data is organized into an emg object inrectional RNNs [22] and, specifically, Bidirec- cluding the following fields: tional LSTMs [20] have been proposed. The • frequency, i.e. the sampling frequency basic idea of such models is to present the (in Hz) of the values from the EMG sentraining sequences both forwards and back- sors. This value is 200 for all the samwards, using two separate recurrent nets, which ples; are connected to the same output layer.

Therefore, we based our deep neural network on the Bidirectional LSTM architecture, as the gesture are processed once done, taking advantage of both previous and future context. • data, a 400 x 8 integer matrix. Each row is then an 8-dimensional array including the values from the 8 EMG sensors of the Myo Armband. Therefore, data is the time series of the values from the EMG sensors during the acquisition of the gesture.

3.2. Proposed Dataset

To evaluate the proposed architecture, we de- Similarly, the IMU data of the sample is orgaveloped a dataset including all the 26 gestures nized into an imu object with the following of the LIS alphabet. Most of the letters of the ifelds: alphabet is represented with static gestures, • frequency, i.e. the sampling frequency while the “G”, “H”, and “Z” are performed (in Hz) of the values from the IMU. This by moving the hand as well. We recorded 30 value is 200 for all the samples; samples for each letter, building a dataset composed of 780 samples. The dataset is publicly • data, a 400 elements length object array. available as a GitHub repository1. Each object has three fields, namely gy

All the collected gesture were performed roscope (an array composed by 3 floatby the same person (male, 24 years old) wear- ing point values), acceleration (an array ing a Myo Gesture Control Armband2 on his composed by 3 floating point values), right arm, always in the same position. In fact, and rotation (an array composed by 4 each sample of the dataset is composed of the lfoating point values). raw data produced by the 8 EMG sensors and In addition, each json file includes a timesIMU of the Myo Armband. The time window tamp, representing the date and time of the gesture acquisition, and the duration of each 1https://github.com/airtlab/An-EMG-and-IMU- acquisition, which is 2000 for all the samples. Data2shett-tpfosr:/-/twhee-bI.taarlcihaniv-eS.iogrng-/Lwaenbg/2u0a2g0e0-5A2l8p1h1a1b8e2t2/ The information about the acquisition durahttps://support.getmyo.com/hc/en-us/articles/ tion and the sampling frequency are redun202648103-Myo-Gesture-Control-Armband-tech-specs dant in the current version of the dataset, as they are the same for all the gestures. How- Table 1 ever, this information might be useful in the The deep neural network model used for the gesfuture, when we might add samples varying ture recognition. The total number of trainable the acquisition time window or the sampling parameters is 87,514. frequency. The complete dataset specification Layer Output Shape Param # is available in a dedicated open-access data paper [23].

3.3. System Architecture

Figure 2 depicts the architecture of the proposed gesture recognition system, used to identify the gestures of the LIS alphabet. The x 14 matrix, i.e. there are 400 14-dimensional user performs the gesture wearing the Myo vectors for each samples. The first network Armband; the data from the EMG sensors and layer is a bidirectional LSTM. It processes the the IMU are the input for our deep neural input with 64 hidden units, returning in outnetwork, based on the Bidirectional LSTM ar- put 128 hidden state values (64 for the forward chitecture. The system labels the input data sequence, 64 for the backward sequence) for with one of the 26 letters of the alphabet, iden- each of the 400 vectors in a sample. In fact, tifying the gesture made by the user. As ex- each hidden unit is configured to output a plained in Section 4, to evaluate our system, value for each vector in the sample matrix, as we synthetically augmented the data in our proposed by Graves et al [24] to stack muldataset during the training process, trying to tiple LSTM layers. Thus, the second layer use more samples and reduce overfitting. is a also a bidirectional LSTM. However, be

Table 1 lists all the layers included in our ing the last recurrent layer, each of the 32 deep neural network. Among the available hidden units returns a single value for the data, we used the 8 series with the values entire sample. Therefore, the output of the from the 8 EMG sensors of the Myo Arm- second layer is composed of 64 values (32 for band. Concerning the IMU, we took the two the forward sequence, 32 for the backward 3-dimensional vectors with values from the one). A 50% dropout performs the diluition accelerometer and the gyroscope. Therefore, of the LSTM output, to prevent overfitting. each sample is fed into the network as a 400 Bi-LSTM Bi-LSTM Dropout (0.5) Fc1 (ReLU) Dropout (0.5) Fc2 (Softmax) (400, 128) (64) (64) (64) (64) (26) gyroscopse are 3D vectors, they can be easily ifrst includes 64 hidden units, using the rectiifer as the activation function. After another 50% dropout for regularization, the output is processed by the 26 units of the second fully connected layer. The softmax activation function of each unit computes the probability distribution over the 26 classes, i.e. the letters of the LIS alphabet.

4. Experimental Evaluation We evaluate our model by collecting prelim

inary results on the proposed dataset. We actually want understand to which extent our deep neural network is a viable solution to recognize the LIS gestures based on EMG and IMU sensor data. As the collected dataset includes only 780 samples, which might be too rotated by multiplying with a rotation matrix: ⎡1 ⎢ ⎢ ⎣ 0 0

0 cos( ) − sin( )

0 ⎤ sin( )⎥ cos( )⎥⎦ Such transformation rotates the coordinate system of degrees, counterclockwise, around the x-axis. Ohashi et al also propose the following formulation to apply the same rotation to the data of 8 EMG sensors:

( ) ( )

= = ⌊ / ⌋ = 360/ = / − ( ) −

+ (1 − ) ( ) + (1 − ) ( ) − −1 jective of testing with more data and prevent overfitting. Even if data augmentation has few for a deep learning approach, we also ap- Here, plied data augmentation, with the twofold ob- sensor when rotating the armband of de( ) is the reading of the ℎ

EMG grees; sor in the original data; ( ) is the reading of the -th senis the number of some threats to validity. stage, and therefore inevitably sufer from the experiments should be considered as early been proven useful to get general results [25], available EMG sensors; ( ) is the polynomial function ( ) = 2. Intuitively, if the rotation places the -th sensor between the original

4.1. Data Augmentation

To augment the proposed dataset, we apply the technique presented in [26]. Ohashi et al. point out that, during the gesture recognipositions of the -th and ( + 1)-th sensors, the reading of the -th sensor in the rotated data is computed as the interpolation of the readdistance from those sensors. ings of the -th and ( + 1)-th sensors and the

Therefore, we apply such rotation technique

tion with a wearable device such as the Myo to our data, given that, with this approach, Armband, the user is supposed to wear the de- Oshahi et al. got better performance than vice always with the same configuration (i.e. augmenting data with gaussian noise, with identical placement and rotation). In this way, rotating data around all the three axis, and the sensors would be attached to the user’s arm in the same positions every time the dewith linear interpolation. As in their work, we rotate the data with the angles in the folvice is used. However, a displacement is very lowing set: likely to happen when detaching and attaching the device again. Therefore, samples with various rotation angles are desirable in the training data of a gesture recognition model. By rotating the data, we get 780 samples for {−30◦, −22.5◦, −15◦, −7.5◦, 7.5◦, 15◦, 22.5◦, 30◦} each angle, adding the 6,240 synthetic samples to the 780 originally collected with the Myo Armband.

4.2. Experimental Setup

We tested the proposed deep neural network without DA on the original dataset, as well as on the aug- with DA mented dataset. We applied a stratified shufle split cross-validation scheme to validate the accuracy of our model. To this end, we firstly 4.3. Results repeated a randomized 80-20 split 5 times, us- Table 3 shows the prediction accuracy on the ing the 80% of the data as the training set, and test set obtained by repeating 5 times the stratthe 20% as the test set, preserving the percent- ified shufle split of the dataset. With the 780 age of samples from each class, in each split. samples of the original dataset, the mean acThe 12.5% of the training data, i.e. the 10% curacy is 57.44% with a standard deviation of of the entire dataset, was used as validation 5.46% over the 5 splits of the experiment. In data for the training of the neural network. other words, around half of the test samples Then, we repeated the same randomized split gets misclassified. In fact, using only 576 sam30 times on each dataset, to collect more gen- ples for the network training (with 78 samples eral results. used as validation data) results in a poor per

We used the Root Mean Square Propagation formance of our model. (RMSProp) optimizer to minimize the Cate- Instead, with the 7,020 samples of the auggorical Cross-Entropy loss function during mented dataset, the mean accuracy increases the training of the neural network. The num- to 97.36%, and the standard deviation decreases ber of training epochs varied for each split, as to 0.62% over the 5 splits. Using 4,914 samwe early stopped the training after 5 epochs ples for training (with 702 samples used for without an improvement on the minimum validation) significantly improves the perforvalidation loss, restoring the weights corre- mance of our model. The lower standard desponding to the best validation loss. Table 2 viation shows that the model trained on the shows the number of training epochs in each augmented dataset exhibits a better generalsplit, in the 5 split experiments. For the 30 ization. Intuitively, most of the misclassificasplit experiments the mean number of train- tion errors occurs with gestures which look ing epochs was 42.77 (± 9.01) for the original similar. For example, in the first split, the “V” dataset, and 37.67 (± 7.80) on the augmented is erroneously identified as the “U” 9 times dataset. The batch size was 32 samples in each and as the “F” one time, while other 44 samsplit of each experiment. ples are correctly identified. Similary, 3 “U”

A Jupyter notebook with the described ex- samples are wrongly identified as “V”. In the periments is available in a GitHub public repos- same split, the “W” is misclassified only one itory3, in order to guarantee the reproducibil- time, being identified as the “V”. ity of the tests. The tests ran on Google Colab The results are similar when repeating the with the GPU runtime, using Keras 2.4.3, Ten- tests on 30 random stratified shufle splits of sorFlow 2.4.1, and scikit-learn 0.22.2.post1. the dataset, as showed in Table 4. The mean value of accuracy is 58.69% (± 4.37%) for the original dataset and 97.07% (± 1.32%) on the

3https://github.com/airtlab/italian-sign-language

recognition/ Table 4 the gesture acquisition to 2 seconds. Such Mean number of training epochs and mean accu- time window is worth of further research, as racy on 30 random stratified shufle splits, with this time might vary from person to person and without Data Augmentation (DA). and also for more complex gestures.

Epoch # Accuracy Concerning the presented results, we built without DA 42.77 ± 9.01 58.69 ± 4.37% our model on the results of existing literature with DA 37.67 ± 7.80 97.07 ± 1.32% about LSTMs to process time series, especially in speech and gesture recognition. However, a augmented dataset. Therefore, both in the systematic study on alternative models as well experiments with 5 splits and 30 splits, the as a comparison on more datasets should be training on augmented data is more stable performed to get more results, and therefore than with the original data, resulting in a validate our method. lower standard deviation on the test accuracy.

Moreover, the tests did not highlight any sig- 5. Conclusions nificant diference in the recognition of static gestures (most of the letters) with respect to We presented a deep learning approach for the dynamic ones (“G”, “H”, and “Z”), scoring the recognition of the LIS alphabet, based on similar class-wise precision and recall values. surface EMG and IMU data. Specifically, we

These preliminary results encourage the developed a deep neural network based on the use of wearable devices equipped with EMG bidirectional LSTM architecture. To validate and IMU sensors to execute the recognition our method, we built a dataset including 30 of the LIS with deep neural networks. Most gesture samples for each letter of the alphabet. of the samples gets correctly identified by our The gestures were recorded from the 8 EMG LSTM-based model. As expected, the data sensors and the IMU of the Myo Armband. augmentation improves the performance, and To ensure the proper training of our model, our model gets better results with more data, with enough samples, we used data augmentahighlighting the need of expanding the col- tion, simulating the rotation of the armband. lected dataset. The results are preliminary, but promising: on the augmented dataset, our model got 97% 4.4. Threats to validity accuracy, showing few classification errors Being in early stage, the presented research on very similar gestures. The source code of inevitably sufers from some threats to valid- the experiments and the dataset are available ity. Concerning the collected dataset, all the as public GitHub repositories, to guarantee gesture samples were performed by the same the reproducibility of the tests. Moreover, the subject. Samples from more subjects are nec- public dataset is available for further tests. essary to get more general conclusions. More- The presented research is in early stage, over, we arbitrary fixed the time window for since a systematic study of alternative deep