=Paper= {{Paper |id=Vol-437/paper-8 |storemode=property |title=Recognition of Voice and Hand Activity Through Fusion of Acceleration and Speech |pdfUrl=https://ceur-ws.org/Vol-437/paper8.pdf |volume=Vol-437 |dblpUrl=https://dblp.org/rec/conf/ciaem/JungBH08 }} ==Recognition of Voice and Hand Activity Through Fusion of Acceleration and Speech== https://ceur-ws.org/Vol-437/paper8.pdf
                                         Recognition of Voice and Hand activities through
                                              Fusion of Acceleration and Speech1

                                                    Young-Giu Jung, ChangSeok Bae and Mun-Sung Han

                                                      Electronics and Telecommunications Research Institute,
                                                         138 Gajeongno Yuseong-gu, Daejeon, Korea,
                                                                   {reraj, csbae, msh}@etri.re.kr



                                         Abstract. Hand activity and speech comprise the most important modalities of
                                         human-to-agent interaction. So a multimodal interface can achieve more natural
                                         and effective human-agent interaction. In this paper, we suggest a novel
                                         technique for improving the performance of accelerometer-based hand activity
                                         recognition system using fusion of speech. The speech data is used in our
                                         experiment as the complementary sensor data to the acceleration data in an
                                         attempt to improve the performance of hand activity recognizer. This
                                         recognizer is designed to be capable of classifying nineteen hand activities. It
                                         consists of 10 natural gestures, e.g., ‘go left’, ‘over here’ and 9 emotional
                                         expressions by hand activity, e.g., ‘I feel hot’, ‘I love you’. To improve
                                         performance of hand activity recognition using feature fusion, we propose a
                                         modified Time Delay Neural Network (TDNN) architecture with a dedicated
                                         fusion layer and a time normalization layer. Our experimental result shows that
                                         the performance of this system yields an improvement of about 6.96%
                                         compared to the use of accelerometers alone.

                                         Keywords: multimodal interaction, hand activity recognition, modified TDNN,
                                         human cognitive




                                1       Introduction

                                Interface technology using activity or gesture is one of the key functions for agent
                                system in ubiquitous computing environment. In general, accelerometers are currently
                                among the most widely studied wearable sensors for activity or gesture recognition,
                                thanks to their accuracy in the detection of human body movements, small in size, and
                                reasonable power consumptions[1]. Ling Bao and others[2] presented algorithms to
                                detect physical activities from data acquired using five small biaxial accelerometers
                                worn simultaneously on different parts of the body.
                                   In the study of Bharatula and others[3], a low power sensor hardware system was
                                presented, including accelerometer, light sensor, microphone, and wireless

                                1   This work was supported by the IT R&D program of MIC/IITA. [2006- S032-02,
                                    Development of an Intelligent Service technology based on the Personal Life Log], [2008-
                                    P1-15-07J31, Research on Human-friendly Next Generation PC Technology Standardization
                                    on u-Computing]




                                                                               67



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
communication. Accelerometer has been widely used in many pattern recognition
methods in order to access physical activities. In the study of Kiani and others[4], the
artificial neural networks were used. There have been several research efforts to
enhance the performance of accelerometer-based activity recognition.
   We use a human cognitive-based technique for improving the performance of
activity recognition. Humans do not depend only on their hearing in order to
recognize information. This fact is illustrated by the McGurk effect[5], a perceptual
phenomenon which demonstrates an interaction between hearing and vision in speech
perception. The presentation of an audio /p/ with a synchronized incongruent visual
/k/ often leads listeners to identify what they hear as /t/, a phenomenon referred to as
‘fusion’
   Currently, the recognition of human input using data fusion has been partially
achieved in a lip-reading system. The fusion algorithm can be carried out either at the
feature-level or the class-level[6]. Figure 1 shows the block diagram of the class-level
fusion. Two input signals are separately classified, and the results of each classifier
are combined in next step. Fusion module has a set of algorithms to integrate the
individual decision of each sensor. Several different methods of class-level fusion
have been proposed and studied extensively such as voting method, behavior-
knowledge space method and soft-output classifier fusion method[6].




                             Fig. 1. The fusion at class level

   Figure 2 shows the block diagram of the feature-level fusion. Two feature vectors
of accelerometer and speech signal are combined into a joint feature vector, and the
joint feature vector is used as input vector of fusion classifier. As already mentioned,
the feature fusion uses a single classifier to fuse two modalities. Several approach
have been proposed : fuzzy logic, Artificial Neural Network(ANN), Hidden Markov
Model(HMM), hybrid ANN-DTW(Dynamic time warping), hybrid ANN-HMM,
genetic algorithm, Support Vector Machines (SVM) etc[7]. In recent years, ANN
based on back propagation(BP) or radial basis function(RBF) network has been
widely used as a useful tool to the feature-level fusion modeling.




                            Fig 2. The fusion at feature level




                                           68
   In this paper, we propose a feature fusion method in an attempt to improve the
performance of accelerometer-based hand activity recognition. To develop hand
activity recognizer with high performance, we present a modified TDNN architecture
with a dedicated fusion and time normalization layer.


2      Accelerometer-based Hand Activity and Voice Representation

In this section, details of hand activity and the feature extraction module of speech
and acceleration used in this paper are described.


2.1     Speech Feature Extraction

The Speech Feature Extraction(SFE) module extracts feature vectors from the speech
signal. This module is comprised of the following components : an End Point
Detection(EPD) module based on the frame energy, a Feature Extraction(FE) module
based on Zero-Crossing with Peak Amplitude (ZCPA)[8] and RelAtive SpecTrAl
algorithm (RASTA)[9].
   The ZCPA model is more robust in noisy environments than other popularly used
feature extraction methods, such as LPCC or MFCC. It is composed of cochlear
bandpass filter, zero-crossing detector and peak detector. And frequency information
is obtained by the zero-crossing detector, and intensity information is also
incorporated by the peak detector. RASTA processing of speech is a bandpass
modulation filtering, operating on the log spectral domain. Slow channel variations
should in principle be removed. Finally, the SFE module captures 16 features per
frame.


2.2     Accelerometer-based Hand activity Feature Extraction

The acceleration data of the subject was collected using two MTx(Xsens
technologies) accelerometers. These 3-axis accelerometers are accurate to — 1.7G
with tolerances within 0.2%. The accelerometers were mounted on wrist and sampled
at 100Hz. Figure 3 shows MTx accelerometer with 3-axis and the attached type on the
wrist.




    Fig. 3. (a) MTx accelerometer with 3-axis (b) MTx were attached (c) Accelerometers were
                                        mounted on wrist




                                              69
Fig. 4. Acceleration signals from 3-axis accelerometer for right hand moving : (a) Handshaking,
                                          (b) Catching

   Figure 4 shows acceleration signal from 3-axis accelerometer for the movement of
the right hand. As shown in figure 4, (a) is acceleration signal from MTx on right
wrist for ‘Handshaking’ activity and (b) is acceleration signal for ‘Catching’ activity.
In figure 4, the blue, the red, and the green lines are acceleration signal of x-axis, y-
axis and z-axis direction respectively. From the graphs, it is trivial to conclude that
the acceleration signal of each axis is highly sensitive to any movements of hand.
   The Accelerometer-based Activity Feature Extraction (AAFE) module extracts
feature vectors from 3-axis accelerometer signal on two wrists. This module is
consists of two components : a Start Point Detection(STD) module based on threshold
and a Feature Extraction module using the difference between accelerations of each
axis. The STD module detects using a base signal on the sum of signal differences
over a 10 frame window.
   After finding the start point of activity, AAFE module extracts feature vectors
using the difference between accelerations of each axis. The six coefficients are
computed at every 10ms and fed to the fusion classifier as an input. The feature
vectors are obtained as,

                   Lx't | xt1  xt |,   Ly't | yt1  yt |,   Lz't | zt1  zt |
                                                                                          (2)
                   Rx't | xt1  xt |,   Ry't | yt1  yt |,   Rz't | zt1  zt |

   where xt, yt and zt are acceleration of each axis at time t. R and L is accelerometer
sensor on the left and right hand.


2.3    Specific of Hand Activity and Speech Data

   In our experiment, we examine natural gestures and emotional expressions by hand
activity. A natural gesture is defined as an action that everyone can understand in
human-to-human communication. The supported natural gestures are as follows: ‘go
right’, ‘go left’, ‘go up’, ‘go down’, ‘over here’, ‘go away’, ‘catch’, ‘release’, ‘open’
and ‘close’.




                                              70
    So an emotional expression by hand activity is defined as an action of hand
according to the emotion change such as ‘I love you’, ‘I feel cold’, ‘I feel hot’, ‘I feel
so-cold’, ‘I feel so-hot’, ‘I feel real-hot’, ‘I feel real-cold’, ‘Handshaking’ and ‘Good-
bye’. But the linguistic definitions of emotional expression are often ambiguous. To
address ambiguities in hand activity labels, test subjects were provided with image
descriptions of each hand activity and short sentence descriptions. Figure 5 shows an
example of image descriptions of each hand activity. Table 1 lists Korean utterance of
each hand activity along with its short sentence description.




                  Fig. 5. An example of image descriptions of hand activities



                   Table 1.   Descriptions of utterance and hand activity

    Utterance                         Hand activity
    (Korean-English)                  (short sentence descriptions)
    Jap-A (catch)                     Catching
    Noh-A (release)                   Releasing
    Dad-A (close)                     Closing
    Yeol-Eo (open)                    Opening
    I-Ri-Wa (over here)               Over here
    Jeo-Ri-Ka (go aware)              Go away
    A-Rea (go down)                   Go Down
    Wi (go up)                        Go Up
    O-Reun-Jok (go right)             Go Right
    Woen-Jok (go left)                Go Left
    Jal-Ga (say good-bye)             Good-bye gesture
    Ban-Ga-Ueo-Yo(handshake)          Give a person the right hand of fellowship
    Sa-Rang-Hae-Yo(I love you)        Making heart figure by raising two hands
    Jin-Ja-Deb-Da(real-hot)           Waving the collar back and forth toward face
    Ne-Mu-Deb-Da(so-hot)              Waving two hands back and forth toward face
    A-Deb-Da (hot)                    Roll up one’s sleeves
    O-Chub-Da(real-cold)              Rub one’s ear using hand
    Chub-Da(cold)                     Rub one’s hands
    A-Chub-Da(so-cold)                Stamping one’s feet




                                            71
3     Modified TDNN Architecture for Data Fusion

In 1989, it was shown that neural network model yields a high performance in the
speech recognition. The main goal of TDNN was to have a neural network
architecture for non-linear feature classification invariant under translation in time or
space. TDNN uses time-delay steps to represent temporal relationships. The
translation invariant classification is realized by sharing the connection weights of the
time delay steps[10]. The activation of a unit is normally computed by passing the
weighted sum of its inputs to an activation function(i.e. a sigmoid function).
   We explain modified TDNN architecture to improve performance of hand activity
recognition system. One of most difficult challenges in the feature-level fusion is the
synchronization between accelerometer and speech data. In our system, speech
features are extracted with the dimensions 64x16(where 64 is the number of frames
and 16 is the number of coefficients). So the accelerometer-based hand activity
features are extracted with the dimensions 120 x 6. Therefore, the method chosen to
synchronize between the two feature spaces has a significant effect on the
improvement of the recognition rate. We solve the synchronization problem by using
a dedicated fusion layer and time normalization layer. Figure 6 is the modified TDNN
architecture for data fusion.




             Fig. 6. Architecture of the parameter optimized TDNN for data fusion

  As shown in Figure 6, fusion module is comprised of four layers-Input layer, Time
Normalization(TN) layer, Fusion layer and Output layer. The TN layer controls the
number of input frames to fusion layer. We use the TN layer, the output of each node




                                            72
at fusion layer is given as.

                                       ( N S  F j 1)                 ( N A  F j 1)
                         z   '
                             Fj   f(       ¦w
                                          Si F j
                                                     Si F j   z Si        ¦w
                                                                          Ai F j
                                                                                         Ai F j   z Ai )            (3)


   where Fj is the index of node at fusion layer, A shows the hand activity features. S
denotes the speech feature and f is a sigmoid fusion. So N is the number of windows
at TN layer, i is the index of node at TN layer, j is the index of node at fusion layer. z
is the output value of TN layer, z` is the output value of fusion layer and w is weight.
   In figure 6, The input layer of the Speech Network(SN) has 16 feature values at
each 10ms interval, 64 frames, and overlap windows of 3 frames. And the input layer
of the Accelerometer-based Activity Network(AAN) has 6 feature values at each
10ms, 120 frames, and overlap windows of 59. So the TN of the SN consists of 62
frames, 8 units per frame and overlap windows of 5 frames. And the TN of AAN
consists of 62 frame, 3 units per frame and overlap windows of 5 frame. In the case of
the SN, the 48 units of this input layer are fully interconnected to a layer of 8 TN units.
In the case of the ANN, the 354 units of the input layer are fully interconnected to a
layer of 3 TN units. Finally, the fusion layer consists of 58 frames and 4 units per
frame.



4     Experimental result

The experimental data consists of 19 utterances recorded by a male. The subject is
directed to perform each activity at a time accompanied by corresponding speech in a
quiet office environment. For training, 50 sets of activity and speech data are used,
and the other 50 sets are used as test patterns. Speech is recorded by a SHURE
microphone and the accelerometer-based activity is recorded by two MTx on his
wrists. To train gesture and speech, the data set is provided to the system at the
learning rate of 0.1.
   Table 2 compares the performance of the proposed fusion system to the system that
uses accelerometer alone. In table 2, when the signal-to-noise ratio(SNR) decreases,
the Fusion method does not degrade as much as Accelerometers alone case.

         Table 2. Comparison of recognition rate at various SNR(using white noise)

          SNR         Accelerometers alone                                Speech alone                     Fusion
          -5dB                                                            97.15                            99.26
          -10dB                    92.3                                   89.47                            99.05
          -15dB                                                           53.68                            96.21

   As shown in Table 2, our system show a performance improvement of 6.96% at -
5dB, 6.75% at -10dB and 3.91% at -15dB when compared to the use of accelerometer
alone.




                                                         73
4     Conclusion

An accelerometer is one of the most useful wearable sensors for activity recognition.
The accuracy of the previous work only using accelerometer was around 85% ~90%,
which may be good enough for some applications. In this paper, we present a
multisensory-based fusion recognition system having more enhanced performance
than accelerometer-based activity recognition. In our work, we designed a hand
activity recognizer that can classify acceleration data and utterance into nineteen
activities : 10 natural activities and 9 emotional expressions by hand activity.
   To improve performance of hand activity recognition system, we propose the
modified TDNN architecture with a dedicated fusion layer and time normalization
layer. Using the proposed fusion layer, we solved the synchronization problem in
feature-level fusion. Our experiment shows performance improvement of 6.96% when
compared to an activity system using only accelerometer.


References

1.  Jeonghwa Yang, B. N. Schilit, and D.W. McDonald, “Activity Recogni- tion for the
    Digital Home,”          in Computer. vol. 41, pp. 102-104, April 2008.
2. L. Bao, and S. Intille, “Activity Recognition from user-Annotated Acceleration Data,” In
    Proc. Pervasive, pp. 1-17, April 2004.
3. N. B. Bharatula, M. Stager, P. Lukowicz, and G. Troster, “Power and Size Optimized
    Multisensor Context Recognition Platform,”In ISWC 2005, pp. 194-195, Oct. 2005.
4. K. Kiani, C. J. Snijders and E. S. Gelsema, “Computerized analysis of daily life motor
    activity for ambulatory monitoring,” Technol. Health Care vol. 5, pp. 307-318 Oct. 1997.
5. S. Lafon, Y. Keller, and R. R. Coifman, “Data Fusion and Multicue Data Matching by
    Diffusion Maps,” IEEE Tran. Pattern Analysis and Machine Intelligence, vol 28, pp.
    1784-1797, Nov. 2006.
6. D. Ruta, and B. Garbrys, “An Overview of Classifier Fusion Method,” Computing and
    Information Systems, vol. 7, pp. 1-10 February 2000.
7. J. W. Zhang, L. P. Sun, and J. Cao, “SVM for Sensor Fusion-a Comparison with
    Multilayer Perceptron Networks,” In Proc. Machine Leaqrning and Cybernetics, pp.
    2979-2984, Aug. 2006.
8. J. Young-Giu, H. Mun-Sung, and L. San Jo, “Development of an Optimized Feature
    Extraction Algorithm for Throat Signal Analysis,” ETRI Journal, vol. 29, pp. 292-299,
    June 2007
9. H. Hermansky, and N. Morgan, “RASTA Processing of Speech,” IEEE Trans. Speech
    Audio Processing, vol 2, pp. 578-589, Oct. 1994
10. N. Mache, M. Reczko and A. Hatzigeorgiou, “Multistate Time-Delay Neural Networks
    for the Recognition of Pol II Promoter Sequences,” In Proc. 10th Conf. Intelligent Systems
    for Molecular Biology, St. Louis, 1996




                                             74