<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognition of Voice and Hand activities through Fusion of Acceleration and Speech1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Young-Giu Jung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ChangSeok Bae</string-name>
          <email>csbae@etri.re.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mun-Sung Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Electronics and Telecommunications Research Institute</institution>
          ,
          <addr-line>138 Gajeongno Yuseong-gu, Daejeon</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <fpage>67</fpage>
      <lpage>74</lpage>
      <abstract>
        <p>Hand activity and speech comprise the most important modalities of human-to-agent interaction. So a multimodal interface can achieve more natural and effective human-agent interaction. In this paper, we suggest a novel technique for improving the performance of accelerometer-based hand activity recognition system using fusion of speech. The speech data is used in our experiment as the complementary sensor data to the acceleration data in an attempt to improve the performance of hand activity recognizer. This recognizer is designed to be capable of classifying nineteen hand activities. It consists of 10 natural gestures, e.g., 'go left', 'over here' and 9 emotional expressions by hand activity, e.g., 'I feel hot', 'I love you'. To improve performance of hand activity recognition using feature fusion, we propose a modified Time Delay Neural Network (TDNN) architecture with a dedicated fusion layer and a time normalization layer. Our experimental result shows that the performance of this system yields an improvement of about 6.96% compared to the use of accelerometers alone.</p>
      </abstract>
      <kwd-group>
        <kwd>multimodal interaction</kwd>
        <kwd>hand activity recognition</kwd>
        <kwd>modified TDNN</kwd>
        <kwd>human cognitive</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Interface technology using activity or gesture is one of the key functions for agent
system in ubiquitous computing environment. In general, accelerometers are currently
among the most widely studied wearable sensors for activity or gesture recognition,
thanks to their accuracy in the detection of human body movements, small in size, and
reasonable power consumptions[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Ling Bao and others[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] presented algorithms to
detect physical activities from data acquired using five small biaxial accelerometers
worn simultaneously on different parts of the body.
      </p>
      <p>
        In the study of Bharatula and others[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a low power sensor hardware system was
presented, including accelerometer, light sensor, microphone, and wireless
1 This work was supported by the IT R&amp;D program of MIC/IITA. [2006- S032-02,
Development of an Intelligent Service technology based on the Personal Life Log],
[2008P1-15-07J31, Research on Human-friendly Next Generation PC Technology Standardization
on u-Computing]
communication. Accelerometer has been widely used in many pattern recognition
methods in order to access physical activities. In the study of Kiani and others[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the
artificial neural networks were used. There have been several research efforts to
enhance the performance of accelerometer-based activity recognition.
      </p>
      <p>
        We use a human cognitive-based technique for improving the performance of
activity recognition. Humans do not depend only on their hearing in order to
recognize information. This fact is illustrated by the McGurk effect[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a perceptual
phenomenon which demonstrates an interaction between hearing and vision in speech
perception. The presentation of an audio /p/ with a synchronized incongruent visual
/k/ often leads listeners to identify what they hear as /t/, a phenomenon referred to as
‘fusion’
      </p>
      <p>
        Currently, the recognition of human input using data fusion has been partially
achieved in a lip-reading system. The fusion algorithm can be carried out either at the
feature-level or the class-level[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Figure 1 shows the block diagram of the class-level
fusion. Two input signals are separately classified, and the results of each classifier
are combined in next step. Fusion module has a set of algorithms to integrate the
individual decision of each sensor. Several different methods of class-level fusion
have been proposed and studied extensively such as voting method,
behaviorknowledge space method and soft-output classifier fusion method[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Figure 2 shows the block diagram of the feature-level fusion. Two feature vectors
of accelerometer and speech signal are combined into a joint feature vector, and the
joint feature vector is used as input vector of fusion classifier. As already mentioned,
the feature fusion uses a single classifier to fuse two modalities. Several approach
have been proposed : fuzzy logic, Artificial Neural Network(ANN), Hidden Markov
Model(HMM), hybrid ANN-DTW(Dynamic time warping), hybrid ANN-HMM,
genetic algorithm, Support Vector Machines (SVM) etc[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In recent years, ANN
based on back propagation(BP) or radial basis function(RBF) network has been
widely used as a useful tool to the feature-level fusion modeling.
      </p>
      <p>Fig 2. The fusion at feature level</p>
      <p>In this paper, we propose a feature fusion method in an attempt to improve the
performance of accelerometer-based hand activity recognition. To develop hand
activity recognizer with high performance, we present a modified TDNN architecture
with a dedicated fusion and time normalization layer.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Accelerometer-based Hand Activity and Voice Representation</title>
      <p>In this section, details of hand activity and the feature extraction module of speech
and acceleration used in this paper are described.
2.1</p>
      <sec id="sec-2-1">
        <title>Speech Feature Extraction</title>
        <p>
          The Speech Feature Extraction(SFE) module extracts feature vectors from the speech
signal. This module is comprised of the following components : an End Point
Detection(EPD) module based on the frame energy, a Feature Extraction(FE) module
based on Zero-Crossing with Peak Amplitude (ZCPA)[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and RelAtive SpecTrAl
algorithm (RASTA)[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>The ZCPA model is more robust in noisy environments than other popularly used
feature extraction methods, such as LPCC or MFCC. It is composed of cochlear
bandpass filter, zero-crossing detector and peak detector. And frequency information
is obtained by the zero-crossing detector, and intensity information is also
incorporated by the peak detector. RASTA processing of speech is a bandpass
modulation filtering, operating on the log spectral domain. Slow channel variations
should in principle be removed. Finally, the SFE module captures 16 features per
frame.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Accelerometer-based Hand activity Feature Extraction</title>
        <p>The acceleration data of the subject was collected using two
MTx(Xsens
technologies) accelerometers. These 3-axis accelerometers are accurate to  1.7G
with tolerances within 0.2%. The accelerometers were mounted on wrist and sampled
at 100Hz. Figure 3 shows MTx accelerometer with 3-axis and the attached type on the
wrist.</p>
        <p>In our experiment, we examine natural gestures and emotional expressions by hand
activity. A natural gesture is defined as an action that everyone can understand in
human-to-human communication. The supported natural gestures are as follows: ‘go
right’, ‘go left’, ‘go up’, ‘go down’, ‘over here’, ‘go away’, ‘catch’, ‘release’, ‘open’
and ‘close’.</p>
        <p>So an emotional expression by hand activity is defined as an action of hand
according to the emotion change such as ‘I love you’, ‘I feel cold’, ‘I feel hot’, ‘I feel
so-cold’, ‘I feel so-hot’, ‘I feel real-hot’, ‘I feel real-cold’, ‘Handshaking’ and
‘Goodbye’. But the linguistic definitions of emotional expression are often ambiguous. To
address ambiguities in hand activity labels, test subjects were provided with image
descriptions of each hand activity and short sentence descriptions. Figure 5 shows an
example of image descriptions of each hand activity. Table 1 lists Korean utterance of
each hand activity along with its short sentence description.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Modified TDNN Architecture for Data Fusion</title>
      <p>
        In 1989, it was shown that neural network model yields a high performance in the
speech recognition. The main goal of TDNN was to have a neural network
architecture for non-linear feature classification invariant under translation in time or
space. TDNN uses time-delay steps to represent temporal relationships. The
translation invariant classification is realized by sharing the connection weights of the
time delay steps[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The activation of a unit is normally computed by passing the
weighted sum of its inputs to an activation function(i.e. a sigmoid function).
      </p>
      <p>We explain modified TDNN architecture to improve performance of hand activity
recognition system. One of most difficult challenges in the feature-level fusion is the
synchronization between accelerometer and speech data. In our system, speech
features are extracted with the dimensions 64x16(where 64 is the number of frames
and 16 is the number of coefficients). So the accelerometer-based hand activity
features are extracted with the dimensions 120 x 6. Therefore, the method chosen to
synchronize between the two feature spaces has a significant effect on the
improvement of the recognition rate. We solve the synchronization problem by using
a dedicated fusion layer and time normalization layer. Figure 6 is the modified TDNN
architecture for data fusion.
at fusion layer is given as.</p>
      <p>z F'j
f (
wSiFj zSi</p>
      <p>wAiFj z Ai )
(NS Fj 1)</p>
      <p>( NA Fj 1)
Si Fj</p>
      <p>Ai Fj
(3)
where Fj is the index of node at fusion layer, A shows the hand activity features. S
denotes the speech feature and f is a sigmoid fusion. So N is the number of windows
at TN layer, i is the index of node at TN layer, j is the index of node at fusion layer. z
is the output value of TN layer, z` is the output value of fusion layer and w is weight.</p>
      <p>In figure 6, The input layer of the Speech Network(SN) has 16 feature values at
each 10ms interval, 64 frames, and overlap windows of 3 frames. And the input layer
of the Accelerometer-based Activity Network(AAN) has 6 feature values at each
10ms, 120 frames, and overlap windows of 59. So the TN of the SN consists of 62
frames, 8 units per frame and overlap windows of 5 frames. And the TN of AAN
consists of 62 frame, 3 units per frame and overlap windows of 5 frame. In the case of
the SN, the 48 units of this input layer are fully interconnected to a layer of 8 TN units.
In the case of the ANN, the 354 units of the input layer are fully interconnected to a
layer of 3 TN units. Finally, the fusion layer consists of 58 frames and 4 units per
frame.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental result</title>
      <p>The experimental data consists of 19 utterances recorded by a male. The subject is
directed to perform each activity at a time accompanied by corresponding speech in a
quiet office environment. For training, 50 sets of activity and speech data are used,
and the other 50 sets are used as test patterns. Speech is recorded by a SHURE
microphone and the accelerometer-based activity is recorded by two MTx on his
wrists. To train gesture and speech, the data set is provided to the system at the
learning rate of 0.1.</p>
      <p>Table 2 compares the performance of the proposed fusion system to the system that
uses accelerometer alone. In table 2, when the signal-to-noise ratio(SNR) decreases,
the Fusion method does not degrade as much as Accelerometers alone case.
An accelerometer is one of the most useful wearable sensors for activity recognition.
The accuracy of the previous work only using accelerometer was around 85% ~90%,
which may be good enough for some applications. In this paper, we present a
multisensory-based fusion recognition system having more enhanced performance
than accelerometer-based activity recognition. In our work, we designed a hand
activity recognizer that can classify acceleration data and utterance into nineteen
activities : 10 natural activities and 9 emotional expressions by hand activity.</p>
      <p>To improve performance of hand activity recognition system, we propose the
modified TDNN architecture with a dedicated fusion layer and time normalization
layer. Using the proposed fusion layer, we solved the synchronization problem in
feature-level fusion. Our experiment shows performance improvement of 6.96% when
compared to an activity system using only accelerometer.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Jeonghwa</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Schilit</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.W.</given-names>
            <surname>McDonald</surname>
          </string-name>
          , “
          <article-title>Activity Recogni- tion for the Digital Home</article-title>
          ,” in Computer. vol.
          <volume>41</volume>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>104</lpage>
          ,
          <year>April 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.</given-names>
            <surname>Bao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Intille</surname>
          </string-name>
          , “
          <article-title>Activity Recognition from user-Annotated Acceleration Data,”</article-title>
          <source>In Proc. Pervasive</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          ,
          <year>April 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N. B.</given-names>
            <surname>Bharatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lukowicz</surname>
          </string-name>
          , and G. Troster, “
          <article-title>Power and Size Optimized Multisensor Context Recognition Platform,”</article-title>
          <source>In ISWC</source>
          <year>2005</year>
          , pp.
          <fpage>194</fpage>
          -
          <lpage>195</lpage>
          , Oct.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>K.</given-names>
            <surname>Kiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Snijders</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Gelsema</surname>
          </string-name>
          , “
          <article-title>Computerized analysis of daily life motor activity for ambulatory monitoring</article-title>
          ,” Technol. Health Care vol.
          <volume>5</volume>
          , pp.
          <fpage>307</fpage>
          -
          <lpage>318</lpage>
          Oct.
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Lafon</surname>
          </string-name>
          , Y. Keller, and
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Coifman</surname>
          </string-name>
          , “
          <article-title>Data Fusion and Multicue Data Matching by Diffusion Maps,” IEEE Tran</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          , vol
          <volume>28</volume>
          , pp.
          <fpage>1784</fpage>
          -
          <lpage>1797</lpage>
          , Nov.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Ruta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Garbrys</surname>
          </string-name>
          , “
          <article-title>An Overview of Classifier Fusion Method,”</article-title>
          <source>Computing and Information Systems</source>
          , vol.
          <volume>7</volume>
          , pp.
          <fpage>1</fpage>
          -
          <issue>10</issue>
          <year>February 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Sun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          , “
          <article-title>SVM for Sensor Fusion-a Comparison with Multilayer Perceptron Networks,”</article-title>
          <source>In Proc. Machine Leaqrning and Cybernetics</source>
          , pp.
          <fpage>2979</fpage>
          -
          <lpage>2984</lpage>
          , Aug.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Young-Giu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mun-Sung</surname>
          </string-name>
          , and L. San Jo, “
          <article-title>Development of an Optimized Feature Extraction Algorithm for Throat Signal Analysis,”</article-title>
          <source>ETRI Journal</source>
          , vol.
          <volume>29</volume>
          , pp.
          <fpage>292</fpage>
          -
          <lpage>299</lpage>
          , June 2007
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>H.</given-names>
            <surname>Hermansky</surname>
          </string-name>
          , and N. Morgan, “RASTA Processing of Speech,”
          <source>IEEE Trans. Speech Audio Processing</source>
          , vol
          <volume>2</volume>
          , pp.
          <fpage>578</fpage>
          -
          <lpage>589</lpage>
          , Oct.
          <year>1994</year>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>N.</given-names>
            <surname>Mache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reczko</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Hatzigeorgiou</surname>
          </string-name>
          , “
          <article-title>Multistate Time-Delay Neural Networks for the Recognition of Pol II Promoter Sequences,”</article-title>
          <source>In Proc. 10th Conf. Intelligent Systems for Molecular Biology, St. Louis</source>
          ,
          <year>1996</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>