=Paper= {{Paper |id=None |storemode=property |title=SWS task: Articulatory phonetic units and sliding DTW |pdfUrl=https://ceur-ws.org/Vol-807/Gautam_IIIT_SWS_me11wn.pdf |volume=Vol-807 |dblpUrl=https://dblp.org/rec/conf/mediaeval/MantenaBP11 }} ==SWS task: Articulatory phonetic units and sliding DTW== https://ceur-ws.org/Vol-807/Gautam_IIIT_SWS_me11wn.pdf
   SWS task: Articulatory Phonetic Units and Sliding DTW

          Gautam Varma Mantena                              Bajibabu B                   Kishore Prahallad
             Speech and Vision Lab                  Speech and Vision Lab              Speech and Vision Lab
             International Institute of             International Institute of         International Institute of
             Information Technology,                Information Technology,            Information Technology,
                   Hyderabad                              Hyderabad                          Hyderabad
              Andhra Pradesh, India                  Andhra Pradesh, India              Andhra Pradesh, India
              gautam.mantena                            bajibabu.b                             kishore
              @research.iiit.ac.in                   @research.iiit.ac.in                     @iiit.ac.in

ABSTRACT                                                                 the most likely segments for the input audio query.
This paper describes the experiments conducted for spoken             2. Second level: Use of sliding window DTW search and
web search at MediaEval 2011 evaluations. The task con-                  k-means clustering to obtain the segments with the
sists of searching for audio segments within audio content               best scores.
using an audio query. The current approach uses a broad
articulatory phonetic units for indexing the audio files and         The procedure for the audio search task is as described in
to obtain audio segments. Sliding DTW is applied on the            sections 2.1, 2.2
audio segments to determine the time instants.
                                                                   2.1    Indexing using Articulatory Phonetic Units
Categories and Subject Descriptors                                    The primary motivation for this approach is to have speech
                                                                   specific features rather than language specific features like
H.3.3 [Information Systems]: Spoken Term Detection,
                                                                   phone models. Advantage is that the articulatory phonetic
Articulatory Phonetic Units, Sliding Dynamic Time Warp-
                                                                   units (selected well) could represent a broader set of lan-
ing
                                                                   guages. This would enable us to build articulatory units
                                                                   from one language and use it for other languages.
1. INTRODUCTION                                                       Articulatory units selected are as shown in table 1. For
   The approach attempted aims at identifying audio seg-           example the articulatory unit CON VEL UN is a consonant
ments within an audio content using an audio query. Lan-           velar unvoiced sound. A more detailed description of the
guage independency is one of the primary constraints for           tags is given in table 2.
the spoken web search task [1]. We have implemented a two
stage process in obtaining the most likely audio segments.
Initially all the audio files are indexed based on their corre-    Table 1: Articulatory Phonetic Units derived from a Telugu
sponding articulatory phonetic units. The input audio query        database.
is decoded into its corresponding articulatory phonetic units        Articulatory Unit                   Phones
and necessary audio segments which have a similar sequence              CON VEL UN                      /k/, /kh/
are selected. Sliding window type of approach using dynamic             CON VEL VO                      /g/, /gh/
time warping (DTW) algorithm has been used in determin-                 CON PAL UN                    /ch/, /chh/
ing the time stamps within the audio segment. The following             CON PAL VO                      /j/, /jh/
approach is provided in more detail in section 2.                       CON ALV UN                     /t:/, /t:h/
                                                                        CON ALV VO                     /d:/, /d:h/
2. TASK DESCRIPTION                                                     CON DEN UN                      /t/, /th/
                                                                        CON DEN VO                      /d/, /dh/
  A variant of DTW search is used to identify the audio
                                                                        CON BIL UN                      /p/, /ph/
segments within an audio content. DTW algorithm is time
consuming and so it is necessary for the system to select               CON BIL VO                      /b/, /bh/
the necessary segments beforehand. The system consists of                  NASAL           /ng∼/, /nd∼/, /nj∼/, /n/, /m/
an indexing step which would improve the retrieval time                  FRICATIVE          /f/, /h/, /h:/, /sh/, /shh/, /s/
by selecting the required audio segment within the audio               VOW LO CEN                       /a/, /aa/
content. Following approach is a two level process where it             VOW HI FRO                       /i/, /ii/
prunes appropriate segments from each level. The two levels             VOW HI BAC                      /u/, /uu/
implemented are as follows:                                             VOW MI FRO                      /e/, /ei/
                                                                        VOW MI BAC                      /o/, /oo/
  1. First level: Indexing the audio data in terms of its                    /y/                           /y/
     articulatory phonetic units and using them to obtain                    /r/                           /r/
                                                                             /l/                            /l/
                                                                             /v/                           /v/
Copyright is held by the author/owner(s).
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy
                                                                 costs for the matrix M. The above procedure was adapted
Table 2: Articulatory tags and their corresponding descrip-      from a similar approach as mentioned in [4].
tion.                                                               The scores corresponding to all the possible end points are
           Articulatory Tag Description
                                                                 considered for k-means clustering. For k = 3, mean scores
                  CON              Consonant
                                                                 are calculated. Minimum score is used as a threshold to
                   VEL                Velar
                                                                 select segments. The segment with the lowest score among
                   PAL               Palatal                     the overlapping segments (overlap of 70%) is considered.
                   ALV               Alveolar
                  DEN                 Dental                     2.3   Experimental Results
                   BIL               Bilabial                       From the audio content and audio query, speech and non-
                   VO                 Voiced                     speech segments are detected. Indexing and DTW search
                   UN               Unvoiced                     are then applied on the speech segments. Zero filtered signal
                  VOW                 Vowel                      was generated from the audio signal using a zero frequency
                    HI                 High                      resonator. This zero frequency signal is used to detect voiced
                    MI                 Mid                       and unvoiced regions [5]. If duration of unvoiced segment is
                    LO                 Low                       more than 300ms then it is a non-speech segment.
                   FRO                Front                         The system was evaluated based on NIST spoken term
                   CEN                Center                     detection evaluation scheme [6]. Miss probability and false
                   BAC                 Back                      alarm probability scores on the development data is ranging
                                                                 from 70% - 98% and 0.1% - 0.6% respectively. Miss probabil-
                                                                 ity and false alarm probability scores on the evaluation data
   Audio content is decoded into their corresponding articu-     is ranging from 96% - 98% and 0.1% - 0.2% respectively.
latory units using HMM models with 64 Gaussian mixture
models using the HTK Tool Kit [2]. The models were built         3.    DISCUSSIONS
using 15 hours of telephone Telugu data [3] consisting of           For indexing the audio content, trigrams were used. The
200 speakers. Using the decoded articulatory phonetic out-       approach can be extended for bigram or four-gram. Use of
put, trigrams were used for indexing. The audio query was        bigram would return a lot of segments and would be a prob-
also decoded and the audio segments were selected, if there      lem due to slow speed of sliding DTW search. In case of
was a match in any of the trigrams. Let tstart and tend          four-gram, recall of the number of audio content files has
are the start and the end time stamps for the trigram in         drastically dropped. Trigram seems to strike a balance be-
the audio content which matches with one of the trigrams         tween bigram and four-gram indexing.
from the audio query. Then the likely segment from the              For estimating the end point, k-means clustering was ap-
audio content would be (tstart − audio query length) and         plied on the DTW scores obtained from all the audio seg-
(tend + audio query length). This is would enable to cap-        ments. We suspect that this might be the reason to loose
ture speech segments with varying speaking rate.                 certain segments. DTW scores obtained from similar kind
   These time stamps would provide the audio segments that       of pronunciation might mask the scores from the other pro-
are likely to contain the audio query. Sliding DTW search        nunciation variations.
was applied on these audio segments to obtain the appropri-
ate time stamps for the query which is explained in detail       4.    REFERENCES
in section 2.2.
                                                                 [1] Arun Kumar, Nitendra Rajput, Dipanjan Chakraborty,
2.2 Sliding Window DTW Search                                        Sheetal K. Agarwal, and Amit Anil Nanavati,
                                                                     “WWTW: The World Wide Telecom Web,” in NSDR
   In a regular DTW algorithm the audio segments are as-             2007 (SIGCOMM workshop), 2007.
sumed to have a timing difference and the algorithm helps in
                                                                 [2] Steve J. Young, D. Kershaw, J. Odell, D. Ollason,
time normalization, i.e. it fixes the beginning and the end of
                                                                     V. Valtchev, and P. Woodland, The HTK Book Version
the auido segments. In spoken term detection we also need
                                                                     3.4, Cambridge University Press, 2006.
to identify the right audio segment within an audio with
                                                                 [3] Gopalakrishna Anumanchipalli, Rahul Chitturi, Sachin
appropriate time stamps.
                                                                     Joshi, Rohit Kumar, Satinder Pal Singh, R.N.V.
   We propose an approach where we consider an audio con-
                                                                     Sitaram, and S P Kishore, “Development of indian
tent segment of length twice the length of the audio query,
                                                                     language speech databases for large vocabulary speech
and a DTW is performed. After a segment has been com-
                                                                     recognition systems,” in Proc. SPECOM, 2005.
pared the window is moved by one feature shift and DTW
search is computed again. MFCC features, with window             [4] Kishore Prahallad and A. W. Black, “Segmentation of
length 20 msec and 10 msec window shift have been used to            monologues in audio books for building synthetic
represent the speech signal. Consider an audio content seg-          voices,” in IEEE Transactions on Audio, Speech and
ment S and an audio query Q. Construct a substitution ma-            Language Processing, 2010.
trix M of size q x sq where q is the size of Q and sq = 2 ∗ q.   [5] N. Dhananjaya and B. Yegnanarayana,
We also define M[i,j] as the node measuring the optimal              “Voiced/nonvoiced detection based on robustness of
alignment of the segments Q[1:i] and S[1:j]                          voiced epochs,” in IEEE Signal Processing Letters, 2010.
   During DTW search, at some instants M[q,j] (j < sq )          [6] J. Fiscus, J. Ajot, J. Garofolo, and G. Doddington,
will be reached. Then the time instants from column j to             “Results of the 2006 spoken term detection evaluation,”
column sq are the possible end points for the audio segment.         in Proc. of the ACM SIGIR 2007, Workshop in
Euclidean distance measure have been used to calculate the           Searching Spontaneous Conversational Speech, 2007.