=Paper=
{{Paper
|id=None
|storemode=property
|title=SWS task: Articulatory phonetic units and sliding DTW
|pdfUrl=https://ceur-ws.org/Vol-807/Gautam_IIIT_SWS_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MantenaBP11
}}
==SWS task: Articulatory phonetic units and sliding DTW==
SWS task: Articulatory Phonetic Units and Sliding DTW
Gautam Varma Mantena Bajibabu B Kishore Prahallad
Speech and Vision Lab Speech and Vision Lab Speech and Vision Lab
International Institute of International Institute of International Institute of
Information Technology, Information Technology, Information Technology,
Hyderabad Hyderabad Hyderabad
Andhra Pradesh, India Andhra Pradesh, India Andhra Pradesh, India
gautam.mantena bajibabu.b kishore
@research.iiit.ac.in @research.iiit.ac.in @iiit.ac.in
ABSTRACT the most likely segments for the input audio query.
This paper describes the experiments conducted for spoken 2. Second level: Use of sliding window DTW search and
web search at MediaEval 2011 evaluations. The task con- k-means clustering to obtain the segments with the
sists of searching for audio segments within audio content best scores.
using an audio query. The current approach uses a broad
articulatory phonetic units for indexing the audio files and The procedure for the audio search task is as described in
to obtain audio segments. Sliding DTW is applied on the sections 2.1, 2.2
audio segments to determine the time instants.
2.1 Indexing using Articulatory Phonetic Units
Categories and Subject Descriptors The primary motivation for this approach is to have speech
specific features rather than language specific features like
H.3.3 [Information Systems]: Spoken Term Detection,
phone models. Advantage is that the articulatory phonetic
Articulatory Phonetic Units, Sliding Dynamic Time Warp-
units (selected well) could represent a broader set of lan-
ing
guages. This would enable us to build articulatory units
from one language and use it for other languages.
1. INTRODUCTION Articulatory units selected are as shown in table 1. For
The approach attempted aims at identifying audio seg- example the articulatory unit CON VEL UN is a consonant
ments within an audio content using an audio query. Lan- velar unvoiced sound. A more detailed description of the
guage independency is one of the primary constraints for tags is given in table 2.
the spoken web search task [1]. We have implemented a two
stage process in obtaining the most likely audio segments.
Initially all the audio files are indexed based on their corre- Table 1: Articulatory Phonetic Units derived from a Telugu
sponding articulatory phonetic units. The input audio query database.
is decoded into its corresponding articulatory phonetic units Articulatory Unit Phones
and necessary audio segments which have a similar sequence CON VEL UN /k/, /kh/
are selected. Sliding window type of approach using dynamic CON VEL VO /g/, /gh/
time warping (DTW) algorithm has been used in determin- CON PAL UN /ch/, /chh/
ing the time stamps within the audio segment. The following CON PAL VO /j/, /jh/
approach is provided in more detail in section 2. CON ALV UN /t:/, /t:h/
CON ALV VO /d:/, /d:h/
2. TASK DESCRIPTION CON DEN UN /t/, /th/
CON DEN VO /d/, /dh/
A variant of DTW search is used to identify the audio
CON BIL UN /p/, /ph/
segments within an audio content. DTW algorithm is time
consuming and so it is necessary for the system to select CON BIL VO /b/, /bh/
the necessary segments beforehand. The system consists of NASAL /ng∼/, /nd∼/, /nj∼/, /n/, /m/
an indexing step which would improve the retrieval time FRICATIVE /f/, /h/, /h:/, /sh/, /shh/, /s/
by selecting the required audio segment within the audio VOW LO CEN /a/, /aa/
content. Following approach is a two level process where it VOW HI FRO /i/, /ii/
prunes appropriate segments from each level. The two levels VOW HI BAC /u/, /uu/
implemented are as follows: VOW MI FRO /e/, /ei/
VOW MI BAC /o/, /oo/
1. First level: Indexing the audio data in terms of its /y/ /y/
articulatory phonetic units and using them to obtain /r/ /r/
/l/ /l/
/v/ /v/
Copyright is held by the author/owner(s).
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy
costs for the matrix M. The above procedure was adapted
Table 2: Articulatory tags and their corresponding descrip- from a similar approach as mentioned in [4].
tion. The scores corresponding to all the possible end points are
Articulatory Tag Description
considered for k-means clustering. For k = 3, mean scores
CON Consonant
are calculated. Minimum score is used as a threshold to
VEL Velar
select segments. The segment with the lowest score among
PAL Palatal the overlapping segments (overlap of 70%) is considered.
ALV Alveolar
DEN Dental 2.3 Experimental Results
BIL Bilabial From the audio content and audio query, speech and non-
VO Voiced speech segments are detected. Indexing and DTW search
UN Unvoiced are then applied on the speech segments. Zero filtered signal
VOW Vowel was generated from the audio signal using a zero frequency
HI High resonator. This zero frequency signal is used to detect voiced
MI Mid and unvoiced regions [5]. If duration of unvoiced segment is
LO Low more than 300ms then it is a non-speech segment.
FRO Front The system was evaluated based on NIST spoken term
CEN Center detection evaluation scheme [6]. Miss probability and false
BAC Back alarm probability scores on the development data is ranging
from 70% - 98% and 0.1% - 0.6% respectively. Miss probabil-
ity and false alarm probability scores on the evaluation data
Audio content is decoded into their corresponding articu- is ranging from 96% - 98% and 0.1% - 0.2% respectively.
latory units using HMM models with 64 Gaussian mixture
models using the HTK Tool Kit [2]. The models were built 3. DISCUSSIONS
using 15 hours of telephone Telugu data [3] consisting of For indexing the audio content, trigrams were used. The
200 speakers. Using the decoded articulatory phonetic out- approach can be extended for bigram or four-gram. Use of
put, trigrams were used for indexing. The audio query was bigram would return a lot of segments and would be a prob-
also decoded and the audio segments were selected, if there lem due to slow speed of sliding DTW search. In case of
was a match in any of the trigrams. Let tstart and tend four-gram, recall of the number of audio content files has
are the start and the end time stamps for the trigram in drastically dropped. Trigram seems to strike a balance be-
the audio content which matches with one of the trigrams tween bigram and four-gram indexing.
from the audio query. Then the likely segment from the For estimating the end point, k-means clustering was ap-
audio content would be (tstart − audio query length) and plied on the DTW scores obtained from all the audio seg-
(tend + audio query length). This is would enable to cap- ments. We suspect that this might be the reason to loose
ture speech segments with varying speaking rate. certain segments. DTW scores obtained from similar kind
These time stamps would provide the audio segments that of pronunciation might mask the scores from the other pro-
are likely to contain the audio query. Sliding DTW search nunciation variations.
was applied on these audio segments to obtain the appropri-
ate time stamps for the query which is explained in detail 4. REFERENCES
in section 2.2.
[1] Arun Kumar, Nitendra Rajput, Dipanjan Chakraborty,
2.2 Sliding Window DTW Search Sheetal K. Agarwal, and Amit Anil Nanavati,
“WWTW: The World Wide Telecom Web,” in NSDR
In a regular DTW algorithm the audio segments are as- 2007 (SIGCOMM workshop), 2007.
sumed to have a timing difference and the algorithm helps in
[2] Steve J. Young, D. Kershaw, J. Odell, D. Ollason,
time normalization, i.e. it fixes the beginning and the end of
V. Valtchev, and P. Woodland, The HTK Book Version
the auido segments. In spoken term detection we also need
3.4, Cambridge University Press, 2006.
to identify the right audio segment within an audio with
[3] Gopalakrishna Anumanchipalli, Rahul Chitturi, Sachin
appropriate time stamps.
Joshi, Rohit Kumar, Satinder Pal Singh, R.N.V.
We propose an approach where we consider an audio con-
Sitaram, and S P Kishore, “Development of indian
tent segment of length twice the length of the audio query,
language speech databases for large vocabulary speech
and a DTW is performed. After a segment has been com-
recognition systems,” in Proc. SPECOM, 2005.
pared the window is moved by one feature shift and DTW
search is computed again. MFCC features, with window [4] Kishore Prahallad and A. W. Black, “Segmentation of
length 20 msec and 10 msec window shift have been used to monologues in audio books for building synthetic
represent the speech signal. Consider an audio content seg- voices,” in IEEE Transactions on Audio, Speech and
ment S and an audio query Q. Construct a substitution ma- Language Processing, 2010.
trix M of size q x sq where q is the size of Q and sq = 2 ∗ q. [5] N. Dhananjaya and B. Yegnanarayana,
We also define M[i,j] as the node measuring the optimal “Voiced/nonvoiced detection based on robustness of
alignment of the segments Q[1:i] and S[1:j] voiced epochs,” in IEEE Signal Processing Letters, 2010.
During DTW search, at some instants M[q,j] (j < sq ) [6] J. Fiscus, J. Ajot, J. Garofolo, and G. Doddington,
will be reached. Then the time instants from column j to “Results of the 2006 spoken term detection evaluation,”
column sq are the possible end points for the audio segment. in Proc. of the ACM SIGIR 2007, Workshop in
Euclidean distance measure have been used to calculate the Searching Spontaneous Conversational Speech, 2007.