=Paper=
{{Paper
|id=None
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-807/Barnard_MUST_SWS_me11wn.pdf
|volume=Vol-807
}}
==None==
Phone recognition for Spoken Web Search Etienne Barnard, Marelie Charl van Heerden, Neil Kalika Bali Davel Kleynhans Microsoft Research Lab India Multilingual Speech Technologies, HLT Research Group,CSIR Meraka Bangalore, India North-West University Institute kalikab@microsoft.com Vanderbijlpark, South Africa Pretoria, South Africa {etienne.barnard, {cvheerden, marelie.davel}@gmail.com ntkleynhans}@gmail.com ABSTRACT data and building a set of acoustic models for at least one of the Aiming at both speaker independence and robustness with languages. These acoustic models were then adapted to the other lan- respect to recognition errors in the spoken queries, we have guages and used during both spoken term detection and confidence implemented a two-pass system for spoken web search. In scoring. Without time alignments for the development set, we gen- the first pass, unconstrained phone recognition of both the erated our own using a grapheme-based system in order to evaluate query terms and the content audio is employed to represent our results during development. these recordings as phone strings. A dynamic-programming 2.1 Acoustic modeling approach then finds regions in the content phone strings that correspond closely to one or more query strings. In the sec- The main data set used for acoustic modelling was a Hindi cor- ond pass, each of these regions is again processed with a phone pus obtained from Microsoft. Several additional corpora were also recognizer, but now a lattice is extracted; this lattice is com- considered, but were not as directly suited to this task. pared against similar lattices extracted for each of the queries. 2.1.1 Hindi Models We find our approach to be somewhat successful in identify- The Microsoft Hindi Corpus consists of 60 hours of spontaneous ing the query terms in both the development and evaluation conversations in colloquial Hindi, recorded on Appen’s telephony sets, but not to generalize well between these sets. recording platform and sampled at 8kHz. There are 996 native Hindi speakers and all conversations range between 1 and 4 minutes in du- Categories and Subject Descriptors ration. All conversations are transcribed and time aligned on speaker I.2 [Artificial Intelligence]: Natural Language Processing— turns, and a basic pronunciation dictionary is provided. Speech recognition and synthesis An initial acoustic model trained on all the audio was used to fur- ther process the corpus, using techniques described in [1]. The ini- General Terms tial acoustic models trained were standard 3-state left to right tied triphone models, with 8 mixtures per state and semi tied transforms. Spoken term detection, under-resourced languages, confidence mea- A garbage model was then trained and combined with this initial sures model. The models were further refined by doing MAP adaptation using the target audio (in all 4 languages). Target audio transcrip- 1. INTRODUCTION tions were generated using the initial acoustic models to decode the The ‘spoken web search’ task of MediaEval 2011 [4] involved target audio, using a flat phone grammar. searching for audio content in one of 4 under-resourced languages The list of monophones was reduced rather aggressively from 62 (Gujarati, Telugu, Hindi and Indian English), using only an audio to two smaller sets (of 43 and 21 monophones, respectively) in order version of the content query. All audio content [3] was collected to work with a small set of broad but reliable classes, appropriate to over a mobile connection and acoustic quality varied. Our approach the later scoring tasks. The reduction included amongst others mod- to this task was guided by three principles: (1) Since the search task eling aspiration separately, splitting diphthongs as well as affricates, requires speaker independence, we preferred to use standard speaker- combining some allophones and merging all the nasalized vowels independent ASR technology, rather than (say) template-based meth- with the corresponding non-nasal phonemes. ods. (2) Any pronunciation model derived from a single spoken ex- ample of a search term is likely to be quite fragile; hence, specific 2.2 Spoken term detection care must be taken to model variability around such a model. (3) Spoken term detection was performed using a dynamic program- Only limited resources are available in the target languages / dialects; ming (DP) approach: audio data and query data are decoded by we therefore focused our efforts on approaches that did not rely on the ASR system using a flat phone-loop grammar, and the resulting any textual data (or derived language models), and could produce phoneme strings matched against one another using a dynamic pro- results when closely matched ASR systems are not available. gramming algorithm and a variable cost matrix. A phone set is used where transitional sounds (such as affricates or diphthongs) are split 2. APPROACH into their constituent parts. The phone string generated from the au- As we did not have access to closely matched ASR systems for dio data is then segmented into detection candidates using a shifting any of the target dialects, we focused our approach on obtaining window with a size matching the query phone string (plus or minus a leniency factor), and an alignment cost generated. The alignment Copyright is held by the author/owner(s). cost, normalised by the phoneme length of the longer phone string is MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy used directly as the DP score. This approach is influenced by both the granularity of the phoneme set used and the scoring matrix. In this work a linguistically motivated matrix, a matrix derived from the posterior probabilities obtained from an ASR confusion matrix and a flat matrix were used. Two additional approaches to detection were considered: • Grammar-based term detection: A constrained decoding network grammar is constructured by placing repeatable phoneme fillers before and after the desired search term and allowing multiple search term detections within an utterance. • Lattice-based methods: A phone lattice is constructed from the entire utterance within which search is to be performed, and the phoneme strings corresponding to each of the queries is matched against the lattice. However, the acoustic ambigui- ties of phone recognition in low-quality audio caused practical difficulties – both the computation required and the size of the resulting lattices were found to be unmanageable. 2.3 Confidence scoring The DP scores generated during the detection phase can be used directly as confidence measures. In addition, two other confidence measures were calculated using the terms flagged during DP scoring: lattice-to-lattice matching and dynamic time warping. Since Figure 1: DET curves when using confidence scores standard lattice-based confidence measures are difficult to utilise with- based on DP alignment vs lattice-to-lattice posteri- out reliable language models, a direct lattice-to-lattice matching mea- ors. sure was implemented. This is a direct extension of the DP-based string matching process, and can be implemented efficiently using Table 1: ATWV scores for four task conditions an algorithm as described in [2], combined with similiar scoring ma- Task dev/dev dev/eval eval/dev eval/eval trices as for DP scoring. Posterior probabilities are obtained directly ATWV 0.102 -0.021 -0.13 0.114 from the lattices, and a series of start and end points (relative to the initial detection) can be evaluated efficiently. Finally, Dynamic Time 4. CONCLUSION Warping (DTW) was used to match the query and detection on a frame-to-frame basis, and the corresponding DTW distance (normal- We have presented a DP-based approach to spoken term detection, ized by number of frames) used as the confidence measure. and a lattice-based scoring mechanism that was intended to refine the DP scores. The former approach was somewhat successful, but we 2.4 Evaluating results have not been able to obtain a benefit from the latter; thus, our sec- To analyse our results, it was necessary to generate alignments ond assumption in Section 1 has not been confirmed. Given our time for the development and evaluation data sets. Since the Indo-Aryan constraints, we have not been able to experiment systematically with and Dravidian languages have a high letter-to-sound correlation it numerous variables which are clearly important to the performance was decided to use a grapheme-based recognition system to generate of both stages of the system (e.g. different phone sets, scoring ma- alignments. The grapheme-based alignment system was represented trices, lattice-extraction parameters, etc.) – it is likely that signifi- by 8 mixture context-dependent tri-letter HMM acoustic models. A cant improvements within the current framework can be achieved by pronunciation dictionary was created by letter splitting the words, paying closer attention to each of these factors. Once that has been which resulted in 26 sub-word units (ASCII ‘a’ to ‘z’ in the English achieved, further score improvements by using additional acoustic alphabet). front ends should be a straightforward (though computationally ex- pensive) step. 3. RESULTS Our experimental results are based on detection of the develop- 5. REFERENCES ment queries in the development data. Fig. 1 shows the standard [1] M. H. Davel, C. van Heerden, N. Kleynhans, and E. Barnard. DET curve obtained when using the confidence scores of the DP Efficient harvesting of Internet audio for resource-scarce ASR. alignment to score hypotheses, as well as the DET curve when the In Proc. Interspeech, Florence, Italy, August 2011. lattice-to-lattice measure is employed. We see that the DP confi- [2] Y. Kobayasji and Y. Niimi. Matching algorithms between a dence scores yield better DET curves and OTWVs. A separate anal- phonetic lattice and two types of templates – lattice and graph. ysis showed that the lattice scores for valid matches are generally In Proc. ICASSP, pages 1597 – 1600, Tampa, Florida, April higher than those for false detections; we therefor searched for linear 1985. combinations between DP and lattice scores that would outperform [3] A. Kumar, N. Rajput, D. Chakraborty, S. K. Agarwal, and A. A. either on its own. However, no consistent improvement was found. Nanavati. WWTW: The World Wide Telecom Web. In DSDR We therefore decided to use the DP scores in our submission. The 2007 (SIGCOMM Workshop), Kyoto, Japan, August 2007. ATWV scores obtained in the four task conditions are summarized [4] N. Rajput and F. Metze. Spoken Web Search. In MediaEval below. These results confirm the fact that our system is somewhat 2011 Workshop, Pisa, Italy, September 2011. successful in detecting the desired search terms; however, the nega- tive ATWV scores across the dev/eval divide suggests that our system is quite sensitive to the differences between the two sets.