=Paper= {{Paper |id=None |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-807/Barnard_MUST_SWS_me11wn.pdf |volume=Vol-807 }} ==None== https://ceur-ws.org/Vol-807/Barnard_MUST_SWS_me11wn.pdf
                     Phone recognition for Spoken Web Search

            Etienne Barnard, Marelie                   Charl van Heerden, Neil                             Kalika Bali
                     Davel                                    Kleynhans                            Microsoft Research Lab India
             Multilingual Speech Technologies,         HLT Research Group,CSIR Meraka                    Bangalore, India
                  North-West University                            Institute                          kalikab@microsoft.com
              Vanderbijlpark, South Africa                  Pretoria, South Africa
                     {etienne.barnard,                           {cvheerden,
                marelie.davel}@gmail.com                   ntkleynhans}@gmail.com


ABSTRACT                                                                   data and building a set of acoustic models for at least one of the
Aiming at both speaker independence and robustness with                    languages. These acoustic models were then adapted to the other lan-
respect to recognition errors in the spoken queries, we have               guages and used during both spoken term detection and confidence
implemented a two-pass system for spoken web search. In                    scoring. Without time alignments for the development set, we gen-
the first pass, unconstrained phone recognition of both the                erated our own using a grapheme-based system in order to evaluate
query terms and the content audio is employed to represent                 our results during development.
these recordings as phone strings. A dynamic-programming                   2.1     Acoustic modeling
approach then finds regions in the content phone strings that
correspond closely to one or more query strings. In the sec-                 The main data set used for acoustic modelling was a Hindi cor-
ond pass, each of these regions is again processed with a phone            pus obtained from Microsoft. Several additional corpora were also
recognizer, but now a lattice is extracted; this lattice is com-           considered, but were not as directly suited to this task.
pared against similar lattices extracted for each of the queries.          2.1.1     Hindi Models
We find our approach to be somewhat successful in identify-
                                                                              The Microsoft Hindi Corpus consists of 60 hours of spontaneous
ing the query terms in both the development and evaluation
                                                                           conversations in colloquial Hindi, recorded on Appen’s telephony
sets, but not to generalize well between these sets.
                                                                           recording platform and sampled at 8kHz. There are 996 native Hindi
                                                                           speakers and all conversations range between 1 and 4 minutes in du-
Categories and Subject Descriptors                                         ration. All conversations are transcribed and time aligned on speaker
I.2 [Artificial Intelligence]: Natural Language Processing—                turns, and a basic pronunciation dictionary is provided.
Speech recognition and synthesis                                              An initial acoustic model trained on all the audio was used to fur-
                                                                           ther process the corpus, using techniques described in [1]. The ini-
General Terms                                                              tial acoustic models trained were standard 3-state left to right tied
                                                                           triphone models, with 8 mixtures per state and semi tied transforms.
Spoken term detection, under-resourced languages, confidence mea-
                                                                           A garbage model was then trained and combined with this initial
sures
                                                                           model. The models were further refined by doing MAP adaptation
                                                                           using the target audio (in all 4 languages). Target audio transcrip-
1. INTRODUCTION                                                            tions were generated using the initial acoustic models to decode the
   The ‘spoken web search’ task of MediaEval 2011 [4] involved             target audio, using a flat phone grammar.
searching for audio content in one of 4 under-resourced languages             The list of monophones was reduced rather aggressively from 62
(Gujarati, Telugu, Hindi and Indian English), using only an audio          to two smaller sets (of 43 and 21 monophones, respectively) in order
version of the content query. All audio content [3] was collected          to work with a small set of broad but reliable classes, appropriate to
over a mobile connection and acoustic quality varied. Our approach         the later scoring tasks. The reduction included amongst others mod-
to this task was guided by three principles: (1) Since the search task     eling aspiration separately, splitting diphthongs as well as affricates,
requires speaker independence, we preferred to use standard speaker-       combining some allophones and merging all the nasalized vowels
independent ASR technology, rather than (say) template-based meth-         with the corresponding non-nasal phonemes.
ods. (2) Any pronunciation model derived from a single spoken ex-
ample of a search term is likely to be quite fragile; hence, specific      2.2     Spoken term detection
care must be taken to model variability around such a model. (3)              Spoken term detection was performed using a dynamic program-
Only limited resources are available in the target languages / dialects;   ming (DP) approach: audio data and query data are decoded by
we therefore focused our efforts on approaches that did not rely on        the ASR system using a flat phone-loop grammar, and the resulting
any textual data (or derived language models), and could produce           phoneme strings matched against one another using a dynamic pro-
results when closely matched ASR systems are not available.                gramming algorithm and a variable cost matrix. A phone set is used
                                                                           where transitional sounds (such as affricates or diphthongs) are split
2. APPROACH                                                                into their constituent parts. The phone string generated from the au-
  As we did not have access to closely matched ASR systems for             dio data is then segmented into detection candidates using a shifting
any of the target dialects, we focused our approach on obtaining           window with a size matching the query phone string (plus or minus
                                                                           a leniency factor), and an alignment cost generated. The alignment
Copyright is held by the author/owner(s).                                  cost, normalised by the phoneme length of the longer phone string is
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy                  used directly as the DP score. This approach is influenced by both
the granularity of the phoneme set used and the scoring matrix. In
this work a linguistically motivated matrix, a matrix derived from the
posterior probabilities obtained from an ASR confusion matrix and
a flat matrix were used.
   Two additional approaches to detection were considered:
   • Grammar-based term detection: A constrained decoding
     network grammar is constructured by placing repeatable phoneme
     fillers before and after the desired search term and allowing
     multiple search term detections within an utterance.
   • Lattice-based methods: A phone lattice is constructed from
     the entire utterance within which search is to be performed,
     and the phoneme strings corresponding to each of the queries
     is matched against the lattice. However, the acoustic ambigui-
     ties of phone recognition in low-quality audio caused practical
     difficulties – both the computation required and the size of the
     resulting lattices were found to be unmanageable.

2.3 Confidence scoring
   The DP scores generated during the detection phase can be used
directly as confidence measures. In addition, two other confidence
measures were calculated using the terms flagged during DP scoring:
lattice-to-lattice matching and dynamic time warping. Since                 Figure 1: DET curves when using confidence scores
standard lattice-based confidence measures are difficult to utilise with-   based on DP alignment vs lattice-to-lattice posteri-
out reliable language models, a direct lattice-to-lattice matching mea-     ors.
sure was implemented. This is a direct extension of the DP-based
string matching process, and can be implemented efficiently using
                                                                                 Table 1: ATWV scores for four task conditions
an algorithm as described in [2], combined with similiar scoring ma-              Task    dev/dev dev/eval eval/dev eval/eval
trices as for DP scoring. Posterior probabilities are obtained directly
                                                                                  ATWV 0.102      -0.021    -0.13    0.114
from the lattices, and a series of start and end points (relative to the
initial detection) can be evaluated efficiently. Finally, Dynamic Time      4.    CONCLUSION
Warping (DTW) was used to match the query and detection on a
frame-to-frame basis, and the corresponding DTW distance (normal-              We have presented a DP-based approach to spoken term detection,
ized by number of frames) used as the confidence measure.                   and a lattice-based scoring mechanism that was intended to refine the
                                                                            DP scores. The former approach was somewhat successful, but we
2.4 Evaluating results                                                      have not been able to obtain a benefit from the latter; thus, our sec-
   To analyse our results, it was necessary to generate alignments          ond assumption in Section 1 has not been confirmed. Given our time
for the development and evaluation data sets. Since the Indo-Aryan          constraints, we have not been able to experiment systematically with
and Dravidian languages have a high letter-to-sound correlation it          numerous variables which are clearly important to the performance
was decided to use a grapheme-based recognition system to generate          of both stages of the system (e.g. different phone sets, scoring ma-
alignments. The grapheme-based alignment system was represented             trices, lattice-extraction parameters, etc.) – it is likely that signifi-
by 8 mixture context-dependent tri-letter HMM acoustic models. A            cant improvements within the current framework can be achieved by
pronunciation dictionary was created by letter splitting the words,         paying closer attention to each of these factors. Once that has been
which resulted in 26 sub-word units (ASCII ‘a’ to ‘z’ in the English        achieved, further score improvements by using additional acoustic
alphabet).                                                                  front ends should be a straightforward (though computationally ex-
                                                                            pensive) step.
3. RESULTS
   Our experimental results are based on detection of the develop-
                                                                            5.    REFERENCES
ment queries in the development data. Fig. 1 shows the standard             [1] M. H. Davel, C. van Heerden, N. Kleynhans, and E. Barnard.
DET curve obtained when using the confidence scores of the DP                   Efficient harvesting of Internet audio for resource-scarce ASR.
alignment to score hypotheses, as well as the DET curve when the                In Proc. Interspeech, Florence, Italy, August 2011.
lattice-to-lattice measure is employed. We see that the DP confi-           [2] Y. Kobayasji and Y. Niimi. Matching algorithms between a
dence scores yield better DET curves and OTWVs. A separate anal-                phonetic lattice and two types of templates – lattice and graph.
ysis showed that the lattice scores for valid matches are generally             In Proc. ICASSP, pages 1597 – 1600, Tampa, Florida, April
higher than those for false detections; we therefor searched for linear         1985.
combinations between DP and lattice scores that would outperform            [3] A. Kumar, N. Rajput, D. Chakraborty, S. K. Agarwal, and A. A.
either on its own. However, no consistent improvement was found.                Nanavati. WWTW: The World Wide Telecom Web. In DSDR
We therefore decided to use the DP scores in our submission. The                2007 (SIGCOMM Workshop), Kyoto, Japan, August 2007.
ATWV scores obtained in the four task conditions are summarized             [4] N. Rajput and F. Metze. Spoken Web Search. In MediaEval
below. These results confirm the fact that our system is somewhat               2011 Workshop, Pisa, Italy, September 2011.
successful in detecting the desired search terms; however, the nega-
tive ATWV scores across the dev/eval divide suggests that our system
is quite sensitive to the differences between the two sets.