=Paper=
{{Paper
|id=None
|storemode=property
|title=Bypassing Words in Automatic Speech Recognition
|pdfUrl=https://ceur-ws.org/Vol-841/submission_9.pdf
|volume=Vol-841
|dblpUrl=https://dblp.org/rec/conf/maics/PalmaLSW12
}}
==Bypassing Words in Automatic Speech Recognition==
Bypassing Words in Automatic Speech Recognition
Paul De Palma George Luger
Department of Computer Science Department of Computer Science
Gonzaga University University of New Mexico
Spokane, WA Albuquerque, NM
depalma@gonzaga.edu luger@cs.unm.edu
Caroline Smith Charles Wooters
Department of Linguistics Next It Corporation
University of New Mexico Spokane, WA
Albuquerque, NM cwooters@nextit.com
caroline@unm.edu
Abstract to my husband went to
Automatic speech recognition (ASR) is usually um proximity wise
defined as the transformation of an acoustic signal to is probably within a mile of each other we were
words. Though there are cases where the kind of high school sweethearts and
transformation to words is useful, the definition does the whole bit um
not exhaust all contexts in which ASR could be used. his dad still lives in grove city
Once the constraint that an ASR system outputs
my mom lives still at our old family
words is relaxed, modifications that reduce the
search space become possible: 1) The use of
house there on the westside and
syllables instead of words in the recognizer’s we moved um also on the
language model; 2) The addition of a concept model westside probably couple miles from my mom.
that transforms syllable strings to concept strings,
where a concept collects related words and phrases. While we recognize the benefits of solving the speech
The paper presents preliminary positive results on recognition problem as described, the research presented
the use of syllables and concepts in speech here begins with the observation that human language
recognition and outlines our current efforts to verify performance does not include transcription from an
the Syllable-Concept Hypothesis (SCH).
acoustic signal to words—either in the sanitized form
found in The New York Times quote or in the raw form
Introduction found in the Buckeye Corpus. We do not suggest that AI
The speech recognition problem is conventionally research limit itself to human performance. We do claim
formulated as the transformation of an acoustic speech that there is much to be gained by relaxing the constraint
signal to word strings. Yet this formulation dramatically that the output of automatic speech recognition be a word
underspecifies what counts as word strings. Here is a “33- string. Consider a speech recognition system designed to
year-old business woman” speaking to a reporter from The handle spoken plane reservations via telephone or, for that
New York Times: “We have never seen anything like this matter, just about any spoken-dialog system. The
in our history. Even the British colonial rule, they stopped recognizer need only pass on the sense of the caller’s
chasing people around when they ran into a monastery” speech to an appropriately constructed domain knowledge
(Sang-Hun 2007: 1). The reporter has certainly system to solve a problem of significant scope.
transformed an acoustic signal into words. Though it The question of what is meant by the sense of an
would be nice to have a recording and transcription of the utterance is central to this research. As a first
actual interview, we can get a sense of what the reporter approximation, one can think of the sense of an utterance
left out (and put in) by looking at any hand-transcribed as a sequence of concepts, where a concept is an
corpus of spontaneous speech. Here is the very first equivalence class of words and phrases that seem to mean
segment from the Buckeye Corpus: the same thing. A conventional recognizer generates a
word string given a sequence of acoustic observations.
yes i uh um uh The first stage in our research is to generate a syllable
lordy um string given the same sequence of acoustic observations.
grew up on the westside i went Notice that the search space is much reduced. There are
fewer syllables to search through (and mistake) than formulation of probabilistic speech recognition (Jurafsky
words. Of course, this syllable string must undergo a and Martin 2009):
further transformation to be useful. One possibility would
be to probabilistically map it to word strings. We have ( ) ( ) ( ) (4)
experimented with this. The results have not been
encouraging. We propose, instead, to generate a concept The difference is that the formulation has been generalized
string given the syllable string. Once again, the search from words to any symbol string. P(A|S), known as the
space is reduced. There are fewer concepts to search likelihood probability in Bayesian inference, is called the
through and mistake than words. acoustic model in the context of automatic speech
The Symbol-Concept Hypothesis (SCH) claims that recognition. P(S), known as the prior probability in
this dual reduction in search space will result in better Bayesian inference, is called the language model in ASR.
recognition accuracy over a standard recognizer. Though The acoustic model expresses the probability that a string
SCH can be argued using the axioms of probability, at of symbols—words, syllables, whatever—is associated
bottom it is an empirical hypothesis. Preliminary with an acoustic signal in a training corpus. The language
experimental results have been promising. This paper is model expresses the probability that a sequence of
the first in a four phase, multi-year research effort to test symbols—again, words, syllables, whatever—is found in a
SCH: training corpus.
The attractiveness of syllables for the acoustic model of
Phase I: Gather preliminary data about SCH using speech recognition has been noted for some time. A study
small corpora. of the SWITCHBOARD corpus found that over 20% of the
Phase II: Reproduce the results from Phase I using manually annotated phones are never realized acoustically,
a much larger corpus. since phone deletion is common in fluent speech. On the
Phase III: Introduce a probabilistic concept other hand, the same study showed that 99% of canonical
generator and concept model. syllables are realized in speech. Syllables also have
Phase IV: Introduce an existing domain attractive distributional properties. The statistical
knowledge system and speech synthesizer to distributions of the 300 most frequently occurring words in
provide response to the user. English and the most common syllables are almost
identical. Though monosyllabic words account for only
Background 22% of SWITCHBOARD by type, they account for a full
The goal of probabilistic speech recognition is to answer 81% of tokens (Greenberg 1999; Greenberg 2001;
this question: “What is the most likely string of words, W, Greenberg et al. 2002). All of this suggests that the use of
from a language, L, given some acoustic input, A.” This is syllables in the acoustic model might avoid some of the
formulated in equation 1: difficulties associated with word pronunciation variation
due to dialect, idiolect, speaking rate, acoustic
( ) ( ) (1) environment, and pragmatic/semantic context.
Nevertheless, most studies indicate positive but not
dramatic improvement when using a syllable-based
Since words have no place in SCH, we speak instead of
acoustic model (Ganapathiraju et al. 1997 and 2002; Sethy
symbol strings drawn from some set of legal symbols, with
and Narayanan 2003; Hamalainen et al. 2007). This has
the sole constraint that the symbols be encoded in ASCII
been disappointing given the theoretical attractiveness of
format. So, equation (1) becomes:
syllables in the acoustic model. Since this paper is
concerned exclusively with the language model and post-
( ) ( ) (2) language model processing, conjectures about the
performance of syllables in the acoustic model
Equation (2) is read: “The hypothesized symbol string is performance are beyond its scope.
the one with the greatest probability given the sequence of Still, many of the reasons that make syllables attractive
acoustic observations” (De Palma 2010:16). Bayes in the acoustic model also make them attractive in the
Theorem lets us rewrite equation (2) as: language model, including another not mentioned in the
literature on acoustic model research: there are fewer
( ) ( )
( ) (3) syllables than words, a topic explored later in this paper.
( )
Since the output of a recognizer using a syllable language
model is a syllable string, studies of speech recognition
Since P(A) does not affect the computation of the most
using syllable language models have been limited to
probable symbol string (the acoustic observation is the
special purpose systems where output word strings are not
acoustic observation, no matter the potential string of
necessary. These include reading trackers, audio indexing
symbols) we arrive at a variation of the standard
systems, and spoken name recognizers. Investigations proportional to the accuracy of the word recognition task.
report significant improvement over word language Substitute syllables for words in language B—since both
models (Bolanos et al. 2007; Schrumpf, Larson, and are symbols—and this is exactly the argument being made
Eickler 2005; Sethy and Narayanan 1998). The system here.
proposed here, however, does not end with a syllable Now, one might ask, if syllables work so nicely in the
string, but, rather, passes this output to a concept model— language model of speech recognition, why not use another
and thereby transforms them to concept strings, all to be sub-word with an even smaller symbol set, say a phone or
described later. demi-syllable? Though the question is certainly worth
Researchers have recognized the potential usefulness of investigating empirically, the proposed project uses
concepts in speech recognition: since the early nineties at syllables because they represent a compromise between a
Bell Labs, later at the University of Colorado, and still later full word and a sound. By virtue of their length, they
at Microsoft Research (Pieraccini et al. 1991; Hacioglu and preserve more linguistic information than a phone and,
Ward 2001; Yaman et al. 2008). The system proposed unlike words they represent a relatively closed set.
here does not use words in any fashion (unlike the Bell Syllables tend not to change much over time.
Labs system), proposes the use of probabilistically A standard a priori indicator of the probable success of
generated concepts (unlike the Colorado system), and is a language model is lower perplexity, where perplexity is
more general than the utterance classification system defined as the Nth inverse root of the probability of a
developed at Microsoft. Further, it couples the use of sub- sequence of words (Jurafsky and Martin 2009; Ueberla
word units in the language model, specifically syllables, 1994):
with concepts, an approach that appears to be novel.
PP(W) = p(w1w2…wn) -1/n (5)
Syllables, Perplexity, and Error Rate
One of the first things that a linguist might notice in the Because there are fewer syllables than words, we
literature on the use of the syllable in the acoustic model is would expect both their perplexity in a language model to
that its complexity is underappreciated. Rabiner and Juang be lower and their recognition accuracy to be higher.
(1993), an early text on speech recognition, has only two Since the history of science is littered with explanations
index entries for “syllable” and treat it as just another whose self-evidence turned out to have been incorrect upon
easily-defined sub-word unit. This is peculiar, since the examination, we offer a first pass at an empirical
number of English syllables varies by a factor of 30 investigation.
depending on whom one reads (Rabiner and Juang 1993; To compare the perplexity of both syllable and word
Ganapathiraju, et al. 1997; Huang et al. 2001). In fact, language models, we used two corpora, the Air Travel
there is a substantial linguistic literature on the syllable and Information System (Hemphill 1993) and a smaller corpus
how to define it across languages. This is important since (SC) of human-computer dialogs captured using the
any piece of software that claims to syllabify words Wizard-of-Oz protocol at Next It (Next IT 2012), where
embodies a theory of the syllable. Thus, the syllabifier that subjects thought they we were interacting with a computer
is cited most frequently in the speech recognition literature, but in fact were conversing with a human being. The
and the one used in the work described in this paper, corpora were syllabified using software available from the
implements a dissertation that is firmly in the tradition of National Institute of Standards and Technology (NIST
generative linguistics (Kahn 1976). Since our work is 2012).
motivated by more recent research in functional and Test and training sets were created from the same
cognitive linguistics (see, for example, Tomasello 2003), a collection of utterances, with the fraction of the collection
probabilistic syllabifier might be more appropriate. We used in the test set as a parameter. The results reported
defer that to a later stage of the project, but note in passing here use a randomly chosen 10% of the collection in the
that probabilistic syllabifiers have been developed test set and the remaining 90% in the training set. The
(Marchand, et al. 2007). system computed the mean, median, and standard
Still, even though researchers disagree on the number deviation over twenty runs. These computations were done
of syllables in English, that number is significantly smaller for both word and syllable language models for unigrams,
than the number of words. And therein lies part of their bigrams, trigrams, and quadrigrams (sequences of one,
attractiveness for this research. Simply put, the syllable two, three, and four words or syllables). As a baseline, the
search space is significantly smaller than the word search perplexity of the unweighted language model—one in
space. Suppose language A has a words and language B which any word/syllable has the same probability as any
has b words, where a > b. All other things being equal, the other—was computed.
probability of correctly guessing a word from B is greater For bigrams, trigrams, and quadrigrams, the perplexity
than guessing one from A. Suppose further, that these of a syllable language model was less than that of a word
words are not useful in and of themselves, but contribute to language model. Of course, in comparing the perplexity of
some downstream task, the accuracy of which is syllable and word language models, we are comparing
sample spaces of different sizes. This can introduce error
based on the way perplexity computations assign i want to fly from spokane to seattle
probability mass to out-of-vocabulary tokens. It must be ay waantd tuw flay frahm spow kaen tuw si ae dxaxl
recognized, however, that syllable and word language
models are not simply language models of different sizes i would like to fly from seattle to san Francisco
ay wuhdd laykd tuw flay frahm siy ae dxaxl tuw saen fraen
of the kind that Ueberla (1994) considered. Rather, they sih skow
are functionally related to one another. This suggests that Figure 1: Word and Syllable References
the well-understood caution against comparing the
perplexity of language models with different vocabularies The recognizer equipped with a syllable language
might not apply completely in the case of syllables and model showed a mean improvement in SER over all N-
words. Nevertheless, the drop in perplexity was so gram sizes of 14.6% when compared to one equipped with
substantial in a few cases (37.8% SC quadrigrams, 85.7% a word language model. Though the results are
ATIS bigrams), that it invited empirical investigation with preliminary, and await confirmation with other corpora,
audio data. and with the caveats already noted, they suggest that a
recognizer equipped with a syllable language model will
Recognition Accuracy perform more accurately than one equipped with a word
Symbol Error Rate (SER) is the familiar Word Error Rate language model.1 This will contribute to the downstream
(WER) generalized so that context clarifies whether we are accuracy of the system described below. Of course, it must
talking about syllables, words, or concepts. The use of be pointed out that some of this extraordinary gain in
SER raises a potential problem. The number of syllables recognition accuracy will necessarily be lost in the
(either by type or token) differs from the number of words probabilistic transformation to concept strings.
in the training corpus. Further, in all but monosyllabic
training corpora, syllables will, on average, be shorter than Concepts
words. How then can we compare error rates? The At this point one might wonder about the usefulness of
answer, as before, is that 1) words are functionally related syllable strings, no matter how accurately they are
to syllables and 2) improved accuracy in syllable recognized. We observe that the full range of a natural
recognition will contribute to downstream accuracy in language is redundant in certain pre-specified domains, say
concept recognition. a travel reservation system. Thus the words and phrases
To test the hypothesis that a syllable language model ticket, to book a flight, to book a ticket, to book some
would perform more accurately than a word language travel, to buy a ticket, to buy an airline ticket, to depart, to
model, we gathered eighteen short audio recordings, fly, to get to, all taken from the reference files for the audio
evenly distributed by gender, and recorded over both the used in this study, describe what someone wants in this
public switched telephone network and mobile phones. constrained context, namely to go somewhere. With
The recognizer used was SONIC from the Center for respect to a single word, we collapse morphology and
Spoken Language Research of the University of Colorado auxiliary words used to denote person, tense, aspect, and
(SONIC 2010). The acoustic model was trained on the mood, into a base word. So, fly, flying, going to fly, flew,
MACROPHONE corpus (Bernstein et al. 1994). go to, travelling to, are grouped, along with certain
Additional tools included a syllabifier and scoring software formulaic phrases (book a ticket to), in the equivalence
available from the National Institute of Standards and class, GO. Similarly, the equivalence class WANT
Technology (NIST 2012), and language modeling software contains the elements buy, can I, can I have, could I, could
developed by one of the authors. I get, I like, I need, I wanna, I want, I would like, I’d like,
The word-level transcripts in the training corpora were I’d like to have, I’m planning on, looking for, need, wanna,
transformed to phone sequences via a dictionary look-up. want, we need, we would like, we’d like, we’ll need, would
The phone-level transcripts were then syllabified using the like. We refer to these equivalence classes as concepts.
NIST syllabifier. The pronunciation lexicon, a mapping of
words to phone sequences, was similarly transformed to For example, a sentence from the language model (I
map syllables to phone sequences. The word-level want to fly to Spokane) was syllabified, giving:
reference files against which the recognizer’s hypotheses
were scored were also run through the same process as the ay w_aa_n_td t_uw f_l_ay t_uw s_p_ow k_ae_n
training transcripts to produce syllable-level reference
files.
With these alterations, the recognizer transformed 1
Though it might be interesting and useful to look at individual
acoustic input into syllable output represented as a flavor errors, the point to keep in mind is that we are looking for broad
of Arpabet. Figure 1 shows an extract from a reference file improvement. The components of SCH were not so much
represented both in word and in phone-based syllable form. arguments as the initial justification for empirical investigations,
investigations that will support or falsify SCH.
Then concepts were mapped to the syllable strings, The algorithm takes advantage of “the strong tendency of
producing: words to exhibit only one sense per collocation and per
discourse” (Yarowsky 1995: 50). The technique will begin
WANT GO s_p_ow k_ae_n with a hand-tagged seed set of concepts. These will be
used to incrementally train a classifier to augment the seed
The mapping from concepts to syllable strings was rigid concepts. The output of a speech recognizer equipped
and chosen in order to generate base-line results. The with a syllable language model is the most probable
mapping rules required that at least one member of an sequence of syllables given an acoustic event. The
equivalence class of syllable strings had to appear in the formalisms used to probabilistically map concepts to
output string for the equivalence class name to be inserted syllable strings are reworkings of equations (1) to (4),
in its place in the output file. For example, k_ae_n ay resulting in:
hh_ae_v (can I have) had to appear in its entirety in the
output file for it to be replaced with the concept WANT. ( ) ( ) ( ) ( ) (6)
The experiment required that we:
1. Develop concepts/equivalence classes from the
Acoustic Feature
training transcript used in the language model User Acoustic Model
Analysis
experiments.
2. Map the equivalence classes onto the reference files
used to score the output of the recognizer. For each
distinct syllable string that appears in one of the
concept/equivalence classes, we substituted the name Speech Syllable
of the equivalence class for the syllable string. We did Decoder
Synthesizer Language Model
this for each of the 18 reference files that correspond
to each of the 18 audio files. For example, WANT is
substituted for every occurrence of ay w_uh_dd
l_ay_kd (I would like).
3. Map the equivalence classes onto the output of the Domain
N-Best Phone to
Knowledge
recognizer when using a syllable language model for System
Syllable List Syllable Lexicon
N-gram sizes 1 through 4. We mapped the
equivalence class names onto the content of each of
the 72 output files (4 x 18) generated by the
recognizer.
4. Determine the error rate of the output in step 3 with Concept
Concepts Concept Model
respect to the reference files in step 2. Generator
As before, the SONIC recognizer, the NIST syllabifier
and scoring software, and our own language modeling
software were used. The experiments showed a mean
Intent
increase in SER over all N-gram sizes of just 1.175%. Intent Scorer Accuracy
Given the rigid mapping scheme, these results were Score
promising enough to encourage us to begin work on: 1)
reproducing the results on the much larger ATIS2 corpus
(Garofalo 1993) and 2) a probabilistic concept model. Figure 2: Acoustic features are decoded into syllable strings
using a syllable language model. The syllables strings are
probabilistically mapped to concept strings. The N-best syllable
Current Work list is rescored using concepts. The Intent Scorer enables
comparison of performance with a conventional recognizer.
We are currently building the system illustrated in Figure
2. The shaded portions describe our work. A crucial
component is the concept generator. Under our definition, M is just the set of legal concepts created for a domain by
concepts are purely collocations of words and phrases, the concept generator. Equation (6) is an extension of a
effectively, equivalence classes. In order for the system to classic problem in computational linguistics: probabilistic
be useful for multiple domains, we must go beyond our part-of-speech tagging. That is, given a string of words,
preliminary investigations: the concepts must be machine- what is the most probable string of parts-of-speech? In the
generated. This will be done using a boot-strapping case at hand, given a syllable string, what is the most
procedure, first described for word-sense disambiguation. probable concept string?
Using equation (6), the Syllable-Concept Hypothesis, But as has been pointed out, a syllable string is not useful
introduced early in the paper, can be formalized. If in a dialog system. Concepts must be mapped to syllables.
equation (1) describes how a recognizer goes about A concept, as we define it, is an equivalence class of words
choosing a word string given a string of acoustic and phrases that seem to mean the same thing in a given
observations, then our enhanced recognizer can be context. To date, we have hand-generated concepts from
described in equation (7): reference files and mapped them to syllables using a ridgid
mapping scheme intended as a baseline.
( ) ( ) (7) But to be truly useful, any recognizer using concepts
must automatically generate them. Since concepts, under
That is, we are looking for the legal concept string with the our definition, are no more than collocations of words, we
greatest probability given a sequence of acoustic we propose a technique first developed for word-sense
observations. SCH, in effect, argues that the P(C|A) disambiguation: incrementally generate a collection of
exceeds the P(W|A). concepts from a hand-generated set of seed concepts. The
Finally, the question of how to judge the accuracy of idea in both phases or our work—probabilistically
the system, from the initial utterance to the output of the generating syllable strings and probabilistically generating
concept model, must be addressed. Notice that the concept concept strings—is to reduce the search space from what
strings themselves are human readable. So, conventional recognizers encounter. At the very end of
this process, we propose scoring how closely the generated
I WANT TO FLY TO SPOKANE concepts match the intent of the speaker using Mechanical
Turk workers and a modified Likert scale. Ultimately the
becomes: output the system will be sent on to a domain knowledge
system, from there onto a speech synthesizer, and finally to
WANT GO s_p_ow k_ae_n the user, who, having heard the output will respond, thus
starting the cycle over gain.
Amazon Mechanical Turk2 workers will be presented with Our results to date suggests that the use of syllables
both the initial utterance as text and the output of the and concepts in ASR will results in improved recognition
concept model as text and asked to offer an opinion about accuracy over a conventional word-based speech
accuracy based on an adaptation of the Likert scale. To recognizer. This improved accuracy has the potential to be
judge how the proposed system performs relative to a used in fully functional dialog systems. The impact of
conventional recognizer, the same test will be made, such systems could be as far-reaching as the invention of
substituting the output of the recognizer with a word the mouse and windowing software, opening up computing
language model and no concept model for the output of the to persons with coordination difficulties or sight
proposed system. impairment, freeing digital devices from manual input, and
transforming the structure of call centers. One
application, often overlooked in catalogues of the uses to
Conclusion which ASR might be put, is surveillance.3 The Defense
We have argued that the speech recognition problem as Advanced Research Agency (DARPA) helped give ASR
conventionally formulated—the transformation of an its current shape. According to some observers, the NSA,
acoustic signal to words—neither emulates human as a metonym for all intelligence agencies, is drowning in
performance nor exhausts the uses to which ASR might be unprocessed data, much of which is almost certainly
put. This suggests that we could bypass words in some speech (Bamford 2008). The kinds of improvements
ASR applications, going from an acoustic to signal to described in this paper, the kinds that promise to go
probabilistically generated syllable strings and from there beyond the merely incremental, are what are needed to take
to probabilistically generated concept strings. Our voice recognition to the next step.
experiments with syllables on small corpora have been
promising:
37.8% drop in perplexity with quadrigrams on the References
SC corpus Bamford, J. 2008. The Shadow Factory: The Ultra-Secret
85.7% drop in perplexity with ATIS bigrams NSA from 9/11 to the Eavesdropping on America. NY:
14.6% mean increase in recognition accuracy over Random House.
bigram, trigram, and quadrigrams
Bernstein, J., Taussig, K., Godfrey, J. 1994.
2
The Amazon Mechanical Turk allows computational linguists
3
(and just about anyone else who needs a task that requires human Please note that this paper is not necessarily an endorsement of
intelligence) to crowd-source their data for human judgment. See all uses to which ASR might be put. It merely recognizes what is
https://www.mturk.com/mturk/welcome in fact the case.
MACROPHONE. Linguistics Data Consortium, Huang, X., Acero, A., Hsiao-Wuen, H. 2001. Spoken
Philadelphia PA Language Processing: A Guide to Theory, Algorithm, and
System Development. Upper Saddle River, NJ: Prentice
Bolanos, B., Ward, W., Van Vuuren, S., Garrido, J. 2007. Hall.
Syllable Lattices as a Basis for a Children’s Speech
Reading Tracker. Proceedings of Interspeech-2007, 198- Jurafsky, D., Martin, J. 2009. Speech and Language
201. Processing. Upper Saddle River, NJ: Pearson/Prentice
Hall.
De Palma, P. 2010. Syllables and Concepts in Large
Vocabulary Speech Recognition. Ph.D. dissertation, Kahn, D. 1976. Syllable-based Generalizations in English
Department of Linguistics, University of New Mexico, Phonology. Ph.D. dissertation, Department of Linguistics,
Albuquerque, NM. University of Indiana, Bloomington, In: Indiana University
Linguistics Club.
Ganapathiraju, A., Goel, V., Picone, J., Corrada, A.,
Doddington, G., Kirchof, K., Ordowski, M., Wheatley, B. Marchand, Y. Adsett, C., Damper, R. 2007. Evaluating
1997. Syllable—A Promising Recognition Unit for Automatic Syllabification Algorithms for English.
LVCSR. Proceedings of the IEEE Workshop on Automatic Proceedings of the 6th International Conference of the
Speech Recognition and Understanding, 207-214. Speech Communication Association, 316-321.
Ganapathiraju, A., Hamaker, J., Picone, J., Ordowski, M., Next It Corporation. 2012. Web Customer Service with
Doddington, G. 2001. Syllable-based large vocabulary Intelligent Virtual Agents. Retrieved 3/37/2012 from:
continuous speech recognition. IEEE Transactions on http://www.nextit.com.
Speech and Audio Processing, vol. 9, no. 4, 358-366.
NIST. 2012. Language Technology Tools/Multimodel
Garofalo, J. 1993. ATIS2. Linguistics Data Consortium, Information Group—Tools. Retrieved 2/19/2012 from:
Philadelphia, PA http://www.nist.gov.
Greenberg, S. 1999. Speaking in Shorthand—A Syllable- Pieraccini, R., Levin, E., Lee, C., 1991. Stochastic
Centric Perspective for Understanding Pronunciation Representation of Conceptual Structure in the ATIS Task.
Variation. Speech Communication, 29, 159-176. Proceedings of the DARPA Speech and Natural Language
Workshop, 121-124.
Greenberg, S. 2001. From here to Utility—Melding
Insight with Speech Technology. Proceedings of the 7th Rabiner, L., Juang, B. 1993. Fundamentals of Speech
European Conference on Speech Communication and Recognition. Englewood Cliffs, NJ: Prentice Hall.
Technology, 2485-2488.
Sang-Hun, C. 10/21/2007. Myanmar, Fear Is Ever Present.
Greenberg, S., Carvey, H. Hitchcock, L., Chang, S. 2002. The New York Times.
Beyond the Phoneme: A Juncture-Accent Model of Spoken
Language. Proceedings of the 2nd International Schrumpf, C., Larson, M., Eickler, S., 2005. Syllable-
Conference on Human Language Technology Research, based Language Models in Speech Recognition for English
36-43. Spoken Document Retrieval. Proceedings of the 7th
International Workshop of the EU Network of Excellence
Hacioglu, K., Ward, W. 2001. Dialog-Context DELOS on Audio-Visual Content and Information
Dependent Language Modeling Combining Visualization in Digital Libraries, pp. 196-205.
N-Grams and Stochastic Context-Free Grammars.
Proceedings of International Conference on Acoustics, Sethy, A., Narayanan, S. 2003. Split-Lexicon Based
Speech and Signal Processing, 537-540. Hierarchical Recognition of Speech Using Syllable and
World Level Acoustic Units, Proceedings of IEEE
Hamalainen, A., Boves, L., de Veth, J., Bosch, L. 2007. International Conference on Acoustics, Speech, and Signal
On the Utility of Syllable-Based Acoustic Models for Processing, I, 772-775.
Pronunciation Variation Modeling. EURASIP Journal on
Audio, Speech, and Music Processing, 46460, 1-11. SONIC. 2010. SONIC: Large Vocabulary Continuous
Speech Technology. Retrieved 3/8/2010 from:
Hemphill, C. 1993. ATIS0. Linguistics Data Consortium, http://techexplorer.ucsys.edu/show_NCSum.cfm?NCS=25
Philadelphia, PA. 8626.
Tomasello, M. (ed.) 2003. The New Psychology of Yaman, S., Deng, L., Yu, D., Wang, W, Acera, A. 2008.
Language: Cognitive and Functional Approaches to An Integrative and Discriminative Technique for Spoken
Language Structure. Mahwah, NJ: Lawrence Erlbaum Utterance Classification. IEEE Transactions on Audio,
Associates. Speech, and Language Processing, vol. 16, no. 6, 1207-
1214.
Ueberla, J. 1994. Analyzing and Improving Statistical
Language Models for Speech Recognition. Ph.D. Yarowsky, D. 1995. Unsupervised Word Sense
Dissertation, School of Computing Science, Simon Frazier Disambiguation Rivaling Supervised Methods.
University. Proceedings of the 33rd Annual Meeting of the Association
for Computational Linguistics, 189-196.