=Paper=
{{Paper
|id=None
|storemode=property
|title=Bypassing Words in Automatic Speech Recognition
|pdfUrl=https://ceur-ws.org/Vol-841/submission_9.pdf
|volume=Vol-841
|dblpUrl=https://dblp.org/rec/conf/maics/PalmaLSW12
}}
==Bypassing Words in Automatic Speech Recognition==
<pdf width="1500px">https://ceur-ws.org/Vol-841/submission_9.pdf</pdf>
<pre>
                   Bypassing Words in Automatic Speech Recognition


                   Paul De Palma                                         George Luger
          Department of Computer Science                        Department of Computer Science
                Gonzaga University                                University of New Mexico
                  Spokane, WA                                         Albuquerque, NM
              depalma@gonzaga.edu                                     luger@cs.unm.edu

                   Caroline Smith                                      Charles Wooters
              Department of Linguistics                                Next It Corporation
              University of New Mexico                                   Spokane, WA
                 Albuquerque, NM                                      cwooters@nextit.com
                 caroline@unm.edu

                           Abstract                                to <EXCLUDE-name> my husband went to
    Automatic speech recognition (ASR) is usually                  <EXCLUDE-name> um <SIL> proximity wise
    defined as the transformation of an acoustic signal to         is probably within a mile of each other we were
    words.      Though there are cases where the                   kind of high school sweethearts and
    transformation to words is useful, the definition does         <VOCNOISE> the whole bit <SIL> um
    not exhaust all contexts in which ASR could be used.           <VOCNOISE> his dad still lives in grove city
    Once the constraint that an ASR system outputs
                                                                   my mom lives still <SIL> at our old family
    words is relaxed, modifications that reduce the
    search space become possible: 1) The use of
                                                                   house there on the westside <VOCNOISE> and
    syllables instead of words in the recognizer’s                 we moved <SIL> um <SIL> also on the
    language model; 2) The addition of a concept model             westside probably couple miles from my mom.
    that transforms syllable strings to concept strings,
    where a concept collects related words and phrases.             While we recognize the benefits of solving the speech
    The paper presents preliminary positive results on         recognition problem as described, the research presented
    the use of syllables and concepts in speech                here begins with the observation that human language
    recognition and outlines our current efforts to verify     performance does not include transcription from an
    the Syllable-Concept Hypothesis (SCH).
                                                               acoustic signal to words—either in the sanitized form
                                                               found in The New York Times quote or in the raw form
                       Introduction                            found in the Buckeye Corpus. We do not suggest that AI
The speech recognition problem is conventionally               research limit itself to human performance. We do claim
formulated as the transformation of an acoustic speech         that there is much to be gained by relaxing the constraint
signal to word strings. Yet this formulation dramatically      that the output of automatic speech recognition be a word
underspecifies what counts as word strings. Here is a “33-     string. Consider a speech recognition system designed to
year-old business woman” speaking to a reporter from The       handle spoken plane reservations via telephone or, for that
New York Times: “We have never seen anything like this         matter, just about any spoken-dialog system.           The
in our history. Even the British colonial rule, they stopped   recognizer need only pass on the sense of the caller’s
chasing people around when they ran into a monastery”          speech to an appropriately constructed domain knowledge
(Sang-Hun 2007: 1).         The reporter has certainly         system to solve a problem of significant scope.
transformed an acoustic signal into words.        Though it         The question of what is meant by the sense of an
would be nice to have a recording and transcription of the     utterance is central to this research.          As a first
actual interview, we can get a sense of what the reporter      approximation, one can think of the sense of an utterance
left out (and put in) by looking at any hand-transcribed       as a sequence of concepts, where a concept is an
corpus of spontaneous speech. Here is the very first           equivalence class of words and phrases that seem to mean
segment from the Buckeye Corpus:                               the same thing. A conventional recognizer generates a
                                                               word string given a sequence of acoustic observations.
    yes <VOCNOISE> i uh <SIL> um <SIL> uh                      The first stage in our research is to generate a syllable
    <VOCNOISE> lordy <VOCNOISE> um                             string given the same sequence of acoustic observations.
    <VOCNOISE> grew up on the westside i went                  Notice that the search space is much reduced. There are
fewer syllables to search through (and mistake) than            formulation of probabilistic speech recognition (Jurafsky
words. Of course, this syllable string must undergo a           and Martin 2009):
further transformation to be useful. One possibility would
be to probabilistically map it to word strings. We have             ( )             (    )    ( )       (4)
experimented with this. The results have not been
encouraging. We propose, instead, to generate a concept         The difference is that the formulation has been generalized
string given the syllable string. Once again, the search        from words to any symbol string. P(A|S), known as the
space is reduced. There are fewer concepts to search            likelihood probability in Bayesian inference, is called the
through and mistake than words.                                 acoustic model in the context of automatic speech
     The Symbol-Concept Hypothesis (SCH) claims that            recognition. P(S), known as the prior probability in
this dual reduction in search space will result in better       Bayesian inference, is called the language model in ASR.
recognition accuracy over a standard recognizer. Though         The acoustic model expresses the probability that a string
SCH can be argued using the axioms of probability, at           of symbols—words, syllables, whatever—is associated
bottom it is an empirical hypothesis.           Preliminary     with an acoustic signal in a training corpus. The language
experimental results have been promising.      This paper is    model expresses the probability that a sequence of
the first in a four phase, multi-year research effort to test   symbols—again, words, syllables, whatever—is found in a
SCH:                                                            training corpus.
                                                                    The attractiveness of syllables for the acoustic model of
         Phase I: Gather preliminary data about SCH using      speech recognition has been noted for some time. A study
          small corpora.                                        of the SWITCHBOARD corpus found that over 20% of the
         Phase II: Reproduce the results from Phase I using    manually annotated phones are never realized acoustically,
          a much larger corpus.                                 since phone deletion is common in fluent speech. On the
         Phase III: Introduce a probabilistic concept          other hand, the same study showed that 99% of canonical
          generator and concept model.                          syllables are realized in speech. Syllables also have
         Phase IV: Introduce an existing domain                attractive distributional properties.         The statistical
          knowledge system and speech synthesizer to            distributions of the 300 most frequently occurring words in
          provide response to the user.                         English and the most common syllables are almost
                                                                identical. Though monosyllabic words account for only
                        Background                              22% of SWITCHBOARD by type, they account for a full
The goal of probabilistic speech recognition is to answer       81% of tokens (Greenberg 1999; Greenberg 2001;
this question: “What is the most likely string of words, W,     Greenberg et al. 2002). All of this suggests that the use of
from a language, L, given some acoustic input, A.” This is      syllables in the acoustic model might avoid some of the
formulated in equation 1:                                       difficulties associated with word pronunciation variation
                                                                due to dialect, idiolect, speaking rate, acoustic
    ( )                 (           )     (1)                   environment,       and      pragmatic/semantic       context.
                                                                    Nevertheless, most studies indicate positive but not
                                                                dramatic improvement when using a syllable-based
Since words have no place in SCH, we speak instead of
                                                                acoustic model (Ganapathiraju et al. 1997 and 2002; Sethy
symbol strings drawn from some set of legal symbols, with
                                                                and Narayanan 2003; Hamalainen et al. 2007). This has
the sole constraint that the symbols be encoded in ASCII
                                                                been disappointing given the theoretical attractiveness of
format. So, equation (1) becomes:
                                                                syllables in the acoustic model. Since this paper is
                                                                concerned exclusively with the language model and post-
    ( )             (           )         (2)                   language model processing, conjectures about the
                                                                performance of syllables in the acoustic model
Equation (2) is read: “The hypothesized symbol string is        performance are beyond its scope.
the one with the greatest probability given the sequence of         Still, many of the reasons that make syllables attractive
acoustic observations” (De Palma 2010:16). Bayes                in the acoustic model also make them attractive in the
Theorem lets us rewrite equation (2) as:                        language model, including another not mentioned in the
                                                                literature on acoustic model research: there are fewer
                    (       )       ( )
    ( )                                   (3)                   syllables than words, a topic explored later in this paper.
                            ( )
                                                                Since the output of a recognizer using a syllable language
                                                                model is a syllable string, studies of speech recognition
Since P(A) does not affect the computation of the most
                                                                using syllable language models have been limited to
probable symbol string (the acoustic observation is the
                                                                special purpose systems where output word strings are not
acoustic observation, no matter the potential string of
                                                                necessary. These include reading trackers, audio indexing
symbols) we arrive at a variation of the standard
systems, and spoken name recognizers. Investigations               proportional to the accuracy of the word recognition task.
report significant improvement over           word language        Substitute syllables for words in language B—since both
models (Bolanos et al. 2007; Schrumpf, Larson, and                 are symbols—and this is exactly the argument being made
Eickler 2005; Sethy and Narayanan 1998). The system                here.
proposed here, however, does not end with a syllable                   Now, one might ask, if syllables work so nicely in the
string, but, rather, passes this output to a concept model—        language model of speech recognition, why not use another
and thereby transforms them to concept strings, all to be          sub-word with an even smaller symbol set, say a phone or
described later.                                                   demi-syllable? Though the question is certainly worth
    Researchers have recognized the potential usefulness of        investigating empirically, the proposed project uses
concepts in speech recognition: since the early nineties at        syllables because they represent a compromise between a
Bell Labs, later at the University of Colorado, and still later    full word and a sound. By virtue of their length, they
at Microsoft Research (Pieraccini et al. 1991; Hacioglu and        preserve more linguistic information than a phone and,
Ward 2001; Yaman et al. 2008).          The system proposed        unlike words they represent a relatively closed set.
here does not use words in any fashion (unlike the Bell            Syllables tend not to change much over time.
Labs system), proposes the use of probabilistically                    A standard a priori indicator of the probable success of
generated concepts (unlike the Colorado system), and is            a language model is lower perplexity, where perplexity is
more general than the utterance classification system              defined as the Nth inverse root of the probability of a
developed at Microsoft. Further, it couples the use of sub-        sequence of words (Jurafsky and Martin 2009; Ueberla
word units in the language model, specifically syllables,          1994):
with concepts, an approach that appears to be novel.
                                                                   PP(W) = p(w1w2…wn) -1/n             (5)
      Syllables, Perplexity, and Error Rate
One of the first things that a linguist might notice in the             Because there are fewer syllables than words, we
literature on the use of the syllable in the acoustic model is     would expect both their perplexity in a language model to
that its complexity is underappreciated. Rabiner and Juang         be lower and their recognition accuracy to be higher.
(1993), an early text on speech recognition, has only two          Since the history of science is littered with explanations
index entries for “syllable” and treat it as just another          whose self-evidence turned out to have been incorrect upon
easily-defined sub-word unit. This is peculiar, since the          examination, we offer a first pass at an empirical
number of English syllables varies by a factor of 30               investigation.
depending on whom one reads (Rabiner and Juang 1993;                    To compare the perplexity of both syllable and word
Ganapathiraju, et al. 1997; Huang et al. 2001). In fact,           language models, we used two corpora, the Air Travel
there is a substantial linguistic literature on the syllable and   Information System (Hemphill 1993) and a smaller corpus
how to define it across languages. This is important since         (SC) of human-computer dialogs captured using the
any piece of software that claims to syllabify words               Wizard-of-Oz protocol at Next It (Next IT 2012), where
embodies a theory of the syllable. Thus, the syllabifier that      subjects thought they we were interacting with a computer
is cited most frequently in the speech recognition literature,     but in fact were conversing with a human being. The
and the one used in the work described in this paper,              corpora were syllabified using software available from the
implements a dissertation that is firmly in the tradition of       National Institute of Standards and Technology (NIST
generative linguistics (Kahn 1976). Since our work is              2012).
motivated by more recent research in functional and                     Test and training sets were created from the same
cognitive linguistics (see, for example, Tomasello 2003), a        collection of utterances, with the fraction of the collection
probabilistic syllabifier might be more appropriate. We            used in the test set as a parameter. The results reported
defer that to a later stage of the project, but note in passing    here use a randomly chosen 10% of the collection in the
that probabilistic syllabifiers have been developed                test set and the remaining 90% in the training set. The
(Marchand, et al. 2007).                                           system computed the mean, median, and standard
    Still, even though researchers disagree on the number          deviation over twenty runs. These computations were done
of syllables in English, that number is significantly smaller      for both word and syllable language models for unigrams,
than the number of words. And therein lies part of their           bigrams, trigrams, and quadrigrams (sequences of one,
attractiveness for this research. Simply put, the syllable         two, three, and four words or syllables). As a baseline, the
search space is significantly smaller than the word search         perplexity of the unweighted language model—one in
space. Suppose language A has a words and language B               which any word/syllable has the same probability as any
has b words, where a > b. All other things being equal, the        other—was computed.
probability of correctly guessing a word from B is greater              For bigrams, trigrams, and quadrigrams, the perplexity
than guessing one from A. Suppose further, that these              of a syllable language model was less than that of a word
words are not useful in and of themselves, but contribute to       language model. Of course, in comparing the perplexity of
some downstream task, the accuracy of which is                     syllable and word language models, we are comparing
sample spaces of different sizes. This can introduce error
based on the way perplexity computations assign                      i want to fly from spokane to seattle
probability mass to out-of-vocabulary tokens. It must be             ay waantd tuw flay frahm spow kaen tuw si ae dxaxl
recognized, however, that syllable and word language
models are not simply language models of different sizes             i would like to fly from seattle to san Francisco
                                                                     ay wuhdd laykd tuw flay frahm siy ae dxaxl tuw saen fraen
of the kind that Ueberla (1994) considered. Rather, they             sih skow
are functionally related to one another. This suggests that               Figure 1: Word and Syllable References
the well-understood caution against comparing the
perplexity of language models with different vocabularies            The recognizer equipped with a syllable language
might not apply completely in the case of syllables and         model showed a mean improvement in SER over all N-
words. Nevertheless, the drop in perplexity was so              gram sizes of 14.6% when compared to one equipped with
substantial in a few cases (37.8% SC quadrigrams, 85.7%         a word language model. Though the results are
ATIS bigrams), that it invited empirical investigation with     preliminary, and await confirmation with other corpora,
audio data.                                                     and with the caveats already noted, they suggest that a
                                                                recognizer equipped with a syllable language model will
                Recognition Accuracy                            perform more accurately than one equipped with a word
Symbol Error Rate (SER) is the familiar Word Error Rate         language model.1 This will contribute to the downstream
(WER) generalized so that context clarifies whether we are      accuracy of the system described below. Of course, it must
talking about syllables, words, or concepts. The use of         be pointed out that some of this extraordinary gain in
SER raises a potential problem. The number of syllables         recognition accuracy will necessarily be lost in the
(either by type or token) differs from the number of words      probabilistic transformation to concept strings.
in the training corpus. Further, in all but monosyllabic
training corpora, syllables will, on average, be shorter than                              Concepts
words. How then can we compare error rates? The                 At this point one might wonder about the usefulness of
answer, as before, is that 1) words are functionally related    syllable strings, no matter how accurately they are
to syllables and 2) improved accuracy in syllable               recognized. We observe that the full range of a natural
recognition will contribute to downstream accuracy in           language is redundant in certain pre-specified domains, say
concept recognition.                                            a travel reservation system. Thus the words and phrases
     To test the hypothesis that a syllable language model      ticket, to book a flight, to book a ticket, to book some
would perform more accurately than a word language              travel, to buy a ticket, to buy an airline ticket, to depart, to
model, we gathered eighteen short audio recordings,             fly, to get to, all taken from the reference files for the audio
evenly distributed by gender, and recorded over both the        used in this study, describe what someone wants in this
public switched telephone network and mobile phones.            constrained context, namely to go somewhere. With
The recognizer used was SONIC from the Center for               respect to a single word, we collapse morphology and
Spoken Language Research of the University of Colorado          auxiliary words used to denote person, tense, aspect, and
(SONIC 2010). The acoustic model was trained on the             mood, into a base word. So, fly, flying, going to fly, flew,
MACROPHONE corpus (Bernstein et al. 1994).                      go to, travelling to, are grouped, along with certain
Additional tools included a syllabifier and scoring software    formulaic phrases (book a ticket to), in the equivalence
available from the National Institute of Standards and          class, GO. Similarly, the equivalence class WANT
Technology (NIST 2012), and language modeling software          contains the elements buy, can I, can I have, could I, could
developed by one of the authors.                                I get, I like, I need, I wanna, I want, I would like, I’d like,
     The word-level transcripts in the training corpora were    I’d like to have, I’m planning on, looking for, need, wanna,
transformed to phone sequences via a dictionary look-up.        want, we need, we would like, we’d like, we’ll need, would
The phone-level transcripts were then syllabified using the     like. We refer to these equivalence classes as concepts.
NIST syllabifier. The pronunciation lexicon, a mapping of
words to phone sequences, was similarly transformed to             For example, a sentence from the language model (I
map syllables to phone sequences. The word-level                want to fly to Spokane) was syllabified, giving:
reference files against which the recognizer’s hypotheses
were scored were also run through the same process as the       ay w_aa_n_td t_uw f_l_ay t_uw s_p_ow k_ae_n
training transcripts to produce syllable-level reference
files.
     With these alterations, the recognizer transformed         1
                                                                  Though it might be interesting and useful to look at individual
acoustic input into syllable output represented as a flavor     errors, the point to keep in mind is that we are looking for broad
of Arpabet. Figure 1 shows an extract from a reference file     improvement. The components of SCH were not so much
represented both in word and in phone-based syllable form.      arguments as the initial justification for empirical investigations,
                                                                investigations that will support or falsify SCH.
Then concepts were mapped to the syllable strings,              The algorithm takes advantage of “the strong tendency of
producing:                                                      words to exhibit only one sense per collocation and per
                                                                discourse” (Yarowsky 1995: 50). The technique will begin
WANT GO s_p_ow k_ae_n                                           with a hand-tagged seed set of concepts. These will be
                                                                used to incrementally train a classifier to augment the seed
The mapping from concepts to syllable strings was rigid         concepts. The output of a speech recognizer equipped
and chosen in order to generate base-line results. The          with a syllable language model is the most probable
mapping rules required that at least one member of an           sequence of syllables given an acoustic event.           The
equivalence class of syllable strings had to appear in the      formalisms used to probabilistically map concepts to
output string for the equivalence class name to be inserted     syllable strings are reworkings of equations (1) to (4),
in its place in the output file. For example, k_ae_n ay         resulting in:
hh_ae_v (can I have) had to appear in its entirety in the
output file for it to be replaced with the concept WANT.            ( )              (      )               (    )       ( ) (6)
   The experiment required that we:

1.   Develop concepts/equivalence classes from the
                                                                                         Acoustic Feature
     training transcript used in the language model                   User                                      Acoustic Model
                                                                                            Analysis
     experiments.
2.   Map the equivalence classes onto the reference files
     used to score the output of the recognizer. For each
     distinct syllable string that appears in one of the
     concept/equivalence classes, we substituted the name            Speech                                         Syllable
     of the equivalence class for the syllable string. We did                               Decoder
                                                                   Synthesizer                                  Language Model
     this for each of the 18 reference files that correspond
     to each of the 18 audio files. For example, WANT is
     substituted for every occurrence of ay w_uh_dd
     l_ay_kd (I would like).
3.   Map the equivalence classes onto the output of the             Domain
                                                                                             N-Best                 Phone to
                                                                   Knowledge
     recognizer when using a syllable language model for             System
                                                                                           Syllable List        Syllable Lexicon
     N-gram sizes 1 through 4.              We mapped the
     equivalence class names onto the content of each of
     the 72 output files (4 x 18) generated by the
     recognizer.
4.   Determine the error rate of the output in step 3 with                                                            Concept
                                                                    Concepts             Concept Model
     respect to the reference files in step 2.                                                                       Generator


     As before, the SONIC recognizer, the NIST syllabifier
and scoring software, and our own language modeling
software were used. The experiments showed a mean
                                                                                           Intent
increase in SER over all N-gram sizes of just 1.175%.             Intent Scorer           Accuracy
Given the rigid mapping scheme, these results were                                         Score
promising enough to encourage us to begin work on: 1)
reproducing the results on the much larger ATIS2 corpus
(Garofalo 1993) and 2) a probabilistic concept model.           Figure 2: Acoustic features are decoded into syllable strings
                                                                using a syllable language model. The syllables strings are
                                                                probabilistically mapped to concept strings. The N-best syllable
                     Current Work                               list is rescored using concepts. The Intent Scorer enables
                                                                comparison of performance with a conventional recognizer.
We are currently building the system illustrated in Figure
2. The shaded portions describe our work. A crucial
component is the concept generator. Under our definition,       M is just the set of legal concepts created for a domain by
concepts are purely collocations of words and phrases,          the concept generator. Equation (6) is an extension of a
effectively, equivalence classes. In order for the system to    classic problem in computational linguistics: probabilistic
be useful for multiple domains, we must go beyond our           part-of-speech tagging. That is, given a string of words,
preliminary investigations: the concepts must be machine-       what is the most probable string of parts-of-speech? In the
generated. This will be done using a boot-strapping             case at hand, given a syllable string, what is the most
procedure, first described for word-sense disambiguation.       probable concept string?
     Using equation (6), the Syllable-Concept Hypothesis,          But as has been pointed out, a syllable string is not useful
introduced early in the paper, can be formalized. If               in a dialog system. Concepts must be mapped to syllables.
equation (1) describes how a recognizer goes about                 A concept, as we define it, is an equivalence class of words
choosing a word string given a string of acoustic                  and phrases that seem to mean the same thing in a given
observations, then our enhanced recognizer can be                  context. To date, we have hand-generated concepts from
described in equation (7):                                         reference files and mapped them to syllables using a ridgid
                                                                   mapping scheme intended as a baseline.
    ( )              (     ) (7)                                        But to be truly useful, any recognizer using concepts
                                                                   must automatically generate them. Since concepts, under
That is, we are looking for the legal concept string with the      our definition, are no more than collocations of words, we
greatest probability given a sequence of acoustic                  we propose a technique first developed for word-sense
observations. SCH, in effect, argues that the P(C|A)               disambiguation: incrementally generate a collection of
exceeds the P(W|A).                                                concepts from a hand-generated set of seed concepts. The
     Finally, the question of how to judge the accuracy of         idea in both phases or our work—probabilistically
the system, from the initial utterance to the output of the        generating syllable strings and probabilistically generating
concept model, must be addressed. Notice that the concept          concept strings—is to reduce the search space from what
strings themselves are human readable. So,                         conventional recognizers encounter. At the very end of
                                                                   this process, we propose scoring how closely the generated
I WANT TO FLY TO SPOKANE                                           concepts match the intent of the speaker using Mechanical
                                                                   Turk workers and a modified Likert scale. Ultimately the
becomes:                                                           output the system will be sent on to a domain knowledge
                                                                   system, from there onto a speech synthesizer, and finally to
WANT GO s_p_ow k_ae_n                                              the user, who, having heard the output will respond, thus
                                                                   starting the cycle over gain.
Amazon Mechanical Turk2 workers will be presented with                  Our results to date suggests that the use of syllables
both the initial utterance as text and the output of the           and concepts in ASR will results in improved recognition
concept model as text and asked to offer an opinion about          accuracy over a conventional word-based speech
accuracy based on an adaptation of the Likert scale. To            recognizer. This improved accuracy has the potential to be
judge how the proposed system performs relative to a               used in fully functional dialog systems. The impact of
conventional recognizer, the same test will be made,               such systems could be as far-reaching as the invention of
substituting the output of the recognizer with a word              the mouse and windowing software, opening up computing
language model and no concept model for the output of the          to persons with coordination difficulties or sight
proposed system.                                                   impairment, freeing digital devices from manual input, and
                                                                   transforming the structure of call centers.             One
                                                                   application, often overlooked in catalogues of the uses to
                         Conclusion                                which ASR might be put, is surveillance.3 The Defense
We have argued that the speech recognition problem as              Advanced Research Agency (DARPA) helped give ASR
conventionally formulated—the transformation of an                 its current shape. According to some observers, the NSA,
acoustic signal to words—neither emulates human                    as a metonym for all intelligence agencies, is drowning in
performance nor exhausts the uses to which ASR might be            unprocessed data, much of which is almost certainly
put. This suggests that we could bypass words in some              speech (Bamford 2008). The kinds of improvements
ASR applications, going from an acoustic to signal to              described in this paper, the kinds that promise to go
probabilistically generated syllable strings and from there        beyond the merely incremental, are what are needed to take
to probabilistically generated concept strings.        Our         voice recognition to the next step.
experiments with syllables on small corpora have been
promising:
     37.8% drop in perplexity with quadrigrams on the                                      References
         SC corpus                                                 Bamford, J. 2008. The Shadow Factory: The Ultra-Secret
         85.7% drop in perplexity with ATIS bigrams               NSA from 9/11 to the Eavesdropping on America. NY:
     14.6% mean increase in recognition accuracy over             Random House.
         bigram, trigram, and quadrigrams
                                                                   Bernstein, J., Taussig, K., Godfrey, J. 1994.

2
  The Amazon Mechanical Turk allows computational linguists
                                                                   3
(and just about anyone else who needs a task that requires human     Please note that this paper is not necessarily an endorsement of
intelligence) to crowd-source their data for human judgment. See   all uses to which ASR might be put. It merely recognizes what is
https://www.mturk.com/mturk/welcome                                in fact the case.
MACROPHONE. Linguistics Data Consortium,                    Huang, X., Acero, A., Hsiao-Wuen, H. 2001. Spoken
Philadelphia PA                                             Language Processing: A Guide to Theory, Algorithm, and
                                                            System Development. Upper Saddle River, NJ: Prentice
Bolanos, B., Ward, W., Van Vuuren, S., Garrido, J. 2007.    Hall.
Syllable Lattices as a Basis for a Children’s Speech
Reading Tracker. Proceedings of Interspeech-2007, 198-      Jurafsky, D., Martin, J. 2009. Speech and Language
201.                                                        Processing. Upper Saddle River, NJ: Pearson/Prentice
                                                            Hall.
De Palma, P. 2010. Syllables and Concepts in Large
Vocabulary Speech Recognition.      Ph.D. dissertation,     Kahn, D. 1976. Syllable-based Generalizations in English
Department of Linguistics, University of New Mexico,        Phonology. Ph.D. dissertation, Department of Linguistics,
Albuquerque, NM.                                            University of Indiana, Bloomington, In: Indiana University
                                                            Linguistics Club.
Ganapathiraju, A., Goel, V., Picone, J., Corrada, A.,
Doddington, G., Kirchof, K., Ordowski, M., Wheatley, B.     Marchand, Y. Adsett, C., Damper, R. 2007. Evaluating
1997. Syllable—A Promising Recognition Unit for             Automatic Syllabification Algorithms for English.
LVCSR. Proceedings of the IEEE Workshop on Automatic        Proceedings of the 6th International Conference of the
Speech Recognition and Understanding, 207-214.              Speech Communication Association, 316-321.

Ganapathiraju, A., Hamaker, J., Picone, J., Ordowski, M.,   Next It Corporation. 2012. Web Customer Service with
Doddington, G. 2001. Syllable-based large vocabulary        Intelligent Virtual Agents. Retrieved 3/37/2012 from:
continuous speech recognition. IEEE Transactions on         http://www.nextit.com.
Speech and Audio Processing, vol. 9, no. 4, 358-366.
                                                            NIST. 2012. Language Technology Tools/Multimodel
Garofalo, J. 1993. ATIS2. Linguistics Data Consortium,      Information Group—Tools. Retrieved 2/19/2012 from:
Philadelphia, PA                                            http://www.nist.gov.

Greenberg, S. 1999. Speaking in Shorthand—A Syllable-       Pieraccini, R., Levin, E., Lee, C., 1991. Stochastic
Centric Perspective for Understanding Pronunciation         Representation of Conceptual Structure in the ATIS Task.
Variation. Speech Communication, 29, 159-176.               Proceedings of the DARPA Speech and Natural Language
                                                            Workshop, 121-124.
Greenberg, S. 2001. From here to Utility—Melding
Insight with Speech Technology. Proceedings of the 7th      Rabiner, L., Juang, B. 1993. Fundamentals of Speech
European Conference on Speech Communication and             Recognition. Englewood Cliffs, NJ: Prentice Hall.
Technology, 2485-2488.
                                                            Sang-Hun, C. 10/21/2007. Myanmar, Fear Is Ever Present.
Greenberg, S., Carvey, H. Hitchcock, L., Chang, S. 2002.    The New York Times.
Beyond the Phoneme: A Juncture-Accent Model of Spoken
Language.      Proceedings of the 2nd International         Schrumpf, C., Larson, M., Eickler, S., 2005. Syllable-
Conference on Human Language Technology Research,           based Language Models in Speech Recognition for English
36-43.                                                      Spoken Document Retrieval. Proceedings of the 7th
                                                            International Workshop of the EU Network of Excellence
Hacioglu, K., Ward, W. 2001. Dialog-Context                 DELOS on Audio-Visual Content and Information
Dependent Language Modeling Combining                       Visualization in Digital Libraries, pp. 196-205.
N-Grams and Stochastic Context-Free Grammars.
Proceedings of International Conference on Acoustics,       Sethy, A., Narayanan, S. 2003. Split-Lexicon Based
Speech and Signal Processing, 537-540.                      Hierarchical Recognition of Speech Using Syllable and
                                                            World Level Acoustic Units, Proceedings of IEEE
Hamalainen, A., Boves, L., de Veth, J., Bosch, L. 2007.     International Conference on Acoustics, Speech, and Signal
On the Utility of Syllable-Based Acoustic Models for        Processing, I, 772-775.
Pronunciation Variation Modeling. EURASIP Journal on
Audio, Speech, and Music Processing, 46460, 1-11.           SONIC. 2010. SONIC: Large Vocabulary Continuous
                                                            Speech Technology.           Retrieved 3/8/2010 from:
Hemphill, C. 1993. ATIS0. Linguistics Data Consortium,      http://techexplorer.ucsys.edu/show_NCSum.cfm?NCS=25
Philadelphia, PA.                                           8626.
Tomasello, M. (ed.) 2003. The New Psychology of            Yaman, S., Deng, L., Yu, D., Wang, W, Acera, A. 2008.
Language: Cognitive and Functional Approaches to           An Integrative and Discriminative Technique for Spoken
Language Structure. Mahwah, NJ: Lawrence Erlbaum           Utterance Classification. IEEE Transactions on Audio,
Associates.                                                Speech, and Language Processing, vol. 16, no. 6, 1207-
                                                           1214.
Ueberla, J. 1994. Analyzing and Improving Statistical
Language Models for Speech Recognition. Ph.D.              Yarowsky, D. 1995.          Unsupervised Word Sense
Dissertation, School of Computing Science, Simon Frazier   Disambiguation      Rivaling     Supervised     Methods.
University.                                                Proceedings of the 33rd Annual Meeting of the Association
                                                           for Computational Linguistics, 189-196.

</pre>