<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recent improvements on error detection for automatic speech recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yannick Este`ve</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nathalie Camelin</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Automatic speech recognition(ASR) offers the ability to access the semantic content present in spoken language within audio and video documents. While acoustic models based on deep neural networks have recently significantly improved the performances of ASR systems, automatic transcriptions still contain errors. Errors perturb the exploitation of these ASR outputs by introducing noise to the text. To reduce this noise, it is possible to apply an ASR error detection in order to remove recognized words labelled as errors. This paper presents an approach that reaches very good results, better than previous state-of-the-art approaches. This work is based on a neural approach, and more especially on a study targeted to acoustic and linguistic word embeddings, that are representations of words in a continuous space. In comparison to the previous state-of-the-art approach which were based on Conditional Random Fields, our approach reduces the classification error rate by 7.2%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The advancement in the speech processing field and the
availability of powerful computing devices have led to better performance in
the speech recognition domain. However, recognition errors are still
unavoidable, whatever the quality of the ASR systems. This reflects
their sensitivity to the variability: the acoustic environment, speaker,
language styles and the theme of the speech. These errors can have a
considerable impact on the application of certain automatic processes
such as information retrieval, speech to speech translation, etc.</p>
      <p>The encountered errors can be due to a misinterpretation of the
signal. For example, the noise associated with the sound of the
environment or a problem with the quality of recording channel is
interpreted as speech by the system. One of the source of errors may
also come from a mispronunciation of a word, a non respect speech
turn when two speakers are involved at the same time also creates a
disturbance of the sound signal.</p>
      <p>The efficient generation of speech transcriptions in any condition
(e.g. noise free environment, etc.) remains the ultimate goal, which is
not already solved. Error detection can help to improve the
exploitation of ASR outputs by downstream applications, but is a difficult
task given the fact that there are several types of errors, which can
range from the simple substitution of a word with a homophone to
the insertion of an irrelevant word for the overall understanding of
the sequence of words. They can also affect neighboring words and
create a whole area of erroneous words.</p>
      <p>Error detection can be performed in three steps: first, generating a
set of features that are based on ASR system or gathered from other
source of knowledge. Then, based on these features, estimating
correctness probabilities (confidence measures). Finally, a decision is
made by applying a threshold on these probabilities.</p>
      <p>
        Many studies focus on the ASR error detection. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], authors
have applied the detection capability for filtering data for
unsupervised learning of an acoustic model. Their approach was based on
applying two thresholds on the linear combination of two
confidence measures. The first one, was derived from language model and
takes into account backoff behavior during the ASR decoding. This
measure is different from the language model score, because it
provides information about the word context. The second is the posterior
probability extracted from the confusion network. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], authors
addressed the issue of error region detection and characterization in
Large Vocabulary Continuous Speech Recognition (LVCSR)
transcriptions. They proposed to classify error regions in four classes,
in particular, they are interested in a person noun error which is a
critical information in many information retrieval applications. They
proposed several sequential detection, classification approaches and
an integrated sequence labeling approach. The ASR error detection
problem is related to the Out Of Vocabulary (OOV) detection task,
considering that OOV errors behavior and impact differ from other
errors, assuming that OOV words contribute to recognition errors on
surrounding words. Many studies focused on detecting OOV errors.
More recently, in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], authors have also focused on detecting
error regions generated by OOV words. They proposed an approach
based on CRF tagger, which takes into account contextual
information from neighboring regions instead of considering only the local
region of OOV words. This approach leads to significant
improvement compared to state of the art. The generalization of this
approach for other ASR errors was presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which proposes
an error detection system based on CRF tagger using various ASR,
lexical and syntactic features. Their experiments that are performed
on two corpora in English for the DARPA BOLT project showed
the validity of this approach for the detection of important errors.
In [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], new features gathered from other knowledge sources than
the decoder itself were explored for ASR error detection, which are
a binary feature that compares the outputs from two different ASR
systems (word by word), a feature based on the number of hits of the
hypothesized bigrams, obtained by queries entered into a very
popular Web search engine, and finally a feature related to automatically
infered topics at sentence and word levels. Two out of three new
features, a binary word match feature and a bigram hit feature, led to
significant improvements, with a maximum entropy model and CRF
with linear-chain conditional random fields, comparing to a
baseline using only decoder-based features. A neural network classifier
trained to locate errors in an utterance using a variety of features
is presented in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Two approaches are proposed to extract
confidence measures : the first one, is based on Recurrent Neural network
Language Model (RNNLM) features to capture long-distance
context within and across previous utterances. The second one, consist
of combining complementary state-of-the-art DNN and GMM ASR
for effective error detection, by leveraging DNN and GMM
confusion networks that store word confusion information from multiple
systems for feature extraction.
      </p>
      <p>The ASR error detection method presented in this paper is based
on incorporating a set of features in the confidence classifier built on
neural network architectures, including MLP and DNN, which is in
charge to attribute a label (error or correct) for each word of an ASR
hypothesis.</p>
      <p>
        A combination approach based on the use of an auto encoder is
applied to combine well-known word embeddings: this combination
helps to take benefit from the complementarities of these different
word embeddings, as recently shown in one of our previous studies
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>ASR error detection based on word embeddings</title>
      <p>The error detection system has to attribute the label correct or
error to each word in the ASR transcript. Each decision is based on a
set of heterogeneous features. In our approach, this classification is
performed by analyzing each recognized word within its context.</p>
      <p>The proposed ASR error detection system is based on a feed
forward neural network and is designed to be fed by different kinds of
features, including word embeddings.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Architecture</title>
      <p>
        This ASR error detection system is based on a multi-stream
strategy to train the network, named multilayer perceptron multi stream
(MLP-MS). The MLP-MS architecture is used in order to better
integrate the contextual information from neighboring words. This
architecture is inspired by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where word and semantic features are
integrated for topic identification in telephone conversations. The
training of the MLP-MS is based on pre-training the hidden layers
separately and then fine tuning the whole network. The proposed
architecture, depicted in Figure 1, is detailed as follows: three feature
vectors are used as input to the network – feature vectors are
described in the next section. These vectors are respectively the feature
vector representing the two left words (L), the feature vector
representing the current word (W) and the feature vector for the two right
words (R). Each feature vector is used separately in order to train a
multilayer perceptron (MLP) with a single hidden layer. Formally,
the architecture is described by the following equations:
H1;X = f (P1;X
      </p>
      <p>X + b1;X )
where X represents respectively the three feature vectors (L,W and
R), Pi is the weight matrix and bi is the bias vector.</p>
      <p>The resulting vectors H1;L, H1;W and H1;R are concatenated to
form the first hidden layer H1. The H1 vector is presented as the
input of the second MLP-MS hidden layer H2 computed according
to the equation:</p>
      <p>H2 = g(P2</p>
      <p>H1 + b2)
(1)
(2)</p>
      <p>Ok = q(PO
(3)
Note that in our experiments f and g are respectively rectified linear
units (ReLU ) and hyperbolic tangent (tanh) activation functions,
and q is the sof tmax function.</p>
      <p>output</p>
      <p>H2
H1-W</p>
      <p>Wi
H1-L</p>
      <p>H1-R
Wi-2</p>
      <p>Wi-1</p>
      <p>Wi+1</p>
      <p>Wi+2</p>
      <p>ASR confidence scores: confidence scores are the posterior
probabilities generated from the ASR system (PAP). The word
posterior probability is computed over confusion networks, which is
approximated by the sum of the posterior probabilities of all
transitions through the word that are in competition with it.
Lexical features: lexical features are derived from the word
hypothesis output from the ASR system. They include the word
length that represents the number of letters in the word, and three
binary features indicating if the three 3-grams containing the
current word have been seen in the training corpus of the ASR
language model.</p>
      <p>
        Syntactic features: we obtain syntactic features by automatically
assigning part-of-speech tags (POS tags), dependency labels –
such label is a grammatical relation held between a governor
(head) and a dependent –, and word governors, which are
extracted from the word hypothesis output by using the MACAON
NLP Tool chain2 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to process the ASR outputs.
      </p>
      <p>
        Linguistic word representation (embedding or symbol): The
orthographic representation of a word is used in CRF approaches
as for instance in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Using our neural approach we can handle
different word embeddings, which permits us to take advantage of
the generalizations extracted during the construction of the
continuous vectors.
      </p>
      <p>
        Acoustic word embeddings: these vectors represents the
pronunciation of a word as a projection in a space with high dimension.
Words projected into a close area are words acoustically
similar [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Finally, the output layer is a vector Ok of k=2 nodes corresponding
to the 2 labels correct and error:
2 http://macaon.lif.univ-mrs.fr
Different approaches have been proposed to create word embeddings
through neural networks. These approaches can differ in the type of
the architecture and the data used to train the model. In this study, we
distinguish two categories of word embeddings: the ones estimated
on unlabeled data, and others estimated on labeled data
(dependencybased word embeddings). These representations are detailed
respectively in the next subsections.
This section presents three types of word embeddings coming from
two available implementations (word2vec [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and GloVe [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]):
Skip-gram: This architecture from [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] takes as input the target
word wi and outputs the preceding and the following words.
The target word Wi is at the input layer, and the context words
C are at the output layer. It consists on predicting the contextual
words C given the current word wi.
      </p>
      <p>The skip-gram model with negative sampling seeks to
represent each word Wi and each context C as d-dimensional vectors
(VWwi,VC ) in order to have similar vector representations for
similar words. This is done by maximizing the dot product VWwi:VC
associated with the good word-context pairs that occur in the
document D and minimize it for negative examples, that do not
necessarily exist in D. These negative examples are created by
stochastically corrupting the pairs (Wi, C), thus the name negative
sampling.</p>
      <p>Also, the context is not limited to the immediate context, and
training instances can be created by skipping a constant number of
words in its context, for instance, wi 3 , wi 4 , wi+3 , wi+4 , hence
the name skip-gram.</p>
      <p>
        GloVe: This approach is introduced by [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and relies on
constructing a global co-occurrence matrix X of words, by
processing the corpus using a sliding context window. Here, each element
Xij represents the number of times the word j appears in the
context of word i.
      </p>
      <p>The model is based on the global co-occurrence matrix X instead
of the actual corpus, thus the name GloVe, for Global Vectors.
This model seeks to build vectors Vi and Vj that retain some useful
information about how every pair of words i and j co-occur, such
as:</p>
      <p>ViT Vj + bi + bj = logXij
(4)
where bi and bj are the bias terms associated with words i and j,
respectively.</p>
      <p>This is accomplished by minimizing a cost function J , which
evaluates the sum of all squared errors:</p>
      <p>
        J = X X f (Xij )(ViT Vj + bi + bj
logXij )2
(5)
where f is weighting function which is used to prevent learning
only from very common word pairs. The authors define the f as
follows [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]:
f (Xij ) =
      </p>
      <p>Xij
Xmax
1
if Xij &lt; Xmax
otherwise
2.3.2</p>
      <sec id="sec-3-1">
        <title>Dependency-based word embeddings</title>
        <p>
          Levy et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] proposed an extension of word2vec, called
word2vecf and denoted w2vf-deps, which allows to replace linear
bag-of-words contexts with arbitrary features.
        </p>
        <p>
          This model is a generalization of the skip-gram model with
negative sampling introduced by [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and it requires labeled data for
training. As in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], we derive contexts from dependency trees: a
word is used to predict its governor and dependents, jointly with their
dependency labels. This effectively allows for variable window size.
2.3.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Word embedding combination</title>
        <p>
          In the framework of this work, we have experimented different ways
to combine the word embeddings presented above. Like described in
a previous paper [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], the use of an auto encoder is very effective.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acoustic word embeddings</title>
    </sec>
    <sec id="sec-5">
      <title>Building acoustic word embeddings</title>
      <p>
        The approach we used to build acoustic word embeddings is inspired
from the one proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Word embeddings are trained through
a deep neural architecture, depicted in figure 2, which relies on a
convolutional neural network (CNN) classifier over words and on a
deep neural network (DNN) trained by using a triplet ranking loss [
        <xref ref-type="bibr" rid="ref21 ref22 ref3">3,
21, 22</xref>
        ]. This architecture was proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with the purpose to use
the scores derived from the word classifier for lattice rescoring. The
two architectures are trained using different inputs: speech signal and
orthographic representation of the word.
      </p>
      <p>fully
connected</p>
      <p>layers
convolution
and max
pooling
layers</p>
      <p>CNN</p>
      <p>Softmax
Embedding s</p>
      <p>DNN</p>
      <p>Triplet Ranking Loss
Embedding w+</p>
      <p>Embedding
wO+</p>
      <p>Oword embedding derived from an orthographic representation can be
perceived as a canonical acoustic representation for a word, since
different prononciations imply different embeddings s.</p>
      <p>
        Like in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], orthographic word representation consists on a bag of
n-grams (n 3) of letters, composed of 10222 trigrams, bigrams,
and unigrams of letters, including special symbols [ and ] to specify
the start and the end of a word. Then, we use an auto-encoder to
reduce the size of this bag of n-grams vector to d-dimension. To check
the performance of the resulting orthographic representation, a neural
network is trained to predict a word given this orthographic
representation. It reaches 99.99% of accuracy on the training set composed of
52k words of the vocabulary, showing the richness of this
representation.
      </p>
      <p>
        Similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a DNN was trained by using the triplet ranking
loss [
        <xref ref-type="bibr" rid="ref21 ref22 ref3">3, 21, 22</xref>
        ] in order to project the orthographic word
representation to the acoustic embeddings s obtained from the CNN
architecture, which is trained independently. It takes as input a word
orthographic representation and outputs an embedding vector of the same
size as s. During the training process, this model takes as inputs the
acoustic embedding s selected randomly from the training set, the
orthographic representation of the matching word o+, and the
orthographic representation of a randomly selected word different to the
first word o . These two orthographic representations supply shared
parameters in the DNN.
      </p>
      <p>We call t = (s; w+; w ) a triplet, where s is the acoustic signal
embedding, w+ is the embedding obtained through the DNN for the
matching word, while w is the embedding obtained for the wrong
word. The triplet ranking loss is defined as:</p>
      <p>Loss = max(0; m</p>
      <p>Simdot(s; w+) + Simdot(s; w ))
(6)
where Simdot(x; y) is the dot product function used to compute the
similarity between two vectors x and y, and m is a margin
parameter that regularizes the margin between the two pairs of similarity
Simdot(s; w+) and Simdot(s; w ). This loss is weighted according
to the rank in the CNN output of the word matching the audio signal.</p>
      <p>The resulting trained model can then be used to build an
acoustic embedding (w+) from any word, as long as one can extract an
orthographic representation from it.
3.2
3.2.1</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <sec id="sec-6-1">
        <title>Experimental data</title>
        <p>
          Experimental data for ASR error detection is based on the entire
official ETAPE corpus [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], composed by audio recordings of French
broadcast news shows, with manual transcriptions (reference). This
corpus is enriched with automatic transcriptions generated by the
LIUM ASR system, which is a multi-pass system based on the CMU
Sphinx decoder, using GMM/HMM acoustic models. This ASR
system won the ETAPE evaluation campaign in 2012. A detailed
description is presented in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>The automatic transcriptions have been aligned with reference
transcriptions using the sclite3 tool. From this alignment, each word
in the corpora has been labeled as correct (C) or error (E). The
description of the experimental data, in terms of size, word error rate
(WER) as well as percentage of substitution (Sub), deletion (Del)
and insertion (Ins), is reported in Table 1.</p>
        <p>
          The performance of the proposed approach is compared with a
state-of-the-art system based on CRFs [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] provided by the Wapiti
tagName
Train
Dev
Test
        </p>
        <p>
          WER
ger4 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and applied to the set of features presented in Section 2.2.
The ASR error detection systems (MLP-MS and CRF) are trained
on the training corpus (Train) and are applied on the test (Test) set.
The development set (Dev) was used to tune all the parameters: the
learning rate, the batch size and the hidden layers size of MLP-MS,
and the features template of CRF, that describes which features are
used in training and testing.
        </p>
        <p>The performance is evaluated by using recall (R), precision (P)
and F-measure (F) for the misrecognized word prediction and global
Classification Error Rate (CER). CER is defined as the ratio of the
number of misclassifications over the number of recognized words.</p>
        <p>The linguistic word embedding described in Section 2.3 are made
of 200 dimensions. They were computed from a large textual corpus,
composed of about 2 billions of words. This corpus was built from
articles of the French newspaper “Le Monde”, the French Gigaword
corpus, articles provided by Google News, and manual transcriptions
of about 400 hours of French broadcast news.</p>
        <p>
          The training set for the convolutional neural network used to
compute acoustic word embedding consists of 488 hours of French
Broadcast News with manual transcriptions. This dataset is
composed of data coming from the ESTER1 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], ESTER2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and
EPAC [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] corpora.
        </p>
        <p>It contains 52k unique words that are seen at least twice each in
the corpus. All of them corresponds to a total of 5:75 millions
occurrences. In French language, many words have the same pronunciation
without sharing the same spelling, and they can have different
meanings; e.g. the sound [so] corresponds to four homophones: sot (fool),
saut (jump), sceau (seal) and seau (bucket), and twice more by
taking into account their plural forms that have the same pronunciation:
sots, sauts, sceaux, and seaux. When a CNN is trained to predict a
word given an acoustic sequence, these frequent homophones can
introduce a bias to evaluate the recognition error. To avoid this, we
merged all the homophones existing among the 52k unique words of
the training corpus. As a result, we obtained a new reduced dictionary
containing 45k words and classes of homophones.</p>
        <p>Acoustic features provided to the CNN are log-filterbanks,
computed every 10ms over a 25ms window yielding a 23-dimension
vector for each frame. A forced alignment between manual
transcriptions and speech signal was performed on the training set in order
to detect word boundaries. The statistics computed from this
alignment reveal that 99% of words are shorter than 1 second. Hence we
decided to represent each word by 100 frames, thus, by a vector of
2300 dimensions. When words are shorter they are padded with zero
equally on both ends, while longer words are cut equally on both
ends.
the CRF, especially by using a combination of embeddings. It yields
5.8% of CER reduction compared to CRF by using only linguistic
word embedding. By using also acoustic word embeddings, the CER
reduction reached 7.2%. One can also notice that the use of an auto
encoder to combine word embeddings is really useful to capture
complementarities of different single linguistic word embeddings.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper, we have investigated the use of a neural network to
detect ASR error. Specifically, we proposed to effectively represent
words through linguistic and acoustic word embeddings.</p>
      <p>Experiments were made on automatic transcriptions generated by
LIUM ASR system applied on the ETAPE corpus (French broadcast
news). They show that the proposed neural architecture, using the
acoustic word embeddings as additional features, outperforms
stateof-the-art approach based on the use of Conditional Random Fields,
with a reduction of the classification error rate of 7.2%.
5</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work was partially funded by the European Commission
through the EUMSSI project, under the contract number 611057, in
the framework of the FP7-ICT-2013-10 call and by the Re´gion Pays
de la Loire.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Fre</surname>
          </string-name>
          <article-title>´dric Be´chet and Benoit Favre, 'Asr error segment localisation for spoken recovery strategy', Acoustics, Speech and Signal Processing (ICASSP)</article-title>
          , IEEE International Conference, (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Fre</surname>
          </string-name>
          <article-title>´dric Be´chet and Benoit Favre, 'Asr error segment localization for spoken recovery strategy'</article-title>
          , in Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2013</year>
          IEEE International Conference on, pp.
          <fpage>6837</fpage>
          -
          <lpage>6841</lpage>
          , (May
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Samy</given-names>
            <surname>Bengio</surname>
          </string-name>
          and Georg Heigold, '
          <article-title>Word embeddings for speech recognition</article-title>
          .',
          <string-name>
            <surname>in</surname>
            <given-names>INTERSPEECH</given-names>
          </string-name>
          , pp.
          <fpage>1053</fpage>
          -
          <lpage>1057</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Dele</surname>
          </string-name>
          ´glise, Yannick Este`ve, Sylvain Meignier, and Teva Merlin, '
          <article-title>Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate?'</article-title>
          , in Interspeech, Brighton,
          <string-name>
            <surname>UK</surname>
          </string-name>
          , (
          <year>September 2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Dufour</surname>
          </string-name>
          , Ge´raldine Damnati, and
          <string-name>
            <given-names>Delphine</given-names>
            <surname>Charlet</surname>
          </string-name>
          .
          <article-title>Automatic error region detection and characterization in lvcsr transcriptions of tv news shows</article-title>
          ,
          <year>2012</year>
          . Acoustics, Speech and
          <string-name>
            <given-names>Signal</given-names>
            <surname>Processing</surname>
          </string-name>
          (ICASSP).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Yannick</given-names>
            <surname>Este</surname>
          </string-name>
          `ve, Thierry Bazillon,
          <string-name>
            <surname>Jean-Yves</surname>
            <given-names>Antoine</given-names>
          </string-name>
          , Fre´de´ric Be´chet, and Je´roˆme Farinas, '
          <article-title>The EPAC Corpus: Manual and Automatic Annotations of Conversational Speech in French Broadcast News</article-title>
          .', in LREC. Citeseer, (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Yannick</given-names>
            <surname>Este</surname>
          </string-name>
          `ve, Mohamed Bouallegue, Carole Lailler, Mohamed Morchid, Richard Dufour, Georges Linare`s, Driss Matrouf, and Renato De Mori, '
          <article-title>Integration of word and semantic features for theme identification in telephone conversations'</article-title>
          ,
          <source>in 6th International Workshop on Spoken Dialog Systems (IWSDS</source>
          <year>2015</year>
          ), (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sylvain</given-names>
            <surname>Galliano</surname>
          </string-name>
          , Edouard Geoffrois, Djamel Mostefa, Khalid Choukri,
          <article-title>Jean-Franc¸ois Bonastre, and Guillaume Gravier, 'The ESTER phase II evaluation campaign for the rich transcription of French Broadcast News</article-title>
          .', in Interspeech, pp.
          <fpage>1149</fpage>
          -
          <lpage>1152</lpage>
          , (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Sylvain</given-names>
            <surname>Galliano</surname>
          </string-name>
          , Guillaume Gravier, and Laura Chaubard, '
          <article-title>The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts</article-title>
          .', in Interspeech, volume
          <volume>9</volume>
          , pp.
          <fpage>2583</fpage>
          -
          <lpage>2586</lpage>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sahar</surname>
            <given-names>Ghannay</given-names>
          </string-name>
          , Yannick Este`ve, Nathalie Camelin, Benoit Favre, Camille Dutrey, Fabian Santiago, and
          <string-name>
            <surname>Martine</surname>
          </string-name>
          Adda-Decker, '
          <article-title>Word embeddings for ASR error detection: combinations and evaluation</article-title>
          .',
          <string-name>
            <surname>in</surname>
            <given-names>LREC</given-names>
          </string-name>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Guillaume</surname>
            <given-names>Gravier</given-names>
          </string-name>
          , Gilles Adda, Niklas Paulsson, Matthieu Carr, Aude Giraudel, and Olivier Galibert, '
          <article-title>The ETAPE corpus for the evaluation of speech-based TV content processing in the French language'</article-title>
          ,
          <source>in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)</source>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Lavergne</given-names>
          </string-name>
          , Olivier Cappe´, and Franc¸ois Yvon, '
          <article-title>Practical very large scale CRFs'</article-title>
          ,
          <source>in Proceedings the 48th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , pp.
          <fpage>504</fpage>
          -
          <lpage>513</lpage>
          . Association for Computational Linguistics, (
          <year>July 2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Omer</given-names>
            <surname>Levy and Yoav Goldberg</surname>
          </string-name>
          , '
          <article-title>Dependencybased word embeddings'</article-title>
          ,
          <source>in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</source>
          , volume
          <volume>2</volume>
          , pp.
          <fpage>302</fpage>
          -
          <lpage>308</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Julie</surname>
            <given-names>Mauclair</given-names>
          </string-name>
          , Yannick Este`ve, Simon Petit-Renaud, and Paul Dele´glise, '
          <article-title>Automatic detection of well recognized words in automatic speech transcription'</article-title>
          ,
          <source>in Proceedings of the International Conference on Language Resources and Evaluation : LREC</source>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          , '
          <article-title>Efficient estimation of word representations in vector space', (</article-title>
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Alexis</surname>
            <given-names>Nasr</given-names>
          </string-name>
          , Fre´de´ric Be´chet, Jean-Franc¸ois Rey, Benoˆıt Favre, and Joseph Le Roux, '
          <article-title>Macaon: An nlp tool suite for processing word lattices'</article-title>
          ,
          <source>in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations</source>
          , pp.
          <fpage>86</fpage>
          -
          <lpage>91</lpage>
          . Association for Computational Linguistics, (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Parada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Filimonov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Jelinek</surname>
          </string-name>
          , '
          <article-title>Contextual information improves oov detection in speech', in North American chapter of thes Association for Computational Linguistics (NAACL), (</article-title>
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Pellegrini</surname>
          </string-name>
          and Isabel Trancoso, '
          <article-title>Improving asr error detection with non-decoder based features</article-title>
          .', INTERSPEECH,
          <fpage>1950</fpage>
          -
          <lpage>1953</lpage>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <surname>Christopher D Manning</surname>
          </string-name>
          , 'Glove:
          <article-title>Global vectors for word representation</article-title>
          .',
          <string-name>
            <surname>in</surname>
            <given-names>EMNLP</given-names>
          </string-name>
          , volume
          <volume>14</volume>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Yik-Cheung</surname>
            <given-names>Tam</given-names>
          </string-name>
          , Yun Lei, Jing Zheng, and Wen Wang, '
          <article-title>Asr error detection using recurrent neural network language model and complementary asr'</article-title>
          ,
          <source>Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2014</year>
          IEEE International Conference on,
          <fpage>2312</fpage>
          -
          <lpage>2316</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Jiang</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Song</given-names>
          </string-name>
          , Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu, '
          <article-title>Learning fine-grained image similarity with deep ranking'</article-title>
          ,
          <source>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>1386</fpage>
          -
          <lpage>1393</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Jason</surname>
            <given-names>Weston</given-names>
          </string-name>
          , Samy Bengio, and Nicolas Usunier, 'Wsabie:
          <article-title>Scaling up to large vocabulary image annotation'</article-title>
          ,
          <string-name>
            <surname>in</surname>
            <given-names>IJCAI</given-names>
          </string-name>
          , volume
          <volume>11</volume>
          , pp.
          <fpage>2764</fpage>
          -
          <lpage>2770</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>