<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Re-ranking ASR Outputs for Spoken Sentence Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yeongkil Song</string-name>
          <email>nlpyksong@kangwon.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hyeokju Ahn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harksoo Kim</string-name>
          <email>R@1</email>
          <email>nlpdrkim@kangwon.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Program of Computer and Communications Engineering, College of IT, Kangwon National University</institution>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In spoken information retrieval, users' spoken queries are converted into text queries by using ASR engines. If top-1 results of the ASR engines are incorrect, the errors are propagated to information retrieval systems. If a document collection is a small set of short texts, the errors will more affect the performances of information retrieval systems. To improve the top-1 accuracies of the ASR engines, we propose a post-processing model to rearrange top-n outputs of ASR engines by using Ranking SVM. To improve the re-ranking performances, the proposed model uses various features such as ASR ranking information, morphological information, and domain-specific lexical information. In the experiments, the proposed model showed the higher precision of 4.4% and the higher recall rate of 6.4% than the baseline model without any postprocessing. Based on this experimental result, the proposed model showed that it can be used as a post-processor for improving the performance of a spoken information retrieval system if a document collection is a restricted amount of sentences.</p>
      </abstract>
      <kwd-group>
        <kwd>Re-ranking</kwd>
        <kwd>ASR outputs</kwd>
        <kwd>spoken sentence retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the rapid evolution of smart phones, the needs of information retrieval based on
spoken queries are increasing. Many information retrieval systems use automatic
speech recognition (ASR) systems in order to convert users’ spoken queries to text
queries. In the process of query conversion, ASR systems often make recognition
errors and these errors make irrelevant documents returned. If retrieval target
documents (so called a document collection) are a small set of short texts such as
frequently asked questions (FAQs) and restricted chatting sentences (i.e., chatting corpus for
implementing an intelligent personal assistant such as Siri, S-Voice, and Q-Voice),
information retrieval systems will not perform well because a few keywords that are
incorrectly recognized critically affect the ranking of documents, as shown in Fig. 1
[1].
To resolve this problem, many post processing methods for revising ASR errors have
been proposed. Ringger and Allen [2] proposed a statistical model for detecting and
correcting ASR error patterns. Brandow and Strzalkowski [3] proposed a rule based
method to generate a set of correction rules from ASR results. Jung et al. [4] proposed
a noisy channel model to detect error patterns in the ASR results. These previous
models have a weak point that they need parallel corpus that includes ASR result texts
and their correct transcriptions. To overcome this problem, Choi et al. [5] proposed a
method of ASR engine independent error correction and showed the precision of
about 72% in recognizing named entities in spoken sentences. Although the previous
models showed reasonable performances, they have dealt with the first-ranked
sentences among ASR results. The fact raised the result that low-ranked sentences are not
considered although they are correct ASR outputs, as shown in the following
Romanized Korean example.</p>
      <p>Spoken query: mwol ipgo inni (What are you wearing?)
Rank 1: meorigo inni (Is a head?)</p>
      <p>Rank 2: mwol ipgo inni (What are you wearing?)
To resolve this problem, we propose a machine learning model that re-ranks top-n
outputs of an ASR system. In the above example, we expect that the proposed model
changes Rank 2 to Rank 1. If the volume of a document collection is big, it may be
not easy to apply supervised machine learning models for re-ranking ASR outputs
because the models need a large training data set that is annotated by human.
However, if the document collection is a small set of short messages such as FAQs and
chatting corpus, we think that the supervised machine learning models can be applied
because the volume of the document collection is small enough to be annotated by
human.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Re-ranking Model of ASR Outputs</title>
      <sec id="sec-2-1">
        <title>Overview of the Proposed Model</title>
        <p>
          The proposed model consists of two parts: a training part and a re-ranking part. Fig.1
shows the overall architecture of the proposed model.
As shown in Fig. 1, we first collect top-n ASR1 outputs of a document collection (a
set of sentences in this paper) in which each sentence is uttered by 6 people. Then, we
manually annotate the collected corpus with correct ranks. Next, the proposed system
generates a training model based Ranking SVM (support vector machine) which is an
application of SVM used for solving certain ranking problems [6]. When users input
spoken queries, the proposed system re-ranks ASR outputs of the spoken queries
based on the training model. Then, the system hands over the first ones among the
reranked results to an information retrieval system.
To rearrange top-n ASR outputs, we use a Ranking SVM which is a modification to
the traditional SVM algorithm which allows it to rank instances instead of classifying
them [7]. Given a small collection of ASR outputs ranked according to preference R*
with two ASR outputs di , d j  R* , and a linear learning function f :
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
where the ASR outputs are represented as a set of features. The linear learning
function f is defined as f (d )  w  d , as shown in Equation (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ).
        </p>
        <p>
          In Equation (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), the vector w can be learned by the standard SVM learning method
using slack variables, as shown in Equation (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ).
1 We use Google’s ASR engine which returns top-5 outputs per utterance.
        </p>
        <p>di</p>
        <p>d j  f (di )  f (d j )
f (di )  f (d j )  w  di  w  d j
minimize w  w  C   ij</p>
        <p>
          i, j|R|
subject to (di , d j )  R* : w  di  w  d j  1   ij
(i, j) :  ij  0
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
        </p>
        <p>To represent ASR outputs in the vector space of Ranking SVM, we should convert
each ASR output into feature vectors. Table 1 show the defined feature set.</p>
        <p>Feature Name
ASR-Rank
ASR-Score
MOR-Bigram
POS-Bigram
NUM-DUW
LEX-DUW
NUM-GUW
LEX-GUW
In Table 1, ASR-Rank has an integer number from 1 to 5 because Google’s ASR
engine returns five ASR outputs ranked by descending order. ASR-Score is represented
by 10-point scale of ASR scores 0.1 through 1.0. In other words, if the ASR score is
0.35, the score in 10-point scale is mapped into 0.4. MOR-bigram and POS-Bigram
are morpheme bigrams and POS bigrams that are obtained from a result of
morphological analysis. For example, if a result of morphological analysis is “I/prop can/aux
understand/verb you/prop”, MOR-bigram is the set { ^;I I;can can;understand
understand;you you;$ }, and POS-bigram is the set { ^;prop prop;aux aux;verb verb;prop
prop;$ }. In the example, ‘^’ and ‘$’ are the symbols that represent the beginning and
the end of sentence, respectively. NUM-DUW and LEX-DUW are features associated
with domain-specific lexicon knowledge. The domain dictionary used in NUM-DUW
and LEX-DUW is a set of content words (so-called nouns and verbs) that is
automatically extracted from a training data annotated with POS’s by a morphological
analyzer. NUM-GUW and LEX-GUW are features associated with general lexicon
knowledge. The general dictionary used in NUM-GUW and LEX-GUW is a set of
content words that is registered as entry words in a general purpose dictionary of a
conventional morphological analyzer.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Data Set and Experimental settings</title>
        <p>
          We collected a chatting corpus which contains 1,000 sentences. Then, we asked six
university students (three males and three females) for uttering the short sentences by
using a smartphone application that saves top-5 outputs of Google’s ASR engine.
Next, we manually annotated with new rankings according to a lexical agreement rate
between user’s input utterance and each ASR output. In other words, the more an
ASR output lexically coincides with user’s input utterance, the higher the ASR output
is ranked. Finally, we divided the annotated corpus into training data (800 sentences)
and testing data (200 sentences). To evaluate the proposed model, we used precision
at one (so-called P@1) and recall rate at one (so-called R@1) as performance
measures, as shown in Equation (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ). We performed 5-fold cross validation.
# of sentences corectly ranked in top-1 by the proposed model
        </p>
        <p>
          # of sentences ranked in top-1 by the proposed model
# of sentences corectly ranked in top-1 by the proposed model
# of sentences correctly ranked in top-1 by an ASR engine
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental Results</title>
        <p>We computed the performances of the proposed model for each user, as shown in
Table 2.
In Table 2, ASR-only is a baseline model that returns a top-1 output of an ASR engine
without any re-ranking. The recall rate at five (so-called R@5) of Google’s ASR
engine was 0.705. This fact reveals that Google’s ASR engine failed to correctly
recognize 29.5% of the testing data. In other words, 29.5% of user’s utterances are not
included in top-5 outputs of Google’s ASR engine. As shown in Table 2, the proposed
model showed the higher precision of 4.4% and the higher recall rate of 6.4% than the
baseline model. This fact reveals that the proposed model can contribute to improve
the performance of a spoken sentence retrieval system if a document collection is a
small set of short texts.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We proposed a re-ranking model to improve the top-1 performance of an ASR engine.
The proposed model rearranges ASR outputs based on Ranking SVM. To improve the
re-ranking performances, the proposed model uses various features such as ASR
ranking information, morphological information, and domain-specific lexical information.
In the experiments with a restricted amount of sentences, the proposed model
outperformed the baseline model (the higher precision of 4.4% and the higher recall rate of
6.4%). Based on this experimental result, the proposed model showed that it can be
used as a post-processor for improving the performance of a spoken sentence retrieval
system.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work was supported by the IT R&amp;D program of MOTIE/MSIP/KEIT.
[10041678, The Original Technology Development of Interactive Intelligent Personal
Assistant Software for the Information Service on multiple domains]. This research
was also supported by Basic Science Research Program through the National
Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and
Technology(2013R1A1A4A01005074).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seo</surname>
          </string-name>
          , J.:
          <article-title>Cluster-Based FAQ Retrieval Using Latent Term Weights</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>23</volume>
          (
          <issue>2</issue>
          ),
          <fpage>58</fpage>
          -
          <lpage>65</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ringger</surname>
            ,
            <given-names>E. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          :
          <article-title>Error Correction via a Post-processor for Continuous Speech Recognition</article-title>
          .
          <source>In: Proceedings of IEEE International Conference on the Acoustics, Speech and Signal Processing</source>
          , pp.
          <fpage>427</fpage>
          -
          <lpage>430</lpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brandow</surname>
            ,
            <given-names>R. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strzalkowski</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Improving Speech Recognition through TextBased Linguistic Post-processing</article-title>
          .
          <source>United States Patent</source>
          <volume>6064957</volume>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jeong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>G. G.</given-names>
          </string-name>
          :
          <article-title>Speech Recognition Error Correction Using Maximum Entropy Language Model</article-title>
          .
          <source>In: Proceedings of the International Speech Communication Association</source>
          , pp.
          <fpage>2137</fpage>
          -
          <lpage>2140</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ryu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>G. G.</given-names>
          </string-name>
          :
          <article-title>Engine-Independent ASR Error Management for Dialog Systems</article-title>
          .
          <source>In: Proceedings of the 5th International Workshop on Spoken Dialog System</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Optimizing Search Engines Using Clickthrough Data</article-title>
          .
          <source>In: Proceedings of ACM SIGKDD</source>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Arens</surname>
          </string-name>
          , R. J.:
          <article-title>Learning to Rank Documents with Support Vector Machines via Active Learning</article-title>
          .
          <source>Ph.D dissertation</source>
          , University of Iowa (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>